Som v 5.0


This release has the goal to do "the fastest vm".  To reach this goal, the vm uses three-address instruction format (t-code, som v 1.8) because it offers the lowest number of instruction executed (noi). Therefore in terms of performance, the noi will be smallest amongst all previous Som releases.  Because running time is directly varied with noi, it will also be "the fastest som". This work is based on Som v19 series of experiments.

T2-code design

Three-address instruction format was introduced in som v 1.8 in the new year day of 2006.  It crammed 4 fields into one 32-bit word. Its t-code has two short-comings:
1)  restricted range of globals.  It has 8-bit argument, therefore local+global is max 256.
2)  decoding 3-arg is slow.

To cure all short-comings above, the new instruction set should be a 3-address with fully decoded arguments, each 32-bit. Therefore
1)  all arguments are 32-bit, this cure som-v18 restriction on arguments, such as, small immediate 8/16-bit, small global 64..255.
2)  it is fully decoded, interpreting is fast.

However it will take 4x space in code segment.  I have two main designs:
  1. one with no restriction on addressing mode, so no distinction between local/global/immediate is made.  The instruction size is large (3 big addresses, upto 128 bits).  The call instruction needs to save/restore locals because they are not local to a stack frame.  It also will have the smallest noi.
  2. the other makes distinction between local and global/immediate.  It has several modes: r = r op m/i, r = r op r etc.  Because locals are distinct, a stack frame can be used to create new set of locals.  It will not need to save/restore locals. Global and immediate can be the same.  The instruction size will be 64 bits.
There are many variations between these two designs. Design 1 is simpler and easier to make a correct code (code generator is simple).  The size of code segment is not the main concern. The "call" will be a bit slower (than alternative design).  I will leave more complex optimisation (design 2) for a future release.

To compromise space requirement, the instruction format uses almost fully decoded instruction format, minimum decoding is required when executing instructions. The instruction format is 96-bit (d = a op b):
 d:24 op:8, a:32, b:32

See here for the detailed discussion of t2-code instruction set. <t2-code-design> Beside the new instruction set, som v5 has a better compiler. <v5-code-opt>  The story how to cross-compile code from one existing vm to a new vm is quite interesting and educated. <v5-crosscompile>

Som v 5 performance

v5 performance compare to som42 (the current best) running two sets of benchmarks: small and medium.  The small benchmark is a set of small programs that have been used as Birthday benchmarks, therefore the result can be compared to all previous releases since som v 1.0.  The medium programs are newly created, they emphasise realistic programs that do some real work.  Some part of the compiler had also been extracted into these programs: lexer, parser.  Their size are around 5 to 10 times larger than the small programs.

Results  v5 vs som42  (v5/som42)

noi             60%
runtime         90%   
code size       87%    
                        som42   v5
Mips small programs      62     43            
Mips all programs        50     34                    

See detailed benchmarks here. <v5-performance>

Discussion

The noi figure is as good as expected.  noi is reduced 40% compared to one-address som42. However, it does not translated into runtime speedup.  Only 10% speedup of v5 compared to som42. The surprise result is the size of code, although v5 instruction format is 96-bit, it is more compact than som42 which is 64-bit.  It means the code has higher density. This should be investigate further.  

The small program benchmark can compare with Birthday benchmark.  Som v19v design is guided by queen2 program.  v5 vm is follows from v19v, in this respect it does quite well.  For queen2 runtime, this is the fastest runtime achieved to date (queen2 x100, 286 ms, 68 Mips, on HP Compaq nc6400 Core 2 T7200 2GHz with 1G RAM).

5 December 2009
Long live the King