Som v 5.0
This release has the goal to do "the fastest vm". To reach this
goal, the vm uses three-address instruction format (t-code, som v 1.8)
because it offers the lowest number of instruction executed (noi).
Therefore in terms of performance, the noi will be smallest amongst all
previous Som releases. Because running time is directly varied
with noi, it will also be "the fastest som". This work is based on Som
v19 series of experiments.
T2-code design
Three-address instruction format was introduced in som v 1.8 in the new
year day of 2006. It crammed 4 fields into one 32-bit word. Its
t-code has two short-comings:
1) restricted range of
globals. It has 8-bit argument, therefore local+global is max
256.
2) decoding 3-arg is slow.
To cure all short-comings above, the new instruction set should be a
3-address with fully decoded arguments, each 32-bit. Therefore
1) all arguments are 32-bit, this
cure som-v18 restriction on arguments, such as, small immediate
8/16-bit, small global 64..255.
2) it is fully decoded, interpreting is fast.
However it will take 4x space in code segment. I have two main
designs:
- one with no restriction on addressing mode, so no distinction
between local/global/immediate is made. The instruction size is
large (3 big addresses, upto 128 bits). The call instruction
needs to save/restore locals because they are not local to a stack
frame. It also will have the smallest noi.
- the other makes distinction between local and
global/immediate. It has several modes: r = r op m/i, r = r op r
etc. Because locals are distinct, a stack frame can be used to
create new set of locals. It will not need to save/restore
locals. Global and immediate can be the same. The instruction
size will be 64 bits.
There are many variations between these two designs. Design 1 is
simpler and easier to make a correct code (code generator is
simple). The size of code segment is not the main concern. The
"call" will be a bit slower (than alternative design). I will
leave more complex optimisation (design 2) for a future release.
To compromise space requirement, the instruction format uses almost
fully decoded instruction format, minimum decoding is required when
executing instructions. The instruction format is 96-bit (d = a op b):
d:24 op:8, a:32, b:32
See here for the detailed
discussion of t2-code instruction set. <t2-code-design> Beside the new
instruction set, som v5 has a better compiler. <v5-code-opt> The story how to
cross-compile code from one existing vm to a new vm is quite
interesting and educated. <v5-crosscompile>
Som v 5 performance
v5 performance compare to som42 (the current best) running two sets of
benchmarks: small and medium. The small benchmark is a set of
small programs that have been used as Birthday benchmarks, therefore
the result can be compared to all previous releases since som v
1.0. The medium programs are newly created, they emphasise
realistic programs that do some real work. Some part of the
compiler had also been extracted into these programs: lexer,
parser. Their size are around 5 to 10 times larger than the small
programs.
Results v5 vs som42 (v5/som42)
noi
60%
runtime
90%
code size
87%
som42 v5
Mips small programs
62 43
Mips all programs
50 34
See detailed benchmarks here. <v5-performance>
Discussion
The noi figure is as good as expected. noi is reduced 40%
compared to one-address som42. However, it does not translated into
runtime speedup. Only 10% speedup of v5 compared to som42. The
surprise result is the size of code, although v5 instruction format is
96-bit, it is more compact than som42 which is 64-bit. It means
the code has higher density. This should be investigate further.
The small program benchmark can compare with Birthday benchmark.
Som v19v design is guided by queen2 program. v5 vm is follows
from v19v, in this respect it does quite well. For queen2
runtime, this is the fastest runtime achieved to date (queen2 x100, 286
ms, 68 Mips, on HP Compaq nc6400 Core 2 T7200 2GHz with 1G RAM).
5 December 2009
Long live the King