Report on Som birthday 2008
It is almost four years since the release of the first Som
system. Som v.1.0 is released on Christmas of 2004. It is a good
time to look back and see what has been done.
The first serie of Som is written entirely in C. They are v.1.0, v.1.5,
v.1.7 and finally v.1.8. The instruction set of the virtual
machine starts with s-code. S-code is a stack-based instruction
set. It is simple and beautiful. The 3-address format is
introduced (t-code) in v.1.7 and continue to v.1.8. T-code is aimed for
compact size of the executable
and for performance by reducing the number of instruction executed to
complete a task.
The second serie of Som is written in Som itself, except the virtual
machine. They are v.2.0 and v.2.4. This shows that Som
system (the language and the virtual machine) is complete enough to be
self-replicated. The virtual machine has been written in Som to
demonstrate this task. The virtual machine implemented in C is
used for performance reason. This virtual machine is engineered
for performance.
With the conclusion of the second serie, I was beginning to be
interested more in the performance issue. T-code is good but it
is quite complex. The third serie of Som experiments with this
issue. The one-address format is introduced (sx-code). It
is a small extension of s-code and is quite natural. Som v.3.1 refines
the sx-code further and includes a few special codes that use the
top-of-stack register in the virtual machine to speed up the execution.
The pre-decode format, separating the opcode and its argument, results
in a much faster virtual machine. The third serie is fast, but it comes
with a penalty, the complexity. The instruction set has grown to
almost 100 instructions. The original s-code has only 40
instruction.
The fourth serie of Som, the current one, addreses this issue.
The new instruction set, u-code is introduced. It is aimed to be
as simple as the original s-code and to be as fast as sx-code.
The u-code is a major departure from stack-based instruction. By
careful inspection of sx-code in v.3.1, it is noticed that the stack
depth is shallow in the execution of most benchmark programs.
Therefore, the depth of one is perhaps suitable. Hence, u-code is an
accumulator based and it is stack-less. This allows the virtual
machine is be slightly faster and it is much "cleaner".
Benchmarking
To measure quantitatively, the evolution of Som system, all systems are
benchmarking on the same set of programs. Four programs are used:
bubble, matmul, queen and quick. (queen2 is also used to include to
effect of macro but it is not applicable to v.1.0 and v.2.0 and they
have no macro capability).
bubble - sort 20 items, 20..1 to 1..20
matmul - multiply two 8 by 8 integer matrices
queen - solve all 8-queen (92 solutions)
quick - sort 100 items, 100..1 to 1..100
Also, the measurement of the evolution of the som-compiler is done
separately. The second through the fourth serie are measured (the
first serie is written in C so it is not meaningful to be compared
with). They are used to compile the som-compiler of v.2.0 (all
modules are concatenate into one file).
Two measurements are collected, the number of instruction executed and
the actual runtime. The runtime is measured by the time utility,
clock() of C language instrumented into the virtual machine. The
resolution is in millisecond. Three time measurements are
collected and the results are averaged.
The number of instruction executed indicates the effectiveness of the
instruction set pluses the quality of the code generator. It shows the
effect of the different instruction set on the performance. It is
a reliable indicator as it is platform independent. The actual runtime
is what measured the real performance but it is too dependent on the
machine used. Therefore, the figures are calculated to be
relative. For general benchmarks, they are relative to Som v.1.0.
The results are averaged from all benchmarks. For the compiler
benchmark, they are relative to Som v.2.0.
Result
All benchmarks
noi relative to v1 (vx/v1)
noi
v1 v1.5 v1.8 v2 v2.4 v3 v3.1
v4 v4.1
average 1.00 1.00 0.44 1.00 1.00
0.62 0.59 0.62 0.59
runtime speedup compared to
v1 (1-vx/v1)
on Dell v1 v1.5 v1.8
v2 v2.4 v3 v3.1 v4 v4.1
speedup 0.00 0.09 0.64 0.42 0.39
0.49 0.74 0.69 0.73
The number of instruction executed (noi) of v.1.0, v.1.5, v.2.0 and
v.2.4 are the same. They are s-code. The t-code of v.1.8 is less than
half, so t-code is very effective. The sx-code is also fast, its
noi from v.3.0 is only 62% and v.3.1 is even better at 59%. The
serie 4, u-code, is similarly effective compared to sx-code.
When look at the running time, t-code is 64% faster than v.1.0 (around
2.7x). The virtual machine of serie 2, v.2.0 and v.2.4 (two vm
are the same), is fast. It is around 40% faster than v.1.0, or
1.6x. The serie 3 sx-code has impressive result. The
virtual machine of v.3.0 is 49% faster, or almost 2x. The new virtual
machine for v.3.1 (with separate op and arg and top-of-stack register)
is the fastest. It is 74% faster, or 3.8x. The serie 4, the
virtual machine of u-code v.4.0, is 69% faster and the refined u-code
of v.4.1 is as fast as the best v.3.1 at 73%, or 3.8x.
Compiler benchmark
relative to v2
noi (vx/v2)
runtime speedup (1-vx/v2)
v2 v2.4 v3 v3.1 v4 v4.1
noi 1.00
0.38 0.20 0.19 0.23 0.18
speedup 0.00 0.39 0.64 0.66 0.66
0.66
In terms of noi (the number of instruction executed to compile the
program), the newer compilers are better with the latest one (v.4.1) is
only 18% of v.2.0 (it is very impressive, just 1/5). A lot of code
improvement has been done on these compilers. The runtime speedup
is also reflexed these improvements, with v.4.1 at 66% faster, or
almost 3x. The compiler does a lot of i/o so in term of the performance
of a virtual machine, it may be better than this figure.
To give some absolute number on the performance of the virtual machine,
the runtime for general benchmark is reported as Kilo instructions per
second. The machine that run these benchmark is Dell D500 laptop with
1.3GHz Pentium M (single cpu) with 1Gbyte memory running Windows XP
(SP2). The compiler is lcc-win32 (version Oct 2007) with no code
optimisation. These figures reflexed the effect of the instruction set
and the implementation of its virtual machine.
on Dell v1 v1.5 v1.8
v2 v2.4 v3 v3.1 v4 v4.1
kips 2545 2740
3219 4033 4100 3575 5750 5368 5564
The jump from v.3 to v.3.1 (3575 Kips to 5750) is the result of the
engineering of the virtual machine. It is interesting to note
that the serie 4, which employs u-code, the figure is not as good as
v.3.1 (sx-code). However, in terms of running time, they are
similar. It may be just the variance in the time measurement (the
variance is high).
3 Aug 2008