Report on Som birthday 2008

It is almost four years since the release of the first Som system.  Som v.1.0 is released on Christmas of 2004. It is a good time to look back and see what has been done.   

The first serie of Som is written entirely in C. They are v.1.0, v.1.5, v.1.7 and finally v.1.8.  The instruction set of the virtual machine starts with s-code. S-code is a stack-based instruction set.  It is simple and beautiful. The 3-address format is introduced (t-code) in v.1.7 and continue to v.1.8. T-code is aimed for compact size of the executable and for performance by reducing the number of instruction executed to complete a task.

The second serie of Som is written in Som itself, except the virtual machine.  They are v.2.0 and v.2.4.  This shows that Som system (the language and the virtual machine) is complete enough to be self-replicated. The virtual machine has been written in Som to demonstrate this task.  The virtual machine implemented in C is used for performance reason.  This virtual machine is engineered for performance.

With the conclusion of the second serie, I was beginning to be interested more in the performance issue.  T-code is good but it is quite complex.  The third serie of Som experiments with this issue.  The one-address format is introduced (sx-code).  It is a small extension of s-code and is quite natural. Som v.3.1 refines the sx-code further and includes a few special codes that use the top-of-stack register in the virtual machine to speed up the execution. The pre-decode format, separating the opcode and its argument, results in a much faster virtual machine. The third serie is fast, but it comes with a penalty, the complexity.  The instruction set has grown to almost 100 instructions.  The original s-code has only 40 instruction.

The fourth serie of Som, the current one, addreses this issue.  The new instruction set, u-code is introduced.  It is aimed to be as simple as the original s-code and to be as fast as sx-code.  The u-code is a major departure from stack-based instruction. By careful inspection of sx-code in v.3.1, it is noticed that the stack depth is shallow in the execution of most benchmark programs.  Therefore, the depth of one is perhaps suitable. Hence, u-code is an accumulator based and it is stack-less.  This allows the virtual machine is be slightly faster and it is much "cleaner".

Benchmarking

To measure quantitatively, the evolution of Som system, all systems are benchmarking on the same set of programs.  Four programs are used: bubble, matmul, queen and quick. (queen2 is also used to include to effect of macro but it is not applicable to v.1.0 and v.2.0 and they have no macro capability).

bubble - sort 20 items, 20..1 to 1..20
matmul - multiply two 8 by 8 integer matrices
queen  - solve all 8-queen (92 solutions)
quick  - sort 100 items, 100..1 to 1..100

Also, the measurement of the evolution of the som-compiler is done separately.  The second through the fourth serie are measured (the first serie is written in C so it is not meaningful to be compared with).  They are used to compile the som-compiler of v.2.0 (all modules are concatenate into one file).

Two measurements are collected, the number of instruction executed and the actual runtime.  The runtime is measured by the time utility, clock() of C language instrumented into the virtual machine. The resolution is in millisecond.  Three time measurements are collected and the results are averaged.

The number of instruction executed indicates the effectiveness of the instruction set pluses the quality of the code generator. It shows the effect of the different instruction set on the performance.  It is a reliable indicator as it is platform independent. The actual runtime is what measured the real performance but it is too dependent on the machine used.  Therefore, the figures are calculated to be relative.  For general benchmarks, they are relative to Som v.1.0. The results are averaged from all benchmarks. For the compiler benchmark, they are relative to Som v.2.0.

Result

All benchmarks

noi relative to v1  (vx/v1)

noi     v1   v1.5 v1.8 v2   v2.4 v3   v3.1 v4   v4.1
average 1.00 1.00 0.44 1.00 1.00 0.62 0.59 0.62 0.59

runtime speedup compared to v1  (1-vx/v1)

on Dell v1   v1.5 v1.8 v2   v2.4 v3   v3.1 v4   v4.1
speedup 0.00 0.09 0.64 0.42 0.39 0.49 0.74 0.69 0.73

The number of instruction executed (noi) of v.1.0, v.1.5, v.2.0 and v.2.4 are the same. They are s-code. The t-code of v.1.8 is less than half, so t-code is very effective.  The sx-code is also fast, its noi from v.3.0 is only 62% and v.3.1 is even better at 59%.  The serie 4, u-code, is similarly effective compared to sx-code.  

When look at the running time, t-code is 64% faster than v.1.0 (around 2.7x).  The virtual machine of serie 2, v.2.0 and v.2.4 (two vm are the same), is fast.  It is around 40% faster than v.1.0, or 1.6x.  The serie 3 sx-code has impressive result.  The virtual machine of v.3.0 is 49% faster, or almost 2x. The new virtual machine for v.3.1 (with separate op and arg and top-of-stack register) is the fastest.  It is 74% faster, or 3.8x. The serie 4, the virtual machine of u-code v.4.0, is 69% faster and the refined u-code of v.4.1 is as fast as the best v.3.1 at 73%, or 3.8x.

Compiler benchmark

relative to v2  
noi (vx/v2)
runtime speedup (1-vx/v2)

        v2   v2.4 v3   v3.1 v4   v4.1
noi     1.00 0.38 0.20 0.19 0.23 0.18
speedup 0.00 0.39 0.64 0.66 0.66 0.66

In terms of noi (the number of instruction executed to compile the program), the newer compilers are better with the latest one (v.4.1) is only 18% of v.2.0 (it is very impressive, just 1/5). A lot of code improvement has been done on these compilers.  The runtime speedup is also reflexed these improvements, with v.4.1 at 66% faster, or almost 3x. The compiler does a lot of i/o so in term of the performance of a virtual machine, it may be better than this figure.

To give some absolute number on the performance of the virtual machine, the runtime for general benchmark is reported as Kilo instructions per second. The machine that run these benchmark is Dell D500 laptop with 1.3GHz Pentium M (single cpu) with 1Gbyte memory running Windows XP (SP2). The compiler is lcc-win32 (version Oct 2007) with no code optimisation. These figures reflexed the effect of the instruction set and the implementation of its virtual machine.

on Dell v1   v1.5 v1.8 v2   v2.4 v3   v3.1 v4   v4.1
kips    2545 2740 3219 4033 4100 3575 5750 5368 5564

The jump from v.3 to v.3.1 (3575 Kips to 5750) is the result of the engineering of the virtual machine.  It is interesting to note that the serie 4, which employs u-code, the figure is not as good as v.3.1 (sx-code).  However, in terms of running time, they are similar. It may be just the variance in the time measurement (the variance is high).

3 Aug 2008