Sx-code

A new extension of s-code, zero+one address. The aim is to improve the execution speed of the interpreter. As one-address will reduce the number of instruction by 30-40%, it should be faster than the normal s-code.

Will it be faster than t-code? t-code is a very compact format of instruction (three-address), however, in a virtual machine, the decoding of instruction is slow (I think). That is why som-v16 (carefully engineered the VM) is as fast as som-v17 which is t-code. However, the decoding of zero+one address is exactly the same as zero-address s-code as the instruction has two field: op, arg. In a sense, we get the one-address for "free". If the VM is as fast as som-v16 then by reducing the number of executed instruction by 30-40%, the new VM will be faster.

Extended s-code (sx-code)

All binary operators are extended to have one-address to access local frame, therefore in many cases the sequence "get.x get.y bop" becomes "get.x bop.y". The immediate mode stored a literal in the argument of the instruction, the sequence "get.x lit.y bop" becomes "get.x bopi.y". To blend one-address into zero-address, arg = 0 is used to indicate the top of stack addressing.

The immediate mode is used quite often. addi is obviously used a lot. eqi is used in x == n. bandi is used in masking bits. shli shri are used often because the amount to be shifted is usually a constant.

The load/store index, are extended to store the base-address in the argument, the sequence "get.base get.index ldx" becomes "get.index ldx.base". When base is global, a new instruction "ldy.base" is used. The order of argument for store index is different from s-code. The sequence "get.base get.index get.val stx" becomes "get.index get.val stx.base", when base is global, "sty.base" is used. There is no use for the old "ldx/stx" (zero-address) as there is always the base address in either local or global.

To optimise the for-loop, "efor" instruction is introduced. "efor.x" does the following:

x++, push(x <= adj(x))

where x is a local, adj(x) stored the terminal value of x. The sequence at the end of for-loop is usually "inc.x get.x get.end le jt.loop" becomes "efor.x jt.loop" where adj(x) is end. The compiler must allocate adj(x) accordingly.

Instruction encoding

Arrange the instruction so that grouping is easy.

bop-zero-arg: add..shr (1..16) (17..20 reserved)
bop-one-arg-v: add+20   (21..36) (37..40 reserved)
bop-one-arg-i: add+40   (41..56) (57..60 reserved)
other:         get..calli (61..82)
zero-arg:      not case end (83..85)

bop is   add sub mul div band bor bxor mod
         eq ne lt le ge gt shl shr
other is get put ld st ldx stx ldy sty
         jmp jt jf call ret - efor
         inc dec lit ads sys fun calli
         not case end

The opcode "icArray" is eliminated and uses syscall 14 instead. "fun" and "calli" are not executable, they are markers in code segment.

Compare to t-code and s-code

In terms of dynamic instruction count (noi), the measurement using seven small programs, one medium program (aes), and one large program (compiler), sx-code is 30% less than s-code. t-code is still 30% less than sx-code (t-code is very compact, it is a three-address format).

However, som v 3.0 (sx-code vm) is carefully engineered. In terms of running time (using only the 8-queen benchmark), it is 3 times faster than som v 1.7 (t-code-vm). t-code vm itself is much faster than the original s-code vm, som v 1.5 (s-code-vm). Therefore som v 3.0 is the fastest s-code family virtual machine to date.

(from som-v16/som-v16y/doc/som16y-performance.txt)
t-code noi is 64% less than s-code (more than half)
t-code noi is 31% less than sx-code

noi: s-code -- 30% --> sx-code -- 30% --> t-code
speed: s-code <-- 2x -- t-code <-- 3x -- sx-code

s-code-vm from som-v15
t-code-vm from som-v17
sx-code-vm from som-v3

4 Mar 2007