Sx-code
A new extension of s-code, zero+one address. The aim is to
improve the execution speed of the interpreter. As one-address
will reduce the number of instruction by 30-40%, it should be faster
than the normal s-code.
Will it be faster than t-code? t-code is a very compact format of
instruction (three-address), however, in a virtual machine, the
decoding of instruction is slow (I think). That is why som-v16
(carefully engineered the VM) is as fast as som-v17 which is
t-code. However, the decoding of zero+one address is exactly the
same as zero-address s-code as the instruction has two field: op, arg.
In a sense, we get the one-address for "free". If the VM is as
fast as som-v16 then by reducing the number of executed instruction by
30-40%, the new VM will be faster.
Extended s-code (sx-code)
All binary operators are extended to have one-address to access local
frame, therefore in many cases the sequence "get.x get.y bop" becomes
"get.x bop.y". The immediate mode stored a literal in the argument of
the instruction, the sequence "get.x lit.y bop" becomes "get.x bopi.y".
To blend one-address into zero-address, arg = 0 is used to indicate the
top of stack addressing.
The immediate mode is used quite often. addi is obviously used a
lot. eqi is used in x == n. bandi is used in masking
bits. shli shri are used often because the amount to be shifted
is usually a constant.
The load/store index, are extended to store the base-address in the
argument, the sequence "get.base get.index ldx" becomes "get.index
ldx.base". When base is global, a new instruction "ldy.base" is
used. The order of argument for store index is different from
s-code. The sequence "get.base get.index get.val stx" becomes
"get.index get.val stx.base", when base is global, "sty.base" is
used. There is no use for the old "ldx/stx" (zero-address) as
there is always the base address in either local or global.
To optimise the for-loop, "efor" instruction is introduced.
"efor.x" does the following:
x++,
push(x <= adj(x))
where x is a local, adj(x) stored the terminal value of x. The
sequence at the end of for-loop is usually "inc.x get.x get.end le
jt.loop" becomes "efor.x jt.loop" where adj(x) is end. The
compiler must allocate adj(x) accordingly.
Instruction encoding
Arrange the instruction so that grouping is easy.
bop-zero-arg:
add..shr (1..16) (17..20 reserved)
bop-one-arg-v: add+20
(21..36) (37..40 reserved)
bop-one-arg-i: add+40
(41..56) (57..60 reserved)
other:
get..calli (61..82)
zero-arg:
not case end (83..85)
bop is add sub mul
div band bor bxor mod
eq ne lt le ge gt shl shr
other is get put ld st ldx stx
ldy sty
jmp jt jf call ret - efor
inc dec lit ads sys fun calli
not case end
The opcode "icArray" is eliminated and uses syscall 14 instead. "fun"
and "calli" are not executable, they are markers in code segment.
Compare to t-code and s-code
In terms of dynamic instruction count (noi), the measurement using
seven small programs, one medium program (aes), and one large program
(compiler), sx-code is 30% less than s-code. t-code is still 30% less
than sx-code (t-code is very compact, it is a three-address format).
However, som v 3.0 (sx-code vm) is carefully engineered. In terms of
running time (using only the 8-queen benchmark), it is 3 times faster
than som v 1.7 (t-code-vm). t-code vm itself is
much faster than the original s-code vm, som v 1.5 (s-code-vm).
Therefore
som v 3.0 is the fastest s-code family virtual machine to date.
(from som-v16/som-v16y/doc/som16y-performance.txt)
t-code noi is 64% less than s-code (more than half)
t-code noi is 31% less than sx-code
noi:
s-code -- 30% --> sx-code -- 30% --> t-code
speed: s-code <-- 2x
-- t-code <-- 3x -- sx-code
s-code-vm from som-v15
t-code-vm from som-v17
sx-code-vm from som-v3
4 Mar 2007