som v 4.0

The 2007 series of Som are very exciting.  They are fast with new instruction set and improved compiler. With their performance comes the complexity.  The sx-code of som v3.1 has 93 instructions and complicate code conversion to make use of tos register. The most ambition project is som v 3.5 which employed variable length code (it is not yet completely implemented).

I want to retain performance of 2007 series but I really want to make the instruction as simple as the original s-code (at least in terms of the number of instruction). To this goal I design an accumulator-based instruction set with one-address format.

u-code design

This is for som 4.0 virtual machine. The aim

1  compact instruction set
2  fast
3  clean semantic

toward these goals, the instruction set is

1  less than 50 instructions
2  one-addressing with a few two-addressing 
3  fully decoded format
4  AC based

bop    :  add sub mul div band bor bxor mod
          eq ne lt le gt ge 
bim    :  addi subi shli shri
data   :  ld st get put lit ads
vector :  ldx/ stx/ ldy/ sty/ 
control:  jmp jt jf jle/ case/ call callt ret
extra  :  fun/ sys inc dec not push pusha

total  43 instructions  
(but fun is not really an instruction.  It is a marker)
The two-address instructions are marked with /.

To stop the execution, use "sys x" instead of "end".

Format

  one-address  op:32  arg:32
  two-address  op:32  arg1:24,arg2:8

Semantic

v is M[fp+v]

bop        ::  AC = AC op v

vector

ldy v ads  ::  AC = M[ads+v]
sty v ads  ::  M[ads+v] = AC
ldx v1 v2  ::  AC = M[v1+v2]
stx v1 v2  ::  M[v1+v2] = AC

control

jt ads     ::  if AC != 0 pc = ads
jf ads     ::  if AC == 0 pc = ads
jle v ads  ::  if AC <= v pc = ads
case v lo  ::  if AC >= v >= lo skip 2+2*(v-lo)

extra

inc v      ::  v++, AC = v
dec v      ::  v--, AC = v
push v     ::  sp++, M[sp] = v
pusha      ::  sp++, M[sp] = AC

usage

for loop

...
inc i
jle v loop

case

lit hi
case v lo
jmp else
jmp case1
...
jmp casen

comment

two-addressing alleviates the need for Base register in vector instructions.  "jle" is the variation of "efor".  It is simpler but need a modified "inc v" to work. It does not required to "decrement" the initial index and there is no "hidden" adjacent variable. "case" is a long process of refinement.  I think I got a good compromise here, using AC for "hi" value and keep "jmp" instruction in the jump table.  This design trades off the size of the code for simplicity in semantic.  

Activation record

The "usual" (s-code style) AR is as follows:

   hi

...     <- sp
retads
fp'     <- fp
v1
...
vn    

   lo

v is M[fp-v].  This backward order required renaming of local variables.  The rename process scans the code after the code body is completely generated as there may be some additional local variable allocated during the code generation.

If the order is forward then renaming is not necessary.

   hi

...     <- sp
retads
fp'
vn
...
v1    
        <- fp
   lo

To know where fp' and retads are, the size of the activation record must be known.  It is record as the argument of the "ret" instruction.  To create a new AR, two arguments are used: arity, and the number of local variables. They are recorded as two arguments of the "fun" instruction.

- Code generation for an accumulator machine

Without a stack, a temporary register must be allocated to store the intermediate result.  For example
  
  a = b + c - d

Going from left to right, b + c must be stored in t, then t - d and put a. This is done in genbop.

- parameter passing

Passing parameters to a function requires special treatment. The space in SS[] (where it stores the activation records), is used as a "virtual stack" to pass parameters to a new frame.  The new instructions are created for this task: push, pushv, pushi. With pushing parameters to a virtual stack, a tail-call instruction (callt) is revived as it is appropriate.  It is far simpler than trying to generate codes to pass parameters back to the old frame and do a jump.  Callt is faster too (it is another "big" instruction according to our philosophy of trying to create big instruction, a "big" instruction does more in one instruction). The last parameter is passed through A, occassionally saving one push instruction. Read more on parameter passing in doc/sv-code.txt. 

Optimisation (macro and/or)

Beside simple peep hole optimisations such as:

get.a put.b => mov.b.a
lit.n put.a => mov.a.n
not jf => jt
not jt => jf
lit.0 eqv.x jf => get.x jt   and its family 
jmp.x to jmp.y => jmp.y ...
jmp.x to ret   => ret
lit.1 jt => jmp  (while 1)

There are a complex cascade jumps created by macro expansion of and/or.  Doing a good code optimisation here improve performance significantly, for example, the 8-queen benchmark. The and/and optimisation is already done in som v 3.0.  However the or/or is missing.  We start the explanation with a simple case first.

: and a b = if a b else 0
: or a b = if a 1 else b

- and a b

a jf.1 b jmp.2 <1> lit.0 <2>

- and (and a b) c

<--- and a b -------------> 
a jf.1 b jmp.2 <1> lit.0 <2> jf.3 c jmp.4 <3> lit.0 <4>

We recognise the pattern:  jf.1 to lit.0 jf.3 => jf.3 ...
because lit.0 jf always jump.

We cannot do anything to jmp.2 to jf.3.  If we move jf.3 left then it will be incorrect when it does not jump.  

The ideal code is

a jf.1 b jf.1 c jmp.2 <1> lit.0 <2>

But that require the code generator to be clever.

Now the more difficult case of or/or.

- or a b

a jf.1 lit.1 jmp.2 <1> b <2>

- or (or a b) c

<----- or a b ------------>
a jf.1 lit.1 jmp.2 <1> b <2> jf.3 lit.1 jmp.4 <3> c <4>

Recognising that: 

lit.1 jmp.2 to jf.3 lit.1 jmp.4 => lit.1 jmp.4

This requires one look back and three look forwards.

Even with different association the code sequence remains the same.

- or a (or b c)

                       <-----  or b c --------------->
a jf.1 lit.1 jmp.2 <1> b <2> jf.3 lit.1 jmp.4 <3> c <4>

If the cascade is mixed of and/or.

- and (or a b) c

<------ or a b ----------->
a jf.1 lit.1 jmp.2 <1> b <2> jf.3 c jmp.4 <3> lit.0 <4>

The optimisable sequence is a difficult one.

lit.1 jmp.2 to jf.3 =>  jmp.3 

Other situation of mixing does not have any new pattern.  In summary, there are 3 cases:

1  cascade and:   jx to lit.0 jf.y => jx.y
2  cascade or:    lit.1 jmp to jf lit.1 jmp.y  => lit.1 jmp.y
3  or with other: lit.1 jmp to jf <z>  => jmp.z

These have been implemented in som v 4.0.

Conclusion

som v 4.0 compiler resembles more to som v 3.0 than to som v 3.1 because v 3.1 has a complex handling of conversion and forward call. v 4.0 symbol table is much better than any previous version (see doc/som-v35-symtab.txt). The handling of allocating temporary variables is much simpler than expected.  The optimisation is done more thoroughly than any previous version. 

1 July 2008
