S-code optimisation


S-code is intentionally kept simple and minimal to make it easy to be changed, to be experimented with and to target it for a new processor or new applications. Most of the optimisation described here has been done for a variety of reason.  However, the basic Som system employs just a few of these optimisation.  Specifically the following:

inc v, dec v
short cut jmp to ret, jmp to jmp

Experiment

These are the lesson learn from the work "nibble coding":
 
1) primitive is 2 times faster than user function
2) code improvement (not counting nibble coding) is amount to 10% size saving
3) size saving is related to increasing speed, as the number of instruction to be executed is reduced

The fact 3 is interesting.  It seems ordinary but not obvious.  Reducing the size by extending instructions and nibble coding by combining two instructions make each new instruction more complex hence taking more time to execute them.  In fact, as the semantic of instruction does not change, the new instruction does what several old instruction do, the amount of work is the same.  However, as the interpreter is implemented as a big while switch loop:

fetch op code at ip
while no exception
  decode and execute by
  switch op
     case add : do ...
     ....
  ip = ip + 1

Each time through the loop takes some overhead.  Reducing the number instruction also reduce the number of time through this loop.  Hence the new instruction coding (compression) makes it faster too.

Inc, dec

get lv, lit 1, add, put lv  ->  inc lv

and vice versa for dec.  Increment uses quite often and it replaces 5 instruction which is significant saving.  

It is observed that (in the project A1c-self) the sequence:

ld ip, lit 1, add, st ip

occurs frequently and it can not be optimised to inc x because the argument is not a local variable.  The sequence "Lit n, add" or "Lit n, sub" can be substitute by "Addi n", where n can be positive and negative (for sub).

The goal is to achieve a reasonably small and simple set of primitives.  The aim is to show how to achieve good performance using small set of primitives.

The optimised codes are for performance reason, they are not absolute necessity.  The markers are not executable, they are for compiler internal use to help generate the correct executable.  This set is more or less a typical stack-virtual-machine instruction set.  It is not too much different from JVM. JVM set is much larger with more data types.  Translating to JVM should be straight forward.

Sequence of transformation

1. change break to jump
scan for the matched innermost loop of break: Efor, Ewhile, Ret, End. Patch jmp to that location. case is not a loop, break in case must break loop, or return from function call. Stricly speaking this is not an optimization. It is a usual code generation.

2. short cut the jump to jump, jmp to ret

3. ewhile to nop

4. a = a + 1  => inc a  (if a is local),
and vice versa for dec

5. combine conditional jump, repeat twice
conditional: Eq, Lt, Le, Ge, Gt
jump: Jt, Jf
to: Jeq, Jlt, Jle, Jge, Jgt
Lit 0, Eq => Eq0
Eq0, Jt   => Jf
Eq0, Jf   => Jt
Jmp 1     => nop

Seeing how eval() is implemented in A1c makes me realise that Eq0 is in fact equivalent to Not

xNot:    SS[sp] = b == 0  ip = ip+1
xEq0:    SS[sp] = b == 0  ip = ip+1

hence Eq0 is eliminated and all optimisation use Not instead, for example, Eq0, jt =>  jf etc.

Instructions for "for" loop

"for i start end body" means
i = start
while i <= end
  body
  i = i + 1

Two special instructions for implement for loop are: Ifor, Efor.  In ideal case Ifor and Efor should have three arguments, the two local variables to do "i <= end" and the jump offset.  However, to conform to one argument format, the trick is the use a "pair" of consecutive local variable, therefore only one argument needed to be specified.  To remove the "offset" argument from the instruction, Ifor/Efor v1 v2 offset is splited into two instructions, one is to Ifor/Efor and follows by "jump".  When Ifor/Efor is evaluated to be fault it skips the next instruction.  The style "test and jump next instruction if false" is used for Efor.  Ifor does the initialisation of the index variable and calculates the end value and stores it in the "end" variable.  This allows the end value to be evaluated only once.  The assumption that the end value is never changed in the body of loop must hold.

Ifor lv1  (start end -- )
follows by jmp end_for

use the pair locals: lv1 lv2 to store index and end values

lv1 = start  (index)
lv2 = end

Efor lv1  ( -- )
follows by jmp begin_for

use the pair locals: lv1 lv2 that store index and end values

Ifor takes two items from stack "start, end" and stored their values to lv1 and lv2. If lv1 > lv2 then jump out of the loop by executing the next instruction.  Otherwise it skips the next instruction.  Efor is at the end of body.  It increment lv1, and test if lv1 <= lv2 then execute the next instruction which is jmp begin_for. Otherwise, the next instruction is skipped, hence exits the for loop.

    ...
  get i
  get end
  le
  jf exit
  <loop>  
  body
  get end
  inc-skip-lt i  // i++, if tos < i skip next
  jmp loop
  <exit>

It reduces five codes to three codes.  If we can make "end" to be adjacent to i (allocating "end" will be more complex) then it will reduce to two codes:

    ...
  <loop>
  body
  efor i        // i++, if end < i skip next
  jmp loop
  <exit>

These new primitives reduce the static size by another 2%, and dynamic size 8%.  In terms of speed, it is also faster because the "end" is evaluated and stored in a local variable just once in the initialization by Ifor, so testing "if i <= end" is faster than evaluate the expression every time around the loop.

How rename work

For Ifor and Efor, the index is a pair of locals, one for index, one for end value.  There must be an extra local variable assigned for end_value for each for index. To use one argumen format, the pair of local are adjacent.  For example, Ifor 2 means 2 is index, 3 is end_value.  The following steps assign local for Ifor, Efor:

1)  The index is relocated to the end of the current local.  An adjacent local is also reserved.  (lvx keeps track of this number)

2)  When rename() is called at the end of a function definition, the number of extra locals will be too large by n, where n is the number of for_index.  lvx will be lv + 2*n.  This is because for each for_index there is one redundant local.  This is readjusted by shifting all locals that are not for_index down to take place of for_index (now that all for_index are at the end). And rename all for_index with -n.

3)  To rename n..1 all name is renamed to i' = n - i + 1, where n is the total number of locals (including all for_index).

Example
to fn a b | i j k =
  for i . . .
  for j . . .

At first the locals are:
1 2 3 4 5
a b i j k

when for_index is created, i, k are relocated to be after k, with i, j now reflect the new name (6, 8 are larger than lv which is 5)

1 2 6 8 5 6     7     8     9
a b i j k i_for i_end j_for j_end

the number is adjusted to remove the redundance (by -2, and shift the non-for_index)

1 2 4 6 3 4     5     6     7
a b i j k i_for i_end j_for j_end

now rename to reverse order:

7 6 4 2 5 4     3     2     1
a b i j k i_for i_end j_for j_end

The whole rename are done in one loop.  Testing for the for_index condition by checking if i > lv.

/* d is num of for_index
   n is total number of locals */
for( i=1, k=i; i<=lvx; i++, k++){
  if(lvname[i] > lv){
    c = lvname[i] - d;
    k--;
  }else
    c = k;
  lvname[i] = n - c + 1; /* renum n to 1 */
}

The disadvantage is that the for_index is constraint to be only locals and not the passed parameters as the renaming will affected the order of local hence will tamper with the order of passing parameters, therefore this is prohibited.

Other optimisation

Evaluate A1x  20 Nov 2002

A1x is an experiment to compress the size of the code.

The lessons learned from this experiment are:

1  Implement an instruction as a primitive (built-in) is as much as 2 times faster than as a user defined function.  The following data supports this conclusion:

the number of instructions in executing benchmarks

       bubble hanoi matmul perm queen quick sieve
base   11805  2175  1813  4762  10509  4746  3364
+ prim  6995  1341  1685  3453   4178  2266  1862
1 - A/B 0.41  0.38  0.07  0.27   0.60  0.52  0.45
average 0.39  (or 2.6 times faster)

2  code combination
-  inc x, dec x  5% faster
-  jump conditional  5% faster
-  + addi, subi, eq0, eq1

3  short circuit jmp  5% faster:  
-  jmp to ret -> ret
-  jmp to jmp -> jmp

2 + 3  is approx. 15% faster

4  if e1 e2 e3 is very restrictive because it requires else-part to always be present (must use nop, syscall 3 often).

17 Additional instructions in A1x are:

not, ne, le, gt, ge, inc x, dec x, add1, sub1, eq0, eq1,
jeq, jne, jlt, jle, jge, jgt.

more code optimisation:

eq0, jt -> jf
eq0, jf -> jt

profile study of A1n and A1x  20 Nov 2002

the number of instruction

bubble hanoi matmul queen (x100) quick sieve
raw     20629  3631  3263  8248  15775  7831  5737
nib     11805  2175  1813  4762  10509  4746  3364
+ prim   6995  1341  1685  3453   4178  2266  1862
+ for    5711  1341  1471  2533   3823  2224  1653

1  nib/raw   avg. 41%   (1-A/B)
2  +prim/raw      64%   2-1 = 23%
3  +for/raw       68%   3-2 = 4%

discussion

1  new encoding (nibble) reduce no. of inst.  41%
2  use primitive reduce n-o-i further 23%
3  for reduce n-o-i further 4%

4 October 2004