S-code optimisation

Experiment
Inc, dec
Sequence of transformation
Instructions for "for" loop
- How rename work
Other optimisation

S-code is intentionally kept simple and minimal to make it easy to be changed, to be experimented with and to target it for a new processor or new applications. Most of the optimisation described here has been done for a variety of reason. However, the basic Som system employs just a few of these optimisation. Specifically the following:

inc v, dec v
short cut jmp to ret, jmp to jmp

Experiment

These are the lesson learn from the work "nibble coding":

1) primitive is 2 times faster than user function
2) code improvement (not counting nibble coding) is amount to 10% size saving
3) size saving is related to increasing speed, as the number of instruction to be executed is reduced

The fact 3 is interesting. It seems ordinary but not obvious. Reducing the size by extending instructions and nibble coding by combining two instructions make each new instruction more complex hence taking more time to execute them. In fact, as the semantic of instruction does not change, the new instruction does what several old instruction do, the amount of work is the same. However, as the interpreter is implemented as a big while switch loop:

fetch op code at ip
while no exception
decode and execute by
switch op
case add : do ...
....
ip = ip + 1

Each time through the loop takes some overhead. Reducing the number instruction also reduce the number of time through this loop. Hence the new instruction coding (compression) makes it faster too.

Inc, dec

get lv, lit 1, add, put lv -> inc lv

and vice versa for dec. Increment uses quite often and it replaces 5 instruction which is significant saving.

It is observed that (in the project A1c-self) the sequence:

ld ip, lit 1, add, st ip

occurs frequently and it can not be optimised to inc x because the argument is not a local variable. The sequence "Lit n, add" or "Lit n, sub" can be substitute by "Addi n", where n can be positive and negative (for sub).

The goal is to achieve a reasonably small and simple set of primitives. The aim is to show how to achieve good performance using small set of primitives.

The optimised codes are for performance reason, they are not absolute necessity. The markers are not executable, they are for compiler internal use to help generate the correct executable. This set is more or less a typical stack-virtual-machine instruction set. It is not too much different from JVM. JVM set is much larger with more data types. Translating to JVM should be straight forward.

Sequence of transformation

1. change break to jump
scan for the matched innermost loop of break: Efor, Ewhile, Ret, End. Patch jmp to that location. case is not a loop, break in case must break loop, or return from function call. Stricly speaking this is not an optimization. It is a usual code generation.

2. short cut the jump to jump, jmp to ret

3. ewhile to nop

4. a = a + 1 => inc a (if a is local),
and vice versa for dec

5. combine conditional jump, repeat twice

conditional: Eq, Lt, Le, Ge, Gt
jump: Jt, Jf
to: Jeq, Jlt, Jle, Jge, Jgt
Lit 0, Eq => Eq0
Eq0, Jt   => Jf
Eq0, Jf   => Jt
Jmp 1     => nop

Seeing how eval() is implemented in A1c makes me realise that Eq0 is in fact equivalent to Not

xNot: SS[sp] = b == 0 ip = ip+1
xEq0: SS[sp] = b == 0 ip = ip+1

hence Eq0 is eliminated and all optimisation use Not instead, for example, Eq0, jt => jf etc.

Instructions for "for" loop

"for i start end body" means

i = start
while i <= end
body
i = i + 1

Two special instructions for implement for loop are: Ifor, Efor. In ideal case Ifor and Efor should have three arguments, the two local variables to do "i <= end" and the jump offset. However, to conform to one argument format, the trick is the use a "pair" of consecutive local variable, therefore only one argument needed to be specified. To remove the "offset" argument from the instruction, Ifor/Efor v1 v2 offset is splited into two instructions, one is to Ifor/Efor and follows by "jump". When Ifor/Efor is evaluated to be fault it skips the next instruction. The style "test and jump next instruction if false" is used for Efor. Ifor does the initialisation of the index variable and calculates the end value and stores it in the "end" variable. This allows the end value to be evaluated only once. The assumption that the end value is never changed in the body of loop must hold.

Ifor lv1 (start end -- )
follows by jmp end_for

use the pair locals: lv1 lv2 to store index and end values

lv1 = start (index)
lv2 = end

Efor lv1 ( -- )
follows by jmp begin_for

use the pair locals: lv1 lv2 that store index and end values

Ifor takes two items from stack "start, end" and stored their values to lv1 and lv2. If lv1 > lv2 then jump out of the loop by executing the next instruction. Otherwise it skips the next instruction. Efor is at the end of body. It increment lv1, and test if lv1 <= lv2 then execute the next instruction which is jmp begin_for. Otherwise, the next instruction is skipped, hence exits the for loop.

...
get i
get end
le
jf exit
<loop>
body
get end
inc-skip-lt i // i++, if tos < i skip next
jmp loop
<exit>

It reduces five codes to three codes. If we can make "end" to be adjacent to i (allocating "end" will be more complex) then it will reduce to two codes:

...
<loop>
body
efor i // i++, if end < i skip next
jmp loop
<exit>

These new primitives reduce the static size by another 2%, and dynamic size 8%. In terms of speed, it is also faster because the "end" is evaluated and stored in a local variable just once in the initialization by Ifor, so testing "if i <= end" is faster than evaluate the expression every time around the loop.

How rename work

For Ifor and Efor, the index is a pair of locals, one for index, one for end value. There must be an extra local variable assigned for end_value for each for index. To use one argumen format, the pair of local are adjacent. For example, Ifor 2 means 2 is index, 3 is end_value. The following steps assign local for Ifor, Efor:

1) The index is relocated to the end of the current local. An adjacent local is also reserved. (lvx keeps track of this number)

2) When rename() is called at the end of a function definition, the number of extra locals will be too large by n, where n is the number of for_index. lvx will be lv + 2*n. This is because for each for_index there is one redundant local. This is readjusted by shifting all locals that are not for_index down to take place of for_index (now that all for_index are at the end). And rename all for_index with -n.

3) To rename n..1 all name is renamed to i' = n - i + 1, where n is the total number of locals (including all for_index).

Example

to fn a b | i j k =
for i . . .
for j . . .

At first the locals are:

1 2 3 4 5
a b i j k

when for_index is created, i, k are relocated to be after k, with i, j now reflect the new name (6, 8 are larger than lv which is 5)

1 2 6 8 5 6 7 8 9
a b i j k i_for i_end j_for j_end

the number is adjusted to remove the redundance (by -2, and shift the non-for_index)

1 2 4 6 3 4 5 6 7
a b i j k i_for i_end j_for j_end

now rename to reverse order:

7 6 4 2 5 4 3 2 1
a b i j k i_for i_end j_for j_end

The whole rename are done in one loop. Testing for the for_index condition by checking if i > lv.

/* d is num of for_index
   n is total number of locals */
for( i=1, k=i; i<=lvx; i++, k++){
if(lvname[i] > lv){
    c = lvname[i] - d;
    k--;
}else
    c = k;
lvname[i] = n - c + 1; /* renum n to 1 */
}

The disadvantage is that the for_index is constraint to be only locals and not the passed parameters as the renaming will affected the order of local hence will tamper with the order of passing parameters, therefore this is prohibited.

Other optimisation

Evaluate A1x 20 Nov 2002

A1x is an experiment to compress the size of the code.

The lessons learned from this experiment are:

1 Implement an instruction as a primitive (built-in) is as much as 2 times faster than as a user defined function. The following data supports this conclusion:

the number of instructions in executing benchmarks

       bubble hanoi matmul perm queen quick sieve
base   11805 2175 1813 4762 10509 4746 3364
+ prim 6995 1341 1685 3453   4178 2266 1862
1 - A/B 0.41 0.38 0.07 0.27   0.60 0.52 0.45
average 0.39 (or 2.6 times faster)

2 code combination
- inc x, dec x 5% faster
- jump conditional 5% faster
- + addi, subi, eq0, eq1

3 short circuit jmp 5% faster:
- jmp to ret -> ret
- jmp to jmp -> jmp

2 + 3 is approx. 15% faster

4 if e1 e2 e3 is very restrictive because it requires else-part to always be present (must use nop, syscall 3 often).

17 Additional instructions in A1x are:

not, ne, le, gt, ge, inc x, dec x, add1, sub1, eq0, eq1,
jeq, jne, jlt, jle, jge, jgt.

more code optimisation:

eq0, jt -> jf
eq0, jf -> jt

profile study of A1n and A1x 20 Nov 2002

the number of instruction

bubble hanoi matmul queen (x100) quick sieve
raw     20629 3631 3263 8248 15775 7831 5737
nib     11805 2175 1813 4762 10509 4746 3364
+ prim   6995 1341 1685 3453   4178 2266 1862
+ for    5711 1341 1471 2533   3823 2224 1653

1 nib/raw   avg. 41%   (1-A/B)
2 +prim/raw      64%   2-1 = 23%
3 +for/raw       68%   3-2 = 4%

discussion

1 new encoding (nibble) reduce no. of inst. 41%
2 use primitive reduce n-o-i further 23%
3 for reduce n-o-i further 4%

4 October 2004