S-code optimisation
S-code is intentionally kept simple and minimal to make it easy to be
changed, to be experimented with and to target it for a new processor
or new applications. Most of the optimisation described here has been
done for a variety of reason. However, the basic Som system
employs just a few of these optimisation. Specifically the
following:
inc v, dec v
short cut jmp to ret, jmp to jmp
Experiment
These are the lesson learn from the work "nibble coding":
1) primitive is 2 times faster than user function
2) code improvement (not counting nibble coding) is amount to 10% size
saving
3) size saving is related to increasing speed, as the number of
instruction to be executed is reduced
The fact 3 is interesting. It seems ordinary but not
obvious. Reducing the size by extending instructions and nibble
coding by combining two instructions make each new instruction more
complex hence taking more time to execute them. In fact, as the
semantic of instruction does not change, the new instruction does what
several old instruction do, the amount of work is the same.
However, as the interpreter is implemented as a big while switch loop:
fetch
op code at ip
while no exception
decode and execute by
switch op
case add
: do ...
....
ip = ip + 1
Each time through the loop takes some overhead. Reducing the
number instruction also reduce the number of time through this
loop. Hence the new instruction coding (compression) makes it
faster too.
Inc, dec
get
lv, lit 1, add, put lv -> inc lv
and vice versa for dec. Increment uses quite often and it
replaces 5 instruction which is significant saving.
It is observed that (in the project A1c-self) the sequence:
ld
ip, lit 1, add, st ip
occurs frequently and it can not be optimised to inc x because the
argument is not a local variable. The sequence "Lit n, add" or
"Lit n, sub" can be substitute by "Addi n", where n can be positive and
negative (for sub).
The goal is to achieve a reasonably small and simple set of
primitives. The aim is to show how to achieve good performance
using small set of primitives.
The optimised codes are for performance reason, they are not absolute
necessity. The markers are not executable, they are for compiler
internal use to help generate the correct executable. This set is
more or less a typical stack-virtual-machine instruction set. It
is not too much different from JVM. JVM set is much larger with more
data types. Translating to JVM should be straight forward.
Sequence of
transformation
1. change break to jump
scan for the matched innermost loop of break: Efor, Ewhile, Ret, End.
Patch jmp to that location. case is not a loop, break in case must
break loop, or return from function call. Stricly speaking this is not
an optimization. It is a usual code generation.
2. short cut the jump to jump, jmp to ret
3. ewhile to nop
4. a = a + 1 => inc a
(if a is local),
and vice versa for dec
5. combine conditional jump, repeat twice
conditional: Eq, Lt, Le, Ge, Gt
jump: Jt, Jf
to: Jeq, Jlt, Jle, Jge, Jgt
Lit 0, Eq => Eq0
Eq0, Jt => Jf
Eq0, Jf => Jt
Jmp 1
=> nop
Seeing how eval() is implemented in A1c makes me realise that Eq0 is in
fact equivalent to Not
xNot:
SS[sp] = b == 0 ip = ip+1
xEq0: SS[sp] =
b == 0 ip = ip+1
hence Eq0 is eliminated and all optimisation use Not instead, for
example, Eq0, jt => jf etc.
Instructions for
"for" loop
"for i start end body" means
i
= start
while i <= end
body
i = i + 1
Two special instructions for implement for loop are: Ifor, Efor.
In ideal case Ifor and Efor should have three arguments, the two local
variables to do "i <= end" and the jump offset. However, to
conform to one argument format, the trick is the use a "pair" of
consecutive local variable, therefore only one argument needed to be
specified. To remove the "offset" argument from the instruction,
Ifor/Efor v1 v2 offset is splited into two instructions, one is to
Ifor/Efor and follows by "jump". When Ifor/Efor is evaluated to
be fault it skips the next instruction. The style "test and jump
next instruction if false" is used for Efor. Ifor does the
initialisation of the index variable and calculates the end value and
stores it in the "end" variable. This allows the end value to be
evaluated only once. The assumption that the end value is never
changed in the body of loop must hold.
Ifor
lv1 (start end -- )
follows by jmp end_for
use the pair locals: lv1 lv2 to store index and end values
lv1
= start (index)
lv2 = end
Efor
lv1 ( -- )
follows by jmp begin_for
use the pair locals: lv1 lv2 that store index and end values
Ifor takes two items from stack "start, end" and stored their values to
lv1 and lv2. If lv1 > lv2 then jump out of the loop by executing the
next instruction. Otherwise it skips the next instruction.
Efor is at the end of body. It increment lv1, and test if lv1
<= lv2 then execute the next instruction which is jmp begin_for.
Otherwise, the next instruction is skipped, hence exits the for loop.
...
get i
get end
le
jf exit
<loop>
body
get end
inc-skip-lt i //
i++, if tos < i skip next
jmp loop
<exit>
It reduces five codes to three codes. If we can make "end" to be
adjacent to i (allocating "end" will be more complex) then it will
reduce to two codes:
...
<loop>
body
efor
i // i++, if end < i skip
next
jmp loop
<exit>
These new primitives reduce the static size by another 2%, and dynamic
size 8%. In terms of speed, it is also faster because the "end"
is evaluated and stored in a local variable just once in the
initialization by Ifor, so testing "if i <= end" is faster than
evaluate the expression every time around the loop.
How rename work
For Ifor and Efor, the index is a pair of locals, one for index, one
for end value. There must be an extra local variable assigned for
end_value for each for index. To use one argumen format, the pair of
local are adjacent. For example, Ifor 2 means 2 is index, 3 is
end_value. The following steps assign local for Ifor, Efor:
1) The index is relocated to the end of the current local.
An adjacent local is also reserved. (lvx keeps track of this
number)
2) When rename() is called at the end of a function definition,
the number of extra locals will be too large by n, where n is the
number of for_index. lvx will be lv + 2*n. This is because
for each for_index there is one redundant local. This is
readjusted by shifting all locals that are not for_index down to take
place of for_index (now that all for_index are at the end). And rename
all for_index with -n.
3) To rename n..1 all name is renamed to i' = n - i + 1, where n
is the total number of locals (including all for_index).
Example
to
fn a b | i j k =
for i . . .
for j . . .
At first the locals are:
1
2 3 4 5
a b i j k
when for_index is created, i, k are relocated to be after k, with i, j
now reflect the new name (6, 8 are larger than lv which is 5)
1
2 6 8 5 6 7
8 9
a b i j k i_for i_end j_for j_end
the number is adjusted to remove the redundance (by -2, and shift the
non-for_index)
1
2 4 6 3 4 5
6 7
a b i j k i_for i_end j_for j_end
now rename to reverse order:
7
6 4 2 5 4 3
2 1
a b i j k i_for i_end j_for j_end
The whole rename are done in one loop. Testing for the for_index
condition by checking if i > lv.
/*
d is num of for_index
n is total number of
locals */
for( i=1, k=i; i<=lvx; i++,
k++){
if(lvname[i] > lv){
c = lvname[i]
- d;
k--;
}else
c = k;
lvname[i] = n - c + 1; /*
renum n to 1 */
}
The disadvantage is that the for_index is constraint to be only locals
and not the passed parameters as the renaming will affected the order
of local hence will tamper with the order of passing parameters,
therefore this is prohibited.
Other optimisation
Evaluate A1x 20 Nov 2002
A1x is an experiment to compress the size of the code.
The lessons learned from this experiment are:
1 Implement an instruction as a primitive (built-in) is as much
as 2 times faster than as a user defined function. The following
data supports this conclusion:
the number of instructions in executing benchmarks
bubble hanoi matmul perm queen quick sieve
base
11805 2175 1813 4762 10509 4746 3364
+ prim
6995 1341 1685 3453 4178 2266
1862
1 - A/B
0.41 0.38 0.07 0.27 0.60 0.52
0.45
average
0.39 (or 2.6 times faster)
2 code combination
- inc x, dec x 5% faster
- jump conditional 5% faster
- + addi, subi, eq0, eq1
3 short circuit jmp 5% faster:
- jmp to ret -> ret
- jmp to jmp -> jmp
2 + 3 is approx. 15% faster
4 if e1 e2 e3 is very restrictive because it requires else-part
to always be present (must use nop, syscall 3 often).
17 Additional instructions in A1x are:
not,
ne, le, gt, ge, inc x, dec x, add1, sub1, eq0, eq1,
jeq, jne, jlt, jle, jge, jgt.
more code optimisation:
eq0,
jt -> jf
eq0, jf -> jt
profile study of A1n and A1x 20 Nov 2002
the number of instruction
bubble hanoi matmul
queen (x100) quick sieve
raw
20629 3631 3263 8248 15775 7831 5737
nib
11805 2175 1813 4762 10509 4746 3364
+
prim 6995 1341 1685 3453
4178 2266 1862
+
for 5711 1341 1471 2533
3823 2224 1653
1 nib/raw avg. 41% (1-A/B)
2 +prim/raw 64% 2-1 =
23%
3 +for/raw 68%
3-2 = 4%
discussion
1 new encoding (nibble) reduce no. of inst. 41%
2 use primitive reduce n-o-i further 23%
3 for reduce n-o-i further 4%
4 October 2004