Optimisation
We perform several optimisations:
language level
compiler: macro
code generator: primitives
code generation
a = a + 1 -> incv.a, decv.a (already support in Sx1)
jmp to jmp -> short
jmp to ret -> ret
!= p 0 -> p
(if (= a 0) x y) -> (if a y x)
(if (= a 0) x) -> if a skip x
instruction set level
extend instructions: ldxv stxv seqi jne
change compiler to use seqi
ldxv stxv jne can be done at code generator
microarchitecture level sx2
benchmark programs
nut.txt basic compiler
nos2.txt send/receive 100 messages
why?
these two programs represent normal load of work in this project:
compiling programs, running operating systems.
set up of benchmark
nut.txt will compile itself.
nos2.txt run producer/consumer sending/receiving 100 messages with 200
task switches.
How?
Run benchmarks, collect profiles consisted of
1 frequency of instruction used
2 frequency of each line of program used
1 let us know what instruction to improve
2 let us know where the program spend its time
<explain nut.obj gen2.obj nsim32 sx1p noss>
<nsim32 should be renamed to nvm, and sx1p to sx>
baseline
this is the profile of nut.txt. running "sx1p nuts.obj <
nut.txt" (base)
base nut-compiler, compiling nut-compiler
number of instruction (x1000) used in functions
!=
738
and
963
str= 4414
getName 416
install 1456
-------
7987
total 8585
Conclusion of the
baseline
str= consume 50% of cycle. follows by install, !=, and, getName.
all these sum up to 93% of total
We should do the following:
1 support str= in machine code
2 change symbol table algorithm
this is the profile of nos2.txt
number of instruction used in functions (anything less than 1000 is not
shown)
!=
1993
or
1295
ei/di
1194
getNext 1992
setNext 2020
setPrev 2020
appendDL 3413
deleteDL 2700
setValue 3005
switchp 3804
findmail 3465
send
3267
receive 2772
produce 1581
consume 1395
--------
35916
sum to 82% of total
total 43580
mostly accessing data structure in the form (vec a n) (setv a n)
where a is local, n is constant
Macro
how to do macro
nut-compiler (nut.txt) is modified to expand macro (nut4.txt).
The compiler itself (the target) has changed almost all the access
functions to be the macros (nutm.txt). The compiler, nut4.txt, is
used to compile the macroed compiler (nutm.txt). Then the macroed
compiler is used to compile the original, nut.txt for
benchmarking. This is the steps of the work.
1 nsim32 nut.obj < nut4.txt
> nut4.obj
produce the compiler that can expand macro.
2 nsim32 nut4.obj < nutm.txt
> nutm.obj
this compiler is used to compile the macroed compiler.
3 nsim32 gen2.obj < nutm.obj
> nutms.obj
and generate code from it. The final executable code is the
compiler with most access functions inlined, nutms.obj. We use
this compiler to compile the original nut-compiler to collect profiling
statistics.
4 sx1p nutms.obj < nut.txt
The second benchmark is similar. First, compile the macroed-nos,
nos2m.txt.
1 nsim32 nut4.obj < nos2m.txt
> nos2m.obj
Then produce the executable code.
2 nsim32 gen2.obj < nos2m.obj
> ns2.obj
and run nos (with most access functions inlined) under noss.
3 noss ns2.obj
The result from macro expansion (inline) to eliminate calls is:
nut x1000
nos
inst cycle inst cycle (system:user)
cycle per task switch (system)
base
8585 38033 43580 220915 (49274:171641)
246
macro
6556 26422 27696 120644
(22724:97920) 114
macro/base%
76.4
69.6 63.6
54.6
46.3
in compilation, with macro is 30% faster (in terms of cycle), and 46%
faster in message passing, 54% faster in task switching.
<fig inst freq of base/macro?>
observing the frequency of instruction used, the "call" (plus "ret") is
reduced by 80% in macro version and 30% of "get" is reduced (in getting
the parameters). Similar reduction is observed in nos benchmark, 80%
reduction in "call", 48% reduction in "get". So macro expansion is
highly effective.
Introduce new
primitives into the language
Nut is designed to be minimal. Many basic and frequently used
functions are written as user-defined functions, such as !=, >=,
<=, and, or, not. As the processor Sx already has machine
instructions to support these functions, they should be considered as
builtin operators of the language. The code generator can be
modified to convert the call to these functions into generating the
associated instructions of the Sx processor. This is not the same
as macro expansion because the way Sx instruction behave is not exactly
the same as the same operation written in nut-language, for example the
"and" function, in nut.
(def
and (a b) (if a b 0))
The meaning of this "and" is "if a is true then the result depends on
b, else the result is false". Its semantic is the "short-cut and"
where the argument is evaluated enough to know the result (not always
evaluate all arguments as in "eager evaluation" semantic).
However, the machine instruction "And" will evaluate all
arguments. So there is a difference. The macro expansion
will preserve the semantic of the function. The machine primitive
will be faster but not always. In some case where evaluating
arguments is costly, "short cut" semantic may be faster because it may
evaluate less number of arguments.
Here is how the code generator is modified. The functions !=,
>=, <=, and, or, not are still written as user-defined functions
in the source program, their use will be compiled into function
calls. The code generation for "call" checks the index to these
functions and generates Sx machine instructions instead of a normal
call.
;
e is arglist
(def gencall (arg e) (idx a)
(do
(set idx (searchRef arg))
...
(if (= idx yNe) (genop
icNe 0 e)
(if (= idx yLe) (genop
icLe 0 e)
(if (= idx yGe) (genop
icGe 0 e)
(if (= idx yAnd) (genop
icBand 0 e)
(if (= idx yOr) (genop
icBor 0 e)
(if (= idx yNot) (genop
icNot 0 e)
; else
(genop icCall idx
e)))))))))) ; normal call
;
eval arg-list (e), out code op.arg
(def genop (op arg e) ()
(do
(while e
(do
(eval (head e))
(set e (tail
e))))
(outa op arg)))
Where yNe, yLe, yGe, yAnd, yOr, yNot are the indexes to the
user-defined functions !=, <=, >=, and, or, not. These
indexes are found in the symbol table which is read by the code
generator. The instructions icNe, icLe, icGe, icBand, icBor, icNot are
the native Sx instructions for not-equal, less-than-or-equal,
greater-than-or-equal, bitwise-and, bitwise-or, logical-not operations.
For example the following code fragment from nut.txt (the last line of
function "str=")
...
(and (= c1 0) (= c2 0))))
is normally compiled to call to the defined function "and".
133 Get 1
134 Lit 0
135 Eq
136 Call and
With this code generator it is compiled into:
133 Get 1
134 Lit 0
135 Eq
136 Band
Let this code generator be "gen5.txt". The following steps
produce the profile statistic.
1 Compile
the code generator
nsim32
nut.obj < gen5.txt > gen5.obj
2 Generate the compiler
nsim32
gen5.obj < nut.obj > nutp.obj
3 Run nutp.obj to compile nut.txt
sx1p
nutp.obj < nut.txt
Similarly for the operating system.
1 Generate the operating system using the modified code
generator, gen5.obj.
nsim32
gen5.obj < nos2.obj > nos2p.obj
2 Run nos2p.obj to collect statistics
noss
nos2p.obj
The result of running benchmarks using the code generator gen5.txt is
shown below.
nut x1000
nos
inst cycle inst cycle (system:user)
cycle per task switch (system)
base
8585 38033 43580 220915 (49274:171641)
246
macro
6556 26422 27696 120644
(22724:97920) 114
primitive 6855
28987 40688 206450 (44468:161982) 222
primitive/base%
79.8 76.2 93.3
93.4
90
In terms of speedup (cycle), using only primitives is not as effective
as macro expansion as it is only 24% faster (macro is 30%) but they are
comparable. However, in the message passing benchmark, the
primitives are not used much, the speedup is only 7% and 8% in task
switching (for macro, 46% and 50%). So using primitives is not
effective in the operating system as much as using macro expansion.
Inspecting the compiler task profile revealed that the reduction in the
number of "call" is 56%, and "get" is 21%, not as much as in macro
expansion (89% and 48% consecutively). In the message passing
benchmark, the native instructions "Bor" and "Ne" generated by the
modified code generator, are executed only 499 times, merely 1.2% of
the total number of instruction executed.
Improving the quality
of code from the code generator
The next step, the quality of code from the code generator can be
improved. This method can be applied without changing the
instruction set or the language. Some sequence of operations can
be replaced by a shorter sequence of operations without affecting the
result, hence making them faster.
<more explanation from compiler text>
We elected to do the following code optimisations which are not
difficult to implement.
1 Introduce inc/dec local variables as there are native
instructions supported in Sx processor (as done in the excercise of
Chapter 6???).
2 Improve jmp to jmp, jmp to ret, to "short cut" them.
3 Change (!= p 0) to p.
4 Change the conditional (= a 0) in "if" expression, there are
two possibilities.
4.1 (if (= a 0) x y) is replaced by (if a y x)
4.2 (if (= a 0) x) is replaced by (if a skip x)
The expression with inc/dec can be detected from the n-code tree.
(set
a (+ a 1))
It is normally compiled into,
get.a
lit.1 add put.a
It can be replaced by generating the native code "Inc.a", "Dec.a".
The last two rules come from the observation that in the "if"
expression, the conditional becomes:
get.a
lit.0 eq jf
The sequence "lit.0 eq if" can be replaced by "jt". So the
conditional becomes,
get.a
jt
The rule 4.1 and 4.2 used this fact.
The code generator fragment for doing "inc/dec" is as follows.
"genput" is the main function. It is called when the operator
"put.a" is encountered at the beginning of the expression (in
n-code) (put.a (add get.a lit.1)). "isIncDec" checks
whether the expression is in the increment/decrement expression.
If it is then generate the native instruction, otherwise generate the
normal unary-op code (in "genuop").
(def
genuop (op arg e) ()
(do
(eval e)
(outa op arg)))
(def isIncDec (op arg e) ()
(do
(if (isATOM e) 0
(if (!= (head e) (mkATOM
op 0)) 0
(if (!= (arg2 e) (mkATOM
xGET arg)) 0
(if (!= (arg3 e) (mkATOM
xLIT 1)) 0
1))))))
(def genput (op arg e) ()
(if (isIncDec xADD arg e)
(outa icInc
arg)
(if (isIncDec xSUB arg e)
(outa icDec
arg)
; else
(genuop op arg e))))
The code fragment for transforming (!= p 0) to p is as follows (at the
function "gencall"), similar to the generating the primitives.
Where yNe is the index to the user-defined function (def != ...).
If the call is "!=" and the second argument is "lit.0" then generate
the native code, otherwise generate the normal code (as function call).
(if (= idx yNe)
(if (isOpArg
(arg2 e) xLIT 0) ; (!= p 0) -> p
(eval (head e))
;
else
(genop icNe 0 e))
The code fragment for "short-cut" jump is straight forward and we will
not elaborate any further. The code fragment for performing
reduction in "if" expression is as follows. Where "genif2" is the
normal "if" generation, "iseq0" checks the expression of the form (= a
0).
(def
genif e (e1 e2 e3 ads)
(if (iseq0 (head e))
; (= a 0)
(do
(set e1 (arg2
(head e))) ; e1 is a
(set e2 (arg2
e))
(set e3 (arg3
e))
(if
e3 ; (if (= a 0) x y) -> (if a
y x)
(genif2 (cons e1 (cons e3 (cons e2 NIL))))
;
else
(do ; (if (= a 0) x) -> if a
skip x
(eval e1) ; a
(outa icJt 0)
(set ads (- XP 1))
(eval e2) ; x
(patch ads (- XP ads)))))
; else
(genif2
e))) ; normal if
The code generator, gen5.txt is modified with these rules. The
result is the code generator which generates primitives: !=, <=,
>=, and, or, not and with the above improvement, inc, dec, short-cut
jump, reduce != and improve conditional (if (= a 0)..). Let this code
generator be "gen6.txt".
The steps to run benchmarks of this new code generator is.
1 Compile the code generator
nsim32
nut.obj < gen6.txt > gen6.obj
2 Using the new code generator to generate the executable object
of the compiler.
nsim32
gen6.obj < nut.obj > nutg.obj
3 Using the new compiler to compile the benchmark.
sx1p
nutg.obj < nut.txt
Similarly for the operating system benchmark.
1 Generate the executable nos
nsim32
gen6.obj < nos2.obj > nos2g.obj
2 Run the new nos2
noss
nos2g.obj
The result is shown below.
nut x1000
nos
inst cycle inst cycle (system:user)
cycle per task switch (system)
base
8585 38033 43580 220915 (49274:171641)
246
macro
6556 26422 27696 120644
(22724:97920) 114
primitive 6855
28987 40688 206450 (44468:161982) 222
codegen 6412
27483 38599 199088 (44036:155052) 220
codegen/prim %
93.5 94.8 94.8
96.4
99
The gen6.txt code generation is built on top of the gen5.txt
(primitive) code generation so the performance improvement is
considered relative to gen5.txt (a kind of further improvement of
gen5). The improvement in code generation further reduces the
cycle by 5% in compiler benchmark and 4% in message passing
benchmark. It reduces the task switching cycle by only 1%.
Inspecting the profile of the compiler benchmark revealed that the
"inc" is used 123 times, the instructions in the sequence that it
replaced, consisted of "get.a lit.1 add put.a" are reduced by the same
amount. The profile of message passing benchmark is similar, but the
"short-cut jump" which is not much in effect in the compiler benchmark,
is quite effective here, the number of "jmp" is reduced by 50% (from
800 to 401).
(to be continue on the instruction set level)
29 Aug 2009