Optimisation


We perform several optimisations:

language level
  compiler:  macro
  code generator:  primitives

code generation
  a = a + 1  -> incv.a, decv.a (already support in Sx1)
  jmp to jmp -> short
  jmp to ret -> ret
  != p 0     -> p
  (if (= a 0) x y) -> (if a y x)
  (if (= a 0) x)   -> if a skip x

instruction set level
  extend instructions: ldxv stxv seqi jne
  change compiler to use seqi
  ldxv stxv jne can be done at code generator

microarchitecture level  sx2

benchmark programs

  nut.txt    basic compiler
  nos2.txt   send/receive 100 messages

why?

these two programs represent normal load of work in this project: compiling programs, running operating systems.

set up of benchmark

nut.txt  will compile itself.
nos2.txt run producer/consumer sending/receiving 100 messages with 200 task switches.

How?

Run benchmarks, collect profiles consisted of
1 frequency of instruction used
2 frequency of each line of program used

1 let us know what instruction to improve
2 let us know where the program spend its time

<explain nut.obj gen2.obj nsim32 sx1p noss>
<nsim32 should be renamed to nvm, and sx1p to sx>

baseline

this is the profile of nut.txt.  running "sx1p nuts.obj < nut.txt"  (base)
base nut-compiler, compiling nut-compiler

number of instruction (x1000) used in functions

!=         738
and        963
str=      4414
getName    416
install   1456
       -------
          7987

total 8585

Conclusion of the baseline

str= consume 50% of cycle.  follows by install, !=, and, getName. all these sum up to 93% of total

We should do the following:
1 support str= in machine code
2 change symbol table algorithm

this is the profile of nos2.txt

number of instruction used in functions (anything less than 1000 is not shown)


!=         1993
or         1295
ei/di      1194
getNext    1992
setNext    2020
setPrev    2020
appendDL   3413
deleteDL   2700
setValue   3005
switchp    3804
findmail   3465
send       3267
receive    2772
produce    1581
consume    1395
       --------
          35916
 
sum to 82% of total

total 43580

mostly accessing data structure in the form (vec a n) (setv a n)  where a is local, n is constant

Macro

how to do macro

nut-compiler (nut.txt) is modified to expand macro (nut4.txt).  The compiler itself (the target) has changed almost all the access functions to be the macros (nutm.txt).  The compiler, nut4.txt, is used to compile the macroed compiler (nutm.txt).  Then the macroed compiler is used to compile the original, nut.txt for benchmarking.  This is the steps of the work.

1 nsim32 nut.obj < nut4.txt > nut4.obj

produce the compiler that can expand macro.

2 nsim32 nut4.obj < nutm.txt > nutm.obj

this compiler is used to compile the macroed compiler.

3 nsim32 gen2.obj < nutm.obj > nutms.obj

and generate code from it.  The final executable code is the compiler with most access functions inlined, nutms.obj.  We use this compiler to compile the original nut-compiler to collect profiling statistics.

4 sx1p nutms.obj < nut.txt

The second benchmark is similar. First, compile the macroed-nos, nos2m.txt.

1 nsim32 nut4.obj < nos2m.txt > nos2m.obj

Then produce the executable code.

2 nsim32 gen2.obj < nos2m.obj > ns2.obj

and run nos (with most access functions inlined) under noss.

3 noss ns2.obj

The result from macro expansion (inline) to eliminate calls is:

             nut x1000        nos                
          inst  cycle    inst  cycle (system:user)  cycle per task switch (system)
base      8585  38033   43580 220915 (49274:171641)  246
macro     6556  26422   27696 120644 (22724:97920)   114

macro/base%
          76.4  69.6    63.6  54.6                   46.3


in compilation, with macro is 30% faster (in terms of cycle), and 46% faster in message passing, 54% faster in task switching.

<fig inst freq of base/macro?>

observing the frequency of instruction used, the "call" (plus "ret") is reduced by 80% in macro version and 30% of "get" is reduced (in getting the parameters). Similar reduction is observed in nos benchmark, 80% reduction in "call", 48% reduction in "get". So macro expansion is highly effective.

Introduce new primitives into the language

Nut is designed to be minimal.  Many basic and frequently used functions are written as user-defined functions, such as !=, >=, <=, and, or, not.  As the processor Sx already has machine instructions to support these functions, they should be considered as builtin operators of the language.  The code generator can be modified to convert the call to these functions into generating the associated instructions of the Sx processor.  This is not the same as macro expansion because the way Sx instruction behave is not exactly the same as the same operation written in nut-language, for example the "and" function, in nut.

(def and (a b) (if a b 0))

The meaning of this "and" is "if a is true then the result depends on b, else the result is false".  Its semantic is the "short-cut and" where the argument is evaluated enough to know the result (not always evaluate all arguments as in "eager evaluation" semantic).  However, the machine instruction "And" will evaluate all arguments.  So there is a difference.  The macro expansion will preserve the semantic of the function.  The machine primitive will be faster but not always.  In some case where evaluating arguments is costly, "short cut" semantic may be faster because it may evaluate less number of arguments.

Here is how the code generator is modified.  The functions !=, >=, <=, and, or, not are still written as user-defined functions in the source program, their use will be compiled into function calls.  The code generation for "call" checks the index to these functions and generates Sx machine instructions instead of a normal call.

; e is arglist
(def gencall (arg e) (idx a)
  (do
  (set idx (searchRef arg))
  ...
  (if (= idx yNe) (genop icNe 0 e)
  (if (= idx yLe) (genop icLe 0 e)
  (if (= idx yGe) (genop icGe 0 e)
  (if (= idx yAnd) (genop icBand 0 e)
  (if (= idx yOr) (genop icBor 0 e)
  (if (= idx yNot) (genop icNot 0 e)
  ; else
  (genop icCall idx e))))))))))    ; normal call

; eval arg-list (e), out code op.arg
(def genop (op arg e) ()
  (do
  (while e
    (do
    (eval (head e))
    (set e (tail e))))
  (outa op arg)))

Where yNe, yLe, yGe, yAnd, yOr, yNot are the indexes to the user-defined functions !=, <=, >=, and, or, not.  These indexes are found in the symbol table which is read by the code generator. The instructions icNe, icLe, icGe, icBand, icBor, icNot are the native Sx instructions for not-equal, less-than-or-equal, greater-than-or-equal, bitwise-and, bitwise-or, logical-not operations.

For example the following code fragment from nut.txt (the last line of function "str=")

  ...
  (and (= c1 0) (= c2 0))))

is normally compiled to call to the defined function "and".

   133 Get 1
   134 Lit 0
   135 Eq
   136 Call and

With this code generator it is compiled into:

   133 Get 1
   134 Lit 0
   135 Eq
   136 Band

Let this code generator be "gen5.txt".  The following steps produce the profile statistic.

1  Compile the code generator

nsim32 nut.obj < gen5.txt > gen5.obj

2  Generate the compiler

nsim32 gen5.obj < nut.obj > nutp.obj

3  Run nutp.obj to compile nut.txt

sx1p nutp.obj < nut.txt

Similarly for the operating system.

1  Generate the operating system using the modified code generator, gen5.obj.

nsim32 gen5.obj < nos2.obj > nos2p.obj

2  Run nos2p.obj to collect statistics

noss nos2p.obj

The result of running benchmarks using the code generator gen5.txt is shown below.

             nut x1000        nos                
        inst  cycle    inst  cycle (system:user)  cycle per task switch (system)
base      8585  38033   43580 220915 (49274:171641)  246
macro     6556  26422   27696 120644 (22724:97920)   114
primitive 6855  28987   40688 206450 (44468:161982)  222
primitive/base%   
          79.8  76.2   93.3  93.4                     90

In terms of speedup (cycle), using only primitives is not as effective as macro expansion as it is only 24% faster (macro is 30%) but they are comparable.  However, in the message passing benchmark, the primitives are not used much, the speedup is only 7% and 8% in task switching (for macro, 46% and 50%). So using primitives is not effective in the operating system as much as using macro expansion.  

Inspecting the compiler task profile revealed that the reduction in the number of "call" is 56%, and "get" is 21%, not as much as in macro expansion (89% and 48% consecutively).  In the message passing benchmark, the native instructions "Bor" and "Ne" generated by the modified code generator, are executed only 499 times, merely 1.2% of the total number of instruction executed.

Improving the quality of code from the code generator

The next step, the quality of code from the code generator can be improved.  This method can be applied without changing the instruction set or the language.  Some sequence of operations can be replaced by a shorter sequence of operations without affecting the result, hence making them faster.  

<more explanation from compiler text>

We elected to do the following code optimisations which are not difficult to implement.

1  Introduce inc/dec local variables as there are native instructions supported in Sx processor (as done in the excercise of Chapter 6???).
2  Improve jmp to jmp, jmp to ret, to "short cut" them.
3  Change (!= p 0) to p.
4  Change the conditional (= a 0) in "if" expression, there are two possibilities.
4.1  (if (= a 0) x y) is replaced by (if a y x)
4.2  (if (= a 0) x) is replaced by (if a skip x)

The expression with inc/dec can be detected from the n-code tree.

(set a (+ a 1))

It is normally compiled into,

get.a lit.1 add put.a

It can be replaced by generating the native code "Inc.a", "Dec.a".

The last two rules come from the observation that in the "if" expression, the conditional becomes:

get.a lit.0 eq jf

The sequence "lit.0 eq if" can be replaced by "jt".  So the conditional becomes,

get.a jt

The rule 4.1 and 4.2 used this fact.    

The code generator fragment for doing "inc/dec" is as follows.  "genput" is the main function.  It is called when the operator "put.a" is encountered at the beginning of the expression (in n-code)  (put.a (add get.a lit.1)).  "isIncDec" checks whether the expression is in the increment/decrement expression.  If it is then generate the native instruction, otherwise generate the normal unary-op code (in "genuop").

(def genuop (op arg e) ()
  (do
  (eval e)
  (outa op arg)))

(def isIncDec (op arg e) ()
  (do
  (if (isATOM e) 0
  (if (!= (head e) (mkATOM op 0)) 0
  (if (!= (arg2 e) (mkATOM xGET arg)) 0
  (if (!= (arg3 e) (mkATOM xLIT 1)) 0
  1))))))

(def genput (op arg e) ()
  (if (isIncDec xADD arg e)
    (outa icInc arg)
  (if (isIncDec xSUB arg e)
    (outa icDec arg)
  ; else
  (genuop op arg e))))

The code fragment for transforming (!= p 0) to p is as follows (at the function "gencall"), similar to the generating the primitives.  Where yNe is the index to the user-defined function (def != ...).  If the call is "!=" and the second argument is "lit.0" then generate the native code, otherwise generate the normal code (as function call).

  (if (= idx yNe)
    (if (isOpArg (arg2 e) xLIT 0) ; (!= p 0) -> p
      (eval (head e))
      ; else
      (genop icNe 0 e))

The code fragment for "short-cut" jump is straight forward and we will not elaborate any further.  The code fragment for performing reduction in "if" expression is as follows.  Where "genif2" is the normal "if" generation, "iseq0" checks the expression of the form (= a 0).

(def genif e (e1 e2 e3 ads)
  (if (iseq0 (head e))         ; (= a 0)
    (do
    (set e1 (arg2 (head e))) ; e1 is a
    (set e2 (arg2 e))
    (set e3 (arg3 e))
    (if e3        ; (if (= a 0) x y) -> (if a y x)           
      (genif2 (cons e1 (cons e3 (cons e2 NIL))))
      ; else
      (do        ; (if (= a 0) x) -> if a skip x
      (eval e1)            ; a
      (outa icJt 0)
      (set ads (- XP 1))
      (eval e2)            ; x
      (patch ads (- XP ads)))))
    ; else
    (genif2 e)))        ; normal if

The code generator, gen5.txt is modified with these rules.  The result is the code generator which generates primitives: !=, <=, >=, and, or, not and with the above improvement, inc, dec, short-cut jump, reduce != and improve conditional (if (= a 0)..). Let this code generator be "gen6.txt".

The steps to run benchmarks of this new code generator is.

1  Compile the code generator

nsim32 nut.obj < gen6.txt > gen6.obj

2  Using the new code generator to generate the executable object of the compiler.

nsim32 gen6.obj < nut.obj > nutg.obj

3  Using the new compiler to compile the benchmark.

sx1p nutg.obj < nut.txt

Similarly for the operating system benchmark.

1  Generate the executable nos

nsim32 gen6.obj < nos2.obj > nos2g.obj

2  Run the new nos2

noss nos2g.obj

The result is shown below.

             nut x1000        nos                
        inst  cycle    inst  cycle (system:user)  cycle per task switch (system)
base      8585  38033   43580 220915 (49274:171641)  246
macro     6556  26422   27696 120644 (22724:97920)   114
primitive 6855  28987   40688 206450 (44468:161982)  222
codegen   6412  27483   38599 199088 (44036:155052)  220

codegen/prim %
          93.5  94.8   94.8  96.4                    99

The gen6.txt code generation is built on top of the gen5.txt (primitive) code generation so the performance improvement is considered relative to gen5.txt (a kind of further improvement of gen5).  The improvement in code generation further reduces the cycle by 5% in compiler benchmark and 4% in message passing benchmark.  It reduces the task switching cycle by only 1%.  Inspecting the profile of the compiler benchmark revealed that the "inc" is used 123 times, the instructions in the sequence that it replaced, consisted of "get.a lit.1 add put.a" are reduced by the same amount. The profile of message passing benchmark is similar, but the "short-cut jump" which is not much in effect in the compiler benchmark, is quite effective here, the number of "jmp" is reduced by 50% (from 800 to 401).

(to be continue on the instruction set level)

29 Aug 2009