Code improvement  for NOS applications

The evaluation is based on running two copies (processes) test-sum.txt on NOS and profile only the system execution.  The improvement is done in two ways:
1)   modify the NUT compiler to produce better code.  This is done by "inlining" frequently used codes such as functions to access data structure.
2)    create additional s-code to support new primitives, such as, combine jump and new addressing mode: index + displacement in Ld/St instructions.

The table below shows the six profiles:

performance optimisation for nut-som NOS system









profile

1

2

3

5

7

9

switchp

28

28

28

28

26

20








Add

6

6

6

6

6

6

Eq

88

62

86

0

0

0

Ldx

31

31

31

31

24

18

Stx

78

78

78

78

24

18

Ret

140

9

9

9

9

9

Array

2

2

2

2

2

2

End

1

1

1

1

1

1

Get

389

184

184

184

119

95

Put

33

33

33

33

31

25

Ld

152

126

152

126

118

94

St

37

37

37

37

35

29

Jmp

31

31

4

30

28

20

Jf

90

90

60

0

0

0

Lit

273

247

245

245

176

140

Call

197

66

66

66

62

50

Sys

34

34

34

34

32

26

Bor

0

0

28

0

0

0

Ne

0

0

2

0

0

0

Jeq

0

0

0

2

2

2

Jne

0

0

0

60

56

44

Lxvd

0

0

0

0

5

5

Sxvd

0

0

0

0

50

44

total

1582

1037

1058

944

780

628


Profile
1    baseline, the basic n-code to s-code
2    macro expansion
3    replace function call with primitives : or, !=, <=, >=
5    add s-code:  combine jump   Jne Jeq  Jge Jgt
7    add s-code for data structure access:  Lxvd, Sxvd
9     improve usercode, add s-code : Inc and dead code removal

Profile analysis

We measure the execution profile of NOS.  Trying to tune the instruction set and code generation for NOS.  We did  two experiments:

1) inline code, using macro expansion.  This is done in nut25 compiler. (producing the faster code)

the result (in profile2.txt) shows total number of instruction executed in NOS is reduced from 1582 to 1037 inst. (1-b/a = 34%) The total number of inst. ex.  from 27652 to 27097 (1-b/a = 2%).  This figure shows that the optimisation targeted for NOS is effective (34% reduction) but not effective for the user (sum 1000) program (2% total reduction).

2) change some simple user functions such as : or, and, !=, <=, >= into primitives (in s-code: Bor, Band, Ne, Le, Ge).  To our surprise, it is not as fast as inline.  The figure is:  system 1058 (inline 1037).  Upon close inspection we found that inlining is indeed producing a faster code because it produces a "short-circuit or" behaviour, instead of evaluating all arguments as in Bor primitive.  See the macro definition:

(defm or (a b) () (if a 1 b))

and when it is called (in switchp) :

  (if (or (= status 10) (= status 12)) ...

the inlined code is:

    279 Fun switchp
    280 Sys 5
    281 Ld status
    282 Lit 10
    283 Eq
    284 Jf 287
    285 Lit 1
    286 Jmp 290
    287 Ld status
    288 Lit 12
    289 Eq
    290 Jf 311
    ....

the code using Bor looks like:

    308 Fun switchp
    309 Sys 5
    310 Ld status
    311 Lit 10
    312 Eq
    313 Ld status
    314 Lit 12
    315 Eq
    316 Bor
    317 Jf 338
    ....

The frequency count on code execution show:

inlining     primitive  +/-  (prim compared to inline)

Add 6        same
Eq 62        Eq 86    +
Ldx 31       same
Stx 78       same
Ret 9        same
Array 2      same
End 1        same
Get 184      same
Put 33       same
Ld 126       Ld 152   +
St 37        same
Jmp 31       Jmp 4    -   
Jf 90        Jf 60    -
Lit 247      Lit 245  -
Call 66      same
Sys 34       same
             Bor 28   +

The increase of Eq and Ld is the result of non-short circuit Bor.  However, there are some decrease in Jf and Jmp.  Other primitives that rely on (if..) will be not much effective compared to inlining.

The next experiment is adding combine jmp: Jeq, Jne.  system is 944.  (profile5.txt) The number of Eq, Ne, Jf disappear (becomes Jne, Jeq)

profile7.txt
We are experimenting with additional addressing mode, starting with using local variable as base-address, we consider even 2 arguments in an instruction: with displacement.  when indexing is constant it can be coded into the instruction.  saving one literal instruction fetch and execution.

summary of addressing mode

ldx      [ads index -- value]   M[ads+index]
stx      [ads index value --]
lxv.v    [index -- value]       M[get.v + index]
sxv.v    [index value --]
lxvd.v.d [-- value]             M[get.v + d]
sxvd.v.d [value --]        

two new combine jumps are also included: Jgt, Jge.  They are used in while (i < n) and in the for loop for(i=0; i<n; i++).
system is 780.  Now the performance is sub-800.  That is what we aim for.  This is about two times faster than the original (1582).

The next experiment does not help system.  However it will improve user program by  introducing "inc".  This expression  can be recoginsed easily:

(set i (+ i 1)) ->  (inc i)

icInc is already available in s-code.  only the code generator needed to be modified.

profile9.txt
Add exactly one instruction : inc.v.  Because user program is faster, now the process terminate earlier.  switchp is only 20 times.  system becomes 628.  per switch is 628/20 = 31.5 inst.  (compare to profile1  1582/28 = 56.5)  looking at user execution, it is 18050 (profile7  24056, reduce 25%) and the first one 26070 (reduce 31%).

As the final push, do some code improvement, short cut jmp to jmp, jmp to ret.  and eliminate dead code (primitives: or, and, or, !=, <=, >=) no improvement in terms of speed but the executable code is small. 268 inst.  vs. the code size of the original 496 inst. almost half the size.

Conclusion

Inlining seems to be very effective (no doubt!).  The good point about this is that there are no pressing need for more primitives !  (which will make implementing a chip a bit easier for small number of primitives).  New addressing mode is very effective both in term of speed and space.

9 Feb 2005
P. Chongstitvatana