Code improvement for NOS applications

The evaluation is based on running two copies (processes) test-sum.txt on NOS and profile only the system execution. The improvement is done in two ways:
1) modify the NUT compiler to produce better code. This is done by "inlining" frequently used codes such as functions to access data structure.
2) create additional s-code to support new primitives, such as, combine jump and new addressing mode: index + displacement in Ld/St instructions.

The table below shows the six profiles:

performance optimisation for nut-som NOS system

profile	1	2	3	5	7	9
switchp	28	28	28	28	26	20

Add	6	6	6	6	6	6
Eq	88	62	86	0	0	0
Ldx	31	31	31	31	24	18
Stx	78	78	78	78	24	18
Ret	140	9	9	9	9	9
Array	2	2	2	2	2	2
End	1	1	1	1	1	1
Get	389	184	184	184	119	95
Put	33	33	33	33	31	25
Ld	152	126	152	126	118	94
St	37	37	37	37	35	29
Jmp	31	31	4	30	28	20
Jf	90	90	60	0	0	0
Lit	273	247	245	245	176	140
Call	197	66	66	66	62	50
Sys	34	34	34	34	32	26
Bor	0	0	28	0	0	0
Ne	0	0	2	0	0	0
Jeq	0	0	0	2	2	2
Jne	0	0	0	60	56	44
Lxvd	0	0	0	0	5	5
Sxvd	0	0	0	0	50	44
total	1582	1037	1058	944	780	628

Profile
1    baseline, the basic n-code to s-code
2    macro expansion
3    replace function call with primitives : or, !=, <=, >=
5    add s-code: combine jump   Jne Jeq Jge Jgt
7    add s-code for data structure access: Lxvd, Sxvd
9     improve usercode, add s-code : Inc and dead code removal

Profile analysis

We measure the execution profile of NOS. Trying to tune the instruction set and code generation for NOS. We did two experiments:

1) inline code, using macro expansion. This is done in nut25 compiler. (producing the faster code)

the result (in profile2.txt) shows total number of instruction executed in NOS is reduced from 1582 to 1037 inst. (1-b/a = 34%) The total number of inst. ex. from 27652 to 27097 (1-b/a = 2%). This figure shows that the optimisation targeted for NOS is effective (34% reduction) but not effective for the user (sum 1000) program (2% total reduction).

2) change some simple user functions such as : or, and, !=, <=, >= into primitives (in s-code: Bor, Band, Ne, Le, Ge). To our surprise, it is not as fast as inline. The figure is: system 1058 (inline 1037). Upon close inspection we found that inlining is indeed producing a faster code because it produces a "short-circuit or" behaviour, instead of evaluating all arguments as in Bor primitive. See the macro definition:

(defm or (a b) () (if a 1 b))

and when it is called (in switchp) :

(if (or (= status 10) (= status 12)) ...

the inlined code is:

    279 Fun switchp
    280 Sys 5
    281 Ld status
    282 Lit 10
    283 Eq
    284 Jf 287
    285 Lit 1
    286 Jmp 290
    287 Ld status
    288 Lit 12
    289 Eq
    290 Jf 311
    ....

the code using Bor looks like:

    308 Fun switchp
    309 Sys 5
    310 Ld status
    311 Lit 10
    312 Eq
    313 Ld status
    314 Lit 12
    315 Eq
    316 Bor
    317 Jf 338
    ....

The frequency count on code execution show:

inlining primitive +/- (prim compared to inline)

Add 6        same
Eq 62        Eq 86    +
Ldx 31       same
Stx 78       same
Ret 9        same
Array 2    same
End 1        same
Get 184    same
Put 33       same
Ld 126       Ld 152   +
St 37        same
Jmp 31       Jmp 4    -
Jf 90        Jf 60    -
Lit 247    Lit 245 -
Call 66    same
Sys 34       same
             Bor 28   +

The increase of Eq and Ld is the result of non-short circuit Bor. However, there are some decrease in Jf and Jmp. Other primitives that rely on (if..) will be not much effective compared to inlining.

The next experiment is adding combine jmp: Jeq, Jne. system is 944. (profile5.txt) The number of Eq, Ne, Jf disappear (becomes Jne, Jeq)

profile7.txt
We are experimenting with additional addressing mode, starting with using local variable as base-address, we consider even 2 arguments in an instruction: with displacement. when indexing is constant it can be coded into the instruction. saving one literal instruction fetch and execution.

summary of addressing mode

ldx      [ads index -- value]   M[ads+index]
stx      [ads index value --]
lxv.v    [index -- value]       M[get.v + index]
sxv.v    [index value --]
lxvd.v.d [-- value]             M[get.v + d]
sxvd.v.d [value --]

two new combine jumps are also included: Jgt, Jge. They are used in while (i < n) and in the for loop for(i=0; i<n; i++).
system is 780. Now the performance is sub-800. That is what we aim for. This is about two times faster than the original (1582).

The next experiment does not help system. However it will improve user program by introducing "inc". This expression can be recoginsed easily:

(set i (+ i 1)) -> (inc i)

icInc is already available in s-code. only the code generator needed to be modified.

profile9.txt
Add exactly one instruction : inc.v. Because user program is faster, now the process terminate earlier. switchp is only 20 times. system becomes 628. per switch is 628/20 = 31.5 inst. (compare to profile1 1582/28 = 56.5) looking at user execution, it is 18050 (profile7 24056, reduce 25%) and the first one 26070 (reduce 31%).

As the final push, do some code improvement, short cut jmp to jmp, jmp to ret. and eliminate dead code (primitives: or, and, or, !=, <=, >=) no improvement in terms of speed but the executable code is small. 268 inst. vs. the code size of the original 496 inst. almost half the size.

Conclusion

Inlining seems to be very effective (no doubt!). The good point about this is that there are no pressing need for more primitives ! (which will make implementing a chip a bit easier for small number of primitives). New addressing mode is very effective both in term of speed and space.

9 Feb 2005
P. Chongstitvatana