Simple code optimisation

example 1 simle sequence
example 2 simple loop
example 3 logical statement
example 4 factoring
example 5 string copy
timing of S2 instruction
conclusion

Performance is important. The goal of this lecture is to give students some understanding of the performance issue at the machine code level. Don't be missled that optimisation at this level yields a big improvement. Usually, optimisation at the high level, such as changing the algorithm, changing the data structure, can yield order of magnitude improvement, while optimisation at the assembly language level results in much less improvement. 10-200% speed improvement can be observed using a simple code optimisation discussed in this lecture.

By inspecting the assembly code that is outputed from a compiler, the code can be improved to make the code smaller and/or faster. This process is called "peep-hole optimisation". The rules are heuristic and simply stated as follows:

1) Eliminate redundant code.
2) Keep variables in registers as much as possible.
3) Substitute long sequence of codes with a shorter one.

For example, the sequence

ld r1 A
st A r1

st A r1
ld r1 A

the second line (emphasized) is redundant and it can be eliminated without changing the meaning of program.

Example 1 simple sequence

A = B + C
C = A + 1
ld    r1 B
ld    r2 C
add   r1 r1 r2
st    A r1
ld    r1 A
add   r1 r1 #1
st    C r1

the line with underline can be eliminated.

Example 2 simple loop

while i < 0
if A > 2 then B = 1
           else i = i +1
:while ld    r1 i
        sub   r0 r1 r0
        jmp   ge exit
        ld    r1 A
        sub   r0 r1 #2
        jmp   le else
        ld    r1 #1
        st    B r1
        jmp   always out
:else   ld    r1 i
        add   r1 r1 #1
        st    i r1
:out    jmp   always while
:exit

Let keep i in r1 , the line "ld r1 i" can be moved out of loop and the pair of ld/st i can be eliminated. Other use of r1 is renamed to r2. The "jmp always out" is changed to while. The result is:

        ld    r1 i
:while sub   r0 r1 r0
        jmp   ge exit
        ld    r2 A
        sub   r0 r2 #2
        jmp   le else
        ld    r2 #1
        st    B r2
        jmp   always while
:else   add   r1 r1 #1
:out    jmp   always while
:exit

Example 3 logical statement

Show a more complicate statement which requires temporary variables.

B = (A >= 0) && (A < 30)

Two temporary variables are needed to store the boolean value or A >= 0 and A < 30. Let T1, T2 be temporary variables.

        ld     r1 A
        sub    r0 r1 r0
        jmp    ge true1
        ld     T1 #0
        jmp    always out1
:true1 ld     T1 #1
:out1
        ld     r1 A
        sub    r0 r1 #30
        jmp    lt true2
        ld     T2 #0
        jmp    always out2
:true2 ld     T2 #1
:out2
        ld     r1 T1
        ld     r2 T2
        and    r1 r1 r2
        st     B r1

Let keep A in r1, T1 in r2 and T2 in r3

        ld     r1 A
        sub    r0 r1 r0
        jmp    ge true1
        add    r2 r0 r0
        jmp    always out1
:true1 add    r2 r0 #1
:out1
        sub    r0 r1 #30
        jmp    lt true2
        add    r3 r0 r0
        jmp    always out2
:true2 add    r3 r0 #1
:out2
        and    r1 r2 r3
        st     B r1

Example 4 factoring

Sometime arithmetic expression can be simplified to reduce the number of operation (factoring).

a = (b*c) + (b*d)
   ld   r1 b
   ld   r2 c
   mul r1 r1 r2
   ld   r2 b
   ld   r3 d
   mul r2 r2 r3
   add r1 r1 r2
   st   a r1

can be factored to

a = b*(c+d)

   ld   r1 c
   ld   r2 d
   add r1 r1 r2
   ld   r2 b
   mul r1 r2 r1
   st   a r1

which reduce two multiplies to one multiply.

As each instruction takes different amount of time to execute, to understand the speed improvement of code optimisation, one must take into account such difference. Let assume the following timing for S2 instruction set:

Timing

ld/st from memory   take    4 clocks
jmp                                   2 clocks
mul, div                             3 clocks
others                                1 clock

Example 5 string copy

Copy string. Assume a string is stored in the array and is terminated by 0. The following program copy string B to string A.

a[100], b[100]
i = 0
c = b[i]
while c != 0
    a[i] = b[i]
    i = i + 1
    c = b[i]

This section of code comes from the output of RZ compiler. c and i are local variables and are accessed through offset from base pointer (bp pointed to an activation record). c is @-1, i is @-2. A base address is 2001, B base address is 2011.

:label_164
   ld   r1 @-1 bp
   sub rz r1 #0
   jmp eq label_216     ;; while(c != 0){

   ld   r1 #2001
   ld   r2 @-2 bp
   add r1 r1 r2
   ld   r2 #2011
   ld   r3 @-2 bp
   add r2 r2 r3
   ld   r2 @0 r2
   st   @0 r1 r2        ;; a[i] = b[i];
   ld   r1 @-2 bp
   add r1 r1 #1
   st   @-2 bp r1       ;; i = i + 1;
   ld   r1 #2011
   ld   r2 @-2 bp
   add r1 r1 r2
   ld   r1 @0 r1
   st   @-1 bp r1        ;; c = b[i];
   jmp always label_164
:label_216

Total 52 clocks per character.

What we do is:
1) using registers as much as possible, c in r5, i in r6, &a in r7, &b in r8
2) use a shorter sequence of code.
when computing the address of an element in an array, instead of using this simple sequence to get the value of b[i]: (the code generator of RZ compiler is simple minded)

   ld   r2 #2011
   ld   r3 @-2 bp
   add r2 r2 r3
   ld   r2 @0 r2

One can use a more appropriate addressing available in S2, the indexing mode:

ld r2 +r8 r6

remember that &b is in r8 and i is in r6. Now the resulting code becomes:

:label_164
   sub rz r5 #0
   jmp eq label_216       ;; while(c != 0){

   ld   r2 +r8 r6
   st   +r7 r6 r5          ;; a[i] = b[i];

   add r6 r6 #1           ;; i = i + 1;

   ld   r5 +r8 r6          ;; c = b[i];
   jmp always label_164

which takes only 19 clocks per charactor. This is almost 3 times (300%) faster than the original one.

Conclusion

A simple code optimisation can yield a satisfactory benefit. It is easy to implement in a compiler, therefore it is actually included in virtually all compilers. However, there are many more optimisation done in a production compiler and many are very processor dependent (this means it works for a particular machine). Don't be carried away by trying to do optimisation at this level too much, remember, optimisation at the high level yields much higher benefit.

last update 25 June 2003