example 1 simle sequence
example 2 simple loop
example 3 logical statement
example 4 factoring
example 5 string copy
timing of S2 instruction
conclusion
Performance is important. The goal of this lecture is to give
students some understanding of the performance issue at the machine code
level. Don't be missled that optimisation at this level yields a
big improvement. Usually, optimisation at the high level, such as
changing the algorithm, changing the data structure, can yield order of
magnitude improvement, while optimisation at the assembly language level
results in much less improvement. 10-200% speed improvement can be
observed using a simple code optimisation discussed in this lecture.
By inspecting the assembly code that is outputed from a compiler, the code can be improved to make the code smaller and/or faster. This process is called "peep-hole optimisation". The rules are heuristic and simply stated as follows:
1) Eliminate redundant code.
2) Keep variables in registers as much as possible.
3) Substitute long sequence of codes with a shorter one.
For example, the sequence
ld r1 Aor
st A r1
st A r1the second line (emphasized) is redundant and it can be eliminated without changing the meaning of program.
ld r1 A
A = B + Cthe line with underline can be eliminated.
C = A + 1ld r1 B
ld r2 C
add r1 r1 r2
st A r1
ld r1 A
add r1 r1 #1
st C r1
while i < 0
if A > 2 then B = 1
else i = i +1:while ld r1 i
sub r0 r1 r0
jmp ge exit
ld r1 A
sub r0 r1 #2
jmp le else
ld r1 #1
st B r1
jmp always out
:else ld r1 i
add r1 r1 #1
st i r1
:out jmp always while
:exit
Let keep i in r1 , the line "ld r1 i" can be moved out of
loop and the pair of ld/st i can be eliminated. Other use of r1 is
renamed to r2. The "jmp always out" is changed to while.
The result is:
ld r1 i
:while sub r0 r1 r0
jmp ge exit
ld r2 A
sub r0 r2 #2
jmp le else
ld r2 #1
st B r2
jmp always while
:else add r1 r1 #1
:out jmp always while
:exit
B = (A >= 0) && (A < 30)Two temporary variables are needed to store the boolean value or A >= 0 and A < 30. Let T1, T2 be temporary variables.
ld r1 A
sub r0 r1 r0
jmp ge true1
ld T1 #0
jmp always out1
:true1 ld T1 #1
:out1
ld r1 A
sub r0 r1 #30
jmp lt true2
ld T2 #0
jmp always out2
:true2 ld T2 #1
:out2
ld r1 T1
ld r2 T2
and r1 r1 r2
st B r1
Let keep A in r1, T1 in r2 and T2 in r3
ld r1 A
sub r0 r1 r0
jmp ge true1
add r2 r0 r0
jmp always out1
:true1 add r2 r0 #1
:out1
sub r0 r1 #30
jmp lt true2
add r3 r0 r0
jmp always out2
:true2 add r3 r0 #1
:out2
and r1 r2 r3
st B r1
a = (b*c) + (b*d)can be factored told r1 b
ld r2 c
mul r1 r1 r2
ld r2 b
ld r3 d
mul r2 r2 r3
add r1 r1 r2
st a r1
a = b*(c+d)
ld r1 cwhich reduce two multiplies to one multiply.
ld r2 d
add r1 r1 r2
ld r2 b
mul r1 r2 r1
st a r1
As each instruction takes different amount of time to execute, to understand the speed improvement of code optimisation, one must take into account such difference. Let assume the following timing for S2 instruction set:
a[100], b[100]This section of code comes from the output of RZ compiler. c and i are local variables and are accessed through offset from base pointer (bp pointed to an activation record). c is @-1, i is @-2. A base address is 2001, B base address is 2011.
i = 0
c = b[i]
while c != 0
a[i] = b[i]
i = i + 1
c = b[i]
:label_164Total 52 clocks per character.
ld r1 @-1 bp
sub rz r1 #0
jmp eq label_216 ;; while(c != 0){
ld r1 #2001
ld r2 @-2 bp
add r1 r1 r2ld r2 #2011
ld r3 @-2 bp
add r2 r2 r3
ld r2 @0 r2
st @0 r1 r2 ;; a[i] = b[i];ld r1 @-2 bp
add r1 r1 #1
st @-2 bp r1 ;; i = i + 1;ld r1 #2011
ld r2 @-2 bp
add r1 r1 r2
ld r1 @0 r1
st @-1 bp r1 ;; c = b[i];
jmp always label_164
:label_216
What we do is:
1) using registers as much as possible, c in r5, i in r6, &a
in r7, &b in r8
2) use a shorter sequence of code.
when computing the address of an element in an array, instead of using
this simple sequence to get the value of b[i]: (the code generator
of RZ compiler is simple minded)
ld r2 #2011One can use a more appropriate addressing available in S2, the indexing mode:
ld r3 @-2 bp
add r2 r2 r3
ld r2 @0 r2
ld r2 +r8 r6
remember that &b is in r8 and i is in r6. Now the resulting code becomes:
:label_164which takes only 19 clocks per charactor. This is almost 3 times (300%) faster than the original one.
sub rz r5 #0
jmp eq label_216 ;; while(c != 0){
ld r2 +r8 r6
st +r7 r6 r5 ;; a[i] = b[i];
add r6 r6 #1 ;; i = i + 1;
ld r5 +r8 r6 ;; c = b[i];
jmp always label_164
last update 25 June 2003