R1 initially the address of the element
F2 scalar value, s
assume element with lowest address is at zero
assume delay slot
loop: LD F0,0(R1) ; F0=array
element
ADDD F4,F0,F2
; add scalar in F2
SD 0(R1),F4
; store result
SUBI R1,R1,#8
; decrement pointer
; 8 bytes per DW
BNEZ R1,loop
; branch R1 != zero
without any scheduling the loop will stall
loop : LD F0,0(R1) 1
stall
2
ADDD F4,F0,F2 3
stall
4
stall
5
SD 0(R1),F4 6
SUBI R1,R1,#8 7
stall
8
BNEZ R1,loop 9
stall
10
one stall for LD
two for ADDD
one for SUBI
one for delayed branch
Eleminate three branches and three decrement of R1. Without scheduling it will stall. This loop run in 28 clocks, each LD has 1 stall, each ADDD has 2, SUBI has 1, the branch 1, plus 14 instruction issue cycles. Total 7 clocks per four elements. Unrolling improves the performance by eliminating overhead instructions.
Schedule the unrolled loop :
loop : LD F0,0(R1)
LD F6,-8(R1)
LD F10,-16(R1)
LD F14,-24(R1)
ADDD F4,F0,F2
ADDD F8,F6,F2
ADDD F12,F10,F2
ADDD F16,F14,F2
SD 0(R1),F4
SD -8(R1),F8
SUBI R1,R1,#32
SD 16(R1),F12
BNEZ R1,loop
SD 8(R1),F16 ; 8-32 = -24
Execution time dropped to 14 clocks, or 3.5 clocks per element.
Int F D X M W
FP F D X M W
Int F D X M W
FP F D X M W
. . .
Superscalar version
loop : LD F0,0(R1)
LD F6,-8(R1)
LD F10,-16(R1)
ADDD F4,F0,F2
LD F14,-24(R1)
ADDD F8,F6,F2
LD F18,-32(R1)
ADDD F12,F10,F2
SD 0(R1),F4
ADDD F16,F14,F2
SD -8(R1),F8
ADDD F20,F18,F2
SD -16(R1),F12
SUBI R1,R1,#40
SD 16(R1),F16
BNEZ R1,loop
SD 8(R1),F20
The unrolled superscalar runs in 12 clocks, 2.4 clocks per element.
Mem Ref 1 | Mem Ref 2 | FP op 1 | FP op 2 | Int op / Branch |
LD F0,0(R1) | LD F6,-8(R1) | |||
LD F10,-16(R1) | LD F14,-24(R1) | |||
LD F18,-32(R1) | LD F22,-40(R1) | ADD F4,F0,F2 | ADD F8,F6,F2 | |
LD F26,-48(R1) | ADD F12,F10,F2 | ADD F16,F14,F2 | ||
ADD F20,F18,F2 | ADD F24,F22,F2 | |||
SD 0(R1),F4 | SD -8(R1),F8 | ADDF28,F26,F2 | ||
SD –16(R1),F12 | SD –24(R1), F16 | |||
SD –32(R1), F20 | SD –40(R1),F24 | SUBI R1,R1,#56 | ||
SD 8(R1), F28 | BNEZ R1,LOOP |