R1 initially the address of the element
F2 scalar value, s
assume element with lowest address is at zero
assume delay slot
loop: LD F0,0(R1)       ; F0=array
element
    ADDD F4,F0,F2      
; add scalar in F2
    SD 0(R1),F4        
; store result
    SUBI R1,R1,#8      
; decrement pointer
                       
; 8 bytes per DW
    BNEZ  R1,loop      
; branch R1 != zero
without any scheduling the loop will stall
loop :  LD F0,0(R1) 1
    stall          
2
    ADDD F4,F0,F2   3
    stall          
4
    stall          
5
    SD 0(R1),F4     6
    SUBI R1,R1,#8   7
    stall          
8
    BNEZ  R1,loop   9
    stall         
10
one stall for LD
two for ADDD
one for SUBI
one for delayed branch
Eleminate three branches and three decrement of R1. Without scheduling it will stall. This loop run in 28 clocks, each LD has 1 stall, each ADDD has 2, SUBI has 1, the branch 1, plus 14 instruction issue cycles. Total 7 clocks per four elements. Unrolling improves the performance by eliminating overhead instructions.
Schedule the unrolled loop :
loop :  LD F0,0(R1)
    LD F6,-8(R1)
    LD F10,-16(R1)
    LD F14,-24(R1)
    ADDD F4,F0,F2
    ADDD F8,F6,F2
    ADDD F12,F10,F2
    ADDD F16,F14,F2
    SD 0(R1),F4
    SD -8(R1),F8
    SUBI R1,R1,#32
    SD 16(R1),F12
    BNEZ R1,loop
    SD 8(R1),F16 ; 8-32 = -24
Execution time dropped to 14 clocks, or 3.5 clocks per element.
Int   F D X M W
FP    F D X M W
Int     F D X M W
FP      F D X M W
. . .
Superscalar version
loop :  LD F0,0(R1)
    LD F6,-8(R1)
    LD F10,-16(R1)     
ADDD F4,F0,F2
    LD F14,-24(R1)     
ADDD F8,F6,F2
    LD F18,-32(R1)     
ADDD F12,F10,F2
    SD  0(R1),F4       
ADDD F16,F14,F2
    SD -8(R1),F8       
ADDD F20,F18,F2
    SD -16(R1),F12
    SUBI R1,R1,#40
    SD 16(R1),F16
    BNEZ R1,loop
    SD 8(R1),F20
The unrolled superscalar runs in 12 clocks, 2.4 clocks per element.
 
| Mem Ref 1 | Mem Ref 2 | FP op 1 | FP op 2 | Int op / Branch | 
| LD F0,0(R1) | LD F6,-8(R1) | |||
| LD F10,-16(R1) | LD F14,-24(R1) | |||
| LD F18,-32(R1) | LD F22,-40(R1) | ADD F4,F0,F2 | ADD F8,F6,F2 | |
| LD F26,-48(R1) | ADD F12,F10,F2 | ADD F16,F14,F2 | ||
| ADD F20,F18,F2 | ADD F24,F22,F2 | |||
| SD 0(R1),F4 | SD -8(R1),F8 | ADDF28,F26,F2 | ||
| SD –16(R1),F12 | SD –24(R1), F16 | |||
| SD –32(R1), F20 | SD –40(R1),F24 | SUBI R1,R1,#56 | ||
| SD 8(R1), F28 | BNEZ R1,LOOP |