performance limitThe term 'superscalar' describes a computer implementation that improves performance by concurrent execution of scalar instructions (more than one instruction per cycle). 'Scalar' processor is a processor that execute one instruction at a time. Superscalar allows concurrent execution of instructions 'in the same' pipeline stage. Superscalar is a machine that is designed to improve the performance of the execution of scalar instructions. (as opposed to vector processors operate on vectors)
instruction vs machine parallelism
instruction issue policy
register renaming
loop unrolling
long instruction word
example : PowerPC601, Pentium
Performance limit of superscalar:
Instruction-issue refers to the process of initiating instruction execution in the processor's functional units. Instruction-issue policy limits affects performance because it determines the processor's 'lookahead' capability; that is, the ability of the processor to examine instructions beyond the current point of execution in hopes of finding independent instructions to execute.
Example
Assume a superscalar capable of fetching and decoding two instructions
at a time, having three separate functional units, two writeback stages.
I1 requires two cycles to execute
I3 and I4 conflict for the same functional unit.
I5 depends on the value produced by I4.
I5 and I6 conflict for a functional unit.
Figure Executing six instructions by different issue policies
I1, I2 RAW
I1, I3 WAW
I2, I3 WAR
I3, I4 RAW
Register renaming to avoid conflict of resources
R3b = R3a op R5a I1
R4b = R3b + 1 I2
R3c = R5a + 1 I3
R7b = R3c op R4b I4
R1 initially the address of the element
F2 scalar value, s
assume element with lowest address is at zero
assume delay slot
loop: LD F0,0(R1) ; F0=array
element
ADDD F4,F0,F2
; add scalar in F2
SD 0(R1),F4
; store result
SUBI R1,R1,#8
; decrement pointer
; 8 bytes per DW
BNEZ R1,loop
; branch R1 != zero
without any scheduling the loop will stall
loop : LD F0,0(R1) 1
stall
2
ADDD F4,F0,F2 3
stall
4
stall
5
SD 0(R1),F4 6
SUBI R1,R1,#8 7
stall
8
BNEZ R1,loop 9
stall
10
one stall for LD
two for ADDD
one for SUBI
one for delayed branch
Eleminate three branches and three decrement of R1. Without scheduling it will stall. This loop run in 28 clocks, each LD has 1 stall, each ADDD has 2, SUBI has 1, the branch 1, plus 14 instruction issue cycles. Total 7 clocks per four elements. Unrolling improves the performance by eliminating overhead instructions.
Schedule the unrolled loop :
loop : LD F0,0(R1)
LD F6,-8(R1)
LD F10,-16(R1)
LD F14,-24(R1)
ADDD F4,F0,F2
ADDD F8,F6,F2
ADDD F12,F10,F2
ADDD F16,F14,F2
SD 0(R1),F4
SD -8(R1),F8
SUBI R1,R1,#32
SD 16(R1),F12
BNEZ R1,loop
SD 8(R1),F16 ; 8-32 = -24
Execution time dropped to 14 clocks, or 3.5 clocks per element.
Int F D X M W
FP F D X M W
Int F D X M W
FP F D X M W
. . .
Superscalar version
loop : LD F0,0(R1)
LD F6,-8(R1)
LD F10,-16(R1)
ADDD F4,F0,F2
LD F14,-24(R1)
ADDD F8,F6,F2
LD F18,-32(R1)
ADDD F12,F10,F2
SD 0(R1),F4
ADDD F16,F14,F2
SD -8(R1),F8
ADDD F20,F18,F2
SD -16(R1),F12
SUBI R1,R1,#40
SD 16(R1),F16
BNEZ R1,loop
SD 8(R1),F20
The unrolled superscalar runs in 12 clocks, 2.4 clocks per element.
Suppose a VLIW could issue two memory references, two FP op, one integer
op or branch in every clock. Unroll loop 7 times, this loop takes
9 clocks assuming no branch delay. It yields 7 results in 9 clocks
or 1.29 clocks per element. The issue rate is 23 operations in 9
clocks or 2.5 operation per clock.
Mem Ref 1 | Mem Ref 2 | FP op 1 | FP op 2 | Int op / Branch |
LD F0,0(R1) | LD F6,-8(R1) | |||
LD F10,-16(R1) | LD F14,-24(R1) | |||
LD F18,-32(R1) | LD F22,-40(R1) | ADD F4,F0,F2 | ADD F8,F6,F2 | |
LD F26,-48(R1) | ADD F12,F10,F2 | ADD F16,F14,F2 | ||
ADD F20,F18,F2 | ADD F24,F22,F2 | |||
SD 0(R1),F4 | SD -8(R1),F8 | ADDF28,F26,F2 | ||
SD –16(R1),F12 | SD –24(R1), F16 | |||
SD –32(R1), F20 | SD –40(R1),F24 | SUBI R1,R1,#56 | ||
SD 8(R1), F28 | BNEZ R1,LOOP |
Example IBM PowerPC601, Intel Pentium
Figure of PowerPC601 and Pentium pipeline
PowerPC 601
has many functional units which have different pipeline:
branch inst. | Fetch | Dispatch Decode Execute Predict | ||||
Integer inst. | Fetch | Dispatch Decode | Execute | Writeback | ||
Load/Store | Fetch | Dispatch Decode | Ads Gen | Cache | Writeback | |
FP inst. | Fetch | Dispatch | Decode | Execute1 | Execute2 | Writeback |
PowerPC601 can issue branch and floating-point inst. out of order.
Branch processing employs fixed rule to reduce stall cycle :
1. Scan the dispatch buffer (8 deep) for branch instructions.
Target address are generated.
2. Determine the outcome of conditional branches :
a. will be taken : for uncond. and for known condition
code and indicate branching
b. will not be taken : for uncond. and for known condition
code and indicate no branching
c. outcome cannot yet be determine : for backward branch
guess taken, for forward branch guess not taken.
The designer did not use branch history for the reason that it will
achieve minimum payoff.
Pentium
has 5 stages pipeline , two integer units
1. Prefetch
2. Decode stage 1 (inst. pairing)
3. Decode stage 2 (ads. gen.)
4. Execute
5. Writeback
Instruction Pairing rule
1. both inst. are simple (hardwired)
2. no RAW, WAW
3. no displacement and immediate operand
Simple inst. : mov, alu r r , shift, inc, dec, pop, lea, jump call jcc near
Branch prediction : dynamic based on history. A branch target
buffer (BTB) stores branch destination address associated with the current
branch instruction. Once the instruction is executed the history
is updated. The BTB is 4-way set associative cache with 256 lines.
Each entry uses the address of branch instruction as a tag. The value
field contains the branch destination address for the last time this branch
was taken and a two-bit history field.