superscalar

Superscalar

performance limit
instruction vs machine parallelism
instruction issue policy
register renaming
loop unrolling
long instruction word
example : PowerPC601, Pentium

The term 'superscalar' describes a computer implementation that improves performance by concurrent execution of scalar instructions (more than one instruction per cycle). 'Scalar' processor is a processor that execute one instruction at a time. Superscalar allows concurrent execution of instructions 'in the same' pipeline stage. Superscalar is a machine that is designed to improve the performance of the execution of scalar instructions. (as opposed to vector processors operate on vectors)

Performance limit of superscalar:

data dependencies,
procedural dependencies (control hazard),
and resource conflicts.

Instruction parallelism is a measure of the average number of instructions that a superscalar processor might be able to execute at the same time. Machine parallelism of a processor is a measure of the ability of the processor to take advantage of the instruction-level parallelism. Machine parallelism is determined by the number of instructions that can be fetched and executed at the same time by the speed and sophistication of the mechanisms that the processor uses to find independent instructions.

Instruction-issue refers to the process of initiating instruction execution in the processor's functional units. Instruction-issue policy limits affects performance because it determines the processor's 'lookahead' capability; that is, the ability of the processor to examine instructions beyond the current point of execution in hopes of finding independent instructions to execute.

In-order issue with in-order completion
In-order issue with out-of-order completion
Out-of-order issue with out-of-order completion

(continue decode regardless conflict in functional units, use 'instruction window') by examining the instructions in the window to find instruction that can be executed (no resource conflict or dependencies).

Example
Assume a superscalar capable of fetching and decoding two instructions at a time, having three separate functional units, two writeback stages.

I1 requires two cycles to execute
I3 and I4 conflict for the same functional unit.
I5 depends on the value produced by I4.
I5 and I6 conflict for a functional unit.

Figure Executing six instructions by different issue policies

Register renaming

R3 = R3 op R5            I1
R4 = R3 + 1                 I2
R3 = R5 + 1                 I3
R7 = R3 op R4            I4

I1, I2 RAW
I1, I3 WAW
I2, I3 WAR
I3, I4 RAW

R3b = R3a op R5a I1
R4b = R3b + 1 I2
R3c = R5a + 1 I3
R7b = R3c op R4b I4

Loop Scheduling / Loop Unrolling

for (i=1; i<=1000; i++)
x[i] = x[i] + s;

R1 initially the address of the element
F2 scalar value, s
assume element with lowest address is at zero
assume delay slot

loop: LD F0,0(R1)       ; F0=array element
    ADDD F4,F0,F2       ; add scalar in F2
    SD 0(R1),F4         ; store result
    SUBI R1,R1,#8       ; decrement pointer
                        ; 8 bytes per DW
    BNEZ R1,loop       ; branch R1 != zero

without any scheduling the loop will stall

loop : LD F0,0(R1) 1
    stall           2
    ADDD F4,F0,F2   3
    stall           4
    stall           5
    SD 0(R1),F4     6
    SUBI R1,R1,#8   7
    stall           8
    BNEZ R1,loop   9
    stall          10

one stall for LD
two for ADDD
one for SUBI
one for delayed branch

Schedule loop to reduce stall

loop : LD F0,0(R1) 1
    SUBI R1,R1,#8   2
    ADDD F4,F0,F2   3
    stall           4
    BNEZ R1,loop   5
    SD 8(R1),F4     6

Loop unrolling can eliminate branch

loop : LD F0,0(R1)
    ADDD F4,F0,F2
    SD 0(R1),F4
    LD F6,-8(R1)
    ADDD F8,F6,F2
    SD -8(R1),F8
    LD F10,-16(R1)
    ADDD F12,F10,F2
    SD -16(R1),F12
    LD F14,-24(R1)
    ADDD F16,F14,F2
    SD -24(R1),F16
    SUBI R1,R1,#32
    BNEZ R1,loop

Eleminate three branches and three decrement of R1. Without scheduling it will stall. This loop run in 28 clocks, each LD has 1 stall, each ADDD has 2, SUBI has 1, the branch 1, plus 14 instruction issue cycles. Total 7 clocks per four elements. Unrolling improves the performance by eliminating overhead instructions.

Schedule the unrolled loop :

loop : LD F0,0(R1)
    LD F6,-8(R1)
    LD F10,-16(R1)
    LD F14,-24(R1)
    ADDD F4,F0,F2
    ADDD F8,F6,F2
    ADDD F12,F10,F2
    ADDD F16,F14,F2
    SD 0(R1),F4
    SD -8(R1),F8
    SUBI R1,R1,#32
    SD 16(R1),F12
    BNEZ R1,loop
    SD 8(R1),F16 ; 8-32 = -24

Execution time dropped to 14 clocks, or 3.5 clocks per element.

Superscalar

Assume two instructions can be issued per clock cycle. One of the instruction can be load, store, branch, or integer, and the other can be an floating-point operation.

Int   F D X M W
FP    F D X M W
Int     F D X M W
FP      F D X M W
. . .

Superscalar version

loop : LD F0,0(R1)
    LD F6,-8(R1)
    LD F10,-16(R1)      ADDD F4,F0,F2
    LD F14,-24(R1)      ADDD F8,F6,F2
    LD F18,-32(R1)      ADDD F12,F10,F2
    SD 0(R1),F4        ADDD F16,F14,F2
    SD -8(R1),F8        ADDD F20,F18,F2
    SD -16(R1),F12
    SUBI R1,R1,#40
    SD 16(R1),F16
    BNEZ R1,loop
    SD 8(R1),F20

The unrolled superscalar runs in 12 clocks, 2.4 clocks per element.

Long Instruction Word

A superscalar processor uses dynamic scheduling, e.g. the hardware controls the issue of instruction dynamically. For static scheduling the LIW architecture (long instruction word) (now VLIW very long..) depends on a compiler to schedule concurrent instructions and rearranging them into a long instruction word, typically 120-200 bits. Visualise a processor without instruction, just direct control of hardware i.e. at the level of microprogram. Compiler performs scheduling of parallel execution. Since hardware can have multiple functional units we can schedule as many of them to execute concurrently. The limit is on instruction parallelism. Basic block is defined to contain sequence of code without branching, average about 10 lines of assembly. The number of instruction in basic block, i.e. straight line code, must be enough to sustain parallel execution of functional units. One simple technique is loop unrolling. More advance technique required inter block analysis, so called "trace scheduling". Trace scheduling is done by analysing the sequence of instruction executed.

Suppose a VLIW could issue two memory references, two FP op, one integer op or branch in every clock. Unroll loop 7 times, this loop takes 9 clocks assuming no branch delay. It yields 7 results in 9 clocks or 1.29 clocks per element. The issue rate is 23 operations in 9 clocks or 2.5 operation per clock.

Mem Ref 1 Mem Ref 2 FP op 1 FP op 2 Int op / Branch

LD F0,0(R1) LD F6,-8(R1)

LD F10,-16(R1) LD F14,-24(R1)

LD F18,-32(R1) LD F22,-40(R1) ADD F4,F0,F2 ADD F8,F6,F2

LD F26,-48(R1) ADD F12,F10,F2 ADD F16,F14,F2

ADD F20,F18,F2 ADD F24,F22,F2

SD 0(R1),F4 SD -8(R1),F8 ADDF28,F26,F2

SD –16(R1),F12 SD –24(R1), F16

SD –32(R1), F20 SD –40(R1),F24 SUBI R1,R1,#56

SD 8(R1), F28 BNEZ R1,LOOP

Example IBM PowerPC601, Intel Pentium

Figure of PowerPC601 and Pentium pipeline

PowerPC 601
has many functional units which have different pipeline:

Dispatch unit holds instruction buffer
Branch processing unit handles all branch instructions
Floating-point unit
Integer unit

branch inst.	Fetch	Dispatch Decode Execute Predict
Integer inst.	Fetch	Dispatch Decode	Execute	Writeback
Load/Store	Fetch	Dispatch Decode	Ads Gen	Cache	Writeback
FP inst.	Fetch	Dispatch	Decode	Execute1	Execute2	Writeback

PowerPC601 can issue branch and floating-point inst. out of order. Branch processing employs fixed rule to reduce stall cycle :
1. Scan the dispatch buffer (8 deep) for branch instructions. Target address are generated.
2. Determine the outcome of conditional branches :
   a. will be taken : for uncond. and for known condition code and indicate branching
   b. will not be taken : for uncond. and for known condition code and indicate no branching
   c. outcome cannot yet be determine : for backward branch guess taken, for forward branch guess not taken.
The designer did not use branch history for the reason that it will achieve minimum payoff.

Pentium
has 5 stages pipeline , two integer units
1. Prefetch
2. Decode stage 1 (inst. pairing)
3. Decode stage 2 (ads. gen.)
4. Execute
5. Writeback

Instruction Pairing rule
1. both inst. are simple (hardwired)
2. no RAW, WAW
3. no displacement and immediate operand

Simple inst. : mov, alu r r , shift, inc, dec, pop, lea, jump call jcc near

Branch prediction : dynamic based on history. A branch target buffer (BTB) stores branch destination address associated with the current branch instruction. Once the instruction is executed the history is updated. The BTB is 4-way set associative cache with 256 lines. Each entry uses the address of branch instruction as a tag. The value field contains the branch destination address for the last time this branch was taken and a two-bit history field.

Mem Ref 1	Mem Ref 2	FP op 1	FP op 2	Int op / Branch
LD F0,0(R1)	LD F6,-8(R1)
LD F10,-16(R1)	LD F14,-24(R1)
LD F18,-32(R1)	LD F22,-40(R1)	ADD F4,F0,F2	ADD F8,F6,F2
LD F26,-48(R1)		ADD F12,F10,F2	ADD F16,F14,F2
		ADD F20,F18,F2	ADD F24,F22,F2
SD 0(R1),F4	SD -8(R1),F8	ADDF28,F26,F2
SD –16(R1),F12	SD –24(R1), F16
SD –32(R1), F20	SD –40(R1),F24			SUBI R1,R1,#56
SD 8(R1), F28				BNEZ R1,LOOP