superscalar

Superscalar

performance limit
instruction vs machine parallelism
instruction issue policy
register renaming
long instruction word
example : PowerPC601, Pentium

The term 'superscalar' describes a computer implementation that improves performance by concurrent execution of scalar instructions (more than one instruction per cycle). 'Scalar' processor is a processor that execute one instruction at a time. Superscalar allows concurrent execution of instructions 'in the same' pipeline stage. Superscalar is a machine that is designed to improve the performance of the execution of scalar instructions. (as opposed to vector processors operate on vectors)

Performance limit of superscalar:

data dependencies,
procedural dependencies (control hazard),
and resource conflicts.

Instruction parallelism is a measure of the average number of instructions that a superscalar processor might be able to execute at the same time. Machine parallelism of a processor is a measure of the ability of the processor to take advantage of the instruction-level parallelism. Machine parallelism is determined by the number of instructions that can be fetched and executed at the same time by the speed and sophistication of the mechanisms that the processor uses to find independent instructions.

Instruction-issue refers to the process of initiating instruction execution in the processor's functional units. Instruction-issue policy limits affects performance because it determines the processor's 'lookahead' capability; that is, the ability of the processor to examine instructions beyond the current point of execution in hopes of finding independent instructions to execute.

In-order issue with in-order completion
In-order issue with out-of-order completion
Out-of-order issue with out-of-order completion

(continue decode regardless conflict in functional units, use 'instruction window') by examining the instructions in the window to find instruction that can be executed (no resource conflict or dependencies).

Example
Assume a superscalar capable of fetching and decoding two instructions at a time, having three separate functional units, two writeback stages.

I1 requires two cycles to execute
I3 and I4 conflict for the same functional unit.
I5 depends on the value produced by I4.
I5 and I6 conflict for a functional unit.

Figure Executing six instructions by different issue policies

Register renaming

R3 = R3 op R5            I1
R4 = R3 + 1                 I2
R3 = R5 + 1                 I3
R7 = R3 op R4            I4

I1, I2 RAW
I1, I3 WAW
I2, I3 WAR
I3, I4 RAW

R3b = R3a op R5a I1
R4b = R3b + 1 I2
R3c = R5a + 1 I3
R7b = R3c op R4b I4

Long Instruction Word

A superscalar processor uses dynamic scheduling, e.g. the hardware controls the issue of instruction dynamically. For static scheduling the LIW architecture (long instruction word) (now VLIW very long..) depends on a compiler to schedule concurrent instructions and rearranging them into a long instruction word, typically 120-200 bits. Visualise a processor without instruction, just direct control of hardware i.e. at the level of microprogram. Compiler performs scheduling of parallel execution. Since hardware can have multiple functional units we can schedule as many of them to execute concurrently. The limit is on instruction parallelism. Basic block is defined to contain sequence of code without branching, average about 10 lines of assembly. The number of instruction in basic block, i.e. straight line code, must be enough to sustain parallel execution of functional units. One simple technique is loop unrolling. More advance technique required inter block analysis, so called "trace scheduling". Trace scheduling is done by analysing the sequence of instruction executed.

Example IBM PowerPC601, Intel Pentium

Figure of PowerPC601 and Pentium pipeline

PowerPC 601
has many functional units which have different pipeline:

Dispatch unit holds instruction buffer
Branch processing unit handles all branch instructions
Floating-point unit
Integer unit

branch inst.	Fetch	Dispatch Decode Execute Predict
Integer inst.	Fetch	Dispatch Decode	Execute	Writeback
Load/Store	Fetch	Dispatch Decode	Ads Gen	Cache	Writeback
FP inst.	Fetch	Dispatch	Decode	Execute1	Execute2	Writeback

PowerPC601 can issue branch and floating-point inst. out of order. Branch processing employs fixed rule to reduce stall cycle :
1. Scan the dispatch buffer (8 deep) for branch instructions. Target address are generated.
2. Determine the outcome of conditional branches :
   a. will be taken : for uncond. and for known condition code and indicate branching
   b. will not be taken : for uncond. and for known condition code and indicate no branching
   c. outcome cannot yet be determine : for backward branch guess taken, for forward branch guess not taken.
The designer did not use branch history for the reason that it will achieve minimum payoff.

Pentium
has 5 stages pipeline , two integer units
1. Prefetch
2. Decode stage 1 (inst. pairing)
3. Decode stage 2 (ads. gen.)
4. Execute
5. Writeback

Instruction Pairing rule
1. both inst. are simple (hardwired)
2. no RAW, WAW
3. no displacement and immediate operand

Simple inst. : mov, alu r r , shift, inc, dec, pop, lea, jump call jcc near

Branch prediction : dynamic based on history. A branch target buffer (BTB) stores branch destination address associated with the current branch instruction. Once the instruction is executed the history is updated. The BTB is 4-way set associative cache with 256 lines. Each entry uses the address of branch instruction as a tag. The value field contains the branch destination address for the last time this branch was taken and a two-bit history field.