Superscalar

performance limit
instruction vs machine parallelism
instruction issue policy
register renaming
long instruction word
example : PowerPC601, Pentium
The term 'superscalar' describes a computer implementation that improves performance by concurrent execution of scalar instructions (more than one instruction per cycle).  'Scalar' processor is a processor that execute one instruction at a time.  Superscalar allows concurrent execution of instructions 'in the same' pipeline stage.  Superscalar is a  machine that is designed to improve the performance of the execution of scalar instructions. (as opposed to vector processors operate on vectors)

Performance limit of superscalar:

Instruction parallelism is a measure of the average number of instructions that a superscalar processor might be able to execute at the same time.  Machine parallelism of a processor is a measure of the ability of the processor to take advantage of the instruction-level parallelism.  Machine parallelism is determined by the number of instructions that can be fetched and executed at the same time by the speed and sophistication of the mechanisms that the processor uses to find independent instructions.

Instruction-issue refers to the process of initiating instruction execution in the processor's functional units.  Instruction-issue policy limits affects performance because it determines the processor's 'lookahead' capability; that is, the ability of the processor to examine instructions beyond the current point of execution in hopes of finding independent instructions to execute.

(continue decode regardless conflict in functional units, use 'instruction window') by examining the instructions in the window to find instruction that can be executed (no resource conflict or dependencies).

Example
Assume a superscalar capable of fetching and decoding two instructions at a time, having three separate functional units, two writeback stages.

I1 requires two cycles to execute
I3 and I4 conflict for the same functional unit.
I5 depends on the value produced by I4.
I5 and I6 conflict for a functional unit.
 
Figure Executing six instructions by different issue policies
 

Register renaming

R3 = R3 op R5            I1
R4 = R3 + 1                 I2
R3 = R5 + 1                 I3
R7 = R3 op R4            I4

I1, I2  RAW
I1, I3  WAW
I2, I3  WAR
I3, I4  RAW

Register renaming  to avoid conflict of resources

R3b = R3a op R5a I1
R4b = R3b + 1  I2
R3c = R5a + 1  I3
R7b = R3c op R4b  I4
 

Long Instruction Word

A superscalar processor uses dynamic scheduling, e.g. the hardware controls the issue of instruction dynamically.  For static scheduling the  LIW architecture (long instruction word) (now VLIW very long..) depends on a compiler to schedule concurrent instructions and rearranging them into a long instruction word, typically 120-200 bits. Visualise a processor without instruction, just direct control of hardware i.e. at the level of microprogram.  Compiler performs scheduling of parallel execution.  Since hardware can have multiple functional units we can schedule as many of them to execute concurrently.  The limit is on instruction parallelism.  Basic block is defined to contain sequence of code without branching, average about 10 lines of assembly.  The number of instruction in basic block, i.e. straight line code, must be enough to sustain parallel execution of functional units.  One simple technique is loop unrolling.  More advance technique required inter block analysis, so called "trace scheduling".  Trace scheduling is done by analysing the sequence of instruction executed.

Example IBM PowerPC601,  Intel Pentium

Figure of PowerPC601  and Pentium pipeline

PowerPC 601
has many functional units which have different pipeline:

 
branch inst. Fetch Dispatch Decode Execute Predict 
Integer inst. Fetch Dispatch Decode Execute Writeback 
Load/Store Fetch Dispatch Decode Ads Gen Cache Writeback 
FP inst.  Fetch Dispatch  Decode Execute1 Execute2 Writeback 

PowerPC601 can issue branch and floating-point inst. out of order.  Branch processing employs fixed rule to reduce stall cycle :
1. Scan the dispatch buffer (8 deep) for branch instructions.  Target address are generated.
2. Determine the outcome of conditional branches :
   a. will be taken : for uncond. and for known condition code and indicate branching
   b. will not be taken : for uncond. and for known condition code and indicate no branching
   c. outcome cannot yet be determine : for backward branch guess taken, for forward branch guess not taken.
The designer did not use branch history for the reason that it will achieve minimum payoff.

Pentium
has 5 stages pipeline , two integer units
1. Prefetch
2. Decode stage 1  (inst. pairing)
3. Decode stage 2  (ads. gen.)
4. Execute
5. Writeback

Instruction Pairing rule
1. both inst. are simple  (hardwired)
2. no RAW, WAW
3. no displacement and immediate operand

Simple inst.  : mov, alu r r , shift,  inc,  dec, pop, lea, jump call jcc near
 
Branch prediction : dynamic based on history.  A branch target buffer (BTB) stores branch destination address associated with the current branch instruction.  Once the instruction is executed the history is updated. The BTB is 4-way set associative cache with 256 lines.  Each entry uses the address of branch instruction as a tag.  The value field contains the branch destination address for the last time this branch was taken and a two-bit history field.