performance limitThe term 'superscalar' describes a computer implementation that improves performance by concurrent execution of scalar instructions (more than one instruction per cycle). 'Scalar' processor is a processor that execute one instruction at a time. Superscalar allows concurrent execution of instructions 'in the same' pipeline stage. Superscalar is a machine that is designed to improve the performance of the execution of scalar instructions. (as opposed to vector processors operate on vectors)
instruction vs machine parallelism
instruction issue policy
register renaming
long instruction word
example : PowerPC601, Pentium
Performance limit of superscalar:
Instruction-issue refers to the process of initiating instruction execution in the processor's functional units. Instruction-issue policy limits affects performance because it determines the processor's 'lookahead' capability; that is, the ability of the processor to examine instructions beyond the current point of execution in hopes of finding independent instructions to execute.
Example
Assume a superscalar capable of fetching and decoding two instructions
at a time, having three separate functional units, two writeback stages.
I1 requires two cycles to execute
I3 and I4 conflict for the same functional unit.
I5 depends on the value produced by I4.
I5 and I6 conflict for a functional unit.
Figure Executing six instructions by different issue policies
I1, I2 RAW
I1, I3 WAW
I2, I3 WAR
I3, I4 RAW
Register renaming to avoid conflict of resources
R3b = R3a op R5a I1
R4b = R3b + 1 I2
R3c = R5a + 1 I3
R7b = R3c op R4b I4
Example IBM PowerPC601, Intel Pentium
Figure of PowerPC601 and Pentium pipeline
PowerPC 601
has many functional units which have different pipeline:
branch inst. | Fetch | Dispatch Decode Execute Predict | ||||
Integer inst. | Fetch | Dispatch Decode | Execute | Writeback | ||
Load/Store | Fetch | Dispatch Decode | Ads Gen | Cache | Writeback | |
FP inst. | Fetch | Dispatch | Decode | Execute1 | Execute2 | Writeback |
PowerPC601 can issue branch and floating-point inst. out of order.
Branch processing employs fixed rule to reduce stall cycle :
1. Scan the dispatch buffer (8 deep) for branch instructions.
Target address are generated.
2. Determine the outcome of conditional branches :
a. will be taken : for uncond. and for known condition
code and indicate branching
b. will not be taken : for uncond. and for known condition
code and indicate no branching
c. outcome cannot yet be determine : for backward branch
guess taken, for forward branch guess not taken.
The designer did not use branch history for the reason that it will
achieve minimum payoff.
Pentium
has 5 stages pipeline , two integer units
1. Prefetch
2. Decode stage 1 (inst. pairing)
3. Decode stage 2 (ads. gen.)
4. Execute
5. Writeback
Instruction Pairing rule
1. both inst. are simple (hardwired)
2. no RAW, WAW
3. no displacement and immediate operand
Simple inst. : mov, alu r r , shift, inc, dec,
pop, lea, jump call jcc near
Branch prediction : dynamic based on history. A branch target
buffer (BTB) stores branch destination address associated with the current
branch instruction. Once the instruction is executed the history
is updated. The BTB is 4-way set associative cache with 256 lines.
Each entry uses the address of branch instruction as a tag. The value
field contains the branch destination address for the last time this branch
was taken and a two-bit history field.