Advanced Pipeline

Pipeline

instruction pipeline (single step ) Fetch Decode Execute Writeback . Each stage executes in one step.
floating-point pipeline (multi step)

Topics

pipeline of the floating point unit
    reservation table
    collision vector
multiple cycle pipeline
dynamic scheduling in pipeline
    scoreboard bookeeping
    Tomasulo
figure CDC6600   IBM360/91

Pipeline of the floating-point unit

Figure a pipelined floating-point combined adder/multiplier unit

Reservation table

Figure the reservation of the combined add/mult FP unit

A reservation table shows a timing diagram of a flow of data through the functional unit. The reservation table is derived directly from the pipeline design. It is used to decide when to launch an operation into the pipeline. Both operations cannot use the same unit at the same time. A collision vector represents the collision information. Position i contains a bit that indicates whether or not a new operation can be launched i units after the first operation has been started.

(a) 1 1 0 0 0 0 (b) 1 0 0 0 0 0 0 0

Figure Collision vectors of (a) multiplication (mul follows by mul) (b) addition (add, add)

Multiple cycle pipeline

For floating-point operations the pipeline will required to operate in multiple cycle. Assume there are multiple functional units :

Integer unit,
FP/integer multiplier,
FP adder,
FP/integer divider.

Figure an example of multi step pipeline

Assume there are two separate register sets : FP and Integer. This simplifies the pipeline control as it reduces hazard detection such as overlapping FP and Integer operations, except for load/store FP and movement between FP/Integer registers. Integer unit handles load/store to both register sets. Assume EX stage is repeated many times to do these operations. No other instruction using functional unit may issue until the previous instruction leaves EX. If an instruction cannot proceed to the EX stage, the entire pipeline behind that instruction will be stalled (to avoid this stall, we need the capability to do out-of-order issue, the topic of next section). The following steps are required to issue a new floating-point instruction :

Check for structural hazard
Check for data hazard
Check for forwarding

Because all FP instructions require different execution time, this caused three complications :

contention for register access at WB stage,
WAR and WAW hazards and
interrupts. (we ignore interrupt).

FP load and FP operation can content for FP register on writes. This can be dealt with using priority scheme at the WB stage. The highest priority instruction can get access to register and all other instructions are stalled. A simple heuristic is to give the longest latency instruction the highest priority. If all instructions read their registers at the same time there will be no WAR hazards. WAW hazards occur because the results can be written in different order. The instructions can be completed inn a different order from the order in which they are issued. For example

DIVF F0 F2 F4
SUBF F0 F8 F10

Assume DIVF takes longer than SUBF to complete. The SUBF will complete first and writes its result before the DIVF. This hazard must be detected and ensure that the result is of executing instructions is correct.

Dynamic scheduling in pipeline

The pipeline fetches and issues an instruction unless there is a data dependence between an instruction already in the pipeline and the fetched instruction. When data hazard occurs the pipeline is stalled. When the software (mostly the compiler) is responsible for scheduling instructions to minimise these stalls, it is called static scheduling. This section will introduce the hardware scheme to reduce the stalls, this is called dynamic scheduling as it can detect the dependencies that are unknown at the compile time.

All the previous pipeline technique that we described use in-order instruction issue. If an instruction is stalled in the pipeline, no later instructions can proceed. When there are multiple functional units, these units could be idle. For example,

DIVF F0 F2 F4
ADDF F10 F0 F8
SUBF F8 F8 F14

The SUBF cannot be issue because the dependence of ADDF on DIVF (RAW on F0). Yet, the SUBF does not depend on any instruction in the pipeline. If an instruction can be executed out-of-order this stall can be eliminated.

We can check data hazards in the ID stage. In order to let an instruction start its execution as soon as its operands is available, the instruction issuing process must be separated from the hazard checking. The pipeline will do out-of-order execution which implies out-of-order completion. We have to split the decode, execution stages into 3 stages :

Issue decode instruction, check for data hazards
Read operands wait until no data hazard, then read operands
Execute

Scoreboard

Scoreboard is a hardware technique that enables instructions to be out-of-order executed when there are sufficient resources and no data dependencies. It is named after the CDC6600 scoreboard (first delivered in 1964 and was considered by many to be the first supercomputer). The goal of a scoreboard is to maintain an execution rate of one instruction per clock cycle by executing an instruction as early as possible (this can be achieve only when there are no structural hazard, i.e. there are sufficient number of resources). The scoreboard is responsible for instruction issue and execution. It does all hazard detection.

Every instruction goes through the scoreboard which keeps all information necessary to detect all hazards. The scoreboard determines when an instruction can read its operands and begin execution. The scoreboard controls when an instruction can write its result into the destination register. All hazard detection and resolution is centralised in the scoreboard.

To illustrate the working of scoreboard let we assume S1 with scoreboard (S1s). Assume S1s has 2 FP multipliers, FP divide, FP add and Integer unit. Integer units handles all load/store, memory references, branches and integer operations. Each instruction goes through 4 steps, which replaces the pipeline stages, as follows:

Issue, when no structural hazard the instruction is issued. Scoreboard updates its internal data structure. It guarantees that there is no WAW.
Read operands, the scoreboard monitors source operands. If 2.1 no active instruction is going to write to it. 2.2 no active functional unit is writing to the register containing the operands. When the source operands are available, the scoreboard tells the functional unit to read its operands and begin execution. RAW is resolved and instructions may be executed out-of-order.
Execution, a functional unit begins execution. It notifies the scoreboard when the result is ready.
Write result, when the functional unit has completed execution, the scoreboard checks or WAR hazards. An instruction is not allowed to write its results when 4.1 there is an instruction that has not read its operands. 4.2 one of the operands is the same register as the result of the completing instruction. 4.3 the other operand was the result of an earlier instruction.

The scoreboard controls the execution of an instruction by communicating with the functional units.

Example, a scoreboard of S1s controls the execution of the following sequence of instructions:

LF F6 34(R2)
LF F2 45(R3)
MULTF F0 F2 F4
SUBF F8 F6 F2
DIVF F10 F0 F6
ADDF F6 F8 F2

The scoreboard has three parts:

1. Instruction status
2. Functional unit status, each FU has the following fields
2.1 Busy
2.2 Op, instruction to be performed
2.3 Dest , destination register
2.4 Src1, Src2, source registers
2.5 P1, P2, number of units producing Src1, Src2
2.6 R1, R2, ready flags for Src1, Src2; they are reset when new values are read so the scoreboard knows that the source operand has been read.
3. Register result status, which FU will write register.

Assuming the execution of the floating-point functional units are : add is 2 clocks, multiply is 10 clocks and divide is 40 clocks. Each instruction that has issued, has an entry in the instruction status table. Once the instruction issues, the record of its operands is kept in the functional unit status table. See the figure, the instruction status says that

1) the first LF has completed
2) the second FL has completed but has not yet written its result.
3) the MULTF, SUBF and DIVF have issued but are stalled, waiting for their operands.

The instruction unit status says that

1) the first multiplier unit is waiting for the integer unit. (RAW on F2).
2) the add unit is waiting for the integer unit. (RAW on F2).
3) the divide unit is waiting for the first multiplier unit. (RAW on F0).
4) the ADDF is stalled due to structural hazard (FU Add is in used by SUBF).

Figure The scoreboard after issuing the first five instructions.

Now assume the MULF and DIVF are proceeded and ready to write results.
There are RAW on

1) the second LF to MULTF and SUBF (on F2)
2) MULTF to DIVF (on F0)
3) SUBF to ADDF (on F8)

There is a WAR between DIVF and ADDF on F6. There is a structural hazard on FU add for ADDF. The DIVF has not yet read its operands. The ADDF has read its operands and is in execution, it was waiting for SUBF (structural hazard). The ADDF cannot write its results because of WAR on F6.

Figure The scoreboard when MULTF and DIVF are ready to write results.

Here is the summary of bookeeping for each instruction execution

Instruction issue

check funtional unit is not busy (functional units status)
and dest is not waiting for the result (register result status)

check busy field in FU = yes
check op field in FU = opcode
fill FU : Dest Src1 Src2
fill P1 P2 with the register result of Src1 Src2
check R1 R2 = not P1, not P2
write the name of FU to register result

Read operands

wait for R1 R2 until ready
read operands
set R1, R2 = No
P1, P2 = 0

Execution complete

wait until functional unit done

Write result

check WAR hazard
when other instruction has this instruction Dest
as Src1 or Src2
for all f
Src1(f), Src2(f) != Dest(FU)
AND
when other instruction has written the register R1, R2
R1 = Yes or R2 = Yes

wait until no harzard
set ready flag
for all f
if P1(f) = FU then R1(f) = Yes
if P2(f) = FU then R2(f) = Yes
reset register result
reset busy field of FU

end scoreboard bookeeping

Figure CDC6600 the first supercomputer. Photo courtesy of Charles Babbage Institute, University of Minnesota.

Tomasulo

Another dynamic scheduling technique similar to scoreboard is Tomasulo. This technique was invented by Robert Tomasulo 1967 for the IBM 360/91 floating-point unit. The key concept is the renaming of registers to avoid WAR and WAW hazards. This function is provided by the reservation stations. A reservation station fetches and buffers an operand as soon as it is available, pending instructions designate the reservation station that will provide their input. When successive writes to a register appear, only the last one is actually used to update the register. As instructions are issued, the register name for pending operands are renamed to the names of the reservation station. This is the main difference between scoreboard and Tomasulo's algorithm. There can be more reservation stations than real registers, the technique can eliminate hazards that could not be eliminated by a compiler.

Two other differences between scoreboard and Tomasulo are : first, hazard detection and execution control are distributed by each reservation station (in scoreboard it is centralised), second, results are passed directly to functional units rather than through registers. A common result bus allows all units waiting for an operand to be loaded simultaneously, this is called the common data bus (CDB).

The steps to execute an instruction :

Issue, Get an instruction from the queue, issue it if there is an empty reservation station, send the operands to the reservation station if they are in the registers. If the operand is a load or store, it can issue if there is an available buffer. If there is no empty reservation station or an empty buffer, then there is a structural hazard. This step also performs the renaming registers.
Execute, if operands are not yet available, monitor CDB waiting for the registers. When an operand is available, it is placed into the corresponding reservation station. When both operands are available, execute the operation. This step checks RAW hazards.
Write result, When the result is available, write it on CDB and from there into the registers and any reservation stations waiting for this result.

Figure A CPU with two floating point functional units each with two reservation stations, and one load one store buffer

Figure IBM 360/91 it is unveiled in 1966. Photo courtesy of IBM.