Advanced Pipeline

Pipeline

instruction pipeline (single step ) Fetch Decode Execute Writeback . Each stage executes in one step.
floating-point pipeline (multi step)

Pipeline of the floating-point unit

Figure a pipelined floating-point combined adder/multiplier unit

Reservation table

Figure the reservation of the combined add/mult FP unit

A reservation table shows a timing diagram of a flow of data through the functional unit. The reservation table is derived directly from the pipeline design. It is used to decide when to launch an operation into the pipeline. Both operations cannot use the same unit at the same time. A collision vector represents the collision information. Position i contains a bit that indicates whether or not a new operation can be launched i units after the first operation has been started.

(a) 1 1 0 0 0 0 (b) 1 0 0 0 0 0 0 0

Figure Collision vectors of (a) multiplication (mul follows by mul) (b) addition (add, add)

Multiple cycle pipeline

For floating-point operations the pipeline will required to operate in multiple cycle. Assume there are multiple functional units :

Integer unit,
FP/integer multiplier,
FP adder,
FP/integer divider.

Figure an example of multi step pipeline

Assume there are two separate register sets : FP and Integer. This simplifies the pipeline control as it reduces hazard detection such as overlapping FP and Integer operations, except for load/store FP and movement between FP/Integer registers. Integer unit handles load/store to both register sets. Assume EX stage is repeated many times to do these operations. No other instruction using functional unit may issue until the previous instruction leaves EX. If an instruction cannot proceed to the EX stage, the entire pipeline behind that instruction will be stalled (to avoid this stall, we need the capability to do out-of-order issue, the topic of next section). The following steps are required to issue a new floating-point instruction :

Check for structural hazard
Check for data hazard
Check for forwarding

Because all FP instructions require different execution time, this caused three complications :

contention for register access at WB stage,
WAR and WAW hazards and
interrupts. (we ignore interrupt).

FP load and FP operation can content for FP register on writes. This can be dealt with using priority scheme at the WB stage. The highest priority instruction can get access to register and all other instructions are stalled. A simple heuristic is to give the longest latency instruction the highest priority. If all instructions read their registers at the same time there will be no WAR hazards. WAW hazards occur because the results can be written in different order. The instructions can be completed inn a different order from the order in which they are issued. For example

DIVF F0 F2 F4
SUBF F0 F8 F10

Assume DIVF takes longer than SUBF to complete. The SUBF will complete first and writes its result before the DIVF. This hazard must be detected and ensure that the result is of executing instructions is correct.

Dynamic scheduling in pipeline

The pipeline fetches and issues an instruction unless there is a data dependence between an instruction already in the pipeline and the fetched instruction. When data hazard occurs the pipeline is stalled. When the software (mostly the compiler) is responsible for scheduling instructions to minimise these stalls, it is called static scheduling. This section will introduce the hardware scheme to reduce the stalls, this is called dynamic scheduling as it can detect the dependencies that are unknown at the compile time.

All the previous pipeline technique that we described use in-order instruction issue. If an instruction is stalled in the pipeline, no later instructions can proceed. When there are multiple functional units, these units could be idle. For example,

DIVF F0 F2 F4
ADDF F10 F0 F8
SUBF F8 F8 F14

The SUBF cannot be issue because the dependence of ADDF on DIVF (RAW on F0). Yet, the SUBF does not depend on any instruction in the pipeline. If an instruction can be executed out-of-order this stall can be eliminated.

We can check data hazards in the ID stage. In order to let an instruction start its execution as soon as its operands is available, the instruction issuing process must be separated from the hazard checking. The pipeline will do out-of-order execution which implies out-of-order completion. We have to split the decode, execution stages into 3 stages :

Issue decode instruction, check for data hazards
Read operands wait until no data hazard, then read operands
Execute

Scoreboard

Scoreboard is a hardware technique that enables instructions to be out-of-order executed when there are sufficient resources and no data dependencies. It is named after the CDC6600 scoreboard (first delivered in 1964 and was considered by many to be the first supercomputer). The goal of a scoreboard is to maintain an execution rate of one instruction per clock cycle by executing an instruction as early as possible (this can be achieve only when there are no structural hazard, i.e. there are sufficient number of resources). The scoreboard is responsible for instruction issue and execution. It does all hazard detection.

Every instruction goes through the scoreboard which keeps all information necessary to detect all hazards. The scoreboard determines when an instruction can read its operands and begin execution. The scoreboard controls when an instruction can write its result into the destination register. All hazard detection and resolution is centralised in the scoreboard.

To illustrate the working of scoreboard let we assume S1 with scoreboard (S1s). Assume S1s has 2 FP multipliers, FP divide, FP add and Integer unit. Integer units handles all load/store, memory references, branches and integer operations. Each instruction goes through 4 steps, which replaces the pipeline stages, as follows:

Issue, when no structural hazard the instruction is issued. Scoreboard updates its internal data structure. It guarantees that there is no WAW.
Read operands, the scoreboard monitors source operands. If 2.1 no active instruction is going to write to it. 2.2 no active functional unit is writing to the register containing the operands. When the source operands are available, the scoreboard tells the functional unit to read its operands and begin execution. RAW is resolved and instructions may be executed out-of-order.
Execution, a functional unit begins execution. It notifies the scoreboard when the result is ready.
Write result, when the functional unit has completed execution, the scoreboard checks or WAR hazards. An instruction is not allowed to write its results when 4.1 there is an instruction that has not read its operands. 4.2 one of the operands is the same register as the result of the completing instruction. 4.3 the other operand was the result of an earlier instruction.

The scoreboard controls the execution of an instruction by communicating with the functional units.

Example, a scoreboard of S1s controls the execution of the following sequence of instructions:

LF F6 34(R2)
LF F2 45(R3)
MULTF F0 F2 F4
SUBF F8 F6 F2
DIVF F10 F0 F6
ADDF F6 F8 F2

The scoreboard has three parts:

1. Instruction status
2. Functional unit status, each FU has the following fields
2.1 Busy
2.2 Op, instruction to be performed
2.3 Dest , destination register
2.4 Src1, Src2, source registers
2.5 P1, P2, number of units producing Src1, Src2
2.6 R1, R2, ready flags for Src1, Src2; they are reset when new values are read so the scoreboard knows that the source operand has been read.
3. Register result status, which FU will write register.

Assuming the execution of the floating-point functional units are : add is 2 clocks, multiply is 10 clocks and divide is 40 clocks. Each instruction that has issued, has an entry in the instruction status table. Once the instruction issues, the record of its operands is kept in the functional unit status table. See the figure, the instruction status says that

1) the first LF has completed
2) the second FL has completed but has not yet written its result.
3) the MULTF, SUBF and DIVF have issued but are stalled, waiting for their operands.

The instruction unit status says that

1) the first multiplier unit is waiting for the integer unit. (RAW on F2).
2) the add unit is waiting for the integer unit. (RAW on F2).
3) the divide unit is waiting for the first multiplier unit. (RAW on F0).
4) the ADDF is stalled due to structural hazard (FU Add is in used by SUBF).

Figure The scoreboard after issuing the first five instructions.

Now assume the MULF and DIVF are proceeded and ready to write results.
There are RAW on

1) the second LF to MULTF and SUBF (on F2)
2) MULTF to DIVF (on F0)
3) SUBF to ADDF (on F8)

There is a WAR between DIVF and ADDF on F6. There is a structural hazard on FU add for ADDF. The DIVF has not yet read its operands. The ADDF has read its operands and is in execution, it was waiting for SUBF (structural hazard). The ADDF cannot write its results because of WAR on F6.

Figure The scoreboard when MULTF and DIVF are ready to write results.