## Pipeline

Principle
Speedup
Stall
* Structural hazard
* Data hazard
* Control hazard
Pipeline performance
Managing pipeline
* Register forwarding
* Branch prediction
* Delay branch
S1 pipeline
The principle of pipeline is to overlap the operation of various functional units therefore reduce their idle time.  Pipeline is one of the  very first technique to increase the performance invented since the early days of computer.  Consider one important cycle in the working of a processor: executing an instruction.  Assume the cycle is broken into 5 stages :

loop
1  instruction fetch         IF
2  instruction decode     ID
3  operand fetch             OF
4  execute                        EX
5  write back                   WB

A processor executes three instructions : i1, i2, i3

12345,12345,12345
horizontal axis is time, if each stage takes 1 unit time (it is not necessary that each stage takes  equal time but for simplicity) total time is 15 units.  If we arrange such that three instructions can be overlapped :
i1 12345
i2  12345
i3   12345
now the total time is just 7 units. At any time, each stage works on different instructions, which can be viewed like this after 3 clocks :
1  2  3  4  5
i3 i2 i1 -  -

### Speedup

In an ideal case, the pipeline always be fully used (the number of instruction is large) hence the speedup is
execution time without pipe / execution time with pipe
execution time with pipe = execution time without pipe/ no. of stage in pipe
speedup = no. of stage in pipe

### How a pipeline is implemented?

Each stage composed of a functional unit follows by a latch.  All latches are synchronised.
(F = functional unit, L = latch)
in -F-L-F-L- out
clk__|___|
As we have seen the number of stage determine the speedup. What is the limit of the number of stage?  When a task is divided into several stages which enable overlap operations total time for executing tasks can be shorten but the delay (setting time of latch) of each stage remains constant.  This delay time is the limit of speedup.  Therefore for a pipeline the latch circuits must be very fast.

Amdahl's law can explain this limit.

### Stall of pipeline

A pipeline is not always be filled.  When the flow of pipeline is interrupted, all stages before interruption are stopped (locked), called stall.  The rest of the pipeline can continue to function.  There are three caused
1  structural hazard
2  data hazard
3  control hazard

Structural hazard
caused by two stages of a pipeline accessing the same functional unit.  Because the required resource is busy the pipeline must be stalled.  (more about this in the section pipeline floating point unit)

Data hazard
caused by dependency of data in a sequence of instructions.

i1 R1 = R2 + R3
i2 R4 = R1 + R5
R1 in i1 must be updated before R1 in i2 is read.  Because of overlapping operation, R1 in i1 is updated at WB stage but R1 in i2 is read in OF stage.
x
i1 1 2 3 4 5
x
i2   1 2 3 4 5
Hence the pipeline must be stalled (i2 wait) until R1 is updated (in i1).  The above situation is called Read After Write (RAW) hazard.  Other possibilities are :
i1 R1 = R2 + R3
i2 R2 = R4 + R5
R2 has WAR.
x
i1 1 2 3 4 5
x
i2   1 2 3 4 5
Because WB is at the later stage than OF in this design, there is no WAR hazard.
- Write after Write (WAW)
i1 R1 = R2 + R3
i2 R1 = R4 + R5
if R1 in i2 is updated before R1 in i1, the result will be in the wrong order.  This hazard is presented in pipelines that write in more than one pipe stage (or allow an instruction to proceed even when a previous instruction is stalled).

Control hazard
caused by transfer of control (jump).  The pipeline must be flushed and start fetching the designated instruction (i*).  Therefore stalled three clocks (or wasted three instructions that already be in the pipeline).

x
i1 1 2 3 4 5
i2   1 2 3 4 5
i3     1 2 3 4 5
i4       1 2 3 4 5
i*         1 2 3 4 5

### Managing pipeline

A stall causes the pipeline performance to degrade.
ideal CPI = CPI without pipeline / pipeline depth
CPI with pipeline = ideal CPI + stall
pipeline speedup = ideal CPI x Pipeline depth / (ideal CPI +stall)
Register Forwarding
To reduce stall caused by data hazard, use register forwarding (or bypass) to handle RAW.

Figure  the ALU with register forwarding

y x
i1 1 2 3 4 5
x
i2   1 2 3 4 5
i1 at EX (y), the result is already available.  Bypass unit check if the result is in the buffer then select to use that result instead of reading from register file.  This based on assumption that i1 EX produce a result at the front half of clock cycle and i2 OF read operand at the back half.
y
i1  4F 4B
x
i2  3F 3B
How deep is the buffer in bypass unit?  consider a sequence of instructions
i1  R1 = R2 + R3
i2  R4 = R1 - R5
i3  R6 = R1 + R7
i4  R8 = R1 - R9
y x
i1 1 2 3 4 5
x
i2   1 2 3 4 5
x
i3     1 2 3 4 5
i4       1 2 3 4 5
with bypass unit, i1 only affect i2.  There is also hazard between i1 and i3, to forward the result from i1 to i3 required a buffer of depth two.  There is no hazard between i1 and i4.

To improve stall caused by control hazard,

1  prefetch both next and target addresses
2  use branch prediction
3  delay branch
Branch prediction
use history of branch to predict the current branch (taken/not taken).  The history is stored in "branch-target buffer".
PC of the instruction to be fetched
v
with one bit of history the rule to decide is simply :
if the previous branch is taken then fetch from this branch target

This prediction will be wrong twice, one when enter the loop and the other when leave the loop.  To improve the prediction, one can increase the state of history bit.  Example, use 2 bits history with four states.  The rule can be :
if wrong prediction twice then change prediction to the other way.

#### Branch-target-buffer

We need to know what address to fetch by the end of  Fetch stage.  That is, we need to know what next PC should be (even the newly fetch instruction is not yet decoded, so we don't even know if that instruction is a branch or not).  Therefore we are predicting the next instruction address and will send it out (next Fetch) before decoding the instruction.  If the instruction is a branch and we know what the next PC is, we can have zero stall on branch.

The main idea is to use a cache memory to store : PC of instruction Fetched and its predicted next PC and history bit.  The steps are as follows:

Figure Steps in handling branch prediction

The history bit says whethere the branch was recently taken or not.  Using one-bit history will predict incorrectly twice in a situation such as when a branch is almost always taken (such as loop back) and then when it is not taken (end of loop).  Using two-bit, a prediction must miss twice in a row before it is changed.

Figure Two-bit prediction

Delay branch
The principle of this method is to use the stall cycle to do useful work.  By moving some instruction to be executed in the stall cycle.  The jump instruction is "delayed" i.e. caused the actual jump by some amount of clock (equal to stall cycle).  Some instruction can be scheduled to be executed in that "delay slot".  The instruction can be taken :

1  from above
2  from target
3  from fall through
from above :
i1           jmp i2
jmp i2       <i1 >
< >          i2
i2
from target :
i1           i2
i2           jmp i2
jmp i1       <i1>
< >          i3
i3
from fall through :
i1           i1
jmp i3       jmp i3
< >          < i2 >
i2           i3
i3

using from target, branch taken should be of high probability.  Using from fall through, branch not taken is of high probability. The instruction in delay slot must be "safe", that is, when the prediction is wrong, executing this instruction should not change the machine state until the outcome of branch is definitely known ( and hence knowing whether this instruction must be executed or not).

One delay slot can be filled with useful instruction about 50% of the time.  The rest can be filled with NOP.

Example  pipeline S1