Evolution of Architecture

Sequential execution
Pipeline (overlapped execution)
Superpipe
Superscalar
Vector

Sequential execution
I1, I2, I3

Overlap execution (pipeline)
Fetch, Decode, Execute
Fetch, Decode, Execute

because the memory was slow, it is Fetch limit. The fetch portion is larger than decode and execution. To increase performance designer do "more" in one instruction during execution.
------ Fetch------- Dec1 Exe1 Dec2 Exe2
-------Fetch------- Dec1 Exe1 Dec2 Exe2

therefore, CPI is large, cycle time is large because the complex circuits required to execute complex instruction. The increase in chance of conflict in pipeline because one instruction stays in pipeline for long time therefore it can interfere with other instructions.

The invention of cache memory reduces Fetch time greatly. Current design concentrates on reducing CPI and cycle time. By simplify the execution of one instruction (and ISA), pipeline can be more effective and circuits can be simpler and faster.
Fetch, decode, execute, writeback
Fetch, decode, execute, writeback

Superpipeline
Once the pipeline enables CPI to reach 1, the only way to increase speed is to reduce cycle time. To make it possible, the pipeline is divided into finer grain which reduce the clock time for each stage. This idea is called "superpipeline".

Fet1, fet2, dec1, dec2, wrt1, wrt2
Fet1, fet2, dec1, dec2, wrt1, writ2

Superscalar
To increse performance further we need to issue more than one instruction per clock. This is called "superscalar".
   Fetch, decode, execute, writeback
   Fetch, decode, execute, writeback
          Fetch, decode, execute, writeback
          Fetch, decode, execute, writeback

Of course, superpipe-superscalar is possible.
   Fet1, fet2, dec1, dec2, wrt1, wrt2
   Fet1, fet2, dec1, dec2, wrt1, wrt2
         Fet1, fet2, dec1, dec2, wrt1, wrt2
         Fet1, fet2, dec1, dec2, wrt1, wrt2

Summary

- Sequential (non overlap execution)
- Pipeline (overlap execution) CPI --> 1
instruction pipeline (single step)
floating-point pipeline (multi step)

Scoreboard and Tomasulo methods are hardware for enabling dynamic execution in which instructions can be rearrange by hardware to execute according to the resources available.

- Superpipe CPI = 1 reduce cycle time (higher clock rate)
- Superscalar CPI < 1
- Vector machines reduce fetch time and increase effective pipeline but its use is restricted to the class of program that fits to vector computation.