Future Architecture of microprocessors

trend by year 2010
multi-media workload
architectural trend
    trace cache
    hybrid branch prediction
advanced superscalar
super speculative
simultaneous multi-thread
trace processor
vector IRAM
one chip multi processor
RAW
conclusion

Billion transistors chip  (mostly  from IEEE Computer, Sept. 97)

Forecasting technology is full of errors.

Trend by year 2010

Wire delay becomes dominant, forcing hardware to be more distributed (limits any centralized clock).  A compiler is getting better at exploiting parallelism hence, it can be rely on to build the future architecture.  Workloads contain more parallelism (multi-media workload).  The design and validation cost become more limiting factor.

Multi-media workload

  1. vectors of packed 8- 16- 32- or FP size of operand
  2. fine grain data parallelism : many elements with the same processing (filtering, transformation), SIMD (vector), different from superscalar
  3. coarse grain parallelism : multi-thread : encode/decode video audio simultaneously, temporal/spatial parallel
  4. high instruction reference locality to small loops : hand optimized is practical
  5. high memory bandwidth : cache not effective, requires data prefetch and cache bypass
  6. high network bandwidth : ISA towards real-time bitstream processing

Architecture Trend

Range of different architecture are from a more traditional design : scaling up of today superscalar to a non-conventional design : using reconfigurable computing arrays.  The more traditional designs employ some recent advance concept such as trace cache and multiple branch predictors to improve the performance of instuction issued.  The non-conventional designs rely on the combination of compiler and hardware to extract more parallelism from programs.

(from top to bottom : hardware more distributed as wire delay dominate, and programming model is furthur depart from the norm)

Grouping by concept will be :
  1. advanced super scalar, super speculative  :  use wider instruction issues
  2. SMT, Trace : exploit multi-thread
  3. V-IRAM :  incorporate on-chip memory, use vector units
  4. CMP : multi processors on a chip
  5. RAW : reconfigurable hardware

Trace cache

is a I cache, the main purpose is to fetch past a taken branch.  The trace cache is access using the starting address of the next block of instructions.  It stores logically continuous instructions in physically continuous storage.  A cache line stores a segment of the dynamic instruction trace -- upto an issue width -- across multiple taken branches.

Figure  The effect of trace cache continue to gain when the size is increased

Hybrid branch prediction

Instead of using one kind of branch predictor, using several predictors.  Each targeting different classes of branches.  It has advantage that the warm-up time will be shorter (as each predictor is smaller), hence it reacts to change in a program faster.  However, with the same amount of resource, several predictors will each be limited in size.  Sometime the performance depends on critical size of the buffer of predictor.

Advanced super scalar (U. of Michigan)

The main idea is :  bigger, faster, wider (instruction issued).  The key problems are
  1. instruction supply
  2. data memory supply
  3. implementable execution core
The solution :
to improve instruction supply :out-of-order fetch, multi hybrid branch predictor and trace cache
to improve data supply : replicate first level cache, huge on chip cache and data speculation
It will required a large out-of-order instruction window (2000) and banks of functional units.

Figure  The advanced super scalar

Super Speculative (Carnegie Mellon University)

super scalar has diminishing return.
state of the art processors : DEC Alpha 21264, Silicon Graphics MIPS R 10000, PowerPC 604, Intel Pentium Pro, aims 4 IPC , achieve 0.5-1.5 for real world programs.

Scaling a conventional superscalar will have a limit sustained IPC (see fig the bars are 4-, 8-, 16-, 32- issues).  Superspeculative attained the sustained IPC ~ 10 for non-numerical programs.

Figure  Conventional superscalar don't scale up vs gain from superspeculative

Superflow :
instruction flow -- rate of useful instruction fetched, decoded and dispatched
register data flow -- rate of which the results are produced
memory data flow -- rate of data store/retrieve from memory

Instruction flow is improved by using a trace cache.  In a trace cache : history based fetch mechanism, stores dynamic-instruction trace in a cache indexed by fetch address and branch outcome.  Whenever it finds a suitable trace, it dispatches inst. from the trace cache rather than sequential inst. from the inst. cache.
Register data flow : detect and resolve inter-inst. dependency .  Eliminate and
bypass as many dependencies as possible (mechanism such as register renaming).
Mem. data flow : minimize average memory latency.  Prediction of load value, address.

Figure Superflow architecture

Prototype (simulation)

Figure  performance : each bar  prefect cache with unlimited ports, 64K cache with unlimited ports, 64K with 8 ports, 64K with 4 ports, 64K with 2 ports)   The cache with 64K 4 ports seems to be adequate.

Trace Processor (U. of Wisconsin at Madison)

Use multiple, distributed on-chip cores.  Each simultaneously executes  a different trace.  One core executes traces speculatively. microarchitecture
instruction preprocessing : form a trace, check data dependency, reordering, get resource.
trace cache
next trace prediction :  predict multiple branches per cycle
instruction dispatch :  register renaming, predict trace's input data values, data speculation

Figure  each processor issues 4 inst/cycle.  four pro. issue 16 inst/cycle, no benchmark result

Vector IRAM  ( U. of California at Berkeley, D. Patterson)

CPU speed up 60% per year, memory speed up 7% per year.  The gap is filled by cache memory.  However large off chip and cache  has limit. Half area of Alpha chip is cache.  MIPS R5000 compared to MIPS R10000  (out of order speculative)  R10K has 3.43 times more area but performance gain only 1.64 (SpecInt95 rating)

The fact that DRAM can accommodate 30 to 50 times more data than the same chip area devoted to caches (SRAM) so it should be treated as main memory..  On chip memory can support high bandwidth (50 - 200 times) , low latency (5 to 10 times) by using a wide interface and eliminating the delay of pads and buses.

speed up by vectorization

SPECint95  42%
m88ksim  36%
PGP  vector outperforms superscalar, vector uses only 10% die area.
Advantage : V-IRAM

Figure Vector I-RAM

Simultaneous multi thread (SMT)

Augment wide superscalar to execute instructions from multiple threads on control concurrently, dynamically selecting and executing instructions from many active threads simultaneously.  To run multi-thread, it is required to save process states, e.g. Program counter, Registers.  There are 8 duplicated unit for PC and Registers.

Figure SMT

One chip multiple processors CMP (Stanford)

multiple (16) simple, fast processor.  Each couples to a small, fast level-one cache.  All processors share a larger level-two cache. exploiting parallelism
  1. in a single application : instruction level parallelism, data-independent loop iteration : loop level parallelism. These two factors :  instruction window size
  2. thread-level parallelism  : A compile can simulate a single large instruction window by multiple smaller instructio windows -- one for each tread.
  3. process-level parallelism (coarse grain) : independent applications  running in independent processes controlled by OS.
advantage : disadvantage
it is slower than SMT when running code that cannot be multithreadd, because only one processor can be targeted to the task.


Superscalar


CMP

Figure Comparing superscalar, SMT and CMP under similar resource

Figure Performance comparison

Raw (MIT)

Highly parallel, hundreds of very simple processors coupled with small on chip memory. Each processor has a small bank of configurable logic.  It does not use ISA.  A program is compiled to hardware.  The compiler schedules communication.  Limit by compiler.  Whether this type of architecture is effective for the future workload is remained an open question.

Figure Comparing RAW (a) with superscalar (b) and multi processors (c)

Prototype : 64 Xilinx 4013 FPGA (10,000 gates each) 25 MHz
speed up compared to Sun Sparc 20/71
Life  600 X  :  +32X bit level, +32X parallelism, + 22X configurability , -3X slow FPGA clock, -13X communication overhead.

Benchmark
Hardware prototype 25 MHz (Xilinx 4013) compares to software executing on 2.82 SPECint95 SparcStation (Sparc 20/71)
 
Benchmark data width (bits) no. of elements speed up over sw
binary heap 32 15 1.26
bubble sort  32 64 7
DES encryption 64 4 7
integer FFT  3 4 9
Jacobi  16x16 8 256 230
Conway's life 1 1024 597
integer matrix mul 16 16 90
merge sort 32 14 2.6
n queens 1 16 3.96
single-source shortest path 16 16 10
multiplicative shortest path 16 16 14
transitive closure 1 512 398

Conclusion

The rate of progress is very fast, the radical models of architecture that will be dismissed a few years ago are now feasible.  It is interesting to explore the trends that will affect future architectures and the space of these architectures.  On-chip transistor budgets will soon allow virtually anything to be implemented -- the limit will be mainly the designer's imagination.