Future Architecture of microprocessors

I don't have time to put the figures in.   Please refer to IEEE Computer Sept. 1997.

Billion transistors chip

(excerpt from IEEE Computer, Sept. 97)

Forecasting technology is full of errors.

Trend by year 2010

Architecture Trend

(from top to bottom : hardware more distributed as wire delay dominate, and programming model is furthur depart from the norm)
  1. advanced super scalar  (16-32 inst./cycle)
  2. super speculative (wide issue speculating)
  3. simultaneous multi thread (multi task, aggressive pipeline)
  4. trace (multi scalar) (high ILP separate trace)
  5. vector iram (vector + IRAM)
  6. one chip (4-16 processors)
  7. Raw ( > 100 processing elements reconfigurable)

Uniprocessors

super scalar, super speculative (fine grain), trace (coarse grain) : compat with old
binaries.

Advanced super scalar (U. of Michigan)

Super speculative (Carnegie Mellon U.)

super scalar has diminishing return.
state of the art processors : DEC Alpha 21264, Silicon Graphics MIPS R 10000, PowerPC 604, Intel Pentium Pro, aims 4 IPC , achieve 0.5-1.5 for real world programs.

Superflow : inst. flow, register data flow, memory data flow.
trace cache : history based fetch mechanism, stores dynamic-instruction trace in
a cache indexed by fetch address and branch outcome.  Whenever it finds a suitable
trace, it dispatches inst. from the trace cache rather than sequential inst. from the
inst. cache.
Register data flow : detect and resolve inter-inst. dependency .  Eliminate and
bypass as many dependencies as possible (mechanism such as register renaming).
Mem. data flow : minimize average memory latency.  Prediction of load value, address.

Prototype (simulation)

Trace Processor (U. of Wisconsin at Madison)

Vector IRAM (U. of California at Berkeley, D. Patterson)

CPU speed up 60% per year, memory speed up 7% per year.  The gap is filled by cache memory.  However large off chip and cache  has limit. Half area of Alpha chip is cache.  MIPS R5000 compared to MIPS R10000  (out of order speculative)  R10K has 3.43 times more area but performance gain only 1.64 (SpecInt95 rating)

Intelligent RAM : DRAM can accommodate 30-50 times more data than cache with the same die area so it should be treated as main memory.
Advantage :

V-IRAM

Simultaneous multi thread (SMT)

combine wide issue super scalar (multiple issue) multi thread (holds hardware state) for several threads.

One chip multiple processors (Stanford)

advantage :

Raw (MIT)

Prototype : 64 Xilinx 4013 FPGA (10,000 gates each) 25 MHz
speed up compared to Sun Sparc 20/71
Life  600 X  :  +32X bit level, +32X parallelism, + 22X configurability , -3X slow FPGA clock, -13X communication overhead.