Future trend in microprocessor

Future Architecture of microprocessors

I don't have time to put the figures in. Please refer to IEEE Computer Sept. 1997.

Billion transistors chip

(excerpt from IEEE Computer, Sept. 97)

Forecasting technology is full of errors.

"Everything that can be invented has been invented". US Commissioner of Patents, 1899.
"I think there is a world market for about five computers". Thomas J Watson Sr., IBM founder, 1943.
"There is no reason for any individuals to have a computer in their home". Ken Olsen, CEO of Digital Equipment Corp., 1977.
"The current rate of progress can't continue much longer. We're facing fundamental problems that we didn't have to deal with before". Various computer technologists, 1995-1997.

Trend by year 2010

The chip 800 M tr., 1000 pin, 1000 bit bus, 2G clock, 180W power
On chip wire is much slower than logic gates. A signal across the chip may require 20 clocks.
Design, verfication and testing will consume large percent of cost.
Intel validate and test is 40-50% of design cost and in term of transistors (built in test) is 6%.
Fabrication facility is now $2 billion, 10X more than a decade ago.

Architecture Trend

(from top to bottom : hardware more distributed as wire delay dominate, and programming model is furthur depart from the norm)

advanced super scalar (16-32 inst./cycle)
super speculative (wide issue speculating)
simultaneous multi thread (multi task, aggressive pipeline)
trace (multi scalar) (high ILP separate trace)
vector iram (vector + IRAM)
one chip (4-16 processors)
Raw ( > 100 processing elements reconfigurable)

Uniprocessors

super scalar, super speculative (fine grain), trace (coarse grain) : compat with old
binaries.

Advanced super scalar (U. of Michigan)

uniprocessor,
large trace cache
large number of reservation stations
large number of pipeline functional units
sufficient on chip data cache
sufficient resolution and fowarding logic
16-32 inst. per cycle
reservation station 2000 inst.
24-48 pipe line functional units
most important : inst. bandwidth, memory bandwidth and latency

Super speculative (Carnegie Mellon U.)

super scalar has diminishing return.
state of the art processors : DEC Alpha 21264, Silicon Graphics MIPS R 10000, PowerPC 604, Intel Pentium Pro, aims 4 IPC , achieve 0.5-1.5 for real world programs.

Superflow : inst. flow, register data flow, memory data flow.
trace cache : history based fetch mechanism, stores dynamic-instruction trace in
a cache indexed by fetch address and branch outcome. Whenever it finds a suitable
trace, it dispatches inst. from the trace cache rather than sequential inst. from the
inst. cache.
Register data flow : detect and resolve inter-inst. dependency . Eliminate and
bypass as many dependencies as possible (mechanism such as register renaming).
Mem. data flow : minimize average memory latency. Prediction of load value, address.

Prototype (simulation)

fetch width 32 inst.
reorder buffer 128 entries
64 K 4-way set assoc. cache
128 K fully assoc. store queue

Trace Processor (U. of Wisconsin at Madison)

make parallel inst. more visible
dynamically partition hierarchical parallelism
incorporate speculation for both control and data

Vector IRAM (U. of California at Berkeley, D. Patterson)

CPU speed up 60% per year, memory speed up 7% per year. The gap is filled by cache memory. However large off chip and cache has limit. Half area of Alpha chip is cache. MIPS R5000 compared to MIPS R10000 (out of order speculative) R10K has 3.43 times more area but performance gain only 1.64 (SpecInt95 rating)

Intelligent RAM : DRAM can accommodate 30-50 times more data than cache with the same die area so it should be treated as main memory.
Advantage :

higher memory bandwidth
reduce energy (reduce off chip high capitance bus driving)
few pin, therefore can devote more pin to I/O (higher I/O bandwidth)
On chip mem. can reduce processor-memory latency 5-10 times and increase bandwidth 5-20 times.

V-IRAM

vector unit : two load, one store, two ALU, 8 x 60 bit
pipeline running at 1 GHz
peak performance 16 GFLOPS
DRAM latency 20 ns, cycle 4 ns, will meet demand 192 Gbyte/s from vector unit
scalar core : dual issue with first level I&D cache.

Simultaneous multi thread (SMT)

combine wide issue super scalar (multiple issue) multi thread (holds hardware state) for several threads.

One chip multiple processors (Stanford)

exploit parallelism, ILP, TLP, PLP (process)
multiple issue parallelism limits by instruction window size
layout of chip will affect architecture : avoid long wire

advantage :

area/complexity is linear vs quadratic in superscalar
shorter cycle time because short wire and no bus switch etc.
easier (faster) to design, verify and test
distributed cache lower demand on memory bandwidth

Raw (MIT)

replicate tiles, interconnect is synchronous and direct, short latency.
static scheduling, operands are available when needed. eliminate explicit
synchronization.
each tile supports multi granular (bit, byte, word : level ) operations.

interconnect 30% of area
switch : single cycle message injection and receive operations. communication nearly same speed as register read.
switch control : sequencing routing instruction.
configurability : at a coarser grain than FPGA, compiler can create customised instruction without using longer software sequences (example, Game of Life a custom instruction reduces 22 cycles to one)
Compiler is complex.
N tiles as a collection of functional units for exploiting ILP.

Prototype : 64 Xilinx 4013 FPGA (10,000 gates each) 25 MHz
speed up compared to Sun Sparc 20/71
Life 600 X : +32X bit level, +32X parallelism, + 22X configurability , -3X slow FPGA clock, -13X communication overhead.