Future Architecture of microprocessors
I don't have time to put the figures in. Please refer to IEEE
Computer Sept. 1997.
Billion transistors chip
(excerpt from IEEE Computer, Sept. 97)
Forecasting technology is full of errors.
-
"Everything that can be invented has been invented". US Commissioner of
Patents, 1899.
-
"I think there is a world market for about five computers". Thomas J Watson
Sr., IBM founder, 1943.
-
"There is no reason for any individuals to have a computer in their home".
Ken Olsen, CEO of Digital Equipment Corp., 1977.
-
"The current rate of progress can't continue much longer. We're facing
fundamental problems that we didn't have to deal with before". Various
computer technologists, 1995-1997.
Trend by year 2010
-
The chip 800 M tr., 1000 pin, 1000 bit bus, 2G clock, 180W power
-
On chip wire is much slower than logic gates. A signal across the
chip may require 20 clocks.
-
Design, verfication and testing will consume large percent of cost.
-
Intel validate and test is 40-50% of design cost and in term of transistors
(built in test) is 6%.
-
Fabrication facility is now $2 billion, 10X more than a decade ago.
Architecture Trend
(from top to bottom : hardware more distributed as wire delay dominate,
and programming model is furthur depart from the norm)
-
advanced super scalar (16-32 inst./cycle)
-
super speculative (wide issue speculating)
-
simultaneous multi thread (multi task, aggressive pipeline)
-
trace (multi scalar) (high ILP separate trace)
-
vector iram (vector + IRAM)
-
one chip (4-16 processors)
-
Raw ( > 100 processing elements reconfigurable)
Uniprocessors
super scalar, super speculative (fine grain), trace (coarse grain) : compat
with old
binaries.
Advanced super scalar (U. of Michigan)
-
uniprocessor,
-
large trace cache
-
large number of reservation stations
-
large number of pipeline functional units
-
sufficient on chip data cache
-
sufficient resolution and fowarding logic
-
16-32 inst. per cycle
-
reservation station 2000 inst.
-
24-48 pipe line functional units
-
most important : inst. bandwidth, memory bandwidth and latency
Super speculative (Carnegie Mellon U.)
super scalar has diminishing return.
state of the art processors : DEC Alpha 21264, Silicon Graphics MIPS
R 10000, PowerPC 604, Intel Pentium Pro, aims 4 IPC , achieve 0.5-1.5 for
real world programs.
Superflow : inst. flow, register data flow, memory data flow.
trace cache : history based fetch mechanism, stores dynamic-instruction
trace in
a cache indexed by fetch address and branch outcome. Whenever
it finds a suitable
trace, it dispatches inst. from the trace cache rather than sequential
inst. from the
inst. cache.
Register data flow : detect and resolve inter-inst. dependency
. Eliminate and
bypass as many dependencies as possible (mechanism such as register
renaming).
Mem. data flow : minimize average memory latency. Prediction
of load value, address.
Prototype (simulation)
-
fetch width 32 inst.
-
reorder buffer 128 entries
-
64 K 4-way set assoc. cache
-
128 K fully assoc. store queue
Trace Processor (U. of Wisconsin at Madison)
-
make parallel inst. more visible
-
dynamically partition hierarchical parallelism
-
incorporate speculation for both control and data
Vector IRAM (U. of California at Berkeley, D. Patterson)
CPU speed up 60% per year, memory speed up 7% per year. The gap is
filled by cache memory. However large off chip and cache has
limit. Half area of Alpha chip is cache. MIPS R5000 compared to MIPS
R10000 (out of order speculative) R10K has 3.43 times more
area but performance gain only 1.64 (SpecInt95 rating)
Intelligent RAM : DRAM can accommodate 30-50 times more data
than cache with the same die area so it should be treated as main memory.
Advantage :
-
higher memory bandwidth
-
reduce energy (reduce off chip high capitance bus driving)
-
few pin, therefore can devote more pin to I/O (higher I/O bandwidth)
-
On chip mem. can reduce processor-memory latency 5-10 times and increase
bandwidth 5-20 times.
V-IRAM
-
vector unit : two load, one store, two ALU, 8 x 60 bit
-
pipeline running at 1 GHz
-
peak performance 16 GFLOPS
-
DRAM latency 20 ns, cycle 4 ns, will meet demand 192 Gbyte/s from vector
unit
-
scalar core : dual issue with first level I&D cache.
Simultaneous multi thread (SMT)
combine wide issue super scalar (multiple issue) multi thread (holds hardware
state) for several threads.
One chip multiple processors (Stanford)
-
exploit parallelism, ILP, TLP, PLP (process)
-
multiple issue parallelism limits by instruction window size
-
layout of chip will affect architecture : avoid long wire
advantage :
-
area/complexity is linear vs quadratic in superscalar
-
shorter cycle time because short wire and no bus switch etc.
-
easier (faster) to design, verify and test
-
distributed cache lower demand on memory bandwidth
Raw (MIT)
-
replicate tiles, interconnect is synchronous and direct, short latency.
-
static scheduling, operands are available when needed. eliminate
explicit
-
synchronization.
-
each tile supports multi granular (bit, byte, word : level ) operations.
1 billion tr. can make 128 tiles (5 M tr. each)
16K bytes inst. RAM ( static
ram)
16K bytes switch mem. (static ram)
32K first level data mem. (static ram)
128K DRAM
-
interconnect 30% of area
-
switch : single cycle message injection and receive operations. communication
nearly same speed as register read.
-
switch control : sequencing routing instruction.
-
configurability : at a coarser grain than FPGA, compiler can create customised
instruction without using longer software sequences (example, Game of Life
a custom instruction reduces 22 cycles to one)
-
Compiler is complex.
-
N tiles as a collection of functional units for exploiting ILP.
Prototype : 64 Xilinx 4013 FPGA (10,000 gates each) 25 MHz
speed up compared to Sun Sparc 20/71
Life 600 X : +32X bit level, +32X parallelism, +
22X configurability , -3X slow FPGA clock, -13X communication overhead.