Future trend in microprocessor

Future Architecture of microprocessors

trend by year 2010
multi-media workload
architectural trend
trace cache
hybrid branch prediction
advanced superscalar
super speculative
simultaneous multi-thread
trace processor
vector IRAM
one chip multi processor
RAW
conclusion

Billion transistors chip (mostly from IEEE Computer, Sept. 97)

Forecasting technology is full of errors.

"Everything that can be invented has been invented". US Commissioner of Patents, 1899.
"I think there is a world market for about five computers". Thomas J Watson Sr., IBM founder, 1943.
"There is no reason for any individuals to have a computer in their home". Ken Olsen, CEO of Digital Equipment Corp., 1977.
"The current rate of progress can't continue much longer. We're facing fundamental problems that we didn't have to deal with before". Various computer technologists, 1995-1997.

Trend by year 2010

Wire delay becomes dominant, forcing hardware to be more distributed (limits any centralized clock). A compiler is getting better at exploiting parallelism hence, it can be rely on to build the future architecture. Workloads contain more parallelism (multi-media workload). The design and validation cost become more limiting factor.

The chip 800 M tr., 1000 pin, 1000 bit bus, 2G clock, 180W power
On chip wire is much slower than logic gates. A signal across the chip may require 20 clocks.
Design, verfication and testing will consume large percent of cost.
Intel validate and test is 40-50% of design cost and in term of transistors (built in test) is 6%.
Fabrication facility is now $2 billion, 10X more than a decade ago.

Multi-media workload

peer-to-peer
real-time
continuous data stream

vectors of packed 8- 16- 32- or FP size of operand
fine grain data parallelism : many elements with the same processing (filtering, transformation), SIMD (vector), different from superscalar
coarse grain parallelism : multi-thread : encode/decode video audio simultaneously, temporal/spatial parallel
high instruction reference locality to small loops : hand optimized is practical
high memory bandwidth : cache not effective, requires data prefetch and cache bypass
high network bandwidth : ISA towards real-time bitstream processing

Architecture Trend

Range of different architecture are from a more traditional design : scaling up of today superscalar to a non-conventional design : using reconfigurable computing arrays. The more traditional designs employ some recent advance concept such as trace cache and multiple branch predictors to improve the performance of instuction issued. The non-conventional designs rely on the combination of compiler and hardware to extract more parallelism from programs.

(from top to bottom : hardware more distributed as wire delay dominate, and programming model is furthur depart from the norm)

advanced super scalar (16-32 inst./cycle)
super speculative (wide issue speculating)
simultaneous multi thread SMT (multi task, aggressive pipeline)
trace (multi scalar) (high ILP separate trace)
vector iram (vector + IRAM)
one chip CMP (4-16 processors)
Raw ( > 100 processing elements reconfigurable)

Grouping by concept will be :

advanced super scalar, super speculative : use wider instruction issues
SMT, Trace : exploit multi-thread
V-IRAM : incorporate on-chip memory, use vector units
CMP : multi processors on a chip
RAW : reconfigurable hardware

Trace cache

is a I cache, the main purpose is to fetch past a taken branch. The trace cache is access using the starting address of the next block of instructions. It stores logically continuous instructions in physically continuous storage. A cache line stores a segment of the dynamic instruction trace -- upto an issue width -- across multiple taken branches.

Figure The effect of trace cache continue to gain when the size is increased

Hybrid branch prediction

Instead of using one kind of branch predictor, using several predictors. Each targeting different classes of branches. It has advantage that the warm-up time will be shorter (as each predictor is smaller), hence it reacts to change in a program faster. However, with the same amount of resource, several predictors will each be limited in size. Sometime the performance depends on critical size of the buffer of predictor.

Advanced super scalar (U. of Michigan)

The main idea is : bigger, faster, wider (instruction issued). The key problems are

instruction supply
data memory supply
implementable execution core

The solution :
to improve instruction supply :out-of-order fetch, multi hybrid branch predictor and trace cache
to improve data supply : replicate first level cache, huge on chip cache and data speculation
It will required a large out-of-order instruction window (2000) and banks of functional units.

uniprocessor,
large trace cache
large number of reservation stations
large number of pipeline functional units
sufficient on chip data cache
sufficient resolution and fowarding logic
16-32 inst. per cycle
reservation station 2000 inst.
24-48 pipe line functional units
most important : inst. bandwidth, memory bandwidth and latency

Figure The advanced super scalar

Super Speculative (Carnegie Mellon University)

super scalar has diminishing return.
state of the art processors : DEC Alpha 21264, Silicon Graphics MIPS R 10000, PowerPC 604, Intel Pentium Pro, aims 4 IPC , achieve 0.5-1.5 for real world programs.

Scaling a conventional superscalar will have a limit sustained IPC (see fig the bars are 4-, 8-, 16-, 32- issues). Superspeculative attained the sustained IPC ~ 10 for non-numerical programs.

Figure Conventional superscalar don't scale up vs gain from superspeculative

Superflow :
instruction flow -- rate of useful instruction fetched, decoded and dispatched
register data flow -- rate of which the results are produced
memory data flow -- rate of data store/retrieve from memory

Instruction flow is improved by using a trace cache. In a trace cache : history based fetch mechanism, stores dynamic-instruction trace in a cache indexed by fetch address and branch outcome. Whenever it finds a suitable trace, it dispatches inst. from the trace cache rather than sequential inst. from the inst. cache.
Register data flow : detect and resolve inter-inst. dependency . Eliminate and
bypass as many dependencies as possible (mechanism such as register renaming).
Mem. data flow : minimize average memory latency. Prediction of load value, address.

Figure Superflow architecture

Prototype (simulation)

fetch width 32 inst.
reorder buffer 128 entries
64 K 4-way set assoc. cache
128 K fully assoc. store queue

Figure performance : each bar prefect cache with unlimited ports, 64K cache with unlimited ports, 64K with 8 ports, 64K with 4 ports, 64K with 2 ports) The cache with 64K 4 ports seems to be adequate.

Trace Processor (U. of Wisconsin at Madison)

Use multiple, distributed on-chip cores. Each simultaneously executes a different trace. One core executes traces speculatively.

make parallel instructions more visible : must have a large instruction window
dynamically partition hierarchical parallelism : replicated units, control unit must allocate parts of program effectively
incorporate speculation for both control and data : control flow -- branch prediction, data dependency -- speculation memory addressing hazards

microarchitecture

instruction preprocessing : form a trace, check data dependency, reordering, get resource.
trace cache
next trace prediction : predict multiple branches per cycle
instruction dispatch : register renaming, predict trace's input data values, data speculation

Figure each processor issues 4 inst/cycle. four pro. issue 16 inst/cycle, no benchmark result

Vector IRAM ( U. of California at Berkeley, D. Patterson)

CPU speed up 60% per year, memory speed up 7% per year. The gap is filled by cache memory. However large off chip and cache has limit. Half area of Alpha chip is cache. MIPS R5000 compared to MIPS R10000 (out of order speculative) R10K has 3.43 times more area but performance gain only 1.64 (SpecInt95 rating)

The fact that DRAM can accommodate 30 to 50 times more data than the same chip area devoted to caches (SRAM) so it should be treated as main memory.. On chip memory can support high bandwidth (50 - 200 times) , low latency (5 to 10 times) by using a wide interface and eliminating the delay of pads and buses.

speed up by vectorization

SPECint95 42%
m88ksim 36%
PGP vector outperforms superscalar, vector uses only 10% die area.

Advantage :

higher memory bandwidth
reduce energy (reduce off chip high capitance bus driving)
few pin, therefore can devote more pin to I/O (higher I/O bandwidth)
On chip mem. can reduce processor-memory latency 5-10 times and increase bandwidth 5-20 times.

V-IRAM

vector unit : two load, one store, two ALU, 8 x 64 bit pipeline running at 1 GHz
peak performance 16 GFLOPS or 128 GOPS when each pipeline is split into multiple 8-bit pipelines for multimedia op.
on chip memory 96 Mbytes DRAM latency 20 ns, cycle 4 ns, will meet demand 192 Gbyte/s from vector unit
scalar core : dual issue with first level I&D cache.

Figure Vector I-RAM

Simultaneous multi thread (SMT)

Augment wide superscalar to execute instructions from multiple threads on control concurrently, dynamically selecting and executing instructions from many active threads simultaneously. To run multi-thread, it is required to save process states, e.g. Program counter, Registers. There are 8 duplicated unit for PC and Registers.

Figure SMT

One chip multiple processors CMP (Stanford)

multiple (16) simple, fast processor. Each couples to a small, fast level-one cache. All processors share a larger level-two cache.

Simple design
Faster validation
Code explicitly parallel

exploiting parallelism

in a single application : instruction level parallelism, data-independent loop iteration : loop level parallelism. These two factors : instruction window size
thread-level parallelism : A compile can simulate a single large instruction window by multiple smaller instructio windows -- one for each tread.
process-level parallelism (coarse grain) : independent applications running in independent processes controlled by OS.

exploit parallelism, ILP, TLP, PLP (process)
multiple issue parallelism limits by instruction window size
layout of chip will affect architecture : avoid long wire

advantage :

area/complexity is linear vs quadratic in superscalar
shorter cycle time because short wire and no bus switch etc.
easier (faster) to design, verify and test
distributed cache lower demand on memory bandwidth

disadvantage
it is slower than SMT when running code that cannot be multithreadd, because only one processor can be targeted to the task.

Superscalar

CMP

Figure Comparing superscalar, SMT and CMP under similar resource

Figure Performance comparison

Raw (MIT)

Highly parallel, hundreds of very simple processors coupled with small on chip memory. Each processor has a small bank of configurable logic. It does not use ISA. A program is compiled to hardware. The compiler schedules communication. Limit by compiler. Whether this type of architecture is effective for the future workload is remained an open question.

Figure Comparing RAW (a) with superscalar (b) and multi processors (c)

replicate tiles, interconnect is synchronous and direct, short latency.
static scheduling, operands are available when needed. eliminate explicit
synchronization.
each tile supports multi granular (bit, byte, word : level ) operations.

1 billion tr. can make 128 tiles (5 M tr. each)
16K bytes inst. RAM ( static ram)
16K bytes switch mem. (static ram)
32K first level data mem. (static ram)
128K DRAM

interconnect 30% of area
switch : single cycle message injection and receive operations. communication nearly same speed as register read.
switch control : sequencing routing instruction.
configurability : at a coarser grain than FPGA, compiler can create customised instruction without using longer software sequences (example, Game of Life a custom instruction reduces 22 cycles to one)
Compiler is complex.
N tiles as a collection of functional units for exploiting ILP.

Prototype : 64 Xilinx 4013 FPGA (10,000 gates each) 25 MHz
speed up compared to Sun Sparc 20/71
Life 600 X : +32X bit level, +32X parallelism, + 22X configurability , -3X slow FPGA clock, -13X communication overhead.

Benchmark
Hardware prototype 25 MHz (Xilinx 4013) compares to software executing on 2.82 SPECint95 SparcStation (Sparc 20/71)

Benchmark data width (bits) no. of elements speed up over sw

binary heap 32 15 1.26

bubble sort 32 64 7

DES encryption 64 4 7

integer FFT 3 4 9

Jacobi 16x16 8 256 230

Conway's life 1 1024 597

integer matrix mul 16 16 90

merge sort 32 14 2.6

n queens 1 16 3.96

single-source shortest path 16 16 10

multiplicative shortest path 16 16 14

transitive closure 1 512 398

Conclusion

The rate of progress is very fast, the radical models of architecture that will be dismissed a few years ago are now feasible. It is interesting to explore the trends that will affect future architectures and the space of these architectures. On-chip transistor budgets will soon allow virtually anything to be implemented -- the limit will be mainly the designer's imagination.

Benchmark	data width (bits)	no. of elements	speed up over sw
binary heap	32	15	1.26
bubble sort	32	64	7
DES encryption	64	4	7
integer FFT	3	4	9
Jacobi 16x16	8	256	230
Conway's life	1	1024	597
integer matrix mul	16	16	90
merge sort	32	14	2.6
n queens	1	16	3.96
single-source shortest path	16	16	10
multiplicative shortest path	16	16	14
transitive closure	1	512	398