Technology progresses from vacumm tubes to transisters to integrated circuits. The birth of microprocessor around 1970 has great impact on performance of computers. The growth of performance has been highest for microprocessors. Since 1980 the performance double every 2 years. This phenomena is called Moore's law. For example, around 1980 the first IBM PC appeared. Its CPU was a 8088, 16 bits CPU with 8 MHz clock, 16Kbytes of memory, floppy disk and no hard disk. The later model offer 5Mbytes hard disk (so called IBM XT). Today PC equipped with Pentium 32 bit CPU 233 MHz clock, 32Mbyte of memory and 1 Gbyte disk. It performance is around 500 times of the first PC.
Performance = how fast a computer can run
performance = response time ( latency)
performance = throughput (bandwidth)
The fastest machine of today is the ASCI-Red of the deparment of energy,
USA. It is composed of 2048 nodes of Pentium Pro with collective
memory of 600 G bytes. Its peak performance is 1.8 Tflops and it
has run 630 Gflops on 3400 nodes (running simulation of motion of
particles). See detail in H. Karp, E. Lusk and D. Bailey, "1997 Gordon
Bell prize winners",IEEE computer magazine Jan 1998, pp.86-92.
Amdalh's law
speedup = performance with enhancement use / performance
without enhancement use
speedup = execution time without enhancement use / execution
time with enhancement use
if enhancement is used only fraction_enhancement
execution time new = execution time old ( (1-fraction_enhancement)
+
fraction_enhancement/speedup )
speedup overall = 1 / ((1-fraction_enhancement) + fraction_en / speedup )
Therefore the limit is at how much the enhancement has been used. In achieving speedup by parallelization, Amdalh's law predicts that speedup will be limit by the sequential part of the program. Let see some numerical example.
Example : a computer has an enhancement with 10 times speedup. That enhancement is used only 40% of the time. What is the overall speedup?
speedup overall = 1/ ((1-0.4) + 0.4/10 ) = 1.56
Please note that Amdalh's law apply only with the problem of fixed size. When problem size much much larger than the machine, Amdalh's law doesn't applied. This is why the massively parallel machine is still possible.
Example : comparing CPU A and CPU B, A with "compare then branch" instruction sequence, B has special combined "compare&branch". A has 25% faster clock. For CPU A, 20% of instruction is "branch" and hench another 20% is the accompanied "compare". "Branch" takes 2 clocks and all other instructions take 1 clock. "compare&branch" takes 2 clocks. Both cpu run the same program. Which is faster?
CPU time A = num. of instruction A x CPI A x cycletime A
= n.o.i A x (( .20 x 2 ) + ( .8 x 1 ) ) x cycletime A
= 1.2 x n.o.i x cycletime A
"compare" are not executed in CPU B so 20% of 80% = 25% of instructions
are now branching taking 2 clocks and the rest 75% take 1 clock.
CPI B = .25 x 2 + .75 x 1 = 1.25
CPU time B = .8 x n.o.i A x 1.25 x 1.25 cycletime A
= 1.25 n.o.i A x cycletime A
Therefore A, with shorter cycle time, is faster than B, which executes fewer instructions.
Now if the designer reworks CPU B and reduces the clock cycle time so that now A cycle time is only 10% faster. Which CPU is faster now?
CPU time B = .8 x n.o.i A x 1.25 x 1.1 cycletime A
= 1.1 n.o.i A x cycletime A
So now CPU B is faster.
Calculation of CPI (Some assembly language is required)
In order to appreciate the effect of different instruction set, understanding
of assembly language is required. An example of assembly language
programming is illustrated as follows.
Suppose a hypothetical machine has the typical instruction set composed of : load, store, compare, increment, jump condition. It has an index register (x) and a set of general purpose register (r0..r7).
Find max of array[i], i=1..N
max equ ...
array equ ...
load x, #1
load r0, array, x
store r0, max
; max = array[1]
load x, #2
; x keeps i
loop cmp x, #N
jump GT exit
; i <= N
load r0, max
load r1, array, x
cmp r0, r1
; max < array[i]
jump GE skip
store r1, max
skip inc x
jump loop
exit END
We count the number of instruction being executed to calculate CPI.
let N=3, array[] = 1,2,3
frequency clock
load 3+4
2
store 1+2
2
cmp 4
1
jump 4
2
inc
3
1
total 21 instructions, 35 clocks. CPI = 35/21 = 1.67