Vector machines provide high-level operations that work on vectors -- linear array of numbers. Vector operations have several important properties that solve most of the problems above :
The primary components of a vector machine are : vector registers each must have two read ports and one write port (CRAY-1 has only a single port register but use some clever implementation techniques), vector functional units that can start a new operation on every clock cycle, vector load/store unit that words can be moved between the vector registers and memory with a bandwidth of one word per clock cycle after an initial latency, a set of scalar registers provide data as input to the vector register functional units, as well as compute addresses to pass to the vector load/store unit.
(fig. characteristics of several vector-register architectures)
(excerpt from Hennessy & Patterson, Computer architecture : quantitative approach. Morgan Kaufmann Pub. Inc. 1990. pp.351-357.)
Most vector computers have a pipeline structure. Multiple pipelines may also be provided to increase performance. The price performance ratio of vector computers can be one to two order of magnitude increased throughput for vector computations when compared to serial computers of equal cost. But it is limited to the problems that fit the architecture -- i.e. the problems that can be structured as a sequence of vector operations.
Two major approaches in designing memory system for vector machines:
M0---|
M1---| stream A
M2---|----------->
M3---| stream B Pipeline Adder ---|
M4---|----------->
|
M5---|
stream C
|
M6---|<-------------------------------|
M7---|
Figure Eight 3-port memory modules
M0 a0 b6 c4
M1 a1 b7 c5
M2 a2 b0 c6
M3 a3 b1 c7
M4 a4 b2 c0
M5 a5 b3 c1
M6 a6 b4 c2
M7 a7 b5 c3
Figure physical layout of three vectors in the memory
pipe 4
0 1 2 3 4 5
6 7
pipe 3
0 1 2 3 4 5
6 7
pipe 2
0 1 2 3 4 5
6 7
pipe 1
0 1 2 3 4 5
6 7
M7
RB5 RB5 RA7 RA7 W3 W3
M6
RB4 RB4 RA6 RA6 W2 W2
M5
RB3 RB3 RA5 RA5 W1 W1
M4
RB2 RB2 RA4 RA4 W0 W0
M3
RB1 RB1 RA3 RA3
M2 RB0 RB0 RA2 RA2
W6
M1
RA1 RA1
RB7 RB7 W5 W5
M0 RA0 RA0
RB6 RB6 W4 W4
clock 0 1 2
3 4 5 6 7 8
9 10 11 12
Figure timing diagram for vector addition in pipeline(4 stages)
R = read, W = write, A,B input vectors
Two vectors are allocated to modules so that no conflicts occur. At clock 0, M0 and 2 initiate READs to the first elements of vectors A and B. These elements appear at the pipeline inputs at clock 2, and the corresponding output appears at the end of clock 5. At clock 1, M1 and 3 initiate READs to the second elements of the input vectors. At clock 5, the first output emerges from the ALU pipeline, the next clock, clock 6, M5 and 6 are busy reading A5 and A6. M5 delivers A5 at the beginning of clock 7 and M6 delivers A6 at the beginning of clock 8. At clock 6, M4 initiates WRITE to put way C0, M0 initiates READ of B6. Note how the arrangement of timing enables all operations to proceed without a collision. In reality, the pipeline is never as well behaved as ideal examples are.
(excerpt from H. Stone. High performance computer architecture, 3rd ed. Addison Wesley, 1993, pp. 292-307)
Let's take typical vector problem,
Y
= a * X + Y
where Y, X, are vectors a is scalar This is so called SAXPY or
DAXPY (single-precision or double-precision A*X Plus Y) loop that forms
the inner loop of the Linpack benchmark. Linpack is a collection of linear
algebra routines; the Gaussian elimination portion of Linpack is
the segment used as benchmark. SAXPY represents a small piece of the SV
elimination code, though it takes most of the time in the benchmark.
Let's assume that the number of elements, or length, of a vector register (64) matches the length of the vector operation we are interested in. Let's define f0-f7 as floating point registers. Starting addresss of X and Y are in rx, ry, and let's assume we have some immediate addressing mode in S1.
DAXPY in S1x
load f0, a
mov rx,r4
add r4,#512 ;
last address to load
loop :
load (rx),f2 ; load
X(i)
fmul f2,f0
; a * X(i)
load (ry),f4 ; load
Y(i)
fadd f4,f2
; a * X(i) + Y(i)
store f4,(ry) ; store into
Y(i)
add rx,#8
; increment index to X
add ry,#8
; increment index to Y
cmp r4,rx
; compute bound
jnz loop
; check if done
DAXPY in S1v
load f0,a
; load scalar a
vload v1,rx ;
load vector X
vmul v2,f0,v1 ; vector-scalar
mul v2 = f0 * v1
vload v3,ry ;
load vector Y
vadd v4,v2,v3 ; vector add
v4 = v2 + v3
vstore v4,ry ; store
the result
Comparing the two code segments it is easy to see that the vector machine greatly reduces the dynamic instruction bandwidth, S1v executing only 6 instructions versus almost 600 for S1x (for 64 element vectors). Another important difference is the frequency of pipeline interlocks. In S1x every fadd must wait for fmul and every store must wait for the fadd. On the vector machine, each vector instruction operates on all the vector elements independently. Thus, pipeline stalls are required only once per vector operation, rather than once per vector element. In the example, the pipeline stall on S1x will be about 64 times higher than on S1v.
start up time + n * initiation rate
Example Start up time for a vector multiply is 10 clocks. Initiation rate is one per clock. What is the number of clock per result for a 64 element vector ?
clock per result = total time / vector lengthWhat determine the start up time and initiation rate?
= (start up time + 64 * initiation rate ) / 64
= 1.16 clock per result
Example Cray -1 has the following vector unit characteristic :
Sustained rate is the time per element for a set of vector operations.start up time
FP add 6
FP mul 7
FP div 20
FP load 12
stall 4 clocks on RAW
We can chart the time line :
start at | complete at | |
vmul | 0 | 7+64 = 71 |
vadd | 1 | 1+6+64 = 71 |
In 1981, CDC started shipping the CYBER-205. The 205 had the same basic architecture as the STAR, but offered improved performance all around as well as expandability of the vector unit with up to four vector pipelines, each with multiple functional units and a wide load/store pipe that provided multiple words per clock. The peak performance of the CYBER-205 greatly exceeded the performance of the CRAY-1. However, on real programs, the performance difference was much smaller.
In 1983, Cray shipped the first CRAY X-MP. With an improved clock rate (9.5 ns versus 12.5 ns on the CRAY-1), better chaining support, and multiple memory pipelines. The CRAY-2, a completely new design configurable with up to four processors, was introduced later. It has a much faster clock than the X-MP, but also much deeper pipelines. In general, it is only faster than the CRAY X-MP on problems that require its very large main memory.
In 1983, the Japanese computer vendors entered the supercomputer market place, starting with Fujitsu VP100 and VP200 and later expanding to include the Hitachi S810, and the NEC SX/2. These machines have proved to be close to the CRAY X-MP in performance. In general, these three machines have much higher peak performance than the CRAY X-MP, though because of large start-up overhead, the typical performance is often lower than the CRAY X-MP.
In 1988, Cray Research introduced the CRAY Y-MP, a bigger and faster version of the CRAY X-MP. The Y-MP allows up to 8 processors and lowers the cycle time to 6 ns.
(from Hennessy & Patterson, "Computer architecture: a quantitative
approach", Morgan Kaufmann Pub. Inc., 1990, pp.393-395.)