- Clock cycle time -- the clock cycle time can be decreased by making the pipelines deeper but very deep pipelining can eventually slow down a processor.
- Instruction fetch and decode rate -- this prevents fetching and issuing of more than a few instructions per clock cycle.

Vector machines provide high-level operations that work on vectors -- linear array of numbers. Vector operations have several important properties that solve most of the problems above :

- The computation of each result is independent of the computation of previous results, allowing a very deep pipeline without generating any data hazards. Essentially, the absence of data hazards was determined by the compiler or programmer.
- A single vector instruction is equivalent to executing an entire loop. Thus, the instruction bandwidth requirement is reduced.
- Because an entire loop is replaced by a vector instruction whose behavior is predetermined, control hazards that would normally arise from the loop branch are nonexistent.

The primary components of a vector machine are : vector registers each must have two read ports and one write port (CRAY-1 has only a single port register but use some clever implementation techniques), vector functional units that can start a new operation on every clock cycle, vector load/store unit that words can be moved between the vector registers and memory with a bandwidth of one word per clock cycle after an initial latency, a set of scalar registers provide data as input to the vector register functional units, as well as compute addresses to pass to the vector load/store unit.

(fig. characteristics of several vector-register architectures)

(excerpt from Hennessy & Patterson, Computer architecture : quantitative approach. Morgan Kaufmann Pub. Inc. 1990. pp.351-357.)

interpret as

Most vector computers have a pipeline structure. Multiple pipelines may also be provided to increase performance. The price performance ratio of vector computers can be one to two order of magnitude increased throughput for vector computations when compared to serial computers of equal cost. But it is limited to the problems that fit the architecture -- i.e. the problems that can be structured as a sequence of vector operations.

Two major approaches in designing memory system for vector machines:

- Using independent memory modules in main memory to support concurrent access to independent data. (a la interleave)
- Using intermediate high-speed memory (a la cache).

`M0---|`
` M1---| stream A`
` M2---|----------->`
` M3---| stream B Pipeline Adder ---|`
` M4---|----------->
|`
` M5---|
stream C
|`
` M6---|<-------------------------------|`
` M7---|`

Figure Eight 3-port memory modules

`M0 a0 b6 c4`
` M1 a1 b7 c5`
` M2 a2 b0 c6`
` M3 a3 b1 c7`
` M4 a4 b2 c0`
` M5 a5 b3 c1`
` M6 a6 b4 c2`
` M7 a7 b5 c3`

Figure physical layout of three vectors in the memory

`pipe 4
0 1 2 3 4 5
6 7`
` pipe 3
0 1 2 3 4 5
6 7`
` pipe 2
0 1 2 3 4 5
6 7`
` pipe 1
0 1 2 3 4 5
6 7`
` M7
RB5 RB5 RA7 RA7 W3 W3`
` M6
RB4 RB4 RA6 RA6 W2 W2`
` M5
RB3 RB3 RA5 RA5 W1 W1`
` M4
RB2 RB2 RA4 RA4 W0 W0`
` M3
RB1 RB1 RA3 RA3`
` M2 RB0 RB0 RA2 RA2
W6`
` M1
RA1 RA1
RB7 RB7 W5 W5`
` M0 RA0 RA0
RB6 RB6 W4 W4`
` clock 0 1 2
3 4 5 6 7 8
9 10 11 12`

Figure timing diagram for vector addition in pipeline(4 stages)

R = read, W = write, A,B input vectors

Two vectors are allocated to modules so that no conflicts occur. At clock 0, M0 and 2 initiate READs to the first elements of vectors A and B. These elements appear at the pipeline inputs at clock 2, and the corresponding output appears at the end of clock 5. At clock 1, M1 and 3 initiate READs to the second elements of the input vectors. At clock 5, the first output emerges from the ALU pipeline, the next clock, clock 6, M5 and 6 are busy reading A5 and A6. M5 delivers A5 at the beginning of clock 7 and M6 delivers A6 at the beginning of clock 8. At clock 6, M4 initiates WRITE to put way C0, M0 initiates READ of B6. Note how the arrangement of timing enables all operations to proceed without a collision. In reality, the pipeline is never as well behaved as ideal examples are.

(excerpt from H. Stone. High performance computer architecture, 3rd ed. Addison Wesley, 1993, pp. 292-307)

Let's take typical vector problem,
` Y
= a * X + Y`

where Y, X, are vectors a is scalar This is so called SAXPY or
DAXPY (single-precision or double-precision A*X Plus Y) loop that forms
the inner loop of the Linpack benchmark. Linpack is a collection of linear
algebra routines; the Gaussian elimination portion of Linpack is
the segment used as benchmark. SAXPY represents a small piece of the SV
elimination code, though it takes most of the time in the benchmark.

Let's assume that the number of elements, or length, of a vector register (64) matches the length of the vector operation we are interested in. Let's define f0-f7 as floating point registers. Starting addresss of X and Y are in rx, ry, and let's assume we have some immediate addressing mode in S1.

__DAXPY in S1x__

` load f0, a`
` mov rx,r4`
` add r4,#512 ;
last address to load`
` loop :`
` load (rx),f2 ; load
X(i)`
` fmul f2,f0
; a * X(i)`
` load (ry),f4 ; load
Y(i)`
` fadd f4,f2
; a * X(i) + Y(i)`
` store f4,(ry) ; store into
Y(i)`
` add rx,#8
; increment index to X`
` add ry,#8
; increment index to Y`
` cmp r4,rx
; compute bound`
` jnz loop
; check if done`

__DAXPY in S1v__

`load f0,a
; load scalar a`
` vload v1,rx ;
load vector X`
` vmul v2,f0,v1 ; vector-scalar
mul v2 = f0 * v1`
` vload v3,ry ;
load vector Y`
` vadd v4,v2,v3 ; vector add
v4 = v2 + v3`
` vstore v4,ry ; store
the result`

Comparing the two code segments it is easy to see that the vector machine greatly reduces the dynamic instruction bandwidth, S1v executing only 6 instructions versus almost 600 for S1x (for 64 element vectors). Another important difference is the frequency of pipeline interlocks. In S1x every fadd must wait for fmul and every store must wait for the fadd. On the vector machine, each vector instruction operates on all the vector elements independently. Thus, pipeline stalls are required only once per vector operation, rather than once per vector element. In the example, the pipeline stall on S1x will be about 64 times higher than on S1v.

- vector start up time
- initiation rate

*
start up time + n * initiation rate*

**Example** Start up time for a vector multiply is 10 clocks.
Initiation rate is one per clock. What is the number of clock per
result for a 64 element vector ?

clock per result = total time / vector length

= (start up time + 64 * initiation rate ) / 64

= 1.16 clock per result

Let consider register to register operation (therefore ignore memory latency)

start up time = depth of FU pipe, or the time to get the first result

initiation rate = rate tha a FU can accept operands, for fully pipe, one per clock.

Pipeline depth is determined by type of operation and clock cycle time.

**Example** Cray -1 has the following vector unit characteristic
:

Sustained rate is the time per element for a set of vector operations.start up time

FP add 6

FP mul 7

FP div 20

FP load 12

stall 4 clocks on RAW

We can chart the time line :

start at | complete at | |

vmul |
0 | 7+64 = 71 |

vadd |
1 | 1+6+64 = 71 |

Because of independent vector operations, both instruction run concurrenty most of the time. Sustained rate is one element per clock, this sequence executes 128 FLOPS in 71 clocks

= 1.8 FLOPS per clock.

In 1981, CDC started shipping the CYBER-205. The 205 had the same basic architecture as the STAR, but offered improved performance all around as well as expandability of the vector unit with up to four vector pipelines, each with multiple functional units and a wide load/store pipe that provided multiple words per clock. The peak performance of the CYBER-205 greatly exceeded the performance of the CRAY-1. However, on real programs, the performance difference was much smaller.

In 1983, Cray shipped the first CRAY X-MP. With an improved clock rate (9.5 ns versus 12.5 ns on the CRAY-1), better chaining support, and multiple memory pipelines. The CRAY-2, a completely new design configurable with up to four processors, was introduced later. It has a much faster clock than the X-MP, but also much deeper pipelines. In general, it is only faster than the CRAY X-MP on problems that require its very large main memory.

In 1983, the Japanese computer vendors entered the supercomputer market place, starting with Fujitsu VP100 and VP200 and later expanding to include the Hitachi S810, and the NEC SX/2. These machines have proved to be close to the CRAY X-MP in performance. In general, these three machines have much higher peak performance than the CRAY X-MP, though because of large start-up overhead, the typical performance is often lower than the CRAY X-MP.

In 1988, Cray Research introduced the CRAY Y-MP, a bigger and faster version of the CRAY X-MP. The Y-MP allows up to 8 processors and lowers the cycle time to 6 ns.

(from Hennessy & Patterson, "Computer architecture: a quantitative
approach", Morgan Kaufmann Pub. Inc., 1990, pp.393-395.)