Vector Machines

Vector machines

What is vector machines

What more can be done beside pipelining and multiple issues of instructions to increase a processor performance? There are two factors in performance limitation :

Clock cycle time -- the clock cycle time can be decreased by making the pipelines deeper but very deep pipelining can eventually slow down a processor.
Instruction fetch and decode rate -- this prevents fetching and issuing of more than a few instructions per clock cycle.

It is just as difficult to schedule a pipeline that in n times deeper as it is to schedule a machine that issues n instructions per clock cycle.

Vector machines provide high-level operations that work on vectors -- linear array of numbers. Vector operations have several important properties that solve most of the problems above :

The computation of each result is independent of the computation of previous results, allowing a very deep pipeline without generating any data hazards. Essentially, the absence of data hazards was determined by the compiler or programmer.
A single vector instruction is equivalent to executing an entire loop. Thus, the instruction bandwidth requirement is reduced.
Because an entire loop is replaced by a vector instruction whose behavior is predetermined, control hazards that would normally arise from the loop branch are nonexistent.

For these reasons, vector operations can be made faster than a sequence of scalar operations on the same number of data items, and designers are motivated to include vector units if the applications domain can use them frequently.

The primary components of a vector machine are : vector registers each must have two read ports and one write port (CRAY-1 has only a single port register but use some clever implementation techniques), vector functional units that can start a new operation on every clock cycle, vector load/store unit that words can be moved between the vector registers and memory with a bandwidth of one word per clock cycle after an initial latency, a set of scalar registers provide data as input to the vector register functional units, as well as compute addresses to pass to the vector load/store unit.

(fig. characteristics of several vector-register architectures)

(excerpt from Hennessy & Patterson, Computer architecture : quantitative approach. Morgan Kaufmann Pub. Inc. 1990. pp.351-357.)

Vector operations

The basic idea of vector machines is to combine two vectors, element by element, to produce output vector. If A,B,C are vectors with n elements, a vector processor can perform as one instruction :
     C = A + B
interpret as
             for i = 1 to n
          c(i) = a(i) + b(i)

Most vector computers have a pipeline structure. Multiple pipelines may also be provided to increase performance. The price performance ratio of vector computers can be one to two order of magnitude increased throughput for vector computations when compared to serial computers of equal cost. But it is limited to the problems that fit the architecture -- i.e. the problems that can be structured as a sequence of vector operations.

Memory bandwidth

In the above example, the memory system must supply one element of A and B on every clock cycle. The ALU produces one output during each clock cycle. The difficulty is in designing a memory system to sustain a continuous flow of data from memory to ALU and the return flow of results from ALU to memory. Therefore for the C = A + B, the memory system must has at least three times the bandwidth of a conventional memory system. We can ignore the bandwidth for instruction fetches as a single vector operation can initiate a long vector operation. Therefore the bandwidth required for instruction fetches is negligible as compared to the bandwidth of instruction fetches in a conventional machine.

Two major approaches in designing memory system for vector machines:

Using independent memory modules in main memory to support concurrent access to independent data. (a la interleave)
Using intermediate high-speed memory (a la cache).

Distribution of data in multi memory modules is important for the performance.

M0---|
M1---| stream A
M2---|----------->
M3---| stream B     Pipeline Adder ---|
M4---|----------->                    |
M5---|            stream C            |
M6---|<-------------------------------|
M7---|

Figure Eight 3-port memory modules

M0 a0 b6 c4
M1 a1 b7 c5
M2 a2 b0 c6
M3 a3 b1 c7
M4 a4 b2 c0
M5 a5 b3 c1
M6 a6 b4 c2
M7 a7 b5 c3

Figure physical layout of three vectors in the memory

pipe 4                       0   1   2   3   4   5   6   7
pipe 3                   0   1   2   3   4   5   6   7
pipe 2               0   1   2   3   4   5   6   7
pipe 1           0   1   2   3   4   5   6   7
M7                          RB5 RB5 RA7 RA7 W3 W3
M6                      RB4 RB4 RA6 RA6 W2 W2
M5                  RB3 RB3 RA5 RA5 W1 W1
M4              RB2 RB2 RA4 RA4 W0 W0
M3          RB1 RB1 RA3 RA3
M2      RB0 RB0 RA2 RA2                                 W6
M1          RA1 RA1                 RB7 RB7         W5 W5
M0      RA0 RA0                 RB6 RB6         W4 W4
clock    0   1   2   3   4   5   6   7   8   9 10 11 12

Figure timing diagram for vector addition in pipeline(4 stages)
R = read, W = write, A,B input vectors

Two vectors are allocated to modules so that no conflicts occur. At clock 0, M0 and 2 initiate READs to the first elements of vectors A and B. These elements appear at the pipeline inputs at clock 2, and the corresponding output appears at the end of clock 5. At clock 1, M1 and 3 initiate READs to the second elements of the input vectors. At clock 5, the first output emerges from the ALU pipeline, the next clock, clock 6, M5 and 6 are busy reading A5 and A6. M5 delivers A5 at the beginning of clock 7 and M6 delivers A6 at the beginning of clock 8. At clock 6, M4 initiates WRITE to put way C0, M0 initiates READ of B6. Note how the arrangement of timing enables all operations to proceed without a collision. In reality, the pipeline is never as well behaved as ideal examples are.

(excerpt from H. Stone. High performance computer architecture, 3rd ed. Addison Wesley, 1993, pp. 292-307)

S1 with vector units

Let's compare vector machines with conventional machines. Assume we extend S1 architecture to include : vector load/store that can provide vector registers with one input on every clock cycle, 8 64 elements vector registers, and usual vector functional units, FP add, FP mul, FP div, integer, logical.

Let's take typical vector problem,
Y = a * X + Y
where Y, X, are vectors a is scalar This is so called SAXPY or DAXPY (single-precision or double-precision A*X Plus Y) loop that forms the inner loop of the Linpack benchmark. Linpack is a collection of linear algebra routines; the Gaussian elimination portion of Linpack is the segment used as benchmark. SAXPY represents a small piece of the SV elimination code, though it takes most of the time in the benchmark.

Let's assume that the number of elements, or length, of a vector register (64) matches the length of the vector operation we are interested in. Let's define f0-f7 as floating point registers. Starting addresss of X and Y are in rx, ry, and let's assume we have some immediate addressing mode in S1.

DAXPY in S1x

    load f0, a
    mov rx,r4
    add r4,#512     ; last address to load
loop :
    load (rx),f2     ; load X(i)
    fmul f2,f0      ; a * X(i)
    load (ry),f4     ; load Y(i)
    fadd f4,f2      ; a * X(i) + Y(i)
    store f4,(ry)    ; store into Y(i)
    add rx,#8       ; increment index to X
    add ry,#8       ; increment index to Y
    cmp r4,rx       ; compute bound
    jnz loop        ; check if done

DAXPY in S1v

         load f0,a       ; load scalar a
    vload v1,rx      ; load vector X
    vmul v2,f0,v1   ; vector-scalar mul v2 = f0 * v1
    vload v3,ry      ; load vector Y
    vadd v4,v2,v3   ; vector add v4 = v2 + v3
    vstore v4,ry     ; store the result

Comparing the two code segments it is easy to see that the vector machine greatly reduces the dynamic instruction bandwidth, S1v executing only 6 instructions versus almost 600 for S1x (for 64 element vectors). Another important difference is the frequency of pipeline interlocks. In S1x every fadd must wait for fmul and every store must wait for the fadd. On the vector machine, each vector instruction operates on all the vector elements independently. Thus, pipeline stalls are required only once per vector operation, rather than once per vector element. In the example, the pipeline stall on S1x will be about 64 times higher than on S1v.

Performance of vector processors

the time to complete a vector operation depends on

vector start up time
initiation rate

Vector start up time is the pipeline latency. It depends on the depth of the pipeline of that functional unit. Initiation rate is the time per result once a vector operation is running, usually one result per clock for a fully pipeline functional unit. The time to complete a vector operation on vector of length n is

start up time + n * initiation rate

Example Start up time for a vector multiply is 10 clocks. Initiation rate is one per clock. What is the number of clock per result for a 64 element vector ?

clock per result = total time / vector length
= (start up time + 64 * initiation rate ) / 64
= 1.16 clock per result

What determine the start up time and initiation rate?
Let consider register to register operation (therefore ignore memory latency)
start up time = depth of FU pipe, or the time to get the first result
initiation rate = rate tha a FU can accept operands, for fully pipe, one per clock.
Pipeline depth is determined by type of operation and clock cycle time.

Example Cray -1 has the following vector unit characteristic :

start up time
FP add 6
FP mul 7
FP div 20
FP load 12
stall 4 clocks on RAW

Sustained rate is the time per element for a set of vector operations.
Example what is the sustained rate for the following sequence of instruction. Assume a vector of length 64 ?
vmul v1 v2 v3
vadd v4 v5 v6

We can chart the time line :

start at complete at

vmul 0 7+64 = 71

vadd 1 1+6+64 = 71

Because of independent vector operations, both instruction run concurrenty most of the time. Sustained rate is one element per clock, this sequence executes 128 FLOPS in 71 clocks
= 1.8 FLOPS per clock.

History

The first vector machines were the CDC STAR-100 and TI ACS, both announced in 1972. Both were memory-memory vector machines. Cray who worked on CDC 6600 and the 7600, founded Cray Research and introduced the CRAY-1 in 1976. The CRAY-1 used a vector-register architecture to significantly lower start-up overhead. Most importantly, the CRAY-1 was also the fastest scalar machine in the world at that time.

In 1981, CDC started shipping the CYBER-205. The 205 had the same basic architecture as the STAR, but offered improved performance all around as well as expandability of the vector unit with up to four vector pipelines, each with multiple functional units and a wide load/store pipe that provided multiple words per clock. The peak performance of the CYBER-205 greatly exceeded the performance of the CRAY-1. However, on real programs, the performance difference was much smaller.

In 1983, Cray shipped the first CRAY X-MP. With an improved clock rate (9.5 ns versus 12.5 ns on the CRAY-1), better chaining support, and multiple memory pipelines. The CRAY-2, a completely new design configurable with up to four processors, was introduced later. It has a much faster clock than the X-MP, but also much deeper pipelines. In general, it is only faster than the CRAY X-MP on problems that require its very large main memory.

In 1983, the Japanese computer vendors entered the supercomputer market place, starting with Fujitsu VP100 and VP200 and later expanding to include the Hitachi S810, and the NEC SX/2. These machines have proved to be close to the CRAY X-MP in performance. In general, these three machines have much higher peak performance than the CRAY X-MP, though because of large start-up overhead, the typical performance is often lower than the CRAY X-MP.

In 1988, Cray Research introduced the CRAY Y-MP, a bigger and faster version of the CRAY X-MP. The Y-MP allows up to 8 processors and lowers the cycle time to 6 ns.

(from Hennessy & Patterson, "Computer architecture: a quantitative approach", Morgan Kaufmann Pub. Inc., 1990, pp.393-395.)