How to program a vector machine

To use the vector unit, the program must be "vectorised", i.e. transform the loop into vector operations.

If VL is vector length of the machine

            for i = 1 to VL
              c(i) = a(i) + b(i)

can be transformed into one vector operation

            vadd v0 v1 v2   where v0 holds c, v1 = a, v2 = b

There are several consideration for programming vector machines :

Vector length

when the data lenght is not equal to vector length of a machine, the loop can be transformed into loop of vector operations, each of maximum length plus one loop for the rest of element.  This is called "strip-mine".   Let VLR be the vector length register that holds the number of element in a vector operation, MVL be the maximum length of vector unit (VLR <= MVL).
The following program can be strip-mined

            for i = 1 to n
              y(i) = a * x(i) + y(i)

to

            low = 1
            VL = ( n mod MVL )
            for j = 0 to ( n / MVL )
              for i = low to low + VL -1
                 y(i) = a * x(i) + y(i)
              low = low + VL
              VL = MVL

The inner loop (for i = .. ) can be vectorised with length VL.  VL is equal to (n mod MVL) the first round through the loop, and becomes MVL for the rest.

Vector stride

A vector stride is the distance separate elements that are to be merge into a single vector.

The following program does matrix multiply
            for i = 1 to 100
              for j = 1 to 100
                a(i,j) = 0.0
                for k = 1 to 100
                  a(i,j) = a(i,j) + b(i,k) * c(k,j)

this program can be vectorised to multiply each row of b by each column of c.  The inner loop is strip-mined with k.  How adjacent elements in b and c are addressed?

Example for a 100x100 matrix
For row major order b(i,j) is adjacent to b(i,j+1)
(1,1) (1,2) ... (1,100) (2,1) (2,2) ... (2,100) ... (100,100)
and col major order b(i,j) is adjacent to b(i+1,j)
(1,1) (2,1) ... (100,1) (1,2) (2,2) ... (100,2) ... (100,100)

For row major order c has a stride of 100, b has a stride of one.

Once a vector is loaded into a vector register it had logically adjacent elements.  This enables the machine to handle the non-unit stride such as c above.  Vector load for non unit stride is complicate and can caused memory bank conflict.

Loop - carried dependency

The following program is not vectorisable

            1 for i = 1 to 100
            2    a(i+1) = a(i) + b(i)
            3    b(i+1) = b(i) + a(i+1)

This is because in the loop the computation used the value of an earlier iteration.  a(i+1) = a(i).  Also in line 3 has RAW on line 2 for a(i+1)

Figure the level of vectorisation on the Perfect Club benchmarks running with CRAY X-MP
 
Benchmark name FP op FP op in vector mode
ADM  23% 68%
DYFESM 26% 27%
FLO52  41% 100%
MDG  28% 27%
MG3D  31% 86%
OCEAN  28% 58%
QCD  14% 1%
SPICE  16% 17%
TRACK  9% 23%
TRFD  22% 10%
These benchmarks are large, real scientific applications.  The wide variation of level of vectorisation can be observed.  There are also large variation how well the compilers do in vectorising programs.  To get best performance from a vector machine, a significant modification or a rewrite of a program is needed.

Improving performance of a vector machine

Chaining

It is invented by Seymour Cray and introduced in CRAY-1.  When vector operations are running in sequence, the first result once completed is immediately make available to the next operation.  Chaining allows vector operations to run in parallel.  A sustained rate of more than one operations per clock can be achieve even when the operations are dependent.

Consider a vector sequence

vmul v1 v2 v3
vadd v4 v1 v5
Figure Timings for a sequence of dependent vector opreation
 

Conditional statement

From the table above the level of vectorisation is not very high.  There are two reasons : first, the presence of conditional statement which prevent loop from vectorising and second sparse matrix.  The following loop can not be vectorised :
for i = 1 to 64
  if ( a(i) != 0 ) then a(i) = a(i) - b(i)
the inner loop can be vectorised if we can selectively run it for the element which a(i) != 0.  This can be achieved using Vector Mask.  The vector mask can be loaded from the vector test instruction and any vector operation will operate on the element whose the corresponding entry in the vector mask is 1.  The above loop can be vectorised using vector mask as follows : ra, rb are the starting address of a, b
lv v1 ra         ; load vector a into v1
lv v2 rb         ; load vector b
ld f0 #0         ; f0 = 0
vsnes f0 v1      ; set VM to 1 if v1(i) != 0
vsub v1 v1 v2    ; subtract under vector mask
cvm              ; set vector mask to all 1
sv ra v1         ; store the result to a
This loop contains sparse matrix
for i =1 to n
  a( k(i) ) = a( k(i) ) + c( m(i) )
the sum of sparse vector on array a and c using index vector k and m to designate the non zero elements of a and c (a and c must have the same number of non zero elements -- n ).

For a sparse matrix, the supporting operation is called "scatter-gather".  A gather operation takes an index vector and fetches the vector whose elements are at the addresses given by base-address + offsets in the index vector.  After completing the computation, the vector  can be store in expanded form using a scatter store with the same index vector.  Suppose we have the instruction lvi (load vector indexed) and svi (store vector indexed), ra, rc, rk and rm contain the starting address of vector a, c, k, m.  The inner loop of the above program can be vectorised as follows :

lv vk rk             ; load k
lvi va (ra + vk )    ; load a ( k(i) )
lv vm rm             ; load m
lvi vc (rc + vm )    ; load c ( m(i) )
vadd va va vc        ;  add
svi (ra + vk ) va    ; store a( k(i) )

Vector reduction

Reduction is a loop that reduces an array to a single value by repeated application of an operation.  For example a dot product :
dot = 0.0
for i = 1 to 64
  dot = dot + a(i) * b(i)
This loop has a loop-carried dependency on dot and cannot be vectorised.  If we split the loop to separate out the vectorisable part :
for i = 1 to 64
  dot(i) = a(i) * b(i)
for i = 2 to 64
  dot(1) = dot(1) + dot(i)
the variable dot has been expanded into a vector, this is called "scalar expansion".

Another technique is called "recursive doubling".  A loop with recurrence is transformed using adding sequences of progressively shorter vectors -- two 32- element vectors and then two 16-element vectors, and so on.  It is faster than doing all operations in scalar mode.  An example of doing the second loop above with recursive doubling :

len = 32
for j = 1 to 6
  for i = 1 to len
    dot(i) = dot(i) + dot( i + len )
  len = len / 2
When the loop is done the sum is in dot(1).

A simple model of vector performance

                Tn = Tbase +    ( n/MVL + 1 ) * ( Tloop + Tstart ) + n * Telement

Tn total running time
Telement time to process one element
Tloop overhead for scalar code to strip-mine
Tstart vector start up time
Tbase overhead to compute starting address and set up vector control, occur once for one vector operation.

For CRAY-1  Tbase 10 clocks, Tloop 15 clocks.
What is the total running time for SAXPY using CRAY-1 with 64 element vectors?
 
SAXPY start at complete at
lv v1 rx 0 12 + 64 = 76
vmul v2 s1 v1 12 + 1 = 13 13 + 7 + 64 = 84
lv v3 ry 76 + 1 = 77 7 7 + 12 + 64 = 153
vadd v4 v2 v3 77 + 1 + 12 = 90 90 + 6 + 64 = 163
sv ry v4 160 + 1 + 4 = 165  165 + 12 + 64 = 241
Tstart = 241 - 64 * Telement
= 241 - 192 = 49
calculate the other way from the pipeline latency
12 + 7 + 12 + 6 + 12 = 49

MVL = 64, Tloop = 15, Tbase = 10, Telement = 3

Tn = 10 + ( n/64 + 1 ) * (15+ 49 ) + 3 n
= 4 n + 74

We can calculate the performance of a vector machine by

            R* = lim n -> * ( operation per iteration * clock rate  / clock cycle per iteration )

R* is MFLOPS rate on infinite vector length

Peak performance = number of FLOPS per iteration * clock rate / Telement
Rn = number of FLOPS per iteration * clock rate / Tn
For SAXPY code above, for 66 element vectors
T66 = 10 + 2 * (15 + 49) + 66 * 3
= 336
assume our vector processor has 80 MHz clock
R66 = 2 * 66 * 80 / 336 = 31.4 MFLOPS