If VL is vector length of the machine
for i = 1 to VL
c(i) = a(i) + b(i)
can be transformed into one vector operation
vadd v0 v1 v2 where v0 holds c, v1 = a, v2 = b
There are several consideration for programming vector machines :
for i = 1 to n
y(i) = a * x(i) + y(i)
to
low = 1
VL = ( n mod MVL )
for j = 0 to ( n / MVL )
for i = low to low + VL -1
y(i) = a * x(i) + y(i)
low = low + VL
VL = MVL
The inner loop (for i = .. ) can be vectorised with length VL. VL is equal to (n mod MVL) the first round through the loop, and becomes MVL for the rest.
The following program does matrix multiply
for i = 1 to 100
for j = 1 to 100
a(i,j) = 0.0
for k = 1 to 100
a(i,j) = a(i,j) + b(i,k) * c(k,j)
this program can be vectorised to multiply each row of b by each column of c. The inner loop is strip-mined with k. How adjacent elements in b and c are addressed?
Example for a 100x100 matrix
For row major order b(i,j) is adjacent to b(i,j+1)
(1,1) (1,2) ... (1,100) (2,1) (2,2) ... (2,100) ... (100,100)
and col major order b(i,j) is adjacent to b(i+1,j)
(1,1) (2,1) ... (100,1) (1,2) (2,2) ... (100,2) ... (100,100)
For row major order c has a stride of 100, b has a stride of one.
Once a vector is loaded into a vector register it had logically adjacent elements. This enables the machine to handle the non-unit stride such as c above. Vector load for non unit stride is complicate and can caused memory bank conflict.
1 for i = 1 to 100
2 a(i+1) = a(i) + b(i)
3 b(i+1) = b(i) + a(i+1)
This is because in the loop the computation used the value of an earlier iteration. a(i+1) = a(i). Also in line 3 has RAW on line 2 for a(i+1)
Figure the level of vectorisation on the Perfect Club benchmarks
running with CRAY X-MP
Benchmark name | FP op | FP op in vector mode |
ADM | 23% | 68% |
DYFESM | 26% | 27% |
FLO52 | 41% | 100% |
MDG | 28% | 27% |
MG3D | 31% | 86% |
OCEAN | 28% | 58% |
QCD | 14% | 1% |
SPICE | 16% | 17% |
TRACK | 9% | 23% |
TRFD | 22% | 10% |
Consider a vector sequence
vmul v1 v2 v3Figure Timings for a sequence of dependent vector opreation
vadd v4 v1 v5
for i = 1 to 64the inner loop can be vectorised if we can selectively run it for the element which a(i) != 0. This can be achieved using Vector Mask. The vector mask can be loaded from the vector test instruction and any vector operation will operate on the element whose the corresponding entry in the vector mask is 1. The above loop can be vectorised using vector mask as follows : ra, rb are the starting address of a, b
if ( a(i) != 0 ) then a(i) = a(i) - b(i)
lv v1 ra ; load vector a into v1This loop contains sparse matrix
lv v2 rb ; load vector b
ld f0 #0 ; f0 = 0
vsnes f0 v1 ; set VM to 1 if v1(i) != 0
vsub v1 v1 v2 ; subtract under vector mask
cvm ; set vector mask to all 1
sv ra v1 ; store the result to a
for i =1 to nthe sum of sparse vector on array a and c using index vector k and m to designate the non zero elements of a and c (a and c must have the same number of non zero elements -- n ).
a( k(i) ) = a( k(i) ) + c( m(i) )
For a sparse matrix, the supporting operation is called "scatter-gather". A gather operation takes an index vector and fetches the vector whose elements are at the addresses given by base-address + offsets in the index vector. After completing the computation, the vector can be store in expanded form using a scatter store with the same index vector. Suppose we have the instruction lvi (load vector indexed) and svi (store vector indexed), ra, rc, rk and rm contain the starting address of vector a, c, k, m. The inner loop of the above program can be vectorised as follows :
lv vk rk ; load k
lvi va (ra + vk ) ; load a ( k(i) )
lv vm rm ; load m
lvi vc (rc + vm ) ; load c ( m(i) )
vadd va va vc ; add
svi (ra + vk ) va ; store a( k(i) )
dot = 0.0This loop has a loop-carried dependency on dot and cannot be vectorised. If we split the loop to separate out the vectorisable part :
for i = 1 to 64
dot = dot + a(i) * b(i)
for i = 1 to 64the variable dot has been expanded into a vector, this is called "scalar expansion".
dot(i) = a(i) * b(i)
for i = 2 to 64
dot(1) = dot(1) + dot(i)
Another technique is called "recursive doubling". A loop with recurrence is transformed using adding sequences of progressively shorter vectors -- two 32- element vectors and then two 16-element vectors, and so on. It is faster than doing all operations in scalar mode. An example of doing the second loop above with recursive doubling :
len = 32When the loop is done the sum is in dot(1).
for j = 1 to 6
for i = 1 to len
dot(i) = dot(i) + dot( i + len )
len = len / 2
Tn total running time
Telement time to process one element
Tloop overhead for scalar code to strip-mine
Tstart vector start up time
Tbase overhead to compute starting address and set up vector control,
occur once for one vector operation.
For CRAY-1 Tbase 10 clocks, Tloop 15 clocks.
What is the total running time for SAXPY using CRAY-1 with 64 element
vectors?
SAXPY | start at | complete at |
lv v1 rx | 0 | 12 + 64 = 76 |
vmul v2 s1 v1 | 12 + 1 = 13 | 13 + 7 + 64 = 84 |
lv v3 ry | 76 + 1 = 77 | 7 7 + 12 + 64 = 153 |
vadd v4 v2 v3 | 77 + 1 + 12 = 90 | 90 + 6 + 64 = 163 |
sv ry v4 | 160 + 1 + 4 = 165 | 165 + 12 + 64 = 241 |
MVL = 64, Tloop = 15, Tbase = 10, Telement = 3
Tn = 10 + ( n/64 + 1 ) * (15+ 49 ) + 3 n
= 4 n + 74
We can calculate the performance of a vector machine by
R* = lim n -> * ( operation per iteration * clock rate / clock cycle per iteration )
R* is MFLOPS rate on infinite vector length
Peak performance = number of FLOPS per iteration * clock rate / TelementFor SAXPY code above, for 66 element vectors
Rn = number of FLOPS per iteration * clock rate / Tn