If VL is vector length of the machine

`
for i = 1 to VL`
`
c(i) = a(i) + b(i)`

can be transformed into one vector operation

`
vadd v0 v1 v2` where v0 holds c, v1 = a, v2 = b

There are several consideration for programming vector machines :

The following program can be strip-mined

`
for i = 1 to n`
`
y(i) = a * x(i) + y(i)`

to

`
low = 1`
`
VL = ( n mod MVL )`
`
for j = 0 to ( n / MVL )`
`
for i = low to low + VL -1`
`
y(i) = a * x(i) + y(i)`
`
low = low + VL`
`
VL = MVL`

The inner loop (`for i = ..` ) can be vectorised with length
VL. VL is equal to (n mod MVL) the first round through the loop,
and becomes MVL for the rest.

The following program does matrix multiply
`
for i = 1 to 100`
`
for j = 1 to 100`
`
a(i,j) = 0.0`
`
for k = 1 to 100`
`
a(i,j) = a(i,j) + b(i,k) * c(k,j)`

this program can be vectorised to multiply each row of b by each column of c. The inner loop is strip-mined with k. How adjacent elements in b and c are addressed?

**Example **for a 100x100 matrix

For row major order b(i,j) is adjacent to b(i,j+1)

(1,1) (1,2) ... (1,100) (2,1) (2,2) ... (2,100) ... (100,100)

and col major order b(i,j) is adjacent to b(i+1,j)

(1,1) (2,1) ... (100,1) (1,2) (2,2) ... (100,2) ... (100,100)

For row major order c has a stride of 100, b has a stride of one.

Once a vector is loaded into a vector register it had logically adjacent elements. This enables the machine to handle the non-unit stride such as c above. Vector load for non unit stride is complicate and can caused memory bank conflict.

`
1 for i = 1 to 100`
`
2 a(i+1) = a(i) + b(i)`
`
3 b(i+1) = b(i) + a(i+1)`

This is because in the loop the computation used the value of an earlier
iteration. `a(i+1) = a(i).` Also in line 3 has RAW on
line 2 for `a(i+1)`

**Figure** the level of vectorisation on the Perfect Club benchmarks
running with CRAY X-MP

Benchmark name | FP op | FP op in vector mode |

ADM | 23% | 68% |

DYFESM | 26% | 27% |

FLO52 | 41% | 100% |

MDG | 28% | 27% |

MG3D | 31% | 86% |

OCEAN | 28% | 58% |

QCD | 14% | 1% |

SPICE | 16% | 17% |

TRACK | 9% | 23% |

TRFD | 22% | 10% |

Consider a vector sequence

Figure Timings for a sequence of dependent vector opreationvmul v1 v2 v3vadd v4 v1 v5

the inner loop can be vectorised if we can selectively run it for the element whichfor i = 1 to 64if ( a(i) != 0 ) then a(i) = a(i) - b(i)

This loop containslv v1 ra ; load vector a into v1lv v2 rb ; load vector bld f0 #0 ; f0 = 0vsnes f0 v1 ; set VM to 1 if v1(i) != 0vsub v1 v1 v2 ; subtract under vector maskcvm ; set vector mask to all 1sv ra v1 ; store the result to a

the sum of sparse vector on array a and c using index vector k and m to designate the non zero elements of a and c (a and c must have the same number of non zero elements -- n ).for i =1 to na( k(i) ) = a( k(i) ) + c( m(i) )

For a sparse matrix, the supporting operation is called "scatter-gather".
A gather operation takes an index vector and fetches the vector whose elements
are at the addresses given by base-address + offsets in the index vector.
After completing the computation, the vector can be store in expanded
form using a scatter store with the same index vector. Suppose we
have the instruction `lvi` (load vector indexed) and `svi`
(store vector indexed), `ra, rc, rk `and `rm` contain the
starting address of vector a, c, k, m. The inner loop of the above
program can be vectorised as follows :

lv vk rk ; load klvi va (ra + vk ) ; load a ( k(i) )lv vm rm ; load mlvi vc (rc + vm ) ; load c ( m(i) )vadd va va vc ; addsvi (ra + vk ) va ; store a( k(i) )

This loop has a loop-carried dependency on dot and cannot be vectorised. If we split the loop to separate out the vectorisable part :dot = 0.0for i = 1 to 64dot = dot + a(i) * b(i)

the variable dot has been expanded into a vector, this is called "scalar expansion".for i = 1 to 64dot(i) = a(i) * b(i)for i = 2 to 64dot(1) = dot(1) + dot(i)

Another technique is called "**recursive doubling**". A loop
with recurrence is transformed using adding sequences of progressively
shorter vectors -- two 32- element vectors and then two 16-element vectors,
and so on. It is faster than doing all operations in scalar mode.
An example of doing the second loop above with recursive doubling :

When the loop is done the sum is inlen = 32for j = 1 to 6for i = 1 to lendot(i) = dot(i) + dot( i + len )len = len / 2

Tn total running time

Telement time to process one element

Tloop overhead for scalar code to strip-mine

Tstart vector start up time

Tbase overhead to compute starting address and set up vector control,
occur once for one vector operation.

For CRAY-1 Tbase 10 clocks, Tloop 15 clocks.

What is the total running time for SAXPY using CRAY-1 with 64 element
vectors?

SAXPY | start at | complete at |

lv v1 rx | 0 | 12 + 64 = 76 |

vmul v2 s1 v1 | 12 + 1 = 13 | 13 + 7 + 64 = 84 |

lv v3 ry | 76 + 1 = 77 | 7 7 + 12 + 64 = 153 |

vadd v4 v2 v3 | 77 + 1 + 12 = 90 | 90 + 6 + 64 = 163 |

sv ry v4 | 160 + 1 + 4 = 165 | 165 + 12 + 64 = 241 |

= 241 - 192 = 49

calculate the other way from the pipeline latency

12 + 7 + 12 + 6 + 12 = 49

MVL = 64, Tloop = 15, Tbase = 10, Telement = 3

Tn = 10 + ( n/64 + 1 ) * (15+ 49 ) + 3 n

= 4 n + 74

We can calculate the performance of a vector machine by

*
R* = lim n -> * ( operation per iteration * clock rate / clock cycle
per iteration )*

R* is MFLOPS rate on infinite vector length

For SAXPY code above, for 66 element vectorsPeak performance = number of FLOPS per iteration * clock rate / TelementRn = number of FLOPS per iteration * clock rate / Tn

T66 = 10 + 2 * (15 + 49) + 66 * 3

= 336

assume our vector processor has 80 MHz clock

R66 = 2 * 66 * 80 / 336 = 31.4 MFLOPS