The main method to use many cores together to run one program is "divide
and conquer". Dividing means to divide data for each core. All
cores run concurrently hence they will need to "sync" at the some point so
that all partial results can be unified. Let us walk through a
simple example.
Let do sum of all elements in an array of size N. Here is the single
core version.
sum()
s = 0
for i = 0; i < N; i++
s += ax[i]
return s
To divide the data, we can use the idea of "stripping". Assume we
have 4 cores. The first core will sum from index 0, 4, 8, ...
The second core performs index 1, 5, 9 .... That is, each core will do the
index i, i+4, ... So, the "stride" is the number of core
used. Here is the multicore version of the above program.
// all cores share sum() function, assume K cores
// starting from index m, stride K
sum(m)
s = 0
for i = m; i < N; i += K
s += ax[i]
return s
// partial sum, global
s0, s1, s2, s3
// now each core
core0()
s0 = sum(0)
sync()
s = s0 + s1 + s2 + s3
print(s)
core1()
s1 = sum(1)
sync()
core2()
s2 = sum(2)
sync()
core3()
s3 = sum(3)
sync()
After all cores finish its task, they sync, then, core0 continue to
process the partial sum to give the final result. Let look at
another similar example, find the maximum element in an array.
pmax(m)
mx = ax[m]
for i = m+1; i < N; i += K
if mx < ax[i]
mx = ax[i]
return mx
// partial results, global
mx0, mx1, mx2, mx3
// now each core
core0()
mx0 = pmax(0)
sync
mx = max(mx0, mx1)
mx = max(mx, mx2)
mx = max(mx, mx3)
print(mx)
core1()
mx1 = pmax(1)
sync
core2() ...
Let us try a more complicate problem, matrix multiplication. Here is
the definition of matrix multiplication.
c[i][j] = sum_k a[i][k] * b[k][j]
The part of sum_k a[i][k] * b[k][j]
is called the
inner product, or the dot product. Here is the code for it.
Let the square matrix has rank N.
// let a, b, c be global
a[N][N], b[N][N], c[N][N]
inner(i,j)
m = 0
for k = 0; k < N; k++
m += a[i][k] * b[k][j]
return m
Now a single core version is:
matmul()
for i = 0; i < N; i++
for j = 0; j < N; j++
c[i][j] = inner(i,j)
How should we divide the data for multicore? One straight forward
way is not divide the work in inner() but consider divide the work for
each processor to do one inner. For example, let N = 4 and K = 4.
// share function for all cores, start at index m
do_inner(m)
for i = m; i < N; i += K
for j = 0; j < N; j++
c[i][j] = inner(i,j)
core0()
do_inner(0)
sync()
core1()
do_inner(1)
sync()
core2()
do_inner(2)
sync()
core3()
do_inner(3)
sync()
suma-rz40.txt sum all elements in an array using four cores (compile with rz40)
The rz40.zip compiler will generate a correct code for s30 simulator. The compiler has these built-in functions:
core0()
core1()
....
sync()
last update 6 Nov 2017