Multi-core processor programming

The main method to use many cores together to run one program is "divide and conquer".  Dividing means to divide data for each core.  All cores run concurrently hence they will need to "sync" at the some point so that all partial results can be unified.  Let us walk through a simple example.

Let do sum of all elements in an array of size N.  Here is the single core version.

sum()
  s = 0
  for i = 0; i < N; i++
     s += ax[i]
  return s


To divide the data, we can use the idea of "stripping".  Assume we have 4 cores.  The first core will sum from index 0, 4, 8, ...  The second core performs index 1, 5, 9 .... That is, each core will do the index i, i+4, ...  So, the "stride"  is the number of core used.  Here is the multicore version of the above program.

// all cores share sum() function, assume K cores

// starting from index m, stride K
sum(m)
  s = 0
  for i = m; i < N; i += K
     s += ax[i]
  return s

// partial sum, global

s0, s1, s2, s3

// now each core

core0()
  s0 = sum(0)
  sync()
  s = s0 + s1 + s2 + s3
  print(s)

core1()
  s1 = sum(1)
  sync()

core2()
  s2 = sum(2)
  sync()

core3()
  s3 = sum(3)
  sync()


After all cores finish its task, they sync, then, core0 continue to process the partial sum to give the final result.  Let look at another similar example, find the maximum element in an array.

pmax(m)
  mx = ax[m]
  for i = m+1; i < N; i += K
    if mx < ax[i]
      mx = ax[i]
  return mx

// partial results, global

mx0, mx1, mx2, mx3

// now each core

core0()
  mx0 = pmax(0)
  sync
  mx = max(mx0, mx1)
  mx = max(mx, mx2)
  mx = max(mx, mx3)
  print(mx)

core1()
  mx1 = pmax(1)
  sync

core2() ...


Let us try a more complicate problem, matrix multiplication.  Here is the definition of matrix multiplication.

  c[i][j] = sum_k a[i][k] * b[k][j]

The part of sum_k a[i][k] * b[k][j]  is called the inner product, or the dot product.  Here is the code for it.  Let the square matrix has rank N.

//  let a, b, c be global
a[N][N], b[N][N], c[N][N]

inner(i,j)
  m = 0
  for k = 0; k < N; k++
    m += a[i][k] * b[k][j]
  return m


Now a single core version is:

matmul()
  for i = 0; i < N; i++
    for j = 0; j < N; j++
      c[i][j] = inner(i,j)


How should we divide the data for multicore?  One straight forward way is not divide the work in inner() but consider divide the work for each processor to do one inner.  For example, let N = 4 and K = 4.

//  share function for all cores, start at index m

do_inner(m)
  for i = m; i < N; i += K
    for j = 0; j < N; j++
       c[i][j] = inner(i,j)

core0()
  do_inner(0)
  sync()

core1()
  do_inner(1)
  sync()

core2()
  do_inner(2)
  sync()

core3()
  do_inner(3)
  sync()

Example programs

suma-rz40.txt     sum all elements in an array using four cores (compile with rz40)

Tools

The rz40.zip  compiler will generate a correct code for s30 simulator.  The compiler has these built-in functions:

core0()
core1()
....
sync()


last update 6 Nov 2017