Example of NPU programming

Reduction


We want to "sum" across PEs.  Let rx[i] denotes register x on PE[i]. Write a program to sum all r2 over all PEs. Assume there are 4 PEs (0..3). Assume there are initial values in r2[.] (all PEs).

We must send one r2 to other PEs in order to "sum" them.

str 2       ;  LDS = R[2]
bc 3 0      ;  R[3] = LDS[0] 
add 4 2 3   ;  R[4] = R[2] + R20  ; R[3] is R[2] of PE0


Now look at the value of R[4] of each PE

R40  is  R20 + R20
R41  is  R20 + R21
R42  is  R20 + R22
R43  is  R20 + R23

If we now send R41 to sum with other PEs R[4]

str 4
bc 5 1      ;  R[5] = R41
add 6 5 4   ;  R[6] = R41 + R[4]


the value of R[6] are

R60  is  (R20 + R21) + (R20 + R20)
R61  is  (R20 + R21) + (R20 + R21)
R62  is  (R20 + R21) + (R20 + R22)
R63  is  (R20 + R21) + (R20 + R23)

str 6
bc 7 2      ; R[7] = R62
add 8 7 6   ; R[8] = R62 + R[6]


and R[8] are

R80  ...
R81
R82
R83  is  (R20+R21) + (R20+R22) + (R20+R21) + (R20+R23)

We need to get rid of (R20+R21) once, and R20 twice then we will get the desired result.  That value (R20+R21)+(R20+R20) is in R60.

str 6
bc 9 0      ; R[9] = R60
sub 10 8 9  ; done!


The result is in R10[3].

This is rather convolute and not easy to understand.
Another way to do it (which is easy to understand) is to move all R2 to one PE and sum them there.  Let us PE0.

Move R21 to R30, R22 to R40, R23 to R50

str 2
bc 3 1     ;  R30 = R21
bc 4 2     ;  R40 = R22
bc 5 3     ;  R50 = R23
add 6 2 3  ;  R60 = R20 + R21
add 6 6 4  ;  R60 = R20 + R21 + R22
add 6 6 5  ;  R60 = R20 + R21 + R22 + R23


The lesson learn is that reduction across PEs can not be done in parallel.  We are restricted by the "broadcast" way to move data around between PEs. 

If we have a way to communicate between near neighbour in parallel then we can do "divide and conquer".  For example,

If we can sum   R20 + R21  at the same time as R22 + R23, then we add these two together.  It takes only two steps.  Each step reduces the work by half. So it takes log_2 n steps to reduce n elements.

See the actual code here and the screen dump of running it on npu4 simulator.


D:\prabhas\bag\npu\npu4\test>npasm < reduce1.txt > reduce1.obj

D:\prabhas\bag\npu\npu4\test>npu < reduce1.obj
R[0] 110 121 132 143
R[1] 0 0 0 0
R[2] 11 22 33 44
R[3] 22 22 22 22
LDS  11 22 33 44
execute 14 instructions 106 cycles

D:\prabhas\bag\npu\npu4\test>

last update 6 July 2013