Example of NPU programming

Reduction

We want to "sum" across PEs. Let rx[i] denotes register x on PE[i]. Write a program to sum all r2 over all PEs. Assume there are 4 PEs (0..3). Assume there are initial values in r2[.] (all PEs).

We must send one r2 to other PEs in order to "sum" them.

str 2       ; LDS = R[2] bc 3 0      ; R[3] = LDS[0] add 4 2 3   ; R[4] = R[2] + R20 ; R[3] is R[2] of PE0

Now look at the value of R[4] of each PE

R40 is R20 + R20
R41 is R20 + R21
R42 is R20 + R22
R43 is R20 + R23

If we now send R41 to sum with other PEs R[4]

str 4 bc 5 1      ; R[5] = R41 add 6 5 4   ; R[6] = R41 + R[4]

the value of R[6] are

R60 is (R20 + R21) + (R20 + R20)
R61 is (R20 + R21) + (R20 + R21)
R62 is (R20 + R21) + (R20 + R22)
R63 is (R20 + R21) + (R20 + R23)

str 6 bc 7 2      ; R[7] = R62 add 8 7 6   ; R[8] = R62 + R[6]

and R[8] are

R80 ...
R81
R82
R83 is (R20+R21) + (R20+R22) + (R20+R21) + (R20+R23)

We need to get rid of (R20+R21) once, and R20 twice then we will get the desired result. That value (R20+R21)+(R20+R20) is in R60.

str 6 bc 9 0      ; R[9] = R60 sub 10 8 9 ; done!

The result is in R10[3].

This is rather convolute and not easy to understand.
Another way to do it (which is easy to understand) is to move all R2 to one PE and sum them there. Let us PE0.

Move R21 to R30, R22 to R40, R23 to R50

str 2 bc 3 1     ; R30 = R21 bc 4 2     ; R40 = R22 bc 5 3     ; R50 = R23 add 6 2 3 ; R60 = R20 + R21 add 6 6 4 ; R60 = R20 + R21 + R22 add 6 6 5 ; R60 = R20 + R21 + R22 + R23

The lesson learn is that reduction across PEs can not be done in parallel. We are restricted by the "broadcast" way to move data around between PEs.

If we have a way to communicate between near neighbour in parallel then we can do "divide and conquer". For example,

If we can sum   R20 + R21 at the same time as R22 + R23, then we add these two together. It takes only two steps. Each step reduces the work by half. So it takes log_2 n steps to reduce n elements.

See the actual code here and the screen dump of running it on npu4 simulator.

D:\prabhas\bag\npu\npu4\test>npasm < reduce1.txt > reduce1.obj

D:\prabhas\bag\npu\npu4\test>npu < reduce1.obj
R[0] 110 121 132 143
R[1] 0 0 0 0
R[2] 11 22 33 44
R[3] 22 22 22 22
LDS 11 22 33 44
execute 14 instructions 106 cycles

D:\prabhas\bag\npu\npu4\test>

last update 6 July 2013