We want to "sum" across PEs. Let rx[i] denotes register x on PE[i]. Write
a program to sum all r2 over all PEs. Assume there are 4 PEs (0..3).
Assume there are initial values in r2[.] (all PEs).
We must send one r2 to other PEs in order to "sum" them.
str 2 ; LDS = R[2]
bc 3 0 ; R[3] = LDS[0]
add 4 2 3 ; R[4] = R[2] + R20 ; R[3] is R[2] of PE0
Now look at the value of R[4] of each PE
R40 is R20 + R20
R41 is R20 + R21
R42 is R20 + R22
R43 is R20 + R23
If we now send R41 to sum with other PEs R[4]
str 4
bc 5 1 ; R[5] = R41
add 6 5 4 ; R[6] = R41 + R[4]
the value of R[6] are
R60 is (R20 + R21) + (R20 + R20)
R61 is (R20 + R21) + (R20 + R21)
R62 is (R20 + R21) + (R20 + R22)
R63 is (R20 + R21) + (R20 + R23)
str 6
bc 7 2 ; R[7] = R62
add 8 7 6 ; R[8] = R62 + R[6]
and R[8] are
R80 ...
R81
R82
R83 is (R20+R21) + (R20+R22) + (R20+R21) + (R20+R23)
We need to get rid of (R20+R21) once, and R20 twice then we will get the
desired result. That value (R20+R21)+(R20+R20) is in R60.
str 6
bc 9 0 ; R[9] = R60
sub 10 8 9 ; done!
The result is in R10[3].
This is rather convolute and not easy to understand.
Another way to do it (which is easy to understand) is to move all R2 to
one PE and sum them there. Let us PE0.
Move R21 to R30, R22 to R40, R23 to R50
str 2
bc 3 1 ; R30 = R21
bc 4 2 ; R40 = R22
bc 5 3 ; R50 = R23
add 6 2 3 ; R60 = R20 + R21
add 6 6 4 ; R60 = R20 + R21 + R22
add 6 6 5 ; R60 = R20 + R21 + R22 + R23
The lesson learn is that reduction across PEs can not be done in
parallel. We are restricted by the "broadcast" way to move data around
between PEs.
If we have a way to communicate between near neighbour in parallel then we
can do "divide and conquer". For example,
If we can sum R20 + R21 at the same time as R22 + R23, then we add
these two together. It takes only two steps. Each step reduces the work
by half. So it takes log_2 n steps to reduce n elements.
See the actual code here and the screen dump of running it on npu4 simulator.
D:\prabhas\bag\npu\npu4\test>npasm < reduce1.txt >
reduce1.obj
D:\prabhas\bag\npu\npu4\test>npu < reduce1.obj
R[0] 110 121 132 143
R[1] 0 0 0 0
R[2] 11 22 33 44
R[3] 22 22 22 22
LDS 11 22 33 44
execute 14 instructions 106 cycles
D:\prabhas\bag\npu\npu4\test>
last update 6 July 2013