We want to "sum" across PEs.  Let rx[i] denotes register x on PE[i]. Write
      a program to sum all r2 over all PEs. Assume there are 4 PEs (0..3).
      Assume there are initial values in r2[.] (all PEs).
      
      We must send one r2 to other PEs in order to "sum" them.
      
      str 2       ;  LDS = R[2]
        bc 3 0      ;  R[3] = LDS[0]  
        add 4 2 3   ;  R[4] = R[2] + R20  ; R[3] is R[2] of PE0
      
      Now look at the value of R[4] of each PE
      
      R40  is  R20 + R20
      R41  is  R20 + R21
      R42  is  R20 + R22
      R43  is  R20 + R23
      
      If we now send R41 to sum with other PEs R[4]
      
      str 4
        bc 5 1      ;  R[5] = R41
        add 6 5 4   ;  R[6] = R41 + R[4]
      
      the value of R[6] are
      
      R60  is  (R20 + R21) + (R20 + R20)
      R61  is  (R20 + R21) + (R20 + R21)
      R62  is  (R20 + R21) + (R20 + R22) 
      R63  is  (R20 + R21) + (R20 + R23)
      
      str 6
        bc 7 2      ; R[7] = R62
        add 8 7 6   ; R[8] = R62 + R[6]
      
      and R[8] are
      
      R80  ...
      R81
      R82
      R83  is  (R20+R21) + (R20+R22) + (R20+R21) + (R20+R23)
      
      We need to get rid of (R20+R21) once, and R20 twice then we will get the
      desired result.  That value (R20+R21)+(R20+R20) is in R60.
      
      str 6
        bc 9 0      ; R[9] = R60
        sub 10 8 9  ; done!
      
      The result is in R10[3].
      
      This is rather convolute and not easy to understand.
      Another way to do it (which is easy to understand) is to move all R2 to
      one PE and sum them there.  Let us PE0.
      
      Move R21 to R30, R22 to R40, R23 to R50
      
      str 2
        bc 3 1     ;  R30 = R21
        bc 4 2     ;  R40 = R22
        bc 5 3     ;  R50 = R23
        add 6 2 3  ;  R60 = R20 + R21
        add 6 6 4  ;  R60 = R20 + R21 + R22
        add 6 6 5  ;  R60 = R20 + R21 + R22 + R23
      
      The lesson learn is that reduction across PEs can not be done in
      parallel.  We are restricted by the "broadcast" way to move data around
      between PEs.  
      
      If we have a way to communicate between near neighbour in parallel then we
      can do "divide and conquer".  For example, 
      
      If we can sum   R20 + R21  at the same time as R22 + R23, then we add
      these two together.  It takes only two steps.  Each step reduces the work
      by half. So it takes log_2 n steps to reduce n elements.
    
See the actual code here and the screen dump of running it on npu4 simulator.
      D:\prabhas\bag\npu\npu4\test>npasm < reduce1.txt >
        reduce1.obj
      
      D:\prabhas\bag\npu\npu4\test>npu < reduce1.obj
      R[0] 110 121 132 143
      R[1] 0 0 0 0
      R[2] 11 22 33 44
      R[3] 22 22 22 22
      LDS  11 22 33 44
      execute 14 instructions 106 cycles
      
      D:\prabhas\bag\npu\npu4\test>
      
      last update 6 July 2013