See NPU in action (online demonstration, compute Mandelbrot set )
This is a simple GPU with four 32-bit cores . It contains 4 Processing
Elements (PE or core). Each PE has 32 registers, one ALU and Local data
store (LS). It also includes a random number generator 32 bits 4x8. It
has 16Kx32 bits of memory. The memory is interface with processor with
Memory Interface (MI) connected to Local Store (LS). LS communicates to
all PEs in parallel. Its instruction has fixed size of 32 bits.
The organisation of NPU is not unlike multicore processors. All
processing elements (PEs) are similar to general cores. However, they
share the same Program memory (stored program), Program counter and
Instruction Register. All PEs run the same instruction. This affects
programming in a big way. Program memory and Main memory shared the same
address space.
Each PE has three distinct units: Register, ALU and Address Unit (AU).
Registers of each PE connects to its Local Store (LS) which is an
interface to the main memory. AU outputs effective address in case of
index addressing to Memory Interface unit (MI).
Memory Interface is a big highway to connect Main Memory to Local Store.
Accessing to main memory is similar to a single core processor, that is, a
load/store instruction can access one location at a time. So, moving data
between Main memory and LS is a serial operation. Once all data are in
LS, then moving LS to R is simultaneous for all PEs.
There are two unique instructions for NPU (with respect to CPU). Load
wide (ldw) sends M to all LS at once. Broadcast (bc) instruction sends a
register from one PE to all LS. This allows PEs to exchange data without
going through main memory.
op:8 a1:14 a2:5 a3:5
<to be updated><code>
.end
<data>
.end
; comment until end of line
ld 0 @100
add r3 r1 r2
:loop
jmp @loop
...
.end
@ads
10 20 30 ...
.end
A = B * CLet A is at @100..103, B at @104..107, C at @108..111, use R[2] for A, R[0] for B, R[1] for C.
ld 0 @104NPU instruction can perform loop iteration using jump, jump-if-zero, jump-if-not-zero (jmp, jz, jnz). Because NPU is SIMD (single instruction, multiple data) the condition zero/not-zero must be true for all PEs (they work in lock-step). For example, if we want to loop n times: use R[2]
ld 1 @105
ld 2 @106
ld 3 @107 ; load B from Mem to LDS
ldr 0 ; move LS to R[0]
ld 0 @108
ld 1 @109
ld 2 @110
ld 3 @111 ; load C from Mem to LS
ldr 1 ; move LS to R[1]
mul 2 0 1 ; R[2] = R[0] * R[1] all cores
str 2 ; move R[2] to LS
st 0 @100
st 1 @101
st 2 @102
st 3 @103 ; store LS to Mem
sys 4 ; stop simulation
.end
; data ; initialise Mem
@100
0 0 0 0 ; A
1 2 3 4 ; B
2 3 4 5 ; C
.end
i = 0We show the first program that uses all PEs to do the same task. (Quite a waste, but the program is easy to understand)
s = 0
while ax[i] != 0
s = s + ax[i]
i++
; r1 s, r2 i, r3 ax[i], r5 &ax
clr 2 ; i = 0
clr 1 ; s = 0
clr 5
addi 5 5 #100 ; base &ax
:loop
ldx 0 5 2 ; get ax[i] to all cores
ldx 1 5 2
ldx 2 5 2
ldx 3 5 2
ldr 3 ; to r3
jz 3 @exit ; ax[i] == 0 ?
add 1 1 3 ; s += ax[i]
inc 2 ; i++
jmp @loop
:exit
sys 4
.end
@100 ; ax[.]
1 2 3 4 5 0
.end
; process 4 elements at once, keep intermediate result in registers
; fetch next data 4 elements, stride 4, repeat
; fetching an element by ldx ls r1 r2, where r1 is base ads, r2 is
index
; all r1 are the same base, all r2 have different starting index
:main
; initialize base and index, r1 base, r2 index
ldw @100 ; @100 stores base address
ldr 1 ; r1 -- base address
ld 0 @101 ; @101 stores 0 initial index
ld 1 @102 ; @102 stores 1
ld 2 @103 ; @103 stores 2
ld 3 @104 ; @104 stores 3
ldr 2 ; r2 -- index
clr 4 ; r4 -- sum = 0
addi 8 4 #2 ; r8 -- loop count #2
; fetch n elements
:loop
ldx 0 1 2 ; ax[i]
ldx 1 1 2
ldx 2 1 2
ldx 3 1 2
ldr 3 ; r3 = ax[i]
add 4 4 3 ; sum += ax[i]
addi 2 2 #4 ; index += 4
dec 8 ; loop count --
jnz 8 @loop
; now partial sum is in r4
; how to sum all r4s together
; accumulate it in r6
; broadcast each r4 to r5 and r6 += r5
clr 6 ; r6 -- bigsum = 0
str 4
bc 5 0 ; all r5 = r4_pe0
add 6 6 5
bc 5 1 ; r5s = r4_pe1
add 6 6 5
bc 5 2 ; r5s = r4_pe2
add 6 6 5
bc 5 3 ; r5s = r4_pe3
add 6 6 5
str 6
st 0 @105 ; @105 stores result
sys 4
.end
@100
106 0 1 2 3 0 ; base address, constant 0,1,2,3, result
11 22 33 44 ; @106 ax[.]
55 66 77 88
.end
:main
; initialize base and index, r1 base, r2 index
ldw @100 ; @100 stores base address
ldr 1 ; r1 -- base address
ld 0 @101 ; @101 stores 0 initial index
ld 1 @102 ; @102 stores 1
ld 2 @103 ; @103 stores 2
ld 3 @104 ; @104 stores 3
ldr 2 ; r2 -- index
ld 0
@101
" will load the index stored at M[101] to LS[0], and so on
for other indexes. When all indexes are fetched then "ldr 2
"
will move them to r2 of all PEs. clr 4 ; r4 -- sum = 0
addi 8 4 #2 ; r8 -- loop count #2
; fetch n elements
:loop
ldx 0 1 2 ; ax[i]
ldx 1 1 2
ldx 2 1 2
ldx 3 1 2
ldr 3 ; r3 = ax[i]
add 4 4 3 ; sum += ax[i]
addi 2 2 #4 ; index += 4
dec 8 ; loop count --
jnz 8 @loop
Yjrldx 0 1 2
" gets one element of ax[0].
Other "ldx
" do the same but with different index, ax[1],
ax[2], ax[3]. Then, move them to r3 of all PEs. Each PE now does "add
4 4 3
" the sum of its element. In one round, 4 elements have been
processed. In the next round, index is cremented by 4, so PE0 gets ax[4],
PE1 - ax[5], PE2 - ax[6], PE3 - ax[7]. clr 6 ; r6 -- bigsum = 0
str 4
bc 5 0 ; all r5 = r4_pe0
add 6 6 5
bc 5 1 ; r5s = r4_pe1
add 6 6 5
bc 5 2 ; r5s = r4_pe2
add 6 6 5
bc 5 3 ; r5s = r4_pe3
add 6 6 5
str 6
st 0 @105 ; @105 stores result
str 4
" moves partial sum of each PE
to its LS. "bc 5 0
" moves this value to r5 of all PEs. (this
is how we can get value of a register of other PE). Do the sum by "add
6 6 5
". Repeat this for other 3 partial sums in PE1, PE2, PE3.
To get the final result, which is now is in r6, "str 6
" moves
r6 to LS and then LS to memory M[105]. This final part operation is
called "reduction".load object to 17
t
s r3
" sets the
value of R[3] of all PEs. "s m100 50
" sets M[100] = 50. "s
pc 10
" sets PC to 10. Dump command inspects the memory, "d
100 10
" shows 10 location of the memory started from 100. "r
"
shows all registers of all PEs.