Natural Graphics Processing Unit  (NPU)

This is a simple GPU with four 32-bit cores .  It contains 4 Processing Elements (PE or core). Each PE has 32 registers,  one ALU and Local data store  (LDS).  It also includes a random number generator  32 bits 4x8.  It has 16Kx32 bits of memory. The memory is interface with processor with BUF (buffer unit) connected to LDS.  LDS communicates to all PEs in parallel.  NPU operates in Single Instruction Multiple Data (SIMD) mode.  That is, every PE runs the same program in synchrony.  So, NPU has only one control unit.  It has one program counter (PC) and one Instruction Register (IR).  Its instruction has fixed size of 32 bits.  


NPU diagram

Instruction format

op:8 a1:14 a2:5 a3:5

a1 is ads  or #n 14 bits
a1 is ls, r3
a2 is r1
a3 is r2

ls 0,1,2,3
r 0..31

Data
ld ls @ads      LDS[ls] = M[ads]     ls = 0,1,2,3
st ls @ads      M[ads] = LDS[ls]
ldr r           R[r] = LDS         r = 0..31
str r           LDS = R[r]
ldx ls r1 r2    LDS[ls] = M[R[r1]+R[r2]]   load indirect (index)
stx ls r1 r2    M[R[r1]+R[r2]] = LDS[ls]   store index
ldw @ads        LDS = M[ads]         load wide
bc r ls         R[r] = LDS[ls]       broadcast

ALU operations
add r3 r1 r2    R[r3] = R[r1] + R[r2]
sub r3 r1 r2    R[r3] = R[r1] - R[r2]
mul r3 r1 r2    R[r3] = R[r1] * R[r2]
ashr r3 r1 #n   R[r3] = R[r1] >> n
addi r3 r1 #n   R[r3] = R[r1] + n
and r3 r1 r2    R[r3] = R[r1] & R[r2]
or  r3 r1 r2    R[r3] = R[r1] | R[r2]
xor r3 r1 r2    R[r3] = R[r1] ^ R[r2]
lt r3 r1 r2     R[r3] = R[r1] <  R[r2]
le r3 r1 r2     R[r3] = R[r1] <= R[r2]
eq r3 r1 r2     R[r3] = R[r1] == R[r2]

Control
jmp @ads        pc = ads
jz r @ads       if R[r] == 0, pc = ads
jnz r @ads      if R[r] != 0, pc = ads

Special
rnd r           R[r] = random
mv_t r3 r1 r2   if R[r3] != 0, R[r1] = R[r2]  move if true

pseudo
sys 4           stop simulation

equivalent
inc r           addi r r #1
dec r           addi r r #-1
clr r           xor r r r
mov r3 r1       addi r3 r1 #0

LDS can be named "bus interface" (to join a narrow 32-bit bus, with a wide 32x4 -bit bus to cores) or "broadcast unit" because it can do broadcast one LDS to all cores. 

bc r ls          LDS[ls] -> all R[r]   broadcast
ldw @ads         M[ads] -> all LDS     load wide


Sample program

Mutiply two vectors,  each vector has 4 components. We strip components across cores.
A = B * C  

Let A is at @100..103, B at @104..107, C at @108..111, use R[2] for A, R[0] for B, R[1] for C.

ld 0 @104
ld 1 @105
ld 2 @106
ld 3 @107    ; load B from Mem to LDS
ldr 0        ; move LDS to R[0]
ld 0 @108
ld 1 @109
ld 2 @110
ld 3 @111    ; load C from Mem to LDS
ldr 1        ; move LDS to R[1]
mul 2 0 1    ; R[2] = R[0] * R[1]  all cores
str 2        ; move R[2] to LDS
st 0 @100
st 1 @101
st 2 @102
st 3 @103    ; store LDS to Mem
sys 4        ; stop simulation
.end
; data       ; initialise Mem
@100
0 0 0 0      ; A
1 2 3 4      ; B
2 3 4 5      ; C
.end

NPU instruction can perform loop iteration using jump, jump-if-zero, jump-if-not-zero (jmp, jz, jnz).  Because NPU is SIMD the condition zero/not-zero must be true for all PEs (they work in lock-step). For example, if we want to loop n times: use R[2]

    clr 2
    addi 2 2 #n
:loop
    ; perform some action
    dec 2
    jnz 2 @loop

Accessing an array

ldx and stx  are instructions for accessing an array.  They use LDS to indirectly address memory.  The access to memory must be "serialised", that is, each core takes turn to access memory. 

    ldx ls r1 r2             LDS[ls] = M[ R[r1]+R[r2] ]


R[r1] + R[r2] of core ls is used as an address to read memory into LDS[ls].

    stx ls r1 r2             M[ R[r1] + R[r2] ] = LDS[ls]

Similarly, stx uses R[r1]+R[r2] of core ls to address memory and store LDS[ls] to that address.

Example program to sum elements in an array.  The array is terminated with 0.
i = 0
s = 0
while ax[i] != 0
    s = s + ax[i]
    i++
We show a program that uses all cores to do the same task. (Quite a waste, but the program is easy to understand)

; r1 s, r2 i, r3 ax[i], r5 &ax

  clr 2        ; i = 0
  clr 1        ; s = 0
  clr 5   
  addi 5 5 #100 ; base &ax
:loop
  ldx 0 5 2    ; get ax[i] to all cores
  ldx 1 5 2
  ldx 2 5 2
  ldx 3 5 2
  ldr 3        ; to r3
  jz 3 @exit    ; ax[i] == 0 ?
  add 1 1 3    ; s += ax[i]
  inc 2        ; i++
  jmp @loop
:exit
  sys 4
.end

@100        ; ax[.]
1 2 3 4 5 0
.end

How to use Assembler and Simulator

The assembler for NPU is  "npasm".
c:> npasm < mul.txt > mul.obj

generate "mul.obj", an object code file.  To run this object:
c:> npu < mul.obj

The screen will shows internal registers after executing each instruction, like this:
0: ld 104 0 0
R[0] 0 0 0 0
R[1] 0 0 0 0
R[2] 0 0 0 0
R[3] 0 0 0 0
LDS  1 0 0 0
1: ld 105 1 0
R[0] 0 0 0 0
R[1] 0 0 0 0
R[2] 0 0 0 0
R[3] 0 0 0 0
LDS  1 2 0 0
. . .
4: ldr 0 0 0
R[0] 1 2 3 4
R[1] 0 0 0 0
R[2] 0 0 0 0
R[3] 0 0 0 0
LDS  1 2 3 4
...
10: mul 2 0 1
R[0] 1 2 3 4
R[1] 2 3 4 5
R[2] 2 6 12 20
R[3] 0 0 0 0
LDS  2 3 4 5
. . .
17: sys 4 0 0
R[0] 1 2 3 4
R[1] 2 3 4 5
R[2] 2 6 12 20
R[3] 0 0 0 0
LDS  2 6 12 20
execute 18 instructions 149 cycles


The simulator has defined 2000 words of memory.  The limit of a run is 1000 instructions. (You can change by recompile the simulator).


last update 16 Dec 2012