Natural Graphics Processing Unit  (NPU)

See NPU in action  (online demonstration)

This is a simple GPU with four 32-bit cores .  It contains 4 Processing Elements (PE or core). Each PE has 32 registers,  one ALU and Local data store  (LS).  It also includes a random number generator  32 bits 4x8.  It has 16Kx32 bits of memory. The memory is interface with processor with Memory Interface (MI) connected to Local Store (LS).  LS communicates to all PEs in parallel. Its instruction has fixed size of 32 bits. 

The organisation of NPU is not unlike multicore processors.  All processing elements (PEs) are similar to general cores. However, they share the same Program memory (stored program), Program counter and Instruction Register.  All PEs run the same instruction.  This affects programming in a big way.  Program memory and Main memory shared the same address space.

Each PE has three distinct units: Register, ALU and Address Unit (AU).  Registers of each PE connects to its Local Store (LS) which is an interface to the main memory.  AU outputs effective address in case of index addressing to Memory Interface unit (MI). 

Memory Interface is a big highway to connect Main Memory to Local Store. Accessing to main memory is similar to a single core processor, that is, a load/store instruction can access one location at a time.  So, moving data to and from PE is a serial operation.  Once all data are in LS, then moving LS to R is simultaneous for all PEs.

There are two unique instructions for NPU (with respect to CPU).  Load wide (ldw) sends M to all LS at once.  Broadcast (bc) instruction sends a register from one PE to all LS.  This allows PEs to exchange data without going through main memory.

npu

Instruction format

op:8 a1:14 a2:5 a3:5

a1 is ads  or #n 14 bits
a1 is ls, r3
a2 is r1
a3 is r2

ls 0,1,2,3
r 0..31

Data
ld ls @ads      LS[ls] = M[ads]     ls = 0,1,2,3
st ls @ads      M[ads] = LS[ls]
ldr r           R[r] = LS         r = 0..31
str r           LS = R[r]
ldx ls r1 r2    LS[ls] = M[R[r1]+R[r2]]   load indirect (index)
stx ls r1 r2    M[R[r1]+R[r2]] = LS[ls]   store index
ldw @ads        LS = M[ads]         load wide
bc r ls         R[r] = LS[ls]       broadcast


ALU operations
add r3 r1 r2    R[r3] = R[r1] + R[r2]
sub r3 r1 r2    R[r3] = R[r1] - R[r2]
mul r3 r1 r2    R[r3] = R[r1] * R[r2]
ashr r3 r1 #n   R[r3] = R[r1] >> n
addi r3 r1 #n   R[r3] = R[r1] + n
and r3 r1 r2    R[r3] = R[r1] & R[r2]
or  r3 r1 r2    R[r3] = R[r1] | R[r2]
xor r3 r1 r2    R[r3] = R[r1] ^ R[r2]
lt r3 r1 r2     R[r3] = R[r1] <  R[r2]
le r3 r1 r2     R[r3] = R[r1] <= R[r2]
eq r3 r1 r2     R[r3] = R[r1] == R[r2]


Control
jmp @ads        pc = ads
jz r @ads       if R[r] == 0, pc = ads
jnz r @ads      if R[r] != 0, pc = ads


Special
rnd r           R[r] = random
mvt r3 r1 r2    if R[r3] != 0, R[r1] = R[r2]  move if true
pseudo
sys 4           stop simulation

equivalent
inc r           addi r r #1
dec r           addi r r #-1
clr r           xor r r r
mov r3 r1       addi r3 r1 #0


LS can be named "bus interface" (to join a narrow 32-bit bus, with a wide 32x4 -bit bus to cores) or "broadcast unit" because it can do broadcast one LS to all cores. 

bc r ls          LS[ls] -> all R[r]   broadcast
ldw @ads         M[ads] -> all LS     load wide

Assembly language

The source file composed of two sections: code and data. The structure of a source file is as follow:
<code>
.end
<data>
.end

Code section

A line of instruction consists of opcode and its arguments, or a label. A label begins with ':'. An argument is a number or an address (prefix '@').  An address can refer to a label.
op arg*
:label
arg is
  num, @num, @label

A comment starts with ';' and lasts until the end of line.  A comment will be ignored by the assembler.
; comment until end of line

Data section

The data section consists of set-address and data (number). The set-address command begins with '@' follow by the address.
@ads   
num*    

Sample program

Mutiply two vectors,  each vector has 4 components. We strip components across cores.
A = B * C  
Let A is at @100..103, B at @104..107, C at @108..111, use R[2] for A, R[0] for B, R[1] for C.
ld 0 @104
ld 1 @105
ld 2 @106
ld 3 @107    ; load B from Mem to LDS
ldr 0        ; move LDS to R[0]
ld 0 @108
ld 1 @109
ld 2 @110
ld 3 @111    ; load C from Mem to LDS
ldr 1        ; move LS to R[1]
mul 2 0 1    ; R[2] = R[0] * R[1]  all cores
str 2        ; move R[2] to LS
st 0 @100
st 1 @101
st 2 @102
st 3 @103    ; store LS to Mem
sys 4        ; stop simulation
.end
; data       ; initialise Mem
@100
0 0 0 0      ; A
1 2 3 4      ; B
2 3 4 5      ; C
.end
NPU instruction can perform loop iteration using jump, jump-if-zero, jump-if-not-zero (jmp, jz, jnz).  Because NPU is SIMD (single instruction, multiple data) the condition zero/not-zero must be true for all PEs (they work in lock-step). For example, if we want to loop n times: use R[2]

    clr 2
    addi 2 2 #n
:loop
    ; perform some action
    dec 2
    jnz 2 @loop

Accessing an array

ldx and stx  are instructions for accessing an array.  They use LS to indirectly address memory.  The access to memory must be "serialised", that is, each core takes turn to access memory. 

    ldx ls r1 r2             LS[ls] = M[ R[r1]+R[r2] ]


R[r1] + R[r2] of core ls is used as an address to read memory into LS[ls].

    stx ls r1 r2             M[ R[r1] + R[r2] ] = LS[ls]

Similarly, stx uses R[r1]+R[r2] of core ls to address memory and store LS[ls] to that address.

Example program to sum elements in an array.  The array is terminated with 0.
i = 0
s = 0
while ax[i] != 0
    s = s + ax[i]
    i++
We show a program that uses all PE to do the same task. (Quite a waste, but the program is easy to understand)
; r1 s, r2 i, r3 ax[i], r5 &ax

  clr 2        ; i = 0
  clr 1        ; s = 0
  clr 5   
  addi 5 5 #100 ; base &ax
:loop
  ldx 0 5 2    ; get ax[i] to all cores
  ldx 1 5 2
  ldx 2 5 2
  ldx 3 5 2
  ldr 3        ; to r3
  jz 3 @exit    ; ax[i] == 0 ?
  add 1 1 3    ; s += ax[i]
  inc 2        ; i++
  jmp @loop
:exit
  sys 4
.end

@100        ; ax[.]
1 2 3 4 5 0
.end
Another example shows how to use vectorization with NPU.  The interesting bit is the reduction at the end that total the partial sum.  It uses "bc" instruction to send value from one PE to another.  The code is a bit cumbersome because there is no real immediate mode in the instruction set, hence the constants are stored in the main memory.

;  process n elements at once, keep intermediate result in registers
;  fetch next data n elements, stride n, repeat

;  fetching an element by ldx ls r1 r2, where r1 is base ads, r2 is index
;  all r1 are the same base, all r2 have different starting index

:main
  ;  initialize base and index, r1 base, r2 index

  ldw @100         ; @100 stores base address
  ldr 1            ; r1 -- base address
  ld 0 @101        ; @101 stores 0  initial index
  ld 1 @102        ; @102 stores 1
  ld 2 @103        ; @103 stores 2
  ld 3 @104        ; @104 stores 3
  ldr 2            ; r2 -- index

  clr 4            ; r4 -- sum = 0
  addi 8 4 #2      ; r8 -- loop count #2

  ;  fetch n elements
:loop
  ldx 0 1 2        ;  ax[i]
  ldx 1 1 2  
  ldx 2 1 2
  ldx 3 1 2
  ldr 3            ;  r3 = ax[i]
  add 4 4 3        ;  sum += ax[i]
  addi 2 2 #4      ;  index += 4
  dec 8            ;  loop count --
  jnz 8 @loop

  ; now partial sum is in r4
  ; how to sum all r4s together
  ; accumulate it in r6
  ; broadcast each r4 to r5 and r6 += r5

  clr 6            ;  r6 -- bigsum = 0
  str 4
  bc 5 0           ;  all r5 = r4_pe0
  add 6 6 5
  bc 5 1           ; r5s = r4_pe1
  add 6 6 5
  bc 5 2           ; r5s = r4_pe2
  add 6 6 5
  bc 5 3           ; r5s = r4_pe3
  add 6 6 5
  str 6
  st 0 @105        ; @105 stores result
  sys 4
.end

@100
 106 0 1 2 3 0     ; base address, constant 0,1,2,3, result
 11 22 33 44       ; @106  ax[.]
 55 66 77 88
.end

How to use Assembler and Simulator

The assembler for NPU is  "asm4".
c:> asm4 < mul.txt > mul.obj

generate "mul.obj", an object code file.  To run this object:
c:> sim4 mul.obj
load object to 17
>t

The screen will shows internal registers after executing each instruction, like this:
0: ld 104 0 0
R[0] 0 0 0 0
R[1] 0 0 0 0
R[2] 0 0 0 0
R[3] 0 0 0 0
LS  1 0 0 0
>t
1: ld 105 1 0
R[0] 0 0 0 0
R[1] 0 0 0 0
R[2] 0 0 0 0
R[3] 0 0 0 0
LS  1 2 0 0
. . .
4: ldr 0 0 0
R[0] 1 2 3 4
R[1] 0 0 0 0
R[2] 0 0 0 0
R[3] 0 0 0 0
LS  1 2 3 4
...
10: mul 2 0 1
R[0] 1 2 3 4
R[1] 2 3 4 5
R[2] 2 6 12 20
R[3] 0 0 0 0
LS  2 3 4 5
. . .
17: sys 4 0 0
R[0] 1 2 3 4
R[1] 2 3 4 5
R[2] 2 6 12 20
R[3] 0 0 0 0
LS  2 6 12 20
execute 18 instructions 149 cycles


The display shows the information about the execution.  The first line shows PC and the instruction (in decoded form).  Few registers are displays.  Each row shows one register of each PE.  The last line shows Local Store.  The simulator has defined 2000 words of memory.  The limit of a run is 10000 instructions. (You can change by recompile the simulator).

Controlling the simulator

You can control the execution of the simulator.  The following commands are available:
g - go
t - single step
b ads - set breakpoint
s [rn,mn,pc] v - set
d ads n - dump
r - show register
q - quit
h - help

Set command sets various value in the PE.  "s r3" sets the value of R[3] of all PEs. "s m100 50" sets M[100] = 50.  "s pc 10" sets PC to 10.  Dump command inspects the memory, "d 100 10" shows 10 location of the memory started from 100. "r" shows all registers of all PEs.

Tools

The package includes : assembler, simulator, sample programs  npsim4-1.zip

References

Thammasan, N. and Chongstitvatana, P., "Design of a GPU-styled Softcore on Field Programmable Gate Array," Int. Joint Conf. on Computer Science and Software Engineering (JCSSE), 30 May - 1 June 2012, pp.142-146. ( pdf )
Chongstitvatana, P., "Putting General Purpose into a GPU-style Softcore,"  Int. Conf. on Embedded Systems and Intelligent Technology, Jan 13-15, 2013, Thailand. (preprint )

last update 14 Nov 2017