Natural Graphics Processing Unit (NPU)

This is a simple GPU with four 32-bit cores . It contains 4 Processing Elements (PE or core). Each PE has 32 registers, one ALU and Local data store (LS). It also includes a random number generator 32 bits 4x8. It has 16Kx32 bits of memory. The memory is interface with processor with Memory Interface (MI) connected to Local Store (LS). LS communicates to all PEs in parallel. Its instruction has fixed size of 32 bits.

The organisation of NPU is not unlike multicore processors. All processing elements (PEs) are similar to general cores. However, they share the same Program memory (stored program), Program counter and Instruction Register. All PEs run the same instruction. This affects programming in a big way. Program memory and Main memory shared the same address space.

Each PE has three distinct units: Register, ALU and Address Unit (AU). Registers of each PE connects to its Local Store (LS) which is an interface to the main memory. AU outputs effective address in case of index addressing to Memory Interface unit (MI).

Memory Interface is a big highway to connect Main Memory to Local Store. Accessing to main memory is similar to a single core processor, that is, a load/store instruction can access one location at a time. So, moving data between Main memory and LS is a serial operation. Once all data are in LS, then moving LS to R is simultaneous for all PEs.

There are two unique instructions for NPU (with respect to CPU). Load wide (ldw) sends M to all LS at once. Broadcast (bc) instruction sends a register from one PE to all LS. This allows PEs to exchange data without going through main memory.

npu

;  process 4 elements at once, keep intermediate result in registers

        ;  fetch next data 4 elements, stride 4, repeat

        

        ;  fetching an element by ldx ls r1 r2, where r1 is base ads, r2 is
        index

        ;  all r1 are the same base, all r2 have different starting index

        

        :main

          ;  initialize base and index, r1 base, r2 index

        

          ldw @100         ; @100 stores base address

          ldr 1            ; r1 -- base address

          ld 0 @101        ; @101 stores 0  initial index

          ld 1 @102        ; @102 stores 1

          ld 2 @103        ; @103 stores 2

          ld 3 @104        ; @104 stores 3

          ldr 2            ; r2 -- index

        

          clr 4            ; r4 -- sum = 0

          addi 8 4 #2      ; r8 -- loop count #2

        

          ;  fetch n elements

        :loop

          ldx 0 1 2        ;  ax[i]

          ldx 1 1 2   

          ldx 2 1 2 

          ldx 3 1 2

          ldr 3            ;  r3 = ax[i]

          add 4 4 3        ;  sum += ax[i]

          addi 2 2 #4      ;  index += 4

          dec 8            ;  loop count --

          jnz 8 @loop

        

          ; now partial sum is in r4

          ; how to sum all r4s together

          ; accumulate it in r6

          ; broadcast each r4 to r5 and r6 += r5

        

          clr 6            ;  r6 -- bigsum = 0

          str 4

          bc 5 0           ;  all r5 = r4_pe0

          add 6 6 5

          bc 5 1           ; r5s = r4_pe1

          add 6 6 5

          bc 5 2           ; r5s = r4_pe2

          add 6 6 5

          bc 5 3           ; r5s = r4_pe3

          add 6 6 5

          str 6

          st 0 @105        ; @105 stores result

          sys 4

        .end

        

        @100

         106 0 1 2 3 0     ; base address, constant 0,1,2,3, result

         11 22 33 44       ; @106  ax[.]

         55 66 77 88 

        .end

Here is the explanation of the above code.



        :main

          ;  initialize base and index, r1 base, r2 index

        

          ldw @100         ; @100 stores base address

          ldr 1            ; r1 -- base address

          ld 0 @101        ; @101 stores 0  initial index

          ld 1 @102        ; @102 stores 1

          ld 2 @103        ; @103 stores 2

          ld 3 @104        ; @104 stores 3

          ldr 2            ; r2 -- index

The first part set up the base address in r1 of all PEs and the index in r2. Each PE starts with different indexes. PE0 - 0, PE1- 1, PE2 -2, PE3 -3. The only way we can load different data into PEs is to use "indirect" addressing. The initial indexes are stored in the memory. "

ld 0
        @101

" will load the index stored at M[101] to LS[0], and so on for other indexes. When all indexes are fetched then "ldr 2" will move them to r2 of all PEs.

  clr 4            ; r4 -- sum = 0

          addi 8 4 #2      ; r8 -- loop count #2

        

          ;  fetch n elements

        :loop

          ldx 0 1 2        ;  ax[i]

          ldx 1 1 2   

          ldx 2 1 2 

          ldx 3 1 2

          ldr 3            ;  r3 = ax[i]

          add 4 4 3        ;  sum += ax[i]

          addi 2 2 #4      ;  index += 4

          dec 8            ;  loop count --

          jnz 8 @loop

Yjr

The second part does the sum using "index" addressing. The base address is in r1, index in r2. "ldx 0 1 2" gets one element of ax[0]. Other "ldx" do the same but with different index, ax[1], ax[2], ax[3]. Then, move them to r3 of all PEs. Each PE now does "

add
        4 4 3

" the sum of its element. In one round, 4 elements have been processed. In the next round, index is cremented by 4, so PE0 gets ax[4], PE1 - ax[5], PE2 - ax[6], PE3 - ax[7].

When the loop finishes, we have r4 of all PEs store the partial sum. The next step we sum all r4 together to get the final result.

  clr 6            ;  r6 -- bigsum = 0

          str 4

          bc 5 0           ;  all r5 = r4_pe0

          add 6 6 5

          bc 5 1           ; r5s = r4_pe1

          add 6 6 5

          bc 5 2           ; r5s = r4_pe2

          add 6 6 5

          bc 5 3           ; r5s = r4_pe3

          add 6 6 5

          str 6

          st 0 @105        ; @105 stores result

r6 is used for the total sum. We can get the value of the neibor of r4 by broadcast instruction. "str 4" moves partial sum of each PE to its LS. "bc 5 0" moves this value to r5 of all PEs. (this is how we can get value of a register of other PE). Do the sum by "

add
        6 6 5

". Repeat this for other 3 partial sums in PE1, PE2, PE3. To get the final result, which is now is in r6, "str 6" moves r6 to LS and then LS to memory M[105]. This final part operation is called "reduction".

Natural Graphics Processing Unit (NPU)

Instruction format

Assembly language

Code section

Data section

Sample program

Accessing an array

How to use Assembler and Simulator

Controlling the simulator

Tools

References