Natural Graphics Processing Unit (NPU)

This is a simple GPU with four 32-bit cores . It contains 4 Processing Elements (PE or core). Each PE has 32 registers, one ALU and Local data store (LS). It also includes a random number generator 32 bits 4x8. It has 16Kx32 bits of memory. The memory is interface with processor with Memory Interface (MI) connected to Local Store (LS). LS communicates to all PEs in parallel. Its instruction has fixed size of 32 bits.

The organisation of NPU is not unlike multicore processors. All processing elements (PEs) are similar to general cores. However, they share the same Program memory (stored program), Program counter and Instruction Register. All PEs run the same instruction. This affects programming in a big way. Program memory and Main memory shared the same address space.

Each PE has three distinct units: Register, ALU and Address Unit (AU). Registers of each PE connects to its Local Store (LS) which is an interface to the main memory. AU outputs effective address in case of index addressing to Memory Interface unit (MI).

Memory Interface is a big highway to connect Main Memory to Local Store. Accessing to main memory is similar to a single core processor, that is, a load/store instruction can access one location at a time. So, moving data to and from PE is a serial operation. Once all data are in LS, then moving LS to R is simultaneous for all PEs.

There are two unique instructions for NPU (with respect to CPU). Load wide (ldw) sends M to all LS at once. Broadcast (bc) instruction sends a register from one PE to all LS. This allows PEs to exchange data without going through main memory.

npu

Instruction format

Assembly language

Code section

Data section

Sample program

Accessing an array

Another example shows how to use vectorization with NPU. The interesting bit is the reduction at the end that total the partial sum. It uses "bc" instruction to send value from one PE to another. The code is a bit cumbersome because there is no real immediate mode in the instruction set, hence the constants are stored in the main memory.

;  process n elements at once, keep intermediate result in registers

        ;  fetch next data n elements, stride n, repeat

        

        ;  fetching an element by ldx ls r1 r2, where r1 is base ads, r2 is
        index

        ;  all r1 are the same base, all r2 have different starting index

        

        :main

          ;  initialize base and index, r1 base, r2 index

        

          ldw @100         ; @100 stores base address

          ldr 1            ; r1 -- base address

          ld 0 @101        ; @101 stores 0  initial index

          ld 1 @102        ; @102 stores 1

          ld 2 @103        ; @103 stores 2

          ld 3 @104        ; @104 stores 3

          ldr 2            ; r2 -- index

        

          clr 4            ; r4 -- sum = 0

          addi 8 4 #2      ; r8 -- loop count #2

        

          ;  fetch n elements

        :loop

          ldx 0 1 2        ;  ax[i]

          ldx 1 1 2   

          ldx 2 1 2 

          ldx 3 1 2

          ldr 3            ;  r3 = ax[i]

          add 4 4 3        ;  sum += ax[i]

          addi 2 2 #4      ;  index += 4

          dec 8            ;  loop count --

          jnz 8 @loop

        

          ; now partial sum is in r4

          ; how to sum all r4s together

          ; accumulate it in r6

          ; broadcast each r4 to r5 and r6 += r5

        

          clr 6            ;  r6 -- bigsum = 0

          str 4

          bc 5 0           ;  all r5 = r4_pe0

          add 6 6 5

          bc 5 1           ; r5s = r4_pe1

          add 6 6 5

          bc 5 2           ; r5s = r4_pe2

          add 6 6 5

          bc 5 3           ; r5s = r4_pe3

          add 6 6 5

          str 6

          st 0 @105        ; @105 stores result

          sys 4

        .end

        

        @100

         106 0 1 2 3 0     ; base address, constant 0,1,2,3, result

         11 22 33 44       ; @106  ax[.]

         55 66 77 88 

        .end

Natural Graphics Processing Unit (NPU)

Instruction format

Assembly language

Code section

Data section

Sample program

Accessing an array

How to use Assembler and Simulator

Controlling the simulator

Tools

References