Final examination questions 
2190250  computer architecture and organiztion, 2190250

Three questions, duration 1:30 hour.

-----------------------
This is an "open-book" exam.  Students are allowed to bring any textbooks or information on paper with them into the examination room.  

Students are NOT allow to use any calculator, computer, or cell-phone during exam.
--------------------------


1) Memory system design

Given a list of 32-bit memory address references, given as word 
addresses.

0x03, 0xb4, 0x2b, 0x02, 0xbf, 0x58, 0xbe, 0x0e, 0xb5, 
0x2c, 0xba, 0xfd

1.1) For each of these references, identify the binary word address, the tag, and the index given a direct-mapped cache with 16 one-word blocks. Also list whether each reference is a hit or a miss, assuming the cache is initially empty.

1.2) For each of these references, identify the binary word address, 
the tag, the index, and the offset given a direct-mapped cache with 8 two-word blocks. Also list if each reference is a hit or a miss, assuming 
the cache is initially empty.

2)  Graphic Processing Unit


Matrix multiplication

Two square matrices (n x n) are multiplied together:

  c[i][j] = sum (k = 0,n) ( a[i][[k] * b[k][j] )


each term of c[i][j] is called "inner product"

when we draw the picture of "inner product" like this

a
[ x x x .. ]
[ x ....   ]
[ x ...    ]

a[i][k] is the "column" (vertical) of a

b 
[ x x x... ]
[ x ...    ]
[ x...     ]

b[k][j] is the "row" (horizontal" of b

so if we arrange "column" if a, and "row" of b properly we can do the "inner product" with 4-core NPU:

Let a column of a is at @100..103, a row of b is at @104..107, use R[2] for partial result, R[0] for a, R[1] for b.


*** Your task is to fill in line 1, 2, 3 (below)

ld 0 @104
ld 1 @105
ld 2 @106
ld 3 @107    ; load a from Mem to LDS
ldr 0        ; move LDS to R[0]
ld 0 @108
ld 1 @109
ld 2 @110
ld 3 @111    ; load C from Mem to LDS
ldr 1        ; move LDS to R[1]
mul 2 0 1    ; R[2] = R[0] * R[1]  all cores 

Now we have partial result in each R[2].  We just add all of them (R[2] of 4 cores) together to have the final result.

To send one R[2] to another core, we use "broadcast" instruction.  Let move R[2] of core 1 to all R[3]. We then do the same to move R[2] of core 2 to R[4], and R[3] of core 3 to R[5].  Now we have all partial result in R[2], R[3], R[4], R[5], we sum all of them to get the final result. We use R[2] of core 0 to store the result.  (We ignore all other core)

str 2        ; all R[2] to LDS
bc 3 1       ; LDS[1], R[2] of core 1, to all R[3]
......            (line 1)
......            (line 2)
add 2 2 3
add 2 2 4
......            (line 3)

The final result is stored in R[2] of core 0.

**** Your task is the fill in the line 1, 2, 3.

Given

the instruction set of 4-core NPU.

3)  Future of computing

The current super computer consists of millions of cores as shown in top500.org list.  Many of these systems also include GPU.  These super computers consume huge amount of energy.  From the perspective of evolution of computer design, what is the future of these machines?  Can we increase the size, more core, faster GPU?  Where is the limit?  

What do you think will be the super computer of the next 50 years? What should be the characteristic of the future machine?  Extrapolate from the current technology that we use to build computer,  what technology should be use for the future computer?

Express your opinion about these questions, give some reason about your idea and some realistic explanation of the technology you mention.

Please write at least about 200 words for your answer. (that is about a text that is twice as long as this question).