Final Examination CPE600 Computer Architecture KMUTT 1999 1) A direct map cache has 4 lines, each line contains two words. Draw the picture of this cache after running this program to the end : (show only the address tag) for i = 0; i < 3; i++ s = s + i which is translated to S1 assembly code : code ads code 0 load r1 23 1 store r1 20 2 store r1 21 3 load r3 22 4 mov r1 r2 ;; r2 = r1 5 loop: add r2 r1 ;; r2 = r1 + r2 6 inc r1 7 cmp r1 r3 8 jmp LT loop 9 store r1 20 10 store r2 21 END data 20 i 21 s 22 3 23 0 register usage : r1 i r2 s r3 3 2) Discuss the advantage/disadvantage of using split instruction/data cache versus using unified instruction/data cache. 3) A virtual memory system has a page size of 1024 words, eight virtual pages, and four physical page frames. The page table is as follows: virtual page number page frame number 0 3 1 1 2 - 3 - 4 2 5 - 6 0 7 - a) make a list of all virtual addresses that will cause page faults. b) what are the main memory addresses for the following virtual addresses : 0, 3728, 1023, 1024, 1025, 7800, 4096 ? 4) Describe the behaviour of branch predictors which have the following state diagrams : fig 11.23 p 426. 5) A superscalar processor has 2 ALUs with 5-stage pipeline (Fectch, Decode, Execute, Memory, Register) and unlimited number of ports connected to its cache. The cache has one clock latency, i.e. the request on the current cycle will get the data by the next cycle. instruction number 1 ld r1 100 2 ld r2 101 3 ld r3 102 4 add r1 r2 r3 5 add r4 r5 r6 6 sub r2 r1 r5 7 mul r2 r4 r1 8 sub r6 r2 r4 9 jump Z exit 10 st r1 100 11 exit: st r2 101 12 st r6 102 END data 100 x 101 y 102 z 103 w Write down the complete schedule (time diagram of the pipelines) of running the above program to the end. The instruction has 3 operands : destination, source1, source2 with the meaning : dest = source1 op source2. Assume the jump in line 9 is taken, the load/store instructions issued concurrently except when it has dependency. How many clocks does it take to run this program? 6) Consider the following vector code running on an 80-MHz version of S1v for a fixed vector length of 64 : vload v1, r0 vmul v2,v1,v3 vadd v4,v1,v3 vstore v2,r1 vstore v4,r2 Ignore all strip-mining overhead, but assume that the store latency must be included in the time to perform the loop. The entire sequence produces 64 results. a) assuming no chaining and a single memory pipeline, how many clock cycles per result (including both stores as one result) does this vectore sequence require? b) if the vector sequence is chained, how many clock cycles per presult does this sequence require? c) suppose S1v had three memory pipelines and chaining. If there were no bank conflicts in the accesses for the above loop, how many clock cycles are required per result for this sequence? 7) The following parameters are defined for a disk system : ts = seek time; average time to position head over track r = rotation speed of the disk, in revolutions per second n = number of bits per sector N = capacity of a track, in bits ta = time to access a sector Develop a formula for ta as a fuction of the other parameters. 8) Suppose you have half a billion transistors on a chip (5 x 10^8) and each S1 class processor requires one million transistors to implement. Suggest how you are going to build a very powerful version of S1 given half a billion transistors budget? Draw the processor organization diagram and describe its function and give reasons to support your argument why it is going to be very fast.