Projects in  computer architecture 2000

Due date Monday 28th August 2000

Problem definition

The objective is to design or modify a machine and run one or two benchmark programs on its simulator and report its performance (CPI). Basically what you have to do is to "design" a machine, i.e. its instruction set and/or its behaviour (microstep),  modify/write a simulator and run some benchmark program chosen from my Stanford integer benchmark suite.  If you cannot make the simulation to work  you can submit your design and simulate it by hand.

There are 15 problems ranging from creative to mechanistic process.  The reward will be proportional to the "quality" of the solution and the "difficulty" of the chosen problem.  A team of two persons chooses one problem. Please submit your choice as soon as possible via email.   Please state 2 choices and the name and ID both of person in the team.  I will confirm your choice once a day replying to your sending address.  I read my mail around 9am. everyday.  Please check the project status to see which one is already taken. (update daily).

The project must be handed in not later than the deadline.  No late submission will be accepted.

Project list

1  Superscalar S1 with 2 ALUs
Add extra ALU to S1.  You can use non-pipe or pipe version.  Invent a way to issue two instructions at the same time when possible.

2  LIW version of S1
Redesign S1 to have LIW capability.  You have to determine what kind of additional functional units you want to add to improve the performance (depend on your benchmark programs).

3  S1 with  Scoreboard
Assume S1 has multifunctional units : FPmult1, FPmult2, FPadd, FPdiv, Integer.  Simulate Scoreboard running this program :
    LF F6,34(R2)
    LF F2,45(R3)
    MULTF F0,F2,F4
    SUBF F8,F6,F2
    DIVF F10,F0,F6
    ADDF F6,F8,F2

(Read Henessy & Patterson's book)

4  S1 with Tomasulo
Assume S1 has FP adder, FP multiplier, with 3 and 2 reservation stations, Load buffer, Store buffer with 6 and 3 entries.  Simulate S1 with Tomasulo running this program :
    LF F6,34(R2)
    LF F2,45(R3)
    MULTF F0,F2,F4
    SUBF F8,F6,F2
    DIVF F10,F0,F6
    ADDF F6,F8,F2

(Read Hennessy & Patterson's book)

5  S1p with branch prediction
Add branch prediction capability to S1 pipe.   You have to decide the method to do branch prediction, branch-target buffer.

6  S1p with delay branch
Add delay branch capability to S1 pipe.  Examine your benchmark programs.  How much the delay slot
can be usefully filled?

7  Stack machine ISA
Design a stack machine, its instruction set must be stack oriented (no register!).  Have a look at my research paper which I designed a stack machine   http://www.cp.eng.chula.ac.th/faculty/pjw/r1/R1.htm
at the section "intermediate code specification".  You can also have a look at Java chip called PicoJava at http://www.cp.eng.chula.ac.th/faculty/pjw/teaching/ca/JavaVM/picojava.pdf

8  Minimum instruction set CPU
Design a processor with minimum number of instructions.  It must be able to run at least a benchmark program completely (that is, it must have enough instruction to implement a benchmark program).  You should not worry too much about the ISA being absolute minumum, however you should try to make its ISA as small as you can.

9  Fastest Matrix Multiplication S1
Modify S1 such that it can run Matmul.c  fastest.  There are many ways to do this : you can modify the instruction set (add some special instruction) or add functional units or modify organization (such as two pipelines).

10  Comparing S1 with 2,3,4 pipeline stages
Design S1 with 2,3,4 stages pipeline.  Compare its performance with S1p which has 5 stages pipe.  Please note that in this case one stage of the pipeline will take several clocks to be completed.

11  S1 microprogram with 2 formats microprogram
Modify S1m  to use 2 formats microprogram to shorten the width of a microprogram word.  After observing that ALU functions, Memory control, are never activated at the same time as Bus transfer (Dest, Src, SelR), the following formats  are suggested :
format 1  first bit is 0 ,  Bus transfer
    0 , Dest : 5,  Src : 6, SelR : 3, Cond : 4, Goto : 5   = 24 bits
format 2  first bit is 1,  ALU and Mem control
    1, Sel2R : 1, ALU : 4, Mctl : 2, PC+1 : 1, undef : 6, Cond : 4, Goto : 5 = 24 bits
which reduce the width from 31 bits to 24 bits without performance penalty.  Sel2R is a control bit to select 2 registers for ALU input.  All the rest are similar to S1m.  Write a new microprogram using this narrower microprogram word.  Write simulator to run it.  Run a benchmark program under this new simulator.  Remember that the machine code (object code) doesn't change at all running under this new simulator or the original S1.
(Read Stalling's text book for an example of two-format microprogram)

12  Using microprogram as instructions directly.
Consider that there is  no "instruction set", no program counter (but microprogram counter), no instruction fetch in the normal sense.  Your machine and "program" is THE microprogram itself.  You have to add some fields into microprogram word such as : R0, R1, R2, ADS which hold the appropriate values.  Can you pipeline this machine? (pipeline microprogram itself).

13  Add Floating point instructions to S1
Add the following FP instruction to S1 : fadd, fmult, fdiv.  The FP number in your design is a 32 bits word and a set of FP register (32 bits) is needed.  In writing the simulator you don't have to do IEEE Floating point arithmetic yourself.  You can use data type in C to do it for you, i.e. you can multiply, divide, add the floating point number in C.

Benchmark programs (choose one)
A) running the program to find square root.
Using Newton - Raphson, or so called "successive approximation" method.  Let x be a guess square root of a then
                x n+1 = 0.5 ( xn + a/xn )
Iterate this 7 times and the precision will always be better than 24 bits.
B) evaluate sin x
                sin x = x - x3/3! + x5/5! - x7/7! + ...
using only four terms (not very accurate), where x is expressed in radians and maximum is pi/2.

14  Change S1 to 32 bits word
Design new ISA, instruction set, instruction formats.  This machine is essentially S1 with 32 bits instruction and data. You should add some new instructions to make your machine run the benchmark program faster.  Write its simulator and run benchmark

15  Change S1 instruction set to 3 registers format
You must change ALL S1 instructions that are appropriate and add the "immediate" value to some instruction.
These are the example of immediate instructions :

Immediate mode : addi, cmpi, inci, storeri.

    addi r1,N       r1 = r1 + N

Index mode :   loadx, storex

    loadrx (r1),r2,offset      (r1)+ offset -> r2
    storex r1, (r2), offset    r1 -> (r2) + offset

Write microstep, modify simulator and run benchmark programs.

Benchmark Programs

The benchmark suite is Stanford Integer benchmark.  They are collection of small interger programs supposed to test CPU integer performance.  These programs are suitable for students' excercise and NOT realistic by today standard :  qsort.c  queen.c  sieve.c  hanoi.c,  matmul.c  bubble.c  perm.c

Stanford integer benchmark suite
(run to measure dynamic instruction count of R1   26 Aug 98)

bubble sort 100..1  to 1..100  global a[100], N=100
hanoi 5 disks from 1 to 3, global num[4], D = 5
matmul mul 10 x 10 matrix, global a[100], b[100], c[100]  a = b x c
perm permute N, global val[4], id, N = 4
qsort quick sort 100..1 to 1..100, global a[100], N = 100
queen soln of all 8-queen, global Q,Z,D, col[8], d45[15], d135[15],
 queen[8], soln,  run find(0);
sieve find prime less than N, global p[1000], N = 1000
 original N = 10000 but is too large for 16-bit applications.

How to do the project

1) You have to design an instruction set with enough instruction to execute some benchmark program (no I/O).

2) You have to design "micro architecture", i.e. the internal structure of a cpu and write its "micro step".

3) A set of benchmark program (Stanford integer benchmark) written in C is provided.  To run a benchmark program you have to convert it to an assembly level program (in the instruction set of your design).  You don't have to run the whole program. You must choose some portion of programs to measure your design. The most important portion that determine the runtime of a program is its "innest loop".  The benchmark should be taken from several parts of high level program and should be at least 10-20 lines of assembly code in total.  Choose the benchmark that will illustrate the capability of your design.

4) To validate (check that the design is correct) and evaluate (measure how fast your cpu is) the design, you can either :

  1. Write (modify) a simulator and run the benchmark programs to count the number of clocks required to complete the tasks  (you will earn extra bonus for doing the simulation) or
  2. Estimate the number of clocks by hand.  You need to make sure that you count the right thing.
5) You must hand in a report containing the following sections :
  1. Motivation behind your design (why you did it that way).
  2. Instruction set details : opcode, opcode format, number of clock required by each instruction.
  3. Microarchitecture and its microstep.
  4. Your benchmark program ( in your assembly language) and why you choose this particular part of the program.  Programs should be well commented so that I can read and understand what it does.
  5. How you validate and evaluate your design.
  6. Performance of your design (Cycle Per Instruction)
  7. Conclusion, what you learn from this project.

How I evaluate your project

I will look for the "quality" of your work including: Any question regarding the project, please contact me promptly.