There are 15 problems ranging from creative to mechanistic process. The reward will be proportional to the "quality" of the solution and the "difficulty" of the chosen problem. A team of two persons chooses one problem. Please submit your choice as soon as possible via email. Please state 2 choices and the name and ID both of person in the team. I will confirm your choice once a day replying to your sending address. I read my mail around 9am. everyday. Please check the project status to see which one is already taken. (update daily).
The project must be handed in not later than the deadline. No late submission will be accepted.
2 LIW version of S1
Redesign S1 to have LIW capability. You have to determine what
kind of additional functional units you want to add to improve the performance
(depend on your benchmark programs).
3 S1 with Scoreboard
Assume S1 has multifunctional units : FPmult1, FPmult2, FPadd, FPdiv,
Integer. Simulate Scoreboard running this program :
LF F6,34(R2)
LF F2,45(R3)
MULTF F0,F2,F4
SUBF F8,F6,F2
DIVF F10,F0,F6
ADDF F6,F8,F2
(Read Henessy & Patterson's book)
4 S1 with Tomasulo
Assume S1 has FP adder, FP multiplier, with 3 and 2 reservation stations,
Load buffer, Store buffer with 6 and 3 entries. Simulate S1 with
Tomasulo running this program :
LF F6,34(R2)
LF F2,45(R3)
MULTF F0,F2,F4
SUBF F8,F6,F2
DIVF F10,F0,F6
ADDF F6,F8,F2
(Read Hennessy & Patterson's book)
5 S1p with branch prediction
Add branch prediction capability to S1 pipe. You have to
decide the method to do branch prediction, branch-target buffer.
6 S1p with delay branch
Add delay branch capability to S1 pipe. Examine your benchmark
programs. How much the delay slot
can be usefully filled?
7 Stack machine ISA
Design a stack machine, its instruction set must be stack oriented
(no register!). Have a look at my research paper which I designed
a stack machine http://www.cp.eng.chula.ac.th/faculty/pjw/r1/R1.htm
at the section "intermediate code specification". You can also
have a look at Java chip called PicoJava at http://www.cp.eng.chula.ac.th/faculty/pjw/teaching/ca/JavaVM/picojava.pdf
8 Minimum instruction set CPU
Design a processor with minimum number of instructions. It must
be able to run at least a benchmark program completely (that is, it must
have enough instruction to implement a benchmark program). You should
not worry too much about the ISA being absolute minumum, however you should
try to make its ISA as small as you can.
9 Fastest Matrix Multiplication S1
Modify S1 such that it can run Matmul.c fastest. There
are many ways to do this : you can modify the instruction set (add some
special instruction) or add functional units or modify organization (such
as two pipelines).
10 Comparing S1 with 2,3,4 pipeline stages
Design S1 with 2,3,4 stages pipeline. Compare its performance
with S1p which has 5 stages pipe. Please note that in this case one
stage of the pipeline will take several clocks to be completed.
11 S1 microprogram with 2 formats microprogram
Modify S1m to use 2 formats microprogram to shorten the width
of a microprogram word. After observing that ALU functions, Memory
control, are never activated at the same time as Bus transfer (Dest, Src,
SelR), the following formats are suggested :
format 1 first bit is 0 , Bus transfer
0 , Dest : 5, Src : 6, SelR : 3, Cond : 4,
Goto : 5 = 24 bits
format 2 first bit is 1, ALU and Mem control
1, Sel2R : 1, ALU : 4, Mctl : 2, PC+1 : 1, undef
: 6, Cond : 4, Goto : 5 = 24 bits
which reduce the width from 31 bits to 24 bits without performance
penalty. Sel2R is a control bit to select 2 registers for ALU input.
All the rest are similar to S1m. Write a new microprogram using this
narrower microprogram word. Write simulator to run it. Run
a benchmark program under this new simulator. Remember that the machine
code (object code) doesn't change at all running under this new simulator
or the original S1.
(Read Stalling's text book for an example of two-format microprogram)
12 Using microprogram as instructions directly.
Consider that there is no "instruction set", no program counter
(but microprogram counter), no instruction fetch in the normal sense.
Your machine and "program" is THE microprogram itself. You have to
add some fields into microprogram word such as : R0, R1, R2, ADS which
hold the appropriate values. Can you pipeline this machine? (pipeline
microprogram itself).
13 Add Floating point instructions to S1
Add the following FP instruction to S1 : fadd, fmult, fdiv.
The FP number in your design is a 32 bits word and a set of FP register
(32 bits) is needed. In writing the simulator you don't have to do
IEEE Floating point arithmetic yourself. You can use data type in
C to do it for you, i.e. you can multiply, divide, add the floating point
number in C.
Benchmark programs (choose one)
A) running the program to find square root.
Using Newton - Raphson, or so called "successive approximation" method.
Let x be a guess square root of a then
x n+1 = 0.5 ( xn + a/xn )
Iterate this 7 times and the precision will always be better than 24
bits.
B) evaluate sin x
sin x = x - x3/3! + x5/5! - x7/7! + ...
using only four terms (not very accurate), where x is expressed in
radians and maximum is pi/2.
14 Change S1 to 32 bits word
Design new ISA, instruction set, instruction formats. This machine
is essentially S1 with 32 bits instruction and data. You should add some
new instructions to make your machine run the benchmark program faster.
Write its simulator and run benchmark
15 Change S1 instruction set to 3 registers format
You must change ALL S1 instructions that are appropriate and add the
"immediate" value to some instruction.
These are the example of immediate instructions :
Immediate mode : addi, cmpi, inci, storeri.
addi r1,N r1 = r1 + N
Index mode : loadx, storex
loadrx (r1),r2,offset
(r1)+ offset -> r2
storex r1, (r2), offset r1
-> (r2) + offset
Write microstep, modify simulator and run benchmark programs.
Stanford integer benchmark suite
(run to measure dynamic instruction count of R1 26 Aug
98)
bubble sort 100..1 to 1..100 global a[100], N=100
hanoi 5 disks from 1 to 3, global num[4], D = 5
matmul mul 10 x 10 matrix, global a[100], b[100], c[100] a =
b x c
perm permute N, global val[4], id, N = 4
qsort quick sort 100..1 to 1..100, global a[100], N = 100
queen soln of all 8-queen, global Q,Z,D, col[8], d45[15], d135[15],
queen[8], soln, run find(0);
sieve find prime less than N, global p[1000], N = 1000
original N = 10000 but is too large for 16-bit applications.
2) You have to design "micro architecture", i.e. the internal structure of a cpu and write its "micro step".
3) A set of benchmark program (Stanford integer benchmark) written in C is provided. To run a benchmark program you have to convert it to an assembly level program (in the instruction set of your design). You don't have to run the whole program. You must choose some portion of programs to measure your design. The most important portion that determine the runtime of a program is its "innest loop". The benchmark should be taken from several parts of high level program and should be at least 10-20 lines of assembly code in total. Choose the benchmark that will illustrate the capability of your design.
4) To validate (check that the design is correct) and evaluate (measure how fast your cpu is) the design, you can either :