Projects in advanced topic in digital systems  2001

Project evaluation

I evaluate your projects according to three criteria:
  1. the design : problem solving,  the design concept, how you solve the problem
  2. the experiment:  the method, how you conduct the experiment (hand simulate or running simulation), how you choose the benchmarks, the amount of experimental data, all of these are boiled to how convincing your data is.
  3. the report:  writing, discussion of the result, the completeness.
The evaluation is reported Design/Experiment/Report
A = 10, B+ = 8, B = 6   Max total AAA = 30

Summary

I am happy with all projects.  They exceed my expectation.  Many ideas are well thought of and the experiments are well done.  The results are illuminating.  Please send your report in machine readable form (PS or PDF are good for browsing, .doc is also acceptable) (email me or drop the disk at my office) I will post them on the course page for a limited time so that you can learn from your friends' work.  (note the ordering of the reports is random)

Matinee : Direct microprogram  AAA   report
Sunee : Improved microprogram to increase the speed.   AAA    report
Yodthong : VLIW.   AAB+   report
Timaporn : two-level microprogram for s2.  AAA   report
Shisanu : dynamic scheduling.  AAA   report
Tanom : Pipelined DLX   BBA  report
Supaporn : Improve S2 ISA by combining the frequently used instruction.   BAA    report
Netnapa : Improve S2 addressing mode.    AAA    report
Benjaporn : Add new instruction to S2    B+B+A    report
Chaiwat : Branch prediction   AAA   report
Prapon : Include Tomasulo into S2    AAA   report
Chatchawit : Redundant interleaved memory  AAA  report

Matinee : Direct microprogram  AAA

Design: the design of microprogram adding extra fields for register numbers and constant.
Experiment: Using bubble-sort program, convert high-level language into assembly and then write the direct microprogram (it is at Register Transfer Level).  the number of clock is calculated by hand.
Result:   Reduce 1401 clocks to 403 clocks (speed up 3.48).  Impressive result.
Comment:  you misscalculate the number of instruction when you do CPI. the number of instruction is the "dynamic" count, i.e. counting the instructions that are executed, not the "static" count (counting from the program).  You can recognise this anomaly easily from the CPI that you got(42).  The CPI can not be that high, usually you will get between 4-8 on the machine in our class.

Sunee : Improved microprogram to increase the speed.   AAA

Design: pipeline microprogram, (overlap fetch and execution).  Must modify register read-write in one cycle and separate data and address bus in data path to allow concurrent operations.
Experiment: Run seven benchmarks, use RZ compiler assembler and modified simulator to collect all statistics.
Result: Average over all seven benchmarks the CPI reduces 16.5%  (from 5.21 to 4.35)
comment:  I like your problem solving, especially the overlap between fetch and write-back.  The result is convincing.

Yodthong : VLIW.   AAB+

Design: Propose a transformation method from assembly language to VLIW. The proposed machine has 1 load/store unit, 2 ALUs, 1 branch unit.  The method is interesting.
Experiment: Schedule the code from assembly language to VLIW by hand. Run full program of bubble sort.
Result: The result is good, the speed up is 1.8.
Comment:  The writing, explanation of many details are not clear.

Timaporn : two-level microprogram for s2.  AAA

Design: the aim is to compress the SIZE of microprogram. Proposing 3 methods (gradually more complex scheme): nanoprogramming, micro-subroutine and multi-nanoprogram.
1) nanoprogramming, construct "nanoprogram" which is the pointer to the "microprogram".  This has the effect of reducing the "repeat" microprogram.
2) micro-subroutine, to achieve a larger grain of compression, "microprogram" becomes a pointer to micro-subroutine (one pointer to a sequence of microoperations).
3) multi-nanoprogramming, to achieve even higher compression, multiple format is used.  "nanoprogram" is the pointer to many formats microprogram.
Experiment: All proposed methods are microcoded and run in the simulator.  All seven benchmarks are run.
Result:
                        original nano         subroutine     multi-nano
size in bits (% reduce) 1357     1148 (15.4)  998 (26.5)     926 (31.8)
CPI                     5.2      10.4         6              10.4

Comment: The size saving will be much larger than this if the microprogram is larger (more instruction, more complex instructions).  VAX has 32K words microprogram compared to S2 59 words.

Shisanu : dynamic scheduling.  AAA

Design: comparing scoreboard, Tomasulo, and basic pipeline.
Experiment: using a simulator (dlxview) running three small synthetic benchmarks FP ops programs.  Scoreboard and Tomasulo FP units are not piped.  The experiments are carried out varying many parameters.
Result:  The basic pipe is quite good.  Scoreboard turns out to be worst.  The performance limit is due to data dependency.
benchmarks                 pipe     scoreboard     Tomasulo
FP-mul, int add no loop      27     34              29
+ FP-div                     40     41              37
+ loop                       423    593             238

Tanom : Pipelined DLX   BBA

Design: 5 stages DLX pipeline in the book
Experiment: Using two programs, hand simulate
1)  out[i] = w1.x1 + w2.x2 + w3.x3
2)  memcopy
Result:
benchmark    speed up (compared to non-pipe)
pgm1          3.3
memcopy       3.6

Supaporn : Improve S2 ISA by combining the frequently used instruction.   BAA

Design: Emphasise on reducing the dynamic instruction count.  The new instruction is "compare and branch".
Experiment: two benchmarks: max, bubble compiled from high-level, get the output assembly language and modify them to include new instruction.  Run simulation to collect statistics.
Result:
benchmarks reduction of instruction count
max          9.8%
bubble       6.3%
comment:  should calculate the number of clocks.  As the new instruction can save the number of instruction but it also increase the number of clock of its own.  Of course number of clock of the new instruction is lower and the old one combined.

Netnapa : Improve S2 addressing mode.    AAA

Design: Analyse the program (reverse character string) and found the following addressing modes are used most:
1) load/store displacement
2) add immediate register
3) jump uncondition

Invent new instruction based on the following codes:
1: ld r1 @-1 bp
2: add r1 r1 #1
3: st @0 r1 r2  // a[i] = j
4: ld r1 @-1 bp  // get j
5: add r1 r1 #10 // j = j + 10
6: st @-1 bp r1  // put j

line 3,4,5  can be replace by  store and increment (j = j + 10)

stinc @0 r1 r2 #10

also another new instruction:  swap d1(r1) d2(r2)
use in swapping the two chars.
Experiment: Compile reverse from high-level language and modify the output assembly language to include the new instructions.  Statistics is calculated by hand.
Result: This improvement reduces the number of clocks from 884 to 539 clocks.  CPI reduced from 5.23 to 5.10.

Benjaporn : Add new instruction to S2    B+B+A

Design: Using the idea of replacing a sequence of old instructions by a new instruction to improve the speed of execution (as explained in the paper "Chongstitvatana, P., Post processing optimization of byte-code instructions by extension of its virtual machine, Proc. of 20th Symposium of Electrical Engineering, Bangkok, Thailand, 1997." However, the context of the paper is the virtual machine to execute byte-code.  The new instruction is "increment a variable" (in the memory)
Experiment: The benchmar program, summing an array is compiled from high-level language and the generated assembly is modified to include the new instruction.  The statistics is collected by running in the modified simulator.
Result: The reduction of the number of clocks is 16.4%  CPI from 5.29 (original) to 5.05 (with new instructions)
comment: The result is the number of clock is 1353.  Therefore the amount of reduction of clocks is 1619 - 1353 = 266 clocks.  I cannot find this number (266) from your calculation.  I did the calculation by hand and the result is inconsistent with your calculation. Let's try to calculate it.  From the code segment (1) in your report page 8.  the following code:
ld r1 @-1 r29
add r1 r1 #1
st @-1 r29 r1
This is the segment you want to replace by inc v which is 5 clocks.  The above segment consumes, 6+5+6 = 17 clocks.  It appeared 19 times = total 17 x 19 = 323 clocks.  The inc v is 5 clocks, therefore total 95 clocks.  The difference is 323 - 95 = 228.  Therefore I do not trust your number.

Chaiwat : Branch prediction   AAA

Design: Comparing six scheme of branch predictions:
1)  always taken
2)  always not-taken
3)  heuristic
4)  branch history table 1-bit
4.1)  initial value: taken
4.2)  initial value: not-taken
4.3)  initial value: heuristic
5)  branch history table 2-bit
5.1)  initial value: taken
5.2)  initial value: not-taken
6)  most frequently branch
6.1)  global
6.2)  per address
The group 1,2,3 are static.  the group 4,5,6 are dynamic.
Experiment: Run all seven benchmarks collecting all statistics about branching.
Result: overall correct prediction is 30-70%.  static is worse than dynamic.  method 4 is worse than method 5.  The best of the lot is most frequently branch, per address.
comment:   the report is thorough and well written.  It should be noted that all branches in the benchmarks are forward branch.   not surprisingly because the high-level language has only while-loop which in turn compiled into forward conditional branch (the only branch backward is unconditional) You report seven benchmarks separately which is nice but hard to read. You should find a way to aggregate all seven results into one (perhaps
using weighted average etc.)

Prapon : Include Tomasulo into S2    AAA

Design: To overcome the effect of out-of-order execution, the flags cannot be global, the result of comparison must be store locally in a register.  The ISA is modified:
1)  compare puts its result into a register:
cmp rd1 rs1 #n
cmp rd1 rs2 rs2
2)  jump takes the result from a register
jmp cond (rd1) rs1

Experiment: The experiment is carried out using a simulator written in Verilog HDL (to simulate parallelism).  Run all seven benchmarks.  A tool is written to transform S2 object code into Verilog file to be run in the simulation.
Result: The average reduction of CPI from 5.2 to 2.2  (speed up more than twice).

Chatchawit : Redundant interleaved memory  AAA

Design: Using predictive interleaved memory and static scheduling by compiler to schedule access to memory (load/store) before the data is used, hence hiding the latency.  The memory are banked (with redundancy) and interleaved access to increase the bandwidth.  It helps in the case of contiguous address access.  Static scheduling reduces the complexity of hardware (processor).  The compiler will schedule for branch prediction when the correctness of the result does not depend on the correctness of prediction.

Program memory and data memory are separated.  Program memory is accessed sequentially, data memoryis accessed randomly.

ISA design, include register forwarding (done by compiler).  Branch and compare instructions store the result in register.  use delay branch.

Experiment: The experiment uses two benchmark memcopy and Ai = Bi + Ci + Di.  The performance calculation is done by hand.  Assuming the latency is 5 clocks hence use 5 delay slots.
Result:
benchmark    speed up
memcopy      1.14
sum          1.91

Comment:  Elegant design (no cache is required as the system hides the latency).  the memory is 4 times larger.  As all static scheduler, it will work well on some program (that the access pattern can be predicted). Finally, the compiler must be clever to use this scheme effectively.