Projects in advanced topic in digital systems 2001
Project evaluation
I evaluate your projects according to three criteria:
-
the design : problem solving, the design concept, how you solve the
problem
-
the experiment: the method, how you conduct the experiment (hand
simulate or running simulation), how you choose the benchmarks, the amount
of experimental data, all of these are boiled to how convincing your data
is.
-
the report: writing, discussion of the result, the completeness.
The evaluation is reported Design/Experiment/Report
A = 10, B+ = 8, B = 6 Max total AAA = 30
Summary
I am happy with all projects. They exceed my expectation. Many
ideas are well thought of and the experiments are well done. The
results are illuminating. Please send your report in machine readable
form (PS or PDF are good for browsing, .doc is also acceptable) (email
me or drop the disk at my office) I will post them on the course page for
a limited time so that you can learn from your friends' work. (note
the ordering of the reports is random)
Matinee : Direct microprogram AAA
report
Sunee : Improved microprogram to increase the speed.
AAA report
Yodthong : VLIW. AAB+
report
Timaporn : two-level microprogram for s2.
AAA report
Shisanu : dynamic scheduling. AAA
report
Tanom : Pipelined DLX BBA report
Supaporn : Improve S2 ISA by combining the frequently
used instruction. BAA report
Netnapa : Improve S2 addressing mode.
AAA report
Benjaporn : Add new instruction to S2
B+B+A report
Chaiwat : Branch prediction AAA
report
Prapon : Include Tomasulo into S2
AAA report
Chatchawit : Redundant interleaved memory
AAA report
Matinee : Direct microprogram AAA
Design: the design of microprogram adding extra fields for register
numbers and constant.
Experiment: Using bubble-sort program, convert high-level language
into assembly and then write the direct microprogram (it is at Register
Transfer Level). the number of clock is calculated by hand.
Result: Reduce 1401 clocks to 403 clocks (speed
up 3.48). Impressive result.
Comment: you misscalculate the number of instruction when you
do CPI. the number of instruction is the "dynamic" count, i.e. counting
the instructions that are executed, not the "static" count (counting from
the program). You can recognise this anomaly easily from the CPI
that you got(42). The CPI can not be that high, usually you will
get between 4-8 on the machine in our class.
Sunee : Improved microprogram to increase the speed.
AAA
Design: pipeline microprogram, (overlap fetch and execution).
Must modify register read-write in one cycle and separate data and address
bus in data path to allow concurrent operations.
Experiment: Run seven benchmarks, use RZ compiler assembler
and modified simulator to collect all statistics.
Result: Average over all seven benchmarks the CPI reduces 16.5%
(from 5.21 to 4.35)
comment: I like your problem solving, especially the overlap
between fetch and write-back. The result is convincing.
Yodthong : VLIW. AAB+
Design: Propose a transformation method from assembly language to
VLIW. The proposed machine has 1 load/store unit, 2 ALUs, 1 branch unit.
The method is interesting.
Experiment: Schedule the code from assembly language to VLIW
by hand. Run full program of bubble sort.
Result: The result is good, the speed up is 1.8.
Comment: The writing, explanation of many details are not clear.
Timaporn : two-level microprogram for s2.
AAA
Design: the aim is to compress the SIZE of microprogram. Proposing
3 methods (gradually more complex scheme): nanoprogramming, micro-subroutine
and multi-nanoprogram.
1) nanoprogramming, construct "nanoprogram" which is the pointer to
the "microprogram". This has the effect of reducing the "repeat"
microprogram.
2) micro-subroutine, to achieve a larger grain of compression, "microprogram"
becomes a pointer to micro-subroutine (one pointer to a sequence of microoperations).
3) multi-nanoprogramming, to achieve even higher compression, multiple
format is used. "nanoprogram" is the pointer to many formats microprogram.
Experiment: All proposed methods are microcoded and run in the
simulator. All seven benchmarks are run.
Result:
original nano subroutine
multi-nano
size in bits (% reduce) 1357 1148 (15.4)
998 (26.5) 926 (31.8)
CPI
5.2 10.4
6
10.4
Comment: The size saving will be much larger than this if the microprogram
is larger (more instruction, more complex instructions). VAX has
32K words microprogram compared to S2 59 words.
Shisanu : dynamic scheduling. AAA
Design: comparing scoreboard, Tomasulo, and basic pipeline.
Experiment: using a simulator (dlxview) running three small
synthetic benchmarks FP ops programs. Scoreboard and Tomasulo FP
units are not piped. The experiments are carried out varying many
parameters.
Result: The basic pipe is quite good. Scoreboard
turns out to be worst. The performance limit is due to data dependency.
benchmarks
pipe scoreboard Tomasulo
FP-mul, int add no loop 27
34
29
+ FP-div
40 41
37
+ loop
423 593
238
Tanom : Pipelined DLX BBA
Design: 5 stages DLX pipeline in the book
Experiment: Using two programs, hand simulate
1) out[i] = w1.x1 + w2.x2 + w3.x3
2) memcopy
Result:
benchmark speed up (compared to non-pipe)
pgm1 3.3
memcopy 3.6
Supaporn : Improve S2 ISA by combining the frequently
used instruction. BAA
Design: Emphasise on reducing the dynamic instruction count.
The new instruction is "compare and branch".
Experiment: two benchmarks: max, bubble compiled from high-level,
get the output assembly language and modify them to include new instruction.
Run simulation to collect statistics.
Result:
benchmarks reduction of instruction count
max 9.8%
bubble 6.3%
comment: should calculate the number of clocks. As the
new instruction can save the number of instruction but it also increase
the number of clock of its own. Of course number of clock of the
new instruction is lower and the old one combined.
Netnapa : Improve S2 addressing mode.
AAA
Design: Analyse the program (reverse character string) and found
the following addressing modes are used most:
1) load/store displacement
2) add immediate register
3) jump uncondition
Invent new instruction based on the following codes:
1: ld r1 @-1 bp
2: add r1 r1 #1
3: st @0 r1 r2 // a[i] = j
4: ld r1 @-1 bp // get j
5: add r1 r1 #10 // j = j + 10
6: st @-1 bp r1 // put j
line 3,4,5 can be replace by store and increment (j = j
+ 10)
stinc @0 r1 r2 #10
also another new instruction: swap d1(r1) d2(r2)
use in swapping the two chars.
Experiment: Compile reverse from high-level language and modify
the output assembly language to include the new instructions. Statistics
is calculated by hand.
Result: This improvement reduces the number of clocks from 884
to 539 clocks. CPI reduced from 5.23 to 5.10.
Benjaporn : Add new instruction to S2
B+B+A
Design: Using the idea of replacing a sequence of old instructions
by a new instruction to improve the speed of execution (as explained in
the paper "Chongstitvatana, P., Post processing optimization of byte-code
instructions by extension of its virtual machine, Proc. of 20th Symposium
of Electrical Engineering, Bangkok, Thailand, 1997." However, the context
of the paper is the virtual machine to execute byte-code. The new
instruction is "increment a variable" (in the memory)
Experiment: The benchmar program, summing an array is compiled
from high-level language and the generated assembly is modified to include
the new instruction. The statistics is collected by running in the
modified simulator.
Result: The reduction of the number of clocks is 16.4%
CPI from 5.29 (original) to 5.05 (with new instructions)
comment: The result is the number of clock is 1353. Therefore
the amount of reduction of clocks is 1619 - 1353 = 266 clocks. I
cannot find this number (266) from your calculation. I did the calculation
by hand and the result is inconsistent with your calculation. Let's try
to calculate it. From the code segment (1) in your report page 8.
the following code:
ld r1 @-1 r29
add r1 r1 #1
st @-1 r29 r1
This is the segment you want to replace by inc v which is 5 clocks.
The above segment consumes, 6+5+6 = 17 clocks. It appeared 19 times
= total 17 x 19 = 323 clocks. The inc v is 5 clocks, therefore total
95 clocks. The difference is 323 - 95 = 228. Therefore I do
not trust your number.
Chaiwat : Branch prediction AAA
Design: Comparing six scheme of branch predictions:
1) always taken
2) always not-taken
3) heuristic
4) branch history table 1-bit
4.1) initial value: taken
4.2) initial value: not-taken
4.3) initial value: heuristic
5) branch history table 2-bit
5.1) initial value: taken
5.2) initial value: not-taken
6) most frequently branch
6.1) global
6.2) per address
The group 1,2,3 are static. the group 4,5,6 are dynamic.
Experiment: Run all seven benchmarks collecting all statistics
about branching.
Result: overall correct prediction is 30-70%. static is
worse than dynamic. method 4 is worse than method 5. The best
of the lot is most frequently branch, per address.
comment: the report is thorough and well written.
It should be noted that all branches in the benchmarks are forward branch.
not surprisingly because the high-level language has only while-loop which
in turn compiled into forward conditional branch (the only branch backward
is unconditional) You report seven benchmarks separately which is nice
but hard to read. You should find a way to aggregate all seven results
into one (perhaps
using weighted average etc.)
Prapon : Include Tomasulo into S2
AAA
Design: To overcome the effect of out-of-order execution, the flags
cannot be global, the result of comparison must be store locally in a register.
The ISA is modified:
1) compare puts its result into a register:
cmp rd1 rs1 #n
cmp rd1 rs2 rs2
2) jump takes the result from a register
jmp cond (rd1) rs1
Experiment: The experiment is carried out using a simulator written
in Verilog HDL (to simulate parallelism). Run all seven benchmarks.
A tool is written to transform S2 object code into Verilog file to be run
in the simulation.
Result: The average reduction of CPI from 5.2 to 2.2 (speed
up more than twice).
Chatchawit : Redundant interleaved memory
AAA
Design: Using predictive interleaved memory and static scheduling
by compiler to schedule access to memory (load/store) before the data is
used, hence hiding the latency. The memory are banked (with redundancy)
and interleaved access to increase the bandwidth. It helps in the
case of contiguous address access. Static scheduling reduces the
complexity of hardware (processor). The compiler will schedule for
branch prediction when the correctness of the result does not depend on
the correctness of prediction.
Program memory and data memory are separated. Program memory is
accessed sequentially, data memoryis accessed randomly.
ISA design, include register forwarding (done by compiler). Branch
and compare instructions store the result in register. use delay
branch.
Experiment: The experiment uses two benchmark memcopy and Ai
= Bi + Ci + Di. The performance calculation is done by hand.
Assuming the latency is 5 clocks hence use 5 delay slots.
Result:
benchmark speed up
memcopy 1.14
sum 1.91
Comment: Elegant design (no cache is required as the system hides
the latency). the memory is 4 times larger. As all static scheduler,
it will work well on some program (that the access pattern can be predicted).
Finally, the compiler must be clever to use this scheme effectively.