2110793 Advanced Topics in Digital
Systems 2016
Class meet on Wed and Fri 9:30-11:00, Eng 3 room 324
Syllabus
This course concentrates on the components and architecture of modern
supercomputers including the increasing role of Graphic Processing
Units. Operating systems and cloud computing will be discussed.
Assesment
homework
20%
midterm project 30%
final
50%
Announcements
30 Jan 2016 The s21 tool has been extensively
updated, including all documents. Please use the newest one, s2cal2.zip
10 Feb 2016 s2cal3.zip
is the simulator with trace generation.
27 Mar 2016 s21i.zip
S21 with interrupt
2 Apr 2016 s21mos.zip
S21 with MOS functions
15 Apr 2016 s21mos2.zip
MOS in assembly, s21mos3.zip
MOS with semaphore.
17 Apr 2016 s30.zip
Multicore simulator, with all goodies
s30-1.zip
20 Apr 2016 Final
project is announced. s30 multicore
lecture is updated
1 May 2016 Explanation of s30
multicore is updated. Update package s30-2.zip
with improved assembler and tracer.
. . .
previous lecture 2015
2013
Topics
Processor
Instruction level parallelism
Memory subsystems
Graphic Processing Units
Multi-core processor
Concurrency
Exascale system
Lecture
The development of modern processor.
Performance metrics.
We discuss Instruction Level Parallelism based on Chapter 3 of CAQA,
Hennessy & Pattern (5th Ed).
The cache discussion is based on Appendix B of CAQA, Hennessy & Pattern
(5th Ed).
Processors: S21 assembly language
S21 instruction set
(updated 2 Apr)
Cache memory
Graphic
Processing Unit
Multitask: S21 with
interrupt Multitask OS
Operating systems and task scheduling basic
OS concepts
semaphore: flash demo from
William Stallings, U. of Queensland
lecture from Matt Welsh, Harvard, semaphores.pdf
semaphore in MOS and
examples
Multicore processor: S30
multicore (updated) instruction set sheet S30
instruction
Additional: Memoize Function
(for implementing a dynamic programming problem)
Homework
1) Do the calculator project
2) Write S2 assembly for matrix multiplication (integer). Check
that your program produces the correct answer.
3) Generate trace from your program in 2). Keep it in a text
file.
4) Experiment with cache design, using your own trace.
5) Write S21 program with interrupt to run two processes in ISR.
6) Try s21mos, write your own
concurrent program. Make sure you do not share any variable between
processes. Change DEL and observe the change in the number of interrupts.
7) Convert all "trap ..." that support MOS into S2 assembly language.
Implement these functions: newp(), newStack(), enqueue(), terminate(),
nextp(). (a day of work)
8) Try s21mos2 package, run the
demo with two processes (all in assembly). Write a three processes
with non-share data and run it.
9) Try s21mos3 package, run two
demos, one with protected shared variable (mos5.txt),
the second one shows two processes synchronise (mos-sync.txt)
10) Use s30-1.zip package,
compare countm.txt with mos2.txt
(in terms of number of instruction). compare sync.txt
with mos-sync.txt and finally,
study sync2.txt (run mos version
with s21mos3).
11) Write a program to sum up all elements in an array, using two
cores. You can divide array or use "stripping" to pick each element for each
core. Finally, you need to sync two cores and add the final sum.
Tools
S2.1 assembler and simulator s21.zip
s21cal2.zip s21cal3.zip
(with trace)
S2.1 with interrupt for multitask
s21i.zip
S2.1 with MOS s21mos.zip
S2.1 with MOS in assembly s21mos2.zip
, MOS with semaphore s21mos3.zip
S3.0 Multicore simulator s30.zip;
with wfi, intx, sync, semaphore s30-1.zip
update package s30-2.zip
with assembler and better tracer.
NPU simulator, assembler npsim4.zip
Code for Cache simulation to be used in Cache design assignment. cache.zip
Transistor level simulation of an antique CPU. 6502 is used in the
iconic Apple II machine.
http://visual6502.org/JSSim/index.html
Update s30-2 package
R[0] now is free, unlike s2.1 where it is always zero. Return address
is now in R[31] instead of in a special register. These instructions
are now built-in (not in "trap"): ei, di, pushm, popm, cid,
wfi, intx, sync.
Some instructions have been renamed:
pushm is savr, popm is resr.
Some instructions now is not
needed: savt, rest
(as R[31] can be moved explicitly with mv
).
Some OS supports are still trap functions: newsem, wait, signal
.
Finally, interrupt works for all cores.
Project
1) Small project to practice assembly language programming
2) Hand-in a report of your homework and cache experiment.
3) Final project, select a parallel algorithm that is not too
difficult to implement in assembly language and not too trivial (like sum an
array). Write a multicore program to run that algorithm. Demonstrate
your speed up using many cores (such as 4, 8 cores) compared to single
core. Write a report of your project. You can discuss your
choice of algorithm with me. Everybody must choose a different
algorithm.
Your reports are available here: (I will make them online until
31 May 2016)
3.1) Believe Network, believe propagation (a
dynamic programming problem) update with C code
3.2) Sorting by Cellular Automata
3.3) Merge sort
3.4) Clustering
3.5) Quicksort
Thank you every one for your super participation in this class. I am
more than happy to have a chance to work with all of you.
The calculator
Write a program to behave as a calculator. There are two internal
registers: X, Y. They behave as a stack of depth two. Top of
stack is X. The calculator accepts input from the keyboard
one-line-at-a-time. There are five operations on this
calculator: + - * / . The basic four functions are obvious and operate
on integers (32 bits). The . does print the value of X to the
calculator screen. The input to the calculator is the sequence of
Reserve Polish Notation (RPN) which is suitable to be performed with a stack
storage. Here is an example of the input sequence:
123
456
+
.
The calcuator "579" on the screen.
How to do it
I make a special "trap" instruction to perform the input function.
trap 3 returns two values. The first one is the type of the input, 1
for number, 2 for operator (i.e. + - * / .). This will be returned in
R[30]. The second one is the value of the input. It will be
returned in R[29]. The value for number will be the integer. The
value for operator will be the ASCII code of the first character of that
input. (this intends to make the input more flexible, so you can use any
printable character).
Here is an example. The following program will read input and print
its ASCII value of the first character. It will continue forever until
you break the execution.
.code 0
:loop
trap 3 ;; get input
mv r30 r29 ;; move value to r30
trap 1 ;; print it
out
jmp loop
.end
PS I have modified "trap 1" to print out an integer on a new line of
its own and decorate with <..>
to make it easy to
distinguish output from input. The s21 simulator has been extensively
cleanup. The most important change is the "st" instruction has the order of
argument change to be similar to "ld", i.e. ld r1 ads, st r1 ads
.
(it is the reverse of the original version).
future work
1) Read the article Kogges, "the top in flops" (link below) and
prepare to discuss it in class.
2) Read and prepare to discuss The
Opportunities and Challenges of Exascale Computing
Reference
Exaflop machine The full report TR-2008-13.pdf
(3 Mb)
The Opportunities and
Challenges of Exascale Computing, (pdf 2 Mb) US Department of Energy,
Fall 2010
Kogge,
P. "The tops in flops", IEEE spectrum, vol.48, issue 2, Feb 2011, pp.48-54,
(local pdf )
Digital Object Identifier : 10.1109/MSPEC.2011.5693074
Thailand Digital Economy Proposed
law, Debate, Banker
worries
Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 5th
ed., Elsevier, 2012.
last update 23 May
2016