2110793  Advanced Topics in Digital Systems   2016 


Class meet on Wed and Fri 9:30-11:00, Eng 3 room 324

Syllabus

This course concentrates on the components and architecture of  modern supercomputers including the increasing role of Graphic Processing Units.  Operating systems and cloud computing will be discussed.

Assesment

homework            20%
midterm project   30%
final                     50%

Announcements

  30 Jan 2016    The s21 tool has been extensively updated, including all documents.  Please use the newest one,  s2cal2.zip
  10 Feb 2016    s2cal3.zip  is the simulator with trace generation.
  27 Mar 2016   s21i.zip   S21 with interrupt
  2  Apr 2016    s21mos.zip   S21 with MOS functions
  15 Apr 2016    s21mos2.zip   MOS in assembly,  s21mos3.zip   MOS with semaphore.
  17 Apr 2016    s30.zip    Multicore simulator,  with all goodies  s30-1.zip 
  20 Apr 2016    Final project is announced.  s30 multicore lecture is updated
  1 May 2016     Explanation of s30 multicore is updated. Update package s30-2.zip   with improved assembler and tracer.
. . .

previous lecture  2015  2013

Topics

Processor
  Instruction level parallelism
Memory subsystems
Graphic Processing Units
Multi-core processor
Concurrency
Exascale system

Lecture

The development of modern processor.
Performance metrics.
We discuss Instruction Level Parallelism based on Chapter 3 of CAQA, Hennessy & Pattern (5th Ed).
The cache discussion is based on Appendix B of CAQA, Hennessy & Pattern (5th Ed).
Processors:  S21 assembly language    S21 instruction set   (updated 2 Apr)
Cache memory
Graphic Processing Unit
Multitask:   S21 with interrupt    Multitask OS
Operating systems and task scheduling  basic OS concepts
    semaphore:  flash demo from William Stallings, U. of Queensland
    lecture from Matt Welsh, Harvard,   semaphores.pdf  
    semaphore in MOS and examples
Multicore processor:     S30 multicore  (updated)  instruction set sheet  S30 instruction
Additional:   Memoize Function  (for implementing a dynamic programming problem)

Homework

1)  Do the calculator project
2)  Write S2 assembly for matrix multiplication (integer).  Check that your program produces the correct answer.
3)  Generate trace from your program in 2).  Keep it in a text file.
4)  Experiment with cache design, using your own trace.
5)  Write S21 program with interrupt to run two processes in ISR.
6)  Try s21mos, write your own concurrent program.  Make sure you do not share any variable between processes. Change DEL and observe the change in the number of interrupts.
7)  Convert all "trap ..." that support MOS into S2 assembly language. Implement these functions: newp(), newStack(), enqueue(), terminate(), nextp().  (a day of work)
8)  Try s21mos2  package, run the demo with two processes (all in assembly).  Write a three processes with non-share data and run it.
9)  Try s21mos3  package, run two demos, one with protected shared variable (mos5.txt), the second one shows two processes synchronise (mos-sync.txt
10)  Use s30-1.zip  package, compare  countm.txt  with mos2.txt  (in terms of number of instruction).  compare sync.txt  with mos-sync.txt  and finally, study  sync2.txt  (run mos version with s21mos3).
11)  Write a program to sum up all elements in an array, using two cores. You can divide array or use "stripping" to pick each element for each core.  Finally, you need to sync two cores and add the final sum.

Tools

S2.1  assembler and simulator  s21.zip    s21cal2.zip  s21cal3.zip (with trace)
S2.1 with interrupt for multitask    s21i.zip
S2.1 with MOS   s21mos.zip
S2.1 with MOS in assembly  s21mos2.zip ,   MOS with semaphore  s21mos3.zip
S3.0  Multicore simulator  s30.zip;   with wfi, intx, sync, semaphore  s30-1.zip
      update package s30-2.zip  with assembler and better tracer.

NPU simulator, assembler  npsim4.zip
Code for Cache simulation to be used in Cache design assignment.  cache.zip
Transistor level simulation of an antique CPU.  6502 is used in the iconic Apple II machine.
    http://visual6502.org/JSSim/index.html

Update s30-2 package

R[0] now is free, unlike s2.1 where it is always zero.  Return address is now in R[31] instead of in a special register.  These instructions are now built-in (not in "trap"):  ei, di, pushm, popm, cid, wfi, intx, sync. Some instructions have been renamed:  pushm is savr, popm is resr.  Some instructions now is not needed: savt, rest (as R[31] can be moved explicitly with mv).  Some OS supports are still trap functions: newsem, wait, signal.  Finally, interrupt works for all cores.

Project

1)  Small project to practice assembly language programming
2)  Hand-in a report of your homework and cache experiment.
3)  Final project,  select a parallel algorithm that is not too difficult to implement in assembly language and not too trivial (like sum an array). Write a multicore program to run that algorithm.  Demonstrate your speed up using many cores (such as 4, 8 cores) compared to single core.  Write a report of your project.  You can discuss your choice of algorithm with me.  Everybody must choose a different algorithm. 
Your reports  are available here:  (I will make them online until 31 May 2016)
3.1)     Believe Network, believe propagation  (a dynamic programming problem)  update with C code
3.2)     Sorting by Cellular Automata
3.3)     Merge sort
3.4)     Clustering
3.5)     Quicksort

Thank you every one for your super participation in this class.  I am more than happy to have a chance to work with all of you.

The calculator

Write a program to behave as a calculator.  There are two internal registers: X, Y.  They behave as a stack of depth two.  Top of stack is X.  The calculator accepts input from the keyboard one-line-at-a-time.   There are five operations on this calculator: + - * / .  The basic four functions are obvious and operate on integers (32 bits).  The . does print the value of X to the calculator screen.  The input to the calculator is the sequence of Reserve Polish Notation (RPN) which is suitable to be performed with a stack storage.  Here is an example of the input sequence:

123
456
+
.


The calcuator "579" on the screen.

How to do it

I make a special "trap" instruction to perform the input function.  trap 3  returns two values. The first one is the type of the input, 1 for number, 2 for operator (i.e. + - * / .). This will be returned in R[30].  The second one is the value of the input.  It will be returned in R[29].  The value for number will be the integer.  The value for operator will be the ASCII code of the first character of that input. (this intends to make the input more flexible, so you can use any printable character).

Here is an example.  The following program will read input and print its ASCII value of the first character.  It will continue forever until you break the execution.

.code 0
:loop
  trap 3         ;; get input
  mv r30 r29     ;; move value to r30
  trap 1         ;; print it out
  jmp loop
.end

PS  I have modified "trap 1" to print out an integer on a new line of its own and decorate with <..> to make it easy to distinguish output from input.  The s21 simulator has been extensively cleanup. The most important change is the "st" instruction has the order of argument change to be similar to "ld", i.e.  ld r1 ads, st r1 ads. (it is the reverse of the original version).

future work

1)   Read the article Kogges, "the top in flops" (link below) and prepare to discuss it in class.
2)   Read and prepare to discuss The Opportunities and Challenges of Exascale Computing

Reference

Exaflop machine    The full report    TR-2008-13.pdf  (3 Mb)
The Opportunities and Challenges of Exascale Computing, (pdf 2 Mb) US Department of Energy, Fall 2010
Kogge, P. "The tops in flops", IEEE spectrum, vol.48, issue 2, Feb 2011, pp.48-54,   (local pdf )
    Digital Object Identifier : 10.1109/MSPEC.2011.5693074
Thailand Digital Economy   Proposed law, Debate, Banker worries
Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 5th ed., Elsevier, 2012.
last update  23 May 2016