Compilation scheme



A source program is compiled into postfix codes.  The control flow becomes jumps.

notation:  op:arg   /label

if e1 e2 e3     =>  e1 jf:a e2 jmp:b /a e3 /b
while e1 e2     =>  jmp:a /b e2 /a e1 jt:b
for e1 e2 e3    =>  .... [*** for loop]
case e0 ecase   =>  case lit:low lit:hi jmp:else jump-table ecase
v = e           =>  e st:v  (global var)
                =>  e put:v (local var)
v               =>  ld:v    (global var)
                =>  get:v   (local var)
f arg*          =>  arg* call:f
n               =>  lit:n
e1 binop e2     =>  e1 e2 binop
unaryop e       =>  e unaryop

array index

v[i] = e        =>  ld/get:v i e stx
 ... = v[i]     =>  ld/get:v i ldx ...


How to compile for

"for i start end body" means

i = start
while i <= end
  body
  i = i + 1

A new local variable is created to be used to store "end".   In terms of speed, it is faster because the "end" is evaluated and stored in a local variable just once in the initialization, so testing "if i <= end" is faster than evaluate the expression every time around the loop.

For the parser, most work is easy (writing a grammar for a parser generator).  The code generation is simple too.  A new local is created to store "stop" value, let's call it "end". The simple form is: (assuming i is local)

  lit start
  put i  
  lit stop
  put end
  /loop
  get i
  get end
  le
  jf exit
  body
  inc i
  jmp loop
  /exit

This is inefficient as it executes two jumps (jf, jmp) per loop. The following form requires only one jump in the loop.

    lit start
  put i
  lit stop
  put end
  jmp test
  /loop
  body
  inc i
  /test
  get i
  get end
  le
  jt loop

This is also applicable to while loop.

How to compile case

grammar

'enum '{ [num] label1 ... labeln '}
'case ex0 '{ caselist '}
caselist ->
  label ': ex caselist |
  'else ': ex |      
  nil

'else case must be the last case (if present)

Example

enum { 10 add sub mul div }

case op
  add : doadd
  sub : dosub
  ...
  else : error

To use symbolic constant, lex() grabs var and looks it up, if its type is "constant" then lex returns NUM.  

Compiling 'case generates table-lookup instruction of the form:

<case, else, n, value1, goto1, ... valuen, goton>

follows by the code of each caseitem.  The code for each caseitem is ended with jump to exit. (perhaps using "break" which is transformed into jump)

where n is the number of caselist, else is the default goto, value is the case label.  The jump list is sorted according to value to enable the matching (searching) in log n time using for example binary search.  This is very similar to lookup instruction in JVM.

The above compare-and-jump table is the most general form. It can handle all cases of "case".  However it is not the most efficient method.  A jump table can be constructed that take constant time in searching.  When the case label is densed, which is usually the case, the index can be used to access the jump table directly without search.  

<case2, low, hi, else, goto1...goton>

where low, hi is the range of label.  If the index is out of range then use "else".  hi-low+1 is the size of jumptable.  index-lo is used to access the gotos.  This is the form we used in Som compiler. Each case action is compiled into a normal sequence of code ended with jmp:end.  The entry into the jumptable is the code "jmp" to the corresponding ex.

case e0 ecase   =>  
  e0
  case
  lit:low
  lit:hi
  jmp:else
  jmp:1
  ...
  jmp:n        // jump table
  /1
  e1
  jmp:end      // each case action
  /2
  ...
  /n
  en
  jmp:end
  /else
  e-else       // default action
  /end

Code generation for case

Using 'case required 'enum to generate label. However, enum is processed by the parser and code generator does not have to know about it. Generating code for case is a bit complicate but executing it by an interpreter is a simple O(1) (indexed table look up) operation.

Assume the whole case statement has already been parsed and the number of caselist is known, low and high labels are known. The jump table is filled with all entries between low and hi.

algorithm gencase

1 genex(ex0)
2 outs(icCase)
3 look into the block and get low, hi labels.
4 outa(icLit,low)
5 outa(icLit,hi)
6 mark address here a1 (jumbtable starts at a1+1)
7 outa(icJmp,0)  this is the jump to else case
now generate an empty jumptable from low to hi
8 for i=low i<=hi i++
    outa(icJmp,0)
generate each ex in caselist except exelse
each ex
8.1  get num, patch jumptable at num-low to
       jump to this address
8.2  genex(ex)
8.3  outa(icJmp,backchain)  
backchain is used to patch its location to eloc in the end.
9 genex(exelse)
10 patch jump to exelse at a1
11 patch any entry in the jumbtable that has not been update (ads = 0) to jump to exelse.
12 here is eloc, backpatch the backchain to jump to eloc.

End of gencase

Labels are generated from enum.  enum is processed entirely in the parser using symbol table to hold the enum labels as global symbols with enumerated values (unique integers). Jump table is filled with the entries from smallest label to largest label so to make it a direct jump with a key without searching.  If label is not densed the size of jump table will be large.  Another alternative which might make code generation easier is to use "label-jump" pair in the jump table and use search (linear or binary) to find a matched label.  

the code sequence is:

    ex0
  case2
  <jump table>
  label1, jmp ex1
  ...
  labeln, jmp exn
  -1, jmp exelse
  <end jump table>
  ex1, jmp exit
  ...
  exn, jmp exit
  exelse
  <exit>

algorithm gencase2

1  genex(ex0)
2  outs(icCase2)
3  mark this a1
4  count no. of case
5  make room for jump table
6  for each ex
7    get label, update jump table
8      putcode(a1,icLit,label)
9      putcode(a1+1,icJmp,here)
10     a1 = a1 + 2
11     genex(ex)
12     outa(icJmp,exit)
13     exit is not known yet
14 if last ex is not else
15   update jump table with -1, jmp exit
16 here is exit
17 backpatch jmp exit in the code

To eliminate backpatch loop, exit can be fixed in the beginning:

    ex0
  case2
  <exit>
  jmp exit
  <jump table>  
  ...

All jmp exit in the ex body will jump to this location.  After code improvement "improv2", the code will be shortcut.

interactive mode

To make the programming system interactive, that is, the input source program can be typed in and the compilation translates the source into S-codes which then are executed immediately, all expressions must be stored in the code segment whether they are defined functions or not.  We need some markers in the code segment to mark out the defined function so that when a S-code file is loaded it is possible to know which section is to be executed immediately and which section is the function definition.  The symbol table is not required to execute the S-code, however the presence of a symbol table facilitates the debugging.  

A defined function is marked in CS by "fun m" and ended with "ret n". The layout in CS is like this:

to name arg* e  =>   fun m, ... body ..., ret n.

Toplevel

At the toplevel, evaluating an expression returns a value which may or may not be used. The unused values accumulate, they are purged when "ret" is performed.  

e   =>  e end

The action at the toplevel is as follows.  Once S-code is loaded (or having been translated into CS), the execution begins.  The state of computation is created causing changes in DS and SS.  The "end" will return the control back to the console.

[to write more on the compiler itself]

27 Sept 2004