s21

The set of tools composed of 
rz compiler   
rzvm  rz virtual machine (rz interpreter)
as2  s2 assember
s2   s2 simulator  

rz compiler translates an input file in rz language into an executable byte-code file and a s2 assembly language file.  The byte-code can be executed using rzvm interpreter.  The s2 assembly language can be translated (using as2) into machine codes (in a loadable format) that can be executed using s2 simulator.

rz source file -->[rz] --> file.out --> [rzvm] --> output
                           file.lst  (listing of byte-codes)
                           file.s2  --> [as2] --> file.obj --> [s2] --> output (out.txt)
                                                  file.lis  (listing of s2 with locations)
rz compiler tools kit

The tools kit composed of 4 main parts: lex, parse, compiler, code generator.  The first three are independent of target machine code.  lex has lexgen.c that generates token.c  automatically from an input specification.  parse has pgen.c that generates parse.c from a grammar.  lex is almost independent from other modules and has only one export function token().  token() is a recogniser and accepts special symbols (delimiter) and keywords).  string, identifier, and number require to be handled separately in lex.c to simplify lexgen.c.  (it is really simple).  parse and compiler are tightly coupled although I have tried to make them as self-contained as possible.  These generators (lexgen, pgen) achieve 4:1 reduction ratio in writing code for a compiler.  That means it saves upto 4 times amount of writing line of codes.  In my experience of using these tools, it reduces the error in writing a recursive descend compiler and makes it easier to debug too (as the grammar is more compact and easier to understand than codes).  However, by the simple nature of tools, it requires intimate understanding of pgen.c in order to make change to grammar and the generated byte-code.

rz language

RZ is a descendant of R1, a concurrent language for small control applications.  (Have a look at full report and implementation from my research work web page).  rz is aimed to be a teaching language for system programming and computer architecture subjects, which emphasises a small language that can be used to illustrate all the "inner" working parts of a computer system (compilation, code generation, ISA simulation), in other words from a high level language down to logic gates and bits of a processor.  rz has stripped out all the concurrency features of R1, retaining the byte-code and interpreter.  The language syntax has the look of C.  rz language can be summarised as follows:

-  it has only integer as primitive data.  (16-bit or 32-bit depends on the target machine)
-  Variables are either global or local.  Global variables must be declared before their use.  Local variables are automatic (not required to declare).
-  Global variables can be array.  The size of array must be known at compile time.  An array has only one dimension.  Local variables can be only scalar.
-  reserved words are:  if, else, while, return, print.
-  operators are:  +, -, *, /, ==, !=, <, <=, >, >=, !, &&, ||, * (indirection), & (address).
(for C programmer, please note, no: for, break, do, missing many operators especially ++, --)

It is easier just to look at an example to know most of the syntax.

Example:  a matrix multiplication.  A matrix (2 dimensions) is converted into one dimension array by the index calculation i*N + j  where i,j are 2 dimension index 0..N-1 and N are the size of matrix NxN.

// matrix multiplication 

//  N = 4, simulate matrix by (i,j) = i*N + j
N, a[16], b[16], c[16]; 

index(i,j){
  return i*N + j;
}

matmul() {
  i = 0;
  while( i<N ){
    j = 0;
    while( j<N ){
      s = 0;
      k = 0;
*      while( k<N ){
*        s = s + a[index(i,k)] * b[index(k,j)];
*        k = k + 1;
*      }
      c[index(i,j)] = s;
      j = j + 1;
    }
    i = i + 1;
  }
}

The call by reference can be achived using the * and & operators just like in C.

rz byte-code

rz byte-code is a post-fix zero address code.  Some code has parameters (so it can be called not a zero address, or one address, more likely to be a half address) for performance reason.  The full detail can be found in the R1 final report.  The following is the code from the inner loop of matrix multiplication program above (the lines with *).

  270 Rval 1 	;;   i
  273 Rvalg 1   ;;   N
  276 Lt 	;; while i<N
  277 Jz 331  	;; exit while loop
  280 Lval 2    ;;   &s, for s = ...
  283 Rval 2    ;;   s
  286 Lvalg 2   ;;   &a[]
  289 Rval 4    ;;   i
  292 Rval 1 	;;   k
  295 Call 200  ;;   index(i,k)
  298 Index 	;;   &a[index(i,k)]
  299 Fetch 	;;   a[index(i,k)]
  300 Lvalg 18 	;;   &b
  303 Rval 1 	;;   k
  306 Rval 3 	;;   j
  309 Call 200 
  312 Index 	;;   similar to a[]
  313 Fetch 	;;   b[index(k,j)]
  314 Mul 
  315 Add 
  316 Set 	;;   s=s+a[..]*b[..]
  317 Lval 1 
  320 Rval 1 
  323 Lit 1 
  326 Add 
  327 Set 	;;   k = k + 1
  328 Jmp 270   ;; end while loop
  331   . . .

A bit of explanation is in order.  Rval, Rvalg is the right value of a local (global) variable, similarly Lval, Lvalg the left value (or the address).  Lt, Mul, Add, Set are binary operators with obvious meaning (Set takes an address and a value and assign the value to that address).  Fetch takes an address and gets the value. Index is just adding the offset with the base address to get the address of the element in an array. Jz ads, jumps if top of stack is zero, Jmp is an unconditional jump.

How rz compiler works 

The compiler is a simple recursive descend with one symbol look ahead (LL1) parser.  The compiler is one pass, that is, it scan the source file only once.  While it scans the source it builds up the symbol table which stores important information required to generate an executable output (such as the reference to a symbol, its address).  

Declaration of a global variable will allocate a place in the data segment.  Declaration of a function similarly will allocate the subsequent codes to the code segment and its reference is the current location of the code segment.  Inside a scope of a function definiton, any symbol is either global or local.  When the parser finds an identifier, it looks up in the symbol table, first the local symbol, and then the global symbol, if not found that identifier is deemed to be a local variable and is inserted into the local symbol table.  (This is a correct behaviour because if the symbol is global, then it must already been found in the global symbol table as the syntax of the language stated that the global variable must be declared before its use hence it must already been in the symbol table).  The reference to a local variable is just a number, ordering in order of its occurance.  This reference is really an offset into an activation record which is created at runtime.  An activation record is a dynamic data structure used to store the information required in a function call, called it a context.  This context includes a return address to the caller, all the local variables and the actual parameters (the parameters passed from the caller to callee) and perhaps some other useful information, depends on the implementation of the run-time system for a language execution.

While the compiler is compiling some line of source code, some information is not known, such as the reference to a function that is not yet defined, hence the correct address to generate the code for a function call is not yet determined.  To achieve one pass, the technique of "coming back to complete the work later" is used.  (this is an obvious and popular technique but easily overlooked).  The compiler generates the code for a function call using the index to the symbol table as its address.  When all lines of the source program are translated, the symbol table will contain a complete information of all symbols (including the correct references).  The byte-code output is scanned from the beginning to the end and the reference of function calls are updated with the information from the (now complete) symbol table.  The update is easy because the reference in the function call in the byte-code is the index to the symbol table.

While compiling, some consistency checking is easy to do, such as, the number of arguments of a function call.  This information is stored in the symbol table.  As a symbol is inserted into the symbol table when it is first encountered, the number of arguments is known at that time.  Subsequently, when the name of function appears again in the source, the number of arguments for that call can be checked against what is known in the symbol table.  The type of a variable is checked in a similar manner.  Counting is not easily done in a grammar, therefore it is done in a special routine (in stmt.c).  

The compiler and the tools set accompanied with it are designed with emphasise on simplicity (aim for teaching purpose). (All sources are totally about 2500 lines of code including the target assembly code generator). However, as simple as they are, they are powerful enough to construct a working compiler with adequate checking to enable it for a practical use in many applications.  Some short falls are still remainded, notably the lack of "error-recovery" when the syntax error occurs.  In all, this is (to my mind) a good balance between usefulness and simplicity.

(to write more on code generation, code optimisation, and also how lexgen and pgen really work)

4 January 2002
Prabhas Chongstitvatana
