RZ compiler tools kit

RZ-language
Tools set    source  RZ-S2 tools set  v.1.1    (June 2003)
RZ compiler tools kit    
How RZ compiler works
Hand-on: a session on RZ-S2

S2 processor and assembly language programming
S2 code generation   ( a more detailed version )
S2 assembler
S2 simulator
Modification to S2 microarchitecture
S2 slightly improved instruction set version 2.1   (March 2007)

Tools set

The s2 tools set composed of RZ compiler (rz), S2 assembler (as2) and s2 simulator (s2).  You write the source code in RZ language (a small language, C look alike). RZ compiler translates an input file in rz language into an executable byte-code file and a s2 assembly language file.  The byte-code can be executed using rzvm interpreter.  The s2 assembly language can be translated (using as2) into machine codes (in a loadable format) that can be executed using s2 simulator.  RZ itself comes with an interpreter (rzvm), therefore the source code in RZ can be tested independent of s2 ISA using this interpreter.

rz source file
  |
  V
[rz] --> file.out --> [rzvm] --> output
         file.lst  (1)
         file.s2
           |
           V
         [as2] --> file.obj --> [s2] --> output (out.txt)
                   file.lis  (2)

(1) listing of byte-codes
(2) listing of s2 with locations

RZ compiler tools kit

The tools kit composed of 4 main parts: lex, parse, compiler, code generator.  The first three are independent of target machine code.  lex is lexgen.c that generates token.c  automatically from an input specification.  parse is pgen.c that generates parse.c from a grammar.  lex is almost independent from other modules and has only one export function token().  token() is a recogniser and accepts special symbols (delimiter) and keywords.  String, identifier, and number require to be handled separately in lex.c to simplify lexgen.c.  (It is really simple).  Parser and compiler are tightly coupled although I have tried to make them as self-contained as possible.

These generators (lexgen, pgen) achieve 4:1 reduction ratio in writing code for a compiler (grammar is 1/4 of the size of the parser generated from it).  That means it saves upto 4 times amount of writing line of codes.  In my experience of using these tools, it reduces the error in writing a recursive descend compiler and makes it easier to debug too. This is because the grammar is more compact and easier to understand than codes.  However, by the simple nature of tools, it requires intimate understanding of pgen.c in order to make change to grammar and the generated byte-code.

RZ compiler produces a number of output files:

file.s2     is the assembly language of s2
file.out    is the byte code of RZ virtual machine.
file.lst     is the listing of byte code output in symbolic form.

The byte code file can be "executed" using the RZ interpreter (rzvm).  So, testing the program can be done independent of s2 ISA.  The format and the meaning of byte codes and other information related to the language can be found in the R1 project page.
 

How RZ compiler works

The compiler is a simple recursive descend with one symbol look ahead parser.  The compiler is one pass, that is, it scans the source file only once.  While it scans the source it builds up the symbol table which stores important information required to generate an executable output (such as the reference to a symbol, its address).

Variable declaration

Declaration of a global variable will allocate a place in the data segment.  Declaration of a function similarly will allocate the subsequent codes to the code segment and its reference is the current location of the code segment.  Inside a scope of a function definiton, any symbol is either global or local.  When the parser finds an identifier, it looks up in the symbol table, first the local symbol, and then the global symbol, if not found that identifier is deemed to be a local variable and is inserted into the local symbol table.  (This is a correct behaviour because if the symbol is global, then it must already been found in the global symbol table as the syntax of the language stated that the global variable must be declared before its use hence it must already been in the symbol table).  The reference to a local variable is just a number, ordering in order of its occurance.  This reference is really an offset into an activation record which is created at runtime. 

Activation record

An activation record is a dynamic data structure used to store the information required in a function call, called it a context.  This context includes a return address to the caller, all the local variables and the actual parameters (the parameters passed from the caller to callee) and perhaps some other useful information, depends on the implementation of the run-time system for a language execution.

Function Call

While the compiler is compiling some line of source code, some information is not known, such as the reference to a function that is not yet defined, hence the correct address to generate the code for a function call is not yet determined.  To achieve one pass, the technique of "coming back to complete the work later" is used.  (this is an obvious and popular technique but easily overlooked).  The compiler generates the code for a function call using the index to the symbol table as its address.  When all lines of the source program are translated, the symbol table will contain a complete information of all symbols (including the correct references).  The byte-code output is scanned from the beginning to the end and the reference of function calls are updated with the information from the (now complete) symbol table.  The update is easy because the reference in the function call in the byte-code is the index to the symbol table.

While compiling, some consistency checking is easy to do, such as, the number of arguments of a function call.  This information is stored in the symbol table.  As a symbol is inserted into the symbol table when it is first encountered, the number of arguments is known at that time.  Subsequently, when the name of function appears again in the source, the number of arguments for that call can be checked against what is known in the symbol table.  The type of a variable is checked in a similar manner.  Counting is not easily done in a grammar, therefore it is done in a special routine (in stmt.c).

The compiler and the tools set accompanied with it are designed with emphasise on simplicity (aim for teaching purpose). (All sources are totally about 2500 lines of code including the target assembly code generator). However, as simple as they are, they are powerful enough to construct a working compiler with adequate checking to enable it for a practical use in many applications.  Some short falls are still remainded, notably the lack of "error-recovery" when the syntax error occurs.  In all, this is (to my mind) a good balance between usefulness and simplicity.

(to write more on code generation, code optimisation, and also how lexgen and pgen really work)

S2 code generation

The s2 code generation can be described more as a "binary translation" from the R1 byte code to s2 assembly language.  Each byte code is translated into a sequence of s2 assembly.  Some register allocation and peephole optimisation are preformed during the translation.  The calling convention follows R1 virtual machine convention which generates the activation record containing local variables in reverse order.  The call is transformed into a sequence of s2 assembly language.

S2 assembler

The output from RZ compiler (file.s2) must be assembled by an assembler to produce  the machine code.  The output from the assembler is a loadable file format suitable for the s2 simulator.  The assembly language of s2 consisted of s2 ISA and pseudo command (.s, .a, .c, .w, .e).  In general, the input for the assembler is as follows:

symbolic declaration

This section declares the binding of symbolic names to their value.  The value must be a numeric constant.
.s
temp 101
N 10
. . .

code section

.c
:label  add r1 r2 r3
 jmp always exit
. . .
:exit trap stop
Each line of code can be proceeded with a label (prefix with :) which its name can be used as an operand to an instruction.  An instruction is written is the form:

        opcode dest  source1  source2 . . .

The addressing mode can be expressed as:

ld   r1 temp      means   absolute address
ld   r1 @10 r2    means   ld  r1,10(r2)
ld   r1 +r2 r3    means   ld  r1,(r2+r3)
ld   r1 #n        means   load immediate

add  r1 r2 r3     means   r1 = r2 + r3
add  r1 r2 #10    means   r1 = r2 + 10

where all operands can be symbolic (prefix with null, @, +, #).

data section

Declare the constants or the variables used in the program.  Each symbol occupied one word.  The address of a variable can be declared using the lable.
.w
10
20
temp
:ads variableA
ads  ;; this is the constant which is the address of variableA
The pseudo command .a sets the address.  All sections can be ordered in any sequences.  The last line of a program is .e  The assembler produces the loadable file (file.obj) which the format is suitable to be used by the s2 simulator.  The assembler also produces a listing file (file.lis) for cross reference purpose (for debugging).  The listing tells where all labels are.  The predefined symbols in the assembler are:

        r0 .. r31, all opcodes, always, eq, neq, lt, le, ge, gt, stop.

The symbols are not case sensitive (they are all translated into uppercase inside the hash table).  The assembler does not rigorously check all validity of the assembly language input file.

How the assembler work (link)

S2 simulator

The simulator reads in an object code and controls its execution.  The interface to the command line is:

t               step, execute one instruction at the current pc
g              run from the current pc until it stops (by trap stop or a breakpoint)
b ads        set break point
c              clear all break points
d ads n      display the memory n words started from ads
r               display all registers
m a1 a2 n   move memory n words from a1 to a2.
                this can be used to fill the memory when a1+n and a2 are overlapped.
s  dest value  set the following destination:
   r0..r31
   m0..m20000
   pc, s, z
z 0/1         turn on/off the trace
o             display the output buffer
h             display summary of commands
q             quit   and store the output buffer to file "out.txt"

The memory is limited at 20000 locations. The usual interaction with the simulator is set the breakpoint to some line, execute the program to that point, inspect the values in register or memory, and continue.

 In the loop, the break point is usually set to some place within the loop, once it reaches there, to continue, uses step to step over that line (t) and continue (g), the program will go through the loop, coming back to the break point again.  Once the program stops the simulator will report the number of instruction executed and the CPI (calculated from the assumption about the number of cycle required by each instruction).  To get more measurement such as the frequency of use of each instruction, you have to modify the simulator (which can be done easily).

Modification to S2 microarchitecture

The instruction ld st should NOT affect flags.  As the version 1.0 of s2 microarchitecture uses ALU to calculate the effective address, flags are effected.  I have modified it slightly by using an additional adder to perform address calculation instead of using the ALU.  The adder inputs tie to the register bank and the data bus (using the same input as ALU).  Its output to the data bus (buffer by a suitable buffer) which can transferred to MAR.  The cycle is as follows:
<LOADINDEX>
IFETCH
MAR = ADDER(R[IR:R2],R[IR:R3])
MDR = M[MAR]
R[IR:R1] = MDR
This is actually one clock faster than using ALU (as it is not necessary to store the output to T).  Compare this with the original microsteps.

last update  20 July 2011