RZ-language
Tools set source RZ-S2 tools set v.1.1 (June 2003)
RZ compiler tools kit
How RZ compiler works
Hand-on: a session on RZ-S2
S2 processor and assembly language programming
S2 code generation ( a more detailed version )
S2 assembler
S2 simulator
Modification to S2 microarchitecture
S2 slightly improved instruction set version 2.1 (March 2007)
rz source file
|
V
[rz] --> file.out --> [rzvm] --> output
file.lst (1)
file.s2
|
V
[as2] --> file.obj
--> [s2] --> output (out.txt)
file.lis
(2)
(1) listing of byte-codes
(2) listing of s2 with locations
These generators (lexgen, pgen) achieve 4:1 reduction ratio in writing code for a compiler (grammar is 1/4 of the size of the parser generated from it). That means it saves upto 4 times amount of writing line of codes. In my experience of using these tools, it reduces the error in writing a recursive descend compiler and makes it easier to debug too. This is because the grammar is more compact and easier to understand than codes. However, by the simple nature of tools, it requires intimate understanding of pgen.c in order to make change to grammar and the generated byte-code.
RZ compiler produces a number of output files:
file.s2 is the
assembly language of s2
file.out is the byte code of RZ virtual machine.
file.lst is the listing of byte code output in
symbolic
form.
The byte code file can be "executed" using the RZ interpreter
(rzvm).
So, testing the program can be done independent of s2 ISA. The
format
and the meaning of byte codes and other information related to the
language can be found in the R1 project
page.
Variable declaration
Declaration of a global variable will allocate a place in the data
segment.
Declaration of a function similarly will allocate the subsequent codes
to the code segment and its reference is the current location of the
code
segment. Inside a scope of a function definiton, any symbol is
either
global or local. When the parser finds an
identifier, it looks up in the symbol table, first the local symbol,
and then the global symbol, if not found that identifier is deemed to
be
a local variable and is inserted into the local symbol table.
(This
is a correct behaviour because if the symbol is global, then it must
already
been found in the global symbol table as the syntax of
the language stated that the global variable must be declared before
its use hence it must already been in the symbol table). The
reference
to a local variable is just a number, ordering in order of its
occurance.
This reference is really an offset into an activation record which is
created
at runtime.
Activation record
An activation record is a dynamic data structure used to store the information required in a function call, called it a context. This context includes a return address to the caller, all the local variables and the actual parameters (the parameters passed from the caller to callee) and perhaps some other useful information, depends on the implementation of the run-time system for a language execution.
Function Call
While the compiler is compiling some line of source code, some information is not known, such as the reference to a function that is not yet defined, hence the correct address to generate the code for a function call is not yet determined. To achieve one pass, the technique of "coming back to complete the work later" is used. (this is an obvious and popular technique but easily overlooked). The compiler generates the code for a function call using the index to the symbol table as its address. When all lines of the source program are translated, the symbol table will contain a complete information of all symbols (including the correct references). The byte-code output is scanned from the beginning to the end and the reference of function calls are updated with the information from the (now complete) symbol table. The update is easy because the reference in the function call in the byte-code is the index to the symbol table.
While compiling, some consistency checking is easy to do, such as, the number of arguments of a function call. This information is stored in the symbol table. As a symbol is inserted into the symbol table when it is first encountered, the number of arguments is known at that time. Subsequently, when the name of function appears again in the source, the number of arguments for that call can be checked against what is known in the symbol table. The type of a variable is checked in a similar manner. Counting is not easily done in a grammar, therefore it is done in a special routine (in stmt.c).
The compiler and the tools set accompanied with it are designed with emphasise on simplicity (aim for teaching purpose). (All sources are totally about 2500 lines of code including the target assembly code generator). However, as simple as they are, they are powerful enough to construct a working compiler with adequate checking to enable it for a practical use in many applications. Some short falls are still remainded, notably the lack of "error-recovery" when the syntax error occurs. In all, this is (to my mind) a good balance between usefulness and simplicity.
(to write more on code generation, code optimisation, and also how lexgen and pgen really work)
.s
temp 101
N 10
. . .
.cEach line of code can be proceeded with a label (prefix with :) which its name can be used as an operand to an instruction. An instruction is written is the form:
:label add r1 r2 r3
jmp always exit
. . .
:exit trap stop
opcode dest source1 source2 . . .
The addressing mode can be expressed as:
ld r1 temp means absolute addresswhere all operands can be symbolic (prefix with null, @, +, #).
ld r1 @10 r2 means ld r1,10(r2)
ld r1 +r2 r3 means ld r1,(r2+r3)
ld r1 #n means load immediateadd r1 r2 r3 means r1 = r2 + r3
add r1 r2 #10 means r1 = r2 + 10
.wThe pseudo command .a sets the address. All sections can be ordered in any sequences. The last line of a program is .e The assembler produces the loadable file (file.obj) which the format is suitable to be used by the s2 simulator. The assembler also produces a listing file (file.lis) for cross reference purpose (for debugging). The listing tells where all labels are. The predefined symbols in the assembler are:
10
20
temp
:ads variableA
ads ;; this is the constant which is the address of variableA
r0 .. r31, all opcodes, always, eq, neq, lt, le, ge, gt, stop.
The symbols are not case sensitive (they are all translated into uppercase inside the hash table). The assembler does not rigorously check all validity of the assembly language input file.
t
step,
execute
one
instruction at the current pc
g
run
from
the
current pc until it stops (by trap stop or a breakpoint)
b ads set break point
c
clear
all
break
points
d ads n display the memory n words
started
from ads
r
display
all
registers
m a1 a2 n move memory n words from a1 to a2.
this
can
be
used to fill the memory when a1+n and a2 are overlapped.
s dest value set the following destination:
r0..r31
m0..m20000
pc, s, z
z 0/1 turn on/off the
trace
o
display
the
output
buffer
h
display
summary
of
commands
q
quit
and
store
the output buffer to file "out.txt"
The memory is limited at 20000 locations. The usual interaction with the simulator is set the breakpoint to some line, execute the program to that point, inspect the values in register or memory, and continue.
In the loop, the break point is usually set to some place within the loop, once it reaches there, to continue, uses step to step over that line (t) and continue (g), the program will go through the loop, coming back to the break point again. Once the program stops the simulator will report the number of instruction executed and the CPI (calculated from the assumption about the number of cycle required by each instruction). To get more measurement such as the frequency of use of each instruction, you have to modify the simulator (which can be done easily).
<LOADINDEX>This is actually one clock faster than using ALU (as it is not necessary to store the output to T). Compare this with the original microsteps.
IFETCH
MAR = ADDER(R[IR:R2],R[IR:R3])
MDR = M[MAR]
R[IR:R1] = MDR
last update 20 July 2011