nut compiler
The whole package is "nut31.zip". It contained nut31 compiler, nsim31
simulator, nut-compiler in nut (in the directory /test/nut.txt).
I am going to describe nut compiler which is written in nut. We will
use "nut" as a common language to describe software system used in this
course. A compiler can be written using the target language (of the
compiler). Writing a compiler with its own target language demonstrates
two points:
1) that the target language is not trivial. At least it can
be used to write some complex program such as a compiler. This
shows a kind of "completeness" of the language.
2) that the understanding of meaning the language is complete
enough to use it to write such a non-trivial program.
and I have my favourite third point:
3) that it is beautiful. In a sense that the language is
self-describing.
Writing a compiler with the target language is not new (for example you
can write a C compiler in C and compile your C compiler into the
executable code using any existing C compiler). (1) It has been
practiced especially in the early days of computer software
development. Some "conceptual difficulty" must be overcome concerning
the confusion of the "compiler" and the "compiled" program (since we
can use the compiler to compile itself!). The run-time facilities
are always posing difficulty as the memory is shared between the
compiler and the compiled program. However, these points are not
a priority in our study. With some experience, you will soon
overcome all these hurdles by yourself.
(1) The question arises as how the first compiler was written?
compiler
input source program (in
nut)
output n-code
code generator
input n-code
output machine code of a
specific processor
To develop "nut-in-nut" compiler, we will use a special version of nut
compiler (written in C), "nut31". This version of nut-compiler
has many additional operators that support compilation. To run
the compiler, a special version of the "n-code virtual machine" (or the
interpreter for n-code), called "nsim31" is used. nsim31 contains
many supporting functions to facilitate compilation (which are
cumbersome to write in nut or which we are not interested in
discussing).
Next, I will explain the output of the compiler. It is important
to understand the n-code, it is what the compiler must produce.
N-code
An example
the source program (to be compiled):
(def
add1 x () (+ x 1))
(def main () ()
(sys 1 (add1 2))
where "sys 1 x" is a system call (analogous to OS call for I/O) that
will print an integer x to the screen.
the n-code of the above program in "human-readable" form is:
add1
(fun.1.1 (+ get.1 lit.1 ))
main
(fun.0.0 (sys.1 (call.17 lit.2 )))
and in "absolute" object form (as in the output file a.obj):
22
22
2 1 16 1 0
4 1 14 1 2
6 1 6 0 4
8 0 0 6 0
10 1 19 257 8
12 1 16 2 0
14 1 13 10 12
16 0 0 14 0
18 1 20 1 16
20 0 0 18 0
22 1 19 0 20
0
2
17 add1 3 10 1 1
19 main 3 22 0 0
There are three parts in the object file:
1) the code
2) the data (the line contains a zero)
3) the symbol table (use for debugging purpose)
The code is a contiguous block of memory. The code "main" started at
22. The format of the object code is:
address type op arg next
type 0 is dot-pair
type 1 is atom
op arg is the operator
next is the address of the next cell
Some explanation:
22
1 19 0 20
says it is an atom "fun.0" next is 20
20
0 0 18 0
says it is a list (dot-pair), the head pointed to 18 and the next is
NIL.
18
1 20 1 16
says it is an atom "sys.1", next is 16 (the argument of sys.1)
16
0 0 14 0
says it is a list, the head is 14, next is NIL.
14
1 13 10 12
says it is an atom "call.10", next is 12 (the argument of the function)
12
1 16 2 0
says it is an atom "lit.2", next is NIL.
at 10, is the "fun.x" (the function "add1") etc.
Compiler
You can see the whole compiler in "nut.txt"
(in
the directory /test/). The whole compiler is about 500 lines. The
short pseudo code to show the "overview"
of the compiler is in "nut3-compiler.txt" in /doc directory.
The compiler has four main functions:
main
readinfile
parse
resolve
outobj
"readinfile" reads the source program from a standard input stream
(stdin in Unix). The compiler reads the whole input at once and
keeps it in a big array of characters (maximum input size is 50 Kbytes).
"parse" is the main parser that scans the input stream and generates
n-code.
"resolve" performs renaming and binding of the actual code to generate
the executable code.
"outobj" prints out the object file to a standard output (stdout).
We will concentrate on "parse". To understand "resolve" you need
to know the run-time system which is the topic of the next week lecture.
Nut has a trivial syntax (by design) hence the parser for nut is very
simple. Our parser is a "recursive descent" parser. It
calls parsing routines recursively with one look ahead symbol and never
backtrack (it is called LL(1) parser). This is a simple and
straightforward kind of parser. You can read more about this kind
of parser in any standard textbook in compiler.
Before we look into the parser, we need to be able to scan the input
which is the stream of characters and forms "token". This is
called "lexical analyser". There are only three separators in
nut: space, "(" and ")". "tokenise" is one of the system call
that is implemented in the "nsim". It passes the token as a
string of characters (string of nut language). Here is the sample
use of "tokenise":
(let
tok)
(def tokenise () ()
(set tok (sys 3))
(def testtok () ()
(do
(tokenise)
(while (!= (vec tok 0) EOF)
(do
(prstr tok)
(space)
(tokenise)))))
where "prstr" is a function to print a nut string. The end of
file is signified by the first character of the returned string to be
EOF (127). When run testtok with the example stream, here is the
output:
(
def add1 x ( ) ( + x 1 ) ) ( def main ( ) ( ) ( sys 1 ( add1 2 ) )
Now we take a look at the parser
parse
tokenise
while not EOF
expect "("
tokenise
if token ==
"def"
parseDef
if token ==
"let"
parseLet
if token ==
"enum"
parseEnum
tokenise
The parser gets a token and call "parseDef" or "parseLet" or
"parseEnum" according to the token then loops until the end of file.
parseDef
tokenise
; get fun name
parseNL
; get formal arg
parseNL
; get local
tokenise
e = parseExp
; get body
tokenise
; skip ")"
update symtab
out (fun.k e)
"parseDef" parses the "header" of the function definition then the main
part is "parseExp" to parse the body of the function definition.
The "header" of a function definition composed of:
(def
add1 x () ...)
"add1" the function name
"x" the list of formal parameters
"()" the list of local variables
the list of formal and locals is parsed by "parseNL" (parse name list)
which will store the formal and local names to the symbol table and
gives them the references as a running number 1..n.
For sake of clarity, let's assume that the name list will not be an
atom.
parseNL
tokenise
while token != ")"
installLocal
token
tokenise
"parseNL" merely gets a name and stores it in the symbol table using
"installLocal" until exhausting the list (found the token ")" ).
Before getting into "parseExp", let's study the symbol table.
Symbol table
The symbol table is a one-dimension array of the "entry". Each
entry has five fields: name, type, value, arity, lv. "name" is a
pointer to a string, the "symbol". "type" shows the type of symbol:
2 tyVAR is a local
variable
3 tyFUN is a function
4 tyOP is an operator
5 tyOPX is an op that has one special argument
6 tySYS is "sys.x"
7 tyUD is undefined
8 tyGVAR is a global variable
9 tyXX --
10 tyENUM is an enum symbol
"value" stores the value of the symbol, which is a reference for
function, local/global variable, the opcode of an operator, the syscall
number of "sys.x" or the value of an enum symbol.
"arity" and "lv" are for the function symbol, storing its arity and
total number of local variables.
"install nm" does a scan for "nm" in the symbol table (a sequential
search, for simplicity). If it is already present, returns the
index to that entry. If it is a new symbol, insert it into the
symbol table and returns its index.
Initially, all keywords are primed into the symbol table. They
are treated the same as any other symbol.
;
it a list, number, string or names
parseExp
if token == "("
tokenise
nm = parseName
e = parseEL
out (nm e)
if isNumber token
n = atoi token
out lit.n
if isString token
e = makestring
token+1
out str.e
parseName
; it is OP, OPX, VAR, FUN
"parseExp" parse an expression in nut language. An expression can
be: a name (such as a variable), a number, a constant string, a list
(function application).
For a list, "parseExp" uses two auxilliary functions to parse:
parseName, parseEL (expression list).
parseName
get symbol token
v = symbol.val
switch type
OP: out v
VAR: out get.v
GVAR: out ld.v
FUN: out
call.idx
OPX:
tokenise ; get var name
get symbol token
v2
= symbol.val
if
type == VAR
switch v
SET: out put.v2
SETV: out stx.v2
VEC: out ldx.v2
else if type == GVAR
switch v
SET: out st.v2
SETV: out sty.v2
VEC: out ldy.v2
SYS:
tokenise
k
= atoi token
out sys.k
ENUM:
out lit.v
"parseName" takes one token and depends on its type, it outputs an
appropriate n-code.
OP an operator, out
opcode
VAR a local var, out get.ref
GVAR a global var, out ld.ref
FUN a fun, out call.ref
SYS a system call, gets sys num, out sys.k
ENUM a enum symbol, out lit.ref
OPX a special op
OPX takes the next token as an "unevaluated name", i.e. a reference of
that name, not its value. Three operators are OPX: "set", "setv" and
"vec". Two possibilities for the next token, either it is a local
variable or a global variable.
local out put/stx/ldx.ref
global out st/sty/ldy.ref
parseEL
tokenise
if token == ")" return NIL
e = parseExp
e2 = parseEL
out (e e2)
The last piece of the parser, "parseEL" recursively parses the rest of
the list.
How to compile and
run nut-compiler
I have written "nut31" in C to be the compiler that will compile
nut-compiler (in the source file "nut.txt"). nut31 outputs "a.obj".
c:>nut31
< nut.txt
!=
(fun.2.2 (if (= get.1 get.2
)lit.0 lit.1 ))
and
(fun.2.2 (if get.1 get.2 lit.0 ))
...
main
(fun.0.0 (do (call.79 )(call.124
)(call.157 )(call.29 )(st.3 (sys.9 ))(call.145 )(call.155 )(call.156
)(call.23 )))
A lot of "human-readable" n-code are displayed on the screen. It
can be visually check whether it is correct (no error message).
The object code can be executed under the "nsim31" simulator.
nsim31 loads "a.obj" by default (to save "stdin" for nut-compiler to
use to read its source) and then starts the compilation.
c:>nsim31
< t2.txt
add1
(fun.1.1 (+ get.1 lit.1 ))
main
(fun.0.0 (sys.1 (call.75 lit.2 )))
9392 9392
9372 1 16 1 0
9374 1 14 1 9372
9376 1 6 0 9374
9378 0 0 9376 0
9380 1 19 257 9378
9382 1 16 2 0
9384 1 13 9380 9382
9386 0 0 9384 0
9388 1 20 1 9386
9390 0 0 9388 0
9392 1 19 0 9390
0
3
add1 3 9380 1 1
x 2 1 0 0
main 3 9392 0 0
The whole output can be redirect to a file, then cut only the code
segment to be the output "a.obj". This object code can be executed
under "nsim31".
c:>nsim31
< t2.txt > t2.obj
edit "t2.obj" to eliminate surplus listing at the beginning.
c:>ren
a.obj nut.obj
c:>copy t2.obj a.obj
rename the previous "a.obj" which is the n-code of nut-compiler to save
it in other name, "nut.obj", and prepare t2.obj to be executed under
"nsim31" by renaming it to "a.obj". Now, run it.
c:>nsim31
3
c:>
End of lecture
Prabhas Chongstitvatana
22 June 2006