nut compiler 



The whole package is "nut31.zip". It contained nut31 compiler, nsim31 simulator, nut-compiler in nut (in the directory /test/nut.txt).

I am going to describe nut compiler which is written in nut. We will use "nut" as a common language to describe software system used in this course. A compiler can be written using the target language (of the compiler). Writing a compiler with its own target language demonstrates two points:

1)  that the target language is not trivial.  At least it can be used to write some complex program such as a compiler.  This shows a kind of "completeness" of the language.
2)  that the understanding of meaning the language is complete enough to use it to write such a non-trivial program.

and I have my favourite third point:

3)  that it is beautiful. In a sense that the language is self-describing.

Writing a compiler with the target language is not new (for example you can write a C compiler in C and compile your C compiler into the executable code using any existing C compiler). (1) It has been practiced especially in the early days of computer software development. Some "conceptual difficulty" must be overcome concerning the confusion of the "compiler" and the "compiled" program (since we can use the compiler to compile itself!).  The run-time facilities are always posing difficulty as the memory is shared between the compiler and the compiled program.  However, these points are not a priority in our study.  With some experience, you will soon overcome all these hurdles by yourself.  

(1) The question arises as how the first compiler was written?

compiler
  input source program (in nut)
  output n-code

code generator
  input n-code
  output machine code of a specific processor

To develop "nut-in-nut" compiler, we will use a special version of nut compiler (written in C), "nut31".  This version of nut-compiler has many additional operators that support compilation.  To run the compiler, a special version of the "n-code virtual machine" (or the interpreter for n-code), called "nsim31" is used.  nsim31 contains many supporting functions to facilitate compilation (which are cumbersome to write in nut or which we are not interested in discussing).

Next, I will explain the output of the compiler.  It is important to understand the n-code, it is what the compiler must produce.

N-code

An example

the source program (to be compiled):

(def add1 x () (+ x 1))
(def main () ()
  (sys 1 (add1 2))

where "sys 1 x" is a system call (analogous to OS call for I/O) that will print an integer x to the screen.

the n-code of the above program in "human-readable" form is:

add1
(fun.1.1 (+ get.1 lit.1 ))
main
(fun.0.0 (sys.1 (call.17 lit.2 )))

and in "absolute" object form (as in the output file a.obj):

22 22
2 1 16 1 0
4 1 14 1 2
6 1 6 0 4
8 0 0 6 0
10 1 19 257 8
12 1 16 2 0
14 1 13 10 12
16 0 0 14 0
18 1 20 1 16
20 0 0 18 0
22 1 19 0 20
0

2
17 add1 3 10 1 1
19 main 3 22 0 0

There are three parts in the object file:
1)  the code
2)  the data (the line contains a zero)
3)  the symbol table (use for debugging purpose)

The code is a contiguous block of memory. The code "main" started at 22.  The format of the object code is:  

address type op arg next

type 0 is dot-pair
type 1 is atom
op arg is the operator
next is the address of the next cell

Some explanation:

22 1 19 0 20

says  it is an atom "fun.0" next is 20

20 0 0 18 0

says it is a list (dot-pair), the head pointed to 18 and the next is NIL.

18 1 20 1 16

says it is an atom "sys.1", next is 16 (the argument of sys.1)

16 0 0 14 0

says it is a list, the head is 14, next is NIL.

14 1 13 10 12

says it is an atom "call.10", next is 12 (the argument of the function)

12 1 16 2 0

says it is an atom "lit.2", next is NIL.

at 10, is the "fun.x" (the function "add1") etc.

Compiler

You can see the whole compiler in "nut.txt" (in the directory /test/). The whole compiler is about 500 lines.  The short pseudo code to show the "overview" of the compiler is in "nut3-compiler.txt" in /doc directory.

The compiler has four main functions:

main
  readinfile
  parse
  resolve
  outobj

"readinfile" reads the source program from a standard input stream (stdin in Unix).  The compiler reads the whole input at once and keeps it in a big array of characters (maximum input size is 50 Kbytes).

"parse" is the main parser that scans the input stream and generates n-code.

"resolve" performs renaming and binding of the actual code to generate the executable code.

"outobj" prints out the object file to a standard output (stdout).

We will concentrate on "parse".  To understand "resolve" you need to know the run-time system which is the topic of the next week lecture.

Nut has a trivial syntax (by design) hence the parser for nut is very simple.  Our parser is a "recursive descent" parser.  It calls parsing routines recursively with one look ahead symbol and never backtrack (it is called LL(1) parser). This is a simple and straightforward kind of parser.  You can read more about this kind of parser in any standard textbook in compiler.

Before we look into the parser, we need to be able to scan the input which is the stream of characters and forms "token".  This is called "lexical analyser".  There are only three separators in nut: space, "(" and ")".  "tokenise" is one of the system call that is implemented in the "nsim".  It passes the token as a string of characters (string of nut language).  Here is the sample use of "tokenise":

(let tok)

(def tokenise () ()
  (set tok (sys 3))

(def testtok () ()
  (do
  (tokenise)
  (while (!= (vec tok 0) EOF)
    (do
    (prstr tok)
    (space)
    (tokenise)))))

where "prstr" is a function to print a nut string.  The end of file is signified by the first character of the returned string to be EOF (127).  When run testtok with the example stream, here is the output:

( def add1 x ( ) ( + x 1 ) ) ( def main ( ) ( )  ( sys 1 ( add1 2 ) )

Now we take a look at the parser

parse
  tokenise
  while not EOF
    expect "("
    tokenise
    if token == "def"
      parseDef
    if token == "let"
      parseLet
    if token == "enum"
      parseEnum
    tokenise

The parser gets a token and call "parseDef" or "parseLet" or "parseEnum" according to the token then loops until the end of file.

parseDef
  tokenise    ; get fun name
  parseNL    ; get formal arg
  parseNL    ; get local
  tokenise
  e = parseExp    ; get body
  tokenise    ; skip ")"
  update symtab
  out (fun.k e)

"parseDef" parses the "header" of the function definition then the main part is "parseExp" to parse the body of the function definition.  The "header" of a function definition composed of:

(def add1 x () ...)
       

"add1" the function name
"x"    the list of formal parameters
"()"   the list of local variables

the list of formal and locals is parsed by "parseNL" (parse name list) which will store the formal and local names to the symbol table and gives them the references as a running number 1..n.

For sake of clarity, let's assume that the name list will not be an atom.

parseNL
  tokenise
  while token != ")"
    installLocal token
    tokenise

"parseNL" merely gets a name and stores it in the symbol table using "installLocal" until exhausting the list (found the token ")" ).

Before getting into "parseExp", let's study the symbol table.

Symbol table

The symbol table is a one-dimension array of the "entry".  Each entry has five fields: name, type, value, arity, lv.  "name" is a pointer to a string, the "symbol". "type" shows the type of symbol:

2  tyVAR   is a local variable
3  tyFUN   is a function
4  tyOP    is an operator
5  tyOPX   is an op that has one special argument
6  tySYS   is "sys.x"
7  tyUD    is undefined
8  tyGVAR  is a global variable
9  tyXX    --
10 tyENUM  is an enum symbol

"value" stores the value of the symbol, which is a reference for function, local/global variable, the opcode of an operator, the syscall number of "sys.x" or the value of an enum symbol.

"arity" and "lv" are for the function symbol, storing its arity and total number of local variables.

"install nm" does a scan for "nm" in the symbol table (a sequential search, for simplicity).  If it is already present, returns the index to that entry.  If it is a new symbol, insert it into the symbol table and returns its index.

Initially, all keywords are primed into the symbol table.  They are treated the same as any other symbol.

;  it a list, number, string or names
parseExp
  if token == "("
    tokenise
    nm = parseName
    e = parseEL
    out (nm e)
  if isNumber token
    n = atoi token
    out lit.n
  if isString token
    e = makestring token+1
    out str.e
  parseName    ; it is OP, OPX, VAR, FUN

"parseExp" parse an expression in nut language.  An expression can be: a name (such as a variable), a number, a constant string, a list (function application).  

For a list, "parseExp" uses two auxilliary functions to parse: parseName, parseEL (expression list).  

parseName
  get symbol token
  v = symbol.val
  switch type
    OP: out v
    VAR: out get.v
    GVAR: out ld.v
    FUN: out call.idx
    OPX:
      tokenise        ; get var name
      get symbol token
      v2 = symbol.val
      if type == VAR
        switch v
          SET: out put.v2
          SETV: out stx.v2
          VEC:  out ldx.v2
      else if type == GVAR
        switch v
          SET: out st.v2
          SETV: out sty.v2
          VEC:  out ldy.v2
    SYS:
      tokenise
      k = atoi token
      out sys.k
    ENUM:
      out lit.v

"parseName" takes one token and depends on its type, it outputs an appropriate n-code.

OP    an operator, out opcode
VAR    a local var, out get.ref
GVAR    a global var, out ld.ref
FUN    a fun, out call.ref
SYS    a system call, gets sys num, out sys.k
ENUM    a enum symbol, out lit.ref
OPX    a special op

OPX takes the next token as an "unevaluated name", i.e. a reference of that name, not its value. Three operators are OPX: "set", "setv" and "vec".  Two possibilities for the next token, either it is a local variable or a global variable.

local    out put/stx/ldx.ref
global    out st/sty/ldy.ref

parseEL
  tokenise
  if token == ")" return NIL
  e = parseExp
  e2 = parseEL
  out (e e2)

The last piece of the parser, "parseEL" recursively parses the rest of the list.

How to compile and run nut-compiler

I have written "nut31" in C to be the compiler that will compile nut-compiler (in the source file "nut.txt"). nut31 outputs "a.obj".

c:>nut31 < nut.txt

!=
(fun.2.2 (if (= get.1 get.2 )lit.0 lit.1 ))
and
(fun.2.2 (if get.1 get.2 lit.0 ))
...
main
(fun.0.0 (do (call.79 )(call.124 )(call.157 )(call.29 )(st.3 (sys.9 ))(call.145 )(call.155 )(call.156 )(call.23 )))

A lot of "human-readable" n-code are displayed on the screen.  It can be visually check whether it is correct (no error message).  The object code can be executed under the "nsim31" simulator.  nsim31 loads "a.obj" by default (to save "stdin" for nut-compiler to use to read its source) and then starts the compilation.  

c:>nsim31 < t2.txt

add1
(fun.1.1 (+ get.1 lit.1 ))
main
(fun.0.0 (sys.1 (call.75 lit.2 )))

9392 9392
9372 1 16 1 0
9374 1 14 1 9372
9376 1 6 0 9374
9378 0 0 9376 0
9380 1 19 257 9378
9382 1 16 2 0
9384 1 13 9380 9382
9386 0 0 9384 0
9388 1 20 1 9386
9390 0 0 9388 0
9392 1 19 0 9390
0
3
add1 3 9380 1 1
x 2 1 0 0
main 3 9392 0 0


The whole output can be redirect to a file, then cut only the code segment to be the output "a.obj". This object code can be executed under "nsim31".

c:>nsim31 < t2.txt > t2.obj

edit "t2.obj" to eliminate surplus listing at the beginning.

c:>ren a.obj nut.obj
c:>copy t2.obj a.obj

rename the previous "a.obj" which is the n-code of nut-compiler to save it in other name, "nut.obj", and prepare t2.obj to be executed under "nsim31" by renaming it to "a.obj". Now, run it.
 
c:>nsim31
3
c:>

End of lecture

Prabhas Chongstitvatana
22 June 2006