nut compiler  
I am going to describe nut compiler which is written in nut. We will
use "nut" as a common language to describe software system used in this
course. A compiler can be written using the target language (of the
compiler). Writing a compiler with its own target language demonstrates
two points:
  -  
that the target language is not trivial.  At least it can
be used to write some complex program such as a compiler.  This
shows a kind of "completeness" of the language.
 
  -  
that the understanding of meaning the language is complete
enough to use it to write such a non-trivial program.
   
and I have my favourite third point:
  - that it is beautiful. In a sense that the language is
self-describing.
 
Writing a compiler with the target language is not new (for example you
can write a C compiler in C and compile your C compiler into the
executable code using any existing C compiler). (1) It has been
practiced especially in the early days of computer software
development. Some "conceptual difficulty" must be overcome concerning
the confusion of the "compiler" and the "compiled" program (since we
can use the compiler to compile itself!).  The run-time facilities
are always posing difficulty as the memory is shared between the
compiler and the compiled program.  However, these points are not
a priority in our study.  With some experience, you will soon
overcome all these hurdles by yourself.  
(1) The question arises as how the first compiler was written?
compiler
  input source program (in
nut)
  output n-code
code generator
  input n-code
  output machine code of a
specific processor
To develop "nut-in-nut" compiler, we will use a version of nut
compiler (written in C), "nutc", as a compiler tool.  This version
of nut-compiler
has many additional operators that support compilation.  To run
the compiler, a version of the "n-code virtual machine" (or the
interpreter for n-code), called "nvm" is used.  "nutc" contains
many supporting functions to facilitate compilation (which are
cumbersome to write in nut or which we are not interested in
discussing).
Next, I will explain the output of the compiler.  It is important
to understand the n-code, it is what the compiler must produce.
N-code
An example
the source program (to be compiled):
(def
add1 x () (+ x 1))
(def main () ()
  (sys 1 (add1 2))
where "sys 1 x" is a system call (analogous to OS call for I/O) that
will print an integer x to the screen. 
the n-code of the above program in "human-readable" form is:
add1
(fun.1.1 (+ get.1 lit.1 ))
main
(fun.0.0 (sys.1 (call.17 lit.2 )))
and in "absolute" object form (as in the output file a.obj):
22
22
2 1 16 1 0
4 1 14 1 2
6 1 6 0 4
8 0 0 6 0
10 1 19 257 8
12 1 16 2 0
14 1 13 10 12
16 0 0 14 0
18 1 20 1 16
20 0 0 18 0
22 1 19 0 20
0
2
17 add1 3 10 1 1
19 main 3 22 0 0
There are three parts in the object file:
1)  the code
2)  the data (the line contains a zero)
3)  the symbol table (use for debugging purpose)
The code is a contiguous block of memory. The code "main" started at
22.  The format of the object code is:  
address type op arg next
type 0 is dot-pair
type 1 is atom
op arg is the operator
next is the address of the next cell
Some explanation:
22
1 19 0 20
says  it is an atom "fun.0" next is 20
20
0 0 18 0
says it is a list (dot-pair), the head pointed to 18 and the next is
NIL.
18
1 20 1 16
says it is an atom "sys.1", next is 16 (the argument of sys.1)
16
0 0 14 0
says it is a list, the head is 14, next is NIL.
14
1 13 10 12
says it is an atom "call.10", next is 12 (the argument of the function)
12
1 16 2 0
says it is an atom "lit.2", next is NIL.
at 10, is the "fun.x" (the function "add1") etc.
Compiler
You can see the whole compiler in
(in
the directory /nut-compiler:  lib.nut 
nut-h.nut  symtab.nut  parse.nut  nut.nut ). The whole compiler is about
600 lines.  The
short pseudo code to show the "overview"
of the compiler.
The compiler has four main functions:
main
  readinfile
  parse
  resolve
  outobj
"readinfile" reads the source program from a standard input stream
(stdin in Unix).  The compiler reads the whole input at once and
keeps it in a big array of characters (maximum input size is 50 Kbytes).
"parse" is the main parser that scans the input stream and generates
n-code.
"resolve" performs renaming and binding of the actual code to generate
the executable code.
"outobj" prints out the object file to a standard output (stdout).
We will concentrate on "parse".  To understand "resolve" you need
to know the run-time system.
Nut has a trivial syntax (by design) hence the parser for nut is very
simple.  Our parser is a "recursive descent" parser.  It
calls parsing routines recursively with one look ahead symbol and never
backtrack. This is a simple and
straightforward kind of parser.  You can read more about this kind
of parser in any standard textbook in compiler.
Before we look into the parser, we need to be able to scan the input
which is the stream of characters and forms "token".  This is
called "lexical analyser".  There are only three separators in
nut: space, "(" and ")".  "tokenise" is one of the system call
that is implemented in the "nsim".  It passes the token as a
string of characters (string of nut language).  Here is the sample
use of "tokenise":
(let
tok)
(def tokenise () ()
  (set tok (sys 3))
(def testtok () ()
  (do
  (tokenise)
  (while (!= (vec tok 0) EOF)
    (do
    (prstr tok)
    (space)
    (tokenise)))))
where "prstr" is a function to print a nut string.  The end of
file is signified by the first character of the returned string to be
EOF (127).  When run testtok with the example stream, here is the
output:
(
def add1 x ( ) ( + x 1 ) ) ( def main ( ) ( )  ( sys 1 ( add1 2 ) )
Now we take a look at the parser
parse
  tokenise
  while not EOF
    expect "("
    tokenise
    if token ==
"def"
     
parseDef
    if token ==
"let"
     
parseLet
    if token ==
"enum"
     
parseEnum
    tokenise
The parser gets a token and call "parseDef" or "parseLet" or
"parseEnum" according to the token then loops until the end of file.
parseDef
  tokenise  
     ; get fun name
  parseNL  
      ; get formal arg
  parseNL  
      ; get local
  tokenise
  e = parseExp  
 ; get body
  tokenise  
     ; skip ")"
  update symtab
  out (fun e)
"parseDef" parses the "header" of the function definition then the main
part is "parseExp" to parse the body of the function definition. 
The "header" of a function definition composed of:
(def
add1 x () ...)
       
"add1" the function name
"x"    the list of formal parameters
"()"   the list of local variables
the list of formal and locals is parsed by "parseNL" (parse name list)
which will store the formal and local names to the symbol table and
gives them the references as a running number 1..n.
For sake of clarity, let's assume that the name list will not be an
atom.
parseNL
  tokenise
  while token != ")"
    installLocal
token
    tokenise
"parseNL" merely gets a name and stores it in the symbol table using
"installLocal" until exhausting the list (found the token ")" ).
Before getting into "parseExp", let's study the symbol table.
Symbol table
The symbol table is a one-dimension array of the "entry".  Each
entry has five fields: name, type, value, arity, lv.  "name" is a
pointer to a string, the "symbol". "type" shows the type of symbol:
2  tyVAR   is a local
variable 
3  tyFUN   is a function
4  tyOP    is an operator
5  tyOPX   is an op that has one special argument
6  tySYS   is "sys.x"
7  tyUD    is undefined
8  tyGVAR  is a global variable
9  tyXX    --
10 tyENUM  is an enum symbol
"value" stores the value of the symbol, which is a reference for
function, local/global variable, the opcode of an operator, the syscall
number of "sys.x" or the value of an enum symbol.
"arity" and "lv" are for the function symbol, storing its arity and
total number of local variables.
"install nm" does a scan for "nm" in the symbol table (a sequential
search, for simplicity).  If it is already present, returns the
index to that entry.  If it is a new symbol, insert it into the
symbol table and returns its index.
Initially, all keywords are primed into the symbol table.  They
are treated the same as any other symbol.
; 
it a list, number, string or names
parseExp
  if token == "("
    tokenise
    nm = parseName
    e = parseEL
    out (nm e)
  if isNumber token
    n = atoi token
    out lit.n
  if isString token
    e = makestring
token+1
    out str.e
  parseName  
 ; it is OP, OPX, VAR, FUN
"parseExp" parse an expression in nut language.  An expression can
be: a name (such as a variable), a number, a constant string, a list
(function application).  
For a list, "parseExp" uses two auxilliary functions to parse:
parseName, parseEL (expression list).  
parseName
  get symbol token
  v = symbol.val
  switch type
    OP: out v
    VAR: out get.v
    GVAR: out ld.v
    FUN: out
call.idx
    OPX:
     
tokenise        ; get var name
     
get symbol token
      v2
= symbol.val
      if
type == VAR
       
switch v
         
SET: out put.v2
         
SETV: out stx.v2
         
VEC:  out ldx.v2
     
else if type == GVAR
       
switch v
         
SET: out st.v2
         
SETV: out sty.v2
         
VEC:  out ldy.v2
    SYS:
     
tokenise
      k
= atoi token
     
out sys.k
    ENUM:
     
out lit.v
"parseName" takes one token and depends on its type, it outputs an
appropriate n-code.
OP    an operator, out
opcode
VAR    a local var, out get.ref
GVAR    a global var, out ld.ref
FUN    a fun, out call.ref
SYS    a system call, gets sys num, out sys.k
ENUM    a enum symbol, out lit.ref
OPX    a special op
OPX takes the next token as an "unevaluated name", i.e. a reference of
that name, not its value. Three operators are OPX: "set", "setv" and
"vec".  Two possibilities for the next token, either it is a local
variable or a global variable.
local    out put/stx/ldx.ref
global    out st/sty/ldy.ref
parseEL
  tokenise
  if token == ")" return NIL
  e = parseExp
  e2 = parseEL
  out (e e2)
The last piece of the parser, "parseEL" recursively parses the rest of
the list.
How to compile and
run nut-compiler
I have written "nutc" in C to be the compiler that will compile
nut-compiler. nutc outputs "a.obj".  Because the whole compiler is
separated into 5 files, we use "batch" command to process them. 
The batch command is like this (it is "makenut.bat" file):
type
lib.nut nut-h.nut symtab.nut parse.nut nut.nut > t1.nut
nutc t1.nut
nvm a.obj < test.nut
del t1.nut
We run the batch file, it first concatenate all files into one
temporary file (t1.nut), compile it using "nutc".  The output is
"a.obj", it is the compiler object (executable under "nvm"). 
Then, we use it to compile a test file.
c:>
makenut
!=
(fun.2.2 (if (= get.1 get.2
)lit.0 lit.1 ))
and
(fun.2.2 (if get.1 get.2 lit.0 ))
...
main
(fun.0.0 (do (call.79 )(call.124
)(call.157 )(call.29 )(st.3 (sys.9 ))(call.145 )(call.155 )(call.156
)(call.23 )))
A lot of "human-readable" n-code are displayed on the screen.  It
can be visually check whether it is correct (no error message). 
The object code can be executed under the "nvm" nut virtual
machine. 
c:>nvm
a.obj
< test.nut
add1
(fun.1.1 (+ get.1 lit.1 ))
main
(fun.0.0 (sys.1 (call.75 lit.2 )))
9392 9392
9372 1 16 1 0
9374 1 14 1 9372
9376 1 6 0 9374
9378 0 0 9376 0
9380 1 19 257 9378
9382 1 16 2 0
9384 1 13 9380 9382
9386 0 0 9384 0
9388 1 20 1 9386
9390 0 0 9388 0
9392 1 19 0 9390
0
3
add1 3 9380 1 1
x 2 1 0 0
main 3 9392 0 0
The whole output can be redirect to a file, then cut only the code
segment to be the output "test.obj".
c:>nvm
a.obj
< test.nut > test.obj
edit "test.obj" to eliminate surplus listing at the beginning.
c:>copy
a.obj nut.obj
rename the previous "a.obj" which is the n-code of nut-compiler to save
it in other name, "nut.obj", and prepare test.obj to be executed under
"nvm".  Now, run it.
 
c:>nvm test.obj
3
c:>
End 
Prabhas Chongstitvatana
10 Nov 2009