Summary of the class on compiler


There are three parts: lexical analyser, parser, and code generation

lexcial analyser

lexical analyser changes stream of input text into tokens.
regular expression is used to specify "word" in the language.
the pattern of regex defines set of words in the language.
skill in defining words is:
  a.  given a regex write down the set of words
  b.  know how to write a specification, given set of words, write its regex
note:  writing a regex is considered to be both rigorous and creative. One has to strive for a balance between simple and complex.
implementation of lex
a regex can be translated into an automaton.  (similar to a finite state machine)
a set of automaton is composed from individual automaton, it becomes Non-deterministic Finite Automata, NFA.
to implement NFA, we need to transform it into DFA (deterministic finite automata, which has many more states than NFA, but it is realizable.)
the main parser uses lexical analyser by calling "next token" function.

parser

the syntax of a language is specified by a grammar. 
Context Free Grammar and a type of grammar (which is simple) that is used to specify a computer language.
there are two methods to parse a sentence:
  1. recursive descence parser
  2. LL(1)

skill in defining syntax:
  a.  given a grammar, write set of sentences (this can be infinite)
  b.  given a language, write its CFG. (this is similar to write a spec. in engineering job)
practical grammar
in order to implement a parser, the grammar must be of some specific form.
it has to avoid "left recursion".

recursive descence parser

it is the most simple way to write a parser for small grammar.
you write it by hands. (other methods are automated method)
every non-terminal symbol becomes a function
the body of function is "call" to other non-terminal symbol and "match" to terminal symbol.
in this way, the whole parser becomes a nested recursive call, started from starting symbol.
to implement "optional" part, one needs a lookahead to see the next symbol before making a decision. 
the "match" function compares the current token with the non-terminal symbol in the grammar and return yes/no.  If it is successful (return yes) it will advance to the next token (call next token).

LL(1)

it is an algorithm that use grammar and a push-down stack to parse a sentence.
the grammar must be in correct form (which we will know "after" we construct the parser).
in order to write an LL(1) parser, one need to compute
  a.  first set
  b.  follow set
first and follow set can be compute efficiently.
with these two sets, we can draw a parsing table.
use the parsing table, with push-down stack, we can parse a sentence.
to check whether the grammar is LL(1) compatible, after we create the parsing table, we check each entry in the table, it must has only one rule.
we can rewrite the grammar so that it is  LL(1) compatable most of the time.

code generation

while we parse, we construct an auxilliary data structure called "parse tree".
the parse tree represents the input source.
an interpreter can be written so that given a parse tree as its input, its can "run" (execute) that program.
this interpreter "traverses" the parse tree and perform the function according to the "operator" in the parse tree (such as +).
to create a code generator, we change the interpreter one step further.
instead of "run" the parse tree, the code generator outputs the assembly language string that is equivalent to "run".
for example a parse tree of A+2 is (+ A 2)
the code generator outputs the following sequence of assembly instruction for 3-address format as follows:
assume A is in R1
2 is in R2
the result is in R3

   load R1, A
   mov R2, #2
   add R3, R1, R2

other parsing methods

two methods that we learn are the top down method.
it is limited to the grammar that is compatible.
some syntax can not be handle by LL(1).
there are many more methods that are "bottom up" methods.
they are more powerful and at the same time more complex.

end