There are three parts: lexical analyser, parser, and code generation
lexcial analyser
lexical analyser changes stream of input text into tokens.
regular expression is used to specify "word" in the language.
the pattern of regex defines set of words in the language.
skill in defining words is:
a. given a regex write down the set of words
b. know how to write a specification, given set of words,
write its regex
note: writing a regex is considered to be both rigorous and
creative. One has to strive for a balance between simple and complex.
implementation of lex
a regex can be translated into an automaton. (similar to a finite
state machine)
a set of automaton is composed from individual automaton, it becomes
Non-deterministic Finite Automata, NFA.
to implement NFA, we need to transform it into DFA (deterministic finite
automata, which has many more states than NFA, but it is realizable.)
the main parser uses lexical analyser by calling "next token" function.
parser
the syntax of a language is specified by a grammar.
Context Free Grammar and a type of grammar (which is simple) that is used
to specify a computer language.
there are two methods to parse a sentence:
1. recursive descence parser
2. LL(1)
skill in defining syntax:
a. given a grammar, write set of sentences (this can be
infinite)
b. given a language, write its CFG. (this is similar to write
a spec. in engineering job)
practical grammar
in order to implement a parser, the grammar must be of some specific form.
it has to avoid "left recursion".
recursive descence parser
it is the most simple way to write a parser for small grammar.
you write it by hands. (other methods are automated method)
every non-terminal symbol becomes a function
the body of function is "call" to other non-terminal symbol and "match" to
terminal symbol.
in this way, the whole parser becomes a nested recursive call, started
from starting symbol.
to implement "optional" part, one needs a lookahead to see the next symbol
before making a decision.
the "match" function compares the current token with the non-terminal
symbol in the grammar and return yes/no. If it is successful (return
yes) it will advance to the next token (call next token).
LL(1)
it is an algorithm that use grammar and a push-down stack to parse a
sentence.
the grammar must be in correct form (which we will know "after" we
construct the parser).
in order to write an LL(1) parser, one need to compute
a. first set
b. follow set
first and follow set can be compute efficiently.
with these two sets, we can draw a parsing table.
use the parsing table, with push-down stack, we can parse a sentence.
to check whether the grammar is LL(1) compatible, after we create the
parsing table, we check each entry in the table, it must has only one
rule.
we can rewrite the grammar so that it is LL(1) compatable most of
the time.
code generation
while we parse, we construct an auxilliary data structure called "parse
tree".
the parse tree represents the input source.
an interpreter can be written so that given a parse tree as its input, its
can "run" (execute) that program.
this interpreter "traverses" the parse tree and perform the function
according to the "operator" in the parse tree (such as +).
to create a code generator, we change the interpreter one step further.
instead of "run" the parse tree, the code generator outputs the assembly
language string that is equivalent to "run".
for example a parse tree of A+2 is (+ A 2)
the code generator outputs the following sequence of assembly instruction
for 3-address format as follows:
assume A is in R1
2 is in R2
the result is in R3
load R1, A
mov R2, #2
add R3, R1, R2
other parsing methods
two methods that we learn are the top down method.
it is limited to the grammar that is compatible.
some syntax can not be handle by LL(1).
there are many more methods that are "bottom up" methods.
they are more powerful and at the same time more complex.
end