Parser generator
To simplify the writing of a compiler, a parser generator (pgen) is
used to generate the parser. pgen accepts a LL(1) grammar with
decorations and generates a recursive descent parser. Currently, pgen
generates both C and Som languages (there are some differences in two
versions but principles are the same).
Using a parser generator reduces the amount of writing code to a
quarter (1/4) of writing the parser by hand. The evidence can be
seen from the Som-grammar (120 lines) to Som-parser (464 lines), the
ratio of grammar:C-parser = 1:4. This amount includes the code
for declaration and statements into the number of lines of grammar
(part of parser that must be coded by hand). If this is excluded
the ratio will be much higher, the grammar for som-som1 is just 60
lines.
pgen itself is a very limited. There is no error recovery, the first
grammatical error in the source file will stop the parser. It is
aimed to be a simple generator that does a minimum processing. It is
mostly driven by LL(1) grammar. It scans an input file linearly
just once. This design choice requires many decorations in the
grammar to allow knowing what-will-come-next. However, with all these
shortcomings it is really small, 100 lines of C.
The input LL(1) grammar file is in the form:
A -> B | C #
B -> a B |
nil #
where A..Z is non-terminal, a..z is terminal, # denotes the end of the
current rule, nil denotes empty clause. the parser can be
embedded with $str (action routine)
A -> B $str . . .
The general form of a recursive descent parser is
int
A(){
if( tok == a ) { //
check a terminal symbol
... do action
lex(); // get
next symbol
return 1;
}
if( B() )
{ // check a non-term symbol, it is
a call
// it gets next sym if accept
... do action
return
1;
}
return
0; // reject,
// return
1; // if the grammar is nil, accept
}
Checking for a terminal symbol is done by:
if( tok == term ) { ... lex(); }
for the first symbol. The subsequent symbol is done by:
expect(term, error-message); ...
lex();
as the token is already available.
Checking for a non-terminal symbol is a function call to that
non-terminal function, and subsequent non-terminal symbol is a call to
"commit( call() )" that fails when the call() fails.
C -> a b D #
int
C(){
if(tok == a) {
lex();
expect(b,err1);
commit(D());
return 1;
}
}
for the grammar
A -> B | C #
the output parser is:
int
A() {
//
loop:
if( B() ) { return 1; }
if( C() ) { return 1; }
return 0;
}
The right recursive grammar is turned into a loop:
B -> a B | nil #
int
B() {
loop:
if( tok == a ) { //
a is terminal
lex();
goto loop;
}
return 1;
}
The action routine is embeded into the parser:
C -> D
$doaction; E | F #
int
C() {
// loop:
if( D() ) {
doaction;
commit(E());
return 1;
}
if( F() ) { return 1; }
return 0;
}
The parser generator also generates the header file (.h for C) that
stores function prototypes of all non-terminal symbols.
input Grammar
must be careful about name space. All symbols are global.
$
out this line through to the parser
$ it is used to put some
initialisation into the parser
// the comment line
notation
-> separate head and right hand
| alternative clause
nil epsilon rule
# end of the current rule
[ list of local var ]
$xxx xxx is the action routine
tkXXX the terminal symbol (as defined in lex)
Example
//
this is the example of grammar
$ #include "myhead.h"
A -> B C | D #
E [ v ] -> ... | nil #
F -> tkTO $dofun ... #
...
END // end of input
Local variables in a
rule
Values can be passed "within" a rule to aid "remembering" some
values to be used later in a rule. To do this local variables can
be declared and used in the action routine. Values cannot be
passed to other rules, as this will turn a variable into a global
variable which makes a code not re-entrant. A non-terminal function
must be re-entrant because it must allow recursive call. Here is
an example of the use of a local variable (from som-som1 grammar):
term
[ a ] ->
tkIDEN
$a=search(tokstring); $doiden(a); |
...
Local variables are turned into a declaration of local variables in the
parser.
int
term(){
int a;
....
a=search(tokstring);
lex();
doiden(a);
return 1;
...
}
The current token is stored in "tok" and its string is in
"tokstring". Many action routines use these variables so the
"lex()" to get next token must come after the $action so it does not
affect the current "tok" and "tokstring".
Pseudo code
Here is the pseudo code of pgen:
pgen
main
readspec
scan input until "END"
begin with $
outputline
begin with //
ignore comment line
dogen
dogen
out function head
get local var list
get "->"
doRHS
doRHS
scan until "#" //
end of current rule
// always keep head and
last
term
if
isfirst
out "if tok == ..."
else
out "expect ..."
trylex
$x
out x
next
|
if
last != "goto"
out "return 1"
next
nil
out "return 1"
next
non-term
if
thisterm == head
out "goto loop"
else
if isfirst
out "if thisterm() .."
else
out "commit .."
next
if last != "nil"
out "return 1,
return 0"
end doRHS
trylex
next
if $x
out x lex()
next
else
out lex()
next
scan next token
End pgen
Example
the input file, grammar:
//
this is a part of som-som1 grammar
$ #include "parse.h"
top ->
tkTO fundef | ex #
fundef ->
tkIDEN $setfname(); args
$setarity(); tkEQ ex $dofun(); #
args ->
tkIDEN
$enterLocal(tokstring); args |
tkBAR $ypush(lv); local |
$ypush(lv); nil #
local ->
tkIDEN
$enterLocal(tokstring); local | nil #
END
and the output parser:
#include
"parse.h"
int top() {
loop:
if (tok == tkTO) {
lex();
commit(fundef());
return 1;
}
if (ex()) {
return 1;
}
return 0;
}
int fundef() {
loop:
if (tok == tkIDEN) {
setfname();
lex();
commit(args());
setarity();
expect(tkEQ,
"missing tkEQ");
lex();
commit(ex());
dofun();
return 1;
}
return 0;
}
int args() {
loop:
if (tok == tkIDEN) {
enterLocal(tokstring);
lex();
goto loop;
}
if (tok == tkBAR) {
ypush(lv);
lex();
commit(local());
return 1;
}
ypush(lv);
return 1;
}
int local() {
loop:
if (tok == tkIDEN) {
enterLocal(tokstring);
lex();
goto loop;
}
return 1;
}
with the following parse.h :
int
top(void);
int fundef(void);
int args(void);
int local(void);
Some remark
1 "nil" clause can have action routines. The action routine
must come before "nil", "nil" must be the last symbol.
2 if lex() is needed, an empty action routine can be used, $.
3 tokenise is moved only when consumes a terminal symbol.
4 the current structure of parser generator does not allow a
clean separation between the grammar, action routines and the
generator. Writing an action routine requires knowing how the
parser generator work in details.
3 October 2004
P. Chongstitvatana