After a long time of comtemplating, I think it is a good time to rewrite the parser generator. First of all, there is a need to update the old parser generator to make it generate a multi-way branch using switch and case. Secondly, a new Som program will be refreshing. (I also teach a compiler course in this semester J ). A number of ideas has been implemented in the lex in vm of som 4.2, now it is the time to try to make use of it. The result is quite nice, a 700-line program works well and it generates a better parser (slightly more efficient than the old hand-tuned one).
The grammar of the parser to be the input
of the generator is
described as follows:
grammar -> 'string |
rule | 'eof
rule -> 'id rule2
rule2
-> '= es |
'[ var
var -> 'id var | '] '= es
es -> e1 es | '| es | '%
e1
->
'id | 'string
'string is
passed through. It is the "action" part of
the
generator. If 'id
is a terminal symbol, it is
a token name (tkEQ ...). If 'id is
a nonterminal symbol, it is the rule name appeared on the left hand
side of
other rules. '| indicates alternatives. '% terminates a rule. The
terminal 'nil
indicates empty match (always match).
Example
this grammar:
args =
tkIDEN
"enterLocal tokvalue" args |
tkBAR
"ypush Nlv" local |
"ypush
Nlv" nil %
ex =
tkBB
"ypush MARK" exs tkBE "doblock" |
ex1 %
turned into:
to args =
while
tok == tkIDEN
enterLocal tokvalue
lex
if
tok == tkBAR
ypush Nlv
lex
commit local
1 break
ypush
Nlv
1
to ex =
if
tok == tkBB
ypush MARK
lex
commit exs
expect tkBE
doblock
lex
1 break
if
ex1
1 break
0
A rule always return 0 (fail) or 1
(success). As this is
really an LL(1) parser, returns 0 means an
error. An alternative consists of several
"match"
tokens. The first match is handled
differently from the rest. If a
token starts with "tk" (tkXX), is a terminal symbol, otherwise it is
a nonterminal symbol.
match token
first
tkXX
-> if tok == tkXX
nonterm -> if nonterm
other
tkXX
-> expect tkXX
nonterm -> commit nonterm
A match token is followed by "lex" to move
it over
this input token but if there is a string (pass through), "lex" is
will be placed after the string.
Example
ex1 =
tkIF
ex0 ex ... |
tkBREAK
"ypush newatom"
to ex1 =
if
tok == tkIF
lex
commit ex0
commit ex
...
if
tok == tkBREAK
ypush newatom
lex
...
An alternative is ended with "1 break" and
a rule
is ended with "0" except when the last one is "nil", then
it is 1 (always match).
Example
ex = ... | ex1
%
to ex =
...
if
ex1
1 break
0
When an alternative is
recursive, "while"
loop is used.
Example
local =
tkIDEN
"enterLocal tokvalue" local |
nil
%
to local =
while
tok == tkIDEN
enterLocal tokvalue
lex
1
When there is multi-choice and the first
set are all
terminals then a more efficient branch, "case", is used.
Example
bop =
tkPLUS
"ypush tok" |
tkMINUS
"ypush tok" |
tkSTAR
"ypush tok" %
to bop =
case
tok
tkPLUS:
ypush tok lex
1
break
tkMINUS:
ypush tok lex
1
break
tkSTAR:
ypush tok lex
1
break
0
The following paragraphs are a short
description of how the generator work. It is best read side-by-side with the source code (this is only the
main part,
for the complete source please see the package somv42a.zip
). Grammar is read by a lex syscall
in vm (of som 4.2 vm). This lex
knows Som tokens. (If the built-in
lex is not used then one can use lex in som from token-s.txt from som
4.1 and
earlier). Two lists are built, one is for the "header" (pass through)
and the other is for all rules. "to grammar"
and "to rule" are the recursive descent parser of the parser grammar. All rules have this form:
((lhs1 (var)(alt1)(alt2)...)
(lhs2 (var)(alt1)(alt2)...)...)
Each symbol is tagged (type.value) where
type = TERM
NONTERM STRING NIL
value = pointer
to its print-string
"to doiden" does
the
tagging. "to doalt" does creating the list
of alternatives and rules.
Generation of a parser from the list of
rules is done for
each rule by "to genarule".
It extracts the "lhs" (head) of the rule and generates
alternatives using "to genalt". "to
genalt" check each alternative for recursion and set the flag
"loopflag".
It then uses "to genalt2" to do the work.
"to genalt2"
generates
each match one by one. It handles
the first match (by "to genone1"). It ignores the last match if it is
recursive. Finally, it generates
the end properly according to the occurrence of "nil" (signalled by
"nilflag").
"to genone1" and
"to
genone" do the real work of generating each match.
They use "to trylex" and "lexflag"
to place "lex" after a string behind a terminal symbol properly. The
match of type "TERM" sets "lexflag". The
"NIL" match sets "nilflag".
So three global flags
are used to co-ordinate
the parser generation: loopflag, lexflag, and nilflag.
About the coding style, in my opinion, "to
genalt2"
is quite brilliant especially the way it handles the loopflag by
stopping
before the last match. It is quite
a good design.
The main goal of this new parser generator
is to generate "multiway
branch" (switch, case). It
will be compiled into a much faster code.
Previously, the parser has been hand-coded in this part. To decide if a rule needs a multiway
branch, there are 3 factors:
1) it must not
contain recursion
2) it has more than
two choices (otherwise if..then else is
sufficient).
3) all choices except
the last one must has tkXX as the first set.
This check is done in "to ismulti" which
also uses
"to chkRecursive". "to
genmulti" is complex in the last choice (the "else"
part. If it is tkXX then it is
similar to other choice otherwise it will be "else: ...".
Example
ex1 =
...
tkENUM
tkBB elist tkBE "ypush NIL" |
tkBREAK
"ypush newatom OPER tkBREAK" |
exas
%
to ex1 =
case
tok
...
tkENUM: ...
tkBREAK: ...
else:
if exas
1 break
0
It should be fine to do "commit exas" but
it
causes parsing error. This dues to the way
"exas"
works. (and this is the most difficult bug
to track
down). In the end, this is a good
parser generator in ~700 lines of code.
"to genforward"
is
required to generate a "header file" (for forward declaration) because
other functions need it (stmt-s.txt). One must manually cut that part
into a
separate file "parse-h-s.txt".
To generate a parser do:
> som42
pgen1.txt
> som42 -x
pgen1.obj > parser.txt
and do a bit of
clean up at "parser.txt".
22 sept 2009