Parser Generator v 2

After a long time of comtemplating, I think it is a good time to rewrite the parser generator. First of all, there is a need to update the old parser generator to make it generate a multi-way branch using switch and case. Secondly, a new Som program will be refreshing. (I also teach a compiler course in this semester J ). A number of ideas has been implemented in the lex in vm of som 4.2, now it is the time to try to make use of it. The result is quite nice, a 700-line program works well and it generates a better parser (slightly more efficient than the old hand-tuned one).

Grammar

The grammar of the parser to be the input of the generator is described as follows:

grammar -> 'string | rule | 'eof

rule -> 'id rule2

rule2 -> '= es | '[ var

var -> 'id var | '] '= es

es -> e1 es | '| es | '%

e1 -> 'id | 'string

'string is passed through. It is the "action" part of the generator. If 'id is a terminal symbol, it is a token name (tkEQ ...). If 'id is a nonterminal symbol, it is the rule name appeared on the left hand side of other rules. '| indicates alternatives. '% terminates a rule. The terminal 'nil indicates empty match (always match).

Example

this grammar:

args =

tkIDEN "enterLocal tokvalue" args |

tkBAR "ypush Nlv" local |

"ypush Nlv" nil %

ex =

tkBB "ypush MARK" exs tkBE "doblock" |

ex1 %

turned into:

to args =

while tok == tkIDEN

enterLocal tokvalue

lex

if tok == tkBAR

ypush Nlv

lex

commit local

1 break

ypush Nlv

to ex =

if tok == tkBB

ypush MARK

lex

commit exs

expect tkBE

doblock

lex

1 break

if ex1

1 break

A rule always return 0 (fail) or 1 (success). As this is really an LL(1) parser, returns 0 means an error. An alternative consists of several "match" tokens. The first match is handled differently from the rest. If a token starts with "tk" (tkXX), is a terminal symbol, otherwise it is a nonterminal symbol.

match token

first

tkXX -> if tok == tkXX

nonterm -> if nonterm

other

tkXX -> expect tkXX

nonterm -> commit nonterm

A match token is followed by "lex" to move it over this input token but if there is a string (pass through), "lex" is will be placed after the string.

Example

ex1 =

tkIF ex0 ex ... |

tkBREAK "ypush newatom"

to ex1 =

if tok == tkIF

lex

commit ex0

commit ex

...

if tok == tkBREAK

ypush newatom

lex

...

An alternative is ended with "1 break" and a rule is ended with "0" except when the last one is "nil", then it is 1 (always match).

Example

ex = ... | ex1 %

to ex =

...

if ex1

1 break

When an alternative is recursive, "while" loop is used.

Example

local =

tkIDEN "enterLocal tokvalue" local |

nil %

to local =

while tok == tkIDEN

enterLocal tokvalue

lex

When there is multi-choice and the first set are all terminals then a more efficient branch, "case", is used.

Example

bop =

tkPLUS "ypush tok" |

tkMINUS "ypush tok" |

tkSTAR "ypush tok" %

to bop =

case tok

tkPLUS:

ypush tok lex 1 break

tkMINUS:

ypush tok lex 1 break

tkSTAR:

ypush tok lex 1 break

Implementation

The following paragraphs are a short description of how the generator work. It is best read side-by-side with the source code (this is only the main part, for the complete source please see the package somv42a.zip ). Grammar is read by a lex syscall in vm (of som 4.2 vm). This lex knows Som tokens. (If the built-in lex is not used then one can use lex in som from token-s.txt from som 4.1 and earlier). Two lists are built, one is for the "header" (pass through) and the other is for all rules. "to grammar" and "to rule" are the recursive descent parser of the parser grammar. All rules have this form:

((lhs1 (var)(alt1)(alt2)...) (lhs2 (var)(alt1)(alt2)...)...)

Each symbol is tagged (type.value) where

type = TERM NONTERM STRING NIL

value = pointer to its print-string

"to doiden" does the tagging. "to doalt" does creating the list of alternatives and rules.

Generation of a parser from the list of rules is done for each rule by "to genarule". It extracts the "lhs" (head) of the rule and generates alternatives using "to genalt". "to genalt" check each alternative for recursion and set the flag "loopflag". It then uses "to genalt2" to do the work.

"to genalt2" generates each match one by one. It handles the first match (by "to genone1"). It ignores the last match if it is recursive. Finally, it generates the end properly according to the occurrence of "nil" (signalled by "nilflag").

"to genone1" and "to genone" do the real work of generating each match. They use "to trylex" and "lexflag" to place "lex" after a string behind a terminal symbol properly. The match of type "TERM" sets "lexflag". The "NIL" match sets "nilflag".

So three global flags are used to co-ordinate the parser generation: loopflag, lexflag, and nilflag.

About the coding style, in my opinion, "to genalt2" is quite brilliant especially the way it handles the loopflag by stopping before the last match. It is quite a good design.

The main goal of this new parser generator is to generate "multiway branch" (switch, case). It will be compiled into a much faster code. Previously, the parser has been hand-coded in this part. To decide if a rule needs a multiway branch, there are 3 factors:

1) it must not contain recursion

2) it has more than two choices (otherwise if..then else is sufficient).

3) all choices except the last one must has tkXX as the first set.

This check is done in "to ismulti" which also uses "to chkRecursive". "to genmulti" is complex in the last choice (the "else" part. If it is tkXX then it is similar to other choice otherwise it will be "else: ...".

Example

ex1 =

...

tkENUM tkBB elist tkBE "ypush NIL" |

tkBREAK "ypush newatom OPER tkBREAK" |

exas %

to ex1 =

case tok

...

tkENUM: ...

tkBREAK: ...

else:

if exas

1 break

It should be fine to do "commit exas" but it causes parsing error. This dues to the way "exas" works. (and this is the most difficult bug to track down). In the end, this is a good parser generator in ~700 lines of code.

Housekeeping

"to genforward" is required to generate a "header file" (for forward declaration) because other functions need it (stmt-s.txt). One must manually cut that part into a separate file "parse-h-s.txt". To generate a parser do:

> som42 pgen1.txt

> som42 -x pgen1.obj > parser.txt

and do a bit of clean up at "parser.txt".

22 sept 2009