Parser Generator  v 2

 

Input grammar

Implementation

Housekeeping

 

After a long time of comtemplating, I think it is a good time to rewrite the parser generator.  First of all, there is a need to update the old parser generator to make it generate a multi-way branch using  switch and case.  Secondly, a new Som program  will be refreshing.  (I also teach a compiler course in this semester J ).  A number of ideas has been implemented in the lex in vm of som 4.2, now it is the time to try to make use of it.  The result is quite nice, a 700-line program works well and it generates a better parser (slightly more efficient than the old hand-tuned one).

 

Grammar

The grammar of the parser to be the input of the generator is described as follows:

 

  grammar -> 'string | rule | 'eof

 rule -> 'id rule2

 rule2 ->  '= es | '[ var

 var -> 'id var | '] '= es

 es -> e1 es | '| es | '%

 e1 -> 'id | 'string

 

'string is passed through.  It is the "action" part of the generator.  If 'id  is a terminal symbol, it is a token name (tkEQ ...).  If 'id is a nonterminal symbol, it is the rule name appeared on the left hand side of other rules. '| indicates alternatives. '% terminates a rule. The terminal 'nil indicates empty match (always match).

 

Example

 

this grammar:

 

  args =

    tkIDEN "enterLocal tokvalue" args |

    tkBAR "ypush Nlv" local |

    "ypush Nlv" nil %

  ex =

    tkBB "ypush MARK" exs tkBE "doblock" |

    ex1 %

 

turned into:

 

  to args =

    while tok == tkIDEN

      enterLocal tokvalue

      lex

    if tok == tkBAR

      ypush Nlv

      lex

      commit local

      1 break

    ypush Nlv

    1

 

  to ex =

    if tok == tkBB

      ypush MARK

      lex

      commit exs

      expect tkBE

      doblock

      lex

      1 break

    if ex1

      1 break

    0

 

A rule always return 0 (fail) or 1 (success). As this is really an LL(1) parser, returns 0 means an error.  An alternative consists of several "match" tokens.  The first match is handled differently from the rest.  If a token starts with "tk" (tkXX), is a terminal symbol, otherwise it is a nonterminal symbol.

 

match token

 

first

   tkXX     ->  if tok == tkXX

   nonterm  ->  if nonterm

 

other

   tkXX     ->  expect tkXX

   nonterm  ->  commit nonterm

 

A match token is followed by "lex" to move it over this input token but if there is a string (pass through), "lex" is will be placed after the string.

 

Example

 

  ex1 =   

    tkIF ex0 ex ... |

    tkBREAK "ypush newatom"

 

  to ex1 =

    if tok == tkIF

      lex

      commit ex0

      commit ex

      ...

    if tok == tkBREAK

      ypush newatom

      lex

      ...

 

An alternative is ended with "1 break" and a rule is ended with "0" except when the last one is "nil", then it is 1 (always match).

 

Example

 

  ex = ... | ex1 %

 

  to ex =

    ...

    if ex1

      1 break

    0 

 

When an alternative is recursive, "while" loop is used.

 

Example

 

  local =

    tkIDEN "enterLocal tokvalue" local |

    nil %

 

  to local =

    while tok == tkIDEN

      enterLocal tokvalue

      lex

    1

 

When there is multi-choice and the first set are all terminals then a more efficient branch, "case", is used.

 

Example

 

  bop =

    tkPLUS "ypush tok" |

    tkMINUS "ypush tok" |

    tkSTAR "ypush tok" %

 

  to bop =

    case tok

      tkPLUS:

        ypush tok  lex 1 break

      tkMINUS:

        ypush tok  lex 1 break

      tkSTAR:

        ypush tok  lex 1 break

     0

 

Implementation

 

The following paragraphs are a short description of how the generator work.  It is best read side-by-side with the source code (this is only the main part, for the complete source please see the package somv42a.zip ).  Grammar is read by a lex syscall in vm (of som 4.2 vm).  This lex knows Som tokens.  (If the built-in lex is not used then one can use lex in som from token-s.txt from som 4.1 and earlier). Two lists are built, one is for the "header" (pass through) and the other is for all rules. "to grammar" and "to rule" are the recursive descent parser of the parser grammar.  All rules have this form:

 

  ((lhs1 (var)(alt1)(alt2)...) (lhs2 (var)(alt1)(alt2)...)...)

 

Each symbol is tagged (type.value) where

  type = TERM NONTERM STRING NIL

  value = pointer to its print-string

 

"to doiden" does the tagging. "to doalt" does creating the list of alternatives and rules.

 

Generation of a parser from the list of rules is done for each rule by "to genarule".  It extracts the "lhs" (head) of the rule and generates alternatives using "to genalt". "to genalt" check each alternative for recursion and set the flag "loopflag". It then uses "to genalt2" to do the work.

 

"to genalt2" generates each match one by one.  It handles the first match (by "to genone1"). It ignores the last match if it is recursive.  Finally, it generates the end properly according to the occurrence of "nil" (signalled by "nilflag").

 

"to genone1" and "to genone" do the real work of generating each match.  They use "to trylex" and "lexflag" to place "lex" after a string behind a terminal symbol properly. The match of type "TERM" sets "lexflag".  The "NIL" match sets "nilflag".

 

So three global flags are used to co-ordinate the parser generation: loopflag, lexflag, and nilflag.

 

About the coding style, in my opinion, "to genalt2" is quite brilliant especially the way it handles the loopflag by stopping before the last match.  It is quite a good design.

 

The main goal of this new parser generator is to generate "multiway branch" (switch, case).  It will be compiled into a much faster code.  Previously, the parser has been hand-coded in this part.  To decide if a rule needs a multiway branch, there are 3 factors:

  1)  it must not contain recursion

  2)  it has more than two choices (otherwise if..then else is sufficient).

  3)  all choices except the last one must has tkXX as the first set.

 

This check is done in "to ismulti" which also uses "to chkRecursive".  "to genmulti" is complex in the last choice (the "else" part.  If it is tkXX then it is similar to other choice otherwise it will be "else: ...".

 

Example

 

  ex1 =

    ...

    tkENUM tkBB elist tkBE "ypush NIL" |

    tkBREAK "ypush newatom OPER tkBREAK" |

    exas %

 

  to ex1 =

    case tok

      ...

      tkENUM: ...

      tkBREAK: ...

      else:

        if exas

           1 break

     0

 

It should be fine to do "commit exas" but it causes parsing error. This dues to the way "exas" works. (and this is the most difficult bug to track down).  In the end, this is a good parser generator in ~700 lines of code.

 

Housekeeping

 

"to genforward" is required to generate a "header file" (for forward declaration) because other functions need it (stmt-s.txt). One must manually cut that part into a separate file "parse-h-s.txt".  To generate a parser do:

 

> som42 pgen1.txt

> som42 -x pgen1.obj > parser.txt

 

and do a bit of clean up at "parser.txt".

 

22 sept 2009