parser generator

The grammar of the parser is described as follows:

  grammar ->  'string | rule | 'eof
  rule -> 'id '= es
  es ->  e1 es | '| es | '%
  e1 ->  'string | 'id

'string is passed through.  It is the "action" part of the generator.  If 'id  is a terminal symbol, it is a token name (tkEQ ...).  If 'id is a nonterminal symbol, it is the rule name appeared on the left hand side of other rules. '| indicates alternatives. '% terminates a rule. The terminal 'nil indicates empty match (always match).

example 

this grammar:

  args =
    tkIDEN "enterLocal tokvalue" args |
    tkBAR "ypush Nlv" local |
    "ypush Nlv" nil %
  ex = 
    tkBB "ypush MARK" exs tkBE "doblock" | 
    ex1 %

turned into:

  to args =
    while tok == tkIDEN
      enterLocal tokvalue
      lex
    if tok == tkBAR
      ypush Nlv
      lex
      commit local
      1 break
    ypush Nlv
    1

  to ex =
    if tok == tkBB
      ypush MARK
      lex
      commit exs
      expect tkBE
      doblock
      lex
      1 break
    if ex1
      1 break
    0

A rule always return 0 (fail) or 1 (success). As this is really an LL(1) parser, returns 0 means an error.  An alternative consists of several "match" tokens.  The first match is handled differently from the rest.  If a token starts with "tk" (tkXX), is a terminal symbol, otherwise it is a nonterminal symbol.

match token

first
   tkXX     ->  if tok == tkXX
   nonterm  ->  if nonterm

other
   tkXX     ->  expect tkXX
   nonterm  ->  commit nonterm

A match token is followed by "lex" to move it over this input token but if there is a string (pass through), "lex" is will be placed after the string.

example

  ex1 =    
    tkIF ex0 ex ... |
    tkBREAK "ypush newatom"

  to ex1 =
    if tok == tkIF
      lex
      commit ex0
      commit ex
      ...
    if tok == tkBREAK
      ypush newatom
      lex
      ...

An alternative is ended with "1 break" and a rule is ended with "0" except when the last one is "nil", then it is 1 (always match).

example

  ex = ... | ex1 %

  to ex = 
    ...
    if ex1
      1 break
    0  

When an alternative is recursive, "while" loop is used.

example

  local = 
    tkIDEN "enterLocal tokvalue" local |
    nil %

  to local =
    while tok == tkIDEN
      enterLocal tokvalue
      lex
    1

When there is multi-choice and the first set are all terminals then a more efficient branch, "case", is used.

example

  bop = 
    tkPLUS "ypush tok" | 
    tkMINUS "ypush tok" |
    tkSTAR "ypush tok" % 

  to bop =
    case tok
      tkPLUS:
        ypush tok  lex 1 break
      tkMINUS:
        ypush tok  lex 1 break
      tkSTAR: 
        ypush tok  lex 1 break
     0

Implementation

Grammar is read by a lex syscall in vm (of som 4.2 vm).  This lex knows Som tokens.  (If the built-in lex is not used then one can use lex in som from token-s.txt from som 4.1 and earlier). Two lists are built, one is for the "header" (pass through) and the other is for all rules. "to grammar" and "to rule" are the recursive descent parser of the parser grammar.  All rules have this form:

  ((lhs1 (var)(alt1)(alt2)...) (lhs2 (var)(alt1)(alt2)...)...)

Each symbol is tagged (type.value) where 
  type = TERM NONTERM STRING NIL
  value = pointer to its print-string

"to doiden" does the tagging. "to doalt" does creating the list of alternatives and rules.

Generation of a parser from the list of rules is done for each rule by "to genarule".  It extracts the "lhs" (head) of the rule and generates alternatives using "to genalt". "to genalt" check each alternative for recursion and set the flag "loopflag". It then uses "to genalt2" to do the work.

"to genalt2" generates each match one by one.  It handles the first match (by "to genone1"). It ignores the last match if it is recursive.  Finally, it generates the end properly according to the occurrence of "nil" (signalled by "nilflag").

"to genone1" and "to genone" do the real work of generating each match.  They use "to trylex" and "lexflag" to place "lex" after a string behind a terminal symbol properly. The match of type "TERM" sets "lexflag".  The "NIL" match sets "nilflag".

So three global flags are used to co-ordinate the parser generation: loopflag, lexflag, and nilflag.

About the coding style, in my opinion, "to genalt2" is quite brilliant especially the way it handles the loopflag by stopping before the last match.  It is quite a good design.

The main goal of this new parser generator is to generate "multiway branch" (switch, case).  It will be compiled into a much faster code.  Previously, the parser has been hand-coded in this part.  To decide if a rule needs a multiway branch, there are 3 factors:
  1)  it must not contain recursion
  2)  it has more than two choices (otherwise if..then else is sufficient).
  3)  all choices except the last one must has tkXXX as the first set.

This check is done in "to ismulti" which also uses "to chkRecursive".  "to genmulti" is complex in the last choice (the "else" part.  If it is tkXX then it is similar to other choice otherwise it will be "else: ...". 

example

  ex1 =
    ...
    tkENUM tkBB elist tkBE "ypush NIL" |
    tkBREAK "ypush newatom OPER tkBREAK" |
    exas %

  to ex1 =
    case tok
      ...
      tkENUM: ...
      tkBREAK: ...
      else:
        if exas 
           1 break
     0

It should be fine to do "commit exas" but it causes parsing error. This dues to the way "exas" works. (and this is the most difficult bug to track down).  In the end, this is a good parser generator in ~700 lines of code.

housekeeping

"to genforward" is required to generate a "header file" (for forward declaration) because other functions need it (stmt-s.txt). One must manually cut that part into a separate file "parse-h-s.txt".  To generate a parser do:

> som42 pgen1.txt 
> som42 -x pgen1.obj > parser.txt

and do a bit of clean up at "parser.txt". 

22 sept 2009
