Parser generator 



To simplify the writing of a compiler, a parser generator (pgen) is used to generate the parser.  pgen accepts a LL(1) grammar with decorations and generates a recursive descent parser. Currently, pgen generates both C and Som languages (there are some differences in two versions but principles are the same).

Using a parser generator reduces the amount of writing code to a quarter (1/4) of writing the parser by hand.  The evidence can be seen from the Som-grammar (120 lines) to Som-parser (464 lines), the ratio of grammar:C-parser = 1:4.  This amount includes the code for declaration and statements into the number of lines of grammar (part of parser that must be coded by hand).  If this is excluded the ratio will be much higher, the grammar for som-som1 is just 60 lines.

pgen itself is a very limited. There is no error recovery, the first grammatical error in the source file will stop the parser.  It is aimed to be a simple generator that does a minimum processing. It is mostly driven by LL(1) grammar.  It scans an input file linearly just once.  This design choice requires many decorations in the grammar to allow knowing what-will-come-next. However, with all these shortcomings it is really small, 100 lines of C.

The input LL(1) grammar file is in the form:

A -> B | C #
B -> a B | nil #

where A..Z is non-terminal, a..z is terminal, # denotes the end of the current rule, nil denotes empty clause.  the parser can be embedded with $str (action routine)

A -> B $str . . .

The general form of a recursive descent parser is

int A(){
  if( tok == a ) {  // check a terminal symbol
    ... do action
    lex();          // get next symbol
    return 1;
  }
  if( B() ) {       // check a non-term symbol, it is a call
                    // it gets next sym if accept
    ... do action
    return 1;         
  }
  return 0;         // reject,
//  return 1;       // if the grammar is nil, accept
}

Checking for a terminal symbol is done by:

  if( tok == term ) { ... lex(); }

for the first symbol.  The subsequent symbol is done by:

  expect(term, error-message); ... lex();

as the token is already available.  

Checking for a non-terminal symbol is a function call to that non-terminal function, and subsequent non-terminal symbol is a call to "commit( call() )" that fails when the call() fails.

  C -> a b D #

int C(){
  if(tok == a) {
    lex();
    expect(b,err1);
    commit(D());
    return 1;
  }
}

for the grammar

   A -> B | C #

the output parser is:

int A() {
//  loop:      
  if( B() ) { return 1; }
  if( C() ) { return 1; }
  return 0;
}

The right recursive grammar is turned into a loop:

   B -> a B | nil #

int B() {
  loop:
  if( tok == a ) {  // a is terminal
    lex();
    goto loop;
  }
  return 1;
}

The action routine is embeded into the parser:

  C -> D $doaction; E | F #

int C() {
//  loop:
  if( D() ) {
    doaction;
    commit(E());
    return 1;
  }
  if( F() ) { return 1; }
  return 0;
}

The parser generator also generates the header file (.h for C) that stores function prototypes of all non-terminal symbols.

input Grammar

must be careful about name space.  All symbols are global.

$ out this line through to the parser
$ it is used to put some initialisation into the parser
// the comment line

notation

-> separate head and right hand
|  alternative clause
nil  epsilon rule
#  end of the current rule
[ list of local var ]
$xxx   xxx is the action routine
tkXXX  the terminal symbol (as defined in lex)

Example

// this is the example of grammar
$  #include "myhead.h"

A -> B C | D #  
E [ v ] ->  ... | nil #  
F -> tkTO $dofun ... #
...

END  // end of input

Local variables in a rule

Values can be passed "within" a rule to aid "remembering"  some values to be used later in a rule.  To do this local variables can be declared and used in the action routine.  Values cannot be passed to other rules, as this will turn a variable into a global variable which makes a code not re-entrant. A non-terminal function must be re-entrant because it must allow recursive call.  Here is an example of the use of a local variable (from som-som1 grammar):

term [ a ] ->
  tkIDEN $a=search(tokstring); $doiden(a); |
  ...

Local variables are turned into a declaration of local variables in the parser.

int term(){
  int a;
  ....
  a=search(tokstring);
  lex();
  doiden(a);
  return 1;
  ...
}

The current token is stored in "tok" and its string is in "tokstring".  Many action routines use these variables so the "lex()" to get next token must come after the $action so it does not affect the current "tok" and "tokstring".

Pseudo code

Here is the pseudo code of pgen:

pgen

main
  readspec
  scan input until "END"
    begin with $ outputline
    begin with // ignore comment line
    dogen

dogen
  out function head
  get local var list
  get "->"
  doRHS

doRHS
  scan until "#"  // end of current rule
  // always keep head and last
    term
      if isfirst
        out "if tok == ..."
      else  
        out "expect ..."
      trylex
    $x
      out x
      next
    |
      if last != "goto"
        out "return 1"
      next    
    nil
      out "return 1"
      next
    non-term
      if thisterm == head
        out "goto loop"
      else
        if isfirst
          out "if thisterm() .."
        else
          out "commit .."
        next

  if last != "nil"
    out "return 1, return 0"

end doRHS

trylex
  next
  if $x
    out x lex()
    next
  else
    out lex()
    
next
  scan next token

End pgen

Example

the input file, grammar:

// this is a part of som-som1 grammar

$ #include "parse.h"

top ->
  tkTO fundef | ex #
fundef ->
  tkIDEN $setfname(); args $setarity(); tkEQ ex $dofun(); #
args ->    
  tkIDEN $enterLocal(tokstring); args |
  tkBAR $ypush(lv); local |
  $ypush(lv); nil #
local ->
  tkIDEN $enterLocal(tokstring); local | nil #

END

and the output parser:

#include "parse.h"

int top() {
  loop:
  if (tok == tkTO) {
    lex();
    commit(fundef());
    return 1;
  }
  if (ex()) {
    return 1;
  }
  return 0;
}

int fundef() {
  loop:
  if (tok == tkIDEN) {
    setfname();
    lex();
    commit(args());
    setarity();
    expect(tkEQ, "missing tkEQ");
    lex();
    commit(ex());
    dofun();
    return 1;
  }
  return 0;
}

int args() {
  loop:
  if (tok == tkIDEN) {
    enterLocal(tokstring);
    lex();
    goto loop;
  }
  if (tok == tkBAR) {
    ypush(lv);
    lex();
    commit(local());
    return 1;
  }
  ypush(lv);
  return 1;
}

int local() {
  loop:
  if (tok == tkIDEN) {
    enterLocal(tokstring);
    lex();
    goto loop;
  }
  return 1;
}

with the following parse.h :

int top(void);
int fundef(void);
int args(void);
int local(void);

Some remark

1  "nil" clause can have action routines.  The action routine must come before "nil", "nil" must be the last symbol.
2  if lex() is needed, an empty action routine can be used, $.
3  tokenise is moved only when consumes a terminal symbol.
4  the current structure of parser generator does not allow a clean separation between the grammar, action routines and the  generator.  Writing an action routine requires knowing how the parser generator work in details.

3 October 2004
P. Chongstitvatana