Lexical generator 

Lexical generator is a program that outputs a lexical analyser function from an input binding.  The input file is a binding of tokens to their values (symbolic).  See the following example:

* tkSTAR
/ tkSLASH
- tkMINUS
+ tkPLUS
= tkEQ
== tkEQEQ
...

The generator outputs two files: token.c, token.h. The "token.c"  is a function, "int token() {...}".  token() scans a text buffer using "getC()" to find a defined token (reserved word), it returns 1 if accept, 0 if reject.  It assumes the first character is already in CH, the next character is pointed to by cp.

This is the output file, token.c:

/* automatically generated from lexgen */
#include "compile.h"
int token() {
switch(CH) {
case '*':  accept(tkSTAR,1); break;
case '/':  accept(tkSLASH,1); break;
case '-':  accept(tkMINUS,1); break;
case '+':  accept(tkPLUS,1); break;
case '=':
  getC();
  switch(CH) {
  case '=':  accept(tkEQEQ,2); break;
  default: accept(tkEQ,1);
  }
  break;
...
case 't':
  getC();
  switch(CH) {
  case 'o':
    getC();
    if( isLetterOrDigit(CH) ) return 0;
     accept(tkTO,2);
    break;
  default: return 0; /* reject */
  }
  break;
...

The token.h is a binding of an integer value to the symbolic name.  The first integer starts at 50.  The output file is:

#define tkSTAR 50
#define tkSLASH 51
#define tkMINUS 52
...

Pseudo code

Here is the short pseudo code for the lexical generator.

main
  readspec
  writedoth
  initG
  mergeK
  gen

readspec
  get string (the token)
  get name   (symbolic name)

writedoth
  output  #define name i  where i is the running number

graph (really a tree) is a list of nodes
data stucture of each node
    char info
    int next
    int choice
    int prop

buildG  from string s1
  // build a graph (linear), each node contains a char
  // the head node has Propnum, the index to string
  node = getanode  *s1
  s1++
  loop to extend node for each char in s1
     nextG = getanode *s1
     s1++
  set last node propG = Propnum

mergeK k g
  // merge k-th string to graph g
  set Propnum = k
  mergeG string[k] to g

mergeg  
  // extend g with string s
  check all choices in g
    if match and endofstring, tryinsertE return
    if match and endgraph, extendnext return
  // exhaust all choices
  insertchoice

insertchoice p1 p2
  // insert p1 into choice of p2

tryinsertE
  if g is nil return
  find last choice of g
  insert $ at last choice

extendnext s g  
  // extend g with graph of s
  a = buildG s
  nextG g = a

gen
  outcase
  if endgraph
    outaccept
  else
    out getC(); switch(CH)
    gen nextG
    out break;
  check choice
  if nil
    if is $  
       gen choice
    else
       outdefaultY
  else
    outdefaultN

The following paragraph explains the behaviour of the generator. It has two main functions: build graph, and generate C program.

notation:  info:propnum

a -> next
|
choice

1)  Build graph

The main routine is mergeG

mergeG *s g   // merge string s into graph g

check all choices
  each choice match info with *s until no match *1

*1 two possible cases
  a)  end *s first, insert $ at choice of nextG
  b)  end g first, extend this graph with the rest of string
      with $ as the first choice of this extended part.

a) tryinsertE  // insert $

a -> b -> c:3 -/

4 "ab"

a -> b -> c:3 -/
          |
          $:4 -/

b) extendnext *s g  // extend g with string s

5 "abcde"

a -> b -> c:3  -> d -> e:5 -/
          |       |
          $:4 -/  $:3 -/

insertchoice p1 p2
  insert p1 as the first choice of p2

before

p2 -> ....
|
x ...

after

p2 -> ...
|
p1 -> ...
|
x ...

more examples

3 "abc"

a -> b -> c:3 -/

4 "de"

a -> b -> c:3 -/
|
d -> e:4 -/

5 "def"

a -> b -> c:3 -/
|
d -> e:4 -> f:5 -/
            |
            $:4 -/

6 "ab"

a -> b -> c:3 -/
|         |
|         $:6 -/          
|
d -> e:4 -> f:5 -/
            |
            $:4 -/

2)  Generate C program

Each character is a case, "case 'c':".  In the same level is the choice. Within each case is the next character in the same choice.  When scanning each character in a graph, "getC();" is generated.  At the end of each path in graph, check input, accept if it is a separator, "accept(token,length);",  else reject, "if( isLetterOrDigit(CH) ) return 0;".  In each "switch(CH)", if there is no match (exhaust all choices) then reject, "default: return 0;". If $ is found in a choice then "default: accept(token,length);".

Example

a -> b -> c:5 -/
          |
          $:6 -/
result in :

switch(CH) {
case 'a':
  getC();
  switch(CH) {
  case 'b':
    getC();
    switch(CH) {
    case 'c':
      getC();
      if( isLetterOrDigit(CH) ) return 0; /* reject */
      accept(tkABC,3);
      break;
    default: accept(tkAB,2);  
    }
    break;
  default: return 0;   /* reject */
  }
  break;
default: return 0;  /* reject */
}

PS.  This program is written beautifully.  The logic is clear and the code is compact. mergeG() is quite a gem.

1 October 2004