Lexical generator
Lexical generator is a program that outputs a lexical analyser function
from an input binding. The input file is a binding of tokens to
their values (symbolic). See the following example:
*
tkSTAR
/ tkSLASH
- tkMINUS
+ tkPLUS
= tkEQ
== tkEQEQ
...
The generator outputs two files: token.c, token.h. The "token.c"
is a function, "int token() {...}".
token() scans a text buffer using "getC()" to find a defined token
(reserved word), it returns 1 if accept, 0 if reject. It assumes
the first character is already in CH, the next character is pointed to
by cp.
This is the output file, token.c:
/*
automatically generated from lexgen */
#include "compile.h"
int token() {
switch(CH) {
case '*': accept(tkSTAR,1);
break;
case '/':
accept(tkSLASH,1); break;
case '-':
accept(tkMINUS,1); break;
case '+': accept(tkPLUS,1);
break;
case '=':
getC();
switch(CH) {
case '=':
accept(tkEQEQ,2); break;
default: accept(tkEQ,1);
}
break;
...
case 't':
getC();
switch(CH) {
case 'o':
getC();
if(
isLetterOrDigit(CH) ) return 0;
accept(tkTO,2);
break;
default: return 0; /*
reject */
}
break;
...
The token.h is a binding of an integer value to the symbolic
name. The first integer starts at 50. The output file is:
#define
tkSTAR 50
#define tkSLASH 51
#define tkMINUS 52
...
Pseudo code
Here is the short pseudo code for the lexical generator.
main
readspec
writedoth
initG
mergeK
gen
readspec
get string (the token)
get name
(symbolic name)
writedoth
output #define name
i where i is the running number
graph (really a tree) is a list
of nodes
data stucture of each node
char info
int next
int choice
int prop
buildG from string s1
// build a graph (linear),
each node contains a char
// the head node has
Propnum, the index to string
node = getanode *s1
s1++
loop to extend node for
each char in s1
nextG =
getanode *s1
s1++
set last node propG =
Propnum
mergeK k g
// merge k-th string to
graph g
set Propnum = k
mergeG string[k] to g
mergeg
// extend g with string s
check all choices in g
if match and
endofstring, tryinsertE return
if match and
endgraph, extendnext return
// exhaust all choices
insertchoice
insertchoice p1 p2
// insert p1 into choice
of p2
tryinsertE
if g is nil return
find last choice of g
insert $ at last choice
extendnext s g
// extend g with graph of s
a = buildG s
nextG g = a
gen
outcase
if endgraph
outaccept
else
out getC();
switch(CH)
gen nextG
out break;
check choice
if nil
if is $
gen choice
else
outdefaultY
else
outdefaultN
The following paragraph explains the behaviour of the generator. It has
two main functions: build graph, and generate C program.
notation: info:propnum
a
-> next
|
choice
1) Build graph
The main routine is mergeG
mergeG
*s g // merge string s into graph g
check all choices
each choice match info with *s until no match *1
*1 two possible cases
a) end *s first, insert $ at choice of nextG
b) end g first, extend this graph with the rest of string
with $ as the first choice of this
extended part.
a) tryinsertE // insert $
a
-> b -> c:3 -/
4 "ab"
a -> b -> c:3 -/
|
$:4 -/
b) extendnext *s g //
extend g with string s
5 "abcde"
a
-> b -> c:3 -> d -> e:5 -/
| |
$:4 -/ $:3 -/
insertchoice p1 p2
insert p1 as the first choice of p2
before
p2
-> ....
|
x ...
after
p2
-> ...
|
p1 -> ...
|
x ...
more examples
3 "abc"
a
-> b -> c:3 -/
4 "de"
a
-> b -> c:3 -/
|
d -> e:4 -/
5 "def"
a
-> b -> c:3 -/
|
d -> e:4 -> f:5 -/
|
$:4 -/
6 "ab"
a
-> b -> c:3 -/
|
|
|
$:6 -/
|
d -> e:4 -> f:5 -/
|
$:4 -/
2) Generate C
program
Each character is a case, "case 'c':". In the same level is the
choice. Within each case is the next character in the same
choice. When scanning each character in a graph, "getC();" is
generated. At the end of each path in graph, check input, accept
if it is a separator, "accept(token,length);", else reject, "if(
isLetterOrDigit(CH) ) return 0;". In each "switch(CH)", if there
is no match (exhaust all choices) then reject, "default: return 0;". If
$ is found in a choice then "default: accept(token,length);".
Example
a
-> b -> c:5 -/
|
$:6 -/
result in :
switch(CH)
{
case 'a':
getC();
switch(CH) {
case 'b':
getC();
switch(CH) {
case 'c':
getC();
if( isLetterOrDigit(CH) ) return 0; /* reject */
accept(tkABC,3);
break;
default:
accept(tkAB,2);
}
break;
default: return
0; /* reject */
}
break;
default: return 0; /*
reject */
}
PS. This program is written beautifully. The logic is clear
and the code is compact. mergeG()
is quite a gem.
1 October 2004