Top Down Parser

This note explains an implementation of a naive top-down parser in C. I use list as main data structure. The way it works is more like a LISP programming in C (if you know LISP). I think it is quite natural this way. The reason I do not use "Artificial Intelligence" language such as LISP, Prolog or Python (as recommended by Peter Norvig) is that it takes too much time to introduce a new programming language to a class that is already crammed with AI contents.

Introduction to list

A list contains variable number of elements (which can also be another list). For illustration I will use only "name" (a string) and "number" (integer) as the basic data element (or called it an atom as opposed to a list). The function name("...") creates a string element and number(..) creates a number element. NIL is a special atom denoting an empty element. Construction of a list is by "cons"ing elements together. The last element of a list is a NIL.

a = cons(number(1), cons(number(2),NIL))

(1 2)

b = cons(number(3), cons(name("abc"),NIL))

(3 "abc")

c = cons(a,cons(b,NIL))

c = ((1 2) (3 "abc"))

Selector picks a list apart. head() picks the first element. tail() produces the list without the first element.

head(a) is 1

tail(a) is (2)

Remember this invariant. If x is a list,

cons(head(x), tail(x)) == x

Data structure of the parser

Two new kinds of atoms are introduced: rule, lex. makerule(S) produces S as an atom. makelex(V) creates an atom V. (where S, V are defined as constants in C).

A rule is stored as a list of (left-hand-side right-hand-side). When rhs has more than one rule, it is a list of rules.

S -> NP VP is (S (NP VP))

A grammar is a list of rule.

S -> NP VP
NP -> ART N
is ((S (NP VP)) (NP (ART N)))

A lexicon is a list of lex and name.

N -> dogs, man is (N ("dogs" "man"))

An example of grammar

Grammar

S -> NP VP
NP -> ART N
NP -> ART ADJ N
VP -> V
VP -> V NP

Lexicon

cried : V
dogs: N, V
the: ART
old: ADJ, N
man: N, V

This above grammar and lexicon are stored as an association list. An association is a pair of key and attribute. For a grammar, key is the rule name. attribute is its right-hand-side. For a lexicon, key is lexicon name, attribute is list of words.

greammar ((S (NP VP)) (NP (ART N)(ART ADJ N)) (VP (V)(V NP))) lexicon ((V ("cried" "dogs" "man")) (N ("dogs" "old" "man")) (ART ("the"))(ADJ ("old")))

Before we can run the parser algorithm, we need to be able to
1) search a rule for its right-hand-side (in order to rewrite a symbol)
2) search a lexicon for its dictionary of words (in order to match it with the input word)

Searching an association list

Given a key and an asso list, returns the attribute corresponding to the key. second(x) is defined as head(tail(x))

asso(key, ls)
   if ls == NIL return NIL
   a = head(ls)       // a is the first pair
   if head(a) == key return second(a)
   asso(key, tail(ls))

We recurse on the rest of the asso list. input the grammar above

asso(NP, grammar) returns ((ART N)(ART ADJ N))
asso(ART,lexicon) returns ("the")

Rewrite a rule

Rewriting is substituting a symbol (lhs) by its right-hand-side rules. If there are more than one alternative, we want to generate all alternatives. Here is an example

NP -> ART N
NP -> ART ADJ N

In the grammar it is stored like this.

(NP (ART N) (ART ADJ N))

If we have a state:

(NP VP)

We substitute NP by its rhs. The output is:

((ART N VP) (ART ADJ N VP))

The substitution is written as:

subs( state )
rhs = asso(head(state), grammar)
return subs2(rhs, tail(state))

subs2( rules, state )
if rules == NIL return NIL
return cons(merge(head(rules),state),
subs2(tail(rules),state))

Let's do the above example:

state is (NP VP)

subs( (NP VP ) )
rhs of NP is ((ART N) (ART ADJ N))
subs2( rhs, (VP) )
     a = merge( (ART N), (VP) )
     b = subs2( ((ART ADJ N)), (VP))
     cons(a,b)

output: ((ART N VP) (ART ADJ N VP))

where merge( L1, L2) produces (L1 L2)
merge( (ART N), (VP )) => (ART N VP)

Parser

In the actual parser, a list that stored all possible alternatives is called "states". A state contains rules and input. For example, at the beginning, "states" are:

(((S) ("the" "dogs" "cried")))

The starting symbol is (S). The input is ("the" "dogs" "cried"). A starting state is ((S)("the"...)) and there is no alternative. The first element of "states" is called the current state.

Now the main parser is:

1. if states == NIL return NO
2. s1 = head(states)          // s1 is the current state
3. if success(s1) return YES
    // let t by the first symbol
4. if t is a lexicon
5.    rhs = asso(t,lexicon)
6.    if( match(rhs,head(input)) )
7.      states = cons(...,tail(states))
8.    else         states = tail(states) // back up, do alternatives
9. else if t is a rule
10.   rhs = asso(t, grammar)
11.   states = merge(subs2(rhs,tail(rule),input),tail(states))
12. else
      states = tail(states)    // all fail, back up

We repeat these steps until YES or NO.

Line 1 checks if there is no more alternative, it fails. Line 3 checks if the current state is a success return YES. Now check the first symbol. Line 4-8 are for a lexicon. Line 6 checks the matching of rhs of the lexicon with the first word of input. It they matched, the next state is created by throwing away the first symbol (t) and the first word of input (Line 7). If they do not match, do the alternative by throwing away the current state and get the next state from the list of all alternatives (Line 8).

Line 9-11 are for a rule. Line 11 does a rewrite of the current symbol (t) using the substitution function (subs2). Line 12 is a catch-all when something wrong, just throwaway the current state and back up.

Let feed an input and see the parser in action.

states = (((S )("the" "dogs" "cried" )))
parse( states )
2> s1 = ((S )("the" "dogs" "cried" ))
   t = S
   t is a rule
10> rhs of S is ((NP VP))
11> states = merge(
     subs2((NP VP), NIL, ("the" "dogs" ..)), NIL))

Line 11, merge(subs2(..),.. ) with tail(states) which is currently NIL. The subs2(rhs,NIL,("the"..)) rewrites S -> (NP VP). The result is the states(((NP VP )("the" "dogs" "cried" ))) and then repeat from Line 1 again.

The next symbol of the current state is NP. It is a rule.

10> rhs: NP -> ((ART N)(ART ADJ N))
11> states = merge(
subs2(((ART N)(ART ADJ N)),(VP),("the"...)),NIL)

The states are now(((ART N VP )("the" "dogs" "cried" ))((ART ADJ N VP )("the" "dogs" "cried" )))

The next round, the first symbol is ART. It is a lexicon.

5> rhs: ART ("the")
6> match(("the"),"the")
7> generate the next states

The states are now(((N VP )("dogs" "cried" ))((ART ADJ N VP )("the" "dogs" "cried" )))

Notice the current state (the head of states). ART has been throwaway including the first word of input ("the"). The steps are repeated and succeed in the end. You can read the actual program of the parser here.

last update 21 Nov 2012