Natural Language Processing

This note explains how to write a top-down parser of natural language.  A parsing algorithm uses grammatical rules to search for a combination of rules that describe the structure of the input sentence.  A top-down parser starts from the starting rule and rewrite it step by step into symbols that match words in the input sentence.

This is an example to illustrate the relationship between words and their structure.

Grammar

S -> NP VP
NP -> ART N
NP -> ART ADJ N
VP -> V
VP -> V NP

Lexicon

cried : V
dogs: N, V
the: ART
old: ADJ, N
man: N, V

I will "draw" structure as a list.  The first element of the list is the name of rule.

"The dogs cried"

(S (NP (ART the) (N dogs)) (VP cried))

"The old man cried"

(S (NP (ART the) (ADJ old) (N man)) (VP cried))

A naive top-down algorithm works on the list structure.  The whole parsing state is kept in a list which each item is a possible parse. The first element is the current state.  The rest of the list keep all alternative parse.  An element stores the rule and the associate input. Given the input "The dogs cried", here is the starting state:

(S (The dogs cried))

Rewriting is to change the first rule (symbol) to its right-hand rule.  S -> NP VP

((NP VP) (The dogs cried))

If the first symbol is a lexicon (V, N, ART, ADJ), match it with the first word in the input.  If match then throw away both lexicon and word.  Keep rewriting until all matches or fail.

Next rewrite is NP -> ART N

((ART N VP) (The dogs cried))

Now the first symbol is a lexicon, match it with the input.

ART matches "The"

((N VP)(dogs cried))

...

When the current state fails, try another alternative from the list of all possible parses.  If there is nothing left, then the whole parse fails.

Let's Alt denote the list of all possible parse, Cur denotes the current state, first() picks the first element of the list, rest() returns the list without the first element. Start with Alt = ((S (the dogs cried)).

Here is the naive top-down parsing algorithm:

1 If Alt is empty, fail
2 Cur = first(Alt), if Cur has empty rule and empty input then succeed.
3 Generate next possible state
3.1  if first(Cur) is a lexicon L
          if match  L with input then next state is throw away L and first word of input
          else backup by goto step 2
3.2  if first(Cur) is a rule
          next state is rewrite that rule, add all alternatives to Alt, goto step 2

Here is the result from running the above algorithm in full:

(((S )("the" "dogs" "cried" )))
rule: S -> ((NP VP ))
(((NP VP )("the" "dogs" "cried" )))
rule: NP -> ((ART N )(ART ADJ N ))
(((ART N VP )("the" "dogs" "cried" ))((ART ADJ N VP )("the" "dogs" "cried" )))
lex: ART -> ("the" )
match: "the"
(((N VP )("dogs" "cried" ))((ART ADJ N VP )("the" "dogs" "cried" )))
lex: N -> ("man" "dogs" "old" )
match: "dogs"
(((VP )("cried" ))((ART ADJ N VP )("the" "dogs" "cried" )))
rule: VP -> ((V )(V NP ))
(((V )("cried" ))((V NP )("cried" ))((ART ADJ N VP )("the" "dogs" "cried" )))
lex: V -> ("man" "cried" "dogs" )
match: "cried"
yes

The following example shows how the parser tries alternatives. When the current rule does not match, the alternative is used.  There are a lot of "backup" from this example:

(((S )("the" "old" "man" "cried" )))
rule: S -> ((NP VP ))
(((NP VP )("the" "old" "man" "cried" )))
rule: NP -> ((ART N )(ART ADJ N ))
(((ART N VP )("the" "old" "man" "cried" ))((ART ADJ N VP )("the" "old" "man" "cried" )))
lex: ART -> ("the" )
match: "the"
(((N VP )("old" "man" "cried" ))((ART ADJ N VP )("the" "old" "man" "cried" )))
lex: N -> ("man" "dogs" "old" )
match: "old"
(((VP )("man" "cried" ))((ART ADJ N VP )("the" "old" "man" "cried" )))
rule: VP -> ((V )(V NP ))
(((V )("man" "cried" ))((V NP )("man" "cried" ))((ART ADJ N VP )("the" "old" "man" "cried" )))
lex: V -> ("man" "cried" "dogs" )
match: "man"
( ((V NP )("man" "cried" ))((ART ADJ N VP )("the" "old" "man" "cried" )))
(((V NP )("man" "cried" ))((ART ADJ N VP )("the" "old" "man" "cried" )))
lex: V -> ("man" "cried" "dogs" )
match: "man"
(((NP )("cried" ))((ART ADJ N VP )("the" "old" "man" "cried" )))
rule: NP -> ((ART N ART )(ART ADJ N ADJ ART ))
(((ART N ART )("cried" ))((ART ADJ N ADJ ART )("cried" ))((ART ADJ N VP )("the" "old" "man" "cried" )))
lex: ART -> ("the" )
(((ART ADJ N ADJ ART )("cried" ))((ART ADJ N VP )("the" "old" "man" "cried" )))
lex: ART -> ("the" )
(((ART ADJ N VP )("the" "old" "man" "cried" )))
lex: ART -> ("the" )
match: "the"
(((ADJ N VP )("old" "man" "cried" )))
lex: ADJ -> ("old" )
match: "old"
(((N VP )("man" "cried" )))
lex: N -> ("man" "dogs" "old" )
match: "man"
(((VP )("cried" )))
rule: VP -> ((V )(V NP V ))
(((V )("cried" ))((V NP V )("cried" )))
lex: V -> ("man" "cried" "dogs" )
match: "cried"
yes

My next article explains how to write a top-down parser that shows the results like above.

last update 20 Nov 2012