Example of Writing Grammar to Specify a Language
We will do one example to learn how to write a grammar of a language.
Here is an assembly language source file.
; test xas
symbol
sp r10 ; stack pointer
base 100
dd 2
code 100
add r1 r2
jmp exit
ld base
ld 200
:exit
add r2 r3
inc r4 1
inc r3 dd
jt exit
data 200
10 20
end
We will write a grammar to specify this ASM langauge. Assuming that
the comment line that begins with ";" will be removed by the scanner.
Here is the grammar.
program = symsec codesec datasec
symsec = tkSYMBOL sympair
sympair = tkNAME symval sympair |
nil
codesec = tkCODE tkNUMBER codeline
codeline = tkLABEL oparg | oparg | nil
oparg = opargs codeline
datasec = tkDATA tkNUMBER dataline
dataline =
tkNUMBER
dataline |
tkEND
Where tkXXX
denotes a terminal symbol; nil
is an
empty string.
These non-terminal symbols are special:
symval
the scanner gets either: a name, a register name (r0, r1...), a constant
opargs
parse the line: opcode operand ... (the machine instruction). We
will not get into the details here.
These terminal symbols are more general:
tkNAME
match any non-reserved word (reserved words are all terminal symbols),
also keep the string in the symbol table.
tkLABEL
match string begins with ":" (a label), also keep it in the symbol table
tkNUMBER
match an
integer
ASM grammar reads this way.
A program consists of three sections: symbol section, code section and data
section.
The symbol section begins with "symbol" follows by any number of symbol
pair.
A symbol pair consists of a name and a symbol-value.
The code section begins with "code" follows by a number follows by any
number of code-line.
A code-line is either
1) begin with "label" follows by "opcode-arguments" or
2) "opcode-arguments" (no label)
Where "opcode-arguments" is the actual machine code. We will not get
into details here.
The data section begins with "data" follows by a number follows by any
number of data-line terminated by "end"
A data-line is a number.
Please note how we write a right recursive grammar to represent "any number
of ...". We need a "nil" to denote a finite repeat (that it will
end). Here is the grammar for "any number of sympair".
symsec = tkSYMBOL sympair
sympair = tkNAME symval sympair |
nil
From this grammar we can generate a parser (in any language)
automatically. Here is an excerpt from the grammar of ASM
language (in C).
int program(void){
if( symsec() ){
commit(codesec());
commit(datasec());
return 1;
}
return 0;
}
int symsec(void){
if( tok == tkSYMBOL ){
mylex();
commit(sympair());
return 1;
}
return 0;
}
int sympair(void){
loop:
if( tok == tkNAME ){
pusht(tokstring);
mylex();
commit(symval());
goto loop;
}
return 1;
}
. . .
Where mylex()
is the scan function. It reads the next
token. tok
is the current token. commit()
calls a function and makes sure that it returns true. "pusht()
"
is an extra "action routine" that is defined outside the grammar. It
is necessary so that the parser can generate output. In this particular
example, the parser saves the string of the token in a stack to be used
later.
Good luck with your grammar writing!
last update 2 Sept 2014