Example of Writing Grammar to Specify a Language

We will do one example to learn how to write a grammar of a language.  Here is an assembly language source file.

; test xas

symbol
  sp r10    ; stack pointer
  base 100
  dd 2
code 100
  add r1 r2
  jmp exit
  ld base
  ld 200
:exit
  add r2 r3
  inc r4 1
  inc r3 dd
  jt exit
data 200
  10 20
end

We will write a grammar to specify this ASM langauge.  Assuming that the comment line that begins with ";" will be removed by the scanner.  Here is the grammar.

program = symsec codesec datasec
symsec = tkSYMBOL sympair
sympair = tkNAME symval sympair | nil
codesec = tkCODE tkNUMBER codeline
codeline = tkLABEL oparg | oparg | nil
oparg = opargs codeline
datasec = tkDATA tkNUMBER dataline
dataline =  tkNUMBER dataline | tkEND

Where tkXXX denotes a terminal symbol; nil is an empty string.

These non-terminal symbols are special:
    symval        the scanner gets either: a name, a register name (r0, r1...), a constant
    opargs        parse the line: opcode operand ...  (the machine instruction).  We will not get into the details here.

These terminal symbols are more general:
    tkNAME         match any non-reserved word  (reserved words are all terminal symbols), also keep the string in the symbol table.
    tkLABEL       match string begins with ":" (a label), also keep it in the symbol table
    tkNUMBER     match an integer

ASM grammar reads this way.
A program consists of three sections: symbol section, code section and data section. 
The symbol section begins with "symbol" follows by any number of symbol pair. 
A symbol pair consists of a name and a symbol-value.
The code section begins with "code" follows by a number follows by any number of code-line.
A code-line is either
  1)  begin with "label" follows by "opcode-arguments"  or
  2)  "opcode-arguments"   (no label)
Where "opcode-arguments" is the actual machine code.  We will not get into details here.
The data section begins with "data" follows by a number follows by any number of data-line terminated by "end"
A data-line is a number.

Please note how we write a right recursive grammar to represent "any number of ...".  We need a "nil" to denote a finite repeat (that it will end).  Here is the grammar for "any number of sympair".
symsec = tkSYMBOL sympair
sympair = tkNAME symval sympair | nil

From this grammar we can generate a parser (in any language) automatically.   Here is an excerpt from the grammar of ASM language (in C).

int program(void){
    if( symsec() ){
        commit(codesec());
        commit(datasec());
        return 1;
    }
    return 0;
}
int symsec(void){
    if( tok == tkSYMBOL ){
        mylex();
        commit(sympair());
        return 1;
    }
    return 0;
}
int sympair(void){
loop:
    if( tok == tkNAME ){
        pusht(tokstring);
        mylex();
        commit(symval());
        goto loop;
    }
    return 1;
}
. . .

Where mylex() is the scan function. It reads the next token.  tok is the current token.  commit() calls a function and makes sure that it returns true. "pusht()" is an extra "action routine" that is defined outside the grammar.  It is necessary so that the parser can generate output. In this particular example, the parser saves the string of the token in a stack to be used later.

Good luck with your grammar writing!

last update 2 Sept 2014