Example of Writing Grammar to Specify a Language
    We will do one example to learn how to write a grammar of a language. 
    Here is an assembly language source file.
    
    ; test xas
      
      symbol
        sp r10    ; stack pointer
        base 100
        dd 2
      code 100
        add r1 r2
        jmp exit
        ld base
        ld 200
      :exit
        add r2 r3
        inc r4 1
        inc r3 dd
        jt exit
      data 200
        10 20
      end
    
    We will write a grammar to specify this ASM langauge.  Assuming that
    the comment line that begins with ";" will be removed by the scanner. 
    Here is the grammar.
    
    program = symsec codesec datasec
      symsec = tkSYMBOL sympair
      sympair = tkNAME symval sympair |
        nil
      codesec = tkCODE tkNUMBER codeline
      codeline = tkLABEL oparg | oparg | nil 
      oparg = opargs codeline
      datasec = tkDATA tkNUMBER dataline 
      dataline =  tkNUMBER dataline |
        tkEND 
      
    
    Where tkXXX denotes a terminal symbol; nil is an
    empty string.
    
    These non-terminal symbols are special: 
        symval       
    the scanner gets either: a name, a register name (r0, r1...), a constant
        opargs       
    parse the line: opcode operand ...  (the machine instruction).  We
    will not get into the details here.
    
    These terminal symbols are more general:
        tkNAME        
    match any non-reserved word  (reserved words are all terminal symbols),
    also keep the string in the symbol table.
        tkLABEL      
    match string begins with ":" (a label), also keep it in the symbol table
        tkNUMBER     match an
    integer
    
    ASM grammar reads this way.
    A program consists of three sections: symbol section, code section and data
    section.  
    The symbol section begins with "symbol" follows by any number of symbol
    pair.  
    A symbol pair consists of a name and a symbol-value.
    The code section begins with "code" follows by a number follows by any
    number of code-line.
    A code-line is either 
      1)  begin with "label" follows by "opcode-arguments"  or
      2)  "opcode-arguments"   (no label) 
    Where "opcode-arguments" is the actual machine code.  We will not get
    into details here.
    The data section begins with "data" follows by a number follows by any
    number of data-line terminated by "end"
    A data-line is a number.
    
    Please note how we write a right recursive grammar to represent "any number
    of ...".  We need a "nil" to denote a finite repeat (that it will
    end).  Here is the grammar for "any number of sympair".
    symsec = tkSYMBOL sympair
      sympair = tkNAME symval sympair |
        nil
    
    From this grammar we can generate a parser (in any language)
    automatically.   Here is an excerpt from the grammar of ASM
    language (in C).
    
    int program(void){
          if( symsec() ){
              commit(codesec());
              commit(datasec());
              return 1;
          }
          return 0;
      }
      int symsec(void){
          if( tok == tkSYMBOL ){
              mylex();
              commit(sympair());
              return 1;
          }
          return 0;
      }
      int sympair(void){
      loop:
          if( tok == tkNAME ){
              pusht(tokstring);
              mylex();
              commit(symval());
              goto loop;
          }
          return 1;
      }
        . . .
        
      
    Where mylex() is the scan function. It reads the next
    token.  tok is the current token.  commit()
    calls a function and makes sure that it returns true. "pusht()"
    is an extra "action routine" that is defined outside the grammar.  It
    is necessary so that the parser can generate output. In this particular
    example, the parser saves the string of the token in a stack to be used
    later.
    
    Good luck with your grammar writing!
    
    last update 2 Sept 2014