Compiler: A short introduction

Aim: We learn how a compiler work. We study theories behind the translation (compile) methods. We write a specification of a simple language (a grammar). We build a simple compiler from prefabricated parts (my compiler here).

Theory of Grammar, especially Context-Free Grammar ( ppt )
Parser, top down, LL(1) ( ppt )
Write a grammar
Compiler package rz37ds-1.zip
S2 processor and its instruction set

Building a compiler

The first run, compile the whole compiler and test that it works.
1) unzip the rz37ds-1.zip package.
2) the compiler is in the directory /comp
3) use your tools (C compiler) to compile it. Let the compiler executable be "rz37".
4) this compiler compiles an input file of Rz language

see what is Rz language here

and it produces an assembly language output (to be translate to machine codes and run on an appropriate processor).

Example

Rz input file:
// a test file, Rz language

ax[10]

findmax(a)
max = a[0]
i = 1
while( i < 10 )
    if( max < a[i] )
       max = a[i]
    i = i + 1
return max

main()
print(findmax(&ax))

// -------------------

run this command

c>test/rz37 max.txt

and you will get this output on the screen. It is the assembly language of S2 processor. You can read about S2 processor and its instruction set here.

.symbol ... ax 5000 .code 100 ... st r1 20 jal rads main trap r0 #0 ... :findmax st r1 @1 fp ... pop sp r1 ; (= #2 (vec #1 0 )) mov r5 #0 ld r4 +r1 r5 mov r2 r4 ; (= #3 1 ) mov r3 #1 ; while ; (< #3 10 ) jmp L102 :L103 ; if ; (< #2 (vec #1 #3 )) ld r5 +r1 r3 lt r4 r2 r5 jf r4 L104 ; (= #2 (vec #1 #3 )) ld r4 +r1 r3 mov r2 r4 :L104 ; (= #3 (+ #3 1 )) add r3 r3 #1 :L102 lt r4 r3 #10 jt r4 L103 ; end while ; (return #3 ) mov retval r3 ... ld r1 @1 fp ret rads ; function :main st r1 @1 fp add fp fp #2 st rads @0 fp ; (print (call findmax (& ax ))) mov r1 #ax push sp r1 jal rads findmax trap r28 #1 ld rads @0 fp sub fp fp #2 ld r1 @1 fp ret rads .data 5000 .end ; ---------------------------

Rebuilding the compiler for a new language

To understand the compiler, we start with the first phase of compilation, lexical analyser (scanner). Scanner reads the input stream and recognises the words of the language (words are defined by Regular Expression). It returns "token", an internal representation (just integer in our compiler).

You try the scanner to see how it works by running this function. mylex() is the function that reads input and returns token.

void testlex(void){ while(tok != tkEOF){ prtoken(tok); printf(" '%d ",tok); mylex(); } }

int main( int argc, char *argv[] ){

      ...

          readinfile(source);

          testlex();

          return 0;

      }

This function is inserted into the main compiler file "compile.c". Rebuild the compiler and run it on an input file (above). This is the output on the screen. The word is displayed follow by its integer representation.

D:\prabhas\bag\rz\rz37ds\test>rz37 max.txt ax '10 [ '74 10 '11 ] '75 ; '71 findmax '10 ( '72 a '10 ) '73 { '76 max '10 = '5 4 a '10 [ '74 0 '11 ] '75 ; '71 i '10 = '54 1 '11 ; '71 while '80 ( '72 i '10 < '62 10 '11 ) '73 { '76 if '78 ( '72 max '10 < '62 a '10 [ '74 i '10 ] '75 ) '73 { '76 max '10 = '54 a '10 [ '74 i '10 ] '75 ; '71 } '77 i '10 = '54 i '10 + '53 1 '11 ; '71 } '77 return '81 i '10 ; '71 } '77 main '10 ( '72 ) '73 { '76 print '82 ( '72 findmax '10 ( '72 & '56 ax '10 ) '73 ) '73 ; '71 } '77 D:\prabhas\bag\rz\rz37ds\test>

The next exploration is to change the parser so that it can read a new input language. Let's start with defining the language. Our language consists of just an expression (similar to expression in an ordinary programming language). These are examples of sentences in this language.

a + b;

a + 1 + b;

To define a language, we write a Context Free Grammar of that language.

pass = stmt pass | tkEOF ; parse each statement until end of file stmt = expr tkSEMI ; each stmt is expression follows by ; expr = term exprs ; each expr is a term or a term, binary operator, another term exprs = bop term exprs | nil term = tkIDEN | tkNUMBER ; each term is an identifier or a number bop = tkPLUS ; operator is +

We use a parser generator (directory /pgen-c/pgen.exe) to generate a parser in C language from this grammar. The input grammar needs a few additional details, you can see in the file "grammar.txt". The parser generator reads input grammar from the file "grammar.txt". The output ("parse.c") looks like this:

D:\prabhas\bag\rz\rz37ds\pgen-c\ds>pgen > parse.c

// parser generated from grammar
...
int pass(void){
loop:
if( stmt() ){
goto loop;}
if( tok == tkEOF ){
mylex();
return 1; }
return 0;
}
int stmt(void){
...
}

Now, we rebuild the compiler by replacing the original "parse.c" by this new "parse.c". We have to change mylex() in "lex.c" (to skip the preprocessing step). Change main() and rebuild the compiler and use it to compile our language input.

file "lex.c"

void mylex(void){ prtoken(tok); printf(" "); mylex2(); }

file "compile.c"

int main( int argc, char *argv[] ){ ... readinfile(source); f = pass(); if( f ) printf("yes\n"); return 0; }

Here is the output on the screen:

D:\prabhas\bag\rz\rz37ds\test>rz37 t1.txt a + 1 + b ; eof yes

< to be continued >

last update 11 Aug 2013