Self Generating Systems

How to generate an assembler from scratch

We will generate an assembler without depending on any external tools (such as a compiler).  We start by writing a loader in machine code.  The processor we used is a virtual machine that accept s-code. So we write the loader in s-code.

Here is the first "image" presented in a pseudo code form:

global var
CH
CP
CS  

isNum c
  (c >= 48) & (c <= 57)

isSpace c
  (c == 32) | (c == 10) | (c < 0)

readc { c }
  c = getchar()
//  printc c
  CH = c
  ret c

readi c { n pos }
  n = c - 48
  pos = n
  while ! isSpace readc
    n = n*10 + CH - 48
  if pos != 0
    ret n
  else
    ret 0-n

// it reads input and write it to screen
boot {op arg i}
  op = readi readc
  while op != 0
    arg = readi readc
//    CS[i] = (arg << 8) | op
    print op printc 32
    print arg printc 10
//    i++
    op = readi readc

The "boot" function will read input from a standard input stream.   This program is hand translated into "assembly" form of s-code.  You can see it here ( boot1-s.txt )  and is hand compiled into "s-code" (an s-object file)  suitable to be executed by S-code virtual machine here  (  boot1.obj ).  At the end of the object file, we append an input stream, it is the test input for this loader, "boot".  Now, run it using Som virtual machine:

> som < boot1.obj
CP = 124 DP = 2000
32 17
23 0
38 1

You will see the output on the screen, reflected the input stream it was reading.  This ensure us that all written s-code work correctly.  Now we begin the second stage, write a lexical analyser to read symbols.

// after the basic functions are working
// now we write boot2 to be used to load
// subsequent functions
// i is set to the appropriate address
// (after boot2)

boot2 {op arg i}
  CS = 0
  i = 130
  op = readi readc
  while op != 0
    arg = readi readc
    CS[i] = (arg << 8) | op
    i++
    op = readi readc


Before we go further, we will test this "boot2" run reading in a simple program "print 1 + 2" and execute it to make sure our loader can load and run a program correctly.   The s-code is here (  boot2-s.txt    boot2.obj  )

CP = 159 DP = 2000
3

Now we write the lexical analyser.  It needs a symbol table to store symbols and associated values.  The symbol table uses TRIE data structure which is built on-line while reading the input stream.  See the function lex, how it is used.

// the second stage is to write
// a scanner that can read symbols
// and store them in a symbol table
// the scanner is called "lexical analyser"

// this is the symbol table
// ??? explain trie symbol table

freecell = 0    // pointer to free sym
sym = 2000    // symbol table, array 1000
tkval = 0       // value of current token

// a cell has 4 fields: char,right,next,atr
newcell c { k }
  k = freecell
  freecell = freecell + 4
  sym[k] = c
  ret k

search i c     // if sym[i]=0 insert c at i
  if sym[i] == 0
    sym[i] = newcell c
  ret sym[i]

// return index to symtab, 0 if EOF, 1 if num
lex { i }
  while isSpace readc
    if CH < 0 break  // skip blank stop at EOF
  if isNum CH
    tkval = readi CH
    ret 1         // it is a number
  i = 0
  while ! isSpace CH    // check sep, EOF
    i = search i+2 CH    // next of i
    while sym[i] != CH    // match right
      i = search i+1 CH // right of i
    CH = readc        // read next
  ret i

// we test lex by appending some symbols
// at the end and read them until found 0

testlex { i }
  sym = array 100
  i = lex
  while tkval != 0
    print i printc 32
    i = lex

// now we use "boot2" to read these codes
// into code segment and call testlex

// example of input to testlex
1 2 aa bb x
0

This section of program is appended to the boot2.obj.   You can see it here ( boot3-s.txt    boot3.obj ).  We test it by feeding some symbols (number and names), it will output the associated values, 1 for numbers, 2..n for names.

> som < boot3.obj
CP = 159 DP = 2000
1 1 4 12 16 1

At this stage, the scanner is working correctly, it can read symbols from input stream and outputs the associated values (the value of symbols).  We use it to write an assembler.

// now is the final stage
// lex starts working
// we build symbol table by reading
// "associative pair" symbol-value
// into the table

initsym { i j }
  freecell = 4
  sym = array 1000
  i = lex             // read sym
  while i > 1
    j = lex          // read value
    sym[i+3] = tkval  // set value
    i = lex          // read sym

// with a symbol table
// we can write an assembler

outc op arg
  CS[CP] = (arg << 8) | op
  CP = CP + 1

asm { i j op }         // a tiny assembler
  i = lex
  while i > 1
    op = sym[i+3]
    tkval = 0
    if op > 23 j = lex
    outc op tkval
    i = lex

main
  initsym
  CS = 301000
  CP = 370
  asm
  print fac 6
 
The assembler is very simple.  It reads input stream and output the associated values.  The input line composed of two fields: operational code (instruction) and one argument (if any).  The assembler stores appropritate code to the code segment at M[301000].  We test it by feeding at the input stream, a factorial program. After the assembler reads and assembles the factorial program, it executes it.  The code is here (  boot4-s.txt    boot4.obj ).  As before, the new code is appended to the previous one.

Here is the input stream.  The first section of the input stream is the initialised symbol (reserved words) then followed by the input program to be assembled.

add 1 sub 2 mul 3 div 4
and 5 or 6  xor 7 not 8
eq 9  ne 10 lt 11
le 12 gt 14 ge 13
shl 15 shr 16 mod 17
ldx 18 stx 19
array 22 end 23

get 24 put 25 ld 26 st 27
jmp 28 jt 29 jf 30
lit 31 call 32
inc 34 dec 35 sys 36
fun 38 ret 40
0

fun 1
get 1
lit 0
eq
jf 3
lit 1
ret 2
get 1
get 1
lit 1
sub
call 370
mul
ret 2
0

Try to run the whole thing;

> som < boot4.obj
CP  = 159  DP =  2000
720

The output displays the result of  calling  "print fac 6".   The whole pseudo code in one file is here (  boot.txt  ).   You can see the whole serie of output of the final stage here ( out.txt ).  It also echo all the read input stream. 

Enjoy the brain teaser !
Prabhas Chongstitvatana
last update 30 Mar 2007