Self Generating Systems
How to generate an assembler from
scratch
We will generate an assembler without depending on any external tools
(such as a compiler). We start by writing a loader in machine
code. The processor we used is a virtual machine that accept
s-code. So we write the loader in s-code.
Here is the first "image" presented in a pseudo code form:
global
var
CH
CP
CS
isNum c
(c >= 48) & (c
<= 57)
isSpace c
(c == 32) | (c == 10) | (c
< 0)
readc { c }
c = getchar()
// printc c
CH = c
ret c
readi c { n pos }
n = c - 48
pos = n
while ! isSpace readc
n = n*10 + CH
- 48
if pos != 0
ret n
else
ret 0-n
// it reads input and write it to
screen
boot {op arg i}
op = readi readc
while op != 0
arg = readi
readc
// CS[i] = (arg
<< 8) | op
print op
printc 32
print arg
printc 10
// i++
op = readi
readc
The "boot" function will read input from a standard input
stream. This program is hand translated into "assembly"
form of s-code. You can see it here (
boot1-s.txt ) and is hand compiled into "s-code" (an
s-object file) suitable to be executed by S-code virtual machine
here ( boot1.obj ). At the
end of the object file, we append an input stream, it is the test input
for this loader, "boot". Now, run it using Som virtual machine:
>
som < boot1.obj
CP
= 124 DP = 2000
32 17
23 0
38 1
You will see the output on the screen, reflected the input stream it
was reading. This ensure us that all written s-code work
correctly. Now we begin the second stage, write a lexical
analyser to read symbols.
//
after the basic functions are working
// now we write boot2 to be used
to load
// subsequent functions
// i is set to the appropriate
address
// (after boot2)
boot2 {op arg i}
CS = 0
i = 130
op = readi readc
while op != 0
arg = readi
readc
CS[i] = (arg
<< 8) | op
i++
op = readi
readc
Before we go further, we will test this "boot2" run reading in a simple
program "print 1 + 2" and execute it to make sure our loader can load
and run a program correctly. The s-code is here ( boot2-s.txt boot2.obj )
CP = 159 DP = 2000
3
Now we write the lexical analyser. It needs a symbol table to
store symbols and associated values. The symbol table uses TRIE
data structure which is built on-line while reading the input
stream. See the function lex, how it is used.
//
the second stage is to write
// a scanner that can read symbols
// and store them in a symbol
table
// the scanner is called "lexical
analyser"
// this is the symbol table
// ??? explain trie symbol table
freecell = 0 //
pointer to free sym
sym = 2000 //
symbol table, array 1000
tkval =
0 // value of current token
// a cell has 4 fields:
char,right,next,atr
newcell c { k }
k = freecell
freecell = freecell + 4
sym[k] = c
ret k
search i c //
if sym[i]=0 insert c at i
if sym[i] == 0
sym[i] =
newcell c
ret sym[i]
// return index to symtab, 0 if
EOF, 1 if num
lex { i }
while isSpace readc
if CH < 0
break // skip blank stop at EOF
if isNum CH
tkval = readi
CH
ret
1 // it is a number
i = 0
while ! isSpace
CH // check sep, EOF
i = search i+2
CH // next of i
while sym[i]
!= CH // match right
i
= search i+1 CH // right of i
CH =
readc // read next
ret i
// we test lex by appending some
symbols
// at the end and read them until
found 0
testlex { i }
sym = array 100
i = lex
while tkval != 0
print i printc
32
i = lex
// now we use "boot2" to read
these codes
// into code segment and call
testlex
// example of input to testlex
1 2 aa bb x
0
This section of program is appended to the boot2.obj. You
can see it here ( boot3-s.txt
boot3.obj ). We test it by feeding some
symbols (number and names), it will output the associated values, 1 for
numbers, 2..n for names.
>
som < boot3.obj
CP = 159 DP = 2000
1 1 4 12 16 1
At this stage, the scanner is working correctly, it can read symbols
from input stream and outputs the associated values (the value of
symbols). We use it to write an assembler.
//
now is the final stage
// lex starts working
// we build symbol table by
reading
// "associative pair"
symbol-value
// into the table
initsym { i j }
freecell = 4
sym = array 1000
i =
lex
// read sym
while i > 1
j =
lex // read value
sym[i+3] =
tkval // set value
i =
lex // read sym
// with a symbol table
// we can write an assembler
outc op arg
CS[CP] = (arg << 8)
| op
CP = CP + 1
asm { i j op }
// a tiny assembler
i = lex
while i > 1
op = sym[i+3]
tkval = 0
if op > 23
j = lex
outc op tkval
i = lex
main
initsym
CS = 301000
CP = 370
asm
print fac 6
The assembler is very simple. It reads input stream and output
the associated values. The input line composed of two fields:
operational code (instruction) and one argument (if any). The
assembler stores appropritate code to the code segment at
M[301000]. We test it by feeding at the input stream, a factorial
program. After the assembler reads and assembles the factorial program,
it executes it. The code is here ( boot4-s.txt
boot4.obj ). As before, the new code is
appended to the previous one.
Here is the input stream. The first section of the input stream
is the initialised symbol (reserved words) then followed by the input
program to be assembled.
add
1 sub 2 mul 3 div 4
and 5 or 6 xor 7 not 8
eq 9 ne 10 lt 11
le 12 gt 14 ge 13
shl 15 shr 16 mod 17
ldx 18 stx 19
array 22 end 23
get 24 put 25 ld 26 st 27
jmp 28 jt 29 jf 30
lit 31 call 32
inc 34 dec 35 sys 36
fun 38 ret 40
0
fun 1
get 1
lit 0
eq
jf 3
lit 1
ret 2
get 1
get 1
lit 1
sub
call 370
mul
ret 2
0
Try to run the whole thing;
>
som < boot4.obj
CP = 159 DP =
2000
720
The output displays the result of calling "print fac
6". The whole pseudo code in one file is here ( boot.txt ). You can see the
whole serie of output of the final stage here ( out.txt
). It also echo all the read input stream.
Enjoy the brain teaser !
Prabhas Chongstitvatana
last update 30 Mar 2007