Som version 2.0 : Som-in-Som
Introduction
This version is an evolution of Som system to be
self-replicating. That is the system can generate itself with
minimum support from a host language. This is achieved by writing
most of Som system (lexical analyser, compiler, code generator) in
Som. Only the eval function must be written in the host language
(C). This approach has been used to port a compiler to a new platform
since the early days of computer science, for example, Pascal
compiler. However, the interesting point is not "porting" a
system to another platform. It is the ability to "self-replicate"
that I try to achieve.
This Som system comprises of the whole Som in object format (som-code,
"som.obj") that is loaded into the memory and is executed by eval()
(written in C). This image plus eval-in-C works as Som compiler.
The first image "som.obj" is generated by Som itself. However,
this is not done in this release. The present "som.obj" is
generated from a modified Som v1. There are minor differences
that are needed to be resolved before it is truely self-replicate.
Som-in-Som
The source in Som comprises of
lib.txt
library, mostly named syscalls
string-s.txt string
functions to handle Som-string
compile-h-s.txt main header and global variables
declaration
list-s.txt
list functions : cons, car, cdr
symtab5-s.txt symbol table
token-s.txt lexical
analyser
parse-h-s.txt parser header, forward
function declaration
stmt-s.txt functions
to support parser
parse.som parser,
generated from a similar generator in som v1
icode-s.txt functions to
support generating som-code
gencode-s.txt som-code generator
main-s.txt main functions
The improvements in this version
1 The symbol table now uses hash function. This is a
big improvement from v1 that used two tables with linear search.
Now it is one table with hash search (constant time). See
symtab5-s.txt for details.
2 Tail-call elimination is done by compiler. Now, the
instruction "callt" is eliminated as the compiler detects the tail-call
and generates a proper parameter passing and jump back to the beginning
of function body.
3 Immediate line does not generate any code in the object.
Instead, a snapshot of memory is used. Previously, the code
generated from immediate lines try to "replay" the events that lead to
a particular memory snapshot, hence, we can throw away the "replay" and
use the snapshot itself. The benefit is that we hope to eliminate
many complex coding and house keeping that arise from the need to make
the "replay" accurate and also the need to keep all immediate lines to
the end before code generation.
4 eval-C has been made reentrant. The subtle point in
implementation is to switch between user space (CS') and system space
(CS) for eval-C to work in user space. (see eval-C, syscall
5). This makes it possible for user programs to call eval.
Finding out where CS' is, is not straight forward as when som-som is
loaded it must be initialised and started before CS' is known.
There is no way for C to know when that happens. eval-in-som has
been written and is working. It is very clean way to execute
everything is user space. However, it is doubly-interpret
(vm-over-vm) so it is not efficient. Writing a translator eval-in-som
to eval-in-C has been attempted. The result is still very
experimental, not complete enough for public release.
Finer points of the
implementation
1 Symbol table
One hash table is possible by recognising that shadowing global by
local can be achieved by storing the shadowed (GVAR,ref) in the same
attribute of local, using the field (arity,lv). One table is
conceptually cleaner. The code is actually not shorter than two
tables, compare symtab4-s.txt and symtab5-s.txt, very similar.
However one table is faster as there is no linear search.
This is done in symtab5-s.txt
The following is an explanation of symtab5-s.txt:
name will never be deleted, only its attribute, when it is only local,
is deleted. To denote the undefined name, attr = 0. symbol
table will have two fields: symname, symatr. symname[]
stores pointer to name-strings, allocated normally and never
deleted. symatr[] stores pointer to (type,ref,arity,lv).
The tuple (type,ref,arity,lv) is allocated and only the tuple with type
local and no shadowing can be freed. A free tuple
list is kept to reuse it.
install s1
search and insert s1 into
symname.
return index. if s1 is new, its symatr = 0
enterLocal s1
install s1
if symatr = 0 it is
a new name
get new tuple
instantiate it
it is not new
check type,
must be GVAR else error. Can not shadow other (such as func)
move
(GVAR,ref) to (arity,lv), fill (LOCAL,ref)
record this
index in the localvar list
cleanlocal
for i 0 lv-1
get
attribute tuple from localvar list
must be
type LOCAL (as it is in localvar list)
check
(arity,lv)
if arity
!= 0 it is shadowing a gvar
move (arity,lv) to (type,ref)
else
it is pure local, delete it (move tuple to free list), symatr[.] = 0
lv = 0
2 Tail-call
elimination
See example in "hanoi.lst", the last call to mov function is turned
into :
19 Fun
mov
20 Get 4
. . .
64 Get
4 // mov n-1 other t2
65 Lit 1
66 Sub
67 Get 1
68 Get 2
69 Put 2
70 Put 3
71 Put 4
72 Jmp 20
The instruction "callt" is eliminated. Recognising tail-call is not
straight forward. The first attempt to do it was to change "call"
to "callt" at the parse tree, then generating som-code would be
straight forward. However, after successfully coding, it is
recognised that this simple idea is incorrect. The last call may
not be tail-call optimisable. See the example:
to
fac n =
if (n == 0) 1
else n * fac n - 1
"fac n -1" is the last call in the parse tree but it can not be changed
into a loop as it is not the tail-call.
The solution came from a much simple approach, by looking at the
generated som-code of a normal function definition. The last
instruction is "ret" and what is the tail call? It is the "call"
before "ret", simple ! Then changing "call" to passing
parameters and jumping to the beginning is easy. The remaining
task is to update the "jmp, jf, jt" instructions that may be pointed to
the old "ret" location to the new (now it is a bit further down)
location.
Sometime, looking at the problem is a different view makes the solution
so obvious.
3 Snapshot
Snapshot is the image of allocated memory at the end of source
processing (compiling function definitions and interpreting the
immediate lines). Its size depends on the way the program is
written, the behaviour of programs before taking the snapshot.
The size can be reduced if the program is written in such a way that
all dynamic allocation has not been occurred. This can be
accomplished by writing all "array" within function definitions instead
of immediate lines, hence the xalloc() will occur at run-time instead
of at compile-time. One thing that is stored in the image is the string
constants occurred in the source file.
Snapshot is included with .obj file. The loader can reinitialise
CS and DS into a proper state before beginning the execution of Som
programs. When generating .obj file, the first function to be
executed is "main", if "main" is defined.
How to use som-v2
Som-v2 can be used only in batch mode. Options are available to
produce listing and object files. The object file can be loaded
and executed. The format of an object file is different from
som-v1. See the file "obj-format.txt". Presently, the only
input/output from Som system is stdin and stdout so the command line
must redirect a file to/from Som input/output.
Input/Output functions through syscall:
print, printc getchar, gets
Examples in "test" directory
bubble.txt bubble sort
20..1 to 1..20
array.txt fill in an array 0..9
and print it
hanoi.txt solve tower of hanoi for
3 disks
matmul.txt multiply two 4 by 4 matrices
perm.txt permute the number 0,1,2,3
queen.txt solve all solutions (92
boards) of 8-queen
quick.txt quick sort 20..1 to 1..20
sieve.txt find all prime numbers
less than 1000 (168 primes)
Sample session
The executable "som.exe" is in "test" directory. The image
"som.obj" must be in the same directory. To display a short help
message:
C:\som-v2\test> som -?
som <
file
run
som -x <
obj run from obj
som -o < file >
obj produce obj
som -l < file >
lst produce lst
This will run the quicksort:
C:\som-v2\test> som <
quick.txt
+print
+space
+nl
+printc
Warning: N new global
Warning: a new global
+inita
+show
+swap
+partition
+quicksort
+main
20 19 18 17 16 15 14 13 12 11 10
9 8 7 6 5 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20
To produce an object file and run it:
C:\som-v2\test> som -o <
array.txt > array.obj
C:\som-v2\test> som -x <
array.obj
0 1 2 3 4 5 6 7 8 9
To produce a listing (som-code):
C:\som-v2\test> som -l <
array.txt
1 Call main
2 End
3 Fun print
4 Get 1
5 Sys 1
6 Ret 2
7 Fun space
8 Lit 32
9 Sys 2
10 Ret 1
11 Fun fill
12 Lit 0
13 Put 1
...
54 Ld max
55 Call pr
56 Ret 1
Caveat
This version is still not self-replicate. It is very close. It is
possible to make changes to the compiler itself (in "som" directory)
and regenerate "som.obj". Some subtle difficulty must be overcome
so that the output "som.obj" is exactly the same as the first image
"som.obj".
31 December 2004
Happy New Year!
P. Chongstitvatana