Som version 2.0 : Som-in-Som


Introduction

This version is an evolution of Som system to be self-replicating.  That is the system can generate itself with minimum support from a host language.  This is achieved by writing most of Som system (lexical analyser, compiler, code generator) in Som.  Only the eval function must be written in the host language (C). This approach has been used to port a compiler to a new platform since the early days of computer science, for example, Pascal compiler.  However, the interesting point is not "porting" a system to another platform.  It is the ability to "self-replicate" that I try to achieve.

This Som system comprises of the whole Som in object format (som-code, "som.obj") that is loaded into the memory and is executed by eval() (written in C). This image plus eval-in-C works as Som compiler.  The first image "som.obj" is generated by Som itself.  However, this is not done in this release.  The present "som.obj" is generated from a modified Som v1.  There are minor differences that are needed to be resolved before it is truely self-replicate.

Som-in-Som

The source in Som comprises of

lib.txt                  library, mostly named syscalls
string-s.txt           string functions to handle Som-string
compile-h-s.txt    main header and global variables declaration
list-s.txt              list functions : cons, car, cdr
symtab5-s.txt      symbol table
token-s.txt          lexical analyser
parse-h-s.txt       parser header, forward function declaration
stmt-s.txt           functions to support parser
parse.som          parser, generated from a similar generator in som v1
icode-s.txt         functions to support generating som-code
gencode-s.txt     som-code generator
main-s.txt          main functions

The improvements in this version

1   The symbol table now uses hash function.  This is a big improvement from v1 that used two tables with linear search.  Now it is one table with hash search (constant time).  See symtab5-s.txt for details.

2   Tail-call elimination is done by compiler.  Now, the instruction "callt" is eliminated as the compiler detects the tail-call and generates a proper parameter passing and jump back to the beginning of function body.  

3  Immediate line does not generate any code in the object.  Instead, a snapshot of memory is used.  Previously, the code generated from immediate lines try to "replay" the events that lead to a particular memory snapshot, hence, we can throw away the "replay" and use the snapshot itself.  The benefit is that we hope to eliminate many complex coding and house keeping that arise from the need to make the "replay" accurate and also the need to keep all immediate lines to the end before code generation.

4  eval-C has been made reentrant.  The subtle point in implementation is to switch between user space (CS') and system space (CS) for eval-C to work in user space.  (see eval-C, syscall 5).  This makes it possible for user programs to call eval.  Finding out where CS' is, is not straight forward as when som-som is loaded it must be initialised and started before CS' is known.  There is no way for C to know when that happens.  eval-in-som has been written and is working.  It is very clean way to execute everything is user space.  However, it is doubly-interpret (vm-over-vm) so it is not efficient. Writing a translator eval-in-som to eval-in-C has been attempted.  The result is still very experimental, not complete enough for public release.

Finer points of the implementation

1  Symbol table

One hash table is possible by recognising that shadowing global by local can be achieved by storing the shadowed (GVAR,ref) in the same attribute of local, using the field (arity,lv).  One table is conceptually cleaner.  The code is actually not shorter than two tables, compare symtab4-s.txt and symtab5-s.txt, very similar.  However one table is faster as there is no linear search.   This is done in symtab5-s.txt

The following is an explanation of symtab5-s.txt:

name will never be deleted, only its attribute, when it is only local, is deleted.  To denote the undefined name, attr = 0.  symbol table will have two fields:  symname, symatr.  symname[] stores pointer to name-strings, allocated normally and never deleted.  symatr[] stores pointer to (type,ref,arity,lv).  The tuple (type,ref,arity,lv) is allocated and only the tuple with type local and no shadowing   can be freed.  A free tuple list is kept to reuse it.  

install s1
  search and insert s1 into symname. 
  return index.  if s1 is new, its symatr = 0


enterLocal s1
  install s1
  if symatr = 0  it is a new name
    get new tuple instantiate it
  it is not new
    check type, must be GVAR else error. Can not shadow other (such as func)
    move (GVAR,ref) to (arity,lv), fill (LOCAL,ref)
    record this index in the localvar list

cleanlocal
  for i 0 lv-1
     get attribute tuple from localvar list
     must be type LOCAL (as it is in localvar list)
     check (arity,lv)  
     if arity != 0  it is shadowing a gvar
        move (arity,lv) to (type,ref)
     else
        it is pure local, delete it (move tuple to free list), symatr[.] = 0
  lv = 0

2  Tail-call elimination

See example in "hanoi.lst", the last call to mov function is turned into :

     19 Fun mov
     20 Get 4
     . . .
     64 Get 4    // mov n-1 other t2
     65 Lit 1
     66 Sub
     67 Get 1
     68 Get 2
     69 Put 2
     70 Put 3
     71 Put 4
     72 Jmp 20

The instruction "callt" is eliminated. Recognising tail-call is not straight forward.  The first attempt to do it was to change "call" to "callt" at the parse tree, then generating som-code would be straight forward.  However, after successfully coding, it is recognised that this simple idea is incorrect.  The last call may not be tail-call optimisable.  See the example:

to fac n =
  if (n == 0) 1
  else n * fac n - 1

"fac n -1" is the last call in the parse tree but it can not be changed into a loop as it is not the tail-call.

The solution came from a much simple approach, by looking at the generated som-code of a normal function definition.  The last instruction is "ret" and what is the tail call?  It is the "call" before "ret", simple !   Then changing "call" to passing parameters and jumping to the beginning is easy.  The remaining task is to update the "jmp, jf, jt" instructions that may be pointed to the old "ret" location to the new (now it is a bit further down) location.  

Sometime, looking at the problem is a different view makes the solution so obvious.

3  Snapshot

Snapshot is the image of allocated memory at the end of source processing (compiling function definitions and interpreting the immediate lines).  Its size depends on the way the program is written, the behaviour of programs before taking the snapshot.  The size can be reduced if the program is written in such a way that all dynamic allocation has not been occurred.  This can be accomplished by writing all "array" within function definitions instead of immediate lines, hence the xalloc() will occur at run-time instead of at compile-time. One thing that is stored in the image is the string constants occurred in the source file.

Snapshot is included with .obj file.  The loader can reinitialise CS and DS into a proper state before beginning the execution of Som programs.  When generating .obj file, the first function to be executed is "main", if "main" is defined.

How to use som-v2

Som-v2 can be used only in batch mode.  Options are available to produce listing and object files.  The object file can be loaded and executed.  The format of an object file is different from som-v1.  See the file "obj-format.txt".  Presently, the only input/output from Som system is stdin and stdout so the command line must redirect a file to/from Som input/output.  

Input/Output functions through syscall:
  print, printc getchar, gets

Examples in "test" directory

bubble.txt        bubble sort  20..1  to 1..20
array.txt        fill in an array 0..9 and print it
hanoi.txt        solve tower of hanoi for 3 disks
matmul.txt    multiply two 4 by 4 matrices
perm.txt        permute the number 0,1,2,3
queen.txt        solve all solutions (92 boards) of 8-queen
quick.txt        quick sort 20..1 to 1..20
sieve.txt        find all prime numbers less than 1000 (168 primes)

Sample session

The executable "som.exe" is in "test" directory.  The image "som.obj" must be in the same directory.  To display a short help message:

C:\som-v2\test> som -?

som < file            run
som -x < obj          run from obj
som -o < file > obj   produce obj
som -l < file > lst   produce lst

This will run the quicksort:

C:\som-v2\test> som < quick.txt
+print
+space
+nl
+printc
Warning: N new global
Warning: a new global
+inita
+show
+swap
+partition
+quicksort
+main
20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

To produce an object file and run it:

C:\som-v2\test> som -o < array.txt > array.obj

C:\som-v2\test> som -x < array.obj
0 1 2 3 4 5 6 7 8 9

To produce a listing (som-code):

C:\som-v2\test> som -l < array.txt
1 Call main
2 End
3 Fun print
4 Get 1
5 Sys 1
6 Ret 2
7 Fun space
8 Lit 32
9 Sys 2
10 Ret 1
11 Fun fill
12 Lit 0
13 Put 1
...
54 Ld max
55 Call pr
56 Ret 1

Caveat

This version is still not self-replicate. It is very close.  It is possible to make changes to the compiler itself (in "som" directory) and regenerate "som.obj".  Some subtle difficulty must be overcome so that the output "som.obj" is exactly the same as the first image "som.obj".

31 December 2004
Happy New Year!
P. Chongstitvatana