som version 2.0

This version is an evolution of Som system to be self-replicating.  That is the system can generate itself with minimum support from a host language.  This is achieved by writing most of Som system (lexical analyser, compiler, code generator) in Som.  Only the eval function must be written in the host language (C). This approach has been used to port a compiler to a new platform since the early days of computer science, for example, Pascal compiler.  However, the interesting point is not "porting" a system to another platform.  It is the ability to "self-replicate" that I try to achieve.

This Som system comprises of the whole Som in object format (som-code, "som.obj") that is loaded into the memory and is executed by eval() (written in C). This image plus eval-in-C works as Som compiler.  The first image "som.obj" is generated by Som itself.  However, this is not done in this release.  The present "som.obj" is generated from a modified Som v1.  There are minor differences that are needed to be resolved before it is truely self-replicate.

Som-in-Som

The source in Som comprises of

lib.txt		library, mostly named syscalls
string-s.txt	string functions to handle Som-string
compile-h-s.txt	main header and global variables declaration
list-s.txt		list functions : cons, car, cdr
symtab5-s.txt	symbol table
token-s.txt	lexical analyser
parse-h-s.txt	parser header, forward function declaration
stmt-s.txt		functions to support parser
parse.som	parser, generated from a similar generator in som v1
icode-s.txt	functions to support generating som-code
gencode-s.txt	som-code generator
main-s.txt		main functions

The improvements in this version

1   The symbol table now uses hash function.  This is a big improvement from v1 that used two tables with linear search.  Now it is one table with hash search (constant time).  See symtab5-s.txt for details.

2   Tail-call elimination is done by compiler.  Now, the instruction "callt" is eliminated as the compiler detects the tail-call and generates a proper parameter passing and jump back to the beginning of function body.  

3  Immediate line does not generate any code in the object.  Instead, a snapshot of memory is used.  Previously, the code generated from immediate lines try to "replay" the events that lead to a particular memory snapshot, hence, we can throw away the "replay" and use the snapshot itself.  The benefit is that we hope to eliminate many complex coding and house keeping that arise from the need to make the "replay" accurate and also the need to keep all immediate lines to the end before code generation.

4  eval-C must be made reentrant.  The subtle point in implementation is to switch between user space (CS') and system space (CS) for eval-C to work in user space.  (see eval-C, syscall 5).  Finding out where CS' is, is not straight forward as when som-som is loaded it must be initialised and started before CS' is known.  There is no way for C to know when that happens.

Finer points of the implementation

1  Symbol table

One hash table is possible by recognising that shadowing global by local can be achieved by storing the shadowed (GVAR,ref) in the same attribute of local, using the field (arity,lv).  One table is conceptually cleaner.  The code is actually not shorter than two tables, compare symtab4-s.txt and symtab5-s.txt, very similar.  However one table is faster as there is no linear search.   This is done in symtab5-s.txt

The following is an explanation of symtab5-s.txt:

name will never be deleted, only its attribute, when it is only local, is deleted.  To denote the undefined name, attr = 0.  symbol table will have two fields:
symname, symatr.  symname[] stores pointer to name-strings, allocated normally and never deleted.  symatr[] stores pointer to (type,ref,arity,lv).  The tuple (type,ref,arity,lv) is allocated and only the tuple with type local and no shadowing   can be freed.  A free tuple list is kept to reuse it.  

install s1
  search and insert s1 into symname.  return index.  if s1 is new, its symatr = 0

enterLocal s1
  install s1
  if symatr = 0  it is a new name 
    get new tuple instantiate it
  it is not new
    check type, must be GVAR else error. Can not shadow other (such as func)
    move (GVAR,ref) to (arity,lv), fill (LOCAL,ref)
    record this index in the localvar list

cleanlocal
  for i 0 lv-1
     get attribute tuple from localvar list
     must be type LOCAL (as it is in localvar list)
     check (arity,lv)  
     if arity != 0  it is shadowing a gvar
        move (arity,lv) to (type,ref)
     else
        it is pure local, delete it (move tuple to free list), symatr[.] = 0
  lv = 0

2  Tail-call elimination

See example in "hanoi.lst", the last call to mov function is turned into :

     19 Fun mov
     20 Get 4
     . . . 
     64 Get 4	// mov n-1 other t2
     65 Lit 1
     66 Sub
     67 Get 1
     68 Get 2
     69 Put 2
     70 Put 3
     71 Put 4
     72 Jmp 20

Recognising tail-call is not straight forward.  The first attempt to do it was to change "call" to "callt" at the parse tree, then generating s-code would be straight forward.  However, after successfully coding the "lastitem" in list.c, it is recognised that this simple idea is incorrect.  The last call may not be tail-call optimisable.  See the example:

to fac n =
  if (n == 0) 1
  else n * fac n - 1

"fac n -1" is the last call in the parse tree but it can not be changed into a loop as it is not the tail-call. 

The solution came from a much simple approach, by looking at the generated s-code of a normal function definition.  The last instruction is "ret" and what is the tail call?  It is the "call" before "ret", simple !   Then changing "call" to passing parameters and jumping to the beginning is easy.  The remaining task is to update the "jmp, jf, jt" instructions that may be pointed to the old "ret" location to the new (now it is a bit further down) location.  

Sometime, looking at the problem is a different view makes the solution so obvious.

3  Snapshot

Snapshot is the image of allocated memory at the end of source processing (compiling function definitions and interpreting the immediate lines).  Its size depends on the way the program is written, the behaviour of programs before taking the snapshot.  The size can be reduced if the program is written in such a way that all dynamic allocation has not been occurred.  This can be accomplished by writing all "array" within function definitions instead of immediate lines, hence the xalloc() will occur at run-time instead of at compile-time. One thing that is stored in the image is the string constants occurred in the source file.

Snapshot is included with .obj file.  The loader can reinitialise CS and DS into a proper state before beginning the execution of Som programs.  When generating .obj file, the first function to be executed is "main", if "main" is defined. 

How to use som-v2

Som-v2 can be used only in batch mode.  Options are available to produce listing and object files.  The object file can be loaded and executed.  The format of an object file is different from som-v1.  See the file "obj-format.txt".  Presently, the only input/output from Som system is stdin and stdout so the command line must redirect a file to/from Som input/output.  

Input/Output functions through syscall:
  print, printc getchar, gets

Examples in "test" directory

bubble.txt		bubble sort  20..1  to 1..20
array.txt		fill in an array 0..9 and print it
hanoi.txt		solve tower of hanoi for 3 disks
matmul.txt	multiply two 4 by 4 matrices
perm.txt		permute the number 0,1,2,3
queen.txt		solve all solutions (92 boards) of 8-queen
quick.txt		quick sort 20..1 to 1..20
sieve.txt		find all prime numbers less than 1000 (168 primes)

Sample session

The executable "som.exe" is in "test" directory.  The image "som.obj" must be in the same directory.  To display a short help message:

C:\som-v2\test> som -?

som < file                run
som -x < obj            run from obj
som -o < file > obj   produce obj
som -l < file > lst     produce lst

This will run the quicksort:

C:\som-v2\test> som < quick.txt
+print
+space
+nl
+printc
Warning: N new global
Warning: a new global
+inita
+show
+swap
+partition
+quicksort
+main
20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

To produce an object file and run it:

C:\som-v2\test> som -o < array.txt > array.obj

C:\som-v2\test> som -x < array.obj
0 1 2 3 4 5 6 7 8 9

To produce a listing (som-code):

C:\som-v2\test> som -l < array.txt
1 Call main
2 End
3 Fun print
4 Get 1
5 Sys 1
6 Ret 2
7 Fun space
8 Lit 32
9 Sys 2
10 Ret 1
11 Fun fill
12 Lit 0
13 Put 1
...
54 Ld max
55 Call pr
56 Ret 1

Caveat

This version is still not self-replicate. It is very close.  It is possible to make changes to the compiler itself (in "som" directory) and regenerate "som.obj".  Some subtle difficulty must be overcome so that the output "som.obj" is exactly the same as the first image "som.obj".

31 December 2004
Happy New Year!
P. Chongstitvatana

