Report on Som language

Motivation
Operators
Grammar of Som language
System calls
String
Tuple
A syntax discussion
Readability
More example and language tutorial

Motivation

The aim of Som language is for teaching. It has been used in computer architecture class to teach how high level programming languages and machine codes are related. The whole language translation process is simple enough that students can modify it to generate code for their projects easily. Som has a familiar syntax, infix operators, and it is designed to be minimal. The internal code is a linear code similar to byte-code of popular languages such as Java. It has a simple static memory model for efficiency, and it also has dynamic allocation for flexibility. It is interactive, an expression can be typed in and is evaluated immediately. See the following example:

> print 2 + 3 nl
5
> to sq x = x * x
> print sq 4 nl
16
>

The basic element in Som is an expression. An expression returns a value, except for assignment which does not return any value. A variable is evaluated to its value. Som has a minimal set of operators as it is intended to be used as a teaching tool. It has small set of reserved words:

to, if, else, while, for, case, break, enum.

Operators are:

arithmetic: + - * / % (mod)
logic: & (bit-and) | (bit-or) ! (bit-not) ^ (bit-xor) << (shift left) >> (arithmetic shift right)
relation: == != < <= >= >
assignment: =
other: array (memory allocation)

Som has three types of variable: global, local, and array. A global variable is automatically defined when it is first appeared. A local variable's scope is in the function that it is defined. An array variable has its space allocated by calling the "array" operators and assigns the return value to the array variable. If an array variable is global, it is static. The compiler knows the base address at compile time and can perform some optimisation to achieve faster execution. An array that is defined inside a function definition, i.e. local, is dynamic. Its space is allocated in the heap. The life time of a dynamic memory depends on the use. When there is no reference to it, it is said to be garbage. The run-time system may support garbage collection. At present, Som system has no garbage collection. Som's programming environment allows using indentation for grouping expressions.

Example: a program to solve tower of hanoi problem

num = array 4     // a global array variable

// define function "mov" with 3 arguments: n, from, t
// and one local variable: other

to mov n from t | other =
if n == 1
    num[from] = num[from] - 1
    num[t] = num[t] + 1
else
    other = 6 - from - t
    mov n-1 from other
    mov 1 from t
    mov n-1 other t

interactive mode

disk = 3
num[0] = 0
num[1] = disk
num[2] = 0
num[3] = 0
mov disk 1 3

Grammar of Som language

notation:

* zero or more times
+ one or more times
[..] optional
' constant symbol

Indentation is used for grouping instead of '{ '}

toplevel -> 'to fundef | ex
fundef -> id args '= ex
args -> id* ['| id+ ]
ex -> '{ ex* '} | ex1

ex1 ->
'if ex0 [ 'else ex ] |
'while ex0 ex |
'for lvar ex0 ex0 ex |
'break |
'case ex0 caselist |
'enum '{ [ number ] id+ '}
id '= ex0 |
ex0

caselist -> caseitem | '{ caseitem+ '}
caseitem -> number ': ex | 'else ': ex

ex0 -> term term*
term ->
number | id | vec |
fun ex0* |
'! ex0 |
'array ex0 |
'( ex0 ')

vec -> id '[ ex0 ']
bop -> '+ | '- | '* | '/ | '& | '| | '^ |
'== | '!= | '< | '<= | '>= | '> | '% | '<< | '>>

There are several interesting points about Som grammar. First, it is expression-based, therefore an expression returns a value except: assignment, for, while. Second, the syntax allows very compact writing, with minimum number of separator and parentheses, for example, a semicolon at the end of statement is not necessary as all operators have known arity. Third, the language has very small vocaburary, this makes it very easy to learn. Recursion is quite natural in Som.

to fib n =
if n < 3
1
else
(fib n - 1) + (fib n - 2)

Here are some elegant examples. They show how to define some primitives using only: if , ==, <.

to and x y = if x y else 0
to or x y = if x 1 else y
to not x = if x 0 else 1
to eq x y = x == y
to neq x y = not ( x == y )
to lt x y = x < y
to le x y = or ( x < y ) ( x == y )
to gt x y = not ( le x y )
to ge x y = not ( x < y )

For, break, and case

In "for" loop, the index variable must be a local variable. "for i start end body" means

i = start
while i <= end
body
i = i + 1

See the following example of the use of "for":

// fill in an array and print it
max = 10
N = array max

// array is passed by reference
to fill ar n | i =
for i 0 n-1
ar[i] = i

to pr ar n | i =
for i 0 n-1
print ar[i] space

fill N max pr N max

The "break" has three meanings:
1) break for loop
2) break while loop
3) force return a function call.
This is a rather nice semantics as it is very consistent.

The case construction is an efficient way for a multiway branch. The label in each case is an integer. To help readability, the enum is used to give labels their symbolic names.

// 10 is the start number
enum
10 a1 a2 a3 a4 a5

to ff a =
case a
    a1: 1
    a2: 2
    a3: 3
    else: 99

print ff 13

The case label must be number, ":" allows syntax check to catch any error and to make it look more familiar. A label (id) is stored in the symbol table with a unique reference. The option [number] is used to set the starting reference otherwise it is zero. "case" is used to make an efficient inner interpreter loop in decode and dispatch each instruction.

while running
case opcode
    1 : add
    2 : sub
    ...
    else : error undef opcode

As the compare and jump in "case" uses log(n) search strategy it is faster than linear search using if..else. In fact, in this implementation, "case" uses a constant time (indexing) to go to the matched label.

while running
if opcode == 1 add
else if opcode == 2 sub
...
else error undef opcode

System calls

To enable input/output and other system functions, Som uses a primitive "syscall". Syscall has a variable number of arguments, the first one is a constant, the number that identifies the system function. Syscall is used to implement library functions such as print, printc, loadfile etc. The implementation of syscall is dependent on the platform. For a PC, syscall is implemented with the implementation language (C in this version). The following is the list of available system functions:

1 print integer
2 print character
3 get character
19 load file

example how syscall is used in the library

to print x = syscall {1 x}
to printc c = syscall {2 c}
to getchar = syscall {3}
to loadfile fn = syscall {19 fn} // fn is som-string
to nl = syscall {2 10} // 10 is a newline char

String

A string is an array of integer. An integer is 32-bit. It contains at most 4 characters, right pads with 0, terminates by an integer 0 (an extra one). This is called a packed string. It is a compromise between the ease of manipulating a string (insert and delete a character is slow) and the space efficiency (storing each character in a 32-bit integer is expensive). See the following program for string manupilation, string copy.

// copy s1 = s2
to strcpy s1 s2 | i =
i = 0
while s2[i] != 0
s1[i] = s2[i]
i = i + 1
s1[i] = 0

s1 = array 20
strcpy s1 "test string"

The compiler translated a constant string in the program text into a constant pointed to data segment storing the string. A string constant is handled at compile time. The most convenient way to implement this is to represent string as an array of integer and the pointer to this string is an address of the som-memory. The operator must be type-specific, i.e. the operator knows what type its argument is. The pointer to string is just an integer.

Tuple

Tuple is a list of variable number of arguments. There is a place where it is required to use variable number of arguments for a function call. That is, call to a system function, "syscall". The original idea is to use open stack coding. This is easy and does not require any special treatment in parsing. However, open stack coding is not good. See how confusing it can be in this code (from eval-s.txt):

xArray:           // be careful open stack coding !!!
pop             // get n from user stack
a = syscall 8   // alloc M
push a

syscall 8 needs one argument and returns one value. The complexity arises because of open stack coding which does not allow putting argument to syscall like this:

a = syscall 8 pop

This is syntax error because function call must know the number of argument.

Why syscall has variable number of argument?

syscall is an escape hatch. It is there to allow new commands to be added to the language without changing the compiler. Only the interpreter needed to be updated. Therefore the form (number of argument, whether it outputs any value) of a particular syscall is not known when writing the compiler. An alternative to open stack coding is to use "tuple". A tuple is a special syntax form that enclose a list of arguments (similar to "block" enclosing a list of expressions). The token "{" "}" are used and they are similar to "block".

tuple -> { ex0 ex0 ... }

A syntax discussion

I love LISP. Its syntax is trivial and very flexible. However, it is not easy to get the matching parentheses right without the help from the editor. I also love functional programming languages. Many languages such as Haskell have very elegant syntax. To make a language easy to write, I try to minimise the number of parentheses. The rules to reduce the amount of parenthesis are:

1) do not use ( ) with arguments of function call. The arity of any function is known.
2) use infix, left association, and override with ( )

With these two simple rules, most ( ) are eliminated. Only 'block' must be decided. The symbols such as {} can be used. ( ) should not be used to group statements as it is ambiguous between precedency and grouping. The lexical analyser is made so that indentation is recognised and is converted into the proper {}, so in the program text {} are never necessary.

While writing the grammar, some decision has been made about the 'implicit' meaning such as:

the precedency of Uop, whether it should be higher than Bop? What is preferable for "fetch a + 2", either "(fetch a) + 2" or "fetch (a + 2)".
the form for "store", it can be Bop. A symbol can be used to make it consistent with other Bops, for example e1 <- e2. However to reduce the strangeness of symbol, I decide to use a function form "store e1 e2".
array definition, the recognition that "array" is a dynamic allocator prompt me to use it as a Uop. id = array size. Other alternative is a declaration "array id size".

e -> [ e1* ] | e1
e1 -> 'if e e e | 'while e e ... | e Bop t
t -> number | ... | '( e ')

Consider "if cond e e", should the "condition" be restricted to non-block? similarly for the "while e e". Another consideration is the use of ( ). Here I use "(e)" which allow any expression to be ( ). Again, should ( ) be restricted to non-block i.e. ( ) should not be used over {}? I decide to allow full flexibility. However, semantic is not clear. For example for assignment statement "v = e", e must return value, should a block return a value (perhaps the last e)? What to do with some e that does not return value? The obvious case is "v = e". The value returned by "if e1 e2 e3" depends on what e2, e3 return or it may not return any value. User defined functions may or may not return any value. If "v = e" should return a value (so that cascading them is possible, a = b = 3) then during execution in a loop the evaluation stack will grow. The situation is not too bad because the stack will be clear upon returning from a function call.

Readability

How easy it is to read a program? This is very much dependent on prior experience. It is a matter of syntax or "form" of language. Three major types of syntax (based on the concept of operator): prefix, infix, postfix. Most of us grow up to be familiar with infix syntax; (a * 2) + b. For us, this is easier to read than prefix syntax; (* a (+ 2 b)), or the postfix syntax; a 2 b + *. The meaning of three forms are the same. However, the difficulty of parsing them are different. Infix syntax requires precedency of operators for correct association and needs parentheses to override that precedency. Grammar can be written to deal with precedency for infix. On the other hand, parsing of prefix and postfix expression is trivial. Parsing the infix and prefix naturally results in a structure of parse tree while postfix results in a linear sequence. However, a prefix language although it is trival to parse, it tends to have a lot of parentheses especially on the far right hand of the expression which is hard to get it right without the help from an editor that can match parentheses automatically. See the following example of a prefix language (this example is a part of a parser):

(define parseEL ()
(begin
    (tokenise)
    (if (str= token $rparen) 0
    (mkExplist (parseExp) (parseEL)))))

The example of real languages are; prefix language: LISP, postfix language: FORTH, Postscript.

The "model" of language also affect its "form". The current language distinguishes between "statement" and "expression". A statement becomes a special property where as an expression has well-defined mathematical meaning, evaluating an expression returns a value. A language can have expression as the only basic unit. This will make it more compact. Consider the following example:

(if x y 0) = if( x ) then return y; else return 0;

We are more familiar with the right hand side (C syntax) than the left hand side (LISP). However you can notice that the left hand side is much more compact than the right hand side. The meaning is "evaluate x if true then evaluate y else evaluate 0". The value returned is the value of the last evaluated expression. There is no need to explicitly "return".

Let us consider an example of adding one to a variable.

infix:     a = a + 1
prefix:    (= a (+ a 1))
postfix:   &a a 1 + =

For infix and prefix, the operator "=" (assign) treats its first argument "a" as special, it is an address. For postfix this must be done explicitly using another operator "&". You can not write it the other way. The postfix expression must be understood using the "model" of stack. The central concept is the evaluation stack. Evaluate a variable push its value into the stack. An operator takes its argument from stack and pushes its result back to stack. "Form" also affects the way an operator works. This is an infix language (C):

a[1] = a[2] + 1;

actually means *(&a + 1) = *(&a + 2) + 1;

The "=" here is not the same meaning as in a = a + 1 because it takes the left hand argument as an expression which must be evaluated to give a value as address where as the "=" in a = a + 1 takes a simple value directly. The parser must know this difference.