Report on Som language
Motivation
The aim of Som language is for teaching. It has been used in computer
architecture class to teach how high level programming languages and
machine codes are related. The whole language translation process is
simple enough that students can modify it to generate code for their
projects easily. Som has a familiar syntax, infix operators, and
it is designed to
be minimal. The internal code is a
linear code similar to byte-code of popular languages such as Java. It
has a simple static memory model for efficiency, and it also has
dynamic allocation for flexibility. It is interactive, an
expression can be typed in and is evaluated immediately. See the
following example:
>
print
2 + 3 nl
5
> to sq x = x * x
> print sq 4 nl
16
>
The basic element in Som is an expression. An expression returns a
value, except for assignment which does not return any value. A
variable is evaluated to its value. Som has a minimal set of
operators as it is intended to be used as a teaching tool. It has
small set of reserved words:
to,
if,
else, while, for, case, break, enum.
Operators are:
arithmetic: + - * / % (mod)
logic: & (bit-and) | (bit-or) ! (bit-not) ^ (bit-xor) << (shift left) >> (arithmetic shift
right)
relation: == != < <= >=
>
assignment: =
other: array
(memory allocation)
Som has three types of variable: global, local, and array. A global
variable is automatically defined when it is first appeared. A
local variable's scope is in the function that it is defined. An
array
variable has its space allocated by calling the "array" operators and assigns
the return value to the array variable. If an array variable is
global, it is static. The compiler knows the base
address at compile time and can perform some optimisation to achieve
faster execution. An array that is defined inside a function
definition, i.e. local, is dynamic. Its space is allocated in the
heap. The life time of a dynamic memory depends on the use.
When there is no reference to it, it is said to be garbage. The
run-time system may support garbage collection. At present, Som system
has no garbage collection. Som's programming environment allows using
indentation for grouping
expressions.
Example: a program to solve tower of hanoi problem
num
=
array 4 // a global array variable
// define function "mov" with 3
arguments: n, from, t
// and one local variable: other
to mov n from t | other
=
if n == 1
num[from] =
num[from] - 1
num[t] =
num[t] + 1
else
other = 6 -
from - t
mov n-1 from
other
mov 1 from t
mov n-1 other t
interactive mode
disk
=
3
num[0] = 0
num[1] = disk
num[2] = 0
num[3] = 0
mov disk 1 3
Grammar of Som
language
notation:
* zero or more times
+ one or more times
[..] optional
' constant symbol
Indentation is used for grouping instead of '{ '}
toplevel
->
'to fundef | ex
fundef -> id args '= ex
args ->
id* ['| id+ ]
ex -> '{ ex* '} |
ex1
ex1
->
'if ex0 [ 'else ex ] |
'while ex0 ex |
'for lvar ex0 ex0 ex |
'break |
'case ex0 caselist |
'enum '{ [ number ] id+ '}
id '= ex0 |
ex0
caselist
->
caseitem | '{ caseitem+ '}
caseitem -> number ': ex |
'else ': ex
ex0 -> term term*
term ->
number | id | vec |
fun ex0* |
'! ex0 |
'array ex0 |
'( ex0 ')
vec -> id '[ ex0 ']
bop -> '+ | '- | '* | '/ |
'& | '| | '^ |
'== | '!= | '< | '<=
| '>= | '> | '% | '<< | '>>
There are several interesting points about Som grammar. First, it
is expression-based, therefore an expression returns a value
except: assignment, for, while. Second, the syntax allows very
compact writing, with minimum number of separator and parentheses, for
example, a semicolon at the end of statement is not necessary as all
operators have known arity. Third, the language has very small
vocaburary, this makes it very easy to learn. Recursion is quite
natural in Som.
to
fib
n =
if n < 3
1
else
(fib n - 1) +
(fib n - 2)
Here are some elegant examples. They show how to define some primitives
using only: if , ==, <.
to
and
x y = if x y else 0
to or x y = if x 1 else y
to not x = if x 0 else 1
to eq x y = x == y
to neq x y = not ( x == y )
to lt x y = x < y
to le x y = or ( x < y ) ( x
== y )
to gt x y = not ( le x y )
to ge x y = not ( x < y )
For, break, and case
In "for" loop, the index
variable must be a local variable. "for i start end body" means
i
= start
while i <= end
body
i = i + 1
See the following example of the use of "for":
//
fill
in an array and print it
max = 10
N = array max
// array is passed by reference
to fill ar n | i =
for i 0 n-1
ar[i] = i
to pr ar n | i =
for i 0 n-1
print ar[i]
space
fill N max pr N max
The "break" has three
meanings:
1) break for loop
2) break while loop
3) force return a function call.
This is a rather nice semantics as it is very consistent.
The case construction is an efficient way for a multiway branch.
The label in each case is an integer. To help readability, the
enum is used to give labels their symbolic names.
//
10
is the start number
enum
10 a1 a2 a3 a4 a5
to ff a =
case a
a1: 1
a2: 2
a3: 3
else: 99
print ff 13
The case label must be number, ":" allows syntax check to catch any
error and to make it look more familiar. A label (id) is stored in the
symbol table with a unique reference. The option [number] is used to
set the starting reference otherwise it is zero. "case" is used
to make an efficient
inner interpreter loop in decode and dispatch each instruction.
while
running
case opcode
1 : add
2 : sub
...
else : error
undef opcode
As the compare and jump in "case" uses log(n) search strategy it is
faster than linear search using if..else. In fact, in this
implementation, "case" uses a constant time (indexing) to go to the
matched label.
while
running
if opcode == 1 add
else if opcode == 2
sub
...
else error undef opcode
System calls
To enable input/output and other system functions, Som uses a primitive
"syscall". Syscall has a variable number of arguments, the first
one is a constant, the number that identifies the system
function. Syscall is used to implement library functions such as
print, printc, loadfile etc. The implementation of syscall is
dependent on the platform. For a PC, syscall is implemented with
the implementation language (C in this version). The following is
the list of available system functions:
1 print integer
2 print character
3 get character
19 load file
example how syscall is used in the library
to
print
x = syscall {1 x}
to printc c = syscall {2 c}
to getchar = syscall {3}
to loadfile fn = syscall {19
fn} // fn is som-string
to nl = syscall {2
10}
// 10 is a newline char
String
A string is an array of integer. An integer is 32-bit. It
contains at most 4 characters, right pads with 0, terminates by an
integer 0 (an extra one). This is called a packed string. It is a
compromise between the ease of manipulating a string (insert and delete
a character is slow) and the space efficiency (storing each character
in a 32-bit integer is expensive). See the following program for string
manupilation,
string copy.
//
copy
s1 = s2
to strcpy s1 s2 | i =
i = 0
while s2[i] != 0
s1[i] = s2[i]
i = i + 1
s1[i] = 0
s1 = array 20
strcpy s1 "test string"
The compiler translated a constant string in the program text into a
constant pointed to data segment storing the string. A string
constant is
handled at compile time. The most convenient way to implement
this
is to represent string as an array of integer and the pointer to this
string is an address of the som-memory. The operator must be
type-specific, i.e. the operator knows what type its argument is.
The pointer to string is just an integer.
Tuple
Tuple is a list of variable number of arguments. There is a place
where it is required to use variable number of arguments for a function
call. That is, call to a system function, "syscall". The
original idea is to use open stack coding. This is easy and does
not require any special treatment in parsing. However, open stack
coding is not good. See how confusing it can be in this code
(from eval-s.txt):
xArray:
// be careful open stack
coding !!!
pop
// get n from user
stack
a = syscall 8
// alloc M
push a
syscall 8 needs one argument and returns one value. The
complexity arises because of open stack coding which does not allow
putting argument to syscall like this:
a = syscall 8 pop
This is syntax error because function call must know the number of
argument.
Why syscall has
variable number of argument?
syscall is an escape hatch. It is there to allow new commands to
be added to the language without changing the compiler. Only the
interpreter needed to be updated. Therefore the form (number of
argument, whether it outputs any value) of a particular syscall is not
known when writing the compiler. An alternative to open stack
coding is to use "tuple". A tuple is a
special syntax form that enclose a list of arguments (similar to
"block" enclosing a list of expressions). The token
"{" "}" are used and they are similar to "block".
tuple
->
{ ex0 ex0 ... }
A syntax discussion
I love LISP. Its syntax is trivial and very flexible.
However, it is not easy to get the matching parentheses right without
the help from the editor. I also love functional programming
languages. Many languages such as Haskell have very elegant
syntax. To make a language easy to write, I try to minimise the
number of parentheses. The rules to reduce the amount of
parenthesis are:
1) do not use ( ) with arguments of function call. The
arity
of any function is known.
2) use infix, left association, and override with ( )
With these two simple rules, most ( ) are eliminated. Only
'block'
must be decided. The symbols such as {} can be used. ( )
should not be used to group statements as it is ambiguous between
precedency and grouping. The lexical analyser is made so that
indentation is recognised and is converted into the proper {}, so in
the program text {} are never necessary.
While writing the grammar, some decision has been made about the
'implicit' meaning such as:
- the precedency of Uop, whether it should be higher than
Bop? What is preferable for "fetch a + 2", either "(fetch a) + 2"
or "fetch (a + 2)".
- the form for "store", it can be Bop. A symbol can be used to make
it consistent with other Bops, for example e1 <- e2. However to
reduce the strangeness of symbol, I decide to use a function form "store e1 e2".
- array definition, the recognition that "array" is a dynamic
allocator prompt me to use it as a Uop. id = array size. Other
alternative is a declaration "array
id
size".
e
-> [ e1* ] | e1
e1 -> 'if e e e | 'while e e
... | e Bop t
t -> number | ... | '( e ')
Consider "if cond e e", should the "condition" be restricted to
non-block? similarly for the "while e e". Another
consideration is the use of ( ). Here I use "(e)" which allow any
expression to be ( ). Again, should ( ) be restricted to
non-block
i.e. ( ) should not be used over {}? I decide to allow full
flexibility. However, semantic is not clear. For example for
assignment statement "v = e", e must return value, should a block
return a value (perhaps the last e)? What to do with some e that
does not return value? The obvious case is "v = e". The
value returned by "if e1 e2 e3" depends on what e2, e3 return or it may
not return any value. User defined functions may or may not
return any value. If "v = e" should return a value (so that
cascading them is possible, a = b = 3) then during execution in a loop
the evaluation stack will grow. The situation is not too bad because
the stack will be clear upon returning from a function call.
Readability
How easy it is to read a program? This is very much dependent on
prior experience. It is a matter of syntax or "form" of
language. Three major types of syntax (based on the concept of
operator): prefix, infix, postfix. Most of us grow up to be
familiar with infix syntax; (a * 2) + b. For us, this is easier
to read than prefix syntax; (* a (+ 2 b)), or the postfix syntax; a 2 b
+ *. The meaning of three forms are the same.
However, the difficulty of parsing them are different. Infix
syntax requires precedency of operators for correct association and
needs parentheses to override that precedency. Grammar can be
written to deal with precedency for infix. On the other hand,
parsing of prefix and postfix expression is trivial. Parsing the
infix and prefix naturally results in a structure of parse tree while
postfix results in a linear sequence. However, a prefix language
although it is trival to parse, it tends to have a lot of parentheses
especially on the far right hand of the expression which is hard
to get it right without the help from an editor that can match
parentheses automatically. See the following example of a prefix
language (this example is a part of a parser):
(define
parseEL
()
(begin
(tokenise)
(if (str=
token $rparen) 0
(mkExplist
(parseExp) (parseEL)))))
The example of real languages are; prefix language: LISP, postfix
language: FORTH, Postscript.
The "model" of language also affect its "form". The current
language distinguishes between "statement" and "expression". A
statement becomes a special property where as an expression has
well-defined mathematical meaning, evaluating an expression returns a
value. A language can have expression as the only basic
unit. This will make it more compact. Consider the
following example:
(if
x
y 0) = if( x ) then return y; else return 0;
We are more familiar with the right hand side (C syntax) than the left
hand side (LISP). However you can notice that the left hand side
is much more compact than the right hand side. The meaning is "evaluate
x if true then evaluate y else evaluate 0". The value returned is
the value of the last evaluated expression. There is no need to
explicitly "return".
Let us consider an example of adding one to a variable.
infix:
a
= a + 1
prefix: (= a (+
a 1))
postfix: &a a 1 +
=
For infix and prefix, the operator "=" (assign) treats its first
argument "a" as special, it is an address. For postfix this
must be done explicitly using another operator "&". You can
not write it the other way. The postfix expression must be
understood using the "model" of stack. The central concept is the
evaluation stack. Evaluate a variable push its value into the
stack. An operator takes its argument from stack and pushes its
result back to stack. "Form" also affects the way an operator
works. This is an infix language (C):
a[1]
=
a[2] + 1;
actually means *(&a + 1) = *(&a + 2) + 1;
The "=" here is not the same meaning as in a = a + 1 because it takes the
left hand argument as an expression which must be evaluated to give a
value as address where as the "=" in a = a + 1 takes a simple value
directly. The parser must know this difference.
More example and
language tutorial
Example 1 Hello world
prints
"hello
world" // prints is in the string lib
Example 2 modulo (it is a built-in operator)
to
mod
m n = m - (n * (m / n))
Example 3 quick sort
enum
20 N
a = array N
to swap i j | t =
t = a[i]
a[i] = a[j]
a[j] = t
to partition p r | x i j flag =
x = a[p]
i = p - 1
j = r + 1
flag = 1
while flag
j = j - 1
while a[j]
> x
j
= j - 1
i = i + 1
while a[i]
< x
i
= i + 1
if (i < j)
swap i j else flag = 0
j
to quicksort p r | q =
if p < r
q = partition
p r
quicksort p q
quicksort q+1 r
quicksort 0 (N - 1)
Example 4 string manipulation (part of the string lib)
//
copy
s1 = s2
to strcpy s1 s2 | i =
i = 0
while s2[i] != 0
s1[i] = s2[i]
i = i + 1
s1[i] = 0
// print string s1, s1 is
som-string
to prints s1 | a i c1 c2 c3 c4 =
i = 0
a = s1[i]
while a != 0
c4 = a &
255
c3 = (a
>> 8) & 255
c2 = (a
>> 16) & 255
c1 = (a
>> 24) & 255
if c1 != 0
printc c1
if c2 != 0
printc c2
if c3 != 0
printc c3
if c4 != 0
printc c4
i = i + 1
a = s1[i]
// b is a constant string, c is a
dynamic array
to teststring | b c =
b = "123456"
c = array 10
stcpy c b
prints c
27 Sept 2004
last update 8 Jan 2011