Non-linear machine code
We will illustrate the concept using an example from an internal form
of an executable code (i.e. machine code) of a high level language
A. The language A has its external form closely resembly its
internal form (which we will call A-code).
We will call a textual language (a program) a surface language.
This surface language will be transformed into an "internal form" by a
process called "parsing (or compiling). This internal form will
be executed by a machine by the process we called "evaluation".
For example, a computer translates a source code inot a machine code
which can be executed on a computer.
surface language (program)
|
v
------
parsing
------
|
v
internal forms
(machine code)
|
v
----------
evaluating (by virtual machine)
----------
|
v
result
Now, I will describe a language. This language is defined by S.
Kamin in the chapter 1 of his textbook.
(* Prof. Samuel Kamin of the department of computer science, University
of Illinois at Urbana Champagne, Programming Languages: An
interpreter-based approach, Addison-Wesley, 1990.)
The beauty of this language stems from its smallness and its
elegance. There are 12 words which are already defined (called
reserved words). Only one form of syntax rule is required, using
only two characters as syntactic features (the left and right
parenthesis). The grammar for this language can be written down
in just a few lines. Despite of its look of a toy-language, I
will illustrate the beauty of its completeness by showing that the
whole executable system including a parser and an evaluator can be
completely written in itself (about 500 lines of executable program
written in itself). In fact, I am cheating a bit here as I have
to add 3 more predefined words to create and manipulate data
structure.
This is a procedural (imperative) language. You can contrast this
language with any standard procedural language such as C and you will
see the stark difference of size! (counted by the number of
reserved words and the number of grammar rule describing its
syntax). I will denote this language as A1 and its internal
form as A0. All values produce in this language are in the domain
of integer. I will describe three aspects of the language.
1 its lexicon -- the words in the language
2 its syntax -- the composition of words into sentences
3 its semantic -- the meaning of a program
A1 lexicon
There are 12 predefined words which are used to compose programs.
set if while begin + - * / = <
> print
The first four words {set if while begin} are called control-op and the
rest are called value-op. The value-op takes its arguments
(mostly two, except print) and evaluates all arguments before applying
its operator. The operators have obvious meaning, such as + is
add etc. The control-op is different, it selectively evaluates
its arguments. The meaning of these operations will be explained
in the section semantic.
A1 syntax
Every sentences in A1 are expression which return values. An
expression has the form
(op e)
where e denotes an expression, op can be any reserved word or
user-defined words. The control-op has the following syntax
(set name e)
(if e1 e2 e3)
(while e1 e2)
(begin e1 e2 ... en)
The name of a variable and user-defined words can be any string of
characters except the reserved words. (We arbritarily limit the
length of a name to 15 characters in this implementation). The
syntax for defining a user function (non-built-in) is
(define name (formals) e )
where formals is the list of formal parameters.
The grammar for A1 is as follows. ( * denotes zero or more
repetition, 'sym terminal symbol)
toplevel -> e | define-op
e
-> name | control-op | value-op
control-op -> '( 'set name e
') |
'( 'if e1 e2 e3 ') |
'( 'while e1 e2 ') |
'( 'begin e1 e2 ... en ')
value-op -> '( op args ')
define-op -> '( 'define name
'( formals ') e ')
op
-> '+ | '- | '* | '/ | '= | '< | '> | 'print |
name
args
-> name* | number*
formals -> name*
name
-> variable names or user-defined names
number -> integer
A1 semantic
At the toplevel, it is a read-eval-print loop. A user types in an
expression or define-op and the parser translates it into an internal
form and then evaluates and returns its value. Examples will make
this clear:
-> (+ 2 3) user
input
5
computer reply
-> (define square (x) (* x x ))
square
-> (square 4)
16
->
The value-op evaluates all of its arguments before applying the
operator. The control-op treats its arguments in a different way.
(set name e)
"set" assigns a value of an expression e to the variable "name".
A varialbe can be local or global. A variable is local when its
name is listed in the formal parameters of the current function
otherwise it is global. The returned value is the value of e.
(if e1 e2 e3)
"if" evaluates e1 and if its value is non-zero (true) it evaluates e2
otherwise evaluates e3. The returned value is the value of the
last expression it evaluates.
(while e1 e2)
"while" is an iterative operator. It evaluates e1, if its value
is non-zero it evaluates e2. This process is repeated until e1
returns zero. The returned value is the value of e2 before the
loop terminate.
(begin e1 e2 ... en)
"begin" is a sequencing operator. It evaluates e1 e2 ... en
sequentially and returns the value of en.
(define name (formals) e )
The define-op is used to define a user-defined
function. You may notice that there is only the formal parameters
and there is no local variables. One way to have local
variables is to define an "auxilliary function" which has extra
parameters which will be used as local variables.
Example
(define isNumber (s1) (isNumber2
s1 0))
(define isNumber2 (s1 i) ... )
isNumber has only one parameter (s1) and we define isNumber2 to have
one extra parameter (i) that is used as a local variable (an index into
a string).
Programming styles
It is educational to read through the self-interpret program (lib.k,
data.k, base.k, eval.k)
(* the file name extension .k is used to honor Prof. Kamin, the creator
of the language)
You will see quite a number of distinc styles that make a program looks
much cleaner than a program in an ordinary language. (We are not going
to debate whether one style is better than another). Because
every expression return a value, this leads to a very natural way to
express a cascade-test construct.
(if test1 action1
(if test2 action2
. . .
defaut-action)...)
actionx is an expression hence returns a value without having to assign
this value to a variable as in C.
if (test1) n = action1;
else if (test2) n = action2;
. . .
else n = default-action;
return n;
Because of this semantic, there is no need to have a special word to
return a value either ("return" in C, assign to the function name in
Pascal).
You can see the beautiful construct such as
(define and (x y) (if x y 0))
Try this in your favorite language and compare!
Iteration vs Recursion
The following examples contrast two styles.
(define findName2 (name i found)
(begin
(while (and
(<= i numNames) (not found))
(if (str= (def-name-at i) name)
(set found 1)
(set i (+ 1 i))))
(if found i
0)))
(define findName3 (name i)
(if (> i numNames) 0
(if (str= (def-name-at i)
name) i
(findName3 name (+ 1 i)))))
"findName" searches a name in the symbol table (def-name).
"findName2" is iterative and uses "found" to break while loop.
"i" is set to "i + 1" for the next iteration. "findName3" is
recursive, "i + 1" is passed as a parameter to the next
recursion. Please note the absence of "set" in the recursive
version.
The next example is the function "atoi" which converts a string such as
"-1234" into its value -1234.
(define atoi4 (s1 i m)
(begin
(if (= 45
(fetch s1)) (set i 1) 0)
(while (!= 0
(fetch-at s1 i))
(begin
(set m (+ (* 10 m) (- (fetch-at s1 i) 48)))
(set i (+ 1 i))))
(if (= 45
(fetch s1))
(-
0 m)
m)))
(define atoi (s1)
(if (= 45 (fetch s1))
(- 0 (atoi2 (+
1 s1) 0))
(atoi2 s1 0)))
(define atoi2 (s1 m)
(if (= 0 (fetch s1))
m
(atoi2 (+ 1
s1) (+ (* 10 m) (- (fetch s1) 48)))))
"atoi4" uses iteration with "i" as an index of character and "m" as a
local variable storing the converted value. "atoi" and "atoi2"
are the recursive version. "atoi" handles the negative sign and
calls "atoi2" to convert the string. You can see the simplicity
of the structure in the recursive version and the lack of "set".
Internal forms
When an expression in a source language is entered, it is transformed
(compiled) into an internal form before it is evaluated
(executed). The internal form is considered to be machine code of
a virtual machine of A language. This internal form has the structure
in the form of a tree (inverted,
the root is at the top).
Suppose we have a function power(x y) which raises x to the power y.
(define power (x y)
(if (= 0 y) 1
(if (= 1 y) x
(* x (power x (- y 1))))))
The expression defining the body of power can be drawn as:
if
/ |\
/ | \
= 1 if
/ \ /|\
0 y / | \
= x *
/ \ / \
1 y x power
/ \
1 -
/ \
y 1
A general purpose linked structure is used to represent this tree
structure, called "list". List composed from two kinds of nodes:
dot-pair and atom. dot-pair stored two components; first
component is a pointer to an element of the list and second
component is a link to other dot-pair. atom stored information
(or element of list). See the following example: (atom is
shown in CAPITOL letter) and list is (.). / is NULL pointer.
(A
B C)
.--.--.-/
| | |
A B C
(A (B C))
.--.-/
| |
A .--.-/
| |
B C
In our machine code, the data structure of the tree composed of two
kinds of nodes: extype and exltype. "extype" (expression type)
stores symbols and "exltype" (expression list type) stores links.
(extype is atom, exltype is dot-pair).
extype
exltype
|-----------|
|---------|
| type
|
| head |
|-----------|
|---------|
| {optr, num,
varble}
| tail |
|-----------|
|---------|
| args |
|-----------|
"extype" has three fields: type, {optr, num, varble}, args. The
first field "type" stores its type: valexp, lvexp, gvexp, apexp.
"valexp" is the constant type and the value is stored in the second
field "num". "lvexp" "gvexp" denote local variables and global
variables. Their references to the variables are stored in the
second field "varble". "apexp" is the function application
type. The reference to the function is stored in the second field
"optr" and its list of arguments is stored in the third field
"args". You may notice that extype stores four kinds of data that
shared the same places. It is called union in C and variant
record in Pascal.
"exltype" is the linked-list node with two fields: head and tail.
Now we will draw the previous program (power) in this concrete
form. The type of "extype" is denoted by V (valexp), L (lvexp), G
(gvexp) and A (apexp). The exltype node is drawn as follows:
|
v
A if .
|
.--------.------.---|
|
| |
A =
. V 1 A if .
|
|
.---.--|
.--------.--------.----|
|
|
|
| |
V 0 L y A =
. L x A * .
|
|
.----.----| . . . .
| |
V 1 L y
The value of {optr and varble) refers to the symbol table, which stores
all attributes of a name (predefined-function, user-defined function,
global variables). We will not go into details of the
implementation such as the symbol table, the data segment. You
can study it from the program itself.
Data structures
How all these data structures are implemented in our system, which
provide only scalar values in integer domain? For example, how to
implement a pointer (so that we can have linked-list and others)?
The trick is simple. If we know the "base" of the data, then the
reference to the data is just
the offset from the base. This offset is not the "address", it is
an ordinary integer. In order to provide aggregate of data, we
need an array, which is a data structure accessible by an index.
The index is the offset that we use as a reference to data.
To have an array we need 3 more words: array, fetch, store in this
syntax.
(array name size)
(fetch name)
(store name value)
"array" allocates memory of "size", where "size" is an expression, for
example (* 4 10) or 40. What really is "name"? It is a
global variable, just like the variable that is defined and set its
value by "set", for example (set name 1) and the name stores the
reference to that memory block.
name
memory
|-------| |-------|
| |------->| value |
|-------| |-------|
| |
| . . . |
The memory in our system is a heap, a large block of memory with the
base address at "heap". To address any where in the heap we use
"ref" which is an index into this array of integer. If the heap
is defined in C, int heap[MAXMEM] then "ref" is heap[ref].
memory
heap -->
|---------|
| |
| |
|.........|
ref -->
|variable |
|.........|
| |
| |
. . .
"fetch" evaluates its argument ("name") gets its value, which is the
"ref" to the data segment and using this reference to get the
value. This "indirection" is called dereferencing. "store"
similary performs storing a value into a name indirectly.
With "fetch" and "store" you can define access functions to your
user-defined data structure. A calculation on pointer to a
variable becomes an ordinary arithmetic on integer because the
reference is just an integer. For example, you can define
indexing into an array:
(define fetch-at (name i) (fetch
(at name i)))
(define store-at (name i value)
(store (at name i) value))
(define at (base index) (+ base
index))
You can use them to copy one array to another (in the example copy y to
x)
(set N 10)
(array a1 N)
(array a2 N)
(define array-copy (x y n)
(if (= 0 n) 0
(begin
(store-at y n (fetch-at x n))
(array-copy x y (- n 1))))))
(array-copy a1 a2 N)
Another example is the access to internal forms, which are composed of
extype and exltype nodes. These nodes are aggregated type and can
be constructed as struct, union in C or record, variant record in
Pascal. We allocate the memory of size 3 integers for the extype
node and 2 integers for the exltype node. The following
definitions show how to create access functions for them.
(set MEMMAX ...
(array heap MEMMAX)
Page 2 e-type ...
Then, we can use these access function to manipulate the data
structure, such as
(store (e-type e) valexp)
(store (e-num e) n)
(define car (el) (fetch (e-head
el) )))
...
(if (= lvexp (fetch (e-type (car
args)))) . . .
How eval works?
The evaluator (function eval) is the machine that executes the internal
forms (A0). The "machine code" is the extype node. The
extype node has four types { valexp, lvexp, gvexp, apexp }. The
node type valexp returns the value stored in its field "num". The
node type lvexp and gvexp return values from local data and global data
with the reference in the "varble" field respectively.
node
type
return
valexp
(fetch (e-num e))
lvexp
(getval (fetch (e-varble e)) rho)
gvexp
(getcal (fetch (e-varble e)) globalEv)
The global data is allocated from the data segment when the variable is
defined. The local data is dynamic and is allocated from the stack
segment. The local data is created when passing the actual
parameters to a function and is destroyed when exit from the
function. Because the function call has the behavious of last-in
first-out (LIFO) as the earliest call will exit the last, a stack
structure is suitable for allocating the local data for function calls.
Using a stack gains a huge benefit of automatic reclamation of the
memory when the local data is not longer in used. (You may think
this is obvious but this is the beauty of it. Think about other
alternative way of storing local data such as linked-list. The
local data once ceased to exist will
have to be reclaim by some method).
stack
rho -->
| |
|-------|
local1 -->
| |
|-------|
local2 -->
| |
| . . . |
The access functions to variables are defined as
(define assign (name num
rho) (store-at rho name num))
(define getval (name rho)
(fetch-at rho name))
The global and local data can be handled in the same way except that
the global data is in the data segment (its base is "data") and the
local data is in the stack segement (its base is "stack").
Encoding of machine code
A list can be serialised and stored into a file. It becomes a
file of number and looks similar to ordinary machine code. Only
the information part is stored, the link part (dot-pair) needs not to
be stored, the structure of list is embeded into the encoding using the
number of element of lists as prefix. For example:
(A B C) => 3 1 2
3
the first 3 is the number of element of this list. 1, 2, 3
encoded A, B, C.
(A (B C)) => 2 1 2 2 3
The full encoding of A-code is a bit more complicate but not relevant
to our discussion. (see c/base.h and doc/format.txt )
17 Nov 2004
P. Chongstitvatana