Non-linear machine code

topics

A lexicon
A syntax
A semantic
local variables
programming styles
iteration vs recursion
internal forms
how to create data structures
integer -> offset
access function: array, fetch, store
how eval work
encoding of machine code

We will illustrate the concept using an example from an internal form of an executable code (i.e. machine code) of a high level language A. The language A has its external form closely resembly its internal form (which we will call A-code).

We will call a textual language (a program) a surface language. This surface language will be transformed into an "internal form" by a process called "parsing (or compiling). This internal form will be executed by a machine by the process we called "evaluation". For example, a computer translates a source code inot a machine code which can be executed on a computer.

surface language (program)
     |
     v
   ------
   parsing
   ------
     |
     v
internal forms (machine code)
     |
     v
   ----------
   evaluating    (by virtual machine)
   ----------
     |
     v
    result

Now, I will describe a language. This language is defined by S. Kamin in the chapter 1 of his textbook.
(* Prof. Samuel Kamin of the department of computer science, University of Illinois at Urbana Champagne, Programming Languages: An interpreter-based approach, Addison-Wesley, 1990.)

The beauty of this language stems from its smallness and its elegance. There are 12 words which are already defined (called reserved words). Only one form of syntax rule is required, using only two characters as syntactic features (the left and right parenthesis). The grammar for this language can be written down in just a few lines. Despite of its look of a toy-language, I will illustrate the beauty of its completeness by showing that the whole executable system including a parser and an evaluator can be completely written in itself (about 500 lines of executable program written in itself). In fact, I am cheating a bit here as I have to add 3 more predefined words to create and manipulate data structure.

This is a procedural (imperative) language. You can contrast this language with any standard procedural language such as C and you will see the stark difference of size! (counted by the number of reserved words and the number of grammar rule describing its syntax).   I will denote this language as A1 and its internal form as A0. All values produce in this language are in the domain of integer. I will describe three aspects of the language.

1 its lexicon -- the words in the language
2 its syntax -- the composition of words into sentences
3 its semantic -- the meaning of a program

A1 lexicon

There are 12 predefined words which are used to compose programs.

set if while begin + - * / = < > print

The first four words {set if while begin} are called control-op and the rest are called value-op. The value-op takes its arguments (mostly two, except print) and evaluates all arguments before applying its operator. The operators have obvious meaning, such as + is add etc. The control-op is different, it selectively evaluates its arguments. The meaning of these operations will be explained in the section semantic.

A1 syntax

Every sentences in A1 are expression which return values. An expression has the form

(op e)

where e denotes an expression, op can be any reserved word or user-defined words. The control-op has the following syntax

(set name e)
(if e1 e2 e3)
(while e1 e2)
(begin e1 e2 ... en)

The name of a variable and user-defined words can be any string of characters except the reserved words. (We arbritarily limit the length of a name to 15 characters in this implementation). The syntax for defining a user function (non-built-in) is

(define name (formals) e )

where formals is the list of formal parameters.

The grammar for A1 is as follows. ( * denotes zero or more repetition, 'sym terminal symbol)

toplevel -> e | define-op
e         -> name | control-op | value-op
control-op -> '( 'set name e ') |
              '( 'if e1 e2 e3 ') |
              '( 'while e1 e2 ') |
              '( 'begin e1 e2 ... en ')
value-op -> '( op args ')
define-op -> '( 'define name '( formals ') e ')
op        -> '+ | '- | '* | '/ | '= | '< | '> | 'print | name
args      -> name* | number*
formals -> name*
name      -> variable names or user-defined names
number    -> integer

A1 semantic

At the toplevel, it is a read-eval-print loop. A user types in an expression or define-op and the parser translates it into an internal form and then evaluates and returns its value. Examples will make this clear:

-> (+ 2 3)   user input
5            computer reply
-> (define square (x) (* x x ))
square
-> (square 4)
16
->

The value-op evaluates all of its arguments before applying the operator. The control-op treats its arguments in a different way.

(set name e)

"set" assigns a value of an expression e to the variable "name". A varialbe can be local or global. A variable is local when its name is listed in the formal parameters of the current function otherwise it is global. The returned value is the value of e.

(if e1 e2 e3)

"if" evaluates e1 and if its value is non-zero (true) it evaluates e2 otherwise evaluates e3. The returned value is the value of the last expression it evaluates.

(while e1 e2)

"while" is an iterative operator. It evaluates e1, if its value is non-zero it evaluates e2. This process is repeated until e1 returns zero. The returned value is the value of e2 before the loop terminate.

(begin e1 e2 ... en)

"begin" is a sequencing operator. It evaluates e1 e2 ... en sequentially and returns the value of en.

(define name (formals) e )

The define-op is used to define a user-defined function. You may notice that there is only the formal parameters and there is no local variables.   One way to have local variables is to define an "auxilliary function" which has extra parameters which will be used as local variables.

Example

(define isNumber (s1) (isNumber2 s1 0))
(define isNumber2 (s1 i) ... )

isNumber has only one parameter (s1) and we define isNumber2 to have one extra parameter (i) that is used as a local variable (an index into a string).

Programming styles

It is educational to read through the self-interpret program (lib.k, data.k, base.k, eval.k)
(* the file name extension .k is used to honor Prof. Kamin, the creator of the language)

You will see quite a number of distinc styles that make a program looks much cleaner than a program in an ordinary language. (We are not going to debate whether one style is better than another). Because every expression return a value, this leads to a very natural way to express a cascade-test construct.

(if test1 action1
(if test2 action2
. . .
defaut-action)...)

actionx is an expression hence returns a value without having to assign this value to a variable as in C.

if (test1) n = action1;
else if (test2) n = action2;
. . .
else n = default-action;
return n;

Because of this semantic, there is no need to have a special word to return a value either ("return" in C, assign to the function name in Pascal).

You can see the beautiful construct such as

(define and (x y) (if x y 0))

Try this in your favorite language and compare!

Iteration vs Recursion

The following examples contrast two styles.

(define findName2 (name i found)
(begin
    (while (and (<= i numNames) (not found))
      (if (str= (def-name-at i) name)
        (set found 1)
        (set i (+ 1 i))))
    (if found i 0)))

(define findName3 (name i)
(if (> i numNames) 0
(if (str= (def-name-at i) name) i
(findName3 name (+ 1 i)))))

"findName" searches a name in the symbol table (def-name). "findName2" is iterative and uses "found" to break while loop. "i" is set to "i + 1" for the next iteration. "findName3" is recursive, "i + 1" is passed as a parameter to the next recursion. Please note the absence of "set" in the recursive version.

The next example is the function "atoi" which converts a string such as "-1234" into its value -1234.

(define atoi4 (s1 i m)
(begin
    (if (= 45 (fetch s1)) (set i 1) 0)
    (while (!= 0 (fetch-at s1 i))
      (begin
        (set m (+ (* 10 m) (- (fetch-at s1 i) 48)))
        (set i (+ 1 i))))
    (if (= 45 (fetch s1))
      (- 0 m)
      m)))

(define atoi (s1)
(if (= 45 (fetch s1))
    (- 0 (atoi2 (+ 1 s1) 0))
    (atoi2 s1 0)))

(define atoi2 (s1 m)
(if (= 0 (fetch s1))
    m
    (atoi2 (+ 1 s1) (+ (* 10 m) (- (fetch s1) 48)))))

"atoi4" uses iteration with "i" as an index of character and "m" as a local variable storing the converted value. "atoi" and "atoi2" are the recursive version. "atoi" handles the negative sign and calls "atoi2" to convert the string. You can see the simplicity of the structure in the recursive version and the lack of "set".

Internal forms

When an expression in a source language is entered, it is transformed (compiled) into an internal form before it is evaluated (executed). The internal form is considered to be machine code of a virtual machine of A language. This internal form has the structure in the form of a tree (inverted,
the root is at the top).

Suppose we have a function power(x y) which raises x to the power y.

(define power (x y)
(if (= 0 y) 1
(if (= 1 y) x
(* x (power x (- y 1))))))

The expression defining the body of power can be drawn as:

             if
             / |\
           /   | \
          =    1   if
         / \       /|\
        0   y    / | \
                =   x   *
               / \     / \
              1   y   x   power
                          / \
                         1   -
                            / \
                           y   1

A general purpose linked structure is used to represent this tree structure, called "list". List composed from two kinds of nodes: dot-pair and atom. dot-pair stored two components; first component is a pointer to an element of the list and second component is a link to other dot-pair. atom stored information (or element of list). See the following example: (atom is shown in CAPITOL letter) and list is (.). / is NULL pointer.

(A B C)

.--.--.-/
| | |
A B C

(A (B C))

.--.-/
| |
A .--.-/
| |
B C

In our machine code, the data structure of the tree composed of two kinds of nodes: extype and exltype. "extype" (expression type) stores symbols and "exltype" (expression list type) stores links. (extype is atom, exltype is dot-pair).

                extype                            exltype
             |-----------|                      |---------|
             | type     |                      | head   |
             |-----------|                      |---------|
             | {optr, num, varble}              | tail   |
             |-----------|                      |---------|
             | args     |
             |-----------|

"extype" has three fields: type, {optr, num, varble}, args. The first field "type" stores its type: valexp, lvexp, gvexp, apexp. "valexp" is the constant type and the value is stored in the second field "num". "lvexp" "gvexp" denote local variables and global variables. Their references to the variables are stored in the second field "varble". "apexp" is the function application type. The reference to the function is stored in the second field "optr" and its list of arguments is stored in the third field "args". You may notice that extype stores four kinds of data that shared the same places. It is called union in C and variant record in Pascal.

"exltype" is the linked-list node with two fields: head and tail. Now we will draw the previous program (power) in this concrete form. The type of "extype" is denoted by V (valexp), L (lvexp), G (gvexp) and A (apexp). The exltype node is drawn as follows:

|
v
A if .
      |
      .--------.------.---|
      |        |      |
     A = .    V 1    A if .
         |                |
         .---.--|         .--------.--------.----|
         |   |            |        |        |
        V 0 L y         A = .     L x      A * .
                             |                  |
                             .----.----|       . . . .
                             |    |
                            V 1   L y

The value of {optr and varble) refers to the symbol table, which stores all attributes of a name (predefined-function, user-defined function, global variables). We will not go into details of the implementation such as the symbol table, the data segment. You can study it from the program itself.

Data structures

How all these data structures are implemented in our system, which provide only scalar values in integer domain? For example, how to implement a pointer (so that we can have linked-list and others)? The trick is simple. If we know the "base" of the data, then the reference to the data is just
the offset from the base. This offset is not the "address", it is an ordinary integer. In order to provide aggregate of data, we need an array, which is a data structure accessible by an index. The index is the offset that we use as a reference to data.

To have an array we need 3 more words: array, fetch, store in this syntax.
(array name size)
(fetch name)
(store name value)

"array" allocates memory of "size", where "size" is an expression, for example (* 4 10) or 40. What really is "name"? It is a global variable, just like the variable that is defined and set its value by "set", for example (set name 1) and the name stores the reference to that memory block.

       name              memory

       |-------|        |-------|
       |       |------->| value |
       |-------|        |-------|
                        |       |
                        | . . . |

The memory in our system is a heap, a large block of memory with the base address at "heap". To address any where in the heap we use "ref" which is an index into this array of integer. If the heap is defined in C, int heap[MAXMEM] then "ref" is heap[ref].

             memory
heap --> |---------|
            |         |
            |         |
            |.........|
   ref --> |variable |
            |.........|
            |         |
            |         |
              . . .

"fetch" evaluates its argument ("name") gets its value, which is the "ref" to the data segment and using this reference to get the value. This "indirection" is called dereferencing. "store" similary performs storing a value into a name indirectly.

With "fetch" and "store" you can define access functions to your user-defined data structure. A calculation on pointer to a variable becomes an ordinary arithmetic on integer because the reference is just an integer. For example, you can define indexing into an array:

(define fetch-at (name i) (fetch (at name i)))
(define store-at (name i value) (store (at name i) value))
(define at (base index) (+ base index))

You can use them to copy one array to another (in the example copy y to x)
(set N 10)
(array a1 N)
(array a2 N)
(define array-copy (x y n)
(if (= 0 n) 0
     (begin
        (store-at y n (fetch-at x n))
        (array-copy x y (- n 1))))))
(array-copy a1 a2 N)

Another example is the access to internal forms, which are composed of extype and exltype nodes. These nodes are aggregated type and can be constructed as struct, union in C or record, variant record in Pascal. We allocate the memory of size 3 integers for the extype node and 2 integers for the exltype node. The following definitions show how to create access functions for them.

(set MEMMAX ...
(array heap MEMMAX)

Page 2 e-type ...

Then, we can use these access function to manipulate the data structure, such as

(store (e-type e) valexp)
(store (e-num e) n)

(define car (el) (fetch (e-head el) )))

...
(if (= lvexp (fetch (e-type (car args)))) . . .

How eval works?

The evaluator (function eval) is the machine that executes the internal forms (A0). The "machine code" is the extype node. The extype node has four types { valexp, lvexp, gvexp, apexp }. The node type valexp returns the value stored in its field "num". The node type lvexp and gvexp return values from local data and global data with the reference in the "varble" field respectively.

node type               return
valexp          (fetch (e-num e))
lvexp           (getval (fetch (e-varble e)) rho)
gvexp           (getcal (fetch (e-varble e)) globalEv)

The global data is allocated from the data segment when the variable is defined. The local data is dynamic and is allocated from the stack segment. The local data is created when passing the actual parameters to a function and is destroyed when exit from the function. Because the function call has the behavious of last-in first-out (LIFO) as the earliest call will exit the last, a stack structure is suitable for allocating the local data for function calls.

Using a stack gains a huge benefit of automatic reclamation of the memory when the local data is not longer in used. (You may think this is obvious but this is the beauty of it. Think about other alternative way of storing local data such as linked-list. The local data once ceased to exist will
have to be reclaim by some method).

             stack
   rho --> |       |
            |-------|
local1 --> |       |
            |-------|
local2 --> |       |
            | . . . |

The access functions to variables are defined as

(define assign (name num rho) (store-at rho name num))
(define getval (name rho) (fetch-at rho name))

The global and local data can be handled in the same way except that the global data is in the data segment (its base is "data") and the local data is in the stack segement (its base is "stack").

Encoding of machine code

A list can be serialised and stored into a file. It becomes a file of number and looks similar to ordinary machine code. Only the information part is stored, the link part (dot-pair) needs not to be stored, the structure of list is embeded into the encoding using the number of element of lists as prefix. For example:

(A B C) => 3 1 2 3

the first 3 is the number of element of this list. 1, 2, 3 encoded A, B, C.

(A (B C)) => 2 1 2 2 3

The full encoding of A-code is a bit more complicate but not relevant to our discussion. (see c/base.h and doc/format.txt )

17 Nov 2004
P. Chongstitvatana