Bootstraping

60 minutes talk in the research group meeting on 12 September 2002, Chulalongkorn university.

Why a new language?
Readability
Internal representation
Bootstraping
Accessibility

Why a new language?

I would like to talk about computer language design and implementation. However, this is a huge topic. I want to cover only a small part of it, namely, how to create a new language and new environment. Starting from the existing system and gradually grows to be independent from it. This process is known as "bootstraping".

Why one wants to create a new language? A new language usually arises from a demand of new applications. The existing language may not fit the new constraints, whether it is an embeded system that requires small memory footprint, or a language that resides in software applications, called embedded language, or a language as a glue to hold many software components together called scripting language. The yet unseen applications in the future will have requirements that the existing languages do not meet. Language is also evolving through time. As our understanding of computer languages grows from experience, we will be able to design a better language that is safer, easier to use, and more powerful.

Readability

How easy it is to read a program? This is very much dependent on prior experience. It is a matter of syntax or "form" of language. Three major types of syntax (based on the concept of operator): prefix, infix, postfix. Most of us grow up to be familiar with infix syntax; (a * 2) + b. For us, this is easier to read than prefix syntax; (* a (+ 2 b)), or the postfix syntax; a 2 b + *. The meaning of three forms are the same. However, the difficulty of parsing them are different. Infix syntax requires precedency of operators for correct association and needs parentheses to override that precedency. Grammar that deals with precedency for infix is not simple. On the other hand, parsing of prefix and postfix expression is trivial. Parsing the infix and prefix naturally results in a structure of parse tree while postfix results in a linear sequence. However, a prefix language although it is trival to parse, it tends to have a lot of parentheses especially on the far right hand of the expression which is hard to get it right without the help from an editor that can match parentheses automatically. See the following example of a prefix language (this example is a part of a parser):

(define parseEL ()
(begin
    (tokenise)
    (if (str= token $rparen) 0
    (mkExplist (parseExp) (parseEL)))))

The example of real languages are; prefix language: LISP, postfix language: FORTH, Postscript.

The "model" of language also affect its "form". The current language distinguishes between "statement" and "expression". A statement becomes a special property where as an expression has well-defined mathematical meaning, evaluating an expression returns a value. A language can have expression as the only basic unit. This will make it more compact. Consider the following example:

(if x y 0) = if( x ) then return y; else return 0;

We are more familiar with the right hand side (C syntax) than the left hand side (LISP). However you can notice that the left hand side is much more compact than the right hand side. The meaning is "evaluate x if true then evaluate y else evaluate 0". The value returned is the value of the last evaluated expression. There is no need to explicitly "return".

Let us consider an example of adding one to a variable.

infix: a = a + 1
prefix: (= a (+ a 1))
postfix: &a a 1 + =

For infix and prefix, the operator "=" (assign) treats its first argument "a" as special, it is an address. For postfix this must be done explicitly using another operator "&". You can not write it the other way. The postfix expression must be understood using the "model" of stack. The central concept is the evaluation stack. Evaluate a variable push its value into the stack. An operator takes its argument from stack and pushes its result back to stack. "Form" also affects the way an operator works. This is an infix language (C):

a[1] = a[2] + 1;
actually means *(&a + 1) = *(&a + 2) + 1;

The "=" here is not the same meaning as in a = a + 1 because it takes the left hand argument as an expression which must be evaluated to give a value as address where as the "=" in a = a + 1 takes a simple value directly. The parser must know this difference.

Internal representation

For prefix and infix language a natural form as a result of parsing is the parse tree.

a = a + 1 and (= a (+ a 1 ))

        =
       / \
      a   +
         / \
        a   1

This parse tree can be evaluated directly using "graph reduction". Each operator is evaluated taken its argument from its child and sustitute the node by the result.

        =           =           (a=a+1)
       / \         / \
      a   +       a (a+1)
         / \
        a   1

A parse tree can be represented using a list structure (a list is easier becuase it has a fixed number of child, namely two).

.-- is a node with two children.
|

This is called dot operator. The end of the list is marked by "/" or a NIL.

( a b )
.--.-- /   or    (. a (. b NIL))
| |
a b
( a b ( c d ) )

.--.--.-- /
| | |
a b .--.-- /
        | |
        c d

or (. a (. b (. (. c (. d NIL)) NIL)))

The tree structure requires dynamic allocation of storage (allocate a node and link it into structure). For a postfix langauge it is already in a linear form. The internal representation is a contiguous storage. However while a control structure can be represented directly in a parse tree, a control structure in a postfix language in linear form must be converted into a series of jumps.

if e1 e2 e3

a parse tree

.--.-- /
| |
if .--.--.--/
| | |
e1 e2 e3

a linear code

e1 jump-if-false-1 e2 jump-2 #1 e3 #2

where #1, #2 are labels

Evaluating a parse tree is most naturally written using recursive style. The thread of program will handle the intermediate value in a tree reduction automatically. In evaluating a linear code, a evaluation stack is more natural. The evaluator is written using iteration (loops) and the intermediate value is storing in the stack explicitly.

Bootstraping

Now comes the most interesting part of this talk. How a new language starting its life? To implement a new language, we use tools that exist in the current machine. The new language compiler, interpreter etc. are written with an existing language and are running as applications on the existing operating system. Of course, the result, that is, the internal representation can be further translated into an executable code of a target machine. This will render the executable code dependent on the machine that it is to be used. One most popular method on starting a new life for a language is to implement a "virtual machine".

A virtual machine takes an internal representation and evaluate it and returns the result. In other words, the virtual machine "runs" the code written in an internal form (whether it is a parse tree or a linear code). Java is a well-known example. Java language is compiled into Java byte-code (a linear code) and is executed on Java virtual machine (most browser embeds JVM). This enables the internal representation to be independent of any particular machine.

So we start off writing a compiler of a new language using our favorite existing language (such as Java, C, C++, Pascal). In writing a compiler, we need to write a lexical analyser and a parser. The best way to write them is to use tools such as a lexical analyzer generator and a parser generator (such as LEX and YACC). This allows us to write only the grammar for the language which will result in a big compression (in the number of lines). Grammar is much small than the parser, at least 1:4 and sometimes I have seen 1:10 compression ratio. The big plus is that the output is correct with respect to the input grammar (assuming the generator is correct). However, debugging a grammar is not easy. Mostly because we tend to be weak in regard to formal languages.

Once the new language is up and running, it is ready to be used to write applications. However, this arrangement requires all tools (compiler, interpreter, debugger etc.) that are written not in this new language, hence they are dependent on other existing tools (such as the compiler that is used to write them). To grow the new system out of this dependency, the tools themselves can be rewritten in the new language. This will make the new system self-reliant, that is, everything is expressed in terms of its own language. It is in a sense, "complete". This is a beautiful idea. Consider the following scenario:

1) language X : compiler/X of Y --> language Y running on VM/X
2) language Y running on VM/X: compiler/Y of Y --> language Y running on ?

If we disregard compiling into platform dependent machine codes, instead we use a virtual machine. Initially, this VM is written with X and it enables Y programs to be run. The phase 1 is completed. Using this runable platform, tools such as a compiler can be written in Y. The program is developed and run on the virtual machine written in X. The final phase then will be to have all programs written in Y. The question is what it will be run on? The obvious answer is on VM/X. Otherwise, all programs have to be compiled into the machine codes of the platform used. What desirable is to have VM written in Y itself, hence it will be "completed". One way to do it, as shown by the team who developed Squeak (Smalltalk system written in Smalltalk), is to write VM in Y, debugs it on VM/X then write a program to translate VM/Y it into language X. This VM is denoted VM/X'. The VM/X' is different from VM/X that it is directly translated from VM/Y, hence small details about data structure and other supporting routines will be different. The VM/X' is essentially VM/Y expressed in X. The translation program may or may not be difficult to write depends on the difference between the language X and Y.

Accessibility

To probe a little bit further, consider how the internal representation is stored. The storage of a programming system consists of data segment (ds), code segment (cs) and stack segment (ss). Under a language X, only the data segment is accessible by programmers. The code segment and stack segment are managed by the system and not accessible by programmers. Let mem:P be all storage for a program P. Let ^ denotes the accessibility.

eq 1) mem:P = ds:P^X, cs:P, ss:P

This equation means the data of P is accessible by programmers of language X, the internal form of program P is stored in the code segment and when P is run, its activation record is in the stack segment. In the first phase, the virtual machine (VM) is written in the language X (X = C, C++, Java, Pascal, etc.) and runs on a typical machine. The storage of VM is:

eq 2) mem:VM = ds:VM, cs:VM, ss:VM

because VM is written in X, all storage is accessible to X.

eq 3) mem:VM^X

If there is a program AY running on VM, the storage of AY resides in the storage of VM.

        eq 4) mem:AY = ds:AY, cs:AY, ss:AY
        eq 5) mem:VM = mem:AY
               ds:VM = ds:AY, cs:VM = cs:AY, ss:VM = ss:AY

As the VM implements the language Y, only ds:AY is accessible to Y.

eq 6) ds:AY^Y

Now, the final phase, VM is written in Y, all the storage, ds:VM, cs:VM, ss:VM is accessible to Y.

This shows that all memory of the virtual machine written in Y is accessible to the language Y including the language Y code segment and data segment themselves. Hence the manipulation of VM is expressible in language Y. This is how a programmer has control over all aspect of running a program in Y. It enables a program that help debugging possible.

If VM/Y is not compiled into machine codes, the only way to run it is to run it on top of VM/X. What is the minimum functionality of VM/X that must be existed to support running VM/Y? A VM consists of three main parts: reader, parser, executer. A reader reads input (a source program) and put it into tokens. A parser takes tokens and translates them into an internal form (parse tree or linear code). An executer evaluates the internal form (running the code).

If the reader and parser program written in Y have been transformed into their internal form then only an executer is necessary to run them. So, the executer (called it EX/X) is the minimum functionality required to run VM/Y. It is necessary to have a utility (written in X) to load the internal form of VM/Y into the data segment of EX/X (ds:EX/X) to bootstrap the system.

The complete picture of the storage at the level of running an application AY written by the language Y running on VM/Y on top of EX/X on a computer is: (the lowest level is mem:EX)

mem:EX = ds:EX^X, cs:EX, ss:EX
mem:VM/Y = ds:VM/Y, cs:VM/Y, ss:VM/Y
mem:AY = ds:AY^Y, cs:AY, ss:AY
ds:EX^X = mem:VM/Y^Y
mem:VM/Y = mem:AY

The leap of understanding comes from recognising that although only ds:AY^Y, seeing that mem:VM/Y^Y and mem:VM/Y = mem:AY therefore mem:AY^Y (including cs:AY^Y, ss:AY^Y).

14 September 2002
P. Chongstitvatana