IOI Computer Olympic

Year 2001

Prabhas Chongstitvatana
Department of computer engineering
Chulalongkorn university

Last year lecture (probabilistic algorithm)

Round 1 Beauty of programs

Synopsis

I want to show you the beauty of programs. Programs can be think of as an asthetic object. To appreciate it you should know the reason of its design and construction. I am going to show you a small yet beautiful programming design. I will argue about its beauty (why and how). You should play with it and read its listing.
You will learn by doing more than by listening. Of course, having a teacher to tell you what is good is useful.

What is this thing that computer obeys?
What is a language?
A language
Lexicon
Syntax
Semantic
Programming styles
Iteration vs recursion
Internal forms
Data structure : access functions
How eval works?
It sounds like an infinite regression!
next ....

What is this thing that computer obeys?

Programming is still an art. It requires skill, which is acquired through a lot of practice. Its foundation lays in mathematics. The study of programs as an object in itself is interesting and useful. By such study we can understand more thoroughly the relationship between a program and the result we want it to accomplish. It is my intention in this short lecture to initiate you towards the study of programs. Hopefully, to give you some insight into programming but my higher hope is to make you appreciate programs as beautiful man-made objects.

I will start with thinking about the obvious (which is a good way to exercise your imagination because it needs a lot of imagination to describe the obvious everday-things). All of you have done quite a number of programming exercises. You all have used tools in the computer when you go through your task of programming such as editor, integrated environment, command line etc. Most of the time you type something into a computer either the program text or the commands to perform some tasks.

I want to ask an obvious question:

"What is it that you type into a computer via its keyboard?"

This question is not so obvious to answer. I would like all of you to take some time to think about how you are going to answer it and write your answer down. (Stop here a minute and write down your answer). Most of the answers I have heard fall into one of the following categories:

1) It is the communicating medium between human (programmers) and machines.
2) It is the commands that we use to control machines.
3) It is the relationship between our ideas what we want machines to do and the physical devices, i.e. the program text is transformed into bits and then to signals in which the hardware know how to perform the task.
4) It is the expression of our idea that we want machines to accomplish in terms that machines can accept and follow.

The above examples are by no means exhaustive. It shows that the "meaning" of the text that we typed into a computer can be viewed from many different perspectives. This "meaning" exists in the mind of human-programmers of the machines. I will discuss how this meaning arises from the text.

The way I want to take you to this subject is to get away from your familiar surrounding and put you into a strange land. This will make you alert and ready to know new things. I am going to do this by not using any computer language that you are familiar with such as C, Pascal, C++, Java etc. Instead, I will introduce a new language which is very small. I will use it as our subject of study.

We do not want to study how to program in yet another new language. We want to study how a language is created and how it works. We will limit our discussion to the language that is represented by a text. Although languages can be represented in many other forms such as pictures, graphs, or even human voice, a text is the most popular form in programming practice (and almost the only form exists today).

What is a language?

In talking about languages, we want to be a bit more precise. I will introduce some notions and terminology to define what a language is.

An "alphabet" is a finite nonempty set. The elements of an alphabet /Sigma are called "letters" and "symbols". A "word" over an alphabet /Sigma is a finite string consisting of zero or more letters of /Sigma, whereby the same letter may occur several times. The string consisting of zero letter is called the "empty word", written "nil". For example, nil, 0, 1, 010, 1100 are words over the alphabet /Sigma = {0,1}. The set of all words over an alphabet /Sigma is denoted by /Sigma-star. The set /Sigma-star is infinite for any /Sigma. Subsets of /Sigma-star are referred to as "(formal) languages" over /Sigma.

The above paragraph sounds very "mathematical". Indeed, it is, it is the usual definition of languages from a branch of mathematics called "formal language theory" which is one of the more useful mathematical theory in computer science. The set of /Sigma, its elements and strings of its elements could be called a vocabulary, words and sentences respectively in our usual sense of human-languages.

We will call a textual language (a program) a surface language. This surface language will be transformed into an "internal form" by a process called "parsing (or compiling). This internal form will be executed by a machine by the process we called "evaluation". For example, a computer translates a source code into a machine code, which can be executed on a computer.

surface language (program)
     |
     v
   ------
   parsing
   ------
     |
     v
internal forms
     |
     v
   ----------
   evaluating
   ----------
     |
     v
    result

The process how a computer accepts programs and executes it can be drawn as the following diagram.

|---> reader
|       |
|       v
|     tokeniser
|       |
|       v
|     parser
|       |
|       v
|---- eval

"reader" reads sentences of program from the input device, such as a keyboard or a file. A program is composed of string of characters. "tokeniser" groups characters into words. "parser" transforms sequences of words into internal forms. "eval" evaluates internal forms and prints the result (executes a program). This loop is called read-eval-print loop. It depicts programming interactively. This type of programming is called "interpretive" as opposed to "compilative" or using a compiler (which you do edit-compile-run). Using a compiler is currently the most popular form of programming.

A language

Now, I will describe a language. This language is defined by S. Kamin in the chapter 1 of his textbook.
(* Prof. Samuel Kamin of the department of computer science, University of Illinois at Urbana Champagne, Programming Languages: An interpreter-based approach, Addison-Wesley, 1990.)

The beauty of this language stems from its smallness and its elegance. There are 12 words, which are already defined (called reserved words). Only one form of syntax rule is required, using only two characters as syntactic features (the left and right parenthesis). The grammar for this language can be written down in just a few lines. Despite of its look of a toy-language, I will illustrate the beauty of its completeness by showing that the whole executable system including a parser and an evaluator can be completely written in itself. If you want to see it now, it is "here" (about 500 lines of executable program written in itself). In fact, I am cheating a bit here, as I have to add 3 more predefined words to create and manipulate data structure.

This is a procedural (imperative) language. You can contrast this language with any standard procedural language such as C and you will see the stark difference of size! (counted by the number of reserved words and the number of grammar rule describing its syntax).

I will denote this language as A1 and its internal form as A0. Later, I will introduce a more syntax-friendly version of A1, called A2. All values produce in this language is in the domain of integer. I will describe three aspects of the language.

1 its lexicon -- the words in the language
2 its syntax -- the composition of words into sentences
3 its semantic -- the meaning of a program

A1 lexicon

There are 12 predefined words, which are used to compose programs.

set if while begin + - * / = < > print

The first four words {set if while begin} are called control-op and the rest are called value-op. The value-op takes its arguments (mostly two, except print) and evaluates all arguments before applying its operator. The operators have obvious meaning, such as + is add etc. The control-op is different, it selectively evaluates its arguments. The meaning of these operations will be explained in the section semantic.

A1 syntax

Every sentence in A1 are expression which return values. An expression has the form

(op e)

where e denotes an expression, op can be any reserved word or user-defined words. The control-op has the following syntax

        (set name e)
        (if e1 e2 e3)
        (while e1 e2)
        (begin e1 e2 ... en)

The name of a variable and user-defined words can be any string of characters except the reserved words. (We arbitrarily limit the length of a name to 15 characters in this implementation). The syntax for defining a user function (non-builtin) is

(define name (formals) e )

where formals is the list of formal parameters.

The grammar for A1 is as follows. ( * denotes zero or more repetition)

toplevel ->     e | define-op
e ->            name | control-op | value-op
control-op ->   "(" "set" name e ")" |
                "(" "if" e1 e2 e3 ")" |
                "(" "while" e1 e2 ")" |
                "(" "begin" e1 e2 ... en ")"
value-op ->     "(" op args ")"
define-op ->    "(" "define" name "(" formals ")" e ")"
op ->           "+" | "-" | "*" | "/" | "=" | "<" | ">" | "print" | name
args ->         name* | number*
formals ->      name*
name ->         variable names or user-defined names
number ->       integer

A1 semantic

At the toplevel, it is a read-eval-print loop. A user types in an expression or define-op and the parser translates it into an internal form and then evaluates and returns its value. Examples will make this clear:

-> (+ 2 3) user input
5 computer reply
-> (define square (x) (* x x ))
square
-> (square 4)
16
->

The value-op evaluates all of its arguments before applying the operator. The control-op treats its arguments in a different way.

(set name e)

"set" assigns a value of an expression e to the variable "name". A variable can be local or global. A variable is local when its name is listed in the formal parameters of the current function otherwise it is global. The returned value is the value of e.

(if e1 e2 e3)

"if" evaluates e1 and if its value is non-zero (true) it evaluates e2 otherwise evaluates e3. The returned value is the value of the last expression it evaluates.

(while e1 e2)

"while" is an iterative operator. It evaluates e1, if its value is non-zero it evaluates e2. This process is repeated until e1 returns zero. The returned value is the value of e2 before the loop terminates.

(begin e1 e2 ... en)

"begin" is a sequencing operator. It evaluates e1 e2 ... en sequentially and returns the value of en.

(define name (formals) e )

The define-op is used to define a user-defined function. You may notice that there are only the formal parameters and there are no local variables. One way to have local variables is to define an "auxilliary function" which has extra parameters which will be used as local variables.

Example

(define isNumber (s1) (isNumber2 s1 0))
(define isNumber2 (s1 i) ... )

isNumber has only one parameter (s1) and we define isNumber2 to have one extra parameter (i) that is used as a local variable (an index into a string).

Programming styles

It is educational to read through the self-interpret program (lib.k, data.k, base.k, eval.k)
(* the file name extension .k is used to honor Prof. Kamin, the creator of the language)

You will see quite a number of distinct styles that make a program looks much cleaner than a program in an ordinary language. (We are not going to debate whether one style is better than another). Because every expression returns a value, this leads to a very natural way to express a cascade-test construct.

        (if test1 action1
        (if test2 action2
        . . .
        defaut-action)...)

actionx is an expression hence returns a value without having to assign this value to a variable as in C.

        if (test1) n = action1;
        else if (test2) n = action2;
        . . .
        else n = default-action;
        return n;

Because of this semantic, there is no need to have a special word to return a value either ("return" in C, assign to the function name in Pascal).

You can see the beautiful construct such as

(define and (x y) (if x y 0))

Try this in your favorite language and compare!

Iteration vs Recursion

The following examples contrast two styles.

(define findName2 (name i found)
(begin
    (while (and (<= i numNames) (not found))
      (if (str= (def-name-at i) name)
        (set found 1)
        (set i (+ 1 i))))
    (if found i 0)))

(define findName3 (name i)
(if (> i numNames) 0
(if (str= (def-name-at i) name) i
(findName3 name (+ 1 i)))))

"findName" searches a name in the symbol table (def-name). "findName2" is iterative and uses "found" to break while loop. "i" is set to "i + 1" for the next iteration. "findName3" is recursive, "i + 1" is passed as a parameter to the next recursion. Please note the absence of "set" in the recursive version.

The next example is the function "atoi" which converts a string such as "-1234" into its value -1234.

(define atoi4 (s1 i m)
(begin
    (if (= 45 (fetch s1)) (set i 1) 0)
    (while (!= 0 (fetch-at s1 i))
      (begin
        (set m (+ (* 10 m) (- (fetch-at s1 i) 48)))
        (set i (+ 1 i))))
    (if (= 45 (fetch s1))
      (- 0 m)
      m)))

(define atoi (s1)
(if (= 45 (fetch s1))
(- 0 (atoi2 (+ 1 s1) 0))
(atoi2 s1 0)))

(define atoi2 (s1 m)
(if (= 0 (fetch s1))
m
(atoi2 (+ 1 s1) (+ (* 10 m) (- (fetch s1) 48)))))

"atoi4" uses iteration with "i" as an index of character and "m" as a local variable storing the converted value. "atoi" and "atoi2" are the recursive version. "atoi" handles the negative sign and calls "atoi2" to convert the string. You can see the simplicity of the structure in the recursive version and the lack of "set".

You may think that recursion consumes more memory and runs slower than iteration. Let us expose more details of this argument. First, the memory concern, most procedural languages use stack to store all local variables and actual parameters. Recursive call will consumes this stack where as iteration does not. However, for the case that the recursive call is the last function executed in a user-defined function, so called "tail-recursion", this stack growing can be eliminated.

stack   |-----| activation record for call n
|     |     |
|     |-----|
v     |     | activation record for call n+1
growing |-----|

We can eliminate the activation record of the next call (n+1) by realising that the call is the last function executed hence all local variables and parameters of the current activation need not to be saved (as they are not used anymore). The actual parameters of the next recursive call can substitute the current activation record in-place. A parser or a compiler can recognise tail-recursion and performs this optimisation (for simplicity, our parser does not).

Second, the speed concern, the speed of recursive call can be slow due to the overhead of a function call. A function call requires calculating a number of pointers to adjust the stack (see applyUserfun2 for details). Where as for the iteration the loop can be achieved by "jumping" which is a cheaper operation than a call. It depends on the implementation how much this difference will be. In our implementation, the difference is not much because both are interpreted, not compiling into the actual machine codes.

Internal forms

When an expression (in a source language, or A1) is entered via a keyboard, it is transformed into an internal form before it is evaluated (executed). This internal form has the structure in the form of a tree (inverted,
the root is at the top). We call this internal form distinct from the surface language as it is a lower level, A0. One surface language may have different internal forms and different surface languages may have the same internal form. You can think of an internal form as a machine language and a surface language as a high level language. However, an internal form is unlike a machine language. It is not directly executable in any processor (except you want to design a special processor for it). There is a program that takes an internal form and runs it. This program is called in many names: an interpreter, a virtual machine or an evaluator in our language.

Suppose we have a function power(x y) which raises x to the power y.

(define power (x y)
(if (= 0 y) 1
(if (= 1 y) x
(* x (power x (- y 1))))))

The expression defining the body of power can be drawn as:

              if
             / |\
           /   | \
          =    1   if
         / \       /|\
        0   y    / | \
                =   x   *
               / \     / \
              1   y   x   power
                          / \
                         1   -
                            / \
                           y   1

The data structure of the tree composed of two kinds of nodes: extype and exltype. "extype" (expression type) stores symbols and "exltype" (expression list type) stores links.

                extype                            exltype
             |-----------|                      |---------|
             | type     |                      | head   |
             |-----------|                      |---------|
             | {optr, num, varble}              | tail   |
             |-----------|                      |---------|
             | args     |
             |-----------|

"extype" has three fields: type, {optr, num, varble}, args. The first field "type" stores its type: valexp, lvexp, gvexp, apexp. "valexp" is the constant type and the value is stored in the second field "num". "lvexp" "gvexp" denote local variables and global variables. Their references to the variables are stored in the second field "varble". "apexp" is the function application type. The reference to the function is stored in the second field "optr" and its list of arguments is stored in the third field "args". You may notice that extype stores four kinds of data that shared the same places. It is called union in C and variant record in Pascal.

"exltype" is the linked-list node with two fields: head and tail. Now we will draw the previous program (power) in this concrete form. The type of "extype" is denoted by V (valexp), L (lvexp), G (gvexp) and A (apexp). The exltype node is drawn as follows. The linked-list is drawn as

      .--- tail
      |
    head

with the end of list drawn as .---|

|
v
A if .
      |
      .--------.---------.---|
      |        |         |
     A = .    V 1       A if .
         |                   |
         .---.--|            .--------.--------.----|
         |   |               |        |        |
        V 0 L y            A = .     L x      A * .
                                |                  |
                                .----.----|       . . . .
                                |    |
                               V 1   L y

The value of {optr and varble) refers to the symbol table, which stores all attributes of a name predefined-function, user-defined function, global variables). We will not go into details of the implementation such as the symbol table, the data segment. You can study it from the program itself.

Data structures

How all these data structures are implemented in our system, which provide only scalar values in integer domain? For example, how to implement a pointer (so that we can have linked-list and others)? The trick is simple. If we know the "base" of the data, then the reference to the data is just the offset from the base. This offset is not the "address", it is an ordinary integer. In order to provide aggregate of data, we need an array,
which is a data structure accessible by an index. The index is the offset that we use as a reference to data.

To have an array we need 3 more words: array, fetch, store in this syntax.
        (array name size)
        (fetch name)
        (store name value)

"array" allocates memory of "size", where "size" is an expression, for example (* 4 10) or 40. What really is "name"? It is a global variable, just like the variable that is defined and set its value by "set", for example
(set name 1) and the name stores the reference to that memory block.

name memory

       |-------|        |-------|
       |       |------->| value |
       |-------|        |-------|
                        |       |
                        | . . . |

The memory in our system is a heap, a large block of memory with the base address at "heap". To address any where in the heap we use "ref" which is an index into this array of integer. If the heap is defined in C,
int heap[MAXMEM] then "ref" is heap[ref].

             memory
heap --> |---------|
            |         |
            |         |
            |.........|
   ref --> |variable |
            |.........|
            |         |
            |         |
              . . .

"fetch" evaluates its argument ("name") gets its value, which is the "ref" to the data segment and using this reference to get the value. This "indirection" is called dereferencing. "store" similarly performs storing
a value into a name indirectly.

With "fetch" and "store" you can define access functions to your user-defined data structure. A calculation on pointer to a variable becomes an ordinary arithmetic on integer because the reference is just an integer. For example, you can define indexing into an array:

(define fetch-at (name i) (fetch (at name i)))
(define store-at (name i value) (store (at name i) value))
(define at (base index) (+ base index))

You can use them to copy one array to another (in the example copy y to x)

(set N 10)
(array a1 N)
(array a2 N)
(define array-copy (x y n)
(if (= 0 n) 0
     (begin
        (store-at y n (fetch-at x n))
        (array-copy x y (- n 1))))))

(array-copy a1 a2 N)

Another example is the access to internal forms, which are composed of extype and exltype nodes. These nodes are aggregated type and can be constructed as struct, union in C or record, variant record in Pascal. We allocate the memory of size 3 integers for the extype node and 2 integers for the exltype node. The following definitions show how to create access functions for them.

(set MEMMAX 10000)

(array heap MEMMAX)

(define e-type (e) (+ heap e))
(define e-num (e) (+ heap (+ 1 e)))
(define e-optr (e) (+ heap (+ 1 e)))
(define e-args (e) (+ heap (+ 2 e)))
(define e-varble (e) (+ heap (+ 1 e)))
(define e-head (e) (+ heap e))
(define e-tail (e) (+ heap (+ 1 e)))

Then, we can use these access function to manipulate the data structure, such as

(store (e-type e) valexp)
(store (e-num e) n)

(define car (el) (fetch (e-head el) )))

...
(if (= lvexp (fetch (e-type (car args)))) . . .

How eval works?

The evaluator (function eval) is the machine that executes the internal forms (A0). The "machine code" is the extype node. The extype node has four types { valexp, lvexp, gvexp, apexp }. The node type valexp returns the value stored in its field "num". The node type lvexp and gvexp return values from local data and global data with the reference in the "varble" field respectively.

node type return

valexp          (fetch (e-num e))
lvexp           (getval (fetch (e-varble e)) rho)
gvexp           (getcal (fetch (e-varble e)) globalEv)

The global data is allocated from the data segment when the variable is defined. The local data is dynamic and is allocated from the stack segment. The local data is created when passing the actual parameters to a function and is destroyed when exit from the function. Because the function call has the behaviour of last-in first-out (LIFO) as the earliest call will exit the last, a stack structure is suitable for allocating the local data for function calls.

Using a stack gains a huge benefit of automatic reclamation of the memory when the local data is not longer in used. (You may think this is obvious but this is the beauty of it. Think about other alternative way of storing
local data such as linked-list. The local data once ceased to exist will have to be reclaim by some method).

             stack
   rho --> |       |
            |-------|
local1 --> |       |
            |-------|
local2 --> |       |
            | . . . |

The access functions to variables are defined as

(define assign (name num rho) (store-at rho name num))
(define getval (name rho) (fetch-at rho name))

The global and local data can be handled in the same way except that the global data is in the data segment (its base is "data") and the local data is in the stack segment (its base is "stack").

It sounds like an infinite regression!

(under construction)

Coming soon

The smallest language
epilogue
my contribution