How the assembler work

The assembler works in two passes.  The first pass collects all symbols and resolves references to those symbols.  The second pass generates assembly codes.  The symbol table is a centralised data that is shared between two passes.  It stores symbols and their references.  To begin using the assembler, the syntax of the assembly language must be understood.

syntax of the assembly language

;; comment
.s	;; define symbol
symbol value
. . .
.a n	;; set address to n
.c	;; code segment
:label	op opr1 opr2 ...
. . .	
.w 	;; data segment
sym sym ...
.e	;; end of program

.s .a .c .w  can occur in any sequence.  .e is the last line of program.

opr -> n #n @n +n #sym @sym +sym sym

The convention for operand ordering is: op dest source.  The operands are written in such a way to simplify the assembler using prefix to identify the addressing mode.

ld r1, 10(r2)  is written as  ld r1 @10 r2
ld r1, (r2+r3)   "            ld r1 +r2 r3
ld r1, #200      "            ld r1 #200
add r1, r2, r3   "            add r1 r2 r3
add r1, r2, #20               add r1 r2 #20

The assembler does not check for all possible illegal combination of opcode, addressing mode and operands.

The process of of assembly starts with scanning the input file and collects all symbols into the symbol table.

symbol table

There are several predefined symbols already in the table: opcode, r0..r31, conditional, stop.  The predefined opcode are ld st jmp jal jr add sub mul div and or xor shl shr trap.  The conditionals are: always eq neq lt le ge gt.

pass 1
collect symbols and resolve reference
build symbol table
store token list

token list is an array of token.  Each token stores type, mode, reference and line number (refer to source code line number).  line number is used in reporting error.  Type is: sym num op dot.  Mode is addressing mode: absolute, displacement, index, immediate, reg-reg, reg-imm, special. 

For example ld r1 @lv1 base  will generate the list of four tokens:
{type,mode,ref} 
{ {op,disp,ld}, {sym,reg,r1}, {sym,disp,lv1}, {sym,reg,base} }

pass 2
generate code from token list

The output format is suitable for a loader of the simulator as follows:

a num              set address
{l,d,x} num+       instruction
w num              defined word
e                  end of file

s2 instruction format

L-format  op:5 r1:5 ads:22
D-format  op:5 r1:5 r2:5 disp:17
X-format  op:5 r1:5 r2:5 r3:5 xop:12

l num num 
d num num num 
x num num num num

ads and disp will be sign extended to 32-bit.

4 December 2001
Prabhas Chongstitvatana
