Stack Frame Caching
An alternative design of sx, for performance improvement, employs stack
frame caching. In an analysis of the profile of execution of
instructions at the level of clock cycle, most frequently used
instructions are:
- get (ldstatic in java)
- lit, ld, add
- st, put, ldx, stx
notable is the "inc v" instruction which is not generated from this
compiler (gen2.txt). It is used often but will not be included in
this experiment.
To improve the performance, the effort should be spent on improving
these instructions. The microprogram for these instructions are:
Let the shorthand notation of push/pop be
push
x is
sp+1->sp
x->mW(sp)
pop x is
mR(sp)->x
sp-1->sp
get
push ts
alu(fp-arg)->tbus,
mR(tbus)->ts
lit
push ts
arg->ts
ld
push ts
mR(arg)->ts
add
pop ff
alu(ts op ff)->ts
st
ts->mW(arg)
pop ts
put
alu(fp-arg)->tbus,
ts->mW(tbus)
pop ts
ldx {ads idx}
pop ff
alu(ts+ff)->tbus,
mR(tbus)->ts
stx {ads idx val}
pop nx
pop ff
alu(nx+ff)->tbus,
ts->mW(tbus)
pop ts
There are two key ideas:
- push/pop can be done in one clock if sp can be
incremented/decremented independent of alu and they can achieved
pre-increment and post-decrement at the proper phi2 of the cycle.
- to improve "get", the most frequently used instruction, the local
variable must be stored in a register instead of memory as push/pop
also accesses memory. If it is done properly "get" will take just
one clock.
get
$1 push ts, $2
v[arg]->ts
where $1 is phi1 and $2 is phi2, v[] is the cache register. the
old value ts is pushed into memory at $1, before the new value from
v[arg] is written to ts at $2.
push/pop
To push a register to memory in one clock, the "sp+1" must appear at
address bus from the beginning of $1, ts is presented to data bus at
the same time, at the begining of $2 memory write signal is ended (it
is assumed that the value is written into memory here), the value of
"sp+1" is also written to sp at this time.
Poping a register can be done in one cycle. The value "sp" is
presented to the address bus at $1. The memory is read. At $2,
the data is latched to a register, at the same time, "sp-1" is written
to sp (post-decrement).
lit
$1 push ts, $2 arg->ts
ld
push ts
mR(arg)->ts
"ld" cannot be done in one cycle as it reads the memory twice, the
first one for push ts, the second one for the value.
add, 2 cycles
st, 2 cycles
put
ts->v[arg], pop ts
ldx, 2 cycles
stx, 4 cycles
to improve get, the stack frame must be cached. The first thing
that must be done is improving the SP.
implementing sp unit
The sp unit performs pre-increment $1, post-decrement $2, and load
value from bus $2.
sp+1
||<--------------
||
|
<-- ||<---- sp
--||---- +/- 1 <--
|
||<-bus |
|
|
-------------------------
all mux are asserted at $1, latching the register sp is at $2.
Stack frame
A number of registers are used to cache part of stack frame. The
stack frame remains unchange from the original design, however,
lv1..lvn are cached into v1..vn the cache registers. In caching
the stack frame, the call/ret must save/restore these registers to the
memory.
call
* save v to the current stack
frame
push ts (flush stack)
create a new frame
save fp' and return address
* cache v from the new frame
update sp
ret
restore return address, sp
restore the old frame
* cache v of this current frame
(restore old v)
if it is "ret" pop ts
the lines with * are the additional work that must be done to do stack
frame caching.
To simplify the logic, the save/cache v will be of fixed size, 1..4.
save
v
alu(fp-n)->fp
vn->mW(fp),
alu(fp+1)->fp
...
v1->mW(fp),
alu(fp+1)->fp ; fp at right place
cache v
alu(fp-n)->fp
mR(fp)->vn,
alu(fp+1)->fp
...
mR(fp)->v1,
alu(fp+1)->fp ; fp at right place
if the size of caching register is n then the extra cycle in call/ret
instruction is 3(n+1).
13 July 2006