Stack Frame Caching


An alternative design of sx, for performance improvement, employs stack frame caching.  In an analysis of the profile of execution of instructions at the level of clock cycle, most frequently used instructions are:
  1. get  (ldstatic in java)
  2. lit, ld, add
  3. st, put, ldx, stx
notable is the "inc v" instruction which is not generated from this compiler (gen2.txt).  It is used often but will not be included in this experiment.

To improve the performance, the effort should be spent on improving these instructions. The microprogram for these instructions are:

Let the shorthand notation of  push/pop be
push x is   
  sp+1->sp
  x->mW(sp)

pop x is
  mR(sp)->x
  sp-1->sp

get
  push ts
  alu(fp-arg)->tbus, mR(tbus)->ts

lit
  push ts
  arg->ts

ld
  push ts
  mR(arg)->ts

add
  pop ff
  alu(ts op ff)->ts

st
  ts->mW(arg)
  pop ts

put
  alu(fp-arg)->tbus, ts->mW(tbus)
  pop ts

ldx  {ads idx}
  pop ff
  alu(ts+ff)->tbus, mR(tbus)->ts

stx  {ads idx val}
  pop nx
  pop ff
  alu(nx+ff)->tbus, ts->mW(tbus)
  pop ts

There are two key ideas:

  1. push/pop can be done in one clock if  sp can be incremented/decremented independent of alu and they can achieved pre-increment and post-decrement at the proper phi2 of the cycle.
  2. to improve "get", the most frequently used instruction, the local variable must be stored in a register instead of memory as push/pop also accesses memory.  If it is done properly "get" will take just one clock.
get
  $1 push ts, $2 v[arg]->ts

where $1 is phi1 and $2 is phi2, v[] is the cache register.  the old value ts is pushed into memory at $1, before the new value from v[arg] is written to ts at $2.

push/pop

To push a register to memory in one clock, the "sp+1" must appear at address bus from the beginning of $1, ts is presented to data bus at the same time, at the begining of $2 memory write signal is ended (it is assumed that the value is written into memory here), the value of "sp+1" is also written to sp at this time.

Poping a register can be done in one cycle.  The value "sp" is presented to the address bus at $1. The memory is read.  At $2, the data is latched to a register, at the same time, "sp-1" is written to sp (post-decrement).

lit
  $1 push ts, $2 arg->ts

ld

  push ts
  mR(arg)->ts   

"ld" cannot be done in one cycle as it reads the memory twice, the first one for push ts, the second one for the value.

add, 2 cycles
st, 2 cycles

put
  ts->v[arg], pop ts

ldx, 2 cycles
stx, 4 cycles

to improve get, the stack frame must be cached.  The first thing that must be done is improving the SP.

implementing sp unit


The sp unit performs pre-increment $1, post-decrement $2, and load value from bus $2.

           sp+1
     ||<--------------
     ||               |
 <-- ||<---- sp --||---- +/- 1 <--
          |       ||<-bus         |
          |                       |
          -------------------------

all mux are asserted at $1, latching the register sp is at $2.

Stack frame

A number of registers are used to cache part of stack frame.  The stack frame remains unchange from the original design, however, lv1..lvn are cached into v1..vn the cache registers.  In caching the stack frame, the call/ret must save/restore these registers to the memory.

call
* save v to the current stack frame
  push ts (flush stack)
  create a new frame
  save fp' and return address
* cache v from the new frame
  update sp

ret
  restore return address, sp
  restore the old frame
* cache v of this current frame (restore old v)
  if it is "ret" pop ts

the lines with * are the additional work that must be done to do stack frame caching.

To simplify the logic, the save/cache v will be of fixed size, 1..4.

save v
  alu(fp-n)->fp
  vn->mW(fp), alu(fp+1)->fp
  ...
  v1->mW(fp), alu(fp+1)->fp  ; fp at right place

cache v
  alu(fp-n)->fp
  mR(fp)->vn, alu(fp+1)->fp
  ...
  mR(fp)->v1, alu(fp+1)->fp  ; fp at right place

if the size of caching register is n then the extra cycle in call/ret instruction is 3(n+1).

13 July 2006