Memory system design

Constraint of a Von Neumann architecture is that a single memory module of conventional design can access no more than one word during each cycle of the memory clock

memory cycle time

to read a word
send address
wait until data is valid (READ)
enable data to the bus

to write a word
send address
send data
command WRITE
wait until cycle complete

type of memory cell : RAM, static RAM, dynamic RAM
New memory bus : page mode, nibble mode, RAMBUS, EDO etc.

To increase memory bandwidth (bits/bytes per second)
* reduce cycle time
* increase word size
* concurrent access (banking, interleaving)

memory hierarchy
processor --- cache --- main memory --- secondary memory

principle of locality of references
* temporal
* spatial  (array, vector, code segment)

advantage : average memory access time is nearly equal to the speed of the fastest memory, whereas the average unit cost of the memory system approaches the cost of the cheapest memory.

figure graph of access time vs cost of memory
 

Cache

cache operation : read request, write request.
cache event : load hit, load miss, store hit, store miss
write operation : write through, write back.
cache performance : hit ratio (h) = number of ref. found in cache/ total ref. An example of how the hit ratio affects the performance, given the speed of cache and main memory.
tc = tcache, tm = miss penalty  (read block from main memory)
h = 0.5 to 0.95 step 0.05
tc = 10 ns, tm = 100 ns
ta = 55, 50, 46, 41, 37, 32, 28, 23, 19, 14
tc = 50 ns, tm = 100 ns
ta = 75, 72, 70, 67, 65, 62, 60, 57, 55, 52

Cache Organisation
* Fully Associative
* Direct Map
* Set Associative

replacement policy
* random
* least frequently used
* least recently used
* optimal

Cache type is classified according to its "organisation".  Mostly they differ in the following manner :
1  where can a block be placed in the upper level?
2  how is a block found if it is in the upper level?
3  which block should be replaced on a miss?
4  what happens on a write?

Q1
if each block has only one place it can appear in a cache.  The cache is called "direct mapped"
if a block can be placed anywhere, it is called "fully associative".
if a block can be placed in a restricted set of places, it is "set- associative".

Q2
Cache includes an address tag on each block which is used to check if the required address is in the cache.
Direct mapped cache searches exactly one tag.
Fully associative cache searches all tags at once.
Set associative cache searches one tag from one set but all set at once.

Q3
replacement policy
direct mapped cache has no choice.
others have random, Least Recently Used, Least Frequently Used, Optimal.

Q4
two options :write through, write back.

associative memory (Content Addressable Memory, CAM)
retrieve "content" of memory by association. "Is there any cell that has value 105?".
Using conventional memory (RAM) which has to be accessed by address, to retrieve by association requires scanning the whole content.  If there is no ordering among the content scanning will take O(n).  If some ordering, such as ascending, using binary search will take O(n log n). But associative memory take O(1).

Example on-chip cache of Intel 486
4 Way set associative, line size 16 bytes, cache size 8Kbytes
Pseudo LRU 3 bits  : r0, r1, r2    Set number : s0, s1, s2, s3

most recent access           set bit
s0 or s1                                    r0
s0                                              r1
s2                                              r2
if r0 r1  = 00   replace    s0
if r0 r1  = 11   replace    s1
if r0 r2  = 10   replace    s2
if r0 r2  = 11   replace    s3

Address Trace
To measure the performance of a cache memory, miss ratio is measure from an address trace.  Some problems are presented :
1. the workload on the trace may not be representative
2. the initialisation transient may grossly affect the evaluation
3. the trace may be too short to obtain accurate measurement.

The length of the trace is important for accuracy.  Also the concern about initialisation of cache (should or should not take into account for cache miss).  If cache size is 1000 lines, assuming 1 byte per line, and miss ratio is 1%, for miss to occur once for every line requires the trace length of 100,000 addresses just to initialise the cache.

(for each quadrupling of cache size, the trace length increases by a factor of 8)  Typical simulations cover hundreds of ms at most.