Cache memory

General idea

Cache memory is a "small" fast memory that resides "between" main memory and the processor. It is much faster than the main memory. Having cache memory reduces the access time. Because cache memory is much smaller than main memory, it is a "copy" of some part of the main memory. When the processor gets the data from cache memory it is called, cache is "hit", otherwise that data has to be fetched from the main memory, in this case, it is called a "miss".

The goal is to arrange the cache memory to maximize "hit".

How it work

   The main memory can be considered to be divided into many blocks of cache (in which the actual cache can keep only a few of these blocks). The full address bit is divided into two parts, the address bit to cache block, and the address bit to the tag. TAG stores the "block number" of the actual block of memory stored in the cache at that moment. In other words, TAG is used to remember which block is stored in the cache memory.

Organisation
   Direct Map
   Fully Associative
   Set Associative

Direct Map

The simplest organisation of cache is to use the block number to "directly" address the TAG. When an access occurs, the "actual" TAG address bit is compared to the "block number" stored in that slot, if they are the same, it is a hit, otherwise it is a miss.

    Let us illustrate by an example:

   The main memory is 1G bytes (address 30 bits). The cache is 1M bytes (address 20 bits). The cache is organised as 4 bytes each line (4 x 256KB). For a direct map cache, we need a TAG storage which has 256K slots, each slot stores 10 bits of the "block number".

   Let the access address be "xyzabcd" where xyz is the left most 10 bits and abcd is the rest 20 bits.   "abcd" is used as the address of the 1M cache. The leftmost 18 bits points to a line in the cache and the rightmost 2 bits points to a byte in the line.   "xyz" (10 bits) is the block number stored in the corresponding slot in the TAG. When an access occurs, the bottom 20 bits of address goes to cache block with the leftmost 18 bits points to a line (select one of 256K lines). The first 10 bits is compared with the block number in the corresponding slot in TAG. If they are the same, the access is a hit and the processor can retrieved the valid data from the cache memory. If it is a miss, the data will be fetched from the main memory and also write to cache. The "block number" in the TAG will be updated.

   Direct map organisation is economic to implement. However, it is restrictive because the access can be conflicting at the same "block" eventhough there are other slot available.    To improve this situation, the access to TAG storage is change to "associative".

Fully associative

The cache can be used more effectively if the "block number" can be placed at any slot of TAG instead of being forced to a certain place as in the direct map. To do this the TAG storage must be "associative". This means that the action to access the "block number" is a kin to answer the question "where xyz exists in the TAG?" That is the same as using the "content" to address the memory.

With "associative" TAG storage, the part "xyz" can be placed at any empty slot. Therefore the hit rate is better than the direct map organisation. The disadvantage is that "associative" memory is expensive to implement. The third alternative cache organisation is the hybrid of direct map and fully associative.

Set associative

Rather than having full freedom of associative memory, a number of direct map can be used in parallel (so called a set, or n-way). All TAG are compared in parallel and there are flexibility in choosing one of n places to store data.

This scheme keeps simplicity of direct map and it is flexible in using the available slot (up to n choices).

Write policy

write through
write back

When all write to cache is also write to the main memory, the consistency between cache memory and main memory will be simple to keep. This can be important in some application. However, always having to write to the main memory will slow down the access.

In the write back scheme, the main memory will be update only when there is a miss on that slot. In order to do that, an extra bit of memory is needed to keep track whether a slot in cache has been written to or not. This is called the "dirty" bit. When a write to cache occurs its dirty bit is set. When a miss occurs, the dirty bit is inspected and if it is set, the content of cache will be update to the main memory. This method reduces the number of write to the main memory to the minimum. However, the state of main memory and the cache memory may not be the same all the time.

Instruction/Data cache

As the access pattern of instruction is different from the one of data, it is better to separate cache of instruction and cache of data (to have a better hit for both). However, to maximise the use of of the memory in a cache, "Unified" cache is used. A unified cache stores both instruction and data. One disadvantage of the separate instruction/data cache is that there will be a "fragment" unused in both caches because of the fixed size of both caches. The unified cache can reduce this waste.

Real world examples

The lastest processor (2016) Intel sixth generation i5 6200U has 2 cores, 4 threads, with TDP 15W. It has 3 levels of cache: L1 cache 128 K bytes (2 x 32KB 8-way set associative, separate instruction/data), L2 512K bytes (2 x 256KB 4-way) and L3 3072K bytes (12-way).
http://ark.intel.com/products/88193/Intel-Core-i5-6200U-Processor-3M-Cache-up-to-2_80-GHz
http://www.cpu-world.com/CPUs/Core_i5/Intel-Core%20i5-6200U%20Mobile%20processor.html

Apple A9 (in iPhone 6s) is a dual core processor. It has 64KB L1 (separate instruction and data), 3MB L2, 4MB L3.
http://www.anandtech.com/show/9686/the-apple-iphone-6s-and-iphone-6s-plus-review/4
http://anandtech.com/show/9825/correcting-a9s-l3-cache

last update 5 Feb 2016