Garbage collection

"Garbage" is the memory that has been allocated from heap. The life of this memory may be longer than the function that create it. For example, a list that has been created inside a function then returned as an output of that function.  In this case this list is still "alive" after the function that creates it has been terminate

Therefore, we cannot "deallocated" or free this memory in the way as the memory that has been allocated on the stack.

However, if we keep allocating memory from the heap without freeing any memory that is not used anymore, eventually we run out of memory.

Many imperative language has function to allocate and free memory from the heap.  However, "freeing" memory by the programmer may have errors.  These errors mostly come from freeing the memory that has not been allocated.  The other possible case is forget to free the memory that is not used causing "memory leak" and running out of memory after the program run for some time.

These kind of "bugs" are difficult to find and fix.

Modern language allows programmers to allocate memory but performs the "free" automatically.

To do that, a special function called "garbage collector" is called during runtime to "free" the memory that is not used anymore.

When "garbage collector" is called?

The most simple situation is when there is no more memory when the memory allocator is called.  We can also call it periodically which make runtime behaviour more predictable.

How the garbage collector work?

Throughout the evolution of programming languages, there are many variations of garbage collectors.  Two basic types are
1) mark and sweep collector
2) reference count collector

Mark and sweep collector

Here are the steps:
1.  mark the memory in the heap that is still "alive"
2.  scan the whole heap (or sweep it) and reallocate those "alive" memory to a new space (yes, we need to have extra space to do so).
3.  free everything in the heap.
4.  update the address of all variables.

step 1.  To find out which variables are "alive" we look into all activation records.  All variables that are in them are alive.   To mark them, consider that each chunk of allocated memory has the "header". This header contains markers for this purpose.

step 2.  Sweep the heap.  Heap is a big chunk of memory.  Each allocated memory are allocated from the heap and the free space in heap is shrinking.  The heap keeps track of its free space.  Scanning from the beginning, we can navigate the block of allocated memory.  The header of each block contains the size of the block. We use this information of move from one block to another block.  To reallocate the live memory we use "extra" space and copy each live block that location. We must remember new location of each block.  That requires an "address map" table each entry contains the old address and the new address of the reallocated block.

step 3.  Now that all live memory are moved to a save place.  We can free the heap.  Finally, all blocks are moved back to the heap and "address map" is updated.

step 4.  Look into all activation records and use "address map" the update the address of the relevant variables.

Disadvantage of mark and sweep collector

It takes a long time to collector all garbage at once. Hence, the runtime is difficult to predict.  This makes the running user program seems like it pauses for quite a while when the garbage collector works. 

Reference count collector

In this scheme, all memory allocated from heap has a "counter" field.  The function of this counter is to keep track to the life of the memory.  The counter is increased when there are other variable that shares this memory. The counter is decreased when the variable that uses this memory is dead.  When the count reaches zero, this memory can be freed. 

To update the counter, the compiler needs to insert a code to do it at the right place.  If the update is not absolutely correct then memory error can happen.  There are many situation that is difficult to determine how to update the counter.

Disadvantage of reference count collector

It can be expensive to update the counter. When a variable enter or leave its scope the counter needs to be update. Determine when to update the counter has to be absolutely correct. 

The runtime behaviour of this collector is predictable and freeing a memory is inexpensive.  There is no need to sweep the heap or scan for liveness of variables.

Improvement of garbage collector

The above two schemes are the basic one. There are many variations that aim to improve the runtime behaviour.

Generational collector

In mark and sweep collector, we can think of the heap as many heaps that have different lifetime.  The memory that is new is more likely to be used and free.  The memory that survives sweeping is considered to be old. Then, this memory is migrate to the other heap.  Sweeping this heap should occur less often than sweeping the young heap.

More reading

First from wiki
https://en.wikipedia.org/wiki/Garbage_collection_(computer_science)

then classic papers
1)
https://arxiv.org/pdf/1505.00017.pdf
Comparative Analysis of Classic Garbage-Collection Algorithms for a Lisp-like Language
Tyler Hannan, Chester Holtz, Jonathan Liao  (2015)

2)
https://courses.cs.washington.edu/courses/cse590p/05au/p50-bacon.pdf
A Unified Theory of Garbage Collection
David F. Bacon, Perry Cheng, V.T. Rajan  (2004)

discussion
https://news.ycombinator.com/item?id=14823054
3)
super classic

https://dl.acm.org/doi/pdf/10.1145/1317203.1317205

Overview of garbage collection in symbolic computing
by TJ McEntee ยท 1987

You should try to implement it !
The base code is here (no garbage collection).  You can add in.

Base language
https://www.cp.eng.chula.ac.th/%7Eprabhas/project/recursive-programming/rec-programming.htm


last update 11 March 2024