Assignment 6: Garbage Collection

Due date: 15th of May, 2014

(graded, individual assignment)

In the sixth graded assignment, you will implement a mark and sweep garbage collector for the L3 virtual machine.

Introduction

Individual Assignment

This is an individual assignment. To get the source code, you need to log into the Repository interface and select “Advanced Compiler Construction 2014 Assignment 6” as the group. You will be given a new group and repository (acc2014i-groupXX) and you will find the reference source code there.

ASM Lowering and Virtual Machine

The initial commit for the 6th assignment contains two elements:

  • the compiler, with reference implementations of the phases so far, a CPS register allocator and a CPS to assembly language transformer in l3/compiler/src/l3/CPSToASMTranslator.scala;
  • a virtual machine for executing the assmbly code, in the l3/vm directory.

The virtual machine contains two main components:

  • the interpreter, located in l3/vm/src/engine.c (and with the header file l3/vm/src/engine.h);
  • the memory system, implemented in l3/vm/src/memory_nofree.c and l3/vm/src/memory_mark_n_sweep.c (and with the header file l3/vm/src/memory.h).

The virtual machine also contains basic infrastructure like a makefile and 4 tests you can use to check your GC implementation.

Getting Started

To start, clone your repository:

 git clone git@larasrv09.epfl.ch:acc2014i-groupXX 

Now, you will have the vm directory in l3/. Go there and run make:

 $ cd vm/ $ make Use the following targets:  - 'make vm' to use your mark-sweep GC  - 'make test' to test the VM  - 'make clean' to clean the VM $ make test rm -rf bin mkdir -p bin cc -std=c99 -g -Wall -O3  src/engine.c src/fail.c src/loader.c src/main.c src/memory.c -o bin/vm Queens test failed! Bignums test failed! Pascal test failed! Maze test failed! 

You can also run the tests individually, since they’re included in l3/examples:

 $ bin/vm ../examples/queens.asm  enter size (0 to exit)> 10 Error: no memory left (block of size 2 requested) 

Your task is simple: make the tests pass by implementing a mark and sweep garbage collector instead of the no-GC memory system currently implemented in l3/vm/src/memory.c.

By the way, the challenge for the previous assignment was running maze for size 10 and seed 10. You can now see the exact code patterns that you were generating and tweak them. ūüôā

Eclipse Projects

In case you plan to develop using Eclipse, you will have to install the Eclipse C Development Tools (you can use a stand-alone installation of Eclipse or you can add CDT along with the Scala IDE in Eclipse).

To create the Eclipse project, follow these steps:

  1. From the menu, select File > New > Project …
  2. From “C/C++”, select “C Project”
  3. Name the project “l3-vm”, uncheck “Use default location” and set the location to <your clone of the l3 repo>/vm
  4. When you’re done, hit Finish (no need to tweak the other steps)
  5. Once the project is created, right-click it and click “Properties”. In the properties window, go to “C/C++ Build” -> “Settings” and in the “Tool Settings” tab navigate to “* C Compiler” > “Miscellaneous”. Once there, add “-std=c99” to the “Other flags” field. Now the project should compile correctly.

Okay, that’s it with the intro. Now for real work.

The Memory

We first breifly describe how the VM engine interacts with the memory manager and subsequently detail the implementation that is required for this assignement.¬† When the virtual machine is started, the first call is to memory_setup.¬† This allows the memory manager to allocate the total memory (the amount can be changed using the -m <size> command line option of the vm). The “total_size” parameter passed to this method indicates the total number of bytes needed. The method memory_get_start returns a pointer to the beginning of this allocated memory. The total memory is used to store the program code and the heap.

The virtual machine then loads the program code into the memory. Afterwards, the method memory_set_heap_start is called, indicating the first address that directly follows the program code. The memory starting from this address can be used by the memory manager to store its data structures and for allocating blocks in the heap: 

GC Diagram 1

The function memory_allocate is invoked when a heap block needs to be allocated.

Virtual and Physical Addresses

  • The memory can be seen¬†as an array of 32-bit values referred to as words, which is represented by the type¬†value_t¬†in the VM. Every entry in the array can contain arbitrary 32 bit information: tagged values, virtual addresses (explained below), block headers or registers (which are also stored in the heap);
  • There are two kinds of addresses in the VM:
    • Physical addresses, represented as values of type¬†value_t*, are pointers to the elements of the memory array.¬†
    • Virtual addresses, represented as values of type¬†value_t, are relative to the starting (physical) address of the memory and only point to individual words, so their last two bits are always 0. This satisfies the tagging requirement for references, which states that they should end in bits 00;
  • The pointers stored in the heap are always virtual addresses. The addr_v_to_p¬†and¬†addr_p_to_v methods in the file¬†engine.c convert between the two address types. You have to copy the conversion functions in your gc (do not remove the static modifier from the existing ones, this will prevent inlining and make your virtual machine slower!);
  • While physical addresses are addresses of words in the memory, virtual addresses are virtual machine pointers. It is essential that addresses stored in the heap are virtual addresses and not physical addresses (i.e, pointers to the elements of the memory array)!;
  • For efficiency reasons, the pointers stored in the base registers (result of¬†engine_get_base_register) are physical addresses;
  • The size parameter passed to the memory_allocate function is numbers of¬† words that have to be allocated.

The Garbage Collector

You are required to write a memory manager that¬†implements the interfaces defined in the “memory.h” file and¬†performs mark-sweep garbage collection. Below we provide an overview of the data structures that have to be implemented and also a few tips and tricks.¬†

Block Headers

As discussed in the lectures, memory is allocated and freed in chucks of words referred to as blocks. Each block has a tag and a size which are passed as parameters to the “block-alloc” primitive.¬†We refer to the starting address of a block as a “block pointer”. Use the first word of a block to store the tag and size of the block (which are referred to as block headers). Therefore, to allocate a block of size n you will need n+1 words.

Free lists

You are required to implement a free list that maintains a list of free blocks. Use the tag “255” to indicate that a block is free. The second word of a free block can be used to store the address of the next element of the free list.

There is a slight trick with free lists. A free list entry contains at least 2 words. But the library allocates blocks of size 0 (and thus 1 word) that your GC can later free. This produces 1-word entries in the free list, which won’t work. To overcome this, the easiest solution is to allocated at least two words, even for blocks of size 0.

The free list is used to allocate blocks. When having multiple free list entries, you should pick the smallest one that fits the necessary size.

Pointer Bitmap

You need to maintain a bitmap from pointers (physical addresses) to bits. This bitmap is used for two purposes:

(a) to determine if a value stored in the heap is a valid block pointer. Even though we use tagged values, it is possible that, during an arithmetic operation, a non-tagged value is left in one of the registers. Since the registers of our virtual machine are also stored in the heap, we may confuse non-tagged value for a virtual address in the marking phase. So although it looks like a virtual address, the untagged value may point anywhere, including invalid locations and in the middle of blocks. In order to prevent the GC from following incorrect addresses, we mark the beginning of each block in the bitmap.
(b) during marking, we reset the bit for the visited blocks, to prevent scanning the same block over and over again (and prevent infinite loops in case of circular data structures). This way, when we encounter a pointer to a block that was already scanned, this will not be a valid pointer anymore, according to (a).

During the sweeping phase, you will also have to merge successive entries in the free list, so the heap memory does not get too fragmented. Also don’t forget to update the bitmap by resetting the bits for swept blocks and setting them for the live blocks.

To complete the above picture, the memory layout would actually look like this:

GC Diagram 2

Testing

You can either use the make test command to execute the tests automatically, or you can run each individual test by hand (good for debugging):

 $ bin/vm ../examples/asm/queens.asm  enter size (0 to exit)> 10 Error: no memory left (block of size 16 requested) 
The expected sizes your program should run on are:
  • for queens.asm: without GC – breaks at 9, GC works with 15
  • for bignums.asm: without GC – breaks at 250, GC works with 1000
  • for pascal.asm: without GC – breaks at 90, GC works with 200
  • for maze.asm: without GC – breaks at 7, GC works with 20
 
Debugging
 
When debugging you may want to trigger a garbage collection early on. To do so, print the code size and adjust the memory with¬†bin/vm -m <bytes>¬†such that the size is just above the code size + twice the bitmap size. This will let you trigger the garbage collection early, when the program hasn’t yet allocated too much space, allowing easier debugging.
Implement a procedure to check the data strcuture invariants

For ease of debugging, it is recommended that you implement a procedure that traverses every block in the heap and checks for the correctness of the headers and the freelist. To traverse all blocks in the heap you can start from the first block and use the block size (stored in the header) to jump to the start of the next block and so on. Also, use “assert”s (defined in assert.h header file) in all places in the code where you think an invariant must hold. This will greatly help in debugging. Of couse, this is only recommended and not mandatory.

The assignment will be evaluated based on the correctness of the implementation, its memory efficiency and execution time.

Grading

Here are several points to keep in mind:

  • neither of the examples should take more than 1s to run (best running times are in the order of hundreds of miliseconds);
  • we may test your assignment through valgrind or address sanitizer, so you should be very careful about the way you allocate, access and free memory (you can also run your assignment through valgrind/with address sanitizer on your own);
  • we will use the reference Makefile to compile and test your GC: please make sure that¬†“make test” runs correctly on your machine and if there are additional flags necessary, remind us in the “memory_mark_n_sweep.c” file;
  • we discurage any use of floating-point arithmetics for memory management. Using methods such as ceil, floor, pow or divisions in floating-point is dangerous when dealing with integer numbers;
  • we encourage you to write code that compiles across platforms. This includes using VALUE_BITS (defined in “vmtypes.h”) and CHAR_BIT (defined in <limits.h>) and avoiding any cast from a pointer X* to a value or integer type (value_t, int, long int, unit, size_t …). It is safe to cast form one pointer type to another (X* to Y*, such as from (void*) to (char*) or (value_t*)).

These criteria are can influence your grade, so please take them into account!