Lab 3: x86_64 and More C

Outline#

In this week’s lab you will:

Practice reverse engineering x86-64 assembly and hand-optimizing assembly code
Learn about some performance considerations behind data structure design
Learn about function pointers in C
Further practice the basics of C

This lab is a bit shorter than the previous two, and hopefully gives you some space to catch up on completing them if you are behind.

Preparation#

Open the lab3 directory in your local copy of your fork of the lab pack 1 template repo in VS Code and continue on with the lab.

If you don’t have a clone or fork, go back to the lab 1 preparation section and follow the instructions.

Exercise 1: C to x86-64 Assembly#

We have already discussed the basics of x86-64 assembly and converting C code to x86-64 assembly in the lectures. It is now time to apply the knowledge we gained in practice. In the first exercise below, we provide you with the C code for a recursive function, and your job is to compile the C code using gcc, disabling all optimizations.

Open the file src/jacobsthal.c and run make assembly to compile the C code with the -O0 flag. The -O0 flag turns off all optimizations, generating assembly code that closely resembles the C code. Run the executable and try to understand what the program is doing. Note down the output of the program for a few input values.

Next, open the generated assembly file and try to fully understand the key function, namely jacobsthal. You can ignore the startup and finalization code that interacts with the operating system and assembler directives. You need to focus only on the call to jacobsthal in main and the body of the jacobsthal function. Do you see how the recursive calls in the C code are translated into assembly? Can you set n to a small value (e.g., 2) and draw the call stack to gain insight into how recursion works? Note that call stack is a fancy name for all active stack frames (one corresponding to each active function) on the program’s stack.

Optimising Assembly, and using GDB for Assembly#

You may notice that the code generated by the compiler is far from efficient, let alone optimal. In the exercises below, we ask you to optimize the assembly code by hand. You will need to generate the executable binary from your hand-optimized assembly code. You should also ensure that the executable binary generated from your assembly code produces the same output as that produced by the executable generated from the original C code.

With complicated optimisations in particular you may need help from GDB to debug your code; fortunately, GDB has commands to help you debug at the assembly level.

This program has arguments, so you will need to start GDB like this: gdb --args NAME_OF_EXECUTABLE ARG1 ARG2. You can also provide arguments to the program using set args ARG1 ARG2 as a command inside GDB itself.
Start by placing a breakpoint on main with b main and running with r.
Now, type si to step through the instructions one instruction at a time. gdb will print the last instruction executed on the screen. Compare this to s, which was used to step through one source-level statement, in this case the C source code.
Use ni to step through the instructions one instruction at a time while stepping over call instructions. Compare this to n, which was used to step through one source-level statement while stepping overs calls, i.e. the call is treated as one source-level statement instead of multiple.
It is useful to print the register values at any given point. You can type info registers to print all registers.
You can print only a specific register using x/x $rax. The first x stands for examine and second x is hex (printing format). You can see other formats using help x. On some versions of gdb, the p command also works for printing register values. Try p $rax.
(Optional) The display command enables automatic displaying of registers each time gdb stops at a breakpoint or after a step. Try display $rax.

Again, for further help you can refer to the GDB cheat sheet

If you are using Apple Silicon (e.g an M1/M1 Pro/M2 CPU), there are two things you should note.

First, the default GCC compiler settings produce the wrong type of assembly (specifically, ARMv8 assembly instead of x86-64). We have corrected this for you in the Makefile (at the top where it checks if you are using Mac and then adds the flag -arch x86_64 to ASMFLAGS if so).

Secondly, as in Lab 2, you will need to use LLDB instead of GDB. As stated before, this resource can help you convert commands, but here are some tips specific to this exercise:

To provide program arguments, you should input settings append target.run-args 4 2; which appends to the specific setting target.run-args. Of course, the more convenient option of simply providing the arguments on invoking lldb still exists as in gdb.
si won’t print the assembly instructions by default - you will need to use the disassemble command to see the assembly in your terminal. It is highly recommended to carefully read the help disassemble page and try out different options to work out your display preference here.
Printing the register values can be done with register read instead of info registers.
The x/x syntax in lldb reads only from the memory of the target process, if you want to print only a specific register, you should instead use register read (name) - try register read $rax.

Your next task is to optimize the assembly code by hand. You may notice on a quick inspection that there is a lot of low-hanging fruit! Has the compiler made efficient use of the stack space (i.e., main memory resource)? Has the compiler made efficient use of general purpose registers? Are there any instructions you can omit without changing the semantics of the C program? You can generate an executable binary from your assembly code with the make jacobsthal command. Study the make assembly and make jacobsthal commands carefully to know which sources are they using to generate the executable file.

Exercise 2: Memory Locality in Data Structures#

Moving on from x86, we will now take a closer look at data structures. Later in lectures, you will study the CPU’s memory hierarchy in great detail. It is one thing to know the theory, but a key aspect of this course is that the theory and understanding of what the CPU is doing under the hood can inform your understanding of how to design programs to perform well. The following case study provides insight into how data structures interact with the memory hierarchy of modern compute systems.

Consider the following structure that represents a point in a 3-dimensional space,

struct point3D {
  int i;
  int j;
  int k;
};

// array of structs
struct point3D points[LEN];

Suppose we have an array of several 3D points (call the array points) and we want to sum all the points along the k dimension. Each point3D element of the points array is 12 bytes in size (4 bytes for i, 4 bytes for j, and 4 bytes for k). Recall that members of a structure are placed contiguously in memory. Recall also that elements of an array are also placed contiguously in memory. The k members of two 3D points, points[0] and points[1], are therefore 12 bytes apart in memory. In fact, each time the CPU accesses a point in the k dimension, it jumps to an address 12 bytes away from the previous access.

We would now like to discuss an aspect of the memory systems in modern CPUs. The main memory in modern computers is built using Dynamic Random Access Memory (DRAM) technology. Each DRAM cell uses a capacitor to store charge. Storing and retrieving charge to/from capacitors is slower than a different memory technology, namely Static Random Access Memory (SRAM). SRAM is built using flip-flops or cross-coupled inverters. SRAM is expensive but fast and DRAM is cheap but slow. Therefore, modern computers use a memory hierarchy that consists of a small amount of fast SRAM (called a cache) and a large amount of slow DRAM. The cache is tightly integrated tightly with the CPU, i.e., it is present on the same die (chip) as the CPU. On the other hand, main memory resides off-chip or outside the CPU die.

On a memory read operation, the CPU first check the fast SRAM (called a cache) for data. If the data is not found in the cache, only then the CPU sends the request to slow main memory (DRAM). A cache hit means the CPU finds the requested data in the cache. A cache miss means the data is not found in the cache. Since DRAM accesses are slow, the CPU tries to be efficient and brings a lot more information than a single 64-bit word. Typically, the CPU loads 64 bytes from memory. This data from memory is then stored in the SRAM cache. The expectation is that if the CPU needs a word at an address A at a given point in time, then it will likely need the neighboring word at address A+1 soon afterward. Surprisingly, a lot of real-world programs exhibit this so-called spatial locality in program references.

The amount of data loaded from memory in each memory access is not related to the size of the CPU’s registers. Instead, it is the CPU’s cache line size, which is the amount of memory loaded in to each cache entry. For example, early Intel Pentium 3 processors were 32-bit processors (i.e. 32 bit registers and ALU), with 512kB caches using a 64 byte cache line size.

Now, let’s go back to the points array. Now that we have more knowledge about the underlying memory system of a modern computer, we can be smart about how we sum all the points along the k dimension in the 3D space. As a start, we can change the representation of our points in space. Consider the following declaration,

struct pointarray3D {
    int i[LEN];
    int j[LEN];
    int k[LEN];
};

Instead of placing the points in the k dimension 12-bytes apart in the array of structs approach, we have placed all the points along the k dimension in the 3D space next to each other. This second representation above is called the struct of arrays approach. By placing the points next to each other, we expect the CPU to more frequently find the k points in the fast SRAM cache. The result is that the number of times the CPU needs to wait for slow main memory is reduced significantly. This way the program that uses the second approach runs faster than the one using the array of structs approach.

Open the programs src/aos.c and src/soa.c and complete the definitions of the sum function. Test your code for correctness. You can measure the time it takes for each program to finish using the time command. For example, if you type time ./program_name on the terminal, the time command reports the total time the program takes to run or execute. You should only look at the real component of the output. What do you notice?

Exercise 3: Function Pointers#

This is a more advanced C feature that will become more relevant when covering concurrency later in the semester.

Function pointers are pointers that point to functions, as the name would suggest. They are used to treat functions as regular data. This means that it is possible to define pointers to functions, which can be assigned, placed in arrays, passed to functions, and even returned by functions. Unlike regular pointers, the type of a function pointer is described in terms of a return value and parameters that the function accepts. Declarations for function pointers look as follows:

int (*match)(int *key1, int *key2);

The above declaration means that we can set match to point to any function that accepts two int pointers and returns an integer. If we have a function match_int as below,

int match_int (int *k1, int *k2) {
	if (*k1 == *k2) return 1;
	return 0;
}

We can set match to point to the above function with the following statement:

match = match_int;

To execute a function referenced by the function pointer, we simply use the function pointer where we would normally use the function itself. For example,

int x = 10;
int y = 12;
int val = match(&x, &y);

Function pointers are useful for encapsulating functions into data structures. Typically, a function pointer is made a member of a data structure, and the pointer is used to invoke one of the many functions based on the type of data that is stored in the data structure. The Advanced Exercises exercise at the end of this handout helps you to explore this use of function pointers further.

Open the file src/reduction.c and read the comments to fill in the missing parts of the code. The compiled code should either reduce the samples array using reduce1 or reduce2 depending on whether the user runs the compiled binary with -r1 or -r2 argument on the command prompt. If you write the code we ask for properly, you will observe the output 55 for -r1, and 3628800 for -r2.

Push your work to GitLab to test your implementation.

Exercise 4: Practice Exercises#

Having now taught you most of C’s basic features, we provide you a bunch of programming exercises to help you practice them further. Feel free to ask tutors for help if you get stuck!

Write the program tail, which prints the last n lines of its input. By default, n is 5, but it can be changed by an optional argument, so that ./tail n prints the last n lines. The program should behave gracefully regardless of the input or the value of n. You can store the lines in a two-dimensional array of fixed size. Note that we do not provide you a template for this program. You need to write this program from scratch.

If you’d like a challenge, you can also implement extending the array lengths if needed, which requires using dynamic memory allocation using malloc and exposes the true benefit of dynamic memory.

Push your changes to tail.c to verify that it passes the CI test.

The next exercise explores the power of function pointers with a sorting program that either sorts lines input by the user (via keyboard) lexicographically or numerically. Specifically, if the optional argument -n is given, the program will sort the input lines numerically. A sort typically consists of three parts: (1) a comparison that determines the ordering of a pair of objects (e.g., numbers), (2) an exchange that reverses their order, and (3) a sorting algorithm that makes a sequence of comparisons and exchanges until the objects are in a proper order. Note that the sorting algorithm is independent of the comparison and exchange operations, so by passing different comparison and exchange functions to it, we can arrange to sort by different criteria. Let’s explore this decoupling of concerns via function pointers in the exercise below.

Open the source file src/qsort.c. The source code read lines from the input and sorts them lexicographically using the quick sort algorithm. You do not need to understand quick sort to solve this problem, but if you have the time to explore it, that would be great! Run the program and type a bunch of lines and then enter EOF and observe the output. Now, we would like to add the -n option to the program that sorts the lines entered from the keyboard numerically. Write a function called numcmp that takes two input strings and convert them to double types, and returns -1 if the first argument is less than the second argument, and 1 if the first argument is greater than the second argument, and 0 otherwise. You can use the atof utility function to convert a string to a floating point value. If the user enters -n at the command prompt, then the program should compare the input lines numerically. To test numerical sort, you can input one integer per line from the keyboard and then enter EOF. Note that you will need to change the line in the main function that calls quick_sort.

Note that the src/qsort.c program makes use of generic void* pointers. Normally, C allows assignments only between pointers of the same type. A generic pointer in C is declared as a void pointer in C and it can be assigned to a pointer of any type. Once again, generic pointers are useful for implementing data structures, which we will explore in the future.