In this week’s lab you will learn about various techniques for bug hunting using fuzzing and static analysis, and using some of the surrounding toolchains around it.

Prerequisites

Docker - for running the AFL++ (a specific type of fuzzer), and the target program image
Familiarity with C language and gdb (GNU debugger), don’t worry if you have none, most of the tutorial will be via reading code/specfic commands will be provided for gdb as hints. Contact your tutor if you get stuck!

Note: If you are building AFL++ from scratch (instead of pulling the Docker image as suggested in the tutorial), it is also recommended to use Linux

Background

Any sufficiently large codebase would have bugs related to security, functionality, performance, and usability. To find these, we can perform

Manual Code Review - Via logs, test case results, audits, enforcing best code practices via linting, we can manually find out the sources of error if they arise. However, they have limitations w.r.t. manual labor and difficulty in finding subtle bugs whose cause is too complex to be discovered and have caused issues for a long period of time. A popular example would be the Log4J vulnerability, which had gone unnoticed for almost 8 years!
Static/Dynamic Analysis - Using software to automatically analyse the code. Static analysis is testing/evaluation of a code without executing, in contrast with dynamic program analysis, which is performed during their execution.

Different analysis can be combined together for various situations. For example, if we have no access to the source code - We can decompile / reverse engineer the compiled code to get the assembly version, from where we can do further analysis.

Note: we would be focusing on finding security bugs for this lab, and not focus on performance and usability.

Fuzzing

Fuzzing is an automated program testing technique where:

We feed program with malinformed inputs
Monitor program for crashes

Take the following example of the foo function

double foo(double a, int c) {
    if (c > -10 && c < 10) {
        return a / c;
    }
    
    return a + c;
}

In the program above, we can see that c being 0 would lead to divide-by-zero error, however we can assume that it won’t be run for quite a large state space of inputs. The total state space is 2 ^ 64 * 2 ^ 32 considering input of a double and int. The question arises - what and how do we select these inputs? Generally, probabilistic sampling is used where every input is taken with a certain probability.

Considerations#

Making “progress” - Whether inputs are progressing via heuristics to different parts of the program to maximise the chance of finding the bug.
Determine whether a bug has been reached - CPU’s + OS can detect crashes and emit appropriate signal such as SIGSEGV (Segmentation fault)/SIGFPE (Divide by 0)/etc. However, not all bugs are detected by CPU/OS, such as a software behaving in unintented ways without raising a signal - an example would be an infinite loop, not closing on the kill signal due to mishandled signal handler, etc.
Maximise throughput of random generation various inputs
Generating good heuristics on malinformed inputs.

Input Generation#

Mutational#

It requires a set of seed inputs, from where they can randomly mutate these inputs For example, if the seed inputs are hi and pass, a sample set of inputs could be

hi
hihi
hi123
_2hi_
pass
passhi
...

Advantages

Requires no a priori knowledge about the program structure
Very fast in generating new inputs

Disadvantages

Generated inputs may fail parsing stage of the main program, so many “progress” may not be achieved

Generational#

Generate seeds from a model/grammar of the target program’s input language. For example, if we define the grammar for valid C in BNF form for fuzzer, it could generate the following inputs

void f() {}
void fx(){  ;}
void f(int a) { a = 3333333333333333333333333; }
....

Advantages

Generates more “valid” inputs according to the underlying input grammar
Does not require an initial set of seeds

Disadvantages

Requires domain-specific knowledge of target to generate grammar

Fuzzing categories#

Blackbox - No knowledge of the program structure, treating the target as a black box. This leads to high throughput, but low progress
Greybox - Leverages light-weight program instrumentation to extract + monitor runtime information. This runtime information is used as feedback to guide input. The specfic runtime information includes control flow coverage (which conditional jumps were taken), and data-flow coverage (how data is manipulated by the target program) generation
Whitebox - Apply heavy-weight program analysis to collect runtime information. They collect a lot of information to better guide the fuzzer towards more coverage. However, they have a huge overhead when it comes to runtime. One of the examples is in symbolic execution. You can read more on that here.

Google uses CluserFuzz to fuzz all it’s products for security and stability.

In this lab, we would be using one of it’s fuzzing engines known as American Fuzzy Lop (AFL++). AFL++ is the successor to AFL, a very popular mutational, coverage-guided greybox fuzzer. You give it:

A program to fuzz
Set of initial files In return, it gives you
An instrumented program for running coverage-guidance
Crashes!

It works by first instrumenting the binary (i.e. adding specfic machine code into a program without changing it’s behaviour) by doing static analysis to collect runtime logs. We then run this instrumented binary with some seed inputs, and based on the behaviour of the input results it generates further new inputs.

fuzzing-process [2]

Aside: Compiler-based Instrumentation#

void f(int *a, int elem, long len) {
    int ret = -1;
    for (int i = 0; i < len; i++) {
        if (a[i] == elem) {
            ret = 1;
            break;
        }
        a[i]++;
    }
    return ret;
}

Try to figure out the bug in the code above.

Executing the code above will probably not produce a crash, even under fuzzing. We can use instrumentation by the compiler itself (Asan). For enabling address-based sanitization, we can use the compiler flag -fsanitize=address when compiling programs in gcc (for cmake, you would need similar to the following).

Activity : AFL++ Tutorial#

Note: Some parts are taken from [4], so for troubleshooting, please see the README

Cloning the sample program#

A sample program is made with a potential bug on specific input strings here. Clone the repository and try to do a manual code review for potential bugs first before moving onto the next step.

Create AFL++ Docker Container#

We will be doing a guided tour of AFL++ and test out it’s capabilities. To have the compiled version directly, you can pull the image directly from Dockerhub:

docker pull aflplusplus/aflplusplus

To launch the Docker image:

$ docker run -ti -v $HOME:/home aflplusplus/aflplusplus
$ export $HOME="/home"

To test this program, run afl-fuzz

At this point, you should be in the AFLplusplus directory (test it with pwd). Once there, make the AFL++ executables (with the -j flag if your PC supports multiple cores). This should take several minutes.

Create Target Docker Container#

Within the CLI of the host machine, type docker ps, which will provide the CONTAINER ID of the running AFL++ container
Run docker commit <container-id> to commit the changes / settings in a new image.
- This will output a SHA256 hash of the committed container. Copy the first 7-10 characters of the commit hash.
To start the AFL++ container with the target code, first navigate to the top directory of your clone of this github repository (this clone should be on your host machine):

https://gitlab.cecs.anu.edu.au/comp2120/2023/comp2120-tut6-fuzz.

Then type

$ docker run --rm -it -v $(pwd):/<name of the directory you are adding to the container> <the commit hash that you copied in the previous step>

This maps the volumes of the target program’s directory to the directory within container (for simplicity, you can also use $(pwd) for the mapped directory).

For example, if you cloned the repository above into ~/tmp/comp2120-tut6-fuzz folder and you want to make it appear inside your docker image as /alex folder you need to go into ~/tmp/comp2120-tut6-fuzz and your commit hash was sha256:91b215fce97b80bcac21348d905406b947d7a8bc76a4cc4ccae0bc1461dc1071 run this exact command:

docker run --rm -it -v $(pwd):/alex 91b215fc

and you will find yourself inside the docker image again but with /alex appropriately mapped to the outside folder where you checked out our github repo.

Running target executable#

From the previous step, you should now be in the /AFLplusplus directory

Navigate to the directory of the test program
Create a build directory and change into it
```
$ mkdir build && cd build
```
Add AFL++ tooling to the compiler for your executable (note that CC and CXX are just aliases we setup to pass to cmake which builds the result):
```
CC=/AFLplusplus/afl-clang-fast CXX=/AFLplusplus/afl-clang-fast++ cmake ..
```

afl-clang-fast/++ is just one example of compilers you can use with AFL++ - different compilers have different advantages. You can use any of the compilers within /AFLplusplus, and the CXX variable name is always the same as the CC variable, with ++ appended to the end. You can read more about the different compilers and their advantages within the AFL++ docs

Use the make build system to compile the executable for simple_crash
```
$ make
```
Now, we will be using the following flags to run AFL++
- -i - AFL++ uses a seeds directory which contains the list of initial sample inputs to test, and based on the program control flow, update it’s inputs. We will create the seeds/ directory alongside build/, and populate it with a random input with the dd utility
```
$ cd ..
$ mkdir seeds && cd seeds
$ dd if=/dev/urandom of=seed_i bs=64 count=10                  # One seed would be enough for this program
```
- -s - Have the generation of new inputs be deterministic. For the exercise, we will use 123
- -o - the output directory to have crashes, hangs, queues, etc

The final command to run is as follows

/AFLplusplus/afl-fuzz -i <full path for the seeds directory> -s 123 -o out -m none -- ./simple_crash

After a while, we would see some crashes

Go to the out/default/crashes folder, and analyse the respective crashes starting from id:0000.... The final step is to fix the respective errors as well

You can use gdb, and from within that, one can use input redirection to run the crashed input. For example r < out/default/crashes/<crash_id>. It will show the emitted signal on crash. To further debug with more options, also add set(CMAKE_BUILD_TYPE Debug) to the CMakeLists.txt file for additional debugging info. Finally, there is also compiler instrumentation (see the aside above for more information).

With this, we have only scratched the surface on AFL++ capabilites, along with learning about build systems as well as integrating docker images for ease of use. Notice that you didn’t have to install any external dependencies for AFL and compiling the target program due to the docker image.

Activity: AFL++ Exercises#

Now that we have a taste of what AFL can do, let’s fuzz for potential real-world software vulnerabilities.

An exercise for fuzzing xpdf-3.0.2 on Exercise 1

Try to complete more exercises from the Fuzzing 101 repository

References

[1] Some parts on fuzzing concepts are borrowed from Software Security (COMP3703) slides under Fuzzing

[2] Steelix: Program-State Based Binary Fuzzing

[3] Symbolic Execution

[4] AFL Tutorial

[5] AFL Exercises

Search this site

Week 7 Tutorial