Assignment 2: Automated Unit Test Generation Using Large Language Models (LLMs) [30 MARKS]

  • Due date: Thursday 16th April 2026, 23:55
  • Assignment Weighting: 30% of total grade
  • Expected Workload: 20-40 hours
  • Hurdle: Not hurdle
  • Type: Individual
  • Submission: Submit through Canvas—see further instructions below.
  • Policies: For late submission, plagiarism, and other policies, see the policies page.

Introduction#

In the previous assignment, we explored how to use traditional tools like EvoSuite for automated unit test generation. However, these traditional program analysis tools come with notable limitations. For example, their underlying algorithms (search-based [2, 5, 8], constraint-based [4, 6, 14], random-based [9]) can be complex and challenging to fully understand or optimize. They also rely on predefined rules that may not cover all real-world scenarios; once the code falls outside these rules, the tool may fail to generate effective tests. Moreover, these tools require complex environment configurations (e.g., JDK 8, evosuite.jar, evosuite-runtime.jar, and other dependencies), which makes the test generation process more cumbersome. Additionally, the generated tests often lack readability and meaningfulness, making developers reluctant to adopt them [1].

Quote from a research paper discussing the shortcomings of Search-Based Software Testing (SBST):

“Widely used test-suite generation tools such as EvoSuite [7] use a Search Based Software Testing (SBST) approach in which test inputs are randomly generated and mutated to maximize coverage of the software unit under test. However, SBST approaches struggle to generate high coverage test inputs in many cases, such as when branch conditions depend on specific values or states that are difficult to resolve with randomized inputs and heuristics. In a large scale study of SBST on 110 widely used open source projects, Fraser et al. observed that more than 25% of the tested software classes had less than 20% coverage [7].”

— Ryan et al. “Code-Aware Prompting: A Study of Coverage-Guided Test Generation… [10].”

Recent advances in large language models (LLMs) offer a new paradigm, overcoming the limitations of traditional methods by treating code as text and analyzing its structure and semantics rather than relying on predefined rules or heuristics [15, 3, 12, 11, 13, 10]. By providing a focal method and optionally additional code context (e.g., the class containing the focal method), along with a set of instructions (commonly referred to as a prompt), LLMs can generate test cases effectively.

In this assignment, we will explore how to leverage LLMs to automatically generate test cases. Then, we will analyze the quality and coverage of these tests and identify their limitations. For a deeper understanding of the underlying concepts, students are encouraged to refer to the papers listed in the references or explore related research through Google Scholar.


Some questions in this assignment may require you to perform multiple actions on your computer, such as downloading docker image. In Assignment 1, for these types of tasks, you were required to include screenshots as evidence that you completed the actions. However, in this assignment, providing full evidence may require multiple screenshots or extremely long screenshots. Instead of submitting many screenshots or a very long screenshot, you may include a partial screenshot and clearly indicate where the complete result is stored so that we can view and verify it. For example, in Task 2.1 you are required to “store the extracted data in a CSV file.” In this case, you may include a screenshot showing only part of the CSV file and indicate the location of the CSV file within your submission directory so that we can access and review the full result.

Make sure to carefully read the instructions for each task.

Task 1: Design a Prompt to Generate Unit Tests for a Single Focal Method [8 MARKS]#

In this task, you will design a prompt that can automatically generate unit tests based on the provided focal method.

Task 1.1. Environment Setup [1/8 MARKS]#

To begin, ensure the following environment setup:

  1. Install Python and Java IDE, such as PyCharm for Python development and IntelliJ IDEA for Java, or vscode for both.
  2. Install the OpenAI packages along with their corresponding dependencies using pip.

Task 1.2. Using SootUp to Extract Code Information [3/8 MARKS]#

Listing 1 presents the focal method sortArray, which belongs to the TheArray class within a tutorial package. To generate high-quality test cases for this method, we need to extract three key pieces of code-related knowledge:

  • Fully Qualified Name (FQN): It uniquely identifies the method, preventing conflicts with other methods of the same name across different packages. Providing the FQN helps the LLM correctly reference the focal method within the project structure. The format of FQN is packageName.ClassName.MethodName(ParameterTypes). For example, the FQN for sortArray is tutorial.TheArray.sortArray(int[]).
  • Signature: It defines the input and return types, helping the LLM generate appropriate test inputs. The format of the Signature is returnType MethodName(ParameterTypes). For example, the Signature for sortArray is void sortArray(int[]).
  • Jimple Code Representation: Jimple is an intermediate representation (IR) of Java bytecode that provides a structured, low-level view of method execution. It explicitly details variable assignments, conditional branches, and control flow, making it useful for analyzing program behavior. Listing 2 provides the Jimple code for sortArray, showing the execution steps in a structured format.

In this task, you should use the SootUp library to extract all three types of code knowledge (FQN, method signature, and Jimple code) for the sortArray function.

SootUp is a modernized version of the Soot static analysis framework. It enables comprehensive program analysis by extracting method-level details, control-flow graphs, and bytecode representations. For more details on how to use SootUp, visit the official website for guidelines. Please use SootUp version 2.0.0.

package tutorial;

public class TheArray {
    // Sorts the given array of integers in ascending order.
    public void sortArray(int[] array) {
        int n = array.length;
        for (int i = 0; i < n-1; i++) {
            for (int j = 0; j < n-i-1; j++) {
                if (array[j] > array[j+1]) {
                    int temp = array[j];
                    array[j] = array[j+1];
                    array[j+1] = temp;
                }
            }
        }
    }
}

Listing 1: Focal Method sortArray in package tutorial

{
    int[] array;
    tutorial.TheArray this;
    unknown $stack10, $stack11, $stack12, $stack13, $stack14, $stack6, $stack7, $stack8, $stack9, i, j, n, temp;

    this := @this: tutorial.TheArray;
    array := @parameter0: int[];
    n = lengthof array;
    i = 0;

    label1:
    $stack6 = n - 1;
    if i >= $stack6 goto label5;
    j = 0;

    label2:
    $stack7 = n - i;
    $stack8 = $stack7 - 1;
    if j >= $stack8 goto label4;
    $stack11 = array[j];
    $stack9 = j + 1;
    $stack10 = array[$stack9];
    if $stack11 <= $stack10 goto label3;
    temp = array[j];
    $stack12 = j + 1;
    $stack13 = array[$stack12];
    array[j] = $stack13;
    $stack14 = j + 1;
    array[$stack14] = temp;

    label3:
    j = j + 1;
    goto label2;

    label4:
    i = i + 1;
    goto label1;

    label5:
    return;
}

Listing 2: Jimple Code of sortArray

Task 1.3. Implement a Simple Prompt [2/8 MARKS]#

Having obtained the three types of code knowledge, in this step you will use this information to guide the LLM to generate unit tests.

  1. Your task is to create a prompt that includes the following components [1 MARK]:
    • Task Description: Clearly describe the test generation task, and describe all code knowledge that will be used.
    • Three Examples: Provide three examples, each example includes an input and output. The input is the focal method information (i.e., FQN, Signature, Jimple Code Representation), output is the generated test code. These three examples help the model understand the format of input and output and understand how to utilize the given three types of code knowledge to generate tests.
    • User Input: Include the three types of the code knowledge of the focal method for which tests need to be generated. Note: When calling the API, use the proper chat format (e.g. separate system, user, and assistant roles); do not put the task description, examples, and current input all in one “user” message.
  2. Use the OpenAI library to call the GPT-4o-mini model and pass the prompt to it to obtain the generated test code [1 MARK].

Task 1.4. Design a Prompt Template [2/8 MARKS]#

  1. Create a Prompt Template: Design a prompt template that can automatically process a given focal method. The template should dynamically construct a suitable prompt and generate corresponding unit test code. To achieve this, write a program that reads the prompt template and replaces placeholders with the focal method’s extracted information (FQN, Signature, and Jimple) from your CSV or extractor output. [1 MARK]
  2. Test the Template: Test the template with three focal methods to ensure it performs well for various input scenarios (for example if the generated test is runnable). [1 MARK]

Task 2: Automated Test Generation [10 MARKS]#

In this task, you will extend the automated prompt-based test generation process to cover multiple focal methods across three Java classes: org/apache/commons/codec/net/URLCodec.java, org/apache/commons/codec/binary/BinaryCodec.java, and org/apache/commons/codec/net/RFC1522Codec.java from the commons-codec project (version buggy 18) in Defects4J. The goal is to analyze all methods in the selected classes and generate runnable unit tests for each method.

Task 2.1. Extracting Method Information [3/10 MARKS]#

  • Download the codec_18_buggy project from Defects4J.
  • Use the SootUp tool to extract the following three key pieces of method information:
    • Fully Qualified Name (FQN)
    • Signature
    • Jimple Code Representation
  • Store the extracted data in a CSV file in which each row represents a method with the following three columns: | FQN | Signature | Jimple Code Representation |
    Note: The FQN must include ParameterTypes (e.g. ClassName.methodName(int,String)) so that overloaded methods are uniquely identified; omitting parameter types will produce duplicate FQNs and cause ambiguity.

Task 2.2. Generating Unit Test Code Using LLM [3/10 MARKS]#

  • SootUp may generate the method information for the whole project. You should remove any rows in the csv file where the method does not belong to one of the following classes: “URLCodec”, “BinaryCodec”, “RFC1522Codec” with code. This can be done by checking whether the Fully Qualified Name (FQN) of the method contains the corresponding class name (e.g. org.apache.commons.codec.net.URLCodec, org.apache.commons.codec.binary.BinaryCodec, org.apache.commons.codec.net.RFC1522Codec).
  • Use the designed prompt in Task 1.4 to generate unit test code for each method listed in the CSV file.
  • Add a new column to store the generated test code. After this step, the CSV file will have four columns: | FQN | Signature | Jimple Code Representation | Generated Code |

Task 2.3. Formatting and Saving the Test Code [2/10 MARKS]#

  • The raw generated code may contain natural language descriptions or special characters that prevent it from compiling successfully. Therefore, format the generated test code so that it is properly structured and syntactically correct. The reformatted version is referred to as Code After Formatting (just remove the natural language and special characters that prevent it from compiling; do not edit the test code itself). Store tests for each focal method in a separate .java file (one file per focal method; do not put all methods’ tests in one file). To avoid files being overwritten when a class contains overloaded methods (same method name but different parameter types), name each test file with the class name, method name, and parameter types, e.g., Calculator_add_int_int_Test.java. After this step, the CSV file will have five columns: | FQN | Signature | Jimple Code Representation | Generated Code | Code After Formatting
  • Save the formatted test code as a .java file. Ensure that the class name matches the file name and save the file under src/test/java/org/apache/commons/codec/. Before saving, clear this directory to remove any pre-existing developer-written test files. After this step, the CSV file will have six columns: | FQN | Signature | Jimple Code Representation | Generated Code | Code After Formatting | Saved Path |

Task 2.4. Running the Tests and Evaluating Coverage [2/10 MARKS]#

  • Execute the generated test files within the codec_18_buggy project. If any test code contains syntax errors that prevent it from running, remove it from the final test file or refine your prompt and formatting code to generate as many error-free test cases as possible.
  • Measure line and branch coverage to assess the effectiveness of the generated tests.

Task 3: Analyzing the Results [12 MARKS]#

In this task, you will analyze the outcomes of the automated test generation process, evaluate the quality of the generated tests, and explore ways to improve their effectiveness.

  1. How many tests did the LLM generate? How many were removed due to syntax errors? For those contain no syntax errors, how many of them pass or fail? What issues are present in the tests that fail to execute? [4/12 MARKS]
  2. How do you organize your prompt to guide the model to generate tests? Can the structure of the prompt be further improved? Discuss whether examples are necessary, considering both their potential benefits (e.g., clarifying format and intent) and drawbacks (e.g., overfitting, reduced diversity), and whether they are needed when the prompt is already well-structured, and why. [4/12 MARKS]
  3. The current prompt provides three types of code knowledge (FQN, Signature, and Jimple Code) for the focal method. Discuss whether this information is sufficient for generating effective tests and what limitations it may have. Then discuss what types of additional context could help and how they might improve test generation. [4/12 MARKS]

Submission Instructions#

You must submit your finalised report as a PDF via Canvas.

Submission Requirements:

  1. Deadline: The lab report is due on Thursday 16th April 2026 at 23:55.
  2. Report Content: Submit a well-structured lab report that thoroughly documents all steps taken to complete Tasks 1–3, together with experimental results and answers to all assigned questions. For each task, state the paths to your key code files and generated files (e.g., CSV, saved test files) so that markers can locate them. In addition, compress the entire project directory into a single .zip file and attach it as an appendix.
  3. File Naming Convention: Name your submission as:
    Lab_2_u0000000.pdf
    

    Replace ‘u0000000’ with your university ID.

  4. Appendix Directory Structure: Ensure that your submitted project follows the directory structure below:
    firstname_lastname_u0000000
    |-- Lab_2_u0000000.pdf
    |-- codec_18_buggy
    |   |-- ...
    |-- LLM_Test_Gen
    |   |-- Data
    |   |   |-- Prompts
    |   |   |   |-- prompt_for_sortArrays
    |   |   |   |   |-- ...
    |   |   |   |-- prompt_template
    |   |   |   |   |-- ...
    |   |   |-- Test_Data.csv
    |   |-- Java_Scripts
    |   |   |-- ...
    |   |-- Python_Scripts
    |   |   |-- ...
    
  5. Grading: This lab is worth 30 marks and accounts for 30% of the total course assessment.

References#

[1] Mohammad Moein Almasi et al. “An Industrial Evaluation of Unit Test Generation: Finding Real Faults in a Financial Application”. In: 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP) (2017), pp. 263–272.

[2] Arianna Blasi et al. “Call Me Maybe: Using NLP to Automatically Generate Unit Test Cases Respecting Temporal Constraints”. In: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (2022).

[3] Yinghao Chen et al. “ChatUniTest: A Framework for LLM-Based Test Generation”. In: SIGSOFT FSE Companion.

[4] Christoph Csallner, Nikolai Tillmann, and Yannis Smaragdakis. “DySy”. In: 2008 ACM/IEEE 30th International Conference on Software Engineering (2008), pp. 281–290.

[5] Pedro Delgado-Pérez et al. “InterEvo-TR: Interactive Evolutionary Test Generation With Readability Assessment”. In: IEEE Transactions on Software Engineering 49 (2023), pp. 2580–2596.

[6] Michael D. Ernst et al. “The Daikon system for dynamic detection of likely invariants”. In: Sci. Comput. Program. 69 (2007), pp. 35–45.

[7] Gordon Fraser and Andrea Arcuri. “EvoSuite: automatic test suite generation for object-oriented software”. In: ESEC/FSE ‘11. 2011.

[8] Mark Harman and Phil McMinn. “A Theoretical and Empirical Study of Search-Based Testing: Local, Global, and Hybrid Search”. In: IEEE Transactions on Software Engineering 36 (2010), pp. 226–247.

[9] Carlos Pacheco et al. “Feedback-Directed Random Test Generation”. In: 29th International Conference on Software Engineering (ICSE’07) (2007), pp. 75–84.

[10] Gabriel Ryan et al. “Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM”. In: Proc. ACM Softw. Eng. 1 (2024), pp. 951–971.

[11] Max Schäfer et al. “An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation”. In: IEEE Transactions on Software Engineering 50 (2023), pp. 85–105.

[12] Junjie Wang et al. “Software Testing With Large Language Models: Survey, Landscape, and Vision”. In: IEEE Transactions on Software Engineering 50 (2023), pp. 911–936.

[13] Zejun Wang et al. “HITS: High-coverage LLM-based Unit Test Generation via Method Slicing”. In: 2024 39th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2024), pp. 1258–1268.

[14] Xusheng Xiao et al. “Characteristic studies of loop problems for structural test generation via symbolic execution”. In: 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2013), pp. 246–256.

[15] Zhiqiang Yuan et al. “Evaluating and Improving ChatGPT for Unit Test Generation”. In: Proc. ACM Softw. Eng. 1 (2024), pp. 1703–1726.

bars magnifying-glass xmark