Assignment 3: Fuzzing Algorithm for Optimal Unit Test Generation [40 MARKS]
- Due date: Thursday 21st May 2026, 23:55 (Week 12)
- Assignment Weighting: 40% of total grade
- Expected Workload: 20-40 hours
- Hurdle: Hurdle (you must earn at least 50%, i.e. 20 marks or more)
- Type: Group
- Submission: Submit through Canvas—see further instructions below.
- Policies: For late submission, plagiarism, and other policies, see the policies page.
Make sure to carefully read the instructions for each task.
Introduction#
In Assignment 2, we experimented with guiding large language models (LLMs) to generate Java unit tests by supplying three pieces of information: the method’s fully qualified name (FQN), its signature, and a Jimple IR representation. While this approach produced some valid tests, overall line and branch coverage often remained incomplete, and a noticeable share of generated tests failed to compile or run. Those limitations have many contributing factors (you should have examined several of them in your Assignment 2 report), including but not limited to: too little contextual information for the LLM to build correct setup; prompts whose layout and wording did not make constraints and output format easy to follow; generated inputs that were not diverse enough to exercise a broad range of behaviors; and a tendency to discard faulty output outright instead of driving a systematic compile–repair loop.
To address these limitations, we draw inspiration from the core ideas of fuzz testing. Traditional fuzzing systematically mutates candidate inputs at scale and leverages feedback signals (e.g., increased coverage, program crashes, or differential behaviors) to guide subsequent input generation. Its primary objective is to continuously explore new execution paths, rather than being confined to a narrow set of plausible inputs. LLM-based test generation can follow a similar paradigm. The model first generates test cases, which are then executed to collect various forms of feedback (e.g., compiler diagnostics and coverage gaps). This feedback is subsequently fed back into the model to guide the repair, extension, or restructuring of existing tests.
In this assignment, you are required to implement this idea end-to-end by designing an iterative, feedback-driven framework for LLM-based test generation. Specifically, you need to:
- Analyze the source code of Java projects to determine which types of code information are required to generate high-quality unit tests for each focal method, and implement an automated extractor to collect this information.
- Design a fuzzing-inspired strategy to guide the model in generating diverse test inputs, and specify how feedback will steer later rounds.
- Design a prompt template that incorporates the extracted context and encodes your strategy with clear instructions, rules, and placeholders.
- Feed each filled prompt to the LLM to generate test code, and format the generated code into executable test cases.
- Build an iterative feedback loop by compiling and executing each generated test. Use feedback signals (e.g., syntax errors, runtime failures, and coverage gaps) to inform the next round of prompting—repairing, extending, or retargeting tests—rather than discarding failing output without learning from it.
- Analyze the results by computing metrics such as line coverage, branch coverage, test pass rate, and bug-identification outcomes, and present them in a structured format (e.g., tables).
- Write up your work as a research paper. By the end of this assignment, you will build a robust, end-to-end framework that leverages LLMs to produce high-quality, runnable unit tests with minimal human intervention.
Task 1: Identify Crucial Code Information and Design Code Information Extractor [6 MARKS]#
In this task, you will investigate three target Java classes, each from a different Defects4J buggy project.
- Commons Codec — buggy version 18:
org/apache/commons/codec/binary/StringUtils.java - Commons Collections — buggy version 27:
org/apache/commons/collections4/map/MultiValueMap.java - Commons Compress — buggy version 45:
org/apache/commons/compress/archivers/tar/TarUtils.java
1. Identify Crucial Code Information [2/6 MARKS]. Determine which pieces of code knowledge are essential for generating high-quality unit tests for methods in these classes. For each piece of knowledge you identify, provide:
- Code Knowledge: the name of the information (e.g., Jimple code).
- Role: how it supports test generation (for instance, by revealing control-flow paths or setup requirements).
- Target Methods: the specific types of focal methods that depend on this knowledge.
Example:
- Code Knowledge: Jimple code
- Role: Exposes internal control-flow and variable interactions to help the model cover all branches.
- Target Methods: For example: all methods, private methods, or methods declared in abstract classes.
2. Design Code Information Extractor [4/6 MARKS]. Building on the code information types you identified above, build an automatic extractor that traverses each project, collects this information for each relevant focal method, and writes it to a CSV file.
Your CSV must satisfy the following requirements:
- Each row represents one focal method.
- There is a mandatory FQN column holding the element’s fully qualified name.
- There is one column for each code information type you identified in Identify Crucial Code Information above (for example, signature, JimpleCode, comments, modifiers, and so on).
You are free to choose any program-analysis library. For instance, you might use SootUp to extract FQNs, method signatures, and Jimple IR, and JavaParser to obtain abstract syntax tree (AST), comments, or other syntactic information as needed.
Task 2: Fuzzing-Inspired Strategy [5 MARKS]#
To generate diverse test inputs that trigger different execution paths, design your own fuzzing-inspired strategy. This is an open-ended design task: there is no single required strategy, as long as your design is technically coherent and implemented clearly. Your approach should:
- Define your strategy and rationale. Clearly explain what information you use to guide generation (for example, but not limited to, control flow, data flow, types, exceptions, API contracts, constants, or prior failures), and why that information should improve diversity and effectiveness.
- Specify how diversity is created. Describe how your method explores different behaviors (e.g., target selection, mutation rules, partitioning, or iterative prompting design), and how generated tests avoid collapsing into near-duplicate inputs.
- Implement a concrete workflow. Provide a reproducible implementation plan that maps your strategy into executable steps and artifacts (e.g., input records, prompt fields, intermediate metadata, and generation rounds). You may use any suitable data format (CSV or equivalent) as long as it supports your strategy.
-
Explain feedback usage. Describe how execution feedback (such as compilation diagnostics, runtime failures, or coverage gaps) is incorporated into subsequent iterations.#
Task 3: Design a Prompt Template [4 MARKS]#
The prompt must carry both the extracted code context from Task 1 and instructions that implement your fuzzing-inspired strategy from Task 2 (e.g., how the model should diversify inputs or follow per-round targets). It is essential to describe the test-generation task clearly and to define the exact input/output format the pipeline will fill automatically. In this task, you are required to write a structured prompt that includes the following components:
- Task Description. Use directives such as
@persona,@terminology, and@instructionto explain who the model is, what it should do, and any rules it must follow. You may base your structure on Listing 1, but you must adapt it so that@terminologyand@instructionreflect the code information from Task 1 and the strategy-specific guidance from Task 2. - User Input. Specify placeholders for each piece of extracted code information (e.g.,
#{FQN}#,#{JimpleCode}#, etc.) and for any strategy fields your Task 2 workflow requires (e.g., a target description, partition label, or seed metadata), which will be filled for each generation unit.
Note: Your prompt must be zero-shot: do not include examples (such as a sample focal method paired with sample test code). Rely on clear instructions, terminology definitions, and the placeholders filled from your pipeline.
After designing the template, exercise it on a variety of focal methods to check robustness. The foundation model for this task is GPT-4o-mini. You may add richer extracted fields and stricter rules to improve quality, still without embedding examples in the prompt.
Listing 1: Example Structured Prompt (illustrative only). The use of CFGPath here is one possible way to pass a generation target; your own @terminology, placeholders, and rules must stay consistent with Task 1 (what you extract) and Task 2 (how you steer diversity).
Test Generator {
@persona {
You are an expert in Java programming with a focus on test generation;
Your task is to generate test cases for a focal method based on provided contextual information;
}
@terminology {
focal_method_source_code: The source code of the method under test (focal method);
focal_method_info_in_the_project: The class and method signature of the focal method in its project, which helps generate import statements for testing;
CFGPath: The control flow graph path of the focal method, indicating the specific execution flow to be tested;
generated_test_code: The generated test code for the focal method to cover the specified CFGPath;
}
@instruction {
@command: Given the focal_method_source_code, focal_method_info_in_the_project, and CFGPath, analyze the provided information comprehensively to write the generated_test_code for the focal method, ensuring the CFGPath is covered. Include all necessary imports.
@rule1: When generating test cases to cover a throw statement, use a try-catch block to handle the exception.
@rule2: You don't need to rewrite the focal method in the test code.
@rule3: Use JUnit 4 and JDK 8 for writing the test cases.
@rule4: If it is necessary to test for "throws Exception", use try-catch block to handle the exception, rather than using assertThrows.
@rule: Please follow rules 1-4 strictly!
@format {
@input: ###focal_method_source_code ###focal_method_info_in_the_project, ###CFGPath
@output: ###generated_test_code
}
}
}
Task 4: Unit Test Generation [4 MARKS]#
Task 4.1. Generating Unit Test Code Using LLM [2/4 MARKS]#
Generate tests for all methods in the specified target classes from the three projects:
- Commons Codec — buggy version 18:
org/apache/commons/codec/binary/StringUtils.java - Commons Collections — buggy version 27:
org/apache/commons/collections4/map/MultiValueMap.java - Commons Compress — buggy version 45:
org/apache/commons/compress/archivers/tar/TarUtils.java
Important constraints:
- Do not generate tests for methods in any other classes.
- You must use GPT-4o-mini for this task (no other model is allowed).
- Using the strategy outputs from Task 2 (e.g., one or more generation units per focal method), use the prompt template from Task 3 to generate unit test code for each unit. Each prompt should include the focal method’s context and any strategy-specific fields so that the model is steered toward diverse and meaningful test behaviors.
- Add a new column, “Generated Code”, to your CSV and store the raw test code returned by the model.
Task 4.2. Formatting and Saving the Test Code [2/4 MARKS]#
- The raw generated code may contain natural language descriptions or special characters that prevent it from compiling successfully. Thus, format each generated test so that it adheres to Java syntax and style conventions, and rename its test class as needed to guarantee a unique class name and avoid naming collisions. Record the reformatted version in a new CSV column named “Code After Formatting”.
- Save the formatted test code as a
.javafile within the corresponding Defects4J project’s test source tree (e.g., under that project’ssrc/test/java/), using a package path that matches your test class and allows the project build to compile and run it. Do not place all tests from Collections or Compress into the Codec project tree. Before saving generated tests for a project, clear only the test directories you use in that project so you do not mix in stale developer tests. Add a column, “Saved Path”, containing the path to each saved file relative to that project’s root.
Task 5: Feedback Loop [5 MARKS]#
LLM-generated tests may fail either because they are not runnable (e.g., syntax/type/import errors) or because they do not yet exercise enough program behavior. Therefore, your feedback loop should include both compilation repair and behavior-guided refinement:
- Phase A: Compilation Repair Loop (Runnability)
- Compile every generated test file for the three target classes/projects.
- If compilation fails, record the failing code and compiler diagnostics.
- Use a repair prompt (failing test + compiler errors) to obtain a revised version, then recompile.
- Repeat repair for up to three iterations, or stop earlier once the test compiles.
- Phase B: Behavior-Guided Improvement Loop (Effectiveness)
- Execute runnable tests and collect behavioral feedback (e.g., runtime failures and coverage gaps such as uncovered branches/lines).
- Use this feedback to drive the next generation round (e.g., input mutation, prompt retargeting, or targeted regeneration for under-exercised behaviors), rather than relying only on syntax repair.
- Final Artifacts
- Add a new CSV column, “Runnable Test Code”:
- For tests that compile after formatting, set “Runnable Test Code” to “Code After Formatting”.
- For repaired tests, set “Runnable Test Code” to the final compilable revision.
- Add a new CSV column, “Runnable Test Code”:
Task 6: Analyzing the Results [6 MARKS]#
-
Finalize the test suite. After Task 5 (Feedback Loop), remove any remaining files with syntax errors from the final test suite.
- Measure per-class effectiveness. For each of the three target classes in Task 1, report all of the following in a per-class table:
- Branch coverage: total branches, covered branches, and branch coverage (covered/total).
- Line coverage: total lines, covered lines, and line coverage (covered/total).
- Pass rate: number of runnable tests executed for that class, number of passed tests, and pass rate (passed/executed).
- Identified bugs: whether your generated tests expose the class’s buggy behavior; include the corresponding failing test(s) and a brief explanation of why the failure indicates the bug.
For example, if a target class has 76 total branches with 71 covered, its branch coverage is 93%; if it has 118 total lines with 111 covered, its line coverage is 94.07%.
- Scoring breakdown for Task 6 (6 marks total):
- Coverage: 4 marks
- Pass rate: 1 mark
- Bug identification: 1 mark
- Coverage scoring (4 marks). Calculate overall branch coverage by summing covered/total branches across the three target classes, and overall line coverage by summing covered/total lines across the three classes. Assign up to 2 marks for branch coverage and up to 2 marks for line coverage using the same thresholds below:
- 10–50%: 0.5 mark
- 50–80%: 1 mark
- 80–90%: 1.5 marks
- 90–100%: 2 marks
- Pass-rate scoring (1 mark). Report per-class pass rate and overall pass rate (passed runnable tests / executed runnable tests). Use the following rubric:
- 0–30%: 0 marks
- 30–60%: 0.5 marks
- 60–100%: 1 mark
-
Bug-identification scoring (1 mark). Each target class contains a known Defects4J fault. For each class, state whether your generated tests expose that bug and give failing-test evidence (test name or file, assertion/stack trace, and a one-line explanation linking the failure to the faulty behavior).
For marking purposes, a bug counts as identified only if the failure is tied to a test that targets the defective focal method in that class:
org/apache/commons/codec/binary/StringUtils.java— methodequalsorg/apache/commons/collections4/map/MultiValueMap.java— methoddeserializeorg/apache/commons/compress/archivers/tar/TarUtils.java— methodformatLongOctalOrBinaryBytes
Use the following rubric:
- 0 bugs identified: 0 marks
- 1–2 bugs identified: 0.5 marks
- 3 bugs identified: 1 mark
Task 7: Draft a Research Paper [10 MARKS]#
You must present your work as a short research paper. The report should contain the following sections, in order: Please note: Write the paper in LaTeX and compile it to PDF.
-
Abstract. Provide a short summary (typically one paragraph) of the problem, your approach, main results, and conclusions so that a reader can quickly understand what the paper is about.
-
Introduction. Give a brief overview of the paper: what problem you address, what approach you take, and what the reader will find in the rest of the document.
-
Motivation. Clearly state the problem you are solving. For example: when using LLMs to generate unit tests out of the box, the model often lacks necessary code knowledge (e.g., control flow, dependencies, or structure). Describe what kinds of knowledge you identified as missing and how providing that knowledge (e.g., via extraction and prompting) addresses the problem.
-
Approach. Describe your method in pipeline order: how you extract code information (Task 1), how you design the fuzzing-inspired strategy (Task 2), how you encode that strategy in the prompt template (Task 3), how you generate and format tests (Task 4), and how you use the feedback loop (Task 5). Include an overview figure (e.g., a pipeline diagram) and explain each component in text.
-
Evaluation. Report your experimental setup (e.g., which buggy projects and target classes you used, which model) and the results. Present evidence that your method is effective—e.g., per-class coverage and pass-rate tables, bug-identification outcomes, and a brief analysis of what worked and what did not.
-
Limitations. Summarise the current shortcomings of your approach (e.g., scalability, types of methods or errors that remain difficult, or dependence on the model or tooling).
-
Conclusions. Summarise your contributions and findings, and optionally suggest directions for future work.
Submission Instructions#
You must submit three items:
- Research paper (PDF). The research paper written for Task 7 (Abstract, Introduction, Motivation, Approach, Evaluation, Limitations, Conclusions). Name the file
research_paper.pdf(orLab_3_u0000000_research_paper.pdfif you include your ID).- References: You may cite any references needed to support your arguments or to acknowledge prior work whose ideas you build on. For example, related work you might find useful includes:
- A³-CodGen: A Repository-Level Code Generation Framework for Code Reuse with Local-Aware, Global-Aware, and Third-Party-Library-Aware;
- Navigating the Labyrinth: Path-Sensitive Unit Test Generation with Large Language Models. Citation of these is optional and not required.
- arXiv (optional): You may optionally upload your research paper to arXiv or another preprint server to gain visibility and feedback.
- Workshop: The top five papers may be selected for presentation at a workshop. Higher-quality submissions may receive a higher score.
- References: You may cite any references needed to support your arguments or to acknowledge prior work whose ideas you build on. For example, related work you might find useful includes:
-
Lab report (PDF). A report that thoroughly documents all steps taken to complete Tasks 1–6, together with experimental results (e.g., coverage tables, pass rates), answers to assigned questions, and any supplementary analysis. Name the file
Lab_3_u0000000.pdf, replacingu0000000with your university ID. - Code and project (ZIP). Compress the entire project directory (including code, data, and scripts) into a single
.zipfile so that markers can reproduce your work. Name the zip file consistently with your submission (e.g.Lab_3_u0000000.zip).
Deadline: All three items are due on Thursday, 21st May 2026, at 23:55.
Appendix directory structure: When you unpack your zip, the contents should follow the structure below (the research paper and lab report PDFs may be placed at the top level of the zip or in the folder as shown):
firstname_lastname_u0000000
|-- Lab_3_u0000000.pdf (lab report)
|-- research_paper.pdf (Task 7 research paper)
|-- codec_18_buggy
| |-- ...
|-- collections_27_buggy
| |-- ...
|-- compress_45_buggy
| |-- ...
|-- LLM_Test_Gen
| |-- Data
| | |-- ...
| | |-- Test_Data.csv
| |-- Scripts
| | |-- ...
Grading: This assignment is worth 40 marks and accounts for 40% of the total course assessment (consistent with the task marks listed above).