Week 8 Analysis of LLM-Generated Tests Tutorial
In this week’s tutorial you will:
- Manually fix buggy tests generated by an LLM
- Analyse why LLMs sometimes produce buggy or incorrect tests (hallucination)
- Discuss ways to reduce hallucination (prompting, post-processing, iterative generation)
- Assign groups for Assignment 3 (3 students per group) and confirm your group details
Tutorials are one hour. Work in pairs or small groups where suggested.
Prerequisites
- At least one LLM-generated test file (e.g. from Week 7 or Assignment 2) that contains at least one buggy or failing test.
- A Java project and test runner (e.g. JUnit, Maven/IDE) to run and fix tests.
Reference: LLM Hallucinations: Why They Happen, How to Spot Them, and How to Reduce the Risk
Outline (1 hour)
| Part | Activity | Time (guide) |
|---|---|---|
| 1 | Fix buggy LLM-generated tests | ~15 min |
| 2 | Analyse causes of hallucination | ~15 min |
| 3 | Strategies to reduce hallucination | ~15 min |
| 4 | Assign groups (3 students per group) for Assignment 3 | ~15 min |
Activity 1: Manually fix buggy tests (~15 min)#
Task 1.1: Run and identify failures#
- Take one or two test methods (or a small test class) that were generated by an LLM (e.g. from your Week 7 pipeline or A2).
- Run the tests (e.g.
mvn testor run from your IDE). - Note which tests fail or error (e.g. compilation error, wrong assertion, wrong API usage).
- List the symptoms: e.g. “wrong expected value”, “calls non-existent method”, “wrong exception type”.
Do the failures look like “typos” and small mistakes, or like the model “invented” behaviour that doesn’t match the real code?
Task 1.2: Fix the tests by hand#
For each failing test:
- Locate the cause (wrong assertion, wrong method name, wrong type, etc.).
- Fix it so the test compiles and passes and still tests the intended behaviour where possible.
- Write a one-line note (e.g. in a comment) describing what was wrong: e.g. “LLM used assertEquals(5, …) but correct value is 4” or “LLM assumed method X existed; replaced with Y”.
Keep fixes minimal so you can later compare “what the LLM did wrong” vs “what a human had to change”.
Activity 2: Analyse why LLMs generate buggy tests (~15 min)#
Task 2.1: Categorise the bugs#
As a group or pair, classify the bugs you found into categories. For example:
- Wrong API or signature: e.g. wrong method name, wrong number/type of arguments.
- Wrong expected value: e.g. wrong constant, wrong order of arguments in assertion.
- Wrong control flow: e.g. wrong exception type, missing setup.
- Syntactic / style: e.g. wrong import, wrong JUnit annotation.
- Hallucination: the model “invented” behaviour (e.g. return value, method, or class) that does not exist in the real code.
Discuss: Which of these are “hallucination” in the sense of the model being confident but wrong? Which are more like ambiguity (e.g. underspecified prompt) or context limits (e.g. model didn’t have enough code context)?
Task 2.2: Why do hallucinations happen?#
Read or recall the main ideas from the reference on LLM hallucinations. Then briefly answer:
- Why might an LLM produce a plausible-looking test that doesn’t match the real code? (e.g. training on similar but different APIs, no execution feedback, limited context.)
- How can you spot such tests? (e.g. run tests, use compiler, check method existence, compare to source.)
If the LLM had been given more code context (e.g. full method body, return type, exceptions), would that reduce some of the bugs you saw? Which ones?
Activity 3: Ways to reduce hallucination (~15 min)#
Task 3.1: Three directions#
Discuss and jot down one concrete idea for each of the following (in the context of test generation):
- Prompting: How could you change the prompt (e.g. instructions, examples, format) to reduce wrong API use or wrong expected values?
- Post-processing: What automated checks could you run on the generated code before accepting it? (e.g. compile with
javac, run a linter, check that method names exist in the project.) - Iterative generation: If the first attempt fails (e.g. compile error or test failure), how could you feed the error back to the LLM and ask for a fixed version? (We will implement a simple version in Week 10.)
Share one idea per category with another group. Which approach do you think would help most for the bugs you saw in Activity 1?
Task 3.2: Quick design (optional)#
Sketch a one-sentence pipeline:
“First we … then we … if it fails we …”
that uses at least one of: better prompting, post-processing, or iterative regeneration.
Activity 4: Assign groups for Assignment 3 (~15 min)#
Task 4.1: Form your group (3 students per group)#
- Form your Assignment 3 group (exactly 3 students per group).
- Choose teammates thoughtfully and confirm everyone can commit for the rest of the semester.
- Once formed, your group should remain unchanged for the semester (unless staff approve an exception).
Task 4.2: Submit group details to your tutor#
Provide your tutor with:
- Full name and student ID of each member
Task 4.3: Check the group list / resolve issues#
If your tutor cannot find your name on their list for any reason, message the instructors on EdStem. A group list may also be made available on the course site.
Reminder: Assignment 3 includes a group presentation in the final tutorial. Plan early and divide responsibilities.