Week 6 Use SootUp to Automatically Extract Code Information

In this week’s lab you will:

  1. Recall what FQN, method signature, and Jimple mean and why they matter for test generation.
  2. Implement a small SootUp pipeline: inspect focal classes and **export every method in a compiled tree to CSV (ProjectMethodsCsvExporter, with a Class column).
  3. Export method-level triples to a CSV with correct quoting for multi-line Jimple.
  4. Run a Python script that reads the CSV and asks gpt-4o-mini to draft Java 8 + JUnit 4 tests per row, writing a CSV.
  5. Parse java fences from that CSV and write .java files into your test source tree, then run your tests.

The tutorial is designed for about one hour.

Prerequisites

  1. SootUp on the classpath, JavaView, resolving a class, iterating SootMethod, inspecting bodies.
  2. Target project compiled to a target/classes-style directory.
  3. Java + Maven for the SootUp driver project.
  4. Activities 4–5: Python 3.8+, pip install openai pandas tqdm, plus an OpenAI API key. Activity 5 also uses re (standard library).

Suggested workspace layout (you create these)

Folder / file (suggestion) Role
A small target Maven module (e.g. Tutorial/) Contains tutorial.TheArray, tutorial.Calculator
A SootUp Maven module (Week 5 style) Paste SingleClassMethodInspector and ProjectMethodsCsvExporter below into src/main/java/... (same package so helpers are shared).
A Python folder Activities 4–5: e.g. generate_tests_from_csv.py, format.py

Outline (1 hour)

Part Activity Time (guide)
1 Definitions + hand trace: FQN, signature, Jimple for TheArray.sortArray ~10 min
2 SootUp: two focal classes → console (SingleClassMethodInspector) ~10 min
3 Whole compiled tree → CSV (ProjectMethodsCsvExporter, RFC 4180 escaping) ~15 min
4 CSV → LLM-generated tests (Java 8 + JUnit 4 prompt, generated_code column) ~15 min
5 Strip fences → write test .java files ~10 min

Activity 1: FQN, signature, and Jimple (~10 min)#

Task 1.1: Why these three fields?#

  • FQN (method): Locates the focal method in the project: declaring class type + name + parameter types, e.g. tutorial.TheArray.sortArray(int[]). It disambiguates overloads and ties prompts to a single symbol.
  • Signature: Summarises how to call the method — return type, name, parameter types (e.g. void sortArray(int[])). In SootUp we use MethodSignature.getSubSignature().toString() for a Java-like string.
  • Jimple: A readable IR over bytecode (locals, assignments, goto, calls). It exposes control flow and data flow without decompiling to full Java, which is often enough context for LLMs or for your own analyses.

Task 1.2: Example class#

Create a small Maven module under package tutorial, then compile it ** so target/classes/tutorial/TheArray.class exists.

Listing 1a — Source: src/main/java/tutorial/TheArray.java (bubble-style sortArray).

package tutorial;

public class TheArray {
    public void sortArray(int[] array) {
        int n = array.length;
        for (int i = 0; i < n - 1; i++) {
            for (int j = 0; j < n - i - 1; j++) {
                if (array[j] > array[j + 1]) {
                    int temp = array[j];
                    array[j] = array[j + 1];
                    array[j + 1] = temp;
                }
            }
        }
    }
}

Listing 1b — Source: src/main/java/tutorial/Calculator.java.

package tutorial;

public class Calculator {
    public int add(int a, int b) {
        return a + b;
    }

    public int divide(int a, int b) {
        if (b == 0) {
            throw new IllegalArgumentException("Cannot divide by zero");
        }
        return a / b;
    }
}

Illustrative extraction for sortArray:

Field Example
FQN tutorial.TheArray.sortArray(int[])
Signature void sortArray(int[])

Listing 2 — Jimple (structure is stable; temporary/unknown local names may vary with SootUp version):

{
    int[] array;
    tutorial.TheArray this;
    unknown $stack10, $stack11, $stack12, $stack13, $stack14, $stack6, $stack7, $stack8, $stack9, i, j, n, temp;

    this := @this: tutorial.TheArray;
    array := @parameter0: int[];
    n = lengthof array;
    i = 0;

    label1:
    $stack6 = n - 1;
    if i >= $stack6 goto label5;
    j = 0;

    label2:
    $stack7 = n - i;
    $stack8 = $stack7 - 1;
    if j >= $stack8 goto label4;
    // ... swap logic ...
    label5:
    return;
}

Activity 2: SootUp — extract two focal classes (~10 min)#

Task 2.1: Maven driver and SootUp dependencies#

Create a separate Maven project (or module) for your SootUp drivers—the same pattern as Week 5. In pom.xml, paste the following <dependency> blocks inside <dependencies> (siblings of your compiler plugin, not inside <dependencyManagement> unless you know you need a BOM).

The listing below pins org.soot-oss artifacts at 2.0.0 so versions stay consistent. For this lab’s bytecode + JavaView workflow, the sample code relies especially on sootup.core, sootup.java.core, and sootup.java.bytecode.frontend; the other modules add source / Jimple / APK frontends, call graphs, and analyses—useful if you extend the pipeline, and safe to keep as a single copy-paste set.

<dependency>
    <groupId>org.soot-oss</groupId>
    <artifactId>sootup.core</artifactId>
    <version>2.0.0</version>
</dependency>
<dependency>
    <groupId>org.soot-oss</groupId>
    <artifactId>sootup.java.core</artifactId>
    <version>2.0.0</version>
</dependency>
<dependency>
    <groupId>org.soot-oss</groupId>
    <artifactId>sootup.java.sourcecode.frontend</artifactId>
    <version>2.0.0</version>
</dependency>
<dependency>
    <groupId>org.soot-oss</groupId>
    <artifactId>sootup.java.bytecode.frontend</artifactId>
    <version>2.0.0</version>
</dependency>
<dependency>
    <groupId>org.soot-oss</groupId>
    <artifactId>sootup.jimple.frontend</artifactId>
    <version>2.0.0</version>
</dependency>
<dependency>
    <groupId>org.soot-oss</groupId>
    <artifactId>sootup.apk.frontend</artifactId>
    <version>2.0.0</version>
</dependency>
<dependency>
    <groupId>org.soot-oss</groupId>
    <artifactId>sootup.callgraph</artifactId>
    <version>2.0.0</version>
</dependency>
<dependency>
    <groupId>org.soot-oss</groupId>
    <artifactId>sootup.analysis.intraprocedural</artifactId>
    <version>2.0.0</version>
</dependency>
<dependency>
    <groupId>org.soot-oss</groupId>
    <artifactId>sootup.analysis.interprocedural</artifactId>
    <version>2.0.0</version>
</dependency>
<dependency>
    <groupId>org.soot-oss</groupId>
    <artifactId>sootup.qilin</artifactId>
    <version>2.0.0</version>
</dependency>
<dependency>
    <groupId>org.soot-oss</groupId>
    <artifactId>sootup.codepropertygraph</artifactId>
    <version>2.0.0</version>
</dependency>

Task 2.2: one or more classes to stdout#

Copy everything in the block below into your SootUp Maven project (same dependency setup as Week 5), e.g. src/main/java/org/example/SingleClassMethodInspector.java. Change the package if needed (and keep the CSV exporter in the same package so it can call escapeCsvField, etc.).

Pipeline in words:

  1. PathBasedAnalysisInputLocation.create(classesDir, SourceType.Application)
  2. new JavaView(inputLocation)
  3. For each requested class FQN: view.getClasses().filter (...) → that class
  4. For each method: print FQN, signature, Jimple (or <no body>)

main: The listing below shows a minimal pattern — hard-code your own classesDir and call printMethodsForClass for tutorial.TheArray. To cover both focal types, add a second call for tutorial.Calculator (or loop over a String[] of FQNs in your own driver).

package org.example;

import sootup.core.inputlocation.AnalysisInputLocation;
import sootup.core.model.SootMethod;
import sootup.core.model.SourceType;
import sootup.core.signatures.MethodSignature;
import sootup.core.types.Type;
import sootup.java.bytecode.frontend.inputlocation.PathBasedAnalysisInputLocation;
import sootup.java.core.JavaSootClass;
import sootup.java.core.views.JavaView;

import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.stream.Collectors;

/**
 * Loads compiled bytecode under a directory and prints FQN, signature, and Jimple for every method
 * in one named class. Shared helpers are reused by {@link ProjectMethodsCsvExporter}.
 */
public final class SingleClassMethodInspector {

    private SingleClassMethodInspector() {}

    public static JavaView createView(Path projectClassesDir) {
        AnalysisInputLocation inputLocation =
                PathBasedAnalysisInputLocation.create(projectClassesDir, SourceType.Application);
        return new JavaView(inputLocation);
    }

    /** Method FQN: declaring class + name + parameter types. */
    public static String methodFqn(MethodSignature sig) {
        String methodName = sig.getName();
        String paramStr =
                sig.getParameterTypes().stream().map(Type::toString).collect(Collectors.joining(","));
        return sig.getDeclClassType() + "." + methodName + "(" + paramStr + ")";
    }

    public static String methodSignature(MethodSignature sig) {
        return sig.getSubSignature().toString();
    }

    public static String jimpleOrPlaceholder(SootMethod method) {
        if (!method.hasBody()) {
            return "<no body>";
        }
        return method.getBody().toString();
    }

    /** RFC 4180–style CSV field quoting; used by {@link ProjectMethodsCsvExporter}. */
    static String escapeCsvField(String value) {
        if (value == null) {
            return "\"\"";
        }
        return "\"" + value.replace("\"", "\"\"") + "\"";
    }

    /**
     * @param classesDir root of compiled outputs (e.g. {@code Tutorial/target/classes})
     * @param classFqn fully qualified class name (e.g. {@code tutorial.TheArray})
     */
    public static void printMethodsForClass(Path classesDir, String classFqn) {
        JavaView view = createView(classesDir);
        JavaSootClass sootClass =
                view.getClasses()
                        .filter(c -> c.getType().getFullyQualifiedName().equals(classFqn))
                        .findFirst()
                        .orElseThrow(
                                () ->
                                        new IllegalArgumentException(
                                                "Class not found in view: "
                                                        + classFqn
                                                        + " (check classesDir: "
                                                        + classesDir.toAbsolutePath()
                                                        + ")"));

        System.out.println("======== Single class — all methods ==========");
        System.out.println("Class: " + sootClass.getType().getFullyQualifiedName());
        System.out.println();

        for (SootMethod method : sootClass.getMethods()) {
            MethodSignature sig = method.getSignature();
            System.out.println("---");
            System.out.println("method_fqn:  " + methodFqn(sig));
            System.out.println("signature:   " + methodSignature(sig));
            System.out.println("jimple:");
            System.out.println(jimpleOrPlaceholder(method));
        }
    }

    /**
     * Edit the hard-coded paths below, then run this {@code main}.
     */
    public static void main(String[] args) throws IOException {
        Path classesDir =
                Paths.get(
                                "/path/to/your/Tutorial/target/classes")
                        .toAbsolutePath()
                        .normalize();
        printMethodsForClass(classesDir, "tutorial.TheArray");
        printMethodsForClass(classesDir, "tutorial.Calculator");
        System.out.println();
    }
}

Replace /path/to/your/Tutorial/target/classes with the absolute path to your compiled output.


Activity 3: Whole project → CSV (~25 min)#

Task 3.1: Motivation#

Activity 2 scales manually only to classes you name. For automated extraction you usually want every method reachable from a given compiled tree without hard-coding class names.

Task 3.2: CSV schema#

Headers produced by ProjectMethodsCsvExporter:

Column Meaning
Class Declaring type FQN (e.g. tutorial.TheArray) — first column
FQN Method FQN (same convention as Activity 2)
Signature getSubSignature().toString()
Jimple Code Representation Body toString(), or <no body>

Jimple often contains commas, quotes, newlines. Use quoted fields and double internal quotes (RFC 4180). SingleClassMethodInspector.escapeCsvField implements this; do not hand-write naive comma-separated output.

Task 3.3: whole tree to CSV#

Copy the entire class below into the same Maven module and same package as SingleClassMethodInspector (it calls SingleClassMethodInspector.createView, .escapeCsvField, .methodFqn, etc.), e.g. src/main/java/org/example/ProjectMethodsCsvExporter.java.

Behaviour in short: createView(classesDir) → collect all classes, sort by class FQN → for each class, sort methods by method FQN → write UTF-8 CSV ( Class first, then method triple) with RFC 4180–style quoting.

package org.example;

import sootup.core.model.SootMethod;
import sootup.core.signatures.MethodSignature;
import sootup.java.core.JavaSootClass;
import sootup.java.core.views.JavaView;

import java.io.BufferedWriter;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.stream.Collectors;

/**
 * Traverses every class under a compiled output directory and writes one CSV row per method:
 * Class (declaring type FQN), method FQN, Signature, Jimple Code Representation.
 */
public final class ProjectMethodsCsvExporter {

    private ProjectMethodsCsvExporter() {}

    public static void exportToCsv(Path classesDir, Path csvOut) throws IOException {
        JavaView view = SingleClassMethodInspector.createView(classesDir);

        List<JavaSootClass> classes =
                view.getClasses()
                        .sorted(Comparator.comparing(c -> c.getType().getFullyQualifiedName()))
                        .collect(Collectors.toList());

        try (BufferedWriter w = Files.newBufferedWriter(csvOut, StandardCharsets.UTF_8)) {
            w.write(SingleClassMethodInspector.escapeCsvField("Class"));
            w.write(',');
            w.write(SingleClassMethodInspector.escapeCsvField("FQN"));
            w.write(',');
            w.write(SingleClassMethodInspector.escapeCsvField("Signature"));
            w.write(',');
            w.write(SingleClassMethodInspector.escapeCsvField("Jimple Code Representation"));
            w.newLine();

            for (JavaSootClass sootClass : classes) {
                String classFqn = sootClass.getType().getFullyQualifiedName();
                List<SootMethod> methods = new ArrayList<>(sootClass.getMethods());
                methods.sort(
                        Comparator.comparing(m -> SingleClassMethodInspector.methodFqn(m.getSignature())));

                for (SootMethod method : methods) {
                    MethodSignature sig = method.getSignature();
                    w.write(SingleClassMethodInspector.escapeCsvField(classFqn));
                    w.write(',');
                    w.write(SingleClassMethodInspector.escapeCsvField(
                            SingleClassMethodInspector.methodFqn(sig)));
                    w.write(',');
                    w.write(SingleClassMethodInspector.escapeCsvField(
                            SingleClassMethodInspector.methodSignature(sig)));
                    w.write(',');
                    w.write(SingleClassMethodInspector.escapeCsvField(
                            SingleClassMethodInspector.jimpleOrPlaceholder(method)));
                    w.newLine();
                }
            }
        }
    }

    public static void main(String[] args) throws IOException {
        Path classesDir =
                Paths.get("/path/to/your/Tutorial/target/classes").toAbsolutePath().normalize();
        Path csvOut = Paths.get("/path/to/your/extracted_codes.csv").toAbsolutePath().normalize();
        exportToCsv(classesDir, csvOut);
        System.out.println("Wrote CSV: " + csvOut.toAbsolutePath().normalize());
    }
}

Task 3.4: Run and verify#

  1. In the target Maven module (e.g. Tutorial): compile it so target/classes is up to date.
  2. Run ProjectMethodsCsvExporter.main from your IDE
  3. Open the output CSV: confirm the header is Class,FQN,Signature,Jimple Code Representation and that rows exist for tutorial.TheArray and tutorial.Calculator (and other types on the classpath, if any).

Activity 4: From CSV to LLM-generated tests (~15 min)#

Task 4.1: Motivation#

Raw Jimple plus FQN and signature is a compact prompt payload: the model sees signature (types), FQN (location), and behaviour (IR). A simple batch script lets you scale test ideation across hundreds of methods and store results next to the facts you extracted — the same pattern many A2 pipelines use (extract → CSV → prompt → collect generations).

Never commit API keys or paste them into markdown. Use openai_key.txt (local only) or environment variables. If a key was ever committed, rotate it in the OpenAI dashboard.

Task 4.2: What the script does#

  1. Read the Activity 3 CSV (must include Class, FQN, Signature, Jimple Code Representation).
  2. Keep only rows whose Class is in tutorial.Calculator or tutorial.TheArray (see TARGET_CLASSES in the script; edit the list if you rename types).
  3. For each remaining row, call gpt-4o-mini. The system and user strings must both insist on Java 8 and JUnit 4 only (no org.junit.jupiter.*).
  4. Append generated_code (raw model reply, usually containing a ```java fence) and write generated_code.csv containing only the filtered rows.

Task 4.3: Python script (copy into your own file)#

Save the block below as e.g. generate_tests_from_csv.py. Never embed API keys in the script or in handouts — read from openai_key.txt (one line).

from pathlib import Path
from typing import List
import pandas as pd
from openai import OpenAI
from tqdm import tqdm

def chat_complete(messages: List[dict], model: str, api_key: str) -> str:
    client = OpenAI(api_key=api_key)
    response = client.chat.completions.create(model=model, messages=messages)
    return (response.choices[0].message.content or "").strip()


def build_messages(fqn: str, signature: str, jimple: str) -> List[dict]:
    system = (
        "You generate Java **8** (JDK 1.8) source that compiles with -source 1.8 / -target 1.8. "
        "Do not use var, modules, records, text blocks, or any API added after Java 8. "
        "Use **JUnit 4** only: org.junit.Test on public void methods, "
        "org.junit.Assert (e.g. static import assertEquals, assertTrue, assertNotNull, assertArrayEquals). "
        "Never use org.junit.jupiter.*. "
        "Output only the Java source inside ONE markdown fence: ```java ... ```."
    )
    user = f"""### Focal method
FQN: {fqn}
Signature: {signature}
### Jimple
{jimple}
### Task
Write one **JUnit 4** test class: same package as the class under test, `public void` methods annotated with `org.junit.Test`, assertions via `org.junit.Assert`."""
    return [{"role": "system", "content": system}, {"role": "user", "content": user}]


def main() -> None:
    extracted_codes_path = "/path/to/extracted_codes.csv"
    generated_codes_path = "/path/to/generated_code.csv"
    # Rows whose Class is in this list are sent to the API; output CSV only contains these rows.
    TARGET_CLASSES = ["tutorial.Calculator", "tutorial.TheArray"]
    MODEL = "gpt-4o-mini"

    api_key = "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"
    df = pd.read_csv(extracted_codes_path, encoding="utf-8")
    required = ["Class", "FQN", "Signature", "Jimple Code Representation"]
    for col in required:
        if col not in df.columns:
            raise ValueError(f"Missing {col!r}; got {list(df.columns)}")

    df = df[df["Class"].isin(TARGET_CLASSES)].copy().reset_index(drop=True)
    if df.empty:
        raise ValueError(f"No rows with Class in {TARGET_CLASSES!r}")

    df["generated_code"] = ""
    for i in tqdm(df.index, desc=MODEL):
        row = df.loc[i]
        msgs = build_messages(
            str(row["FQN"]), str(row["Signature"]), str(row["Jimple Code Representation"])
        )
        try:
            df.at[i, "generated_code"] = chat_complete(msgs, model=MODEL, api_key=api_key)
        except Exception as e:
            df.at[i, "generated_code"] = f"<error: {e}>"
    df.to_csv(generated_codes_path, index=False, encoding="utf-8")
    print(f"Wrote {generated_codes_path} ({len(df)} rows)")


if __name__ == "__main__":
    main()

Task 4.4: After running#

  • Inspect a few generated_code cells: the model should have emitted JUnit 4, not Jupiter.
  • Adjust extracted_codes_path / generated_codes_path in main if your Activity 3 CSV is not beside the script.

Activity 5: From model output to runnable Java files (~10 min)#

Task 5.1: Why this step?#

Activity 4 stores opaque text per row (often Markdown plus a fenced code block whose language tag is java). To compile in Maven you need plain .java files on src/test/java/... with a valid package and public class name. The helper below heuristically extracts the last such fenced block, rebuilds package, merges imports, and re-wraps the inner class body into public class <name> { ... }.

Regex extraction is fragile (nested braces, multiple classes, the model omits the fence). If get_runnable_code_from_test_code returns 'error', paste that row’s generated_code into the IDE and fix by hand. Always review generated tests before trusting them.

Task 5.2: Naming test classes#

The sample loop derives test_code_class_name from FQN (sanitised characters) so each method gets a distinct file name, e.g. TheArray_sortArray_int_____Test.java. Keep package_name equal to your production package (tutorial here).

Task 5.3: Script (reference)#

Adjust generated_codes_path, format_codes_path, and test_dir to your project. The script calls Path(test_dir).mkdir(parents=True, exist_ok=True) so the package folder is created if it does not exist yet. Requires pip install pandas tqdm.

import re
from pathlib import Path

import pandas as pd
from tqdm import tqdm


def get_runnable_code_from_test_code(test_code, package_name, class_name):
    """Take LLM markdown; return one .java file body or 'error'."""
    # Build fence as three backticks + "java" so this Markdown file never embeds a literal fence (breaks Jekyll).
    triple = "`" * 3
    pattern = triple + r"java\s*\n(.*?)\n" + triple
    matches = re.findall(pattern, test_code, re.DOTALL)
    if not matches:
        return "error"
    java_blob = matches[-1]

    import_pattern = r"^import\s.+;$"
    imports = re.findall(import_pattern, java_blob, re.MULTILINE)

    class_pattern = r"class\s+\w+\s*\{([\s\S]*)\}$"
    class_match = re.search(class_pattern, java_blob, re.MULTILINE)
    class_code = class_match.group(1) if class_match else ""

    runnable = "package " + package_name + ";\n"
    runnable += "\n".join(imports)
    # String concat instead of an f-string: Jekyll/Liquid treats two consecutive open-braces as template syntax.
    runnable += "\n\npublic class " + class_name + " {" + class_code + "\n}\n"
    return runnable


if __name__ == "__main__":
    generated_codes_path = "/path/to/generated_code.csv"
    format_codes_path = "/path/to/formatted_code.csv"
    test_dir = "/path/to/Tutorial/src/test/java/tutorial"
    package_name = "tutorial"

    Path(test_dir).mkdir(parents=True, exist_ok=True)

    df = pd.read_csv(generated_codes_path, encoding="utf-8")
    df["runnable_test_code"] = ""
    df["test_code_file_path"] = ""

    for index, row in tqdm(df.iterrows(), total=len(df)):
        FQN = row["FQN"]
        generated_code = row["generated_code"]
        test_code_class_name = (
            FQN.replace(package_name + ".", "")
            .replace(".", "_")
            .replace("(", "_")
            .replace(")", "")
            .replace(",", "_")
            .replace("[", "__")
            .replace("]", "__")
            .replace("$", "__dollarsign__")
            .replace("<", "_")
            .replace(">", "_")
            + "_Test"
        )
        runnable_code = get_runnable_code_from_test_code(
            str(generated_code), package_name, test_code_class_name
        )
        df.at[index, "runnable_test_code"] = runnable_code

        file_path = f"{test_dir}/{test_code_class_name}.java"
        if runnable_code != "error":
            with open(file_path, "w", encoding="utf-8") as f:
                f.write(runnable_code)
        df.at[index, "test_code_file_path"] = file_path

    df.to_csv(format_codes_path, index=False, encoding="utf-8")
    print(f"Wrote {format_codes_path} and files under {test_dir}")

Task 5.4: Run tests#

  1. Ensure Tutorial/pom.xml declares junit:junit (JUnit 4) for test scope and maven.compiler.source / maven.compiler.target match Java 8 if that is your baseline.
  2. From Tutorial/: run the tests and fix compile errors (missing imports, wrong package) in the written files as needed.

Learning objectives#

By the end of this tutorial you should be able to:

  • Use SootUp to automatically extract FQN, signature, and Jimple for multiple classes and methods
  • Design a code information extractor (inputs, outputs, main loop, filtering)
  • Write extracted code information to CSV files with proper escaping for Jimple
  • Align the extractor and CSV output with A2 Task 2 requirements

References

  • Week 5 (SootUp, view, class, method, FQN, signature, Jimple)
  • SootUp
  • Course A2 specification (Task 2: code information extraction and output format)
bars magnifying-glass xmark