Understanding the Limits of LLMs on Graph Problems

Background

Large Language Models (LLMs) have shown impressive abilities in reasoning, coding, and problem solving. However, graphs pose a unique challenge: they require structured, algorithmic thinking and long-range dependencies. Recent papers have begun exploring LLM performance on graph tasks, including:

Can Language Models Solve Graph Problems in Natural Language? (NeurIPS 2023)
Can Large Language Models Analyze Graphs like Professionals? A Benchmark, Datasets and Models (NeurIPS 2024)
GraphArena: Evaluating and Exploring LLMs on Graph Computation (ICLR 2025)

These works show that LLMs can often solve small or local tasks but fail on larger graphs, global properties, or NP-hard problems, frequently producing hallucinations or inconsistent reasoning.

Aim

The aim of this project is to systematically study the strengths and limitations of LLMs on graph reasoning tasks and connect these limitations to the underlying architecture and inductive biases of Transformer-based models. The student will investigate where LLMs work well, where they fail, and which limitations arise fundamentally from their design.

Research Questions

Are LLMs better at local reasoning tasks than global ones?
How do graph size, density, and structure influence performance?
Does prompt format (natural language vs code generation) affect reliability?
Which failures stem from architectural constraints such as limited context, lack of recursion, or inability to implement multi-step algorithms?
Which limitations appear fixable, and which seem intrinsic to the Transformer architecture?
What are potential approaches to overcome these limitations?

Project Description

1. Review of Prior Work and Model Mechanics

The student will summarise the findings of recent graph–LLM evaluation papers and review relevant aspects of Transformer architectures, including attention mechanisms, multi-hop reasoning, and limitations in representing algorithmic computations.

2. Designing a Targeted Evaluation Suite

The student will design a small, manageable set of graph tasks that test different kinds of reasoning, including local/global tasks, polynomial vs NP-hard tasks (in very small instances), and different graph types (synthetic and real-world). Representation effects (NL vs code prompting) will also be examined.

3. Experimental Pipeline and Evaluation

The student will implement an experimental pipeline using Python and NetworkX. They will test one or two LLMs and evaluate correctness, feasibility, hallucination, and stability.

4. Analysis: Architectural Sources of Limitations

Based on observations, the student will relate empirical failures to underlying Transformer constraints. This includes analysing where attention patterns fail, where iterative computation is missing, and where statistical pattern-matching leads to hallucination.

5. Proposing Methods to Overcome These Limitations

Building on the findings of the previous steps, the student will propose and implement potential methods to overcome these limitations. This can be done by focusing on a particular type of problems as toy examples.

Expected Outcomes

A structured survey of prior work.
A compact benchmark-style test suite for probing LLM limitations.
Experimental results mapping strengths and weaknesses across tasks.
An explanation of which limitations arise from the architecture vs prompting choices.
Proposed methods to resolve these issues.

Requirements

The ideal student should have a strong background in programming, (graph) algorithms, and machine learning.

Project Scope

This project is suitable for a 12-unit student project or a 24-unit honours project over a full academic year.

Contact

If you are interested, please email Ahad N. Zehmakan (ahadn.zehmakan@anu.edu.au) and Jing Jiang (Jing.Jiang@anu.edu.au) with:

What aspects of this project interest you most
The type of research project you are seeking (6-unit, 12-unit, or 24-unit)
A copy of your transcripts and/or CV
Any questions you may have