Optimal troubleshooting (and other tractable POMDPs)

Picture of Patrik Haslum Patrik Haslum

14 Feb 2023

“Troubleshooting” is the problem of integrated active diagnosis and repair, in which the repairer must repeatedly select between (imperfect) diagnostic tests and repairs to perform to restore functionality of a faulty system, while minimising the expected outage time and/or repair cost. The problem arises in many technical systems, including power networks, cloud computing, vehicle repair, etc.

Given prior probability distributions over faults and test false positives/negatives, the troubleshooting problem is a partially observable Markov decision problem (POMDP). Finding optimal solutions to POMDPs in general is intractable. However, in practical contexts, the way that faults affect the system and the available tests often exhibits a simple structure, such as a tree. We aim to exploit that structure to identify tractable optimal troubleshooting strategies.

Troubleshooting is not the only problem that is naturally modelled as a POMDP with a specialised, simple structure. For example, the problem known as “attack planning” (automated cyber red-teaming) has also been cast into this framework. In this project, we therefore aim to identify general structures in a POMDP that allow it to be solved to optimality in an efficient way.

Background reading


This is primarily a theory project. Fundamental knowledge of computational complexity theory and probability theory is essential. Some familiarity with AI models and methods for decision making under uncertainty (such as MDPs, POMDPs, Bayesian networks, and similar) is also helpful.

You are on Aboriginal land.

The Australian National University acknowledges, celebrates and pays our respects to the Ngunnawal and Ngambri people of the Canberra region and to all First Nations Australians on whose traditional lands we meet and work, and whose cultures are among the oldest continuing cultures in human history.

arrow-left bars search times arrow-up