Optimal troubleshooting (and other tractable POMDPs)

Picture of patrik-haslum.md Patrik Haslum

1 Dec 2023

“Troubleshooting” is the problem of integrated active diagnosis and repair, in which the repairer must repeatedly select between (imperfect) diagnostic tests and repairs to perform to restore functionality of a faulty system, while minimising the expected outage time and/or repair cost. The problem arises in many technical systems, including power networks, cloud computing, vehicle repair, etc.

Given prior probability distributions over faults and test false positives/negatives, the troubleshooting problem is a partially observable Markov decision problem (POMDP). Finding optimal solutions to POMDPs in general is intractable. However, in practical contexts, the way that faults affect the system and the available tests often exhibits a simple structure, such as a tree. We aim to exploit that structure to identify tractable optimal troubleshooting strategies.

Troubleshooting is not the only problem that is naturally modelled as a POMDP with a specialised, simple structure. For example, the problem known as “attack planning” (automated cyber red-teaming) has also been cast into this framework. In this project, we therefore aim to identify general structures in a POMDP that allow it to be solved to optimality in an efficient way.

Background reading

Requirements

This is primarily a theory project. Fundamental knowledge of computational complexity theory and probability theory is essential. Some familiarity with AI models and methods for decision making under uncertainty (such as MDPs, POMDPs, Bayesian networks, and similar) is also helpful.

arrow-left bars search times