Optimal troubleshooting (and other tractable POMDPs)


"Troubleshooting" is the problem of integrated active diagnosis and repair, in which the repairer must repeatedly select between (imperfect) diagnostic tests and repairs to perform to restore functionality of a faulty system, while minimising the expected outage time and/or repair cost. The problem arises in many technical systems, including power networks, cloud computing, vehicle repair, etc.

Given prior probability distributions over faults and test false positives/negatives, the troubleshooting problem is a partially observable Markov decision problem (POMDP). Finding optimal solutions to POMDPs in general is intractable. However, in practical contexts, the way that faults affect the system and the available tests often exhibits a simple structure, such as a tree. We aim to exploit that structure to identify tractable optimal troubleshooting strategies.

Troubleshooting is not the only problem that is naturally modelled as a POMDP with a specialised, simple structure. For example, the problem known as "attack planning" (automated cyber red-teaming) has also been cast into this framework. In this project, we therefore aim to identify general structures in a POMDP that allow it to be solved to optimality in an efficient way.

Background Literature

Updated:  1 June 2019/Responsible Officer:  Dean, CECS/Page Contact:  CECS Marketing