Supervised learning requires the specification of a loss function to minimise. While the theory of admissible losses from both a computational and statistical perspective is well-developed, these offer a panoply of different choices. In practice, this choice is typically made in ad hoc manner. In hopes of making this procedure more principled, the problem of learning the loss function for a downstream task (e.g., classification) has garnered recent interest. However, works in this area have been generally empirical in nature.
Starting from the assumption that Bayes rule is optimal for the loss at hand and a general form for proper composite losses, we revisit the SLIsotron algorithm of Kakade et al. (2011) through a novel lens, derive a generalisation based on Bregman divergences, and show how it provides a principled procedure for learning the loss. The resulting BregmanTron algorithm jointly learns the loss along with the classifier. It comes equipped with a simple guarantee of convergence for the loss it learns, and its set of possible outputs comes with a guarantee of agnostic approximability of Bayes rule. Experiments indicate that the BregmanTron substantially outperforms the SLIsotron, and that the loss it learns can be minimized by other algorithms for different tasks, thereby opening the interesting problem of loss transfer between domains. Joint work with Aditya Menon (Google Research).
Richard Nock is in charge of the Machine Learning Research Group of Data61 and Honorary Professor at the ANU.