Before applying an AI system to any real-world problem, it is first necessary to explicitly specify an objective for the system to solve. Whether this is a loss function over a dataset, an MDP to plan in, or a simulator in which to train an RL agent, we must eventually write down a specification that reduces our vague real-world problem to an explicit computational problem. In this thesis, I view writing an explicit problem specification as an act of communication. In this way, a problem specification is more effective if it better achieves the ends of the AI designer in communicating the true problem specification.
We begin by formalizing the problem faced by the designer, which we call the Specification Design Problem. In this problem, the designer must choose between a set of available problem specifications to be solved and aims to select one whose solution performs well on the designer's true unspecified problem. The designer's goal in the specification design problem is to minimize the specification error -- the utility the designer has lost by simply writing down the wrong problem. We show that specification error decomposes into underspecification error, caused by leaving out helpful information, and misspecification error, caused by specifying incorrect information. Furthermore, we can analyze each of these errors in turn. We can characterize underspecification error through a new measure of the value of information. Alternatively, we can understand misspecification error through a geometric analysis of the crucial trade-offs of a particular domain. Taken together, these tools allow us to provide recommendations to the AI designer, which we demonstrate by determining the conditions under which the AI designer should stop specifying more information.
In the second part, we use this perspective to study a class of problem specifications often used in practice to mitigate misspecification errors. Such specifications, including maximizing worst-case utility or minimizing worst-case regret, are often used to avoid specifying complex features of the true problem and produce a system robust to that feature. These approaches contrast the Bayesian framework, which requires fixing some distribution for that feature. However, despite their practical use, it has yet to be shown that such approaches are strictly necessary. While they improve the performance of current systems, one may hope we could eventually replace them with an appropriate model of uncertainty within the Bayesian framework. In this chapter, we provide an example where a specification resembling maximizing worst-case utility is practically unavoidable, as any explicit specification within the standard Bayesian framework requires specifying the task in excessive and impractical detail. This example shows that, counterintuitively, the optimal solution for a Bayesian AI designer in a specification design problem may be to create a non-Bayesian agent.
To match the needs of the AI designer in such specification design problems, I provide an extension to the Bayesian framework to include these helpful problem specifications. The core difficulty of this extension is that these specifications make what I call self-referential claims -- assertions about how the policy will perform in the world. For instance, maximizing worst-case utility can be interpreted as the claim that ``the world is one of the worst for your policy'' and minimizing worst-case regret can be interpreted as the claim ``the policy is likely to incur high loss''. Thus self-referential claims allow us to study these specifications as a class rather than individually. In addition, it enables us to create specifications that utilize other natural self-referential claims like ``you are unlikely to be able to fix the engine'', ``you are likely to be successful'', or ``you are unlikely to make money on the stock market''. Such claims are natural in everyday conversation, partly because they efficiently communicate aspects of the problem at hand, describing essential elements of the engine or market conditions without needing to describe in intricate detail how those systems work. As such, their use is critical to the efficient solution of a wide range of specification design problems.
While there are numerous subtleties in integrating self-referential claims into the Bayesian framework, they can be carefully navigated by developing two formalisms analogous to causal and evidential decision theory. We characterize their edge cases and provide general conditions under which these formalisms produce equivalent results. Thus we can use self-referential claims to build the natural, efficient, and robust specifications we need as AI designers while maintaining philosophical and mathematical rigor.
Finally, I discuss how the problem specifications discussed in this theory can be effectively solved by scalable RL systems through automatically designing training environments. We formalize this approach as Unsupervised Environment Design~(UED) and show that it can find policies consistent with the theory of self-referential claims. Moreover, we propose an algorithm called Protagonist Antagonist Induced Regret Environment Design~(PAIRED), which finds minimax regret strategies as the Nash equilibrium of a 3-agent system. Finally, we show that this approach has benefits for promoting the transfer of the resulting policy and results in a natural automatic curriculum of increasing complexity.
Overall, through analyzing the specification design problem, understanding the effectiveness of self-referential claims in problem specifications, and allowing for their scalable implementation through unsupervised environment design, this thesis lays the necessary groundwork for a study of problem specification, well-grounded theoretically, empirically, and practically.