Offline reinforcement learning, a field dedicated to optimizing sequential decision-making strategies using historical data, has found widespread application in real-world scenarios. Recent years have witnessed a surge in research focusing on establishing the statistical foundations for offline reinforcement learning, with many studies achieving near-optimal worst-case performance bounds. However, empirical results often outperform these non-adaptive bounds significantly. A comprehensive understanding of which decision processes and behavior policies are inherently more amenable or challenging for offline RL remains elusive. To address this critical challenge, the first part of this thesis delves into instance-dependent offline learning within tabular Markov Decision Processes. We introduce the Adaptive Pessimistic Value Iteration algorithm, which achieves an instance-dependent guarantee and is also near optimal. This result subsumes a wide spectrum of previous worst-case optimal results, leading to the first instance-dependent guarantee that characterizes the hardness for offline RL.
In the Second chapter of the thesis, we extend our study for tabular reinforcement learning to the function approximation regime. Specifically, within the context of linear function approximation, we present the variance-aware pessimistic value iteration (VAPVI) algorithm, which quantifies the uncertainty of training examples through conditional variance reweighing. VAPVI enhances offline learning bounds compared to the best-known existing results. Crucially, our learning bounds are expressed in terms of system-related quantities, offering natural instance-dependent characterizations that previous studies lacked.
Furthermore, State-Of-The-Art algorithms usually leverage powerful function approximators (e.g. neural networks) to alleviate the sample complexity hurdle for better empirical performances. In the third chapter, we broaden our focus to function approximation without imposing specific structural constraints on the function class, except for differentiability. This class naturally encompasses a wide range of models with nonlinear and nonconvex structures. Importantly, we demonstrate the provable efficiency of offline RL with differentiable function approximation through an analysis of the pessimistic fitted Q-learning (PFQL) algorithm. Our findings provide the theoretical underpinnings for understanding various practical heuristics relying on Fitted Q-Iteration-style design.
We conclude the thesis by summarizing our work and mentioning other exciting research projects.