Error bounds and dynamics of bootstrapping in actor-critic reinforcement learning

Ahmed J Zerouali, Douglas Blair Tweed

Published: 14 Dec 2023, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Actor-critic algorithms such as DDPG, TD3, and SAC, which are built on Silver's deterministic policy gradient theorem, are among the most successful reinforcement-learning methods, but their mathematical basis is not entirely clear. In particular, the critic networks in these algorithms learn to estimate action-value functions by a “bootstrapping” technique based on Bellman error, and it is unclear why this approach works so well in practice, given that Bellman error is only very loosely related to value error, i.e. to the inaccuracy of the action-value estimate. Here we show that policy training in this class of actor-critic methods depends not on the accuracy of the critic's action-value estimate but on how well the critic estimates the gradient of the action-value, which is better assessed using what we call difference error. We show that this difference error is closely related to the Bellman error — a finding which helps to explain why Bellman-based bootstrapping leads to good policies. Further, we show that value error and difference error show different dynamics along on-policy trajectories through state-action space: value error is a low-pass anticausal (i.e., backward-in-time) filter of Bellman error, and therefore accumulates along trajectories, whereas difference error is a high-pass filter of Bellman error. It follows that techniques which reduce the high-frequency Fourier components of the Bellman error may improve policy training even if they increase the actual size of the Bellman errors. These findings help to explain certain aspects of actor-critic methods that are otherwise theoretically puzzling, such as the use of policy (as distinct from exploratory) noise, and they suggest other measures that may improve these methods.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: A) Introduction and scope of the paper: We have expanded the introduction to clarify the problem we address in the paper, the purpose of the article (which is to provide theoretical results that explain certain empirical facts reported in the literature), and the setting in which we develop our results. We have also modified the last section of the paper to provide a clearer outlook on our results, as discussed in part (D) below. B) Motivation of the difference error. Two reviewers asked for more motivation for our use of the difference error. Accordingly, we have rewritten parts of Section 3.1 and provided a new Appendix A with more details. C)Figure 3.1, spatial factors affecting temporal frequency: As requested, we have shortened the figure caption and added several paragraphs about the figure to the main text. This revised text provides more details about the objects depicted in this figure, including the precise functions used for the reward and the policy, as well as the type of function approximator used as a critic. D) Figure 4.1, variants of TD3: This example seems to have caused some confusion. Its purpose was to illustrate briefly one idea of how our mathematical results might be applied in future work, and was by no means an experimental evaluation of our results or a thorough test of a new algorithm. We have now moved it to a new Appendix C, where we more clearly articulate its purpose. If the editor and the reviewers still consider it a source of confusion, we are happy to remove the figure and accompanying text from the paper. E) We have also rewritten section 4 and parts of other sections to clarify that this work is a theoretical paper, whose purpose is to explain why the current techniques and practices lead to the surprising performance of DPG-based actor-critic algorithms. We are well aware that there is a gap to bridge between our theoretical results and algorithm design, and we prefer to address these topics properly in future work. We respond to the reviewers separately, below their official reports.

Code: https://github.com/mathfrak-g-hat/Error_Bounds_Dynamics_Bootstrapping_ACRL

Assigned Action Editor: ~Pablo_Samuel_Castro1

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 1432