Abstract: Actor-critic algorithms such as DDPG, TD3, and SAC, which are built on Silver's deterministic policy gradient theorem, are among the most successful reinforcement-learning methods, but their mathematical basis is not entirely clear. In particular, the critic networks in these algorithms learn to estimate action-value functions by a “bootstrapping” technique based on Bellman error, and it is unclear why this approach works so well in practice, given that Bellman error is only very loosely related to value error, i.e. to the inaccuracy of the action-value estimate. Here we show that policy training in this class of actor-critic methods depends not on the accuracy of the critic's action-value estimate but on how well the critic estimates the gradient of the action-value, which is better assessed using what we call difference error. We show that this difference error is closely related to the Bellman error — a finding which helps to explain why Bellman-based bootstrapping leads to good policies. Further, we show that value error and difference error show different dynamics along on-policy trajectories through state-action space: value error is a low-pass anticausal (i.e., backward-in-time) filter of Bellman error, and therefore accumulates along trajectories, whereas difference error is a high-pass filter of Bellman error. It follows that techniques which reduce the high-frequency Fourier components of the Bellman error may improve policy training even if they increase the actual size of the Bellman errors. These findings help to explain certain aspects of actor-critic methods that are otherwise theoretically puzzling, such as the use of policy (as distinct from exploratory) noise, and they suggest other measures that may improve these methods.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: A) Introduction and scope of the paper: We
have expanded the introduction to clarify the problem we address in
the paper, the purpose of the article (which is to provide theoretical
results that explain certain empirical facts reported in the literature),
and the setting in which we develop our results. We have also modified
the last section of the paper to provide a clearer outlook on our
results, as discussed in part (D) below.
B) Motivation of the difference error. Two
reviewers asked for more motivation for our use of the difference
error. Accordingly, we have rewritten parts of Section 3.1 and provided
a new Appendix A with more details.
C)Figure 3.1, spatial factors affecting
temporal frequency: As requested, we have shortened the figure caption
and added several paragraphs about the figure to the main text. This
revised text provides more details about the objects depicted in this
figure, including the precise functions used for the reward and the
policy, as well as the type of function approximator used as a critic.
D) Figure 4.1, variants of TD3: This example
seems to have caused some confusion. Its purpose was to illustrate
briefly one idea of how our mathematical results might be applied
in future work, and was by no means an experimental evaluation of
our results or a thorough test of a new algorithm. We have now moved
it to a new Appendix C, where we more clearly articulate its purpose.
If the editor and the reviewers still consider it a source of confusion,
we are happy to remove the figure and accompanying text from the paper.
E) We have also rewritten section 4 and parts of other sections to clarify
that this work is a theoretical paper, whose
purpose is to explain why the current techniques and practices
lead to the surprising performance of DPG-based actor-critic algorithms.
We are well aware that there is a gap to bridge between our theoretical
results and algorithm design, and we prefer to address these topics
properly in future work.
We respond to the reviewers separately, below their official reports.
Code: https://github.com/mathfrak-g-hat/Error_Bounds_Dynamics_Bootstrapping_ACRL
Assigned Action Editor: ~Pablo_Samuel_Castro1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 1432
Loading