Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Provably Efficient Offline Reinforcement Learning With Trajectory-Wise Reward

Published: 01 September 2024 Publication History

Abstract

The remarkable success of reinforcement learning (RL) heavily relies on observing the reward of every visited state-action pair. In many real world applications, however, an agent can observe only a score that represents the quality of the whole trajectory, which is referred to as the trajectory-wise reward. In such a situation, it is difficult for standard RL methods to well utilize trajectory-wise reward, and large bias and variance errors can be incurred in policy evaluation. In this work, we propose a novel offline RL algorithm, called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED), which decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value iteration based on the learned proxy reward. To ensure the value functions constructed by PARTED are always pessimistic with respect to the optimal ones, we design a new penalty term to offset the uncertainty of the proxy reward. We first show that our PARTED achieves an <inline-formula> <tex-math notation="LaTeX">$\tilde {\mathcal {O}}(dH^{3}/\sqrt {N})$ </tex-math></inline-formula> suboptimality for linear MDPs, where d is the dimension of the feature, H is the episode length, and N is the size of the offline dataset. We further extend our algorithm and results to general large-scale episodic MDPs with neural network function approximation. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.

References

[1]
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press, 2018.
[2]
V. Mnih, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.
[3]
S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” J. Mach. Learn. Res., vol. 17, no. 1, pp. 1334–1373, 2016.
[4]
D. Silver et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, Oct. 2017.
[5]
A. W. Senior et al., “Improved protein structure prediction using potentials from deep learning,” Nature, vol. 577, no. 7792, pp. 706–710, 7792.
[6]
H. R. Maei, “Gradient temporal-difference learning algorithms,” Ph.D. thesis, Dept. Comput. Sci., Univ. Alberta, Edmonton, AB, Canada, 2011.
[7]
S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent, reinforcement learning for autonomous driving,” 2016, arXiv:1610.03295.
[8]
N. Chatterji, A. Pacchiano, P. Bartlett, and M. Jordan, “On the theory of reinforcement learning with once-per-episode feedback,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 34, 2021, pp. 3401–3412.
[9]
Y. Gong, M. Abdel-Aty, Q. Cai, and M. S. Rahman, “Decentralized network level adaptive signal control by multi-agent deep reinforcement learning,” Transp. Res. Interdiscipl. Perspect., vol. 1, Jun. 2019, Art. no.
[10]
M. Olivecrona, T. Blaschke, O. Engkvist, and H. Chen, “Molecular de-novo design through deep reinforcement learning,” J. Cheminformatics, vol. 9, no. 1, pp. 1–14, Dec. 2017.
[11]
K. Lin, R. Zhao, Z. Xu, and J. Zhou, “Efficient large-scale fleet management via multi-agent deep reinforcement learning,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Jul. 2018, pp. 1774–1783.
[12]
D. Hein et al., “A benchmark environment motivated by industrial control problems,” in Proc. IEEE Symp. Ser. Computat. Intell. (SSCI), Nov. 2017, pp. 1–8.
[13]
H. Rahmandad, N. Repenning, and J. Sterman, “Effects of feedback delay on learning,” Syst. Dyn. Rev., vol. 25, no. 4, pp. 309–338, Oct. 2009.
[14]
J. A. Arjona-Medina, M. Gillhofer, M. Widrich, T. Unterthiner, J. Brandstetter, and S. Hochreiter, “Rudder: Return decomposition for delayed rewards,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 32, 2019, pp. 1–12.
[15]
A. Pacchiano, A. Saha, and J. Lee, “Dueling RL: Reinforcement learning with trajectory preferences,” 2021, arXiv:2111.04850.
[16]
Y. Liu, Y. Luo, Y. Zhong, X. Chen, Q. Liu, and J. Peng, “Sequence modeling of temporal credit assignment for episodic reinforcement learning,” 2019, arXiv:1905.13420.
[17]
T. Gangwani, Y. Zhou, and J. Peng, “Learning guidance rewards with trajectory-space smoothing,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 33, 2020, pp. 822–832.
[18]
Z. Ren, R. Guo, Y. Zhou, and J. Peng, “Learning long-term reward redistribution via randomized return decomposition,” 2021, arXiv:2111.13485.
[19]
Y. Efroni, N. Merlis, and S. Mannor, “Reinforcement learning with trajectory feedback,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 7288–7295.
[20]
O. Gottesman et al., “Guidelines for reinforcement learning in healthcare,” Nature Med., vol. 25, no. 1, pp. 16–18, Jan. 2019.
[21]
R. Wang, D. P. Foster, and S. M. Kakade, “What are the statistical limits of offline RL with linear function approximation?” 2020, arXiv:2010.11895.
[22]
B. Han, Z. Ren, Z. Wu, Y. Zhou, and J. Peng, “Off-policy reinforcement learning with delayed rewards,” 2021, arXiv:2106.11854.
[23]
Z. Zheng, J. Oh, and S. Singh, “On learning intrinsic rewards for policy gradient methods,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 31, 2018, pp. 1–11.
[24]
M. Klissarov and D. Precup, “Reward propagation using graph convolutional networks,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 33, 2020, pp. 12895–12908.
[25]
J. Oh, Y. Guo, S. Singh, and H. Lee, “Self-imitation learning,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 3878–3887.
[26]
Y. Jin, Z. Yang, and Z. Wang, “Is pessimism provably efficient for offline RL?” in Proc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 5084–5096.
[27]
M. Yin, Y. Bai, and Y.-X. Wang, “Near-optimal offline reinforcement learning via double variance reduction,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 34, 2021, pp. 7677–7688.
[28]
M. Yin, Y. Bai, and Y.-X. Wang, “Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning,” 2020, arXiv:2007.03760.
[29]
M. Yin, Y. Duan, M. Wang, and Y.-X. Wang, “Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism,” 2022, arXiv:2203.05804.
[30]
M. Yin and Y.-X. Wang, “Towards instance-optimal offline reinforcement learning with pessimism,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 34, 2021, pp. 4065–4078.
[31]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[32]
A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 30, 2017, pp. 1–11.
[33]
A. Cohen, H. Kaplan, T. Koren, and Y. Mansour, “Online Markov decision processes with aggregate bandit feedback,” in Proc. Conf. Learn. Theory (COLT), 2021, pp. 1301–1329.
[34]
Y. Liu, A. Swaminathan, A. Agarwal, and E. Brunskill, “Provably good batch reinforcement learning without great exploration,” 2020, arXiv:2007.08202.
[35]
R. Dadashi, S. Rezaeifar, N. Vieillard, L. Hussenot, O. Pietquin, and M. Geist, “Offline reinforcement learning with pseudometric learning,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 2307–2318.
[36]
S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” in Proc. 36th Int. Conf. Mach. Learn., vol. 97, 2019, pp. 2052–2062.
[37]
S. Fujimoto, E. Conti, M. Ghavamzadeh, and J. Pineau, “Benchmarking batch deep reinforcement learning algorithms,” 2019, arXiv:1910.01708.
[38]
Z. Wang et al., “Critic regularized regression,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 7768–7778.
[39]
S. Fujimoto and S. S. Gu, “A minimalist approach to offline reinforcement learning,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 34, 2021, pp. 20132–20145.
[40]
J. Buckman, C. Gelada, and M. G. Bellemare, “The importance of pessimism in fixed-dataset policy optimization,” 2020, arXiv:2009.06799.
[41]
A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative Q-learning for offline reinforcement learning,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 1179–1191.
[42]
L. Shi, G. Li, Y. Wei, Y. Chen, and Y. Chi, “Pessimistic Q-learning for offline reinforcement learning: Towards optimal sample complexity,” 2022, arXiv:2202.13890.
[43]
Y. Yan, G. Li, Y. Chen, and J. Fan, “The efficacy of pessimism in asynchronous Q-learning,” 2022, arXiv:2203.07368.
[44]
G. Li, L. Shi, Y. Chen, Y. Chi, and Y. Wei, “Settling the sample complexity of model-based offline reinforcement learning,” 2022, arXiv:2204.05275.
[45]
M. Yin and Y.-X. Wang, “Optimal uniform ope and model-based offline reinforcement learning in time-homogeneous, reward-free and task-agnostic settings,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 34, 2021.
[46]
T. Ren, J. Li, B. Dai, S. S. Du, and S. Sanghavi, “Nearly horizon-free offline reinforcement learning,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 34, 2021, pp. 15621–15634.
[47]
T. Xie, N. Jiang, H. Wang, C. Xiong, and Y. Bai, “Policy finetuning: Bridging sample-efficient offline and online reinforcement learning,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 34, 2021, pp. 27395–27407.
[48]
P. Rashidinejad, B. Zhu, C. Ma, J. Jiao, and S. Russell, “Bridging offline reinforcement learning and imitation learning: A tale of pessimism,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 34, 2021, pp. 11702–11716.
[49]
T. Xie, C.-A. Cheng, N. Jiang, P. Mineiro, and A. Agarwal, “Bellman-consistent pessimism for offline reinforcement learning,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 34, 2021, pp. 6683–6694.
[50]
A. Zanette, M. J. Wainwright, and E. Brunskill, “Provable benefits of actor-critic methods for offline reinforcement learning,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 34, 2021, pp. 13626–13640.
[51]
A. Zanette, “Exponential lower bounds for batch reinforcement learning: Batch RL can be exponentially harder than online RL,” in Proc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 12287–12297.
[52]
D. J. Foster, A. Krishnamurthy, D. Simchi-Levi, and Y. Xu, “Offline reinforcement learning: Fundamental barriers for value function approximation,” 2021, arXiv:2111.10919.
[53]
C. Jin, Z. Yang, Z. Wang, and M. I. Jordan, “Provably efficient reinforcement learning with linear function approximation,” in Proc. Conf. Learn. Theory, 2020, pp. 2137–2143.
[54]
L. F. Yang and M. Wang, “Sample-optimal parametric Q-learning using linearly additive features,” in Proc. 36th Int. Conf. Mach. Learn., 2019, pp. 6995–7004.
[55]
A. Agarwal, N. Jiang, S. M. Kakade, and W. Sun, “Reinforcement learning: Theory and algorithms,” Dept. CS, UW Seattle, Seattle, WA, USA, Tech. Rep., 2019, vol. 32, p. 96.
[56]
B. Scherrer, “Approximate policy iteration schemes: A comparison,” in Proc. Int. Conf. Mach. Learn. (ICML), 2014, pp. 1314–1322.
[57]
J. Chen and N. Jiang, “Information-theoretic considerations in batch reinforcement learning,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 1042–1051.
[58]
N. Jiang, “On value functions and the agent-environment boundary,” 2019, arXiv:1905.13341.
[59]
L. Wang, Q. Cai, Z. Yang, and Z. Wang, “Neural policy gradient methods: Global optimality and rates of convergence,” 2019, arXiv:1909.01150.
[60]
P. Liao, Z. Qi, R. Wan, P. Klasnja, and S. Murphy, “Batch policy learning in average reward Markov decision processes,” 2020, arXiv:2007.11771.
[61]
B. Liu, Q. Cai, Z. Yang, and Z. Wang, “Neural trust region/proximal policy optimization attains globally optimal policy,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2019, pp. 1–12.
[62]
Y. Duan, Z. Jia, and M. Wang, “Minimax-optimal off-policy evaluation with linear function approximation,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 2701–2709.
[63]
T. Xie and N. Jiang, “Batch value-function approximation with only realizability,” 2020, arXiv:2008.04990.
[64]
S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,” 2020, arXiv:2005.01643.
[65]
A. M. Farahmand, R. Munos, and C. Szepesvári, “Error propagation for approximate policy and value iteration,” in Proc. Adv. Neural Inf. Process. Syst., 2010, pp. 1–9.
[66]
R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims, “MOReL: Model-based offline reinforcement learning,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 21810–21823.
[67]
T. Yu et al., “MOPO: Model-based offline policy optimization,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 14129–14142.
[68]
R. Gao, T. Cai, H. Li, C.-J. Hsieh, L. Wang, and J. D. Lee, “Convergence of adversarial training in overparametrized neural networks,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 32, 2019, pp. 1–12.
[69]
Y. Bai and J. D. Lee, “Beyond linearization: On quadratic and higher-order approximation of wide neural networks,” 2019, arXiv:1910.01619.
[70]
A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 31, 2018, pp. 1–10.
[71]
Z. Yang, C. Jin, Z. Wang, M. Wang, and M. I. Jordan, “On function approximation in reinforcement learning: Optimism in the face of large state spaces,” 2020, arXiv:2011.04622.
[72]
Q. Cai, Z. Yang, J. D. Lee, and Z. Wang, “Neural temporal-difference and Q-learning provably converge to global optima,” 2019, arXiv:1905.10027.
[73]
T. Xu, Y. Liang, and G. Lan, “CRPO: A new approach for safe reinforcement learning with convergence guarantee,” in Proc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 11480–11491.
[74]
S. Qiu, J. Ye, Z. Wang, and Z. Yang, “On reward-free RL with kernel and neural function approximations: Single-agent MDP and Markov game,” in Proc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 8737–8747.
[75]
D. Zhou, L. Li, and Q. Gu, “Neural contextual bandits with UCB-based exploration,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 11492–11502.
[76]
M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini, “Finite-time analysis of kernelised contextual bandits,” 2013, arXiv:1309.6869.
[77]
R. Vershynin, High-Dimensional Probability: An Introduction With Applications in Data Science, vol. 47. Cambridge, U.K.: Cambridge Univ. Press, 2018.
[78]
M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning. Cambridge, MA, USA: MIT Press, 2018.
[79]
Z. Allen-Zhu, Y. Li, and Z. Song, “A convergence theory for deep learning via over-parameterization,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 242–252.
[80]
S. R. Chowdhury and A. Gopalan, “On kernelized multi-armed bandits,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 844–853.
[81]
J. A. Tropp, “An introduction to matrix concentration inequalities,” 2015, arXiv:1501.01571.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Information Theory
IEEE Transactions on Information Theory  Volume 70, Issue 9
Sept. 2024
636 pages

Publisher

IEEE Press

Publication History

Published: 01 September 2024

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media