Abstract
Hierarchical reinforcement learning (HRL) is a promising method to extend traditional reinforcement learning to solve more complex tasks. HRL can solve the problems of long-term reward sparsity and credit assignment. However, the existing HRL methods are trained in specific environments and target tasks each time, resulting in low sample utilization. In addition, the low-level sub-policies of the agent will interfere with each other during the migration process, resulting in poor policy stability. Aiming at the issue above, this paper proposes an HRL method, Relabeling and Policy Distillation of Hierarchical Reinforcement Learning (R-PD-HRL), that integrates meta-learning, shared reward relabeling and policy distillation to accelerate the learning speed and improve the policy stability of the agent. In the training process, a reward relabeling module is introduced to act on the experience buffer. Different reward functions are used to relabel the interaction trajectory for the training of other tasks under the same task distribution. At the low-level, policy distillation technology is used to compress the sub-policies of the low-level, and the interference between the policies is reduced while ensuring the correctness of the original low-level sub-policies. Finally, according to different tasks, the high-level policy calls the low-level optimal policy to complete the decision. In both continuous and discrete state-action environments, experimental results show that compared with other methods, the improved sample utilization of this method greatly accelerates the learning speed, and the success rate is as high as 0.6.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data are not publicly available due to the data is needed for follow-up research.
References
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, Berlin
Zhao X, Ding S, An Y et al (2019) Applications of asynchronous deep reinforcement learning based on dynamic updating weights[J]. Appl Intell 49:581–591. https://doi.org/10.1007/s10489-018-1296-x
Demir A, Çilden E, Polat F (2023) Landmark based guidance for reinforcement learning agents under partial observability[J]. Int J Mach Learn Cybern 14(4):1543–1563. https://doi.org/10.1007/s13042-022-01713-5
Yao Z, Zhang G, Lu D et al (2019) Data-driven crowd evacuation: a reinforcement learning method[J]. Neurocomputing 366:314–327. https://doi.org/10.1016/j.neucom.2019.08.021
Lai J, Wei J, Chen X (2021) Overview of hierarchical reinforcement learning. Comput Eng Appl 57:72–79. https://doi.org/10.3778/j.issn.1002-8331.2010-0038
Sutton RS, Precup D, Singh S (1999) Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artif Intell 112(1–2):181–211. https://doi.org/10.1016/S0004-3702(99)00052-1
Heuillet A, Couthouis F, Díaz-Rodríguez N (2021) Explainability in deep reinforcement learning[J]. Knowl-Based Syst 214:106685. https://doi.org/10.48550/arXiv.2207.01911
Bacon PL, Harb J, Precup D (2017) The option-critic architecture. In: Proceedings of the AAAI conference on artificial intelligence (Vol. 31, No. 1). https://doi.org/10.48550/arXiv.1609.05140
Konda V, Tsitsiklis J (1999) Actor-critic algorithms. Adv Neural Inf Process Syst 12
Kamat A, Precup D (2020) Diversity-enriched option-critic. arXiv preprint arXiv:2011.02565. https://doi.org/10.48550/arXiv.2011.02565
Song Y, Wang J, Lukasiewicz T et al (2019) Diversity-driven extensible hierarchical reinforcement learning. In: Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp 4992–4999). https://doi.org/10.1609/aaai.v33i01.33014992
Song S, Weng J, Su H et al (2019) Playing FPS games with environment-aware hierarchical reinforcement learning. In IJCAI (pp 3475–3482). https://doi.org/10.24963/ijcai.2019/482
Florensa C, Duan Y, Abbeel P (2017) Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012. https://doi.org/10.48550/arXiv.1704.03012
Baxter J (1998) Theoretical models of learning to learn[M]. In: Learning to learn. Boston, MA: Springer US. https://doi.org/10.48550/arXiv.2002.12364
Naik D K, Mammone R J, Agarwal A (1992) Meta-neural network approach to learning by learning[C]. In: Proceedings of the 1992 Artificial Neural Networks in Engineering, ANNIE'92. https://doi.org/10.1109/IJCNN.1992.287172
Kailin Z, Jin X, Wang Y (2020) Survey on few-shot learning. J Software 32(2):349–369. https://doi.org/10.13328/j.cnki.jos.006138
Miconi T, Stanley K, Clune J (2018) Differentiable plasticity: training plastic neural networks with backpropagation[C]. In: International Conference on Machine Learning. https://doi.org/10.48550/arXiv.1804.02464
Wang J X, Kurth-Nelson Z, Tirumala D et al (2016) Learning to reinforcement learn[J]. arXiv preprint arXiv:1611.05763. https://doi.org/10.48550/arXiv.1611.05763
Qi C, Shui C, Han L et al (2023) On the stability-plasticity dilemma in continual meta-learning: theory and algorithm[C]. Thirty-seventh Conference on Neural Information Processing Systems. https://doi.org/10.48550/arXiv.2302.08741
De Lange M, Aljundi R, Masana M et al (2019) Continual learning: a comparative study on how to defy forgetting in classification tasks[J]. https://doi.org/10.48550/arXiv.1909.08383
Zeno C, Golan I, Hoffer E et al (2018) Task agnostic continual learning using online variational bayes[J]. https://doi.org/10.48550/arXiv.1803.10123
Riemer M, Cases I, Ajemian R et al (2018) Learning to learn without forgetting by maximizing transfer and minimizing interference[J]. https://doi.org/10.48550/arXiv.1810.11910
Chen Q, Shui C, Marchand M (2021) Generalization bounds for meta-learning: An information-theoretic analysis. Adv Neural Inf Process Syst 34:25878–25890. https://doi.org/10.48550/arXiv.2109.14595
Duan Y, Schulman J, Chen X et al (2016) Rl $^ 2$: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779. https://doi.org/10.48550/arXiv.1611.02779
Rakelly K, Zhou A, Finn C et al (2019) Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning (pp 5331–5340). PMLR. https://doi.org/10.48550/arXiv.1903.08254
Fakoor R, Chaudhari P, Soatto S et al (2019) Meta-q-learning. arXiv preprint arXiv:1910.00125. https://doi.org/10.48550/arXiv.1910.00125
Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning (pp 1126–1135). PMLR. https://doi.org/10.48550/arXiv.1703.03400
Gupta A, Mendonca R, Liu Y et al (2018) Meta-reinforcement learning of structured exploration strategies. Adv Neural Inf Process Syst. https://doi.org/10.48550/arXiv.1802.07245
Frans K, Ho J, Chen X et al (2017) Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767. https://doi.org/10.48550/arXiv.1710.09767
Schulman J, Levine S, Abbeel P et al (2015) Trust region policy optimization. In International conference on machine learning (pp 1889–1897). PMLR. https://doi.org/10.48550/arXiv.1502.05477
Zeng X, Peng H, Li A (2023) Effective and stable role-based multi-agent collaboration by structural information principles[J]. https://doi.org/10.48550/arXiv.2304.00755
Zeng X, Peng H, Li A et al (2023) Hierarchical state abstraction based on structural information principles[J]. https://doi.org/10.48550/arXiv.2304.12000
Zhou G, Xu Z, Zhang Z, et al (2023) SORA: improving multi-agent cooperation with a soft role assignment mechanism[C]. In: International Conference on Neural Information Processing. Singapore: Springer Nature Singapore, 2023: 319–331. https://doi.org/10.1007/978-981-99-8079-6_25
Zhu L, Cheng J, Zhang H, et al (2023) Autonomous and adaptive role selection for multi-robot collaborative area search based on deep reinforcement learning[J]. arXiv preprint arXiv:2312.01747. https://arxiv.org/pdf/2312.01747
Sutton RS, Precup D, Singh S (1998) Intra-Option Learning about Temporally Abstract Actions. In ICML (Vol. 98, pp 556–564)
Caruana R (1997) Multitask learning. Mach Learn 28:41–75
Kaelbling LP (1993) Learning to achieve goals. In IJCAI (Vol. 2, pp 1094–1098)
Schaul T, Horgan D, Gregor K et al (2015) Universal value function approximators. In International conference on machine learning (pp 1312–1320). PMLR
Dorfman R, Shenfeld I, Tamar A (2020) Offline meta learning of exploration. arXiv preprint arXiv:2008.02598. https://doi.org/10.48550/arXiv.2008.02598
Li A, Pinto L, Abbeel P (2020) Generalized hindsight for reinforcement learning. Adv Neural Inf Process Syst 33:7754–7767. https://doi.org/10.48550/arXiv.2002.11708
Wan M, Peng J, Gangwani T (2021) Hindsight foresight relabeling for meta-reinforcement learning. arXiv preprint arXiv:2109.09031. https://doi.org/10.48550/arXiv.2109.09031
Eysenbach B, Geng X, Levine S et al (2020) Rewriting history with inverse rl: Hindsight inference for policy improvement. Adv Neural Inf Process Syst 33:14783–14795. https://doi.org/10.48550/arXiv.2002.11089
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. https://doi.org/10.48550/arXiv.1503.02531
Czarnecki WM, Pascanu R, Osindero S et al (2019) Distilling policy distillation. In The 22nd international conference on artificial intelligence and statistics (pp 1331–1340). PMLR. https://doi.org/10.48550/arXiv.1902.02186
Ross S, Gordon G, Bagnell D (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics (pp 627–635). JMLR Workshop and Conference Proceedings. https://doi.org/10.48550/arXiv.1011.0686
Rusu AA, Colmenarejo SG, Gulcehre C et al (2015) Policy distillation. arXiv preprint arXiv:1511.06295. https://doi.org/10.48550/arXiv.1511.06295
Ghosh D, Singh A, Rajeswaran A et al. Divide-and-conquer reinforcement learning[J]. arXiv preprint arXiv:1711.09874, 2017. https://doi.org/10.48550/arXiv.1711.09874
Yin H, Pan S (2017) Knowledge transfer for deep reinforcement learning with hierarchical experience replay. In: Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 31, No. 1). https://doi.org/10.1609/aaai.v31i1.10733
Parisotto E, Ba JL, Salakhutdinov R (2015) Actor-mimic: deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342. https://doi.org/10.48550/arXiv.1511.06342
Czarnecki W, Jayakumar S, Jaderberg M et al (2018) Mix & match agent curricula for reinforcement learning. In International Conference on Machine Learning (pp 1087–1095). PMLR. https://doi.org/10.48550/arXiv.1806.01780
Schmitt S, Hudson JJ, Zidek A et al (2018) Kickstarting deep reinforcement learning. arXiv preprint arXiv:1803.03835. https://doi.org/10.48550/arXiv.1803.03835
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531. https://doi.org/10.48550/arXiv.1503.02531
Ba J, Caruana R (2014) Do deep nets really need to be deep?[J]. Adv Neural Inf Process Syst. https://doi.org/10.48550/arXiv.1312.6184
Schulman J, Wolski F, Dhariwal P et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. https://doi.org/10.48550/arXiv.1707.06347
Kreidieh AR, Berseth G, Trabucco B et al (2019) Inter-level cooperation in hierarchical reinforcement learning. arXiv preprint arXiv:1912.02368. https://doi.org/10.48550/arXiv.1912.02368
Acknowledgements
This work is financed by The National Nature Science Foundation of China (61673084) and Project of Liaoning Provincial Department of Education (2021LJKZ1180).
Author information
Authors and Affiliations
Contributions
The authors of this manuscript have made significant contributions to the research, analysis, and writing of the paper. Below is a summary of each author's specific contributions: Zou: Acted as the main supervisor and project leader of the whole research; Provided financial support for research; Participated in the discussion and decision of the research direction; Important revisions and modifications have been made to the whole paper. Zhao: The whole research framework and experimental scheme are designed. Responsible for setting up experimental framework and experimental operation; Data analysis and statistical processing were carried out. The methods and results of the paper are written. Gao: Literature review and related research were carried out. Assist in the design of research methods and experimental protocols; Some data results were analyzed and interpreted. Mainly responsible for writing the introduction and related work parts of the paper. Chen: Data analysis and statistical processing were carried out. Assist in revising and proofreading the whole paper. Liu: During the revision of the article, many constructive suggestions were given to help us improve the article better. Zhang: Assist in the design and execution of experiments; Some data results were analyzed and interpreted. Participated in the revision and proofreading of the paper. We confirm that all listed authors have read and approved the final version of the manuscript and agree to its submission to the journal. Thank you for considering our manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zou, Q., Zhao, X., Gao, B. et al. Relabeling and policy distillation of hierarchical reinforcement learning. Int. J. Mach. Learn. & Cyber. 15, 4923–4939 (2024). https://doi.org/10.1007/s13042-024-02192-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-024-02192-6