Abstract
Multi-agent reinforcement learning is a varied and highly active field of research. The idea of parameter sharing or experience sharing has recently been introduced into multi-agent reinforcement learning to accelerate the training of multiple neural networks and enhance the final returns. However, implementing the parameter or experience sharing methods in multi-agent environments could introduce additional constraint or computation cost. This work presents a preference-based experience sharing scheme, which allows for different policies in environments with weakly homogeneous agents and requires barely any additional computational power. In this scheme, the experience replay buffer is augmented by adding a choice vector which indicates the preferred target of the agent, and each agent can learn various policies from the experience data collected by other agents who choose the same target. PSE-MADDPG, an off-policy algorithm with the preference-based experience sharing scheme, is proposed and benchmarked on a multi-target assignment and cooperative navigation mission. Experimental results show that PSE-MADDPG can successfully solve the problem of multiple targets assignment and outperform two classical deep reinforcement learning algorithms by learning in fewer steps and converging to higher episode rewards. Meanwhile, PSE-MADDPG relaxes the constraint of the strongly homogeneous agents assumption and requires little additional computation cost.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data and materials that support the findings of this study are available from the corresponding author upon reasonable request.
Code availability
The code is accessible on a GitHub repository named 'PSE-MADDPG', visit: https://github.com/guanzhongzx/PSE-MADDPG.
References
Ahuja RK, Kumar A, Jha KC, Orlin JB (2007) Exact and heuristic algorithms for the weapon-target assignment problem. Oper Res 55(6):1136–1146. https://doi.org/10.1287/opre.1070.0440
Albrecht SV, Christianos F, Schäfer L (2023) Multi-agent reinforcement learning: foundations and modern approaches. MIT Press, Cambridge. https://www.marl-book.com
Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA (2017) Deep reinforcement learning: a brief survey. IEEE Signal Process Mag 34(6):26–38. https://doi.org/10.1109/MSP.2017.2743240
Barto AG, Sutton RS, Anderson CW (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern 13:834–846. https://doi.org/10.1109/TSMC.1983.6313077
Bellingham J, Richards A, How JP (2002) Receding horizon control of autonomous aerial vehicles. In: ACC2002 (ed) Proceedings of the 2002 American control conference, vol 5. American Automatic Control Council, Anchorage, pp 3741–3746. https://doi.org/10.1109/ACC.2002.1024509
Bello I, Pham H, Le QV, Norouzi M, Bengio S (2016) Neural combinatorial optimization with reinforcement learning. ArXiv CoRR arXiv:abs/1611.09940
Christianos F, Schäfer L, Albrecht SV (2020) Shared experience actor-critic for multi-agent reinforcement learning. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H (eds) Proceedings of the 34th international conference on neural information processing systems. NIPS’20. WASET, Red Hook, pp 10707–10717
Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: Langford J, Pineau J (eds) Proceedings of the 29th international conference on machine learning. IMLS, Edinburgh, pp 179–186
Foerster JN, Farquhar G, Afouras T, Nardelli N, Whiteson S (2018) Counterfactual multi-agent policy gradients. In: Furman J, Marchant G, Price H, Rossi F (eds) Proceedings of the thirty-second AAAI conference on artificial intelligence. AAAI’18/IAAI’18/EAAI’18. AAAI Press, New Orleans, pp 2974–2982. https://doi.org/10.1609/aaai.v32i1.11794
François-Lavet V, Henderson P, Islam R, Bellemare MG, Pineau J (2018) An introduction to deep reinforcement learning. Found Trends Mach Learn 11(3–4):219–354. https://doi.org/10.1561/2200000071
Gronauer S, Diepold K (2022) Multi-agent deep reinforcement learning: a survey. Artif Intell Rev 55(2):895–943. https://doi.org/10.1007/s10462-021-09996-w
Grondman I, Busoniu L, Lopes GAD, Babuska R (2012) A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(6):1291–1307. https://doi.org/10.1109/TSMCC.2012.2218595
Gupta JK, Egorov M, Kochenderfer M (2017) Cooperative multi-agent control using deep reinforcement learning. In: Sukthankar G, Rodriguez-Aguilar JA (eds) Autonomous agents and multiagent systems. IFAAMAS, Cham, pp 66–83. https://doi.org/10.1007/978-3-319-71682-4_5
Haarnoja T, Tang H, Abbeel P, Levine S (2017) Reinforcement learning with deep energy-based policies. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning. ICML’17, vol 70. IMLS, pp 1352–1361
Hernandez-Leal P, Kartal B, Taylor ME (2019) A survey and critique of multiagent deep reinforcement learning. Auton Agents Multi-Agent Syst 33(6):750–797. https://doi.org/10.1007/s10458-019-09421-1
Hua W, Fan L, Li L, Mei K, Ji J, Ge Y, Hemphill L, Zhang Y (2023) War and peace (waragent): large language model-based multi-agent simulation of world wars. arXiv preprint arXiv:2311.17227
Jaderberg M, Czarnecki WM, Dunning I, Marris L, Lever G, Castañeda AG, Beattie C, Rabinowitz NC, Morcos AS, Ruderman A, Sonnerat N, Green T, Deason L, Leibo JZ, Silver D, Hassabis D, Kavukcuoglu K, Graepel T (2019) Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science 364(6443):859–865. https://doi.org/10.1126/science.aau6249
Kalakanti AK, Verma S, Paul T, Yoshida T (2019) Rl solver pro: reinforcement learning for solving vehicle routing problem. In: Casuarina M, Meru B (eds) 2019 1st international conference on artificial intelligence and data sciences. Sreyas Institute Of Engineering and Technology, Ipoh, pp 94–99. https://doi.org/10.1109/AiDAS47888.2019.8970890
Karasakal O, Karasakal E, Silav A (2021) A multi-objective approach for dynamic missile allocation using artificial neural networks for time sensitive decisions. Soft Comput 25(15):10153–10166. https://doi.org/10.1007/s00500-021-05923-x
Kumar R, Hyland DC (2001) Control law design using repeated trials. In: ACC2001 (ed) Proceedings of the 2001 American control conference, vol 2. American Automatic Control Council, Arlington, pp 837–842. https://doi.org/10.1109/ACC.2001.945820
Lee Z, Lee C, Su S (2002) An immunity-based ant colony optimization algorithm for solving weapon-target assignment problem. Appl Soft Comput 2(1):39–47. https://doi.org/10.1016/S1568-4946(02)00027-3
Lee D, Shin MK, Choi H (2020) Weapon target assignment problem with interference constraints. AIAA Scitech 2020 Forum. AIAA, Orlando. https://doi.org/10.2514/6.2020-0388
Li Y (2018) Deep reinforcement learning. ArXiv CoRR arXiv:abs/1810.06339
Li W, Lyu Y, Dai S, Chen H, Shi J, Li Y (2022) A multi-target consensus-based auction algorithm for distributed target assignment in cooperative beyond-visual-range air combat. Aerospace 9(9):486. https://doi.org/10.3390/aerospace9090486
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2016) Continuous control with deep reinforcement learning. In: Bengio Y, LeCun Y (eds) 4th International conference on learning representations, conference track proceedings. ICLR, San Juan. https://doi.org/10.48550/arXiv.1509.02971
Lloyd SP, Witsenhause HS (1986) Weapon allocation is NP-complete. In: Crosbie R, Luker P (eds) Proceeding of the IEEE summer simulation conference. IEEE, Reno, pp 1054–1058
Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In: Luxburg UV, Guyon I (eds) Proceedings of the 31st international conference on neural information processing systems. NIPS’17. WASET, Long Beach, pp 6382–6393
Lu C, Bao Q, Xia S, Qu C (2022) Centralized reinforcement learning for multi-agent cooperative environments. Evol Intell 17:267–273. https://doi.org/10.1007/s12065-022-00703-4
Lv L, Zhang S, Ding D, Wang Y (2019) Path planning via an improved DQN-based learning policy. IEEE Access 7:67319–67330. https://doi.org/10.1109/ACCESS.2019.2918703
Maddula T, Minai AA, Polycarpou MM (2004) Multi-target assignment and path planning for groups of UAVs, Chapter 1. In: Butenko S, Murphey R, Pardalos PM (eds) Recent developments in cooperative control and optimization, Boston, pp 261–272. https://doi.org/10.1007/978-1-4613-0219-3_15
McLain TW, Chandler PR, Rasmussen S, Pachter M (2001) Cooperative control of UAV rendezvous. In: ACC2001 (ed) Proceedings of the 2001 American control conference, vol. 3. American Automatic Control Council, Arlington, pp 2309–2314. https://doi.org/10.1109/ACC.2001.946096
Meng F, Tian K, Wu C (2021) Deep reinforcement learning-based radar network target assignment. IEEE Sens J 21(14):16315–16327. https://doi.org/10.1109/JSEN.2021.3074826
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D (2015) Human-level control through deep reinforcement learning. Nature 518:529–533. https://doi.org/10.1038/nature14236
Mnih V, Badia AP, Mirza M, Graves A, Harley T, Lillicrap TP, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: Balcan MF, Weinberger KQ (eds) Proceedings of the 33rd international conference on machine learning. ICML’16, vol 48. IMLS, New York, pp 1928–1937
Na H, Ahn J, Moon I (2023) Weapon-target assignment by reinforcement learning with pointer network. J Aerosp Inf Syst 20(1):53–59. https://doi.org/10.2514/1.I011150
Nazari M, Oroojlooy A, Takáč M, Snyder LV (2018) Reinforcement learning for solving the vehicle routing problem. In: Bengio S, Wallach HM, Cesa-Bianchi N (eds) Proceedings of the 32nd international conference on neural information processing systems. NIPS’18. WASET, Montréal, pp 9861–9871
Okumura K, Défago X (2023) Solving simultaneous target assignment and path planning efficiently with time-independent execution. Artif Intell 321:103946. https://doi.org/10.1016/j.artint.2023.103946
Omidshafiei S, Pazis J, Amato C, How JP, Vian J (2017) Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning. ICML’17. IMLS, Sydney, pp 2681–2690
Park JS, O’Brien JC, Cai CJ, Morris MR, Liang P, Bernstein MS (2023) Generative agents: interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442
Rashid T, Samvelyan M, Schroeder C, Farquhar G, Foerster J, Whiteson S (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In: Dy JG, Krause A (eds) Proceedings of the 35th international conference on machine learning, vol 80. PMLR, Stockholmsmässan, Stockholm, pp 4295–4304. https://proceedings.mlr.press/v80/rashid18a.html
Rasmussen S, Chandler P, Mitchell J, Schumacher C, Sparks A (2003) Optimal vs. heuristic assignment of cooperative autonomous unmanned air vehicles. AIAA Guidance, Navigation, and Control Conference and Exhibit. AIAA, Austin. https://doi.org/10.2514/6.2003-5586
Richards A, Bellingham J, Tillerson M, How J (2002) Coordination and control of multiple UAVs. AIAA guidance, navigation, and control conference and exhibit. AIAA, Monterey. https://doi.org/10.2514/6.2002-4588
Schulman J, Levine S, Moritz P, Jordan M, Abbeel P (2015) Trust region policy optimization. In: Bach F, Blei D (eds) Proceedings of the 32nd international conference on machine learning. ICML’15, vol. 37. IMLS, Lille, pp 1889–1897
Shin MK, Lee D, Choi H (2019) Weapon-target assignment problem with interference constraints using mixed-integer linear programming. Asia Pacific International Symposium on Aerospace Technology. RAeS Australian Division and Engineers Australia, Gold Coast, pp 2382–2392. https://doi.org/10.48550/arXiv.1911.12567
Shokoohi M, Afsharchi M, Shah-Hoseini H (2022) Dynamic distributed constraint optimization using multi-agent reinforcement learning. Soft Comput 26(8):3601–3629. https://doi.org/10.1007/s00500-022-06820-7
Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M (2014) Deterministic policy gradient algorithms. In: Xing EP, Jebara T (eds) Proceedings of the 31st international conference on machine learning, vol 32. IMLS, Beijing, pp 387–395
Singh L, Fuller J (2001) Trajectory generation for a UAV in urban terrain, using nonlinear MPC. In: ACC2001 (ed) Proceedings of the 2001 American control conference, vol 3. American Automatic Control Council, Arlington, pp 2301–2308. https://doi.org/10.1109/ACC.2001.946095
Song F, Xing H, Wang X, Luo S, Dai P, Xiao Z, Zhao B (2023) Evolutionary multi-objective reinforcement learning based trajectory control and task offloading in UAV-assisted mobile edge computing. IEEE Trans Mob Comput 22(12):7387–7405. https://doi.org/10.1109/TMC.2022.3208457
Sunehag P, Lever G, Gruslys A, Czarnecki WM, Zambaldi V, Jaderberg M, Lanctot M, Sonnerat N, Leibo JZ, Tuyls K, Graepel T (2018) Value-decomposition networks for cooperative multi-agent learning based on team reward. In: Andre E, Koenig S (eds) Proceedings of the 17th international conference on autonomous agents and multiagent systems. AAMAS ’18. International Foundation for Autonomous Agents and Multiagent Systems, Richland, pp 2085–2087
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction, 2nd edn. MIT Press, Cambridge. https://mitpress.mit.edu/9780262352703/reinforcement-learning/
Vinyals O, Fortunato M, Jaitly N (2015) Pointer networks. In: Cortes C, Lee DD, Sugiyama M, Garnett R (eds) Proceedings of the 28th international conference on neural information processing systems. NIPS’15, vol 2. WASET, Montréal, pp 2692–2700
Wang S, Chen W (2012) Solving weapon-target assignment problems by cultural particle swarm optimization. In: IHMSC’12 (ed) Proceedings of the 2012 4th international conference on intelligent human-machine systems and cybernetics, vol 1. IEEE Computer Society, Nanchang, pp 141–144. https://doi.org/10.1109/IHMSC.2012.41
Wang Z, Liu L, Long T, Wen Y (2018) Multi-UAV reconnaissance task allocation for heterogeneous targets using an opposition-based genetic algorithm with double-chromosome encoding. Chin J Aeronaut 31(2):339–350. https://doi.org/10.1016/j.cja.2017.09.005
Witten IH (1977) An adaptive optimal controller for discrete-time Markov environments. Inf Control 34(4):286–295. https://doi.org/10.1016/S0019-9958(77)90354-0
Wu Y, Lei Y, Zhu Z, Yang X, Li Q (2022) Dynamic multitarget assignment based on deep reinforcement learning. IEEE Access 10:75998–76007. https://doi.org/10.1109/ACCESS.2022.3190972
Xiao Z, Xing H, Zhao B, Qu R, Luo S, Dai P, Li K, Zhu Z (2024) Deep contrastive representation learning with self-distillation. IEEE Trans Emerg Top Comput Intell 8(1):3–15. https://doi.org/10.1109/TETCI.2023.3304948
Zhen Z, Zhu P, Xue Y, Ji Y (2019) Distributed intelligent self-organized mission planning of multi-UAV for dynamic targets cooperative search-attack. Chin J Aeronaut 32(12):2706–2716. https://doi.org/10.1016/j.cja.2019.05.012
Zhu B, Zou F, Wei J (2011) A novel approach to solving weapon-target assignment problem based on hybrid particle swarm optimization algorithm. In: EMEIT2011 (ed) Proceedings of the 2011 international conference on electronic and mechanical engineering and information technology, vol 3. IEEE, Harbin, pp 1385–1387. https://doi.org/10.1109/EMEIT.2011.6023352
Zhu J, Zhao C, Li X, Bao W (2021) Multi-target assignment and intelligent decision based on reinforcement learning. Acta Armamentarii 42(9):2040–2048. https://doi.org/10.3969/j.issn.1000-1093.2021.09.025
Zou Z, Chen Q (2022) Decision tree-based target assignment for confrontation of multiple space vehicles. Acta Aeronaut Astronaut Sin 43(S1):726910. https://doi.org/10.7527/S1000-6893.2022.26910
Funding
The first and second author are supported by the National Natural Science Foundation of China (No. 61790552).
Author information
Authors and Affiliations
Contributions
XZ contributed to designing the method, running the experiments and writing all sections. ZL contributed to providing feedback and guidance. PZ contributed to the conceptualisation and formal analysis. HL contribute to the methodology. All authors contributed on the results analysis and the manuscript revision.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Compatible function approximation
We consider off-policy MARL methods that learn a deterministic target policy \(\mu _{\theta }(s)\) from trajectories generated by an arbitrary stochastic behaviour policy \(\pi (s, a)\). Similar to the stochastic case, we find a critic \(Q^{w}(s, a)\) such that the gradient \(\nabla _a Q^{\mu }(s, a)\) can be replaced by \(\nabla _a Q^w(s, a)\), without affecting the deterministic policy gradient. The following theorem applies to both on-policy, \({\mathbb {E}}[\cdot ]={\mathbb {E}}_{s \sim \rho ^{\mu }}[\cdot ]\), and off-policy, \({\mathbb {E}}[\cdot ]={\mathbb {E}}_{s \sim \rho ^{\beta }}[\cdot ]\).
Theorem 1
A function approximator \(Q^{w}(s, a)\) is compatible with a deterministic policy \(\mu _{\theta }(s)\), \(\nabla _\theta J_{\beta }(\theta ) = {\mathbb {E}}\left[ \left. \nabla _\theta \mu _\theta (s) \nabla _a Q^w(s, a)\right| _{a=\mu _\theta (s)}\right]\), if:
-
(i)
\(\left. \nabla _a Q^w(s, a)\right| _{a=\mu _\theta (s)}=\nabla _\theta \mu _\theta (s)^{\top } w\) and
-
(ii)
w minimises the mean-squared error, \(\text {MSE}(\theta , w)= {\mathbb {E}}\left[ \epsilon (s; \theta , w)^{\top } \epsilon (s; \theta , w)\right]\) where \(\epsilon (s; \theta , w) = \left. \nabla _a Q^w(s, a)\right| _{a=\mu _\theta (s)}-\left. \nabla _a Q^\mu (s, a)\right| _{a=\mu _\theta (s)}\).
Proof
If w minimises the MSE then the gradient of \({\epsilon }^2\) with respect to w must be zero. We then use the fact that, by condition (i), \(\nabla _w \epsilon (s; \theta , w)=\nabla _\theta \mu _\theta (s)\)
For any deterministic policy \(\mu _{\theta }(s)\), there always exists a compatible function approximator of the form \(Q^w(s, a) = \left( a-\mu _\theta (s)\right) ^{\top } \nabla _\theta \mu _\theta (s)^{\top } w + V^v(s)\), where \(V^v(s)\) may be any differentiable baseline function that is independent of the action a. For example, a linear combination of state features \(\phi (s)\) and parameters v, \(V^v(s) = v^{\top }\phi (s)\) for parameters v. We note that a linear function approximator is not very useful for predicting action-values globally, since the action value diverges to \(\pm \infty\) for large actions. As a result, a linear function approximator is sufficient to select the direction in which the actor should adjust its policy parameters. \(\square\)
Appendix B: The pseudocode of RSE-MADDPG
The RSE-MADDPG algorithm is used as baseline for the proposed method.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zuo, X., Zhang, P., Li, HY. et al. Preference-based experience sharing scheme for multi-agent reinforcement learning in multi-target environments. Evolving Systems 15, 1681–1699 (2024). https://doi.org/10.1007/s12530-024-09587-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12530-024-09587-4