Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Preference-based experience sharing scheme for multi-agent reinforcement learning in multi-target environments

  • Original Paper
  • Published:
Evolving Systems Aims and scope Submit manuscript

Abstract

Multi-agent reinforcement learning is a varied and highly active field of research. The idea of parameter sharing or experience sharing has recently been introduced into multi-agent reinforcement learning to accelerate the training of multiple neural networks and enhance the final returns. However, implementing the parameter or experience sharing methods in multi-agent environments could introduce additional constraint or computation cost. This work presents a preference-based experience sharing scheme, which allows for different policies in environments with weakly homogeneous agents and requires barely any additional computational power. In this scheme, the experience replay buffer is augmented by adding a choice vector which indicates the preferred target of the agent, and each agent can learn various policies from the experience data collected by other agents who choose the same target. PSE-MADDPG, an off-policy algorithm with the preference-based experience sharing scheme, is proposed and benchmarked on a multi-target assignment and cooperative navigation mission. Experimental results show that PSE-MADDPG can successfully solve the problem of multiple targets assignment and outperform two classical deep reinforcement learning algorithms by learning in fewer steps and converging to higher episode rewards. Meanwhile, PSE-MADDPG relaxes the constraint of the strongly homogeneous agents assumption and requires little additional computation cost.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The data and materials that support the findings of this study are available from the corresponding author upon reasonable request.

Code availability

The code is accessible on a GitHub repository named 'PSE-MADDPG', visit:  https://github.com/guanzhongzx/PSE-MADDPG.

References

  • Ahuja RK, Kumar A, Jha KC, Orlin JB (2007) Exact and heuristic algorithms for the weapon-target assignment problem. Oper Res 55(6):1136–1146. https://doi.org/10.1287/opre.1070.0440

    Article  MathSciNet  Google Scholar 

  • Albrecht SV, Christianos F, Schäfer L (2023) Multi-agent reinforcement learning: foundations and modern approaches. MIT Press, Cambridge. https://www.marl-book.com

  • Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA (2017) Deep reinforcement learning: a brief survey. IEEE Signal Process Mag 34(6):26–38. https://doi.org/10.1109/MSP.2017.2743240

    Article  Google Scholar 

  • Barto AG, Sutton RS, Anderson CW (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern 13:834–846. https://doi.org/10.1109/TSMC.1983.6313077

    Article  Google Scholar 

  • Bellingham J, Richards A, How JP (2002) Receding horizon control of autonomous aerial vehicles. In: ACC2002 (ed) Proceedings of the 2002 American control conference, vol 5. American Automatic Control Council, Anchorage, pp 3741–3746. https://doi.org/10.1109/ACC.2002.1024509

  • Bello I, Pham H, Le QV, Norouzi M, Bengio S (2016) Neural combinatorial optimization with reinforcement learning. ArXiv CoRR arXiv:abs/1611.09940

  • Christianos F, Schäfer L, Albrecht SV (2020) Shared experience actor-critic for multi-agent reinforcement learning. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H (eds) Proceedings of the 34th international conference on neural information processing systems. NIPS’20. WASET, Red Hook, pp 10707–10717

  • Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: Langford J, Pineau J (eds) Proceedings of the 29th international conference on machine learning. IMLS, Edinburgh, pp 179–186

  • Foerster JN, Farquhar G, Afouras T, Nardelli N, Whiteson S (2018) Counterfactual multi-agent policy gradients. In: Furman J, Marchant G, Price H, Rossi F (eds) Proceedings of the thirty-second AAAI conference on artificial intelligence. AAAI’18/IAAI’18/EAAI’18. AAAI Press, New Orleans, pp 2974–2982. https://doi.org/10.1609/aaai.v32i1.11794

  • François-Lavet V, Henderson P, Islam R, Bellemare MG, Pineau J (2018) An introduction to deep reinforcement learning. Found Trends Mach Learn 11(3–4):219–354. https://doi.org/10.1561/2200000071

    Article  Google Scholar 

  • Gronauer S, Diepold K (2022) Multi-agent deep reinforcement learning: a survey. Artif Intell Rev 55(2):895–943. https://doi.org/10.1007/s10462-021-09996-w

    Article  Google Scholar 

  • Grondman I, Busoniu L, Lopes GAD, Babuska R (2012) A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(6):1291–1307. https://doi.org/10.1109/TSMCC.2012.2218595

    Article  Google Scholar 

  • Gupta JK, Egorov M, Kochenderfer M (2017) Cooperative multi-agent control using deep reinforcement learning. In: Sukthankar G, Rodriguez-Aguilar JA (eds) Autonomous agents and multiagent systems. IFAAMAS, Cham, pp 66–83. https://doi.org/10.1007/978-3-319-71682-4_5

  • Haarnoja T, Tang H, Abbeel P, Levine S (2017) Reinforcement learning with deep energy-based policies. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning. ICML’17, vol 70. IMLS, pp 1352–1361

  • Hernandez-Leal P, Kartal B, Taylor ME (2019) A survey and critique of multiagent deep reinforcement learning. Auton Agents Multi-Agent Syst 33(6):750–797. https://doi.org/10.1007/s10458-019-09421-1

    Article  Google Scholar 

  • Hua W, Fan L, Li L, Mei K, Ji J, Ge Y, Hemphill L, Zhang Y (2023) War and peace (waragent): large language model-based multi-agent simulation of world wars. arXiv preprint arXiv:2311.17227

  • Jaderberg M, Czarnecki WM, Dunning I, Marris L, Lever G, Castañeda AG, Beattie C, Rabinowitz NC, Morcos AS, Ruderman A, Sonnerat N, Green T, Deason L, Leibo JZ, Silver D, Hassabis D, Kavukcuoglu K, Graepel T (2019) Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science 364(6443):859–865. https://doi.org/10.1126/science.aau6249

    Article  MathSciNet  Google Scholar 

  • Kalakanti AK, Verma S, Paul T, Yoshida T (2019) Rl solver pro: reinforcement learning for solving vehicle routing problem. In: Casuarina M, Meru B (eds) 2019 1st international conference on artificial intelligence and data sciences. Sreyas Institute Of Engineering and Technology, Ipoh, pp 94–99. https://doi.org/10.1109/AiDAS47888.2019.8970890

  • Karasakal O, Karasakal E, Silav A (2021) A multi-objective approach for dynamic missile allocation using artificial neural networks for time sensitive decisions. Soft Comput 25(15):10153–10166. https://doi.org/10.1007/s00500-021-05923-x

    Article  Google Scholar 

  • Kumar R, Hyland DC (2001) Control law design using repeated trials. In: ACC2001 (ed) Proceedings of the 2001 American control conference, vol 2. American Automatic Control Council, Arlington, pp 837–842. https://doi.org/10.1109/ACC.2001.945820

  • Lee Z, Lee C, Su S (2002) An immunity-based ant colony optimization algorithm for solving weapon-target assignment problem. Appl Soft Comput 2(1):39–47. https://doi.org/10.1016/S1568-4946(02)00027-3

    Article  Google Scholar 

  • Lee D, Shin MK, Choi H (2020) Weapon target assignment problem with interference constraints. AIAA Scitech 2020 Forum. AIAA, Orlando. https://doi.org/10.2514/6.2020-0388

  • Li Y (2018) Deep reinforcement learning. ArXiv CoRR arXiv:abs/1810.06339

  • Li W, Lyu Y, Dai S, Chen H, Shi J, Li Y (2022) A multi-target consensus-based auction algorithm for distributed target assignment in cooperative beyond-visual-range air combat. Aerospace 9(9):486. https://doi.org/10.3390/aerospace9090486

    Article  Google Scholar 

  • Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2016) Continuous control with deep reinforcement learning. In: Bengio Y, LeCun Y (eds) 4th International conference on learning representations, conference track proceedings. ICLR, San Juan. https://doi.org/10.48550/arXiv.1509.02971

  • Lloyd SP, Witsenhause HS (1986) Weapon allocation is NP-complete. In: Crosbie R, Luker P (eds) Proceeding of the IEEE summer simulation conference. IEEE, Reno, pp 1054–1058

  • Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In: Luxburg UV, Guyon I (eds) Proceedings of the 31st international conference on neural information processing systems. NIPS’17. WASET, Long Beach, pp 6382–6393

  • Lu C, Bao Q, Xia S, Qu C (2022) Centralized reinforcement learning for multi-agent cooperative environments. Evol Intell 17:267–273. https://doi.org/10.1007/s12065-022-00703-4

    Article  Google Scholar 

  • Lv L, Zhang S, Ding D, Wang Y (2019) Path planning via an improved DQN-based learning policy. IEEE Access 7:67319–67330. https://doi.org/10.1109/ACCESS.2019.2918703

    Article  Google Scholar 

  • Maddula T, Minai AA, Polycarpou MM (2004) Multi-target assignment and path planning for groups of UAVs, Chapter 1. In: Butenko S, Murphey R, Pardalos PM (eds) Recent developments in cooperative control and optimization, Boston, pp 261–272. https://doi.org/10.1007/978-1-4613-0219-3_15

  • McLain TW, Chandler PR, Rasmussen S, Pachter M (2001) Cooperative control of UAV rendezvous. In: ACC2001 (ed) Proceedings of the 2001 American control conference, vol. 3. American Automatic Control Council, Arlington, pp 2309–2314. https://doi.org/10.1109/ACC.2001.946096

  • Meng F, Tian K, Wu C (2021) Deep reinforcement learning-based radar network target assignment. IEEE Sens J 21(14):16315–16327. https://doi.org/10.1109/JSEN.2021.3074826

    Article  Google Scholar 

  • Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D (2015) Human-level control through deep reinforcement learning. Nature 518:529–533. https://doi.org/10.1038/nature14236

    Article  Google Scholar 

  • Mnih V, Badia AP, Mirza M, Graves A, Harley T, Lillicrap TP, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: Balcan MF, Weinberger KQ (eds) Proceedings of the 33rd international conference on machine learning. ICML’16, vol 48. IMLS, New York, pp 1928–1937

  • Na H, Ahn J, Moon I (2023) Weapon-target assignment by reinforcement learning with pointer network. J Aerosp Inf Syst 20(1):53–59. https://doi.org/10.2514/1.I011150

    Article  Google Scholar 

  • Nazari M, Oroojlooy A, Takáč M, Snyder LV (2018) Reinforcement learning for solving the vehicle routing problem. In: Bengio S, Wallach HM, Cesa-Bianchi N (eds) Proceedings of the 32nd international conference on neural information processing systems. NIPS’18. WASET, Montréal, pp 9861–9871

  • Okumura K, Défago X (2023) Solving simultaneous target assignment and path planning efficiently with time-independent execution. Artif Intell 321:103946. https://doi.org/10.1016/j.artint.2023.103946

    Article  MathSciNet  Google Scholar 

  • Omidshafiei S, Pazis J, Amato C, How JP, Vian J (2017) Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning. ICML’17. IMLS, Sydney, pp 2681–2690

  • Park JS, O’Brien JC, Cai CJ, Morris MR, Liang P, Bernstein MS (2023) Generative agents: interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442

  • Rashid T, Samvelyan M, Schroeder C, Farquhar G, Foerster J, Whiteson S (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In: Dy JG, Krause A (eds) Proceedings of the 35th international conference on machine learning, vol 80. PMLR, Stockholmsmässan, Stockholm, pp 4295–4304. https://proceedings.mlr.press/v80/rashid18a.html

  • Rasmussen S, Chandler P, Mitchell J, Schumacher C, Sparks A (2003) Optimal vs. heuristic assignment of cooperative autonomous unmanned air vehicles. AIAA Guidance, Navigation, and Control Conference and Exhibit. AIAA, Austin. https://doi.org/10.2514/6.2003-5586

  • Richards A, Bellingham J, Tillerson M, How J (2002) Coordination and control of multiple UAVs. AIAA guidance, navigation, and control conference and exhibit. AIAA, Monterey. https://doi.org/10.2514/6.2002-4588

  • Schulman J, Levine S, Moritz P, Jordan M, Abbeel P (2015) Trust region policy optimization. In: Bach F, Blei D (eds) Proceedings of the 32nd international conference on machine learning. ICML’15, vol. 37. IMLS, Lille, pp 1889–1897

  • Shin MK, Lee D, Choi H (2019) Weapon-target assignment problem with interference constraints using mixed-integer linear programming. Asia Pacific International Symposium on Aerospace Technology. RAeS Australian Division and Engineers Australia, Gold Coast, pp 2382–2392. https://doi.org/10.48550/arXiv.1911.12567

  • Shokoohi M, Afsharchi M, Shah-Hoseini H (2022) Dynamic distributed constraint optimization using multi-agent reinforcement learning. Soft Comput 26(8):3601–3629. https://doi.org/10.1007/s00500-022-06820-7

    Article  Google Scholar 

  • Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M (2014) Deterministic policy gradient algorithms. In: Xing EP, Jebara T (eds) Proceedings of the 31st international conference on machine learning, vol 32. IMLS, Beijing, pp 387–395

  • Singh L, Fuller J (2001) Trajectory generation for a UAV in urban terrain, using nonlinear MPC. In: ACC2001 (ed) Proceedings of the 2001 American control conference, vol 3. American Automatic Control Council, Arlington, pp 2301–2308. https://doi.org/10.1109/ACC.2001.946095

  • Song F, Xing H, Wang X, Luo S, Dai P, Xiao Z, Zhao B (2023) Evolutionary multi-objective reinforcement learning based trajectory control and task offloading in UAV-assisted mobile edge computing. IEEE Trans Mob Comput 22(12):7387–7405. https://doi.org/10.1109/TMC.2022.3208457

    Article  Google Scholar 

  • Sunehag P, Lever G, Gruslys A, Czarnecki WM, Zambaldi V, Jaderberg M, Lanctot M, Sonnerat N, Leibo JZ, Tuyls K, Graepel T (2018) Value-decomposition networks for cooperative multi-agent learning based on team reward. In: Andre E, Koenig S (eds) Proceedings of the 17th international conference on autonomous agents and multiagent systems. AAMAS ’18. International Foundation for Autonomous Agents and Multiagent Systems, Richland, pp 2085–2087

  • Sutton RS, Barto AG (2018) Reinforcement learning: an introduction, 2nd edn. MIT Press, Cambridge. https://mitpress.mit.edu/9780262352703/reinforcement-learning/

  • Vinyals O, Fortunato M, Jaitly N (2015) Pointer networks. In: Cortes C, Lee DD, Sugiyama M, Garnett R (eds) Proceedings of the 28th international conference on neural information processing systems. NIPS’15, vol 2. WASET, Montréal, pp 2692–2700

  • Wang S, Chen W (2012) Solving weapon-target assignment problems by cultural particle swarm optimization. In: IHMSC’12 (ed) Proceedings of the 2012 4th international conference on intelligent human-machine systems and cybernetics, vol 1. IEEE Computer Society, Nanchang, pp 141–144. https://doi.org/10.1109/IHMSC.2012.41

  • Wang Z, Liu L, Long T, Wen Y (2018) Multi-UAV reconnaissance task allocation for heterogeneous targets using an opposition-based genetic algorithm with double-chromosome encoding. Chin J Aeronaut 31(2):339–350. https://doi.org/10.1016/j.cja.2017.09.005

    Article  Google Scholar 

  • Witten IH (1977) An adaptive optimal controller for discrete-time Markov environments. Inf Control 34(4):286–295. https://doi.org/10.1016/S0019-9958(77)90354-0

    Article  MathSciNet  Google Scholar 

  • Wu Y, Lei Y, Zhu Z, Yang X, Li Q (2022) Dynamic multitarget assignment based on deep reinforcement learning. IEEE Access 10:75998–76007. https://doi.org/10.1109/ACCESS.2022.3190972

    Article  Google Scholar 

  • Xiao Z, Xing H, Zhao B, Qu R, Luo S, Dai P, Li K, Zhu Z (2024) Deep contrastive representation learning with self-distillation. IEEE Trans Emerg Top Comput Intell 8(1):3–15. https://doi.org/10.1109/TETCI.2023.3304948

    Article  Google Scholar 

  • Zhen Z, Zhu P, Xue Y, Ji Y (2019) Distributed intelligent self-organized mission planning of multi-UAV for dynamic targets cooperative search-attack. Chin J Aeronaut 32(12):2706–2716. https://doi.org/10.1016/j.cja.2019.05.012

    Article  Google Scholar 

  • Zhu B, Zou F, Wei J (2011) A novel approach to solving weapon-target assignment problem based on hybrid particle swarm optimization algorithm. In: EMEIT2011 (ed) Proceedings of the 2011 international conference on electronic and mechanical engineering and information technology, vol 3. IEEE, Harbin, pp 1385–1387. https://doi.org/10.1109/EMEIT.2011.6023352

  • Zhu J, Zhao C, Li X, Bao W (2021) Multi-target assignment and intelligent decision based on reinforcement learning. Acta Armamentarii 42(9):2040–2048. https://doi.org/10.3969/j.issn.1000-1093.2021.09.025

    Article  Google Scholar 

  • Zou Z, Chen Q (2022) Decision tree-based target assignment for confrontation of multiple space vehicles. Acta Aeronaut Astronaut Sin 43(S1):726910. https://doi.org/10.7527/S1000-6893.2022.26910

    Article  Google Scholar 

Download references

Funding

The first and second author are supported by the National Natural Science Foundation of China (No. 61790552).

Author information

Authors and Affiliations

Authors

Contributions

XZ contributed to designing the method, running the experiments and writing all sections. ZL contributed to providing feedback and guidance. PZ contributed to the conceptualisation and formal analysis. HL contribute to the methodology. All authors contributed on the results analysis and the manuscript revision.

Corresponding author

Correspondence to Xuan Zuo.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Compatible function approximation

We consider off-policy MARL methods that learn a deterministic target policy \(\mu _{\theta }(s)\) from trajectories generated by an arbitrary stochastic behaviour policy \(\pi (s, a)\). Similar to the stochastic case, we find a critic \(Q^{w}(s, a)\) such that the gradient \(\nabla _a Q^{\mu }(s, a)\) can be replaced by \(\nabla _a Q^w(s, a)\), without affecting the deterministic policy gradient. The following theorem applies to both on-policy, \({\mathbb {E}}[\cdot ]={\mathbb {E}}_{s \sim \rho ^{\mu }}[\cdot ]\), and off-policy, \({\mathbb {E}}[\cdot ]={\mathbb {E}}_{s \sim \rho ^{\beta }}[\cdot ]\).

Theorem 1

A function approximator \(Q^{w}(s, a)\) is compatible with a deterministic policy \(\mu _{\theta }(s)\), \(\nabla _\theta J_{\beta }(\theta ) = {\mathbb {E}}\left[ \left. \nabla _\theta \mu _\theta (s) \nabla _a Q^w(s, a)\right| _{a=\mu _\theta (s)}\right]\), if:

  1. (i)

    \(\left. \nabla _a Q^w(s, a)\right| _{a=\mu _\theta (s)}=\nabla _\theta \mu _\theta (s)^{\top } w\) and

  2. (ii)

    w minimises the mean-squared error, \(\text {MSE}(\theta , w)= {\mathbb {E}}\left[ \epsilon (s; \theta , w)^{\top } \epsilon (s; \theta , w)\right]\) where \(\epsilon (s; \theta , w) = \left. \nabla _a Q^w(s, a)\right| _{a=\mu _\theta (s)}-\left. \nabla _a Q^\mu (s, a)\right| _{a=\mu _\theta (s)}\).

Proof

If w minimises the MSE then the gradient of \({\epsilon }^2\) with respect to w must be zero. We then use the fact that, by condition (i), \(\nabla _w \epsilon (s; \theta , w)=\nabla _\theta \mu _\theta (s)\)

$$\begin{aligned} \begin{aligned} \nabla _w \text {MSE}(\theta , w)&= 0 \\ {\mathbb {E}}\left[ \nabla _\theta \mu _\theta (s) \epsilon (s ; \theta , w)\right]&= 0 \\ {\mathbb {E}}\left[ \left. \nabla _\theta \mu _\theta (s) \nabla _a Q^w(s, a)\right| _{a=\mu _\theta (s)}\right]&= {\mathbb {E}}\left[ \left. \nabla _\theta \mu _\theta (s) \nabla _a Q^\mu (s, a)\right| _{a=\mu _\theta (s)}\right] \\&= \nabla _\theta J_\beta \left( \mu _\theta \right) \text{ or } \nabla _\theta J\left( \mu _\theta \right) \end{aligned} \end{aligned}$$

For any deterministic policy \(\mu _{\theta }(s)\), there always exists a compatible function approximator of the form \(Q^w(s, a) = \left( a-\mu _\theta (s)\right) ^{\top } \nabla _\theta \mu _\theta (s)^{\top } w + V^v(s)\), where \(V^v(s)\) may be any differentiable baseline function that is independent of the action a. For example, a linear combination of state features \(\phi (s)\) and parameters v, \(V^v(s) = v^{\top }\phi (s)\) for parameters v. We note that a linear function approximator is not very useful for predicting action-values globally, since the action value diverges to \(\pm \infty\) for large actions. As a result, a linear function approximator is sufficient to select the direction in which the actor should adjust its policy parameters. \(\square\)

Appendix B: The pseudocode of RSE-MADDPG

The RSE-MADDPG algorithm is used as baseline for the proposed method.

figure b

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zuo, X., Zhang, P., Li, HY. et al. Preference-based experience sharing scheme for multi-agent reinforcement learning in multi-target environments. Evolving Systems 15, 1681–1699 (2024). https://doi.org/10.1007/s12530-024-09587-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12530-024-09587-4

Keywords

Navigation