research-article

Open access

Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on Dragonfly Network

Authors:

Zhiling LanAuthors Info & Claims

HPDC '21: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing

Pages 189 - 200

https://doi.org/10.1145/3431379.3460650

Published: 21 June 2021 Publication History

Abstract

High-radix interconnects such as Dragonfly and its variants rely on adaptive routing to balance network traffic for optimum performance. Ideally, adaptive routing attempts to forward packets between minimal and non-minimal paths with the least congestion. In practice, current adaptive routing algorithms estimate routing path congestion based on local information such as output queue occupancy. Using local information to estimate global path congestion is inevitably inaccurate because a router has no precise knowledge of link states a few hops away. This inaccuracy could lead to interconnect congestion. In this study, we present Q-adaptive routing, a multi-agent reinforcement learning routing scheme for Dragonfly systems. Q-adaptive routing enables routers to learn to route autonomously by leveraging advanced reinforcement learning technology. The proposed Q-adaptive routing is highly scalable thanks to its fully distributed nature without using any shared information between routers. Furthermore, a new two-level Q-table is designed for Q-adaptive to make it computational lightly and saves 50% of router memory usage compared with the previous Q-routing. We implement the proposed Q-adaptive routing in SST/Merlin simulator. Our evaluation results show that Q-adaptive routing achieves up to 10.5% system throughput improvement and 5.2x average packet latency reduction compared with adaptive routing algorithms. Remarkably, Q-adaptive can even outperform the optimal VALn non-minimal routing under the ADV+1 adversarial traffic pattern with up to 3% system throughput improvement and 75% average packet latency reduction.

References

[1]

Dennis Abts and Bob Felderman. 2012. A guided tour of data-center networking. Commun. ACM, Vol. 55, 6 (2012), 44--51.

Digital Library

[2]

Bob Alverson, Edwin Froese, Larry Kaplan, and Duncan Roweth. 2012. Cray XC series network. Cray Inc., White Paper WP-Aries01-1112 (2012).

[3]

Justin Boyan and Michael Littman. 1993. Packet routing in dynamically changing networks: A reinforcement learning approach. Advances in neural information processing systems, Vol. 6 (1993), 671--678.

[4]

Samuel PM Choi and Dit-Yan Yeung. 1996. Predictive Q-routing: A memory-based reinforcement learning approach to adaptive traffic control. In Advances in Neural Information Processing Systems. 945--951.

Digital Library

[5]

Sudheer Chunduri, Kevin Harms, Scott Parker, Vitali Morozov, Samuel Oshin, Naveen Cherukuri, and Kalyan Kumaran. 2017. Run-to-run variability on Xeon Phi based Cray XC systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--13.

Digital Library

[6]

Caroline Claus and Craig Boutilier. 1998. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, Vol. 1998, 746--752 (1998), 2.

Digital Library

[7]

William J Dally et almbox. 1992. Virtual-channel flow control. IEEE Transactions on Parallel and Distributed systems, Vol. 3, 2 (1992), 194--205.

Digital Library

[8]

William J Dally and Charles L Seitz. 1988. Deadlock-free message routing in multiprocessor interconnection networks. (1988).

[9]

Daniele De Sensi, Salvatore Di Girolamo, and Torsten Hoefler. 2019. Mitigating network noise on Dragonfly networks through application-aware routing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--32.

Digital Library

[10]

Greg Faanes, Abdulla Bataineh, Duncan Roweth, Tom Court, Edwin Froese, Bob Alverson, Tim Johnson, Joe Kopnick, Mike Higgins, and James Reinhard. 2012. Cray cascade: a scalable HPC system based on a Dragonfly network. In SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1--9.

Digital Library

[11]

Yuping Fan, Taylor Childers, Paul Rich, William Allcock, Michael Papka, and Zhiling Lan. 2021. Deep Reinforcement Agent for Scheduling in HPC. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[12]

Mario Flajslik, Eric Borch, and Mike A Parker. 2018. Megafly: A topology for exascale systems. In International Conference on High Performance Computing. Springer, 289--310.

[13]

Marina Garcia, Enrique Vallejo, Ramon Beivide, Miguel Odriozola, Cristobal Camarero, Mateo Valero, Jesús Labarta, Cyriel Minkenberg, et almbox. 2012. On-the-fly adaptive routing in high-radix hierarchical networks. In 2012 41st International Conference on Parallel Processing. IEEE, 279--288.

Digital Library

[14]

Nikhil Jain, Abhinav Bhatele, Xiang Ni, Nicholas J Wright, and Laxmikant V Kale. 2014. Maximizing throughput on a dragonfly network. In SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 336--347.

Digital Library

[15]

Nan Jiang, John Kim, and William J Dally. 2009. Indirect adaptive routing on large scale interconnection networks. In Proceedings of the 36th annual international symposium on Computer architecture. 220--231.

Digital Library

[16]

Yao Kang, Xin Wang, Neil McGlohon, Misbah Mubarak, Sudheer Chunduri, and Zhiling Lan. 2019. Modeling and Analysis of Application Interference on Dragonfly. In Proceedings of the 2019 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 161--172.

Digital Library

[17]

John Kim, Wiliam J Dally, Steve Scott, and Dennis Abts. 2008. Technology-driven, highly-scalable dragonfly topology. In 2008 International Symposium on Computer Architecture. IEEE, 77--88.

Digital Library

[18]

Georg Kresse and Jürgen Hafner. 1993. Ab initio molecular dynamics for liquid metals. Physical Review B, Vol. 47, 1 (1993), 558.

[19]

Martin Lauer and Martin Riedmiller. 2000. An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In In Proceedings of the Seventeenth International Conference on Machine Learning. Citeseer.

Digital Library

[20]

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).

[21]

Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM Special Interest Group on Data Communication. 270--288.

[22]

Laëtitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. 2007. Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams. In 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 64--69.

[23]

Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. 2017. Device placement optimization with reinforcement learning. arXiv preprint arXiv:1706.04972 (2017).

Digital Library

[24]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et almbox. 2015. Human-level control through deep reinforcement learning. nature, Vol. 518, 7540 (2015), 529--533.

[25]

Misbah Mubarak, Neil McGlohon, Malek Musleh, Eric Borch, Robert B Ross, Ram Huggahalli, Sudheer Chunduri, Scott Parker, Christopher D Carothers, and Kalyan Kumaran. 2019. Evaluating quality of service traffic classes on the megafly network. In International Conference on High Performance Computing. Springer, 3--20.

[26]

Ann Nowe, Kris Steenhaut, Mohamed Fakir, and Katja Verbeeck. 1998. Q-learning for adaptive load based routing. In SMC'98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No. 98CH36218), Vol. 4. IEEE, 3965--3970.

[27]

Leonid Peshkin and Virginia Savova. 2002. Reinforcement learning for adaptive routing. In Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No. 02CH37290), Vol. 2. IEEE, 1825--1830.

[28]

James C Phillips, Gengbin Zheng, Sameer Kumar, and Laxmikant V Kalé. 2002. NAMD: Biomolecular simulation on thousands of processors. In SC'02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. IEEE, 36--36.

[29]

LA Prashanth, HL Prasad, Shalabh Bhatnagar, and Prakash Chandra. 2016. A constrained optimization perspective on actor--critic algorithms and application to network routing. Systems & Control Letters, Vol. 92 (2016), 46--51.

[30]

Md Shafayat Rahman, Saptarshi Bhowmik, Yevgeniy Ryasnianskiy, Xin Yuan, and Michael Lang. 2019. Topology-custom UGAL routing on dragonfly. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--15.

Digital Library

[31]

Joao Reis, Miguel Rocha, Truong Khoa Phan, David Griffin, Franck Le, and Miguel Rio. 2019. Deep Neural Networks for Network Routing. In 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--8.

[32]

Arun F Rodrigues, K Scott Hemmert, Brian W Barrett, Chad Kersey, Ron Oldfield, Marlo Weston, Rolf Risen, Jeanine Cook, Paul Rosenfeld, Elliot Cooper-Balis, et almbox. 2011. The structural simulation toolkit. ACM SIGMETRICS Performance Evaluation Review, Vol. 38, 4 (2011), 37--42.

Digital Library

[33]

Steve Scott, Dennis Abts, John Kim, and William J Dally. 2006. The blackwidow high-radix clos network. ACM SIGARCH Computer Architecture News, Vol. 34, 2 (2006), 16--28.

Digital Library

[34]

Daniele Sensi, Salvatore Girolamo, Kim McMahon, Duncan Roweth, and Torsten Hoefler. 2020. An In-Depth Analysis of the Slingshot Interconnect. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 481--494.

[35]

Alexander Shpiner, Zachy Haramaty, Saar Eliad, Vladimir Zdornov, Barak Gafni, and Eitan Zahavi. 2017. Dragonfly: Low cost topology for scaling datacenters. In 2017 IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB). IEEE, 1--8.

[36]

Charles H Still, RL Berger, AB Langdon, DE Hinkel, LJ Suter, and EA Williams. 2000. Filamentation and forward brillouin scatter of entire smoothed and aberrated laser beams. Physics of Plasmas, Vol. 7, 5 (2000), 2023--2032.

[37]

Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.

Digital Library

[38]

Allen Taflove and Susan C Hagness. 2005. Computational electrodynamics: the finite-difference time-domain method. Artech house.

[39]

Nigel Tao, Jonathan Baxter, and Lex Weaver. 2001. A multi-agent, policy-gradient approach to network routing. In In: Proc. of the 18th Int. Conf. on Machine Learning. Citeseer.

Digital Library

[40]

top500.org. 2020. Top500 list. https://www.top500.org/lists/top500/2020/11/

[41]

Didem Unat, Xing Cai, and Scott B Baden. 2011. Mint: realizing CUDA performance in 3D stencil methods with annotated C. In Proceedings of the international conference on Supercomputing. 214--224.

Digital Library

[42]

Asaf Valadarsky, Michael Schapira, Dafna Shahaf, and Aviv Tamar. 2017. Learning to route. In Proceedings of the 16th ACM workshop on hot topics in networks. 185--191.

Digital Library

[43]

Hanrui Wang, Jiacheng Yang, Hae-Seung Lee, and Song Han. 2018. Learning to design circuits. arXiv preprint arXiv:1812.02734 (2018).

[44]

Xin Wang, Misbah Mubarak, Yao Kang, Robert B Ross, and Zhiling Lan. 2020. Union: An Automatic Workload Manager for Accelerating Network Simulation. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 821--830.

[45]

Jeremiah J Wilke and Joseph P Kenny. 2020. Opportunities and limitations of Quality-of-Service in Message Passing applications on adaptively routed Dragonfly and Fat Tree networks. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 109--118.

[46]

Jongmin Won, Gwangsun Kim, John Kim, Ted Jiang, Mike Parker, and Steve Scott. 2015. Overcoming far-end congestion in large-scale networks. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 415--427.

[47]

Xu Yang, John Jenkins, Misbah Mubarak, Robert B Ross, and Zhiling Lan. 2016. Watch out for the bully! job interference study on dragonfly network. In SC'16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 750--760.

Digital Library

[48]

Jieming Yin, Subhash Sethumurugan, Yasuko Eckert, Chintan Patel, Alan Smith, Eric Morton, Mark Oskin, Natalie Enright Jerger, and Gabriel H Loh. 2020. Experiences with ML-Driven Design: A NoC Case Study. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 637--648.

[49]

Xinyu You, Xuanjie Li, Yuedong Xu, Hui Feng, Jin Zhao, and Huaicheng Yan. 2020. Toward Packet Routing With Fully Distributed Multiagent Deep Reinforcement Learning. IEEE Transactions on Systems, Man, and Cybernetics: Systems (2020).

Cited By

Salimi Beni MHunold SCosenza B(2024)Analysis and prediction of performance variability in large-scale computing systemsThe Journal of Supercomputing10.1007/s11227-024-06040-w80:10(14978-15005)Online publication date: 28-Mar-2024
https://dl.acm.org/doi/10.1007/s11227-024-06040-w
Cai XLi MShi XShen JWu CChen Y(2023)Adaptive Routing with Hierarchical Reinforcement Learning on Dragonfly NetworksICC 2023 - IEEE International Conference on Communications10.1109/ICC45041.2023.10278794(403-409)Online publication date: 28-May-2023
https://doi.org/10.1109/ICC45041.2023.10278794
Kang YWang XLan Z(2022)Study of Workload Interference with Intelligent Routing on DragonflySC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00025(1-14)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00025
Show More Cited By

Index Terms

Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on Dragonfly Network
1. Networks

Recommendations

Indirect adaptive routing on large scale interconnection networks

Recently proposed high-radix interconnection networks [10] require global adaptive routing to achieve optimum performance. Existing direct adaptive routing methods are slow to sense congestion remote from the source router and hence misroute many ...
Study of workload interference with intelligent routing on Dragonfly
SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Dragonfly interconnect is a crucial network technology for supercomputers. To support exascale systems, network resources are shared such that links and routers are not dedicated to any node pair. While link utilization is increased, workload ...
Indirect adaptive routing on large scale interconnection networks
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

Recently proposed high-radix interconnection networks [10] require global adaptive routing to achieve optimum performance. Existing direct adaptive routing methods are slow to sense congestion remote from the source router and hence misroute many ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '21: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing

June 2021

275 pages

ISBN:9781450382175

DOI:10.1145/3431379

General Chairs:
Erwin Laure
Max Planck Computing and Data Facility, Germany
,
Stefano Markidis
KTH Royal Institute of Technology, Sweden
,
Program Chairs:
Ana Lucia Verbanescu
University of Amsterdam UVA, Netherlands
,
Gerald Fredrick (Jay) Lofstead
Sandia National Laboratories, USA

Copyright © 2021 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

University of Arizona: University of Arizona
SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF (National Science Foundation)

Conference

HPDC '21

Sponsor:

University of Arizona
SIGHPC
SIGARCH

HPDC '21: The 30th International Symposium on High-Performance Parallel and Distributed Computing

June 21 - 25, 2021

Virtual Event, Sweden

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
636
Total Downloads

Downloads (Last 12 months)223
Downloads (Last 6 weeks)16

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Salimi Beni MHunold SCosenza B(2024)Analysis and prediction of performance variability in large-scale computing systemsThe Journal of Supercomputing10.1007/s11227-024-06040-w80:10(14978-15005)Online publication date: 28-Mar-2024
https://dl.acm.org/doi/10.1007/s11227-024-06040-w
Cai XLi MShi XShen JWu CChen Y(2023)Adaptive Routing with Hierarchical Reinforcement Learning on Dragonfly NetworksICC 2023 - IEEE International Conference on Communications10.1109/ICC45041.2023.10278794(403-409)Online publication date: 28-May-2023
https://doi.org/10.1109/ICC45041.2023.10278794
Kang YWang XLan Z(2022)Study of Workload Interference with Intelligent Routing on DragonflySC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00025(1-14)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00025
Dudukovich RGormley DKancharla SWagner KShort RBrooks DFantl JJanardhanan SFung A(2022)Toward the Development of a Multi-Agent Cognitive Networking System for the Lunar EnvironmentIEEE Journal of Radio Frequency Identification10.1109/JRFID.2022.31629526(269-283)Online publication date: 2022
https://doi.org/10.1109/JRFID.2022.3162952

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents