research-article

Open access

Workload Interference Prevention with Intelligent Routing and Flexible Job Placement on Dragonfly

Authors:

Zhiling LanAuthors Info & Claims

SIGSIM-PADS '23: Proceedings of the 2023 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation

Pages 23 - 33

https://doi.org/10.1145/3573900.3591119

Published: 21 June 2023 Publication History

All formats PDF

Abstract

Dragonfly is an indispensable interconnect topology for exascale HPC systems. To link tens of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources with the entire system such that network bandwidth is not exclusive to any single job. Since HPC systems are usually shared between multiple co-running workloads at the same time, network competition between co-existing workloads is inevitable. This network contention appears as workload interference, where a job’s network communication can be severely delayed by other jobs. Recent studies show that, compared with the deployed adaptive routing algorithms, an intelligent routing solution based on reinforcement learning named Q-adaptive routing can reduce workload interference. In addition to improving routing efficiency, job placement is a simple yet effective method to mitigate workload interference. In this study, we leverage the well-known parallel discrete event simulation toolkit, SST, to investigate workload interference on Dragonfly with three contributions. We first develop an automatic module that serves as the bridge between SST and HPC job scheduler for automatic simulation configuration and automated simulation launching. Next, we propose a flexible job placement strategy that can mitigate workload interference based on workload communication characteristics. Finally, we extensively examine the workload interference under various job placement and routing configurations.

References

[1]

ALCF. 2022. Aurora: Argonne Leadership Computing Facility. https://www.alcf.anl.gov/aurora

[2]

ALCF. 2022. Polaris User Guide. https://www.alcf.anl.gov/support/user-guides/polaris/queueing-and-running-jobs/job-and-queue-scheduling/index.html#resource-selection-and-job-placement

[3]

Bob Alverson, Edwin Froese, Larry Kaplan, and Duncan Roweth. 2012. Cray XC series network. Cray Inc., White Paper WP-Aries01-1112 (2012).

[4]

Ronald Babich, Michael A Clark, and Bálint Joó. 2010. Parallelizing the QUDA library for multi-GPU calculations in lattice quantum chromodynamics. In SC’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–11.

Digital Library

[5]

Abhinav Bhatele, Nikhil Jain, William D Gropp, and Laxmikant V Kale. 2011. Avoiding hot-spots on two-level direct networks. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 1–11.

Digital Library

[6]

Kevin A Brown, Neil McGlohon, Sudheer Chunduri, Eric Borch, Robert B Ross, Christopher D Carothers, and Kevin Harms. 2021. A Tunable Implementation of Quality-of-Service Classes for HPC Networks. In International Conference on High Performance Computing. Springer, 137–156.

Digital Library

[7]

Sudheer Chunduri, Kevin Harms, Scott Parker, Vitali Morozov, Samuel Oshin, Naveen Cherukuri, and Kalyan Kumaran. 2017. Run-to-run variability on Xeon Phi based Cray XC systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–13.

Digital Library

[8]

Daniele De Sensi, Salvatore Di Girolamo, and Torsten Hoefler. 2019. Mitigating network noise on Dragonfly networks through application-aware routing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–32.

Digital Library

[9]

Greg Faanes, Abdulla Bataineh, Duncan Roweth, Tom Court, Edwin Froese, Bob Alverson, Tim Johnson, Joe Kopnick, Mike Higgins, and James Reinhard. 2012. Cray cascade: a scalable HPC system based on a Dragonfly network. In SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1–9.

Digital Library

[10]

Steven Farrell, Murali Emani, Jacob Balma, Lukas Drescher, Aleksandr Drozd, Andreas Fink, Geoffrey Fox, David Kanter, Thorsten Kurth, Peter Mattson, 2021. MLPerf™ HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems. In 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC). IEEE, 33–45.

[11]

Mario Flajslik, Eric Borch, and Mike A Parker. 2018. Megafly: A topology for exascale systems. In International Conference on High Performance Computing. Springer, 289–310.

[12]

Nikhil Jain, Abhinav Bhatele, Xiang Ni, Nicholas J Wright, and Laxmikant V Kale. 2014. Maximizing throughput on a dragonfly network. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 336–347.

Digital Library

[13]

Nan Jiang, John Kim, and William J Dally. 2009. Indirect adaptive routing on large scale interconnection networks. In Proceedings of the 36th annual international symposium on Computer architecture. 220–231.

Digital Library

[14]

Yao Kang. 2022. Workload Interference Analysis and Mitigation on Dragonfly Class Networks. Ph. D. Dissertation. Illinois Institute of Technology.

[15]

Yao Kang, Xin Wang, and Zhiling Lan. 2021. Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on Dragonfly Network. In Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing. 189–200.

Digital Library

[16]

Yao Kang, Xin Wang, and Zhiling Lan. 2022. Study of Workload Interference with Intelligent Routing on Dragonfly. In 2022 SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 263–276.

[17]

Yao Kang, Xin Wang, Neil McGlohon, Misbah Mubarak, Sudheer Chunduri, and Zhiling Lan. 2019. Modeling and Analysis of Application Interference on Dragonfly+. In Proceedings of the 2019 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 161–172.

Digital Library

[18]

John Kim, Wiliam J Dally, Steve Scott, and Dennis Abts. 2008. Technology-driven, highly-scalable dragonfly topology. In 2008 International Symposium on Computer Architecture. IEEE, 77–88.

Digital Library

[19]

Georg Kresse and Jürgen Hafner. 1993. Ab initio molecular dynamics for liquid metals. Physical Review B 47, 1 (1993), 558.

[20]

Martin Lauer and Martin Riedmiller. 2000. An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In In Proceedings of the Seventeenth International Conference on Machine Learning. Citeseer.

Digital Library

[21]

Amrita Mathuriya, Deborah Bard, Peter Mendygral, Lawrence Meadows, James Arnemann, Lei Shao, Siyu He, Tuomas Kärnä, Diana Moise, Simon J Pennycook, 2018. CosmoFlow: Using deep learning to learn the universe at scale. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 819–829.

Digital Library

[22]

Laëtitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. 2007. Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams. In 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 64–69.

[23]

Neil McGlohon, Christopher D. Carothers, K. Scott Hemmert, Michael Levenhagen, Kevin A. Brown, Sudheer Chunduri, and Robert B. Ross. 2021. Exploration of Congestion Control Techniques on Dragonfly-class HPC Networks Through Simulation. In 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 40–50. https://doi.org/10.1109/PMBS54543.2021.00010

[24]

Misbah Mubarak, Neil McGlohon, Malek Musleh, Eric Borch, Robert B Ross, Ram Huggahalli, Sudheer Chunduri, Scott Parker, Christopher D Carothers, and Kalyan Kumaran. 2019. Evaluating quality of service traffic classes on the megafly network. In International Conference on High Performance Computing. Springer, 3–20.

[25]

ORNL. 2022. Frontier. https://www.olcf.ornl.gov/frontier/

[26]

James C Phillips, Gengbin Zheng, Sameer Kumar, and Laxmikant V Kalé. 2002. NAMD: Biomolecular simulation on thousands of processors. In SC’02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. IEEE, 36–36.

[27]

Arun F Rodrigues, K Scott Hemmert, Brian W Barrett, Chad Kersey, Ron Oldfield, Marlo Weston, Rolf Risen, Jeanine Cook, Paul Rosenfeld, Elliot Cooper-Balis, 2011. The structural simulation toolkit. ACM SIGMETRICS Performance Evaluation Review 38, 4 (2011), 37–42.

Digital Library

[28]

Daniele Sensi, Salvatore Girolamo, Kim McMahon, Duncan Roweth, and Torsten Hoefler. 2020. An In-Depth Analysis of the Slingshot Interconnect. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 481–494.

[29]

Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).

[30]

Alexander Shpiner, Zachy Haramaty, Saar Eliad, Vladimir Zdornov, Barak Gafni, and Eitan Zahavi. 2017. Dragonfly+: Low cost topology for scaling datacenters. In 2017 IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB). IEEE, 1–8.

[31]

Staci A Smith, Clara E Cromey, David K Lowenthal, Jens Domke, Nikhil Jain, Jayaraman J Thiagarajan, and Abhinav Bhatele. 2018. Mitigating inter-job interference using adaptive flow-aware routing. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 346–360.

Digital Library

[32]

Rick Stevens, Jini Ramprakash, Paul Messina, Michael Papka, and Katherine Riley. 2019. Aurora: Argonne’s next-generation exascale supercomputer. Technical Report. Argonne National Lab.(ANL), Argonne, IL (United States).

[33]

Charles H Still, RL Berger, AB Langdon, DE Hinkel, LJ Suter, and EA Williams. 2000. Filamentation and forward brillouin scatter of entire smoothed and aberrated laser beams. Physics of Plasmas 7, 5 (2000), 2023–2032.

[34]

top500.org. 2022. Top500 list. https://www.top500.org/lists/top500/2022/11/

[35]

Didem Unat, Xing Cai, and Scott B Baden. 2011. Mint: realizing CUDA performance in 3D stencil methods with annotated C. In Proceedings of the international conference on Supercomputing. 214–224.

Digital Library

[36]

Xin Wang, Misbah Mubarak, Yao Kang, Robert B Ross, and Zhiling Lan. 2020. Union: An Automatic Workload Manager for Accelerating Network Simulation. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 821–830.

[37]

Xin Wang, Misbah Mubarak, Xu Yang, Robert B Ross, and Zhiling Lan. 2018. Trade-Off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 1113–1122.

[38]

Jeremiah J Wilke and Joseph P Kenny. 2020. Opportunities and limitations of Quality-of-Service in Message Passing applications on adaptively routed Dragonfly and Fat Tree networks. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 109–118.

[39]

Jongmin Won, Gwangsun Kim, John Kim, Ted Jiang, Mike Parker, and Steve Scott. 2015. Overcoming far-end congestion in large-scale networks. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 415–427.

[40]

Xu Yang, John Jenkins, Misbah Mubarak, Robert B Ross, and Zhiling Lan. 2016. Watch out for the bully! job interference study on dragonfly network. In SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 750–760.

[41]

Andy B Yoo, Morris A Jette, and Mark Grondona. 2003. Slurm: Simple linux utility for resource management. In Job Scheduling Strategies for Parallel Processing: 9th International Workshop, JSSPP 2003, Seattle, WA, USA, June 24, 2003. Revised Paper 9. Springer, 44–60.

Cited By

Index Terms

Workload Interference Prevention with Intelligent Routing and Flexible Job Placement on Dragonfly
1. Networks
  1. Network performance evaluation

Recommendations

Study of workload interference with intelligent routing on Dragonfly
SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Dragonfly interconnect is a crucial network technology for supercomputers. To support exascale systems, network resources are shared such that links and routers are not dedicated to any node pair. While link utilization is increased, workload ...
Workload Interference Analysis and Mitigation on Dragonfly Class Networks
Efficient Routing Mechanisms for Dragonfly Networks
ICPP '13: Proceedings of the 2013 42nd International Conference on Parallel Processing

High-radix hierarchical networks are cost-effective topologies for large scale computers. In such networks, routers are organized in super nodes, with local and global interconnections. These networks, known as Dragonflies, outperform traditional ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGSIM-PADS '23: Proceedings of the 2023 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation

June 2023

173 pages

ISBN:9798400700309

DOI:10.1145/3573900

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGSIM: ACM Special Interest Group on Simulation and Modeling

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2023

Check for updates

Badges

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

NSF (National Science Foundation)

Conference

SIGSIM-PADS '23

Sponsor:

SIGSIM

SIGSIM-PADS '23: SIGSIM Conference on Principles of Advanced Discrete Simulation

June 21 - 23, 2023

FL, Orlando, USA

Acceptance Rates

Overall Acceptance Rate 398 of 779 submissions, 51%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
312
Total Downloads

Downloads (Last 12 months)243
Downloads (Last 6 weeks)27

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents