Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3573900.3591119acmconferencesArticle/Chapter ViewAbstractPublication PagespadsConference Proceedingsconference-collections
research-article
Open access

Workload Interference Prevention with Intelligent Routing and Flexible Job Placement on Dragonfly

Published: 21 June 2023 Publication History

Abstract

Dragonfly is an indispensable interconnect topology for exascale HPC systems. To link tens of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources with the entire system such that network bandwidth is not exclusive to any single job. Since HPC systems are usually shared between multiple co-running workloads at the same time, network competition between co-existing workloads is inevitable. This network contention appears as workload interference, where a job’s network communication can be severely delayed by other jobs. Recent studies show that, compared with the deployed adaptive routing algorithms, an intelligent routing solution based on reinforcement learning named Q-adaptive routing can reduce workload interference. In addition to improving routing efficiency, job placement is a simple yet effective method to mitigate workload interference. In this study, we leverage the well-known parallel discrete event simulation toolkit, SST, to investigate workload interference on Dragonfly with three contributions. We first develop an automatic module that serves as the bridge between SST and HPC job scheduler for automatic simulation configuration and automated simulation launching. Next, we propose a flexible job placement strategy that can mitigate workload interference based on workload communication characteristics. Finally, we extensively examine the workload interference under various job placement and routing configurations.

References

[1]
ALCF. 2022. Aurora: Argonne Leadership Computing Facility. https://www.alcf.anl.gov/aurora
[2]
ALCF. 2022. Polaris User Guide. https://www.alcf.anl.gov/support/user-guides/polaris/queueing-and-running-jobs/job-and-queue-scheduling/index.html#resource-selection-and-job-placement
[3]
Bob Alverson, Edwin Froese, Larry Kaplan, and Duncan Roweth. 2012. Cray XC series network. Cray Inc., White Paper WP-Aries01-1112 (2012).
[4]
Ronald Babich, Michael A Clark, and Bálint Joó. 2010. Parallelizing the QUDA library for multi-GPU calculations in lattice quantum chromodynamics. In SC’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–11.
[5]
Abhinav Bhatele, Nikhil Jain, William D Gropp, and Laxmikant V Kale. 2011. Avoiding hot-spots on two-level direct networks. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 1–11.
[6]
Kevin A Brown, Neil McGlohon, Sudheer Chunduri, Eric Borch, Robert B Ross, Christopher D Carothers, and Kevin Harms. 2021. A Tunable Implementation of Quality-of-Service Classes for HPC Networks. In International Conference on High Performance Computing. Springer, 137–156.
[7]
Sudheer Chunduri, Kevin Harms, Scott Parker, Vitali Morozov, Samuel Oshin, Naveen Cherukuri, and Kalyan Kumaran. 2017. Run-to-run variability on Xeon Phi based Cray XC systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–13.
[8]
Daniele De Sensi, Salvatore Di Girolamo, and Torsten Hoefler. 2019. Mitigating network noise on Dragonfly networks through application-aware routing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–32.
[9]
Greg Faanes, Abdulla Bataineh, Duncan Roweth, Tom Court, Edwin Froese, Bob Alverson, Tim Johnson, Joe Kopnick, Mike Higgins, and James Reinhard. 2012. Cray cascade: a scalable HPC system based on a Dragonfly network. In SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1–9.
[10]
Steven Farrell, Murali Emani, Jacob Balma, Lukas Drescher, Aleksandr Drozd, Andreas Fink, Geoffrey Fox, David Kanter, Thorsten Kurth, Peter Mattson, 2021. MLPerf™ HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems. In 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC). IEEE, 33–45.
[11]
Mario Flajslik, Eric Borch, and Mike A Parker. 2018. Megafly: A topology for exascale systems. In International Conference on High Performance Computing. Springer, 289–310.
[12]
Nikhil Jain, Abhinav Bhatele, Xiang Ni, Nicholas J Wright, and Laxmikant V Kale. 2014. Maximizing throughput on a dragonfly network. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 336–347.
[13]
Nan Jiang, John Kim, and William J Dally. 2009. Indirect adaptive routing on large scale interconnection networks. In Proceedings of the 36th annual international symposium on Computer architecture. 220–231.
[14]
Yao Kang. 2022. Workload Interference Analysis and Mitigation on Dragonfly Class Networks. Ph. D. Dissertation. Illinois Institute of Technology.
[15]
Yao Kang, Xin Wang, and Zhiling Lan. 2021. Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on Dragonfly Network. In Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing. 189–200.
[16]
Yao Kang, Xin Wang, and Zhiling Lan. 2022. Study of Workload Interference with Intelligent Routing on Dragonfly. In 2022 SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 263–276.
[17]
Yao Kang, Xin Wang, Neil McGlohon, Misbah Mubarak, Sudheer Chunduri, and Zhiling Lan. 2019. Modeling and Analysis of Application Interference on Dragonfly+. In Proceedings of the 2019 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 161–172.
[18]
John Kim, Wiliam J Dally, Steve Scott, and Dennis Abts. 2008. Technology-driven, highly-scalable dragonfly topology. In 2008 International Symposium on Computer Architecture. IEEE, 77–88.
[19]
Georg Kresse and Jürgen Hafner. 1993. Ab initio molecular dynamics for liquid metals. Physical Review B 47, 1 (1993), 558.
[20]
Martin Lauer and Martin Riedmiller. 2000. An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In In Proceedings of the Seventeenth International Conference on Machine Learning. Citeseer.
[21]
Amrita Mathuriya, Deborah Bard, Peter Mendygral, Lawrence Meadows, James Arnemann, Lei Shao, Siyu He, Tuomas Kärnä, Diana Moise, Simon J Pennycook, 2018. CosmoFlow: Using deep learning to learn the universe at scale. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 819–829.
[22]
Laëtitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. 2007. Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams. In 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 64–69.
[23]
Neil McGlohon, Christopher D. Carothers, K. Scott Hemmert, Michael Levenhagen, Kevin A. Brown, Sudheer Chunduri, and Robert B. Ross. 2021. Exploration of Congestion Control Techniques on Dragonfly-class HPC Networks Through Simulation. In 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 40–50. https://doi.org/10.1109/PMBS54543.2021.00010
[24]
Misbah Mubarak, Neil McGlohon, Malek Musleh, Eric Borch, Robert B Ross, Ram Huggahalli, Sudheer Chunduri, Scott Parker, Christopher D Carothers, and Kalyan Kumaran. 2019. Evaluating quality of service traffic classes on the megafly network. In International Conference on High Performance Computing. Springer, 3–20.
[25]
ORNL. 2022. Frontier. https://www.olcf.ornl.gov/frontier/
[26]
James C Phillips, Gengbin Zheng, Sameer Kumar, and Laxmikant V Kalé. 2002. NAMD: Biomolecular simulation on thousands of processors. In SC’02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. IEEE, 36–36.
[27]
Arun F Rodrigues, K Scott Hemmert, Brian W Barrett, Chad Kersey, Ron Oldfield, Marlo Weston, Rolf Risen, Jeanine Cook, Paul Rosenfeld, Elliot Cooper-Balis, 2011. The structural simulation toolkit. ACM SIGMETRICS Performance Evaluation Review 38, 4 (2011), 37–42.
[28]
Daniele Sensi, Salvatore Girolamo, Kim McMahon, Duncan Roweth, and Torsten Hoefler. 2020. An In-Depth Analysis of the Slingshot Interconnect. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 481–494.
[29]
Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).
[30]
Alexander Shpiner, Zachy Haramaty, Saar Eliad, Vladimir Zdornov, Barak Gafni, and Eitan Zahavi. 2017. Dragonfly+: Low cost topology for scaling datacenters. In 2017 IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB). IEEE, 1–8.
[31]
Staci A Smith, Clara E Cromey, David K Lowenthal, Jens Domke, Nikhil Jain, Jayaraman J Thiagarajan, and Abhinav Bhatele. 2018. Mitigating inter-job interference using adaptive flow-aware routing. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 346–360.
[32]
Rick Stevens, Jini Ramprakash, Paul Messina, Michael Papka, and Katherine Riley. 2019. Aurora: Argonne’s next-generation exascale supercomputer. Technical Report. Argonne National Lab.(ANL), Argonne, IL (United States).
[33]
Charles H Still, RL Berger, AB Langdon, DE Hinkel, LJ Suter, and EA Williams. 2000. Filamentation and forward brillouin scatter of entire smoothed and aberrated laser beams. Physics of Plasmas 7, 5 (2000), 2023–2032.
[34]
top500.org. 2022. Top500 list. https://www.top500.org/lists/top500/2022/11/
[35]
Didem Unat, Xing Cai, and Scott B Baden. 2011. Mint: realizing CUDA performance in 3D stencil methods with annotated C. In Proceedings of the international conference on Supercomputing. 214–224.
[36]
Xin Wang, Misbah Mubarak, Yao Kang, Robert B Ross, and Zhiling Lan. 2020. Union: An Automatic Workload Manager for Accelerating Network Simulation. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 821–830.
[37]
Xin Wang, Misbah Mubarak, Xu Yang, Robert B Ross, and Zhiling Lan. 2018. Trade-Off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 1113–1122.
[38]
Jeremiah J Wilke and Joseph P Kenny. 2020. Opportunities and limitations of Quality-of-Service in Message Passing applications on adaptively routed Dragonfly and Fat Tree networks. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 109–118.
[39]
Jongmin Won, Gwangsun Kim, John Kim, Ted Jiang, Mike Parker, and Steve Scott. 2015. Overcoming far-end congestion in large-scale networks. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 415–427.
[40]
Xu Yang, John Jenkins, Misbah Mubarak, Robert B Ross, and Zhiling Lan. 2016. Watch out for the bully! job interference study on dragonfly network. In SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 750–760.
[41]
Andy B Yoo, Morris A Jette, and Mark Grondona. 2003. Slurm: Simple linux utility for resource management. In Job Scheduling Strategies for Parallel Processing: 9th International Workshop, JSSPP 2003, Seattle, WA, USA, June 24, 2003. Revised Paper 9. Springer, 44–60.

Cited By

View all

Index Terms

  1. Workload Interference Prevention with Intelligent Routing and Flexible Job Placement on Dragonfly

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        SIGSIM-PADS '23: Proceedings of the 2023 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation
        June 2023
        173 pages
        ISBN:9798400700309
        DOI:10.1145/3573900
        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 21 June 2023

        Check for updates

        Badges

        Author Tags

        1. high performance computing
        2. interconnect networking
        3. parallel discrete event simulation

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Funding Sources

        Conference

        SIGSIM-PADS '23
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 398 of 779 submissions, 51%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 312
          Total Downloads
        • Downloads (Last 12 months)243
        • Downloads (Last 6 weeks)27
        Reflects downloads up to 24 Sep 2024

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media