Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3571885.3571911acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Study of workload interference with intelligent routing on Dragonfly

Published: 18 November 2022 Publication History

Abstract

Dragonfly interconnect is a crucial network technology for supercomputers. To support exascale systems, network resources are shared such that links and routers are not dedicated to any node pair. While link utilization is increased, workload performance is often offset by network contention. Recently, intelligent routing built on reinforcement learning demonstrates higher network throughput with lower packet latency. However, its effectiveness in reducing workload interference is unknown. In this work, we present extensive network simulations to study multi-workload contention under different routing mechanisms, intelligent routing and adaptive routing, on a large-scale Dragonfly system. We develop an enhanced network simulation toolkit, along with a suite of workloads with distinctive communication patterns. We also present two metrics to characterize application communication intensity. Our analysis focuses on examining how different workloads interfere with each other under different routing mechanisms by inspecting both application-level and network-level metrics. Several key insights are made from the analysis.

Supplementary Material

MP4 File (SC22_Presentation_Kang.mp4)
Presentation at SC '22

References

[1]
J. Kim, W. J. Dally, S. Scott, and D. Abts, "Technology-driven, highly-scalable dragonfly topology," in 2008 International Symposium on Computer Architecture. IEEE, 2008, pp. 77--88.
[2]
top500.org. (2021) Top500 list. [Online]. Available: https://www.top500.org/lists/top500/2021/11/
[3]
D. Sensi, S. Girolamo, K. McMahon, D. Roweth, and T. Hoefler, "An in-depth analysis of the slingshot interconnect," in 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 2020, pp. 481--494.
[4]
B. Alverson, E. Froese, L. Kaplan, and D. Roweth, "Cray xc series network," Cray Inc., White Paper WP-Aries01-1112, 2012.
[5]
G. Faanes, A. Bataineh, D. Roweth, T. Court, E. Froese, B. Alverson, T. Johnson, J. Kopnick, M. Higgins, and J. Reinhard, "Cray cascade: a scalable hpc system based on a dragonfly network," in SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 2012, pp. 1--9.
[6]
N. Jiang, J. Kim, and W. J. Dally, "Indirect adaptive routing on large scale interconnection networks," in Proceedings of the 36th annual international symposium on Computer architecture, 2009, pp. 220--231.
[7]
S. Chunduri, K. Harms, S. Parker, V. Morozov, S. Oshin, N. Cherukuri, and K. Kumaran, "Run-to-run variability on xeon phi based cray xc systems," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, pp. 1--13.
[8]
B. Li, S. Chunduri, K. Harms, Y. Fan, and Z. Lan, "The effect of system utilization on application performance variability," in Proceedings of the 9th International Workshop on Runtime and Operating Systems for Supercomputers, 2019, pp. 11--18.
[9]
X. Wang, M. Mubarak, Y. Kang, R. B. Ross, and Z. Lan, "Union: An automatic workload manager for accelerating network simulation," in 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2020, pp. 821--830.
[10]
X. Yang, J. Jenkins, M. Mubarak, R. B. Ross, and Z. Lan, "Watch out for the bully! job interference study on dragonfly network," in SC'16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2016, pp. 750--760.
[11]
N. Jain, A. Bhatele, X. Ni, N. J. Wright, and L. V. Kale, "Maximizing throughput on a dragonfly network," in SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014, pp. 336--347.
[12]
D. De Sensi, S. Di Girolamo, and T. Hoefler, "Mitigating network noise on dragonfly networks through application-aware routing," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, pp. 1--32.
[13]
K. A. Brown, N. McGlohon, S. Chunduri, E. Borch, R. B. Ross, C. D. Carothers, and K. Harms, "A tunable implementation of quality-of-service classes for hpc networks," in International Conference on High Performance Computing. Springer, 2021, pp. 137--156.
[14]
Y. Kang, X. Wang, and Z. Lan, "Q-adaptive: A multi-agent reinforcement learning based routing on dragonfly network," in Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing, 2021, pp. 189--200.
[15]
A. Bhatele, N. Jain, W. D. Gropp, and L. V. Kale, "Avoiding hot-spots on two-level direct networks," in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 2011, pp. 1--11.
[16]
N. McGlohon, C. D. Carothers, K. Hemmert, M. Levenhagen, K. A. Brown, S. Chunduri, and R. B. Ross, "Exploration of congestion control techniques on dragonfly-class hpc networks through simulation," in 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). Los Alamitos, CA, USA: IEEE Computer Society, nov 2021, pp. 40--50. [Online].
[17]
M. Mubarak, P. Carns, J. Jenkins, J. K. Li, N. Jain, S. Snyder, R. Ross, C. D. Carothers, A. Bhatele, and K.-L. Ma, "Quantifying i/o and communication traffic interference on dragonfly networks equipped with burst buffers," in 2017 IEEE International Conference on Cluster Computing (CLUSTER), 2017, pp. 204--215.
[18]
Y. Kang, X. Wang, and Z. Lan, "Study of Workload Interference with Intelligent Routing on Dragonfly," Aug. 2022. [Online].
[19]
M. Flajslik, E. Borch, and M. A. Parker, "Megafly: A topology for exascale systems," in International Conference on High Performance Computing. Springer, 2018, pp. 289--310.
[20]
A. Shpiner, Z. Haramaty, S. Eliad, V. Zdornov, B. Gafni, and E. Zahavi, "Dragonfly+: Low cost topology for scaling datacenters," in 2017 IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB). IEEE, 2017, pp. 1--8.
[21]
J. Won, G. Kim, J. Kim, T. Jiang, M. Parker, and S. Scott, "Overcoming far-end congestion in large-scale networks," in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2015, pp. 415--427.
[22]
Y. Kang, X. Wang, N. McGlohon, M. Mubarak, S. Chunduri, and Z. Lan, "Modeling and analysis of application interference on dragonfly+," in Proceedings of the 2019 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, 2019, pp. 161--172.
[23]
M. Mubarak, N. McGlohon, M. Musleh, E. Borch, R. B. Ross, R. Huggahalli, S. Chunduri, S. Parker, C. D. Carothers, and K. Kumaran, "Evaluating quality of service traffic classes on the megafly network," in International Conference on High Performance Computing. Springer, 2019, pp. 3--20.
[24]
J. J. Wilke and J. P. Kenny, "Opportunities and limitations of quality-of-service in message passing applications on adaptively routed dragonfly and fat tree networks," in 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2020, pp. 109--118.
[25]
S. A. Smith, C. E. Cromey, D. K. Lowenthal, J. Domke, N. Jain, J. J. Thiagarajan, and A. Bhatele, "Mitigating inter-job interference using adaptive flow-aware routing," in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2018, pp. 346--360.
[26]
N. McGlohon, C. D. Carothers, K. S. Hemmert, M. Levenhagen, K. A. Brown, S. Chunduri, and R. B. Ross, "Exploration of congestion control techniques on dragonfly-class hpc networks through simulation," in 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2021, pp. 40--50.
[27]
M. Mubarak, C. D. Carothers, R. B. Ross, and P. Carns, "Enabling parallel simulation of large-scale hpc network systems," IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 1, pp. 87--100, 2016.
[28]
A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. Cooper-Balis et al., "The structural simulation toolkit," ACM SIGMETRICS Performance Evaluation Review, vol. 38, no. 4, pp. 37--42, 2011.
[29]
S. J. Pennycook, S. D. Hammond, S. A. Jarvis, and G. R. Mudalige, "Performance analysis of a hybrid mpi/cuda implementation of the naslu benchmark," ACM SIGMETRICS Performance Evaluation Review, vol. 38, no. 4, pp. 23--29, 2011.
[30]
C. H. Still, R. Berger, A. Langdon, D. Hinkel, L. Suter, and E. Williams, "Filamentation and forward brillouin scatter of entire smoothed and aberrated laser beams," Physics of Plasmas, vol. 7, no. 5, pp. 2023--2032, 2000.
[31]
J. C. Phillips, G. Zheng, S. Kumar, and L. V. Kalé, "Namd: Biomolecular simulation on thousands of processors," in SC'02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. IEEE, 2002, pp. 36--36.
[32]
G. Kresse and J. Hafner, "Ab initio molecular dynamics for liquid metals," Physical Review B, vol. 47, no. 1, p. 558, 1993.
[33]
D. Unat, X. Cai, and S. B. Baden, "Mint: realizing cuda performance in 3d stencil methods with annotated c," in Proceedings of the international conference on Supercomputing, 2011, pp. 214--224.
[34]
R. Babich, M. A. Clark, and B. Joó, "Parallelizing the quda library for multi-gpu calculations in lattice quantum chromodynamics," in SC'10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2010, pp. 1--11.
[35]
A. Sergeev and M. Del Balso, "Horovod: fast and easy distributed deep learning in tensorflow," arXiv preprint arXiv:1802.05799, 2018.
[36]
A. Mathuriya, D. Bard, P. Mendygral, L. Meadows, J. Arnemann, L. Shao, S. He, T. Kärnä, D. Moise, S. J. Pennycook et al., "Cosmoflow: Using deep learning to learn the universe at scale," in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2018, pp. 819--829.
[37]
I. Karlin, A. Bhatele, J. Keasler, B. L. Chamberlain, J. Cohen, Z. DeVito, R. Haque, D. Laney, E. Luke, F. Wang et al., "Exploring traditional and emerging parallel programming models using a proxy application," in 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. IEEE, 2013, pp. 919--932.
[38]
I. Karlin, J. Keasler, and R. Neely, "Lulesh 2.0 updates and changes," Tech. Rep. LLNL-TR-641973, August 2013.
[39]
C. D. Carothers, J. S. Meredith, M. P. Blanco, J. S. Vetter, M. Mubarak, J. LaPre, and S. Moore, "Durango: Scalable synthetic workload generation for extreme-scale application performance modeling and simulation," in Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, 2017, pp. 97--108.
[40]
P. C. Roth, J. S. Meredith, and J. S. Vetter, "Automated characterization of parallel application communication patterns," in Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, 2015, pp. 73--84.
[41]
Y. Zheng, A. Kamil, M. B. Driscoll, H. Shan, and K. Yelick, "Upc++: a pgas extension for c++," in 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 2014, pp. 1105--1114.
[42]
F. He, X. Yan, Y. Liu, and L. Ma, "A traffic congestion assessment method for urban road networks based on speed performance index," Procedia engineering, vol. 137, pp. 425--433, 2016.

Cited By

View all
  • (2024)Surrogate Modeling for HPC Application Iteration Times Forecasting with Network FeaturesProceedings of the 38th ACM SIGSIM Conference on Principles of Advanced Discrete Simulation10.1145/3615979.3656055(93-97)Online publication date: 24-Jun-2024
  • (2023)Exploring Machine Learning Models with Spatial-Temporal Information for Interconnect Network Traffic ForecastingProceedings of the 2023 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation10.1145/3573900.3593635(56-57)Online publication date: 21-Jun-2023
  • (2023)Machine Learning for Interconnect Network Traffic Forecasting: Investigation and ExploitationProceedings of the 2023 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation10.1145/3573900.3591123(133-137)Online publication date: 21-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2022
1277 pages
ISBN:9784665454445

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 18 November 2022

Check for updates

Badges

Author Tags

  1. Dragonfly
  2. HPC
  3. interconnect network
  4. network interference

Qualifiers

  • Research-article

Conference

SC '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)3
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Surrogate Modeling for HPC Application Iteration Times Forecasting with Network FeaturesProceedings of the 38th ACM SIGSIM Conference on Principles of Advanced Discrete Simulation10.1145/3615979.3656055(93-97)Online publication date: 24-Jun-2024
  • (2023)Exploring Machine Learning Models with Spatial-Temporal Information for Interconnect Network Traffic ForecastingProceedings of the 2023 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation10.1145/3573900.3593635(56-57)Online publication date: 21-Jun-2023
  • (2023)Machine Learning for Interconnect Network Traffic Forecasting: Investigation and ExploitationProceedings of the 2023 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation10.1145/3573900.3591123(133-137)Online publication date: 21-Jun-2023
  • (2023)Hybrid PDES Simulation of HPC Networks Using Zombie PacketsProceedings of the 2023 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation10.1145/3573900.3591122(128-132)Online publication date: 21-Jun-2023
  • (undefined)Hybrid PDES Simulation of HPC Networks Using Zombie PacketsACM Transactions on Modeling and Computer Simulation10.1145/3682060

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media