research-article

Near-Optimal Wafer-Scale Reduce

Authors:

Piotr Luczynski,

Lukas Gianinazzi,

Leighton Wilson,

Daniele De Sensi,

Torsten HoeflerAuthors Info & Claims

HPDC '24: Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing

Pages 334 - 347

https://doi.org/10.1145/3625549.3658693

Published: 30 August 2024 Publication History

Abstract

Efficient Reduce and AllReduce communication collectives are a critical cornerstone of high-performance computing (HPC) applications. We present the first systematic investigation of Reduce and AllReduce on the Cerebras Wafer-Scale Engine (WSE). This architecture has been shown to achieve unprecedented performance both for machine learning workloads and other computational problems like FFT. We introduce a performance model to estimate the execution time of algorithms on the WSE and validate our predictions experimentally for a wide range of input sizes. In addition to existing implementations, we design and implement several new algorithms specifically tailored to the architecture. Moreover, we establish a lower bound for the runtime of a Reduce operation on the WSE. Based on our model, we automatically generate code that achieves near-optimal performance across the whole range of input sizes. Experiments demonstrate that our new Reduce and AllReduce algorithms outperform the current vendor solution by up to 3.27×. Additionally, our model predicts performance with less than 4% error. The proposed communication collectives increase the range of HPC applications that can benefit from the high throughput of the WSE. Our model-driven methodology demonstrates a disciplined approach that can lead the way to further algorithmic advancements on wafer-scale architectures.

References

[1]

Michael Barnett, Richard J. Littlefield, David G. Payne, and Robert A. van de Geijn. 1995. Global Combine Algorithms for 2-D Meshes with Wormhole Routing. J. Parallel Distributed Comput. 24, 2 (1995), 191--201.

Digital Library

[2]

Yves Baumann, Tal Ben-Nun, Maciej Besta, Lukas Gianinazzi, Torsten Hoefler, and Piotr Luczynski. 2024. Low-Depth Spatial Tree Algorithms. arXiv:2404.12953 [cs.DC]

[3]

Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis. ACM Comput. Surv. 52, 4 (2019), 65:1--65:43.

Digital Library

[4]

Maciej Besta and Torsten Hoefler. 2014. Slim Fly: A Cost Effective Low-Diameter Network Topology. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, New Orleans, LA, USA, November 16--21, 2014, Trish Damkroger and Jack J. Dongarra (Eds.). IEEE Computer Society, 348--359.

Digital Library

[5]

Maciej Besta and Torsten Hoefler. 2022. Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis. CoRR abs/2205.09702 (2022). arXiv:2205.09702

[6]

Achi Brandt and AA Lubrecht. 1990. Multilevel matrix multiplication and fast solution of integral equations. J. Comput. Phys. 90, 2 (1990), 348--370.

Digital Library

[7]

Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert A. van de Geijn. 2007. Collective communication: theory, practice, and experience. Concurr. Comput. Pract. Exp. 19, 13 (2007), 1749--1783.

[8]

S. Alexander Chin, Noriaki Sakamoto, Allan Rui, Jim Zhao, Jin Hee Kim, Yuko Hara-Azumi, and Jason Helge Anderson. 2017. CGRA-ME: A unified framework for CGRA modelling and exploration. In 28th IEEE International Conference on Application-specific Systems, Architectures and Processors, ASAP 2017, Seattle, WA, USA, July 10--12, 2017. IEEE Computer Society, 184--189.

[9]

Benjamin Y. Cho, Jeageun Jung, and Mattan Erez. 2021. Accelerating bandwidth-bound deep learning inference with main-memory accelerators. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14--19, 2021, Bronis R. de Supinski, Mary W. Hall, and Todd Gamblin (Eds.). ACM, 44.

Digital Library

[10]

Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran. 2018. Characterization of MPI usage on a production supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018, Dallas, TX, USA, November 11--16, 2018. IEEE / ACM, 30:1--30:15. http://dl.acm.org/citation.cfm?id=3291696

Digital Library

[11]

Daniele De Sensi, Tommaso Bonato, David Saam, and Torsten Hoefler. 2024. Swing: Short-cutting Rings for Higher Bandwidth Allreduce. In 21th USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). USENIX Association, Santa Clara, CA.

[12]

Daniele De Sensi, Salvatore Di Girolamo, Saleh Ashkboos, Shigang Li, and Torsten Hoefler. 2021. Flare: Flexible in-Network Allreduce. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC '21). Association for Computing Machinery, New York, NY, USA, Article 35, 16 pages.

Digital Library

[13]

Daniele De Sensi, Salvatore Di Girolamo, Kim H. McMahon, Duncan Roweth, and Torsten Hoefler. 2020. An In-Depth Analysis of the Slingshot Interconnect. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Atlanta, Georgia) (SC '20). IEEE Press, Article 35, 14 pages.

[14]

Nolan Dey, Gurpreet Gosal, Zhiming Chen, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, and Joel Hestness. 2023. Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster. CoRR abs/2304.03208 (2023). arXiv:2304.03208

[15]

Murali Emani, Venkatram Vishwanath, Corey Adams, Michael E. Papka, Rick Stevens, Laura Florescu, Sumti Jairath, William Liu, Tejas Nama, Arvind Sujeeth, Volodymyr V. Kindratenko, and Anne C. Elster. 2021. Accelerating Scientific Applications With SambaNova Reconfigurable Dataflow Architecture. Comput. Sci. Eng. 23, 2 (2021), 114--119.

[16]

Brian Gaide, Dinesh Gaitonde, Chirag Ravishankar, and Trevor Bauer. 2019. Xilinx Adaptive Compute Acceleration Platform: Versal™ Architecture. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2019, Seaside, CA, USA, February 24--26, 2019, Kia Bazargan and Stephen Neuendorffer (Eds.). ACM, 84--93.

Digital Library

[17]

Lukas Gianinazzi, Tal Ben-Nun, Saleh Ashkboos, Yves Baumann, Piotr Luczynski, and Torsten Hoefler. 2022. The spatial computer: A model for energy-efficient parallel computation. CoRR abs/2205.04934 (2022). arXiv:2205.04934

[18]

Richard L. Graham, Lion Levi, Devendar Bureddy, Gil Bloch, Gilad Shainer, David Cho, George Elias, Daniel Klein, Joshua Ladd, Ophir Maor, Ami Marelli, Valentin Petrov, Evyatar Romlet, Yong Qin, and Ido Zemah. 2020. Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation. High Performance Computing 12151 (2020), 41--59.

[19]

William Gropp, Ewing L. Lusk, Nathan E. Doss, and Anthony Skjellum. 1996. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard. Parallel Comput. 22, 6 (1996), 789--828.

Digital Library

[20]

Torsten Hoefler and Roberto Belli. 2015. Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, Austin, TX, USA, November 15--20, 2015, Jackie Kern and Jeffrey S. Vetter (Eds.). ACM, 73:1--73:12.

Digital Library

[21]

Torsten Hoefler, Tommaso Bonato, Daniele De Sensi, Salvatore Di Girolamo, Shigang Li, Marco Heddes, Jon Belk, Deepak Goel, Miguel Castro, and Steve Scott. 2022. HammingMesh: A Network Topology for Large-Scale Deep Learning. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13--18, 2022. IEEE, 1--18.

[22]

Torsten Hoefler and D. Moor. 2014. Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations. Journal of Supercomputing Frontiers and Innovations 1, 2 (Oct. 2014), 58--75.

[23]

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine. 2010. Accurately Measuring Overhead, Communication Time and Progression of Blocking and Nonblocking Collective Operations at Massive Scale. International Journal of Parallel, Emergent and Distributed Systems 25, 4 (Jul. 2010), 241--258.

Digital Library

[24]

Cerebras Systems Inc. 2021. Cerebras Systems: Achieving Industry Best AI Performance Through A Systems Approach. (2021).

[25]

Mathias Jacquelin, Mauricio Araya-Polo, and Jie Meng. 2022. Massively scalable stencil algorithm. CoRR abs/2204.03775 (2022). arXiv:2204.03775

[26]

Nikhil Jain and Yogish Sabharwal. 2010. Optimal bucket algorithms for large MPI collectives on torus interconnects. In Proceedings of the 24th International Conference on Supercomputing, 2010, Tsukuba, Ibaraki, Japan, June 2--4, 2010, Taisuke Boku, Hiroshi Nakashima, and Avi Mendelson (Eds.). ACM, 27--36.

Digital Library

[27]

S. Lennart Johnsson and Ching-Tien Ho. 1989. Optimum Broadcasting and Personalized Communication in Hypercubes. IEEE Trans. Computers 38, 9 (1989), 1249--1268.

Digital Library

[28]

Nicholas T. Karonis, Bronis R. de Supinski, Ian T. Foster, William Gropp, Ewing L. Lusk, and John Bresnahan. 2000. Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operation Performance. In Proceedings of the 14th International Parallel & Distributed Processing Symposium (IPDPS'00), Cancun, Mexico, May 1--5, 2000. IEEE Computer Society, 377--384.

[29]

John Kim, William J. Dally, and Dennis Abts. 2007. Flattened butterfly: a cost-efficient topology for high-radix networks. In 34th International Symposium on Computer Architecture (ISCA 2007), June 9--13, 2007, San Diego, California, USA, Dean M. Tullsen and Brad Calder (Eds.). ACM, 126--137.

Digital Library

[30]

John Kim, William J. Dally, Steve Scott, and Dennis Abts. 2008. Technology-Driven, Highly-Scalable Dragonfly Topology. In 35th International Symposium on Computer Architecture (ISCA 2008), June 21--25, 2008, Beijing, China. IEEE Computer Society, 77--88.

Digital Library

[31]

Sameer Kumar and Daniel Faraj. 2013. Optimization of MPI_Allreduce on the Blue GeneQ Supercomputer. In Proceedings of the 20th European MPI Users' Group Meeting (Madrid, Spain) (EuroMPI '13). Association for Computing Machinery, New York, NY, USA, 97--103.

Digital Library

[32]

Sameer Kumar and Norm Jouppi. 2020. Highly Available Data Parallel ML training on Mesh Networks. CoRR abs/2011.03605 (2020). arXiv:2011.03605 https://arxiv.org/abs/2011.03605

[33]

Ignacio Laguna, Ryan J. Marshall, Kathryn Mohror, Martin Ruefenacht, Anthony Skjellum, and Nawrin Sultana. 2019. A large-scale study of MPI usage in open-source HPC applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019, Denver, Colorado, USA, November 17--19, 2019, Michela Taufer, Pavan Balaji, and Antonio J. Peña (Eds.). ACM, 31:1--31:14.

Digital Library

[34]

Sean Lie. 2021. Multi-Million Core, Multi-Wafer AI Cluster. In IEEE Hot Chips 33 Symposium, HCS 2021, Palo Alto, CA, USA, August 22--24, 2021. IEEE, 1--41.

[35]

Sean Lie. 2023. Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning. IEEE Micro 43, 3 (2023), 18--30.

Digital Library

[36]

Hatem Ltaief, Yuxi Hong, Leighton Wilson, Mathias Jacquelin, Matteo Ravasi, and David Elliot Keyes. 2023. Scaling the "Memory Wall" for Multi-Dimensional Seismic Processing with Algebraic Compression on Cerebras CS-2 Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023, Denver, CO, USA, November 12--17, 2023, Dorian Arnold, Rosa M. Badia, and Kathryn M. Mohror (Eds.). ACM, 6:1--6:12.

Digital Library

[37]

Message Passing Interface Forum. 2021. MPI: A Message-Passing Interface Standard Version 4.0. https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf

[38]

G. F. Oliveira, J. Gomez-Luna, S. Ghose, A. Boroumand, and O. Mutlu. 2022. Accelerating Neural Network Inference With Processing-in-DRAM: From the Edge to the Cloud. IEEE Micro 42, 06 (nov 2022), 25--38.

Digital Library

[39]

Marcelo Orenes-Vera, Ilya Sharapov, Robert Schreiber, Mathias Jacquelin, Philippe Vandermersch, and Sharan Chetlur. 2023. Wafer-Scale Fast Fourier Transforms. In Proceedings of the 37th International Conference on Supercomputing, ICS 2023, Orlando, FL, USA, June 21--23, 2023, Kyle A. Gallivan, Efstratios Gallopoulos, Dimitrios S. Nikolopoulos, and Ramón Beivide (Eds.). ACM, 180--191.

Digital Library

[40]

Pitch Patarasuk and Xin Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distributed Comput. 69, 2 (2009), 117--124.

Digital Library

[41]

Jianlong Qi, Hassan Foroughi Asl, Johan Björkegren, and Tom Michoel. 2014. kruX: matrix-based non-parametric eQTL discovery. BMC bioinformatics 15 (2014), 1--7.

[42]

Rolf Rabenseifner. 2004. Optimization of Collective Reduction Operations. In Computational Science - ICCS 2004, 4th International Conference, Kraków, Poland, June 6--9, 2004, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 3036), Marian Bubak, G. Dick van Albada, Peter M. A. Sloot, and Jack J. Dongarra (Eds.). Springer, 1--9.

[43]

Rolf Rabenseifner and Jesper Larsson Träff. 2004. More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September 19--22, 2004, Proceedings (Lecture Notes in Computer Science, Vol. 3241), Dieter Kranzlmüller, Péter Kacsuk, and Jack J. Dongarra (Eds.). Springer, 36--46.

[44]

Kamil Rocki, Dirk Van Essendelft, Ilya Sharapov, Robert Schreiber, Michael Morrison, Vladimir Kibardin, Andrey Portnoy, Jean-Francois Dietiker, Madhava Syamlal, and Michael James. 2020. Fast stencil-code computation on a wafer-scale processor. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9--19, 2020, Christine Cuicchi, Irene Qualters, and William T. Kramer (Eds.). IEEE/ACM, 58.

[45]

Niklas Roemer. 2023-08-06. Designing of a communication library for Versal devices using Stream-Based API. Bachelor Thesis. ETH Zurich, Zurich.

[46]

Yousef Saad and Martin H. Schultz. 1989. Data communication in parallel architectures. Parallel Comput. 11, 2 (1989), 131--150.

[47]

Paul Sack and William Gropp. 2015. Collective Algorithms for Multiported Torus Networks. ACM Trans. Parallel Comput. 1, 2 (2015), 12:1--12:33.

Digital Library

[48]

Justin Selig. 2023. The Cerebras Software Development Kit: A Technical Overview. Technical Report. Cerebras Systems, Inc. 8 pages. https://f.hubspotusercontent30.net/hubfs/8968533/Cerebras%20SDK%20Technical%20Overview%20White%20Paper.pdf

[49]

Andrey A Shabalin. 2012. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics 28, 10 (2012), 1353--1358.

Digital Library

[50]

Sean Lie Stewart Hall, Rob Schreiber. 2023. Training Giant Neural Networks Using Weight Streaming on Cerebras Wafer-Scale Clusters. Technical Report. Cerebras Systems, Inc. 34 pages. https://f.hubspotusercontent30.net/hubfs/8968533/Virtual%20Booth%20Docs/CS%20Weight%20Streaming%20White%20Paper%20111521.pdf

[51]

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of Collective Communication Operations in MPICH. Int. J. High Perform. Comput. Appl. 19, 1 (2005), 49--66.

Digital Library

[52]

John Tramm, Bryce Allen, Kazutomo Yoshii, Andrew Siegel, and Leighton Wilson. 2024. Efficient algorithms for Monte Carlo particle transport on AI accelerator hardware. Computer Physics Communications 298 (2024), 109072.

[53]

Sathish S. Vadhiyar, Graham E. Fagg, and Jack J. Dongarra. 2000. Automatically Tuned Collective Communications. In Proceedings Supercomputing 2000, November 4--10, 2000, Dallas, Texas, USA. IEEE Computer Society, CD-ROM, Jed Donnelley (Ed.). IEEE Computer Society, 3.

[54]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998--6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

Digital Library

[55]

Arthur H. Veen. 1986. Dataflow Machine Architecture. ACM Comput. Surv. 18, 4 (dec 1986), 365--396.

Digital Library

[56]

Udayanga Wickramasinghe and Andrew Lumsdaine. 2016. A Survey of Methods for Collective Communication Optimization and Tuning. CoRR abs/1611.06334 (2016). arXiv:1611.06334 http://arxiv.org/abs/1611.06334

[57]

Max Wierse. 2023-02. Evaluation of Xilinx Versal Device. Bachelor Thesis. ETH Zurich, Zurich.

[58]

Leighton Wilson. 2023. What's New in R0.6 of the Cerebras SDK. https://www.cerebras.net/blog/whats-new-in-r0.6-of-the-cerebras-sdk. Accessed: 2023-08-09.

[59]

Mino Woo, Terry Jordan, Robert Schreiber, Ilya Sharapov, Shaheer Muhammad, Abhishek Koneru, Michael James, and Dirk Van Essendelft. 2022. Disruptive Changes in Field Equation Modeling: A Simple Interface for Wafer Scale Engines. CoRR abs/2209.13768 (2022). arXiv:2209.13768

[60]

Thomas Worsch, Ralf H. Reussner, and Werner Augustin. 2002. On Benchmarking Collective MPI Operations. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, 9th European PVM/MPI Users' Group Meeting, Linz, Austria, September 29 - October 2, 2002, Proceedings (Lecture Notes in Computer Science, Vol. 2474), Dieter Kranzlmüller, Péter Kacsuk, Jack J. Dongarra, and Jens Volkert (Eds.). Springer, 271--279.

Index Terms

Near-Optimal Wafer-Scale Reduce
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed algorithms
  2. Parallel computing methodologies
    1. Parallel algorithms
2. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms

Recommendations

Optimization of Collective Communication Operations in MPICH

We describe our work on improving the performance of collective communication operations in MPICH for clusters connected by switched networks. For each collective operation, we use multiple algorithms depending on the message size, with the goal of ...
Application-bypass reduction for large-scale clusters

Process skew is an important factor in the performance of parallel applications, especially in large-scale clusters. Reduction is a common collective operation which, by its nature, introduces implicit synchronisation between the processes involved in ...
Collective operations for wide-area message-passing systems using adaptive spanning trees

We propose a method for wide-area message-passing systems to perform broadcasts and reductions efficiently using latency and bandwidth-aware spanning trees constructed at run-time. These trees are updated when processes join or leave a computation, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '24: Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 2024

436 pages

ISBN:9798400704130

DOI:10.1145/3625549

Chair:
Patrizio Dazzi,
Co-chair:
Gabriele Mencagli,
Program Chair:
David Lowenthal,
Program Co-chair:
Rosa M Badia

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 August 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HPDC '24

Sponsor:

SIGARCH

HPDC '24: 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa, Italy

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
52
Total Downloads

Downloads (Last 12 months)52
Downloads (Last 6 weeks)22

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents