research-article

Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes

Authors:

D. K. PandaAuthors Info & Claims

SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 70, Pages 1 - 12

Published: 10 November 2012 Publication History

Abstract

Over the last decade, InfiniBand has become an increasingly popular interconnect for deploying modern super-computing systems. However, there exists no detection service that can discover the underlying network topology in a scalable manner and expose this information to runtime libraries and users of the high performance computing systems in a convenient way. In this paper, we design a novel and scalable method to detect the InfiniBand network topology by using Neighbor-Joining techniques (NJ). To the best of our knowledge, this is the first instance where the neighbor joining algorithm has been applied to solve the problem of detecting InfiniBand network topology. We also design a network-topology-aware MPI library that takes advantage of the network topology service. The library places processes taking part in the MPI job in a network-topology-aware manner with the dual aim of increasing intra-node communication and reducing the long distance inter-node communication across the InfiniBand fabric.

References

[1]

K. Kandalla and H. Subramoni and D. K. Panda, "Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters: Case Studies wih Scatter and Gather," in IPDPS, 2010.

[2]

T. Hoefler and M. Snir, "Generic Topology Mapping Strategies for Large-scale Parallel Architectures," in Proceedings of the 2011 ACM International Conference on Supercomputing (ICS'11). ACM, Jun. 2011, pp. 75--85.

Digital Library

[3]

M. J. Rashti, J. Green, P. Balaji, A. Afsahi, and W. Gropp, "Multi-core and Network Aware MPI Topology Functions," in Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface, ser. EuroMPI'11. Berlin, Heidelberg: Springer-Verlag, 2011, pp. 50--60.

Digital Library

[4]

H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K. Tomko, R. Mclay, K. Schulz and D. K. Panda, "Design and Evaluation of Network Topology-/Speed-Aware Broadcast Algorithms for InfiniBand Clusters," in CLUSTER, 2011.

Digital Library

[5]

N. Saitou and M. Nei, "The Neighbor-Joining Method: A New Method for Reconstructing Phylogentic Trees," Mol. Biol. Evol, vol. 4, pp. 406--425, 1987.

[6]

The MIMD Lattice Computation (MILC) Collaboration, http://physics.indiana.edu/~sg/milc.html.

[7]

R. D. Falgout and U. M. Yang, "Hypre: A Library of High Performance Preconditioners," in Proceedings of the International Conference on Computational Science-Part III, ser. ICCS '02. London, UK, UK: Springer-Verlag, 2002, pp. 632--641.

Digital Library

[8]

D. Chen, N. A. Eisley, P. Heidelberger, R. M. Senger, Y. Sugawara, S. Kumar, V. Salapura, D. L. Satterfield, B. Steinmacher-Burow, and J. J. Parker, "The IBM Blue Gene/Q Interconnection Network and Message Unit," in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '11. New York, NY, USA: ACM, 2011, pp. 26:1--26:10. {Online}. Available: http://doi.acm.org/10.1145/2063384.2063419

Digital Library

[9]

Top500, "Top500 Supercomputing systems," Jun 2011, http://www.top500.org/.

[10]

"Open Fabrics Enterprise Distribution," http://www.openfabrics.org/.

[11]

S. C. Johnson, "Hierarchical Clustering Schemes," Psychometrika, vol. 32, no. 3, pp. 241--254, September 1967.

[12]

C. Walshaw and M. Cross, "JOSTLE: Parallel Multi-level Graph-Partitioning Software -- An Overview," in Mesh Partitioning Techniques and Domain Decomposition Techniques, F. Magoules, Ed. Civil-Comp Ltd., 2007.

[13]

K. Schloegel, G. Karypis, and V. Kumar, "Parallel Static and Dynamic Multi-Constraint Graph Partitioning," Concurrency and Computation: Practice and Experience, pp. 219--240, 2002.

[14]

C. D. Spradling, "SPEC CPU2006 Benchmark Tools," SIGARCH Comput. Archit. News, vol. 35, no. 1, pp. 130--134, Mar. 2007.

Digital Library

[15]

Müller, Matthias S. and van Waveren, Matthijs and Lieberman, Ron and Whitney, Brian and Saito, Hideki and Kumaran, Kalyan and Baron, John and Brantley, William C. and Parrott, Chris and Elken, Tom and Feng, Huiyu and Ponder, Carl, "SPEC MPI2007--An Application Benchmark Suite for Parallel Systems using MPI," Concurr. Comput.: Pract. Exper., vol. 22, no. 2, pp. 191--205, Feb. 2010.

Digital Library

[16]

The NERSC SDSA Benchmark Codes, http://www1.nersc.gov/projects/SDSA/software/.

[17]

W. P. Nicholas J. Wright and A. Snavely, "Characterizing Parallel Scaling of Scientific Applications using IPM," in 10th LCI Conference, Mar. 2009.

[18]

He, Jun and Kowalkowski, Jim and Paterno, Marc and Holmgren, Don and Simone, James and Sun, Xian-He, "Layout-Aware Scientific Computing: A Case Study using MILC," in Proceedings of the Second Workshop on Latest AdScalable Algorithms for Large-Scale Systems, ser. ScalA '11. ACM, 2011, pp. 21--24.

Digital Library

[19]

A.H. Baker, R. D. Falgout, T. V. Kolev and U. M. Yang, "Scaling hypre's Multigrid Solvers to 100,000 Cores," in High Performance Scientific Computing: Algorithms and Applications - A Tribute to Prof. Ahmed Sameh, M. Berry et al., eds., Springer, LLNL-JRNL-479591, 2012.

[20]

Y. Cui, R. Moore, K. Olsen, A. Chourasia, P. Maechling, B. Minster, S. Day, Y. Hu, J. Zhu, A. Majumdar, and T. Jordan, "Toward Petascale Earthquake Simulations," in Acta Geotechnica (in press), Springer, 2008.

[21]

MVAPICH2, http://mvapich.cse.ohio-state.edu/.

[22]

F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin, G. Mercier, S. Thibault, and R. Namyst, "hwloc: a Generic Framework for Managing Hardware Affinities in HPC Applications," in PDP2010, 2010.

Digital Library

[23]

P. Sack and W. Gropp, "A Scalable MPI_Comm_split Algorithm for Exascale Computing," in Recent Advances in the Message Passing Interface, ser. Lecture Notes in Computer Science. Springer Berlin/Heidelberg, 2010.

Digital Library

[24]

J. Dinan, S. Krishnamoorthy, P. Balaji, J. R. Hammond, M. Krishnan, V. Tipparaju, and A. Vishnu, "Noncollective Communicator Creation in MPI," in EuroMPI, 2011.

Digital Library

[25]

The NERSC-6 Benchmarks, http://www.nersc.gov/research-and-development/benchmarking-and-workload-characterization/nersc-6-benchmarks/.

[26]

S. H. Bokhari, "On the Mapping Problem," IEEE Transactions on Computers, vol. 30, pp. 207--214, 1981.

Digital Library

[27]

F. Erçal, J. Ramanujam, and P. Sadayappan, "Task Allocation onto a Hypercube by Recursive Mincut Bipartitioning," J. Parallel Distrib. Comput., vol. 10, no. 1, pp. 35--44, 1990.

Digital Library

[28]

B. W. Kernighan and S. Lin, "An Efficient Heuristic Procedure for Partitioning Graphs," Bell System Technical Journal, vol. 49, no. 2, pp. 291--308, 1970.

[29]

S.-Y. Lee and J. K. Aggarwal, "A Mapping Strategy for Parallel Processing," IEEE Trans. Comput., vol. 36, no. 4, pp. 433--442, Apr. 1987.

Digital Library

[30]

S. Radhakrishnan, R. Brunner, and L. V. Kalé, "Branch and Bound Based Load Balancing for Parallel Applications," in Proceedings of the Third International Symposium on Computing in Object-Oriented Parallel Environments, ser. ISCOPE '99. London, UK: Springer-Verlag, 1999, pp. 194--199.

Digital Library

[31]

F. Berman and L. Snyder, "On Mapping Parallel Algorithms into Parallel Architectures," Journal of Parallel and Distributed Computing, vol. 4, pp. 439--458, 1987.

Digital Library

[32]

S. W. Bollinger and S. F. Midkiff, "Heuristic Technique for Processor and Link Assignment in Multicomputers," IEEE Trans. Comput., vol. 40, pp. 325--333, March 1991.

Digital Library

[33]

N. Mansour and G. Fox, "Allocating Data to Multicomputer Modes by Physical Optimization Algorithms for Loosely Synchronous Computations," Concurrency - Practice and Experience, vol. 4, no. 7, pp. 557--574, 1992.

Digital Library

[34]

T. Chockalingam and S. Arunkumar, "Genetic Algorithm Based Heuristics for the Mapping Problem," Computers & Operations Research, vol. 22, pp. 55--64, 1995.

Digital Library

[35]

A. Bhatele, "Automating Topology Aware Mapping for Supercomputers," Ph.D. dissertation, Dept. of Computer Science, University of Illinois, August 2010.

Digital Library

[36]

A. Bhatele, E. J. Bohm, and L. V. Kalé, "Optimizing Communication for Charm++ Applications by Reducing Network Contention," Concurrency and Computation: Practice and Experience, 2011.

Digital Library

[37]

E. Jeannot and G. Mercier, "Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures," in Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II, ser. Euro-Par'10. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 199--210.

Digital Library

[38]

G. Mercier and E. Jeannot, "Improving MPI Applications Performance on Multicore Clusters with Rank Reordering," in Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface, ser. EuroMPI'11. Berlin, Heidelberg: Springer-Verlag, 2011, pp. 39--49.

Digital Library

[39]

F. Pellegrini and J. Roman, "Scotch: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs," in High-Performance Computing and Networking, ser. Lecture Notes in Computer Science. Springer Berlin/Heidelberg, 1996.

Digital Library

[40]

"Implicit Radiation Solver (IRS)," https://asc.llnl.gov/sequoia/benchmarks/#irs.

[41]

"Arbitrary Lagrangian Eulerian in 3D (ALE3D)," https://wci.llnl.gov/codes/ale3d/.

Cited By

Jha SPatke ABrandt JGentile ALim BShowerman MBauer GKaplan LKalbarczyk ZKramer WIyer RBhagwan RPorter G(2020)Measuring congestion in high-performance datacenter interconnectsProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388246(37-58)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388246
Niethammer CRabenseifner R(2019)An MPI interface for application and hardware aware cartesian topology optimizationProceedings of the 26th European MPI Users' Group Meeting10.1145/3343211.3343217(1-8)Online publication date: 11-Sep-2019
https://dl.acm.org/doi/10.1145/3343211.3343217
Luo XWu WBosilca GPatinyasakdikul TWang LDongarra JZhao MChandra ARamakrishnan L(2018)ADAPTProceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3208040.3208054(118-130)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3208040.3208054
Show More Cited By

Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes

Recommendations

Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes
SC '12: Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis

Over the last decade, InfiniBand has become an increasingly popular interconnect for deploying modern supercomputing systems. However, there exists no detection service that can discover the underlying network topology in a scalable manner and expose ...
Topology agnostic hot-spot avoidance with InfiniBand
The Best of CCGrid'2007: A Snapshot of an ‘Adolescent’ Area

InfiniBand has become a very popular interconnect due to its advanced features and open standard. Large-scale InfiniBand clusters are becoming very popular, as reflected by the TOP 500 supercomputer rankings. However, even with popular topologies such ...
Technology-Driven, Highly-Scalable Dragonfly Topology
ISCA '08: Proceedings of the 35th Annual International Symposium on Computer Architecture

Evolving technology and increasing pin-bandwidth motivate the use of high-radix routers to reduce the diameter, latency, and cost of interconnection networks. High-radix networks, however, require longer cables than their low-radix counterparts. Because ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2012

1161 pages

ISBN:9781467308045

General Chair:
Jeffrey K. Hollingsworth
University of Maryland

Sponsors

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 10 November 2012

Check for updates

Qualifiers

Research-article

Conference

SC '12

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC '12: International Conference for High Performance Computing, Networking, Storage and Analysis

November 10 - 16, 2012

Utah, Salt Lake City

Acceptance Rates

SC '12 Paper Acceptance Rate 100 of 461 submissions, 22%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
306
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)1

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jha SPatke ABrandt JGentile ALim BShowerman MBauer GKaplan LKalbarczyk ZKramer WIyer RBhagwan RPorter G(2020)Measuring congestion in high-performance datacenter interconnectsProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388246(37-58)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388246
Niethammer CRabenseifner R(2019)An MPI interface for application and hardware aware cartesian topology optimizationProceedings of the 26th European MPI Users' Group Meeting10.1145/3343211.3343217(1-8)Online publication date: 11-Sep-2019
https://dl.acm.org/doi/10.1145/3343211.3343217
Luo XWu WBosilca GPatinyasakdikul TWang LDongarra JZhao MChandra ARamakrishnan L(2018)ADAPTProceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3208040.3208054(118-130)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3208040.3208054
Galvez JJain NKale LGropp WBeckman PLi ZCazorla F(2017)Automatic topology mapping of diverse large-scale parallel applicationsProceedings of the International Conference on Supercomputing10.1145/3079079.3079104(1-10)Online publication date: 14-Jun-2017
https://dl.acm.org/doi/10.1145/3079079.3079104
Domke JHoefler TWest J(2016)Scheduling-aware routing for supercomputersProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014922(1-12)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3014904.3014922
Zhu YEran HFirestone DGuo CLipshteyn MLiron YPadhye JRaindel SYahia MZhang M(2015)Congestion Control for Large-Scale RDMA DeploymentsACM SIGCOMM Computer Communication Review10.1145/2829988.278748445:4(523-536)Online publication date: 17-Aug-2015
https://dl.acm.org/doi/10.1145/2829988.2787484
Zhu YEran HFirestone DGuo CLipshteyn MLiron YPadhye JRaindel SYahia MZhang MUhlig SMaennel OKarp BPadhye J(2015)Congestion Control for Large-Scale RDMA DeploymentsProceedings of the 2015 ACM Conference on Special Interest Group on Data Communication10.1145/2785956.2787484(523-536)Online publication date: 17-Aug-2015
https://dl.acm.org/doi/10.1145/2785956.2787484
Tuncer OLeung VCoskun ABhuyan LChong FSarkar V(2015)PaCMapProceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751225(37-46)Online publication date: 8-Jun-2015
https://dl.acm.org/doi/10.1145/2751205.2751225
Wu JXiong XLan Z(2015)Hierarchical task mapping for parallel applications on supercomputersThe Journal of Supercomputing10.1007/s11227-014-1324-571:5(1776-1802)Online publication date: 1-May-2015
https://dl.acm.org/doi/10.1007/s11227-014-1324-5
Dragojević ANarayanan DHodson OCastro MMahajan RStoica I(2014)FaRMProceedings of the 11th USENIX Conference on Networked Systems Design and Implementation10.5555/2616448.2616486(401-414)Online publication date: 2-Apr-2014
https://dl.acm.org/doi/10.5555/2616448.2616486
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents