Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Mapping communication layouts to network hardware characteristics on massive-scale blue gene systems

  • Special Issue Paper
  • Published:
Computer Science - Research and Development

Abstract

For parallel applications running on high-end computing systems, which processes of an application get launched on which processing cores is typically determined at application launch time without any information about the application characteristics. As high-end computing systems continue to grow in scale, however, this approach is becoming increasingly infeasible for achieving the best performance. For example, for systems such as IBM Blue Gene and Cray XT that rely on flat 3D torus networks, process communication often involves network sharing, even for highly scalable applications. This causes the overall application performance to depend heavily on how processes are mapped on the network. In this paper, we first analyze the impact of different process mappings on application performance on a massive Blue Gene/P system. Then, we match this analysis with application communication patterns that we allow applications to describe prior to being launched. The underlying process management system can use this combined information in conjunction with the hardware characteristics of the system to determine the best mapping for the application. Our experiments study the performance of different communication patterns, including 2D and 3D nearest-neighbor communication and structured Cartesian grid communication. Our studies, that scale up to 131,072 cores of the largest BG/P system in the United States (using 80% of the total system size), demonstrate that different process mappings can show significant difference in overall performance, especially on scale. For example, we show that this difference can be as much as 30% for P3DFFT and up to twofold for HALO. Through our proposed model, however, such differences in performance can be avoided so that the best possible performance is always achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. IBM Blue Gene Team (2008) Overview of the IBM Blue Gene/P project. IBM J Res Dev 52(1–2):199–220

    Google Scholar 

  2. Cray Research, Inc (1993) Cray T3D system architecture overview

  3. Argonne National Laboratory. PETSc. http://www.mcs.anl.gov/petsc

  4. Kumar S, Huang C, Almasi G, Kale LV (2007) Achieving strong scaling with NAMD on Blue Gene/L. In: IEEE international parallel and distributed processing symposium

    Google Scholar 

  5. Naval Research Laboratory. Naval research laboratory layered ocean model (NLOM). http://www.navo.hpc.mil/Navigator/Fall99_Feature.html

  6. Rabiti C, Smith MA, Kaushik D, Yang WS, Palmiotti G (2008) Parallel method of characteristics on unstructured meshes for the UNIC code. In: PHYSOR, Interlaken, Switzerland, 14–19 Sept 2008

    Google Scholar 

  7. Balaji P, Chan A, Thakur R, Gropp W, Lusk E (2009) Toward message passing for a million processes: characterizing MPI on a massive scale blue gene/P. J Comput Sci Res Devel Special edn (presented at the International supercomputing conference (ISC)); Best Paper Award

  8. Balaji P, Naik H, Desai N (2009) Understanding network saturation behavior on large-scale blue gene/P systems. In: Proceedings of the international conference on parallel and distributed systems (ICPADS), Shenzhen, China, 8–11 Dec 2009

    Google Scholar 

  9. Pekurovsky D (2009) P3DFFT webpage, Feb 2009. http://www.sdsc.edu/us/resources/p3dfft/index.php

  10. Wallcraft AJ (1999) The Halo benchmark. http://www.navo.hpc.mil/Navigator/PDFS/Fall1999.pdf

  11. Fischer P, Lottes J, Pointer D, Siegel A (2008) Petascale algorithms for reactor hydrodynamics. J Phys Conf Ser 125(1). doi:10.1088/1742-6596/125/1/012076

  12. Frigo M, Johnson SG (2005) The design and implementation of FFTW3. Proc IEEE 93:216–231

    Article  Google Scholar 

  13. Cooley JW, Tukey JW (1964) An algorithm for the machine calculation of complex Fourier series. Math Comput 19(90):297–301

    MathSciNet  Google Scholar 

  14. San Diego Supercomputing Center. P3DFFT. http://www.sdsc.edu/us/resources/p3dfft/

  15. Chan A, Balaji P, Gropp W, Thakur R (2008) Communication analysis of parallel 3D FFT for flat Cartesian meshes on large blue gene systems. In: Proceedings of the IEEE/ACM international conference on high performance computing (HiPC), Bangalore, India, 17–20 Dec 2008

    Google Scholar 

  16. Wallcraft AJ (1991) The NRL layered ocean model users guide. NOARL Report 35, Naval Research Laboratory, Stennis Space Center, MS

  17. Traff J (2002) Implementing the MPI process topology mechanism. In: SC, pp 1–14

    Google Scholar 

  18. Hur J (1999) An approach for torus embedding. In: ICPP, Washington, DC, USA. IEEE Computer Society, Los Alamitos, p 301

    Google Scholar 

  19. Ou C, Ranka S, Fox G (1996) Fast and parallel mapping algorithms for irregular problems. J Supercomput 10(2):119–140

    Article  Google Scholar 

  20. Bokhari S (1981) On the mapping problem. IEEE Trans Comput 30(3):207–214

    Article  MathSciNet  Google Scholar 

  21. Bollinger SW, Midkiff S (1991) Heuristic technique for processor and link assignment in multicomputers. IEEE Trans Comput 40(3):325–333

    Article  Google Scholar 

  22. Mansour N, Ponnusamy R, Choudhary A, Fox GC (1993) Graph contraction for physical optimization methods: a quality-cost tradeoff for mapping data on parallel computers. In: ISC, New York, NY, USA. ACM, New York, pp 1–10

    Google Scholar 

  23. Chockalingam T, Arunkumar S (1992) Randomized heuristics for the mapping problem. The genetic approach. In: Parallel computing, pp 1157–1165

    Google Scholar 

  24. Bhagnot G, Gara A, Heidelberger P et al. (2005) Optimizing task layout on the Blue Gene/L supercomputer. IBM J Res Dev 49(2–3):489–500. doi:10.1147/rd.492.0489

    Article  Google Scholar 

  25. Almasi G, Archer C, Castanos J et al. (2004) Implementing MPI on the BlueGene/L Supercomputer. In: Euro-Par, pp 833–845

    Google Scholar 

  26. Yu H, Chung I, Moreira J (2006) Topology mapping for Blue Gene/L supercomputer. In: SC, New York, NY, USA. ACM, New York, p 116

    Google Scholar 

  27. Agarwal T, Sharma A, Laxmikant A, Kale LV (2006) Topology-aware task mapping for reducing communication contention on large parallel machines. In: IPDPS, p 122

    Google Scholar 

  28. Smith B, Bode B (2005) Performance effects of node mappings on the IBM BlueGene/L machine. In: Euro-Par, pp 1005–1013

    Google Scholar 

  29. Faraj A, Yuan X, Lowenthal D (2006) STAR-MPI: self tuned adaptive routines for MPI collective operations. In: Proceedings of the 20th annual international conference on supercomputing (ICS), Cairns, Queensland, Australia, pp 199–208

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pavan Balaji.

Additional information

This work was supported in part by the National Science Foundation Grant #0702182 and by Office of Advanced Scientific Computing Research, Office of Science, US Department of Energy, under Contract DE-AC02-06CH11357.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Balaji, P., Gupta, R., Vishnu, A. et al. Mapping communication layouts to network hardware characteristics on massive-scale blue gene systems. Comput Sci Res Dev 26, 247–256 (2011). https://doi.org/10.1007/s00450-011-0168-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00450-011-0168-y

Keywords

Navigation