Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Main Memory in HPC: Do We Need More or Could We Live with Less?

Published: 06 March 2017 Publication History

Abstract

An important aspect of High-Performance Computing (HPC) system design is the choice of main memory capacity. This choice becomes increasingly important now that 3D-stacked memories are entering the market. Compared with conventional Dual In-line Memory Modules (DIMMs), 3D memory chiplets provide better performance and energy efficiency but lower memory capacities. Therefore, the adoption of 3D-stacked memories in the HPC domain depends on whether we can find use cases that require much less memory than is available now.
This study analyzes the memory capacity requirements of important HPC benchmarks and applications. We find that the High-Performance Conjugate Gradients (HPCG) benchmark could be an important success story for 3D-stacked memories in HPC, but High-Performance Linpack (HPL) is likely to be constrained by 3D memory capacity. The study also emphasizes that the analysis of memory footprints of production HPC applications is complex and that it requires an understanding of application scalability and target category, i.e., whether the users target capability or capacity computing. The results show that most of the HPC applications under study have per-core memory footprints in the range of hundreds of megabytes, but we also detect applications and use cases that require gigabytes per core. Overall, the study identifies the HPC applications and use cases with memory footprints that could be provided by 3D-stacked memory chiplets, making a first step toward adoption of this novel technology in the HPC domain.

Supplementary Material

TACO1401-03 (taco1401-03.pdf)
Slide deck associated with this paper

References

[1]
Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183. EECS Department, University of California, Berkeley.
[2]
Daniel E. Atkins, Kelvin K. Droegemeier, Stuart I. Feldman, Stuart I. Feldman, Michael L. Klein, David G. Messerschmitt, Paul Messina, Jeremiah P. Ostriker, and Margaret H. Wright. 2003. Revolutionizing Science and Engineering Through Cyberinfrastructure. Report of the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure. National Science Foundation.
[3]
Barcelona Supercomputing Center. 2013. MareNostrum III System Architecture. Technical Report.
[4]
Barcelona Supercomputing Center. 2014. Extrae User Guide Manual for Version 2.5.1. Barcelona Supercomputing Center.
[5]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 72--81.
[6]
Susmit Biswas, Bronis R. de Supinski, Martin Schulz, Diana Franklin, Timothy Sherwood, and Frederic T. Chong. 2011. Exploiting data similarity to reduce memory footprints. In Proceedings of the IEEE International Parallel 8 Distributed Processing Symposium (IPDPS). 152--163.
[7]
Mark Bull. 2013. PRACE-2IP: D7.4 unified european applications benchmark suite final. (2013).
[8]
Chris Cantalupo, Karthik Raman, and Ruchira Sasanka. 2015. MCDRAM on 2nd Generation Intel Xeon Phi Processor (code-named Knights Landing): Analysis Methods and Tools. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[9]
Chiachen Chou, Aamer Jaleel, and Moinuddin K. Qureshi. 2014. CAMEO: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In Proceedings of the International Symposium on Microarchitecture (MICRO). 1--12.
[10]
Xiangyu Dong, Yuan Xie, Naveen Muralimanohar, and Norman P. Jouppi. 2010. Simple but effective heterogeneous main memory with on-chip memory controller support. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--11.
[11]
Jack Dongarra, Michael Heroux, and Piotr Luszczek. 2016. The HPCG Benchmark. Retrieved from http://www.hpcg-benchmark.org/.
[12]
Jack J. Dongarra and Michael A. Heroux. 2013. Toward a New Metric for Ranking High Performance Computing Systems. Sandia Report SAND2013-4744. Sandia National Laboratories.
[13]
Jack J. Dongarra, Piotr Luszczek, and Michael A. Heroux. 2014. HPCG: One year later. In Proceedings of the International Supercomputing Conference (ISC).
[14]
Jack J. Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. The LINPACK benchmark: Past, present and future. Concurrency and Computation: Practice and Experience 15, 9 (2003), 803--820.
[15]
ETP4HPC. 2013. ETP4HPC Strategic Research Agenda Achieving HPC Leadership in Europe. (June 2013).
[16]
Hybrid Memory Cube Consortium. 2014. Hybrid Memory Cube Specification 2.0. Retrieved from http://www.hybridmemorycube.org/specification-v2-download-form/.
[17]
Intel. 2016a. Intel VTune Amplifier 2016. Retrieved from https://software.intel.com/en-us/intel-vtune-amplifier-xe/.
[18]
Intel. 2016b. The memkind library. Retrieved from http://memkind.github.io/memkind/.
[19]
JEDEC Solid State Technology Association. 2013. High Bandwidth Memory (HBM) DRAM. http://www.jedec.org/standards-documents/docs/jesd235. (Oct. 2013).
[20]
James Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition (2nd ed.). Morgan Kaufmann.
[21]
Peter Kogge, Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon Hiller, Sherman Karp, Stephen Keckler, Dean Klein, Robert Lucas, Mark Richards, Al Scarpelli, Steven Scott, Allan Snavely, Thomas Sterling, R. Stanley Williams, and Katherine Yelick. 2008. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. Technical Report.
[22]
Matthew J. Koop, Terry Jones, and Dhabaleswar K. Panda. 2007. Reducing connection memory requirements of MPI for infiniband clusters: A message coalescing approach. In Proceedings of the IEEE International Symposium on Cluster Computing and the Grid (CCGRID). 495--504.
[23]
Piotr Luszczek and Jack J. Dongarra. 2005. Introduction to the HPC Challenge Benchmark Suite. ICL Technical Report ICL-UT-05-01. University of Tennessee.
[24]
Vladimir Marjanović, José Garcia, and Colin W. Glass. 2014. Performance modeling of the HPCG benchmark. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. Springer International Publishing, 172--192.
[25]
Mitesh R. Meswani, Sergey Blagodurov, David Roberts, John Slice, Mike Ignatowski, and Gabriel H. Loh. 2015. Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA). 126--136.
[26]
Richard Murphy, Jonathan Berry, William McLendon, Bruce Hendrickson, Douglas Gregor, and Andrew Lumsdaine. 2006. DFS: A simple to write yet difficult to execute benchmark. In IEEE International Symposium on Workload Characterization (IISWC). 175--177.
[27]
Richard Murphy, Kyle Wheeler, Brian Barrett, and James Ang. 2010. Introducing the Graph 500. Cray User’s Group (CUG). (May 2010).
[28]
NERSC. 2012. Large Scale Computing and Storage Requirements for High Energy Physics: Target 2017. Report of the NERSC Requirements Review. Lawrence Berkeley National Laboratory.
[29]
NERSC. 2013. Large Scale Computing and Storage Requirements for Biological and Environmental Science: Target 2017. Report of the NERSC Requirements Review LBNL-6256E. Lawrence Berkeley National Laboratory.
[30]
NERSC. 2014a. High Performance Computing and Storage Requirements for Basic Energy Sciences: Target 2017. Report of the HPC Requirements Review LBNL-6978E. Lawrence Berkeley National Laboratory.
[31]
NERSC. 2014b. Large Scale Computing and Storage Requirements for Fusion Energy Sciences: Target 2017. Report of the NERSC Requirements Review LBNL-6631E. Lawrence Berkeley National Laboratory.
[32]
NERSC. 2015a. High Performance Computing and Storage Requirements for Nuclear Physics: Target 2017. Report of the NERSC Requirements Review LBNL-6926E. Lawrence Berkeley National Laboratory.
[33]
NERSC. 2015b. Large Scale Computing and Storage Requirements for Advanced Scientific Computing Research: Target 2017. Report of the NERSC Requirements Review LBNL-6978E. Lawrence Berkeley National Laboratory.
[34]
Chris J. Newburn. 2015. Code for the future: Knights Landing and beyond. In Proceedings of the International Supercomputing Conference (ISC).
[35]
Jongsoo Park, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Alexander Heinecke, Dhiraj D. Kalamkar, Xing Liu, Md. Mosotofa Ali Patwary, Yutong Lu, and Pradeep Dubey. 2014. Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 945--955.
[36]
Milan Pavlovic, Yoav Etsion, and Alex Ramirez. 2011. On the memory system requirements of future scientific applications: Four case-studies. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). 159--170.
[37]
Milan Pavlovic, Milan Radulovic, Alex Ramirez, and Petar Radojkovic. 2015. Limpio - Lightweight MPI instrumentation. In Proceedings of the International Conference on Program Comprehension (ICPC). Retrieved from https://www.bsc.es/computer-sciences/computer-architecture/memory-systems/limpio, 303--306.
[38]
O. Perks, S. D. Hammond, S. J. Pennycook, and S. A. Jarvis. 2011. Should we worry about memory loss? SIGMETRICS Performance Evaluation Review 38, 4 (March 2011), 69--74.
[39]
Antoine Petitet, Clint Whaley, Jack Dongarra, Andy Cleary, and Piotr Luszczek. 2012. HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers. Retrieved from http://www.netlib.org/benchmark/hpl/.
[40]
PRACE. 2013. Unified European Applications Benchmark Suite. www.prace-ri.eu/ueabs/. (2013).
[41]
PRACE. 2016. Prace Research Infrastructure. http://www.prace-ri.eu. (2016).
[42]
Milan Radulovic, Darko Zivanovic, Daniel Ruiz, Bronis R. de Supinski, Sally A. McKee, Petar Radojković, and Eduard Ayguadé. 2015. Another trip to the wall: How much will stacked DRAM benefit HPC? In Proceedings of the International Symposium on Memory Systems (MEMSYS). 31--36.
[43]
Jaewoong Sim, Alaa R. Alameldeen, Zeshan Chishti, Chris Wilkerson, and Hyesoon Kim. 2014. Transparent hardware management of stacked DRAM as part of memory. In Proc. of the International Symposium on Microarchitecture (MICRO). 13--24.
[44]
Avinash Sodani. 2011. Race to Exascale: Opportunities and Challenges. International Symposium on Microarchitecture (MICRO). (Dec. 2011). Keynote.
[45]
Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing: Second-generation Intel Xeon Phi product. IEEE Micro 36, 2 (March 2016), 34--46.
[46]
SPEC. 2015a. SPEC MPI2007. Retrieved from http://www.spec.org/mpi2007/.
[47]
SPEC. 2015b. SPEC OMP2012. https://www.spec.org/omp2012/.
[48]
Rick Stevens, Andy White, Pete Beckman, Ray Bair-ANL, Jim Hack, Jeff Nichols, Al GeistORNL, Horst Simon, Kathy Yelick, John Shalf-LBNL, Steve Ashby, Moe Khaleel-PNNL, Michel McCoy, Mark Seager, Brent Gorda-LLNL, John Morrison, Cheryl Wampler-LANL, James Peery, Sudip Dosanjh, Jim Ang-SNL, Jim Davenport, Tom Schlagel, BNL, Fred Johnson, and Paul Messina. 2010. A Decadal DOE Plan for Providing Exascale Applications and Technologies for DOE Mission Needs. Presentation at Advanced Simulation and Computing Principal Investigators Meeting.
[49]
Erich Strohmaier, Jack Dongarra, Horst Simon, Martin Meuer, and Hans Meuer. 2015. TOP500 List. Retrieved from http://www.top500.org/.
[50]
Frederick C. Wong, Richard P. Martin, Remzi H. Arpaci-Dusseau, and David E. Culler. 1999. Architectural requirements and scalability of the NAS parallel benchmarks. In Proceedings of the of the ACM/IEEE Conference on Supercomputing (SC).
[51]
Steven Cameron Woo, Moriyoshi Ohara, and Evan Torrie. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the of the International Symposium on Computer Architecture (ISCA). 24--36.
[52]
Darko Zivanovic, Milan Radulovic, Germán Llort, David Zaragoza, Janko Strassburg, Paul M. Carpenter, Petar Radojković, and Eduard Ayguadé. 2016. Large-memory nodes for energy efficient high-performance computing. In Proceedings of the of the International Symposium on Memory Systems (MEMSYS).

Cited By

View all
  • (2024)Load Balancing with Job-Size Testing: Performance Improvement or Degradation?ACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/36511549:2(1-27)Online publication date: 4-Mar-2024
  • (2024)Software Resource Disaggregation for HPC with Serverless Computing2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00021(139-156)Online publication date: 27-May-2024
  • (2024)Exploring Approximate Memory for Energy-Efficient Computing2024 ASU International Conference in Emerging Technologies for Sustainability and Intelligent Systems (ICETSIS)10.1109/ICETSIS61505.2024.10459495(1685-1689)Online publication date: 28-Jan-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 14, Issue 1
March 2017
258 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3058793
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 March 2017
Accepted: 01 December 2016
Revised: 01 November 2016
Received: 01 May 2016
Published in TACO Volume 14, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. HPCG
  2. HPL
  3. Memory capacity requirements
  4. high-performance computing
  5. production HPC applications

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Spanish Ministry of Science and Technology
  • Collaboration Agreement between Samsung Electronics Co., Ltd.
  • BSC, Spanish Government through Severo Ochoa programme
  • Generalitat de Catalunya
  • Darko Zivanovic holds the Severo Ochoa
  • Ministry of Economy and Competitiveness of Spain
  • European Union’s Horizon 2020 research and innovation programme under ExaNoDe

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)326
  • Downloads (Last 6 weeks)46
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Load Balancing with Job-Size Testing: Performance Improvement or Degradation?ACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/36511549:2(1-27)Online publication date: 4-Mar-2024
  • (2024)Software Resource Disaggregation for HPC with Serverless Computing2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00021(139-156)Online publication date: 27-May-2024
  • (2024)Exploring Approximate Memory for Energy-Efficient Computing2024 ASU International Conference in Emerging Technologies for Sustainability and Intelligent Systems (ICETSIS)10.1109/ICETSIS61505.2024.10459495(1685-1689)Online publication date: 28-Jan-2024
  • (2024)General resource manager for computationally demanding scientific software (MARE)Engineering with Computers10.1007/s00366-023-01890-z40:3(1927-1942)Online publication date: 1-Jun-2024
  • (2023)Accelerating Performance of GPU-based Workloads Using CXLProceedings of the 13th Workshop on AI and Scientific Computing at Scale using Flexible Computing10.1145/3589013.3596678(27-31)Online publication date: 10-Aug-2023
  • (2022)STEPS 4.0: Fast and memory-efficient molecular simulations of neurons at the nanoscaleFrontiers in Neuroinformatics10.3389/fninf.2022.88374216Online publication date: 26-Oct-2022
  • (2022)A Case For Intra-rack Resource Disaggregation in HPCACM Transactions on Architecture and Code Optimization10.1145/351424519:2(1-26)Online publication date: 7-Mar-2022
  • (2021)Multi‐communication layered HPL model and its application to GPU clustersETRI Journal10.4218/etrij.2020-039343:3(524-537)Online publication date: 23-Jun-2021
  • (2021)Fortran Coarray Implementation of Semi-Lagrangian Convected Air Particles within an Atmospheric ModelChemEngineering10.3390/chemengineering50200215:2(21)Online publication date: 6-May-2021
  • (2021)Quantifying server memory frequency margin and using it to improve performance in HPC systemsProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00064(748-761)Online publication date: 14-Jun-2021
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media