research-article

Public Access

Optimizing for KNL Usage Modes When Data Doesn't Fit in MCDRAM

Authors:

Stephen L. Olivier,

Jonathan Berry,

Simon D. Hammond,

Peter M. KoggeAuthors Info & Claims

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

Article No.: 37, Pages 1 - 10

https://doi.org/10.1145/3225058.3225116

Published: 13 August 2018 Publication History

Abstract

Technologies such as Multi-Channel DRAM (MCDRAM) or High Bandwidth Memory (HBM) provide significantly more bandwidth than conventional memory. This trend has raised questions about how applications should manage data transfers between levels. This paper focuses on evaluating different usage modes of the MCDRAM in Intel Knights Landing (KNL) manycore processors. We evaluate these usage modes with a sorting kernel and a sorting-based streaming benchmark. We develop a performance model for the benchmark and use experimental evidence to demonstrate the correctness of the model. The model projects near-optimal numbers of copy threads for memory bandwidth bound computations. We demonstrate on KNL up to a 1.9X speedup for sort when the problem does not fit in MCDRAM over an OpenMP GNU sort that does not use MCDRAM.

References

[1]

Alok Aggarwal, Jeffrey Vitter, et al. 1988. The input/output complexity of sorting and related problems. Commun. ACM 31, 9 (1988), 1116--1127.

Digital Library

[2]

Ritu Arora and Lars Koesterke. 2017. Interactive Code Adaptation Tool for Modernizing Applications for Intel Knights Landing Processors. In Proceedings of Practice and Experience in Advanced Research Computing 2017 (PEARC17). ACM, New York, NY, USA, Article 28, 8 pages.

Digital Library

[3]

JEDEC Solid State Technology Association. 2015. JEDEC Standard High Bandwidth Memory (HBM) DRAM Specification, Standard JESD235A". (2015).

[4]

Michael A. Bender, Jonathan W. Berry, Simon D. Hammond, K. Scott Hemmert, Samuel McCauley, Branden Moore, Benjamin Moseley, Cynthia A. Phillips, David Resnick, and Arun Rodrigues. 2017. Two-level main memory co-design: Multithreaded algorithmic primitives, analysis, and simulation. J. Parallel and Distrib. Comput. 102 (2017), 213--228.

Digital Library

[5]

Michael A Bender, Roozbeh Ebrahimi, Jeremy T Fineman, Golnaz Ghasemiesfeh, Rob Johnson, and Samuel McCauley. 2014. Cache-adaptive algorithms. In Proceedings of the 25th ACM-SIAM Symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 958--971.

Digital Library

[6]

Gerth Stølting Brodal, Rolf Fagerberg, and Riko Jacob. 2002. Cache oblivious search trees via binary trees of small height. In Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 39--48.

Digital Library

[7]

Gerth Stølting Brodal, Rolf Fagerberg, and Kristoffer Vinther. 2008. Engineering a cache-oblivious sorting algorithm. Journal of Experimental Algorithmics (JEA) 12 (2008), 2--2.

Digital Library

[8]

Christopher Cantalupo, Vishwanath Venkatesan, Jeff Hammond, Krzysztof Czurlyo, and Simon David Hammond. 2015. memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. Technical Report SAND 2015-1862C. Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States).

[9]

Jonathan M. Cohen, Sarah Tariq, and Simon Green. 2010. Interactive Fluid-particle Simulation Using Translating Eulerian Grids. In Proceedings of the 2010 ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (I3D '10). ACM, New York, NY, USA, 15--22.

Digital Library

[10]

Nicolas Denoyelle, Brice Goglin, Aleksandar Ilic, Emmanuel Jeannot, and Leonel Sousa. 2017. Modeling Large Compute Nodes with Heterogeneous Memories with Cache-Aware Roofline Model. In International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS 2017), Stephen Jarvis, Steven Wright, and Simon Hammond (Eds.). Springer, Cham, 91--113.

[11]

Douglas Doerfler, Jack Deslippe, Samuel Williams, Leonid Oliker, Brandon Cook, Thorsten Kurth, Mathieu Lobet, Tareq M. Malas, Jean-Luc Vay, and Henri Vincenti. 2016. Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor. In ISC Workshops (Lecture Notes in Computer Science), Michela Taufer, Bernd Mohr, and Julian M. Kunkel (Eds.), Vol. 9945. Springer, Cham, 339--353.

[12]

Matteo Frigo, Charles E Leiserson, Harald Prokop, and Sridhar Ramachandran. 1999. Cache-oblivious algorithms. In 40th Symposium on Foundations of Computer Science. IEEE, Washington, DC, USA, 285--297.

Digital Library

[13]

Naga K. Govindaraju, Jim Gray, Ritesh Kumar, and Dinesh Manocha. 2006. GPUT-eraSort: High Performance Graphics Coprocessor Sorting for Large Database Management. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, USA, 325--336.

Digital Library

[14]

Bingsheng He, Mian Lu, Ke Yang, Rui Fang, Naga K. Govindaraju, Qiong Luo, and Pedro V. Sander. 2009. Relational Query Coprocessing on Graphics Processors. ACM Trans. Database Syst. 34, 4, Article 21 (Dec. 2009), 39 pages.

Digital Library

[15]

Yuji Kohara, Kiyotaka Akiyama, and Katsumi Isono. 1987. The physical map of the whole E. coli chromosome: Application of a new strategy for rapid analysis and sorting of a large genomic library. Cell 50, 3 (1987), 495--508.

[16]

Ang Li, Weifeng Liu, Mads R. B. Kristensen, Brian Vinter, Hao Wang, Kaixi Hou, Andres Marquez, and Shuaiwen Leon Song. 2017. Exploring and Analyzing the Real Impact of Modern On-package Memory on HPC Scientific Kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 26, 14 pages.

Digital Library

[17]

John D. McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter (Dec. 1995), 19--25.

[18]

Stephen L. Olivier, Simon D. Hammond, and Alejandro Duran. 2017. Double buffering for MCDRAM on Second Generation Xeon Phi Processors with OpenMP. In Proceedings of the 13th Internanational Workshop on OpenMP (IWOMP 2017): Scaling OpenMP for Exascale Performance and Portability (Lecture Notes in Computer Science), Bronis R. de Supinski, Stephen L. Olivier, Christian Terboven, Barbara M. Chapman, and Matthias S. Müller (Eds.), Vol. 10468. Springer, Cham, 311--324.

[19]

OpenMP Architecture Review Board. 2017. OpenMP Technical Report 6: Version 5.0 Preview 2. http://www.openmp.org/wp-content/uploads/openmp-TR6.pdf. (Nov. 2017).

[20]

S. Rajasekaran. 2001. A Framework for Simple Sorting Algorithms on Parallel Disk Systems. Theory of Computing Systems 34, 2 (01 Apr 2001), 101--114.

[21]

Sabela Ramos and Torsten Hoefler. 2017. Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL. In 2017 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017. IEEE, Washington, DC, USA, 297--306.

[22]

Johannes Singler and Benjamin Konsik. 2008. The GNU Libstdc++ Parallel Mode: Software Engineering Considerations. In Proceedings of the 1st International Workshop on Multicore Software Engineering (IWMSE '08). ACM, New York, NY, USA, 15--22.

Digital Library

[23]

Johannes Singler, Peter Sanders, and Felix Putze. 2007. MCSTL: The Multi-core Standard Template Library. In Euro-Par 2007 Parallel Processing, Anne-Marie Kermarrec, Luc Bougé, and Thierry Priol (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 682--694.

Digital Library

[24]

Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing: Second-generation Intel Xeon Phi product. IEEE Micro 36, 2 (2016), 34--46.

Digital Library

Cited By

DeLayo DZhang KAgrawal KBender MBerry JDas RMoseley BPhillips CAgrawal KLee I(2022)Automatic HBM ManagementProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538570(147-159)Online publication date: 11-Jul-2022
https://dl.acm.org/doi/10.1145/3490148.3538570
Lim CKim DWoo SJoh MAn JMoon I(2020)Performance Comparisons on Parallel Optimization of Atmospheric and Ocean Numerical Circulation Models Using KISTI Supercomputer Nurion SystemApplied Sciences10.3390/app1008288310:8(2883)Online publication date: 21-Apr-2020
https://doi.org/10.3390/app10082883
Roussel ACarribault PJaeger J(2020)Preliminary Experience with OpenMP Memory Management ImplementationOpenMP: Portable Multi-Level Parallelism on Modern Systems10.1007/978-3-030-58144-2_20(313-327)Online publication date: 22-Sep-2020
https://dl.acm.org/doi/10.1007/978-3-030-58144-2_20
Show More Cited By

Index Terms

Optimizing for KNL Usage Modes When Data Doesn't Fit in MCDRAM
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms

Recommendations

A hybrid CPU/GPU approach for optimizing sorting throughput
Highlights
- Performance on hybrid CPU/GPU systems is often dominated by communication overhead.
Abstract
The GPU is an effective architecture for sorting due to its massive parallelism and high memory bandwidth. However, for input datasets that exceed global memory capacity, the communication overhead between host (CPU) and GPU may ...
Towards enhanced I/O performance of a highly integrated many-core processor by empirical analysis
Abstract
Optimized for parallel operations, Intel’s second generation Xeon Phi processor, code-named Knights Landing (KNL), is actively utilized in high performance computing systems based on its highly integrated cores and high-bandwidth on-package memory,...
Fast in-place sorting with CUDA based on bitonic sort
PPAM'09: Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I

State of the art graphics processors provide high processing power and furthermore, the high programmability of GPUs offered by frameworks like CUDA increases their usability as high-performance coprocessors for general-purpose computing. Sorting is ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

August 2018

945 pages

ISBN:9781450365109

DOI:10.1145/3225058

Copyright © 2018 ACM.

© 2018 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

In-Cooperation

University of Oregon: University of Oregon

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Nuclear Security Administration

Conference

ICPP 2018

ICPP 2018: 47th International Conference on Parallel Processing

August 13 - 16, 2018

OR, Eugene, USA

Acceptance Rates

ICPP '18 Paper Acceptance Rate 91 of 313 submissions, 29%;

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
408
Total Downloads

Downloads (Last 12 months)92
Downloads (Last 6 weeks)13

Reflects downloads up to 29 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

DeLayo DZhang KAgrawal KBender MBerry JDas RMoseley BPhillips CAgrawal KLee I(2022)Automatic HBM ManagementProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538570(147-159)Online publication date: 11-Jul-2022
https://dl.acm.org/doi/10.1145/3490148.3538570
Lim CKim DWoo SJoh MAn JMoon I(2020)Performance Comparisons on Parallel Optimization of Atmospheric and Ocean Numerical Circulation Models Using KISTI Supercomputer Nurion SystemApplied Sciences10.3390/app1008288310:8(2883)Online publication date: 21-Apr-2020
https://doi.org/10.3390/app10082883
Roussel ACarribault PJaeger J(2020)Preliminary Experience with OpenMP Memory Management ImplementationOpenMP: Portable Multi-Level Parallelism on Modern Systems10.1007/978-3-030-58144-2_20(313-327)Online publication date: 22-Sep-2020
https://dl.acm.org/doi/10.1007/978-3-030-58144-2_20
Arima ESchulz M(2020)Pattern-Aware Staging for Hybrid Memory SystemsHigh Performance Computing10.1007/978-3-030-50743-5_24(474-495)Online publication date: 15-Jun-2020
https://doi.org/10.1007/978-3-030-50743-5_24
Perarnau SVideau BDenoyelle NMonna FIskra KBeckman P(2019)Explicit Data Layout Management for Autotuning Exploration on Complex Memory Topologies2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC)10.1109/MCHPC49590.2019.00015(58-63)Online publication date: Nov-2019
https://doi.org/10.1109/MCHPC49590.2019.00015

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents