Modeling Large Compute Nodes with Heterogeneous Memories with Cache-Aware Roofline Model

Nicolas Denoyelle^16,18,
Brice Goglin¹⁶,
Aleksandar Ilic¹⁷,
Emmanuel Jeannot¹⁶ &
…
Leonel Sousa¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10724))

Included in the following conference series:

International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems

1696 Accesses

Abstract

In order to fulfill modern applications needs, computing systems become more powerful, heterogeneous and complex. NUMA platforms and emerging high bandwidth memories offer new opportunities for performance improvements. However they also increase hardware and software complexity, thus making application performance analysis and optimization an even harder task. The Cache-Aware Roofline Model (CARM) is an insightful, yet simple model designed to address this issue. It provides feedback on potential applications bottlenecks and shows how far is the application performance from the achievable hardware upper-bounds. However, it does not encompass NUMA systems and next generation processors with heterogeneous memories. Yet, some application bottlenecks belong to those memory subsystems, and would benefit from the CARM insights. In this paper, we fill the missing requirements to scope recent large shared memory systems with the CARM. We provide the methodology to instantiate, and validate the model on a NUMA system as well as on the latest Xeon Phi processor equiped with configurable hybrid memory. Finally, we show the model ability to exhibits several bottlenecks of such systems, which were not supported by CARM.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Dynamic Cache Architecture for Efficient Memory Resource Allocation in Many-Core Systems

Towards optimal scheduling policy for heterogeneous memory architecture in many-core system

Article 25 July 2018

Case Studies on the Impact and Challenges of Heterogeneous NUMA Architectures for HPC

Notes

1.
Intel Advisor Roofline - 2017-05-12: https://software.intel.com/en-us/articles/intel-advisor-roofline.
2.
Main memory and cache levels.
3.
On usual platforms, a cluster is identical to the widely-used definition of a NUMA node. On KNL, there can exist two local NUMA nodes near each core (DDR and MCDRAM), hence two NUMA nodes per cluster.
4.
Network congestion in data networking and queueing theory is the reduced quality of service that occurs when a network node is carrying more data than it can handle.
5.
The error is computed as $\frac{100}{n}\times \sqrt{\sum _{i=1..n}{\left( \frac{y_i-\hat{y}_i}{\hat{y}_i}\right) ^2}}$ where $y_i$ is the validation point at a given arithmetic intensity, and $\hat{y}_i$ is the corresponding roof.
6.
Remote memory bandwidths are very close to congested bandwidths on this system and we omit the former in the chart to avoid confusion.
7.
https://github.com/benchmark-subsetting/NPB3.0-omp-C.
8.
Product version: Update 2 (build 501009).
9.
Lulesh application run parameters: -i 1000 -s 60 -r 4. The application is compiled with ICC 17.0.2 and options: -DUSE_MPI=0 -qopenmp -O3 -xHost.

References

Blake, G., Dreslinski, R.G., Mudge, T.: A survey of multicore processors. IEEE Signal Process. Magaz. 26(6), 26–37 (2009)
Article Google Scholar
Blagodurov, S., Zhuravlev, S., Dashti, M., Fedorova, A.: A case for NUMA-aware contention management on multicore systems. In: 2011 USENIX Annual Technical Conference, Portland, OR, USA, 15–17 June 2011 (2011)
Google Scholar
Reinders, J., Jeffers, J., Sodani, A.: Intel Xeon Phi Processor High Performance Programming Knights Landing Edition (2016)
Google Scholar
Ziakas, D., Baum, A., Maddox, R.A., Safranek, R.J.: Intel® quickpath interconnect architectural features supporting scalable system architectures. In: 2010 IEEE 18th Annual Symposium on High Performance Interconnects (HOTI), pp. 1–6. IEEE (2010)
Google Scholar
Bull atos technologies: Bull coherent switch. http://support.bull.com/ols/product/platforms/hw-extremcomp/hw-bullx-sup-node/BCS/index.htm
Ilic, A., Pratas, F., Sousa, L.: Cache-aware roofline model: upgrading the loft. IEEE Comput. Archit. Lett. 13(1), 21–24 (2014)
Article Google Scholar
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)
Article Google Scholar
Cantalupo, C., Venkatesan, V., Hammond, J., Czurlyo, K., Hammond, S.D.: Memkind: an extensible heap memory manager for heterogeneous memory platforms and mixed memory policies. Technical report, Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States) (2015)
Google Scholar
Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: hwloc: a generic framework for managing hardware affinities in HPC applications. In: The 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing (PDP 2010), Pisa, Italy. IEEE, February 2010
Google Scholar
Kleen, A.: A NUMA API for LINUX. Novel Inc. (2005)
Google Scholar
Lepers, B., Quema, V., Fedorova, A.: Thread and memory placement on NUMA systems: asymmetry matters. In: 2015 USENIX Annual Technical Conference (USENIX ATC 2015), Santa Clara, CA, pp. 277–289. USENIX Association, July 2015
Google Scholar
Chou, C., Jaleel, A., Qureshi, M.K.: CAMEO: a two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47), Washington, DC, USA, pp. 1–12. IEEE Computer Society (2014)
Google Scholar
Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Simon, H.D., Venkatakrishnan, V., Weeratunga, S.K.: The NAS parallel benchmarks. Int. J. Supercomput. Appl. 5, 63–73 (1991). Technical report
Article Google Scholar
Karlin, I., Keasler, J., Neely, R.: Lulesh 2.0 updates and changes. Technical report LLNL-TR-641973, August 2013
Google Scholar
Lepers, B., Quéma, V., Fedorova, A.: Thread and memory placement on NUMA systems: asymmetry matters. In: USENIX Annual Technical Conference, pp. 277–289 (2015)
Google Scholar
Ramos, S., Hoefler, T.: Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL (2017)
Google Scholar
The Memkind Library. http://memkind.github.io/memkind
Ilic, A., Pratas, F., Sousa, L.: Beyond the roofline: cache-aware power and energy-efficiency modeling for multi-cores. IEEE Trans. Comput. 66(1), 52–58 (2017)
Article MathSciNet MATH Google Scholar
Doerfler, D., et al.: Applying the roofline performance model to the intel xeon phi knights landing processor. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 339–353. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46079-6_24
Chapter Google Scholar
Lorenzo, O.G., Pena, T.F., Cabaleiro, J.C., Pichel, J.C., Rivera, F.F.: Using an extended roofline model to understand data and thread affinities on NUMA systems. Ann. Multicore GPU Program. 1(1), 56–67 (2014)
Google Scholar
Hofmann, J., Eitzinger, J., Fey, D.: Execution-cache-memory performance model: introduction and validation. CoRR abs/1509.03118 (2015)
Google Scholar
Intel: Intel Advisor Roofline (2017)
Google Scholar
Marques, D., Duarte, H., Ilic, A., Sousa, L., Belenov, R., Thierry, P., Matveev, Z.A.: Performance analysis with cache-aware roofline model in intel advisor. In: 2017 International Conference on High Performance Computing Simulation (HPCS), pp. 898–907, July 2017
Google Scholar

Download references

Acknowledgments

We would like to acknowledge COST Action IC1305 (NESUS) and Atos for funding parts of this work, as well as national funds through Fundação para a Ciência e a Tecnologia (FCT) with reference UID/CEC/50021/2013.

Some experiments presented in this paper were carried out using the PLAFRIM experimental testbed, being developed under the Inria PlaFRIM development action with support from Bordeaux INP, LaBRI and IMB and other entities: Conseil Régional d’Aquitaine, Université de Bordeaux and CNRS (and ANR in accordance to the programme d’investissements d’Avenirs, see https://www.plafrim.fr/).

Author information

Authors and Affiliations

Inria – Bordeaux - Sud-Ouest, Univ. Bordeaux, Talence, France
Nicolas Denoyelle, Brice Goglin & Emmanuel Jeannot
INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
Aleksandar Ilic & Leonel Sousa
Atos, Paris, France
Nicolas Denoyelle

Authors

Nicolas Denoyelle
View author publications
You can also search for this author in PubMed Google Scholar
Brice Goglin
View author publications
You can also search for this author in PubMed Google Scholar
Aleksandar Ilic
View author publications
You can also search for this author in PubMed Google Scholar
Emmanuel Jeannot
View author publications
You can also search for this author in PubMed Google Scholar
Leonel Sousa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Nicolas Denoyelle , Brice Goglin , Aleksandar Ilic or Leonel Sousa .

Editor information

Editors and Affiliations

University of Warwick, Coventry, United Kingdom
Stephen Jarvis
University of Warwick, Coventry, United Kingdom
Steven Wright
Sandia National Laboratories, Albuquerque, New Mexico, USA
Simon Hammond

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Denoyelle, N., Goglin, B., Ilic, A., Jeannot, E., Sousa, L. (2018). Modeling Large Compute Nodes with Heterogeneous Memories with Cache-Aware Roofline Model. In: Jarvis, S., Wright, S., Hammond, S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. PMBS 2017. Lecture Notes in Computer Science(), vol 10724. Springer, Cham. https://doi.org/10.1007/978-3-319-72971-8_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-72971-8_5
Published: 23 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-72970-1
Online ISBN: 978-3-319-72971-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Modeling Large Compute Nodes with Heterogeneous Memories with Cache-Aware Roofline Model

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Dynamic Cache Architecture for Efficient Memory Resource Allocation in Many-Core Systems

Towards optimal scheduling policy for heterogeneous memory architecture in many-core system

Case Studies on the Impact and Challenges of Heterogeneous NUMA Architectures for HPC

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Modeling Large Compute Nodes with Heterogeneous Memories with Cache-Aware Roofline Model

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Dynamic Cache Architecture for Efficient Memory Resource Allocation in Many-Core Systems

Towards optimal scheduling policy for heterogeneous memory architecture in many-core system

Case Studies on the Impact and Challenges of Heterogeneous NUMA Architectures for HPC

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation