research-article

TABARNAC: visualizing and resolving memory access issues on NUMA architectures

Authors:

David Beniamine,

Matthias Diener,

Guillaume Huard,

Philippe O. A. NavauxAuthors Info & Claims

VPA '15: Proceedings of the 2nd Workshop on Visual Performance Analysis

Article No.: 1, Pages 1 - 9

https://doi.org/10.1145/2835238.2835239

Published: 15 November 2015 Publication History

Abstract

In modern parallel architectures, memory accesses represent a common bottleneck. Thus, optimizing the way applications access the memory is an important way to improve performance and energy consumption. Memory accesses are even more important with NUMA machines, as the access time to data depends on its location in the memory. Many efforts were made to develop adaptive tools to improve memory accesses at the runtime by optimizing the mapping of data and threads to NUMA nodes. However, theses tools are not able to change the memory access pattern of the original application, therefore a code written without considering memory performance might not benefit from them. Moreover, automatic mapping tools take time to converge towards the best mapping, losing optimization opportunities. A deeper understanding of the memory behavior can help optimizing it, removing the need for runtime analysis.

In this paper, we present TABARNAC, a tool for analyzing the memory behavior of parallel applications with a focus on NUMA architectures. TABARNAC provides a new visualization of the memory access behavior, focusing on the distribution of accesses by thread and by structure. Such visualization allows the developer to easily understand why performance issues occur and how to fix them. Using TABARNAC, we explain why some applications do not benefit from data and thread mapping. Moreover, we propose several code modifications to improve the memory access behavior of several parallel applications.

References

[1]

L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685--701, 2010.

Digital Library

[2]

AMD. AMD Opteron™ 6300 Series processor Quick Reference Guide. Technical Report August, 2012.

[3]

M. Awasthi, D. W. Nellans, K. Sudan, R. Balasubramonian, and A. Davis. Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers. In Parallel Architectures and Compilation Techniques (PACT), pages 319--330, 2010.

Digital Library

[4]

Y. Bao, M. Chen, Y. Ruan, L. Liu, J. Fan, Q. Yuan, B. Song, and J. Xu. HMTT: A Platform Independent Full-system Memory Trace Monitoring System. In Proceedings of the 2008 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '08, pages 229--240. ACM, 2008.

Digital Library

[5]

R. Bosch, C. Stolte, D. Tang, J. Gerth, M. Rosenblum, and P. Hanrahan. Rivet: A Flexible Environment for Computer Systems Visualization. SIGGRAPH Comput. Graph., 34(1):68--73, feb 2000.

Digital Library

[6]

F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin, G. Mercier, S. Thibault, and R. Namyst. hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications. In Parallel, Distributed and Network-Based Processing (PDP), 2010 18th Euromicro International Conference on, pages 180--186, Feb 2010.

Digital Library

[7]

J. Corbet. Toward better NUMA scheduling, 2012.

[8]

M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quéma, and M. Roth. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 381--393, 2013.

Digital Library

[9]

L. DeRose, K. Ekanadham, J. K. Hollingsworth, and S. Sbaraglia. SIGMA: a simulator infrastructure to guide memory analysis. In Supercomputing, ACM/IEEE 2002 Conference, pages 1--13, Nov 2002.

Digital Library

[10]

L. A. DeRose. The Hardware Performance Monitor Toolkit. In Euro-Par 2001 Parallel Processing, volume 2150, chapter Lecture Notes in Computer Science, pages 122--132. Springer Berlin Heidelberg, 2001.

Digital Library

[11]

M. Diener, E. H. M. Cruz, and P. O. A. Navaux. Locality vs. Balance: Exploring Data Mapping Policies on NUMA Systems. In International Conference on Parallel, Distributed, and Network-Based Processing (PDP), pages 9--16, 2015.

[12]

M. Diener, E. H. M. Cruz, P. O. A. Navaux, A. Busse, and H.-U. Heiß. kMAF: Automatic Kernel-Level Management of Thread and Data Affinity. In International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 277--288, 2014.

Digital Library

[13]

M. Diener, E. H. M. Cruz, L. L. Pilla, F. Dupros, and P. O. A. Navaux. Characterizing Communication and Page Usage of Parallel Applications for Thread and Data Mapping. Performance Evaluation, 88-89(June):18--36, 2015.

Digital Library

[14]

U. Drepper. What every programmer should know about memory. http://people.redhat.com/drepper/cpumemory.pdf, 2007.

[15]

P. J. Drongowski. Instruction-based sampling: A new performance analysis technique for AMD family 10h processors. Technical report, AMD CodeAnalyst Project, 2007.

[16]

P. J. Drongowski. An introduction to analysis and optimization with AMD CodeAnalyst™ Performance Analyzer. Technical report, 2008.

[17]

F. Dupros, H. Aochi, A. Ducellier, D. Komatitsch, and J. Roman. Exploiting Intensive Multithreading for the Efficient Simulation of 3D Seismic Wave Propagation. In IEEE International Conference on Computational Science and Engineering (CSE), pages 253--260, 2008.

Digital Library

[18]

A. Giménez, T. Gamblin, B. Rountree, A. Bhatele, I. Jusufi, P.-T. Bremer, and B. Hamann. Dissecting On-Node Memory Access Performance: A Semantic Approach. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '14, pages 166--176. IEEE Press, 2014.

Digital Library

[19]

Intel. Intel Performance Counter Monitor - A better way to measure CPU utilization, 2012.

[20]

T. Jiang, Q. Zhang, R. Hou, L. Chai, S. A. Mckee, Z. Jia, and N. Sun. Understanding the behavior of in-memory computing workloads. In Workload Characterization (IISWC), 2014 IEEE International Symposium on, pages 22--30, Oct 2014.

[21]

H. Jin, M. Frumkin, and J. Yan. The OpenMP implementation of NAS Parallel Benchmarks and Its Performance. Technical Report October, NASA, 1999.

[22]

A. Kleen. An NUMA API for Linux, 2004.

[23]

R. Lachaize, B. Lepers, and V. Quema. MemProf: A Memory Profiler for NUMA Multicore Systems. In USENIX 2012 Annual Technical Conference (USENIX ATC 12), pages 53--64. USENIX, 2012.

Digital Library

[24]

D. Levinthal. Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors. Technical report, 2009.

[25]

X. Liu and J. Mellor-Crummey. A Tool to Analyze the Performance of Multithreaded Programs on NUMA Architectures. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, pages 259--272. ACM, 2014.

Digital Library

[26]

H. Löf and S. Holmgren. affinity-on-next-touch: Increasing the Performance of an Industrial PDE Solver on a cc-NUMA System. In International Conference on Supercomputing (SC), pages 387--392, 2005.

Digital Library

[27]

C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '05, pages 190--200. ACM, 2005.

Digital Library

[28]

Z. Majo and T. R. Gross. (Mis)understanding the NUMA memory system performance of multithreaded workloads. In Workload Characterization (IISWC), 2013 IEEE International Symposium on, pages 11--22, Sept 2013.

[29]

M. Marchetti, L. Kontothanassis, R. Bianchini, and M. L. Scott. Using Simple Page Placement Policies to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems. In International Parallel Processing Symposium (IPPS), pages 480--485, 1995.

Digital Library

[30]

M. Martonosi, A. Gupta, and T. Anderson. MemSpy: Analyzing Memory System Bottlenecks in Programs. In Proceedings of the 1992 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '92/PERFORMANCE '92, pages 1--12. ACM, 1992.

Digital Library

[31]

C. McCurdy and J. Vetter. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), pages 87--96, 2010.

[32]

G. Piccoli, H. N. Santos, R. E. Rodrigues, C. Pousa, E. Borin, F. M. Quintão Pereira, and F. Magno. Compiler support for selective page migration in NUMA architectures. In International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 369--380, 2014.

Digital Library

[33]

J. Reinders. VTune performance analyzer essentials. Intel Press, 2005.

[34]

C. P. Ribeiro, J.-F. Méhaut, A. Carissimi, M. Castro, and L. G. Fernandes. Memory Affinity for Hierarchical Shared Memory Multiprocessors. In International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pages 59--66, 2009.

Digital Library

[35]

J. Tao, W. Karl, and M. Schulz. Visualizing the Memory Access Behavior of Shared Memory Applications on NUMA Architectures. In V. Alexandrov, J. Dongarra, B. Juliano, R. Renner, and C. J. K. Tan, editors, Computational Science - ICCS 2001, volume 2074, chapter Lecture Notes in Computer Science, pages 861--870. Springer Berlin Heidelberg, 2001.

Digital Library

[36]

B. Weyers, C. Terboven, D. Schmidl, J. Herber, T. W. Kuhlen, M. S. Muller, and B. Hentschel. Visualization of Memory Access Behavior on Hierarchical NUMA Architectures. In Visual Performance Analysis (VPA), 2014 First Workshop on, pages 42--49, Nov 2014.

Digital Library

Cited By

Blanco ABergel AAlcocer J(2022)Software Visualizations to Analyze Memory Consumption: A Literature ReviewACM Computing Surveys10.1145/348513455:1(1-34)Online publication date: 17-Jan-2022
https://dl.acm.org/doi/10.1145/3485134
Pasqualin DDiener MDu Bois APilla M(2022)Sharing-Aware Data Mapping in Software Transactional MemoryEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-04580-6_32(481-492)Online publication date: 27-Apr-2022
https://doi.org/10.1007/978-3-031-04580-6_32
Zhao XZhou JGuan HWang WLiu XLiu TZhou HMoreira JMueller FEtsion Y(2021)NumaPerfProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460361(52-62)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460361
Show More Cited By

TABARNAC: visualizing and resolving memory access issues on NUMA architectures
1. General and reference
  1. Cross-computing tools and techniques

Recommendations

WOM-Code Solutions for Low Latency and High Endurance in Phase Change Memory
This paper describes a write-once-memory-code phase change memory (WOM-code PCM) architecture for next-generation non-volatile memory applications. Specifically, we address the long latency of the write operation in PCM—attributed to PCM SET—...
A Novel Memory Block Management Scheme for PCM Using WOM-Code
HPCC-CSS-ICESS '15: Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf on Embedded Software and Systems

Phase Change Memory (PCM) is a promising DRAM replacement in embedded systems due to its attractive characteristics including low static power consumption and high density. However, long write latency is one of the major drawbacks in current PCM ...
Mellow writes: extending lifetime in resistive memories through selective slow write backs
ISCA'16

Emerging resistive memory technologies, such as PCRAM and ReRAM, have been proposed as promising replacements for DRAM-based main memory, due to their better scalability, low standby power, and non-volatility. However, limited write endurance is a major ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

VPA '15: Proceedings of the 2nd Workshop on Visual Performance Analysis

November 2015

44 pages

ISBN:9781450340137

DOI:10.1145/2835238

Conference Chairs:
Peer-Timo Bremer
Lawrence Livermore National Laboratory
,
Bernd Mohr
Jülich Supercomputing Centre
,
Valerio Pascucci
University of Utah
,
Martin Schulz
Lawrence Livermore National Laboratory

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS\DATC: IEEE Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SC15

Sponsor:

SIGHPC
SIGARCH
IEEE-CS\DATC

SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 15, 2015

Texas, Austin

Acceptance Rates

VPA '15 Paper Acceptance Rate 5 of 6 submissions, 83%;

Overall Acceptance Rate 5 of 6 submissions, 83%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
203
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Blanco ABergel AAlcocer J(2022)Software Visualizations to Analyze Memory Consumption: A Literature ReviewACM Computing Surveys10.1145/348513455:1(1-34)Online publication date: 17-Jan-2022
https://dl.acm.org/doi/10.1145/3485134
Pasqualin DDiener MDu Bois APilla M(2022)Sharing-Aware Data Mapping in Software Transactional MemoryEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-04580-6_32(481-492)Online publication date: 27-Apr-2022
https://doi.org/10.1007/978-3-031-04580-6_32
Zhao XZhou JGuan HWang WLiu XLiu TZhou HMoreira JMueller FEtsion Y(2021)NumaPerfProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460361(52-62)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460361
Sánchez Barrera IBlack-Schaffer DCasas MMoretó MStupnikova APopov MAyguadé EHwu WBadia RHofstee H(2020)Modeling and optimizing NUMA effects and prefetching with machine learningProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392765(1-13)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392765
Pasqualin DDiener MDu Bois APilla M(2020)Online Sharing-Aware Thread Mapping in Software Transactional Memory2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD49847.2020.00016(35-42)Online publication date: Sep-2020
https://doi.org/10.1109/SBAC-PAD49847.2020.00016
Popov MJimborean ABlack-Schaffer DEigenmann RDing CMcKee S(2019)Efficient thread/page/parallelism autotuning for NUMA systemsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330376(342-353)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330376
Trahay FSelva MMorel LMarquet K(2018)NumaMMAProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225094(1-10)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225094
Valat SBouizi O(2018)NUMAPROF, A NUMA Memory ProfilerEuro-Par 2018: Parallel Processing Workshops10.1007/978-3-030-10549-5_13(159-170)Online publication date: 31-Dec-2018
https://doi.org/10.1007/978-3-030-10549-5_13
Chang CSrirama SBuyya R(2016)Mobile Cloud Business Process Management System for the Internet of ThingsACM Computing Surveys10.1145/301200049:4(1-42)Online publication date: 20-Dec-2016
https://dl.acm.org/doi/10.1145/3012000
Diener MCruz EAlves MNavaux PKoren I(2016)Affinity-Based Thread and Data Mapping in Shared Memory SystemsACM Computing Surveys10.1145/300638549:4(1-38)Online publication date: 5-Dec-2016
https://dl.acm.org/doi/10.1145/3006385
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten