Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2835238.2835239acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

TABARNAC: visualizing and resolving memory access issues on NUMA architectures

Published: 15 November 2015 Publication History

Abstract

In modern parallel architectures, memory accesses represent a common bottleneck. Thus, optimizing the way applications access the memory is an important way to improve performance and energy consumption. Memory accesses are even more important with NUMA machines, as the access time to data depends on its location in the memory. Many efforts were made to develop adaptive tools to improve memory accesses at the runtime by optimizing the mapping of data and threads to NUMA nodes. However, theses tools are not able to change the memory access pattern of the original application, therefore a code written without considering memory performance might not benefit from them. Moreover, automatic mapping tools take time to converge towards the best mapping, losing optimization opportunities. A deeper understanding of the memory behavior can help optimizing it, removing the need for runtime analysis.
In this paper, we present TABARNAC, a tool for analyzing the memory behavior of parallel applications with a focus on NUMA architectures. TABARNAC provides a new visualization of the memory access behavior, focusing on the distribution of accesses by thread and by structure. Such visualization allows the developer to easily understand why performance issues occur and how to fix them. Using TABARNAC, we explain why some applications do not benefit from data and thread mapping. Moreover, we propose several code modifications to improve the memory access behavior of several parallel applications.

References

[1]
L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685--701, 2010.
[2]
AMD. AMD Opteron™ 6300 Series processor Quick Reference Guide. Technical Report August, 2012.
[3]
M. Awasthi, D. W. Nellans, K. Sudan, R. Balasubramonian, and A. Davis. Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers. In Parallel Architectures and Compilation Techniques (PACT), pages 319--330, 2010.
[4]
Y. Bao, M. Chen, Y. Ruan, L. Liu, J. Fan, Q. Yuan, B. Song, and J. Xu. HMTT: A Platform Independent Full-system Memory Trace Monitoring System. In Proceedings of the 2008 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '08, pages 229--240. ACM, 2008.
[5]
R. Bosch, C. Stolte, D. Tang, J. Gerth, M. Rosenblum, and P. Hanrahan. Rivet: A Flexible Environment for Computer Systems Visualization. SIGGRAPH Comput. Graph., 34(1):68--73, feb 2000.
[6]
F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin, G. Mercier, S. Thibault, and R. Namyst. hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications. In Parallel, Distributed and Network-Based Processing (PDP), 2010 18th Euromicro International Conference on, pages 180--186, Feb 2010.
[7]
J. Corbet. Toward better NUMA scheduling, 2012.
[8]
M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quéma, and M. Roth. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 381--393, 2013.
[9]
L. DeRose, K. Ekanadham, J. K. Hollingsworth, and S. Sbaraglia. SIGMA: a simulator infrastructure to guide memory analysis. In Supercomputing, ACM/IEEE 2002 Conference, pages 1--13, Nov 2002.
[10]
L. A. DeRose. The Hardware Performance Monitor Toolkit. In Euro-Par 2001 Parallel Processing, volume 2150, chapter Lecture Notes in Computer Science, pages 122--132. Springer Berlin Heidelberg, 2001.
[11]
M. Diener, E. H. M. Cruz, and P. O. A. Navaux. Locality vs. Balance: Exploring Data Mapping Policies on NUMA Systems. In International Conference on Parallel, Distributed, and Network-Based Processing (PDP), pages 9--16, 2015.
[12]
M. Diener, E. H. M. Cruz, P. O. A. Navaux, A. Busse, and H.-U. Heiß. kMAF: Automatic Kernel-Level Management of Thread and Data Affinity. In International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 277--288, 2014.
[13]
M. Diener, E. H. M. Cruz, L. L. Pilla, F. Dupros, and P. O. A. Navaux. Characterizing Communication and Page Usage of Parallel Applications for Thread and Data Mapping. Performance Evaluation, 88-89(June):18--36, 2015.
[14]
U. Drepper. What every programmer should know about memory. http://people.redhat.com/drepper/cpumemory.pdf, 2007.
[15]
P. J. Drongowski. Instruction-based sampling: A new performance analysis technique for AMD family 10h processors. Technical report, AMD CodeAnalyst Project, 2007.
[16]
P. J. Drongowski. An introduction to analysis and optimization with AMD CodeAnalyst™ Performance Analyzer. Technical report, 2008.
[17]
F. Dupros, H. Aochi, A. Ducellier, D. Komatitsch, and J. Roman. Exploiting Intensive Multithreading for the Efficient Simulation of 3D Seismic Wave Propagation. In IEEE International Conference on Computational Science and Engineering (CSE), pages 253--260, 2008.
[18]
A. Giménez, T. Gamblin, B. Rountree, A. Bhatele, I. Jusufi, P.-T. Bremer, and B. Hamann. Dissecting On-Node Memory Access Performance: A Semantic Approach. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '14, pages 166--176. IEEE Press, 2014.
[19]
Intel. Intel Performance Counter Monitor - A better way to measure CPU utilization, 2012.
[20]
T. Jiang, Q. Zhang, R. Hou, L. Chai, S. A. Mckee, Z. Jia, and N. Sun. Understanding the behavior of in-memory computing workloads. In Workload Characterization (IISWC), 2014 IEEE International Symposium on, pages 22--30, Oct 2014.
[21]
H. Jin, M. Frumkin, and J. Yan. The OpenMP implementation of NAS Parallel Benchmarks and Its Performance. Technical Report October, NASA, 1999.
[22]
A. Kleen. An NUMA API for Linux, 2004.
[23]
R. Lachaize, B. Lepers, and V. Quema. MemProf: A Memory Profiler for NUMA Multicore Systems. In USENIX 2012 Annual Technical Conference (USENIX ATC 12), pages 53--64. USENIX, 2012.
[24]
D. Levinthal. Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors. Technical report, 2009.
[25]
X. Liu and J. Mellor-Crummey. A Tool to Analyze the Performance of Multithreaded Programs on NUMA Architectures. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, pages 259--272. ACM, 2014.
[26]
H. Löf and S. Holmgren. affinity-on-next-touch: Increasing the Performance of an Industrial PDE Solver on a cc-NUMA System. In International Conference on Supercomputing (SC), pages 387--392, 2005.
[27]
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '05, pages 190--200. ACM, 2005.
[28]
Z. Majo and T. R. Gross. (Mis)understanding the NUMA memory system performance of multithreaded workloads. In Workload Characterization (IISWC), 2013 IEEE International Symposium on, pages 11--22, Sept 2013.
[29]
M. Marchetti, L. Kontothanassis, R. Bianchini, and M. L. Scott. Using Simple Page Placement Policies to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems. In International Parallel Processing Symposium (IPPS), pages 480--485, 1995.
[30]
M. Martonosi, A. Gupta, and T. Anderson. MemSpy: Analyzing Memory System Bottlenecks in Programs. In Proceedings of the 1992 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '92/PERFORMANCE '92, pages 1--12. ACM, 1992.
[31]
C. McCurdy and J. Vetter. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), pages 87--96, 2010.
[32]
G. Piccoli, H. N. Santos, R. E. Rodrigues, C. Pousa, E. Borin, F. M. Quintão Pereira, and F. Magno. Compiler support for selective page migration in NUMA architectures. In International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 369--380, 2014.
[33]
J. Reinders. VTune performance analyzer essentials. Intel Press, 2005.
[34]
C. P. Ribeiro, J.-F. Méhaut, A. Carissimi, M. Castro, and L. G. Fernandes. Memory Affinity for Hierarchical Shared Memory Multiprocessors. In International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pages 59--66, 2009.
[35]
J. Tao, W. Karl, and M. Schulz. Visualizing the Memory Access Behavior of Shared Memory Applications on NUMA Architectures. In V. Alexandrov, J. Dongarra, B. Juliano, R. Renner, and C. J. K. Tan, editors, Computational Science - ICCS 2001, volume 2074, chapter Lecture Notes in Computer Science, pages 861--870. Springer Berlin Heidelberg, 2001.
[36]
B. Weyers, C. Terboven, D. Schmidl, J. Herber, T. W. Kuhlen, M. S. Muller, and B. Hentschel. Visualization of Memory Access Behavior on Hierarchical NUMA Architectures. In Visual Performance Analysis (VPA), 2014 First Workshop on, pages 42--49, Nov 2014.

Cited By

View all
  • (2022)Software Visualizations to Analyze Memory Consumption: A Literature ReviewACM Computing Surveys10.1145/348513455:1(1-34)Online publication date: 17-Jan-2022
  • (2022)Sharing-Aware Data Mapping in Software Transactional MemoryEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-04580-6_32(481-492)Online publication date: 27-Apr-2022
  • (2021)NumaPerfProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460361(52-62)Online publication date: 3-Jun-2021
  • Show More Cited By
  1. TABARNAC: visualizing and resolving memory access issues on NUMA architectures

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    VPA '15: Proceedings of the 2nd Workshop on Visual Performance Analysis
    November 2015
    44 pages
    ISBN:9781450340137
    DOI:10.1145/2835238
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 November 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Conference

    SC15
    Sponsor:

    Acceptance Rates

    VPA '15 Paper Acceptance Rate 5 of 6 submissions, 83%;
    Overall Acceptance Rate 5 of 6 submissions, 83%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)21
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Software Visualizations to Analyze Memory Consumption: A Literature ReviewACM Computing Surveys10.1145/348513455:1(1-34)Online publication date: 17-Jan-2022
    • (2022)Sharing-Aware Data Mapping in Software Transactional MemoryEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-04580-6_32(481-492)Online publication date: 27-Apr-2022
    • (2021)NumaPerfProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460361(52-62)Online publication date: 3-Jun-2021
    • (2020)Modeling and optimizing NUMA effects and prefetching with machine learningProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392765(1-13)Online publication date: 29-Jun-2020
    • (2020)Online Sharing-Aware Thread Mapping in Software Transactional Memory2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD49847.2020.00016(35-42)Online publication date: Sep-2020
    • (2019)Efficient thread/page/parallelism autotuning for NUMA systemsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330376(342-353)Online publication date: 26-Jun-2019
    • (2018)NumaMMAProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225094(1-10)Online publication date: 13-Aug-2018
    • (2018)NUMAPROF, A NUMA Memory ProfilerEuro-Par 2018: Parallel Processing Workshops10.1007/978-3-030-10549-5_13(159-170)Online publication date: 31-Dec-2018
    • (2016)Mobile Cloud Business Process Management System for the Internet of ThingsACM Computing Surveys10.1145/301200049:4(1-42)Online publication date: 20-Dec-2016
    • (2016)Affinity-Based Thread and Data Mapping in Shared Memory SystemsACM Computing Surveys10.1145/300638549:4(1-38)Online publication date: 5-Dec-2016
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media