research-article

ComDetective: a lightweight communication detection tool for threads

Authors:

Muhammad Aditya Sasongko,

Palwisha Akhtar,

Didem UnatAuthors Info & Claims

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 18, Pages 1 - 21

https://doi.org/10.1145/3295500.3356214

Published: 17 November 2019 Publication History

Abstract

Inter-thread communication is a vital performance indicator in shared-memory systems. Prior works on identifying inter-thread communication employed hardware simulators or binary instrumentation and suffered from inaccuracy or high overheads---both space and time---making them impractical for production use. We propose ComDetective, which produces communication matrices that are accurate and introduces low runtime and low memory overheads, thus making it practical for production use.

ComDetective employs hardware performance counters to sample memory-access events and uses hardware debug registers to sample communicating pairs of threads. ComDetective can differentiate communication as true or false sharing between threads. Its runtime and memory overheads are only 1.30X and 1.27X, respectively, for the 18 applications studied under 500K sampling period. Using ComDetective, we produce insightful communication matrices for microbenchmarks, PARSEC benchmark suite, and several CORAL applications and compare the generated matrices against MPI counterparts. Guided by ComDetective, we optimize a few codes and achieve up to 13% speedup.

References

[1]

L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. 2010. HPCToolkit: Tools for Performance Analysis of Optimized Parallel Programs. Concurrency Computation: Practice Experience 22, 6 (2010), 685--701.

Digital Library

[2]

AMG. 2017. Parallel Algebraic Multigrid Solver. https://github.com/LLNL/AMG.

[3]

Reza Azimi, David K. Tam, Livio Soares, and Michael Stumm. 2009. Enhancing operating system support for multicore processors by using hardware performance monitoring. ACM SIGOPS Operating Systems Review 43, 2 (2009), 56--65.

Digital Library

[4]

Nick Barrow-Williams, Christian Fensch, and Simon Moore. 2009. A communication characterisation of Splash-2 and Parsec. In IEEE International Symposium on Workload Characterization, 2009. IISWC 2009.

Digital Library

[5]

C. Bienia, S. Kumar, J. P. Singh, and K. Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT). 72--81.

[6]

K. J. Bowers, B. J. Albright, B. Bergen, L. Yin, K. J. Barker, and D. J. Kerbyson. 2008. 0.374 Pflop/s Trillion-particle Kinetic Modeling of Laser Plasma Interaction on Roadrunner. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC '08). IEEE Press, Piscataway, NJ, USA, Article 63, 11 pages. http://dl.acm.org/citation.cfm?id=1413370.1413435

Digital Library

[7]

Milind Chabbi, Shasha Wen, and Xu Liu. 2018. Featherlight On-the-fly False-sharing Detection. In 2018 SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP).

[8]

Pietro Cicotti and Laura Carrington. 2016. ADAMANT: Tools to Capture, Analyze, and Manage Data Movement. In The International Conference on Computational Science, 2016. ICCS 2016.

Digital Library

[9]

Eduardo H.M. Cruz, Matthias Diener, and Philippe O.A. Navaux. 2012. Using the Translation Lookaside Buffer to Map Threads in Parallel Applications Based on Shared Memory. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS).

[10]

Eduardo H. M. Cruz, Matthias Diener, Laércio L. Pilla, and Philippe O. A. Navaux. 2019. EagerMap: A Task Mapping Algorithm to Improve Communication and Load Balancing in Clusters of Multicore Systems. ACM Trans. Parallel Comput. 5, 4, Article 17 (March 2019), 24 pages.

Digital Library

[11]

Eduardo Henrique Molina da Cruz, Marco Antonio Zanata Alves, Alexandre Carissimi, Philippe Olivier Alexandre Navaux, Christiane Pousa Ribeiro, and Jean-Francois Mehaut. 2011. Using Memory Access Traces to Map Threads and Data on Hierarchical Multi-core Platforms. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW).

Digital Library

[12]

Vincent Danjean, Raymond Namyst, and Pierre-André Wacrenier. 2005. An Efficient Multi-level Trace Toolkit for Multi-threaded Applications. In Proceedings of the 11th International Euro-Par Conference on Parallel Processing (Euro-Par'05). 166--175.

Digital Library

[13]

Matthias Diener, Eduardo H.M. Cruz, Laercio L. Pilla, Fabrice Dupros, and Philippe O.A. Navaux. 2015. Characterizing communication and page usage of parallel applications for thread and data mapping. Performance Evaluation 88--89 (2015), 18--36.

[14]

Matthias Diener, Eduardo H. M. Cruz, Marco A. Z. Alves, and Philippe O. A. Navaux. 2016. Communication in Shared Memory: Concepts, Definitions, and Efficient Detection. In 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[15]

Paul J. Drongowski. 2007. Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors. https://pdfs.semanticscholar.org/5219/4b43b8385ce39b2b08ecd409c753e0efafe5.pdf.

[16]

Intel. 2010. Intel Microarchitecture Codename Nehalem Performance Monitoring Unit Programming Guide. https://software.intel.com/sites/default/files/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf.

[17]

Mark Scott Johnson. 1982. Some Requirements for Architectural Support of Software Debugging. In Proceedings of the First International Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS I). ACM, New York, NY, USA, 140--148.

Digital Library

[18]

Ian Karlin, Abhinav Bhatele, Jeff Keasler, Bradford L. Chamberlain, Jonathan Cohen, Zachary DeVito, Riyaz Haque, Dan Laney, Edward Luke, Felix Wang, David Richards, Martin Schulz, and Charles Still. 2013. Exploring Traditional and Emerging Parallel Programming Models using a Proxy Application. In 27th IEEE International Parallel & Distributed Processing Symposium (IEEE IPDPS 2013). Boston, USA.

Digital Library

[19]

Renaud Lachaize, Baptiste Lepers, and Vivien Quema. 2012. MemProf: a memory profiler for NUMA multicore systems. In USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference. 5.

[20]

Linux. 2012. perf_event_open - Linux man page. https://linux.die.net/man/2/perf_event_open.

[21]

Linux. 2018. SIGALTSTACK. http://man7.org/linux/man-pages/man2/sigaltstack.2.html.

[22]

Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation. 190--200.

Digital Library

[23]

LULESH 2.0. [n. d.]. Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). https://github.com/LLNL/LULESH.

[24]

P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. 2002. Simics: A full system simulation platform. Computer 35, 2 (2002), 50--58.

Digital Library

[25]

Arya Mazaheri, Felix Wolf, and Ali Jannesari. 2015. Characterizing Loop-Level Communication Patterns in Shared Memory Applications. In Proceedings of the 2015 44th International Conference on Parallel Processing (ICPP 2015).

Digital Library

[26]

Arya Mazaheri, Felix Wolf, and Ali Jannesari. 2018. Unveiling Thread Communication Bottlenecks Using Hardware-Independent Metrics. In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). ACM, New York, NY, USA, Article 6, 10 pages.

Digital Library

[27]

R. E. McLear, D. M. Scheibelhut, and E. Tammaru. 1982. Guidelines for Creating a Debuggable Processor. In Proceedings of the First International Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS I). ACM, New York, NY, USA, 100--106.

Digital Library

[28]

miniFE. [n. d.]. MiniFE Finite Element Mini-Application. https://github.com/Mantevo/miniFE.

[29]

Greg Nakhimovsky. 2001. Debugging and Performance Tuning with Library Interposers. http://dsc.sun.com/solaris/articles/lib_interposers.html.

[30]

Dimitrios S. Nikolopoulos, Eduard Ayguadé, and Constantine D. Polychronopoulos. 2002. Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models. International Journal of Parallel Programming 30, 4 (2002), 225--255.

Digital Library

[31]

PENNANT. 2016. Unstructured mesh hydrodynamics for advanced architectures. https://github.com/lanl/PENNANT.

[32]

Quicksilver. [n. d.]. A proxy app for the Monte Carlo Transport Code, Mercury. https://github.com/LLNL/Quicksilver.

[33]

Pirah Noor Soomro, Muhammad Aditya Sasongko, and Didem Unat. 2018. BindMe: A thread binding library with advanced mapping algorithms. Concurrency and Computation: Practice and Experience 30, 21 (2018).

[34]

M. Srinivas, B. Sinharoy, R. J. Eickemeyer, R. Raghavan, S. Kunkel, T. Chen, W. Maron, D. Flemming, A. Blanchard, P. Seshadri, J. W. Kellington, A. Mericas, A. E. Petruski, V. R. Indukuru, and S. Reyes. 2011. IBM POWER7 performance modeling, verification, and evaluation. IBM JRD 55, 3 (May-June 2011), 4:1--4:19.

[35]

David Tam, Reza Azimi, and Michael Stumm. 2007. Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007. 47--58.

Digital Library

[36]

F. Trahay, F. Rue, M. Faverge, Y. Ishikawa, R. Namyst, and J. Dongarra. 2011. EZTrace: A Generic Framework for Performance Analysis. In 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 618--619.

Digital Library

[37]

Didem Unat, Cy Chan, Weiqun Zhang, Samuel Williams, John Bachan, John Bell, and John Shalf. 2015. ExaSAT: An exascale co-design tool for performance modeling. The International Journal of High Performance Computing Applications 29, 2 (2015), 209--232. arXiv:https://doi.org/10.1177/1094342014568690

Digital Library

[38]

D. Unat, A. Dubey, T. Hoefler, J. Shalf, M. Abraham, M. Bianco, B. L. Chamberlain, R. Cledat, H. C. Edwards, H. Finkel, K. Fuerlinger, F. Hannig, E. Jeannot, A. Kamil, J. Keasler, P. H. J. Kelly, V. Leung, H. Ltaief, N. Maruyama, C. J. Newburn, and M. Pericas. 2017. Trends in Data Locality Abstractions for HPC Systems. IEEE Transactions on Parallel and Distributed Systems 28, 10 (Oct 2017), 3007--3020.

[39]

VPIC. [n.d.]. Vector Particle-In-Cell (VPIC) Project. https://github.com/lanl/vpic.

[40]

Ulrike Meier Yang. 2006. Parallel Algebraic Multigrid Methods High Performance Preconditioner. Numerical Solution of Partial Differential Equations on Parallel Computers, LNCS 51 (2006), 209--233.

Cited By

Issa MSasongko MTurimbetov IBaydamirli JSağbili DUnat D(2024)Snoopie: A Multi-GPU Communication Profiler and VisualizerProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656597(525-536)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656597
Miksits SShi RGokhale MWahlgren JSchieffer GPeng I(2024)Multi-level Memory-Centric Profiling on ARM Processors with ARM SPEProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00139(996-1005)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SCW63240.2024.00139
You XYang HLei KLuan ZQian DMohror KArnold DBadia R(2023)TrivialSpy: Identifying Software Triviality via Fine-grained and Dataflow-based Value ProfilingProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607052(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607052
Show More Cited By

Index Terms

ComDetective: a lightweight communication detection tool for threads

Recommendations

ReuseTracker: Fast Yet Accurate Multicore Reuse Distance Analyzer
One widely used metric that measures data locality is reuse distance—the number of unique memory locations that are accessed between two consecutive accesses to a particular memory location. State-of-the-art techniques that measure reuse distance in ...
Featherlight on-the-fly false-sharing detection
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Shared-memory parallel programs routinely suffer from false sharing---a performance degradation caused by different threads accessing different variables that reside on the same CPU cacheline and at least one variable is modified. State-of-the-art tools ...
Featherlight on-the-fly false-sharing detection
PPoPP '18

Shared-memory parallel programs routinely suffer from false sharing---a performance degradation caused by different threads accessing different variables that reside on the same CPU cacheline and at least one variable is modified. State-of-the-art tools ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2019

1921 pages

ISBN:9781450362290

DOI:10.1145/3295500

General Chair:
Michela Taufer,
Program Chairs:
Pavan Balaji,
Antonio J. Peña

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Artifacts Available

Author Tags

Qualifiers

Research-article

Funding Sources

Türkiye Bilimsel ve Teknolojik Araştirma Kurumu

Conference

SC '19

Sponsor:

SIGHPC

SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis

November 17 - 19, 2019

Colorado, Denver

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
622
Total Downloads

Downloads (Last 12 months)31
Downloads (Last 6 weeks)3

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Issa MSasongko MTurimbetov IBaydamirli JSağbili DUnat D(2024)Snoopie: A Multi-GPU Communication Profiler and VisualizerProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656597(525-536)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656597
Miksits SShi RGokhale MWahlgren JSchieffer GPeng I(2024)Multi-level Memory-Centric Profiling on ARM Processors with ARM SPEProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00139(996-1005)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SCW63240.2024.00139
You XYang HLei KLuan ZQian DMohror KArnold DBadia R(2023)TrivialSpy: Identifying Software Triviality via Fine-grained and Dataflow-based Value ProfilingProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607052(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607052
Sasongko MChabbi MKelly PUnat D(2023)Precise Event Sampling on AMD Versus Intel: Quantitative and Qualitative ComparisonIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.325710534:5(1594-1608)Online publication date: May-2023
https://doi.org/10.1109/TPDS.2023.3257105
Sasongko MChabbi MKelly PUnat D(2023)Precise event sampling‐based data locality tools for AMD multicore architecturesConcurrency and Computation: Practice and Experience10.1002/cpe.770735:24Online publication date: 3-Apr-2023
https://doi.org/10.1002/cpe.7707
Soytürk MAkhtar PTezcan EUnat D(2022)Monitoring Collective Communication Among GPUsEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_4(41-52)Online publication date: 9-Jun-2022
https://doi.org/10.1007/978-3-031-06156-1_4
Li JZhang YZhang X(2021)CMLB: a Communication-aware and Memory Load Balance Mapping Optimization for Modern NUMA Systems2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys53884.2021.00099(579-586)Online publication date: Dec-2021
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys53884.2021.00099
Akhtar PTezcan EQararyah FUnat D(2021)ComScribe: Identifying Intra-node GPU CommunicationBenchmarking, Measuring, and Optimizing10.1007/978-3-030-71058-3_10(157-174)Online publication date: 2-Mar-2021
https://doi.org/10.1007/978-3-030-71058-3_10
Pasqualin DDiener MDu Bois APilla M(2021)Characterizing the Sharing Behavior of Applications Using Software Transactional MemoryBenchmarking, Measuring, and Optimizing10.1007/978-3-030-71058-3_1(3-21)Online publication date: 2-Mar-2021
https://doi.org/10.1007/978-3-030-71058-3_1
Sun XZhang NToonen BAllcock B(2020)Performance Modeling and Evaluation of a Production Disaggregated Memory SystemProceedings of the International Symposium on Memory Systems10.1145/3422575.3422795(223-232)Online publication date: 28-Sep-2020
https://dl.acm.org/doi/10.1145/3422575.3422795
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten