Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3295500.3356214acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

ComDetective: a lightweight communication detection tool for threads

Published: 17 November 2019 Publication History

Abstract

Inter-thread communication is a vital performance indicator in shared-memory systems. Prior works on identifying inter-thread communication employed hardware simulators or binary instrumentation and suffered from inaccuracy or high overheads---both space and time---making them impractical for production use. We propose ComDetective, which produces communication matrices that are accurate and introduces low runtime and low memory overheads, thus making it practical for production use.
ComDetective employs hardware performance counters to sample memory-access events and uses hardware debug registers to sample communicating pairs of threads. ComDetective can differentiate communication as true or false sharing between threads. Its runtime and memory overheads are only 1.30X and 1.27X, respectively, for the 18 applications studied under 500K sampling period. Using ComDetective, we produce insightful communication matrices for microbenchmarks, PARSEC benchmark suite, and several CORAL applications and compare the generated matrices against MPI counterparts. Guided by ComDetective, we optimize a few codes and achieve up to 13% speedup.

References

[1]
L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. 2010. HPCToolkit: Tools for Performance Analysis of Optimized Parallel Programs. Concurrency Computation: Practice Experience 22, 6 (2010), 685--701.
[2]
AMG. 2017. Parallel Algebraic Multigrid Solver. https://github.com/LLNL/AMG.
[3]
Reza Azimi, David K. Tam, Livio Soares, and Michael Stumm. 2009. Enhancing operating system support for multicore processors by using hardware performance monitoring. ACM SIGOPS Operating Systems Review 43, 2 (2009), 56--65.
[4]
Nick Barrow-Williams, Christian Fensch, and Simon Moore. 2009. A communication characterisation of Splash-2 and Parsec. In IEEE International Symposium on Workload Characterization, 2009. IISWC 2009.
[5]
C. Bienia, S. Kumar, J. P. Singh, and K. Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT). 72--81.
[6]
K. J. Bowers, B. J. Albright, B. Bergen, L. Yin, K. J. Barker, and D. J. Kerbyson. 2008. 0.374 Pflop/s Trillion-particle Kinetic Modeling of Laser Plasma Interaction on Roadrunner. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC '08). IEEE Press, Piscataway, NJ, USA, Article 63, 11 pages. http://dl.acm.org/citation.cfm?id=1413370.1413435
[7]
Milind Chabbi, Shasha Wen, and Xu Liu. 2018. Featherlight On-the-fly False-sharing Detection. In 2018 SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP).
[8]
Pietro Cicotti and Laura Carrington. 2016. ADAMANT: Tools to Capture, Analyze, and Manage Data Movement. In The International Conference on Computational Science, 2016. ICCS 2016.
[9]
Eduardo H.M. Cruz, Matthias Diener, and Philippe O.A. Navaux. 2012. Using the Translation Lookaside Buffer to Map Threads in Parallel Applications Based on Shared Memory. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS).
[10]
Eduardo H. M. Cruz, Matthias Diener, Laércio L. Pilla, and Philippe O. A. Navaux. 2019. EagerMap: A Task Mapping Algorithm to Improve Communication and Load Balancing in Clusters of Multicore Systems. ACM Trans. Parallel Comput. 5, 4, Article 17 (March 2019), 24 pages.
[11]
Eduardo Henrique Molina da Cruz, Marco Antonio Zanata Alves, Alexandre Carissimi, Philippe Olivier Alexandre Navaux, Christiane Pousa Ribeiro, and Jean-Francois Mehaut. 2011. Using Memory Access Traces to Map Threads and Data on Hierarchical Multi-core Platforms. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW).
[12]
Vincent Danjean, Raymond Namyst, and Pierre-André Wacrenier. 2005. An Efficient Multi-level Trace Toolkit for Multi-threaded Applications. In Proceedings of the 11th International Euro-Par Conference on Parallel Processing (Euro-Par'05). 166--175.
[13]
Matthias Diener, Eduardo H.M. Cruz, Laercio L. Pilla, Fabrice Dupros, and Philippe O.A. Navaux. 2015. Characterizing communication and page usage of parallel applications for thread and data mapping. Performance Evaluation 88--89 (2015), 18--36.
[14]
Matthias Diener, Eduardo H. M. Cruz, Marco A. Z. Alves, and Philippe O. A. Navaux. 2016. Communication in Shared Memory: Concepts, Definitions, and Efficient Detection. In 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.
[15]
Paul J. Drongowski. 2007. Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors. https://pdfs.semanticscholar.org/5219/4b43b8385ce39b2b08ecd409c753e0efafe5.pdf.
[16]
Intel. 2010. Intel Microarchitecture Codename Nehalem Performance Monitoring Unit Programming Guide. https://software.intel.com/sites/default/files/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf.
[17]
Mark Scott Johnson. 1982. Some Requirements for Architectural Support of Software Debugging. In Proceedings of the First International Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS I). ACM, New York, NY, USA, 140--148.
[18]
Ian Karlin, Abhinav Bhatele, Jeff Keasler, Bradford L. Chamberlain, Jonathan Cohen, Zachary DeVito, Riyaz Haque, Dan Laney, Edward Luke, Felix Wang, David Richards, Martin Schulz, and Charles Still. 2013. Exploring Traditional and Emerging Parallel Programming Models using a Proxy Application. In 27th IEEE International Parallel & Distributed Processing Symposium (IEEE IPDPS 2013). Boston, USA.
[19]
Renaud Lachaize, Baptiste Lepers, and Vivien Quema. 2012. MemProf: a memory profiler for NUMA multicore systems. In USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference. 5.
[20]
Linux. 2012. perf_event_open - Linux man page. https://linux.die.net/man/2/perf_event_open.
[21]
Linux. 2018. SIGALTSTACK. http://man7.org/linux/man-pages/man2/sigaltstack.2.html.
[22]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation. 190--200.
[23]
LULESH 2.0. [n. d.]. Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). https://github.com/LLNL/LULESH.
[24]
P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. 2002. Simics: A full system simulation platform. Computer 35, 2 (2002), 50--58.
[25]
Arya Mazaheri, Felix Wolf, and Ali Jannesari. 2015. Characterizing Loop-Level Communication Patterns in Shared Memory Applications. In Proceedings of the 2015 44th International Conference on Parallel Processing (ICPP 2015).
[26]
Arya Mazaheri, Felix Wolf, and Ali Jannesari. 2018. Unveiling Thread Communication Bottlenecks Using Hardware-Independent Metrics. In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). ACM, New York, NY, USA, Article 6, 10 pages.
[27]
R. E. McLear, D. M. Scheibelhut, and E. Tammaru. 1982. Guidelines for Creating a Debuggable Processor. In Proceedings of the First International Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS I). ACM, New York, NY, USA, 100--106.
[28]
miniFE. [n. d.]. MiniFE Finite Element Mini-Application. https://github.com/Mantevo/miniFE.
[29]
Greg Nakhimovsky. 2001. Debugging and Performance Tuning with Library Interposers. http://dsc.sun.com/solaris/articles/lib_interposers.html.
[30]
Dimitrios S. Nikolopoulos, Eduard Ayguadé, and Constantine D. Polychronopoulos. 2002. Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models. International Journal of Parallel Programming 30, 4 (2002), 225--255.
[31]
PENNANT. 2016. Unstructured mesh hydrodynamics for advanced architectures. https://github.com/lanl/PENNANT.
[32]
Quicksilver. [n. d.]. A proxy app for the Monte Carlo Transport Code, Mercury. https://github.com/LLNL/Quicksilver.
[33]
Pirah Noor Soomro, Muhammad Aditya Sasongko, and Didem Unat. 2018. BindMe: A thread binding library with advanced mapping algorithms. Concurrency and Computation: Practice and Experience 30, 21 (2018).
[34]
M. Srinivas, B. Sinharoy, R. J. Eickemeyer, R. Raghavan, S. Kunkel, T. Chen, W. Maron, D. Flemming, A. Blanchard, P. Seshadri, J. W. Kellington, A. Mericas, A. E. Petruski, V. R. Indukuru, and S. Reyes. 2011. IBM POWER7 performance modeling, verification, and evaluation. IBM JRD 55, 3 (May-June 2011), 4:1--4:19.
[35]
David Tam, Reza Azimi, and Michael Stumm. 2007. Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007. 47--58.
[36]
F. Trahay, F. Rue, M. Faverge, Y. Ishikawa, R. Namyst, and J. Dongarra. 2011. EZTrace: A Generic Framework for Performance Analysis. In 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 618--619.
[37]
Didem Unat, Cy Chan, Weiqun Zhang, Samuel Williams, John Bachan, John Bell, and John Shalf. 2015. ExaSAT: An exascale co-design tool for performance modeling. The International Journal of High Performance Computing Applications 29, 2 (2015), 209--232. arXiv:https://doi.org/10.1177/1094342014568690
[38]
D. Unat, A. Dubey, T. Hoefler, J. Shalf, M. Abraham, M. Bianco, B. L. Chamberlain, R. Cledat, H. C. Edwards, H. Finkel, K. Fuerlinger, F. Hannig, E. Jeannot, A. Kamil, J. Keasler, P. H. J. Kelly, V. Leung, H. Ltaief, N. Maruyama, C. J. Newburn, and M. Pericas. 2017. Trends in Data Locality Abstractions for HPC Systems. IEEE Transactions on Parallel and Distributed Systems 28, 10 (Oct 2017), 3007--3020.
[39]
VPIC. [n.d.]. Vector Particle-In-Cell (VPIC) Project. https://github.com/lanl/vpic.
[40]
Ulrike Meier Yang. 2006. Parallel Algebraic Multigrid Methods High Performance Preconditioner. Numerical Solution of Partial Differential Equations on Parallel Computers, LNCS 51 (2006), 209--233.

Cited By

View all
  • (2024)Snoopie: A Multi-GPU Communication Profiler and VisualizerProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656597(525-536)Online publication date: 30-May-2024
  • (2024)Multi-level Memory-Centric Profiling on ARM Processors with ARM SPEProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00139(996-1005)Online publication date: 17-Nov-2024
  • (2023)TrivialSpy: Identifying Software Triviality via Fine-grained and Dataflow-based Value ProfilingProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607052(1-13)Online publication date: 12-Nov-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2019
1921 pages
ISBN:9781450362290
DOI:10.1145/3295500
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. communication matrix
  2. debug registers
  3. false sharing
  4. hardware performance counters
  5. inter-thread communication
  6. sampling

Qualifiers

  • Research-article

Funding Sources

  • Türkiye Bilimsel ve Teknolojik Araştirma Kurumu

Conference

SC '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)31
  • Downloads (Last 6 weeks)3
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Snoopie: A Multi-GPU Communication Profiler and VisualizerProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656597(525-536)Online publication date: 30-May-2024
  • (2024)Multi-level Memory-Centric Profiling on ARM Processors with ARM SPEProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00139(996-1005)Online publication date: 17-Nov-2024
  • (2023)TrivialSpy: Identifying Software Triviality via Fine-grained and Dataflow-based Value ProfilingProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607052(1-13)Online publication date: 12-Nov-2023
  • (2023)Precise Event Sampling on AMD Versus Intel: Quantitative and Qualitative ComparisonIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.325710534:5(1594-1608)Online publication date: May-2023
  • (2023)Precise event sampling‐based data locality tools for AMD multicore architecturesConcurrency and Computation: Practice and Experience10.1002/cpe.770735:24Online publication date: 3-Apr-2023
  • (2022)Monitoring Collective Communication Among GPUsEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_4(41-52)Online publication date: 9-Jun-2022
  • (2021)CMLB: a Communication-aware and Memory Load Balance Mapping Optimization for Modern NUMA Systems2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys53884.2021.00099(579-586)Online publication date: Dec-2021
  • (2021)ComScribe: Identifying Intra-node GPU CommunicationBenchmarking, Measuring, and Optimizing10.1007/978-3-030-71058-3_10(157-174)Online publication date: 2-Mar-2021
  • (2021)Characterizing the Sharing Behavior of Applications Using Software Transactional MemoryBenchmarking, Measuring, and Optimizing10.1007/978-3-030-71058-3_1(3-21)Online publication date: 2-Mar-2021
  • (2020)Performance Modeling and Evaluation of a Production Disaggregated Memory SystemProceedings of the International Symposium on Memory Systems10.1145/3422575.3422795(223-232)Online publication date: 28-Sep-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media