Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

SHERIFF: precise detection and automatic mitigation of false sharing

Published: 22 October 2011 Publication History

Abstract

False sharing is an insidious problem for multithreaded programs running on multicore processors, where it can silently degrade performance and scalability. Previous tools for detecting false sharing are severely limited: they cannot distinguish false sharing from true sharing, have high false positive rates, and provide limited assistance to help programmers locate and resolve false sharing.
This paper presents two tools that attack the problem of false sharing: Sheriff-Detect and Sheriff-Protect. Both tools leverage a framework we introduce here called Sheriff. Sheriff breaks out threads into separate processes, and exposes an API that allows programs to perform per-thread memory isolation and tracking on a per-page basis. We believe Sheriff is of independent interest.
Sheriff-Detect finds instances of false sharing by comparing updates within the same cache lines by different threads, and uses sampling to rank them by performance impact. Sheriff-Detect is precise (no false positives), runs with low overhead (on average, 20%), and is accurate, pinpointing the exact objects involved in false sharing. We present a case study demonstrating Sheriff-Detect's effectiveness at locating false sharing in a variety of benchmarks.
Rewriting a program to fix false sharing can be infeasible when source is unavailable, or undesirable when padding objects would unacceptably increase memory consumption or further worsen runtime performance. Sheriff-Protect mitigates false sharing by adaptively isolating shared updates from different threads into separate physical addresses, effectively eliminating most of the performance impact of false sharing. We show that Sheriff-Protect can improve performance for programs with catastrophic false sharing by up to 9×, without programmer intervention.

References

[1]
E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: A scalable memory allocator for multithreaded applications. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX), pages 117--128, Cambridge, MA, Nov. 2000.
[2]
E. D. Berger, T. Yang, T. Liu, and G. Novark. Grace: safe multithreaded programming for C/C
[3]
. In OOPSLA '09: Proceeding of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications, pages 81--96, New York, NY, USA, 2009. ACM.
[4]
E. D. Berger, B. G. Zorn, and K. S. McKinley. Composing high-performance memory allocators. In Proceedings of the 2001 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Snowbird, Utah, June 2001.
[5]
C. Bienia and K. Li. PARSEC 2.0: A new benchmark suite for chip-multiprocessors. In Proceedings of the 5th Annual Workshop on Modeling, Benchmarking and Simulation, June 2009.
[6]
W. J. Bolosky and M. L. Scott. False sharing and its effect on shared memory performance. In SEDMS IV: USENIX Symposium on Experiences with Distributed and Multiprocessor Systems, pages 57--71, Berkeley, CA, USA, 1993. USENIX Association.
[7]
J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and performance of Munin. In SOSP '91: Proceedings of the thirteenth ACM symposium on Operating systems principles, pages 152--164, New York, NY, USA, 1991. ACM.
[8]
J.-H. Chow and V. Sarkar. False sharing elimination by selection of runtime scheduling parameters. In ICPP '97: Proceedings of the international Conference on Parallel Processing, pages 396--403, Washington, DC, USA, 1997. IEEE Computer Society.
[9]
M. Dubois, J. C. Wang, L. A. Barroso, K. Lee, and Y.-S. Chen. Delayed consistency and its effects on the miss rate of parallel programs. In Proceedings of the 1991 ACM/IEEE conference on Supercomputing, Supercomputing '91, pages 197--206, New York, NY, USA, 1991. ACM.
[10]
S. L. Graham, P. B. Kessler, and M. K. Mckusick. Gprof: A call graph execution profiler. SIGPLAN Not., 17(6):120--126, 1982.
[11]
S. M. Günther and J. Weidendorfer. Assessing cache false sharing effects by dynamic binary instrumentation. In WBIA '09: Proceedings of the Workshop on Binary Instrumentation and Applications, pages 26--33, New York, NY, USA, 2009. ACM.
[12]
R. L. Hyde and B. D. Fleisch. An analysis of degenerate sharing and false coherence. J. Parallel Distrib. Comput., 34(2):183--195, 1996.
[13]
Intel Corporation. Intel Performance Tuning Utility 3.2 Update, November 2008.
[14]
Intel Corporation. Avoiding and identifying false sharing among threads. http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads/, February 2010.
[15]
T. E. Jeremiassen and S. J. Eggers. Reducing false sharing on shared memory multiprocessors through compile time data transformations. In PPOPP '95: Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 179--188, New York, NY, USA, 1995. ACM.
[16]
P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. TreadMarks: distributed shared memory on standard workstations and operating systems. In WTEC'94: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, pages 10--10, Berkeley, CA, USA, 1994. USENIX Association.
[17]
J. Larus and R. Rajwar. Transactional Memory (Synthesis Lectures on Computer Architecture). Morgan & Claypool Publishers, first edition, 2007.
[18]
J. Levon. OProfile internals. http://oprofile.sourceforge.net/doc/internals/index.html, 2003.
[19]
C.-L. Liu. False sharing analysis for multithreaded programs. Master's thesis, National Chung Cheng University, July 2009.
[20]
M. Olszewski and S. Amarasinghe. Outfoxing the mammoths: PLDI 2010 FIT presentation, June 2010.
[21]
A. Pesterev, N. Zeldovich, and R. T. Morris. Locating cache performance bottlenecks using data profiling. In EuroSys '10: Proceedings of the 5th European conference on Computer systems, pages 335--348, New York, NY, USA, 2010. ACM.
[22]
C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating MapReduce for multi-core and multiprocessor systems. In HPCA '07: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13--24, Washington, DC, USA, 2007. IEEE Computer Society.
[23]
M. Schindewolf. Analysis of cache misses using SIMICS. Master's thesis, Institute for Computing Systems Architecture, University of Edinburgh, 2007.
[24]
W. R. Stevens and S. A. Rago. Advanced Programming in the UNIX® Environment: Second Edition. Addison Wesley Professional, 2005.
[25]
W. Xiong, S. Park, J. Zhang, Y. Zhou, and Z. Ma. Ad hoc synchronization considered harmful. In OSDI'10: Proceedings of the 9th Conference on Symposium on Opearting Systems Design & Implementation, pages 163--176, Berkeley, CA, USA, 2010. USENIX Association.
[26]
Q. Zhao, D. Koh, S. Raza, D. Bruening, W.-F. Wong, and S. Amarasinghe. Dynamic cache contention detection in multi-threaded applications. In The International Conference on Virtual Execution Environments, Newport Beach, CA, Mar 2011.

Cited By

View all
  • (2023)A Large-Scale Empirical Study of Real-Life Performance Issues in Open Source ProjectsIEEE Transactions on Software Engineering10.1109/TSE.2022.316762849:2(924-946)Online publication date: 1-Feb-2023
  • (2021)A stealing mechanism for delegation methodsThe Journal of Supercomputing10.1007/s11227-021-03719-277:10(10827-10849)Online publication date: 1-Oct-2021
  • (2021)DynaCo: Dynamic Coherence Management for Tiled Manycore ArchitecturesInternational Journal of Parallel Programming10.1007/s10766-020-00688-649:4(570-599)Online publication date: 1-Aug-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 46, Issue 10
OOPSLA '11
October 2011
1063 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2076021
Issue’s Table of Contents
  • cover image ACM Conferences
    OOPSLA '11: Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
    October 2011
    1104 pages
    ISBN:9781450309400
    DOI:10.1145/2048066
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2011
Published in SIGPLAN Volume 46, Issue 10

Check for updates

Author Tags

  1. false sharing
  2. multi-threaded

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)46
  • Downloads (Last 6 weeks)6
Reflects downloads up to 02 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A Large-Scale Empirical Study of Real-Life Performance Issues in Open Source ProjectsIEEE Transactions on Software Engineering10.1109/TSE.2022.316762849:2(924-946)Online publication date: 1-Feb-2023
  • (2021)A stealing mechanism for delegation methodsThe Journal of Supercomputing10.1007/s11227-021-03719-277:10(10827-10849)Online publication date: 1-Oct-2021
  • (2021)DynaCo: Dynamic Coherence Management for Tiled Manycore ArchitecturesInternational Journal of Parallel Programming10.1007/s10766-020-00688-649:4(570-599)Online publication date: 1-Aug-2021
  • (2019)Huron: hybrid false sharing detection and repairProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314644(453-468)Online publication date: 8-Jun-2019
  • (2019)Optimised memory allocation for less false abortion and better performance in hardware transactional memoryInternational Journal of Parallel, Emergent and Distributed Systems10.1080/17445760.2019.1605605(1-9)Online publication date: 6-May-2019
  • (2018)Declarative ResilienceACM Transactions on Embedded Computing Systems10.1145/321055917:4(1-27)Online publication date: 24-Jul-2018
  • (2017)ffwdProceedings of the 26th Symposium on Operating Systems Principles10.1145/3132747.3132771(342-358)Online publication date: 14-Oct-2017
  • (2015)Enabling multi-threaded applications on hybrid shared memory manycore architecturesProceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition10.5555/2755753.2755922(742-747)Online publication date: 9-Mar-2015
  • (2024)Yggdrasil: Reducing Network I/O Tax with (CXL-Based) Distributed Shared MemoryProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673138(597-606)Online publication date: 12-Aug-2024
  • (2024)ParaShareDetect: Dynamic Instrumentation and Runtime Analysis for False Sharing Detection in Parallel Computing2024 4th International Conference on Computer, Control and Robotics (ICCCR)10.1109/ICCCR61138.2024.10585404(230-235)Online publication date: 19-Apr-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media