research-article

SHERIFF: precise detection and automatic mitigation of false sharing

Authors:

Emery D. BergerAuthors Info & Claims

ACM SIGPLAN Notices, Volume 46, Issue 10

Pages 3 - 18

https://doi.org/10.1145/2076021.2048070

Published: 22 October 2011 Publication History

Abstract

False sharing is an insidious problem for multithreaded programs running on multicore processors, where it can silently degrade performance and scalability. Previous tools for detecting false sharing are severely limited: they cannot distinguish false sharing from true sharing, have high false positive rates, and provide limited assistance to help programmers locate and resolve false sharing.

This paper presents two tools that attack the problem of false sharing: Sheriff-Detect and Sheriff-Protect. Both tools leverage a framework we introduce here called Sheriff. Sheriff breaks out threads into separate processes, and exposes an API that allows programs to perform per-thread memory isolation and tracking on a per-page basis. We believe Sheriff is of independent interest.

Sheriff-Detect finds instances of false sharing by comparing updates within the same cache lines by different threads, and uses sampling to rank them by performance impact. Sheriff-Detect is precise (no false positives), runs with low overhead (on average, 20%), and is accurate, pinpointing the exact objects involved in false sharing. We present a case study demonstrating Sheriff-Detect's effectiveness at locating false sharing in a variety of benchmarks.

Rewriting a program to fix false sharing can be infeasible when source is unavailable, or undesirable when padding objects would unacceptably increase memory consumption or further worsen runtime performance. Sheriff-Protect mitigates false sharing by adaptively isolating shared updates from different threads into separate physical addresses, effectively eliminating most of the performance impact of false sharing. We show that Sheriff-Protect can improve performance for programs with catastrophic false sharing by up to 9×, without programmer intervention.

References

[1]

E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: A scalable memory allocator for multithreaded applications. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX), pages 117--128, Cambridge, MA, Nov. 2000.

Digital Library

[2]

E. D. Berger, T. Yang, T. Liu, and G. Novark. Grace: safe multithreaded programming for C/C

[3]

. In OOPSLA '09: Proceeding of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications, pages 81--96, New York, NY, USA, 2009. ACM.

Digital Library

[4]

E. D. Berger, B. G. Zorn, and K. S. McKinley. Composing high-performance memory allocators. In Proceedings of the 2001 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Snowbird, Utah, June 2001.

Digital Library

[5]

C. Bienia and K. Li. PARSEC 2.0: A new benchmark suite for chip-multiprocessors. In Proceedings of the 5th Annual Workshop on Modeling, Benchmarking and Simulation, June 2009.

[6]

W. J. Bolosky and M. L. Scott. False sharing and its effect on shared memory performance. In SEDMS IV: USENIX Symposium on Experiences with Distributed and Multiprocessor Systems, pages 57--71, Berkeley, CA, USA, 1993. USENIX Association.

Digital Library

[7]

J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and performance of Munin. In SOSP '91: Proceedings of the thirteenth ACM symposium on Operating systems principles, pages 152--164, New York, NY, USA, 1991. ACM.

Digital Library

[8]

J.-H. Chow and V. Sarkar. False sharing elimination by selection of runtime scheduling parameters. In ICPP '97: Proceedings of the international Conference on Parallel Processing, pages 396--403, Washington, DC, USA, 1997. IEEE Computer Society.

Digital Library

[9]

M. Dubois, J. C. Wang, L. A. Barroso, K. Lee, and Y.-S. Chen. Delayed consistency and its effects on the miss rate of parallel programs. In Proceedings of the 1991 ACM/IEEE conference on Supercomputing, Supercomputing '91, pages 197--206, New York, NY, USA, 1991. ACM.

Digital Library

[10]

S. L. Graham, P. B. Kessler, and M. K. Mckusick. Gprof: A call graph execution profiler. SIGPLAN Not., 17(6):120--126, 1982.

Digital Library

[11]

S. M. Günther and J. Weidendorfer. Assessing cache false sharing effects by dynamic binary instrumentation. In WBIA '09: Proceedings of the Workshop on Binary Instrumentation and Applications, pages 26--33, New York, NY, USA, 2009. ACM.

Digital Library

[12]

R. L. Hyde and B. D. Fleisch. An analysis of degenerate sharing and false coherence. J. Parallel Distrib. Comput., 34(2):183--195, 1996.

Digital Library

[13]

Intel Corporation. Intel Performance Tuning Utility 3.2 Update, November 2008.

[14]

Intel Corporation. Avoiding and identifying false sharing among threads. http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads/, February 2010.

[15]

T. E. Jeremiassen and S. J. Eggers. Reducing false sharing on shared memory multiprocessors through compile time data transformations. In PPOPP '95: Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 179--188, New York, NY, USA, 1995. ACM.

Digital Library

[16]

P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. TreadMarks: distributed shared memory on standard workstations and operating systems. In WTEC'94: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, pages 10--10, Berkeley, CA, USA, 1994. USENIX Association.

Digital Library

[17]

J. Larus and R. Rajwar. Transactional Memory (Synthesis Lectures on Computer Architecture). Morgan & Claypool Publishers, first edition, 2007.

Digital Library

[18]

J. Levon. OProfile internals. http://oprofile.sourceforge.net/doc/internals/index.html, 2003.

[19]

C.-L. Liu. False sharing analysis for multithreaded programs. Master's thesis, National Chung Cheng University, July 2009.

[20]

M. Olszewski and S. Amarasinghe. Outfoxing the mammoths: PLDI 2010 FIT presentation, June 2010.

[21]

A. Pesterev, N. Zeldovich, and R. T. Morris. Locating cache performance bottlenecks using data profiling. In EuroSys '10: Proceedings of the 5th European conference on Computer systems, pages 335--348, New York, NY, USA, 2010. ACM.

Digital Library

[22]

C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating MapReduce for multi-core and multiprocessor systems. In HPCA '07: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13--24, Washington, DC, USA, 2007. IEEE Computer Society.

Digital Library

[23]

M. Schindewolf. Analysis of cache misses using SIMICS. Master's thesis, Institute for Computing Systems Architecture, University of Edinburgh, 2007.

[24]

W. R. Stevens and S. A. Rago. Advanced Programming in the UNIX® Environment: Second Edition. Addison Wesley Professional, 2005.

Digital Library

[25]

W. Xiong, S. Park, J. Zhang, Y. Zhou, and Z. Ma. Ad hoc synchronization considered harmful. In OSDI'10: Proceedings of the 9th Conference on Symposium on Opearting Systems Design & Implementation, pages 163--176, Berkeley, CA, USA, 2010. USENIX Association.

Digital Library

[26]

Q. Zhao, D. Koh, S. Raza, D. Bruening, W.-F. Wong, and S. Amarasinghe. Dynamic cache contention detection in multi-threaded applications. In The International Conference on Virtual Execution Environments, Newport Beach, CA, Mar 2011.

Digital Library

Cited By

Zhao YXiao LBondi AChen BLiu Y(2023)A Large-Scale Empirical Study of Real-Life Performance Issues in Open Source ProjectsIEEE Transactions on Software Engineering10.1109/TSE.2022.316762849:2(924-946)Online publication date: 1-Feb-2023
https://doi.org/10.1109/TSE.2022.3167628
Yi ZYao Y(2021)A stealing mechanism for delegation methodsThe Journal of Supercomputing10.1007/s11227-021-03719-277:10(10827-10849)Online publication date: 1-Oct-2021
https://dl.acm.org/doi/10.1007/s11227-021-03719-2
Srivatsa AMansour MRheindt SGabriel DWild THerkersdorf A(2021)DynaCo: Dynamic Coherence Management for Tiled Manycore ArchitecturesInternational Journal of Parallel Programming10.1007/s10766-020-00688-649:4(570-599)Online publication date: 1-Aug-2021
https://dl.acm.org/doi/10.1007/s10766-020-00688-6
Show More Cited By

Index Terms

SHERIFF: precise detection and automatic mitigation of false sharing

Recommendations

PREDATOR: predictive false sharing detection
PPoPP '14

False sharing is a notorious problem for multithreaded applications that can drastically degrade both performance and scalability. Existing approaches can precisely identify the sources of false sharing, but only report false sharing actually observed ...
Huron: hybrid false sharing detection and repair
PLDI 2019: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation

Writing efficient multithreaded code that can leverage the full parallelism of underlying hardware is difficult. A key impediment is insidious cache contention issues, such as false sharing. False sharing occurs when multiple threads from different ...
Featherlight on-the-fly false-sharing detection
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Shared-memory parallel programs routinely suffer from false sharing---a performance degradation caused by different threads accessing different variables that reside on the same CPU cacheline and at least one variable is modified. State-of-the-art tools ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 46, Issue 10

OOPSLA '11

October 2011

1063 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2076021

Issue’s Table of Contents

OOPSLA '11: Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
October 2011
1104 pages
ISBN:9781450309400
DOI:10.1145/2048066
General Chair:
Cristina Videira Lopes
University of California, Irvine, USA
,
Program Chair:
Kathleen Fisher
Tufts University, USA

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2011

Published in SIGPLAN Volume 46, Issue 10

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

55
Total Citations
View Citations
627
Total Downloads

Downloads (Last 12 months)46
Downloads (Last 6 weeks)6

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhao YXiao LBondi AChen BLiu Y(2023)A Large-Scale Empirical Study of Real-Life Performance Issues in Open Source ProjectsIEEE Transactions on Software Engineering10.1109/TSE.2022.316762849:2(924-946)Online publication date: 1-Feb-2023
https://doi.org/10.1109/TSE.2022.3167628
Yi ZYao Y(2021)A stealing mechanism for delegation methodsThe Journal of Supercomputing10.1007/s11227-021-03719-277:10(10827-10849)Online publication date: 1-Oct-2021
https://dl.acm.org/doi/10.1007/s11227-021-03719-2
Srivatsa AMansour MRheindt SGabriel DWild THerkersdorf A(2021)DynaCo: Dynamic Coherence Management for Tiled Manycore ArchitecturesInternational Journal of Parallel Programming10.1007/s10766-020-00688-649:4(570-599)Online publication date: 1-Aug-2021
https://dl.acm.org/doi/10.1007/s10766-020-00688-6
Khan TZhao YPokam GMozafari BKasikci BMcKinley KFisher K(2019)Huron: hybrid false sharing detection and repairProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314644(453-468)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314644
Li XGulila A(2019)Optimised memory allocation for less false abortion and better performance in hardware transactional memoryInternational Journal of Parallel, Emergent and Distributed Systems10.1080/17445760.2019.1605605(1-9)Online publication date: 6-May-2019
https://doi.org/10.1080/17445760.2019.1605605
Omar HShi QAhmad MDogan HKhan O(2018)Declarative ResilienceACM Transactions on Embedded Computing Systems10.1145/321055917:4(1-27)Online publication date: 24-Jul-2018
https://dl.acm.org/doi/10.1145/3210559
Roghanchi SEriksson JBasu N(2017)ffwdProceedings of the 26th Symposium on Operating Systems Principles10.1145/3132747.3132771(342-358)Online publication date: 14-Oct-2017
https://dl.acm.org/doi/10.1145/3132747.3132771
Rawat TShrivastava ANebel WAtienza D(2015)Enabling multi-threaded applications on hybrid shared memory manycore architecturesProceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition10.5555/2755753.2755922(742-747)Online publication date: 9-Mar-2015
https://dl.acm.org/doi/10.5555/2755753.2755922
Tang WHan YAi TLi GYu BYang X(2024)Yggdrasil: Reducing Network I/O Tax with (CXL-Based) Distributed Shared MemoryProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673138(597-606)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673138
Su JGu NQi D(2024)ParaShareDetect: Dynamic Instrumentation and Runtime Analysis for False Sharing Detection in Parallel Computing2024 4th International Conference on Computer, Control and Robotics (ICCCR)10.1109/ICCCR61138.2024.10585404(230-235)Online publication date: 19-Apr-2024
https://doi.org/10.1109/ICCCR61138.2024.10585404
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents