Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3314221.3314644acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article
Public Access

Huron: hybrid false sharing detection and repair

Published: 08 June 2019 Publication History

Abstract

Writing efficient multithreaded code that can leverage the full parallelism of underlying hardware is difficult. A key impediment is insidious cache contention issues, such as false sharing. False sharing occurs when multiple threads from different cores access disjoint portions of the same cache line, causing it to go back and forth between the caches of different cores and leading to substantial slowdown.
Alas, existing techniques for detecting and repairing false sharing have limitations. On the one hand, in-house (i.e., offline) techniques are limited to situations where falsely-shared data can be determined statically, and are otherwise inaccurate. On the other hand, in-production (i.e., run-time) techniques incur considerable overhead, as they constantly monitor a program to detect false sharing. In-production repair techniques are also limited by the types of modifications they can perform on the fly, and are therefore less effective.
We present Huron, a hybrid in-house/in-production false sharing detection and repair system. Huron detects and repairs as much false sharing as it can in-house, and relies on its lightweight in-production mechanism for remaining cases. The key idea behind Huron's in-house false sharing repair is to group together data that is accessed by the same set of threads, to shift falsely-shared data to different cache lines. Huron's in-house repair technique can generalize to previously-unobserved inputs. Our evaluation shows that Huron can detect more false sharing bugs than all state-of-the-art techniques, and with a lower overhead. Huron improves runtime performance by 3.82× on average (up to 11×), which is 2.11-2.27× better than the state of the art.

Supplementary Material

WEBM File (p453-khan.webm)
MP4 File (3314221.3314644.mp4)
Video Presentation

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: a system for large-scale machine learning. In OSDI, Vol. 16. 265–283.
[2]
Lars Ole Andersen. 1994. Program analysis and specialization for the C programming language. Ph.D. Dissertation. University of Cophenhagen.
[3]
Jennifer M Anderson and Monica S Lam. 1993. Global optimizations for parallelism and locality on scalable parallel machines. In ACM Sigplan Notices, Vol. 28. ACM, 112–125.
[4]
Jon Petter Asen, Jo Inge Buskenes, Carl-Inge Colombo Nilsen, Andreas Austeng, and Sverre Holm. 2014. Implementing capon beamforming on a GP U for real-time cardiac ultrasound imaging. IEEE transactions on ultrasonics, ferroelectrics, and frequency control 61, 1 (2014), 76–85.
[5]
Prithviraj Banerjee, John A Chandy, Manish Gupta, Eugene W Hodges, John G Holm, Antonio Lain, Daniel J Palermo, Shankar Ramaswamy, and Ernesto Su. 1995. The PARADIGM compiler for distributedmemory multicomputers. Computer 28, 10 (1995), 37–47.
[6]
Emery D Berger, Ting Yang, Tongping Liu, and Gene Novark. 2009. Grace: safe multithreaded programming for C/C++. In ACM sigplan notices, Vol. 44. ACM, 81–96.
[7]
Christian Bienia, Sanjeev Kumar, and Kai Li. 2008. PARSEC vs. SPLASH-2: A quantitative comparison of two multithreaded benchmark suites on chip-multiprocessors. In Workload Characterization, 2008. IISWC 2008. IEEE International Symposium on. IEEE, 47–56.
[8]
William J Bolosky and Michael L Scott. 1993. False sharing and its effect on shared memory performance. In Proceedings of the Fourth symposium on Experiences with distributed and multiprocessor systems.
[9]
Milind Chabbi, Shasha Wen, and Xu Liu. 2018. Featherlight on-thefly false-sharing detection. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 152–167.
[10]
Jia Chen. 2018. Andersen’s inclusion-based pointer analysis reimplementation in LLVM. https://github.com/grievejia/andersen . {Online; accessed 16-Nov-2018}.
[11]
Mei-Ling Chiang, Chieh-Jui Yang, and Shu-Wei Tu. 2016. Kernel mechanisms with dynamic task-aware scheduling to reduce resource contention in NUMA multi-core systems. Journal of Systems and Software 121 (2016), 72–87.
[12]
Trishul M Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In OSDI, Vol. 14. 571–582.
[13]
Jyh-Herng Chow and Vivek Sarkar. 1997. False sharing elimination by selection of runtime scheduling parameters. In Parallel Processing, 1997., Proceedings of the 1997 International Conference on. IEEE, 396– 403.
[14]
Daniel Clifford, Hannes Payer, Michael Stanton, and Ben L Titzer. 2015. Memento mori: dynamic allocation-site-based optimizations. In ACM SIGPLAN Notices, Vol. 50. ACM, 105–117.
[15]
Alexander Collins, Tim Harris, Murray Cole, and Christian Fensch. 2015. Lira: Adaptive contention-aware thread placement for parallel runtime systems. In Proceedings of the 5th International Workshop on Runtime and Operating Systems for Supercomputers. ACM, 2.
[16]
Emilio Coppa, Camil Demetrescu, and Irene Finocchi. 2012. Inputsensitive profiling. ACM SIGPLAN Notices 47, 6 (2012), 89–98.
[17]
Emilio Coppa, Camil Demetrescu, Irene Finocchi, and Romolo Marotta. 2013. Multithreaded input-sensitive profiling. arXiv preprint arXiv:1304.3804 (2013).
[18]
Intel Corparation. 2016. Intel (R) 64 and IA-32 Architectures Software Developer’s Manual. Combined Volumes, Dec (2016).
[19]
Florian David, Gaël Thomas, Julia L. Lawall, and Gilles Muller. 2014. Continuously measuring critical section pressure with the free-lunch profiler. In OOPSLA.
[20]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223–1231.
[21]
Christian DeLozier, Ariel Eizenberg, Shiliang Hu, Gilles Pokam, and Joseph Devietti. 2017. TMI: thread memory isolation for false sharing repair. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 639–650.
[22]
David Dice. 2011. False sharing induced by card table marking. https://blogs.oracle.com/dave/ false-sharing-induced-by-card-table-marking . {Online; last accessed 05-August-2018}.
[23]
Wei Ding and Mahmut Kandemir. 2014. CApRI: CAche-conscious data reordering for irregular codes. ACM SIGMETRICS Performance Evaluation Review 42, 1 (2014), 477–489.
[24]
Ariel Eizenberg, Shiliang Hu, Gilles Pokam, and Joseph Devietti. 2016. Remix: online detection and repair of cache contention for the JVM. In ACM SIGPLAN Notices, Vol. 51. ACM, 251–265.
[25]
Olga Golovanevsky, Alon Dayan, Ayal Zaks, and David Edelsohn. 2010. Trace-based data layout optimizations for multi-core processors. In International Conference on High-Performance Embedded Architectures and Compilers. Springer, 81–95.
[26]
Daniel Goodman, Georgios Varisteas, and Tim Harris. 2017. Pandia: comprehensive contention-sensitive thread placement. In Proceedings of the Twelfth European Conference on Computer Systems. ACM, 254– 269.
[27]
Joseph L Greathouse, Zhiqiang Ma, Matthew I Frank, Ramesh Peri, and Todd Austin. 2011. Demand-driven software race detection using hardware performance counters. In ACM SIGARCH Computer Architecture News, Vol. 39. ACM, 165–176.
[28]
Stephan M Günther and Josef Weidendorfer. 2009. Assessing cache false sharing effects by dynamic binary instrumentation. In Proceedings of the Workshop on Binary Instrumentation and Applications. ACM, 26– 33.
[29]
Changwan Hong, Wenlei Bao, Albert Cohen, Sriram Krishnamoorthy, Louis-Noël Pouchet, Fabrice Rastello, J Ramanujam, and Ponnuswamy Sadayappan. 2016. Effective padding of multidimensional arrays to avoid cache conflict misses. In ACM SIGPLAN Notices, Vol. 51. ACM, 129–144.
[30]
Intel. 2017. Tutorial: Identifying False Sharing - C Sample Code. https: //software.intel.com/en-us/vtune-memory-access-tutorial-linux-c
[31]
Sanath Jayasena, Saman Amarasinghe, Asanka Abeyweera, Gayashan Amarasinghe, Himeshi De Silva, Sunimal Rathnayake, Xiaoqiao Meng, and Yanbin Liu. 2013. Detection of false sharing using machine learning. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 30.
[32]
Tor E Jeremiassen and Susan J Eggers. 1995. Reducing false sharing on shared memory multiprocessors through compile time data transformations. Vol. 30. ACM.
[33]
Ismail Kadayif and Mahmut Kandemir. 2004. Quasidynamic layout optimizations for improving data locality. IEEE Transactions on Parallel & Distributed Systems 11 (2004), 996–1011.
[34]
M Kandemir, A Choudhary, and J Ramanujam. 1998. Improving locality in out-of-core computations using data layout transformations. In International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers. Springer, 359–366.
[35]
Mahmut Kandemir, Alok Choudhary, J Ramanujam, and Prithviraj Banerjee. 1999. A framework for interprocedural locality optimization using both loop and data layout transformations. In Parallel Processing, 1999. Proceedings. 1999 International Conference on. IEEE, 95–102.
[36]
Mahmut Kandemir, Alok Choudhary, J Ramanujam, and Prithviraj Banerjee. 1999. A graph based framework to detect optimal memory layouts for improving data locality. In Parallel Processing, 1999. 13th International and 10th Symposium on Parallel and Distributed Processing, 1999. 1999 IPPS/SPDP. Proceedings. IEEE, 738–743.
[37]
Mahmut Kandemir, Alok Choudhary, Jagannathan Ramanujam, Nagaraj Shenoy, and Prithviraj Banerjee. 1998. Enhancing spatial locality via data layout optimizations. In European Conference on Parallel Processing. Springer, 422–434.
[38]
Mahmut Kandemir, Alok Choudhary, J Ramaujam, and Prithviraj Banerjee. 1999. On reducing false sharing while improving locality on shared memory multiprocessors. In Parallel Architectures and Compilation Techniques, 1999. Proceedings. 1999 International Conference on. IEEE, 203–211.
[39]
Mahmut Kandemir and Ismail Kadayif. 2001. Compiler-directed selection of dynamic memory layouts. In Proceedings of the ninth international symposium on Hardware/software codesign. ACM, 219–224.
[40]
Ken Kennedy and Ulrich Kremer. 1995. Automatic data layout for high performance Fortran. In Proceedings of the 1995 ACM/IEEE conference on Supercomputing. ACM, 76.
[41]
Randall L. Hyde and Brett D. Fleisch. 1996. An Analysis of Degenerate Sharing and False Coherence. 34 (05 1996), 183–195.
[42]
Chris Lattner. 2008. LLVM and Clang: Next generation compiler technology. In The BSD conference. 1–2.
[43]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization. IEEE Computer Society, 75.
[44]
Ming Li, Shucheng Yu, Yao Zheng, Kui Ren, and Wenjing Lou. 2013. Scalable and secure sharing of personal health records in cloud computing using attribute-based encryption. IEEE transactions on parallel and distributed systems 24, 1 (2013), 131–143.
[45]
CL Liu. 2009. False sharing analysis for multithreaded programs. Master’s thesis, National Chung Cheng University (2009).
[46]
Tongping Liu and Emery D Berger. 2011. Sheriff: precise detection and automatic mitigation of false sharing. ACM Sigplan Notices 46, 10 (2011), 3–18.
[47]
Tongping Liu and Xu Liu. 2016. Cheetah: detecting false sharing efficiently and effectively. In Proceedings of the 2016 International Symposium on Code Generation and Optimization. ACM, 1–11.
[48]
Tongping Liu, Chen Tian, Ziang Hu, and Emery D. Berger. 2014. PREDATOR: Predictive False Sharing Detection. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’14). ACM, New York, NY, USA, 3–14.
[49]
Tongping Liu, Guangming Zeng, Abdullah Muzahid, et al. 2017. Syncperf: Categorizing, detecting, and diagnosing synchronization performance bugs. In Proceedings of the Twelfth European Conference on Computer Systems. ACM, 298–313.
[50]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: building customized program analysis tools with dynamic instrumentation. In Acm sigplan notices, Vol. 40. ACM, 190– 200.
[51]
Liang Luo, Akshitha Sriraman, Brooke Fugate, Shiliang Hu, Gilles Pokam, Chris J Newburn, and Joseph Devietti. 2016. LASER: Light, accurate sharing detection and repair. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, 261– 273.
[52]
Peter S Magnusson, Magnus Christensson, Jesper Eskilson, Daniel Forsgren, Gustav Hallberg, Johan Hogberg, Fredrik Larsson, Andreas Moestedt, and Bengt Werner. 2002. Simics: A full system simulation platform. Computer 35, 2 (2002), 50–58.
[53]
Joe Mario. 2016. C2C - False Sharing Detection in Linux Perf. https: //joemario.github.io/blog/2016/09/01/c2c-blog/ . {Online; last accessed 04-August-2018}.
[54]
Jason Mars and Lingjia Tang. 2013. Understanding application contentiousness and sensitivity on modern multicores. In Advances in Computers. Vol. 91. Elsevier, 59–85.
[55]
Robert Martin, John Demme, and Simha Sethumadhavan. 2012. TimeWarp: rethinking timekeeping and performance monitoring mechanisms to mitigate side-channel attacks. ACM SIGARCH Computer Architecture News 40, 3 (2012), 118–129.
[56]
Svetozar Miucin, Conor Brady, and Alexandra Fedorova. 2016. End-toend memory behavior profiling with DINAMITE. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 1042–1046.
[57]
Mihir Nanavati, Mark Spear, Nathan Taylor, Shriram Rajagopalan, Dutch T Meyer, William Aiello, and Andrew Warfield. 2013. Whose cache line is it anyway?: operating system support for live detection and repair of false sharing. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 141–154.
[58]
Nicholas Nethercote and Julian Seward. 2007. Valgrind: a framework for heavyweight dynamic binary instrumentation. In ACM Sigplan notices, Vol. 42. ACM, 89–100.
[59]
Adrian Nistor, Linhai Song, Darko Marinov, and Shan Lu. 2013. Toddler: Detecting performance problems via similar memory-access patterns. In Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 562–571.
[60]
Aleksey Pesterev, Nickolai Zeldovich, and Robert T Morris. 2010. Locating cache performance bottlenecks using data profiling. In Proceedings of the 5th European conference on Computer systems. ACM, 335–348.
[61]
Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. 2007. Evaluating mapreduce for multi-core and multiprocessor systems. In High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on. Ieee, 13–24.
[62]
Shai Rubin, Rastislav Bodík, and Trishul Chilimbi. 2002. An efficient profile-analysis framework for data-layout optimizations. In ACM SIGPLAN Notices, Vol. 37. ACM, 140–153.
[63]
Martin Schindewolf. 2007. Analysis of cache misses using SIMICS. Master’s thesis (2007).
[64]
Hovav Shacham, Matthew Page, Ben Pfaff, Eu-Jin Goh, Nagendra Modadugu, and Dan Boneh. 2004. On the Effectiveness of Addressspace Randomization. In Proceedings of the 11th ACM Conference on Computer and Communications Security (CCS ’04). ACM, New York, NY, USA, 298–307.
[65]
Byoungro So, Mary W Hall, and Heidi E Ziegler. 2004. Custom data layout for memory parallelism. In Code Generation and Optimization, 2004. CGO 2004. International Symposium on. IEEE, 291–302.
[66]
Daniel J Sorin, Mark D Hill, and David A Wood. 2011. A primer on memory consistency and cache coherence. Synthesis Lectures on Computer Architecture 6, 3 (2011), 1–212.
[67]
Herb Sutter. 2009. Eliminate false sharing. Dr. Dobb’s Journal 5 (2009).
[68]
O. Temam, E. D. Granston, and W. Jalby. 1993. To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts. In Supercomputing ’93:Proceedings of the 1993 ACM/IEEE Conference on Supercomputing. 410–419.
[69]
WikiSysop. {n. d.}. Atomic operations library. https://en.cppreference. com/w/cpp/atomic . {Online; last accessed 07-August-2018}.
[70]
Zhichen Xu, James R Larus, and Barton P Miller. 1997. Shared-memory performance profiling. In ACM SIGPLAN Notices, Vol. 32. ACM, 240– 251.
[71]
Tingting Yu and Michael Pradel. 2018. Pinpointing and repairing performance bottlenecks in concurrent programs. Empirical Software Engineering 23, 5 (2018), 3034–3071.
[72]
Yuanrui Zhang, Wei Ding, Jun Liu, and Mahmut Kandemir. 2011. Optimizing data layouts for parallel computation on multicores. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. IEEE, 143–154.
[73]
Qin Zhao, David Koh, Syed Raza, Derek Bruening, Weng-Fai Wong, and Saman Amarasinghe. 2011. Dynamic cache contention detection in multi-threaded applications. In ACM SIGPLAN Notices, Vol. 46. ACM, 27–38.

Cited By

View all
  • (2024)ParaShareDetect: Dynamic Instrumentation and Runtime Analysis for False Sharing Detection in Parallel Computing2024 4th International Conference on Computer, Control and Robotics (ICCCR)10.1109/ICCCR61138.2024.10585404(230-235)Online publication date: 19-Apr-2024
  • (2023)MemPerf: Profiling Allocator-Induced Performance SlowdownsProceedings of the ACM on Programming Languages10.1145/36228487:OOPSLA2(1418-1441)Online publication date: 16-Oct-2023
  • (2022)Whisper: Profile-Guided Branch Misprediction Elimination for Data Center Applications2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00017(19-34)Online publication date: Oct-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PLDI 2019: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2019
1162 pages
ISBN:9781450367127
DOI:10.1145/3314221
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2019

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. False sharing
  2. Performance optimization

Qualifiers

  • Research-article

Funding Sources

Conference

PLDI '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)159
  • Downloads (Last 6 weeks)25
Reflects downloads up to 02 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ParaShareDetect: Dynamic Instrumentation and Runtime Analysis for False Sharing Detection in Parallel Computing2024 4th International Conference on Computer, Control and Robotics (ICCCR)10.1109/ICCCR61138.2024.10585404(230-235)Online publication date: 19-Apr-2024
  • (2023)MemPerf: Profiling Allocator-Induced Performance SlowdownsProceedings of the ACM on Programming Languages10.1145/36228487:OOPSLA2(1418-1441)Online publication date: 16-Oct-2023
  • (2022)Whisper: Profile-Guided Branch Misprediction Elimination for Data Center Applications2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00017(19-34)Online publication date: Oct-2022
  • (2020)HangFixProceedings of the 11th ACM Symposium on Cloud Computing10.1145/3419111.3421288(344-357)Online publication date: 12-Oct-2020

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media