Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1629575.1629593acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

PRES: probabilistic replay with execution sketching on multiprocessors

Published: 11 October 2009 Publication History

Abstract

Bug reproduction is critically important for diagnosing a production-run failure. Unfortunately, reproducing a concurrency bug on multi-processors (e.g., multi-core) is challenging. Previous techniques either incur large overhead or require new non-trivial hardware extensions.
This paper proposes a novel technique called PRES (probabilistic replay via execution sketching) to help reproduce concurrency bugs on multi-processors. It relaxes the past (perhaps idealistic) objective of "reproducing the bug on the first replay attempt" to significantly lower production-run recording overhead. This is achieved by (1) recording only partial execution information (referred to as "sketches") during the production run, and (2) relying on an intelligent replayer during diagnosis time (when performance is less critical) to systematically explore the unrecorded non-deterministic space and reproduce the bug. With only partial information, our replayer may require more than one coordinated replay run to reproduce a bug. However, after a bug is reproduced once, PRES can reproduce it every time.
We implemented PRES along with five different execution sketching mechanisms. We evaluated them with 11 representative applications, including 4 servers, 3 desktop/client applications, and 4 scientific/graphics applications, with 13 real-world concurrency bugs of different types, including atomicity violations, order violations and deadlocks. PRES (with synchronization or system call sketching) significantly lowered the production-run recording overhead of previous approaches (by up to 4416 times), while still reproducing most tested bugs in fewer than 10 replay attempts. Moreover, PRES scaled well with the number of processors; PRES's feedback generation from unsuccessful replays is critical in bug reproduction.

References

[1]
Direct communication with the authors of SMP-Revirt, 2009.
[2]
T.C. Bressoud and F.B. Schneider. Hypervisor-based fault tolerance. In SOSP, 1995.
[3]
M. Burrows and K.R.M. Leino. Finding stale-value errors in concurrent programs. Concurrency and Computation: Practice and Experience, 16(12):1161--1172, 2004.
[4]
M. Castro, M. Costa, and J.-P. Martin. Better bug reporting with better privacy. In ASPLOS, pages 319--328. ACM, 2008.
[5]
J.-D. Choi and H. Srinivasan. Deterministic replay of java multithreaded applications. In SPDT, 1998.
[6]
J. Devietti, B. Lucia, M. Oskin, and L. Ceze. Dmp: Deterministic shared-memory multiprocessing. In ASPLOS, 2009.
[7]
A. Dinning and E. Schonberg. An empirical comparison of monitoring algorithms for access anomaly detection. In PPoPP, 1990.
[8]
G. Dunlap, D. Lucchetti, M. Fetterman, and P. Chen. Execution replay of multiprocessor virtual machines. In VEE, 2008.
[9]
G.W. Dunlap. Execution replay for intrusion analysis (ph.d. thesis). http://www.eecs.umich.edu/pmchen/papers/dunlap06.pdf.
[10]
Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, F. Kaashoek, and Z. Zhang. R2: An application-level kernel for record and replay. In OSDI, 2008.
[11]
D.R. Hower and M.D. Hill. Rerun: Exploiting episodes for lightweight memory race recording. In ISCA, 2008.
[12]
S.T. King, G.W. Dunlap, and P.M. Chen. Debugging operating systems with time-traveling virtual machines. In Usenix, 2005.
[13]
O. Laadan, R.A. Baratto, D. Phung, S. Potter, and J. Nieh. Dejaview: A personal virtual computer recorder. In SOSP, 2007.
[14]
T.J. LeBlanc and J.M. Mellor-Crummey. Debugging parallel programs with instant replay. IEEE Trans. Comput., 36(4), 1987.
[15]
K. Li and P. Hudak. Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst., 7(4):321--359, 1989.
[16]
S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes -- a comprehensive study of real world concurrency bug characteristics. In ASPLOS, March 2008.
[17]
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V.J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI, 2005.
[18]
P. Montesinos, L. Ceze, and J. Torrellas. Delorean: Recording and deterministically replaying shared-memory multiprocessor execution efficiently. In ISCA, 2008.
[19]
P. Montesinos, M. Hicks, S.T. King, and J. Torrellas. Capo: Abstractions and software-hardware interface for hardware-assisted deterministic multiprocessor replay. In ASPLOS, 2009.
[20]
M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P.A. Nainar, and I. Neamtiu. Finding and reproducing heisenbugs in concurrent programs. In OSDI, 2008.
[21]
S. Narayanasamy, C. Pereira, and B. Calder. Recording shared memory dependencies using strata. In ASPLOS, 2006.
[22]
S. Narayanasamy, C. Pereira, H. Patil, R. Cohn, and B. Calder. Automatic logging of operating system effects to guide application-level architecture simulation. In SIGMETRICS, 2006.
[23]
S. Narayanasamy, G. Pokam, and B. Calder. Bugnet: Continuously recording program execution for deterministic replay debugging. In ISCA, 2005.
[24]
S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, and B. Calder. Automatically classifying benign and harmful data racesallusing replay analysis. In PLDI, 2007.
[25]
M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: Efficient determistic multithreading in software. In ASPLOS, 2009.
[26]
S. Park, S. Lu, and Y. Zhou. Ctrigger: Exposing atomicity violation bugs from their hiding places. In ASPLOS, 2009.
[27]
D. Perkovic and P.J. Keleher. Online data-race detection via coherency guarantees. In OSDI, 1996.
[28]
M. Ronsse and K.D. Bosschere. Non-intrusive on-the-fly data race detection using execution replay. In Automated and Algorithmic Debugging, Nov 2000.
[29]
S. Sarangi, B. Greskamp, and J. Torrellas. Cadre: Cycle-accurate deterministic replay for hardware debugging. In DSN, 2006.
[30]
S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: A dynamic data race detector for multithreaded programs. ACM TOCS, 1997.
[31]
SecurityFocus. Software bug contributed to blackout. http://www.securityfocus.com/news/8016.
[32]
S.M. Srinivasan, S. Kandula, C.R. Andrews, and Y. Zhou. Flashback: a lightweight extension for rollback and deterministic replay for software debugging. In USENIX, 2004.
[33]
J.M. Stone. Debugging concurrent processes: a case study. In SIGPLAN, pages 145--153. ACM, 1988.
[34]
J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing production run failures at the user's site. In SOSP, 2007.
[35]
VMware. (appendix c) using the integrated virtual debugger for visual studio. http://www.vmware.com/pdf/ws65_manual.pdf.
[36]
VMware. Using the snapshot (vmware workstation 4). http://www.vmware.com/support/ws4/doc/preserve_snapshot_ws.html.
[37]
A. Whitaker, R.S. Cox, and S.D. Gribble. Configuration debugging as search: finding the needle in the haystack. In OSDI, 2004.
[38]
S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In ISCA, 1995.
[39]
M. Xu, R. Bodik, and M. Hill. A "flight data recorder" for enabling full-system multiprocessor deterministic replay. In ISCA/03.
[40]
M. Xu, R. Bodík, and M.D. Hill. A serializability violation detector for shared-memory server programs. In PLDI, 2005.
[41]
M. Xu, M.D. Hill, and R. Bodík. A regulated transitive reduction (rtr) for longer memory race recording. In ASPLOS, 2006.

Cited By

View all
  • (2024)Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault InjectionProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695979(46-62)Online publication date: 4-Nov-2024
  • (2024)LTA: Control-Driven UAV Testing and Bug Localization with Flight Record DecompositionProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699350(450-463)Online publication date: 4-Nov-2024
  • (2024)WeBridge: Synthesizing Stored Procedures for Large-Scale Real-World Web ApplicationsProceedings of the ACM on Management of Data10.1145/36393192:1(1-29)Online publication date: 26-Mar-2024
  • Show More Cited By

Index Terms

  1. PRES: probabilistic replay with execution sketching on multiprocessors

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SOSP '09: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
    October 2009
    346 pages
    ISBN:9781605587523
    DOI:10.1145/1629575
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 October 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. concurrency bug
    2. replay

    Qualifiers

    • Research-article

    Conference

    SOSP09
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 174 of 961 submissions, 18%

    Upcoming Conference

    SOSP '25
    ACM SIGOPS 31st Symposium on Operating Systems Principles
    October 13 - 16, 2025
    Seoul , Republic of Korea

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)22
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 12 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault InjectionProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695979(46-62)Online publication date: 4-Nov-2024
    • (2024)LTA: Control-Driven UAV Testing and Bug Localization with Flight Record DecompositionProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699350(450-463)Online publication date: 4-Nov-2024
    • (2024)WeBridge: Synthesizing Stored Procedures for Large-Scale Real-World Web ApplicationsProceedings of the ACM on Management of Data10.1145/36393192:1(1-29)Online publication date: 26-Mar-2024
    • (2024)Differential Analysis for System Provenance2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00455(5649-5653)Online publication date: 13-May-2024
    • (2023)A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance ComputingThe International Journal of High Performance Computing Applications10.1177/1094342023116661037:3-4(306-327)Online publication date: 5-Apr-2023
    • (2023)TreeSLS: A Whole-system Persistent Microkernel with Tree-structured State Checkpoint on NVMProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613160(1-16)Online publication date: 23-Oct-2023
    • (2023)Alligator in Vest: A Practical Failure-Diagnosis Framework via Arm Hardware FeaturesProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598106(917-928)Online publication date: 12-Jul-2023
    • (2023)Vidi: Record Replay for Reconfigurable HardwareProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582040(806-820)Online publication date: 25-Mar-2023
    • (2023)Diagnosing Kernel Concurrency Failures with AITIAProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567486(94-110)Online publication date: 8-May-2023
    • (2022)Understanding and Reaching the Performance Limit of Schedule Tuning on Stable Synchronization DeterminismProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569669(223-238)Online publication date: 8-Oct-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media