Nothing Special   »   [go: up one dir, main page]

skip to main content
article

A "flight data recorder" for enabling full-system multiprocessor deterministic replay

Published: 01 May 2003 Publication History

Abstract

Debuggers have been proven indispensable in improving software reliability. Unfortunately, on most real-life software, debuggers fail to deliver their most essential feature --- a faithful replay of the execution. The reason is non-determinism caused by multithreading and non-repeatable inputs. A common solution to faithful replay has been to record the non-deterministic execution. Existing recorders, however, either work only for datarace-free programs or have prohibitive overhead.As a step towards powerful debugging, we develop a practical low-overhead hardware recorder for cachecoherent multiprocessors, called Flight Data Recorder (FDR). Like an aircraft flight data recorder, FDR continuously records the execution, even on deployed systems, logging the execution for post-mortem analysis.FDR is practical because it piggybacks on the cache coherence hardware and logs nearly the minimal threadordering information necessary to faithfully replay the multiprocessor execution. Our studies, based on simulating a four-processor server with commercial workloads, show that when allocated less than 7% of system's physical memory, our FDR design can capture the last one second of the execution at modest (less than 2%) slowdown.

References

[1]
A. R. Alameldeen, M. M. K. Martin, C. J. Mauer, K. E. Moore, M. Xu, D. J. Sorin, M. D. Hill, and D. A. Wood. Simulating a $2M Commercial Server on a $2K PC. IEEE Computer, 36(2):50--57, Feb. 2003.
[2]
D. F. Bacon and S. C. Goldstein. Hardware-Assisted Replay of Multiprocessor Programs. Proceedings of the ACM/ONR Workshop on Parallel and Distributed Debugging, published in ACM SIGPLAN Notices, pages 194--206, 1991.
[3]
P. Barford and M. Crovella. Generating Representative Web Workloads for Network and Server Performance Evaluation. In Proceedings of the 1998 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 151--160, June 1998.
[4]
J.-D. Choi and H. Srinivasan. Deterministic Replay of Java Multithread Applications. In Proceedings of the SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT-98), pages 48--59, Aug. 1998.
[5]
A. Dinning and E. Schonberg. The Empirical Comparison of Monitoring Algorithms for Access Anomaly Detection. In Proceedings of the 2nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), pages 1--10, Mar. 1990.
[6]
Geodesic Systems. Geodesic TraceBack - Application Fault Management Monitor. Geodesic Systems, Inc., 2003.
[7]
D. Hunt and P. Marinos. A General Purpose Cache-Aided Rollback Error Recovery (CARER) Technique. In Proceedings of the 17th International Symposium on Fault-Tolerant Computing Systems, pages 170--175, 1987.
[8]
D. E. Knuth. Seminumerical Algorithms, volume 2 of The Art of Computer Programming. Addison-Wesley, Reading, Massachusetts, third edition, 1997.
[9]
L. Lamport. How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs. IEEE Transactions on Computers, C-28(9):690--691, Sept. 1979.
[10]
T. J. Leblanc and J. M. Mellor-Crummey. Debugging Parallel Programs with Instant Replay. IEEE Transactions on Computers, (4):471--482, Apr. 1987.
[11]
P. S. Magnusson et al. Simics: A Full System Simulation Platform. IEEE Computer, 35(2):50--58, Feb. 2002.
[12]
M. M. K. Martin, D. J. Sorin, M. D. Hill, and D. A. Wood. Bandwidth Adaptive Snooping. In Proceedings of the Eighth IEEE Symposium on High-Performance Computer Architecture, pages 251--262, Feb. 2002.
[13]
J. Martinez, J. Renau, M. Huang, M. Prvulovic, and J. Torrellas. Cherry: Checkpointed Early Resource Recycling in Out-of-order Microprocessors. In Proceedings of the 35th Annual IEEE/ACM International Symposium on Microarchitecture, Nov. 2002.
[14]
S. L. Min and J.-D. Choi. An Efficient Cache-based Access Anomaly Detection Scheme. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 235--244, Apr. 1991.
[15]
R. H. B. Netzer. Optimal Tracing and Replay for Debugging Shared-Memory Parallel Programs. In Proceedings of the ACM/ONR Workshop on Parallel and Distributed Debugging (PADD), pages 1--11, 1993.
[16]
R. H. B. Netzer and B. P. Miller. What are Race Conditions?: Some Issues and Formalizations. ACM Letters on Programming Languages and Systems, 1(1):74--88, Mar. 1992.
[17]
D. A. Patterson, et al. Recovery-Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. Technical report, UC Berkeley Computer Science Technical Report UCB//CSD-02-1175, 2002.
[18]
D. K. Pradhan. Fault-Tolerant Computer System Design. Prentice-Hall, Inc., 1996.
[19]
M. Prvulovic, Z. Zhang, and J. Torrellas. ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors. In Proceedings of the 29th Annual International Symposium on Computer Architecture, pages 111--122, May 2002.
[20]
B. Richards and J. R. Larus. Protocol-based Data-race Detection. In SIGMETRICS symposium on Parallel and Distributed Tools, pages 40--47, 1998.
[21]
M. Ronsse and K. D. Bosschere. Non-intrusive On-the-fly Data Race Detection using Execution Replay. In Automated and Algorithmic Debugging, Nov. 2000.
[22]
S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: A Dynamic Data Race Detector for Multithreaded Programs. ACM Transactions on Computer Systems, 15(4):391--411, Nov. 1997.
[23]
C. E. Scheurich. Access Ordering and Coherence in Shared Memory Multiprocessors. Technical report, University of Southern California, Computer Engineering Division Technical Report No. CENG 89-19, May 1989.
[24]
D. Shasha and M. Snir. Efficient and Correct Execution of Parallel Programs that Share Memory. ACM Transactions on Programming Languages and Systems, 10(2):282--312, Apr. 1988.
[25]
D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. Fast Checkpoint/Recovery to Support Kilo-Instruction Speculation and Hardware Fault Tolerance. Technical Report 1420, Computer Sciences Department, University of Wisconsin--Madison, Oct. 2000.
[26]
D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery. In Proceedings of the 29th Annual International Symposium on Computer Architecture, pages 123--134, May 2002.
[27]
R. Tremaine, P. Franaszek, J. Robinson, C. Schulz, T. Smith, M. Wazlowski, and P. Bland. IBM Memory Expansion Technology (MXT). IBM Journal of Research and Development, 45(2):271--285, Mar. 2001.
[28]
D. A. Wood, G. A. Gibson, and R. H. Katz. Verifying a Multiprocessor Cache Controller Using Random Test Generation. IEEE Design and Test of Computers, pages 13--25, Aug. 1990.
[29]
J. Ziv and A. Lempel. A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory, 23(3):337--343, May 1977.

Cited By

View all
  • (2024)RR-Row: Redirect-on-Write Based Virtual Machine Disk for Record/ReplayIEICE Transactions on Information and Systems10.1587/transinf.2023EDP7122E107.D:2(169-179)Online publication date: 1-Feb-2024
  • (2024)Wasm-R3: Record-Reduce-Replay for Realistic and Standalone WebAssembly BenchmarksProceedings of the ACM on Programming Languages10.1145/36897878:OOPSLA2(2156-2182)Online publication date: 8-Oct-2024
  • (2023)A multi-model architecture based on deep learning for aircraft load predictionCommunications Engineering10.1038/s44172-023-00100-42:1Online publication date: 18-Jul-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 31, Issue 2
ISCA 2003
May 2003
422 pages
ISSN:0163-5964
DOI:10.1145/871656
Issue’s Table of Contents
  • cover image ACM Conferences
    ISCA '03: Proceedings of the 30th annual international symposium on Computer architecture
    June 2003
    432 pages
    ISBN:0769519458
    DOI:10.1145/859618
    • Conference Chair:
    • Allan Gottlieb,
    • Program Chair:
    • Kai Li

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2003
Published in SIGARCH Volume 31, Issue 2

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)34
  • Downloads (Last 6 weeks)4
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)RR-Row: Redirect-on-Write Based Virtual Machine Disk for Record/ReplayIEICE Transactions on Information and Systems10.1587/transinf.2023EDP7122E107.D:2(169-179)Online publication date: 1-Feb-2024
  • (2024)Wasm-R3: Record-Reduce-Replay for Realistic and Standalone WebAssembly BenchmarksProceedings of the ACM on Programming Languages10.1145/36897878:OOPSLA2(2156-2182)Online publication date: 8-Oct-2024
  • (2023)A multi-model architecture based on deep learning for aircraft load predictionCommunications Engineering10.1038/s44172-023-00100-42:1Online publication date: 18-Jul-2023
  • (2021)Postmortem accurate IR-level state recovery for deployed concurrent programsACM SIGAPP Applied Computing Review10.1145/3493499.349350221:3(33-48)Online publication date: 20-Oct-2021
  • (2021)STRABProceedings of the 36th Annual ACM Symposium on Applied Computing10.1145/3412841.3442028(1532-1541)Online publication date: 22-Mar-2021
  • (2018)Exploring OS-based full-system deterministic replayProceedings of the 33rd Annual ACM Symposium on Applied Computing10.1145/3167132.3167247(1077-1086)Online publication date: 9-Apr-2018
  • (2018)Leveraging Hardware-Assisted Virtualization for Deterministic Replay on Commodity Multi-Core ProcessorsIEEE Transactions on Computers10.1109/TC.2017.272749267:1(45-58)Online publication date: 1-Jan-2018
  • (2017)Processor-Oblivious Record and ReplayACM SIGPLAN Notices10.1145/3155284.301876452:8(145-161)Online publication date: 26-Jan-2017
  • (2017)Efficient Generation of Compact Execution Traces for Multicore Architectural SimulationsACM Transactions on Architecture and Code Optimization10.1145/310634214:3(1-25)Online publication date: 30-Aug-2017
  • (2017)CPR: cross platform binary code reuse via platform independent trace programProceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3092703.3092707(158-169)Online publication date: 10-Jul-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media