Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

ConAir: featherweight concurrency bug recovery via single-threaded idempotent execution

Published: 16 March 2013 Publication History

Abstract

Many concurrency bugs are hidden in deployed software and cause severe failures for end-users. When they finally manifest and become known by developers, they are difficult to fix correctly. To support end-users, we need techniques that help software survive hidden concurrency bugs during production runs. To help developers, we need techniques that fix exposed concurrency bugs.
The state-of-the-art techniques on concurrency-bug fixing and survival only satisfy a subset of four important properties: compatibility, correctness, generality, and performance.We aim to develop a system that satisfies all of these four properties. To achieve this goal, we leverage two observations: (1) rolling back a single thread is sufficient to recover from most concurrency-bug failures; (2) reexecuting an idempotent region, which requires no memory-state checkpoint, is sufficient to recover from many concurrency-bug failures. Our system ConAir includes a static analysis component that automatically identifies potential failure sites, a static analysis component that automatically identifies the idempotent code regions around every failure site, and a code-transformation component that inserts rollback-recovery code around the identified idempotent regions.
We evaluated ConAir on 10 real-world concurrency bugs in widely used C/C++ open-source applications. These bugs cover different types of failure symptoms and root causes. Quantitatively, ConAir helps software survive failures caused by all of these bugs with negligible run-time overhead (<1%) and short recovery time. Qualitatively, ConAir can help recover from failures caused by unknown bugs. It guarantees that program semantics remain unchanged and requires no change to operating systems or hardware.

References

[1]
G. Altekar and I. Stoica. ODR: output-deterministic replay for multicore debugging. In SOSP, 2009.
[2]
A. Aviram, S.-C. Weng, S. Hu, and B. Ford. Efficient system-enforced deterministic parallelism. In OSDI, 2010.
[3]
T. Bergan, N. Hunt, L. Ceze, and S. D. Gribble. Deterministic process groups in dOS. In OSDI, 2010.
[4]
G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot - a technique for cheap recovery. In OSDI, 2004.
[5]
L. Chew and D. Lie. Kivati: Fast detection and prevention of atomicity violations. In EuroSys, 2010.
[6]
R. Chugh, J. W. Voung, R. Jhala, and S. Lerner. Dataflow analysis for concurrent programs using datarace detection. In PLDI, 2008.
[7]
H. Cui, J. Wu, J. Gallagher, H. Guo, and J. Yang. Efficient deterministic multithreading through schedule relaxation. In SOSP, 2011.
[8]
R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. Efficiently computing static single assignment form and the control dependence graph. Trans. Program. Lang. Syst., 13 (4), Oct. 1991.
[9]
M. de Kruijf and K. Sankaralingam. Idempotent processor architecture. In MICRO, 2011.
[10]
M. de Kruijf and K. Sankaralingam. Idempotent code generation: Implementation, analysis, and evaluation. In CGO, 2013.
[11]
M. de Kruijf, S. Nomura, and K. Sankaralingam. Relax: an architectural framework for software recovery of hardware faults. In ISCA, 2010.
[12]
M. de Kruijf, K. Sankaralingam, and S. Jha. Static analysis and compiler design for idempotent processing. In PLDI, 2012.
[13]
Y. h. Eom and B. Demsky. Self-stabilizing java. In PLDI, 2012.
[14]
J. Erickson, M. Musuvathi, S. Burckhardt, and K. Olynyk. Effective data-race detection for the kernel. In OSDI, 2010.
[15]
M. Ernst, A. Czeisler, W. G. Griswold, and D. Notkin. Quickly detecting relevant program invariants. In ICSE, 2000.
[16]
S. Feng, S. Gupta, A. Ansari, S. A. Mahlke, and D. I. August. Encore: low-cost, fine-grained transient fault recovery. In MICRO, 2011.
[17]
C. Flanagan and S. N. Freund. Atomizer: a dynamic atomicity checker for multithreaded programs. In POPL, 2004.
[18]
C. Flanagan and S. N. Freund. Fasttrack: efficient and precise dynamic race detection. In PLDI, 2009.
[19]
Q. Gao, W. Zhang, Z. Chen, M. Zheng, and F. Qin. 2ndStrike: toward manifesting hidden concurrency typestate bugs. In ASPLOS, 2011.
[20]
P. Godefroid and N. Nagappani. Concurrency at Microsoft -- an exploratory survey. Technical report, MSR-TR-2008--75, Microsoft Research, May 2008.
[21]
006)}krste.ics06M. Hampton and K. Asanović. Implementing virtual memory in a vector processor with software restart markers. In ICS, 2006.
[22]
D. R. Hower, P. Montesinos, L. Ceze, M. D. Hill, and J. Torrellas. Two hardware-based approaches for deterministic multiprocessor replay. Commun. ACM, 52 (6), June 2009.
[23]
G. Jin, L. Song, W. Zhang, S. Lu, and B. Liblit. Automated atomicity-violation fixing. In PLDI, 2011.
[24]
H. Jula, D. Tralamazza, C. Zamfir, and G. Candea. Deadlock immunity: Enabling systems to defend against deadlocks. In OSDI, 2008.
[25]
S. W. Kim, C.-L. Ooi, R. Eigenmann, B. Falsafi, and T. N. Vijaykumar. Exploiting reference idempotency to reduce speculative storage overflow. ACM Trans. Program. Lang. Syst., 28 (5), Sept. 2006.
[26]
S. T. King, G. W. Dunlap, and P. M. Chen. Operating systems with time-traveling virtual machines. In Usenix, 2005.
[27]
C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO, 2004.
[28]
D. Lee, B. Wester, K. Veeraraghavan, S. Narayanasamy, P. M. Chen, and J. Flinn. Respec: efficient online multiprocessor replayvia speculation and external determinism. In ASPLOS, 2010.
[29]
a}atomrace.padtad08Z. Letko, T. Vojnar, and B. Krena. AtomRace: data race and atomicity violation detector and healer. In PADTAD, 2008.
[30]
N. G. Leveson and C. S. Turner. An investigation of the therac-25 accidents. Computer, 26 (7): 18--41, 1993.
[31]
Z. Li, L. Tan, X. Wang, Y. Zhou, and C. Zhai. An empirical study of bug characteristics in modern open source software. In ASID, 2006.
[32]
T. Liu, C. Curtsinger, and E. D. Berger. Dthreads: efficient deterministic multithreading. In SOSP, 2011.
[33]
S. Lu, J. Tucek, F. Qin, and Y. Zhou. AVIO: detecting atomicity violations via access interleaving invariants. In ASPLOS, 2006.
[34]
S. Lu, S. Park, C. Hu, X. Ma, W. Jiang, Z. Li, R. A. Popa, and Y. Zhou. MUVI: Automatically inferring multi-variable access correlations and detecting related semantic and concurrency bugs. In SOSP, 2007.
[35]
S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes -- a comprehensive study of real world concurrency bug characteristics. In ASPLOS, 2008.
[36]
B. Lucia and L. Ceze. Finding concurrency bugs with context-aware communication graphs. In MICRO, 2009.
[37]
B. Lucia, J. Devietti, K. Strauss, and L. Ceze. Atom-aid: Detecting and surviving atomicity violations. In ISCA, 2008.
[38]
S. A. Mahlke, W. Y. Chen, R. A. Bringmann, R. E. Hank, W.-M. W. Hwu, B. R. Rau, and M. S. Schlansker. Sentinel scheduling: a model for compiler-controlled speculative execution. ACM Trans. Comput. Syst., 11 (4), Nov. 1993.
[39]
S. Misailovic, D. Kim, and M. Rinard. Parallelizing sequential programs with statistical accuracy tests. MIT-CSAIL-TR-2010-038.
[40]
MySQL. Mysql 5.6 reference manual. http://dev.mysql.com/doc/refman/5.6/en/.
[41]
M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: efficient deterministic multithreading in software. In ASPLOS, 2009.
[42]
}nasdaqPCWorld. Nasdaq's Facebook Glitch Came From Race Conditions. http://www.pcworld.com/businesscenter/article/255911/ nasdaqs_facebook_glitch_came_from_race_conditions.html.
[43]
S. Qi, N. Otsuki, L. O. Nogueira, A. Muzahid, and J. Torrellas. Pacman: Tolerating asymmetric data races with unintrusive hardware. In HPCA, 2012.
[44]
F. Qin, J. Tucek, J. Sundaresan, and Y. Zhou. Rx: Treating bugs as allergies c a safe method to survive software failures. In SOSP, 2005.
[45]
S. K. Rajamani, G. Ramalingam, V. P. Ranganath, and K. Vaswani. Isolator: dynamically ensuring isolation in comcurrent programs. In ASPLOS, 2009.
[46]
P. Ratanaworabhan, M. Burtscher, D. Kirovski, B. G. Zorn, R. Nagpal, and K. Pattabiraman. Detecting and tolerating asymmetric races. In PPOPP, 2009.
[47]
S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: A dynamic data race detector for multithreaded programs. TOCS, 1997.
[48]
SecurityFocus. Software bug contributed to blackout. http://www.securityfocus.com/news/8016.
[49]
Y. Shi, S. Park, Z. Yin, S. Lu, Y. Zhou, W. Chen, and W. Zheng. Do I use the wrong definition?: DefUse: definition-use invariants for detecting concurrency and sequential bugs. In OOPSLA, 2010.
[50]
S. Sidiroglou, O. Laadan, C. Perez, N. Viennot, J. Nieh, and A. D. Keromytis. Assure: automatic software self-healing using rescue points. In ASPLOS, 2009.
[51]
G. Upadhyaya, S. P. Midkiff, and V. S. Pai. Automatic atomic region identification in shared memory SPMD programs. In OOPSLA, 2010.
[52]
M. Vaziri, F. Tip, and J. Dolby. Associating synchronization constraints with data in an object-oriented language. In POPL, 2006.
[53]
K. Veeraraghavan, P. M. Chen, J. Flinn, and S. Narayanasamy. Detecting and surviving data races using complementary schedules. In SOSP, 2011.
[54]
H. Volos, A. J. Tack, M. M. Swift, and S. Lu. Applying transactional memory to concurrency bugs. In ASPLOS, 2012.
[55]
D. Weeratunge, X. Zhang, and S. Jagannathan. Accentuating the positive: atomicity inference and enforcement using correct executions. In OOPSLA, 2011.
[56]
Z. Yin, D. Yuan, Y. Zhou, S. Pasupathy, and L. N. Bairavasundaram. How do fixes become bugs? In FSE, 2011.
[57]
J. Yu and S. Narayanasamy. A case for an interleaving constrained shared-memory multi-processor. In ISCA, 2009.
[58]
J. Yu and S. Narayanasamy. Tolerating concurrency bugs using transactions as lifeguards. In MICRO, 2010.
[59]
Y. Yu, T. Rodeheffer, and W. Chen. RaceTrack: Efficient detection of data race conditions via adaptive tracking. In SOSP, 2005.
[60]
W. Zhang, C. Sun, and S. Lu. ConMem: Detecting severe concurrency bugs through an effect-oriented approach. In ASPLOS, 2010.
[61]
W. Zhang, J. Lim, R. Olichandran, J. Scherpelz, G. Jin, S. Lu, and T. Reps. ConSeq: Detecting concurrency bugs through sequential errors. In ASPLOS, 2011.

Cited By

View all
  • (2019)Tuning lock-based multicore program based on sliding windows to tolerate data raceThe Journal of Supercomputing10.1007/s11227-019-02921-775:12(7872-7894)Online publication date: 6-Jun-2019
  • (2018)A Comprehensive Study on Bugs in Actor SystemsProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225139(1-9)Online publication date: 13-Aug-2018
  • (2018)A systematic survey on automated concurrency bug detection, exposing, avoidance, and fixing techniquesSoftware Quality Journal10.1007/s11219-017-9385-326:3(855-889)Online publication date: 1-Sep-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 41, Issue 1
ASPLOS '13
March 2013
540 pages
ISSN:0163-5964
DOI:10.1145/2490301
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
    March 2013
    574 pages
    ISBN:9781450318709
    DOI:10.1145/2451116
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2013
Published in SIGARCH Volume 41, Issue 1

Check for updates

Author Tags

  1. bug fixing
  2. concurrency bugs
  3. failure recovery
  4. idempotency
  5. static analysis

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)2
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2019)Tuning lock-based multicore program based on sliding windows to tolerate data raceThe Journal of Supercomputing10.1007/s11227-019-02921-775:12(7872-7894)Online publication date: 6-Jun-2019
  • (2018)A Comprehensive Study on Bugs in Actor SystemsProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225139(1-9)Online publication date: 13-Aug-2018
  • (2018)A systematic survey on automated concurrency bug detection, exposing, avoidance, and fixing techniquesSoftware Quality Journal10.1007/s11219-017-9385-326:3(855-889)Online publication date: 1-Sep-2018
  • (2017)10 Years of research on debugging concurrent and multicore softwareSoftware Quality Journal10.1007/s11219-015-9301-725:1(49-82)Online publication date: 1-Mar-2017
  • (2020)PeacenikProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378485(317-333)Online publication date: 9-Mar-2020
  • (2019)I/O dependent idempotence bugs in intermittent systemsProceedings of the ACM on Programming Languages10.1145/33606093:OOPSLA(1-31)Online publication date: 10-Oct-2019
  • (2019)Supporting peripherals in intermittent systems with just-in-time checkpointsProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314613(1101-1116)Online publication date: 8-Jun-2019
  • (2019)Applying Transactional Memory for Concurrency-Bug Failure Recovery in Production RunsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.287765630:5(990-1006)Online publication date: 1-May-2019
  • (2018)A Comprehensive Study on Bugs in Actor SystemsProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225139(1-9)Online publication date: 13-Aug-2018
  • (2018)SamplerProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00027(231-244)Online publication date: 20-Oct-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media