Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1519065.1519084acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

Transparent checkpoints of closed distributed systems in Emulab

Published: 01 April 2009 Publication History

Abstract

Emulab is a testbed for networked and distributed systems experimentation. Two guiding principles of its design are realism and control of experimentation. There is an inherent tension between these goals, however, and in some aspects of the testbed's design, Emulab's implementers favored realism over control. Thus, Emulab provides wide-ranging control over an experiment's environment and initial conditions, but relatively little control over its execution--in particular, the ability to suspend, preempt, or replay the experiment.
We have extended Emulab with a new means of control over experiment execution: the ability to cleanly checkpoint the execution of the set of nodes and networks that comprise an experiment. Conventional checkpoint mechanisms can easily degrade the fidelity of experiment results as a consequence of checkpoint downtimes, overheads of background state saving, and unintended distributed checkpoint synchronization effects. In this paper we demonstrate a checkpointing technique that is transparent with respect to the execution of the system under test, almost completely concealing the underlying checkpoint activity.
Building on our checkpoint mechanism, we have implemented two powerful facilities for experiment execution control: the ability to preemptively swap-out experiments without losing their run-time state, and the ability to time-travel through the run of a system.

References

[1]
Paul Barham et al. Xen and the art of virtualization. In Proc. SOSP, pages 164--177, Bolton Landing, NY, 2003.
[2]
K. Mani Chandy and Leslie Lamport. Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst., 3(1):63--75, 1985.
[3]
Ludmila Cherkasova and Rob Gardner. Measuring CPU overhead for I/O processing in the Xen virtual machine monitor. In Proc. USENIX, pages 387--390, Anaheim, CA, 2005.
[4]
Christopher Clark et al. Live migration of virtual machines. In Proc. NSDI, pages 273--286, Boston, MA, May 2005.
[5]
Russell Coker. Bonnie++, 2003. http://sourceforge.net/projects/bonnie/.
[6]
Brendan Cully et al. Remus: high availability via asynchronous virtual machine replication. In Proc. NSDI, pages 161--174, San Francisco, CA, 2008.
[7]
George W. Dunlap et al. ReVirt: Enabling intrusion analysis through virtual-machine logging and replay. In Proc. OSDI, pages 211--224, Boston, MA, December 2002.
[8]
George W. Dunlap et al. Execution replay for multiprocessor virtual machines. In Proc. VEE, pages 121--130, Seattle, WA, March 2008.
[9]
E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34 (3):375--408, 2002.
[10]
Dennis Geels et al. Friday: Global comprehension for distributed replay. In Proc. NSDI, pages 285--298, Cambridge, MA, April 2007.
[11]
Diwaker Gupta et al. To infinity and beyond: time-warped network emulation. In Proc. NSDI, pages 87--100, San Jose, CA, May 2006.
[12]
Diwaker Gupta, Kashi V. Vishwanath, and Amin Vahdat. DieCast: Testing distributed systems with an accurate scale model. In Proc. NSDI, pages 407--421, San Francisco, CA, April 2008.
[13]
Mike Hibler, Leigh Stoller, Jay Lepreau, Robert Ricci, and Chad Barb. Fast, scalable disk imaging with Frisbee. In Proc. USENIX, pages 283--296, San Antonio, TX, June 2003.
[14]
IEEE. IEEE 1558 standard for a precision clock synchronization protocol for networked measurement and control systems, September 2004.
[15]
Charles Killian et al. Life, death, and the critical transition: Finding liveness bugs in systems code. In Proc. NSDI, pages 243--256, Cambridge, MA, April 2007.
[16]
Samuel T. King, George W. Dunlap, and Peter M. Chen. Debugging operating systems with time-traveling virtual machines. In Proc. USENIX, pages 1--15, Anaheim, CA, April 2005.
[17]
H. Andres Lagar-Cavilla, Niraj Tolia, M. Satyanarayanan, and Eyal de Lara. VMM-independent graphics acceleration. In Proc. VEE, pages 33--43, San Diego, CA, 2007.
[18]
Dutch T. Meyer et al. Parallax: virtual disks for virtual machines. In Proc. EuroSys, pages 41--54, Glasgow, Scotland, March-April 2008.
[19]
Joerg Micheel, Stephen Donnelly, and Ian Graham. Precision timestamping of network packets. In Proc. 1st ACM SIGCOMM Workshop on Internet Measurement (IWM), pages 273--277, San Francisco, CA, November 2001.
[20]
David L. Mills. Internet time synchronization: The network time protocol. IEEE Trans. Comm., 39:1482--1493, 1991.
[21]
Steven Osman, Dinesh Subhraveti, Gong Su, and Jason Nieh. The design and implementation of Zap: a system for migrating computing environments. In Proc. OSDI, pages 361--376, Boston, MA, May 2002.
[22]
Feng Qin, Joseph Tucek, Jagadeesan Sundaresan, and Yuanyuan Zhou. Rx: Treating bugs as allergies--a safe method to survive software failures. In Proc. SOSP, pages 235--248, Brighton, UK, October 2005.
[23]
Prashanth Radhakrishnan. Stateful-swapping in the Emulab network testbed. Master's thesis, University of Utah, August 2008.
[24]
Redhat. LVM2 Resource Page, 2006. http://sourceware.org/lvm2/.
[25]
Robert Ricci, Chris Alfeld, and Jay Lepreau. A solver for the network testbed mapping problem. SIGCOMM Comput. Commun. Rev., 33(2):65--81, April 2003.
[26]
Robert Ricci et al. The Flexlab approach to realistic evaluation of networked systems. In Proc. NSDI, pages 201--214, Cambridge, MA, April 2007.
[27]
Luigi Rizzo. Dummynet: a simple approach to the evaluation of network protocols. SIGCOMM Comput. Commun. Rev., 27(1):31--41, 1997.
[28]
Jose Renato Santos et al. Bridging the gap between software and hardware techniques for I/O virtualization. In Proc. USENIX, pages 29--42, Boston, MA, 2008.
[29]
Sudarshan M. Srinivasan et al. Flashback: A lightweight extension for rollback and deterministic replay for software debugging. In Proc. USENIX, pages 29--44, Boston, MA, June-July 2004.
[30]
Sun Microsystems, Inc. ZFS, June 2008. http://www.opensolaris.org/os/community/zfs/.
[31]
Michael M. Swift et al. Recovering device drivers. In Proc. OSDI, pages 1--16, San Francisco, CA, December 2004.
[32]
Joseph Tucek et al. Triage: Diagnosing production run failures at the user's site. In Proc. SOSP, pages 131--144, Stevenson, WA, October 2007.
[33]
Darryl Veitch, Satish Babu, and Attila Pasztor. Robust synchronization of software clocks across the Internet. In Proc. 4th ACM SIGCOMM Conf. on Internet Measurement (IMC), pages 219--232, Taormina, Italy, October 2004.
[34]
Brian White et al. An integrated experimental environment for distributed systems and networks. In Proc. OSDI, pages 255--270, Boston, MA, December 2002.
[35]
Junfeng Yang et al. Using model checking to find serious file system errors. ACM Trans. Comput. Syst., 24(4): 393--423, November 2006.
[36]
Andreas Zeller. Isolating cause--effect chains from computer programs. In Proc. FSE, pages 1--10, Charleston, SC, November 2002.

Cited By

View all
  • (2024)Multi-Dimensional and Message-Guided Fuzzing for Robotic Programs in Robot Operating SystemProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640425(763-778)Online publication date: 27-Apr-2024
  • (2024)Parallel and consistent live checkpointing and restoration of split-memory VMsFuture Generation Computer Systems10.1016/j.future.2024.05.024159(432-443)Online publication date: Oct-2024
  • (2023)eHotSnap: An Efficient and Hot Distributed Snapshots System for Virtual Machine ClusterIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.327201434:8(2433-2447)Online publication date: Aug-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
EuroSys '09: Proceedings of the 4th ACM European conference on Computer systems
April 2009
342 pages
ISBN:9781605584829
DOI:10.1145/1519065
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 April 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. distributed checkpointing
  2. emulab
  3. network testbed
  4. transparent checkpointing

Qualifiers

  • Research-article

Conference

EuroSys '09
Sponsor:
EuroSys '09: Fourth EuroSys Conference 2009
April 1 - 3, 2009
Nuremberg, Germany

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25
Twentieth European Conference on Computer Systems
March 30 - April 3, 2025
Rotterdam , Netherlands

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Multi-Dimensional and Message-Guided Fuzzing for Robotic Programs in Robot Operating SystemProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640425(763-778)Online publication date: 27-Apr-2024
  • (2024)Parallel and consistent live checkpointing and restoration of split-memory VMsFuture Generation Computer Systems10.1016/j.future.2024.05.024159(432-443)Online publication date: Oct-2024
  • (2023)eHotSnap: An Efficient and Hot Distributed Snapshots System for Virtual Machine ClusterIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.327201434:8(2433-2447)Online publication date: Aug-2023
  • (2023)A Pipelined Multi-level Checkpoint Storage System for Virtual Cluster Checkpointing2023 8th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA)10.1109/ICCCBDA56900.2023.10154743(239-246)Online publication date: 26-Apr-2023
  • (2022)Be United in Actions: Taking Live Snapshots of Heterogeneous Edge–Cloud Collaborative Cluster With Low OverheadIEEE Internet of Things Journal10.1109/JIOT.2021.31110239:10(7311-7324)Online publication date: 15-May-2022
  • (2020)Efficient and Flexible Checkpoint/Restore of Split-memory Virtual Machines2020 International Conference on Computational Intelligence (ICCI)10.1109/ICCI51257.2020.9247679(270-275)Online publication date: 8-Oct-2020
  • (2019)An In-Memory Checkpoint-Restart Mechanism for a Cluster of Virtual Machines2019 16th International Joint Conference on Computer Science and Software Engineering (JCSSE)10.1109/JCSSE.2019.8864198(131-136)Online publication date: Jul-2019
  • (2017)Piccolo: A Fast and Efficient Rollback System for Virtual Machine ClustersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.266840328:8(2328-2341)Online publication date: 1-Aug-2017
  • (2014)HotRestoreProceedings of the 28th USENIX conference on Large Installation System Administration10.5555/2717491.2717492(1-16)Online publication date: 9-Nov-2014
  • (2014)MercurialProceedings of the 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing10.1109/UCC.2014.143(877-882)Online publication date: 8-Dec-2014
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media