Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A checkpoint compression study for high-performance computing systems

Published: 01 November 2015 Publication History

Abstract

As high-performance computing systems continue to increase in size and complexity, higher failure rates and increased overheads for checkpoint/restart CR protocols have raised concerns about the practical viability of CR protocols for future systems. Previously, compression has proven to be a viable approach for reducing checkpoint data volumes and, thereby, reducing CR protocol overhead leading to improved application performance. In this article, we further explore compression-based CR optimization by exploring its baseline performance and scaling properties, evaluating whether improved compression algorithms might lead to even better application performance and comparing checkpoint compression against and alongside other software- and hardware-based optimizations. Our results highlights are that: 1 compression is a very viable CR optimization; 2 generic, text-based compression algorithms appear to perform near optimally for checkpoint data compression and faster compression algorithms will not lead to better application performance; 3 compression-based optimizations fare well against and alongside other software-based optimizations; and 4 while hardware-based optimizations outperform software-based ones, they are not as cost effective.

References

[1]
<ref id="bibr1-1094342015570921">Al-Kiswany S, Ripeanu M, Vazhkudai S, Gharaibeh A 2008 stdchk: a checkpoint storage system for desktop grid computing. In: The 28th international conference on distributed computing systems, 2008 ICDCS'08, pp. pp.613-–624.
[2]
<ref id="bibr2-1094342015570921">Barney B 2011 Introduction to Livermore Computing Resources. <ext-link ext-link-type="uri" xlink:href="http://computing.llnl.gov/tutorials/lc_resources">http://computing.llnl.gov/tutorials/lc_resources</ext-link>.
[3]
<ref id="bibr3-1094342015570921">Bent J, Gibson G, Grider G. 2009 PLFS: a checkpoint filesystem for parallel applications. In: Conference on high performance computing networking, storage and analysis SC'09, pp. pp.21:1-–21:12.
[4]
<ref id="bibr4-1094342015570921">Bougeret M, Casanova H, Rabie M, Robert Y, Vivien F 2011 Checkpointing strategies for parallel jobs. In: Lathrop S, Costa J, Kramer W eds., Supercomputing, p. pp.33. New York: ACM Press.
[5]
<ref id="bibr5-1094342015570921">Bronevetsky G, Marques D, Pingali K, McKee S, Rugina R 2009 Compiler-enhanced incremental checkpointing for OpenMP applications. In: IEEE international symposium on parallel and distributed processing, pp. pp.1-–12.
[6]
<ref id="bibr6-1094342015570921">Chen Y, Li K, Plank JS 1997 CLIP: a checkpointing tool for message-passing parallel programs. In: SuperComputing'97, San Jose, CA.
[7]
<ref id="bibr7-1094342015570921">Colic A, Kalva H, Furht B 2010 Exploring nvidia-cuda for video coding. In: Proceedings of the first annual ACM SIGMM conference on multimedia systems MMSys'10, pp. pp.13-–22. New York: ACM Press.
[8]
<ref id="bibr8-1094342015570921">Cornwell J, Kongmunvattana A 2011 Efficient system-level remote checkpointing technique for BLCR. In: 2011 eighth international conference on information technology: new generations ITNG, pp. pp.1002-–1007.
[9]
<ref id="bibr9-1094342015570921">Daly JT 2006 A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems Volume 22 Issue 3: pp.303-–312.
[10]
<ref id="bibr10-1094342015570921">Deutsch P 1996 DEFLATE Compressed Data Format Specification version 1.3. Request for Comments 1951, Network Working Group. Available at: <ext-link ext-link-type="uri" xlink:href="https://www.ietf.org/rfc/rfc1951.txt">https://www.ietf.org/rfc/rfc1951.txt</ext-link>.
[11]
<ref id="bibr11-1094342015570921">Elnozahy EN, Alvisi L, Wang YM, Johnson DB 2002 A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys Volume 34 Issue 3: pp.375-–408.
[12]
<ref id="bibr12-1094342015570921">Elnozahy EN, Johnson DB, Zwaenpoel W 1992 The performance of consistent checkpointing. In: 11th IEEE symposium on reliable distributed systems, Houston, TX.
[13]
<ref id="bibr13-1094342015570921">Ferreira K, Riesen R, Bridges P. 2011a Evaluating the viability of process replication reliability for exascale systems. In: Supercomputing. New York: ACM Press.
[14]
<ref id="bibr14-1094342015570921">Ferreira KB, Riesen R, Brightwell R, Bridges PG, Arnold D 2011b Libhashckpt: hash-based incremental checkpointing using GPUs. In: Proceedings of the 18th EuroMPI conference, Santorini, Greece.
[15]
<ref id="bibr15-1094342015570921">Gabriel E. 2004 Open MPI: goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller D, Kacsuk P, Dongarra J eds. Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science, vol. Volume 3241. Berlin: Springer, pp. pp.353-–377.
[16]
<ref id="bibr16-1094342015570921">Hargrove PH, Duell JC 2006 Berkeley Lab Checkpoint/Restart BLCR for Linux clusters. Journal of Physics: Conference Series Volume 46 Issue 1: 494.
[17]
<ref id="bibr17-1094342015570921">Heroux MA, Doerfler DW, Crozier PS. 2009 Improving Performance via Mini-applications. Technical Report SAND2009-5574, Sandia National Laboratory.
[18]
<ref id="bibr18-1094342015570921">Ibtesham D, Arnold D, Bridges PG, Ferreira KB, Brightwell R 2012 On the viability of compression for reducing the overheads of checkpoint/restart-based fault tolerance. 2012 41st international conference on parallel processing, pp. pp.148-–157.
[19]
<ref id="bibr19-1094342015570921">Islam TZ, Mohror K, Bagchi S, Moody A, De Supinski B, Eigenmann R 2012 MCRENGINE: a scalable checkpointing system using data-aware aggregation and compression. In: 2012 international conference for high performance computing, networking, storage and analysis SC.
[20]
<ref id="bibr20-1094342015570921">Kannan S, Gavrilovska A, Schwan K, Milojicic D 2013 Optimizing checkpoints using NVM as virtual memory. In: Proceedings of the international parallel and distributed processing symposium IPDPS'13. New York: ACM Press.
[21]
<ref id="bibr21-1094342015570921">Lang S, Carns P, Latham R, Ross R, Harms K, Allcock W 2009 I/O performance challenges at leadership scale. In: Conference on high performance computing networking, storage and analysis SC'09, pp. pp.40:1-–40:12.
[22]
<ref id="bibr22-1094342015570921">Li CC, Fuchs W 1990 CATCH-compiler-assisted techniques for checkpointing. In: Digest of Papers: 20th international symposium on fault-tolerant computing, 1990 FTCS-20, pp. pp.74-–81.
[23]
<ref id="bibr23-1094342015570921">Li K, Naughton JF, Plank JS 1994 Low-latency, concurrent checkpointing for parallel programs. IEEE Transactions on Parallel and Distributed Systems Volume 5 Issue 8: pp.874-–879.
[24]
<ref id="bibr24-1094342015570921">Moody A, Bronevetsky G, Mohror K, de Supinski BR 2010 Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: ACM/IEEE international conference for high performance computing, networking, storage and analysis SC'10, pp. pp.1-–11.
[25]
<ref id="bibr25-1094342015570921">Morse KG 2005 Compression tools compared. Linux Journal Volume 137 : pp.3.
[26]
<ref id="bibr26-1094342015570921">Moshovos A, Kostopoulos A 2004 Cost-Effective, High-Performance Giga-Scale Checkpoint/Restore. Technical report, University of Toronto.
[27]
<ref id="bibr27-1094342015570921">Naksinehaboon N, Liu Y, Leangsuksun CB, Nassar R, Paun M, Scott SL 2008 Reliability-aware approach: an incremental checkpoint/restart model in HPC environments. In: Proceedings of the 2008 eighth IEEE international symposium on cluster computing and the grid CCGRID'08, pp. pp.783-–788. Washington, DC: IEEE Computer Society.
[28]
<ref id="bibr28-1094342015570921">Paun M, Naksinehaboon N, Nassar R, Leangsuksun C, Scott SL, Taerat N 2010 Incremental checkpoint schemes for Weibull failure distribution. International Journal of Computer Science Volume 21 Issue 3: pp.329-–344.
[29]
<ref id="bibr29-1094342015570921">Plank J, Li K, Puening M 1998 Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems Volume 9 Issue 10: pp.972-–986.
[30]
<ref id="bibr30-1094342015570921">Plank JS, Beck M, Kingsley G, Li K 1995a Libckpt: transparent checkpointing under Unix. In: USENIX winter 1995 technical conference, New Orleans, LA, pp. pp.213-–224.
[31]
<ref id="bibr31-1094342015570921">Plank JS, Chen Y, Li K, Beck M, Kingsley G 1999 Memory exclusion: optimizing the performance of checkpointing systems. Software: Practice and Experience Volume 29 Issue 2: pp.125-–142.
[32]
<ref id="bibr32-1094342015570921">Plank JS, Li K 1994 ickp: a consistent checkpointer for multicomputers. IEEE Parallel and Distributed Technology: Systems and Applications Volume 2 Issue 2: pp.62-–67.
[33]
<ref id="bibr33-1094342015570921">Plank JS, Xu J, Netzer RHB 1995b Compressed Differences: An Algorithm for Fast Incremental Checkpointing. Technical Report CS-95-302, University of Tennessee. Available at: <ext-link ext-link-type="uri" xlink:href="http://web.eecs.utk.edu/~plank/plank/papers/CS-95-302.html">http://web.eecs.utk.edu/~plank/plank/papers/CS-95-302.html</ext-link>.
[34]
<ref id="bibr34-1094342015570921">Rajachandrasekar R, Moody A, Mohror K, Panda DKD 2013 A 1 PB/s file system to checkpoint three million MPI tasks. In: Proceedings of the 22nd international symposium on high-performance parallel and distributed computing, pp. pp.143-–154.
[35]
<ref id="bibr35-1094342015570921">Schroeder B, Gibson GA 2006 A large-scale study of failures in high-performance computing systems. In: Dependable Systems and Networks DSN 2006, Philadelphia, PA.
[36]
<ref id="bibr36-1094342015570921">Schroeder B, Gibson GA 2007 Understanding failures in petascale computers. Journal of Physics Conference Series Volume 78 Issue 1: pp.012022.
[37]
<ref id="bibr37-1094342015570921">Shannon CE 1948 A mathematical theory of communication. The Bell System Technical Journal Volume 27 : pp.379-–423.
[38]
<ref id="bibr38-1094342015570921">Shipman G, Dillow D, Oral S, Wang F 2009 The Spider center wide file system: from concept to reality. In: Proceedings of the 2009 Cray User Group CUG conference, Atlanta, GA.
[39]
<ref id="bibr39-1094342015570921">Stellner G 1996 CoCheck: checkpointing and process migration for MPI. In: International parallel processing symposium, pp. pp.526-–531. Honolulu, HI: IEEE Computer Society.
[40]
<ref id="bibr40-1094342015570921">Templeman R, Kapadia A 2012 Gangrene: exploring the mortality of flash memory. In: Proceedings of the 7th USENIX conference on hot topics in security HotSec'12. Berkeley, CA: USENIX Association.
[41]
<ref id="bibr41-1094342015570921">Vaidya NH 1995 A case for two-level distributed recovery schemes. In: ACM SIGMETRICS joint international conference on measurement and modeling of computer systems SIGMETRICS'95/PERFORMANCE'95, pp. pp.64-–73. New York: ACM Press.
[42]
<ref id="bibr42-1094342015570921">Zandy VC, Miller BP, Livny M 1999 Process hijacking. In: 8th international symposium on high performance distributed computing HPDC'99, Redondo Beach, CA, pp. pp.177-–184.
[43]
<ref id="bibr43-1094342015570921">Ziv J, Lempel A 1977 A universal algorithm for sequential data compression. IEEE Transactions on Information Theory Volume 23 Issue 3: pp.337-–343.

Cited By

View all
  • (2020)Compiler aided checkpointing using crash-consistent data structures in NVMM systemsProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392755(1-13)Online publication date: 29-Jun-2020
  • (2019)GPU snapshotProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330361(171-183)Online publication date: 26-Jun-2019
  • (2019)Efficient Checkpointing with Recompute Scheme for Non-volatile Main MemoryACM Transactions on Architecture and Code Optimization10.1145/332309116:2(1-27)Online publication date: 29-May-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

Publisher

Sage Publications, Inc.

United States

Publication History

Published: 01 November 2015

Author Tags

  1. Fault tolerance
  2. checkpoint compression
  3. checkpoint/restart

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Compiler aided checkpointing using crash-consistent data structures in NVMM systemsProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392755(1-13)Online publication date: 29-Jun-2020
  • (2019)GPU snapshotProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330361(171-183)Online publication date: 26-Jun-2019
  • (2019)Efficient Checkpointing with Recompute Scheme for Non-volatile Main MemoryACM Transactions on Architecture and Code Optimization10.1145/332309116:2(1-27)Online publication date: 29-May-2019
  • (2017)Leveraging near data processing for high-performance checkpoint/restartProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126918(1-12)Online publication date: 12-Nov-2017
  • (2016)Improving application resilience to memory errors with lightweight compressionProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014942(1-12)Online publication date: 13-Nov-2016
  • (2016)Mini-CkptsProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926295(1-14)Online publication date: 1-Jun-2016

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media