research-article

A checkpoint compression study for high-performance computing systems

Authors:

Dewan Ibtesham,

Kurt B Ferreira,

Dorian ArnoldAuthors Info & Claims

International Journal of High Performance Computing Applications, Volume 29, Issue 4

Pages 387 - 402

https://doi.org/10.1177/1094342015570921

Published: 01 November 2015 Publication History

Abstract

As high-performance computing systems continue to increase in size and complexity, higher failure rates and increased overheads for checkpoint/restart CR protocols have raised concerns about the practical viability of CR protocols for future systems. Previously, compression has proven to be a viable approach for reducing checkpoint data volumes and, thereby, reducing CR protocol overhead leading to improved application performance. In this article, we further explore compression-based CR optimization by exploring its baseline performance and scaling properties, evaluating whether improved compression algorithms might lead to even better application performance and comparing checkpoint compression against and alongside other software- and hardware-based optimizations. Our results highlights are that: 1 compression is a very viable CR optimization; 2 generic, text-based compression algorithms appear to perform near optimally for checkpoint data compression and faster compression algorithms will not lead to better application performance; 3 compression-based optimizations fare well against and alongside other software-based optimizations; and 4 while hardware-based optimizations outperform software-based ones, they are not as cost effective.

References

[1]

<ref id="bibr1-1094342015570921">Al-Kiswany S, Ripeanu M, Vazhkudai S, Gharaibeh A 2008 stdchk: a checkpoint storage system for desktop grid computing. In: The 28th international conference on distributed computing systems, 2008 ICDCS'08, pp. pp.613-–624.

Digital Library

[2]

<ref id="bibr2-1094342015570921">Barney B 2011 Introduction to Livermore Computing Resources. <ext-link ext-link-type="uri" xlink:href="http://computing.llnl.gov/tutorials/lc_resources">http://computing.llnl.gov/tutorials/lc_resources</ext-link>.

[3]

<ref id="bibr3-1094342015570921">Bent J, Gibson G, Grider G. 2009 PLFS: a checkpoint filesystem for parallel applications. In: Conference on high performance computing networking, storage and analysis SC'09, pp. pp.21:1-–21:12.

Digital Library

[4]

<ref id="bibr4-1094342015570921">Bougeret M, Casanova H, Rabie M, Robert Y, Vivien F 2011 Checkpointing strategies for parallel jobs. In: Lathrop S, Costa J, Kramer W eds., Supercomputing, p. pp.33. New York: ACM Press.

[5]

<ref id="bibr5-1094342015570921">Bronevetsky G, Marques D, Pingali K, McKee S, Rugina R 2009 Compiler-enhanced incremental checkpointing for OpenMP applications. In: IEEE international symposium on parallel and distributed processing, pp. pp.1-–12.

Digital Library

[6]

<ref id="bibr6-1094342015570921">Chen Y, Li K, Plank JS 1997 CLIP: a checkpointing tool for message-passing parallel programs. In: SuperComputing'97, San Jose, CA.

Digital Library

[7]

<ref id="bibr7-1094342015570921">Colic A, Kalva H, Furht B 2010 Exploring nvidia-cuda for video coding. In: Proceedings of the first annual ACM SIGMM conference on multimedia systems MMSys'10, pp. pp.13-–22. New York: ACM Press.

[8]

<ref id="bibr8-1094342015570921">Cornwell J, Kongmunvattana A 2011 Efficient system-level remote checkpointing technique for BLCR. In: 2011 eighth international conference on information technology: new generations ITNG, pp. pp.1002-–1007.

Digital Library

[9]

<ref id="bibr9-1094342015570921">Daly JT 2006 A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems Volume 22 Issue 3: pp.303-–312.

Digital Library

[10]

<ref id="bibr10-1094342015570921">Deutsch P 1996 DEFLATE Compressed Data Format Specification version 1.3. Request for Comments 1951, Network Working Group. Available at: <ext-link ext-link-type="uri" xlink:href="https://www.ietf.org/rfc/rfc1951.txt">https://www.ietf.org/rfc/rfc1951.txt</ext-link>.

[11]

<ref id="bibr11-1094342015570921">Elnozahy EN, Alvisi L, Wang YM, Johnson DB 2002 A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys Volume 34 Issue 3: pp.375-–408.

Digital Library

[12]

<ref id="bibr12-1094342015570921">Elnozahy EN, Johnson DB, Zwaenpoel W 1992 The performance of consistent checkpointing. In: 11th IEEE symposium on reliable distributed systems, Houston, TX.

[13]

<ref id="bibr13-1094342015570921">Ferreira K, Riesen R, Bridges P. 2011a Evaluating the viability of process replication reliability for exascale systems. In: Supercomputing. New York: ACM Press.

[14]

<ref id="bibr14-1094342015570921">Ferreira KB, Riesen R, Brightwell R, Bridges PG, Arnold D 2011b Libhashckpt: hash-based incremental checkpointing using GPUs. In: Proceedings of the 18th EuroMPI conference, Santorini, Greece.

[15]

<ref id="bibr15-1094342015570921">Gabriel E. 2004 Open MPI: goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller D, Kacsuk P, Dongarra J eds. Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science, vol. Volume 3241. Berlin: Springer, pp. pp.353-–377.

[16]

<ref id="bibr16-1094342015570921">Hargrove PH, Duell JC 2006 Berkeley Lab Checkpoint/Restart BLCR for Linux clusters. Journal of Physics: Conference Series Volume 46 Issue 1: 494.

[17]

<ref id="bibr17-1094342015570921">Heroux MA, Doerfler DW, Crozier PS. 2009 Improving Performance via Mini-applications. Technical Report SAND2009-5574, Sandia National Laboratory.

[18]

<ref id="bibr18-1094342015570921">Ibtesham D, Arnold D, Bridges PG, Ferreira KB, Brightwell R 2012 On the viability of compression for reducing the overheads of checkpoint/restart-based fault tolerance. 2012 41st international conference on parallel processing, pp. pp.148-–157.

Digital Library

[19]

<ref id="bibr19-1094342015570921">Islam TZ, Mohror K, Bagchi S, Moody A, De Supinski B, Eigenmann R 2012 MCRENGINE: a scalable checkpointing system using data-aware aggregation and compression. In: 2012 international conference for high performance computing, networking, storage and analysis SC.

Digital Library

[20]

<ref id="bibr20-1094342015570921">Kannan S, Gavrilovska A, Schwan K, Milojicic D 2013 Optimizing checkpoints using NVM as virtual memory. In: Proceedings of the international parallel and distributed processing symposium IPDPS'13. New York: ACM Press.

[21]

<ref id="bibr21-1094342015570921">Lang S, Carns P, Latham R, Ross R, Harms K, Allcock W 2009 I/O performance challenges at leadership scale. In: Conference on high performance computing networking, storage and analysis SC'09, pp. pp.40:1-–40:12.

Digital Library

[22]

<ref id="bibr22-1094342015570921">Li CC, Fuchs W 1990 CATCH-compiler-assisted techniques for checkpointing. In: Digest of Papers: 20th international symposium on fault-tolerant computing, 1990 FTCS-20, pp. pp.74-–81.

[23]

<ref id="bibr23-1094342015570921">Li K, Naughton JF, Plank JS 1994 Low-latency, concurrent checkpointing for parallel programs. IEEE Transactions on Parallel and Distributed Systems Volume 5 Issue 8: pp.874-–879.

Digital Library

[24]

<ref id="bibr24-1094342015570921">Moody A, Bronevetsky G, Mohror K, de Supinski BR 2010 Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: ACM/IEEE international conference for high performance computing, networking, storage and analysis SC'10, pp. pp.1-–11.

Digital Library

[25]

<ref id="bibr25-1094342015570921">Morse KG 2005 Compression tools compared. Linux Journal Volume 137 : pp.3.

[26]

<ref id="bibr26-1094342015570921">Moshovos A, Kostopoulos A 2004 Cost-Effective, High-Performance Giga-Scale Checkpoint/Restore. Technical report, University of Toronto.

[27]

<ref id="bibr27-1094342015570921">Naksinehaboon N, Liu Y, Leangsuksun CB, Nassar R, Paun M, Scott SL 2008 Reliability-aware approach: an incremental checkpoint/restart model in HPC environments. In: Proceedings of the 2008 eighth IEEE international symposium on cluster computing and the grid CCGRID'08, pp. pp.783-–788. Washington, DC: IEEE Computer Society.

[28]

<ref id="bibr28-1094342015570921">Paun M, Naksinehaboon N, Nassar R, Leangsuksun C, Scott SL, Taerat N 2010 Incremental checkpoint schemes for Weibull failure distribution. International Journal of Computer Science Volume 21 Issue 3: pp.329-–344.

[29]

<ref id="bibr29-1094342015570921">Plank J, Li K, Puening M 1998 Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems Volume 9 Issue 10: pp.972-–986.

Digital Library

[30]

<ref id="bibr30-1094342015570921">Plank JS, Beck M, Kingsley G, Li K 1995a Libckpt: transparent checkpointing under Unix. In: USENIX winter 1995 technical conference, New Orleans, LA, pp. pp.213-–224.

Digital Library

[31]

<ref id="bibr31-1094342015570921">Plank JS, Chen Y, Li K, Beck M, Kingsley G 1999 Memory exclusion: optimizing the performance of checkpointing systems. Software: Practice and Experience Volume 29 Issue 2: pp.125-–142.

Digital Library

[32]

<ref id="bibr32-1094342015570921">Plank JS, Li K 1994 ickp: a consistent checkpointer for multicomputers. IEEE Parallel and Distributed Technology: Systems and Applications Volume 2 Issue 2: pp.62-–67.

Digital Library

[33]

<ref id="bibr33-1094342015570921">Plank JS, Xu J, Netzer RHB 1995b Compressed Differences: An Algorithm for Fast Incremental Checkpointing. Technical Report CS-95-302, University of Tennessee. Available at: <ext-link ext-link-type="uri" xlink:href="http://web.eecs.utk.edu/~plank/plank/papers/CS-95-302.html">http://web.eecs.utk.edu/~plank/plank/papers/CS-95-302.html</ext-link>.

[34]

<ref id="bibr34-1094342015570921">Rajachandrasekar R, Moody A, Mohror K, Panda DKD 2013 A 1 PB/s file system to checkpoint three million MPI tasks. In: Proceedings of the 22nd international symposium on high-performance parallel and distributed computing, pp. pp.143-–154.

Digital Library

[35]

<ref id="bibr35-1094342015570921">Schroeder B, Gibson GA 2006 A large-scale study of failures in high-performance computing systems. In: Dependable Systems and Networks DSN 2006, Philadelphia, PA.

[36]

<ref id="bibr36-1094342015570921">Schroeder B, Gibson GA 2007 Understanding failures in petascale computers. Journal of Physics Conference Series Volume 78 Issue 1: pp.012022.

[37]

<ref id="bibr37-1094342015570921">Shannon CE 1948 A mathematical theory of communication. The Bell System Technical Journal Volume 27 : pp.379-–423.

[38]

<ref id="bibr38-1094342015570921">Shipman G, Dillow D, Oral S, Wang F 2009 The Spider center wide file system: from concept to reality. In: Proceedings of the 2009 Cray User Group CUG conference, Atlanta, GA.

[39]

<ref id="bibr39-1094342015570921">Stellner G 1996 CoCheck: checkpointing and process migration for MPI. In: International parallel processing symposium, pp. pp.526-–531. Honolulu, HI: IEEE Computer Society.

Digital Library

[40]

<ref id="bibr40-1094342015570921">Templeman R, Kapadia A 2012 Gangrene: exploring the mortality of flash memory. In: Proceedings of the 7th USENIX conference on hot topics in security HotSec'12. Berkeley, CA: USENIX Association.

Digital Library

[41]

<ref id="bibr41-1094342015570921">Vaidya NH 1995 A case for two-level distributed recovery schemes. In: ACM SIGMETRICS joint international conference on measurement and modeling of computer systems SIGMETRICS'95/PERFORMANCE'95, pp. pp.64-–73. New York: ACM Press.

[42]

<ref id="bibr42-1094342015570921">Zandy VC, Miller BP, Livny M 1999 Process hijacking. In: 8th international symposium on high performance distributed computing HPDC'99, Redondo Beach, CA, pp. pp.177-–184.

Digital Library

[43]

<ref id="bibr43-1094342015570921">Ziv J, Lempel A 1977 A universal algorithm for sequential data compression. IEEE Transactions on Information Theory Volume 23 Issue 3: pp.337-–343.

Digital Library

Cited By

Coy THe SRen BZhang XAyguadé EHwu WBadia RHofstee H(2020)Compiler aided checkpointing using crash-consistent data structures in NVMM systemsProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392755(1-13)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392755
Lee KSullivan MHari STsai TKeckler SErez MEigenmann RDing CMcKee S(2019)GPU snapshotProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330361(171-183)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330361
Alshboul MElnawawy HElkhouly RKimura KTuck JSolihin Y(2019)Efficient Checkpointing with Recompute Scheme for Non-volatile Main MemoryACM Transactions on Architecture and Code Optimization10.1145/332309116:2(1-27)Online publication date: 29-May-2019
https://dl.acm.org/doi/10.1145/3323091
Show More Cited By

Index Terms

A checkpoint compression study for high-performance computing systems

Index terms have been assigned to the content through auto-classification.

Recommendations

On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance
ICPP '12: Proceedings of the 2012 41st International Conference on Parallel Processing

The increasing size and complexity of high performance computing (HPC) systems have led to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/...
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
Abstract
In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a ...
Abstract: Comparing GPU and Increment-Based Checkpoint Compression
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Increasing size and complexity of high performance computing systems have led to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Based on expected increases in core counts (to at least on the order of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

Copyright © © The Authors 2015.

Publisher

Sage Publications, Inc.

United States

Publication History

Published: 01 November 2015

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Coy THe SRen BZhang XAyguadé EHwu WBadia RHofstee H(2020)Compiler aided checkpointing using crash-consistent data structures in NVMM systemsProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392755(1-13)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392755
Lee KSullivan MHari STsai TKeckler SErez MEigenmann RDing CMcKee S(2019)GPU snapshotProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330361(171-183)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330361
Alshboul MElnawawy HElkhouly RKimura KTuck JSolihin Y(2019)Efficient Checkpointing with Recompute Scheme for Non-volatile Main MemoryACM Transactions on Architecture and Code Optimization10.1145/332309116:2(1-27)Online publication date: 29-May-2019
https://dl.acm.org/doi/10.1145/3323091
Agrawal ALoh GTuck JMohr BRaghavan P(2017)Leveraging near data processing for high-performance checkpoint/restartProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126918(1-12)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126918
Levy SFerreira KBridges PWest J(2016)Improving application resilience to memory errors with lightweight compressionProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014942(1-12)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3014904.3014942
Fiala DMueller FFerreira KEngelmann C(2016)Mini-CkptsProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926295(1-14)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1145/2925426.2926295

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents