Abstract
The article describes the problem of automatic checkpoints creation/data restoration for the jobs that run on a single supercomputer node. The paper formulates the requirements for the checkpointing and restore tools in the supercomputer job management system. Berkeley Lab Checkpoint/Restart (BLCR), Checkpoint Restore In Userspace (CRIU), and Distributed MultiThreaded Checkpointing (DMTCP) tools are examined. It is shown that the DMTCP tool better meets the stated requirements. Experimental estimates of computational performance and impact on efficiency for DMTCP are presented. The problems of checkpointing tools’ integration into the SUPPZ job management system used at the JSCC RAS are considered. Recommendations on the practical use of automatic checkpoint/restore tools are given.
Similar content being viewed by others
REFERENCES
A. Reuther et al., ‘‘Scalable system scheduling for HPC and big data,’’ J. Parallel Distrib. Comput. 111, 76–92 (2018). https://doi.org/10.1016/j.jpdc.2017.06.009
G. I. Savin, B. M. Shabanov, P. N. Telegin, and A. V. Baranov, ‘‘Joint supercomputer center of the Russian Academy of Sciences: Present and future,’’ Lobachevskii J. Math. 40 (11), 1853–1862 (2019). https://doi.org/10.1134/S1995080219110271
Supercomputing Resources of JSCC RAS. http://www.jscc.ru/supercomputing-resources. Accessed May 12, 2020.
J. E. Martinez and S. K. Coulter, ‘‘Introduction to InfiniBand,’’ LANL Tech. Report LA-UR-15-24640 (Los Alamos National Lab., 2015). https://doi.org/10.2172/1186047
M. S. Birrittella et al., ‘‘Intel® omni-path architecture: Enabling scalable, high performance fabrics,’’ in Proceedings of the IEEE 23rd Annual Symposium on High-Performance Interconnects, Santa Clara, CA, 2015, pp. 1–9. https://doi.org/10.1109/HOTI.2015.22
P. Hargrove and J. Duell, ‘‘Berkeley lab checkpoint/restart (BLCR) for Linux clusters,’’ J. Phys.: Conf. Ser. 46, 494–499 (2006). https://doi.org/10.1088/1742-6596/46/1/067
J. Cornwell and A. Kongmunvattana, ‘‘Efficient system-level remote checkpointing technique for BLCR,’’ in Proceedings of the 2011 8th International Conference on Information Technique: New Generations, USA, Las Vegas, 2011, pp. 1002–1007. https://doi.org/10.1109/ITNG.2011.172
M. Rieker, J. Ansel, and G. Cooperman, ‘‘Transparent user-level checkpointing for the Native POSIX Thread Library for Linux,’’ in Proceedings of the Conference on Parallel and Distributed Processing Techniques and Applications PDPTA-06 (2006), pp. 492–498.
J. Cao, G. Kerr, K. Arya, and G. Cooperman, ‘‘Transparent checkpoint-restart over Infiniband,’’ in Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing (HPDC’14), USA, New York (2014), pp. 13–24. https://doi.org/10.1145/2600212.2600219
K. Arya, R. Carq, A. Polyakov, and G. Cooperman, ‘‘Design and implementation for checkpointing of distributed resources using process-level virtualization,’’ in Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER), Taipei (2016), pp. 402–412. https://doi.org/10.1109/CLUSTER.2016.55
CRIU. https://criu.org/Main_Page. Accessed May 12, 2020.
P. Begunkov, ‘‘Checkpoint and restore of file locks in userspace,’’ in Proceedings of the 13th Central and Eastern European Software Engineering Conference in Russia CEE-SECR’17 (Assoc. for Computing Machinery, New York, USA, 2017), Article 13, pp. 1–4. https://doi.org/10.1145/3166094.3166107
Y. Chen, ‘‘Checkpoint and restore of micro-service in docker containers,’’ in Proceedings of the 3rd International Conference on Mechatronics and Industrial Informatics ICMII 2015 (2015), pp. 915–918. https://doi.org/10.2991/icmii-15.2015.160
R. S. Venkatesh, T. Smejkal, D. S. Milojicic, and A. Gavrilovska, ‘‘Fast in-memory CRIU for docker containers,’’ in Proceedings of the International Symposium on Memory Systems (MEMSYS ’19) (Assoc. for Computing Machinery, New York, NY, USA, 2019), pp. 53–65. https://doi.org/10.1145/3357526.3357542
A. Reber and P. Vaterlein, ‘‘Checkpoint/restore in user-space with open MPI,’’ in Proceedings of the BW-CAR Symposium on Information and Communication Systems (SInCom 2014), Germany, Villingen-Schwenningen, 2014, pp. 50–54.
Intel MPI Benchmark User Guide and Methodology Description. https://www.lrz.de/services/compute/courses/x_lecturenotes/mic_workshop/IMB_Users_Guide.pdf. Accessed May 12, 2020.
L. Nguyen, MPI One-Sided Communication. https://software.intel.com/content/www/us/en/develop/blogs/one-sided-communication.html. Accessed May 12, 2020.
About HBICT. http://hbict.sourceforge.net/about.html. Accessed May 12, 2020.
ACKNOWLEDGMENTS
The work was carried out at the JSCC RAS as part of the government assignment. Supercomputer MVS-10P OP was used in research.
Author information
Authors and Affiliations
Corresponding authors
Additional information
(Submitted by A. M. Elizarov)
Rights and permissions
About this article
Cite this article
Savin, G.I., Shabanov, B.M., Fedorov, R.S. et al. Checkpointing Tools in a Supercomputer Center. Lobachevskii J Math 41, 2603–2613 (2020). https://doi.org/10.1134/S1995080220120355
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S1995080220120355