Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Checkpointing Tools in a Supercomputer Center

  • Published:
Lobachevskii Journal of Mathematics Aims and scope Submit manuscript

Abstract

The article describes the problem of automatic checkpoints creation/data restoration for the jobs that run on a single supercomputer node. The paper formulates the requirements for the checkpointing and restore tools in the supercomputer job management system. Berkeley Lab Checkpoint/Restart (BLCR), Checkpoint Restore In Userspace (CRIU), and Distributed MultiThreaded Checkpointing (DMTCP) tools are examined. It is shown that the DMTCP tool better meets the stated requirements. Experimental estimates of computational performance and impact on efficiency for DMTCP are presented. The problems of checkpointing tools’ integration into the SUPPZ job management system used at the JSCC RAS are considered. Recommendations on the practical use of automatic checkpoint/restore tools are given.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

REFERENCES

  1. A. Reuther et al., ‘‘Scalable system scheduling for HPC and big data,’’ J. Parallel Distrib. Comput. 111, 76–92 (2018). https://doi.org/10.1016/j.jpdc.2017.06.009

    Article  Google Scholar 

  2. G. I. Savin, B. M. Shabanov, P. N. Telegin, and A. V. Baranov, ‘‘Joint supercomputer center of the Russian Academy of Sciences: Present and future,’’ Lobachevskii J. Math. 40 (11), 1853–1862 (2019). https://doi.org/10.1134/S1995080219110271

    Article  MATH  Google Scholar 

  3. Supercomputing Resources of JSCC RAS. http://www.jscc.ru/supercomputing-resources. Accessed May 12, 2020.

  4. J. E. Martinez and S. K. Coulter, ‘‘Introduction to InfiniBand,’’ LANL Tech. Report LA-UR-15-24640 (Los Alamos National Lab., 2015). https://doi.org/10.2172/1186047

  5. M. S. Birrittella et al., ‘‘Intel® omni-path architecture: Enabling scalable, high performance fabrics,’’ in Proceedings of the IEEE 23rd Annual Symposium on High-Performance Interconnects, Santa Clara, CA, 2015, pp. 1–9. https://doi.org/10.1109/HOTI.2015.22

  6. P. Hargrove and J. Duell, ‘‘Berkeley lab checkpoint/restart (BLCR) for Linux clusters,’’ J. Phys.: Conf. Ser. 46, 494–499 (2006). https://doi.org/10.1088/1742-6596/46/1/067

    Article  Google Scholar 

  7. J. Cornwell and A. Kongmunvattana, ‘‘Efficient system-level remote checkpointing technique for BLCR,’’ in Proceedings of the 2011 8th International Conference on Information Technique: New Generations, USA, Las Vegas, 2011, pp. 1002–1007. https://doi.org/10.1109/ITNG.2011.172

  8. M. Rieker, J. Ansel, and G. Cooperman, ‘‘Transparent user-level checkpointing for the Native POSIX Thread Library for Linux,’’ in Proceedings of the Conference on Parallel and Distributed Processing Techniques and Applications PDPTA-06 (2006), pp. 492–498.

  9. J. Cao, G. Kerr, K. Arya, and G. Cooperman, ‘‘Transparent checkpoint-restart over Infiniband,’’ in Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing (HPDC’14), USA, New York (2014), pp. 13–24. https://doi.org/10.1145/2600212.2600219

  10. K. Arya, R. Carq, A. Polyakov, and G. Cooperman, ‘‘Design and implementation for checkpointing of distributed resources using process-level virtualization,’’ in Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER), Taipei (2016), pp. 402–412. https://doi.org/10.1109/CLUSTER.2016.55

  11. CRIU. https://criu.org/Main_Page. Accessed May 12, 2020.

  12. P. Begunkov, ‘‘Checkpoint and restore of file locks in userspace,’’ in Proceedings of the 13th Central and Eastern European Software Engineering Conference in Russia CEE-SECR’17 (Assoc. for Computing Machinery, New York, USA, 2017), Article 13, pp. 1–4. https://doi.org/10.1145/3166094.3166107

  13. Y. Chen, ‘‘Checkpoint and restore of micro-service in docker containers,’’ in Proceedings of the 3rd International Conference on Mechatronics and Industrial Informatics ICMII 2015 (2015), pp. 915–918. https://doi.org/10.2991/icmii-15.2015.160

  14. R. S. Venkatesh, T. Smejkal, D. S. Milojicic, and A. Gavrilovska, ‘‘Fast in-memory CRIU for docker containers,’’ in Proceedings of the International Symposium on Memory Systems (MEMSYS ’19) (Assoc. for Computing Machinery, New York, NY, USA, 2019), pp. 53–65. https://doi.org/10.1145/3357526.3357542

  15. A. Reber and P. Vaterlein, ‘‘Checkpoint/restore in user-space with open MPI,’’ in Proceedings of the BW-CAR Symposium on Information and Communication Systems (SInCom 2014), Germany, Villingen-Schwenningen, 2014, pp. 50–54.

  16. Intel MPI Benchmark User Guide and Methodology Description. https://www.lrz.de/services/compute/courses/x_lecturenotes/mic_workshop/IMB_Users_Guide.pdf. Accessed May 12, 2020.

  17. L. Nguyen, MPI One-Sided Communication. https://software.intel.com/content/www/us/en/develop/blogs/one-sided-communication.html. Accessed May 12, 2020.

  18. About HBICT. http://hbict.sourceforge.net/about.html. Accessed May 12, 2020.

Download references

ACKNOWLEDGMENTS

The work was carried out at the JSCC RAS as part of the government assignment. Supercomputer MVS-10P OP was used in research.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to G. I. Savin, B. M. Shabanov, R. S. Fedorov, A. V. Baranov or P. N. Telegin.

Additional information

(Submitted by A. M. Elizarov)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Savin, G.I., Shabanov, B.M., Fedorov, R.S. et al. Checkpointing Tools in a Supercomputer Center. Lobachevskii J Math 41, 2603–2613 (2020). https://doi.org/10.1134/S1995080220120355

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S1995080220120355

Keywords:

Navigation