Nothing Special   »   [go: up one dir, main page]

Skip to main content

DataStates: Towards Lightweight Data Models for Deep Learning

  • Conference paper
  • First Online:
Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI (SMC 2020)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1315))

Included in the following conference series:

Abstract

A key emerging pattern in deep learning applications is the need to capture intermediate DNN model snapshots and preserve or clone them in explore a large number of alternative training and/or inference paths. However, with increasing model complexity and new training approaches that mix data, model, pipeline and layer-wise parallelism, this pattern is challenging to address in a scalable and efficient manner. To this end, this position paper advocates for rethinking how to represent and manipulate DNN learning models. It relies on a broader notion of data states, a collection of annotated, potentially distributed data sets (tensors in the case of DNN models) that AI applications can capture at key moments during the runtime and revisit/reuse later. Instead explicitly interacting with the storage layer (e.g., write to a file), users can “tag” DNN models at key moments during runtime with metadata that expresses attributes and persistency/movement semantics. A high-performance runtime is the responsible to interpret the metadata and perform the necessary actions in the background, while offering a rich interface to find data states of interest. Using this approach has benefits at several levels: new capabilities, performance portability, high performance and scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.tensorflow.org/guide/saved_model.

  2. 2.

    https://www.tensorflow.org/guide/keras/save_and_serialize.

References

  1. Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). http://tensorflow.org/

  2. Wilkinson, M.D., et al.: The fair guiding principles for scientific data management and stewardship. Sci. Data 3(160018), 1–9 (2016)

    Google Scholar 

  3. Balaprakash, P., et al.: Scalable reinforcement-learning-based neural architecture search for cancer deep learning research. In: The 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019, pp. 37:1–37:33 (2019)

    Google Scholar 

  4. Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: High performance fault tolerance interface for hybrid systems. In: The 2011 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, Seattle, USA, pp. 32:1–32:32 (2011)

    Google Scholar 

  5. Bernard, J.: Mercurial-revision control approximated. Linux J. 2011, 212 (2011)

    Google Scholar 

  6. Bhuiyan, S., Zheludkov, M., Isachenko, T.: High Performance In-memory Computing with Apache Ignite. Lulu Press, Morrisville (2017). https://www.lulu.com/

    Google Scholar 

  7. Cao, L., Settlemyer, B.W., Bent, J.: To share or not to share: comparing burst buffer architectures. In: The 25th High Performance Computing Symposium, HPC 2017, Virginia Beach, Virginia, pp. 4:1–4:10 (2017)

    Google Scholar 

  8. Chacon, S., Straub, B.: Pro Git, 2nd edn. Apress, Berkely (2014)

    Book  Google Scholar 

  9. Chard, R., et al.: Publishing and serving machine learning models with DLHub. In: Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning), PEARC 2019, Chicago, USA (2019)

    Google Scholar 

  10. Collins-Sussman, B.: The subversion project: buiding a better CVS. Linux J. 2002(94) (2002)

    Google Scholar 

  11. Dean, J., et al.: Large scale distributed deep networks. In: The 25th International Conference on Neural Information Processing Systems, NIPS 2012, Lake Tahoe, USA, pp. 1223–1231 (2012)

    Google Scholar 

  12. Lawson, M., et al.: Empress: extensible metadata provider for extreme-scale scientific simulations. In: The 2nd Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, PDSW-DISCS@SC 2017, pp. 19–24 (2017)

    Google Scholar 

  13. Li, J., Nicolae, B., Wozniak, J., Bosilca, G.: Understanding scalability and fine-grain parallelism of synchronous data parallel training. In: 5th Workshop on Machine Learning in HPC Environments (in Conjunction with SC19), MLHPC 2019, Denver, USA, pp. 1–8 (2019)

    Google Scholar 

  14. Lockwood, G., et al.: Storage 2020: a vision for the future of HPC storage. Technical Report, Lawrence Berkeley National Laboratory (2017)

    Google Scholar 

  15. Lofstead, J., Baker, J., Younge, A.: Data pallets: containerizing storage for reproducibility and traceability. In: 2019 International Conference on High Performance Computing, ISC 2019, pp. 36–45 (2019)

    Google Scholar 

  16. Lofstead, J., Jimenez, I., Maltzahn, C., Koziol, Q., Bent, J., Barton, E.: DAOS and friends: a proposal for an exascale storage system. In: The 2016 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt Lake City, Utah, pp. 50:1–50:12 (2016)

    Google Scholar 

  17. Merkel, D.: Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014(239), 2 (2014)

    Google Scholar 

  18. Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.R.D.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: The 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, New Orleans, USA, pp. 1:1–1:11 (2010)

    Google Scholar 

  19. Narayanan, D., et al.: PipeDream: generalized pipeline parallelism for DNN training. In: The 27th ACM Symposium on Operating Systems Principles, SOSP 2019, Huntsville, Canada, pp. 1–15 (2019)

    Google Scholar 

  20. Nicolae, B.: Towards scalable checkpoint restart: a collective inline memory contents deduplication proposal. In: The 27th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2013, Boston, USA (2013). http://hal.inria.fr/hal-00781532/en

  21. Nicolae, B.: Leveraging naturally distributed data redundancy to reduce collective I/O replication overhead. In: 29th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015, pp. 1023–1032 (2015)

    Google Scholar 

  22. Nicolae, B., Antoniu, G., Bougé, L., Moise, D., Carpen-Amarie, A.: BlobSeer: next-generation data management for large scale infrastructures. J. Parallel Distrib. Comput. 71, 169–184 (2011)

    Article  Google Scholar 

  23. Nicolae, B., Li, J., Wozniak, J., Bosilca, G., Dorier, M., Cappello, F.: DeepFreeze: towards scalable asynchronous checkpointing of deep learning models. In: 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CGrid 2020, Melbourne, Australia, pp. 172–181 (2020)

    Google Scholar 

  24. Nicolae, B., Moody, A., Gonsiorowski, E., Mohror, K., Cappello, F.: VeloC: towards high performance adaptive asynchronous checkpointing at large scale. In: The 2019 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019, Rio de Janeiro, Brazil, pp. 911–920 (2019)

    Google Scholar 

  25. Nicolae, B., Wozniak, J.M., Dorier, M., Cappello, F.: DeepClone: lightweight state replication of deep learning models for data parallel training. In: The 2020 IEEE International Conference on Cluster Computing, CLUSTER 2020, Kobe, Japan (2020)

    Google Scholar 

  26. Real, E., et al.: Large-scale evolution of image classifiers. In: The 34th International Conference on Machine Learning, ICML 2017, Sydney, Australia, pp. 2902–2911 (2017)

    Google Scholar 

  27. Saurabh, N., Kimovski, D., Ostermann, S., Prodan, R.: VM image repository and distribution models for federated clouds: state of the art, possible directions and open issues. In: Desprez, F., et al. (eds.) Euro-Par 2016. LNCS, vol. 10104, pp. 260–271. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58943-5_21

    Chapter  Google Scholar 

  28. Shanahan, J.G., Dai, L.: Large scale distributed data science using apache spark. In: The 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2015, Sydney, Australia, pp. 2323–2324 (2015)

    Google Scholar 

  29. Shu, H., Zhu, H.: Sensitivity analysis of deep neural networks. In: The 33rd AAAI Conference of Artificial Intelligence, AAAI 2019, pp. 4943–4950 (2019)

    Google Scholar 

  30. Teerapittayanon, S., McDanel, B., Kung, H.T.: BranchyNet: fast inference via early exiting from deep neural networks. In: The 23rd International Conference on Pattern Recognition, ICPR 2016, Cancun, Mexico, pp. 2464–2469 (2016)

    Google Scholar 

  31. Tseng, S.M., Nicolae, B., Bosilca, G., Jeannot, E., Cappello, F.: Towards portable online prediction of network utilization using MPI-level monitoring. In: 25th International European Conference on Parallel and Distributed Systems, EuroPar 2019, Goettingen, Germany, pp. 1–14 (2019)

    Google Scholar 

  32. Wozniak, J., et al.: CANDLE/supervisor: A workflow framework for machine learning applied to cancer research. BMC Bioinform. 19(491), 59–69 (2018)

    Google Scholar 

  33. Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI 2012, San Jose, USA, p. 2 (2012)

    Google Scholar 

  34. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: The 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, Boston, MA, p. 10 (2010)

    Google Scholar 

  35. Zhang, S., Boehmer, W., Whiteson, S.: Deep residual reinforcement learning. In: The 19th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS 2020, Auckland, New Zealand, pp. 1611–1619 (2020)

    Google Scholar 

Download references

Acknowledgments

This material is based upon work supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bogdan Nicolae .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nicolae, B. (2020). DataStates: Towards Lightweight Data Models for Deep Learning. In: Nichols, J., Verastegui, B., Maccabe, A.‘., Hernandez, O., Parete-Koon, S., Ahearn, T. (eds) Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI. SMC 2020. Communications in Computer and Information Science, vol 1315. Springer, Cham. https://doi.org/10.1007/978-3-030-63393-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-63393-6_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-63392-9

  • Online ISBN: 978-3-030-63393-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics