DataStates: Towards Lightweight Data Models for Deep Learning

Bogdan Nicolae¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1315))

Included in the following conference series:

Smoky Mountains Computational Sciences and Engineering Conference

1298 Accesses
7 Citations

Abstract

A key emerging pattern in deep learning applications is the need to capture intermediate DNN model snapshots and preserve or clone them in explore a large number of alternative training and/or inference paths. However, with increasing model complexity and new training approaches that mix data, model, pipeline and layer-wise parallelism, this pattern is challenging to address in a scalable and efficient manner. To this end, this position paper advocates for rethinking how to represent and manipulate DNN learning models. It relies on a broader notion of data states, a collection of annotated, potentially distributed data sets (tensors in the case of DNN models) that AI applications can capture at key moments during the runtime and revisit/reuse later. Instead explicitly interacting with the storage layer (e.g., write to a file), users can “tag” DNN models at key moments during runtime with metadata that expresses attributes and persistency/movement semantics. A high-performance runtime is the responsible to interpret the metadata and perform the necessary actions in the background, while offering a rich interface to find data states of interest. Using this approach has benefits at several levels: new capabilities, performance portability, high performance and scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Analyzing the I/O Patterns of Deep Learning Applications

A Systematic Review of Distributed Deep Learning Frameworks for Big Data

Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems

Article Open access 12 April 2024

Notes

References

Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). http://tensorflow.org/
Wilkinson, M.D., et al.: The fair guiding principles for scientific data management and stewardship. Sci. Data 3(160018), 1–9 (2016)
Google Scholar
Balaprakash, P., et al.: Scalable reinforcement-learning-based neural architecture search for cancer deep learning research. In: The 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019, pp. 37:1–37:33 (2019)
Google Scholar
Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: High performance fault tolerance interface for hybrid systems. In: The 2011 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, Seattle, USA, pp. 32:1–32:32 (2011)
Google Scholar
Bernard, J.: Mercurial-revision control approximated. Linux J. 2011, 212 (2011)
Google Scholar
Bhuiyan, S., Zheludkov, M., Isachenko, T.: High Performance In-memory Computing with Apache Ignite. Lulu Press, Morrisville (2017). https://www.lulu.com/
Google Scholar
Cao, L., Settlemyer, B.W., Bent, J.: To share or not to share: comparing burst buffer architectures. In: The 25th High Performance Computing Symposium, HPC 2017, Virginia Beach, Virginia, pp. 4:1–4:10 (2017)
Google Scholar
Chacon, S., Straub, B.: Pro Git, 2nd edn. Apress, Berkely (2014)
Book Google Scholar
Chard, R., et al.: Publishing and serving machine learning models with DLHub. In: Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning), PEARC 2019, Chicago, USA (2019)
Google Scholar
Collins-Sussman, B.: The subversion project: buiding a better CVS. Linux J. 2002(94) (2002)
Google Scholar
Dean, J., et al.: Large scale distributed deep networks. In: The 25th International Conference on Neural Information Processing Systems, NIPS 2012, Lake Tahoe, USA, pp. 1223–1231 (2012)
Google Scholar
Lawson, M., et al.: Empress: extensible metadata provider for extreme-scale scientific simulations. In: The 2nd Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, PDSW-DISCS@SC 2017, pp. 19–24 (2017)
Google Scholar
Li, J., Nicolae, B., Wozniak, J., Bosilca, G.: Understanding scalability and fine-grain parallelism of synchronous data parallel training. In: 5th Workshop on Machine Learning in HPC Environments (in Conjunction with SC19), MLHPC 2019, Denver, USA, pp. 1–8 (2019)
Google Scholar
Lockwood, G., et al.: Storage 2020: a vision for the future of HPC storage. Technical Report, Lawrence Berkeley National Laboratory (2017)
Google Scholar
Lofstead, J., Baker, J., Younge, A.: Data pallets: containerizing storage for reproducibility and traceability. In: 2019 International Conference on High Performance Computing, ISC 2019, pp. 36–45 (2019)
Google Scholar
Lofstead, J., Jimenez, I., Maltzahn, C., Koziol, Q., Bent, J., Barton, E.: DAOS and friends: a proposal for an exascale storage system. In: The 2016 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt Lake City, Utah, pp. 50:1–50:12 (2016)
Google Scholar
Merkel, D.: Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014(239), 2 (2014)
Google Scholar
Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.R.D.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: The 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, New Orleans, USA, pp. 1:1–1:11 (2010)
Google Scholar
Narayanan, D., et al.: PipeDream: generalized pipeline parallelism for DNN training. In: The 27th ACM Symposium on Operating Systems Principles, SOSP 2019, Huntsville, Canada, pp. 1–15 (2019)
Google Scholar
Nicolae, B.: Towards scalable checkpoint restart: a collective inline memory contents deduplication proposal. In: The 27th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2013, Boston, USA (2013). http://hal.inria.fr/hal-00781532/en
Nicolae, B.: Leveraging naturally distributed data redundancy to reduce collective I/O replication overhead. In: 29th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015, pp. 1023–1032 (2015)
Google Scholar
Nicolae, B., Antoniu, G., Bougé, L., Moise, D., Carpen-Amarie, A.: BlobSeer: next-generation data management for large scale infrastructures. J. Parallel Distrib. Comput. 71, 169–184 (2011)
Article Google Scholar
Nicolae, B., Li, J., Wozniak, J., Bosilca, G., Dorier, M., Cappello, F.: DeepFreeze: towards scalable asynchronous checkpointing of deep learning models. In: 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CGrid 2020, Melbourne, Australia, pp. 172–181 (2020)
Google Scholar
Nicolae, B., Moody, A., Gonsiorowski, E., Mohror, K., Cappello, F.: VeloC: towards high performance adaptive asynchronous checkpointing at large scale. In: The 2019 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019, Rio de Janeiro, Brazil, pp. 911–920 (2019)
Google Scholar
Nicolae, B., Wozniak, J.M., Dorier, M., Cappello, F.: DeepClone: lightweight state replication of deep learning models for data parallel training. In: The 2020 IEEE International Conference on Cluster Computing, CLUSTER 2020, Kobe, Japan (2020)
Google Scholar
Real, E., et al.: Large-scale evolution of image classifiers. In: The 34th International Conference on Machine Learning, ICML 2017, Sydney, Australia, pp. 2902–2911 (2017)
Google Scholar
Saurabh, N., Kimovski, D., Ostermann, S., Prodan, R.: VM image repository and distribution models for federated clouds: state of the art, possible directions and open issues. In: Desprez, F., et al. (eds.) Euro-Par 2016. LNCS, vol. 10104, pp. 260–271. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58943-5_21
Chapter Google Scholar
Shanahan, J.G., Dai, L.: Large scale distributed data science using apache spark. In: The 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2015, Sydney, Australia, pp. 2323–2324 (2015)
Google Scholar
Shu, H., Zhu, H.: Sensitivity analysis of deep neural networks. In: The 33rd AAAI Conference of Artificial Intelligence, AAAI 2019, pp. 4943–4950 (2019)
Google Scholar
Teerapittayanon, S., McDanel, B., Kung, H.T.: BranchyNet: fast inference via early exiting from deep neural networks. In: The 23rd International Conference on Pattern Recognition, ICPR 2016, Cancun, Mexico, pp. 2464–2469 (2016)
Google Scholar
Tseng, S.M., Nicolae, B., Bosilca, G., Jeannot, E., Cappello, F.: Towards portable online prediction of network utilization using MPI-level monitoring. In: 25th International European Conference on Parallel and Distributed Systems, EuroPar 2019, Goettingen, Germany, pp. 1–14 (2019)
Google Scholar
Wozniak, J., et al.: CANDLE/supervisor: A workflow framework for machine learning applied to cancer research. BMC Bioinform. 19(491), 59–69 (2018)
Google Scholar
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI 2012, San Jose, USA, p. 2 (2012)
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: The 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, Boston, MA, p. 10 (2010)
Google Scholar
Zhang, S., Boehmer, W., Whiteson, S.: Deep residual reinforcement learning. In: The 19th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS 2020, Auckland, New Zealand, pp. 1611–1619 (2020)
Google Scholar

Download references

Acknowledgments

This material is based upon work supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357.

Author information

Authors and Affiliations

Argonne National Laboratory, Lemont, USA
Bogdan Nicolae

Authors

Bogdan Nicolae
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bogdan Nicolae .

Editor information

Editors and Affiliations

Oak Ridge National Laboratory, Oak Ridge, TN, USA
Jeffrey Nichols
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Becky Verastegui
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Arthur ‘Barney’ Maccabe
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Oscar Hernandez
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Suzanne Parete-Koon
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Theresa Ahearn

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nicolae, B. (2020). DataStates: Towards Lightweight Data Models for Deep Learning. In: Nichols, J., Verastegui, B., Maccabe, A.‘., Hernandez, O., Parete-Koon, S., Ahearn, T. (eds) Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI. SMC 2020. Communications in Computer and Information Science, vol 1315. Springer, Cham. https://doi.org/10.1007/978-3-030-63393-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-63393-6_8
Published: 18 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63392-9
Online ISBN: 978-3-030-63393-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics