Abstract
A key emerging pattern in deep learning applications is the need to capture intermediate DNN model snapshots and preserve or clone them in explore a large number of alternative training and/or inference paths. However, with increasing model complexity and new training approaches that mix data, model, pipeline and layer-wise parallelism, this pattern is challenging to address in a scalable and efficient manner. To this end, this position paper advocates for rethinking how to represent and manipulate DNN learning models. It relies on a broader notion of data states, a collection of annotated, potentially distributed data sets (tensors in the case of DNN models) that AI applications can capture at key moments during the runtime and revisit/reuse later. Instead explicitly interacting with the storage layer (e.g., write to a file), users can “tag” DNN models at key moments during runtime with metadata that expresses attributes and persistency/movement semantics. A high-performance runtime is the responsible to interpret the metadata and perform the necessary actions in the background, while offering a rich interface to find data states of interest. Using this approach has benefits at several levels: new capabilities, performance portability, high performance and scalability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). http://tensorflow.org/
Wilkinson, M.D., et al.: The fair guiding principles for scientific data management and stewardship. Sci. Data 3(160018), 1–9 (2016)
Balaprakash, P., et al.: Scalable reinforcement-learning-based neural architecture search for cancer deep learning research. In: The 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019, pp. 37:1–37:33 (2019)
Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: High performance fault tolerance interface for hybrid systems. In: The 2011 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, Seattle, USA, pp. 32:1–32:32 (2011)
Bernard, J.: Mercurial-revision control approximated. Linux J. 2011, 212 (2011)
Bhuiyan, S., Zheludkov, M., Isachenko, T.: High Performance In-memory Computing with Apache Ignite. Lulu Press, Morrisville (2017). https://www.lulu.com/
Cao, L., Settlemyer, B.W., Bent, J.: To share or not to share: comparing burst buffer architectures. In: The 25th High Performance Computing Symposium, HPC 2017, Virginia Beach, Virginia, pp. 4:1–4:10 (2017)
Chacon, S., Straub, B.: Pro Git, 2nd edn. Apress, Berkely (2014)
Chard, R., et al.: Publishing and serving machine learning models with DLHub. In: Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning), PEARC 2019, Chicago, USA (2019)
Collins-Sussman, B.: The subversion project: buiding a better CVS. Linux J. 2002(94) (2002)
Dean, J., et al.: Large scale distributed deep networks. In: The 25th International Conference on Neural Information Processing Systems, NIPS 2012, Lake Tahoe, USA, pp. 1223–1231 (2012)
Lawson, M., et al.: Empress: extensible metadata provider for extreme-scale scientific simulations. In: The 2nd Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, PDSW-DISCS@SC 2017, pp. 19–24 (2017)
Li, J., Nicolae, B., Wozniak, J., Bosilca, G.: Understanding scalability and fine-grain parallelism of synchronous data parallel training. In: 5th Workshop on Machine Learning in HPC Environments (in Conjunction with SC19), MLHPC 2019, Denver, USA, pp. 1–8 (2019)
Lockwood, G., et al.: Storage 2020: a vision for the future of HPC storage. Technical Report, Lawrence Berkeley National Laboratory (2017)
Lofstead, J., Baker, J., Younge, A.: Data pallets: containerizing storage for reproducibility and traceability. In: 2019 International Conference on High Performance Computing, ISC 2019, pp. 36–45 (2019)
Lofstead, J., Jimenez, I., Maltzahn, C., Koziol, Q., Bent, J., Barton, E.: DAOS and friends: a proposal for an exascale storage system. In: The 2016 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt Lake City, Utah, pp. 50:1–50:12 (2016)
Merkel, D.: Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014(239), 2 (2014)
Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.R.D.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: The 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, New Orleans, USA, pp. 1:1–1:11 (2010)
Narayanan, D., et al.: PipeDream: generalized pipeline parallelism for DNN training. In: The 27th ACM Symposium on Operating Systems Principles, SOSP 2019, Huntsville, Canada, pp. 1–15 (2019)
Nicolae, B.: Towards scalable checkpoint restart: a collective inline memory contents deduplication proposal. In: The 27th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2013, Boston, USA (2013). http://hal.inria.fr/hal-00781532/en
Nicolae, B.: Leveraging naturally distributed data redundancy to reduce collective I/O replication overhead. In: 29th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015, pp. 1023–1032 (2015)
Nicolae, B., Antoniu, G., Bougé, L., Moise, D., Carpen-Amarie, A.: BlobSeer: next-generation data management for large scale infrastructures. J. Parallel Distrib. Comput. 71, 169–184 (2011)
Nicolae, B., Li, J., Wozniak, J., Bosilca, G., Dorier, M., Cappello, F.: DeepFreeze: towards scalable asynchronous checkpointing of deep learning models. In: 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CGrid 2020, Melbourne, Australia, pp. 172–181 (2020)
Nicolae, B., Moody, A., Gonsiorowski, E., Mohror, K., Cappello, F.: VeloC: towards high performance adaptive asynchronous checkpointing at large scale. In: The 2019 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019, Rio de Janeiro, Brazil, pp. 911–920 (2019)
Nicolae, B., Wozniak, J.M., Dorier, M., Cappello, F.: DeepClone: lightweight state replication of deep learning models for data parallel training. In: The 2020 IEEE International Conference on Cluster Computing, CLUSTER 2020, Kobe, Japan (2020)
Real, E., et al.: Large-scale evolution of image classifiers. In: The 34th International Conference on Machine Learning, ICML 2017, Sydney, Australia, pp. 2902–2911 (2017)
Saurabh, N., Kimovski, D., Ostermann, S., Prodan, R.: VM image repository and distribution models for federated clouds: state of the art, possible directions and open issues. In: Desprez, F., et al. (eds.) Euro-Par 2016. LNCS, vol. 10104, pp. 260–271. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58943-5_21
Shanahan, J.G., Dai, L.: Large scale distributed data science using apache spark. In: The 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2015, Sydney, Australia, pp. 2323–2324 (2015)
Shu, H., Zhu, H.: Sensitivity analysis of deep neural networks. In: The 33rd AAAI Conference of Artificial Intelligence, AAAI 2019, pp. 4943–4950 (2019)
Teerapittayanon, S., McDanel, B., Kung, H.T.: BranchyNet: fast inference via early exiting from deep neural networks. In: The 23rd International Conference on Pattern Recognition, ICPR 2016, Cancun, Mexico, pp. 2464–2469 (2016)
Tseng, S.M., Nicolae, B., Bosilca, G., Jeannot, E., Cappello, F.: Towards portable online prediction of network utilization using MPI-level monitoring. In: 25th International European Conference on Parallel and Distributed Systems, EuroPar 2019, Goettingen, Germany, pp. 1–14 (2019)
Wozniak, J., et al.: CANDLE/supervisor: A workflow framework for machine learning applied to cancer research. BMC Bioinform. 19(491), 59–69 (2018)
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI 2012, San Jose, USA, p. 2 (2012)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: The 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, Boston, MA, p. 10 (2010)
Zhang, S., Boehmer, W., Whiteson, S.: Deep residual reinforcement learning. In: The 19th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS 2020, Auckland, New Zealand, pp. 1611–1619 (2020)
Acknowledgments
This material is based upon work supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Nicolae, B. (2020). DataStates: Towards Lightweight Data Models for Deep Learning. In: Nichols, J., Verastegui, B., Maccabe, A.‘., Hernandez, O., Parete-Koon, S., Ahearn, T. (eds) Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI. SMC 2020. Communications in Computer and Information Science, vol 1315. Springer, Cham. https://doi.org/10.1007/978-3-030-63393-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-63393-6_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63392-9
Online ISBN: 978-3-030-63393-6
eBook Packages: Computer ScienceComputer Science (R0)