Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3394277.3401850acmconferencesArticle/Chapter ViewAbstractPublication PagespascConference Proceedingsconference-collections
research-article

Deploying Scientific Al Networks at Petaflop Scale on Secure Large Scale HPC Production Systems with Containers

Published: 29 June 2020 Publication History

Abstract

There is an ever-increasing need for computational power to train complex artificial intelligence (AI) and machine learning (ML) models to tackle large scientific problems. High performance computing (HPC) resources are required to efficiently compute and scale complex models across tens of thousands of compute nodes. In this paper, we discuss the issues associated with the deployment of machine learning frameworks on large scale secure HPC systems and how we successfully deployed a standard machine learning framework on a secure large scale HPC production system, to train a complex three-dimensional convolutional GAN (3DGAN), with petaflop performance. 3DGAN is an example from the high energy physics domain, designed to simulate the energy pattern produced by showers of secondary particles inside a particle detector on various HPC systems.

References

[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 3 (2016).
[2]
D. Brayford. 2019. High Performance AI. Retrieved April 15, 2020 from https://github.com/DavidBrayford/HPAI
[3]
David Brayford, Sofia Vallecorsa, Atanas Atanasov, Fabio Baruffa, and Walter Riviera. 2019. Deploying AI Frameworks on Secure HPC Systems with Containers. In 2019 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--6.
[4]
F. Carmianti, G. K., S. Vallecorsa, M. Cai, V. Codreanu, D. Podareanu, A. Atanasov, F. Baruffa, S. Choi, V. Saletore, H. Pabst, W. Riviera, and D. Brayford. 2020. Generative Adversarial Networks for Fast Simulation. Retrieved April 15, 2020 from https://indico.cern.ch/event/853334/contributions/3706456/attachments/1973668/3284005/3DGAN.pdf
[5]
Federico Carminati, Andrei Gheata, Gulrukh Khattak, P Mendez Lorenzo, S Sharan, and S Vallecorsa. 2018. Three dimensional Generative Adversarial Networks for fast simulation. In Journal of Physics: Conference Series, Vol. 1085. IOP Publishing, 032016.
[6]
Federico Carminati, Gulrukh Khattak, Maurizio Pierini, Amir Farbin, Benjamin Hooberman, Wei Wei, Matt Zhang, Vitória Barin Pacela, Sofia Vallecorsa, Maria Spiropulu, and Jean-Roch Vlimant. 2017. Calorimetry with deep learning: particle classification, energy regression, and simulation for high-energy physics. In NIPS.
[7]
F. Chollet. 2015. Keras.
[8]
Valeriu Codreanu, Damian Podareanu, and Vikram Saletore. 2017. Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291 (2017).
[9]
CLIC collaboration et al. 2012. CLIC conceptual design report. Technical Report. CERN-2012-007.
[10]
Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE computational science and engineering 5, 1 (1998), 46--55.
[11]
Jack J Dongarra. 1987. TheLINPACKbenchmark:Anexplanation. In International Conference on Supercomputing. Springer, 456--474.
[12]
Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, and Alexander Heinecke. 2018. Anatomy of high-performance deep learning convolutions on simd architectures. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 830--841.
[13]
Lisa Gerhardt, Wahid Bhimji, Markus Fasel, Jeff Porter, Mustafa Mustafa, Doug Jacobsen, Vakho Tsulaia, and Shane Canon. 2017. Shifter: Containers for HPC. In J. Phys. Conf. Ser., Vol. 898. 082021.
[14]
Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Y Bengio. 2014. Generative adversarial nets, In Proceedings of the 27th International Conference on Neural Information Processing Systems. Ghahramani Z, editor, 2672--2680.
[15]
Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. 2012. Lecture 6a overview of mini-batch gradient descent. Coursera Lecture slides https://cla.ss.coursera.org/neuralnets-2012-001/lecture,[Online (2012).
[16]
Intel. 2020. Intel Python Distribution Benchmarks.
[17]
intelaipg. [n.d.]. IntelAI. Retrieved April 15, 2020 from https://hub.docker.com/u/intelaipg
[18]
Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Everett Phillips, Ankur Mahesh, Michael Matheson, Jack Deslippe, Massimiliano Fatica, et al. 2018. Exascale deep learning for climate analytics. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 649--660.
[19]
Gregory M Kurtzer, Vanessa Sochat, and Michael W Bauer. 2017. Singularity: Scientific containers for mobility of compute. PloS one 12, 5 (2017).
[20]
Yunjie Liu, Evan Racah, Joaquin Correa, Amir Khosrowshahi, David Lavers, Kenneth Kunkel, Michael Wehner, William Collins, et al. 2016. Application of deep convolutional neural networks for detecting extreme weather in climate datasets. arXiv preprint arXiv:1605.01156 (2016).
[21]
Dirk Merkel. 2014. Docker: lightweight linux containers for consistent development and deployment. Linux Journal 2014, 239 (2014), 2.
[22]
Mayur Mudigonda, Sookyung Kim, Ankur Mahesh, Samira Kahou, Karthik Kashinath, Dean Williams, Vincent Michalski, Travis O'Brien, and Mr Prabhat. 2017. Segmenting and tracking extreme climate events using neural networks. In Deep Learning for Physical Sciences (DLPS) Workshop, held with NIPS Conference.
[23]
Augustus Odena, Christopher Olah, and Jonathon Shlens. 2017. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2642--2651.
[24]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).
[25]
podman.io. [n.d.]. Podman. Retrieved April 15, 2020 from https://podman.io
[26]
Reid Priedhorsky and Tim Randles. 2017. Charliecloud: Unprivileged containers for user-defined software stacks in hpc. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--10.
[27]
Evan Racah, Christopher Beckham, Tegan Maharaj, Samira Ebrahimi Kahou, Mr Prabhat, and Chris Pal. 2017. ExtremeWeather: A large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events. In Advances in Neural Information Processing Systems. 3402--3413.
[28]
G. Rossum. 1995. Python reference manual. CWI (Centre for Mathematics and Computer Science).
[29]
Karl W Schulz, C Reese Baird, David Brayford, Yiannis Georgiou, Gregory M Kurtzer, Derek Simmel, Thomas Sterling, Nirmala Sundararajan, and Eric Van Hensbergen. 2016. Cluster computing with OpenHPC. (2016).
[30]
Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).
[31]
Alfred Torrez, Timothy Randles, and Reid Priedhorsky. 2019. HPC container runtimes have minimal or no performance impact. In 2019 IEEE/ACM International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC). IEEE, 37--42.
[32]
Sofia Vallecorsa, Federico Carminati, Gulrukh Khattak, Damian Podareanu, Valeriu Codreanu, Vikram Saletore, and Hans Pabst. 2018. Distributed training of generative adversarial networks for fast detector simulation. In International Conference on High Performance Computing. Springer, 487--503.

Cited By

View all
  • (2023)Intelligent Computing: The Latest Advances, Challenges, and FutureIntelligent Computing10.34133/icomputing.00062Online publication date: 30-Jan-2023
  • (2023)A Joint Study of the Challenges, Opportunities, and Roadmap of MLOps and AIOps: A Systematic SurveyACM Computing Surveys10.1145/362528956:4(1-30)Online publication date: 21-Oct-2023
  • (2022)VenusAIJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2022.102550128:COnline publication date: 1-Jul-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PASC '20: Proceedings of the Platform for Advanced Scientific Computing Conference
June 2020
169 pages
ISBN:9781450379939
DOI:10.1145/3394277
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. HPC
  2. Machine learning
  3. high energy physics
  4. high performance AI

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

PASC '20
Sponsor:

Acceptance Rates

PASC '20 Paper Acceptance Rate 16 of 36 submissions, 44%;
Overall Acceptance Rate 109 of 221 submissions, 49%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)2
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Intelligent Computing: The Latest Advances, Challenges, and FutureIntelligent Computing10.34133/icomputing.00062Online publication date: 30-Jan-2023
  • (2023)A Joint Study of the Challenges, Opportunities, and Roadmap of MLOps and AIOps: A Systematic SurveyACM Computing Surveys10.1145/362528956:4(1-30)Online publication date: 21-Oct-2023
  • (2022)VenusAIJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2022.102550128:COnline publication date: 1-Jul-2022
  • (2022)A container-based workflow for distributed training of deep learning algorithms in HPC clustersCluster Computing10.1007/s10586-022-03798-726:5(2815-2834)Online publication date: 7-Nov-2022
  • (2021)Deploying Containerized QuanEX Quantum Simulation Software on HPC Systems2021 3rd International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC)10.1109/CANOPIEHPC54579.2021.00005(1-9)Online publication date: Nov-2021
  • (2021)Characterizing Containerized HPC Applications Performance at Petascale on CPU and GPU ArchitecturesHigh Performance Computing10.1007/978-3-030-78713-4_22(411-430)Online publication date: 24-Jun-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media