Abstract
NVIDIA is defining a High-Performance Computing system architecture called Cloud Native Supercomputing to provide bare-metal system performance with security isolation and functional offload capabilities. Cloud Native Supercomputing delivers a cloud-based user experience in a way that maintains the performance and scalability that is uniquely delivered with supercomputing facilities. This new set of capabilities is being driven by the need to accommodate new scientific workflows that combine traditional simulation with experimental data from the edge and combine it with AI, data analytics and visualization frameworks in an integrated and even real-time fashion. These new workflows stress the system management, security and non-computational functions of traditional cloud or supercomputing facilities. Specifically, workflows that include data from untrusted (or non-local) sources, user experiences that range from Jupyter notebooks and interactive jobs to Gordon Bell-class capacity batch runs and I/O patterns that are unique to the emerging mix of in silico and live data sources. To achieve these objectives, we introduce a new architectural component called the Data Processing Unit (DPU), which in early embodiments is a system-on-a-chip (SoC) that includes an InfiniBand (IB) and Ethernet network adapter, programmable Arm cores, memory, PCI switches, and custom accelerators. The BlueField-1 and BlueField-2 devices are NVIDIA’s first DPU instances. This paper describes the architecture of cloud native supercomputing systems that use DPUs for isolation and acceleration, along with system services provided by that DPU. These services provide enhanced security through isolation, file-system management capabilities, monitoring, and the offloaded support for communication libraries.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ansible: drive automation across open hybrid cloud deployments. https://www.ansible.com/overview/how-ansible-works
Cloud Native Supercomputing Website. https://www.nvidia.com/en-us/networking/products/cloud-native-supercomputing/
Foreman is a complete lifecycle management tool for physical and virtual servers. https://theforeman.org/
Gordon Bell prize winners embrace summit to advance COVID-19 research. https://www.hpcwire.com/off-the-wire/gordon-bell-prize-winners-embrace-summit-to-advance-covid-19-research/
NVIDIA base command: AI workflow and cluster management software. https://docs.nvidia.com/base-command/index.html
NVIDIA base command platform
NVIDIA unveils new data center chips to speed pace of AI. https://www.datacenterknowledge.com/machine-learning/nvidia-unveils-new-data-center-chips-speed-pace-ai
The world’s first cloud-native supercomputer. https://www.nvidia.com/en-us/data-center/dgx-superpod/
Annas, G.J.: HIPAA regulations – a new era of medical-record privacy? N. Engl. J. Med. 348(15), 1486–1490 (2003). PMID: 12686707
0 Bezemer, C.-P., Zaidman, A.: Multi-tenant SaaS applications: maintenance dream or nightmare? In: Proceedings of the Joint ERCIM Workshop on Software Evolution (EVOL) and International Workshop on Principles of Software Evolution (IWPSE), IWPSE-EVOL 2010, pp. 88–92. Association for Computing Machinery, New York, NY, USA (2010)
Fagnan, K., Nashed, Y., Perdue, G., Ratner, D., Shankar, A., Yoo, S.: Data and models: a framework for advancing AI in science (2019)
Gupta, D., Cherkasova, L., Gardner, R., Vahdat, A.: Enforcing performance isolation across virtual machines in Xen. In: van Steen, M., Henning, M. (eds.) Middleware 2006. LNCS, vol. 4290, pp. 342–362. Springer, Heidelberg (2006). https://doi.org/10.1007/11925071_18
Kumar, M.: An incorporation of artificial intelligence capabilities in cloud computing. Int. J. Eng. Comput. Sci. 5, 19070–19073 (2016)
Mansfield-Devine, S.: Security through isolation. Comput. Fraud Secur. 2010(5), 8–11 (2010)
Peterka, T., et al.: ASCR workshop on in situ data management: enabling scientific discovery from diverse data sources (2019)
Rad, P., Chronopoulos, A.T., Lama, P., Madduri, P., Loader, C.: Benchmarking bare metal cloud servers for HPC applications. In: 2015 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM), pp. 153–159 (2015)
Rose, S., Borchert, O., Mitchell, S., Connelly, S.: Zero trust architecture. NIST Special Publication 800-207 (2020)
SAIC. Report on HPC trends for federal government (2019)
Schneider, F.B.: Least privilege and more [computer security]. IEEE Secur. Priv. 1(5), 55–59 (2003)
Stevens, R., Taylor, V., Nichols, J., Maccabe, A.B., Yelick, K., Brown, D.: AI for Science. U.S. Department of Energy Office of Science Report (2019)
Win, T.Y., Tianfield, H., Mair, Q.: Virtualization security combining mandatory access control and virtual machine introspection. In: 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing, pp. 1004–1009 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Shainer, G. et al. (2022). NVIDIA’s Cloud Native Supercomputing. In: Nichols, J., et al. Driving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation. SMC 2021. Communications in Computer and Information Science, vol 1512. Springer, Cham. https://doi.org/10.1007/978-3-030-96498-6_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-96498-6_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-96497-9
Online ISBN: 978-3-030-96498-6
eBook Packages: Computer ScienceComputer Science (R0)