Abstract
In 2015, CSCS and the Swiss national weather and climate service (a.k.a. MeteoSwiss) have deployed the first GPU accelerated HPC system for numerical weather prediction (NWP), which has been in operation since Spring of 2016. As part of the lifecycle management, an eight-times more performant system that can support an upgraded model had to be developed, but at constant cost. This new system is scheduled to go into operation later in 2020. The performance of viable GPUs at a given price has not been sufficiently increasing in recent years. With a fixed budget envelope, the traditional design for operational NWP with two, fully redundant and self-contained systems, was no longer viable to support operations of the 2020–2024 model. We have solved the challenge with a software defined infrastructure concept from cloud infrastructure technologies, and designed a single system with builtin redundancies that would meet reliability requirements with only 1.5 x the number of (expensive) compute nodes needed for the operational NWP. Specifically, concept of network tenants is introduced to define a production, a failover/research-and-development (R&D) and a system test-and-development tenant. Moreover, operational resiliency metrics are ensured via transparent migration of components, similar to cloud environments but with subtle differences to ensure bare-metal performance and scaling of MeteoSwiss simulations. In the paper, we will describe the process for designing and operating a cloud-technology driven, high-availability operational HPC service in a cost-effective manner.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Consortium for small-scale modeling. http://www.cosmo-model.org/
ECMWF’s high performance computing facility (HPCF). https://www.ecmwf.int/en/computing/our-facilities/supercomputer
Gridtools. https://github.com/GridTools/gridtools
Open Ethernet Switch Software. https://www.mellanox.com/open-ethernet
Nvidia tesla v100 gpu architecture whitepaper. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/volta-architecture-whitepaper.pdf
Open networking software for the modern data center. https://cumulusnetworks.com/
Roce v2 considerations. https://community.mellanox.com/s/article/roce-v2-considerations
Afanasyev, A., et al.: Gridtools: a framework for portable weather and climate applications (Submitted)
Basnet, S.R., Chaulagain, R.S., Pandey, S., Shakya, S.: Distributed high performance computing in openstack cloud over sdn infrastructure. In: 2017 IEEE International Conference on Smart Cloud (SmartCloud) (2017)
Benedicic, L., Cruz, F.A., Madonna, A., Mariotti, K.: Sarus: highly scalable docker containers for hpc systems. In: Weiland, M., Juckeland, G., Alam, S., Jagode, H. (eds.) ISC High Performance 2019. LNCS, vol. 11887, pp. 46–60. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34356-9_5
Fuhrer, O., et al.: Towards a performance portable, architecture agnostic implementation strategy for weather and climate models. Supercomput. Front. Innov. 1(1), 45–62 (2014)
Gysi, T., Osuna, C., Fuhrer, O., Bianco, M., Schulthess, T.C.: Stella: A domain-specific tool for strucutred grid methods in weather and climate models. In: Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis (2015), https://doi.org/10.1145/2807591.2807627
Osuna, C., etal.: Operational numerical weather prediction on a GPU-accelerated cluster supercomputer (2016), https://www.ecmwf.int/node/16818
Ranjbar, A., Antikainen, M., Aura, T.: Domain isolation in a multi-tenant software-defined network. In: 2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC) (2015)
West, C.: Weathering the storm - lessons learnt in managing a 24x7x365 hpc delivery platform. In: Cray User Group Meeting (CUG) (2018)
Acknowledgments
We would like to thank the GridTools developer team for developing the infrastructure software for enabling COSMO to run efficiently on GPUs. Additionally we would like to thank Felix Thaler and Hannes Vogt from their dedication in porting the COSMO to use GridTools libraries. This work has been partially funded by the PASC program in Switzerland.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Alam, S.R. et al. (2020). Software Defined Infrastructure for Operational Numerical Weather Prediction. In: Nichols, J., Verastegui, B., Maccabe, A.‘., Hernandez, O., Parete-Koon, S., Ahearn, T. (eds) Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI. SMC 2020. Communications in Computer and Information Science, vol 1315. Springer, Cham. https://doi.org/10.1007/978-3-030-63393-6_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-63393-6_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63392-9
Online ISBN: 978-3-030-63393-6
eBook Packages: Computer ScienceComputer Science (R0)