Abstract
To meet the demand of large computing power for training complex deep neural networks (DNN), we establish an AI ecosystem on Sunway platform to utilize the Sunway series of high performance computers (HPC). We provide a specially optimized accelerating library for DNN operators on Sunway, namely SWDNNv2, supporting both single-precision and half-precision. Based on the highly efficient library, we refactor the PyTorch framework to fit the Sunway platform by adopting hardware-specific acceleration and MPI backend support. A Python-interface based lightweight framework named SWMind is also developed from srcatch to provide higer peformance for some domain models. Some techniques about training large models are also dicussed, including mixed-precision and hybrid parallelism. The toolkits in the AI ecosystem have been applied to actual projects, such as training large scale multi-modality model. We have managed to train a 1 billion parameter model and achieve a relative close performance to the NVIDIA Tesla V100. The high efficiency of SWDNNv2 is demonstrated by the performace of the GEMM operator, which can achieve 88.23% and 84.5% of the FP32 and FP16 theoretical peak FLOPS for the SW many-core CPU. The evaluation also shows the scalability of the AI framework by training a ResNet-50 model and the parallel efficiency can achieve 91.51% when scales to 1024 CPUs.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Zheng, X.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems (2015)
Bradski, G.: The opencv library. Dr Dobbs J. Softw. Tools 25, 120–125 (2000)
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Amodei, D.: Language models are few-shot learners (2020)
Chen, C., Peng, X., Xing, Z., Sun, J., Wang, X., Zhao, Y., Zhao, W.: IEEE Trans. Softw. Eng. (2020). https://doi.org/10.1109/TSE.2021.3074309
Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Shelhamer, E.: Cudnn: Efficient primitives for deep learning. Comput. (2014)
Corporation, N.: Cublas library (2008)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018)
Fang, J., Fu, H., Zhao, W., Chen, B., Yang, G.: Swdnn: a library for accelerating deep learning applications on sunway taihulight. In: 2017 IEEE international parallel and distributed processing symposium (IPDPS) (2017)
Fang, J., Li, L., Fu, H., Jiang, J., Zhao, W., He, C., You, X., Yang, G.: swcaffe: a parallel framework for accelerating deep learning applications on sunway taihulight (2019)
Fedus, W., Zoph, B., Shazeer, N.: Switch transformers: scaling to trillion parameter models with simple and efficient sparsity (2021)
Forum, M.P.: MPI: a message-passing interface standard. MPI: A Message-Passing Interface Standard (1994)
Fu, H., Liao, J., Yang, J., Wang, L., Song, Z., Huang, X., Yang, C., Xue, W., Liu, F., Qiao, F., Zhao, W.: The sunway taihu light supercomputer: system and applications. Sci. China (Inf. Sci.) 59(7), 113–128 (2016)
Gao, J., Zhou, J., Zhou, C., Yu, J.X.: Glog: a high level graph analysis system using mapreduce. In: IEEE, pp. 544–555 (2014)
Gaskill, B.: Onnx: the open neural network exchange format. Linux J (2018)
Hak, M.: Gad-el: flow control : passive, active, and reactive flow management (2000)
Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le, Q.V., Chen, Z.: Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv:1811.06965 (2019)
Intel.: Mkl-dnn for scalable deep learning (2017)
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. ACM (2014)
Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
Lepikhin, D., Lee, H.J., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., Chen, Z.: Gshard: scaling giant models with conditional computation and automatic sharding (2020)
Luitjens, J.: Cuda streams: best practices and common pitfalls (2014)
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G.: Mixed precision training (2017)
Myers, J.L., Well, A.D., Lorch, R.: Research design and statistical analysis. Res. Des. Stat. Anal. (2013)
Oliphant, T.E.: Python for scientific computing. Comput. Sci. Eng. 9(3), 10–20 (2007)
Oliphant, T.E.: Guide to NumPy. Guide to NumPy (2015)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., Devito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
Paszke, A., Gross, S., Massa, F., Lerer, A., Chintala, S.: Pytorch: an imperative style, high-performance deep learning library (2019)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)
Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: Zero: Memory optimization towards training a trillion parameter models (2019)
Sergeev, A., Balso, M.D.: Horovod: fast and easy distributed deep learning in tensorflow (2018)
Shoeybi, M., Patwary, M., Puri, R., Legresley, P., Catanzaro, B.: Megatron-lm: training multi-billion parameter language models using gpu model parallelism (2019)
Whitehead, M.: Creating fast and accurate machine learning ensembles through training dataset preprocessing. Ph.D. thesis, Indiana University (2010)
Zhang, H., Cheng, X., Zang, H., Park, D.H.: Compiler-level matrix multiplication optimization for deep learning (2019)
Zhao, R., Vogel, B., Ahmed, T.: Adaptive loss scaling for mixed precision training (2019)
Acknowledgements
This work is partially supported by PACMAN (Parallel Architecture and Compiler technology of Mobile, Accelerated, and Networked systems) Laboratory of Tsinghua University. The author thanks the support and cooperation from Ma Zixuan, Qiu Jiezhong, He Jiaao and their team.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, S., Gao, J., Liu, X. et al. Establishing high performance AI ecosystem on Sunway platform. CCF Trans. HPC 3, 224–241 (2021). https://doi.org/10.1007/s42514-021-00072-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42514-021-00072-x