Abstract
Modern high-performance computing systems typically conform to the NUMA architecture, a design that negates the ‘memory wall’ issue stemming from simultaneous memory accesses facilitated by independent multiple processors. However, the efficacy of extensive computational tasks, including those in the realm of AI, hinges on the implementation of intricate memory allocation strategies within this framework. Consider the Linux operating system, where the default memory allocation employs the FT (First Touch) policy. This approach often leads to significant remote memory accesses and imbalanced memory allocation across nodes, adversely affecting the performance of Deep Neural Network (DNN) computations. The primary challenge lies in the inability of current operating systems to accurately detect an application’s memory access patterns. Additionally, most optimizations in existing DNN computation systems overlook the nuanced NUMA optimization challenges, such as those arising from inter-layer dependencies within DNNs and dependencies between memory blocks. These oversights result in less than optimal performance enhancements. To address these issues, this paper proposes a NUMA-aware DNN computing system. This system standardizes the memory access pattern across all DNN layers during the computation propagation process, thereby minimizing the inefficiencies associated with dynamic memory allocation through static NUMA optimization techniques. Furthermore, we propose a page-aligned memory allocation strategy designed to prevent non-local memory access, which often results from inter-block dependencies. Our findings demonstrate that, compared to current methodologies, the DNN computation efficiency in our system has achieved a maximum single-layer acceleration ratio of 1.63x and an overall acceleration ratio of 1.37x.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Matthias, D., et al.: Affinity-based thread and data mapping in shared memory systems. ACM Comput. Surv. 49(4), 1–38 (2016)
numactl(8)-Linux man page. https://linux.die.net/man/8/numactl. Accessed 3 March 2024
Li, H.J., et al.: A memory allocation policy for the balance of access latency among multiple memory nodes in NUMA architecture. Chin. J. Comput. 40(9), 16 (2017)
John, B., et al.: Extending OpenMP for NUMA machines. In: SC’00: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, p. 48. IEEE, Dallas, TX, USA (2000)
AutoNUMA: The Other Approach to NUMAScheduling. http://lwn.net/Articles/488709. Accessed 8 May 2024
Cruz, Eduardo, H.M., et al.: An efficient algorithm for communication-based task mapping. In: 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pp. 207–214. IEEE, Turku, FINLAND (2015)
Cruz, Eduardo, H.M., et al.: Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols. J. Parallel Distrib. Comput. 74(3), 2215–2228 (2014)
Li, J., Zhang, Y., Zhang, X.: CMLB: a communication-aware and memory load balance mapping optimization for modern NUMA systems. In: 2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor. Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), pp. 579–586. IEEE, Haikou, Hainan, China (2021)
François, B., et al.: ForestGOMP: an efficient OpenMP environment for NUMA architectures. Int. J. Parallel Program. 38(5–6), 418–439 (2010)
Mohammad, D., et al.: Traffic management: a holistic approach to memory placement on NUMA systems. ACM SIGPLAN Not. 48(4), 381–394 (2013)
Li, D., et al.: Optimizing massively parallel winograd convolution on arm processor. In: Proceedings of the 50th International Conference on Parallel Processing, pp. 1–12. ACM, ELECTR NETWORK (2021)
Huang, X., et al.: Numa-aware fft-based convolution on armv8 many-core CPUS. In: 2021 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), pp. 1019-1026. IEEE, New York (2021)
Jiang, J., et al.: Full-stack optimizing transformer inference on ARM many-core CPU. IEEE Trans. Parallel Distrib. Syst. 34(7), 2221–2235 (2023)
Tallada, M.G.: Coarse grain parallelization of deep neural networks. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. pp. 1–12. ACM, Barcelona, SPAIN (2016)
Roy, P., et al.: Numa-caffe: numa-aware deep learning neural networks. ACM Trans. Architect. Code Optim. 15(2), 1–26 (2018)
Qin, L., et al: Training deep nets with progressive batch normalization on multi-GPUs. Int. J. Parallel Program. 47(3), 373–387 (2019)
Yin, L., et al.: Parax: boosting deep learning for big data analytics on many-core CPUS. Proc. VLDB Endow. 14(6), 864–877 (2021)
Zhang, Y., et al.: ParaX: bandwidth-efficient instance assignment for DL on multi-NUMA many-core CPUs. IEEE Trans. Comput. 71(11), 3032–3046 (2022)
Linux Manpages Online. https://man.cx/mbind. Accessed 8 April 2024
Acknowledgments
This work is supported by the NSFC funding (No.62002371) and funding (No.WDZC20235250111).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Fang, X., Dong, P., Luo, J., Li, L., Ding, Y., Jiang, Z. (2024). Optimization of NUMA Aware DNN Computing System. In: Huang, DS., Zhang, C., Chen, W. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science, vol 14865. Springer, Singapore. https://doi.org/10.1007/978-981-97-5591-2_11
Download citation
DOI: https://doi.org/10.1007/978-981-97-5591-2_11
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5590-5
Online ISBN: 978-981-97-5591-2
eBook Packages: Computer ScienceComputer Science (R0)