Optimization of NUMA Aware DNN Computing System

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14865))

Included in the following conference series:

International Conference on Intelligent Computing

394 Accesses

Abstract

Modern high-performance computing systems typically conform to the NUMA architecture, a design that negates the ‘memory wall’ issue stemming from simultaneous memory accesses facilitated by independent multiple processors. However, the efficacy of extensive computational tasks, including those in the realm of AI, hinges on the implementation of intricate memory allocation strategies within this framework. Consider the Linux operating system, where the default memory allocation employs the FT (First Touch) policy. This approach often leads to significant remote memory accesses and imbalanced memory allocation across nodes, adversely affecting the performance of Deep Neural Network (DNN) computations. The primary challenge lies in the inability of current operating systems to accurately detect an application’s memory access patterns. Additionally, most optimizations in existing DNN computation systems overlook the nuanced NUMA optimization challenges, such as those arising from inter-layer dependencies within DNNs and dependencies between memory blocks. These oversights result in less than optimal performance enhancements. To address these issues, this paper proposes a NUMA-aware DNN computing system. This system standardizes the memory access pattern across all DNN layers during the computation propagation process, thereby minimizing the inefficiencies associated with dynamic memory allocation through static NUMA optimization techniques. Furthermore, we propose a page-aligned memory allocation strategy designed to prevent non-local memory access, which often results from inter-block dependencies. Our findings demonstrate that, compared to current methodologies, the DNN computation efficiency in our system has achieved a maximum single-layer acceleration ratio of 1.63x and an overall acceleration ratio of 1.37x.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Deep Fusion: A Software Scheduling Method for Memory Access Optimization

Neural architecture search for in-memory computing-based deep learning accelerators

Article 20 May 2024

Tetris: A Heuristic Static Memory Management Framework for Uniform Memory Multicore Neural Network Accelerators

Article 30 November 2022

References

Matthias, D., et al.: Affinity-based thread and data mapping in shared memory systems. ACM Comput. Surv. 49(4), 1–38 (2016)
Google Scholar
numactl(8)-Linux man page. https://linux.die.net/man/8/numactl. Accessed 3 March 2024
Li, H.J., et al.: A memory allocation policy for the balance of access latency among multiple memory nodes in NUMA architecture. Chin. J. Comput. 40(9), 16 (2017)
MathSciNet Google Scholar
John, B., et al.: Extending OpenMP for NUMA machines. In: SC’00: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, p. 48. IEEE, Dallas, TX, USA (2000)
Google Scholar
AutoNUMA: The Other Approach to NUMAScheduling. http://lwn.net/Articles/488709. Accessed 8 May 2024
Cruz, Eduardo, H.M., et al.: An efficient algorithm for communication-based task mapping. In: 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pp. 207–214. IEEE, Turku, FINLAND (2015)
Google Scholar
Cruz, Eduardo, H.M., et al.: Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols. J. Parallel Distrib. Comput. 74(3), 2215–2228 (2014)
Google Scholar
Li, J., Zhang, Y., Zhang, X.: CMLB: a communication-aware and memory load balance mapping optimization for modern NUMA systems. In: 2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor. Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), pp. 579–586. IEEE, Haikou, Hainan, China (2021)
Google Scholar
François, B., et al.: ForestGOMP: an efficient OpenMP environment for NUMA architectures. Int. J. Parallel Program. 38(5–6), 418–439 (2010)
Google Scholar
Mohammad, D., et al.: Traffic management: a holistic approach to memory placement on NUMA systems. ACM SIGPLAN Not. 48(4), 381–394 (2013)
Google Scholar
Li, D., et al.: Optimizing massively parallel winograd convolution on arm processor. In: Proceedings of the 50th International Conference on Parallel Processing, pp. 1–12. ACM, ELECTR NETWORK (2021)
Google Scholar
Huang, X., et al.: Numa-aware fft-based convolution on armv8 many-core CPUS. In: 2021 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), pp. 1019-1026. IEEE, New York (2021)
Google Scholar
Jiang, J., et al.: Full-stack optimizing transformer inference on ARM many-core CPU. IEEE Trans. Parallel Distrib. Syst. 34(7), 2221–2235 (2023)
Google Scholar
Tallada, M.G.: Coarse grain parallelization of deep neural networks. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. pp. 1–12. ACM, Barcelona, SPAIN (2016)
Google Scholar
Roy, P., et al.: Numa-caffe: numa-aware deep learning neural networks. ACM Trans. Architect. Code Optim. 15(2), 1–26 (2018)
Google Scholar
Qin, L., et al: Training deep nets with progressive batch normalization on multi-GPUs. Int. J. Parallel Program. 47(3), 373–387 (2019)
Google Scholar
Yin, L., et al.: Parax: boosting deep learning for big data analytics on many-core CPUS. Proc. VLDB Endow. 14(6), 864–877 (2021)
Google Scholar
Zhang, Y., et al.: ParaX: bandwidth-efficient instance assignment for DL on multi-NUMA many-core CPUs. IEEE Trans. Comput. 71(11), 3032–3046 (2022)
Google Scholar
Linux Manpages Online. https://man.cx/mbind. Accessed 8 April 2024

Download references

Acknowledgments

This work is supported by the NSFC funding (No.62002371) and funding (No.WDZC20235250111).

Author information

Authors and Affiliations

NUDT, Changsha, 410073, China
Xiaoxiang Fang, Pan Dong, Jun Luo, Leilei Li & Yan Ding
SouthEast University, Nanjing, 214135, China
Zhe Jiang

Authors

Xiaoxiang Fang
View author publications
You can also search for this author in PubMed Google Scholar
Pan Dong
View author publications
You can also search for this author in PubMed Google Scholar
Jun Luo
View author publications
You can also search for this author in PubMed Google Scholar
Leilei Li
View author publications
You can also search for this author in PubMed Google Scholar
Yan Ding
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pan Dong .

Editor information

Editors and Affiliations

Eastern Institute of Technology, Ningbo, China
De-Shuang Huang
Tianjin University of Science and Technology, Tianjin, China
Chuanlei Zhang
China University of Mining and Technology, Xuzhou, China
Wei Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fang, X., Dong, P., Luo, J., Li, L., Ding, Y., Jiang, Z. (2024). Optimization of NUMA Aware DNN Computing System. In: Huang, DS., Zhang, C., Chen, W. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science, vol 14865. Springer, Singapore. https://doi.org/10.1007/978-981-97-5591-2_11

Download citation

DOI: https://doi.org/10.1007/978-981-97-5591-2_11
Published: 14 August 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5590-5
Online ISBN: 978-981-97-5591-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Optimization of NUMA Aware DNN Computing System

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Deep Fusion: A Software Scheduling Method for Memory Access Optimization

Neural architecture search for in-memory computing-based deep learning accelerators

Tetris: A Heuristic Static Memory Management Framework for Uniform Memory Multicore Neural Network Accelerators

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Optimization of NUMA Aware DNN Computing System

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Deep Fusion: A Software Scheduling Method for Memory Access Optimization

Neural architecture search for in-memory computing-based deep learning accelerators

Tetris: A Heuristic Static Memory Management Framework for Uniform Memory Multicore Neural Network Accelerators

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation