Nothing Special   »   [go: up one dir, main page]

Skip to main content

Optimization of NUMA Aware DNN Computing System

  • Conference paper
  • First Online:
Advanced Intelligent Computing Technology and Applications (ICIC 2024)

Abstract

Modern high-performance computing systems typically conform to the NUMA architecture, a design that negates the ‘memory wall’ issue stemming from simultaneous memory accesses facilitated by independent multiple processors. However, the efficacy of extensive computational tasks, including those in the realm of AI, hinges on the implementation of intricate memory allocation strategies within this framework. Consider the Linux operating system, where the default memory allocation employs the FT (First Touch) policy. This approach often leads to significant remote memory accesses and imbalanced memory allocation across nodes, adversely affecting the performance of Deep Neural Network (DNN) computations. The primary challenge lies in the inability of current operating systems to accurately detect an application’s memory access patterns. Additionally, most optimizations in existing DNN computation systems overlook the nuanced NUMA optimization challenges, such as those arising from inter-layer dependencies within DNNs and dependencies between memory blocks. These oversights result in less than optimal performance enhancements. To address these issues, this paper proposes a NUMA-aware DNN computing system. This system standardizes the memory access pattern across all DNN layers during the computation propagation process, thereby minimizing the inefficiencies associated with dynamic memory allocation through static NUMA optimization techniques. Furthermore, we propose a page-aligned memory allocation strategy designed to prevent non-local memory access, which often results from inter-block dependencies. Our findings demonstrate that, compared to current methodologies, the DNN computation efficiency in our system has achieved a maximum single-layer acceleration ratio of 1.63x and an overall acceleration ratio of 1.37x.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Matthias, D., et al.: Affinity-based thread and data mapping in shared memory systems. ACM Comput. Surv. 49(4), 1–38 (2016)

    Google Scholar 

  2. numactl(8)-Linux man page. https://linux.die.net/man/8/numactl. Accessed 3 March 2024

  3. Li, H.J., et al.: A memory allocation policy for the balance of access latency among multiple memory nodes in NUMA architecture. Chin. J. Comput. 40(9), 16 (2017)

    MathSciNet  Google Scholar 

  4. John, B., et al.: Extending OpenMP for NUMA machines. In: SC’00: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, p. 48. IEEE, Dallas, TX, USA (2000)

    Google Scholar 

  5. AutoNUMA: The Other Approach to NUMAScheduling. http://lwn.net/Articles/488709. Accessed 8 May 2024

  6. Cruz, Eduardo, H.M., et al.: An efficient algorithm for communication-based task mapping. In: 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pp. 207–214. IEEE, Turku, FINLAND (2015)

    Google Scholar 

  7. Cruz, Eduardo, H.M., et al.: Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols. J. Parallel Distrib. Comput. 74(3), 2215–2228 (2014)

    Google Scholar 

  8. Li, J., Zhang, Y., Zhang, X.: CMLB: a communication-aware and memory load balance mapping optimization for modern NUMA systems. In: 2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor. Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), pp. 579–586. IEEE, Haikou, Hainan, China (2021)

    Google Scholar 

  9. François, B., et al.: ForestGOMP: an efficient OpenMP environment for NUMA architectures. Int. J. Parallel Program. 38(5–6), 418–439 (2010)

    Google Scholar 

  10. Mohammad, D., et al.: Traffic management: a holistic approach to memory placement on NUMA systems. ACM SIGPLAN Not. 48(4), 381–394 (2013)

    Google Scholar 

  11. Li, D., et al.: Optimizing massively parallel winograd convolution on arm processor. In: Proceedings of the 50th International Conference on Parallel Processing, pp. 1–12. ACM, ELECTR NETWORK (2021)

    Google Scholar 

  12. Huang, X., et al.: Numa-aware fft-based convolution on armv8 many-core CPUS. In: 2021 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), pp. 1019-1026. IEEE, New York (2021)

    Google Scholar 

  13. Jiang, J., et al.: Full-stack optimizing transformer inference on ARM many-core CPU. IEEE Trans. Parallel Distrib. Syst. 34(7), 2221–2235 (2023)

    Google Scholar 

  14. Tallada, M.G.: Coarse grain parallelization of deep neural networks. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. pp. 1–12. ACM, Barcelona, SPAIN (2016)

    Google Scholar 

  15. Roy, P., et al.: Numa-caffe: numa-aware deep learning neural networks. ACM Trans. Architect. Code Optim. 15(2), 1–26 (2018)

    Google Scholar 

  16. Qin, L., et al: Training deep nets with progressive batch normalization on multi-GPUs. Int. J. Parallel Program. 47(3), 373–387 (2019)

    Google Scholar 

  17. Yin, L., et al.: Parax: boosting deep learning for big data analytics on many-core CPUS. Proc. VLDB Endow. 14(6), 864–877 (2021)

    Google Scholar 

  18. Zhang, Y., et al.: ParaX: bandwidth-efficient instance assignment for DL on multi-NUMA many-core CPUs. IEEE Trans. Comput. 71(11), 3032–3046 (2022)

    Google Scholar 

  19. Linux Manpages Online. https://man.cx/mbind. Accessed 8 April 2024

Download references

Acknowledgments

This work is supported by the NSFC funding (No.62002371) and funding (No.WDZC20235250111).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pan Dong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fang, X., Dong, P., Luo, J., Li, L., Ding, Y., Jiang, Z. (2024). Optimization of NUMA Aware DNN Computing System. In: Huang, DS., Zhang, C., Chen, W. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science, vol 14865. Springer, Singapore. https://doi.org/10.1007/978-981-97-5591-2_11

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-5591-2_11

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-5590-5

  • Online ISBN: 978-981-97-5591-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics