Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Balanced segmentation of CNNs for multi-TPU inference

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In this paper, we propose different alternatives for convolutional neural networks (CNNs) segmentation, addressing inference processes on computing architectures composed by multiple Edge TPUs. Specifically, we compare the inference performance for a number of state-of-the-art CNN models taking as a reference inference times on one TPU and a compiler-based pipelined inference implementation as provided by the Google’s Edge TPU compiler. Departing from a profiled-based segmentation strategy, we provide further refinements to balance the workload across multiple TPUs, leveraging their cooperative computing power, reducing work imbalance and alleviating the memory access bottleneck due to the limited amount of on-chip memory per TPU. The observed performance results compared with a single TPU yield superlinear speedups and accelerations up to \(2.60\times \) compared with the segmentation offered by the compiler targeting multiple TPUs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Algorithm 1
Fig. 9
Fig. 10

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Notes

  1. https://keras.io/api/applications/

  2. https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet/lite

  3. The described experimental conditions are also used throughout the rest of the paper.

  4. https://coral.ai/docs/edgetpu/compiler/#model-segmentation.

  5. The synthetic models used are those that require host memory (after the first performance drop) and can leverage the extra memories because their layers occupy less than 8 MiB (before the fourth performance drop). A situation with layers occupying more than 8 MiB was illustrative above, but it is purely synthetic and unimportant because it does not occur in real models.

  6. It is about splitting d depth levels into s segments. This is equivalent to choosing \(s-1\) separators (to form the s segments) among the \(d-1\) positions between depth levels. That is, there are \(\left( {\begin{array}{c}d-1\\ s-1\end{array}}\right) \) options.

References

  1. Alshehri F, Muhammad G (2021) A comprehensive survey of the Internet of Things (IoT) and AI-based smart healthcare. IEEE Access 9:3660–3678. https://doi.org/10.1109/ACCESS.2020.3047960

    Article  Google Scholar 

  2. Antonini M, Vu TH, Min C, et al (2019) Resource characterisation of personal-scale sensing models on edge accelerators. In: Int. Workshop on Challenges in Artificial Intelligence and Machine Learning for Internet of Things, p 49-55, https://doi.org/10.1145/3363347.3363363

  3. ASUS (2023) ASUS CRL-G18U-P3DF Datasheet. URL https://dlcdnets.asus.com/pub/ASUS/mb/AIOT/AI_Accelerator/AI_Accelerator_Card_Spec_Sheet.pdf?model=CRL-G18U-P3DF

  4. Cass S (2019) Taking AI to the edge: google’s TPU now comes in a maker-friendly package. IEEE Spectrum 56(5):16–17. https://doi.org/10.1109/MSPEC.2019.8701189

    Article  Google Scholar 

  5. Du J, Zhu X, Shen M et al (2021) Model parallelism optimization for distributed inference via decoupled cnn structure. IEEE Trans on Paral and Distrib Syst 32(7):1665–1676. https://doi.org/10.1109/TPDS.2020.3041474

    Article  Google Scholar 

  6. Google Coral (2019) M.2 Accelerator A+E key datasheet. URL https://coral.ai/docs/m2/datasheet/

  7. Guo J, Liu W, Wang W, et al (2019) Accudnn: A gpu memory efficient accelerator for training ultra-deep neural networks. In: 2019 IEEE 37th International Conference on Computer Design (ICCD), pp 65–72, https://doi.org/10.1109/ICCD46524.2019.00017

  8. He W, Guo S, Guo S et al (2020) Joint dnn partition deployment and resource allocation for delay-sensitive deep learning inference in iot. IEEE Internet of Things J 7(10):9241–9254. https://doi.org/10.1109/JIOT.2020.2981338

    Article  Google Scholar 

  9. Hu C, Bao W, Wang D, et al (2019) Dynamic adaptive dnn surgery for inference acceleration on the edge. In: IEEE INFOCOM 2019-IEEE Conference on Computer Communications, IEEE, pp 1423–1431, https://doi.org/10.1109/INFOCOM.2019.8737614

  10. Huang Y, Cheng Y, Bapna A, et al (2019) Gpipe: Efficient training of giant neural networks using pipeline parallelism. In: Wallach H, Larochelle H, Beygelzimer A, et al (eds) Advances in Neural Information Processing Systems, vol 32. Curran Associates, Inc., https://doi.org/10.48550/arXiv.1811.06965

  11. Hunmin Yang, Se-Yoon Oh, Ki-Jung Ryu (2019) Accelerating distributed deep learning inference on multi-GPU with Hadoop Spark. URL https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9343-accelerating-distributed-deep-learning-inference-on-multi-gpu-with-hadoop-spark_V2.pdf

  12. Intel Corp. (2017) Intel Movidius Myriad X Vision Processing Unit (VPU) with Neural Compute Engine. URL https://www.intel.com/content/www/us/en/products/docs/processors/movidius-vpu/myriad-x-product-brief.html

  13. Jahanshahi A, Sabzi HZ, Lau C et al (2020) Gpu-nest: characterizing energy efficiency of multi-gpu inference servers. IEEE Comput Architec Lett 19(2):139–142. https://doi.org/10.1109/LCA.2020.3023723

    Article  Google Scholar 

  14. James A, Sirakoulis GC, Roy K (2019) Smart cameras everywhere: AI vision on edge with emerging memories. In: IEEE Int. Conf. on Electronics, Circuits and Systems (ICECS), pp 422–425, https://doi.org/10.1109/ICECS46596.2019.8965029

  15. Jouppi NP, Young C, Patil N, et al (2017) In-Datacenter Performance Analysis of a Tensor Processing Unit. In: Int. Sym. on Computer Architecture, p 1-12, https://doi.org/10.1145/3079856.3080246

  16. Jouppi NP, Yoon DH, Kurian G et al (2020) A domain-specific supercomputer for training deep neural networks. Commun ACM 63(7):67–78. https://doi.org/10.1145/3360307

    Article  Google Scholar 

  17. Kang P, Somtham A (2022) An evaluation of modern accelerator-based edge devices for object detection applications. Mathematics. https://doi.org/10.3390/math10224299

    Article  Google Scholar 

  18. Kim J, Lee JH, Kim S, et al (2023) Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. In: Oh A, Naumann T, Globerson A, et al (eds) Advances in Neural Information Processing Systems, vol 36. Curran Associates, Inc., pp 36187–36207, URL https://proceedings.neurips.cc/paper_files/paper/2023/file/7183f4fc87598f6c6e947b96714acbd6-Paper-Conference.pdf

  19. Kumar S, Bradbury J, Young C, et al (2021) Exploring the limits of concurrency in ml training on google tpus. https://doi.org/10.48550/arXiv.2011.03641

  20. Li E, Zeng L, Zhou Z et al (2020) Edge AI: on-demand accelerating deep neural network inference via edge computing. IEEE Trans on Wire Commun 19(1):447–457. https://doi.org/10.1109/TWC.2019.2946140

    Article  Google Scholar 

  21. Li J, Liang W, Li Y et al (2021) Throughput maximization of delay-aware dnn inference in edge computing by exploring dnn model partitioning and inference parallelism. IEEE Trans Mobile Comput 22(5):3017–3030. https://doi.org/10.1109/TMC.2021.3125949

    Article  Google Scholar 

  22. Li Z, Zheng L, Zhong Y, et al (2023) AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In: 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, Boston, MA, pp 663–679, URL https://www.usenix.org/conference/osdi23/presentation/li-zhouhan

  23. Libutti L, Igual FD, Piñuel L, et al (2020) Benchmarking Performance and Power of USB Accelerators for Inference with MLPerf. In: Workshop Accelerated Mach. Learn, p 1-15

  24. Ma X, Fang G, Wang X (2023) Llm-pruner: On the structural pruning of large language models. In: Oh A, Naumann T, Globerson A, et al (eds) Advances in Neural Information Processing Systems, vol 36. Curran Associates, Inc., pp 21702–21720, URL https://proceedings.neurips.cc/paper_files/paper/2023/file/44956951349095f74492a5471128a7e0-Paper-Conference.pdf

  25. McEnroe P, Wang S, Liyanage M (2022) A survey on the convergence of edge computing and AI for UAVs: opportunities and challenges. IEEE Internet of Things J. https://doi.org/10.1109/JIOT.2022.3176400

    Article  Google Scholar 

  26. Mohammed T, Joe-Wong C, Babbar R, et al (2020) Distributed inference acceleration with adaptive dnn partitioning and offloading. In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications, IEEE, pp 854–863, https://doi.org/10.1109/INFOCOM41043.2020.9155237

  27. Murshed MGS, Murphy C, Hou D et al (2021) Machine learning at the network edge: a survey. ACM Comput Surv. https://doi.org/10.1145/3469029

    Article  Google Scholar 

  28. Narayanan D, Harlap A, Phanishayee A, et al (2019) Pipedream: generalized pipeline parallelism for dnn training. In: Proceedings of the 27th ACM symposium on operating systems principles, pp 1–15, https://doi.org/10.1145/3341301.3359646

  29. Nikolić GS, Dimitrijević BR, Nikolić TR, et al (2022) A Survey of Three Types of Processing Units: CPU, GPU and TPU. In: Int. Scientific Conf. on Information, Communication and Energy Systems and Technologies (ICEST), pp 1–6, https://doi.org/10.1109/ICEST55168.2022.9828625

  30. Parashar A, Abraham A, Chaudhary D, et al (2020) Processor pipelining method for efficient deep neural network inference on embedded devices. In: 2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC), IEEE, pp 82–90, https://doi.org/10.1109/HiPC50609.2020.00022

  31. Raj P, Sekhar C (2020) Comparative Study on CPU, GPU and TPU. Int J Comput Sci and Inform Technol Edu 5:31–38

    Google Scholar 

  32. Ren W, Qu Y, Dong C, et al (2022) A Survey on Collaborative DNN Inference for Edge Intelligence. https://doi.org/10.48550/ARXIV.2207.07812, arXiv:2207.07812

  33. Renda A, Frankle J, Carbin M (2020) Comparing rewinding and fine-tuning in neural network pruning. In: International Conference on Learning Representations

  34. Sedgewick R, Wayne KD (2011) Algorithms, 4th edn., Addison-Wesley Professional, pp 661–666

  35. Seshadri K, Akin B, Laudon J, et al (2021) An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks. https://doi.org/10.48550/ARXIV.2102.10423, arXiv:2102.10423

  36. Sun Y, Kist AM (2021) Deep Learning on Edge TPUs. https://doi.org/10.48550/ARXIV.2108.13732, arXiv:2108.13732

  37. Thalluri LN, Venkat SN, Prasad CVVD, et al (2021) Artificial Intelligence Enabled Smart City IoT System using Edge Computing. In: Int. Conf. on Smart Electronics and Communication (ICOSEC), pp 12–20, https://doi.org/10.1109/ICOSEC51865.2021.9591732

  38. Varghese B, Wang N, Bermbach D et al (2021) A survey on edge performance benchmarking. ACM Comput Surv. https://doi.org/10.1145/3444692

    Article  Google Scholar 

  39. Villarrubia J, Costero L, Igual FD, et al (2023) Improving inference time in multi-TPU systems with profiled model segmentation. In: 2023 31th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp 84–91, https://doi.org/10.1109/PDP59025.2023.00020

  40. Wu L, Gao G, Yu J et al (2023) Pdd: partitioning dag-topology dnns for streaming tasks. IEEE Internet of Things J. https://doi.org/10.1109/JIOT.2023.3323520

    Article  Google Scholar 

  41. Xiang Y, Kim H (2019) Pipelined data-parallel cpu/gpu scheduling for multi-dnn real-time inference. 2019 IEEE Real-Time Systems Symposium (RTSS) pp 392–405. https://doi.org/10.1109/RTSS46320.2019.00042

  42. Zeng L, Chen X, Zhou Z et al (2021) Coedge: cooperative dnn inference with adaptive workload partitioning over heterogeneous edge devices. IEEE/ACM Trans Network 29(2):595–608. https://doi.org/10.1109/TNET.2020.3042320

    Article  Google Scholar 

  43. Zhou H, Li M, Wang N et al (2023) Accelerating deep learning inference via model parallelism and partial computation offloading. IEEE Trans Paral and Distribut Syst 34(2):475–488. https://doi.org/10.1109/TPDS.2022.3222509

    Article  Google Scholar 

  44. Zhou J, Wang Y, Ota K et al (2019) Aaiot: accelerating artificial intelligence in iot systems. IEEE Wireless Commun Lett 8(3):825–828. https://doi.org/10.1109/LWC.2019.2894703

    Article  Google Scholar 

  45. Zhou Y, Moosavi-Dezfooli SM, Cheung NM et al (2018) Adaptive quantization for deep neural network. Proceed AAAI Conference on Art Intell. https://doi.org/10.1609/aaai.v32i1.11623

    Article  Google Scholar 

Download references

Funding

This work has been partially supported by Grants PID2021-126576NB-I00 and TED2021–130123B-I00 funded by MCIN/AEI/10.13039/501100011033 and by “ERDF A way of making Europe” and NextGenerationEU/PRT and the CM under Grant S2018/TCS-4423.

Author information

Authors and Affiliations

Authors

Contributions

J.V. conducted the design, implementation and evaluation of the segmentation strategies described in the paper and collaborated in the writing of the manuscript. L.C. collaborated in the design and critical analysis of the experiments and collaborated in the preparation of the manuscript. F.I. collaborated in the definition and supervision of research tasks and wrote a substantial part of the manuscript. K.O. contributed in the critical analysis of the experimental results and collaborated in the review and writing of the manuscript.

Corresponding author

Correspondence to Jorge Villarrubia.

Ethics declarations

Conflict of interest

There are no conflict of interest.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Villarrubia, J., Costero, L., Igual, F.D. et al. Balanced segmentation of CNNs for multi-TPU inference. J Supercomput 81, 60 (2025). https://doi.org/10.1007/s11227-024-06605-9

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-024-06605-9

Keywords

Navigation