Abstract
In this paper, we propose different alternatives for convolutional neural networks (CNNs) segmentation, addressing inference processes on computing architectures composed by multiple Edge TPUs. Specifically, we compare the inference performance for a number of state-of-the-art CNN models taking as a reference inference times on one TPU and a compiler-based pipelined inference implementation as provided by the Google’s Edge TPU compiler. Departing from a profiled-based segmentation strategy, we provide further refinements to balance the workload across multiple TPUs, leveraging their cooperative computing power, reducing work imbalance and alleviating the memory access bottleneck due to the limited amount of on-chip memory per TPU. The observed performance results compared with a single TPU yield superlinear speedups and accelerations up to \(2.60\times \) compared with the segmentation offered by the compiler targeting multiple TPUs.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.
Notes
The described experimental conditions are also used throughout the rest of the paper.
The synthetic models used are those that require host memory (after the first performance drop) and can leverage the extra memories because their layers occupy less than 8 MiB (before the fourth performance drop). A situation with layers occupying more than 8 MiB was illustrative above, but it is purely synthetic and unimportant because it does not occur in real models.
It is about splitting d depth levels into s segments. This is equivalent to choosing \(s-1\) separators (to form the s segments) among the \(d-1\) positions between depth levels. That is, there are \(\left( {\begin{array}{c}d-1\\ s-1\end{array}}\right) \) options.
References
Alshehri F, Muhammad G (2021) A comprehensive survey of the Internet of Things (IoT) and AI-based smart healthcare. IEEE Access 9:3660–3678. https://doi.org/10.1109/ACCESS.2020.3047960
Antonini M, Vu TH, Min C, et al (2019) Resource characterisation of personal-scale sensing models on edge accelerators. In: Int. Workshop on Challenges in Artificial Intelligence and Machine Learning for Internet of Things, p 49-55, https://doi.org/10.1145/3363347.3363363
ASUS (2023) ASUS CRL-G18U-P3DF Datasheet. URL https://dlcdnets.asus.com/pub/ASUS/mb/AIOT/AI_Accelerator/AI_Accelerator_Card_Spec_Sheet.pdf?model=CRL-G18U-P3DF
Cass S (2019) Taking AI to the edge: google’s TPU now comes in a maker-friendly package. IEEE Spectrum 56(5):16–17. https://doi.org/10.1109/MSPEC.2019.8701189
Du J, Zhu X, Shen M et al (2021) Model parallelism optimization for distributed inference via decoupled cnn structure. IEEE Trans on Paral and Distrib Syst 32(7):1665–1676. https://doi.org/10.1109/TPDS.2020.3041474
Google Coral (2019) M.2 Accelerator A+E key datasheet. URL https://coral.ai/docs/m2/datasheet/
Guo J, Liu W, Wang W, et al (2019) Accudnn: A gpu memory efficient accelerator for training ultra-deep neural networks. In: 2019 IEEE 37th International Conference on Computer Design (ICCD), pp 65–72, https://doi.org/10.1109/ICCD46524.2019.00017
He W, Guo S, Guo S et al (2020) Joint dnn partition deployment and resource allocation for delay-sensitive deep learning inference in iot. IEEE Internet of Things J 7(10):9241–9254. https://doi.org/10.1109/JIOT.2020.2981338
Hu C, Bao W, Wang D, et al (2019) Dynamic adaptive dnn surgery for inference acceleration on the edge. In: IEEE INFOCOM 2019-IEEE Conference on Computer Communications, IEEE, pp 1423–1431, https://doi.org/10.1109/INFOCOM.2019.8737614
Huang Y, Cheng Y, Bapna A, et al (2019) Gpipe: Efficient training of giant neural networks using pipeline parallelism. In: Wallach H, Larochelle H, Beygelzimer A, et al (eds) Advances in Neural Information Processing Systems, vol 32. Curran Associates, Inc., https://doi.org/10.48550/arXiv.1811.06965
Hunmin Yang, Se-Yoon Oh, Ki-Jung Ryu (2019) Accelerating distributed deep learning inference on multi-GPU with Hadoop Spark. URL https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9343-accelerating-distributed-deep-learning-inference-on-multi-gpu-with-hadoop-spark_V2.pdf
Intel Corp. (2017) Intel Movidius Myriad X Vision Processing Unit (VPU) with Neural Compute Engine. URL https://www.intel.com/content/www/us/en/products/docs/processors/movidius-vpu/myriad-x-product-brief.html
Jahanshahi A, Sabzi HZ, Lau C et al (2020) Gpu-nest: characterizing energy efficiency of multi-gpu inference servers. IEEE Comput Architec Lett 19(2):139–142. https://doi.org/10.1109/LCA.2020.3023723
James A, Sirakoulis GC, Roy K (2019) Smart cameras everywhere: AI vision on edge with emerging memories. In: IEEE Int. Conf. on Electronics, Circuits and Systems (ICECS), pp 422–425, https://doi.org/10.1109/ICECS46596.2019.8965029
Jouppi NP, Young C, Patil N, et al (2017) In-Datacenter Performance Analysis of a Tensor Processing Unit. In: Int. Sym. on Computer Architecture, p 1-12, https://doi.org/10.1145/3079856.3080246
Jouppi NP, Yoon DH, Kurian G et al (2020) A domain-specific supercomputer for training deep neural networks. Commun ACM 63(7):67–78. https://doi.org/10.1145/3360307
Kang P, Somtham A (2022) An evaluation of modern accelerator-based edge devices for object detection applications. Mathematics. https://doi.org/10.3390/math10224299
Kim J, Lee JH, Kim S, et al (2023) Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. In: Oh A, Naumann T, Globerson A, et al (eds) Advances in Neural Information Processing Systems, vol 36. Curran Associates, Inc., pp 36187–36207, URL https://proceedings.neurips.cc/paper_files/paper/2023/file/7183f4fc87598f6c6e947b96714acbd6-Paper-Conference.pdf
Kumar S, Bradbury J, Young C, et al (2021) Exploring the limits of concurrency in ml training on google tpus. https://doi.org/10.48550/arXiv.2011.03641
Li E, Zeng L, Zhou Z et al (2020) Edge AI: on-demand accelerating deep neural network inference via edge computing. IEEE Trans on Wire Commun 19(1):447–457. https://doi.org/10.1109/TWC.2019.2946140
Li J, Liang W, Li Y et al (2021) Throughput maximization of delay-aware dnn inference in edge computing by exploring dnn model partitioning and inference parallelism. IEEE Trans Mobile Comput 22(5):3017–3030. https://doi.org/10.1109/TMC.2021.3125949
Li Z, Zheng L, Zhong Y, et al (2023) AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In: 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, Boston, MA, pp 663–679, URL https://www.usenix.org/conference/osdi23/presentation/li-zhouhan
Libutti L, Igual FD, Piñuel L, et al (2020) Benchmarking Performance and Power of USB Accelerators for Inference with MLPerf. In: Workshop Accelerated Mach. Learn, p 1-15
Ma X, Fang G, Wang X (2023) Llm-pruner: On the structural pruning of large language models. In: Oh A, Naumann T, Globerson A, et al (eds) Advances in Neural Information Processing Systems, vol 36. Curran Associates, Inc., pp 21702–21720, URL https://proceedings.neurips.cc/paper_files/paper/2023/file/44956951349095f74492a5471128a7e0-Paper-Conference.pdf
McEnroe P, Wang S, Liyanage M (2022) A survey on the convergence of edge computing and AI for UAVs: opportunities and challenges. IEEE Internet of Things J. https://doi.org/10.1109/JIOT.2022.3176400
Mohammed T, Joe-Wong C, Babbar R, et al (2020) Distributed inference acceleration with adaptive dnn partitioning and offloading. In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications, IEEE, pp 854–863, https://doi.org/10.1109/INFOCOM41043.2020.9155237
Murshed MGS, Murphy C, Hou D et al (2021) Machine learning at the network edge: a survey. ACM Comput Surv. https://doi.org/10.1145/3469029
Narayanan D, Harlap A, Phanishayee A, et al (2019) Pipedream: generalized pipeline parallelism for dnn training. In: Proceedings of the 27th ACM symposium on operating systems principles, pp 1–15, https://doi.org/10.1145/3341301.3359646
Nikolić GS, Dimitrijević BR, Nikolić TR, et al (2022) A Survey of Three Types of Processing Units: CPU, GPU and TPU. In: Int. Scientific Conf. on Information, Communication and Energy Systems and Technologies (ICEST), pp 1–6, https://doi.org/10.1109/ICEST55168.2022.9828625
Parashar A, Abraham A, Chaudhary D, et al (2020) Processor pipelining method for efficient deep neural network inference on embedded devices. In: 2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC), IEEE, pp 82–90, https://doi.org/10.1109/HiPC50609.2020.00022
Raj P, Sekhar C (2020) Comparative Study on CPU, GPU and TPU. Int J Comput Sci and Inform Technol Edu 5:31–38
Ren W, Qu Y, Dong C, et al (2022) A Survey on Collaborative DNN Inference for Edge Intelligence. https://doi.org/10.48550/ARXIV.2207.07812, arXiv:2207.07812
Renda A, Frankle J, Carbin M (2020) Comparing rewinding and fine-tuning in neural network pruning. In: International Conference on Learning Representations
Sedgewick R, Wayne KD (2011) Algorithms, 4th edn., Addison-Wesley Professional, pp 661–666
Seshadri K, Akin B, Laudon J, et al (2021) An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks. https://doi.org/10.48550/ARXIV.2102.10423, arXiv:2102.10423
Sun Y, Kist AM (2021) Deep Learning on Edge TPUs. https://doi.org/10.48550/ARXIV.2108.13732, arXiv:2108.13732
Thalluri LN, Venkat SN, Prasad CVVD, et al (2021) Artificial Intelligence Enabled Smart City IoT System using Edge Computing. In: Int. Conf. on Smart Electronics and Communication (ICOSEC), pp 12–20, https://doi.org/10.1109/ICOSEC51865.2021.9591732
Varghese B, Wang N, Bermbach D et al (2021) A survey on edge performance benchmarking. ACM Comput Surv. https://doi.org/10.1145/3444692
Villarrubia J, Costero L, Igual FD, et al (2023) Improving inference time in multi-TPU systems with profiled model segmentation. In: 2023 31th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp 84–91, https://doi.org/10.1109/PDP59025.2023.00020
Wu L, Gao G, Yu J et al (2023) Pdd: partitioning dag-topology dnns for streaming tasks. IEEE Internet of Things J. https://doi.org/10.1109/JIOT.2023.3323520
Xiang Y, Kim H (2019) Pipelined data-parallel cpu/gpu scheduling for multi-dnn real-time inference. 2019 IEEE Real-Time Systems Symposium (RTSS) pp 392–405. https://doi.org/10.1109/RTSS46320.2019.00042
Zeng L, Chen X, Zhou Z et al (2021) Coedge: cooperative dnn inference with adaptive workload partitioning over heterogeneous edge devices. IEEE/ACM Trans Network 29(2):595–608. https://doi.org/10.1109/TNET.2020.3042320
Zhou H, Li M, Wang N et al (2023) Accelerating deep learning inference via model parallelism and partial computation offloading. IEEE Trans Paral and Distribut Syst 34(2):475–488. https://doi.org/10.1109/TPDS.2022.3222509
Zhou J, Wang Y, Ota K et al (2019) Aaiot: accelerating artificial intelligence in iot systems. IEEE Wireless Commun Lett 8(3):825–828. https://doi.org/10.1109/LWC.2019.2894703
Zhou Y, Moosavi-Dezfooli SM, Cheung NM et al (2018) Adaptive quantization for deep neural network. Proceed AAAI Conference on Art Intell. https://doi.org/10.1609/aaai.v32i1.11623
Funding
This work has been partially supported by Grants PID2021-126576NB-I00 and TED2021–130123B-I00 funded by MCIN/AEI/10.13039/501100011033 and by “ERDF A way of making Europe” and NextGenerationEU/PRT and the CM under Grant S2018/TCS-4423.
Author information
Authors and Affiliations
Contributions
J.V. conducted the design, implementation and evaluation of the segmentation strategies described in the paper and collaborated in the writing of the manuscript. L.C. collaborated in the design and critical analysis of the experiments and collaborated in the preparation of the manuscript. F.I. collaborated in the definition and supervision of research tasks and wrote a substantial part of the manuscript. K.O. contributed in the critical analysis of the experimental results and collaborated in the review and writing of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
There are no conflict of interest.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Villarrubia, J., Costero, L., Igual, F.D. et al. Balanced segmentation of CNNs for multi-TPU inference. J Supercomput 81, 60 (2025). https://doi.org/10.1007/s11227-024-06605-9
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-024-06605-9