Balanced segmentation of CNNs for multi-TPU inference

Jorge Villarrubia¹,
Luis Costero¹,
Francisco D. Igual¹ &
…
Katzalin Olcoz¹

49 Accesses
Explore all metrics

Abstract

In this paper, we propose different alternatives for convolutional neural networks (CNNs) segmentation, addressing inference processes on computing architectures composed by multiple Edge TPUs. Specifically, we compare the inference performance for a number of state-of-the-art CNN models taking as a reference inference times on one TPU and a compiler-based pipelined inference implementation as provided by the Google’s Edge TPU compiler. Departing from a profiled-based segmentation strategy, we provide further refinements to balance the workload across multiple TPUs, leveraging their cooperative computing power, reducing work imbalance and alleviating the memory access bottleneck due to the limited amount of on-chip memory per TPU. The observed performance results compared with a single TPU yield superlinear speedups and accelerations up to $2.60\times $ compared with the segmentation offered by the compiler targeting multiple TPUs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HILP: hardware-in-loop pruning of convolutional neural networks towards inference acceleration

Article 05 March 2024

MAFAT: Memory-Aware Fusing and Tiling of Neural Networks for Accelerated Edge Inference

An architecture-level analysis on deep learning models for low-impact computations

Article Open access 26 June 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Notes

https://keras.io/api/applications/
https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet/lite
The described experimental conditions are also used throughout the rest of the paper.
https://coral.ai/docs/edgetpu/compiler/#model-segmentation.
The synthetic models used are those that require host memory (after the first performance drop) and can leverage the extra memories because their layers occupy less than 8 MiB (before the fourth performance drop). A situation with layers occupying more than 8 MiB was illustrative above, but it is purely synthetic and unimportant because it does not occur in real models.
It is about splitting d depth levels into s segments. This is equivalent to choosing $s-1$ separators (to form the s segments) among the $d-1$ positions between depth levels. That is, there are $\left( {\begin{array}{c}d-1\\ s-1\end{array}}\right) $ options.

References

Alshehri F, Muhammad G (2021) A comprehensive survey of the Internet of Things (IoT) and AI-based smart healthcare. IEEE Access 9:3660–3678. https://doi.org/10.1109/ACCESS.2020.3047960
Article Google Scholar
Antonini M, Vu TH, Min C, et al (2019) Resource characterisation of personal-scale sensing models on edge accelerators. In: Int. Workshop on Challenges in Artificial Intelligence and Machine Learning for Internet of Things, p 49-55, https://doi.org/10.1145/3363347.3363363
ASUS (2023) ASUS CRL-G18U-P3DF Datasheet. URL https://dlcdnets.asus.com/pub/ASUS/mb/AIOT/AI_Accelerator/AI_Accelerator_Card_Spec_Sheet.pdf?model=CRL-G18U-P3DF
Cass S (2019) Taking AI to the edge: google’s TPU now comes in a maker-friendly package. IEEE Spectrum 56(5):16–17. https://doi.org/10.1109/MSPEC.2019.8701189
Article Google Scholar
Du J, Zhu X, Shen M et al (2021) Model parallelism optimization for distributed inference via decoupled cnn structure. IEEE Trans on Paral and Distrib Syst 32(7):1665–1676. https://doi.org/10.1109/TPDS.2020.3041474
Article Google Scholar
Google Coral (2019) M.2 Accelerator A+E key datasheet. URL https://coral.ai/docs/m2/datasheet/
Guo J, Liu W, Wang W, et al (2019) Accudnn: A gpu memory efficient accelerator for training ultra-deep neural networks. In: 2019 IEEE 37th International Conference on Computer Design (ICCD), pp 65–72, https://doi.org/10.1109/ICCD46524.2019.00017
He W, Guo S, Guo S et al (2020) Joint dnn partition deployment and resource allocation for delay-sensitive deep learning inference in iot. IEEE Internet of Things J 7(10):9241–9254. https://doi.org/10.1109/JIOT.2020.2981338
Article Google Scholar
Hu C, Bao W, Wang D, et al (2019) Dynamic adaptive dnn surgery for inference acceleration on the edge. In: IEEE INFOCOM 2019-IEEE Conference on Computer Communications, IEEE, pp 1423–1431, https://doi.org/10.1109/INFOCOM.2019.8737614
Huang Y, Cheng Y, Bapna A, et al (2019) Gpipe: Efficient training of giant neural networks using pipeline parallelism. In: Wallach H, Larochelle H, Beygelzimer A, et al (eds) Advances in Neural Information Processing Systems, vol 32. Curran Associates, Inc., https://doi.org/10.48550/arXiv.1811.06965
Hunmin Yang, Se-Yoon Oh, Ki-Jung Ryu (2019) Accelerating distributed deep learning inference on multi-GPU with Hadoop Spark. URL https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9343-accelerating-distributed-deep-learning-inference-on-multi-gpu-with-hadoop-spark_V2.pdf
Intel Corp. (2017) Intel Movidius Myriad X Vision Processing Unit (VPU) with Neural Compute Engine. URL https://www.intel.com/content/www/us/en/products/docs/processors/movidius-vpu/myriad-x-product-brief.html
Jahanshahi A, Sabzi HZ, Lau C et al (2020) Gpu-nest: characterizing energy efficiency of multi-gpu inference servers. IEEE Comput Architec Lett 19(2):139–142. https://doi.org/10.1109/LCA.2020.3023723
Article Google Scholar
James A, Sirakoulis GC, Roy K (2019) Smart cameras everywhere: AI vision on edge with emerging memories. In: IEEE Int. Conf. on Electronics, Circuits and Systems (ICECS), pp 422–425, https://doi.org/10.1109/ICECS46596.2019.8965029
Jouppi NP, Young C, Patil N, et al (2017) In-Datacenter Performance Analysis of a Tensor Processing Unit. In: Int. Sym. on Computer Architecture, p 1-12, https://doi.org/10.1145/3079856.3080246
Jouppi NP, Yoon DH, Kurian G et al (2020) A domain-specific supercomputer for training deep neural networks. Commun ACM 63(7):67–78. https://doi.org/10.1145/3360307
Article Google Scholar
Kang P, Somtham A (2022) An evaluation of modern accelerator-based edge devices for object detection applications. Mathematics. https://doi.org/10.3390/math10224299
Article Google Scholar
Kim J, Lee JH, Kim S, et al (2023) Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. In: Oh A, Naumann T, Globerson A, et al (eds) Advances in Neural Information Processing Systems, vol 36. Curran Associates, Inc., pp 36187–36207, URL https://proceedings.neurips.cc/paper_files/paper/2023/file/7183f4fc87598f6c6e947b96714acbd6-Paper-Conference.pdf
Kumar S, Bradbury J, Young C, et al (2021) Exploring the limits of concurrency in ml training on google tpus. https://doi.org/10.48550/arXiv.2011.03641
Li E, Zeng L, Zhou Z et al (2020) Edge AI: on-demand accelerating deep neural network inference via edge computing. IEEE Trans on Wire Commun 19(1):447–457. https://doi.org/10.1109/TWC.2019.2946140
Article Google Scholar
Li J, Liang W, Li Y et al (2021) Throughput maximization of delay-aware dnn inference in edge computing by exploring dnn model partitioning and inference parallelism. IEEE Trans Mobile Comput 22(5):3017–3030. https://doi.org/10.1109/TMC.2021.3125949
Article Google Scholar
Li Z, Zheng L, Zhong Y, et al (2023) AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In: 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, Boston, MA, pp 663–679, URL https://www.usenix.org/conference/osdi23/presentation/li-zhouhan
Libutti L, Igual FD, Piñuel L, et al (2020) Benchmarking Performance and Power of USB Accelerators for Inference with MLPerf. In: Workshop Accelerated Mach. Learn, p 1-15
Ma X, Fang G, Wang X (2023) Llm-pruner: On the structural pruning of large language models. In: Oh A, Naumann T, Globerson A, et al (eds) Advances in Neural Information Processing Systems, vol 36. Curran Associates, Inc., pp 21702–21720, URL https://proceedings.neurips.cc/paper_files/paper/2023/file/44956951349095f74492a5471128a7e0-Paper-Conference.pdf
McEnroe P, Wang S, Liyanage M (2022) A survey on the convergence of edge computing and AI for UAVs: opportunities and challenges. IEEE Internet of Things J. https://doi.org/10.1109/JIOT.2022.3176400
Article Google Scholar
Mohammed T, Joe-Wong C, Babbar R, et al (2020) Distributed inference acceleration with adaptive dnn partitioning and offloading. In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications, IEEE, pp 854–863, https://doi.org/10.1109/INFOCOM41043.2020.9155237
Murshed MGS, Murphy C, Hou D et al (2021) Machine learning at the network edge: a survey. ACM Comput Surv. https://doi.org/10.1145/3469029
Article Google Scholar
Narayanan D, Harlap A, Phanishayee A, et al (2019) Pipedream: generalized pipeline parallelism for dnn training. In: Proceedings of the 27th ACM symposium on operating systems principles, pp 1–15, https://doi.org/10.1145/3341301.3359646
Nikolić GS, Dimitrijević BR, Nikolić TR, et al (2022) A Survey of Three Types of Processing Units: CPU, GPU and TPU. In: Int. Scientific Conf. on Information, Communication and Energy Systems and Technologies (ICEST), pp 1–6, https://doi.org/10.1109/ICEST55168.2022.9828625
Parashar A, Abraham A, Chaudhary D, et al (2020) Processor pipelining method for efficient deep neural network inference on embedded devices. In: 2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC), IEEE, pp 82–90, https://doi.org/10.1109/HiPC50609.2020.00022
Raj P, Sekhar C (2020) Comparative Study on CPU, GPU and TPU. Int J Comput Sci and Inform Technol Edu 5:31–38
Google Scholar
Ren W, Qu Y, Dong C, et al (2022) A Survey on Collaborative DNN Inference for Edge Intelligence. https://doi.org/10.48550/ARXIV.2207.07812, arXiv:2207.07812
Renda A, Frankle J, Carbin M (2020) Comparing rewinding and fine-tuning in neural network pruning. In: International Conference on Learning Representations
Sedgewick R, Wayne KD (2011) Algorithms, 4th edn., Addison-Wesley Professional, pp 661–666
Seshadri K, Akin B, Laudon J, et al (2021) An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks. https://doi.org/10.48550/ARXIV.2102.10423, arXiv:2102.10423
Sun Y, Kist AM (2021) Deep Learning on Edge TPUs. https://doi.org/10.48550/ARXIV.2108.13732, arXiv:2108.13732
Thalluri LN, Venkat SN, Prasad CVVD, et al (2021) Artificial Intelligence Enabled Smart City IoT System using Edge Computing. In: Int. Conf. on Smart Electronics and Communication (ICOSEC), pp 12–20, https://doi.org/10.1109/ICOSEC51865.2021.9591732
Varghese B, Wang N, Bermbach D et al (2021) A survey on edge performance benchmarking. ACM Comput Surv. https://doi.org/10.1145/3444692
Article Google Scholar
Villarrubia J, Costero L, Igual FD, et al (2023) Improving inference time in multi-TPU systems with profiled model segmentation. In: 2023 31th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp 84–91, https://doi.org/10.1109/PDP59025.2023.00020
Wu L, Gao G, Yu J et al (2023) Pdd: partitioning dag-topology dnns for streaming tasks. IEEE Internet of Things J. https://doi.org/10.1109/JIOT.2023.3323520
Article Google Scholar
Xiang Y, Kim H (2019) Pipelined data-parallel cpu/gpu scheduling for multi-dnn real-time inference. 2019 IEEE Real-Time Systems Symposium (RTSS) pp 392–405. https://doi.org/10.1109/RTSS46320.2019.00042
Zeng L, Chen X, Zhou Z et al (2021) Coedge: cooperative dnn inference with adaptive workload partitioning over heterogeneous edge devices. IEEE/ACM Trans Network 29(2):595–608. https://doi.org/10.1109/TNET.2020.3042320
Article Google Scholar
Zhou H, Li M, Wang N et al (2023) Accelerating deep learning inference via model parallelism and partial computation offloading. IEEE Trans Paral and Distribut Syst 34(2):475–488. https://doi.org/10.1109/TPDS.2022.3222509
Article Google Scholar
Zhou J, Wang Y, Ota K et al (2019) Aaiot: accelerating artificial intelligence in iot systems. IEEE Wireless Commun Lett 8(3):825–828. https://doi.org/10.1109/LWC.2019.2894703
Article Google Scholar
Zhou Y, Moosavi-Dezfooli SM, Cheung NM et al (2018) Adaptive quantization for deep neural network. Proceed AAAI Conference on Art Intell. https://doi.org/10.1609/aaai.v32i1.11623
Article Google Scholar

Download references

Funding

This work has been partially supported by Grants PID2021-126576NB-I00 and TED2021–130123B-I00 funded by MCIN/AEI/10.13039/501100011033 and by “ERDF A way of making Europe” and NextGenerationEU/PRT and the CM under Grant S2018/TCS-4423.

Author information

Authors and Affiliations

Department Arquitectura de Computadores y Automática, Universidad Complutense de Madrid, Madrid, Spain
Jorge Villarrubia, Luis Costero, Francisco D. Igual & Katzalin Olcoz

Authors

Jorge Villarrubia
View author publications
You can also search for this author in PubMed Google Scholar
Luis Costero
View author publications
You can also search for this author in PubMed Google Scholar
Francisco D. Igual
View author publications
You can also search for this author in PubMed Google Scholar
Katzalin Olcoz
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.V. conducted the design, implementation and evaluation of the segmentation strategies described in the paper and collaborated in the writing of the manuscript. L.C. collaborated in the design and critical analysis of the experiments and collaborated in the preparation of the manuscript. F.I. collaborated in the definition and supervision of research tasks and wrote a substantial part of the manuscript. K.O. contributed in the critical analysis of the experimental results and collaborated in the review and writing of the manuscript.

Corresponding author

Correspondence to Jorge Villarrubia.

Ethics declarations

Conflict of interest

There are no conflict of interest.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Villarrubia, J., Costero, L., Igual, F.D. et al. Balanced segmentation of CNNs for multi-TPU inference. J Supercomput 81, 60 (2025). https://doi.org/10.1007/s11227-024-06605-9

Download citation

Accepted: 08 October 2024
Published: 22 October 2024
DOI: https://doi.org/10.1007/s11227-024-06605-9

Balanced segmentation of CNNs for multi-TPU inference

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

HILP: hardware-in-loop pruning of convolutional neural networks towards inference acceleration

MAFAT: Memory-Aware Fusing and Tiling of Neural Networks for Accelerated Edge Inference

An architecture-level analysis on deep learning models for low-impact computations

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Balanced segmentation of CNNs for multi-TPU inference

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

HILP: hardware-in-loop pruning of convolutional neural networks towards inference acceleration

MAFAT: Memory-Aware Fusing and Tiling of Neural Networks for Accelerated Edge Inference

An architecture-level analysis on deep learning models for low-impact computations

Explore related subjects

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation