research-article

CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices

Authors:

Yaoxue ZhangAuthors Info & Claims

MobiSys '22: Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services

Pages 209 - 221

https://doi.org/10.1145/3498361.3538932

Published: 27 June 2022 Publication History

Abstract

Concurrent inference execution on heterogeneous processors is critical to improve the performance of increasingly heavy deep learning (DL) models. However, available inference frameworks can only use one processor at a time, or hardly achieve speedup by concurrent execution compared to using one processor. This is due to the challenges to 1) reduce data sharing overhead, and 2) properly partition each operator between processors.

By solving the challenges, we propose CoDL, a concurrent DL inference framework for the CPU and GPU on mobile devices. It can fully utilize the heterogeneous processors to accelerate each operator of a model. It integrates two novel techniques: 1) hybrid-type-friendly data sharing, which allows each processor to use its efficient data type for inference. To reduce data sharing overhead, we also propose hybrid-dimension partitioning and operator chain methods; 2) non-linearity- and concurrency-aware latency prediction, which can direct proper operator partitioning by building an extremely light-weight but accurate latency predictor for different processors.

Based on the two techniques, we build the end-to-end CoDL inference framework, and evaluate it on different DL models. The results show up to 4.93× speedup and 62.3% energy saving compared with the state-of-the-art concurrent execution system.

References

[1]

Snapdragon 855. 2021. https://www.qualcomm.com/products/snapdragon-855-mobile-platform

[2]

Snapdragon 865. 2021. https://www.qualcomm.com/products/snapdragon-865-5g-mobile-platform

[3]

Snapdragon 888. 2021. https://www.qualcomm.com/products/snapdragon-888-5g-mobile-platform

[4]

Kirin 990. 2021. https://www.hisilicon.com/en/products/Kirin/Kirin-flagship-chips/Kirin-990-5G

[5]

Jie An, Haoyi Xiong, Jiebo Luo, Jun Huan, and Jinwen Ma. 2019. Fast Universal Style Transfer for Artistic and Photorealistic Rendering. arXiv:1907.03118 [cs.CV]

[6]

Ermao Cai, Da-Cheng Juan, Dimitrios Stamoulis, and Diana Marculescu. 2017. NeuralPower: Predict and Deploy Energy-Efficient Convolutional Neural Networks. In ACML. 622--637.

[7]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In OSDI 18. USENIX Association, Carlsbad, CA, 578--594.

[8]

Jiankang Deng, Jia Guo, Yuxiang Zhou, Jinke Yu, Irene Kotsia, and Stefanos Zafeiriou. 2019. RetinaFace: Single-stage Dense Face Localisation in the Wild. arXiv:1905.00641 [cs.CV]

[9]

Mali G76. 2021. https://developer.arm.com/ip-products/graphics-and-multimedia/mali-gpus/mali-g76-gpu

[10]

Ling Huang, Jinzhu Jia, Bin Yu, Byung-Gon Chun, Petros Maniatis, and Mayur Naik. 2010. Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression. In NIPS.

[11]

Loc N. Huynh, Youngki Lee, and Rajesh Krishna Balan. 2017. DeepMon: Mobile GPU-Based Deep Learning Framework for Continuous Vision Applications. In MobiSys '17. Association for Computing Machinery, New York, NY, USA, 82--95.

Digital Library

[12]

Shiqi Jiang, Lihao Ran, Ting Cao, Yusen Xu, and Yunxin Liu. 2020. Profiling and Optimizing Deep Learning Inference on Mobile GPUs. In APSys '20. Association for Computing Machinery, New York, NY, USA, 75--81.

[13]

Woosung Kang, Kilho Lee, Jinkyu Lee, Insik Shin, and Hoon Sung Chwa. 2021. LaLaRAND: Flexible Layer-by-Layer CPU/GPU Scheduling for Real-Time DNN Tasks. In 2021 IEEE Real-Time Systems Symposium (RTSS). 329--341.

[14]

Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor Mudge, Jason Mars, and Lingjia Tang. 2017. Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge. SIGARCH Comput. Archit. News 45, 1 (April 2017), 615--629.

Digital Library

[15]

Youngsok Kim, Joonsung Kim, Dongju Chae, Daehyun Kim, and Jangwoo Kim. 2019. μLayer: Low Latency On-Device Inference Using Cooperative Single-Layer Acceleration and Processor-Friendly Quantization. In EuroSys '19. Association for Computing Machinery, New York, NY, USA, Article 45, 15 pages.

Digital Library

[16]

Nicholas D. Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. 2016. DeepX: A Software Accelerator for Low-Power Deep Learning Inference on Mobile Devices. In 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN). 1--12.

[17]

Tensorflow Lite. 2020. https://www.tensorflow.org/lite/

[18]

Tensorflow Lite. 2020. https://www.tensorflow.org/lite/

[19]

MACE. 2020. https://github.com/XiaoMi/mace

[20]

MNN. 2020. https://github.com/alibaba/MNN

[21]

OpenCL. 2021. https://www.khronos.org/opencl/

[22]

Hang Qi, Evan R. Sparks, and Ameet Talwalkar. 2017. Paleo: A Performance Model for Deep Neural Networks. In ICLR.

[23]

J. Redmon and A. Farhadi. 2017. YOLO9000: Better, Faster, Stronger. In CVPR 6517--6525.

[24]

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4510--4520.

[25]

Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations.

[26]

Xiaohu Tang, Shihao Han, Li Lyna Zhang, Ting Cao, and Yunxin Liu. 2021. To Bridge Neural Network Design and Real-World Performance: A Behaviour Study for Neural Networks. In MLSys. https://www.microsoft.com/en-us/research/publication/to-bridge-neural-network-design-and-real-world-performance-a-behaviour-study-for-neural-networks/

[27]

TinyML. 2021. https://github.com/BurnellLiu/TinyML

[28]

Manni Wang, Shaohua Ding, Ting Cao, Yunxin Liu, and Fengyuan Xu. 2021. AsyMo: Scalable and Efficient Deep-Learning Inference on Asymmetric Mobile CPUs. In MobiCom '21. Association for Computing Machinery, New York, NY, USA, 215--228.

Digital Library

[29]

S. Wang, G. Ananthanarayanan, and T. Mitra. 2019. OPTiC: Optimizing Collaborative CPU-GPU Computing on Mobile Devices With Thermal Constraints. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 38, 3 (2019), 393--406.

[30]

S. Wang, G. Ananthanarayanan, Y. Zeng, N. Goel, A. Pathania, and T. Mitra. 2020. High-Throughput CNN Inference on Embedded ARM Big.LITTLE Multicore Processors. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 10 (2020), 2254--2267.

[31]

Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. 2019. FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search. In CVPR. 10726--10734.

[32]

Jinrui Zhang, Deyu Zhang, Xiaohui Xu, Fucheng Jia, Yunxin Liu, Xuanzhe Liu, Ju Ren, and Yaoxue Zhang. 2020. MobiPose: Real-Time Multi-Person Pose Estimation on Mobile Devices. In SenSys '20. Association for Computing Machinery, New York, NY, USA, 136--149.

Digital Library

[33]

Li Lyna Zhang, Shihao Han, Jianyu Wei, Ningxin Zheng, Ting Cao, Yuqing Yang, and Yunxin Liu. 2021. nn-Meter: Towards Accurate Latency Prediction of Deep-Learning Model Inference on Diverse Edge Devices. In MobiSys 2021. https://www.microsoft.com/en-us/research/publication/nn-meter-towards-accurate-latency-prediction-of-deep-learning-model-inference-on-diverse-edge-devices/

Digital Library

[34]

Zhuoran Zhao, Kamyar Mirzazad Barijough, and Andreas Gerstlauer. 2018. Deepthings: Distributed adaptive deep learning inference on resource-constrained iot edge clusters. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2348--2359.

[35]

Christian Zimmermann and Thomas Brox. 2017. Learning to estimate 3d hand pose from single rgb images. In Proceedings of the IEEE international conference on computer vision. 4903--4911.

Cited By

Zhang DXie YXu MCheng EKui XHe BLi Y(2024)Troy: Efficient Service Deployment for Windows SystemsChinese Journal of Electronics10.23919/cje.2022.00.40533:1(313-322)Online publication date: Jan-2024
https://doi.org/10.23919/cje.2022.00.405
Davis JBelviranli M(2024)Context-aware Multi-Model Object Detection for Diversely Heterogeneous Compute Systems2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546645(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546645
Panopoulos IVenieris SVenieris I(2024)CARIn: Constraint-Aware and Responsive Inference on Heterogeneous Devices for Single- and Multi-DNN WorkloadsACM Transactions on Embedded Computing Systems10.1145/366586823:4(1-32)Online publication date: 29-Jun-2024
https://dl.acm.org/doi/10.1145/3665868
Show More Cited By

Index Terms

CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent algorithms
2. Human-centered computing
  1. Ubiquitous and mobile computing

Recommendations

Boosting DNN Cold Inference on Edge Devices
MobiSys '23: Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services

DNNs are ubiquitous on edge devices nowadays. With its increasing importance and use cases, it's not likely to pack all DNNs into device memory and expect that each inference has been warmed up. Therefore, cold inference, the process to read, initialize, ...
Profiling and optimizing deep learning inference on mobile GPUs
APSys '20: Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems

Mobile GPU, as the ubiquitous computing hardware on almost every smartphone, is being exploited for the deep learning inference. In this paper, we present our measurements on the inference performance with mobile GPUs. Our observations suggest that ...
Empirical measurement of instruction level parallelism for four generations of ARM CPUs
PMAM '13: Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores

Parallel computing at all levels is becoming important in all devices and not least in mobile and embedded systems. Many wireless, mobile and deployable devices make use of the ARM CPU and its variants. We report on investigations into measuring ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MobiSys '22: Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services

June 2022

668 pages

ISBN:9781450391856

DOI:10.1145/3498361

General Chairs:
Nirupama Bulusu
Portland State University
,
Ehsan Aryafar
Portland State University
,
Program Chairs:
Aruna Balasubramanian
Stony Brook University
,
Junehwa Song
KAIST

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Artifacts Evaluated & Functional / v1.1

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
National Key R&D Program of China
Natural Science Foundation of Hunan Province

Conference

MobiSys '22

Sponsor:

MobiSys '22: The 20th Annual International Conference on Mobile Systems, Applications and Services

June 27 - July 1, 2022

Oregon, Portland

Acceptance Rates

Overall Acceptance Rate 274 of 1,679 submissions, 16%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
1,107
Total Downloads

Downloads (Last 12 months)428
Downloads (Last 6 weeks)23

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang DXie YXu MCheng EKui XHe BLi Y(2024)Troy: Efficient Service Deployment for Windows SystemsChinese Journal of Electronics10.23919/cje.2022.00.40533:1(313-322)Online publication date: Jan-2024
https://doi.org/10.23919/cje.2022.00.405
Davis JBelviranli M(2024)Context-aware Multi-Model Object Detection for Diversely Heterogeneous Compute Systems2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546645(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546645
Panopoulos IVenieris SVenieris I(2024)CARIn: Constraint-Aware and Responsive Inference on Heterogeneous Devices for Single- and Multi-DNN WorkloadsACM Transactions on Embedded Computing Systems10.1145/366586823:4(1-32)Online publication date: 29-Jun-2024
https://dl.acm.org/doi/10.1145/3665868
Lin ZGuo BLiu SZhou WDing YZhang YYu Z(2024)AdaOper: Energy-efficient and Responsive Concurrent DNN Inference on Mobile DevicesProceedings of the 2024 Workshop on Adaptive AIoT Systems10.1145/3662007.3663884(19-20)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3662007.3663884
Liu SZhou WZhou ZGuo BWang MFang CLin ZYu Z(2024)Deep Learning Inference on Heterogeneous Mobile Processors: Potentials and PitfallsProceedings of the 2024 Workshop on Adaptive AIoT Systems10.1145/3662007.3663881(1-6)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3662007.3663881
Bayer RPriest JTözün P(2024)Reaching the Edge of the Edge: Image Analysis in SpaceProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning10.1145/3650203.3663330(29-38)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3650203.3663330
Xiao RZhao LQian FYang LHan JOkoshi TKo JLiKamWa R(2024)Practical Optical Camera Communication Behind Unseen and Complex BackgroundsProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661866(113-126)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3643832.3661866
Dagli IBelviranli MLee IChabbi MSteuwer M(2024)Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-ChipsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638502(243-256)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638502
Zhao SWang JYu SWang W(2024)An Adaptive Android Memory Management Based on a Lightweight PSO-LSTM Model2024 IEEE Wireless Communications and Networking Conference (WCNC)10.1109/WCNC57260.2024.10570952(1-6)Online publication date: 21-Apr-2024
https://doi.org/10.1109/WCNC57260.2024.10570952
Wu JWang LJin QLiu F(2024)Graft: Efficient Inference Serving for Hybrid Deep Learning With SLO Guarantees via DNN Re-AlignmentIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334051835:2(280-296)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1109/TPDS.2023.3340518
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents