research-article

3U-EdgeAI: Ultra-Low Memory Training, Ultra-Low Bitwidth Quantization, and Ultra-Low Latency Acceleration

Authors:

Cong HaoAuthors Info & Claims

GLSVLSI '21: Proceedings of the 2021 Great Lakes Symposium on VLSI

Pages 157 - 162

https://doi.org/10.1145/3453688.3461738

Published: 22 June 2021 Publication History

Abstract

The deep neural network (DNN) based AI applications on the edge require both low-cost computing platforms and high-quality services. However, the limited memory, computing resources, and power budget of the edge devices constrain the effectiveness of the DNN algorithms. Developing edge-oriented AI algorithms and implementations (e.g., accelerators) is challenging. In this paper, we summarize our recent efforts for efficient on-device AI development from three aspects, including both training and inference. First, we present on-device training with ultra-low memory usage. We propose a novel rank-adaptive tensor-based tensorized neural network model, which offers orders-of-magnitude memory reduction during training. Second, we introduce an ultra-low bitwidth quantization method for DNN model compression, achieving the state-of-the-art accuracy under the same compression ratio. Third, we introduce an ultra-low latency DNN accelerator design, practicing the software/hardware co-design methodology. This paper emphasizes the importance and efficacy of training, quantization and accelerator design, and calls for more research breakthroughs in the area for AI on the edge.

Supplemental Material

MP4 File

presentation video

Download
85.41 MB

References

[1]

H. Alemdar et al. 2017. Ternary neural networks for resource-efficient AI applications. In IJCNN.

[2]

Yoshua Bengio et al. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013).

[3]

Giuseppe G Calvi et al. 2019. Tucker tensor layer in fully connected neural networks. arXiv preprint arXiv:1903.06133 (2019).

[4]

Yao Chen et al. 2019. Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs. In FPGA.

[5]

Yao Chen et al. 2019. T-DLA: An Open-source Deep Learning Acceleratorfor Ternarized DNN Models on Embedded FPGA. ISVLSI (2019).

[6]

Gong Cheng et al. 2019. ""L2Q: An Ultra-Low Loss Quantization Method for DNN Compression. (2019).

[7]

M. Courbariaux et al. 2016. Binarynet: training deep neural networks with weights and activations constrained to +1 or -1. arXiv (2016).

[8]

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems. 3123--3131.

Digital Library

[9]

Timur Garipov et al. 2016. Ultimate tensorization: compressing convolutional and fc layers alike. arXiv preprint arXiv:1611.03214 (2016).

[10]

Cheng Gong et al. 2020. VecQ: Minimal loss dnn model compression with vectorized weight quantization. IEEE Trans. Comput. (2020).

[11]

Philipp Gysel and other. 2016. Hardware-oriented Approximation of Convolutional Neural Networks. CoRR abs/1604.03168 (2016).

[12]

Song Han et al. 2016. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. (2016).

[13]

Cong Hao et al. 2019. FPGA/DNN Co-Design: An Efficient Design Methodology for IoT Intelligence on the Edge. In DAC.

[14]

C. Hao et al. 2021. Enabling Design Methodologies and Future Trends for Edge AI: Specialization and Co-design. IEEE Design Test (2021), 1--1.

[15]

Cole Hawkins, Xing Liu, and Zheng Zhang. 2020. Towards Compact Neural Networks via End-to-End Training: A Bayesian Tensor Approach with Automatic Rank Determination. arXiv preprint arXiv:2010.08689 (2020).

[16]

Cole Hawkins and Zheng Zhang. 2019. Bayesian Tensorized Neural Networks with Automatic Rank Selection. arXiv preprint arXiv:1905.10478 (2019).

[17]

Christopher J Hillar and Lek-Heng Lim. 2013. Most tensor problems are NP-hard. Journal of the ACM (JACM) 60, 6 (2013), 45.

Digital Library

[18]

Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. 2013. Stochastic variational inference. The Journal of Machine Learning Research 14, 1 (2013), 1303--1347.

Digital Library

[19]

Valentin Khrulkov, Oleksii Hrinchuk, Leyla Mirvakhabova, and Ivan Oseledets. 2019. Tensorized embedding layers for efficient model compression. arXiv preprint arXiv:1901.10787 (2019).

[20]

Tamara G Kolda and Brett W Bader. 2009. Tensor decompositions and applications. SIAM review 51, 3 (2009), 455--500.

Digital Library

[21]

Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. 2014. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553 (2014).

[22]

Cong Leng et al. 2018. Extremely Low Bit Neural Network: Squeeze the Last Bit Out With ADMM. In AAAI.

[23]

Cong Leng et al. 2018. Extremely low bit neural network: Squeeze the last bit out with admm. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[24]

F Li and B Liu. 2016. Ternary Weight Networks. Proc. NIPS Workshop Efficient Methods Deep Neural Network (2016).

[25]

Qiang Liu and Dilin Wang. 2016. Stein variational Gradient descent: a general purpose Bayesian inference algorithm. In Proceedings of the 30th International Conference on Neural Information Processing Systems. 2378--2386.

[26]

S. Liu et al. 2011. Real-time object tracking system on FPGAs. In SAAHPC.

[27]

H. Nakahara et al. 2017. A fully connected layer elimination for a binarizec convolutional neural network on an FPGA. In FPL.

[28]

Maxim Naumov et al. 2019. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091 (2019).

[29]

Alexander Novikov et al. 2015. Tensorizing neural networks. In Advances in neural information processing systems. 442--450.

[30]

Ivan V Oseledets. 2011. Tensor-train decomposition. SIAM Journal on Scientific Computing 33, 5 (2011), 2295--2317.

Digital Library

[31]

Jinhwan Park and Wonyong Sung. 2016. FPGA Based Implementation of Deep Neural Networks Using On-chip Memory Only. CoRR (2016).

[32]

A. Prost-Boucle et al. 2017. Scalable high-performance architecture for convolutional ternary neural networks on FPGA. In FPL.

[33]

Jiantao Qiu et al. 2016. Going deeper with embedded fpga platform for convolutional neural network. In FPGA. 26--35.

[34]

Surat Teerapittayanon et al. 2017. Distributed Deep Neural Networks Over the Cloud, the Edge and End Devices. In ICDCS. 328--339.

[35]

Andros Tjandra et al. 2017. Compressing recurrent neural network with tensor train. In 2017 International Joint Conference on Neural Networks (IJCNN). IEEE, 4451--4458.

[36]

Yaman Umuroglu et al. 2017. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. In FPGA.

Digital Library

[37]

Peisong Wang et al. 2018. Two-step quantization for low-bit neural networks. In CVPR.

[38]

Yue Wang et al. 2019. E2-Train: Training State-of-the-art CNNs with Over 80% Energy Savings. In NeurIPS.

[39]

Miao Yin et al. 2020. Compressing Recurrent Neural Networks Using Hierarchical Tucker Tensor Decomposition. arXiv preprint arXiv:2005.04366 (2020).

[40]

Kaiqi Zhang et al. 2021. On-FPGA Training with Ultra Memory Reduction: A Low-Precision Tensor Method. ICLR Workshop of Hardware Aware Efficient Training (2021).

[41]

Xiaofan Zhang et al. 2017. Machine learning on FPGAs to face the IoT revolution. In ICCAD.

[42]

Xiaofan Zhang et al. 2018. DNNBuilder: an automated tool for building high-performance Dnn hardware accelerators for FPGAs. In ICCAD.

[43]

Ritchie Zhao et al. 2017. Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs. In FPGA.

[44]

Mingyi Zhou et al. 2019. Tensor rank learning in CP decomposition via convolutional neural network. Signal Processing: Image Communication 73 (2019), 12--21.

Digital Library

Cited By

Xie KLu YHe XYi DDong HChen Y(2024)Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAsACM Transactions on Architecture and Code Optimization10.1145/364368221:2(1-24)Online publication date: 23-Mar-2024
https://dl.acm.org/doi/10.1145/3643682
Arunachalam AKundu SRaha ABanerjee SNatarajan SBasu K(2023)A Novel Low-Power Compression Scheme for Systolic Array-Based Deep Learning AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.319803642:4(1085-1098)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1109/TCAD.2022.3198036
Joshi PHasanuzzaman MThapa CAfli HScully T(2023)Enabling All In-Edge Deep Learning: A Literature ReviewIEEE Access10.1109/ACCESS.2023.323476111(3431-3460)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3234761
Show More Cited By

Index Terms

3U-EdgeAI: Ultra-Low Memory Training, Ultra-Low Bitwidth Quantization, and Ultra-Low Latency Acceleration
1. Computing methodologies
  1. Machine learning
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs

Recommendations

An Energy Efficient 3D-Heterogeneous Main Memory Architecture for Mobile Devices
MEMSYS '20: Proceedings of the International Symposium on Memory Systems

The demand for main memory capacity is ever increasing in mobile devices and embedded systems. Dynamic Random Access Memories (DRAMs) can not keep pace with the required main memory capacities because of the restrictions in improving the cell density ...
PULP-TrainLib: Enabling On-Device Training for RISC-V Multi-core MCUs Through Performance-Driven Autotuning
Embedded Computer Systems: Architectures, Modeling, and Simulation
Abstract
An open challenge in making Internet-of-Things sensor nodes “smart” and self-adaptive is to enable on-chip Deep Neural Network (DNN) training on Ultra-Low-Power (ULP) microcontroller units (MCUs). To this aim, we present a framework, based on PULP-...
Towards ADC-less compute-in-memory accelerators for energy efficient deep learning
DATE '22: Proceedings of the 2022 Conference & Exhibition on Design, Automation & Test in Europe

Compute-in-Memory (CiM) hardware has shown great potential in accelerating Deep Neural Networks (DNNs). However, most CiM accelerators for matrix vector multiplication rely on costly analog to digital converters (ADCs) which becomes a bottleneck in ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

GLSVLSI '21: Proceedings of the 2021 Great Lakes Symposium on VLSI

June 2021

504 pages

ISBN:9781450383936

DOI:10.1145/3453688

General Chairs:
Yiran Chen
Duke University, USA
,
Victor Zhirnov
Semiconductor Research Corporation, USA
,
Program Chairs:
Avesta Sasan
George Mason University, USA
,
Ioannis Savidis
Drexel University, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Data Availability

presentation video https://dl.acm.org/doi/10.1145/3453688.3461738#PRESENTATION.mp4

Conference

GLSVLSI '21

Sponsor:

SIGDA

GLSVLSI '21: Great Lakes Symposium on VLSI 2021

June 22 - 25, 2021

Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 312 of 1,156 submissions, 27%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
151
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)3

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xie KLu YHe XYi DDong HChen Y(2024)Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAsACM Transactions on Architecture and Code Optimization10.1145/364368221:2(1-24)Online publication date: 23-Mar-2024
https://dl.acm.org/doi/10.1145/3643682
Arunachalam AKundu SRaha ABanerjee SNatarajan SBasu K(2023)A Novel Low-Power Compression Scheme for Systolic Array-Based Deep Learning AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.319803642:4(1085-1098)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1109/TCAD.2022.3198036
Joshi PHasanuzzaman MThapa CAfli HScully T(2023)Enabling All In-Edge Deep Learning: A Literature ReviewIEEE Access10.1109/ACCESS.2023.323476111(3431-3460)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3234761
Horton MJin YFarhadi ARastegari M(2022)Layer-Wise Data-Free CNN Compression2022 26th International Conference on Pattern Recognition (ICPR)10.1109/ICPR56361.2022.9956237(2019-2026)Online publication date: 21-Aug-2022
https://doi.org/10.1109/ICPR56361.2022.9956237
Liu BLuo ZChen HLi C(2022)A Survey of State-of-the-art on Edge Computing: Theoretical Models, Technologies, Directions, and Development PathsIEEE Access10.1109/ACCESS.2022.317610610(54038-54063)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3176106

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents