Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3620665.3640395acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Open access

Energy Efficient Convolutions with Temporal Arithmetic

Published: 27 April 2024 Publication History

Abstract

Convolution is an important operation at the heart of many applications, including image processing, object detection, and neural networks. While data movement and coordination operations continue to be important areas for optimization in general-purpose architectures, for computation fused with sensor operation, the underlying multiply-accumulate (MAC) operations dominate power consumption. Non-traditional data encoding has been shown to reduce the energy consumption of this arithmetic, with options including everything from reduced-precision floating point to fully stochastic operation, but all of these approaches start with the assumption that a complete analog-to-digital conversion (ADC) has already been done for each pixel. While analog-to-time converters have been shown to use less energy, arithmetically manipulating temporally encoded signals beyond simple min, max, and delay operations has not previously been possible, meaning operations such as convolution have been out of reach. In this paper we show that arithmetic manipulation of temporally encoded signals is possible, practical to implement, and extremely energy efficient.
The core of this new approach is a negative log transformation of the traditional numeric space into a 'delay space' where scaling (multiplication) becomes delay (addition in time). The challenge lies in dealing with addition and subtraction. We show these operations can also be done directly in this negative log delay space, that the associative and commutative properties still apply to the transformed operations, and that accurate approximations can be built efficiently in hardware using delay elements and basic CMOS logic elements. Furthermore, we show that these operations can be chained together in space or operated recurrently in time. This approach fits naturally into the staged ADC readout inherent to most modern cameras. To evaluate our approach, we develop a software system that automatically transforms traditional convolutions into delay space architectures. The resulting system is used to analyze and balance error from both a new temporal equivalent of quantization and delay element noise, resulting in designs that improve the energy per pixel of each convolution frame by more than 2× compared to a state-of-the-art while improving the energy delay product by four orders of magnitude.

References

[1]
Armin Alaghi and John P Hayes. Survey of stochastic computing. ACM Transactions on Embedded computing systems (TECS), 12(2s):1--19, 2013.
[2]
G. Bradski. The OpenCV Library. Dr. Dobb's Journal of Software Tools, 2000.
[3]
John S. Bridle. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing, pages 227--236, Berlin, Heidelberg, 1990. Springer Berlin Heidelberg.
[4]
Michael L. Bynum, Gabriel A. Hackebeil, William E. Hart, Carl D. Laird, Bethany L. Nicholson, John D. Siirola, Jean-Paul Watson, and David L. Woodruff. Pyomo-Optimization modeling in Python, volume 67. Springer Science & Business Media, third edition, 2021.
[5]
R. H. Byrd, J. Nocedal, and R.A. Waltz. KNITRO: An integrated package for nonlinear optimization. Large-Scale Nonlinear Optimization. Springer, 2006.
[6]
Weidong Cao, Xin He, Ayan Chakrabarti, and Xuan Zhang. NeuADC: Neural network-inspired RRAM-based synthesizable analog-to-digital conversion with reconfigurable quantization support. In Design, Automation and Test in Europe Conference (DATE), pages 1477--1482, 2019.
[7]
Zhengyu Chen and Jie Gu. High-throughput dynamic time warping accelerator for time-series classification with pipelined mixed-signal time-domain computing. IEEE Journal of Solid-State Circuits, 56(2):624--635, 2021.
[8]
Harsh Chhajed, Gopal Raut, Narendra Dhakad, Sudheer Vishwakarma, and Santosh Kumar Vishvakarma. Bitmac: Bit-serial computation-based efficient multiply-accumulate unit for DNN accelerator. Circuits, Systems, and Signal Processing, pages 1--16, 2022.
[9]
Dong-Hwi Choi and Dong-Woo Jee. A 1984-pixels, 1.26 nW/pixel retinal prosthesis chip with time-domain in-pixel image processing and bipolar stimulating electrode sharing. IEEE Journal of Solid-State Circuits, pages 1--10, 2023.
[10]
Daniel G. Costa. Visual sensors hardware platforms: A review. IEEE Sensors Journal, 20(8):4025--4033, 2020.
[11]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248--255, 2009.
[12]
Oliver E Dial. Ccd performance model. In Surveillance Technologies, volume 1479, pages 2--11. SPIE, 1991.
[13]
Ahmed Elgreatly, Ahmed Dessouki, Hassan Mostafa, Rania Abdalla, and El-sayed El-Rabaie. A novel highly linear voltage-to-time converter (VTC) circuit for time-based analog-to-digital converters (ADC) using body biasing. Electronics, 9(12):2033, 2020.
[14]
Ryuichi Enomoto, Tetsuya Iizuka, Takehisa Koga, Toru Nakura, and Kunihiro Asada. A 16-bit 2.0-ps resolution two-step TDC in 0.18-μ m CMOS utilizing pulse-shrinking fine stage with built-in coarse gain calibration. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 27(1):11--19, 2018.
[15]
Peter W Fry, Peter JW Noble, and Robert J Rycroft. Fixed-pattern noise in photomatrices. IEEE Journal of Solid-State Circuits, 5(5):250--254, 1970.
[16]
Ali H Hassan, Hassan Mostafa, Tawfik Ismail, and SRI Gabran. An ultra-low power voltage-to-time converter (VTC) circuit for low power and low speed applications. In 29th IEEE international system-on-chip conference (SOCC), pages 178--182. IEEE, 2016.
[17]
Gerald C. Holst and Terrence S. Lomheim. CMOS/CCD sensors and camera systems. SPIE Press Monograph, 2007.
[18]
Jeremy Howard. Imagenette. https://github.com/fastai/imagenette/.
[19]
Atul Ingle and David Maier. Count-free single-photon 3d imaging with race logic. arXiv preprint arXiv:2307.04924, 2023.
[20]
Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, et al. A study of BFLOAT16 for deep learning training. arXiv preprint arXiv:1905.12322, 2019.
[21]
Hyunjoon Kim, Taegeun Yoo, Tony Tae-Hyoung Kim, and Bongjin Kim. Colonnade: A reconfigurable SRAM-based digital bit-serial compute-in-memory macro for processing neural networks. IEEE Journal of Solid-State Circuits, 56(7):2221--2233, 2021.
[22]
Martin Lefebvre, Ludovic Moreau, Rémi Dekimpe, and David Bol. 7.7 a 0.2-to-3.6 TOPS/W programmable convolutional imager SoC with in-sensor current-domain ternary-weighted MAC operations for feature extraction and region-of-interest detection. In IEEE International Solid-State Circuits Conference (ISSCC), volume 64, pages 118--120. IEEE, 2021.
[23]
Martin Lefebvre, Ludovic Moreau, Rémi Dekimpe, and David Bol. 7.7 a 0.2-to-3.6TOPS/W programmable convolutional imager SoC with in-sensor current-domain ternary-weighted MAC operations for feature extraction and region-of-interest detection. In IEEE International Solid-State Circuits Conference (ISSCC), volume 64, pages 118--120, 2021.
[24]
Zewen Li, Fan Liu, Wenjie Yang, Shouheng Peng, and Jun Zhou. A survey of convolutional neural networks: analysis, applications, and prospects. IEEE transactions on neural networks and learning systems, 2021.
[25]
Tianrui Ma, Yu Feng, Xuan Zhang, and Yuhao Zhu. CAMJ: Enabling system-level energy modeling and architectural exploration for in-sensor visual computing. In Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA), ISCA '23, New York, NY, USA, 2023. Association for Computing Machinery.
[26]
Advait Madhavan, Matthew W. Daniels, and Mark D. Stiles. Temporal state machines: Using temporal memory to stitch time-based graph computations. J. Emerg. Technol. Comput. Syst., 17(3), may 2021.
[27]
Advait Madhavan, Timothy Sherwood, and Dmitri Strukov. Race logic: A hardware acceleration for dynamic programming algorithms. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA), ISCA '14, page 517--528. IEEE Press, 2014.
[28]
Advait Madhavan, Timothy Sherwood, and Dmitri Strukov. A 4-mm2 180-nm-CMOS 15-giga-cell-updates-per-second DNA sequence alignment engine based on asynchronous race conditions. In IEEE Custom Integrated Circuits Conference (CICC), pages 1--4, 2017.
[29]
Advait Madhavan and Mark D. Stiles. Storing and retrieving wave-fronts with resistive temporal memory. In 2020 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1--5, 2020.
[30]
N.R. Mahapatra, A. Tareen, and S.V. Garimella. Comparison and analysis of delay elements. In The 2002 45th Midwest Symposium on Circuits and Systems, 2002. MWSCAS-2002., volume 2, pages II--II, 2002.
[31]
Jonathan Masci, Ueli Meier, Dan Cireşan, and Jürgen Schmidhuber. Stacked convolutional auto-encoders for hierarchical feature extraction. In 21st International Conference on Artificial Neural Networks (ICANN), pages 52--59. Springer, 2011.
[32]
Daisuke Miyashita, Shouhei Kousai, Tomoya Suzuki, and Jun Deguchi. Time-domain neural network: A 48.5 TSOp/s/W neuromorphic chip optimized for deep learning and CMOS technology. In IEEE Asian Solid-State Circuits Conference (A-SSCC), pages 25--28. IEEE, 2016.
[33]
Xunjun Mo, Jiaqi Wu, Nijwm Wary, and Tony Chan Carusone. Design methodologies for low-jitter CMOS clock distribution. IEEE Open Journal of the Solid-State Circuits Society, 1:94--103, 2021.
[34]
Hassan Mostafa and Yehea I Ismail. Highly-linear voltage-to-time converter (VTC) circuit for time-based analog-to-digital converters (T-ADCs). In IEEE 20th international conference on electronics, circuits, and systems (ICECS), pages 149--152. IEEE, 2013.
[35]
Junjie Mu and Bongjin Kim. 29.2 a 21× 21 dynamic-precision bit-serial computing graph accelerator for solving partial differential equations using finite difference method. In IEEE International Solid-State Circuits Conference (ISSCC), volume 64, pages 406--408. IEEE, 2021.
[36]
Holly Pekau, Abdel Yousif, and James W Haslett. A cmos integrated linear voltage-to-pulse-delay-time converter for time based analog-to-digital converters. In IEEE International Symposium on Circuits and Systems, pages 4--pp. IEEE, 2006.
[37]
Xiangjun Peng, Yaohua Wang, and Ming-Chang Yang. Chopper: A compiler infrastructure for programmable bit-serial SIMD processing using memory in DRAM. In IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 1275--1288. IEEE, 2023.
[38]
Stephen J Sangwine and Todd A Ell. Colour image filters based on hypercomplex convolution. IEE Proceedings-Vision, Image and Signal Processing, 147(2):89--93, 2000.
[39]
Aseem Sayal, Shirin Fathima, SS Teja Nibhanupudi, and Jaydeep P. Kulkarni. COMPAC: Compressed time-domain, pooling-aware convolution CNN engine with reduced data movement for energy-efficient AI computing. IEEE Journal of Solid-State Circuits, 56(7):2205--2220, 2021.
[40]
Aseem Sayal, S. S. Teja Nibhanupudi, Shirin Fathima, and Jaydeep P. Kulkarni. A 12.08-TOPS/W all-digital time-domain CNN engine using bi-directional memory delay lines for energy efficient edge computing. IEEE Journal of Solid-State Circuits, 55(1):60--75, 2020.
[41]
James Smith. Space-time algebra: A model for neocortical computation. In ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pages 289--300. IEEE, 2018.
[42]
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12):2295--2329, 2017.
[43]
Georgios Tzimpragos, Advait Madhavan, Dilip Vasudevan, Dmitri Strukov, and Timothy Sherwood. Boosted race trees for low energy classification. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 215--228, 2019.
[44]
Georgios Tzimpragos, Jennifer Volk, Alex Wynn, James E Smith, and Timothy Sherwood. Superconducting computing with alternating logic elements. In International Symposium on Computer Architecture (ISCA), pages 651--664. IEEE, 2021.
[45]
Hamed Vakili, Mohammad Nazmus Sakib, Samiran Ganguly, Mircea Stan, Matthew W. Daniels, Advait Madhavan, Mark D. Stiles, and Avik W. Ghosh. Temporal memory with magnetic racetracks. IEEE Journal on Exploratory Solid-State Computational Devices and Circuits, 6(2):107--115, 2020.
[46]
DiWu, Jingjie Li, Zhewen Pan, Younghyun Kim, and Joshua San Miguel. uBrain: A unary brain computer interface. In Proceedings of the 49th Annual International Symposium on Computer Architecture (ISCA), pages 468--481, 2022.
[47]
Di Wu, Jingjie Li, Ruokai Yin, Hsuan Hsiao, Younghyun Kim, and Joshua San Miguel. UGEMM: Unary computing architecture for GEMM applications. In ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 377--390. IEEE, 2020.
[48]
Di Wu and Joshua San Miguel. uSystolic: Byte-crawling unary systolic array. In IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 12--24. IEEE, 2022.
[49]
Wei Zhao and Yu Cao. New generation of predictive technology model for sub-45 nm early design exploration. IEEE Transactions on electron Devices, 53(11):2816--2823, 2006.
[50]
Djemel Ziou, Salvatore Tabbone, et al. Edge detection techniques-an overview. Pattern Recognition and Image Analysis C/C of Raspoznavaniye Obrazov I Analiz Izobrazhenii, 8:537--559, 1998.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2
April 2024
1299 pages
ISBN:9798400703850
DOI:10.1145/3620665
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 April 2024

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

ASPLOS '24

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)666
  • Downloads (Last 6 weeks)119
Reflects downloads up to 27 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media