Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3218603.3241339acmconferencesArticle/Chapter ViewAbstractPublication PagesislpedConference Proceedingsconference-collections
invited-talk

Across the Stack Opportunities for Deep Learning Acceleration

Published: 23 July 2018 Publication History

Abstract

The combination of growth in compute capabilities and availability of large datasets has led to a re-birth of deep learning. Deep Neural Networks (DNNs) have become state-of-the-art in a variety of machine learning tasks spanning domains across vision, speech, and machine translation. Deep Learning (DL) achieves high accuracy in these tasks at the expense of 100s of ExaOps of computation; posing significant challenges to efficient large-scale deployment in both resource-constrained environments and data centers.
One of the key enablers to improve operational efficiency of DNNs is the observation that when extracting deep insight from vast quantities of structured and unstructured data the exactness imposed by traditional computing is not required. Relaxing the "exactness" constraint enables exploiting opportunities for approximate computing across all layers of the system stack.
In this talk we present a multi-TOPS AI core [3] for acceleration of deep learning training and inference in systems from edge devices to data centers. We demonstrate that to derive high sustained utilization and energy efficiency from the AI core requires ground-up re-thinking to exploit approximate computing across the stack including algorithms, architecture, programmability, and hardware.
Model accuracy is the fundamental measure of deep learning quality. The compute engine precision in our AI core is carefully calibrated to realize significant reduction in area and power while not compromising numerical accuracy. Our research at the DL algorithms/applications-level [2] shows that it is possible to carefully tune the precision of both weights and activations to as low as 2-bits for inference and was used to guide the choices of compute precision supported in the architecture and hardware for both training and inference. Similarly, distributed DL training's scalability is impacted by the communication overhead to exchange gradients and weights after each mini-batch. Our research on gradient compression [1] shows by selectively sending gradients larger than a threshold, and by further choosing the threshold based on the importance of the gradient we achieve achieve compression ratio of 40X for convolutional layers, and up to 200X for fully-connected layers of the network without losing model accuracy. These results guide the choice of interconnection network topology exploration for a system of accelerators built using the AI core.
Overall, our work shows how the benefits from exploiting approximation using algorithm/application's robustness to tolerate reduced precision, and compressed data communication can be combined effectively with the architecture and hardware of the accelerator designed to support these reduced-precision computation and compressed data communication. Our results demonstate improved end-to-end efficiency of the DL accelerator across different metrics such as high sustained TOPs, high TOPs/watt and TOPs/mm2 catering to different operating environments for both training and inference.

References

[1]
C. Chen, J. Choi, D. Brand, A. Agrawal, W. Zhang, and K. Gopalakrishnan. Adacomp: Adaptive residual gradient compression for data-parallel distributed training. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018, 2018.
[2]
J. Choi, Z. Wang, S. Venkataramani, P. I. Chuang, V. Srinivasan, and K. Gopalakrishnan. PACT: parameterized clipping activation for quantized neural networks. CoRR, abs/1805.06085, 2018.
[3]
B. Fleischer, S. S. annd Matthew Ziegler, J. Silberman, J. Oh, V. Srinivasan, J. Choi, S. Mueller, A. Agrawal, T. Babinsky, N. Cao, C.-Y. Chen, P. Chuang, T. Fox, G. Gristede, M. Guillorn, H. Haynie, M. Klaiber, D. Lee, S.-H. Lo, G. Maier, M. Scheuermann, S. Venkataramani, C. Vezyrtzis, N. Wang, F. Yee, C. Zhou, P.-F. Lu, B. Curran, L. Chang, and K. Gopalakrishnan. A scalable multi-teraops deep learning processor core for ai training and inference. In Proceedings of VLSI Symposium, 2018.

Cited By

View all
  • (2020)Probing the Underlying Implementation Mechanisms of SW260102020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS50907.2020.00090(694-699)Online publication date: Dec-2020

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ISLPED '18: Proceedings of the International Symposium on Low Power Electronics and Design
July 2018
327 pages
ISBN:9781450357043
DOI:10.1145/3218603
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2018

Check for updates

Author Tags

  1. DL Training
  2. DL inference
  3. DNN Optimizations
  4. Deep Learning Accelerators

Qualifiers

  • Invited-talk
  • Research
  • Refereed limited

Funding Sources

Conference

ISLPED '18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 398 of 1,159 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Probing the Underlying Implementation Mechanisms of SW260102020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS50907.2020.00090(694-699)Online publication date: Dec-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media