Nothing Special   »   [go: up one dir, main page]

skip to main content
short-paper
Free access

Neural acceleration for general-purpose approximate programs

Published: 23 December 2014 Publication History

Abstract

As improvements in per-transistor speed and energy efficiency diminish, radical departures from conventional approaches are needed to continue improvements in the performance and energy efficiency of general-purpose processors. One such departure is approximate computing, where error in computation is acceptable and the traditional robust digital abstraction of near-perfect accuracy is relaxed. Conventional techniques in energy-efficient computing navigate a design space defined by the two dimensions of performance and energy, and traditionally trade one for the other. General-purpose approximate computing explores a third dimension---error---and trades the accuracy of computation for gains in both energy and performance. Techniques to harvest large savings from small errors have proven elusive. This paper describes a new approach that uses machine learning-based transformations to accelerate approximation-tolerant programs. The core idea is to train a learning model how an approximable region of code---code that can produce imprecise but acceptable results---behaves and replace the original code region with an efficient computation of the learned model. We use neural networks to learn code behavior and approximate it. We describe the Parrot algorithmic transformation, which leverages a simple programmer annotation ("approximable") to transform a code region from a von Neumann model to a neural model. After the learning phase, the compiler replaces the original code with an invocation of a low-power accelerator called a neural processing unit (NPU). The NPU is tightly coupled to the processor pipeline to permit profitable acceleration even when small regions of code are transformed. Offloading approximable code regions to NPUs is faster and more energy efficient than executing the original code. For a set of diverse applications, NPU acceleration provides whole-application speedup of 2.3× and energy savings of 3.0× on average with average quality loss of at most 9.6%. NPUs form a new class of accelerators and show that significant gains in both performance and efficiency are achievable when the traditional abstraction of near-perfect accuracy is relaxed in general-purpose computing.

References

[1]
Baek, W. and Chilimbi, T.M. Green: A framework for supporting energy-conscious programming using controlled approximation. In PLDI (2010).
[2]
Chakrapani, L.N., Akgul, B.E.S., Cheemalavagu, S., Korkmaz, P., Palem, K.V., and Seshasayee, B. Ultra-efficient (embedded) SOC architectures based on probabilistic CMOS (PCMOS) technology. In DATE (2006).
[3]
Chen, T., Chen, Y., Duranton, M., Guo, Q., Hashmi, A., Lipasti, M., Nere, A., Qiu, S., Sebag, M., Temam, O., and Bench, N.N. On the broad potential application scope of hardware neural network accelerators. In IISWC (2012).
[4]
de Kruijf, M., Nomura, S., and Sankaralingam, K. Relax: An architectural framework for software recovery of hardware faults. In ISCA (2010).
[5]
Dennard, R.H., Gaensslen, F.H., Rideout, V.L., Bassous, E., and LeBlanc, A.R. Design of ion-implanted MOSFET's with very small physical dimensions. IEEE J. Solid-State Circ. 9 (Oct. 1974), 256--268.
[6]
Esmaeilzadeh, H., Saeedi, P., Araabi, B.N., Lucas, C., and Fakhraie, S.M. Neural network stream processing core (NnSP) for embedded systems. In ISCAS (2006).
[7]
Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., and Burger, D. Power challenges may end the multicore era. Commun. ACM 56, 2 (Feb. 2013), 93--102.
[8]
Esmaeilzadeh, H. et al. Architecture support for disciplined approximate programming. In ASPLOS (2012).
[9]
Esmaeilzadeh, H., Sampson, A., Ceze, L., and Burger, D. Neural acceleration for general-purpose approximate programs. In MICRO (2012).
[10]
Galal, S. and Horowitz, M. Energy-efficient floating-point unit design. IEEE Trans. Comput. 60, 7 (2011) 913--922.
[11]
Govindaraju, V., Ho, C.H., and Sankaralingam, K. Dynamically specialized datapaths for energy efficient computing. In HPCA (2011).
[12]
Gupta, S., Feng, S., Ansari, A., Mahlke, S., and August, D. Bundled execution of recurring traces for energy-efficient general purpose processing. In MICRO (2011).
[13]
Guzhva, A., Dolenko, S., and Persiantsev, I. Multifold acceleration of neural network computations using GPU. In ICANN (2009).
[14]
Hameed, R., Qadeer, W., Wachs, M., Azizi, O., Solomatnikov, A., Lee, B.C., Richardson, S., Kozyrakis, C., and Horowitz, M. Understanding sources of inefficiency in general-purpose chips. In ISCA (2010).
[15]
Hardavellas, N., Ferdman, M., Falsafi, B., and Ailamaki, A. Toward dark silicon in servers. IEEE Micro 31, 4 (July-Aug. 2011), 6--15.
[16]
Hashmi, A., Berry, H., Temam, O., and Lipasti, M. Automatic abstraction and fault tolerance in cortical microarchitectures. In ISCA (2011).
[17]
Joubert, A., Belhadj, B., Temam, O., and Héliot, R. Hardware spiking neurons design: Analog or digital? In IJCNN (2012).
[18]
Li, S., Ahn, J.H., Strong, R.D., Brockman, J.B., Tullsen, D.M., and Jouppi, N.P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In MICRO (2009).
[19]
Liu, S., Pattabiraman, K., Moscibroda, T., and Zorn, B.G. Flikker: Saving refresh-power in mobile devices through critical data partitioning. In ASPLOS (2011).
[20]
Muralimanohar, N., Balasubramonian, R., and Jouppi, N. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In MICRO (2007).
[21]
Narayanan, S., Sartori, J., Kumar, R., and Jones, D.L. Scalable stochastic processors. In DATE (2010).
[22]
Patel, A., Afram, F., Chen, S., and Ghose, K. MARSx86: A full system simulator for x86 CPUs. In DAC (2011).
[23]
Putnam, A., Bennett, D., Dellinger, E., Mason, J., Sundararajan, P., and Eggers, S. CHiMPS: A high-level compilation flow for hybrid CPU-FPGA architectures. In FPGA (2008).
[24]
Razdan, R. and Smith, M.D. A high-performance microarchitecture with hardware-programmable functional units. In MICRO (1994).
[25]
Rumelhart, D.E., Hinton, G.E., and Williams, R.J. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition. D.E. Rumelhart, J.L. McClelland, and PDP Research Group, eds. Volume 1. MIT Press, 1986, 318--362.
[26]
Sampson, A., Dietl, W., Fortuna, E., Gnanapragasam, D., Ceze, L., and Grossman, D. EnerJ: Approximate data types for safe and general low-power computation. In PLDI (2011).
[27]
Sidiroglou-Douskos, S., Misailovic, S., Hoffmann, H., and Rinard, M. Managing performance vs. accuracy trade-offs with loop perforation. In FSE (2011).
[28]
Temam, O. A defect-tolerant accelerator for emerging high-performance applications. In ISCA (2012).
[29]
Venkatesh, G., Sampson, J., Goulding, N., Garcia, S., Bryksin, V., Lugo-Martinez, J., Swanson, S., and Taylor, M.B. Conservation cores: Reducing the energy of mature computations. In ASPLOS (2010).
[30]
Zhu, J. and Sutton, P. FPGA implementations of neural networks: A survey of a decade of progress. In FPL (2003).

Cited By

View all
  • (2025)Neural acceleration of incomplete factorization preconditioningNeural Computing and Applications10.1007/s00521-024-10392-y37:2(1009-1026)Online publication date: 1-Jan-2025
  • (2024)HgPCN: A Heterogeneous Architecture for E2E Embedded Point Cloud Inference2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00116(1588-1600)Online publication date: 2-Nov-2024
  • (2024)Optimal data distribution in FeFET-based computing-in-memory macros2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10558611(1-5)Online publication date: 19-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Communications of the ACM
Communications of the ACM  Volume 58, Issue 1
January 2015
105 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/2688498
  • Editor:
  • Moshe Y. Vardi
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 December 2014
Published in CACM Volume 58, Issue 1

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Short-paper
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)502
  • Downloads (Last 6 weeks)61
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Neural acceleration of incomplete factorization preconditioningNeural Computing and Applications10.1007/s00521-024-10392-y37:2(1009-1026)Online publication date: 1-Jan-2025
  • (2024)HgPCN: A Heterogeneous Architecture for E2E Embedded Point Cloud Inference2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00116(1588-1600)Online publication date: 2-Nov-2024
  • (2024)Optimal data distribution in FeFET-based computing-in-memory macros2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10558611(1-5)Online publication date: 19-May-2024
  • (2024)Deep neural networks accelerators with focus on tensor processorsMicroprocessors and Microsystems10.1016/j.micpro.2023.105005105(105005)Online publication date: Mar-2024
  • (2023)Turaco: Complexity-Guided Data Sampling for Training Neural Surrogates of ProgramsProceedings of the ACM on Programming Languages10.1145/36228567:OOPSLA2(1648-1676)Online publication date: 16-Oct-2023
  • (2023)ABM-SpConv-SIMD: Accelerating Convolutional Neural Network Inference for Industrial IoT Applications on Edge DevicesIEEE Transactions on Network Science and Engineering10.1109/TNSE.2022.315441210:5(3071-3085)Online publication date: 1-Sep-2023
  • (2023)Camera with Artificial Intelligence of Things (AIoT) Technology for Wildlife Camera Trap System2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC)10.1109/CCWC57344.2023.10099252(0252-0258)Online publication date: 8-Mar-2023
  • (2023)Neural Acceleration of Graph Based Utility Functions for Sparse MatricesIEEE Access10.1109/ACCESS.2023.326245311(31619-31635)Online publication date: 2023
  • (2022)A survey of architectures of neural network acceleratorsSCIENTIA SINICA Informationis10.1360/SSI-2021-040952:4(596)Online publication date: 29-Mar-2022
  • (2022)As-Is Approximate ComputingACM Transactions on Architecture and Code Optimization10.1145/355976120:1(1-26)Online publication date: 17-Nov-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDFChinese translation

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media