research-article

Energy-Efficient GPGPU Architectures via Collaborative Compilation and Memristive Memory-Based Computing

Authors:

Amirali Ghofrani,

Miguel Angel Lastras-Montano,

Kwang-Ting Cheng,

Rajesh K. GuptaAuthors Info & Claims

DAC '14: Proceedings of the 51st Annual Design Automation Conference

Pages 1 - 6

https://doi.org/10.1145/2593069.2593132

Published: 01 June 2014 Publication History

Abstract

Thousands of deep and wide pipelines working concurrently make GPGPU high power consuming parts. Energy-efficiency techniques employ voltage overscaling that increases timing sensitivity to variations and hence aggravating the energy use issues. This paper proposes a method to increase spatiotemporal reuse of computational effort by a combination of compilation and micro-architectural design. An associative memristive memory (AMM) module is integrated with the floating point units (FPUs). Together, we enable fine-grained partitioning of values and find high-frequency sets of values for the FPUs by searching the space of possible inputs, with the help of application-specific profile feedback. For every kernel execution, the compiler pre-stores these high-frequent sets of values in AMM modules -- representing partial functionality of the associated FPU-- that are concurrently evaluated over two clock cycles. Our simulation results show high hit rates with 32-entry AMM modules that enable 36% reduction in average energy use by the kernel codes. Compared to voltage overscaling, this technique enhances robustness against timing errors with 39% average energy saving.

References

[1]

Nvidia's next generation CUDA compute architecture: Kepler GK110. Whitepaper.

[2]

D. Jeon, et al. Design methodology for voltage-overscaled ultra-low-power systems. Circuits and Systems II: Express Briefs, IEEE Transactions on, 59(12):952--956, Dec 2012.

[3]

R. Pawlowski, et al. A 530mv 10-lane simd processor with variation resiliency in 45nm soi. In Solid-State Circuits Conference Digest of Technical Papers (ISSCC), IEEE International, pages 492--494, 2012.

[4]

S. Borkar, et al. Parameter variations and impact on circuits and microarchitecture. In Design Automation Conference, pages 338--342, June 2003.

Digital Library

[5]

ITRS {Online}. Available: http://public.itrs.net

[6]

S. Das, et al. A self-tuning dvs processor using delay-error detection and correction. Solid-State Circuits, IEEE Journal of, 41(4):792--804, April 2006.

[7]

K. Bowman, et al. A 45 nm resilient microprocessor core for dynamic variation tolerance. Solid-State Circuits, IEEE Journal of, 46(1):194--208, Jan 2011.

[8]

M.-F. Chang, et al. A high-speed 7.2-ns read-write random access 4-mb embedded resistive ram (reram) macro using process-variation-tolerant current-mode read schemes. Solid-State Circuits, IEEE Journal of, 48(3):878--891, March 2013.

[9]

A. Ghofrani, M. Lastras-Montano, and K.-T. Cheng. Towards data reliable crossbar-based memristive memories. In Test Conference (ITC), IEEE International, pages 1--10, Sept 2013.

[10]

J. Li, et al. 1 mb 0.41 μm2 2t-2r cell nonvolatile tcam with two-bit encoding and clocked self-referenced sensing. Solid-State Circuits, IEEE Journal of, 49(4):896--907, April 2014.

[11]

Micro-benchmarking the GT200 gpu. Technical report, Computer Group, ECE, University of Toronto.

[12]

S. Paul, et al. Nanoscale reconfigurable computing using non-volatile 2-d sttram array. In Nanotechnology, IEEE-NANO. IEEE Conference on, pages 880--883, July 2009.

[13]

S. Paul, et al. Energy-efficient reconfigurable computing using a circuit-architecture-software co-design approach. Emerging and Selected Topics in Circuits and Systems, IEEE Journal on, 1(3):369--380, Sept 2011.

[14]

S. kvatinsky, et al. Memristor-based multithreading. Computer Architecture Letters, 2013.

[15]

H. Esmaeilzadeh, et al. Neural acceleration for general-purpose approximate programs. In Microarchitecture (MICRO), 45th Annual IEEE/ACM International Symposium on, pages 449--460, Dec 2012.

Digital Library

[16]

A. Rahimi, L. Benini, and R. Gupta. Spatial memoization: Concurrent instruction reuse to correct timing errors in simd architectures. Circuits and Systems II: Express Briefs, IEEE Transactions on, 60(12):847--851, Dec 2013.

[17]

A. Rahimi, L. Benini, and R. K. Gupta. Temporal memoization for energy-efïňĄcient timing error recovery in gpgpus. In Design, Automation Test in Europe Conference (DATE), March 2014.

Digital Library

[18]

A. Rahimi, L. Benini, and R. K. Gupta. Hierarchically focused guardbanding: An adaptive approach to mitigate pvt variations and aging. In Design, Automation Test in Europe Conference (DATE), pages 1695--1700, March 2013.

Digital Library

[19]

AMD APP OpenCL programming guide. Technical Report Chapter 6.6.1, pp. 157.

[20]

AMD APP SDK 2.5.

[21]

Caltech 101 dataset, {Online} Available: http://www.vision.caltech.edu/Image_Datasets/Caltech101/

[22]

Caltech 10K Web Faces dataset, {Online} Available: http://www.vision.caltech.edu/ImageDatasets/Caltech10KWebFaces/

[23]

FloPoCo {Online}. Available: http://flopoco.gforge.inria.fr/

[24]

Multi2Sim {Online}. Available: http://www.multi2sim.org/

[25]

A. Rahimi, L. Benini, and R. Gupta. Analysis of instruction-level vulnerability to dynamic voltage and temperature variations. In Design, Automation Test in Europe Conference (DATE), pages 1102--1105, 2012.

Digital Library

[26]

K.-H. Kim, et al. A functional hybrid memristor crossbar-array/cmos system for data storage and neuromorphic applications. Nano Letters, 12(1):389--395, 2012.

Cited By

Rahimi ABenini LGupta RRahimi ABenini LGupta R(2017)Spatial and Temporal MemoizationFrom Variability Tolerance to Approximate Computing in Parallel Integrated Architectures and Accelerators10.1007/978-3-319-53768-9_12(181-190)Online publication date: 26-Apr-2017
https://doi.org/10.1007/978-3-319-53768-9_12
Imani MRahimi ARosing TFanucci LTeich J(2016)Resistive configurable associative memory for approximate computingProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2972118(1327-1332)Online publication date: 14-Mar-2016
https://dl.acm.org/doi/10.5555/2971808.2972118
Pattnaik ATang XJog AKayiran OMishra AKandemir MMutlu ODas CZaks AMendelson BRauchwerger LHwu W(2016)Scheduling Techniques for GPU Architectures with Processing-In-Memory CapabilitiesProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967940(31-44)Online publication date: 11-Sep-2016
https://dl.acm.org/doi/10.1145/2967938.2967940
Show More Cited By

Recommendations

Energy-efficient register caching with compiler assistance

The register file is a critical component in a modern superscalar processor. It must be large enough to accommodate the results of all in-flight instructions. It must also have enough ports to allow simultaneous issue and writeback of many values each ...
Energy efficient speculative threads: dynamic thread allocation in Same-ISA heterogeneous multicore systems
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

Thread-level parallelism at the chip level is critical in overcoming some of the challenges that have been ushered in through the advent of modern multicore processors (CMP). Extracting speculatively parallel threads from sequential applications and ...
A Front-end Execution Architecture for High Energy Efficiency
MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture

Smart phones and tablets have recently become widespread and dominant in the computer market. Users require that these mobile devices provide a high-quality experience and an even higher performance. Hence, major developers adopt out-of-order ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

DAC '14: Proceedings of the 51st Annual Design Automation Conference

June 2014

1249 pages

ISBN:9781450327305

DOI:10.1145/2593069

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

EDAC: Electronic Design Automation Consortium
SIGBED: ACM Special Interest Group on Embedded Systems
SIGDA: ACM Special Interest Group on Design Automation
IEEE-CEDA

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

DAC '14

DAC '14: The 51st Annual Design Automation Conference 2014

June 1 - 5, 2014

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
119
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Rahimi ABenini LGupta RRahimi ABenini LGupta R(2017)Spatial and Temporal MemoizationFrom Variability Tolerance to Approximate Computing in Parallel Integrated Architectures and Accelerators10.1007/978-3-319-53768-9_12(181-190)Online publication date: 26-Apr-2017
https://doi.org/10.1007/978-3-319-53768-9_12
Imani MRahimi ARosing TFanucci LTeich J(2016)Resistive configurable associative memory for approximate computingProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2972118(1327-1332)Online publication date: 14-Mar-2016
https://dl.acm.org/doi/10.5555/2971808.2972118
Pattnaik ATang XJog AKayiran OMishra AKandemir MMutlu ODas CZaks AMendelson BRauchwerger LHwu W(2016)Scheduling Techniques for GPU Architectures with Processing-In-Memory CapabilitiesProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967940(31-44)Online publication date: 11-Sep-2016
https://dl.acm.org/doi/10.1145/2967938.2967940
Imani MKim YRahimi ARosing T(2016)ACAMProceedings of the 2016 International Symposium on Low Power Electronics and Design10.1145/2934583.2934595(162-167)Online publication date: 8-Aug-2016
https://dl.acm.org/doi/10.1145/2934583.2934595
Ghofrani ARahimi ALastras-Montano MBenini LGupta RCheng K(2016)Associative Memristive Memory for Approximate Computing in GPUsIEEE Journal on Emerging and Selected Topics in Circuits and Systems10.1109/JETCAS.2016.25386186:2(222-234)Online publication date: Jun-2016
https://doi.org/10.1109/JETCAS.2016.2538618
Rahimi AGhofrani ACheng KBenini LGupta RNebel WAtienza D(2015)Approximate associative memristive memory for energy-efficient GPUsProceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition10.5555/2755753.2757158(1497-1502)Online publication date: 9-Mar-2015
https://dl.acm.org/doi/10.5555/2755753.2757158
Ghofrani Alastras-montaño MGaba SPayvand MLu WTheogarajan LCheng K(2015)A Low-Power Variation-Aware Adaptive Write Scheme for Access-Transistor-Free Memristive MemoryACM Journal on Emerging Technologies in Computing Systems10.1145/271731312:1(1-18)Online publication date: 3-Aug-2015
https://dl.acm.org/doi/10.1145/2717313

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents