research-article

Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement

Authors:

Huiyang ZhouAuthors Info & Claims

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

Pages 433 - 442

https://doi.org/10.1145/2464996.2465022

Published: 10 June 2013 Publication History

Abstract

State-of-art graphics processing units (GPUs) employ the single-instruction multiple-data (SIMD) style execution to achieve both high computational throughput and energy efficiency. As previous works have shown, there exists significant computational redundancy in SIMD execution, where different execution lanes operate on the same operand values. Such value locality is referred to as uniform vectors. In this paper, we first show that besides redundancy within a uniform vector, different vectors can also have the identical values. Then, we propose detailed architecture designs to exploit both types of redundancy. For redundancy within a uniform vector, we propose to either extend the vector register file with token bits or add a separate small scalar register file to eliminate redundant computations as well as redundant data storage. For redundancy across different uniform vectors, we adopt instruction reuse, proposed originally for CPU architectures, to detect and eliminate redundancy. The elimination of redundant computations and data storage leads to both significant energy savings and performance improvement. Furthermore, we propose to leverage such redundancy to protect arithmetic-logic units (ALUs) and register files against hardware errors. Our detailed evaluation shows that our proposed design has low hardware overhead and achieves performance gains, up to 23.9% and 12.0% on average, along with energy savings, up to 24.8% and 12.6% on average, as well as a 21.1% and 14.1% protection coverage for ALUs and register files, respectively.

References

[1]

AMD Accelerated Parallel Processing OpenCL Programming Guide 2.1, May 2012

[2]

A. Bakhoda, et al., Analyzing CUDA workloads using a detailed GPU simulator. IPASS 2009.

[3]

S. Che, et al., Rodinia: a benchmark suite for heterogeneous computing, IISWC 2009.

Digital Library

[4]

Z. Chen, et al., Characterizing Scalar Opportunities in GPGPU Applications, ISPSS, 2013

[5]

S. Collange, et al., Dynamic detection of uniform and affine vectors in GPGPU computations, Euro-Par, 2009

Digital Library

[6]

S. Collange. Identifying scalar behavior in CUDA kernels. Technical report hal-00555134, 2011.

[7]

B. Coutinho, et al., Divergence analysis and optimizations, PACT 2011.

Digital Library

[8]

M. Dimitrov, et al., Understanding software approaches for GPGPU reliability, GPGPU-2, 2009

Digital Library

[9]

S. Gilani, N. Kim, M. Schulte: Power-efficient computing for compute-intensive GPGPU applications. PACT 2012.

Digital Library

[10]

M. Gomaa and T. Vijaykumar, "Opportunistic Transient-Fault Detection", ISCA-32, 2005.

Digital Library

[11]

N. B. Lakshminarayana and H. Kim, Effect of Instruction Fetch and Memory Scheduling on GPU Performance, Workshop on Language, Compiler, and Architecture Support for GPGPU, 2010.

[12]

C. J. Lee, et al. Prefetch-aware DRAM controllers. MICRO-41, 2008.

Digital Library

[13]

J. Leng, et al., GPUWattch: Enabling Energy Optimizations in GPGPUs, ISCA, 2013

[14]

S. Li at al., McPAT: an integrated power, area and timing modeling framework for multicore and manycore architectures, MICRO 2009.

Digital Library

[15]

G. Long, et al., Minimal Multi-Threading: Finding and Removing Redundant Instructions in Multi-Threaded Processors. MICRO, 2010.

Digital Library

[16]

Y. Lee, et al. Convergence and Scalarization for Data-Parallel Architectures. CGO 2013.

[17]

NVIDIA GPU Computing SDK 3.1.

[18]

J. Sheaffer, et al. A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics Processors. Graphics Hardware 2007.

Digital Library

[19]

A. Sodani and G. S. Sohi. Dynamic Instruction Reuse. ISCA 1997.

Digital Library

Cited By

Naylor MJoannou AMarkettos AMetzger PMoore SJones T(2024)Advanced Dynamic Scalarisation for RISC-V GPGPUs2024 IEEE 42nd International Conference on Computer Design (ICCD)10.1109/ICCD63220.2024.00047(260-267)Online publication date: 18-Nov-2024
https://doi.org/10.1109/ICCD63220.2024.00047
Ha DOh YRo WSolihin YHeinrich M(2023)R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589039(1-14)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589039
Zhou KHao YMellor-Crummey JMeng XLiu XFalsafi BFerdman MLu SWenisch T(2022)ValueExpert: exploring value patterns in GPU-accelerated applicationsProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507708(171-185)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507708
Show More Cited By

Recommendations

ValueExpert: exploring value patterns in GPU-accelerated applications
ASPLOS '22: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

General-purpose GPUs have become common in modern computing systems to accelerate applications in many domains, including machine learning, high-performance computing, and autonomous driving. However, inefficiencies abound in GPU-accelerated applications,...
R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUs
ISCA '23: Proceedings of the 50th Annual International Symposium on Computer Architecture

A generally used GPU programming methodology is that adjacent threads access data in neighbor or specific-stride memory addresses and perform computations with the fetched data. This paper demonstrates that the memory addresses often exhibit a simple ...
Dimensionality-Aware Redundant SIMT Instruction Elimination
ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems

In massively multithreaded architectures, redundantly executing the same instruction with the same operands in different threads is a significant source of inefficiency. This paper introduces Dimensionality-Aware Redundant SIMT Instruction Elimination (...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

June 2013

512 pages

ISBN:9781450321303

DOI:10.1145/2464996

General Chair:
Allen D. Malony
University of Oregon, USA
,
Program Chairs:
Mario Nemirovsky
Barcelona Supercomputing Center, Spain
,
Sam Midkiff
Purdue University, USA

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS'13

Sponsor:

SIGARCH

ICS'13: International Conference on Supercomputing

June 10 - 14, 2013

Oregon, Eugene, USA

Acceptance Rates

ICS '13 Paper Acceptance Rate 43 of 202 submissions, 21%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
369
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Naylor MJoannou AMarkettos AMetzger PMoore SJones T(2024)Advanced Dynamic Scalarisation for RISC-V GPGPUs2024 IEEE 42nd International Conference on Computer Design (ICCD)10.1109/ICCD63220.2024.00047(260-267)Online publication date: 18-Nov-2024
https://doi.org/10.1109/ICCD63220.2024.00047
Ha DOh YRo WSolihin YHeinrich M(2023)R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589039(1-14)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589039
Zhou KHao YMellor-Crummey JMeng XLiu XFalsafi BFerdman MLu SWenisch T(2022)ValueExpert: exploring value patterns in GPU-accelerated applicationsProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507708(171-185)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507708
Atoofian E(2020)Approximate Cache in GPGPUsACM Transactions on Embedded Computing Systems10.1145/340790419:5(1-22)Online publication date: 26-Sep-2020
https://dl.acm.org/doi/10.1145/3407904
Zhou KHao YMellor-Crummey JMeng XLiu X(2020)GVPROF: A Value Profiler for GPU-Based ClustersSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00093(1-16)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00093
Kim HAhn SOh YKim BRo WSong W(2020)Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00065(725-737)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00065
Valero ASuarez-Gracia DGran-Tejero R(2020)DC-Patch: A Microarchitectural Fault Patching Technique for GPU Register FilesIEEE Access10.1109/ACCESS.2020.30258998(173276-173288)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3025899
Valero ACandel FSuarez-Gracia DPetit SSahuquillo J(2019)An Aging-Aware GPU Register File Design Based on Data RedundancyIEEE Transactions on Computers10.1109/TC.2018.284937668:1(4-20)Online publication date: 1-Jan-2019
https://doi.org/10.1109/TC.2018.2849376
Wang QGuo WWei J(2018)An efficient control flow validation method using redundant computing capacity of dual-processor architecturePLOS ONE10.1371/journal.pone.020112713:8(e0201127)Online publication date: 1-Aug-2018
https://doi.org/10.1371/journal.pone.0201127
Tan JYan K(2018)Efficiently Managing the Impact of Hardware Variability on GPUs’ Streaming ProcessorsACM Transactions on Design Automation of Electronic Systems10.1145/328730824:1(1-15)Online publication date: 21-Dec-2018
https://dl.acm.org/doi/10.1145/3287308
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten