research-article

Dimensionality-Aware Redundant SIMT Instruction Elimination

Authors:

Roland N. Green,

Timothy G. RogersAuthors Info & Claims

ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 1327 - 1340

https://doi.org/10.1145/3373376.3378520

Published: 13 March 2020 Publication History

Abstract

In massively multithreaded architectures, redundantly executing the same instruction with the same operands in different threads is a significant source of inefficiency. This paper introduces Dimensionality-Aware Redundant SIMT Instruction Elimination (DARSIE), a non-speculative instruction skipping mechanism to reduce redundant operations in GPUs. DARSIE uses static markings from the compiler and information obtained at kernel launch time to skip redundant instructions before they are fetched, keeping them out of the pipeline. DARSIE exploits a new observation that there is significant redundancy across warp instructions in multi-dimensional threadblocks.

For minimal area cost, DARSIE eliminates conditionally redundant instructions without any programmer intervention. On increasingly important 2D GPU applications, DARSIE reduces the number of instructions fetched and executed by 23% over contemporary GPUs. Not fetching these instructions results in a geometric mean of 30% performance improvement, while decreasing the energy consumed by 25%.

References

[1]

Tor M. Aamodt. textscGPGPU-Sim 3.x Manual . http://gpgpu-sim.org/manual/index.php5/GPGPU-Sim_3.x_Manual, 2012. (accessed March 30, 2017).

[2]

AMD. AMD Graphics Cores Next (GCN) Architecture. [Online]. Available: https://www.amd.com/documents/gcn_architecture_whitepaper.pdf, 2016. (accessed April 5, 2019).

[3]

Hodjat Asghari Esfeden, Farzad Khorasani, Hyeran Jeon, Daniel Wong, and Nael Abu-Ghazaleh. textscCORF: textscCoalescing textscOperand textscRegister textscFile for textscGPUs. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), pages 701--714, 2019.

[4]

Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. Analyzing textscCUDA textscWorkloads textscUsing textscA textscDetailed textscGPU textscSimulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 163--174, 2009.

[5]

J. Adam Butts and Guri Sohi. Dynamic textscDead-textscInstruction textscDetection and textscElimination. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), pages 199--210, 2002.

[6]

Cy Chan, Didem Unat, Michael Lijewski, Weiqun Zhang, John Bell, and John Shalf. Software textscDesign textscSpace textscExploration for textscExascale textscCombustion textscCo-design. In International Supercomputing Conference (ICS), pages 196--212, 2013.

[7]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. Rodinia: textscA textscBenchmark textscSuite for textscHeterogeneous textscComputing. In Proceedings of the International Symposium on Workload Characterization (IISWC), pages 44--54, 2009.

[8]

Shuai Che, Bradford M Beckmann, Steven K Reinhardt, and Kevin Skadron. textscPannotia: textscUnderstanding textscIrregular textscGPGPU textscGraph textscApplications. In Proceedings of the International Symposium on Workload Characterization (IISWC), pages 185--195, 2013.

[9]

Zhongliang Chen and David Kaeli. textscBalancing textscScalar and textscVector textscExecution on textscGPU textscArchitectures. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), pages 973--982, 2016.

[10]

Zhongliang Chen, David Kaeli, and Norman Rubin. Characterizing textscScalar textscOpportunities in textscGPGPU textscApplications. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 225--234, 2013.

[11]

Sylvain Collange, David Defour, and Yao Zhang. Dynamic textscDetection of textscUniform and textscAffine textscVectors in textscGPGPU textscComputations. In European Conference on Parallel Processing(Euro-Par), pages 46--55, 2009.

[12]

Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. The textscScalable textscHeterogeneous textscComputing (textscSHOC) textscBenchmark textscSuite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU), pages 63--74, 2010.

Digital Library

[13]

W. W. L. Fung and T. M. Aamodt. Thread textscBlock textscCompaction for textscEfficient textscSIMT textscControl textscFlow. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), pages 25--36, 2011.

[14]

Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo Scarpazza. Dissecting the textscNVIDIA textscTuring textscT4 textscGPU via textscMicrobenchmarking. arXiv preprint arXiv:1903.07486, 2019.

[15]

Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. textscOWL: textscCooperative textscThread textscArray textscAware textscScheduling textscTechniques for textscImproving textscGPGPU textscPerformance. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), pages 395--406, 2013.

[16]

Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. Orchestrated textscScheduling and textscPrefetching for textscGPGPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 332--343, 2013.

Digital Library

[17]

Onur Kayundefinedran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. Neither textscMore nor textscLess: textscOptimizing textscThread-textscLevel textscParallelism for textscGPGPUs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 157--166, 2013.

[18]

A. Kerr, G. Diamos, and S. Yalamanchili. Dynamic textscCompilation of textscData-parallel textscKernels for textscVector textscProcessors. In International Symposium on Code Generation and Optimization (CGO), pages 23--32, 2012.

[19]

Ji Kim, Christopher Torng, Shreesha Srinath, Derek Lockhart, and Christopher Batten. textscMicroarchitectural textscMechanisms to textscExploit value textscStructure in textscSIMT textscArchitectures. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 130--141, 2013.

Digital Library

[20]

Keunsoo Kim and Won Woo Ro. textscWIR: textscWarp textscInstruction textscReuse to textscMinimize textscRepeated textscComputations in textscGPUs. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), pages 389 -- 402, 2018.

[21]

M. Kulkarni, M. Burtscher, C. Cascaval, and K. Pingali. Lonestar: textscA textscSuite of textscParallel textscIrregular textscPrograms. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 65--76, 2009.

[22]

Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, Won Woo Ro, and Murali Annavaram. textscWarped-textscCompression: textscEnabling textscPower textscEfficient textscGPUs through textscRegister textscCompression. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 502--514, 2015.

Digital Library

[23]

Yunsup Lee, Ronny Krashinsky, Vinod Grover, Stephen W. Keckler, and Krste Asanovic. textscConvergence and textscScalarization for textscData-parallel textscArchitectures. In International Symposium on Code Generation and Optimization (CGO), pages 1--11, 2013.

[24]

Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. textscGPUWattch: textscEnabling textscEnergy textscOptimizations in textscGPGPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 487--498, 2013.

Digital Library

[25]

Kevin M. Lepak and Mikko H. Lipasti. On the textscValue textscLocality of textscStore textscInstructions. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 182--191, 2000.

[26]

Mikko H. Lipasti, Christopher B. Wilkerson, and John Paul Shen. Value textscLocality and textscLoad textscValue textscPrediction. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), pages 138--147, 1996.

[27]

S. Liu, J.E. Lindholm, M.Y. Siu, B.W. Coon, and S.F. Oberman. Operand textscCollector textscArchitecture, November 16 2010. US Patent 7,834,881.

[28]

Z. Liu, S. Gilani, M. Annavaram, and N. S. Kim. G-textscScalar: textscCost-textscEffective textscGeneralized textscScalar textscExecution textscArchitecture for textscPower-textscEfficient textscGPUs. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), pages 601--612, 2017.

[29]

Guoping Long, Diana Franklin, Susmit Biswas, Pablo Ortiz, Jason Oberg, Dongrui Fan, and Frederic T. Chong. Minimal textscMulti-threading: textscFinding and textscRemoving textscRedundant textscInstructions in textscMulti-threaded textscProcessors. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 337--348, 2010.

[30]

Joshua San Miguel, Mario Badr, and Natalie Enright Jerger. Load textscValue textscApproximation. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 127--139, 2014.

[31]

NIRVANA. Maxas SASS Assembler . https://github.com/NervanaSystems/maxas, 2016. (accessed Aug 1, 2018).

[32]

NVIDIA. CUDA. [Online]. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/, 2015. (accessed March 5, 2017).

[33]

NVIDIA. NVIDIA CUDA SDK 4.2. [Online]. Available: https://developer.nvidia.com/cuda-downloads, 2016. (accessed March 30, 2017).

[34]

NVIDIA. NVIDIA CUDA SDK 10.0. [Online]. Available: https://developer.nvidia.com/cuda-downloads, 2018. (accessed April 4, 2019).

[35]

PolyBench:. The textscPolyhedral textscBenchmark textscSuite. [Online]. Available: http://web.cse.ohio-state.edu/ pouchet/software/polybench, 2016. (accessed March 30, 2017).

[36]

Timothy G. Rogers, Daniel R. Johnson, Mike O'Connor, and Stephen W. Keckler. A textscVariable textscWarp textscSize textscArchitecture. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 489--501, 2015.

Digital Library

[37]

Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. Cache-textscConscious textscWavefront textscScheduling. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 72--83, 2012.

[38]

Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. Divergence-textscAware textscWarp textscScheduling. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 99--110, 2013.

[39]

James E. Smith. Decoupled textscAccess/textscExecute textscComputer textscArchitectures. ACM Transactions on Computer Systems (TOCS), 2(4):289--308, November 1984.

[40]

Avinash Sodani and Gurindar S. Sohi. Dynamic textscInstruction textscReuse. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 194--205, 1997.

[41]

John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W Hwu. Parboil: A textscRevised textscBenchmark textscSuite for textscScientific and textscCommercial textscThroughput textscComputing. Center for Reliable and High-Performance Computing, 127, 2012.

[42]

John R Tramm, Andrew R Siegel, Tanzima Islam, and Martin Schulz. textscXSBench-the textscDevelopment and textscVerification of a textscPerformance textscAbstraction for textscMonte textscCarlo textscReactor textscAnalysis. The Role of Reactor Physics toward a Sustainable Future (PHYSOR), 2014.

[43]

Vasily Volkov. Better textscPerformance at textscLower textscOccupancy. In Proceedings of the textscGPU technology conference, GTC, volume 10, page 16, 2010.

[44]

J. Wang and S. Yalamanchili. Characterization and textscAnalysis of textscDynamic textscParallelism in textscUnstructured textscGPU textscApplications. In Proceedings of the International Symposium on Workload Characterization (IISWC), pages 51--60, 2014.

[45]

Kai Wang and Calvin Lin. Decoupled textscAffine textscComputation for textscSIMT textscGPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 295--306, 2017.

Digital Library

[46]

Shasha Wen, Milind Chabbi, and Xu Liu. textscREDSPY: textscExploring textscValue textscLocality in textscSoftware. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), pages 47--61, 2017.

[47]

S. J. E. Wilton and N. P. Jouppi. textscCACTI: textscAn textscEnhanced textscCache textscAccess and textscCycle textscTime textscModel. IEEE Journal of Solid-State Circuits, 31(5):677--688, May 1996.

[48]

Daniel Wong, Nam Sung Kim, and Murali Annavaram. Approximating textscWarps with textscIntra-warp textscOperand textscValue textscSimilarity. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), pages 176--187, 2016.

[49]

Ping Xiang, Yi Yang, Mike Mantor, Norm Rubin, Lisa R. Hsu, Dong Qunfeng, and Huiyang Zhou. A textscCase for a textscFlexible textscScalar textscUnit in textscSIMT textscArchitecture. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), pages 93--102, 2014.

[50]

Ping Xiang, Yi Yang, Mike Mantor, Norm Rubin, Lisa R. Hsu, and Huiyang Zhou. Exploiting textscUniform textscVector textscInstructions for textscGPGPU textscPerformance, textscEnergy textscEfficiency, and textscOpportunistic textscReliability textscEnhancement. In Proceedings of the International Conference on Supercomputing (ICS), pages 433--442, 2013.

[51]

Yi Yang, Ping Xiang, Jingfei Kong, and Huiyang Zhou. A textscGPGPU textscCompiler for textscMemory textscOptimization and textscParallelism textscManagement. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 86--97, 2010.

Digital Library

[52]

Ayse Yilmazer, Zhongliang Chen, and David Kaeli. textscScalar textscWaving: textscImproving the textscEfficiency of textscSIMD textscExecution on textscGPUs. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), pages 103--112, 2014.

Cited By

Lin MZhou KSu PAamodt TJerger NSwift M(2023)DrGPUM: Guiding Memory Optimization for GPU-Accelerated ApplicationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582044(164-178)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582044
Ha DOh YRo WSolihin YHeinrich M(2023)R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589039(1-14)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589039
Buduleci CGellert AFlorea A(2023)Selective High-Latency Arithmetic Instruction Reuse in Multicore Processors2023 27th International Conference on System Theory, Control and Computing (ICSTCC)10.1109/ICSTCC59206.2023.10308483(410-415)Online publication date: 11-Oct-2023
https://doi.org/10.1109/ICSTCC59206.2023.10308483
Show More Cited By

Index Terms

Dimensionality-Aware Redundant SIMT Instruction Elimination
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

Unified on-chip memory allocation for SIMT architecture
ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

The popularity of general purpose Graphic Processing Unit (GPU) is largely attributed to the tremendous concurrency enabled by its underlying architecture -- single instruction multiple thread (SIMT) architecture. It keeps the context of a significant ...
SIMT-X: Extending Single-Instruction Multi-Threading to Out-of-Order Cores

This work introduces Single Instruction Multi-Thread Express (SIMT-X), a general-purpose Central Processing Unit (CPU) microarchitecture that enables Graphics Processing Units (GPUs)-style SIMT execution across multiple threads of the same program for ...
Warp-aware trace scheduling for GPUs
PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

GPU performance depends not only on thread/warp level parallelism (TLP) but also on instruction-level parallelism (ILP). It is not enough to schedule instructions within basic blocks, it is also necessary to exploit opportunities for ILP optimization ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems

March 2020

1412 pages

ISBN:9781450371025

DOI:10.1145/3373376

General Chair:
James Larus
EPFL
,
Program Chairs:
Luis Ceze
University of Washington
,
Karin Strauss
Microsoft

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 March 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

JUMP Center co-sponsored by SRC and DARPA
Applications Driving Architectures (ADA) Research Center

Conference

ASPLOS '20

Sponsor:

ASPLOS '20: Architectural Support for Programming Languages and Operating Systems

March 16 - 20, 2020

Lausanne, Switzerland

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
521
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)2

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lin MZhou KSu PAamodt TJerger NSwift M(2023)DrGPUM: Guiding Memory Optimization for GPU-Accelerated ApplicationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582044(164-178)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582044
Ha DOh YRo WSolihin YHeinrich M(2023)R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589039(1-14)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589039
Buduleci CGellert AFlorea A(2023)Selective High-Latency Arithmetic Instruction Reuse in Multicore Processors2023 27th International Conference on System Theory, Control and Computing (ICSTCC)10.1109/ICSTCC59206.2023.10308483(410-415)Online publication date: 11-Oct-2023
https://doi.org/10.1109/ICSTCC59206.2023.10308483
Fukuhara JTakimoto MGrosser TLee K(2022)Automated kernel fusion for GPU based on code motionProceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3519941.3535078(151-161)Online publication date: 14-Jun-2022
https://dl.acm.org/doi/10.1145/3519941.3535078
Lee SHwang SKim MChoi JAhn J(2022)Future Scaling of Memory Hierarchy for Tensor Cores and Eliminating Redundant Shared Memory Traffic Using Inter-Warp MulticastingIEEE Transactions on Computers10.1109/TC.2022.3207134(1-12)Online publication date: 2022
https://doi.org/10.1109/TC.2022.3207134
Buduleci CGellert AFlorea AMatei A(2022)Extending Sniper with Support to Access Operand Values: A Case Study on Reusability Measurement2022 23rd International Carpathian Control Conference (ICCC)10.1109/ICCC54292.2022.9805869(70-75)Online publication date: 29-May-2022
https://doi.org/10.1109/ICCC54292.2022.9805869
Zhou KHao YMellor-Crummey JMeng XLiu XCuicchi CQualters IKramer W(2020)GVProfProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433819(1-16)Online publication date: 9-Nov-2020
https://dl.acm.org/doi/10.5555/3433701.3433819
Zhou KHao YMellor-Crummey JMeng XLiu X(2020)GVPROF: A Value Profiler for GPU-Based ClustersSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00093(1-16)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00093
Kim HAhn SOh YKim BRo WSong W(2020)Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00065(725-737)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00065

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents