Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3373376.3378520acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Dimensionality-Aware Redundant SIMT Instruction Elimination

Published: 13 March 2020 Publication History

Abstract

In massively multithreaded architectures, redundantly executing the same instruction with the same operands in different threads is a significant source of inefficiency. This paper introduces Dimensionality-Aware Redundant SIMT Instruction Elimination (DARSIE), a non-speculative instruction skipping mechanism to reduce redundant operations in GPUs. DARSIE uses static markings from the compiler and information obtained at kernel launch time to skip redundant instructions before they are fetched, keeping them out of the pipeline. DARSIE exploits a new observation that there is significant redundancy across warp instructions in multi-dimensional threadblocks.
For minimal area cost, DARSIE eliminates conditionally redundant instructions without any programmer intervention. On increasingly important 2D GPU applications, DARSIE reduces the number of instructions fetched and executed by 23% over contemporary GPUs. Not fetching these instructions results in a geometric mean of 30% performance improvement, while decreasing the energy consumed by 25%.

References

[1]
Tor M. Aamodt. textscGPGPU-Sim 3.x Manual . http://gpgpu-sim.org/manual/index.php5/GPGPU-Sim_3.x_Manual, 2012. (accessed March 30, 2017).
[2]
AMD. AMD Graphics Cores Next (GCN) Architecture. [Online]. Available: https://www.amd.com/documents/gcn_architecture_whitepaper.pdf, 2016. (accessed April 5, 2019).
[3]
Hodjat Asghari Esfeden, Farzad Khorasani, Hyeran Jeon, Daniel Wong, and Nael Abu-Ghazaleh. textscCORF: textscCoalescing textscOperand textscRegister textscFile for textscGPUs. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), pages 701--714, 2019.
[4]
Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. Analyzing textscCUDA textscWorkloads textscUsing textscA textscDetailed textscGPU textscSimulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 163--174, 2009.
[5]
J. Adam Butts and Guri Sohi. Dynamic textscDead-textscInstruction textscDetection and textscElimination. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), pages 199--210, 2002.
[6]
Cy Chan, Didem Unat, Michael Lijewski, Weiqun Zhang, John Bell, and John Shalf. Software textscDesign textscSpace textscExploration for textscExascale textscCombustion textscCo-design. In International Supercomputing Conference (ICS), pages 196--212, 2013.
[7]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. Rodinia: textscA textscBenchmark textscSuite for textscHeterogeneous textscComputing. In Proceedings of the International Symposium on Workload Characterization (IISWC), pages 44--54, 2009.
[8]
Shuai Che, Bradford M Beckmann, Steven K Reinhardt, and Kevin Skadron. textscPannotia: textscUnderstanding textscIrregular textscGPGPU textscGraph textscApplications. In Proceedings of the International Symposium on Workload Characterization (IISWC), pages 185--195, 2013.
[9]
Zhongliang Chen and David Kaeli. textscBalancing textscScalar and textscVector textscExecution on textscGPU textscArchitectures. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), pages 973--982, 2016.
[10]
Zhongliang Chen, David Kaeli, and Norman Rubin. Characterizing textscScalar textscOpportunities in textscGPGPU textscApplications. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 225--234, 2013.
[11]
Sylvain Collange, David Defour, and Yao Zhang. Dynamic textscDetection of textscUniform and textscAffine textscVectors in textscGPGPU textscComputations. In European Conference on Parallel Processing(Euro-Par), pages 46--55, 2009.
[12]
Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. The textscScalable textscHeterogeneous textscComputing (textscSHOC) textscBenchmark textscSuite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU), pages 63--74, 2010.
[13]
W. W. L. Fung and T. M. Aamodt. Thread textscBlock textscCompaction for textscEfficient textscSIMT textscControl textscFlow. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), pages 25--36, 2011.
[14]
Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo Scarpazza. Dissecting the textscNVIDIA textscTuring textscT4 textscGPU via textscMicrobenchmarking. arXiv preprint arXiv:1903.07486, 2019.
[15]
Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. textscOWL: textscCooperative textscThread textscArray textscAware textscScheduling textscTechniques for textscImproving textscGPGPU textscPerformance. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), pages 395--406, 2013.
[16]
Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. Orchestrated textscScheduling and textscPrefetching for textscGPGPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 332--343, 2013.
[17]
Onur Kayundefinedran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. Neither textscMore nor textscLess: textscOptimizing textscThread-textscLevel textscParallelism for textscGPGPUs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 157--166, 2013.
[18]
A. Kerr, G. Diamos, and S. Yalamanchili. Dynamic textscCompilation of textscData-parallel textscKernels for textscVector textscProcessors. In International Symposium on Code Generation and Optimization (CGO), pages 23--32, 2012.
[19]
Ji Kim, Christopher Torng, Shreesha Srinath, Derek Lockhart, and Christopher Batten. textscMicroarchitectural textscMechanisms to textscExploit value textscStructure in textscSIMT textscArchitectures. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 130--141, 2013.
[20]
Keunsoo Kim and Won Woo Ro. textscWIR: textscWarp textscInstruction textscReuse to textscMinimize textscRepeated textscComputations in textscGPUs. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), pages 389 -- 402, 2018.
[21]
M. Kulkarni, M. Burtscher, C. Cascaval, and K. Pingali. Lonestar: textscA textscSuite of textscParallel textscIrregular textscPrograms. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 65--76, 2009.
[22]
Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, Won Woo Ro, and Murali Annavaram. textscWarped-textscCompression: textscEnabling textscPower textscEfficient textscGPUs through textscRegister textscCompression. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 502--514, 2015.
[23]
Yunsup Lee, Ronny Krashinsky, Vinod Grover, Stephen W. Keckler, and Krste Asanovic. textscConvergence and textscScalarization for textscData-parallel textscArchitectures. In International Symposium on Code Generation and Optimization (CGO), pages 1--11, 2013.
[24]
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. textscGPUWattch: textscEnabling textscEnergy textscOptimizations in textscGPGPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 487--498, 2013.
[25]
Kevin M. Lepak and Mikko H. Lipasti. On the textscValue textscLocality of textscStore textscInstructions. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 182--191, 2000.
[26]
Mikko H. Lipasti, Christopher B. Wilkerson, and John Paul Shen. Value textscLocality and textscLoad textscValue textscPrediction. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), pages 138--147, 1996.
[27]
S. Liu, J.E. Lindholm, M.Y. Siu, B.W. Coon, and S.F. Oberman. Operand textscCollector textscArchitecture, November 16 2010. US Patent 7,834,881.
[28]
Z. Liu, S. Gilani, M. Annavaram, and N. S. Kim. G-textscScalar: textscCost-textscEffective textscGeneralized textscScalar textscExecution textscArchitecture for textscPower-textscEfficient textscGPUs. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), pages 601--612, 2017.
[29]
Guoping Long, Diana Franklin, Susmit Biswas, Pablo Ortiz, Jason Oberg, Dongrui Fan, and Frederic T. Chong. Minimal textscMulti-threading: textscFinding and textscRemoving textscRedundant textscInstructions in textscMulti-threaded textscProcessors. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 337--348, 2010.
[30]
Joshua San Miguel, Mario Badr, and Natalie Enright Jerger. Load textscValue textscApproximation. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 127--139, 2014.
[31]
NIRVANA. Maxas SASS Assembler . https://github.com/NervanaSystems/maxas, 2016. (accessed Aug 1, 2018).
[32]
NVIDIA. CUDA. [Online]. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/, 2015. (accessed March 5, 2017).
[33]
NVIDIA. NVIDIA CUDA SDK 4.2. [Online]. Available: https://developer.nvidia.com/cuda-downloads, 2016. (accessed March 30, 2017).
[34]
NVIDIA. NVIDIA CUDA SDK 10.0. [Online]. Available: https://developer.nvidia.com/cuda-downloads, 2018. (accessed April 4, 2019).
[35]
PolyBench:. The textscPolyhedral textscBenchmark textscSuite. [Online]. Available: http://web.cse.ohio-state.edu/ pouchet/software/polybench, 2016. (accessed March 30, 2017).
[36]
Timothy G. Rogers, Daniel R. Johnson, Mike O'Connor, and Stephen W. Keckler. A textscVariable textscWarp textscSize textscArchitecture. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 489--501, 2015.
[37]
Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. Cache-textscConscious textscWavefront textscScheduling. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 72--83, 2012.
[38]
Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. Divergence-textscAware textscWarp textscScheduling. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 99--110, 2013.
[39]
James E. Smith. Decoupled textscAccess/textscExecute textscComputer textscArchitectures. ACM Transactions on Computer Systems (TOCS), 2(4):289--308, November 1984.
[40]
Avinash Sodani and Gurindar S. Sohi. Dynamic textscInstruction textscReuse. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 194--205, 1997.
[41]
John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W Hwu. Parboil: A textscRevised textscBenchmark textscSuite for textscScientific and textscCommercial textscThroughput textscComputing. Center for Reliable and High-Performance Computing, 127, 2012.
[42]
John R Tramm, Andrew R Siegel, Tanzima Islam, and Martin Schulz. textscXSBench-the textscDevelopment and textscVerification of a textscPerformance textscAbstraction for textscMonte textscCarlo textscReactor textscAnalysis. The Role of Reactor Physics toward a Sustainable Future (PHYSOR), 2014.
[43]
Vasily Volkov. Better textscPerformance at textscLower textscOccupancy. In Proceedings of the textscGPU technology conference, GTC, volume 10, page 16, 2010.
[44]
J. Wang and S. Yalamanchili. Characterization and textscAnalysis of textscDynamic textscParallelism in textscUnstructured textscGPU textscApplications. In Proceedings of the International Symposium on Workload Characterization (IISWC), pages 51--60, 2014.
[45]
Kai Wang and Calvin Lin. Decoupled textscAffine textscComputation for textscSIMT textscGPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 295--306, 2017.
[46]
Shasha Wen, Milind Chabbi, and Xu Liu. textscREDSPY: textscExploring textscValue textscLocality in textscSoftware. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), pages 47--61, 2017.
[47]
S. J. E. Wilton and N. P. Jouppi. textscCACTI: textscAn textscEnhanced textscCache textscAccess and textscCycle textscTime textscModel. IEEE Journal of Solid-State Circuits, 31(5):677--688, May 1996.
[48]
Daniel Wong, Nam Sung Kim, and Murali Annavaram. Approximating textscWarps with textscIntra-warp textscOperand textscValue textscSimilarity. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), pages 176--187, 2016.
[49]
Ping Xiang, Yi Yang, Mike Mantor, Norm Rubin, Lisa R. Hsu, Dong Qunfeng, and Huiyang Zhou. A textscCase for a textscFlexible textscScalar textscUnit in textscSIMT textscArchitecture. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), pages 93--102, 2014.
[50]
Ping Xiang, Yi Yang, Mike Mantor, Norm Rubin, Lisa R. Hsu, and Huiyang Zhou. Exploiting textscUniform textscVector textscInstructions for textscGPGPU textscPerformance, textscEnergy textscEfficiency, and textscOpportunistic textscReliability textscEnhancement. In Proceedings of the International Conference on Supercomputing (ICS), pages 433--442, 2013.
[51]
Yi Yang, Ping Xiang, Jingfei Kong, and Huiyang Zhou. A textscGPGPU textscCompiler for textscMemory textscOptimization and textscParallelism textscManagement. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 86--97, 2010.
[52]
Ayse Yilmazer, Zhongliang Chen, and David Kaeli. textscScalar textscWaving: textscImproving the textscEfficiency of textscSIMD textscExecution on textscGPUs. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), pages 103--112, 2014.

Cited By

View all
  • (2023)DrGPUM: Guiding Memory Optimization for GPU-Accelerated ApplicationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582044(164-178)Online publication date: 25-Mar-2023
  • (2023)R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589039(1-14)Online publication date: 17-Jun-2023
  • (2023)Selective High-Latency Arithmetic Instruction Reuse in Multicore Processors2023 27th International Conference on System Theory, Control and Computing (ICSTCC)10.1109/ICSTCC59206.2023.10308483(410-415)Online publication date: 11-Oct-2023
  • Show More Cited By

Index Terms

  1. Dimensionality-Aware Redundant SIMT Instruction Elimination

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems
    March 2020
    1412 pages
    ISBN:9781450371025
    DOI:10.1145/3373376
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 March 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. gpu
    2. redundant instructions

    Qualifiers

    • Research-article

    Funding Sources

    • JUMP Center co-sponsored by SRC and DARPA
    • Applications Driving Architectures (ADA) Research Center

    Conference

    ASPLOS '20

    Acceptance Rates

    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)40
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)DrGPUM: Guiding Memory Optimization for GPU-Accelerated ApplicationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582044(164-178)Online publication date: 25-Mar-2023
    • (2023)R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589039(1-14)Online publication date: 17-Jun-2023
    • (2023)Selective High-Latency Arithmetic Instruction Reuse in Multicore Processors2023 27th International Conference on System Theory, Control and Computing (ICSTCC)10.1109/ICSTCC59206.2023.10308483(410-415)Online publication date: 11-Oct-2023
    • (2022)Automated kernel fusion for GPU based on code motionProceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3519941.3535078(151-161)Online publication date: 14-Jun-2022
    • (2022)Future Scaling of Memory Hierarchy for Tensor Cores and Eliminating Redundant Shared Memory Traffic Using Inter-Warp MulticastingIEEE Transactions on Computers10.1109/TC.2022.3207134(1-12)Online publication date: 2022
    • (2022)Extending Sniper with Support to Access Operand Values: A Case Study on Reusability Measurement2022 23rd International Carpathian Control Conference (ICCC)10.1109/ICCC54292.2022.9805869(70-75)Online publication date: 29-May-2022
    • (2020)GVProfProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433819(1-16)Online publication date: 9-Nov-2020
    • (2020)GVPROF: A Value Profiler for GPU-Based ClustersSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00093(1-16)Online publication date: Nov-2020
    • (2020)Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00065(725-737)Online publication date: Oct-2020

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media