Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3243176.3243203acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Compiler assisted coalescing

Published: 01 November 2018 Publication History

Abstract

Tightly integrated CPU-GPU systems that share the same virtual address space have significantly improved the programmability of GPUs in recent years. However, to achieve this, every memory access from a GPU has to go through an address translation unit like the TLB and the huge demand on these TLBs can become a significant overhead. Previous proposals have suggested the use of an address coalescing unit that coalesces multiple accesses to the same page into a single access, significantly reducing pressure on the TLB. However, building perfect coalescing logic in real hardware is not feasible and employing a simpler hardware coalescing unit takes away many of the benefits of coalescing.
In this paper, we propose compiler assisted coalescing (CAC) that significantly increases the coalescing capability of GPUs. Our CAC compiler annotates instructions that generate coalescable accesses at compile time, while simple bound checking hardware coalesces these accesses at runtime. We also introduce a translation table to the compute unit pipeline that leverages information passed from the CAC compiler to bypass the TLB, further reducing expensive TLB lookups. Evaluation of our technique on a variety of workloads shows that CAC reduces the TLB accesses by 62% with a TLB dynamic power reduction of 45%.

References

[1]
AMD. 2008. AMD Stream SDK. (2008).
[2]
AMD. 2012. AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE. (2012).
[3]
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[4]
Mohamed-Walid Benabderrahmane, Louis-Noël Pouchet, Albert Cohen, and Cédric Bastoul. 2010. The Polyhedral Model is More Widely Applicable Than You Think. In Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction.
[5]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News (2011).
[6]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation.
[7]
Pierre Boulet, Alain Darte, Georges-André Silber, and Frédéric Vivien. 1998. Loop Parallelization Algorithms: From Parallelism Extraction to Code Generation. Parallel Comput. (1998).
[8]
D. Bouvier and B. Sander. 2014. Applying AMD's Kaveri APU for heterogeneous computing. In 2014 IEEE Hot Chips 26 Symposium (HCS).
[9]
S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron. 2013. Pannotia: Understanding irregular GPGPU graph applications. In 2013 IEEE International Symposium on Workload Characterization (IISWC).
[10]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC).
[11]
Jason Cong, Zhenman Fang, Yuchen Hao, Hao, and Glenn Reinman. 2017. Supporting Address Translation for Accelerator-Centric Architectures. In 23rd IEEE Symposium on High Performance Computer Architecture.
[12]
HSA Foundation. 2016. HSA Platform System Architecture Specification 1.1. (2016).
[13]
HSA Foundation. 2016. HSA Programmer Reference Manual Specification 1.1. (2016).
[14]
B. A. Hechtman, S. Che, D. R. Hower, Y. Tian, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood. 2014. QuickRelease: A throughput-oriented approach to release consistency on GPUs. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 189--200.
[15]
HSA Foundation. 2015. https://github.com/HSAFoundation/HSAIL-HLC-Stable. (2015).
[16]
Intel. 1998. Write Combining Memory Implementation Guidelines. (1998). http://download.intel.com/design/PentiumII/applnots/24442201.pdf
[17]
Hyeran Jeon, Gunjae Koo, and Murali Annavaram. 2014. CTA-aware Prefetching for GPGPU. Computer Engineering Technical Report Number CENG-2014--08 (2014).
[18]
Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. Orchestrated Scheduling and Prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture.
[19]
V. Karakostas, J. Gandhi, A. Cristal, M. D. Hill, K. S. McKinley, M. Nemirovsky, M. M. Swift, and O. S. Unsal. 2016. Energy-efficient address translation. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 631--643.
[20]
G. Krishnan, D. Bouvier, and S. Naffziger. 2016. Energy-Efficient Graphics and Multimedia in 28-nm Carrizo Accelerated Processing Unit. IEEE Micro (2016).
[21]
Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization.
[22]
Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.
[23]
Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, Won Woo Ro, and Murali Annavaram. 2015. Warped-compression: Enabling Power Efficient GPUsThrough Register Compression. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture.
[24]
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture.
[25]
Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-Driven Dynamic GPU Cache Bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing.
[26]
S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[27]
Jieun Lim, Nagesh B. Lakshminarayana, Hyesoon Kim, William Song, Sudhakar Yalamanchili, and Wonyong Sung. 2014. Power Modeling for GPU Architectures Using McPAT. ACM Trans. Des. Autom. Electron. Syst. 19, 3, Article 26 (June 2014), 24 pages.
[28]
Zhenhong Liu, Syed Gilani, Murali Annavaram, and Nam Sung Kim. 2017. G-Scalar: Cost-effective generalized scalar execution architecture for power-efficient GPUs. In 23rd IEEE Symposium on High Performance Computer Architecture.
[29]
M Mantor. 2011. Fusion and the Future of Heterogeneous Computing. (2011).
[30]
NVIDIA. 2009. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. (2009).
[31]
NVIDIA. 2012. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110. (2012).
[32]
Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems.
[33]
Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In 20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA, February 15--19, 2014.
[34]
P Rogers. 2011. The programmer's guide to the apu galaxy. (2011).
[35]
Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2012. Cache-Conscious Wavefront Scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[36]
Sagi Shahar, Shai Bergman, and Mark Silberstein. 2016. ActivePointers: A Case for Software Address Translation on GPUs. In Proceedings of the 43rd International Symposium on Computer Architecture.
[37]
Jun Shirako, Akihiro Hayashi, and Vivek Sarkar. 2017. Optimized Two-level Parallelization for GPU Accelerators Using the Polyhedral Model. In Proceedings of the 26th International Conference on Compiler Construction (CC 2017). ACM, New York, NY, USA, 22--33.
[38]
Andreas Simbúrger, Sven Apel, Armin Grósslinger, and Christian Lengauer. 2013. The potential of polyhedral optimization: An empirical study. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).
[39]
Avinash Sodani. 2011. Race to exascale: Challenges and opportunities. MICRO 2011 Keynote (2011).
[40]
Yingying Tian, Sooraj Puthoor, Joseph L. Greathouse, Bradford M. Beckmann, and Daniel A. Jiménez. 2015. Adaptive GPU Cache Bypassing. In Proceedings of the 8th Workshop on General Purpose Processing Using GPUs.
[41]
R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). 335--344.
[42]
J. Vesely, A. Basu, M. Oskin, G. H. Loh, and A. Bhattacharjee. 2016. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 161--171.
[43]
Pirmin Vogel, Andrea Marongiu, and Luca Benini. 2015. Lightweight Virtual Memory Support for Many-core Accelerators in Heterogeneous Embedded SoCs. In Proceedings of the 10th International Conference on Hardware/Software Codesign and System Synthesis.

Cited By

View all
  • (2024)Energy-Aware Tile Size Selection for Affine Programs on GPUsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444795(13-27)Online publication date: 2-Mar-2024
  • (2023)Turn-based Spatiotemporal Coherence for GPUsACM Transactions on Architecture and Code Optimization10.1145/359305420:3(1-27)Online publication date: 19-Jul-2023
  • (2022)Page-Address Coalescing of Vector Gather Instructions for Efficient Address Translation2022 IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3)10.1109/IA356718.2022.00007(1-8)Online publication date: Nov-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques
November 2018
494 pages
ISBN:9781450359863
DOI:10.1145/3243176
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IFIP WG 10.3: IFIP WG 10.3
  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPUs
  2. compilers

Qualifiers

  • Research-article

Conference

PACT '18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)5
Reflects downloads up to 30 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Energy-Aware Tile Size Selection for Affine Programs on GPUsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444795(13-27)Online publication date: 2-Mar-2024
  • (2023)Turn-based Spatiotemporal Coherence for GPUsACM Transactions on Architecture and Code Optimization10.1145/359305420:3(1-27)Online publication date: 19-Jul-2023
  • (2022)Page-Address Coalescing of Vector Gather Instructions for Efficient Address Translation2022 IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3)10.1109/IA356718.2022.00007(1-8)Online publication date: Nov-2022
  • (2021)Increasing GPU Translation Reach by Leveraging Under-Utilized On-Chip ResourcesMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480105(1169-1181)Online publication date: 18-Oct-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media