research-article

Compiler assisted coalescing

Authors:

Sooraj Puthoor,

Mikko H. LipastiAuthors Info & Claims

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Article No.: 11, Pages 1 - 11

https://doi.org/10.1145/3243176.3243203

Published: 01 November 2018 Publication History

Abstract

Tightly integrated CPU-GPU systems that share the same virtual address space have significantly improved the programmability of GPUs in recent years. However, to achieve this, every memory access from a GPU has to go through an address translation unit like the TLB and the huge demand on these TLBs can become a significant overhead. Previous proposals have suggested the use of an address coalescing unit that coalesces multiple accesses to the same page into a single access, significantly reducing pressure on the TLB. However, building perfect coalescing logic in real hardware is not feasible and employing a simpler hardware coalescing unit takes away many of the benefits of coalescing.

In this paper, we propose compiler assisted coalescing (CAC) that significantly increases the coalescing capability of GPUs. Our CAC compiler annotates instructions that generate coalescable accesses at compile time, while simple bound checking hardware coalesces these accesses at runtime. We also introduce a translation table to the compute unit pipeline that leverages information passed from the CAC compiler to bypass the TLB, further reducing expensive TLB lookups. Evaluation of our technique on a variety of workloads shows that CAC reduces the TLB accesses by 62% with a TLB dynamic power reduction of 45%.

References

[1]

AMD. 2008. AMD Stream SDK. (2008).

[2]

AMD. 2012. AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE. (2012).

[3]

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[4]

Mohamed-Walid Benabderrahmane, Louis-Noël Pouchet, Albert Cohen, and Cédric Bastoul. 2010. The Polyhedral Model is More Widely Applicable Than You Think. In Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction.

Digital Library

[5]

Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News (2011).

Digital Library

[6]

Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation.

Digital Library

[7]

Pierre Boulet, Alain Darte, Georges-André Silber, and Frédéric Vivien. 1998. Loop Parallelization Algorithms: From Parallelism Extraction to Code Generation. Parallel Comput. (1998).

Digital Library

[8]

D. Bouvier and B. Sander. 2014. Applying AMD's Kaveri APU for heterogeneous computing. In 2014 IEEE Hot Chips 26 Symposium (HCS).

[9]

S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron. 2013. Pannotia: Understanding irregular GPGPU graph applications. In 2013 IEEE International Symposium on Workload Characterization (IISWC).

[10]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC).

Digital Library

[11]

Jason Cong, Zhenman Fang, Yuchen Hao, Hao, and Glenn Reinman. 2017. Supporting Address Translation for Accelerator-Centric Architectures. In 23rd IEEE Symposium on High Performance Computer Architecture.

[12]

HSA Foundation. 2016. HSA Platform System Architecture Specification 1.1. (2016).

[13]

HSA Foundation. 2016. HSA Programmer Reference Manual Specification 1.1. (2016).

[14]

B. A. Hechtman, S. Che, D. R. Hower, Y. Tian, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood. 2014. QuickRelease: A throughput-oriented approach to release consistency on GPUs. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 189--200.

[15]

HSA Foundation. 2015. https://github.com/HSAFoundation/HSAIL-HLC-Stable. (2015).

[16]

Intel. 1998. Write Combining Memory Implementation Guidelines. (1998). http://download.intel.com/design/PentiumII/applnots/24442201.pdf

[17]

Hyeran Jeon, Gunjae Koo, and Murali Annavaram. 2014. CTA-aware Prefetching for GPGPU. Computer Engineering Technical Report Number CENG-2014--08 (2014).

[18]

Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. Orchestrated Scheduling and Prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture.

Digital Library

[19]

V. Karakostas, J. Gandhi, A. Cristal, M. D. Hill, K. S. McKinley, M. Nemirovsky, M. M. Swift, and O. S. Unsal. 2016. Energy-efficient address translation. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 631--643.

[20]

G. Krishnan, D. Bouvier, and S. Naffziger. 2016. Energy-Efficient Graphics and Multimedia in 28-nm Carrizo Accelerated Processing Unit. IEEE Micro (2016).

Digital Library

[21]

Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization.

Digital Library

[22]

Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

Digital Library

[23]

Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, Won Woo Ro, and Murali Annavaram. 2015. Warped-compression: Enabling Power Efficient GPUsThrough Register Compression. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture.

Digital Library

[24]

Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture.

Digital Library

[25]

Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-Driven Dynamic GPU Cache Bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing.

Digital Library

[26]

S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

Digital Library

[27]

Jieun Lim, Nagesh B. Lakshminarayana, Hyesoon Kim, William Song, Sudhakar Yalamanchili, and Wonyong Sung. 2014. Power Modeling for GPU Architectures Using McPAT. ACM Trans. Des. Autom. Electron. Syst. 19, 3, Article 26 (June 2014), 24 pages.

Digital Library

[28]

Zhenhong Liu, Syed Gilani, Murali Annavaram, and Nam Sung Kim. 2017. G-Scalar: Cost-effective generalized scalar execution architecture for power-efficient GPUs. In 23rd IEEE Symposium on High Performance Computer Architecture.

[29]

M Mantor. 2011. Fusion and the Future of Heterogeneous Computing. (2011).

[30]

NVIDIA. 2009. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. (2009).

[31]

NVIDIA. 2012. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110. (2012).

[32]

Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems.

Digital Library

[33]

Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In 20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA, February 15--19, 2014.

[34]

P Rogers. 2011. The programmer's guide to the apu galaxy. (2011).

[35]

Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2012. Cache-Conscious Wavefront Scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

Digital Library

[36]

Sagi Shahar, Shai Bergman, and Mark Silberstein. 2016. ActivePointers: A Case for Software Address Translation on GPUs. In Proceedings of the 43rd International Symposium on Computer Architecture.

Digital Library

[37]

Jun Shirako, Akihiro Hayashi, and Vivek Sarkar. 2017. Optimized Two-level Parallelization for GPU Accelerators Using the Polyhedral Model. In Proceedings of the 26th International Conference on Compiler Construction (CC 2017). ACM, New York, NY, USA, 22--33.

Digital Library

[38]

Andreas Simbúrger, Sven Apel, Armin Grósslinger, and Christian Lengauer. 2013. The potential of polyhedral optimization: An empirical study. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

Digital Library

[39]

Avinash Sodani. 2011. Race to exascale: Challenges and opportunities. MICRO 2011 Keynote (2011).

[40]

Yingying Tian, Sooraj Puthoor, Joseph L. Greathouse, Bradford M. Beckmann, and Daniel A. Jiménez. 2015. Adaptive GPU Cache Bypassing. In Proceedings of the 8th Workshop on General Purpose Processing Using GPUs.

Digital Library

[41]

R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). 335--344.

Digital Library

[42]

J. Vesely, A. Basu, M. Oskin, G. H. Loh, and A. Bhattacharjee. 2016. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 161--171.

[43]

Pirmin Vogel, Andrea Marongiu, and Luca Benini. 2015. Lightweight Virtual Memory Support for Many-core Accelerators in Heterogeneous Embedded SoCs. In Proceedings of the 10th International Conference on Hardware/Software Codesign and System Synthesis.

Digital Library

Cited By

Jayaweera MKong MWang YKaeli DGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)Energy-Aware Tile Size Selection for Affine Programs on GPUsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444795(13-27)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444795
Puthoor SLipasti M(2023)Turn-based Spatiotemporal Coherence for GPUsACM Transactions on Architecture and Code Optimization10.1145/359305420:3(1-27)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3593054
Takayashiki HSato MKomatsu KKobayashi H(2022)Page-Address Coalescing of Vector Gather Instructions for Efficient Address Translation2022 IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3)10.1109/IA356718.2022.00007(1-8)Online publication date: Nov-2022
https://doi.org/10.1109/IA356718.2022.00007
Show More Cited By

Index Terms

Compiler assisted coalescing
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
    2. Parallel architectures
      1. Single instruction, multiple data
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Compiler-Assisted Multiple Instruction Word Retry for VLIW Architectures

Very Long Instruction Word (VLIW) architectures can enhance performance by exploiting fine-grained instruction level parallelism. In this paper, we describe a compiler assisted multiple instruction word retry scheme for VLIW architectures. A read buffer ...
C Compiler Design for an Industrial Network Processor
OM '01: Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems

One important problem in code generation for embedded processors is the design of efficient compilers for ASIPs with application specific architectures. This paper outlines the design of a C compiler for an industrial ASIP for telecom applications. The ...
Dynamic coalescing for 16-bit instructions

In the embedded domain, memory usage and energy consumption are critical constraints.Embedded processors such as the ARM and MIPS provide a 16-bit instruction set, (called Thumb in the case of the ARM family of processors), in addition to the 32-bit ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

November 2018

494 pages

ISBN:9781450359863

DOI:10.1145/3243176

General Chair:
Skevos Evripidou
University of Cyprus, Cyprus
,
Program Chairs:
Per Stenström
Chalmers University of Technology, Sweden
,
Michael O'Boyle
University of Edinburgh, UK

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

IFIP WG 10.3: IFIP WG 10.3
IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '18

Sponsor:

SIGARCH

PACT '18: International conference on Parallel Architectures and Compilation Techniques

November 1 - 4, 2018

Limassol, Cyprus

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
302
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)5

Reflects downloads up to 30 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jayaweera MKong MWang YKaeli DGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)Energy-Aware Tile Size Selection for Affine Programs on GPUsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444795(13-27)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444795
Puthoor SLipasti M(2023)Turn-based Spatiotemporal Coherence for GPUsACM Transactions on Architecture and Code Optimization10.1145/359305420:3(1-27)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3593054
Takayashiki HSato MKomatsu KKobayashi H(2022)Page-Address Coalescing of Vector Gather Instructions for Efficient Address Translation2022 IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3)10.1109/IA356718.2022.00007(1-8)Online publication date: Nov-2022
https://doi.org/10.1109/IA356718.2022.00007
Kotra JLeBeane MKandemir MLoh G(2021)Increasing GPU Translation Reach by Leveraging Under-Utilized On-Chip ResourcesMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480105(1169-1181)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480105

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents