research-article

Pressure-Driven Hardware Managed Thread Concurrency for Irregular Applications

Authors:

John D. Leidel,

Yong ChenAuthors Info & Claims

IA3'17: Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms

Article No.: 7, Pages 1 - 8

https://doi.org/10.1145/3149704.3149705

Published: 12 November 2017 Publication History

Abstract

Given the increasing importance of efficient data intensive computing, we find that modern processor designs are not well suited to the irregular memory access patterns found in these algorithms. This research focuses on mapping the compiler's instruction cost scheduling logic to hardware managed concurrency controls in order to minimize pipeline stalls. In this manner, the hardware modules managing the low-latency thread concurrency can be directly understood by modern compilers. We introduce a thread context switching method that is managed directly via a set of hardware-based mechanisms that are coupled to the compiler instruction scheduler. As individual instructions from a thread execute, their respective cost is accumulated into a control register. Once the register reaches a pre-determined saturation point, the thread is forced to context switch. We evaluate the performance benefits of our approach using a series of 24 benchmarks that exhibit performance acceleration of up to 14.6X.

References

[1]

Hydrodynamics Challenge Problem, Lawrence Livermore National Laboratory. Technical Report LLNL-TR-490254. 1--17 pages.

[2]

2013. OpenMP Application Program Interface Version 4.0. Technical Report. OpenMP Architecture Review Board. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf

[3]

2015. Hybrid Memory Cube Specification 2.0. Technical Report. Hybrid Memory Cube Consortium. http://www.hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSR_HMCC_Specification_Rev2.0_Public.pdf

[4]

George Almási, Călin Caşcaval, José G. Castaños, Monty Denneau, Derek Lieber, José E. Moreira, and Henry S. Warren, Jr. 2003. Dissecting Cyclops: A Detailed Analysis of a Multithreaded Architecture. SIGARCH Comput. Archit. News 31, 1 (March 2003), 26--38.

Digital Library

[5]

Gail Alverson, Preston Briggs, Susan Coatney, Simon Kahan, and Richard Korry. 1997. Tera Hardware-software Cooperation. In Proceedings of the 1997 ACM/IEEE Conference on Supercomputing (SC '97). ACM, New York, NY, USA, 1--16.

Digital Library

[6]

Robert Alverson, David Callahan, Daniel Cummings, Brian Koblenz, Allan Porterfield, and Burton Smith. 1990. The Tera Computer System. SIGARCH Comput. Archit. News 18, 3b (June 1990), 1--6.

Digital Library

[7]

Krste Asanovic and David A. Patterson. 2014. Instruction Sets Should Be Free: The Case For RISC-V. Technical Report UCB/EECS-2014-146. EECS Department, University of California, Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-146.html

[8]

D. H. Bailey, L. Dagum, E. Barszcz, and H. D. Simon. 1992. NAS Parallel BenchmarkResults. In Proceedings of the 1992 ACM/IEEE Conference on Supercomputing (Supercomputing '92). IEEE Computer Society Press, Los Alamitos, CA, USA, 386--393. http://dl.acm.org/citation.cfm?id=147877.148032

[9]

Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP Benchmark Suite. CoRR abs/1508.03619 (2015). http://arxiv.org/abs/1508.03619

[10]

David Gordon Bradlee. 1991. Retargetable Instruction Scheduling for Pipelined Processors. Ph.D. Dissertation. Seattle, WA, USA. UMI Order No. GAX91-31611.

[11]

Preston Briggs, Keith D. Cooper, and Linda Torczon. 1994. Improvements to Graph Coloring Register Allocation. ACM Trans. Program. Lang. Syst. 16, 3 (May 1994), 428--455.

Digital Library

[12]

Keith D. Cooper and Anshuman Dasgupta. 2006. Tailoring Graph-coloring Register Allocation For Runtime Compilation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO '06). IEEE Computer Society, Washington, DC, USA, 39--49.

[13]

Juan del Cuvillo, Weirong Zhu, Ziang Hu, and Guang R. Gao. 2006. Toward a Software Infrastructure for the Cyclops-64 Cellular Architecture. In Proceedings of the 20th International Symposium on High-Performance Computing in an Advanced Collaborative Environment (HPCS '06). IEEE Computer Society, Washington, DC, USA, 9--.

Digital Library

[14]

Jack Dongarra and Michael A. Heroux. 2015. Toward a New Metric for Ranking High Performance Computing Systems. (Mar. 2015).

[15]

Jack Dongarra, Michael A. Heroux, and Piotr Luszczek. 2015. HPCG Benchmark: a New Metric for Ranking High Performance Computing Systems. Technical Report. University of Tennessee, Sandia National Laboratories.

[16]

Alejandro Duran, Xavier Teruel, Roger Ferrer, Xavier Martorell, and Eduard Ayguade. 2009. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP. In Proceedings of the 2009 International Conference on Parallel Processing (ICPP '09). IEEE Computer Society, Washington, DC, USA, 124--131.

Digital Library

[17]

Ge Gan, Xu Wang, Joseph Manzano, and Guang R. Gao. 2009. Tile Percolation: An OpenMP Tile Aware Parallelization Technique for the Cyclops-64 Multicore Processor. In Proceedings of the 15th International Euro-Par Conference on Parallel Processing (Euro-Par '09). Springer-Verlag, Berlin, Heidelberg, 839--850.

[18]

Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. 2005. Niagara: A 32-Way Multithreaded Sparc Processor. IEEE Micro 25, 2 (March 2005), 21--29.

Digital Library

[19]

Chris Lattner and Vikram Adve. 2002. The LLVM instruction set and compilation strategy. CS Dept., Univ. of Illinois at Urbana-Champaign, Tech. Report UIUCDCS (2002).

[20]

John D Leidel, Kevin Wadleigh, Joe Bolding, Tony Brewer, and Dean Walker. 2012. CHOMP: a framework and instruction set for latency tolerant, massively multithreaded processors. In High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:. IEEE, 232--239.

[21]

John D. Leidel, Xi Wang, and Yong Chen. GC64-Ctxdata Source Code. http://discl.cs.ttu.edu/gitlab/gc64/gc64-ctxdata. (????). Accessed: 2017-01-17.

[22]

John D. Leidel, Xi Wang, and Yong Chen. 2015. GoblinCore-64: Architectural Specification. Technical Report. Texas Tech University. http://gc64.org/wp-content/uploads/2015/09/gc64-arch-spec.pdf

[23]

Roberto Castañeda Lozano, Mats Carlsson, Frej Drejhammar, and Christian Schulte. 2012. Constraint-based register allocation and instruction scheduling. In Principles and Practice of Constraint Programming. Springer, 750--766.

Digital Library

[24]

John D. McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter (Dec. 1995), 19--25.

[25]

David Mizell and Kristyn Maschhoff. 2009. Early Experiences with Large-scale Cray XMT Systems. In Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing (IPDPS '09). IEEE Computer Society, Washington, DC, USA, 1--9.

Digital Library

[26]

Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang. 1996. The Case for a Single-chip Multiprocessor. SIGOPS Oper. Syst. Rev. 30, 5 (Sept. 1996), 2--11.

Digital Library

[27]

Todd Alan Proebsting. 1992. Code Generation Techniques. Ph.D. Dissertation. Madison, WI, USA. UMI Order No. GAX92-31217.

[28]

Philip John Schielke. 2000. Stochastic Instruction Scheduling. Ph.D. Dissertation. Houston, TX, USA. Advisor(s) Cooper, Keith D. AAI9969315.

[29]

Geoffrey Taylor. 1950. The Formation of a Blast Wave by a Very Intense Explosion. II. The Atomic Explosion of 1945. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 201, 1065 (1950), 175--186. arXiv:http://rspa.royalsocietypublishing.org/content/201/1065/175.full.pdf

[30]

Andrew Waterman, Yunsup Lee, Rimas Avizienis, David A. Patterson, and Krste Asanovic. 2015. The RISC-V Instruction Set Manual Volume II: Privileged Architecture Version 1.7. Technical Report UCB/EECS-2015-49. EECS Department, Univ. of California, Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-49.html

[31]

Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanovic. 2014. The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.0. Technical Report UCB/EECS-2014-54. EECS Department, University of California, Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-54.html

[32]

Kyle Wheeler, Richard Murphy, and Douglas Thain. 2008. Qthreads: an API for Programming with Millions of Lightweight Threads. In Workshop on Multithreaded Architectures and Applications. Miami, Florida, USA.

Cited By

Khojasteh HTabatabaei H(2023)A Survey on the Proposed Architectures for Efficient Execution of Irregular Applications Using Pipeline Parallelism2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00342(2080-2087)Online publication date: 24-Jul-2023
https://doi.org/10.1109/CSCE60160.2023.00342
Leidel JWang XWilliams BChen Y(2020)Toward a Microarchitecture for Efficient Execution of Irregular ApplicationsACM Transactions on Parallel Computing10.1145/34180827:4(1-24)Online publication date: 27-Sep-2020
https://dl.acm.org/doi/10.1145/3418082
Leidel JWang XChen Y(2018)GoblinCore-64: A RISC-V Based Architecture for Data Intensive Computing2018 IEEE High Performance extreme Computing Conference (HPEC)10.1109/HPEC.2018.8547560(1-8)Online publication date: Sep-2018
https://doi.org/10.1109/HPEC.2018.8547560

Index Terms

Pressure-Driven Hardware Managed Thread Concurrency for Irregular Applications

Recommendations

Toward a Microarchitecture for Efficient Execution of Irregular Applications
Special Issue on Innovations in Systems for Irregular Applications, Part 2

Given the increasing importance of efficient data-intensive computing, we find that modern processor designs are not well suited to the irregular memory access patterns often found in these algorithms. Applications and algorithms that do not exhibit ...
Hardware-managed register allocation for embedded processors
LCTES '04

Most modern processors (either embedded or general purpose) contain higher number of physical registers than those exposed in the ISA. Due to a variety of reasons, this phenomenon is likely to continue especially on embedded systems where encoding space ...
Hardware-managed register allocation for embedded processors
LCTES '04: Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems

Most modern processors (either embedded or general purpose) contain higher number of physical registers than those exposed in the ISA. Due to a variety of reasons, this phenomenon is likely to continue especially on embedded systems where encoding space ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

IA3'17: Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms

November 2017

78 pages

ISBN:9781450351362

DOI:10.1145/3149704

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SC '17

Sponsor:

SIGHPC

SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2017

CO, Denver, USA

Acceptance Rates

IA3'17 Paper Acceptance Rate 6 of 22 submissions, 27%;

Overall Acceptance Rate 18 of 67 submissions, 27%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
75
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)1

Reflects downloads up to 30 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Khojasteh HTabatabaei H(2023)A Survey on the Proposed Architectures for Efficient Execution of Irregular Applications Using Pipeline Parallelism2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00342(2080-2087)Online publication date: 24-Jul-2023
https://doi.org/10.1109/CSCE60160.2023.00342
Leidel JWang XWilliams BChen Y(2020)Toward a Microarchitecture for Efficient Execution of Irregular ApplicationsACM Transactions on Parallel Computing10.1145/34180827:4(1-24)Online publication date: 27-Sep-2020
https://dl.acm.org/doi/10.1145/3418082
Leidel JWang XChen Y(2018)GoblinCore-64: A RISC-V Based Architecture for Data Intensive Computing2018 IEEE High Performance extreme Computing Conference (HPEC)10.1109/HPEC.2018.8547560(1-8)Online publication date: Sep-2018
https://doi.org/10.1109/HPEC.2018.8547560

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents