Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3149704.3149705acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Pressure-Driven Hardware Managed Thread Concurrency for Irregular Applications

Published: 12 November 2017 Publication History

Abstract

Given the increasing importance of efficient data intensive computing, we find that modern processor designs are not well suited to the irregular memory access patterns found in these algorithms. This research focuses on mapping the compiler's instruction cost scheduling logic to hardware managed concurrency controls in order to minimize pipeline stalls. In this manner, the hardware modules managing the low-latency thread concurrency can be directly understood by modern compilers. We introduce a thread context switching method that is managed directly via a set of hardware-based mechanisms that are coupled to the compiler instruction scheduler. As individual instructions from a thread execute, their respective cost is accumulated into a control register. Once the register reaches a pre-determined saturation point, the thread is forced to context switch. We evaluate the performance benefits of our approach using a series of 24 benchmarks that exhibit performance acceleration of up to 14.6X.

References

[1]
Hydrodynamics Challenge Problem, Lawrence Livermore National Laboratory. Technical Report LLNL-TR-490254. 1--17 pages.
[2]
2013. OpenMP Application Program Interface Version 4.0. Technical Report. OpenMP Architecture Review Board. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf
[3]
2015. Hybrid Memory Cube Specification 2.0. Technical Report. Hybrid Memory Cube Consortium. http://www.hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSR_HMCC_Specification_Rev2.0_Public.pdf
[4]
George Almási, Călin Caşcaval, José G. Castaños, Monty Denneau, Derek Lieber, José E. Moreira, and Henry S. Warren, Jr. 2003. Dissecting Cyclops: A Detailed Analysis of a Multithreaded Architecture. SIGARCH Comput. Archit. News 31, 1 (March 2003), 26--38.
[5]
Gail Alverson, Preston Briggs, Susan Coatney, Simon Kahan, and Richard Korry. 1997. Tera Hardware-software Cooperation. In Proceedings of the 1997 ACM/IEEE Conference on Supercomputing (SC '97). ACM, New York, NY, USA, 1--16.
[6]
Robert Alverson, David Callahan, Daniel Cummings, Brian Koblenz, Allan Porterfield, and Burton Smith. 1990. The Tera Computer System. SIGARCH Comput. Archit. News 18, 3b (June 1990), 1--6.
[7]
Krste Asanovic and David A. Patterson. 2014. Instruction Sets Should Be Free: The Case For RISC-V. Technical Report UCB/EECS-2014-146. EECS Department, University of California, Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-146.html
[8]
D. H. Bailey, L. Dagum, E. Barszcz, and H. D. Simon. 1992. NAS Parallel BenchmarkResults. In Proceedings of the 1992 ACM/IEEE Conference on Supercomputing (Supercomputing '92). IEEE Computer Society Press, Los Alamitos, CA, USA, 386--393. http://dl.acm.org/citation.cfm?id=147877.148032
[9]
Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP Benchmark Suite. CoRR abs/1508.03619 (2015). http://arxiv.org/abs/1508.03619
[10]
David Gordon Bradlee. 1991. Retargetable Instruction Scheduling for Pipelined Processors. Ph.D. Dissertation. Seattle, WA, USA. UMI Order No. GAX91-31611.
[11]
Preston Briggs, Keith D. Cooper, and Linda Torczon. 1994. Improvements to Graph Coloring Register Allocation. ACM Trans. Program. Lang. Syst. 16, 3 (May 1994), 428--455.
[12]
Keith D. Cooper and Anshuman Dasgupta. 2006. Tailoring Graph-coloring Register Allocation For Runtime Compilation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO '06). IEEE Computer Society, Washington, DC, USA, 39--49.
[13]
Juan del Cuvillo, Weirong Zhu, Ziang Hu, and Guang R. Gao. 2006. Toward a Software Infrastructure for the Cyclops-64 Cellular Architecture. In Proceedings of the 20th International Symposium on High-Performance Computing in an Advanced Collaborative Environment (HPCS '06). IEEE Computer Society, Washington, DC, USA, 9--.
[14]
Jack Dongarra and Michael A. Heroux. 2015. Toward a New Metric for Ranking High Performance Computing Systems. (Mar. 2015).
[15]
Jack Dongarra, Michael A. Heroux, and Piotr Luszczek. 2015. HPCG Benchmark: a New Metric for Ranking High Performance Computing Systems. Technical Report. University of Tennessee, Sandia National Laboratories.
[16]
Alejandro Duran, Xavier Teruel, Roger Ferrer, Xavier Martorell, and Eduard Ayguade. 2009. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP. In Proceedings of the 2009 International Conference on Parallel Processing (ICPP '09). IEEE Computer Society, Washington, DC, USA, 124--131.
[17]
Ge Gan, Xu Wang, Joseph Manzano, and Guang R. Gao. 2009. Tile Percolation: An OpenMP Tile Aware Parallelization Technique for the Cyclops-64 Multicore Processor. In Proceedings of the 15th International Euro-Par Conference on Parallel Processing (Euro-Par '09). Springer-Verlag, Berlin, Heidelberg, 839--850.
[18]
Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. 2005. Niagara: A 32-Way Multithreaded Sparc Processor. IEEE Micro 25, 2 (March 2005), 21--29.
[19]
Chris Lattner and Vikram Adve. 2002. The LLVM instruction set and compilation strategy. CS Dept., Univ. of Illinois at Urbana-Champaign, Tech. Report UIUCDCS (2002).
[20]
John D Leidel, Kevin Wadleigh, Joe Bolding, Tony Brewer, and Dean Walker. 2012. CHOMP: a framework and instruction set for latency tolerant, massively multithreaded processors. In High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:. IEEE, 232--239.
[21]
John D. Leidel, Xi Wang, and Yong Chen. GC64-Ctxdata Source Code. http://discl.cs.ttu.edu/gitlab/gc64/gc64-ctxdata. (????). Accessed: 2017-01-17.
[22]
John D. Leidel, Xi Wang, and Yong Chen. 2015. GoblinCore-64: Architectural Specification. Technical Report. Texas Tech University. http://gc64.org/wp-content/uploads/2015/09/gc64-arch-spec.pdf
[23]
Roberto Castañeda Lozano, Mats Carlsson, Frej Drejhammar, and Christian Schulte. 2012. Constraint-based register allocation and instruction scheduling. In Principles and Practice of Constraint Programming. Springer, 750--766.
[24]
John D. McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter (Dec. 1995), 19--25.
[25]
David Mizell and Kristyn Maschhoff. 2009. Early Experiences with Large-scale Cray XMT Systems. In Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing (IPDPS '09). IEEE Computer Society, Washington, DC, USA, 1--9.
[26]
Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang. 1996. The Case for a Single-chip Multiprocessor. SIGOPS Oper. Syst. Rev. 30, 5 (Sept. 1996), 2--11.
[27]
Todd Alan Proebsting. 1992. Code Generation Techniques. Ph.D. Dissertation. Madison, WI, USA. UMI Order No. GAX92-31217.
[28]
Philip John Schielke. 2000. Stochastic Instruction Scheduling. Ph.D. Dissertation. Houston, TX, USA. Advisor(s) Cooper, Keith D. AAI9969315.
[29]
Geoffrey Taylor. 1950. The Formation of a Blast Wave by a Very Intense Explosion. II. The Atomic Explosion of 1945. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 201, 1065 (1950), 175--186. arXiv:http://rspa.royalsocietypublishing.org/content/201/1065/175.full.pdf
[30]
Andrew Waterman, Yunsup Lee, Rimas Avizienis, David A. Patterson, and Krste Asanovic. 2015. The RISC-V Instruction Set Manual Volume II: Privileged Architecture Version 1.7. Technical Report UCB/EECS-2015-49. EECS Department, Univ. of California, Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-49.html
[31]
Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanovic. 2014. The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.0. Technical Report UCB/EECS-2014-54. EECS Department, University of California, Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-54.html
[32]
Kyle Wheeler, Richard Murphy, and Douglas Thain. 2008. Qthreads: an API for Programming with Millions of Lightweight Threads. In Workshop on Multithreaded Architectures and Applications. Miami, Florida, USA.

Cited By

View all
  • (2023)A Survey on the Proposed Architectures for Efficient Execution of Irregular Applications Using Pipeline Parallelism2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00342(2080-2087)Online publication date: 24-Jul-2023
  • (2020)Toward a Microarchitecture for Efficient Execution of Irregular ApplicationsACM Transactions on Parallel Computing10.1145/34180827:4(1-24)Online publication date: 27-Sep-2020
  • (2018)GoblinCore-64: A RISC-V Based Architecture for Data Intensive Computing2018 IEEE High Performance extreme Computing Conference (HPEC)10.1109/HPEC.2018.8547560(1-8)Online publication date: Sep-2018

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
IA3'17: Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms
November 2017
78 pages
ISBN:9781450351362
DOI:10.1145/3149704
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data intensive computing
  2. context switching
  3. irregular algorithms
  4. thread concurrency

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SC '17
Sponsor:

Acceptance Rates

IA3'17 Paper Acceptance Rate 6 of 22 submissions, 27%;
Overall Acceptance Rate 18 of 67 submissions, 27%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 30 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A Survey on the Proposed Architectures for Efficient Execution of Irregular Applications Using Pipeline Parallelism2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00342(2080-2087)Online publication date: 24-Jul-2023
  • (2020)Toward a Microarchitecture for Efficient Execution of Irregular ApplicationsACM Transactions on Parallel Computing10.1145/34180827:4(1-24)Online publication date: 27-Sep-2020
  • (2018)GoblinCore-64: A RISC-V Based Architecture for Data Intensive Computing2018 IEEE High Performance extreme Computing Conference (HPEC)10.1109/HPEC.2018.8547560(1-8)Online publication date: Sep-2018

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media