Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

SIMT-X: Extending Single-Instruction Multi-Threading to Out-of-Order Cores

Published: 29 May 2020 Publication History

Abstract

This work introduces Single Instruction Multi-Thread Express (SIMT-X), a general-purpose Central Processing Unit (CPU) microarchitecture that enables Graphics Processing Units (GPUs)-style SIMT execution across multiple threads of the same program for high throughput, while retaining the latency benefits of out-of-order execution, and the programming convenience of homogeneous multi-thread processors. SIMT-X leverages the existing Single Instruction Multiple Data (SIMD) back-end to provide CPU/GPU-like processing on a single core with minimal overhead. We demonstrate that although SIMT-X invokes a restricted form of Out-of-Order (OoO), the microarchitecture successfully captures a majority of the benefits of aggressive OoO execution using at most two concurrent register mappings per architectural register, while addressing issues of partial dependencies and supporting a general-purpose Instruction Set Architecture (ISA).

References

[1]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In PACT. ACM, 72--81.
[2]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. IEEE Workload Characterization Symposium 0 (2009), 44--54.
[3]
Ahmed ElTantawy and Tor M. Aamodt. 2016. MIMD synchronization on SIMT architectures. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture.
[4]
Roger Espasa, Federico Ardanaz, Joel Emer, Stephen Felix, Julio Gago, Roger Gramunt, Isaac Hernandez, Toni Juan, Geoff Lowney, Matthew Mattina, et al. 2002. Tarantula: A vector extension to the alpha architecture. In Proceedings of the 29th Annual International Symposium on Computer Architecture. IEEE, 281--292.
[5]
Roger Espasa, Mateo Valero, and James E. Smith. 1997. Out-of-order vector architectures. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 160--170.
[6]
Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2009. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM Transactions on Architecture and Code Optimization (TACO) 6, 2 (2009), 7.
[7]
Azzam Haidar, Ahmad Abdelfatah, Stanimire Tomov, and Jack Dongarra. 2017. High-performance Cholesky factorization for GPU-only execution. In Proceedings of the General Purpose GPUs. ACM, 42--52.
[8]
Sébastien Hily and André Seznec. 1999. Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading. In Proceedings of the 5th International Symposium on High-Performance Computer Architecture, 1999. IEEE, 64--67.
[9]
Intel Corporation. 2017. Intel 64 and IA-32 Architectures Optimization Reference Manual. Intel Corporation.
[10]
S. Kalathingal, S. Collange, B. N. Swamy, and A. Seznec. 2017. DITVA: Dynamic inter-thread vectorization architecture. J. Parallel and Distrib. Comput. (2017).
[11]
Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In CGO. IEEE, 141--150.
[12]
Ji Kim, Shunning Jiang, Christopher Torng, Moyang Wang, Shreesha Srinath, Berkin Ilbeyi, Khalid Al-Hawaj, and Christopher Batten. 2017. Using intra-core loop-task accelerators to improve the productivity and performance of task-based parallel programs. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 759--773.
[13]
Yunsup Lee, Vinod Grover, Ronny Krashinsky, Mark Stephenson, Stephen W. Keckler, and Krste Asanovic. 2014. Exploring the design space of SPMD divergence management on data-parallel architectures. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 101--113.
[14]
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009. MICRO-42. IEEE, 469--480.
[15]
John Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro. 28, 2 (2008), 39--55.
[16]
Guoping Long, Diana Franklin, Susmit Biswas, Pablo Ortiz, Jason Oberg, Dongrui Fan, and Frederic T. Chong. 2010. Minimal multi-threading: Finding and removing redundant instructions in multi-threaded processors. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 337--348.
[17]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. SIGPLAN Not. 40, 6 (2005), 190--200.
[18]
Daniel S. McFarlin, Charles Tucker, and Craig Zilles. 2013. Discerning the dominant out-of-order performance advantage: Is it speculation or dynamism? In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). 241--252.
[19]
Michael Mckeown, Jonathan Balkind, and David Wentzlaff. 2014. Execution drafting: Energy efficiency through computation deduplication. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). 432--444.
[20]
T. Milanez, S. Collange, F. M. Q. Pereira, W. Meira, and R. Ferreira. 2014. Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads. Parallel Comput. 40, 9 (2014), 548--558.
[21]
Sparsh Mittal and Jeffrey S. Vetter. 2015. A survey of CPU-GPU heterogeneous computing techniques. ACM Comput. Surv. 47, 4, Article 69 (July 2015), 35 pages.
[22]
John Nickolls and William J. Dally. 2010. The GPU computing era. IEEE Micro. 30 (March 2010), 56--69. Issue 2. http://dx.doi.org/10.1109/MM.2010.41.
[23]
NVIDIA2017. NVIDIA Tesla V100 GPU Architecture Whitepaper. NVIDIA.
[24]
Alex Pajuelo, Antonio González, and Mateo Valero. 2002. Speculative dynamic vectorization. In Proceedings of the 29th Annual International Symposium on Computer Architecture. IEEE, 271--280.
[25]
Alex Pajuelo, Antonio González, and Mateo Valero. 2005. Control-flow independence reuse via dynamic vectorization. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. IEEE, 10pp.
[26]
Matt Pharr and William R. Mark. 2012. ISPC: A SPMD compiler for high-performance CPU programming. In Innovative Parallel Computing (InPar), 2012. IEEE.
[27]
Nathanaël Prémillieu and André Seznec. 2012. SYRANT: SYmmetric resource allocation on not-taken and taken paths. ACM Trans. Archit. Code Optim. (TACO)—HIPEAC Papers 8, 4 (2012).
[28]
Nathanael Prémillieu and André Seznec. 2014. Efficient out-of-order execution of guarded ISAs. ACM Trans. Archit. Code Optimization 11 (12 2014), 1--21.
[29]
E. Safi, A. Moshovos, and A. Veneris. 2011. Two-stage, pipelined register renaming. IEEE Trans. Very Large Scale Integr. VLSI Syst. 19, 10 (2011), 1926--1931.
[30]
André Seznec. 2011. A new case for the TAGE branch predictor. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). 117--127.
[31]
Faissal M. Sleiman and Thomas F. Wenisch. 2016. Efficiently scaling out-of-order cores for simultaneous multithreading. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 431--443.
[32]
N. Stephens, S. Biles, M. Boettcher, J. Eapen, M. Eyole, G. Gabrielli, M. Horsnell, G. Magklis, A. Martinez, N. Premillieu, A. Reid, A. Rico, and P. Walker. 2017. The ARM scalable vector extension. IEEE Micro. 37, 2 (Mar 2017), 26--39.
[33]
Sriram Vajapeyam, P. J. Joseph, and Tulika Mitra. 1999. Dynamic vectorization: A mechanism for exploiting far-flung ILP in ordinary programs. In Proceedings of the 26th International Symposium on Computer Architecture. 16--27.
[34]
Perry H. Wang, Hong Wang, Ralph-Michael Kling, Kalpana Ramakrishnan, and John Paul Shen. 2001. Register renaming and scheduling for dynamic execution of predicated code. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 15--25.
[35]
Henry Wong and Tor M. Aamodt. 2009. The performance potential for single application heterogeneous systems. In Proceedings of the 8th Workshop on Duplicating, Deconstructing, and Debunking.

Cited By

View all
  • (2024)Infrastructure for Exploring SIMT Architecture in General-Purpose Processors2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00043(316-318)Online publication date: 5-May-2024
  • (2023)The Effects of Different Process Parameters of PLA+ on Tensile Strengths in 3D Printer Produced by Fused Deposition ModelingBirleştirme Yığma Modellemesiyle Üretilen 3B Yazıcıda PLA+’ın Farklı Proses Parametrelerinin Çekme Dayanımları Üzerindeki EtkileriEl-Cezeri Fen ve Mühendislik Dergisi10.31202/ecjse.1179492Online publication date: 21-Jan-2023
  • (2023)Artificial Intelligence AcceleratorsArtificial Intelligence and Hardware Accelerators10.1007/978-3-031-22170-5_1(1-52)Online publication date: 16-Mar-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 17, Issue 2
June 2020
169 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3403597
Issue’s Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 May 2020
Online AM: 07 May 2020
Accepted: 01 March 2020
Revised: 01 March 2020
Received: 01 September 2019
Published in TACO Volume 17, Issue 2

Check for updates

Author Tags

  1. SIMT
  2. computer architecture
  3. hardware
  4. microarchitecture
  5. multi-threading
  6. out-of-order

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)643
  • Downloads (Last 6 weeks)86
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Infrastructure for Exploring SIMT Architecture in General-Purpose Processors2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00043(316-318)Online publication date: 5-May-2024
  • (2023)The Effects of Different Process Parameters of PLA+ on Tensile Strengths in 3D Printer Produced by Fused Deposition ModelingBirleştirme Yığma Modellemesiyle Üretilen 3B Yazıcıda PLA+’ın Farklı Proses Parametrelerinin Çekme Dayanımları Üzerindeki EtkileriEl-Cezeri Fen ve Mühendislik Dergisi10.31202/ecjse.1179492Online publication date: 21-Jan-2023
  • (2023)Artificial Intelligence AcceleratorsArtificial Intelligence and Hardware Accelerators10.1007/978-3-031-22170-5_1(1-52)Online publication date: 16-Mar-2023
  • (2020)Supporting Irregularity in Throughput-Oriented Computing by SIMT-SIMD Integration2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3)10.1109/IA351965.2020.00010(31-35)Online publication date: Nov-2020

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media