research-article

Open access

SIMT-X: Extending Single-Instruction Multi-Threading to Out-of-Order Cores

Authors:

Caroline Collange,

André SeznecAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 17, Issue 2

Article No.: 15, Pages 1 - 23

https://doi.org/10.1145/3392032

Published: 29 May 2020 Publication History

All formats PDF

Abstract

This work introduces Single Instruction Multi-Thread Express (SIMT-X), a general-purpose Central Processing Unit (CPU) microarchitecture that enables Graphics Processing Units (GPUs)-style SIMT execution across multiple threads of the same program for high throughput, while retaining the latency benefits of out-of-order execution, and the programming convenience of homogeneous multi-thread processors. SIMT-X leverages the existing Single Instruction Multiple Data (SIMD) back-end to provide CPU/GPU-like processing on a single core with minimal overhead. We demonstrate that although SIMT-X invokes a restricted form of Out-of-Order (OoO), the microarchitecture successfully captures a majority of the benefits of aggressive OoO execution using at most two concurrent register mappings per architectural register, while addressing issues of partial dependencies and supporting a general-purpose Instruction Set Architecture (ISA).

References

[1]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In PACT. ACM, 72--81.

[2]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. IEEE Workload Characterization Symposium 0 (2009), 44--54.

Digital Library

[3]

Ahmed ElTantawy and Tor M. Aamodt. 2016. MIMD synchronization on SIMT architectures. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture.

[4]

Roger Espasa, Federico Ardanaz, Joel Emer, Stephen Felix, Julio Gago, Roger Gramunt, Isaac Hernandez, Toni Juan, Geoff Lowney, Matthew Mattina, et al. 2002. Tarantula: A vector extension to the alpha architecture. In Proceedings of the 29th Annual International Symposium on Computer Architecture. IEEE, 281--292.

[5]

Roger Espasa, Mateo Valero, and James E. Smith. 1997. Out-of-order vector architectures. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 160--170.

[6]

Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2009. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM Transactions on Architecture and Code Optimization (TACO) 6, 2 (2009), 7.

[7]

Azzam Haidar, Ahmad Abdelfatah, Stanimire Tomov, and Jack Dongarra. 2017. High-performance Cholesky factorization for GPU-only execution. In Proceedings of the General Purpose GPUs. ACM, 42--52.

Digital Library

[8]

Sébastien Hily and André Seznec. 1999. Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading. In Proceedings of the 5th International Symposium on High-Performance Computer Architecture, 1999. IEEE, 64--67.

[9]

Intel Corporation. 2017. Intel 64 and IA-32 Architectures Optimization Reference Manual. Intel Corporation.

[10]

S. Kalathingal, S. Collange, B. N. Swamy, and A. Seznec. 2017. DITVA: Dynamic inter-thread vectorization architecture. J. Parallel and Distrib. Comput. (2017).

[11]

Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In CGO. IEEE, 141--150.

[12]

Ji Kim, Shunning Jiang, Christopher Torng, Moyang Wang, Shreesha Srinath, Berkin Ilbeyi, Khalid Al-Hawaj, and Christopher Batten. 2017. Using intra-core loop-task accelerators to improve the productivity and performance of task-based parallel programs. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 759--773.

Digital Library

[13]

Yunsup Lee, Vinod Grover, Ronny Krashinsky, Mark Stephenson, Stephen W. Keckler, and Krste Asanovic. 2014. Exploring the design space of SPMD divergence management on data-parallel architectures. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 101--113.

Digital Library

[14]

Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009. MICRO-42. IEEE, 469--480.

Digital Library

[15]

John Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro. 28, 2 (2008), 39--55.

Digital Library

[16]

Guoping Long, Diana Franklin, Susmit Biswas, Pablo Ortiz, Jason Oberg, Dongrui Fan, and Frederic T. Chong. 2010. Minimal multi-threading: Finding and removing redundant instructions in multi-threaded processors. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 337--348.

[17]

Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. SIGPLAN Not. 40, 6 (2005), 190--200.

Digital Library

[18]

Daniel S. McFarlin, Charles Tucker, and Craig Zilles. 2013. Discerning the dominant out-of-order performance advantage: Is it speculation or dynamism? In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). 241--252.

Digital Library

[19]

Michael Mckeown, Jonathan Balkind, and David Wentzlaff. 2014. Execution drafting: Energy efficiency through computation deduplication. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). 432--444.

Digital Library

[20]

T. Milanez, S. Collange, F. M. Q. Pereira, W. Meira, and R. Ferreira. 2014. Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads. Parallel Comput. 40, 9 (2014), 548--558.

Digital Library

[21]

Sparsh Mittal and Jeffrey S. Vetter. 2015. A survey of CPU-GPU heterogeneous computing techniques. ACM Comput. Surv. 47, 4, Article 69 (July 2015), 35 pages.

[22]

John Nickolls and William J. Dally. 2010. The GPU computing era. IEEE Micro. 30 (March 2010), 56--69. Issue 2. http://dx.doi.org/10.1109/MM.2010.41.

[23]

NVIDIA2017. NVIDIA Tesla V100 GPU Architecture Whitepaper. NVIDIA.

[24]

Alex Pajuelo, Antonio González, and Mateo Valero. 2002. Speculative dynamic vectorization. In Proceedings of the 29th Annual International Symposium on Computer Architecture. IEEE, 271--280.

[25]

Alex Pajuelo, Antonio González, and Mateo Valero. 2005. Control-flow independence reuse via dynamic vectorization. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. IEEE, 10pp.

Digital Library

[26]

Matt Pharr and William R. Mark. 2012. ISPC: A SPMD compiler for high-performance CPU programming. In Innovative Parallel Computing (InPar), 2012. IEEE.

[27]

Nathanaël Prémillieu and André Seznec. 2012. SYRANT: SYmmetric resource allocation on not-taken and taken paths. ACM Trans. Archit. Code Optim. (TACO)—HIPEAC Papers 8, 4 (2012).

[28]

Nathanael Prémillieu and André Seznec. 2014. Efficient out-of-order execution of guarded ISAs. ACM Trans. Archit. Code Optimization 11 (12 2014), 1--21.

[29]

E. Safi, A. Moshovos, and A. Veneris. 2011. Two-stage, pipelined register renaming. IEEE Trans. Very Large Scale Integr. VLSI Syst. 19, 10 (2011), 1926--1931.

Digital Library

[30]

André Seznec. 2011. A new case for the TAGE branch predictor. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). 117--127.

Digital Library

[31]

Faissal M. Sleiman and Thomas F. Wenisch. 2016. Efficiently scaling out-of-order cores for simultaneous multithreading. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 431--443.

[32]

N. Stephens, S. Biles, M. Boettcher, J. Eapen, M. Eyole, G. Gabrielli, M. Horsnell, G. Magklis, A. Martinez, N. Premillieu, A. Reid, A. Rico, and P. Walker. 2017. The ARM scalable vector extension. IEEE Micro. 37, 2 (Mar 2017), 26--39.

Digital Library

[33]

Sriram Vajapeyam, P. J. Joseph, and Tulika Mitra. 1999. Dynamic vectorization: A mechanism for exploiting far-flung ILP in ordinary programs. In Proceedings of the 26th International Symposium on Computer Architecture. 16--27.

Digital Library

[34]

Perry H. Wang, Hong Wang, Ralph-Michael Kling, Kalpana Ramakrishnan, and John Paul Shen. 2001. Register renaming and scheduling for dynamic execution of predicated code. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 15--25.

[35]

Henry Wong and Tor M. Aamodt. 2009. The performance potential for single application heterogeneous systems. In Proceedings of the 8th Workshop on Duplicating, Deconstructing, and Debunking.

Cited By

Kannan NWei KScott DRatnasegar NCanpolat OMai HFerdman M(2024)Infrastructure for Exploring SIMT Architecture in General-Purpose Processors2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00043(316-318)Online publication date: 5-May-2024
https://doi.org/10.1109/ISPASS61541.2024.00043
YILAN FŞAHİN İKOÇ FURTEKİN L(2023)The Effects of Different Process Parameters of PLA+ on Tensile Strengths in 3D Printer Produced by Fused Deposition ModelingBirleştirme Yığma Modellemesiyle Üretilen 3B Yazıcıda PLA+’ın Farklı Proses Parametrelerinin Çekme Dayanımları Üzerindeki EtkileriEl-Cezeri Fen ve Mühendislik Dergisi10.31202/ecjse.1179492Online publication date: 21-Jan-2023
https://doi.org/10.31202/ecjse.1179492
Mishra AYadav PKim S(2023)Artificial Intelligence AcceleratorsArtificial Intelligence and Hardware Accelerators10.1007/978-3-031-22170-5_1(1-52)Online publication date: 16-Mar-2023
https://doi.org/10.1007/978-3-031-22170-5_1
Show More Cited By

Index Terms

SIMT-X: Extending Single-Instruction Multi-Threading to Out-of-Order Cores
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
    2. Serial architectures
      1. Superscalar architectures

Recommendations

Lightweight SIMT core designs for intelligent 3D stacked DRAM
MEMSYS '17: Proceedings of the International Symposium on Memory Systems

In this work we present an analysis of the Harmonica stream multiprocessor, a light-weight, parameterized, open-source single-instruction-multiple-thread (SIMT) core designed for integration within 3D-stacked DRAM. We evaluate the range of Harmonica ...
Performance estimation in a simultaneous multithreading processor
MASCOTS '96: Proceedings of the 4th International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems

We present a model to estimate the performance of dynamically interleaving instruction streams in super-scalar architectures. Instructions executed per cycle (IPC) are calculated from simple descriptions of the workload and hardware. We compare ...
Compiler-Assisted, Selective Out-Of-Order Commit

This paper proposes an out-of-order instruction commit mechanism using a novel compiler/architecture interface. The compiler creates instruction “blocks” guaranteeing some commit conditions and the processor uses the block information to commit certain ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 17, Issue 2

June 2020

169 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3403597

Editor:
David Kaeli
Northeastern University, USA

Issue’s Table of Contents

Copyright © 2020 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 May 2020

Online AM: 07 May 2020

Accepted: 01 March 2020

Revised: 01 March 2020

Received: 01 September 2019

Published in TACO Volume 17, Issue 2

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Mitacs

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
2,739
Total Downloads

Downloads (Last 12 months)643
Downloads (Last 6 weeks)86

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kannan NWei KScott DRatnasegar NCanpolat OMai HFerdman M(2024)Infrastructure for Exploring SIMT Architecture in General-Purpose Processors2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00043(316-318)Online publication date: 5-May-2024
https://doi.org/10.1109/ISPASS61541.2024.00043
YILAN FŞAHİN İKOÇ FURTEKİN L(2023)The Effects of Different Process Parameters of PLA+ on Tensile Strengths in 3D Printer Produced by Fused Deposition ModelingBirleştirme Yığma Modellemesiyle Üretilen 3B Yazıcıda PLA+’ın Farklı Proses Parametrelerinin Çekme Dayanımları Üzerindeki EtkileriEl-Cezeri Fen ve Mühendislik Dergisi10.31202/ecjse.1179492Online publication date: 21-Jan-2023
https://doi.org/10.31202/ecjse.1179492
Mishra AYadav PKim S(2023)Artificial Intelligence AcceleratorsArtificial Intelligence and Hardware Accelerators10.1007/978-3-031-22170-5_1(1-52)Online publication date: 16-Mar-2023
https://doi.org/10.1007/978-3-031-22170-5_1
Thuerck D(2020)Supporting Irregularity in Throughput-Oriented Computing by SIMT-SIMD Integration2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3)10.1109/IA351965.2020.00010(31-35)Online publication date: Nov-2020
https://doi.org/10.1109/IA351965.2020.00010

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents