research-article

Open access

SIMD²: a generalized matrix instruction set for accelerating tensor computation beyond GEMM

Authors:

Hung-Wei TsengAuthors Info & Claims

ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture

Pages 552 - 566

https://doi.org/10.1145/3470496.3527411

Published: 11 June 2022 Publication History

Abstract

Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We find that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add. Algorithms with a semiring-like structure therefore have potential to be accelerated by a general-purpose matrix operation architecture, instead of common MXUs.

In this paper, we propose SIMD², a new programming paradigm to support generalized matrix operations with a semiring-like structure. SIMD² instructions accelerate eight more types of matrix operations, in addition to matrix multiplications. Since SIMD² instructions resemble a matrix-multiplication instruction, we are able to build SIMD² architecture on top of any MXU architecture with minimal modifications. We developed a framework that emulates and validates SIMD² using NVIDIA GPUs with Tensor Cores. Across 8 applications, SIMD² provides up to 38.59× speedup and more than 6.94× on average over optimized CUDA programs, with only 5% of full-chip area overhead.

References

[1]

BLAS (Basic Linear Algebra Subprograms). http://www.netlib.org/blas/, 2004.

[2]

Dennis Abts, Jonathan Ross, Jonathan Sparling, Mark Wong-VanHaren, Max Baker, Tom Hawkins, Andrew Bell, John Thompson, Temesghen Kahsai, Garrin Kimmell, et al. Think fast: a tensor streaming processor (TSP) for accelerating deep learning workloads. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 145--158, 2020.

Digital Library

[3]

Arm Corporation. Introducing the Scalable Matrix Extension for the Armv9-A Architecture. https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/scalable-matrix-extension-armv9-a-architecture, 2021.

[4]

Karam Chatha. Qualcomm® Cloud Al 100: 12TOPS/W Scalable, High Performance and Low Latency Deep Learning Inference Accelerator. In 2021 IEEE Hot Chips 33 Symposium (HCS), 2021.

[5]

Jack Choquette, Olivier Giroux, and Denis Foley. Volta: Performance and Programmability. IEEE Micro, 38(2):42--52, 2018.

[6]

Jesus Corbal, Roger Espasa, and Mateo Valero. MOM: a matrix SIMD instruction set architecture for multimedia applications. In Proceedings of the 1999 ACM/IEEE conference on Supercomputing, 1999.

Digital Library

[7]

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Third Edition. The MIT Press, 3rd edition, 2009.

Digital Library

[8]

Abdul Dakkak, Cheng Li, Jinjun Xiong, Isaac Gelado, and Wen-mei Hwu. Accelerating reduction and scan using tensor core units. In Proceedings of the ACM International Conference on Supercomputing, ICS '19, pages 46--57, 2019.

Digital Library

[9]

Sultan Durrani, Muhammad Saad Chughtai, Mert Hidayetoglu, Rashid Tahir, Abdul Dakkak, Lawrence Rauchwerger, Fareed Zaffar, and Wen-mei Hwu. Accelerating fourier and number theoretic transforms using tensor cores and warp shuffles. In 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 345--355, 2021.

Digital Library

[10]

Jeff Erickson. Algorithms. 2019.

[11]

Boyuan Feng, Yuke Wang, Guoyang Chen, Weifeng Zhang, Yuan Xie, and Yufei Ding. Egemm-tc: Accelerating scientific computing on tensor cores with extended precision. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '21, pages 278--291, 2021.

[12]

Robert W. Floyd. Algorithm 97: Shortest path. Commun. ACM, page 345, jun 1962.

[13]

Tong Geng, Ang Li, Runbin Shi, Chunshu Wu, Tianqi Wang, Yanfei Li, Pouya Haghi, Antonino Tumeo, Shuai Che, Steve Reinhardt, et al. AWB-GCN: A graph convolutional network accelerator with runtime workload rebalancing. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 922--936, 2020.

[14]

Ashish Gondimalla, Noah Chesnut, Mithuna Thottethodi, and T. N. Vijaykumar. SparTen: A sparse tensor accelerator for convolutional neural networks. In 2019 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), MICRO '52, 2019.

Digital Library

[15]

Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. Harnessing gpu tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 603--613, 2018.

Digital Library

[16]

Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016.

[17]

Pawan Harish, P.J. Narayanan, Vibhav Vineet, and Suryakant Patidar. Chapter 7 - fast minimum spanning tree computation. In Wen mei W. Hwu, editor, GPU Computing Gems Jade Edition, pages 77--88. Morgan Kaufmann, 2012.

[18]

Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W Fletcher. ExTensor: An accelerator for sparse tensor algebra. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 319--333, 2019.

Digital Library

[19]

Jared Hoberock and Nathan Bell. Thrust: A parallel template library. http://thrust.github.io/, 2010.

[20]

Pedro Holanda and Hannes Mühleisen. Relational queries with a tensor processing unit. In Proceedings of the 15th International Workshop on Data Management on New Hardware, DaMoN'19, 2019.

Digital Library

[21]

Kuan-Chieh Hsu and Hung-Wei Tseng. Accelerating Applications using Edge Tensor Processing Units. In SC: The International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2021, 2021.

Digital Library

[22]

Yu-Ching Hu, Yuliang Li, and Hung-Wei Tseng. TCUDB: Accelerating Database with Tensor Processors. In the 2022 ACM SIGMOD/PODS International Conference on Management of Data, SIGMOD 2022, 2022.

[23]

Intel Corporation. Intrinsics for Intel(R) Advanced Matrix Extensions (Intel(R) AMX) Instructions. https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-matrix-extensions-intel-amx-instructions.html, 2021.

[24]

Jeff Hammond. cuASR: CUDA Algebra for Semirings. https://github.com/hpcgarage/cuASR, 2021.

[25]

Jiacheng Pan. CUDA MST. https://github.com/jiachengpan/cudaMST, 2016.

[26]

Norman P Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. A Domain-specific Supercomputer for Training Deep Neural Networks. Communications of the ACM, 63(7):67--78, 2020.

Digital Library

[27]

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA '17, pages 1--12, 2017.

Digital Library

[28]

Gary J. Katz and Joseph T. Kider. All-pairs shortest-paths for large graphs on the gpu. In Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, pages 47--55, 2008.

[29]

Jeremy Kepner, Peter Aaltonen, David Bader, Aydin Buluç, Franz Franchetti, John Gilbert, Dylan Hutchison, Manoj Kumar, Andrew Lumsdaine, Henning Meyerhenke, et al. Mathematical foundations of the graphblas. In 2016 IEEE High Performance Extreme Computing Conference (HPEC), pages 1--9, 2016.

[30]

Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and Timothy G. Rogers. Accel-sim: An extensible simulation framework for validated gpu modeling. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 473--486, 2020.

Digital Library

[31]

Joseph B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society, 1956.

[32]

Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. TensorDIMM: A practical near-memory processing architecture for embeddings and tensor operations in deep learning. In the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), MICRO '52, pages 740--753, 2019.

[33]

Sangwon Lee, Gyuyoung Park, and Myoungsoo Jung. TensorPRAM: Designing a scalable heterogeneous deep learning accelerator with byte-addressable prams. In 12th USENIX Workshop on Hot Topics in Storage and File Systems, HotStorage 2020, July 13--14, 2020, 2020.

[34]

Wai-Kong Lee, Hwajeong Seo, Zhenfei Zhang, and Seong Oun Hwang. Tensorcrypto: High throughput acceleration of lattice-based cryptography using tensor core on gpu. IEEE Access, 10:20616--20632, 2022.

[35]

M Leyzorek, RS Gray, AA Johnson, WC Ladew, SR Meaker Jr, RM Petry, and RN Seitz. Investigation of model techniques-first annual report-6 june 1956--1 july 1957--a study of model techniques for communication systems. Case Institute of Technology, Cleveland, Ohio, 1957.

[36]

Binrui Li, Shenggan Cheng, and James Lin. tcfft: A fast half-precision fft library for nvidia tensor cores. In 2021 IEEE International Conference on Cluster Computing (CLUSTER), pages 1--11, 2021.

[37]

Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. Ascend: a scalable and unified architecture for ubiquitous deep neural network computing: Industry track paper. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021.

[38]

Yiqian Liu and Martin Burtscher. ECL-APSP v1.0. https://userweb.cs.txstate.edu/~burtscher/research/ECL-APSP/, 2021.

[39]

Yu-Chia Liu and Hung-Wei Tseng. NDS: N-Dimensional Storage. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2021, pages 28--45, 2021.

[40]

Tianjian Lu, Yi-Fan Chen, Blake Hechtman, Tao Wang, and John Anderson. Large-scale discrete fourier transform on tpus. IEEE Access, 2021.

[41]

Tianjian Lu, Thibault Marin, Yue Zhuo, Yi-Fan Chen, and Chao Ma. Accelerating mri reconstruction on tpus. In 2020 IEEE High Performance Extreme Computing Conference (HPEC), pages 1--9, 2020.

[42]

Tianjian Lu, Thibault Marin, Yue Zhuo, Yi-Fan Chen, and Chao Ma. Nonuniform fast fourier transform on tpus. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 783--787, 2021.

[43]

Ben D. Lund and Justin W. Smith. A multi-stage cuda kernel for floyd-warshall. ArXiv, abs/1001.4108, 2010.

[44]

Mateusz Bojanowski. Cuda Floyd Warshall implementation. https://github.com/MTB90/cuda-floyd_warshall, 2018.

[45]

Guy Melancon. Just how dense are dense graphs in the real world? a methodological note. In Proceedings of the 2006 AVI Workshop on BEyond Time and Errors: Novel Evaluation Methods for Information Visualization, BELIV '06, pages 1--7, 2006.

Digital Library

[46]

Heajung Min, Kyung Min Han, and Young J. Kim. Accelerating probabilistic volumetric mapping using ray-tracing graphics hardware. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 5440--5445, 2021.

Digital Library

[47]

Mehryar Mohri. Semiring frameworks and algorithms for shortest-distance problems. Journal of Automata, Languages and Combinatorics, 7(3):321--350, 2002.

Digital Library

[48]

Alan Morningstar, Markus Hauru, Jackson Beall, Martin Ganahl, Adam G. M. Lewis, Vedika Khemani, and Guifre Vidal. Simulation of Quantum Many-Body Dynamics with Tensor Processing Units: Floquet Prethermalization. arXiv preprint arXiv:2111.08044, 2021.

[49]

Anurag Mukkara, Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, and Daniel Sanchez. Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling. In the 51st Annual IEEE/ACM international symposium on Microarchitecture (MICRO), 2018.

Digital Library

[50]

Anurag Mukkara, Nathan Beckmann, and Daniel Sanchez. PHI: Architectural Support for Synchronization-and Bandwidth-Efficient Commutative Scatter Updates. In the 52nd Annual IEEE/ACM international symposium on Microarchitecture (MICRO), 2019.

[51]

Ricardo Nobre, Aleksandar Ilic, Sergio Santander-Jim?mnez, and Leonel Sousa. Exploring the binary precision capabilities of tensor cores for epistasis detection. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 338--347, 2020.

[52]

NVIDIA A100 Tensor Core GPU Architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf, 2020.

[53]

NVIDIA Corporation. NVIDIA T4 TENSOR CORE GPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-datasheet-951643.pdf, 2019.

[54]

NVIDIA Corporation. Warp Level Matrix Multiply-Accumulate Instructions. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions, 2021.

[55]

NVIDIA Corporation. NVIDIA Hopper Architecture In-Depth. https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/, 2022.

[56]

Egor Orachyov, Pavel Alimov, and Semyon Grigorev. cuBool: sparse Boolean linear algebra for NVIDIA CUDA. https://github.com/JetBrains-Research/cuBool, 2021. Version 1.2.0.

[57]

Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J Dally. SCNN: An accelerator for compressed-sparse convolutional neural networks. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pages 27--40, 2017.

Digital Library

[58]

Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. SIGMA: A sparse and irregular gemm accelerator with flexible interconnects for dnn training. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 58--70, 2020.

[59]

Md Aamir Raihan, Negar Goli, and Tor M Aamodt. Modeling deep learning accelerator enabled gpus. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 79--92, 2019.

[60]

Scott Rostrup, Shweta Srivastava, and Kishore Singhal. Fast and memory-efficient minimum spanning tree on the gpu. International Journal of Computational Science and Engineering, 2013.

Digital Library

[61]

Justin Salmon and Simon McIntosh-Smith. Exploiting hardware-accelerated ray tracing for monte carlo particle transport with openmc. In 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pages 19--29, 2019.

[62]

Stanislav G. Sedukhin and Marcin Paprzycki. Generalizing matrix multiplication for efficient computations on modern computers. In Parallel Processing and Applied Mathematics, pages 225--234, 2012.

[63]

Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. Brief announcement: The problem based benchmark suite. In Proceedings of the Twenty-Fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 68--70, 2012.

Digital Library

[64]

Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. Graphr: Accelerating graph processing using reram. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 531--543, 2018.

[65]

Nitish Srivastava, Hanchen Jin, Jie Liu, David Albonesi, and Zhiru Zhang. MatRaptor: A sparse-sparse matrix multiplication accelerator based on row-wise product. In the 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 766--780, 2020.

[66]

Nitish Srivastava, Hanchen Jin, Shaden Smith, Hongbo Rong, David Albonesi, and Zhiru Zhang. Tensaurus: A versatile accelerator for mixed sparse-dense tensor computations. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 689--702, 2020.

[67]

Brian W Thompto, Dung Q Nguyen, José E Moreira, Ramon Bertran, Hans Jacobson, Richard J Eickemeyer, Rahul M Rao, Michael Goulet, Marcy Byers, Christopher J Gonzalez, et al. Energy Efficiency Boost in the AI-Infused POWER10 Processor. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021.

[68]

Jesmin Jahan Tithi, Neal C Crago, and Joel S Emer. Exploiting spatial architectures for edit distance algorithms. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2014.

[69]

Michel Barlaud Vincent Garcia, Éric Debreuve. kNN-CUDA. https://github.com/vincentfpgarcia/kNN-CUDA, 2018.

[70]

John Von Neumann. First draft of a report on the edvac. IEEE Annals of the History of Computing, 15(4), 1993.

[71]

Jian Weng, Sihao Liu, Zhengrong Wang, Vidushi Dadu, and Tony Nowatzki. A hybrid systolic-dataflow architecture for inductive matrix algorithms. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020.

[72]

Wm A Wulf and Sally A McKee. Hitting the memory wall: Implications of the obvious. ACM SIGARCH computer architecture news, 23(1):20--24, 1995.

[73]

Guowei Zhang, Nithya Attaluri, Joel S. Emer, and Daniel Sanchez. Gamma: Leveraging Gustavson's Algorithm to Accelerate Sparse Matrix Multiplication. In Proceedings of the 26th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-26), April 2021.

Digital Library

[74]

Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, and Xuehai Qian. Graphp: Reducing communication for pim-based graph processing with efficient data partition. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2018.

[75]

Zhekai Zhang, Hanrui Wang, Song Han, and William J Dally. SpArch: Efficient architecture for sparse matrix multiplication. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 261--274, 2020.

[76]

Xuda Zhou, Zidong Du, Qi Guo, Shaoli Liu, Chengsi Liu, Chao Wang, Xuehai Zhou, Ling Li, Tianshi Chen, and Yunji Chen. Cambricon-S: Addressing irregularity in sparse neural networks through a cooperative software/hardware approach. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 15--28, 2018.

Digital Library

[77]

Maohua Zhu, Tao Zhang, Zhenyu Gu, and Yuan Xie. Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern GPUs. In the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 359--371, 2019.

Digital Library

[78]

Yuhao Zhu. RTNN: Accelerating Neighbor Search Using Hardware Ray Tracing. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '22, pages 76--89, 2022.

[79]

Youwei Zhuo, Chao Wang, Mingxing Zhang, Rui Wang, Dimin Niu, Yanzhi Wang, and Xuehai Qian. Graphq: Scalable pim-based graph processing. In the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 712--725, 2019.

Digital Library

Cited By

Zhang YTsai PTseng H(2024)Sparsepipe: Sparse Inter-operator Dataflow Architecture with Cross-Iteration Reuse2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00090(1201-1216)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00090
Hong CBhatia SHaan ADong SNikiforov DCheung AShao Y(2024)LLM-Aided Compilation for Tensor Accelerators2024 IEEE LLM Aided Design Workshop (LAD)10.1109/LAD62341.2024.10691748(1-14)Online publication date: 28-Jun-2024
https://doi.org/10.1109/LAD62341.2024.10691748
Hong CBhatia SHaan ADong SNikiforov DCheung AShaov Y(2024)LLM-Aided Compilation for Tensor Accelerators2024 IEEE LLM Aided Design Workshop (LAD)10.1109/LAD62341.2024.10691720(1-16)Online publication date: 28-Jun-2024
https://doi.org/10.1109/LAD62341.2024.10691720

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture

June 2022

1097 pages

ISBN:9781450386104

DOI:10.1145/3470496

General Chairs:
Valentina Salapura
Google
,
Mohamed Zahran
New York University
,
Program Chairs:
Fred Chong
The University of Chicago
,
Lingjia Tang
The University of Michigan

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Check for updates

Qualifiers

Research-article

Funding Sources

National Science Foundation (NSF)
National Science Foundation (NSF)

Conference

ISCA '22

Sponsor:

SIGARCH

ISCA '22: The 49th Annual International Symposium on Computer Architecture

June 18 - 22, 2022

New York, New York

Acceptance Rates

ISCA '22 Paper Acceptance Rate 67 of 400 submissions, 17%;

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
3,214
Total Downloads

Downloads (Last 12 months)860
Downloads (Last 6 weeks)85

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YTsai PTseng H(2024)Sparsepipe: Sparse Inter-operator Dataflow Architecture with Cross-Iteration Reuse2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00090(1201-1216)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00090
Hong CBhatia SHaan ADong SNikiforov DCheung AShao Y(2024)LLM-Aided Compilation for Tensor Accelerators2024 IEEE LLM Aided Design Workshop (LAD)10.1109/LAD62341.2024.10691748(1-14)Online publication date: 28-Jun-2024
https://doi.org/10.1109/LAD62341.2024.10691748
Hong CBhatia SHaan ADong SNikiforov DCheung AShaov Y(2024)LLM-Aided Compilation for Tensor Accelerators2024 IEEE LLM Aided Design Workshop (LAD)10.1109/LAD62341.2024.10691720(1-16)Online publication date: 28-Jun-2024
https://doi.org/10.1109/LAD62341.2024.10691720

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten