Article

Decoding CUDA binary

Authors:

Eddy Z. ZhangAuthors Info & Claims

CGO 2019: Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization

Pages 229 - 241

Published: 16 February 2019 Publication History

Abstract

NVIDIA's software does not offer translation of assembly code to binary for their GPUs, since the specifications are closed-source. This work fills that gap. We develop a systematic method of decoding the Instruction Set Architectures (ISAs) of NVIDIA's GPUs, and generating assemblers for different generations of GPUs. Our framework enables cross-architecture binary analysis and transformation. Making the ISA accessible in this manner opens up a world of opportunities for developers and researchers, enabling numerous optimizations and explorations that are unachievable at the source-code level. Our infrastructure has already benefited and been adopted in important applications including performance tuning, binary instrumentation, resource allocation, and memory protection.

References

[1]

J. Lai and A. Seznec, “Performance upper bound analysis and optimization of sgemm on fermi and kepler gpus,” in Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE Computer Society, 2013, pp. 1–10.

Digital Library

[2]

A. Lavin, “maxdnn: an efficient convolution kernel for deep learning with maxwell gpus,” arXiv preprint arXiv:1501.06633, 2015.

[3]

S. Gray, “Maxas: Assembler for nvidia maxwell architecture,” 2014.

[4]

X. Zhang, G. Tan, S. Xue, J. Li, K. Zhou, and M. Chen, “Understanding the gpu microarchitecture to achieve bare-metal performance tuning,” in ACM SIGPLAN Notices, vol. 52, no. 8. ACM, 2017, pp. 31–43.

Digital Library

[5]

H. Zhou, G. Tong, and C. Liu, “Gpes: A preemptive execution system for gpgpu computing,” in Real-Time and Embedded Technology and Applications Symposium (RTAS), 2015 IEEE. IEEE, 2015, pp. 87–97.

[6]

A. B. Hayes and E. Z. Zhang, “Unified on-chip memory allocation for simt architecture,” in Proceedings of the 28th ACM international conference on Supercomputing. ACM, 2014, pp. 293–302.

Digital Library

[7]

A. B. Hayes, L. Li, D. Chavarr´ıa-Miranda, S. L. Song, and E. Z. Zhang, “Orion: A framework for gpu occupancy tuning,” in Proceedings of the 17th International Middleware Conference. ACM, 2016, p. 18.

Digital Library

[8]

A. Li, G.-J. van den Braak, H. Corporaal, and A. Kumar, “Fine-grained synchronizations and dataflow programming on gpus,” in Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 2015, pp. 109–118.

Digital Library

[9]

D. Mikushin, N. Likhogrud, E. Z. Zhang, and C. Bergström, “Kernelgen–the design and implementation of a next generation compiler platform for accelerating numerical models on gpus,” in Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International. IEEE, 2014, pp. 1011–1020.

Digital Library

[10]

A. B. Hayes, L. Li, M. Hedayati, J. He, E. Z. Zhang, and K. Shen, “Gpu taint tracking,” in USENIX ATC, 2017, pp. 209–220.

Digital Library

[11]

A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, “Analyzing cuda workloads using a detailed gpu simulator,” in Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE, 2009, pp. 163–174.

[12]

A. Li, S. L. Song, A. Kumar, E. Z. Zhang, D. Chavarr´ıa-Miranda, and H. Corporaal, “Critical points based register-concurrency autotuning for gpus,” in Proceedings of the 2016 Conference on Design, Automation & Test in Europe. EDA Consortium, 2016, pp. 1273–1278.

Digital Library

[13]

A. B. Hayes, F. Hua, J. Huang, Y. Chen, and E. Z. Zhang, “Decoding cuda binary - opcodes,” Feb. 2019. {Online}. Available:

[14]

——, “Decoding cuda binary - decoded instructions,” Feb. 2019. {Online}. Available:

[15]

——, “Decoding cuda binary - file format,” Feb. 2019. {Online}. Available:

[16]

H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos, “Demystifying gpu microarchitecture through microbenchmarking,” in Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on. IEEE, 2010, pp. 235–246.

[17]

Y. Hou, J. Lai, and D. Mikushin, “Asfermi: An assembler for the nvidia fermi instruction set,”

[18]

Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting the nvidia volta gpu architecture via microbenchmarking,” arXiv preprint arXiv:1804.06826, 2018.

[19]

NVIDIA, “GPU computing sdk.” {Online}. Available: https://developer. nvidia.com/gpu-computing-sdk

[20]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. Ieee, 2009, pp. 44–54.

Digital Library

[21]

A. B. Hayes, F. Hua, J. Huang, Y. Chen, and E. Z. Zhang, “Decoding cuda binary - special registers,” Feb. 2019. {Online}. Available:

[22]

V. Paxson et al., “Flex–fast lexical analyzer generator,” Lawrence Berkeley Laboratory, 1995.

[23]

C. Donnelly and R. Stallman, “Bison. the yacc-compatible parser generator,” 2000.

Recommendations

Massively LDPC Decoding on Multicore Architectures

Unlike usual VLSI approaches necessary for the computation of intensive Low-Density Parity-Check (LDPC) code decoders, this paper presents flexible software-based LDPC decoders. Algorithms and data structures suitable for parallel computing are proposed ...
A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
A CUDA implementation of the Continuous Space Language Model

The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute Unified Device Architecture (CUDA). A detailed explanation of the CSLM algorithm is provided. Implementation was ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO 2019: Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization

February 2019

286 pages

ISBN:9781728114361

General Chair:
Mahmut Taylan Kandemir
Penn State University, USA
,
Program Chairs:
Alexandra Jimborean
Uppsala University, USA
,
Tipp Moseley
Google, USA

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS: Computer Society

Publisher

IEEE Press

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 16 February 2019

Check for updates

Badges

Author Tags

Qualifiers

Article

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
1,100
Total Downloads

Downloads (Last 12 months)108
Downloads (Last 6 weeks)7

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents