research-article

Open access

Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code

Authors:

Charith Mendis,

Jeffrey Bosboom,

Jonathan Ragan-Kelley,

Saman AmarasingheAuthors Info & Claims

ACM SIGPLAN Notices, Volume 50, Issue 6

Pages 391 - 402

https://doi.org/10.1145/2813885.2737974

Published: 03 June 2015 Publication History

Abstract

Highly optimized programs are prone to bit rot, where performance quickly becomes suboptimal in the face of new hardware and compiler techniques. In this paper we show how to automatically lift performance-critical stencil kernels from a stripped x86 binary and generate the corresponding code in the high-level domain-specific language Halide. Using Halide’s state-of-the-art optimizations targeting current hardware, we show that new optimized versions of these kernels can replace the originals to rejuvenate the application for newer hardware. The original optimized code for kernels in stripped binaries is nearly impossible to analyze statically. Instead, we rely on dynamic traces to regenerate the kernels. We perform buffer structure reconstruction to identify input, intermediate and output buffer shapes. We abstract from a forest of concrete dependency trees which contain absolute memory addresses to symbolic trees suitable for high-level code generation. This is done by canonicalizing trees, clustering them based on structure, inferring higher-dimensional buffer accesses and finally by solving a set of linear equations based on buffer accesses to lift them up to simple, high-level expressions. Helium can handle highly optimized, complex stencil kernels with input-dependent conditionals. We lift seven kernels from Adobe Photoshop giving a 75% performance improvement, four kernels from IrfanView, leading to 4.97× performance, and one stencil from the miniGMG multigrid benchmark netting a 4.25× improvement in performance. We manually rejuvenated Photoshop by replacing eleven of Photoshop’s filters with our lifted implementations, giving 1.12× speedup without affecting the user experience.

References

[1]

Idapro, hexrays. URL http://www.hex-rays.com/idapro/.

[2]

Mcsema: Static translation of x86 into llvm. 2014.

[3]

K. Anand, M. Smithson, K. Elwazeer, A. Kotha, J. Gruen, N. Giles, and R. Barua. A compiler-level intermediate representation based binary analysis and rewriting system. In Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys ’13, pages 295– 308, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1994-2. URL http://doi.acm.org/10.1145/2465351.2465380.

Digital Library

[4]

J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U.-M. O’Reilly, and S. Amarasinghe. Opentuner: An extensible framework for program autotuning. In International Conference on Parallel Architectures and Compilation Techniques, Edmonton, Canada, August 2014.

Digital Library

[5]

V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI ’00, pages 1–12, New York, NY, USA, 2000. ACM. ISBN 1-58113-199-2. URL http://doi.acm.org/10.1145/349299.

Digital Library

[6]

349303.

[7]

G. Balakrishnan and T. Reps. Analyzing memory accesses in x86 executables. In E. Duesterwald, editor, Compiler Construction, volume 2985 of Lecture Notes in Computer Science, pages 5–23. Springer Berlin Heidelberg, 2004. ISBN 978-3-540-21297-3.

[8]

F. Bellard. QEMU, a fast and portable dynamic translator. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC ’05, pages 41–41, Berkeley, CA, USA, 2005.

Digital Library

[9]

USENIX Association. URL www.qemu.org.

[10]

D. Bruening, Q. Zhao, and S. Amarasinghe. Transparent dynamic instrumentation. In Proceedings of the 8th ACM SIGPLAN/SIGOPS Conference on Virtual Execution Environments, VEE ’12, pages 133– 144, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1176-2. URL http://doi.acm.org/10.1145/2151024.2151043.

Digital Library

[11]

D. Brumley, I. Jager, T. Avgerinos, and E. J. Schwartz. BAP: A binary analysis platform. In Proceedings of the 23rd International Conference on Computer Aided Verification, CAV’11, pages 463–469, Berlin, Heidelberg, 2011. Springer-Verlag. ISBN 978-3-642-22109-5. URL http://dl.acm.org/citation.cfm?id=2032305.2032342.

Digital Library

[12]

S. Campanoni, T. Jones, G. Holloway, V. J. Reddi, G.-Y. Wei, and D. Brooks. HELIX: Automatic parallelization of irregular programs for chip multiprocessing. In Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO ’12, pages 84–93, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1206-6. URL http://doi.acm.org/10.1145/2259016.2259028.

Digital Library

[13]

V. Chipounov and G. Candea. Reverse engineering of binary device drivers with RevNIC. In Proceedings of the 5th European Conference on Computer Systems, EuroSys ’10, pages 167–180, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-577-2. URL http://doi. acm.org/10.1145/1755913.1755932.

Digital Library

[14]

V. Chipounov, V. Kuznetsov, and G. Candea. S2E: A platform for in-vivo multi-path analysis of software systems. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, pages 265–278, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0266-1. URL http://doi.acm.org/10.1145/1950365.1950396.

Digital Library

[15]

K. ElWazeer, K. Anand, A. Kotha, M. Smithson, and R. Barua. Scalable variable and data type detection in a binary rewriter. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, pages 51–60, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2014-6. URL http: //doi.acm.org/10.1145/2491956.2462165.

Digital Library

[16]

A. Fokin, E. Derevenetc, A. Chernov, and K. Troshina. Smartdec: Approaching c++ decompilation. In Proceedings of the 2011 18th Working Conference on Reverse Engineering, WCRE ’11, pages 347– 356, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-0-7695-4582-0. URL http://dx.doi.org/10.1109/WCRE. 2011.49.

Digital Library

[17]

B. Guo, M. Bridges, S. Triantafyllis, G. Ottoni, E. Raman, and D. August. Practical and accurate low-level pointer analysis. In Code Generation and Optimization, 2005. CGO 2005. International Symposium on, pages 291–302, March 2005.

Digital Library

[18]

R. N. Horspool and N. Marovac. An approach to the problem of detranslation of computer programs. The Computer Journal, 23(3): 223–229, 1980.

[19]

S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1–12, April 2010.

[20]

A. Kotha, K. Anand, M. Smithson, G. Yellareddy, and R. Barua. Automatic parallelization in a binary rewriter. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’43, pages 547–557, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-0-7695-4299-7. URL http://dx.doi.org/10.1109/MICRO.2010.27.

Digital Library

[21]

J. Li, C. Wu, and W.-C. Hsu. Dynamic register promotion of stack variables. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’11, pages 21–31, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-1-61284-356-8. URL http://dl.acm.org/citation.cfm? id=2190025.2190050.

Digital Library

[22]

N. Nethercote and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’07, pages 89–100, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-633-2. URL http://doi.acm.org/10.1145/ 1250734.1250746.

Digital Library

[23]

S. Paris. Adobe systems. personal communication, 2014.

[24]

J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, 2013.

Digital Library

[25]

. URL http://doi.acm.org/10.1145/2491956.2462176.

[26]

M. Research. Phoenix compiler and shared source common language infrastructure. URL http://www.research.microsoft. com/phoenix.

[27]

D. Song, D. Brumley, H. Yin, J. Caballero, I. Jager, M. G. Kang, Z. Liang, J. Newsome, P. Poosankam, and P. Saxena. BitBlaze: A new approach to computer security via binary analysis. In Proceedings of the 4th International Conference on Information Systems Security. Keynote invited paper., Hyderabad, India, Dec. 2008.

Digital Library

[28]

K. Stock, M. Kong, T. Grosser, L.-N. Pouchet, F. Rastello, J. Ramanujam, and P. Sadayappan. A framework for enhancing data reuse via associative reordering. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pages 65–76, New York, NY, USA, 2014. ACM. ISBN 978- 1-4503-2784-8. URL http://doi.acm.org/10.1145/2594291.

Digital Library

[29]

2594342.

[30]

Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The pochoir stencil compiler. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’11, pages 117–128, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0743-7. URL http://doi.acm. org/10.1145/1989493.1989508.

Digital Library

[31]

W. Thies, V. Chandrasekhar, and S. Amarasinghe. A practical approach to exploiting coarse-grained pipeline parallelism in c programs. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 40, pages 356–369, Washington, DC, USA, 2007. IEEE Computer Society. ISBN 0-7695-3047-8. URL http://dx.doi.org/10.1109/MICRO.2007.7.

Digital Library

[32]

H. Vandierendonck, S. Rul, and K. De Bosschere. The paralax infrastructure: Automatic parallelization with a helping hand. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pages 389–400, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0178-7. URL http://doi.acm.org/10.1145/1854273.1854322.

Digital Library

[33]

Z. Wang, G. Tournavitis, B. Franke, and M. F. P. O’boyle. Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Trans. Archit. Code Optim., 11(1):2:1–2:26, Feb. 2014. ISSN 1544-3566. URL http://doi.acm.org/10.1145/ 2579561.

Digital Library

[34]

R. Wilhelm, M. Sagiv, and T. Reps. Shape analysis. In D. Watt, editor, Compiler Construction, volume 1781 of Lecture Notes in Computer Science, pages 1–17. Springer Berlin Heidelberg, 2000. ISBN 978-3-540-67263-0. URL http://dx.doi.org/10.1007/ 3-540-46423-9\_1.

Digital Library

[35]

S. Williams, D. D. Kalamkar, A. Singh, A. M. Deshpande, B. Van Straalen, M. Smelyanskiy, A. Almgren, P. Dubey, J. Shalf, and L. Oliker. Optimization of geometric multigrid for emerging multi-and manycore processors. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 96. IEEE Computer Society Press, 2012.

Digital Library

[36]

Y. Wu. Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation, PLDI ’02, pages 210–221, New York, NY, USA, 2002. ACM. ISBN 1-58113-463-0. URL http: //doi.acm.org/10.1145/512529.512555.

Digital Library

[37]

Q. Zhao, R. Rabbah, S. Amarasinghe, L. Rudolph, and W.-F. Wong. Ubiquitous memory introspection. In International Symposium on Code Generation and Optimization, San Jose, CA, Mar 2007. URL http://groups.csail.mit.edu/commit/papers/ 07/zhao-cgo07-umi.pdf.

Digital Library

Cited By

Alrabaee SDebbabi MShirani PWang LYoussef ARahimian ANouh LMouheb DHuang HHanna AAlrabaee SDebbabi MShirani PWang LYoussef ARahimian ANouh LMouheb DHuang HHanna A(2020)Binary Analysis OverviewBinary Code Fingerprinting for Cybersecurity10.1007/978-3-030-34238-8_2(7-44)Online publication date: 1-Mar-2020
https://doi.org/10.1007/978-3-030-34238-8_2
Kanetaka YTakagi HMaeda YFukushima N(2024)SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for HalideIEEE Access10.1109/ACCESS.2023.334566012(7563-7583)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2023.3345660
Liu ZYuan YWang SXie XMa LCalandrino JTroncoso C(2023)Decompiling x86 deep neural network executablesProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620649(7357-7374)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.5555/3620237.3620649
Show More Cited By

Index Terms

Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code
1. Social and professional topics
  1. Professional topics
    1. Management of computing and information systems
      1. Software management
        Software maintenance
2. Software and its engineering
  1. Software creation and management
    1. Software post-development issues

Recommendations

Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code
PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation

Highly optimized programs are prone to bit rot, where performance quickly becomes suboptimal in the face of new hardware and compiler techniques. In this paper we show how to automatically lift performance-critical stencil kernels from a stripped x86 ...
Automatic code generation and tuning for stencil kernels on modern shared memory architectures

In this paper, we present Patus, a code generation and auto-tuning framework for stencil computations targeted at multi- and manycore processors, such as multicore CPUs and graphics processing units. Patus, which stands for " P arallel A uto tu ned S ...
A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops

In the era of multicores, many applications that require substantial computing power and data crunching can now run on desktop PCs. However, to achieve the best possible performance, developers must write applications in a way that exploits both ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 50, Issue 6

PLDI '15

June 2015

630 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2813885

Editor:
Andy Gill
University of Kansas, Lawrence, KS

Issue’s Table of Contents

PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2015
630 pages
ISBN:9781450334686
DOI:10.1145/2737924
General Chair:
David Grove
IBM Research, USA
,
Program Chair:
Steve Blackburn
Australian National University, Australia

Copyright © 2015 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2015

Published in SIGPLAN Volume 50, Issue 6

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

U.S. Department of Energy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
1,434
Total Downloads

Downloads (Last 12 months)230
Downloads (Last 6 weeks)33

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Alrabaee SDebbabi MShirani PWang LYoussef ARahimian ANouh LMouheb DHuang HHanna AAlrabaee SDebbabi MShirani PWang LYoussef ARahimian ANouh LMouheb DHuang HHanna A(2020)Binary Analysis OverviewBinary Code Fingerprinting for Cybersecurity10.1007/978-3-030-34238-8_2(7-44)Online publication date: 1-Mar-2020
https://doi.org/10.1007/978-3-030-34238-8_2
Kanetaka YTakagi HMaeda YFukushima N(2024)SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for HalideIEEE Access10.1109/ACCESS.2023.334566012(7563-7583)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2023.3345660
Liu ZYuan YWang SXie XMa LCalandrino JTroncoso C(2023)Decompiling x86 deep neural network executablesProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620649(7357-7374)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.5555/3620237.3620649
Martínez PWoodruff JArmengol-Estapé JBernabé GGarcía JO’Boyle MVerbrugge CLhoták OShen X(2023)Matching Linear Algebra and Tensor Code to Specialized Hardware AcceleratorsProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580262(85-97)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580262
Brauckmann APolgreen EGrosser TO'Boyle M(2023)mlirSynth: Automatic, Retargetable Program Raising in Multi-Level IR Using Program Synthesis2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00012(39-50)Online publication date: 21-Oct-2023
https://doi.org/10.1109/PACT58117.2023.00012
Woodruff JArmengol-Estapé JAinsworth SO'Boyle MJhala RDillig I(2022)Bind the gap: compiling real software to hardware FFT acceleratorsProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523439(687-702)Online publication date: 9-Jun-2022
https://dl.acm.org/doi/10.1145/3519939.3523439
Gerard BGrosser TKong MEgger BSmith A(2022)QRANE: lifting QASM programs to an affine IRProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517775(15-28)Online publication date: 19-Mar-2022
https://dl.acm.org/doi/10.1145/3497776.3517775
Aikawa HEndo TYuki THirofuchi TIkegami T(2022)Efficient Stencil Computation with Temporal Blocking by Halide DSL2022 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom57177.2022.00116(870-877)Online publication date: Dec-2022
https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom57177.2022.00116
Johnson MPlayne D(2022)A fast and concise parallel implementation of the 8x8 2D forward and inverse DCTs using halideJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.014163(20-29)Online publication date: May-2022
https://doi.org/10.1016/j.jpdc.2022.01.014
Collie BO'Boyle M(2021)Program Lifting using Gray-Box Behavior2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT52795.2021.00012(60-74)Online publication date: Sep-2021
https://doi.org/10.1109/PACT52795.2021.00012
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents