Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code

Published: 03 June 2015 Publication History

Abstract

Highly optimized programs are prone to bit rot, where performance quickly becomes suboptimal in the face of new hardware and compiler techniques. In this paper we show how to automatically lift performance-critical stencil kernels from a stripped x86 binary and generate the corresponding code in the high-level domain-specific language Halide. Using Halide’s state-of-the-art optimizations targeting current hardware, we show that new optimized versions of these kernels can replace the originals to rejuvenate the application for newer hardware. The original optimized code for kernels in stripped binaries is nearly impossible to analyze statically. Instead, we rely on dynamic traces to regenerate the kernels. We perform buffer structure reconstruction to identify input, intermediate and output buffer shapes. We abstract from a forest of concrete dependency trees which contain absolute memory addresses to symbolic trees suitable for high-level code generation. This is done by canonicalizing trees, clustering them based on structure, inferring higher-dimensional buffer accesses and finally by solving a set of linear equations based on buffer accesses to lift them up to simple, high-level expressions. Helium can handle highly optimized, complex stencil kernels with input-dependent conditionals. We lift seven kernels from Adobe Photoshop giving a 75% performance improvement, four kernels from IrfanView, leading to 4.97× performance, and one stencil from the miniGMG multigrid benchmark netting a 4.25× improvement in performance. We manually rejuvenated Photoshop by replacing eleven of Photoshop’s filters with our lifted implementations, giving 1.12× speedup without affecting the user experience.

References

[1]
Idapro, hexrays. URL http://www.hex-rays.com/idapro/.
[2]
Mcsema: Static translation of x86 into llvm. 2014.
[3]
K. Anand, M. Smithson, K. Elwazeer, A. Kotha, J. Gruen, N. Giles, and R. Barua. A compiler-level intermediate representation based binary analysis and rewriting system. In Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys ’13, pages 295– 308, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1994-2. URL http://doi.acm.org/10.1145/2465351.2465380.
[4]
J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U.-M. O’Reilly, and S. Amarasinghe. Opentuner: An extensible framework for program autotuning. In International Conference on Parallel Architectures and Compilation Techniques, Edmonton, Canada, August 2014.
[5]
V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI ’00, pages 1–12, New York, NY, USA, 2000. ACM. ISBN 1-58113-199-2. URL http://doi.acm.org/10.1145/349299.
[6]
[7]
G. Balakrishnan and T. Reps. Analyzing memory accesses in x86 executables. In E. Duesterwald, editor, Compiler Construction, volume 2985 of Lecture Notes in Computer Science, pages 5–23. Springer Berlin Heidelberg, 2004. ISBN 978-3-540-21297-3.
[8]
F. Bellard. QEMU, a fast and portable dynamic translator. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC ’05, pages 41–41, Berkeley, CA, USA, 2005.
[9]
USENIX Association. URL www.qemu.org.
[10]
D. Bruening, Q. Zhao, and S. Amarasinghe. Transparent dynamic instrumentation. In Proceedings of the 8th ACM SIGPLAN/SIGOPS Conference on Virtual Execution Environments, VEE ’12, pages 133– 144, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1176-2. URL http://doi.acm.org/10.1145/2151024.2151043.
[11]
D. Brumley, I. Jager, T. Avgerinos, and E. J. Schwartz. BAP: A binary analysis platform. In Proceedings of the 23rd International Conference on Computer Aided Verification, CAV’11, pages 463–469, Berlin, Heidelberg, 2011. Springer-Verlag. ISBN 978-3-642-22109-5. URL http://dl.acm.org/citation.cfm?id=2032305.2032342.
[12]
S. Campanoni, T. Jones, G. Holloway, V. J. Reddi, G.-Y. Wei, and D. Brooks. HELIX: Automatic parallelization of irregular programs for chip multiprocessing. In Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO ’12, pages 84–93, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1206-6. URL http://doi.acm.org/10.1145/2259016.2259028.
[13]
V. Chipounov and G. Candea. Reverse engineering of binary device drivers with RevNIC. In Proceedings of the 5th European Conference on Computer Systems, EuroSys ’10, pages 167–180, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-577-2. URL http://doi. acm.org/10.1145/1755913.1755932.
[14]
V. Chipounov, V. Kuznetsov, and G. Candea. S2E: A platform for in-vivo multi-path analysis of software systems. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, pages 265–278, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0266-1. URL http://doi.acm.org/10.1145/1950365.1950396.
[15]
K. ElWazeer, K. Anand, A. Kotha, M. Smithson, and R. Barua. Scalable variable and data type detection in a binary rewriter. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, pages 51–60, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2014-6. URL http: //doi.acm.org/10.1145/2491956.2462165.
[16]
A. Fokin, E. Derevenetc, A. Chernov, and K. Troshina. Smartdec: Approaching c++ decompilation. In Proceedings of the 2011 18th Working Conference on Reverse Engineering, WCRE ’11, pages 347– 356, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-0-7695-4582-0. URL http://dx.doi.org/10.1109/WCRE. 2011.49.
[17]
B. Guo, M. Bridges, S. Triantafyllis, G. Ottoni, E. Raman, and D. August. Practical and accurate low-level pointer analysis. In Code Generation and Optimization, 2005. CGO 2005. International Symposium on, pages 291–302, March 2005.
[18]
R. N. Horspool and N. Marovac. An approach to the problem of detranslation of computer programs. The Computer Journal, 23(3): 223–229, 1980.
[19]
S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1–12, April 2010.
[20]
A. Kotha, K. Anand, M. Smithson, G. Yellareddy, and R. Barua. Automatic parallelization in a binary rewriter. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’43, pages 547–557, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-0-7695-4299-7. URL http://dx.doi.org/10.1109/MICRO.2010.27.
[21]
J. Li, C. Wu, and W.-C. Hsu. Dynamic register promotion of stack variables. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’11, pages 21–31, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-1-61284-356-8. URL http://dl.acm.org/citation.cfm? id=2190025.2190050.
[22]
N. Nethercote and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’07, pages 89–100, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-633-2. URL http://doi.acm.org/10.1145/ 1250734.1250746.
[23]
S. Paris. Adobe systems. personal communication, 2014.
[24]
J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, 2013.
[25]
. URL http://doi.acm.org/10.1145/2491956.2462176.
[26]
M. Research. Phoenix compiler and shared source common language infrastructure. URL http://www.research.microsoft. com/phoenix.
[27]
D. Song, D. Brumley, H. Yin, J. Caballero, I. Jager, M. G. Kang, Z. Liang, J. Newsome, P. Poosankam, and P. Saxena. BitBlaze: A new approach to computer security via binary analysis. In Proceedings of the 4th International Conference on Information Systems Security. Keynote invited paper., Hyderabad, India, Dec. 2008.
[28]
K. Stock, M. Kong, T. Grosser, L.-N. Pouchet, F. Rastello, J. Ramanujam, and P. Sadayappan. A framework for enhancing data reuse via associative reordering. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pages 65–76, New York, NY, USA, 2014. ACM. ISBN 978- 1-4503-2784-8. URL http://doi.acm.org/10.1145/2594291.
[29]
[30]
Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The pochoir stencil compiler. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’11, pages 117–128, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0743-7. URL http://doi.acm. org/10.1145/1989493.1989508.
[31]
W. Thies, V. Chandrasekhar, and S. Amarasinghe. A practical approach to exploiting coarse-grained pipeline parallelism in c programs. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 40, pages 356–369, Washington, DC, USA, 2007. IEEE Computer Society. ISBN 0-7695-3047-8. URL http://dx.doi.org/10.1109/MICRO.2007.7.
[32]
H. Vandierendonck, S. Rul, and K. De Bosschere. The paralax infrastructure: Automatic parallelization with a helping hand. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pages 389–400, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0178-7. URL http://doi.acm.org/10.1145/1854273.1854322.
[33]
Z. Wang, G. Tournavitis, B. Franke, and M. F. P. O’boyle. Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Trans. Archit. Code Optim., 11(1):2:1–2:26, Feb. 2014. ISSN 1544-3566. URL http://doi.acm.org/10.1145/ 2579561.
[34]
R. Wilhelm, M. Sagiv, and T. Reps. Shape analysis. In D. Watt, editor, Compiler Construction, volume 1781 of Lecture Notes in Computer Science, pages 1–17. Springer Berlin Heidelberg, 2000. ISBN 978-3-540-67263-0. URL http://dx.doi.org/10.1007/ 3-540-46423-9\_1.
[35]
S. Williams, D. D. Kalamkar, A. Singh, A. M. Deshpande, B. Van Straalen, M. Smelyanskiy, A. Almgren, P. Dubey, J. Shalf, and L. Oliker. Optimization of geometric multigrid for emerging multi-and manycore processors. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 96. IEEE Computer Society Press, 2012.
[36]
Y. Wu. Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation, PLDI ’02, pages 210–221, New York, NY, USA, 2002. ACM. ISBN 1-58113-463-0. URL http: //doi.acm.org/10.1145/512529.512555.
[37]
Q. Zhao, R. Rabbah, S. Amarasinghe, L. Rudolph, and W.-F. Wong. Ubiquitous memory introspection. In International Symposium on Code Generation and Optimization, San Jose, CA, Mar 2007. URL http://groups.csail.mit.edu/commit/papers/ 07/zhao-cgo07-umi.pdf.

Cited By

View all
  • (2020)Binary Analysis OverviewBinary Code Fingerprinting for Cybersecurity10.1007/978-3-030-34238-8_2(7-44)Online publication date: 1-Mar-2020
  • (2024)SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for HalideIEEE Access10.1109/ACCESS.2023.334566012(7563-7583)Online publication date: 2024
  • (2023)Decompiling x86 deep neural network executablesProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620649(7357-7374)Online publication date: 9-Aug-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 50, Issue 6
PLDI '15
June 2015
630 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2813885
  • Editor:
  • Andy Gill
Issue’s Table of Contents
  • cover image ACM Conferences
    PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation
    June 2015
    630 pages
    ISBN:9781450334686
    DOI:10.1145/2737924
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2015
Published in SIGPLAN Volume 50, Issue 6

Check for updates

Author Tags

  1. Helium
  2. autotuning
  3. dynamic analysis
  4. image processing
  5. reverse engineering
  6. stencil computation
  7. x86 binary instrumentation

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)230
  • Downloads (Last 6 weeks)33
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Binary Analysis OverviewBinary Code Fingerprinting for Cybersecurity10.1007/978-3-030-34238-8_2(7-44)Online publication date: 1-Mar-2020
  • (2024)SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for HalideIEEE Access10.1109/ACCESS.2023.334566012(7563-7583)Online publication date: 2024
  • (2023)Decompiling x86 deep neural network executablesProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620649(7357-7374)Online publication date: 9-Aug-2023
  • (2023)Matching Linear Algebra and Tensor Code to Specialized Hardware AcceleratorsProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580262(85-97)Online publication date: 17-Feb-2023
  • (2023)mlirSynth: Automatic, Retargetable Program Raising in Multi-Level IR Using Program Synthesis2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00012(39-50)Online publication date: 21-Oct-2023
  • (2022)Bind the gap: compiling real software to hardware FFT acceleratorsProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523439(687-702)Online publication date: 9-Jun-2022
  • (2022)QRANE: lifting QASM programs to an affine IRProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517775(15-28)Online publication date: 19-Mar-2022
  • (2022)Efficient Stencil Computation with Temporal Blocking by Halide DSL2022 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom57177.2022.00116(870-877)Online publication date: Dec-2022
  • (2022)A fast and concise parallel implementation of the 8x8 2D forward and inverse DCTs using halideJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.014163(20-29)Online publication date: May-2022
  • (2021)Program Lifting using Gray-Box Behavior2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT52795.2021.00012(60-74)Online publication date: Sep-2021
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media