Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3173162.3173202acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Statistical Reconstruction of Class Hierarchies in Binaries

Published: 19 March 2018 Publication History

Abstract

We address a fundamental problem in reverse engineering of object-oriented code: the reconstruction of a program's class hierarchy from its stripped binary. Existing approaches rely heavily on structural information that is not always available, e.g., calls to parent constructors. As a result, these approaches often leave gaps in the hierarchies they construct, or fail to construct them altogether. Our main insight is that behavioral information can be used to infer subclass/superclass relations, supplementing any missing structural information. Thus, we propose the first statistical approach for static reconstruction of class hierarchies based on behavioral similarity. We capture the behavior of each type using a statistical language model (SLM), define a metric for pairwise similarity between types based on the Kullback-Leibler divergence between their SLMs, and lift it to determine the most likely class hierarchy. We implemented our approach in a tool called ROCK and used it to automatically reconstruct the class hierarchies of several real-world stripped C++ binaries. Our results demonstrate that ROCK obtained significantly more accurate class hierarchies than those obtained using structural analysis alone.

References

[1]
Mart'ın Abadi, Mihai Budiu, Úlfar Erlingsson, and Jay Ligatti. Control-flow integrity. In Proceedings of the Conference on Computer and Communications Security, 2005.
[2]
Wolfram Amme, Peter Braun, Franccois Thomasset, and Eberhard Zehendner. Data dependence analysis of assembly code. normalfont In International Journal Parallel Programming, 2000.
[3]
Gogul Balakrishnan and Thomas Reps. Divine: Discovering variables in executables. In Proceedings of the International Conference on Verification, Model Checking, and Abstract Interpretation, 2007.
[4]
Gogul Balakrishnan and Thomas Reps. WYSINWYX: What you see is not what you execute. normalfont In ACM Transactions on Programming Languages and Systems, 2010.
[5]
Thomas Ball, Ella Bounimova, Byron Cook, Vladimir Levin, Jakob Lichtenberg, Con McGarvey, Bohus Ondrusek, Sriram K. Rajamani, and Abdullah Ustuner. Thorough static analysis of device drivers. In Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems, 2006.
[6]
Tiffany Bao, Jonathan Burket, Maverick Woo, Rafael Turner, and David Brumley. Byteweight: Learning to recognize functions in binary code. In USENIX Security Symposium, 2014.
[7]
J. Bergeron, Mourad Debbabi, M. M. Erhioui, and Béchir Ktari. Static analysis of binary code to isolate malicious behaviors. In Proceedings of the Workshop on Enabling Technologies on Infrastructure for Collaborative Enterprises, 1999.
[8]
David Brumley, Ivan Jager, Thanassis Avgerinos, and Edward J. Schwartz. Bap: A binary analysis platform. In Proceedings of the International Conference on Computer Aided Verification, 2011.
[9]
Juan Caballero and Zhiqiang Lin. Type inference on executables. normalfont In ACM Computing Surveys, 2016.
[10]
John A Capra and Mona Singh. Predicting functionally important residues from sequence conservation. normalfont In Bioinformatics, 2007.
[11]
Stanley F Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the Annual Meeting on Association for Computational Linguistics, 1996.
[12]
John G Cleary and Ian H Witten. Data compression using adaptive coding and partial string matching. normalfont In IEEE Transactions on Communications, 1984.
[13]
Yaniv David and Eran Yahav. Tracelet-based code search in executables. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014.
[14]
Saumya Debray, Robert Muth, and Matthew Weippert. Alias analysis of executable code. In Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 1998.
[15]
Jack Edmonds. Optimum branchings. Journal of Research of the National Bureau of Standards B, 1967.
[16]
Ran El-Yaniv, Shai Fine, and Naftali Tishby. Agnostic classification of markovian sequences. In Proceedings of the Conference on Advances in Neural Information Processing Systems, 1998.
[17]
Alexander Fokin, Egor Derevenetc, Alexander Chernov, and Katerina Troshina. Smartdec: Approaching c
[18]
decompilation. In Proceedings of the Working Conference on Reverse Engineering, 2011.
[19]
Matthew Fredrikson, Mihai Christodorescu, and Somesh Jha. Dynamic behavior matching: A complexity analysis and new approximation algorithms. In Proceedings of the International Conference on Automated Deduction, 2011.
[20]
Denis Gopan, Evan Driscoll, Ducson Nguyen, Dimitri Naydich, Alexey Loginov, and David Melski. Data-delineation in software binaries and its application to buffer-overrun discovery. In Proceedings of the International Conference on Software Engineering, 2015.
[21]
S. Jha, K. Tan, and R.A. Maxion. Markov chains, classifiers, and intrusion detection. In Proceedings of the IEEE workshop on Computer Security Foundations, 2001.
[22]
Omer Katz, Ran El-Yaniv, and Eran Yahav. Estimating types in binaries using predictive modeling. In Proceedings of the Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2016.
[23]
Slava M Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. normalfont In IEEE Transactions on Acoustics, Speech, and Signal Processing, 1987.
[24]
S. Kullback and R. A. Leibler. On information and sufficiency. In The Annals of Mathematical Statistics, 1951.
[25]
Magnus Madsen, Benjamin Livshits, and Michael Fanning. Practical static analysis of javascript applications in the presence of frameworks and libraries. In Proceedings of the Joint Meeting on Foundations of Software Engineering, 2013.
[26]
Geoffrey Mazeroff, Jens Gregor, Michael Thomason, and Richard Ford. Probabilistic suffix models for API sequence analysis of windows XP applications. In Pattern Recognition, 2008.
[27]
Microsoft Corporation. Skype. https://www.skype.com/en/.
[28]
Microsoft Corporation. Visual studio. https://www.visualstudio.com.
[29]
Alon Mishne, Sharon Shoham, and Eran Yahav. Typestate-based semantic code search over partial programs. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, 2012.
[30]
Alistair Moffat. Implementing the ppm data compression scheme. In IEEE Transactions on Communications, 1990.
[31]
Andre Pawlowski, Moritz Contag, Victor van der Veen, Chris Ouwehand, Thorsten Holz, Herbert Bos, Elias Athanasopoulos, and Cristiano Giuffrida. MARX: Uncovering Class Hierarchies in C
[32]
Programs. In Network and Distributed System Security Symposium, 2017.
[33]
Mario Polino, Andrea Scorti, Federico Maggi, and Stefano Zanero. Jackdaw: Towards automatic reverse engineering of large datasets of binaries. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, 2015.
[34]
Mila Dalla Preda, Mihai Christodorescu, Somesh Jha, and Saumya Debray. A semantics-based approach to malware detection. In Proceedings of the Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2007.
[35]
Veselin Raychev, Martin Vechev, and Andreas Krause. Predicting program properties from "big code". In Proceedings of the Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2015.
[36]
Veselin Raychev, Martin Vechev, and Eran Yahav. Code completion with statistical language models. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014.
[37]
Thomas Reps and Gogul Balakrishnan. Improved memory-access analysis for x86 executables. In Proceedings of the International Conference on Compiler construction, 2008.
[38]
Thomas Reps, Gogul Balakrishnan, and Junghee Lim. Intermediate-representation recovery from low-level code. In Proceedings of the ACM SIGPLAN Symposium on Partial Evaluation and Semantics-based Program Manipulation, 2006.
[39]
Thomas Reps, Gogul Balakrishnan, Junghee Lim, and Tim Teitelbaum. A next-generation platform for analyzing executables. In Proceedings of the Third Asian conference on Programming Languages and Systems, 2005.
[40]
Thomas Reps, Junghee Lim, Aditya Thakur, Gogul Balakrishnan, and Akash Lal. There's plenty of room at the bottom: analyzing and verifying machine code. In Proceedings of the International Conference on Computer Aided Verification, 2010.
[41]
Ronald Rosenfeld. Two decades of statistical language modeling: Where do we go from here? normalfont In Proceedings of the IEEE, 2000.
[42]
Paul Vincent Sabanal and Mark Vincent Yason. Reversing c+. In BlackHat USA, 2007.
[43]
Dominik Schnitzer. Musly: Audio music similarity. http://www.musly.org.
[44]
Hinrich Schütze and Yoram Singer. Part-of-speech tagging using a variable memory markov model. In Proceedings of the Annual Meeting on Association for Computational Linguistics, 1994.
[45]
Edward J Schwartz, J Lee, Maverick Woo, and David Brumley. Native x86 decompilation using semantics-preserving structural analysis and iterative control-flow structuring. normalfont In Proceedings of the USENIX Security Symposium, 2013.
[46]
Venkatesh Srinivasan and Thomas Reps. Recovery of class hierarchies and composition relationships from machine code. In In Proceedings of the International Conference on Compiler construction, 2014.
[47]
Venkatesh Karthik Srinivasan and Thomas Reps. Software-architecture recovery from machine code*. https://minds.wisconsin.edu/handle/1793/65091, 2013.
[48]
Stephen Tu. MINO: Data-driven type inference for python. http://people.csail.mit.edu/stephentu/papers/mino.pdf, 2012.
[49]
Christina Warrender, Stephanie Forrest, and Barak Pearlmutter. Detecting intrusions using system calls: Alternative data models. In Proceedings of the IEEE Symposium on Security and Privacy, 1999.
[50]
Golan Yona and Michael Levitt. Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. In Journal of Molecular Biology, 2002.

Cited By

View all
  • (2022)Finding the Dwarf: Recovering Precise Types from WebAssembly BinariesProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523449(410-425)Online publication date: 9-Jun-2022
  • (2022)Recovering container class types in C++ binariesProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741274(131-143)Online publication date: 2-Apr-2022
  • (2021)Program Obfuscation via ABI DebiasingProceedings of the 37th Annual Computer Security Applications Conference10.1145/3485832.3488017(146-157)Online publication date: 6-Dec-2021
  • Show More Cited By

Index Terms

  1. Statistical Reconstruction of Class Hierarchies in Binaries
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
      March 2018
      827 pages
      ISBN:9781450349116
      DOI:10.1145/3173162
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 53, Issue 2
        ASPLOS '18
        February 2018
        809 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/3296957
        Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 March 2018

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. class hierarchies
      2. reverse engineering
      3. static binary analysis
      4. x86

      Qualifiers

      • Research-article

      Funding Sources

      • European Research Council
      • Israel Science Foundation

      Conference

      ASPLOS '18

      Acceptance Rates

      ASPLOS '18 Paper Acceptance Rate 56 of 319 submissions, 18%;
      Overall Acceptance Rate 535 of 2,713 submissions, 20%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)30
      • Downloads (Last 6 weeks)4
      Reflects downloads up to 17 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Finding the Dwarf: Recovering Precise Types from WebAssembly BinariesProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523449(410-425)Online publication date: 9-Jun-2022
      • (2022)Recovering container class types in C++ binariesProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741274(131-143)Online publication date: 2-Apr-2022
      • (2021)Program Obfuscation via ABI DebiasingProceedings of the 37th Annual Computer Security Applications Conference10.1145/3485832.3488017(146-157)Online publication date: 6-Dec-2021
      • (2020)Neural reverse engineering of stripped binaries using augmented control flow graphsProceedings of the ACM on Programming Languages10.1145/34282934:OOPSLA(1-28)Online publication date: 13-Nov-2020
      • (2020)iDEA: Static Analysis on the Security of Apple Kernel DriversProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security10.1145/3372297.3423357(1185-1202)Online publication date: 30-Oct-2020
      • (2020)Devil is Virtual: Reversing Virtual Inheritance in C++ BinariesProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security10.1145/3372297.3417251(133-148)Online publication date: 30-Oct-2020
      • (2020)Instrumenting Compiler Pipeline to Synthesise Traceable Runtime Memory Layouts in Mixed-critical Applications2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW51248.2020.00040(73-78)Online publication date: Oct-2020
      • (2019)VPSProceedings of the 35th Annual Computer Security Applications Conference10.1145/3359789.3359797(97-112)Online publication date: 9-Dec-2019
      • (2019)code2vec: learning distributed representations of codeProceedings of the ACM on Programming Languages10.1145/32903533:POPL(1-29)Online publication date: 2-Jan-2019
      • (2018)DebinProceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security10.1145/3243734.3243866(1667-1680)Online publication date: 15-Oct-2018
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media