Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1806672.1806678acmconferencesArticle/Chapter ViewAbstractPublication PagespasteConference Proceedingsconference-collections
research-article

Extracting compiler provenance from program binaries

Published: 06 May 2010 Publication History

Abstract

We present a novel technique that identifies the source compiler of program binaries, an important element of program provenance. Program provenance answers fundamental questions of malware analysis and software forensics, such as whether programs are generated by similar tool chains; it also can allow development of debugging, performance analysis, and instrumentation tools specific to particular compilers. We formulate compiler identification as a structured learning problem, automatically building models to recognize sequences of binary code generated by particular compilers. We evaluate our techniques on a large set of real-world test binaries, showing that our models identify the source compiler of binary code with over 90% accuracy, even in the presence of interleaved code from multiple compilers. A case study demonstrates the use of inferred compiler provenance to augment stripped binary parsing, reducing parsing errors by 18%.

References

[1]
C. Cifuentes and M. V. Emmerik. Recovery of jump table case statements from binary code. In Seventh International Workshop on Program Comprehension (IWPC '99), page 192, Pittsburgh, PA, May 1999. ISBN 0-7695-0179-6.
[2]
C. Cifuentes and K. J. Gough. Decompilation of binary programs. Softw. Pract. Exper., 25(7):811--829, 1995. ISSN 0038-0644.
[3]
Data Rescue. IDA Pro Disassembler: Version 5.5 http://www.datarescue.com/idabase, 2007.
[4]
J. H. Hayes and J. Offutt. Recognizing authors: an examination of the consistent programmer hypothesis. Software Testing, Verification and Reliability, 2009.
[5]
J. K. Hollingsworth, B. P. Miller, and J. Cargille. Dynamic program instrumentation for scalable performance tools. Technical Report CS-TR-1994-1207, University of Wisconsin-Madison, 1994. URL citeseer.ist.psu.edu/75570.html.
[6]
J. Z. Kolter and M. A. Maloof. Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research, 7:2721--2744, 2006.
[7]
C. Kruegel, W. Robertson, F. Valeur, and G. Vigna. Static disassembly of obfuscated binaries. In Thirteenth USENIX Security Symposium, pages 18--18, San Diego, CA, August 2004.
[8]
C. Krügel, E. Kirda, D. Mutz, W. K. Robertson, and G. Vigna. Polymorphic worm detection using structural information of executables. In Eighth International Symposium on Recent Advances in Intrusion Detection (RAID 2005), pages 207--226, Seattle, WA, September 2005.
[9]
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning, 2001.
[10]
W.-J. Li, K. Wang, S. J. Stolfo, and B. Herzog. Fileprints: identifying file types by n-gram analysis. In Sixth IEEE Information Assurance Workshop (IAW '05), pages 64--71, June 2005.
[11]
A. McCallum. Efficiently inducing features of conditional random fields. In Proceedings of the 19th Conference in Uncertainty in Artificial Intelligence, Acapulco, Mexico, August 2003.
[12]
A. K. McCallum. Mallet: A machine learning for language toolkit. http://www.cs.umass.edu/~mccallum/mallet, 2002.
[13]
Paradyn Project. Dyninst: An application program interface for runtime code generation. http://www.paradyn.org, 2010.
[14]
N. Rosenblum, X. Zhu, B. P. Miller, and K. Hunt. Learning to analyze binary computer code. In Proceedings of the twenty-third conference on Artificial Intelligence (AAAI-08), Chicago, IL, July 2008.
[15]
E. H. Spafford and S. A. Weeber. Software forensics: Can we track code to its authors? Technical Report Purdue Technical Report CSDTR-92-010 / SERC Technical Report SERC-TR-110-P, 1992. URL citeseer.ist.psu.edu/spafford92software.html.
[16]
H. Theiling. Extracting safe and precise control flow from binaries. In RTCSA '00, page 23, Washington, DC, USA, 2000. IEEE Computer Society. ISBN 0-7695-0930-4.

Cited By

View all
  • (2024)Function-Level Compilation Provenance Identification with Multi-Faceted Neural Feature Distillation and FusionElectronics10.3390/electronics1309169213:9(1692)Online publication date: 27-Apr-2024
  • (2024)Fast Cross-Platform Binary Code Similarity Detection Framework Based on CFGs Taking Advantage of NLP and Inductive GNNChinese Journal of Electronics10.23919/cje.2022.00.22833:1(128-138)Online publication date: Jan-2024
  • (2024)Identifying Authorship in Malicious Binaries: Features, Challenges & DatasetsACM Computing Surveys10.1145/365397356:8(1-36)Online publication date: 26-Mar-2024
  • Show More Cited By

Index Terms

  1. Extracting compiler provenance from program binaries

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PASTE '10: Proceedings of the 9th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering
    June 2010
    96 pages
    ISBN:9781450300827
    DOI:10.1145/1806672
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 May 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. forensics
    2. program provenance
    3. static binary analysis

    Qualifiers

    • Research-article

    Conference

    PASTE '10

    Acceptance Rates

    Overall Acceptance Rate 57 of 159 submissions, 36%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)22
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 12 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Function-Level Compilation Provenance Identification with Multi-Faceted Neural Feature Distillation and FusionElectronics10.3390/electronics1309169213:9(1692)Online publication date: 27-Apr-2024
    • (2024)Fast Cross-Platform Binary Code Similarity Detection Framework Based on CFGs Taking Advantage of NLP and Inductive GNNChinese Journal of Electronics10.23919/cje.2022.00.22833:1(128-138)Online publication date: Jan-2024
    • (2024)Identifying Authorship in Malicious Binaries: Features, Challenges & DatasetsACM Computing Surveys10.1145/365397356:8(1-36)Online publication date: 26-Mar-2024
    • (2024)Compiler Provenance Recovery for Multi-CPU Architectures Using a Centrifuge MechanismIEEE Access10.1109/ACCESS.2024.337149912(34477-34488)Online publication date: 2024
    • (2024)ToolPhet: Inference of Compiler Provenance From Stripped Binaries With Emerging Compilation ToolchainsIEEE Access10.1109/ACCESS.2024.335509812(12667-12682)Online publication date: 2024
    • (2023)Improving Security Tasks Using Compiler Provenance Information Recovered At the Binary-LevelProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623098(2695-2709)Online publication date: 15-Nov-2023
    • (2023)Revisiting Lightweight Compiler Provenance Recovery on ARM Binaries2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC)10.1109/ICPC58990.2023.00044(292-303)Online publication date: May-2023
    • (2023)BinAlign: Alignment Padding Based Compiler Provenance RecoveryInformation Security and Privacy10.1007/978-3-031-35486-1_26(609-629)Online publication date: 15-Jun-2023
    • (2022)BinProv: Binary Code Provenance Identification without DisassemblyProceedings of the 25th International Symposium on Research in Attacks, Intrusions and Defenses10.1145/3545948.3545956(350-363)Online publication date: 26-Oct-2022
    • (2022)A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and FeaturesACM Computing Surveys10.1145/348686055:1(1-41)Online publication date: 17-Jan-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media