Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1572272.1572287acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article

Detecting code clones in binary executables

Published: 19 July 2009 Publication History

Abstract

Large software projects contain significant code duplication, mainly due to copying and pasting code. Many techniques have been developed to identify duplicated code to enable applications such as refactoring, detecting bugs, and protecting intellectual property. Because source code is often unavailable, especially for third-party software, finding duplicated code in binaries becomes particularly important. However, existing techniques operate primarily on source code, and no effective tool exists for binaries.
In this paper, we describe the first practical clone detection algorithm for binary executables. Our algorithm extends an existing tree similarity framework based on clustering of characteristic vectors of labeled trees with novel techniques to normalize assembly instructions and to accurately and compactly model their structural information. We have implemented our technique and evaluated it on Windows XP system binaries totaling over 50 million assembly instructions. Results show that it is both scalable and precise: it analyzed Windows XP system binaries in a few hours and produced few false positives. We believe our technique is a practical, enabling technology for many applications dealing with binary code.

References

[1]
IDA Pro disassembler. http://www.datarescue.com.
[2]
JPlag. http://www.jplag.de.
[3]
A. Andoni and P. Indyk. E2LSH: Exact Euclidean locality-sensitive hashing. Web: http://www.mit.edu/~andoni/LSH/, 2004.
[4]
A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117--122, 2008.
[5]
B. S. Baker. Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM J. Comput., 26(5):1343--1362, 1997.
[6]
H. A. Basit and S. Jarzabek. Detecting higher-level similarity patterns in programs. In ESEC/FSE-13, pages 156--165, 2005.
[7]
I. D. Baxter, C. Pidgeon, and M. Mehlich. DMS®: Program transformations for practical scalable software evolution. In ICSE, pages 625--634, 2004.
[8]
D. Bruschi, L. Martignoni, and M. Monga. Detecting self-mutating malware using control flow graph matching. In DIMVA, pages 129--143, 2006.
[9]
M. Christodorescu and S. Jha. Static analysis of executables to detect malicious patterns. In USENIX, editor, Proceedings of the 2003 USENIX Security Symposium. USENIX, 2003.
[10]
M. Christodorescu and S. Jha. Testing malware detectors. In International Symposium on Software Testing and Analysis, pages 34--44, 2004.
[11]
M. Christodorescu, S. Jha, S. A. Seshia, D. X. Song, and R. E. Bryant. Semantics-aware malware detection. In IEEE Symposium on Security and Privacy, pages 32--46, 2005.
[12]
M. Gabel, L. Jiang, and Z. Su. Scalable detection of semantic clones. In ICSE, pages 321--330, 2008.
[13]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Very Large Data Bases, pages 518--529, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.
[14]
A. Hemel. The GPL compliance engineering guide. http://www.loohuis-consulting.nl/downloads/compliance-manual.pdf.
[15]
P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In ACM Symposium on Theory of Computing, pages 604--613. ACM, 1998.
[16]
L. Jiang, G. Misherghi, Z. Su, and S. Glondu. Deckard: Scalable and accurate tree-based detection of code clones. In ICSE, pages 96--105, 2007.
[17]
T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., 28(7):654--670, 2002.
[18]
R. Komondoor and S. Horwitz. Using slicing to identify duplication in source code. In Symposium on Static Analysis, pages 40--56, London, UK, 2001. Springer-Verlag.
[19]
C. Kruegel, D. Mutz, W. Robertson, and G. Vigna. Polymorphic worm detection using structural information of executables. In Recent Adv. in Intrusion Detection, pages 207--226. Springer-Verlag, 2005.
[20]
Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: a tool for finding copy-paste and related bugs in operating system code. In OSDI, pages 20--20, 2004.
[21]
S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In Management of Data, pages 76--85, 2003.
[22]
M. Schordan and D. Quinlan. A source-to-source architecture for user-defined optimizations. In Joint Modular Languages Conference, volume 2789 of Lecture Notes in Computer Science, pages 214--223. Springer Verlag, Aug. 2003.
[23]
D. Schuler, V. Dallmeier, and C. Lindig. A dynamic birthmark for Java. In ASE07, pages 274--283, New York, NY, USA, 2007. ACM.
[24]
A. Schulman. Finding binary clones with opstrings and function digests. Doctor Dobb's J, 30(9):64--70, 2005.
[25]
G. Shakhnarovich, T. Darrell, and P. Indyk. Nearest-Neighbor Methods in Learning and Vision: Theory and Practice. The MIT Press, 2006.
[26]
V. Wahler, D. Seipel, J. W. von Gudenberg, and G. Fischer. Clone detection in source code by frequent itemset techniques. In Source Code Analysis and Manipulation, pages 128--135, 2004.
[27]
H. Yin, D. X. Song, M. Egele, C. Kruegel, and E. Kirda. Panorama: capturing system-wide information flow for malware detection and analysis. In ACM Conf. on Computer and Comms. Sec., pages 116--127, 2007.

Cited By

View all
  • (2024)RepFTI: Representation-Fused Function-Type Inference for Vehicular Secure Software SystemsApplied Sciences10.3390/app1411450214:11(4502)Online publication date: 24-May-2024
  • (2024)LibvDiff: Library Version Difference Guided OSS Version Identification in BinariesProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623336(1-12)Online publication date: 20-May-2024
  • (2024)HAformer: Semantic fusion of hex machine code and assembly code for cross-architecture binary vulnerability detectionComputers & Security10.1016/j.cose.2024.104029145(104029)Online publication date: Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ISSTA '09: Proceedings of the eighteenth international symposium on Software testing and analysis
July 2009
306 pages
ISBN:9781605583389
DOI:10.1145/1572272
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. binary analysis
  2. clone detection
  3. software tools

Qualifiers

  • Research-article

Conference

ISSTA '09

Acceptance Rates

Overall Acceptance Rate 58 of 213 submissions, 27%

Upcoming Conference

ISSTA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)47
  • Downloads (Last 6 weeks)6
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)RepFTI: Representation-Fused Function-Type Inference for Vehicular Secure Software SystemsApplied Sciences10.3390/app1411450214:11(4502)Online publication date: 24-May-2024
  • (2024)LibvDiff: Library Version Difference Guided OSS Version Identification in BinariesProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623336(1-12)Online publication date: 20-May-2024
  • (2024)HAformer: Semantic fusion of hex machine code and assembly code for cross-architecture binary vulnerability detectionComputers & Security10.1016/j.cose.2024.104029145(104029)Online publication date: Oct-2024
  • (2023)Asteria-Pro: Enhancing Deep Learning-based Binary Code Similarity Detection by Incorporating Domain KnowledgeACM Transactions on Software Engineering and Methodology10.1145/360461133:1(1-40)Online publication date: 17-Jun-2023
  • (2023)sem2vec: Semantics-aware Assembly Tracelet EmbeddingACM Transactions on Software Engineering and Methodology10.1145/356993332:4(1-34)Online publication date: 27-May-2023
  • (2023)Nimbus++: Revisiting Efficient Function Signature Recovery with Depth Data AnalysisInternational Journal of Software Engineering and Knowledge Engineering10.1142/S021819402350042033:10(1537-1565)Online publication date: 23-Aug-2023
  • (2023)Enhancing DNN-Based Binary Code Function Search With Low-Cost Equivalence CheckingIEEE Transactions on Software Engineering10.1109/TSE.2022.314924049:1(226-250)Online publication date: 1-Jan-2023
  • (2023)Finding Source Code Clones in Intermediate Representations of Java Bytecode2023 IEEE 17th International Workshop on Software Clones (IWSC)10.1109/IWSC60764.2023.00014(37-43)Online publication date: 1-Oct-2023
  • (2023)DeepDual-SD: Deep Dual Attribute-Aware Embedding for Binary Code Similarity DetectionInternational Journal of Computational Intelligence Systems10.1007/s44196-023-00206-916:1Online publication date: 17-Mar-2023
  • (2023)SeHBPL: Behavioral Semantics-Based Patch Presence Test for BinariesDependable Software Engineering. Theories, Tools, and Applications10.1007/978-981-99-8664-4_6(92-111)Online publication date: 15-Dec-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media