Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2635868.2635900acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection

Published: 11 November 2014 Publication History

Abstract

Existing code similarity comparison methods, whether source or binary code based, are mostly not resilient to obfuscations. In the case of software plagiarism, emerging obfuscation techniques have made automated detection increasingly difficult. In this paper, we propose a binary-oriented, obfuscation-resilient method based on a new concept, longest common subsequence of semantically equivalent basic blocks, which combines rigorous program semantics with longest common subsequence based fuzzy matching. We model the semantics of a basic block by a set of symbolic formulas representing the input-output relations of the block. This way, the semantics equivalence (and similarity) of two blocks can be checked by a theorem prover. We then model the semantics similarity of two paths using the longest common subsequence with basic blocks as elements. This novel combination has resulted in strong resiliency to code obfuscation. We have developed a prototype and our experimental results show that our method is effective and practical when applied to real-world software.

References

[1]
ACL. http://savannah.nongnu.org/projects/acl, 2013.
[2]
AdvanceCOMP. http://advancemame.sourceforge.net/compdownload.html.
[3]
Attr. http://savannah.nongnu.org/projects/attr, 2013.
[4]
B. S. Baker. On finding duplication and near-duplication in large software systems. In WCRE’95, pages 86–95, 1995.
[5]
G. Balakrishnan and T. Reps. WYSINWYX: What You See Is Not What You eXecute. ACM Transactions on Programming Languages and Systems (TOPLAS), 32(6):23:1–23:84, Aug. 2010.
[6]
G. Balakrishnan, T. Reps, D. Melski, and T. Teitelbaum. WYSINWYX: What You See Is Not What You eXecute. In Verified Software: Theories, Tools, Experiments (VSTTE), pages 202–213, 2005.
[7]
BAP: The Next-Generation Binary Analysis Platform. http://bap.ece.cmu.edu/, 2013.
[8]
Binary Diff (bdiff). http://sourceforge.net/projects/bdiff/, 2013.
[9]
Bzip2. http://bzip.org/index.html, 2010.
[10]
C. Cadar, D. Dunbar, and D. Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI’08, 2008.
[11]
C. Cadar and D. Engler. Execution generated test cases: How to make systems code crash itself. In SPIN’05, 2005.
[12]
C. Cadar, V. Ganesh, P. Pawlowski, D. Dill, and D. Engler. EXE: Automatically generating inputs of death. In CCS’06, 2006.
[13]
CLOC—Count Lines of Code. http://cloc.sourceforge.net/, 2013.
[14]
C. Collberg, C. Thomborson, and D. Low. A taxonomy of obfuscating transformations. Technical Report 148, The Univ. Auckland, 1997.
[15]
Compuware-IBM Lawsuit. http://news.zdnet.com/2100-3513 22-5629876.html, 2013.
[16]
J. H. Conway. Unpredictable iterations. In Proceedings of the Number Theory Conference, pages 49–52. Univ. Colorado, Boulder, Colo., 1972.
[17]
T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction To Algorithms. MIT Press, third edition, 2009.
[18]
Cyrus. http://asg.web.cmu.edu/sasl/, 2013.
[19]
DarunGrim: A patch analysis and binary diffing tool. http://www.darungrim.org/, 2013.
[20]
Diablo: code obfuscator. http://diablo.elis.ugent.be/obfuscation, 2013.
[21]
H. Flake. Structural comparison of executable objects. In Proceedings of the IEEE Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA), 2004.
[22]
M. Gabel, L. Jiang, and Z. Su. Scalable detection of semantic clones. In ICSE’08, pages 321–330, 2008.
[23]
V. Ganesh and D. L. Dill. A decision procedure for bit-vectors and arrays. In CAV’07, 2007.
[24]
D. Gao, M. Reiter, and D. Song. BinHunt: Automatically finding semantic differences in binary programs. In Poceedings of the 10th International Conference on Information and Communications Security (ICICS’08), pages 238–255, 2008.
[25]
P. Godefroid, N. Klarlund, and K. Sen. DART: Directed automated random testing. In PLDI’05, 2005.
[26]
P. Godefroid, M. Y. Levin, and D. Molnar. Automated whitebox fuzz testing. In NDSS’08, 2008.
[27]
Gzip. http://www.gzip.org/index.html, 2013.
[28]
K. J. Hoffman, P. Eugster, and S. Jagannathan. Semantics-aware trace analysis. In PLDI’09, 2009.
[29]
Y.-C. Jhi, X. Wang, X. Jia, S. Zhu, P. Liu, and D. Wu. Value-based program characterization and its application to software plagiarism detection. In ICSE’11, pages 756–765, Waikiki, Honolulu, 2011.
[30]
J.-H. Ji, G. Woo, and H.-G. Cho. A source code linearization technique for detecting plagiarized programs. In Proceedings of the 12th annual SIGCSE conference on Innovation and technology in computer science education (ITiCSE), pages 73–77, 2007.
[31]
J. C. Lagarias. The ultimate challenge: The 3x+1 problem. American Mathematical Soc., 2010.
[32]
S. Lahiri, C. Hawblitzel, M. Kawaguchi, and H. Rebelo. SymDiff: A language-agnostic semantic diff tool for imperative programs. In CAV’12, 2012.
[33]
Libgcrypt. http://www.gnu.org/software/libgcrypt/, 2011.
[34]
C. Liu, C. Chen, J. Han, and P. S. Yu. GPLAG: detection of software plagiarism by program dependence graph analysis. In KDD, 2006.
[35]
M. Madou, L. V. Put, and K. D. Bosschere. LOCO: an interactive code (de)obfuscation tool. In PERM’06, pages 140–144, 2006.
[36]
T. J. McCabe. Structured testing: A software testing methodology using the cyclomatic complexity metric. NIST Special Publication 500-99, 1982.
[37]
J. Ming, M. Pan, and D. Gao. iBinHunt: Binary hunting with inter-procedural control flow. In Proc. of the 15th Annual Int’l Conf. on Information Security and Cryptology (ICISC), 2012.
[38]
MOSS: A System for Detecting Software Plagiarism. http://theory.stanford.edu/˜aiken/moss/, 2013.
[39]
G. C. Necula, S. McPeak, S. P. Rahul, and W. Weimer. CIL: Intermediate language and tools for analysis and transformation of C programs. In CC’02, pages 213–228, 2002.
[40]
OpenSSH. http://www.openssh.com/, 2014.
[41]
OpenSSL. http://www.openssl.org/source/, 2014.
[42]
Oreans Technologies. Code Virtualizer: Total obfuscation against reverse engineering. http://oreans.com/codevirtualizer.php, 2013.
[43]
J. Poskanzer. thttpd, 2013. http://www.acme.com/software/thttpd/.
[44]
L. Prechelt, G. Malpohl, and M. Phlippsen. JPlag: Finding plagiarisms among a set of programs. Technical report, Univ. of Karlsruhe, 2000.
[45]
S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 76–85, 2003.
[46]
D. Schuler, V. Dallmeier, and C. Lindig. A dynamic birthmark for Java. In ASE’07, pages 274–283, 2007.
[47]
Semantic Designs, Inc. http://www.semdesigns.com/, 2013.
[48]
sthttpd. http://opensource.dyc.edu/sthttpd, 2013.
[49]
Stunnix, Inc. http://www.stunnix.com/, 2013.
[50]
H. Tamada, K. Okamoto, M. Nakamura, and A. Monden. Dynamic software birthmarks to detect the theft of Windows applications. In Int’l Sym. Future Software Technology (ISFST’04), 2004.
[51]
N. Truong, P. Roe, and P. Bancroft. Static analysis of students’ Java programs. In Proceedings of the Sixth Australasian Conference on Computing Education, pages 317–325, Darlinghurst, Australia, 1991.
[52]
VMProtect Software. VMProtect software protection. http://vmpsoft.com, last reviewed, 02/20/2013.
[53]
X. Wang, Y.-C. Jhi, S. Zhu, and P. Liu. Behavior based software theft detection. In CCS’09, pages 280–290, 2009.
[54]
X. Wang, Y.-C. Jhi, S. Zhu, and P. Liu. Detecting software theft via system call based birthmarks. In ACSAC’09, pages 149–158, 2009.
[55]
A. H. Watson and T. J. McCabe. Structured testing: A testing methodology using the cyclomatic complexity metric. NIST Special Publication 500-235, 1996.
[56]
Z. Xin, H. Chen, X. Wang, P. Liu, S. Zhu, B. Mao, and L. Xie. Replacement attacks on behavior based software birthmark. In ISC’11, pages 1–16, 2011.
[57]
W. Yang. Identifying syntactic differences between two programs. In Software–Practice & Experience, pages 739––755, New York, NY, USA, 1991.
[58]
F. Zhang, Y.-C. Jhi, D. Wu, P. Liu, and S. Zhu. A first step towards algorithm plagiarism detection. In ISSTA’12, pages 111–121, 2012.

Cited By

View all
  • (2024)Semantic aware-based instruction embedding for binary code similarity detectionPLOS ONE10.1371/journal.pone.030529919:6(e0305299)Online publication date: 11-Jun-2024
  • (2024)TypeFSL: Type Prediction from Binaries via Inter-procedural Data-flow Analysis and Few-shot LearningProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695502(1269-1281)Online publication date: 27-Oct-2024
  • (2024)CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity DetectionProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652117(149-161)Online publication date: 11-Sep-2024
  • Show More Cited By

Index Terms

  1. Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering
    November 2014
    856 pages
    ISBN:9781450330565
    DOI:10.1145/2635868
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 November 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Software plagiarism detection
    2. binary code similarity comparison
    3. obfuscation
    4. symbolic execution
    5. theorem proving

    Qualifiers

    • Research-article

    Conference

    SIGSOFT/FSE'14
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 17 of 128 submissions, 13%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)71
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 14 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Semantic aware-based instruction embedding for binary code similarity detectionPLOS ONE10.1371/journal.pone.030529919:6(e0305299)Online publication date: 11-Jun-2024
    • (2024)TypeFSL: Type Prediction from Binaries via Inter-procedural Data-flow Analysis and Few-shot LearningProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695502(1269-1281)Online publication date: 27-Oct-2024
    • (2024)CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity DetectionProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652117(149-161)Online publication date: 11-Sep-2024
    • (2024)Bin2Summary: Beyond Function Name Prediction in Stripped Binaries with Functionality-Specific Code EmbeddingsProceedings of the ACM on Software Engineering10.1145/36437291:FSE(47-69)Online publication date: 12-Jul-2024
    • (2024) ARCTURUS: Full Coverage Binary Similarity Analysis with Reachability-guided EmulationACM Transactions on Software Engineering and Methodology10.1145/364033733:4(1-31)Online publication date: 11-Jan-2024
    • (2024)Strengthening Supply Chain Security with Fine-grained Safe Patch IdentificationProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639104(1-12)Online publication date: 20-May-2024
    • (2024)SepBIN: Binary Feature Separation for Better Semantic Comparison and Authorship VerificationIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.333189519(1372-1387)Online publication date: 2024
    • (2024)Are We There Yet? Filling the Gap Between Binary Similarity Analysis and Binary Software Composition Analysis2024 IEEE 9th European Symposium on Security and Privacy (EuroS&P)10.1109/EuroSP60621.2024.00034(506-523)Online publication date: 8-Jul-2024
    • (2024)Analyzing Implementation-Based SSL/TLS Vulnerabilities with Binary Semantics AnalysisSecurity and Privacy in Communication Networks10.1007/978-3-031-64954-7_19(371-394)Online publication date: 15-Oct-2024
    • (2024)FSmell: Recognizing Inline Function in Binary CodeComputer Security – ESORICS 202310.1007/978-3-031-51476-0_24(487-506)Online publication date: 11-Jan-2024
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media