research-article

Clone-Slicer: Detecting Domain Specific Binary Code Clones through Program Slicing

Authors:

Guru Venkataramani,

Tian LanAuthors Info & Claims

FEAST '18: Proceedings of the 2018 Workshop on Forming an Ecosystem Around Software Transformation

Pages 27 - 33

https://doi.org/10.1145/3273045.3273047

Published: 15 January 2018 Publication History

Abstract

In this paper, we presented a novel framework, Clone-Slicer, a domain-specific code clone detector for binary executables, that integrates program slicing and a deep learning based binary code clone modeling framework to improve the number of code clone detected. In particular, we chose pointer analysis for memory safety as our example domain to demonstrate the usefulness of our approach. We evaluated our approach using real-world applications from SPEC 2006 benchmark suite. Our results show Clone-Slicer is able to detect up to 43.64% code clones compared to prior work and further cut the time-to-solution (the time spent to verify memory bound safety) for Clone-Slicer by 32.96% compared to Clone-Hunter. As future work, we plan to apply Clone-Slicer to different domains and tasks, such as vulnerable program path discovery, and further improve the capability for code clone detection through advanced clustering algorithms. We will also study the cost-benefit tradeoffs of using such advanced algorithms.

References

[1]

2006. SPEC CPU 2006. https://www.spec.org/cpu2006/.

[2]

2016. IDA Pro disassembler. https://www.hex-rays.com/products/ida/.

[3]

Sheeva Afshan, Phil McMinn, and Mark Stevenson. 2013. Evolving readable string test inputs using a natural language model to reduce human oracle cost. In Software Testing, Verification and Validation (ICST), 2013 IEEE Sixth International Conference on. IEEE, 352--361.

Digital Library

[4]

Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2015. Suggesting accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 38--49.

Digital Library

[5]

Brenda S Baker. 1995. On finding duplication and near-duplication in large software systems. In Reverse Engineering, 1995., Proceedings of 2ndWorking Conference on. IEEE, 86--95.

Digital Library

[6]

Brenda S Baker. 1997. Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM J. Comput. 26, 5 (1997), 1343--1362.

Digital Library

[7]

Ira D Baxter, Christopher Pidgeon, and Michael Mehlich. 2004. DMS/spl reg: program transformations for practical scalable software evolution. In Software Engineering, 2004. ICSE 2004. Proceedings. 26th International Conference on. IEEE, 625--634.

Digital Library

[8]

Christopher M Bishop. 2006. Machine learning and pattern recognition. Information Science and Statistics. Springer, Heidelberg (2006).

Digital Library

[9]

Juan Caballero, Gustavo Grieco, Mark Marron, and Antonio Nappa. 2012. Undangle: early detection of dangling pointers in use-after-free and double-free vulnerabilities. In Proceedings of the 2012 International Symposium on Software Testing and Analysis. ACM, 133--143.

Digital Library

[10]

Shuo Chen, Jun Xu, Nithin Nakka, Zbigniew Kalbarczyk, and Ravishankar K Iyer. 2005. Defeating memory corruption attacks via pointer taintedness detection. In Dependable Systems and Networks, 2005. DSN 2005. Proceedings. International Conference on. IEEE, 378--387.

Digital Library

[11]

Mauro Conti, Stephen Crane, Lucas Davi, Michael Franz, Per Larsen, Marco Negro, Christopher Liebchen, Mohaned Qunaibit, and Ahmad-Reza Sadeghi. 2015. Losing control: On the effectiveness of control-flow integrity under stack attacks. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 952--963.

Digital Library

[12]

Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. 2004. Localitysensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry. ACM, 253--262.

Digital Library

[13]

Stéphane Ducasse, Matthias Rieger, and Serge Demeyer. 1999. A language independent approach for detecting duplicated code. In Software Maintenance, 1999.(ICSM'99) Proceedings. IEEE International Conference on. IEEE, 109--118.

Digital Library

[14]

Christine Franks, Zhaopeng Tu, Premkumar Devanbu, and Vincent Hellendoorn. 2015. Cacheca: A cache language model based code suggestion tool. In Proceedings of the 37th International Conference on Software Engineering-Volume 2. IEEE Press, 705--708.

Digital Library

[15]

Mark Gabel and Zhendong Su. 2010. A study of the uniqueness of source code. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering. ACM, 147--156.

Digital Library

[16]

Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. 1999. Similarity search in high dimensions via hashing. In Vldb, Vol. 99. 518--529.

Digital Library

[17]

Christoph Goller and Andreas Kuchler. 1996. Learning task-dependent distributed representations by backpropagation through structure. In Neural Networks, 1996., IEEE International Conference on, Vol. 1. IEEE, 347--352.

[18]

Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks the best choice for modeling source code?. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 763--773.

Digital Library

[19]

Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE, 837--847.

Digital Library

[20]

Yikun Hu, Yuanyuan Zhang, Juanru Li, and Dawu Gu. 2017. Binary code clone detection across architectures and compiling configurations. In Proceedings of the 25th International Conference on Program Comprehension. IEEE Press, 88--98.

Digital Library

[21]

Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, 96--105.

Digital Library

[22]

Dan Jurafsky and James H Martin. 2009. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition., 1024 pages.

Digital Library

[23]

Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654--670.

Digital Library

[24]

Miryung Kim, Vibha Sazawal, David Notkin, and Gail Murphy. 2005. An empirical study of code clone genealogies. In ACM SIGSOFT Software Engineering Notes, Vol. 30. ACM, 187--196.

Digital Library

[25]

Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In International Static Analysis Symposium. Springer, 40--56.

Digital Library

[26]

Kostas A Kontogiannis, Renator DeMori, Ettore Merlo, Michael Galler, and Morris Bernstein. 1996. Pattern matching for clone and concept detection. Automated Software Engineering 3, 1--2 (1996), 77--108.

Digital Library

[27]

Peng Li, Yang Liu, and Maosong Sun. 2013. Recursive autoencoders for ITG-based translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 567--577.

[28]

Yongbo Li, Fan Yao, Tian Lan, and Guru Venkataramani. 2016. Sarre: semanticsaware rule recommendation and enforcement for event paths on android. IEEE Transactions on Information Forensics and Security 11, 12 (2016), 2748--2762.

Digital Library

[29]

Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou. 2006. CP-Miner: Finding copy-paste and related bugs in large-scale software code. IEEE Transactions on software Engineering 32, 3 (2006), 176--192.

Digital Library

[30]

Tomas Mikolov, Stefan Kombrink, Anoop Deoras, Lukar Burget, and Jan Cernocky. 2011. Rnnlm-recurrent neural network language modeling toolkit. In Proc. of the 2011 ASRU Workshop. 196--201.

[31]

Santosh Nagarakatte, Jianzhou Zhao, Milo MK Martin, and Steve Zdancewic. 2009. SoftBound: Highly compatible and complete spatial memory safety for C. ACM Sigplan Notices 44, 6 (2009), 245--258.

Digital Library

[32]

Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, and Thorsten Holz. 2015. Cross-architecture bug search in binary executables. In Security and Privacy (SP), 2015 IEEE Symposium on. IEEE, 709--724.

Digital Library

[33]

Andreas Sæbjørnsen, Jeremiah Willcock, Thomas Panas, Daniel Quinlan, and Zhendong Su. 2009. Detecting code clones in binary executables. In Proceedings of the eighteenth international symposium on Software testing and analysis. ACM, 117--128.

Digital Library

[34]

Fermin J Serna. 2012. The info leak era on software exploitation. Black Hat USA (2012).

[35]

Yan Shoshitaishvili, RuoyuWang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel, et al. 2016. Sok:(state of) the art of war: Offensive techniques in binary analysis. In Security and Privacy (SP), 2016 IEEE Symposium on. IEEE, 138--157.

[36]

Richard Socher, Alex Perelygin, JeanWu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1631--1642.

[37]

Open Source. 2016. Dyninst: An application program interface (api) for runtime code generation. Online, http://www. dyninst. org (2016).

[38]

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, 384--394.

Digital Library

[39]

Guru Venkataramani, Ioannis Doudalis, Yan Solihin, and Milos Prvulovic. 2008. Flexitaint: A programmable accelerator for dynamic taint propagation. In High Performance Computer Architecture, 2008. HPCA 2008. IEEE 14th International Symposium on. IEEE, 173--184.

[40]

Guru Venkataramani, Ioannis Doudalis, Yan Solihin, and Milos Prvulovic. 2009. MemTracker: An accelerator for memory debugging and monitoring. ACM Transactions on Architecture and Code Optimization (TACO) 6, 2 (2009), 5.

Digital Library

[41]

Vera Wahler, Dietmar Seipel, J Wolff, and Gregor Fischer. 2004. Clone detection in source code by frequent itemset techniques. In Source Code Analysis and Manipulation, 2004. Fourth IEEE International Workshop on. IEEE, 128--135.

Digital Library

[42]

Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87--98.

Digital Library

[43]

Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 363--376.

Digital Library

[44]

Hongfa Xue, Yurong Chen, Fan Yao, Yongbo Li, Tian Lan, and Guru Venkataramani. 2017. Simber: Eliminating redundant memory bound checks via statistical inference. In IFIP International Conference on ICT Systems Security and Privacy Protection. Springer, 413--426.

[45]

Hongfa Xue, Guru Venkataramani, and Tian Lan. 2018. Clone-hunter: accelerated bound checks elimination via binary code clone detection. In Proceedings of the 2nd ACM SIGPLAN InternationalWorkshop on Machine Learning and Programming Languages. ACM, 11--19.

Digital Library

[46]

Fan Yao, Yongbo Li, Yurong Chen, Hongfa Xue, Tian Lan, and Guru Venkataramani. 2017. Statsym: vulnerable path discovery through statistics-guided symbolic execution. In Dependable Systems and Networks (DSN), 2017 47th Annual IEEE/IFIP International Conference on. IEEE, 109--120.

Cited By

Chen YMei YLan TVenkataramani G(2023)Exploring Effective Fuzzing Strategies to Analyze Communication ProtocolsDigital Threats: Research and Practice10.1145/35260885:1(1-22)Online publication date: 4-Oct-2023
https://dl.acm.org/doi/10.1145/3526088
Yu DYang QChen XChen JXu Y(2023)Graph-based code semantics learning for efficient semantic code clone detectionInformation and Software Technology10.1016/j.infsof.2022.107130156:COnline publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1016/j.infsof.2022.107130
Alazba AAljamaan HAlshayeb M(2023)Deep learning approaches for bad smell detection: a systematic literature reviewEmpirical Software Engineering10.1007/s10664-023-10312-z28:3Online publication date: 11-May-2023
https://dl.acm.org/doi/10.1007/s10664-023-10312-z
Show More Cited By

Index Terms

Clone-Slicer: Detecting Domain Specific Binary Code Clones through Program Slicing
1. Security and privacy
  1. Software and application security
    1. Software reverse engineering
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Formal software verification

Recommendations

CloneCognition: machine learning based code clone validation tool
ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

A code clone is a pair of similar code fragments, within or between software systems. To detect each possible clone pair from a software system while handling the complex code structures, the clone detection tools undergo a lot of generalization of the ...
Predicting Buggy Code Clones through Machine Learning
CASCON '22: Proceedings of the 32nd Annual International Conference on Computer Science and Software Engineering
Code clones (similar code fragments in a code-base} often have negative impacts on the maintenance and evolution of software systems. According to the existing studies, code clones may contain bugs or inconsistencies that can cause an increased ...
Clone removal: fact or fiction?
IWSC '10: Proceedings of the 4th International Workshop on Software Clones

Despite ongoing research in the field of code duplication, clone research has not yet investigated when and how developers remove clones. We think knowing how developers select candidates for removal and what techniques they use to eliminate duplication ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

FEAST '18: Proceedings of the 2018 Workshop on Forming an Ecosystem Around Software Transformation

October 2018

39 pages

ISBN:9781450359979

DOI:10.1145/3273045

Program Chairs:
Yan Shoshitaishvili
Arizona State University
,
Mayur Naik
University of Pennsylvania

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSAC: ACM Special Interest Group on Security, Audit, and Control

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 January 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

US Office of Naval Research

Conference

CCS '18

Sponsor:

SIGSAC

CCS '18: 2018 ACM SIGSAC Conference on Computer and Communications Security

October 15 - 19, 2018

Toronto, Canada

Acceptance Rates

Overall Acceptance Rate 4 of 4 submissions, 100%

Upcoming Conference

CCS '25

Sponsor:
sigsac

ACM SIGSAC Conference on Computer and Communications Security

October 13 - 17, 2025

Taipei , Taiwan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
235
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)1

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen YMei YLan TVenkataramani G(2023)Exploring Effective Fuzzing Strategies to Analyze Communication ProtocolsDigital Threats: Research and Practice10.1145/35260885:1(1-22)Online publication date: 4-Oct-2023
https://dl.acm.org/doi/10.1145/3526088
Yu DYang QChen XChen JXu Y(2023)Graph-based code semantics learning for efficient semantic code clone detectionInformation and Software Technology10.1016/j.infsof.2022.107130156:COnline publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1016/j.infsof.2022.107130
Alazba AAljamaan HAlshayeb M(2023)Deep learning approaches for bad smell detection: a systematic literature reviewEmpirical Software Engineering10.1007/s10664-023-10312-z28:3Online publication date: 11-May-2023
https://dl.acm.org/doi/10.1007/s10664-023-10312-z
Molloy CCharland PDing SFung B(2022)JARV1S: Phenotype Clone Search for Rapid Zero-Day Malware Triage and Functional Decomposition for Cyber Threat Intelligence2022 14th International Conference on Cyber Conflict: Keep Moving! (CyCon)10.23919/CyCon55549.2022.9811078(385-403)Online publication date: 31-May-2022
https://doi.org/10.23919/CyCon55549.2022.9811078
Pizzolotto DInoue K(2022)BinCC: Scalable Function Similarity Detection in Multiple Cross-Architectural BinariesIEEE Access10.1109/ACCESS.2022.322510010(124491-124506)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3225100
Salimi SKharrazi M(2022)VulSlicerJournal of Systems and Software10.1016/j.jss.2022.111450193:COnline publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1016/j.jss.2022.111450
Tian ZMao HHuang YTian JLi J(2022)Fine-Grained Obfuscation Scheme Recognition on Binary CodeDigital Forensics and Cyber Crime10.1007/978-3-031-06365-7_13(215-228)Online publication date: 4-Jun-2022
https://doi.org/10.1007/978-3-031-06365-7_13
Zhang HSakurai K(2021)A Survey of Software Clone Detection From Security PerspectiveIEEE Access10.1109/ACCESS.2021.30658729(48157-48173)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3065872
Tang KLiu FShan ZZhang C(2021)Anti-obfuscation Binary Code Clone Detection Based on Software GeneData Science10.1007/978-981-16-5940-9_15(193-208)Online publication date: 10-Sep-2021
https://doi.org/10.1007/978-981-16-5940-9_15
Kreindl JBonetta DStadler LLeopoldseder DMössenböck HMarr S(2020)Multi-language dynamic taint analysis in a polyglot virtual machineProceedings of the 17th International Conference on Managed Programming Languages and Runtimes10.1145/3426182.3426184(15-29)Online publication date: 4-Nov-2020
https://dl.acm.org/doi/10.1145/3426182.3426184
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents