Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3273045.3273047acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

Clone-Slicer: Detecting Domain Specific Binary Code Clones through Program Slicing

Published: 15 January 2018 Publication History

Abstract

In this paper, we presented a novel framework, Clone-Slicer, a domain-specific code clone detector for binary executables, that integrates program slicing and a deep learning based binary code clone modeling framework to improve the number of code clone detected. In particular, we chose pointer analysis for memory safety as our example domain to demonstrate the usefulness of our approach. We evaluated our approach using real-world applications from SPEC 2006 benchmark suite. Our results show Clone-Slicer is able to detect up to 43.64% code clones compared to prior work and further cut the time-to-solution (the time spent to verify memory bound safety) for Clone-Slicer by 32.96% compared to Clone-Hunter. As future work, we plan to apply Clone-Slicer to different domains and tasks, such as vulnerable program path discovery, and further improve the capability for code clone detection through advanced clustering algorithms. We will also study the cost-benefit tradeoffs of using such advanced algorithms.

References

[1]
2006. SPEC CPU 2006. https://www.spec.org/cpu2006/.
[2]
2016. IDA Pro disassembler. https://www.hex-rays.com/products/ida/.
[3]
Sheeva Afshan, Phil McMinn, and Mark Stevenson. 2013. Evolving readable string test inputs using a natural language model to reduce human oracle cost. In Software Testing, Verification and Validation (ICST), 2013 IEEE Sixth International Conference on. IEEE, 352--361.
[4]
Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2015. Suggesting accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 38--49.
[5]
Brenda S Baker. 1995. On finding duplication and near-duplication in large software systems. In Reverse Engineering, 1995., Proceedings of 2ndWorking Conference on. IEEE, 86--95.
[6]
Brenda S Baker. 1997. Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM J. Comput. 26, 5 (1997), 1343--1362.
[7]
Ira D Baxter, Christopher Pidgeon, and Michael Mehlich. 2004. DMS/spl reg: program transformations for practical scalable software evolution. In Software Engineering, 2004. ICSE 2004. Proceedings. 26th International Conference on. IEEE, 625--634.
[8]
Christopher M Bishop. 2006. Machine learning and pattern recognition. Information Science and Statistics. Springer, Heidelberg (2006).
[9]
Juan Caballero, Gustavo Grieco, Mark Marron, and Antonio Nappa. 2012. Undangle: early detection of dangling pointers in use-after-free and double-free vulnerabilities. In Proceedings of the 2012 International Symposium on Software Testing and Analysis. ACM, 133--143.
[10]
Shuo Chen, Jun Xu, Nithin Nakka, Zbigniew Kalbarczyk, and Ravishankar K Iyer. 2005. Defeating memory corruption attacks via pointer taintedness detection. In Dependable Systems and Networks, 2005. DSN 2005. Proceedings. International Conference on. IEEE, 378--387.
[11]
Mauro Conti, Stephen Crane, Lucas Davi, Michael Franz, Per Larsen, Marco Negro, Christopher Liebchen, Mohaned Qunaibit, and Ahmad-Reza Sadeghi. 2015. Losing control: On the effectiveness of control-flow integrity under stack attacks. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 952--963.
[12]
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. 2004. Localitysensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry. ACM, 253--262.
[13]
Stéphane Ducasse, Matthias Rieger, and Serge Demeyer. 1999. A language independent approach for detecting duplicated code. In Software Maintenance, 1999.(ICSM'99) Proceedings. IEEE International Conference on. IEEE, 109--118.
[14]
Christine Franks, Zhaopeng Tu, Premkumar Devanbu, and Vincent Hellendoorn. 2015. Cacheca: A cache language model based code suggestion tool. In Proceedings of the 37th International Conference on Software Engineering-Volume 2. IEEE Press, 705--708.
[15]
Mark Gabel and Zhendong Su. 2010. A study of the uniqueness of source code. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering. ACM, 147--156.
[16]
Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. 1999. Similarity search in high dimensions via hashing. In Vldb, Vol. 99. 518--529.
[17]
Christoph Goller and Andreas Kuchler. 1996. Learning task-dependent distributed representations by backpropagation through structure. In Neural Networks, 1996., IEEE International Conference on, Vol. 1. IEEE, 347--352.
[18]
Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks the best choice for modeling source code?. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 763--773.
[19]
Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE, 837--847.
[20]
Yikun Hu, Yuanyuan Zhang, Juanru Li, and Dawu Gu. 2017. Binary code clone detection across architectures and compiling configurations. In Proceedings of the 25th International Conference on Program Comprehension. IEEE Press, 88--98.
[21]
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, 96--105.
[22]
Dan Jurafsky and James H Martin. 2009. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition., 1024 pages.
[23]
Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654--670.
[24]
Miryung Kim, Vibha Sazawal, David Notkin, and Gail Murphy. 2005. An empirical study of code clone genealogies. In ACM SIGSOFT Software Engineering Notes, Vol. 30. ACM, 187--196.
[25]
Raghavan Komondoor and Susan Horwitz. 2001. Using slicing to identify duplication in source code. In International Static Analysis Symposium. Springer, 40--56.
[26]
Kostas A Kontogiannis, Renator DeMori, Ettore Merlo, Michael Galler, and Morris Bernstein. 1996. Pattern matching for clone and concept detection. Automated Software Engineering 3, 1--2 (1996), 77--108.
[27]
Peng Li, Yang Liu, and Maosong Sun. 2013. Recursive autoencoders for ITG-based translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 567--577.
[28]
Yongbo Li, Fan Yao, Tian Lan, and Guru Venkataramani. 2016. Sarre: semanticsaware rule recommendation and enforcement for event paths on android. IEEE Transactions on Information Forensics and Security 11, 12 (2016), 2748--2762.
[29]
Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou. 2006. CP-Miner: Finding copy-paste and related bugs in large-scale software code. IEEE Transactions on software Engineering 32, 3 (2006), 176--192.
[30]
Tomas Mikolov, Stefan Kombrink, Anoop Deoras, Lukar Burget, and Jan Cernocky. 2011. Rnnlm-recurrent neural network language modeling toolkit. In Proc. of the 2011 ASRU Workshop. 196--201.
[31]
Santosh Nagarakatte, Jianzhou Zhao, Milo MK Martin, and Steve Zdancewic. 2009. SoftBound: Highly compatible and complete spatial memory safety for C. ACM Sigplan Notices 44, 6 (2009), 245--258.
[32]
Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, and Thorsten Holz. 2015. Cross-architecture bug search in binary executables. In Security and Privacy (SP), 2015 IEEE Symposium on. IEEE, 709--724.
[33]
Andreas Sæbjørnsen, Jeremiah Willcock, Thomas Panas, Daniel Quinlan, and Zhendong Su. 2009. Detecting code clones in binary executables. In Proceedings of the eighteenth international symposium on Software testing and analysis. ACM, 117--128.
[34]
Fermin J Serna. 2012. The info leak era on software exploitation. Black Hat USA (2012).
[35]
Yan Shoshitaishvili, RuoyuWang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel, et al. 2016. Sok:(state of) the art of war: Offensive techniques in binary analysis. In Security and Privacy (SP), 2016 IEEE Symposium on. IEEE, 138--157.
[36]
Richard Socher, Alex Perelygin, JeanWu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1631--1642.
[37]
Open Source. 2016. Dyninst: An application program interface (api) for runtime code generation. Online, http://www. dyninst. org (2016).
[38]
Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, 384--394.
[39]
Guru Venkataramani, Ioannis Doudalis, Yan Solihin, and Milos Prvulovic. 2008. Flexitaint: A programmable accelerator for dynamic taint propagation. In High Performance Computer Architecture, 2008. HPCA 2008. IEEE 14th International Symposium on. IEEE, 173--184.
[40]
Guru Venkataramani, Ioannis Doudalis, Yan Solihin, and Milos Prvulovic. 2009. MemTracker: An accelerator for memory debugging and monitoring. ACM Transactions on Architecture and Code Optimization (TACO) 6, 2 (2009), 5.
[41]
Vera Wahler, Dietmar Seipel, J Wolff, and Gregor Fischer. 2004. Clone detection in source code by frequent itemset techniques. In Source Code Analysis and Manipulation, 2004. Fourth IEEE International Workshop on. IEEE, 128--135.
[42]
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87--98.
[43]
Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 363--376.
[44]
Hongfa Xue, Yurong Chen, Fan Yao, Yongbo Li, Tian Lan, and Guru Venkataramani. 2017. Simber: Eliminating redundant memory bound checks via statistical inference. In IFIP International Conference on ICT Systems Security and Privacy Protection. Springer, 413--426.
[45]
Hongfa Xue, Guru Venkataramani, and Tian Lan. 2018. Clone-hunter: accelerated bound checks elimination via binary code clone detection. In Proceedings of the 2nd ACM SIGPLAN InternationalWorkshop on Machine Learning and Programming Languages. ACM, 11--19.
[46]
Fan Yao, Yongbo Li, Yurong Chen, Hongfa Xue, Tian Lan, and Guru Venkataramani. 2017. Statsym: vulnerable path discovery through statistics-guided symbolic execution. In Dependable Systems and Networks (DSN), 2017 47th Annual IEEE/IFIP International Conference on. IEEE, 109--120.

Cited By

View all
  • (2023)Exploring Effective Fuzzing Strategies to Analyze Communication ProtocolsDigital Threats: Research and Practice10.1145/35260885:1(1-22)Online publication date: 4-Oct-2023
  • (2023)Graph-based code semantics learning for efficient semantic code clone detectionInformation and Software Technology10.1016/j.infsof.2022.107130156:COnline publication date: 1-Apr-2023
  • (2023)Deep learning approaches for bad smell detection: a systematic literature reviewEmpirical Software Engineering10.1007/s10664-023-10312-z28:3Online publication date: 11-May-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
FEAST '18: Proceedings of the 2018 Workshop on Forming an Ecosystem Around Software Transformation
October 2018
39 pages
ISBN:9781450359979
DOI:10.1145/3273045
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 January 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. binary analysis
  2. code clones
  3. machine learning
  4. program slicing

Qualifiers

  • Research-article

Funding Sources

  • US Office of Naval Research

Conference

CCS '18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 4 of 4 submissions, 100%

Upcoming Conference

CCS '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)1
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Exploring Effective Fuzzing Strategies to Analyze Communication ProtocolsDigital Threats: Research and Practice10.1145/35260885:1(1-22)Online publication date: 4-Oct-2023
  • (2023)Graph-based code semantics learning for efficient semantic code clone detectionInformation and Software Technology10.1016/j.infsof.2022.107130156:COnline publication date: 1-Apr-2023
  • (2023)Deep learning approaches for bad smell detection: a systematic literature reviewEmpirical Software Engineering10.1007/s10664-023-10312-z28:3Online publication date: 11-May-2023
  • (2022)JARV1S: Phenotype Clone Search for Rapid Zero-Day Malware Triage and Functional Decomposition for Cyber Threat Intelligence2022 14th International Conference on Cyber Conflict: Keep Moving! (CyCon)10.23919/CyCon55549.2022.9811078(385-403)Online publication date: 31-May-2022
  • (2022)BinCC: Scalable Function Similarity Detection in Multiple Cross-Architectural BinariesIEEE Access10.1109/ACCESS.2022.322510010(124491-124506)Online publication date: 2022
  • (2022)VulSlicerJournal of Systems and Software10.1016/j.jss.2022.111450193:COnline publication date: 1-Nov-2022
  • (2022)Fine-Grained Obfuscation Scheme Recognition on Binary CodeDigital Forensics and Cyber Crime10.1007/978-3-031-06365-7_13(215-228)Online publication date: 4-Jun-2022
  • (2021)A Survey of Software Clone Detection From Security PerspectiveIEEE Access10.1109/ACCESS.2021.30658729(48157-48173)Online publication date: 2021
  • (2021)Anti-obfuscation Binary Code Clone Detection Based on Software GeneData Science10.1007/978-981-16-5940-9_15(193-208)Online publication date: 10-Sep-2021
  • (2020)Multi-language dynamic taint analysis in a polyglot virtual machineProceedings of the 17th International Conference on Managed Programming Languages and Runtimes10.1145/3426182.3426184(15-29)Online publication date: 4-Nov-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media