Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3377811.3380407acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

SLACC: simion-based language agnostic code clones

Published: 01 October 2020 Publication History

Abstract

Successful cross-language clone detection could enable researchers and developers to create robust language migration tools, facilitate learning additional programming languages once one is mastered, and promote reuse of code snippets over a broader codebase. However, identifying cross-language clones presents special challenges to the clone detection problem. A lack of common underlying representation between arbitrary languages means detecting clones requires one of the following solutions: 1) a static analysis framework replicated across each targeted language with annotations matching language features across all languages, or 2) a dynamic analysis framework that detects clones based on runtime behavior.
In this work, we demonstrate the feasibility of the latter solution, a dynamic analysis approach called SLACC for cross-language clone detection. Like prior clone detection techniques, we use input/output behavior to match clones, though we overcome limitations of prior work by amplifying the number of inputs and covering more data types; and as a result, achieve better clusters than prior attempts. Since clusters are generated based on input/output behavior, SLACC supports cross-language clone detection. As an added challenge, we target a static typed language, Java, and a dynamic typed language, Python. Compared to HitoshiIO, a recent clone detection tool for Java, SLACC retrieves 6 times as many clusters and has higher precision (86.7% vs. 30.7%).
This is the first work to perform clone detection for dynamic typed languages (precision = 87.3%) and the first to perform clone detection across languages that lack a common underlying representation (precision = 94.1%). It provides a first step towards the larger goal of scalable language migration tools.

References

[1]
Brenda S Baker. 1995. On finding duplication and near-duplication in large software systems. In Proceedings of 2nd Working Conference on Reverse Engineering. IEEE, 86--95.
[2]
Earl T Barr, Yuriy Brun, Premkumar Devanbu, MarkHarman, and Federica Sarro. 2014. The plastic surgery hypothesis. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 306--317.
[3]
Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant'Anna, and Lorraine Bier. 1998. Clone detection using abstract syntax trees. In Software Maintenance, 1998. Proceedings., International Conference on. IEEE, 368--377.
[4]
Jonathan Beit-Aharon. 2002. Source code translation. US Patent App. 15/894,096.
[5]
Stephen W Bowles and George E Bethke Jr. 1983. Multi-pass system and method for source to source code translation. US Patent 4,374,408.
[6]
Elizabeth Burd and John Bailey. 2002. Evaluating clone detection tools for use during preventative maintenance. In Proceedings. Second IEEE International Workshop on Source Code Analysis and Manipulation. IEEE, 36--43.
[7]
Patrick Cousot and Radhia Cousot. 1977. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In Proceedings of the 4th ACM SIGACT-SIGPLAN symposium on Principles of programming languages. ACM, 238--252.
[8]
Yingnong Dang, Dongmei Zhang, Song Ge, Ray Huang, Chengyun Chu, and Tao Xie. 2017. Transferring Code-clone Detection and Analysis to Practice. In Proceedings of the 39th International Conference on Software Engineering: Software Engineering in Practice Track (Buenos Aires, Argentina) (ICSE-SEIP '17). IEEE Press, Piscataway, NJ, USA, 53--62.
[9]
Florian Deissenboeck, Lars Heinemann, Benjamin Hummel, and Stefan Wagner. 2012. Challenges of the dynamic detection of functionally similar code fragments. In Software Maintenance and Reengineering (CSMR), 2012 16th European Conference on. IEEE, 299--308.
[10]
Rochelle Elva and Gary T Leavens. 2012. Jsctracker: A semantic clone detection tool for java code. Technical Report. University of Central Florida, Dept. of EECS, CS division.
[11]
Alexandre Fau and Reinhold Bihler. [n.d.]. Java2CSharp. http://sourceforge.net/projects/j2cstranslator/. Accessed: 2018-09-25.
[12]
Hans-Christian Fjeldberg. 2008. Polyglot programming. Ph.D. Dissertation. Master thesis, Norwegian University of Science and Technology, Trondheim/Norway.
[13]
Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. ACM, 416--419.
[14]
Mark Gabel, Lingxiao Jiang, and Zhendong Su. 2008. Scalable detection of semantic clones. In Proceedings of the 30th international conference on Software engineering. ACM, 321--330.
[15]
Google. [n.d.]. Google Code Jam. code.google.com/codejam. Accessed: 2018-09-25.
[16]
Rahul Gopinath, Carlos Jensen, and Alex Groce. 2014. Code coverage for suite evaluation by developers. In Proceedings of the 36th International Conference on Software Engineering. ACM, 72--82.
[17]
Diwaker Gupta. 2004. What is a good first programming language? Crossroads 10, 4 (2004), 7--7.
[18]
Simon Holm Jensen, Anders Møller, and Peter Thiemann. 2009. Type analysis for JavaScript. In International Static Analysis Symposium. Springer, 238--255.
[19]
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, 96--105.
[20]
Lingxiao Jiang and Zhendong Su. 2009. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of the eighteenth international symposium on Software testing and analysis. ACM, 81--92.
[21]
Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654--670.
[22]
Marcus Kessel and Colin Atkinson. 2019. On the Efficacy of Dynamic Behavior Comparison for Judging Functional Equivalence. In 2019 19th International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 193--203.
[23]
Mohd Ehmer Khan, Farmeena Khan, et al. 2012. A comparative study of white box, black box and grey box testing techniques. Int. J. Adv. Comput. Sci. Appl 3, 6(2012).
[24]
Heejung Kim, Yungbum Jung, Sunghun Kim, and Kwankeun Yi. 2011. MeCC: memory comparison-based clone detector. In Proceedings of the 33rd International Conference on Software Engineering. ACM, 301--310.
[25]
Rainer Koschke, Raimar Falke, and Pierre Frenzel. 2006. Clone detection using abstract syntax suffix trees. In 2006 13th Working Conference on Reverse Engineering. IEEE, 253--262.
[26]
Jingyue Li and Michael D Ernst. 2012. CBCD: Cloned buggy code detector. In Proceedings of the 34th International Conference on Software Engineering. IEEE Press, 310--320.
[27]
Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou. 2004. CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code. In OSdi, Vol. 4. 289--302.
[28]
George Mathew, Chris Parnin, and Kathryn T Stolee. [n.d.]. SLACC. github.com/DynamicCodeSearch/SLACC/tree/ICSE20. [Online; accessed 06-February-2020].
[29]
Philip Mayer and Alexander Bauer. 2015. An Empirical Analysis of the Utilization of Multiple Programming Languages in Open Source Projects. In Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering (Nanjing, China) (EASE '15). ACM, New York, NY, USA, Article 4, 10 pages.
[30]
Philip Mayer, Michael Kirsch, and Minh Anh Le. 2017. On multi-language software development, cross-language links and accompanying tools: a survey of professional software developers. Journal of Software Engineering Research and Development 5, 1 (19 Apr 2017), 1.
[31]
Narcisa Andreea Milea, Lingxiao Jiang, and Siau-Cheng Khoo. 2014. Scalable detection of missed cross-function refactorings. In Proceedings of the 2014 International Symposium on Software Testing and Analysis. ACM, 138--148.
[32]
Kawser Nafi, Tonny Sheka Kar, Banani Roy, Chanchal K. Roy, and Kevin Schneider. [n.d.]. CLCDSA: Cross Language Code Clone Detection using Syntactical Features and API Documentation. ([n. d.]).
[33]
Trong Duc Nguyen, Anh Tuan Nguyen, Hung Dang Phan, and Tien N Nguyen. 2017. Exploring API embedding for API usages and applications. In Software Engineering (ICSE), 2017IEEE/ACM 39th International Conference on. IEEE, 438--449.
[34]
Python Community. [n.d.]. Python AST. docs.python.org/3/library/ast.html. [Online; accessed 23-August-2019].
[35]
Baishakhi Ray, Miryung Kim, Suzette Person, and Neha Rungta. 2013. Detecting and Characterizing Semantic Inconsistencies in Ported Code. In Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering (Silicon Valley, CA, USA) (ASE'13). IEEE Press, Piscataway, NJ, USA, 367--377.
[36]
Chanchal K Roy, James R Cordy, and Rainer Koschke. 2009. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of computer programming 74, 7 (2009), 470--495.
[37]
Jean Scholtz and Susan Wiedenbeck. 1990. Learning second and subsequent programming languages: A problem of transfer. International Journal of Human-Computer Interaction 2, 1 (1990), 51--72.
[38]
N. Shrestha, T. Barik, and C. Parnin. 2018. It's Like Python But: Towards Supporting Transfer of Programming Language Knowledge. In 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 177--185.
[39]
Fang-Hsiang Su, Jonathan Bell, Kenneth Harvey, Simha Sethumadhavan, Gail Kaiser, and TonyJebara. 2016. Code relatives: detecting similarly behaving software. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 702--714.
[40]
Fang-Hsiang Su, Jonathan Bell, Gail Kaiser, and Simha Sethumadhavan. 2016. Identifying functionally similar code in complex codebases. In Program Comprehension (ICPC), 2016 IEEE 24th International Conference on. IEEE, 1--10.
[41]
Team GitHub. [n.d.]. GitHub Gist. https://gist.github.com/discover. [Online; accessed 23-August-2019].
[42]
Team Stack Overflow. [n.d.]. Stack Overflow. https://stackoverflow.com. [Online; accessed 23-August-2019].
[43]
Federico Tomassetti and Marco Torchiano. 2014. An Empirical Assessment of Polyglot-ism in GitHub. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering (London, England, United Kingdom) (EASE '14). ACM, New York, NY, USA, Article 17, 4 pages.
[44]
Danny van Bruggen. 2015. Javaparser - For processing Java code. github.com/javaparser/javaparser. [Online; accessed 23-August-2019].
[45]
Andrew Walenstein and Arun Lakhotia. 2007. The software similarity problem in malware analysis. In Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
[46]
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87--98.
[47]
Wikipedia Contributors. [n.d.]. Java bytecode instruction listings. en.wikipedia.org/wiki/Java_bytecode. [Online; accessed 23-August-2019].
[48]
Wikipedia contributors. 2019. Levenshtein distance - Wikipedia, The Free Encyclopedia. en.wikipedia.org/wiki/Levenshtein_distance. [Online; accessed 23-August-2019].
[49]
Quanfeng Wu and John R. Anderson. 1990. Problem-solving transfer among programming languages. Technical Report. Carnegie Mellon University.
[50]
Marvin Wyrich, Daniel Graziotin, and Stefan Wagner. 2019. A theory on individual characteristics of successful coding challenge solvers. Peer J Computer Science 5 (Feb. 2019), e173.
[51]
Wuu Yang. 1991. Identifying syntactic differences between two programs. Software: Practice and Experience 21, 7 (1991), 739--755.
[52]
R. Yue, Z. Gao, N. Meng, Y. Xiong, X. Wang, and J. D. Morgenthaler. 2018. Automatic Clone Recommendation for Refactoring Based on the Present and the Past. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). 115--126.

Cited By

View all
  • (2024)Vulnerabilities and Security Patches Detection in OSS: A SurveyACM Computing Surveys10.1145/369478257:1(1-37)Online publication date: 9-Sep-2024
  • (2024)CFlow: Supporting Semantic Flow Analysis of Students' Code in Programming Problems at ScaleProceedings of the Eleventh ACM Conference on Learning @ Scale10.1145/3657604.3662025(188-199)Online publication date: 9-Jul-2024
  • (2024)Barriers for Students During Code Change ComprehensionProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639227(1-13)Online publication date: 20-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering
June 2020
1640 pages
ISBN:9781450371216
DOI:10.1145/3377811
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • KIISE: Korean Institute of Information Scientists and Engineers
  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2020

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. cross-language analysis
  2. semantic code clone detection

Qualifiers

  • Research-article

Funding Sources

Conference

ICSE '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)26
  • Downloads (Last 6 weeks)3
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Vulnerabilities and Security Patches Detection in OSS: A SurveyACM Computing Surveys10.1145/369478257:1(1-37)Online publication date: 9-Sep-2024
  • (2024)CFlow: Supporting Semantic Flow Analysis of Students' Code in Programming Problems at ScaleProceedings of the Eleventh ACM Conference on Learning @ Scale10.1145/3657604.3662025(188-199)Online publication date: 9-Jul-2024
  • (2024)Barriers for Students During Code Change ComprehensionProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639227(1-13)Online publication date: 20-May-2024
  • (2023)C³: Code Clone-Based Identification of Duplicated ComponentsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613883(1832-1843)Online publication date: 30-Nov-2023
  • (2023)Code Search: A Survey of Techniques for Finding CodeACM Computing Surveys10.1145/356597155:11(1-31)Online publication date: 9-Feb-2023
  • (2023)RunEx: Augmenting Regular-Expression Code Search with Runtime Values2023 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC)10.1109/VL-HCC57772.2023.00024(139-147)Online publication date: 3-Oct-2023
  • (2023)Improving Cross-Language Code Clone Detection via Code Representation Learning and Graph Neural NetworksIEEE Transactions on Software Engineering10.1109/TSE.2023.331179649:11(4846-4868)Online publication date: 6-Sep-2023
  • (2023)Language Agnostic Program Conformance Analysis2023 4th International Conference for Emerging Technology (INCET)10.1109/INCET57972.2023.10170698(1-6)Online publication date: 26-May-2023
  • (2023)Pathways to Leverage Transcompiler based Data Augmentation for Cross-Language Clone Detection2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC)10.1109/ICPC58990.2023.00031(169-180)Online publication date: May-2023
  • (2022)Bind the gap: compiling real software to hardware FFT acceleratorsProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523439(687-702)Online publication date: 9-Jun-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media