research-article

SLACC: simion-based language agnostic code clones

Authors:

Kathryn T StoleeAuthors Info & Claims

ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering

Pages 210 - 221

https://doi.org/10.1145/3377811.3380407

Published: 01 October 2020 Publication History

Abstract

Successful cross-language clone detection could enable researchers and developers to create robust language migration tools, facilitate learning additional programming languages once one is mastered, and promote reuse of code snippets over a broader codebase. However, identifying cross-language clones presents special challenges to the clone detection problem. A lack of common underlying representation between arbitrary languages means detecting clones requires one of the following solutions: 1) a static analysis framework replicated across each targeted language with annotations matching language features across all languages, or 2) a dynamic analysis framework that detects clones based on runtime behavior.

In this work, we demonstrate the feasibility of the latter solution, a dynamic analysis approach called SLACC for cross-language clone detection. Like prior clone detection techniques, we use input/output behavior to match clones, though we overcome limitations of prior work by amplifying the number of inputs and covering more data types; and as a result, achieve better clusters than prior attempts. Since clusters are generated based on input/output behavior, SLACC supports cross-language clone detection. As an added challenge, we target a static typed language, Java, and a dynamic typed language, Python. Compared to HitoshiIO, a recent clone detection tool for Java, SLACC retrieves 6 times as many clusters and has higher precision (86.7% vs. 30.7%).

This is the first work to perform clone detection for dynamic typed languages (precision = 87.3%) and the first to perform clone detection across languages that lack a common underlying representation (precision = 94.1%). It provides a first step towards the larger goal of scalable language migration tools.

References

[1]

Brenda S Baker. 1995. On finding duplication and near-duplication in large software systems. In Proceedings of 2nd Working Conference on Reverse Engineering. IEEE, 86--95.

[2]

Earl T Barr, Yuriy Brun, Premkumar Devanbu, MarkHarman, and Federica Sarro. 2014. The plastic surgery hypothesis. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 306--317.

Digital Library

[3]

Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant'Anna, and Lorraine Bier. 1998. Clone detection using abstract syntax trees. In Software Maintenance, 1998. Proceedings., International Conference on. IEEE, 368--377.

Digital Library

[4]

Jonathan Beit-Aharon. 2002. Source code translation. US Patent App. 15/894,096.

[5]

Stephen W Bowles and George E Bethke Jr. 1983. Multi-pass system and method for source to source code translation. US Patent 4,374,408.

[6]

Elizabeth Burd and John Bailey. 2002. Evaluating clone detection tools for use during preventative maintenance. In Proceedings. Second IEEE International Workshop on Source Code Analysis and Manipulation. IEEE, 36--43.

Digital Library

[7]

Patrick Cousot and Radhia Cousot. 1977. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In Proceedings of the 4th ACM SIGACT-SIGPLAN symposium on Principles of programming languages. ACM, 238--252.

Digital Library

[8]

Yingnong Dang, Dongmei Zhang, Song Ge, Ray Huang, Chengyun Chu, and Tao Xie. 2017. Transferring Code-clone Detection and Analysis to Practice. In Proceedings of the 39th International Conference on Software Engineering: Software Engineering in Practice Track (Buenos Aires, Argentina) (ICSE-SEIP '17). IEEE Press, Piscataway, NJ, USA, 53--62.

Digital Library

[9]

Florian Deissenboeck, Lars Heinemann, Benjamin Hummel, and Stefan Wagner. 2012. Challenges of the dynamic detection of functionally similar code fragments. In Software Maintenance and Reengineering (CSMR), 2012 16th European Conference on. IEEE, 299--308.

Digital Library

[10]

Rochelle Elva and Gary T Leavens. 2012. Jsctracker: A semantic clone detection tool for java code. Technical Report. University of Central Florida, Dept. of EECS, CS division.

[11]

Alexandre Fau and Reinhold Bihler. [n.d.]. Java2CSharp. http://sourceforge.net/projects/j2cstranslator/. Accessed: 2018-09-25.

[12]

Hans-Christian Fjeldberg. 2008. Polyglot programming. Ph.D. Dissertation. Master thesis, Norwegian University of Science and Technology, Trondheim/Norway.

[13]

Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. ACM, 416--419.

Digital Library

[14]

Mark Gabel, Lingxiao Jiang, and Zhendong Su. 2008. Scalable detection of semantic clones. In Proceedings of the 30th international conference on Software engineering. ACM, 321--330.

Digital Library

[15]

Google. [n.d.]. Google Code Jam. code.google.com/codejam. Accessed: 2018-09-25.

[16]

Rahul Gopinath, Carlos Jensen, and Alex Groce. 2014. Code coverage for suite evaluation by developers. In Proceedings of the 36th International Conference on Software Engineering. ACM, 72--82.

Digital Library

[17]

Diwaker Gupta. 2004. What is a good first programming language? Crossroads 10, 4 (2004), 7--7.

Digital Library

[18]

Simon Holm Jensen, Anders Møller, and Peter Thiemann. 2009. Type analysis for JavaScript. In International Static Analysis Symposium. Springer, 238--255.

Digital Library

[19]

Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, 96--105.

Digital Library

[20]

Lingxiao Jiang and Zhendong Su. 2009. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of the eighteenth international symposium on Software testing and analysis. ACM, 81--92.

Digital Library

[21]

Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654--670.

Digital Library

[22]

Marcus Kessel and Colin Atkinson. 2019. On the Efficacy of Dynamic Behavior Comparison for Judging Functional Equivalence. In 2019 19th International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 193--203.

[23]

Mohd Ehmer Khan, Farmeena Khan, et al. 2012. A comparative study of white box, black box and grey box testing techniques. Int. J. Adv. Comput. Sci. Appl 3, 6(2012).

[24]

Heejung Kim, Yungbum Jung, Sunghun Kim, and Kwankeun Yi. 2011. MeCC: memory comparison-based clone detector. In Proceedings of the 33rd International Conference on Software Engineering. ACM, 301--310.

Digital Library

[25]

Rainer Koschke, Raimar Falke, and Pierre Frenzel. 2006. Clone detection using abstract syntax suffix trees. In 2006 13th Working Conference on Reverse Engineering. IEEE, 253--262.

Digital Library

[26]

Jingyue Li and Michael D Ernst. 2012. CBCD: Cloned buggy code detector. In Proceedings of the 34th International Conference on Software Engineering. IEEE Press, 310--320.

Digital Library

[27]

Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou. 2004. CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code. In OSdi, Vol. 4. 289--302.

Digital Library

[28]

George Mathew, Chris Parnin, and Kathryn T Stolee. [n.d.]. SLACC. github.com/DynamicCodeSearch/SLACC/tree/ICSE20. [Online; accessed 06-February-2020].

[29]

Philip Mayer and Alexander Bauer. 2015. An Empirical Analysis of the Utilization of Multiple Programming Languages in Open Source Projects. In Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering (Nanjing, China) (EASE '15). ACM, New York, NY, USA, Article 4, 10 pages.

Digital Library

[30]

Philip Mayer, Michael Kirsch, and Minh Anh Le. 2017. On multi-language software development, cross-language links and accompanying tools: a survey of professional software developers. Journal of Software Engineering Research and Development 5, 1 (19 Apr 2017), 1.

[31]

Narcisa Andreea Milea, Lingxiao Jiang, and Siau-Cheng Khoo. 2014. Scalable detection of missed cross-function refactorings. In Proceedings of the 2014 International Symposium on Software Testing and Analysis. ACM, 138--148.

Digital Library

[32]

Kawser Nafi, Tonny Sheka Kar, Banani Roy, Chanchal K. Roy, and Kevin Schneider. [n.d.]. CLCDSA: Cross Language Code Clone Detection using Syntactical Features and API Documentation. ([n. d.]).

[33]

Trong Duc Nguyen, Anh Tuan Nguyen, Hung Dang Phan, and Tien N Nguyen. 2017. Exploring API embedding for API usages and applications. In Software Engineering (ICSE), 2017IEEE/ACM 39th International Conference on. IEEE, 438--449.

Digital Library

[34]

Python Community. [n.d.]. Python AST. docs.python.org/3/library/ast.html. [Online; accessed 23-August-2019].

[35]

Baishakhi Ray, Miryung Kim, Suzette Person, and Neha Rungta. 2013. Detecting and Characterizing Semantic Inconsistencies in Ported Code. In Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering (Silicon Valley, CA, USA) (ASE'13). IEEE Press, Piscataway, NJ, USA, 367--377.

Digital Library

[36]

Chanchal K Roy, James R Cordy, and Rainer Koschke. 2009. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of computer programming 74, 7 (2009), 470--495.

[37]

Jean Scholtz and Susan Wiedenbeck. 1990. Learning second and subsequent programming languages: A problem of transfer. International Journal of Human-Computer Interaction 2, 1 (1990), 51--72.

[38]

N. Shrestha, T. Barik, and C. Parnin. 2018. It's Like Python But: Towards Supporting Transfer of Programming Language Knowledge. In 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 177--185.

[39]

Fang-Hsiang Su, Jonathan Bell, Kenneth Harvey, Simha Sethumadhavan, Gail Kaiser, and TonyJebara. 2016. Code relatives: detecting similarly behaving software. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 702--714.

Digital Library

[40]

Fang-Hsiang Su, Jonathan Bell, Gail Kaiser, and Simha Sethumadhavan. 2016. Identifying functionally similar code in complex codebases. In Program Comprehension (ICPC), 2016 IEEE 24th International Conference on. IEEE, 1--10.

[41]

Team GitHub. [n.d.]. GitHub Gist. https://gist.github.com/discover. [Online; accessed 23-August-2019].

[42]

Team Stack Overflow. [n.d.]. Stack Overflow. https://stackoverflow.com. [Online; accessed 23-August-2019].

[43]

Federico Tomassetti and Marco Torchiano. 2014. An Empirical Assessment of Polyglot-ism in GitHub. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering (London, England, United Kingdom) (EASE '14). ACM, New York, NY, USA, Article 17, 4 pages.

Digital Library

[44]

Danny van Bruggen. 2015. Javaparser - For processing Java code. github.com/javaparser/javaparser. [Online; accessed 23-August-2019].

[45]

Andrew Walenstein and Arun Lakhotia. 2007. The software similarity problem in malware analysis. In Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.

[46]

Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87--98.

Digital Library

[47]

Wikipedia Contributors. [n.d.]. Java bytecode instruction listings. en.wikipedia.org/wiki/Java_bytecode. [Online; accessed 23-August-2019].

[48]

Wikipedia contributors. 2019. Levenshtein distance - Wikipedia, The Free Encyclopedia. en.wikipedia.org/wiki/Levenshtein_distance. [Online; accessed 23-August-2019].

[49]

Quanfeng Wu and John R. Anderson. 1990. Problem-solving transfer among programming languages. Technical Report. Carnegie Mellon University.

[50]

Marvin Wyrich, Daniel Graziotin, and Stefan Wagner. 2019. A theory on individual characteristics of successful coding challenge solvers. Peer J Computer Science 5 (Feb. 2019), e173.

[51]

Wuu Yang. 1991. Identifying syntactic differences between two programs. Software: Practice and Experience 21, 7 (1991), 739--755.

Digital Library

[52]

R. Yue, Z. Gao, N. Meng, Y. Xiong, X. Wang, and J. D. Morgenthaler. 2018. Automatic Clone Recommendation for Refactoring Based on the Present and the Past. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). 115--126.

Cited By

Lin RFu YYi WYang JCao JDong ZXie FLi H(2024)Vulnerabilities and Security Patches Detection in OSS: A SurveyACM Computing Surveys10.1145/369478257:1(1-37)Online publication date: 9-Sep-2024
https://dl.acm.org/doi/10.1145/3694782
Zhang ATang XOney SChen YJoyner DKim MWang XXia M(2024)CFlow: Supporting Semantic Flow Analysis of Students' Code in Programming Problems at ScaleProceedings of the Eleventh ACM Conference on Learning @ Scale10.1145/3657604.3662025(188-199)Online publication date: 9-Jul-2024
https://dl.acm.org/doi/10.1145/3657604.3662025
Middleton JOre JStolee KRoychoudhury APaiva AAbreu RStorey M(2024)Barriers for Students During Code Change ComprehensionProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639227(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639227
Show More Cited By

Index Terms

SLACC: simion-based language agnostic code clones
1. Information systems
  1. Information systems applications
    1. Data mining
      1. Clustering
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Functional languages
        Object oriented languages
    2. Software maintenance tools

Recommendations

Cross-language program slicing for dynamic web applications
ESEC/FSE 2015: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering

During software maintenance, program slicing is a useful technique to assist developers in understanding the impact of their changes. While different program-slicing techniques have been proposed for traditional software systems, program slicing for ...
Layout-sensitive language extensibility with SugarHaskell
Haskell '12: Proceedings of the 2012 Haskell Symposium

Programmers need convenient syntax to write elegant and concise programs. Consequently, the Haskell standard provides syntactic sugar for some scenarios (e.g., do notation for monadic code), authors of Haskell compilers provide syntactic sugar for more ...
Variability-aware change impact analysis of multi-language product lines
ASE '14: Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering

Change impact analysis (CIA) techniques have been applied successfully to determine the effects of modifications when evolving software systems. However, many software systems today use multiple programming languages and they are organized as software ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering

June 2020

1640 pages

ISBN:9781450371216

DOI:10.1145/3377811

General Chairs:
Gregg Rothermel
North Carolina State University
,
Doo-Hwan Bae
KAIST, South Korea

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

In-Cooperation

KIISE: Korean Institute of Information Scientists and Engineers
IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Artifacts Evaluated & Reusable / v1.1

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

ICSE '20

Sponsor:

SIGSOFT

ICSE '20: 42nd International Conference on Software Engineering

June 27 - July 19, 2020

Seoul, South Korea

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
282
Total Downloads

Downloads (Last 12 months)26
Downloads (Last 6 weeks)3

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lin RFu YYi WYang JCao JDong ZXie FLi H(2024)Vulnerabilities and Security Patches Detection in OSS: A SurveyACM Computing Surveys10.1145/369478257:1(1-37)Online publication date: 9-Sep-2024
https://dl.acm.org/doi/10.1145/3694782
Zhang ATang XOney SChen YJoyner DKim MWang XXia M(2024)CFlow: Supporting Semantic Flow Analysis of Students' Code in Programming Problems at ScaleProceedings of the Eleventh ACM Conference on Learning @ Scale10.1145/3657604.3662025(188-199)Online publication date: 9-Jul-2024
https://dl.acm.org/doi/10.1145/3657604.3662025
Middleton JOre JStolee KRoychoudhury APaiva AAbreu RStorey M(2024)Barriers for Students During Code Change ComprehensionProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639227(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639227
Yang YZou YHu XLo DNi CGrundy JXia XChandra SBlincoe KTonella P(2023)C³: Code Clone-Based Identification of Duplicated ComponentsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613883(1832-1843)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3613883
Di Grazia LPradel M(2023)Code Search: A Survey of Techniques for Finding CodeACM Computing Surveys10.1145/356597155:11(1-31)Online publication date: 9-Feb-2023
https://dl.acm.org/doi/10.1145/3565971
Zhang AChen YOney S(2023)RunEx: Augmenting Regular-Expression Code Search with Runtime Values2023 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC)10.1109/VL-HCC57772.2023.00024(139-147)Online publication date: 3-Oct-2023
https://doi.org/10.1109/VL-HCC57772.2023.00024
Mehrotra NSharma AJindal APurandare R(2023)Improving Cross-Language Code Clone Detection via Code Representation Learning and Graph Neural NetworksIEEE Transactions on Software Engineering10.1109/TSE.2023.331179649:11(4846-4868)Online publication date: 6-Sep-2023
https://dl.acm.org/doi/10.1109/TSE.2023.3311796
Reddy MBhat SChandrashekar NVenkatraman SKanwal P(2023)Language Agnostic Program Conformance Analysis2023 4th International Conference for Emerging Technology (INCET)10.1109/INCET57972.2023.10170698(1-6)Online publication date: 26-May-2023
https://doi.org/10.1109/INCET57972.2023.10170698
Pinku SMondal DRoy C(2023)Pathways to Leverage Transcompiler based Data Augmentation for Cross-Language Clone Detection2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC)10.1109/ICPC58990.2023.00031(169-180)Online publication date: May-2023
https://doi.org/10.1109/ICPC58990.2023.00031
Woodruff JArmengol-Estapé JAinsworth SO'Boyle MJhala RDillig I(2022)Bind the gap: compiling real software to hardware FFT acceleratorsProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523439(687-702)Online publication date: 9-Jun-2022
https://dl.acm.org/doi/10.1145/3519939.3523439
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents