research-article

How should we measure functional sameness from program source code? an exploratory study on Java methods

Authors:

Shinji KusumotoAuthors Info & Claims

FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

Pages 294 - 305

https://doi.org/10.1145/2635868.2635886

Published: 11 November 2014 Publication History

Abstract

Program source code is one of the main targets of software engineering research. A wide variety of research has been conducted on source code, and many studies have leveraged structural, vocabulary, and method signature similarities to measure the functional sameness of source code. In this research, we conducted an empirical study to ascertain how we should use three similarities to measure functional sameness. We used two large datasets and measured the three similarities between all the method pairs in the datasets, each of which included approximately 15 million Java method pairs. The relationships between the three similarities were analyzed to determine how we should use each to detect functionally similar code. The results of our study revealed the following. (1) Method names are not always useful for detecting functionally similar code. Only if there are a small number of methods having a given name, the methods are likely to include functionally similar code. (2) Existing file-level, method-level, and block-level clone detection techniques often miss functionally similar code generated by copy-and-paste operations between different projects. (3) In the cases we use structural similarity for detecting functionally similar code, we obtained many false positives. However, we can avoid detecting most false positives by using a vocabulary similarity in addition to a structural one. (4) Using a vocabulary similarity to detect functionally similar code is not suitable for method pairs in the same file because such method pairs use many of the same program elements such as private methods or private fields.

References

[1]

S. L. Abebe, V. Arnaoudova, G. Antoniol, and Y. Gueheneuc. Can Lexicon Bad Smells Improve Fault Prediction. In Proceedings of the 19th Working Conference on Reverse Engineering, pages 235–244, 2012.

Digital Library

[2]

S. K. Bajracharya, J. Ossher, and C. V. Lopes. Leveraging Usage Similarity for Effective Retrieval of Examples in Code Repositories. In Proceedings of the 18th International Symposium on Foundations of Software Engineering, pages 157–166, 2010.

Digital Library

[3]

S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo. Comparison and Evaluation of Clone Detection Tools. IEEE Transactions on Software Engineering, 33(9):577–591, 2007.

Digital Library

[4]

L. R. Biggers, B. P. Eddy, N. A. Kraft, and L. H. Etzkorn. Toward a Metrics Suite for Source Code Lexicons. In Proceedings of the 27th International Conference on Software Maintenance, pages 492–495, 2011.

Digital Library

[5]

S. R. Chidamber and C. F. Kemerer. A Metrics Suite for Object Oriented Design. IEEE Transactions on Software Engineering, 20(6):476–493, 1994.

Digital Library

[6]

A. Corazza, S. D. Martino, V. Maggio, and G. Scanniello. Investigating the Use of Lexical Information for Software System Clustering. In Proceedings of the 15th European Conference on Software Maintenance and Reengineering, pages 35–44, 2011.

Digital Library

[7]

M. Fowler. Refactoring: Improving the Design of Existing Code. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999.

Digital Library

[8]

G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais. The Vocabulary Problem in Human-system Communication. Communications of the ACM, 30(11):964–971, 1987.

Digital Library

[9]

J. Girard and R. Koschke. A Comparison of Abstract Data Types and Objects Recovery Techniques. Science of Computer Programming, 36(2–3):149–181, 2000.

Digital Library

[10]

M. Grechanik, C. Fu, Q. Xie, C. McMillan, D. Poshyvanyk, and C. Cumby. A Search Engine for Finding Highly Relevant Applications. In Proceedings of the 32nd International Conference on Software Engineering, pages 475–484, 2010.

Digital Library

[11]

S. Haiduc and A. Marcus. On the Use of Domain Terms in Source Code. In Proceedings of the 16th International Conference on Program Comprehension, pages 113–122, 2008.

Digital Library

[12]

Y. Higo, T. Kamiya, S. Kusumoto, and K. Inoue. Method and Implementation for Investigating Code Clones in a Software System. Information and Software Technology, 49(9-10):985–998, 2007.

Digital Library

[13]

Y. Higo and S. Kusumoto. Code Clone Detection on Specialized PDGs with Heuristics. In Proceedings of the 15th European Conference on Software Maintenance and Reengineering, pages 75–84, 2011.

Digital Library

[14]

Y. Higo, S. Kusumoto, and K. Inoue. A Metric-based Approach to Identifying Refactoring Opportunities for Merging Code Clones in a Java Software System. Journal of Software: Maintenance and Evolution, 20(6):435–461, 2008.

Digital Library

[15]

K. Hotta, Y. Higo, and S. Kusumoto. Identifying, Tailoring, and Suggesting Form Template Method Refactoring Opportunities with Program Dependence Graph. In Proceedings of the 16th European Conference on Software Maintenance and Reengineering, pages 53–62, 2012.

Digital Library

[16]

K. Hotta, Y. Higo, and S. Kusumoto. How Accurate Is Coarse-grained Clone Detection?: Comparison with Fine-grained Detectors. In Proceedings of the 8th International Workshop on Software Clones, pages 1–18, 2014.

[17]

B. Hummel, E. Juergens, L. Heinemann, and M. Conradt. Index-based Code Clone Detection: Incremental, Distributed, Scalable. In Proceedings of the International Conference on Software Maintenance, pages 1–9, 2010.

Digital Library

[18]

K. Inoue, R. Yokomori, T. Yamamoto, M. Matsushita, and S. Kusumoto. Ranking Significance of Software Components Based on Use Relations. IEEE Transactions on Software Engineering, 31(3):213–225, 2005.

Digital Library

[19]

T. Ishihara, K. Hotta, Y. Higo, H. Igaki, and S. Kusumoto. Inter-Project Functional Clone Detection Toward Building Libraries - An Empirical Study on 13,000 Projects. In Proceedings of the 19th Working Conference on Reverse Engineering, pages 387–391, 2012.

Digital Library

[20]

T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: A Multilinguistic Token-based Code Clone Detection System for Large Scale Source Code. IEEE Transactions on Software Engineering, 28(7):654–670, 2002.

Digital Library

[21]

H. Kim, Y. Jung, S. Kim, and K. Yi. MeCC: Memory Comparison-based Clone Detector. In Proceedings of the 33rd International Conference on Software Engineering, pages 301–310, 2011.

Digital Library

[22]

R. Komondoor and S. Horwitz. Using Slicing to Identify Duplication in Source Code. In Proceedings of the 8th International Symposium on Static Analysis, pages 40–56, 2001.

Digital Library

[23]

R. Koschke. Large-scale Inter-system Clone Detection Using Suffix Trees and Hashing. Journal of Software: Evolution and Process, pages n/a–n/a, 2013.

[24]

J. Krinke. Identifying Similar Code with Program Dependence Graphs. In Proceedings of the 8th Working Conference on Reverse Engineering, pages 301–309, 2001.

Digital Library

[25]

G. P. Krishnan and N. Tsantalis. Unification and Refactoring of Clones. In Proceedings of the International Conference on Software Maintenace, Reengineering and Reverse Engineering, pages 104–113, 2014.

[26]

D. Lawrie, D. Binkley, and C. Morrell. Normalizing Source Code Vocabulary. In Proceedings of the 17th Working Conference on Reverse Engineering, pages 3–12, 2010.

Digital Library

[27]

S. Livieri, Y. Higo, M. Matushita, and K. Inoue. Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder. In Proceedings of the 29th international conference on Software Engineering, pages 106–115, 2007.

Digital Library

[28]

J. I. Maletic and A. Marcus. Supporting Program Comprehension Using Semantic and Structural information. In Proceedings of the 23rd International Conference on Software Engineering, pages 103–112, 2001.

Digital Library

[29]

A. Marcus and J. I. Maletic. Identification of High-Level Concept Clones in Source Code. In Proceedings of the 16th international conference on Automated software engineering, pages 107–114, 2001.

Digital Library

[30]

A. Marcus, D. Poshyvanyk, and R. Ferenc. Using the Conceptual Cohesion of Classes for Fault Prediction in Object-Oriented Systems. IEEE Transactions on Software Engineering, 34(2):287–300, 2008.

Digital Library

[31]

J. Mayrand, C. Leblanc, and E. Merlo. Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics. In Proceedings of the 1996 International Conference on Software Maintenance, pages 244–253, 1996.

Digital Library

[32]

C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie, and C. Fu. Portfolio: Finding Relevant Functions and Their Usage. In Proceedings of the 33rd International Conference on Software Engineering, pages 111–120, 2011.

Digital Library

[33]

H. A. Müller, M. A. Orgun, S. R. Tilley, and J. S. Uhl. A Reverse-engineering Approach to Subsystem Structure Identification. Journal of Software Maintenance: Research and Practice, 5(4):181–204, 1993.

[34]

H. Murakami, K. Hotta, Y. Higo, H. Igaki, and S. Kusumoto. Folding Repeated Instructions for Improving Token-Based Code Clone Detection. In Proceedings of the 12th International Working Conference on Source Code Analysis and Manipulation, pages 64–73, 2012.

Digital Library

[35]

J. Ossher, H. Sajnani, and C. Lopes. File Cloning in Open Source Java Projects: The Good, The Bad, and The Ugly. In Proceedings of the 27th International Conference on Software Maintenance, pages 283–292, 2011.

Digital Library

[36]

D. Rattan, R. Bhatia, and M. Singh. Software Clone Detection: A Systematic Review. Information and Software Technology, 55(7):1165–1199, 2013.

[37]

C. K. Roy and J. R. Cordy. NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization. In Proceedings of the 2008 The 16th IEEE International Conference on Program Comprehension, pages 172–181, 2008.

Digital Library

[38]

C. K. Roy, J. R. Cordy, and R. Koschke. Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach. Science of Computer Programming, 74(7):470–495, 2009.

Digital Library

[39]

H. Sajnani and C. Lopes. A Parallel and Efficient Approach to Large Scale Clone Detection. In Proceedings of the 7th International Workshop on Software Clones, pages 46–52, May 2013.

Digital Library

[40]

Y. Sasaki, T. Yamamoto, Y. Hayase, and K. Inoue. Finding File Clones in FreeBSD Ports Collection. In Proceedings of the 7th Working Conference on Mingin Software Repositories, pages 102–105, 2010.

[41]

W. Shang, B. Adams, and A. E. Hassan. An Experience Report on Scaling Tools for Mining Software Repositories using MapReduce. In Proceedings of the international conference on Automated software engineering, pages 275–284, 2010.

Digital Library

[42]

L. Tan, Y. Zhou, and Y. Padioleau. aComment: Mining Annotations from Comments and Code to Detect Interrupt Related Concurrency Bugs. In Proceedings of the 33rd International Conference on Software Engineering, pages 11–20, 2011.

Digital Library

[43]

P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining, (First Edition). Addison-Wesley Longman Publishing Co., Inc., 2005.

Digital Library

[44]

R. Tiarks, R. Koschke, and R. Falke. An Extended Assessment of Type-3 Clones As Detected by State-of-the-art Tools. Software Quality Control, 19(2):295–331, 2011.

Digital Library

[45]

T. A. Wiggerts. Using Clustering Algorithms in Legacy Systems Remodularization. In Proceedings of the 4th Working Conference on Reverse Engineering, pages 33–43, 1997.

Digital Library

[46]

J. Yang and L. Tan. Inferring Semantically Related Words from Software Context. In Proceedings of the Working Conference on Mining Software Repositories, pages 161–170, 2012.

Digital Library

Cited By

Zakeri-Nasrabadi MParsa SRamezani MRoy CEkhtiarzadeh M(2023)A systematic literature review on source code similarity measurement and clone detectionJournal of Systems and Software10.1016/j.jss.2023.111796204:COnline publication date: 20-Sep-2023
https://dl.acm.org/doi/10.1016/j.jss.2023.111796
Zhang JJiang HRen ZZhang THuang Z(2021)Enriching API Documentation with Code Samples and Usage Scenarios from Crowd KnowledgeIEEE Transactions on Software Engineering10.1109/TSE.2019.291930447:6(1299-1314)Online publication date: 1-Jun-2021
https://doi.org/10.1109/TSE.2019.2919304
Arima RHigo YKusumoto SKhomh FRoy CSiegmund J(2018)Toward refactoring evaluation with code naturalnessProceedings of the 26th Conference on Program Comprehension10.1145/3196321.3196362(316-319)Online publication date: 28-May-2018
https://dl.acm.org/doi/10.1145/3196321.3196362
Show More Cited By

Index Terms

How should we measure functional sameness from program source code? an exploratory study on Java methods
1. Social and professional topics
  1. Professional topics
    1. Management of computing and information systems
      1. Software management
        Software maintenance
2. Software and its engineering
  1. Software creation and management

Recommendations

A comparison of code similarity analysers

Copying and pasting of source code is a common activity in software engineering. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications ...
Scalable and systematic detection of buggy inconsistencies in source code
OOPSLA '10

Software developers often duplicate source code to replicate functionality. This practice can hinder the maintenance of a software project: bugs may arise when two identical code segments are edited inconsistently. This paper presents DejaVu, a highly ...
Scalable and systematic detection of buggy inconsistencies in source code
OOPSLA '10: Proceedings of the ACM international conference on Object oriented programming systems languages and applications

Software developers often duplicate source code to replicate functionality. This practice can hinder the maintenance of a software project: bugs may arise when two identical code segments are edited inconsistently. This paper presents DejaVu, a highly ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

November 2014

856 pages

ISBN:9781450330565

DOI:10.1145/2635868

General Chair:
Shing-Chi Cheung
Hong Kong University of Science and Technology, China
,
Program Chairs:
Alessandro Orso
Georgia Institute of Technology, USA
,
Margaret-Anne Storey
University of Victoria, Canada

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGSOFT/FSE'14

Sponsor:

SIGSOFT

SIGSOFT/FSE'14: 22nd ACM SIGSOFT Symposium on the Foundations of Software Engineering

November 16 - 21, 2014

Hong Kong, China

Acceptance Rates

Overall Acceptance Rate 17 of 128 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
339
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zakeri-Nasrabadi MParsa SRamezani MRoy CEkhtiarzadeh M(2023)A systematic literature review on source code similarity measurement and clone detectionJournal of Systems and Software10.1016/j.jss.2023.111796204:COnline publication date: 20-Sep-2023
https://dl.acm.org/doi/10.1016/j.jss.2023.111796
Zhang JJiang HRen ZZhang THuang Z(2021)Enriching API Documentation with Code Samples and Usage Scenarios from Crowd KnowledgeIEEE Transactions on Software Engineering10.1109/TSE.2019.291930447:6(1299-1314)Online publication date: 1-Jun-2021
https://doi.org/10.1109/TSE.2019.2919304
Arima RHigo YKusumoto SKhomh FRoy CSiegmund J(2018)Toward refactoring evaluation with code naturalnessProceedings of the 26th Conference on Program Comprehension10.1145/3196321.3196362(316-319)Online publication date: 28-May-2018
https://dl.acm.org/doi/10.1145/3196321.3196362
Barbosa EGarcia A(2018)Global-Aware Recommendations for Repairing Violations in Exception HandlingIEEE Transactions on Software Engineering10.1109/TSE.2017.271692544:9(855-873)Online publication date: 1-Sep-2018
https://doi.org/10.1109/TSE.2017.2716925
Suzuki MCarvalho de Paula AGuerra ELopes CLazzarini Lemos O(2017)An Exploratory Study of Functional Redundancy in Code Repositories2017 IEEE 17th International Working Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM.2017.21(31-40)Online publication date: Sep-2017
https://doi.org/10.1109/SCAM.2017.21
Higo YKusumoto S(2017)Flattening Code for Metrics Measurement and Analysis2017 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME.2017.65(494-498)Online publication date: Sep-2017
https://doi.org/10.1109/ICSME.2017.65
Paula AGuerra ELopes CSajnani HLemos O(2016)An Exploratory Study of Interface Redundancy in Code Repositories2016 IEEE 16th International Working Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM.2016.31(107-116)Online publication date: Oct-2016
https://doi.org/10.1109/SCAM.2016.31
Carzaniga AMattavelli APezzè MBertolino ACanfora GElbaum S(2015)Measuring software redundancyProceedings of the 37th International Conference on Software Engineering - Volume 110.5555/2818754.2818776(156-166)Online publication date: 16-May-2015
https://dl.acm.org/doi/10.5555/2818754.2818776
Sumi SHigo YHotta KKusumoto S(2015)Toward improving graftability on automated program repairProceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSM.2015.7332504(511-515)Online publication date: 29-Sep-2015
https://dl.acm.org/doi/10.1109/ICSM.2015.7332504
Carzaniga AMattavelli APezze M(2015)Measuring Software Redundancy2015 IEEE/ACM 37th IEEE International Conference on Software Engineering10.1109/ICSE.2015.37(156-166)Online publication date: May-2015
https://doi.org/10.1109/ICSE.2015.37

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten