Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2635868.2635886acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

How should we measure functional sameness from program source code? an exploratory study on Java methods

Published: 11 November 2014 Publication History

Abstract

Program source code is one of the main targets of software engineering research. A wide variety of research has been conducted on source code, and many studies have leveraged structural, vocabulary, and method signature similarities to measure the functional sameness of source code. In this research, we conducted an empirical study to ascertain how we should use three similarities to measure functional sameness. We used two large datasets and measured the three similarities between all the method pairs in the datasets, each of which included approximately 15 million Java method pairs. The relationships between the three similarities were analyzed to determine how we should use each to detect functionally similar code. The results of our study revealed the following. (1) Method names are not always useful for detecting functionally similar code. Only if there are a small number of methods having a given name, the methods are likely to include functionally similar code. (2) Existing file-level, method-level, and block-level clone detection techniques often miss functionally similar code generated by copy-and-paste operations between different projects. (3) In the cases we use structural similarity for detecting functionally similar code, we obtained many false positives. However, we can avoid detecting most false positives by using a vocabulary similarity in addition to a structural one. (4) Using a vocabulary similarity to detect functionally similar code is not suitable for method pairs in the same file because such method pairs use many of the same program elements such as private methods or private fields.

References

[1]
S. L. Abebe, V. Arnaoudova, G. Antoniol, and Y. Gueheneuc. Can Lexicon Bad Smells Improve Fault Prediction. In Proceedings of the 19th Working Conference on Reverse Engineering, pages 235–244, 2012.
[2]
S. K. Bajracharya, J. Ossher, and C. V. Lopes. Leveraging Usage Similarity for Effective Retrieval of Examples in Code Repositories. In Proceedings of the 18th International Symposium on Foundations of Software Engineering, pages 157–166, 2010.
[3]
S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo. Comparison and Evaluation of Clone Detection Tools. IEEE Transactions on Software Engineering, 33(9):577–591, 2007.
[4]
L. R. Biggers, B. P. Eddy, N. A. Kraft, and L. H. Etzkorn. Toward a Metrics Suite for Source Code Lexicons. In Proceedings of the 27th International Conference on Software Maintenance, pages 492–495, 2011.
[5]
S. R. Chidamber and C. F. Kemerer. A Metrics Suite for Object Oriented Design. IEEE Transactions on Software Engineering, 20(6):476–493, 1994.
[6]
A. Corazza, S. D. Martino, V. Maggio, and G. Scanniello. Investigating the Use of Lexical Information for Software System Clustering. In Proceedings of the 15th European Conference on Software Maintenance and Reengineering, pages 35–44, 2011.
[7]
M. Fowler. Refactoring: Improving the Design of Existing Code. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999.
[8]
G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais. The Vocabulary Problem in Human-system Communication. Communications of the ACM, 30(11):964–971, 1987.
[9]
J. Girard and R. Koschke. A Comparison of Abstract Data Types and Objects Recovery Techniques. Science of Computer Programming, 36(2–3):149–181, 2000.
[10]
M. Grechanik, C. Fu, Q. Xie, C. McMillan, D. Poshyvanyk, and C. Cumby. A Search Engine for Finding Highly Relevant Applications. In Proceedings of the 32nd International Conference on Software Engineering, pages 475–484, 2010.
[11]
S. Haiduc and A. Marcus. On the Use of Domain Terms in Source Code. In Proceedings of the 16th International Conference on Program Comprehension, pages 113–122, 2008.
[12]
Y. Higo, T. Kamiya, S. Kusumoto, and K. Inoue. Method and Implementation for Investigating Code Clones in a Software System. Information and Software Technology, 49(9-10):985–998, 2007.
[13]
Y. Higo and S. Kusumoto. Code Clone Detection on Specialized PDGs with Heuristics. In Proceedings of the 15th European Conference on Software Maintenance and Reengineering, pages 75–84, 2011.
[14]
Y. Higo, S. Kusumoto, and K. Inoue. A Metric-based Approach to Identifying Refactoring Opportunities for Merging Code Clones in a Java Software System. Journal of Software: Maintenance and Evolution, 20(6):435–461, 2008.
[15]
K. Hotta, Y. Higo, and S. Kusumoto. Identifying, Tailoring, and Suggesting Form Template Method Refactoring Opportunities with Program Dependence Graph. In Proceedings of the 16th European Conference on Software Maintenance and Reengineering, pages 53–62, 2012.
[16]
K. Hotta, Y. Higo, and S. Kusumoto. How Accurate Is Coarse-grained Clone Detection?: Comparison with Fine-grained Detectors. In Proceedings of the 8th International Workshop on Software Clones, pages 1–18, 2014.
[17]
B. Hummel, E. Juergens, L. Heinemann, and M. Conradt. Index-based Code Clone Detection: Incremental, Distributed, Scalable. In Proceedings of the International Conference on Software Maintenance, pages 1–9, 2010.
[18]
K. Inoue, R. Yokomori, T. Yamamoto, M. Matsushita, and S. Kusumoto. Ranking Significance of Software Components Based on Use Relations. IEEE Transactions on Software Engineering, 31(3):213–225, 2005.
[19]
T. Ishihara, K. Hotta, Y. Higo, H. Igaki, and S. Kusumoto. Inter-Project Functional Clone Detection Toward Building Libraries - An Empirical Study on 13,000 Projects. In Proceedings of the 19th Working Conference on Reverse Engineering, pages 387–391, 2012.
[20]
T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: A Multilinguistic Token-based Code Clone Detection System for Large Scale Source Code. IEEE Transactions on Software Engineering, 28(7):654–670, 2002.
[21]
H. Kim, Y. Jung, S. Kim, and K. Yi. MeCC: Memory Comparison-based Clone Detector. In Proceedings of the 33rd International Conference on Software Engineering, pages 301–310, 2011.
[22]
R. Komondoor and S. Horwitz. Using Slicing to Identify Duplication in Source Code. In Proceedings of the 8th International Symposium on Static Analysis, pages 40–56, 2001.
[23]
R. Koschke. Large-scale Inter-system Clone Detection Using Suffix Trees and Hashing. Journal of Software: Evolution and Process, pages n/a–n/a, 2013.
[24]
J. Krinke. Identifying Similar Code with Program Dependence Graphs. In Proceedings of the 8th Working Conference on Reverse Engineering, pages 301–309, 2001.
[25]
G. P. Krishnan and N. Tsantalis. Unification and Refactoring of Clones. In Proceedings of the International Conference on Software Maintenace, Reengineering and Reverse Engineering, pages 104–113, 2014.
[26]
D. Lawrie, D. Binkley, and C. Morrell. Normalizing Source Code Vocabulary. In Proceedings of the 17th Working Conference on Reverse Engineering, pages 3–12, 2010.
[27]
S. Livieri, Y. Higo, M. Matushita, and K. Inoue. Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder. In Proceedings of the 29th international conference on Software Engineering, pages 106–115, 2007.
[28]
J. I. Maletic and A. Marcus. Supporting Program Comprehension Using Semantic and Structural information. In Proceedings of the 23rd International Conference on Software Engineering, pages 103–112, 2001.
[29]
A. Marcus and J. I. Maletic. Identification of High-Level Concept Clones in Source Code. In Proceedings of the 16th international conference on Automated software engineering, pages 107–114, 2001.
[30]
A. Marcus, D. Poshyvanyk, and R. Ferenc. Using the Conceptual Cohesion of Classes for Fault Prediction in Object-Oriented Systems. IEEE Transactions on Software Engineering, 34(2):287–300, 2008.
[31]
J. Mayrand, C. Leblanc, and E. Merlo. Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics. In Proceedings of the 1996 International Conference on Software Maintenance, pages 244–253, 1996.
[32]
C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie, and C. Fu. Portfolio: Finding Relevant Functions and Their Usage. In Proceedings of the 33rd International Conference on Software Engineering, pages 111–120, 2011.
[33]
H. A. Müller, M. A. Orgun, S. R. Tilley, and J. S. Uhl. A Reverse-engineering Approach to Subsystem Structure Identification. Journal of Software Maintenance: Research and Practice, 5(4):181–204, 1993.
[34]
H. Murakami, K. Hotta, Y. Higo, H. Igaki, and S. Kusumoto. Folding Repeated Instructions for Improving Token-Based Code Clone Detection. In Proceedings of the 12th International Working Conference on Source Code Analysis and Manipulation, pages 64–73, 2012.
[35]
J. Ossher, H. Sajnani, and C. Lopes. File Cloning in Open Source Java Projects: The Good, The Bad, and The Ugly. In Proceedings of the 27th International Conference on Software Maintenance, pages 283–292, 2011.
[36]
D. Rattan, R. Bhatia, and M. Singh. Software Clone Detection: A Systematic Review. Information and Software Technology, 55(7):1165–1199, 2013.
[37]
C. K. Roy and J. R. Cordy. NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization. In Proceedings of the 2008 The 16th IEEE International Conference on Program Comprehension, pages 172–181, 2008.
[38]
C. K. Roy, J. R. Cordy, and R. Koschke. Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach. Science of Computer Programming, 74(7):470–495, 2009.
[39]
H. Sajnani and C. Lopes. A Parallel and Efficient Approach to Large Scale Clone Detection. In Proceedings of the 7th International Workshop on Software Clones, pages 46–52, May 2013.
[40]
Y. Sasaki, T. Yamamoto, Y. Hayase, and K. Inoue. Finding File Clones in FreeBSD Ports Collection. In Proceedings of the 7th Working Conference on Mingin Software Repositories, pages 102–105, 2010.
[41]
W. Shang, B. Adams, and A. E. Hassan. An Experience Report on Scaling Tools for Mining Software Repositories using MapReduce. In Proceedings of the international conference on Automated software engineering, pages 275–284, 2010.
[42]
L. Tan, Y. Zhou, and Y. Padioleau. aComment: Mining Annotations from Comments and Code to Detect Interrupt Related Concurrency Bugs. In Proceedings of the 33rd International Conference on Software Engineering, pages 11–20, 2011.
[43]
P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining, (First Edition). Addison-Wesley Longman Publishing Co., Inc., 2005.
[44]
R. Tiarks, R. Koschke, and R. Falke. An Extended Assessment of Type-3 Clones As Detected by State-of-the-art Tools. Software Quality Control, 19(2):295–331, 2011.
[45]
T. A. Wiggerts. Using Clustering Algorithms in Legacy Systems Remodularization. In Proceedings of the 4th Working Conference on Reverse Engineering, pages 33–43, 1997.
[46]
J. Yang and L. Tan. Inferring Semantically Related Words from Software Context. In Proceedings of the Working Conference on Mining Software Repositories, pages 161–170, 2012.

Cited By

View all
  • (2023)A systematic literature review on source code similarity measurement and clone detectionJournal of Systems and Software10.1016/j.jss.2023.111796204:COnline publication date: 20-Sep-2023
  • (2021)Enriching API Documentation with Code Samples and Usage Scenarios from Crowd KnowledgeIEEE Transactions on Software Engineering10.1109/TSE.2019.291930447:6(1299-1314)Online publication date: 1-Jun-2021
  • (2018)Toward refactoring evaluation with code naturalnessProceedings of the 26th Conference on Program Comprehension10.1145/3196321.3196362(316-319)Online publication date: 28-May-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering
November 2014
856 pages
ISBN:9781450330565
DOI:10.1145/2635868
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Clone Detection
  2. Functionally similar code
  3. Method name similarity
  4. Structural similarity
  5. Vocabulary similarity

Qualifiers

  • Research-article

Conference

SIGSOFT/FSE'14
Sponsor:

Acceptance Rates

Overall Acceptance Rate 17 of 128 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)A systematic literature review on source code similarity measurement and clone detectionJournal of Systems and Software10.1016/j.jss.2023.111796204:COnline publication date: 20-Sep-2023
  • (2021)Enriching API Documentation with Code Samples and Usage Scenarios from Crowd KnowledgeIEEE Transactions on Software Engineering10.1109/TSE.2019.291930447:6(1299-1314)Online publication date: 1-Jun-2021
  • (2018)Toward refactoring evaluation with code naturalnessProceedings of the 26th Conference on Program Comprehension10.1145/3196321.3196362(316-319)Online publication date: 28-May-2018
  • (2018)Global-Aware Recommendations for Repairing Violations in Exception HandlingIEEE Transactions on Software Engineering10.1109/TSE.2017.271692544:9(855-873)Online publication date: 1-Sep-2018
  • (2017)An Exploratory Study of Functional Redundancy in Code Repositories2017 IEEE 17th International Working Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM.2017.21(31-40)Online publication date: Sep-2017
  • (2017)Flattening Code for Metrics Measurement and Analysis2017 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME.2017.65(494-498)Online publication date: Sep-2017
  • (2016)An Exploratory Study of Interface Redundancy in Code Repositories2016 IEEE 16th International Working Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM.2016.31(107-116)Online publication date: Oct-2016
  • (2015)Measuring software redundancyProceedings of the 37th International Conference on Software Engineering - Volume 110.5555/2818754.2818776(156-166)Online publication date: 16-May-2015
  • (2015)Toward improving graftability on automated program repairProceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSM.2015.7332504(511-515)Online publication date: 29-Sep-2015
  • (2015)Measuring Software Redundancy2015 IEEE/ACM 37th IEEE International Conference on Software Engineering10.1109/ICSE.2015.37(156-166)Online publication date: May-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media