article

On using machine learning to automatically classify software applications into domain categories

Authors:

Mario Linares-Vásquez,

Collin Mcmillan,

Denys Poshyvanyk,

Mark GrechanikAuthors Info & Claims

Empirical Software Engineering, Volume 19, Issue 3

Pages 582 - 618

https://doi.org/10.1007/s10664-012-9230-z

Published: 01 June 2014 Publication History

Abstract

Software repositories hold applications that are often categorized to improve the effectiveness of various maintenance tasks. Properly categorized applications allow stakeholders to identify requirements related to their applications and predict maintenance problems in software projects. Manual categorization is expensive, tedious, and laborious --- this is why automatic categorization approaches are gaining widespread importance. Unfortunately, for different legal and organizational reasons, the applications' source code is often not available, thus making it difficult to automatically categorize these applications. In this paper, we propose a novel approach in which we use Application Programming Interface (API) calls from third-party libraries for automatic categorization of software applications that use these API calls. Our approach is general since it enables different categorization algorithms to be applied to repositories that contain both source code and bytecode of applications, since API calls can be extracted from both the source code and byte-code. We compare our approach to a state-of-the-art approach that uses machine learning algorithms for software categorization, and conduct experiments on two large Java repositories: an open-source repository containing 3,286 projects and a closed-source repository with 745 applications, where the source code was not available. Our contribution is twofold: we propose a new approach that makes it possible to categorize software projects without any source code using a small number of API calls as attributes, and furthermore we carried out a comprehensive empirical evaluation of automatic categorization approaches.

References

[1]

Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6:37-66.

[2]

Alpaydin E (2010) Introduction to machine learning, 2nd edn. The MIT Press, Cambridge, Massachusetts.

[3]

Antoniol G, Ayari K, Di Penta M, Khomh F, Guéhéneuc Y-G (2008) Is it a bug or an enhancement?: a text-based approach to classify change requests. 18th Conference of the Centre for Advanced Studies on Collaborative Research Meeting of Minds (CASCON'08), Ontario, Canada, pp 304-318.

[4]

Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug? 28th International Conference on Software Engineering (ICSE'06), pp 361-370.

[5]

Anvik J, Murphy GC (2011) Reducing the effort of bug report triage: recommenders for development-oriented decisions. ACM Transactions on Software Engineering and Methods 20(3):10:1-10:35.

Digital Library

[6]

Bajracharya S, Ossher J, Lopes CV (2010) Leveraging usage similarity for effective retrieval of examples in code repositories. 18th International Symposium on the Foundations of Software Engineering (FSE'10).

[7]

Bruno M, Canfora G, Di Penta M, Scognamiglio R (2005) An approach to support web service classification and annotation. IEEE International Conference on e-Technology, e-Commerce and e-Services (EEE'05), pp 138-143.

[8]

Bugde S, Nagappan N, Rajamani S, Ramalingam G (2008) Global software servicing: observational experiences at Microsoft. 2008 IEEE International Conference on Global Software Engineering (ICGSE'08), pp 182-191.

[9]

Cohen WW (1995) Fast effective rule induction. 12th International Conference on Machine Learning, pp 115-123.

[10]

Crammer K, Singer Y (2003) A family of additive online algorithms for category ranking. J Mach Learn Res 3 (6):1025-1058.

Digital Library

[11]

de Carvalho ACPLF, Freitas AA (2009) A tutorial on multi-label classification techniques. Foundations of Computational Intelligence. A. Abraham, A.-E. Hassanien and V. Snásel, Springer-Verlag, 5.

[12]

Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1-30.

Digital Library

[13]

Di Lucca GA, Di Penta M, Gradara S (2002) An approach to classify software maintenance requests. IEEE International Conference on Software Maintenance (ICSM'02), Montréal, Québec, Canada, pp 93-102.

[14]

Dit B, Guerrouj L, Poshyvanyk D, Antoniol G (2011) Can better identifier splitting techniques help feature location? 19th IEEE International Conference on Program Comprehension (ICPC'11), Kingston, Ontario, Canada, pp 11-20.

[15]

Dumitru H, Gibiec M, Hariri N, Cleland-Huang J, Mobasher B, Castro-Herrera C, Mirakhorli M (2011) On-demand feature recommendations derived from mining public product descriptions. 33rd IEEE/ACM International Conference on Software Engineering (ICSE'11), Honolulu, Hawaii, USA, pp 181-190.

[16]

Feng C-XJ, Yu Z-GS, Emanuel JT, Li P-G, Shao X-Y, Wang Z-H (2008) Threefold versus fivefold cross-validation and individual versus average data in predictive regression modelling of machining experimental data. Int J Comput Integrated Manuf 21(6):702-714.

Digital Library

[17]

Frakes W, Prieto-Diaz R, Fox C (1998) DARE: domain analysis and reuse environment. Ann Software Eng 5:125-141.

Digital Library

[18]

Grechanik M, Csallner C, Fu C, Xie Q (2010) Is data privacy always good for software testing? 21st IEEE International Symposium on Software Reliability Engineering (ISSRE'10), San Jose, California, USA, pp 368-377.

[19]

Grechanik M, Fu C, Xie Q, McMillan C, Poshyvanyk D, Cumby C (2010) A search engine for finding highly relevant applications. 32nd ACM/IEEE International Conference on Software Engineering (ICSE'10), Cape Town, South Africa, pp 475-484.

[20]

Grechanik M, McMillan C, DeFerrari L, Comi M, Crespi S, Poshyvanyk D, Fu C, Xie Q, Ghezzi C (2010) An empirical investigation into a large-scale java open source code repository. 4th ACM/ IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM '10), Bolzano-Bozen, Italy.

[21]

Grissom RJ, Kim JJ (2012) Effect sizes for research: univariate and multivariate applications, 2nd edn. Taylor & Francis, New York.

[22]

Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157-1182.

Digital Library

[23]

Hindle A, Germán DM, Godfrey MW, Holt RC (2009) Automatic Classification of Large Changes into Maintenance Categories. 17th IEEE International Conference on Program Comprehension (ICPC'09), Vancouver, Canada, pp 30-39.

[24]

Hsu C, Lin C (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Network 13(2):415-425.

Digital Library

[25]

Ji S, Tang L, Yu S, Ye J (2008) Extracting shared subspace for multi-label classification. 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'08), Las Vegas, Nevada, USA, pp 381-389.

[26]

Jones C (2010) Software engineering best practices. McGraw-Hill, New York.

[27]

Kang KC, Cohen S, Hess J, Novak W, Peterson A (1990) Feature-oriented domain analysis (FODA) feasibility study Pittsburgh, Pennsylvania, USA, Carnegie Mellon University, Software Engineering Institute.

[28]

Kawaguchi S, Garg PK, Matsushita M, Inoue K (2003) Automatic categorization algorithm for evolvable software archive. 6th International Workshop on Principles of Software Evolution (IWPSE'03), pp 195-200.

[29]

Kawaguchi S, Garg PK, Matsushita M, Inoue K (2006) MUDABlue: an automatic categorization system for open source repositories. J Syst Software 79(7):939-953.

Digital Library

[30]

Kelly MB, Alexander JS, Adams B, Hassan AE (2011) Recovering a balanced overview of topics in a software domain. 11th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM'11), Williamsburg, VA, USA, to appear.

[31]

Leopold E, Kindermann J (2002) Text categorization with support vector machines. How to represent texts in input space? Mach Learn 46(1):423-444.

[32]

Lorena AC, De Carvalho ACPLF (2004) Comparing techniques for multiclass classification using binary SVM predictors. Third Mexican International Conference on Artificial Intelligence (MICAI'04), Mexico City, Mexico, Springer, pp 272-281.

[33]

McMillan C, Grechanik M, Poshyvanyk D, Xie Q, Fu C (2011) Portfolio: finding relevant functions and their usages. 33rd IEEE/ACM International Conference on Software Engineering (ICSE'11), Honolulu, Hawaii, USA, pp 111-120.

[34]

McMillan C, Linares-Vásquez M, Poshyvanyk D, Grechanik M (2011) Categorizing software applications for maintenance. 27th IEEE International Conference on Software Maintenance (ICSM'11), Williamsburg, Virginia, USA, pp 343-352.

[35]

Menzies T, Marcus A (2008) Automated severity assessment of software defect reports. IEEE International Conference on Software Maintenance (ICSM'08), Beijing, China, pp 346-355.

[36]

Poshyvanyk D, Grechanik M (2009) Creating and evolving software by searching, selecting and synthesizing relevant source code. 31st IEEE/ACM International Conference on Software Engineering (ICSE'09), Vancouver, British Columbia, Canada, pp 283-286.

[37]

Prieto-Diaz R (1990) Domain analysis: an introduction. ACM SIGSOFT Software Eng Notes 15(2):47-54.

Digital Library

[38]

Ratiu D, Deissenboeck F (2006) How programs represent reality (and How They Don't). 13th Working Conference on Reverse Engineering (WCRE'06), pp 83-92.

[39]

Ratiu D, Deissenboeck F (2007) From reality to programs and (not quite) back again. 15th IEEE International Conference on Program Comprehension (ICPC'07), Banff, Alberta, Canada, pp 91-102.

[40]

Sandhu PS, Singh J, Singh H (2007) Approaches for categorization of reusable software components. J Comput Sci 3(5):266-273.

[41]

Schuler D, Dallmeir V, Lindig C (2007) A dynamic birthmark for java. Twenty-second IEEE/ACM International Conference on Automated software Engineering (ASE 2007), Atlanta, Georgia, USA, pp 274-283.

[42]

Sebastiani, F. (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1-47.

Digital Library

[43]

Sim SE, Umarji M, Ratanotayanon S, Lopes CV (2011) How well do search engines support code retrieval on the web? ACM Trans Software Eng Meth (TOSEM) 21(1).

[44]

Tian K, Revelle M, Poshyvanyk D (2009) Using latent Dirichlet allocation for automatic categorization of software. 6th IEEE Working Conference on Mining Software Repositories (MSR'09), Vancouver, British Columbia, Canada, pp 163-166.

[45]

Ugurel S, Krovetz R, Giles CL (2002) What's the code ? Automatic classification of source code archives. Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD- 2002), Edmontong, Alberta, Canada, pp 632-638.

Digital Library

[46]

Újházi B, Ferenc R, Poshyvanyk D, Gyimóthy T (2010) New conceptual coupling and cohesion metrics for object-oriented systems. 10th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM'10), Timisoara, Romania, pp 33-42.

[47]

Weiss C, Premraj R, Zimmermann T, Zeller A (2007) How long will it take to fix this bug? 4th IEEE International Workshop on Mining Software Repositories (MSR'07), Minneapolis, MN, pp 1-8.

[48]

Zhang M-L, Zhou Z-H (2005) A k-nearest neighbor based algorithm for multi-label classification. IEEE International Conference on Granular Computing, Beijing, China, pp 718-721.

[49]

Zhang M-L, Zhou Z-H (2006) Multi-label neural networks with applications to functional genomics and text categorization. IEEE Trans Knowl Data Eng 18(10):1338-1351.

Digital Library

[50]

Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. ESEC/SIGSOFT FSE 2009, Amsterdam, The Netherlands, pp 91-100.

Digital Library

Cited By

Dey TLoungani JIvers JShang WLamothe MWan Z(2024)Smarter Project Selection for Software Engineering ResearchProceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering10.1145/3663533.3664037(12-21)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663533.3664037
Belgacem HLi XBianculli DBriand L(2024)Learning-based Relaxation of Completeness Requirements for Data Entry FormsACM Transactions on Software Engineering and Methodology10.1145/363570833:3(1-32)Online publication date: 15-Mar-2024
https://dl.acm.org/doi/10.1145/3635708
Swarna KMathews NVagavolu DChimalakonda S(2024)On the impact of multiple source code representations on software engineering tasks — An empirical studyJournal of Systems and Software10.1016/j.jss.2023.111941210:COnline publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1016/j.jss.2023.111941
Show More Cited By

On using machine learning to automatically classify software applications into domain categories
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning

Recommendations

Unsupervised Software Categorization Using Bytecode
ICPC '15: Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension

Automatic software categorization is the task of assigning software systems or libraries to categories based on their functionality. Correctly assigning these categories is essential to ensure that relevant software can be easily retrieved by developers ...
Basic-level categories: A review

This paper analyses selected literature on basic-level categories, explores related theories and discusses theoretical explanations of the phenomenon of basic-level categories. A substantial body of research has proposed that basic-level categories are ...
Risk Management in Projects Based on Open-Source Software
ICSCA '19: Proceedings of the 2019 8th International Conference on Software and Computer Applications

Reusing software components from third-party vendors is one of the key technologies to gain shorter time-to-market and better quality of the software system. These components, also known as OTS (Off-the-Shelf) components, come in two types: COTS (...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Empirical Software Engineering

Empirical Software Engineering Volume 19, Issue 3

June 2014

355 pages

ISSN:1382-3256

Issue’s Table of Contents

Copyright © Copyright © 2014 Springer Science+Business Media New York.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 June 2014

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

37
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dey TLoungani JIvers JShang WLamothe MWan Z(2024)Smarter Project Selection for Software Engineering ResearchProceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering10.1145/3663533.3664037(12-21)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663533.3664037
Belgacem HLi XBianculli DBriand L(2024)Learning-based Relaxation of Completeness Requirements for Data Entry FormsACM Transactions on Software Engineering and Methodology10.1145/363570833:3(1-32)Online publication date: 15-Mar-2024
https://dl.acm.org/doi/10.1145/3635708
Swarna KMathews NVagavolu DChimalakonda S(2024)On the impact of multiple source code representations on software engineering tasks — An empirical studyJournal of Systems and Software10.1016/j.jss.2023.111941210:COnline publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1016/j.jss.2023.111941
Romano AWang W(2023)Automated WebAssembly Function Purpose Identification With Semantics-Aware AnalysisProceedings of the ACM Web Conference 202310.1145/3543507.3583235(2885-2894)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583235
Ding ZLi HShang WChen T(2023)Towards Learning Generalizable Code Embeddings Using Task-agnostic Graph Convolutional NetworksACM Transactions on Software Engineering and Methodology10.1145/354294432:2(1-43)Online publication date: 30-Mar-2023
https://dl.acm.org/doi/10.1145/3542944
Liu SXie XSiow JMa LMeng GLiu Y(2023)GraphSearchNet: Enhancing GNNs via Capturing Global Dependencies for Semantic Code SearchIEEE Transactions on Software Engineering10.1109/TSE.2022.323390149:4(2839-2855)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1109/TSE.2022.3233901
Sas CCapiluppi A(2023)Multi-granular software annotation using file-level weak labellingEmpirical Software Engineering10.1007/s10664-023-10423-729:1Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1007/s10664-023-10423-7
Hamdi M(2022)Towards a classification of sustainable software development process using manifold machine learning techniquesJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-21260042:6(6183-6194)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.3233/JIFS-212600
Yang YXia XLo DBi TGrundy JYang X(2022)Predictive Models in Software Engineering: Challenges and OpportunitiesACM Transactions on Software Engineering and Methodology10.1145/350350931:3(1-72)Online publication date: 9-Apr-2022
https://dl.acm.org/doi/10.1145/3503509
Sas CCapiluppi A(2022)Antipatterns in software classification taxonomiesJournal of Systems and Software10.1016/j.jss.2022.111343190:COnline publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1016/j.jss.2022.111343
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents