article

Cross-company defect prediction via semi-supervised clustering-based data filtering and MSTrA-based transfer learning

Authors:

Kwabena Ebo Bennin,

Chuanxiang MaAuthors Info & Claims

Soft Computing - A Fusion of Foundations, Methodologies and Applications, Volume 22, Issue 10

Pages 3461 - 3472

https://doi.org/10.1007/s00500-018-3093-1

Published: 01 May 2018 Publication History

Abstract

Cross-company defect prediction (CCDP) is a practical way that trains a prediction model by exploiting one or multiple projects of a source company and then applies the model to a target company. Unfortunately, larger irrelevant cross-company (CC) data usually make it difficult to build a prediction model with high performance. On the other hand, brute force leveraging of CC data poorly related to within-company data may decrease the prediction model performance. To address such issues, we aim to provide an effective solution for CCDP. First, we propose a novel semi-supervised clustering-based data filtering method (i.e., SSDBSCAN filter) to filter out irrelevant CC data. Second, based on the filtered CC data, we for the first time introduce multi-source TrAdaBoost algorithm, an effective transfer learning method, into CCDP to import knowledge not from one but from multiple sources to avoid negative transfer. Experiments on 15 public datasets indicate that: (1) our proposed SSDBSCAN filter achieves better overall performance than compared data filtering methods; (2) our proposed CCDP approach achieves the best overall performance among all tested CCDP approaches; and (3) our proposed CCDP approach performs significantly better than with-company defect prediction models.

References

[1]

Arar Ömer Faruk, Ayan Kürsat (2015) Software defect prediction using cost-sensitive neural network. Appl Soft Comput 33:263-277.

[2]

Bennin KE, Keung J, Phannachitta P, et al (2017) MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng.

[3]

Bennin K, Keung J, Monden A, et al (2017) The significant effects of data sampling approaches on software defect prioritization and classification. In: 11th International symposium on empirical software engineering and measurement, ESEM 2017.

Digital Library

[4]

Boetticher G, Menzies T, Ostrand T (2007) PROMISE Repository of empirical software engineering data, West Virginia University, Department of Computer Science. http://promisedata.org/repository.

[5]

Breiman L (2001) Random forests. Mach Learn 45(1):5-32.

Digital Library

[6]

Briand LC, Melo WL, Wust J (2002) Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans Softw Eng 28(7):706-720.

Digital Library

[7]

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321-357.

[8]

Chen L, Fang B, Shang Z et al (2015) Negative samples reduction in cross-company software defects prediction. Inf Softw Technol 62:67-77.

Digital Library

[9]

Dai W et al (2007) Boosting for transfer learning. In: 24th International conference on Machine learning, pp 193-200.

[10]

Dhanajayan RCG, Pillai SA (2016) SLMBC: spiral life cycle model-based Bayesian classification technique for efficient software fault prediction and classification, Soft Computing, 1-13.

[11]

Elish KO, Elish MO (2008) Predicting defect-prone software modules using support vector machines. Softw J Syst Softw 81(5):649-660.

Digital Library

[12]

Erturk Ezgi, Sezer Ebru Akcapinar (2016) Iterative software fault prediction with a hybrid approach. Appl Soft Comput 49:1020-1033.

[13]

Field AP (2001) Discovering statistics using SPSS for windows: advanced techniques for beginners, pp 551-552.

[14]

Fukunaga K, Narendra PM (1975) A branch and bound algorithm for computing k-nearest neighbors. IEEE Trans Comput 100(7):750-753.

Digital Library

[15]

Gray D, Bowes D, Davey N, et al (2009) Using the support vector machine as a classification method for software defect prediction with static code metrics. In: International conference on engineering applications of neural networks. Springer, Berlin, pp 223-234.

[16]

Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276-1304.

Digital Library

[17]

Hosmer DW, Lemeshow S (2000) Introduction to the logistic regression model. Appl Logist Regres 1-30.

[18]

Jain K (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651-666.

Digital Library

[19]

Jing X et al (2015) Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. In Proceedings of the 10th joint meeting on foundations of software engineering, pp 496-507.

[20]

Jing XY, Ying S, Zhang ZW, Wu SS, Liu J (2014) Dictionary learning based software defect prediction. In: Proceedings of the 36th International Conference on Software Engineering, pp 414-423.

[21]

Kampenes V By et al (2007) A systematic review of effect size in software engineering experiments. Inf Softw Technol 49(11):1073-1086.

Digital Library

[22]

Kawata K, Amasaki S, Yokogawa T (2016) Improving relevancy filter methods for cross-project defect prediction, applied computing & information technology, pp 1-12.

[23]

Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf. Softw. Technol. 58:388-402.

[24]

Lelis L, Sander J (2009) Semi-supervised density-based clustering. In: 9th IEEE international conference on data mining, pp 842-847.

[25]

Lewis DD (1998) Naive (Bayes) at forty the independence assumption in information retrieval. In: European conference on machine learning, pp 4-15.

[26]

Ma Y, Luo G, Zeng X, Chen A (2012) Transfer learning for crosscompany software defect prediction. Inf SoftwTechnol 54(3):248-256.

Digital Library

[27]

Malhotra Ruchika (2015) A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput 27:504-518.

[28]

Mesquita PP Diego et al (2016) Classification with reject option for software defect prediction. Appl Soft Comput 49:1085-1093.

Digital Library

[29]

Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: Proceedings of the 2013 international conference on software engineering. IEEE Press, pp 382-391.

[30]

Peng L, Yang B, Chen Y, Abraham A (2009) Data gravitation based classification. Inf Sci 179(6):809-819.

Digital Library

[31]

Peters F, Menzies T, Marcus A (2013) Better cross company defect prediction. In: Proceedings of the 10th international workshop on mining software repositories, pp 409-418.

[32]

Ryu D, Jang JI, Baik J (2015) A hybrid instance selection using nearest-neighbor for cross-project defect prediction. J Comput Sci Technol 30(5):969-980.

[33]

Seliya N, Khoshgoftaar TM (2011) The use of decision trees for cost-sensitive classification: an empirical study in software quality prediction. Wiley Interdiscip Rev Data Min Knowl Discov 1(5):448-459.

Digital Library

[34]

Shepperd M, Bowes D, Hall T (2014) Researcher bias The use of machine learning in software defect prediction. IEEE Trans Softw Eng 40(6):603-616.

[35]

Shukla S, Radhakrishnan T, Muthukumaran K, et al (2016) Multiobjective cross-version defect prediction, Soft Computing 1-22.

[36]

Siers MJ, Islam MZ (2015) Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Inf. Syst. 51:62-71.

Digital Library

[37]

Song Q, Jia Z, Shepperd M et al (2011) A general software defect-proneness prediction framework. IEEE Trans Softw Eng 37(3):356-370.

Digital Library

[38]

Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(6):1806-1817.

Digital Library

[39]

Turhan B, Menzies T, Bener AB, Di Stefano J (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng 14(5):540-578.

Digital Library

[40]

Turhan B, Tosun Misirli A, Bener A (2013) Empirical evaluation of the effects of mixed project data on learning defect predictors. Inf Softw Technol 55(6):1101-1118.

Digital Library

[41]

Vashisht V, Lal M, Sureshchandar GS et al (2015) A framework for software defect prediction using neural networks. J Softw Eng Appl 8(8):384.

[42]

Wang J, Shen B, Chen Y (2012) Compressed C4. 5 models for software defect prediction. In: 12th international conference on quality software, pp 13-16.

[43]

Wilcoxon Frank (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80-83.

[44]

Yan Z, Chen X, Guo P (2010) Software defect prediction using fuzzy support vector regression. In: International Symposium on Neural Networks. Springer, Berlin Heidelberg, pp 17-24.

[45]

Yao Y, Doretto G (2010) Boosting for transfer learning with multiple sources. In: IEEE conference on computer vision and pattern recognition, pp 1855-1862.

[46]

Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pp 91-100.

Cited By

Tong HZhang DLiu JXing WLu LLu WWu Y(2024)MASTER: Multi-Source Transfer Weighted Ensemble Learning for Multiple Sources Cross-Project Defect PredictionIEEE Transactions on Software Engineering10.1109/TSE.2024.338123550:5(1281-1305)Online publication date: 25-Mar-2024
https://dl.acm.org/doi/10.1109/TSE.2024.3381235
Yang PZhu LHu WKeung JLu LXiang J(2023)The Impact of the bug number on Effort-Aware Defect Prediction: An Empirical StudyProceedings of the 14th Asia-Pacific Symposium on Internetware10.1145/3609437.3609458(67-78)Online publication date: 4-Aug-2023
https://dl.acm.org/doi/10.1145/3609437.3609458
Benala TTantati K(2023)Efficiency of oversampling methods for enhancing software defect prediction by using imbalanced dataInnovations in Systems and Software Engineering10.1007/s11334-022-00457-319:3(247-263)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1007/s11334-022-00457-3
Show More Cited By

Recommendations

Transfer learning for cross-company software defect prediction

Context: Software defect prediction studies usually built models using within-company data, but very few focused on the prediction models trained with cross-company data. It is difficult to employ these models which are built on the within-company data ...
Cross-Project Software Defect Prediction Using Feature-Based Transfer Learning
Internetware '15: Proceedings of the 7th Asia-Pacific Symposium on Internetware

Cross-project defect prediction is taken as an effective means of predicting software defects when the data shortage exists in the early phase of software development. Unfortunately, the precision of cross-project defect prediction is usually poor, ...
Cross-Project Software Defect Prediction Based on Feature Selection and Transfer Learning
Machine Learning for Cyber Security
Abstract
Cross-project software defect prediction solves the problem that traditional defect prediction can’t get enough data, but how to apply the model learned from the data of different mechanisms to the target data set is a new problem. At the same ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Soft Computing - A Fusion of Foundations, Methodologies and Applications

Soft Computing - A Fusion of Foundations, Methodologies and Applications Volume 22, Issue 10

May 2018

337 pages

ISSN:1432-7643

EISSN:1433-7479

Issue’s Table of Contents

Copyright © Copyright © 2018 Springer-Verlag GmbH Germany, part of Springer Nature.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 May 2018

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tong HZhang DLiu JXing WLu LLu WWu Y(2024)MASTER: Multi-Source Transfer Weighted Ensemble Learning for Multiple Sources Cross-Project Defect PredictionIEEE Transactions on Software Engineering10.1109/TSE.2024.338123550:5(1281-1305)Online publication date: 25-Mar-2024
https://dl.acm.org/doi/10.1109/TSE.2024.3381235
Yang PZhu LHu WKeung JLu LXiang J(2023)The Impact of the bug number on Effort-Aware Defect Prediction: An Empirical StudyProceedings of the 14th Asia-Pacific Symposium on Internetware10.1145/3609437.3609458(67-78)Online publication date: 4-Aug-2023
https://dl.acm.org/doi/10.1145/3609437.3609458
Benala TTantati K(2023)Efficiency of oversampling methods for enhancing software defect prediction by using imbalanced dataInnovations in Systems and Software Engineering10.1007/s11334-022-00457-319:3(247-263)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1007/s11334-022-00457-3
Xing YLin WLin XYang BTan Z(2022)Cross-Project Defect Prediction Based on Two-Phase Feature Importance AmplificationComputational Intelligence and Neuroscience10.1155/2022/23204472022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/2320447
Bai JJia JCapretz L(2022)A three-stage transfer learning framework for multi-source cross-project software defect predictionInformation and Software Technology10.1016/j.infsof.2022.106985150:COnline publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1016/j.infsof.2022.106985
Singh DSisodia DSingh P(2019)Multiobjective evolutionary-based multi-kernel learner for realizing transfer learning in the prediction of HIV-1 protease cleavage sitesSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-019-04487-124:13(9727-9751)Online publication date: 5-Nov-2019
https://dl.acm.org/doi/10.1007/s00500-019-04487-1

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents