Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Evaluating ensemble imputation in software effort estimation

Published: 15 March 2023 Publication History

Abstract

Choosing the appropriate missing data (MD) imputation technique for a given software development effort estimation (SDEE) technique is not a trivial task. In fact, the impact of MD imputation on the estimation output depends on the dataset and the SDEE technique used, and there is no best imputation technique in all contexts. Thus, an attractive solution is to use more than one imputation technique and combine their results to obtain a final imputation outcome. This concept is called ensemble imputation and can significantly improve the effort estimation accuracy. This study proposes and constructs 11 heterogeneous ensemble imputation techniques, whose members are two, three, or four of the following single imputation techniques: K-nearest neighbors, expectation maximization, support vector regression (SVR) and decision trees (DTs). The effects of single/ensemble imputation techniques on SDEE performance were evaluated over six SDEE datasets: COCOMO81, ISBSG, Desharnais, China, Kemerer, and Miyazaki. Five SDEE performance measures were used: standardized accuracy (SA), predictor at 25% (Pred (0.25)), mean balanced relative error (MBRE), mean inverted balanced relative error (MIBRE), and logarithmic standard deviation (LSD). Moreover, we used: (1) the Skott-Knott (SK) statistical test to cluster and compare the results, and (2) the Borda count method to rank the SDEE techniques belonging to the best SK cluster.
The results showed that ensemble imputers significantly improved the performance of SDEE techniques compared to single imputation techniques. We also found that adding one or more imputers to the ensemble imputers generally led to a significant improvement in the SDEE performance. When the performance improvement is not significant, it is better to use the ensemble imputer with the minimum number of members because it is less complex. For ensemble imputers, the results suggest that no particular ensemble imputer gave the best results in all contexts. Overall, SVR imputation was the best imputation technique used to construct ensemble imputers for the SDEE. For the SDEE techniques, the best results were obtained by the DTs and SVR variants using ensemble imputation.

References

[1]
Abnane I, Hosni M, Idri A, Abran A (2019) Analogy software effort estimation using ensemble KNN imputation. 2019 45th Euromicro Conf Softw Eng Adv Appl 228–235.
[2]
Abnane I, Idri A (2016) Evaluating fuzzy analogy on incomplete software projects data. In: 2016 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, pp 1–8
[3]
Abnane I, Idri A (2018) Improved analogy-based effort estimation with incomplete mixed data. In: federated conference on computer science and information systems (FedCSIS). Pp 1015–1024
[4]
Abnane I, Idri A (2017b) Evaluating fuzzy analogy on incomplete software projects data. In: 2016 IEEE symposium series on computational intelligence, SSCI 2016
[5]
Abnane I, Idri A, Abran A (2020) Fuzzy case-based-reasoning-based imputation for incomplete data in software engineering repositories. J Softw Evol Process.
[6]
Abnane I, Idri A, Hosni M, Abran A (2021) Heterogeneous ensemble imputation for software development effort estimation. In: PROMISE 2021 - proceedings of the 17th international conference on predictive models and data analytics in software engineering, co-located with ESEC/FSE 2021. Pp 1–10
[7]
Albrecht AJ and Gaffney JE Software function, source lines of code, and development effort prediction: a software science validation IEEE Trans Softw Eng 1983 SE-9 639-648
[8]
Amazal FA, Idri A, Abran A (2014) An analogy-based approach to estimation of software development effort using categorical data. In: Joint Conference of the International Workshop on Software Measurement and the International Conference on Software Process and Product Measurement. pp. 252–262
[9]
Aydilek IB and Arslan A A hybrid method for imputation of missing values using optimized fuzzy c -means with support vector regression and a genetic algorithm Inf Sci (Ny) 2013 233 25-35
[10]
Azzeh M, Nassif AB, and Minku LL An empirical evaluation of ensemble adjustment methods for analogy-based effort estimation J Syst Softw 2015 103 36-52
[11]
Boehm BW (1984) Software engineering economics. IEEE Trans Softw Eng SE-10.
[12]
Campbell C, Cristianini N (1999) Simple learning algorithms for training support vector machines. Univ Bristol 1–29
[13]
Cara FJ, Carpio J, Juan J, and Alarcón E An approach to operational modal analysis using the expectation maximization algorithm Mech Syst Signal Process 2012 31 109-129
[14]
Cevallos Valdiviezo H and Van Aelst S Tree-based prediction on incomplete data using imputation or surrogate decisions Inf Sci (Ny) 2015 311 163-181
[15]
Chandra A and Yao X Ensemble learning using multi-objective evolutionary algorithms J Math Model Algo 2006 5 417-445
[16]
Chlioui I, Idri A, Abnane I, Ezzat M  (2021) Ensemble case based reasoning imputation in breast cancer classification. J Inf Sci Eng 37(5):1039–1051
[17]
Cortes C and Vapnik V Support-vector networks Mach Learn 1995 20 273-297
[18]
Cortes C and Vapnik V Support-vector networks Mach Learn 1995 20 273-297
[19]
Dempster AP, Rubin D (1983) Overview. Incomplete data in sample surveys, Vol. II: Theory and Annotated Bibliography
[20]
Dempster AP, Laird NM, and Rubin DB Maximum likelihood from incomplete data via the EM algorithm J R Stat Soc Ser B 1977 39 1-22
[21]
Dempster AP, Laird NM, and Rubin DB Maximum likelihood from incomplete data via the EM algorithm J R Stat Soc Ser B 1977 39 1-38
[22]
Demšar J Statistical comparisons of classifiers over multiple data sets J Mach Learn Res 2006 7 1-30
[23]
Dong Y and Peng CYJ Principled missing data methods for researchers Springerplus 2013 2 1-17
[24]
Dwyer K, Holte R (2007) Decision tree instability and active learning. In: lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). Pp 128–139
[25]
Flake GW and Lawrence S Efficient SVM regression training with SMO Mach Learn 2002 46 271-290
[26]
Folguera L, Zupan J, Cicerone D, and Magallanes JF Self-organizing maps for imputation of missing data in incomplete data matrices Chemom Intell Lab Syst 2015 143 146-151
[27]
Foss T, Myrtveit I, Stensrud E (2001) MRE and heteroscedasticity: an empirical validation of the assumption of homoscedasticity of the magnitude of relative error. In: Proc. ESCOM, 12th European software control and metrics conference. The Netherlands, pp 157–164
[28]
Foss T, Stensrud E, Kitchenham B, and Myrtveit I A simulation study of the model evaluation criterion MMRE IEEE Trans Softw Eng 2003 29 985-995
[29]
Gholami R, Fakhari N (2017a) Support vector machine: principles, parameters, and applications. In: Handbook of neural computation. Academic Press, pp 515–535.
[30]
Gholami R, Fakhari N (2017b) Support vector machine: principles, parameters, and applications. Handb Neural Comput:515–535.
[31]
Gudivada VN, Irfan MT, Fathi E, Rao DL (2016) Cognitive analytics: going beyond big data analytics and machine learning. In: Handbook of statistics. Elsevier, vol. 35, pp 169–205.
[32]
Hall M, Frank E, Holmes G, et al. The WEKA data mining software ACM SIGKDD Explor Newsl 2009 11 10-18
[33]
Hosni M, Idri A, Abran A, Nassif AB (2017) On the value of parameter tuning in heterogeneous ensembles effort estimation. Soft Comput:1–34
[34]
Hosni M, Idri A, Nassif AB, Abran A (2016) Heterogeneous ensembles for software development effort estimation. In: 2016 3rd international conference on soft computing & machine intelligence (ISCMI). IEEE, pp 174–178.
[35]
Idri A, Abnane I (2017) Fuzzy analogy based effort estimation: an empirical comparative study. In: 2017 IEEE International Conference on Computer and Information Technology (CIT). IEEE, pp 114–121.
[36]
Idri A, Amazal FA (2012a) Software cost estimation by fuzzy analogy for ISBSG repository. In: world scientific proc. series on computer engineering and information science 7; uncertainty modeling in knowledge engineering and decision making - proceedings of the 10th international FLINS Conf. Istanbul, Turkey, pp 863–868
[37]
Idri A, Amazal FA (2012b) Software cost estimation by fuzzy analogy for ISBSG repository. In: Uncertainty Modeling in Knowledge Engineering and Decision Making, pp 863–868.
[38]
Idri A, Zahi A (2013) Software cost estimation by classical and Fuzzy Analogy for Web Hypermedia Applications: A replicated study. In: 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM). IEEE, pp 207–213.
[39]
Idri A, Abnane I, Abran A (2015) Systematic mapping study of missing values techniques in software engineering data. In: 2015 IEEE/ACIS 16th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). IEEE, pp 1–8.
[40]
Idri A, Abnane I, Abran A (2016a) Missing data techniques in analogy-based software development effort estimation. J Syst Softw 117:595–611.
[41]
Idri A, Abnane I, Abran A (2017) Evaluating Pred( p) and standardized accuracy criteria in software development effort estimation. J Softw Evol Process
[42]
Idri A, Abnane I, and Abran A Support vector regression-based imputation in analogy-based software development effort estimation J Softw Evol Proc 2018 30
[43]
Idri A, Abnane I, and Abran A Support vector regression-based imputation in analogy-based software development effort estimation J Softw Evol Proc 2018 30
[44]
Idri A, Amazal FA, and Abran A Analogy-based software development effort estimation: a systematic mapping and review Inf Softw Technol 2014 58 206-230
[45]
Idri A, Amazal FA, Abran A (2016b) Accuracy comparison of analogy-based software development effort estimation techniques. Int J Intell Syst 0:1–25.
[46]
Idri A, Hosni M, and Abran A Improved estimation of software development effort using classical and fuzzy analogy ensembles Appl Soft Comput 2016 49 990-1019
[47]
Idri A, Hosni M, and Abran A Systematic literature review of ensemble effort estimation J Syst Softw 2016 118 151-175
[48]
Jerez JM, Molina I, García-Laencina PJ, et al. Missing data imputation using statistical and machine learning methods in a real breast cancer problem Artif Intell Med 2010 50 105-115
[49]
Kemerer CF (1987) An empirical validation of software cost estimation models. Communications of the ACM 30(5):416–429.
[50]
Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of ICNN'95-international conference on neural networks. IEEE, vol. 4, pp 1942–1948.
[51]
Kitchenham BA, SG MD, Pickard L, and Shepperd MJ What accuracy statistics really measure IEE Proc – Softw Eng 2001 148 81-85
[52]
Kocaguneli E and Menzies T Software effort models should be assessed via leave-one-out validation J Syst Softw 2013 86 1879-1890
[53]
Kocaguneli E, Menzies T, and Keung JW On the value of ensemble effort estimation IEEE Trans Softw Eng 2012 38 1403-1416
[54]
Korte M, Port D (2008) Confidence in software cost estimation results based on MMRE and PRED. In: Proceedings of the 4th international workshop on Predictor models in software engineering, pp 63–70.
[55]
Li RH, Belford GG (2002) Instability of decision tree classification algorithms. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 570–575
[56]
Little RJ, Rubin DB (1989) The analysis of social science data with missing values. Sociol Methods Res 18(2–3):292–326.
[57]
Little RJA and Rubin D Statistical analysis with missing data 1987 New York Wiley
[58]
Liu Y, Gopalakrishnan V (2017) An overview and evaluation of recent machine learning imputation methods using cardiac imaging data. Data 2(1):8
[59]
Lokan C, Wright T, Hill P, and Stringer M Organizational benchmarking using the ISBSG data repository Software, IEEE 2001 18 26-32
[60]
Madley-Dowd P, Hughes R, Tilling K, and Heron J The proportion of missing data should not be used to guide decisions on multiple imputation J Clin Epidemiol 2019 110 63-73
[61]
Maimon O, Rokach L (Eds.) (2005) Data mining and knowledge discovery handbook.
[62]
Menzies T, Kocaguneli E, Turhan B, Minku L, Peters F (2014) Sharing data and models in software engineering. Morgan Kaufmann
[63]
Menzies T, Krishna R, Pryor D (2017) The SEACRAFT repository of empirical software engineering data. https://zenodo.org/communities/seacraft
[64]
Menzies T, Krishna R, Pryor D (2015) The PROMISE Repository of Empirical Software Engineering Data. http://openscience.us/repo
[65]
Minku LL and Yao X Ensembles and locality: insight on improving software effort estimation Inf Softw Technol 2013 55 1512-1528
[66]
Minku LL, Yao X (2013b) Software effort estimation as a multiobjective learning problem. ACM Transactions on Software Engineering and Methodology (TOSEM) 22(4):1–32
[67]
Mittas N, Angelis L (2012) Ranking and clustering software cost estimation models through a multiple comparisons algorithm. IEEE Trans Softw Eng 39(4):537–551.
[68]
Miyazaki Y, Takanou A, Nozaki H, et al. Method to estimate parameter values in software prediction models Inf Softw Technol 1991 33 239-243
[69]
Mockus A (2008) Missing data in software engineering. Guide to Advanced Empirical Software Engineering, pp 185–200.
[70]
Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis A, Verbeke G (Eds.) (2014) Handbook of missing data methodology. CRC Press
[71]
Monte-Serrat DM and Cattani C Interpretability in neural networks towards universal consistency Int J Cogn Comput Eng 2021 2 30-39
[72]
Müller KR, Mika S, Rätsch G, et al. An introduction to kernel-based learning algorithms IEEE Trans Neural Netw 2001 12 181-201
[73]
Myrtveit I, Stensrud E, and Shepperd M Reliability and validity in comparative studies of software prediction models IEEE Trans Softw Eng 2005 31 380-391
[74]
Polikar R (2012) Ensemble learning. In: Ensemble machine learning. Springer, Boston, pp 1–34
[75]
Qi F, Jing XY, Zhu X, et al. Software effort estimation based on open source projects: case study of Github Inf Softw Technol 2017 92 145-157
[76]
Quinlan JR Learning decision tree classifiers ACM Comput Surv 1996 28 71-72
[77]
Rahman MG and Islam MZ A decision tree-based missing value imputation technique for data pre-processing Conf Res Pract Inf Technol Ser 2010 121 41-50
[78]
Rokach L (2019) Ensemble learning: pattern classification using ensemble methods.
[79]
Rubin DB Multiple imputation for nonresponse in surveys 1987 New York John Wiley & Sons
[80]
Sagi O, Rokach L (2018) Ensemble learning: a survey. WIREs Data Mining and Knowledge Discovery 8(4).
[81]
Sammaknejad N, Zhao Y, and Huang B A review of the expectation maximization algorithm in data-driven process identification J Process Control 2019 73 123-136
[82]
Schapire RE Measures of diversity in classifier ensembles Mach Learn 2003 51 181-207
[83]
Schneider P, Xhafa F (2022) Machine learning: ML for eHealth systems. Anom Detect Complex Event Process over IoT Data Streams:149–191.
[84]
Scott AJ, Knott M (1974) A cluster analysis method for grouping means in the analysis of variance. Biometrics 30:507–512
[85]
Sehra SK, Brar YS, Kaur N, Sehra SS (2017) Research patterns and trends in software effort estimation. Inf Softw Technol 91.
[86]
Shepperd M (2007) Software project economics: a roadmap. In: Future of Software Engineering (FOSE'07). IEEE, pp 304–315
[87]
Shepperd M and MacDonell S Evaluating prediction systems in software project estimation Inf Softw Technol 2012 54 820-827
[88]
Shi Y, Eberhart R (1998) A modified particle swarm optimizer. In: 1998 IEEE international conference on evolutionary computation proceedings. IEEE world congress on computational intelligence (Cat. No. 98TH8360). IEEE, pp 69–73
[89]
Stensrud E, Foss T, Kitchenham B, and Myrtveit I A further empirical investigation of the relationship between MRE and project size Empir Softw Eng 2003 8 139-161
[90]
Trendowicz A and Jeffery R Software project effort estimation: foundations and best practice guidelines for success 2014 Springer
[91]
Twala B and Cartwright M Ensemble imputation methods for missing software engineering data Proc - Int Softw Metrics Symp 2005 2005 271-280
[92]
Twala B and Cartwright M Ensemble missing data techniques for software effort prediction Intell Data Anal 2010 14 299-331
[93]
Twala B, Cartwright M, Shepperd M (2006) Ensemble of missing data techniques to improve software prediction accuracy. In: Proceedings of the 28th international conference on Software engineering, pp 909–912
[94]
Van Hulse J and Khoshgoftaar TM Incomplete-case nearest neighbor imputation in software measurement data Inf Sci (Ny) 2014 259 596-610
[95]
Van Hulse J, Khoshgoftaar TM, Seiffert C (2006) A comparison of software fault imputation procedures. In: 2006 5th International Conference on Machine Learning and Applications (ICMLA'06). IEEE, pp 135–142.
[96]
Vateekul P, Sarinnapakorn K (2009) Tree-based approach to missing data imputation. In: 2009 IEEE International Conference on Data Mining Workshops. IEEE, pp 70–75
[97]
Wen J, Li S, Lin Z, et al. Systematic literature review of machine learning based software development effort estimation models Inf Softw Technol 2012 54 41-59
[98]
Xia Y (2020) Correlation and association analyses in microbiome study integrating multiomics in health and disease. Prog Mol Biol Trans Sci 171:309–491
[99]
Zhang W, Yang Y, and Wang Q Using Bayesian regression and EM algorithm with missing handling for software effort prediction Inf Softw Technol 2015 58 58-70
[100]
Zhang XZX and Guo YGY Optimization of SVM parameters based on PSO algorithm 2009 Fifth Int Conf Nat Comput 2009 1 536-539
[101]
Zhao Y and Zhang Y Comp Decision Tree Meth Finding Active Objects 2008 41 1955-1959
[102]
Zhou ZH (2012) Ensemble methods: foundations and algorithms. CRC press
[103]
Zhou ZH and Chen ZQ Hybrid decision tree Knowledge-Based Syst 2002 15 515-528

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Empirical Software Engineering
Empirical Software Engineering  Volume 28, Issue 2
Mar 2023
1389 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 15 March 2023
Accepted: 31 October 2022

Author Tags

  1. Missing data
  2. Imputation
  3. Ensemble
  4. Software development effort estimation

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media