Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Optimized fuzzy clustering‐based k‐nearest neighbors imputation for mixed missing data in software development effort estimation

Published: 04 January 2023 Publication History

Abstract

Context

Software development effort estimation (SDEE) is one of the most challenging aspects in project management. The presence of missing data (MD) in software attributes makes SDEE even more complex. K‐nearest neighbors imputation (KNNI) has been widely used in SDEE to deal with the MD issue. However, KNNI, in its classical process, has low tolerance to imprecision and uncertainty especially when dealing with categorical features. When dealing with categorical attributes, KNNI uses a classical approach, employing mainly numbers or classical intervals to represent software attributes and similarity measures originally designed for numerical attributes.

Objectives

This paper evaluates the use of an optimized fuzzy clustering‐based KNNI (FC‐KNNI) and compares it with classical KNN when dealing with mixed data in the context of SDEE.

Methods

We investigate the effect of two imputation techniques (FC‐KNNI and KNNI) on five SDEE techniques: case‐based reasoning, fuzzy case‐based reasoning, support vector regression, multilayer perceptron, and reduced‐error pruning tree. The evaluation is carried out using six publicly available datasets for SDEE using two performance measures, standardized accuracy (SA), and Pred (0.25). The Wilcoxon statistical test is also performed to assess the significance of results.

Results

The results are promising in the sense that using an imputation technique designed for mixed data is better than reusing methods originally designed for numerical data. We found that FC‐KNNI significantly outperforms KNNI regardless of the SDEE technique and dataset used. Another important finding is that F‐CBR improved the analogy process compared to CBR.

Conclusion

The introduction of fuzzy sets and fuzzy clustering in the analogy process improves its performances in terms of SA and Pred (0.25).

Graphical Abstract

This paper investigates the use of k‐nearest neighbors imputation (KNNI) to deal with missing data in software development effort estimation (SDEE). KNNI, in its classical process, has low tolerance to imprecision and uncertainty especially when dealing with categorical features. We evaluate the use of an optimized fuzzy clustering‐based KNNI (FC‐KNNI) and compare it with classical KNN when dealing with mixed data in the context of SDEE. The results are promising in the sense that using an imputation technique designed for mixed data is better than reusing methods originally designed for numerical data. KNNI, in its classical process, has low tolerance to imprecision and uncertainty especially when dealing with categorical features.

References

[1]
Wen J, Li S, Lin Z, Hu Y, Huang C. Systematic literature review of machine learning based software development effort estimation models. Inf Softw Technol. 2012;54(1):41‐59.
[2]
Idri A, Abnane I, Abran A. Missing data techniques in analogy‐based software development effort estimation. J Syst Softw. 2016;117:595‐611.
[3]
Idri A, Hosni M, Abran A. Systematic literature review of ensemble effort estimation. J Syst Softw. 2016c;118:151‐175.
[4]
Sehra SK, Brar YS, Kaur N, Sehra SS. Research patterns and trends in software effort estimation. Inf Softw Technol. 2017;91:1‐21.
[5]
Abnane I, Idri A. Improved analogy‐based effort estimation with incomplete mixed data. In: Federated Conference on Computer Science and Information Systems (FedCSIS); 2018:1015‐1024.
[6]
Idri A, Abnane I, Abran A. Systematic mapping study of missing values techniques in software engineering data. In: 2015 IEEE/ACIS 16th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, SNPD 2015 ‐ Proceedings. Takamatsu, Japan; 2015:1‐8.
[7]
Little R, Rubin D. Analysis of social science data with missing values. Sociol Methods Res. 1989;18(2‐3):292‐326. 0803973233.
[8]
Little RJA, Rubin D. Statistical Analysis with Missing Data. New York: Wiley; 1987.
[9]
Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons; 1987.
[10]
Rubin DB. Inference and missing data (with discussion). Biometrika. 1976;63(3):581‐592.
[11]
Abnane I, Idri A, Abran A. Fuzzy case‐based‐reasoning‐based imputation for incomplete data in software engineering repositories. J Softw Evol Process. 2020;32(9):e2260.
[12]
Idri A, Abran A, Khoshgoftaar TM. Estimating software project effort by analogy based on linguistic values. In: Proc Eighth IEEE Symp Softw Metrics; 2002:21‐30.
[13]
Sentas P, Angelis L. Categorical missing data imputation for software cost estimation by multinomial logistic regression. J Syst Softw. 2006;79(3):404‐414.
[14]
Derrac J, García S, Herrera F. Fuzzy nearest neighbor algorithms: taxonomy, experimental analysis and prospects. Inf Sci (Ny). 2014;260:98‐119.
[15]
Idri A, Abran A. A fuzzy logic based set of measures for software project similarity: validation and possible improvements. In: Proceedings of the Seventh International Software Metrics Symposium; 2001a:85‐96.
[16]
Nguyen TPQ, Kuo RJ. Partition‐and‐merge based fuzzy genetic clustering algorithm for categorical data. Appl Soft Comput J. 2019;75:254‐264.
[17]
Pan Y, Pan Z, Wang Y, Wang W. A new fast search algorithm for exact k‐nearest neighbors based on optimal triangle‐inequality‐based check strategy. Knowledge‐Based Syst. 2020;189:105088.
[18]
Agresti A. Categorical Data Analysis. Wiley series in probability and statistics. Second.ed; 2002.
[19]
Angelis L, Stamelos I, Morisio M. Building a software cost estimation model based on categorical data. Int Softw Metrics Symp Proc. 2001;4‐15.
[20]
Amazal FA, Idri A, Abran A. An analogy‐based approach to estimation of software development effort using categorical data. In: Joint Conference of the International Workshop on Software Measurement and the International Conference on Software Process and Product Measurement; 2014a:252‐262.
[21]
Damiani E, Jain LC, Madravio M. Soft Computing in Software Engineering. Springer; 2004.
[22]
Gallego AJ, Calvo‐Zaragoza J, Valero‐Mas JJ, Rico‐Juan JR. Clustering‐based k‐nearest neighbor classification for large‐scale data with neural codes representation. Pattern Recognit. 2018;74:531‐543.
[23]
Gan G, Ma C, Wu J. Data Clustering: Theory, Algorithms, and Applications; 2007.
[24]
Lee S, Hahn C, Rhee M, et al. Data clustering theory, algorithms, and applications. J Chem Inf Model. 2012.
[25]
Tsekouras GE, Papageorgiou D, Kotsiantis S, Kalloniatis C. Fuzzy clustering of categorical attributes and its use in analyzing cultural data. Eng Technol. 2004;1:87‐91.
[26]
Huang Z, Ng MK. A fuzzy k‐modes algorithm for clustering categorical data. IEEE Trans Fuzzy Syst. 1999;7(4):446‐452.
[27]
Huang Z. A fast clustering algorithm to cluster very large categorical data sets in data mining. Res Issues Data Min Knowl Discov. 1997;1‐8. 10.1.1.6.4718.
[28]
Zadeh LA. Fuzzy logic, neural networks, and soft computing. Commun ACM. 1994;37(3):77‐84.
[29]
Zadeh LA. Fuzzy logic, neural networks and son. Comput Secur. 1993;38:1993.
[30]
Zadeh LA. Fuzzy sets. Informat Control. 1965;8(3):338‐353.
[31]
Huang Z. Extensions to the k‐means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov. 1998;2:283‐304.
[32]
Idri A, Amazal FA, Abran A. Accuracy comparison of analogy‐based software development effort estimation techniques. Int J Intell Syst. 2016;0(03):1‐25.
[33]
Dunn JC. Well‐separated clusters and optimal fuzzy partitions. J Cybern. 1974;4(1):95‐104.
[34]
Cara FJ, Carpio J, Juan J, Alarcón E. An approach to operational modal analysis using the expectation maximization algorithm. Mech Syst Signal Process. 2012;31:109‐129.
[35]
Van Hulse J, Khoshgoftaar TM. Incomplete‐case nearest neighbor imputation in software measurement data. Inf Sci (Ny). 2014;259:596‐610.
[36]
Zhang W, Yang Y, Wang Q. Handling missing data in software effort prediction with naive Bayes and EM algorithm. In: Proc. 7th Int. Conf. Predict. Model. Softw. Eng; 2011:4‐10.
[37]
Monte‐Serrat DM, Cattani C. Interpretability in neural networks towards universal consistency. Int J Cogn Comput Eng. 2021;2:30‐39.
[38]
Cartwright MH, Shepperd MJ, Song Q. Dealing with missing software project data. In: Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717); 2003.
[39]
Jönsson P, Wohlin C. An evaluation of k‐nearest neighbour imputation using lIkert data. In: Proceedings—International Software Metrics Symposium; 2004:108‐118.
[40]
Song Q, Shepperd M, Chen X, Liu J. Can k‐NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation. J Syst Softw. 2008;81(12):2361‐2370.
[41]
Eberhart R, Kennedy J. A new optimizer using particle swarm theory. In: MHS'95. Proc. Sixth Int. Symp. Micro Mach. Hum. Sci; 1995:39‐43.
[42]
Kennedy J, Eberhart R. Particle swarm optimization. Neural Networks, 1995. In: Proceedings, IEEE Int. Conf. 4. Vol.4; 1995:1942‐1948.
[43]
Shi Y, Eberhart R. A modified particle swarm optimizer. In: Evol. Comput. Proceedings, 1998. IEEE World Congr. Comput. Intell. 1998 IEEE Int. Conf; 1998:69‐73.
[44]
Idri A, Abran A. A fuzzy logic based set of measures for software project similarity: validation and possible improvements. In: Proc. Seventh Int. Softw. Metrics Symp; 2001b.
[45]
Idri A, Hosni M, Abran A. Improved estimation of software development effort using classical and fuzzy analogy ensembles. Appl Soft Comput. 2016d;49:990‐1019.
[46]
Idri A, Amazal FA, Abran A. Analogy‐based software development effort estimation: a systematic mapping and review. Inf Softw Technol. 2014;58:206‐230.
[47]
Keung J. Software development cost estimation using analogy: a review. In: 2009 Aust. Softw. Eng. Conf; 2009.
[48]
Amazal FA, Idri A, Abran A. Improving Fuzzy Analogy based Software Development Effort Estimation. In: 21st Asia‐Pacific Software Engineering Conference (APSEC); 2014b:1‐4.
[49]
Hosni M, Idri A, Abran A, Nassif AB. On the value of parameter tuning in heterogeneous ensembles effort estimation. Soft Comput. 2017;22(18):5977‐6010.
[50]
Cortes C, Vapnik V. Support‐vector networks. Mach Learn. 1995;20(3):273‐297.
[51]
Idri A, Abnane I, Abran A. Support vector regression‐based imputation in analogy‐based software development effort estimation. J Softw Evol Process. 2018;30(12):e2114.
[52]
Bakır A, Turhan B, Bener AB. A new perspective on data homogeneity in software cost estimation: a study in the embedded systems domain. Softw Qual J. 2010;18(1):57‐80.
[53]
Menzies, T., Krishna, R., Pryor, D., 2017. The SEACRAFT Repository of Empirical Software Engineering Data. [WWW Document]. URL https://zenodo.org/communities/seacraft
[54]
Menzies, T., Krishna, R., Pryor, D., 2015. The PROMISE Repository of Empirical Software Engineering Data [WWW Document]. URL http://openscience.us/repo
[55]
Lokan C, Wright T, Hill P, Stringer M. Organizational benchmarking using the ISBSG data repository. Software, IEEE. 2001;18(5):26‐32.
[56]
Dong Y, Peng CYJ. Principled missing data methods for researchers. Springerplus. 2013;2(1):1, 222‐17.
[57]
Kocaguneli E, Menzies T, Keung JW. On the value of ensemble effort estimation. IEEE Trans Softw Eng. 2012;38(6):1403‐1416.
[58]
Conte S, Dunsmore H, Shen V. Software engineering metrics and models. Benjamin‐Cummings; 1986.
[59]
Foss T, Myrtveit I, Stensrud E. MRE and Heteroscedasticity: An Empirical Validation of the Assumption of Homoscedasticity of the Magnitude of Relative Error. Analysis; 2001.
[60]
Kitchenham BA, MacDonell SG, Pickard L, Shepperd MJ. What accuracy statistics really measure. IEE Proc– Softw Eng. 2001;148(3):81‐85.
[61]
Korte M, Port D. Confidence in software cost estimation results based on MMRE and PRED. In: Proc. 4th Int. Work. Predict. Model. Softw. Eng. – PROMISE'08 63–70; 2008.
[62]
Idri A, Abnane I, Abran A. Evaluating Pred(p) and standardized accuracy criteria in software development effort estimation. J Softw Evol Process. 2017;30(4):e1925.
[63]
Shepperd M, MacDonell S. Evaluating prediction systems in software project estimation. Inf Softw Technol. 2012;54(8):820‐827.
[64]
Cohen J. Quantitative methods in psychology. Psychol Bull. 1992a;112(1):155‐159.
[65]
Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7(4):1‐30.
[66]
Sheskin D. Handbook of Parametric and Non‐parametric Procedures. CRC Press; 1997.
[67]
Cohen J. A power primer. Psychol Bull. 1992b;112(1):155‐159.
[68]
Kocaguneli E, Menzies T. Software effort models should be assessed via leave‐one‐out validation. J Syst Softw. 2013;86(7):1879‐1890.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Software: Evolution and Process
Journal of Software: Evolution and Process  Volume 36, Issue 4
April 2024
709 pages
EISSN:2047-7481
DOI:10.1002/smr.v36.4
Issue’s Table of Contents

Publisher

John Wiley & Sons, Inc.

United States

Publication History

Published: 04 January 2023

Author Tags

  1. fuzzy logic
  2. imputation
  3. missing data
  4. software development effort estimation

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 27 Nov 2024

Other Metrics

Citations

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media