Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Incomplete-case nearest neighbor imputation in software measurement data

Published: 01 February 2014 Publication History

Abstract

k nearest neighbor imputation (kNNI) is one of the most popular methods in empirical software engineering for imputing missing values. kNNI typically uses only complete cases as possible donors for imputation (called complete case kNNI or CCkNNI). Though it often produces reasonable results, CCkNNI is severely limited when the amount of missing data is large (and hence the number of complete cases is small). In response, a variant of CCkNNI called incomplete case k nearest neighbor imputation (ICkNNI) has been proposed as an attractive alternative. This work presents a detailed simulation comparing CCkNNI and ICkNNI using two different software measurement datasets. The empirical results show that using incomplete cases often increases the effectiveness of nearest neighbor imputation (especially at higher missingness levels), regardless of the type of missingness (i.e., the distribution of missing values in the data).

References

[1]
P.D. Allison, Missing Data. 07-136. Sage University Papers Series on Quantitative Applications in the Social Sciences, Thousand Oaks, CA, 2000.
[2]
M.H. Cartwright, M.J. Shepperd, Q. Song, Dealing with issuing software project data, in: 9th IEEE International Software Metrics Symposium, 2003, pp. 154-165.
[3]
Fenton, N.E. and Pfleeger, S.L., Software Metrics: A Rigorous and Practical Approach. 1997. second ed. PWS Publishing Company, ITP, Boston, MA.
[4]
P. Jönsson, C. Wohlin, An evaluation of k-nearest neighbour imputation using likert data, in: 10th IEEE International Symposium on Software Metrics (METRICS'04), 2004, pp. 108-118.
[5]
T.M. Khoshgoftaar, A. Folleco, L. Bullard, J. Van Hulse, Software quality imputation in the presence of noisy data, in: Proceedings of the IEEE International Conference on Information Reuse and Integration (IRI 2006). Hawaii, 2006, pp. 484-489.
[6]
Khoshgoftaar, T.M. and Seliya, N., Comparative assessment of software quality classification techniques: an empirical case study. Empirical Software Engineering Journal. v9 i2. 229-257.
[7]
Khoshgoftaar, T.M. and Van Hulse, J., Identifying noisy features with the pairwise attribute noise detection algorithm. Intelligent Data Analysis: An International Journal. v9 i6. 589-602.
[8]
Khoshgoftaar, T.M. and Van Hulse, J., Imputation techniques for multivariate missingness in software measurement data. Software Quality Journal. v16 i4. 563-600.
[9]
Khoshgoftaar, T.M., Zhong, S. and Joshi, V., Enhancing software quality estimation using ensemble-classifier based noise filtering. Intelligent Data Analysis: An International Journal. v9 i1. 3-27.
[10]
Little, R.J.A. and Rubin, D.B., Statistical Analysis with Missing Data. 2002. second ed. John Wiley and Sons, Hoboken, NJ.
[11]
Myrtveit, I., Stensrud, E. and Olsson, U., Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods. IEEE Transactions on Software Engineering. v27 i11. 999-1013.
[12]
SAS Institute, SAS/STAT User's Guide, SAS Institute Inc., 2004.
[13]
Schafer, J.L., Analysis of Incomplete Multivariate Data. 2000. Chapman and Hall/CRC.
[14]
A short note on safest default missingness mechanism assumptions. Empirical Software Engineering. v10 i2. 235-243.
[15]
Strike, K., Emam, K.E. and Madhavji, N., Software cost estimation with incomplete data. IEEE Transactions on Software Engineering. v27 i10. 890-908.
[16]
Tamura, K., Kakimoto, T., Toda, K., Tsunoda, M., Monden, A. and Matsumoto, K., . Empirical evaluation of missing data techniques for effort estimation.
[17]
An empirical comparison of techniques for handling incomplete data using decision trees. Applied Artificial Intelligence: An International Journal. v23 i5. 373-405.
[18]
B. Twala, M.H. Cartwright, Ensemble imputation methods for missing software engineering data, in: Proceedings of 11th IEEE International Software Metrics Symposium, 2005, pp. 30-40.
[19]
J. Van Hulse, T.M. Khoshgoftaar, Incomplete-case nearest neighbor imputation in software measurement data, in: Proceedings of the IEEE International Conference on Information Reuse and Integration (IRI 2007), Las Vegas, NV, 2007, pp. 630-637.
[20]
J. Van Hulse, T.M. Khoshgoftaar, C. Seiffert, A comparison of software fault imputation procedures, in: Proceedings of the IEEE International Conference on Machine Learning and Applications (ICMLA 2006), Orlando, FL, 2006, pp. 135-142.
[21]
Nearest neighbour approach in the least-squares data imputation algorithms. Information Sciences. v169 i1-2. 1-25.
[22]
Zhong, S., Khoshgoftaar, T.M. and Seliya, N., Analyzing software measurement data with clustering techniques. IEEE Intelligent Systems. 22-29.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal
Information Sciences: an International Journal  Volume 259, Issue
February, 2014
611 pages

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 February 2014

Author Tags

  1. Complete-case
  2. Incomplete-case
  3. Nearest neighbor imputation
  4. Software measurement data

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 29 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Bayesian ART for incomplete datasetsApplied Soft Computing10.1016/j.asoc.2024.111865163:COnline publication date: 1-Sep-2024
  • (2023)Active learning for ordinal classification on incomplete dataIntelligent Data Analysis10.3233/IDA-22666427:3(613-634)Online publication date: 1-Jan-2023
  • (2023)Evaluating ensemble imputation in software effort estimationEmpirical Software Engineering10.1007/s10664-022-10260-028:2Online publication date: 15-Mar-2023
  • (2023)Optimized fuzzy clustering‐based k‐nearest neighbors imputation for mixed missing data in software development effort estimationJournal of Software: Evolution and Process10.1002/smr.252936:4Online publication date: 4-Jan-2023
  • (2022)Causal Feature Selection with Missing DataACM Transactions on Knowledge Discovery from Data10.1145/348805516:4(1-24)Online publication date: 8-Jan-2022
  • (2020)Performance Evaluation for Class Center-Based Missing Data Imputation AlgorithmProceedings of the 2020 9th International Conference on Software and Computer Applications10.1145/3384544.3384575(36-40)Online publication date: 18-Feb-2020
  • (2020)Enhancing data analysis: uncertainty-resistance method for handling incomplete dataApplied Intelligence10.1007/s10489-019-01514-450:1(74-86)Online publication date: 1-Jan-2020
  • (2020)Missing value imputation: a review and analysis of the literature (2006–2017)Artificial Intelligence Review10.1007/s10462-019-09709-453:2(1487-1509)Online publication date: 1-Feb-2020
  • (2020)Fuzzy case‐based‐reasoning‐based imputation for incomplete data in software engineering repositoriesJournal of Software: Evolution and Process10.1002/smr.226032:9Online publication date: 3-Sep-2020
  • (2019)ExperienceJournal of Data and Information Quality10.1145/332874611:4(1-38)Online publication date: 19-Aug-2019
  • Show More Cited By

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media