article

Incomplete-case nearest neighbor imputation in software measurement data

Authors:

Jason Van Hulse,

Taghi M. KhoshgoftaarAuthors Info & Claims

Information Sciences—Informatics and Computer Science, Intelligent Systems, Applications: An International Journal, Volume 259

Pages 596 - 610

https://doi.org/10.1016/j.ins.2010.12.017

Published: 01 February 2014 Publication History

Abstract

k nearest neighbor imputation (kNNI) is one of the most popular methods in empirical software engineering for imputing missing values. kNNI typically uses only complete cases as possible donors for imputation (called complete case kNNI or CCkNNI). Though it often produces reasonable results, CCkNNI is severely limited when the amount of missing data is large (and hence the number of complete cases is small). In response, a variant of CCkNNI called incomplete case k nearest neighbor imputation (ICkNNI) has been proposed as an attractive alternative. This work presents a detailed simulation comparing CCkNNI and ICkNNI using two different software measurement datasets. The empirical results show that using incomplete cases often increases the effectiveness of nearest neighbor imputation (especially at higher missingness levels), regardless of the type of missingness (i.e., the distribution of missing values in the data).

References

[1]

P.D. Allison, Missing Data. 07-136. Sage University Papers Series on Quantitative Applications in the Social Sciences, Thousand Oaks, CA, 2000.

[2]

M.H. Cartwright, M.J. Shepperd, Q. Song, Dealing with issuing software project data, in: 9th IEEE International Software Metrics Symposium, 2003, pp. 154-165.

[3]

Fenton, N.E. and Pfleeger, S.L., Software Metrics: A Rigorous and Practical Approach. 1997. second ed. PWS Publishing Company, ITP, Boston, MA.

[4]

P. Jönsson, C. Wohlin, An evaluation of k-nearest neighbour imputation using likert data, in: 10th IEEE International Symposium on Software Metrics (METRICS'04), 2004, pp. 108-118.

[5]

T.M. Khoshgoftaar, A. Folleco, L. Bullard, J. Van Hulse, Software quality imputation in the presence of noisy data, in: Proceedings of the IEEE International Conference on Information Reuse and Integration (IRI 2006). Hawaii, 2006, pp. 484-489.

[6]

Khoshgoftaar, T.M. and Seliya, N., Comparative assessment of software quality classification techniques: an empirical case study. Empirical Software Engineering Journal. v9 i2. 229-257.

[7]

Khoshgoftaar, T.M. and Van Hulse, J., Identifying noisy features with the pairwise attribute noise detection algorithm. Intelligent Data Analysis: An International Journal. v9 i6. 589-602.

[8]

Khoshgoftaar, T.M. and Van Hulse, J., Imputation techniques for multivariate missingness in software measurement data. Software Quality Journal. v16 i4. 563-600.

[9]

Khoshgoftaar, T.M., Zhong, S. and Joshi, V., Enhancing software quality estimation using ensemble-classifier based noise filtering. Intelligent Data Analysis: An International Journal. v9 i1. 3-27.

[10]

Little, R.J.A. and Rubin, D.B., Statistical Analysis with Missing Data. 2002. second ed. John Wiley and Sons, Hoboken, NJ.

[11]

Myrtveit, I., Stensrud, E. and Olsson, U., Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods. IEEE Transactions on Software Engineering. v27 i11. 999-1013.

[12]

SAS Institute, SAS/STAT User's Guide, SAS Institute Inc., 2004.

[13]

Schafer, J.L., Analysis of Incomplete Multivariate Data. 2000. Chapman and Hall/CRC.

[14]

A short note on safest default missingness mechanism assumptions. Empirical Software Engineering. v10 i2. 235-243.

[15]

Strike, K., Emam, K.E. and Madhavji, N., Software cost estimation with incomplete data. IEEE Transactions on Software Engineering. v27 i10. 890-908.

[16]

Tamura, K., Kakimoto, T., Toda, K., Tsunoda, M., Monden, A. and Matsumoto, K., . Empirical evaluation of missing data techniques for effort estimation.

[17]

An empirical comparison of techniques for handling incomplete data using decision trees. Applied Artificial Intelligence: An International Journal. v23 i5. 373-405.

[18]

B. Twala, M.H. Cartwright, Ensemble imputation methods for missing software engineering data, in: Proceedings of 11th IEEE International Software Metrics Symposium, 2005, pp. 30-40.

Digital Library

[19]

J. Van Hulse, T.M. Khoshgoftaar, Incomplete-case nearest neighbor imputation in software measurement data, in: Proceedings of the IEEE International Conference on Information Reuse and Integration (IRI 2007), Las Vegas, NV, 2007, pp. 630-637.

[20]

J. Van Hulse, T.M. Khoshgoftaar, C. Seiffert, A comparison of software fault imputation procedures, in: Proceedings of the IEEE International Conference on Machine Learning and Applications (ICMLA 2006), Orlando, FL, 2006, pp. 135-142.

[21]

Nearest neighbour approach in the least-squares data imputation algorithms. Information Sciences. v169 i1-2. 1-25.

[22]

Zhong, S., Khoshgoftaar, T.M. and Seliya, N., Analyzing software measurement data with clustering techniques. IEEE Intelligent Systems. 22-29.

Cited By

Matias AGomes JMattos CRocha Neto AMesquita D(2024)Bayesian ART for incomplete datasetsApplied Soft Computing10.1016/j.asoc.2024.111865163:COnline publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1016/j.asoc.2024.111865
He D(2023)Active learning for ordinal classification on incomplete dataIntelligent Data Analysis10.3233/IDA-22666427:3(613-634)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/IDA-226664
Abnane IIdri AChlioui IAbran A(2023)Evaluating ensemble imputation in software effort estimationEmpirical Software Engineering10.1007/s10664-022-10260-028:2Online publication date: 15-Mar-2023
https://dl.acm.org/doi/10.1007/s10664-022-10260-0
Show More Cited By

Recommendations

Nearest neighbor selection for iteratively kNN imputation

Existing kNN imputation methods for dealing with missing data are designed according to Minkowski distance or its variants, and have been shown to be generally efficient for numerical variables (features, or attributes). To deal with heterogeneous (i.e.,...
An Empirical Study of Dynamic Incomplete-Case Nearest Neighbor Imputation in Software Quality Data
QRS '15: Proceedings of the 2015 IEEE International Conference on Software Quality, Reliability and Security

Software quality prediction is an important yet difficult problem in software project development and management. Historical datasets can be used to build models for software quality prediction. However, the missing data significantly affects the ...
Benchmarking k-nearest neighbour imputation with homogeneous Likert data

Missing data are common in surveys regardless of research field, undermining statistical analyses and biasing results. One solution is to use an imputation method, which recovers missing data by estimating replacement values. Previously, we have ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal

Information Sciences: an International Journal Volume 259, Issue

February, 2014

611 pages

ISSN:0020-0255

Issue’s Table of Contents

Copyright © Elsevier Inc. © 2011.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 February 2014

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Matias AGomes JMattos CRocha Neto AMesquita D(2024)Bayesian ART for incomplete datasetsApplied Soft Computing10.1016/j.asoc.2024.111865163:COnline publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1016/j.asoc.2024.111865
He D(2023)Active learning for ordinal classification on incomplete dataIntelligent Data Analysis10.3233/IDA-22666427:3(613-634)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/IDA-226664
Abnane IIdri AChlioui IAbran A(2023)Evaluating ensemble imputation in software effort estimationEmpirical Software Engineering10.1007/s10664-022-10260-028:2Online publication date: 15-Mar-2023
https://dl.acm.org/doi/10.1007/s10664-022-10260-0
Abnane IIdri AAbran A(2023)Optimized fuzzy clustering‐based k‐nearest neighbors imputation for mixed missing data in software development effort estimationJournal of Software: Evolution and Process10.1002/smr.252936:4Online publication date: 4-Jan-2023
https://dl.acm.org/doi/10.1002/smr.2529
Yu KYang YDing W(2022)Causal Feature Selection with Missing DataACM Transactions on Knowledge Discovery from Data10.1145/348805516:4(1-24)Online publication date: 8-Jan-2022
https://dl.acm.org/doi/10.1145/3488055
Nugroho HUtama NSurendro K(2020)Performance Evaluation for Class Center-Based Missing Data Imputation AlgorithmProceedings of the 2020 9th International Conference on Software and Computer Applications10.1145/3384544.3384575(36-40)Online publication date: 18-Feb-2020
https://dl.acm.org/doi/10.1145/3384544.3384575
Hamidzadeh JMoradi M(2020)Enhancing data analysis: uncertainty-resistance method for handling incomplete dataApplied Intelligence10.1007/s10489-019-01514-450:1(74-86)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1007/s10489-019-01514-4
Lin WTsai C(2020)Missing value imputation: a review and analysis of the literature (2006–2017)Artificial Intelligence Review10.1007/s10462-019-09709-453:2(1487-1509)Online publication date: 1-Feb-2020
https://dl.acm.org/doi/10.1007/s10462-019-09709-4
Abnane IIdri AAbran A(2020)Fuzzy case‐based‐reasoning‐based imputation for incomplete data in software engineering repositoriesJournal of Software: Evolution and Process10.1002/smr.226032:9Online publication date: 3-Sep-2020
https://dl.acm.org/doi/10.1002/smr.2260
Bosu MMacdonell S(2019)ExperienceJournal of Data and Information Quality10.1145/332874611:4(1-38)Online publication date: 19-Aug-2019
https://dl.acm.org/doi/10.1145/3328746
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents