Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A Bayesian Approach for Estimating and Replacing Missing Categorical Data

Published: 01 June 2009 Publication History

Abstract

We propose a new approach for estimating and replacing missing categorical data. With this approach, the posterior probabilities of a missing attribute value belonging to a certain category are estimated using the simple Bayes method. Two alternative methods for replacing the missing value are proposed: The first replaces the missing value with the value having the estimated maximum probability; the second uses a value that is selected with probability proportional to the estimated posterior distribution. The effectiveness of the proposed approach is evaluated based on some important data quality measures for data warehousing and data mining. The results of the experimental study demonstrate the effectiveness of the proposed approach.

References

[1]
Asuncion, A. and Newman, D. J. 2007. UCI Machine Learning Repository. School of Information and Computer Science, University of California, Irvine, CA. http://www.ics.uci.edu/~mlearn/MLRepository.html.
[2]
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. 1984. Classification and Regression Trees. Wadsworth, Belmont, CA.
[3]
Chen, G. and Astebro, T. 2003. How to deal with missing categorical data: Test of a simple Bayesian method. Organ. Res. Methods 6, 3, 309--327.
[4]
Chiu, H. Y. and Sedransk, J. 1986. A Bayesian procedure for imputing missing values in sample surveys. J. Amer. Statist. Assoc. 81, 3905, 5667--5676.
[5]
Clark, P. and Niblett, T. 1989. The CN2 induction algorithm. Mach. Learn. 3, 4, 261--283.
[6]
Codd, E. F. 1979. Extending the database relational model to capture more meaning. ACM Trans. Database Syst. 4, 4, 397--434.
[7]
Congdon, P. 2005. Bayesian Models for Categorical Data. John Wiley & Sons, New York.
[8]
Duda, R. O., Hart, P. E., and Stork, D. G. 2001. Pattern Classification. John Wiley & Sons, New York.
[9]
Evfimievski, A., Srikant, R., Agrawal, R., and Gehrke, J. 2002. Privacy preserving mining of association rules. In Proceedings of 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 217--228.
[10]
Fan, W., Lu, H., Madnick, S. E., and Cheung, D. 2002. DIRECT: A system for mining data value conversion rules from disparate data sources. Decis. Support Syst. 34, 1, 19--39.
[11]
Fung, R. and Del Favero, B. 1995. Applying Bayesian networks to information retrieval. Commun. ACM 38, 5, 42--57.
[12]
Jiang, Z., Sarkar, S., De, P., and Dey, D. 2007. A framework for reconciling attribute values from multiple data sources. Manag. Sci. 53, 12, 1946--1963.
[13]
Law, A. M. and Kelton, W. D. 1991. Simulation Modeling and Analysis. McGraw-Hill, New York.
[14]
Li, X.-B. and Sarkar, S. 2006. Privacy protection in data mining: A perturbation approach for categorical data. Inf. Syst. Res. 17, 3, 254--270.
[15]
Michie, D., Spiegelhalter, D. J., and Taylor, C. C., Eds. 1994. Machine Learning, Neural, and Statistical Classification. Ellis Horwood, New York.
[16]
Pipino, L. L., Lee, Y. W., and Wang, R. Y. 2002. Data quality assessment. Commun. ACM 45, 4, 211--218.
[17]
Pyle, D. 1999. Data Preparation for Data Mining. Morgan Kaufmann, San Mateo, CA.
[18]
Quinlan, J. R. 1989. Unknown attribute values in induction. In Proceedings of the 6th International Workshop on Machine Learning. Morgan Kaufmann, San Mateo, CA, 164--168.
[19]
Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.
[20]
Rizvi, S. J. and Haritsa, J. R. 2002. Maintaining data privacy in association rule mining. In Proceedings of the 28th Very Large Data Base Conference.
[21]
SAS Institute, Inc. 1990. SAS Procedure Guide. SAS Institute Inc., Cary, NC.
[22]
Witten, I. H. and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann of Elsevier, San Francisco, CA.
[23]
Zhu, H. and Wang, R. 2008. An information quality framework for verifiable intelligence products. In Data Engineering: Mining, Information, and Intelligence. Y. Chan et al., Eds. Springer, New York. to appear.

Cited By

View all
  • (2023)Handling Missing Values in Information Systems ResearchInformation Systems Research10.1287/isre.2022.110434:1(5-26)Online publication date: 1-Mar-2023
  • (2022)An Advisory Student Achievement Model Based on Data Mining Techniques2022 5th International Conference on Computing and Informatics (ICCI)10.1109/ICCI54321.2022.9756121(351-355)Online publication date: 9-Mar-2022
  • (2019)Improving Classification Quality in Uncertain GraphsJournal of Data and Information Quality10.1145/324209511:1(1-20)Online publication date: 4-Jan-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 1, Issue 1
June 2009
94 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/1515693
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2009
Published in JDIQ Volume 1, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Missing data
  2. data quality
  3. simple Bayes

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)4
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Handling Missing Values in Information Systems ResearchInformation Systems Research10.1287/isre.2022.110434:1(5-26)Online publication date: 1-Mar-2023
  • (2022)An Advisory Student Achievement Model Based on Data Mining Techniques2022 5th International Conference on Computing and Informatics (ICCI)10.1109/ICCI54321.2022.9756121(351-355)Online publication date: 9-Mar-2022
  • (2019)Improving Classification Quality in Uncertain GraphsJournal of Data and Information Quality10.1145/324209511:1(1-20)Online publication date: 4-Jan-2019
  • (2018)The Application of Last Observation Carried Forward Method for Missing Data Estimation in the Context of Industrial Wireless Sensor Networks2018 IEEE Asia-Pacific Conference on Antennas and Propagation (APCAP)10.1109/APCAP.2018.8538147(1-2)Online publication date: Aug-2018
  • (2017)A Comparison of Multiple Imputation Methods for Data with Missing ValuesIndian Journal of Science and Technology10.17485/ijst/2017/v10i19/11064610:19(1-7)Online publication date: 29-Jun-2017
  • (2017)On-line imputation for missing values2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)10.1109/CISP-BMEI.2017.8302315(1-5)Online publication date: Oct-2017
  • (2017)Cohesion based attribute value matching2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)10.1109/CISP-BMEI.2017.8302312(1-5)Online publication date: Oct-2017
  • (2017)COSSET+: Crowdsourced Missing Value Imputation Optimized by Knowledge BaseJournal of Computer Science and Technology10.1007/s11390-017-1768-132:5(845-857)Online publication date: 20-Sep-2017
  • (2016)Preserving Patient Privacy When Sharing Same-Disease DataJournal of Data and Information Quality10.1145/29565547:4(1-14)Online publication date: 6-Oct-2016
  • (2014)General table completion using a bayesian nonparametric modelProceedings of the 27th International Conference on Neural Information Processing Systems - Volume 110.5555/2968826.2968936(981-989)Online publication date: 8-Dec-2014
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media