Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1281192.1281263acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Making generative classifiers robust to selection bias

Published: 12 August 2007 Publication History

Abstract

This paper presents approaches to semi-supervised learning when the labeled training data and test data are differently distributed. Specifically, the samples selected for labeling are a biased subset of some general distribution and the test set consists of samples drawn from either that general distribution or the distribution of the unlabeled samples. An example of the former appears in loan application approval, where samples with repay/default labels exist only for approved applicants and the goal is to model the repay/default behavior of all applicants. An example of the latter appears in spam filtering, in which the labeled samples can be out-dated due to the cost of labeling email by hand, but an unlabeled set of up-to-date emails exists and the goal is to build a filter to sort new incoming email.Most approaches to overcoming such bias in the literature rely on the assumption that samples are selected for labeling depending only on the features, not the labels, a case in which provably correct methods exist. The missing labels are said to be "missing at random" (MAR). In real applications, however, the selection bias can be more severe. When the MAR conditional independence assumption is not satisfied and missing labels are said to be "missing not at random" (MNAR), and no learning method is provably always correct.We present a generative classifier, the shifted mixture model (SMM), with separate representations of the distributions of the labeled samples and the unlabeled samples. The SMM makes no conditional independence assumptions and can model distributions of semi-labeled data sets with arbitrary bias in the labeling. We present a learning method based on the expectation maximization (EM) algorithm that, while not always able to overcome arbitrary labeling bias, learns SMMs with higher test-set accuracy in real-world data sets (with MNAR bias) than existing learning methods that are proven to overcome MAR bias.

References

[1]
K. Benson and A. J. Hartz. A comparison of observational studies and randomized controlled trials. The New England Journal of Medicine, 342(25): 1878--1886, 2000.
[2]
S. Bickel, M. Brückner, and T. Scheffer. Discriminative learning for differing training and test distributions. In Proceedings of the 24th International Conference on Machine Learning, 2007.
[3]
S. Bickel and T. Scheffer. Dirichlet-enhanced spam filtering based on biased samples. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 161--168. MIT Press, Cambridge, MA, 2007.
[4]
W. J. Boyes, D. J. Hoffman, and S. A. Low. An econometric analysis of the bank credit scoring problem. Journal of Econometrics, 40(1):3--14, 1989.
[5]
C. Chelba and A. Acero. Adaptation of maximum entropy capitalizer: Little data can help a lot. Computer Speech & Language, 20(4):382--399, 2006.
[6]
G. Chen and T. Astebro. The economic value of reject inference in credit scoring. In J. N. C. L. C. Thomas and D. B. Edelman, editors, Credit Scoring and Credit Control VII: Proceedings of Conference held at University of Edinburgh, Edinburgh, Scotland, 5--7 September, 2001.
[7]
D. A. Cobb-Clark and T. Crossley. Econometrics for evaluations: An introduction to recent developments. The Economic Record, 79(247):491--511, 2003.
[8]
J. Crook and J. Banasik. Does reject inference really improve the performance of application scoring models? Technical Report Working Paper Series No. 02/3, Credit Research Centre, 2002.
[9]
A. Dempster, N. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39:1--38, 1977.
[10]
A. J. Feelders. An overview of model based reject inference for credit scoring. Technical report, Utrecht University, Institute for Information and Computing Sciences, (unpublished). http://www.cs.uu.nl/people/ad/mbrejinf.pdf.
[11]
Glynn, Peter W. and Iglehart, Donald L. Importance sampling for stochastic simulations. Management Science, 35(11):1367--1392, nov 1989.
[12]
J. Huang, A. Smola, A. Gretton, K. M. Borgwardt, and B. Schölkopf. Correcting sample selection bias by unlabeled data. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19. MIT Press, Cambridge, MA, 2007.
[13]
R. J. A. Little and D. B. Rubin. Statistical analysis with missing data. John Wiley & Sons, Inc., New York, NY, USA, second edition, 1986.
[14]
K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, University of California, Berkeley, 2002.
[15]
R. Pace and R. Barry. Sparse spatial autoregressions. Statistics and Probability Letters, 33:291--297, 1997.
[16]
J. Pearl. Graphical models for probabilistic and causal reasoning. In D. M. Gabbay and P. Smets, editors, Handbook of Defeasible Reasoning and Uncertainty Management Systems, Volume 1: Quantified Representation of Uncertainty and Imprecision, pages 367--389. Kluwer Academic Publishers, Dordrecht, 1998.
[17]
J. M. Robins, M. A. Hernan, and B. Brumback. Marginal structural models and causal inference in epidemiology. Epidemiology, 11(5):550--560, 2000.
[18]
P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41--56, 1983.
[19]
S. Rosset, J. Zhu, H. Zou, and T. Hastie. A method for inferring label sampling mechanisms in semi-supervised learning. Advances in Neural Information Processing Systems, 17: 1161--1168, 2005.
[20]
H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2): 227--244, 2000.
[21]
A. Smith and C. Elkan. A bayesian network framework for reject inference. In KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 286--295, New York, NY, USA, 2004. ACM Press.
[22]
A. Storkey and M. Sugiyama. Mixture regression for covariate shift. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 1337--1344. MIT Press, Cambridge, MA, 2007.
[23]
M. Sugiyama and K.-R. Möller. Model selection under covariate shift. In W. Duch, J. Kacprzyk, E. Oja, and S. Zadrozny, editors, Artificial Neural Networks: Formal Models and Their Applications, volume 3697 of Lecture Notes in Computer Science, pages 235--240, Berlin, 2005. Springer.
[24]
A. J. Treno, P. J. Gruenewald, and F. W. Johnson. Sample selection bias in the emergency room: an examination of the role of alcohol in injury. Addiction, 93(1): 113--29, 1998.
[25]
K. Yamazaki, M. Kawanabe, S. Watanabe, M. Sugiyama, and K.-R. Müller. Asymptotic Bayesian generalization error when training and test distributions are different. In Proceedings of 24nd International Conference on Machine Learning, Corvallis, Oregon, USA, Jun. 20--24 2007.
[26]
B. Zadrozny. Learning and evaluating classifiers under sample selection bias. In ICML '04: Proceedings of the twenty-first international conference on Machine learning, page 114, New York, NY, USA, 2004. ACM Press.
[27]
B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 694--699, New York, NY, USA, 2002. ACM Press.

Cited By

View all
  • (2024)Learning peptide properties with positive examples onlyDigital Discovery10.1039/D3DD00218G3:5(977-986)Online publication date: 2024
  • (2022)Divide and Imitate: Multi-cluster Identification and Mitigation of Selection BiasAdvances in Knowledge Discovery and Data Mining10.1007/978-3-031-05936-0_12(149-160)Online publication date: 16-May-2022
  • (2021)One-Class Remote Sensing Classification From Positive and Unlabeled Background DataIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2020.302545114(730-746)Online publication date: 2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2007
1080 pages
ISBN:9781595936097
DOI:10.1145/1281192
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. generative classifiers
  2. reject inference
  3. sample selection bias
  4. semi-supervised learning

Qualifiers

  • Article

Conference

KDD07

Acceptance Rates

KDD '07 Paper Acceptance Rate 111 of 573 submissions, 19%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)3
Reflects downloads up to 23 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Learning peptide properties with positive examples onlyDigital Discovery10.1039/D3DD00218G3:5(977-986)Online publication date: 2024
  • (2022)Divide and Imitate: Multi-cluster Identification and Mitigation of Selection BiasAdvances in Knowledge Discovery and Data Mining10.1007/978-3-031-05936-0_12(149-160)Online publication date: 16-May-2022
  • (2021)One-Class Remote Sensing Classification From Positive and Unlabeled Background DataIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2020.302545114(730-746)Online publication date: 2021
  • (2020)Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias2020 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM50108.2020.00115(996-1001)Online publication date: Nov-2020
  • (2019)Positive And Unlabeled Learning Algorithms And Applications: A Survey2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA)10.1109/IISA.2019.8900698(1-8)Online publication date: Jul-2019
  • (2019)Synthetic minority oversampling for function approximation problemsInternational Journal of Intelligent Systems10.1002/int.2212034:11(2741-2768)Online publication date: 24-Sep-2019
  • (2018)Simple strategies for semi-supervised feature selectionMachine Language10.1007/s10994-017-5648-2107:2(357-395)Online publication date: 1-Feb-2018
  • (2016)Impact of Query Sample Selection Bias on Information Retrieval System Ranking2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)10.1109/DSAA.2016.43(341-350)Online publication date: Oct-2016
  • (2015)Probabilistic Modeling of a Sales Funnel to Prioritize LeadsProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2783258.2788578(1751-1758)Online publication date: 10-Aug-2015
  • (2015)Markov blanket discovery in positive-unlabelled and semi-supervised dataProceedings of the 2015th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I10.1007/978-3-319-23528-8_22(351-366)Online publication date: 7-Sep-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media