PAC-Bayesian Stochastic Model Selection

David A. McAllester¹

2648 Accesses
106 Citations
Explore all metrics

Abstract

PAC-Bayesian learning methods combine the informative priors of Bayesian methods with distribution-free PAC guarantees. Stochastic model selection predicts a class label by stochastically sampling a classifier according to a “posterior distribution” on classifiers. This paper gives a PAC-Bayesian performance guarantee for stochastic model selection that is superior to analogous guarantees for deterministic model selection. The guarantee is stated in terms of the training error of the stochastic classifier and the KL-divergence of the posterior from the prior. It is shown that the posterior optimizing the performance guarantee is a Gibbs distribution. Simpler posterior distributions are also derived that have nearly optimal performance guarantees.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Barron, A. R. (1991). Complexity regularization with application to artificial neural networks. In G. Roussas (Ed.), Nonparametric functional estimation and related topics (pp. 561–576). Dordrecht: Kluwer Academic Publishers.
Google Scholar
Barron, A. R.,& Cover, T. M. (1991). Minimum complexity density estimation. IEEE Transactions on Information Theory, 37, 1034–1054.
Google Scholar
Buntine, W. (1992). Learning classification trees. Statistics and Computing, 2, 63–73.
Google Scholar
Catoni, O. (To appear a) Gibbs estimators. Probability Theory and Related Fields.
Catoni, O. (To appear b) Universal aggregation rules with sharp oracle inequalities. Annals of Statistics.
Cesa-Bianchi, N., Freund, Y., Helmbold, D. P., Haussler, D., Schapire, R. E., & Warmuth, M. K. (1997). How to use expert advice. JACM, 44:3, 427–485.
Google Scholar
Freund, Y., & Schapire, R. E. (1999). Adaptive game playing using multiplicative weights. Games and Economic Behavior, 79–103.
Freund, Y., Schapire, R. E., Singer, Y., & Warmuth, M. K. (1997). Using and combining predictors that specialize. In COLT-97.
Helmbold, D. P.,& Schapire, R. E. (1997). Predicting nearly as well as the best pruning of a decision tree. Machine Learning, 27:1, 51–68.
Google Scholar
Kearns, M., & Mansour, Y. (1998). A fast, bottom-up decision tree pruning algorithm with near-optimal generalization. In Proceedings of the 15th International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Kearns, M., Mansour, Y., Ng, A., & Ron, D. (1995). An experimental and theoretical comparison of model selection methods. In Proceedings of the Eighth ACMConference on Computational Learning Theory (pp. 21–30). New York: ACM Press.
Google Scholar
Linial, N., Mansour, Y., & Rivest, R. (1991). Results on learnability and the Vapnik-Chervonenkis dimension. Information and Computation, 90, 33–49.
Google Scholar
Littlestone, N., & Warmuth, M. (1994). The weighted majority algorithm. Information and Computation, 108:2, 212–261. An extended abstract appeared in COLT-89.
Google Scholar
Lugosi, G., & Zeger, K. (1996). Concept learning using complexity regularization. IEEE Transactions on Information Theory, 42, 48–54.
Google Scholar
McAllester, D. (1998). Some pac-bayesian theorems. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory (pp. 230–234).
McAllester, D. (1999). Pac-bayesian model averaging. In COLT-99.
Oliver, J. J.,& Hand, D. (1995).Onpruning and averaging decision trees. In Proceedings of the Twelfth International Conference on Machine Learning.
Pereira, F. C., & Singer, Y. (1997). An efficient extension to mixture techniques for prediction and decision trees. In Proceedings of the Tenth Annual Conference on Computational Learning Theory (pp. 114–121). Long version to appear in Machine Learning.
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Shawe-Taylor, J., & Williamson, R. (1997). A pac analysis of a bayesian estimator. In Proceedings of the Tenth Annual Conference on Computational Learning Theory. New York: ACM Press.
Google Scholar
Yamanishi, K. (1992). Learning non-parametric densities in tyerms of finite-dimensional parametric hypotheses. IEICE Trans. Inf. and Syst., E75-D(4).
Yang, Y. (2000a). Adaptive estimation in pattern recognition by combining different procedures. Statistica Sinica, 10, 1069–1089.
Google Scholar
Yang, Y. (2000b). Mixing strategies for density estimation. Annals of Statistics, 28, 75–87.
Google Scholar

Download references

Author information

Authors and Affiliations

AT&T Shannon Labs, 180 Park Avenue, Florham Park, NJ, 07932-0971, USA
David A. McAllester

Authors

David A. McAllester
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

McAllester, D.A. PAC-Bayesian Stochastic Model Selection. Machine Learning 51, 5–21 (2003). https://doi.org/10.1023/A:1021840411064

Download citation

Issue Date: April 2003
DOI: https://doi.org/10.1023/A:1021840411064

PAC-Bayesian Stochastic Model Selection

Abstract

Article PDF

Similar content being viewed by others

Stochastic feature mapping for PAC-Bayes classification

SCOD: From Heuristics to Theory

Efficient parameter learning of Bayesian network classifiers

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

PAC-Bayesian Stochastic Model Selection

Abstract

Article PDF

Similar content being viewed by others

Stochastic feature mapping for PAC-Bayes classification

SCOD: From Heuristics to Theory

Efficient parameter learning of Bayesian network classifiers

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation