Abstract
PAC-Bayesian learning methods combine the informative priors of Bayesian methods with distribution-free PAC guarantees. Stochastic model selection predicts a class label by stochastically sampling a classifier according to a “posterior distribution” on classifiers. This paper gives a PAC-Bayesian performance guarantee for stochastic model selection that is superior to analogous guarantees for deterministic model selection. The guarantee is stated in terms of the training error of the stochastic classifier and the KL-divergence of the posterior from the prior. It is shown that the posterior optimizing the performance guarantee is a Gibbs distribution. Simpler posterior distributions are also derived that have nearly optimal performance guarantees.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Barron, A. R. (1991). Complexity regularization with application to artificial neural networks. In G. Roussas (Ed.), Nonparametric functional estimation and related topics (pp. 561–576). Dordrecht: Kluwer Academic Publishers.
Barron, A. R.,& Cover, T. M. (1991). Minimum complexity density estimation. IEEE Transactions on Information Theory, 37, 1034–1054.
Buntine, W. (1992). Learning classification trees. Statistics and Computing, 2, 63–73.
Catoni, O. (To appear a) Gibbs estimators. Probability Theory and Related Fields.
Catoni, O. (To appear b) Universal aggregation rules with sharp oracle inequalities. Annals of Statistics.
Cesa-Bianchi, N., Freund, Y., Helmbold, D. P., Haussler, D., Schapire, R. E., & Warmuth, M. K. (1997). How to use expert advice. JACM, 44:3, 427–485.
Freund, Y., & Schapire, R. E. (1999). Adaptive game playing using multiplicative weights. Games and Economic Behavior, 79–103.
Freund, Y., Schapire, R. E., Singer, Y., & Warmuth, M. K. (1997). Using and combining predictors that specialize. In COLT-97.
Helmbold, D. P.,& Schapire, R. E. (1997). Predicting nearly as well as the best pruning of a decision tree. Machine Learning, 27:1, 51–68.
Kearns, M., & Mansour, Y. (1998). A fast, bottom-up decision tree pruning algorithm with near-optimal generalization. In Proceedings of the 15th International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann.
Kearns, M., Mansour, Y., Ng, A., & Ron, D. (1995). An experimental and theoretical comparison of model selection methods. In Proceedings of the Eighth ACMConference on Computational Learning Theory (pp. 21–30). New York: ACM Press.
Linial, N., Mansour, Y., & Rivest, R. (1991). Results on learnability and the Vapnik-Chervonenkis dimension. Information and Computation, 90, 33–49.
Littlestone, N., & Warmuth, M. (1994). The weighted majority algorithm. Information and Computation, 108:2, 212–261. An extended abstract appeared in COLT-89.
Lugosi, G., & Zeger, K. (1996). Concept learning using complexity regularization. IEEE Transactions on Information Theory, 42, 48–54.
McAllester, D. (1998). Some pac-bayesian theorems. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory (pp. 230–234).
McAllester, D. (1999). Pac-bayesian model averaging. In COLT-99.
Oliver, J. J.,& Hand, D. (1995).Onpruning and averaging decision trees. In Proceedings of the Twelfth International Conference on Machine Learning.
Pereira, F. C., & Singer, Y. (1997). An efficient extension to mixture techniques for prediction and decision trees. In Proceedings of the Tenth Annual Conference on Computational Learning Theory (pp. 114–121). Long version to appear in Machine Learning.
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.
Shawe-Taylor, J., & Williamson, R. (1997). A pac analysis of a bayesian estimator. In Proceedings of the Tenth Annual Conference on Computational Learning Theory. New York: ACM Press.
Yamanishi, K. (1992). Learning non-parametric densities in tyerms of finite-dimensional parametric hypotheses. IEICE Trans. Inf. and Syst., E75-D(4).
Yang, Y. (2000a). Adaptive estimation in pattern recognition by combining different procedures. Statistica Sinica, 10, 1069–1089.
Yang, Y. (2000b). Mixing strategies for density estimation. Annals of Statistics, 28, 75–87.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
McAllester, D.A. PAC-Bayesian Stochastic Model Selection. Machine Learning 51, 5–21 (2003). https://doi.org/10.1023/A:1021840411064
Issue Date:
DOI: https://doi.org/10.1023/A:1021840411064