Abstract
We consider on-line density estimation with a parameterized density from the exponential family. The on-line algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss, which is the negative log-likelihood of the example with respect to the current parameter of the algorithm. An off-line algorithm can choose the best parameter based on all the examples. We prove bounds on the additional total loss of the on-line algorithm over the total loss of the best off-line parameter. These relative loss bounds hold for an arbitrary sequence of examples. The goal is to design algorithms with the best possible relative loss bounds. We use a Bregman divergence to derive and analyze each algorithm. These divergences are relative entropies between two exponential distributions. We also use our methods to prove relative loss bounds for linear regression.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Auer, P. Herbster, M. & Warmuth, M. K. (1995). Exponentially many local minima for single neurons. In Proc. 1995 Neural Information Processing Conference (pp. 316–317). Cambridge, MA: MIT Press.
Amari, S. (1985). Differential Geometrical Methods in Statistics. Berlin: Springer Verlag.
Azoury, K. & Warmuth, M. K. (1999). Relative loss bounds for on-line density estimation with the exponential family of distributions. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (pp. 31–40) Morgan Kaufmann, San Francisco CA.
Beckenbach, E. F. & Bellman, R. (1965). Inequalities. Berlin: Springer. Second revised printing.
Blackwell, D. (1956). An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6:1, 1–8.
Barndorff-Nielsen, O. (1978). Information and Exponential Families in Statistical Theory. Chichester: Wiley.
Bregman, L. M. (1967). The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Physics, 7, 200–217.
Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., & Warmuth, M. K. (1997). How to use expert advice. Journal of the ACM, 44:3, 427–485.
Cesa-Bianchi, N., Long, P., & Warmuth, M. K. (1996).Worst-case quadratic loss bounds for on-line prediction of linear functions by gradient descent. IEEE Transactions on Neural Networks, 7:2, 604–619.
Censor, Y. & Lent, A. (1981). An iterative row-action method for interval convex programming. Journal of Optimization Theory and Applications, 34:3, 321–353.
Cover, T. M. & Ordentlich, E. (1996). Universal portfolios with side information. IEEE Transactions on Information Theory, 42:2, 348–363.
Cover, T. M. (1991). Universal portfolios. Mathematical Finance, 1:1, 1–29.
Csiszar, I. (1991). Why least squares and maximum entropy? An axiomatic approach for linear inverse problems. The Annals of Statistics, 19:4, 2032–2066.
DeSantis, A., Markowsky, G., & Wegman, M. N. (1988). Learning probabilistic prediction functions. In Proc. 29th Annu. IEEE Sympos. Found. Comput. Sci. (pp. 110–119). Los Alamitos, CA: IEEE Computer Society Press.
Forster, J. (1999). On relative loss bounds in generalized linear regression. In 12th International Symposium on Fundamentals of Computation Theory (FCT' 99) (pp. 269–280). Springer.
Foster, D. P. (1991). Prediction in the worst case. The Annals of Statistics, 19:2, 1084–1090.
Freund, Y. (1996). Predicting a binary sequence almost as well as the optimal biased coin. In Proc. 9th Annu. Conf. on Comput. Learning Theory (pp. 89–98). New York, NYACM Press.
Grove, A. J., Littlestone, N., & Schuurmans, D. (1997). General convergence results for linear discriminant updates. In Proc. 10th Annu. Conf. on Comput. Learning Theory (pp. 171–183). ACM.
Gordon, G. J. (1999). Approximate solutions to Markov decision processes. Ph.D. Thesis, Department of Computer Science, Carnegie Mellon University, Pittsburgh. Technical report CMU-CS-99-143.
Gutiérrez-Peña & Smith, A. F. M. (1995). Conjugate parameterizations of natural exponential families. Journal of the American Statistical Association, 90:432, 1347–1356.
Gruenwald, P. D. (1998). The Minimum Discription Length Principle and reasoning under uncertainty. Ph.D. thesis, Institute for Logic, Language and Computation, Universiteit van Amsterdam, October 1998.
Hannan, J. (1957). M. Dresher, A. W. Tucker & P. Wolfe (Eds.), Approximation to Bayes risk in repeated play, Vol. III. Princeton University Press.
Herbster, M. (1998). Learning algorithms for tracking changing concepts and an investigation into the error surfaces of single artificial neurons. Ph.D. Thesis, Department of Computer Science, University of California at Santa Cruz.
Helmbold, D. P., Kivinen, J., & Warmuth, M. K. (1999). Relative loss bounds for single neurons. IEEE Transactions on Neural Networks, 10:6, 1291–1304.
Haussler, D., Kivinen, J., & Warmuth, M. K. (1998). Sequential prediction of individual sequences under general loss functions. IEEE Transactions on Information Theory, 44:2, 1906–1925.
Herbster, M. & Manfred, W. K. (1998). Tracking the best regressor. In Proc. 11th Annu. Conf. on Comput. Learning Theory (pp. 24–31). New York, NY: ACM Press.
Jones, L. & Byrne, C. (1990). General entropy criteria for inverse problems, with applications to data compression, pattern classification and cluster analysis. IEEE Transactions on Information Theory, 36:1, 23–30.
Kivinen, J. & Warmuth, M. K. (1997). Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132:1. 1–64.
Kivinen, J. & Warmuth, M. K. (2000). Relative loss bounds for multidimensional regression problems, to appear in Journal of Machine Learning.
Littlestone, N. (1988). Learning when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285–318.
Littlestone, N., Long, P. M., & Warmuth, M. K. (1995). On-line learning of linear functions. Journal of Computational Complexity, 5, 1–23.
Littlestone N. & Warmuth, M. K. (1994). The weighted majority algorithm. Information and Computation, 108:2, 212–261.
McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models. Chapman & Hall.
Morris, C. N. (1982). Natural exponential families with quadratic variance functions. The Annals of Statistics, 10:1, 65–80.
Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry, volume 15 of Series in Computer Science. World Scientific.
Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42:1, 40–47.
Rockafellar, R. (1997). Convex Analysis. Princeton University Press.
Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proc. of the Third Berkeley Symposium on Mathematical Statistics and Probability, (Vol. 1) (pp 197–205). Berkeley: University of California Press.
Takeuchi, J. & Barron, A. (1997). Asymptotically minimax regret for exponential families. In SITA' 97 (pp. 665–668).
Takeuchi, J. & Barron, A. (1998). Asymptotically minimax regret by Bayes mixtures. In IEEE ISIT' 98.
Vovk, V. (1990). Aggregating strategies. In Proc. 3rd Annu.Workshop on Comput. Learning Theory (pp. 371–383). Morgan Kaufmann.
Vovk, V. (1997). Competitive on-line linear regression. Technical Report CSD-TR-97-13, Department of Computer Science, Royal Holloway, University of London.
Warmuth, M. K. & Jagota, A. (1998). Continuous and discrete time nonlinear gradient descent: relative loss bounds and convergence. In R. Greiner & E. Boros (Eds.), Electronic Proceedings of Fifth International Symposium on Artificial Intelligence and Mathematics. Electronic, http://rutcor.rutgers.edu/Qamai.
Xie, Q. & Barron, A. R. (1997). Minimax redundancy for the class of memoryless sources. IEEE Transactions on Information Theory, 43:2, 646–57.
Yamanishi, K. (1998). A decision-theoretic extension of stochastic complexity and its applications to learning. IEEE Transaction on Information Theory, 44:4, 1424–1439.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Azoury, K.S., Warmuth, M.K. Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions. Machine Learning 43, 211–246 (2001). https://doi.org/10.1023/A:1010896012157
Issue Date:
DOI: https://doi.org/10.1023/A:1010896012157