Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions

Katy S. Azoury¹ &
M. K. Warmuth²

1765 Accesses
169 Citations
3 Altmetric
Explore all metrics

Abstract

We consider on-line density estimation with a parameterized density from the exponential family. The on-line algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss, which is the negative log-likelihood of the example with respect to the current parameter of the algorithm. An off-line algorithm can choose the best parameter based on all the examples. We prove bounds on the additional total loss of the on-line algorithm over the total loss of the best off-line parameter. These relative loss bounds hold for an arbitrary sequence of examples. The goal is to design algorithms with the best possible relative loss bounds. We use a Bregman divergence to derive and analyze each algorithm. These divergences are relative entropies between two exponential distributions. We also use our methods to prove relative loss bounds for linear regression.

Article PDF

From robust tests to Bayes-like posterior distributions

Article 13 July 2023

Minimum distance histograms with universal performance guarantees

Article Open access 13 July 2019

A new method for estimation and model selection:\(\rho \)-estimation

Article 26 July 2016

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Auer, P. Herbster, M. & Warmuth, M. K. (1995). Exponentially many local minima for single neurons. In Proc. 1995 Neural Information Processing Conference (pp. 316–317). Cambridge, MA: MIT Press.
Google Scholar
Amari, S. (1985). Differential Geometrical Methods in Statistics. Berlin: Springer Verlag.
Google Scholar
Azoury, K. & Warmuth, M. K. (1999). Relative loss bounds for on-line density estimation with the exponential family of distributions. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (pp. 31–40) Morgan Kaufmann, San Francisco CA.
Google Scholar
Beckenbach, E. F. & Bellman, R. (1965). Inequalities. Berlin: Springer. Second revised printing.
Google Scholar
Blackwell, D. (1956). An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6:1, 1–8.
Google Scholar
Barndorff-Nielsen, O. (1978). Information and Exponential Families in Statistical Theory. Chichester: Wiley.
Google Scholar
Bregman, L. M. (1967). The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Physics, 7, 200–217.
Google Scholar
Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., & Warmuth, M. K. (1997). How to use expert advice. Journal of the ACM, 44:3, 427–485.
Google Scholar
Cesa-Bianchi, N., Long, P., & Warmuth, M. K. (1996).Worst-case quadratic loss bounds for on-line prediction of linear functions by gradient descent. IEEE Transactions on Neural Networks, 7:2, 604–619.
Google Scholar
Censor, Y. & Lent, A. (1981). An iterative row-action method for interval convex programming. Journal of Optimization Theory and Applications, 34:3, 321–353.
Google Scholar
Cover, T. M. & Ordentlich, E. (1996). Universal portfolios with side information. IEEE Transactions on Information Theory, 42:2, 348–363.
Google Scholar
Cover, T. M. (1991). Universal portfolios. Mathematical Finance, 1:1, 1–29.
Google Scholar
Csiszar, I. (1991). Why least squares and maximum entropy? An axiomatic approach for linear inverse problems. The Annals of Statistics, 19:4, 2032–2066.
Google Scholar
DeSantis, A., Markowsky, G., & Wegman, M. N. (1988). Learning probabilistic prediction functions. In Proc. 29th Annu. IEEE Sympos. Found. Comput. Sci. (pp. 110–119). Los Alamitos, CA: IEEE Computer Society Press.
Google Scholar
Forster, J. (1999). On relative loss bounds in generalized linear regression. In 12th International Symposium on Fundamentals of Computation Theory (FCT' 99) (pp. 269–280). Springer.
Foster, D. P. (1991). Prediction in the worst case. The Annals of Statistics, 19:2, 1084–1090.
Google Scholar
Freund, Y. (1996). Predicting a binary sequence almost as well as the optimal biased coin. In Proc. 9th Annu. Conf. on Comput. Learning Theory (pp. 89–98). New York, NYACM Press.
Google Scholar
Grove, A. J., Littlestone, N., & Schuurmans, D. (1997). General convergence results for linear discriminant updates. In Proc. 10th Annu. Conf. on Comput. Learning Theory (pp. 171–183). ACM.
Gordon, G. J. (1999). Approximate solutions to Markov decision processes. Ph.D. Thesis, Department of Computer Science, Carnegie Mellon University, Pittsburgh. Technical report CMU-CS-99-143.
Google Scholar
Gutiérrez-Peña & Smith, A. F. M. (1995). Conjugate parameterizations of natural exponential families. Journal of the American Statistical Association, 90:432, 1347–1356.
Google Scholar
Gruenwald, P. D. (1998). The Minimum Discription Length Principle and reasoning under uncertainty. Ph.D. thesis, Institute for Logic, Language and Computation, Universiteit van Amsterdam, October 1998.
Hannan, J. (1957). M. Dresher, A. W. Tucker & P. Wolfe (Eds.), Approximation to Bayes risk in repeated play, Vol. III. Princeton University Press.
Herbster, M. (1998). Learning algorithms for tracking changing concepts and an investigation into the error surfaces of single artificial neurons. Ph.D. Thesis, Department of Computer Science, University of California at Santa Cruz.
Google Scholar
Helmbold, D. P., Kivinen, J., & Warmuth, M. K. (1999). Relative loss bounds for single neurons. IEEE Transactions on Neural Networks, 10:6, 1291–1304.
Google Scholar
Haussler, D., Kivinen, J., & Warmuth, M. K. (1998). Sequential prediction of individual sequences under general loss functions. IEEE Transactions on Information Theory, 44:2, 1906–1925.
Google Scholar
Herbster, M. & Manfred, W. K. (1998). Tracking the best regressor. In Proc. 11th Annu. Conf. on Comput. Learning Theory (pp. 24–31). New York, NY: ACM Press.
Google Scholar
Jones, L. & Byrne, C. (1990). General entropy criteria for inverse problems, with applications to data compression, pattern classification and cluster analysis. IEEE Transactions on Information Theory, 36:1, 23–30.
Google Scholar
Kivinen, J. & Warmuth, M. K. (1997). Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132:1. 1–64.
Google Scholar
Kivinen, J. & Warmuth, M. K. (2000). Relative loss bounds for multidimensional regression problems, to appear in Journal of Machine Learning.
Littlestone, N. (1988). Learning when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285–318.
Google Scholar
Littlestone, N., Long, P. M., & Warmuth, M. K. (1995). On-line learning of linear functions. Journal of Computational Complexity, 5, 1–23.
Google Scholar
Littlestone N. & Warmuth, M. K. (1994). The weighted majority algorithm. Information and Computation, 108:2, 212–261.
Google Scholar
McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models. Chapman & Hall.
Morris, C. N. (1982). Natural exponential families with quadratic variance functions. The Annals of Statistics, 10:1, 65–80.
Google Scholar
Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry, volume 15 of Series in Computer Science. World Scientific.
Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42:1, 40–47.
Google Scholar
Rockafellar, R. (1997). Convex Analysis. Princeton University Press.
Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proc. of the Third Berkeley Symposium on Mathematical Statistics and Probability, (Vol. 1) (pp 197–205). Berkeley: University of California Press.
Google Scholar
Takeuchi, J. & Barron, A. (1997). Asymptotically minimax regret for exponential families. In SITA' 97 (pp. 665–668).
Takeuchi, J. & Barron, A. (1998). Asymptotically minimax regret by Bayes mixtures. In IEEE ISIT' 98.
Vovk, V. (1990). Aggregating strategies. In Proc. 3rd Annu.Workshop on Comput. Learning Theory (pp. 371–383). Morgan Kaufmann.
Vovk, V. (1997). Competitive on-line linear regression. Technical Report CSD-TR-97-13, Department of Computer Science, Royal Holloway, University of London.
Warmuth, M. K. & Jagota, A. (1998). Continuous and discrete time nonlinear gradient descent: relative loss bounds and convergence. In R. Greiner & E. Boros (Eds.), Electronic Proceedings of Fifth International Symposium on Artificial Intelligence and Mathematics. Electronic, http://rutcor.rutgers.edu/Qamai.
Xie, Q. & Barron, A. R. (1997). Minimax redundancy for the class of memoryless sources. IEEE Transactions on Information Theory, 43:2, 646–57.
Google Scholar
Yamanishi, K. (1998). A decision-theoretic extension of stochastic complexity and its applications to learning. IEEE Transaction on Information Theory, 44:4, 1424–1439.
Google Scholar

Download references

Author information

Authors and Affiliations

College of Business, San Francisco State University, San Francisco, CA, 94132, USA
Katy S. Azoury
Computer Science Department, University of California, Santa Cruz, Santa Cruz, CA, 95064, USA
M. K. Warmuth

Authors

Katy S. Azoury
View author publications
You can also search for this author in PubMed Google Scholar
M. K. Warmuth
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Azoury, K.S., Warmuth, M.K. Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions. Machine Learning 43, 211–246 (2001). https://doi.org/10.1023/A:1010896012157

Download citation

Issue Date: June 2001
DOI: https://doi.org/10.1023/A:1010896012157

Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions

Abstract

Article PDF

Similar content being viewed by others

From robust tests to Bayes-like posterior distributions

Minimum distance histograms with universal performance guarantees

A new method for estimation and model selection:\(\rho \)-estimation

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions

Abstract

Article PDF

Similar content being viewed by others

From robust tests to Bayes-like posterior distributions

Minimum distance histograms with universal performance guarantees

A new method for estimation and model selection:\(\rho \)-estimation

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation