A Bayesian/Information Theoretic Model of Learning to Learn via Multiple Task Sampling

Jonathan Baxter^1,2

4458 Accesses
275 Citations
4 Altmetric
Explore all metrics

Abstract

A Bayesian model of learning to learn by sampling from multiple tasks is presented. The multiple tasks are themselves generated by sampling from a distribution over an environment of related tasks. Such an environment is shown to be naturally modelled within a Bayesian context by the concept of an objective prior distribution. It is argued that for many common machine learning problems, although in general we do not know the true (objective) prior for the problem, we do have some idea of a set of possible priors to which the true prior belongs. It is shown that under these circumstances a learner can use Bayesian inference to learn the true prior by learning sufficiently many tasks from the environment. In addition, bounds are given on the amount of information required to learn a task when it is simultaneously learnt with several other tasks. The bounds show that if the learner has little knowledge of the true prior, but the dimensionality of the true prior is small, then sampling multiple tasks is highly advantageous. The theory is applied to the problem of learning a common feature set or equivalently a low-dimensional-representation (LDR) for an environment of related tasks.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Abu-Mostafa, Y.S. (1989). Learning from Hints in Neural Networks. Journal of Complexity, 6:192–198.
Google Scholar
Anthony, Martin & Bartlett, Peter. (1995). Function learning from interpolation. In Proceedings of the Second European Conference on Computational Learning Theory, Barcelona. Springer-Verlag.
Google Scholar
Barron, Andrew & Clarke, Bertrand. (1994). Jeffreys' Prior is Asymptotically Least Favourable under Entropy Risk. Journal of Statistical Planning and Inference, 41:37–60.
Article Google Scholar
Bartlett, Peter, Long, Philip & Williamson, Bob. (1994). Fat-Shattering and the Learnability of Real-Valued Functions. In Proccedings of the Seventh ACM Conference on Computational Learning Theory, New York. ACM Press.
Google Scholar
Baxter, Jonathan. (1995a). A Model of Bias Learning. Technical Report LSE-MPS-97, London School of Economics, Centre for Discrete and Applicable Mathematics. Submitted for publication.
Baxter, Jonathan. (1995b). Learning Internal Representations. In Proceedings of the Eighth International Conference on Computational Learning Theory, pages 311–320, Santa Cruz, California. ACM Press.
Google Scholar
Baxter, Jonathan. (1996a). A Bayesian/Information Theoretic Model of Bias Learning. In Proccedings of the Ninth ACM Conference on Computational Learning Theory, New York. ACM Press.
Google Scholar
Baxter, Jonathan. (1996b). Learning Model Bias. In Advances in Neural Information Processing Systems 8, pages 169–175.
Berger, James O. (1985). Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, New York.
Google Scholar
Berger, James O. (1986) Multivariate Estimation: Bayes, Empirical Bayes, and Stein Approaches. SIAM.
Bridle, J.S. (1989). Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In F Fogelman-Soulie and J Herault, editors, Neurocomputing: Algorithms, Architectures. Springer Verlag, New York.
Google Scholar
Caruana, Richard. (1993). Learning Many Related Tasks at the Same Time with Backpropagation. In Advances in Neural Information Processing 5.
Clarke, Bertrand & Barron, Andrew. (1990). Information-Theoretic Asymptotics of Bayes Methods. IEEE Transactions on Information Theory, 36:453–471.
Article Google Scholar
Cover, T.M. & Thomas, J.A. (1991). Elements of Information Theory. John Wiley & Sons, Inc., New York.
Google Scholar
Fefferman, Charles. (1994). Reconstructing a neural network from its output. Rev. Mat. Iberoamericana, 10:507–555.
Google Scholar
Good, I.J. (1980). Some History of the Hierarchical Bayesian Methodology. In J M Bernado, M H De Groot, D V Lindley, and A F M Smith, editors, Bayesian Statistics II. University Press, Valencia.
Google Scholar
Haussler, David & Opper, Manfred. (1995a). General Bounds on the Mutual Information Between a Parameter and n Conditionally Independent Observations. In Proccedings of the Eighth ACM Conference on Computational Learning Theory, New York. ACM Press.
Google Scholar
Haussler, David & Opper, Manfred. (1995b). Mutual Information, Metric Entropy and Risk in Estimation of Probability Distributions. Submitted to Annals of Statistics.
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4:251–257.
Google Scholar
Mackay, David. (1991). Bayesian Interpolation. Neural Computation, 4:415–447.
Google Scholar
Mackay, David. (1991). The Evidence Framework Applied to Classification Networks. Neural Computation, 4:698–714.
Google Scholar
Mitchell, Tom M. (1990). The need for biases in learning generalisations. In Tom G Dietterich and Jude Shavlik, editors, Readings in Machine Learning. Morgan Kaufmann.
Mitchell, Tom M. & Thrun, Sebastian. (1994). Learning One More Thing. Technical Report CMU-CS-94-184, CMU.
Pratt, Lori Y. (1992). Discriminability-based transfer between neural networks. In Stephen J Hanson, Jack D Cowan, and C Lee Giles, editors, Advances in Neural Information Processing Systems 5, pages 204–211, San Mateo. Morgan Kaufmann.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, London School of Economics, UK
Jonathan Baxter
Department of Computer Science, Royal Holloway College, University of London, UK
Jonathan Baxter

Authors

Jonathan Baxter
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Baxter, J. A Bayesian/Information Theoretic Model of Learning to Learn via Multiple Task Sampling. Machine Learning 28, 7–39 (1997). https://doi.org/10.1023/A:1007327622663

Download citation

Issue Date: July 1997
DOI: https://doi.org/10.1023/A:1007327622663

A Bayesian/Information Theoretic Model of Learning to Learn via Multiple Task Sampling

Abstract

Article PDF

Similar content being viewed by others

Minimax Decision Rules for Identifying an Unknown Distribution of a Random Variable

Sampling and Sampling Distributions

A Hierarchical Infinite Generalized Dirichlet Mixture Model with Feature Selection

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

A Bayesian/Information Theoretic Model of Learning to Learn via Multiple Task Sampling

Abstract

Article PDF

Similar content being viewed by others

Minimax Decision Rules for Identifying an Unknown Distribution of a Random Variable

Sampling and Sampling Distributions

A Hierarchical Infinite Generalized Dirichlet Mixture Model with Feature Selection

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation