Andreas Kirsch

Hi there! 👋 I’m Andreas Kirsch.

I’m currently a Research Scientist at Google DeepMind on the Deep Learning Engineering team. Before this, I was a Researcher (Research Scientist) at Midjourney for a year.

I obtained my a PhD (“DPhil”) with Prof Yarin Gal in the OATML group at the University of Oxford and as a student in the AIMS CDT program. You can reach me via email or anonymously via admonymous.

During my DPhil, my interests were in information theory and its applications: information bottlenecks and active learning using Bayesian deep learning, and uncertainty quantification. I also enjoyed thinking about AI ethics and AI safety: in particular, the ML safety course by the Center of AI Safety was a lot of fun.

My thesis focuses on data subset selection: “Advancing Deep Active Learning & Data Subset Selection: Unifying Principles with Information-Theory Intuitions”.

Originally from Romania, I grew up in Southern Germany. After studying Computer Science and Mathematics at the Technical University in Munich (i.a. reading machine learning under Jürgen Schmidhuber 🎉), I spent a couple of years in Zurich as a software engineer at Google (YouTube Monetization) and worked as a performance research engineer at DeepMind for a year in 2016/17 before spending a gap year as a fellow at Newspeak House. I began my DPhil in September 2018 and submitted my thesis in April 2023.

selected publications

ICLR Blogpost ’24

Bayesian Model Selection: The Marginal Likelihood, Cross-Validation, and Conditional Log Marginal Likelihood

Kirsch, Andreas

In The Third Blogpost Track at ICLR 2024 2024

Abs HTML Poster

Bayesian model selection has long relied on the marginal likelihood and related quantities, often motivated by the principle of Occam’s razor. Following the paper ’Bayesian Model Selection, the Marginal Likelihood, and Generalization’ by Lotfi et al. (2022), this blog post critically examines the conventional focus on the marginal likelihood and related quantities for Bayesian model selection as a direct consequence of Occam’s razor. We find that the suitability of these criteria depends on the specific context and goals of the modeling task. We revisit the concepts of log marginal likelihood (LML), cross-validation, and the recently introduced conditional log marginal likelihood (CLML), highlighting their connections and differences through an information-theoretic lens. Through thought experiments and empirical observations, we explore the behavior of these model selection criteria in different data regimes under model misspecification and prior-data conflict, finding that the conditional marginal cross-entropy, closely related to cross-validation, is often more reliable for optimizing generalization performance. We review relevant literature, compare the CLML and validation loss for deep neural networks, and using a toy Bayesian linear regression, we demonstrate that all the discussed quantities can fail to reliably predict generalization. Our takeaways are that: there is no one-size-fits-all solution; the choice of model selection quantity depends on the specific context and goals; and in the future, we should take into account model complexity as well and not assume a uniform model prior. While this work leaves scope for more rigorous theoretical justification and more wide-ranging empirical investigation (along with deeper engagement with philosophical implications), it nevertheless provides grounds for questioning the primacy of the (conditional) log marginal likelihood and encourages critical thinking about its foundations, aiming for a more nuanced understanding of Bayesian model selection.

ICLR Blogpost ’24
Highlight

Bridging the Data Processing Inequality and Function-Space Variational Inference

Kirsch, Andreas

In The Third Blogpost Track at ICLR 2024 2024

Abs HTML Poster

This blog post explores the interplay between the Data Processing Inequality (DPI), a cornerstone concept in information theory, and Function-Space Variational Inference (FSVI) within the context of Bayesian deep learning. The DPI governs the transformation and flow of information through stochastic processes, and its unique connection to FSVI is employed to highlight FSVI’s focus on Bayesian predictive posteriors over parameter space. The post examines various forms of the DPI, including the KL divergence based DPI, and provides intuitive examples and detailed proofs. It also explores the equality case of the DPI to gain a deeper understanding. The connection between DPI and FSVI is then established, showing how FSVI can measure a predictive divergence independent of parameter symmetries. The post relates FSVI to knowledge distillation and label entropy regularization, highlighting the practical relevance of the theoretical concepts. Throughout the post, theoretical concepts are intertwined with intuitive explanations and mathematical rigor, offering a comprehensive understanding of these complex topics. By examining these concepts in depth, the post provides valuable insights for both theory and practice in machine learning.

PhD Thesis

Advancing Deep Active Learning & Data Subset Selection: Unifying Principles with Information-Theory Intuitions

Kirsch, Andreas

2023

arXiv HTML PDF

TMLR

Black-Box Batch Active Learning for Regression

Kirsch, Andreas

Transactions on Machine Learning Research 2023

arXiv HTML PDF

CVPR 2023
Highlight

Deterministic Neural Networks with Appropriate Inductive Biases Capture Epistemic and Aleatoric Uncertainty

Mukhoti*, Jishnu, Kirsch*, Andreas, van Amersfoort, Joost, Torr, Philip HS, and Gal, Yarin

Conference on Computer Vision and Pattern Recognition 2023

Abs arXiv HTML PDF

We show that a single softmax neural net with minimal changes can beat the uncertainty predictions of Deep Ensembles and other more complex single-forward-pass uncertainty approaches. Standard softmax neural nets suffer from feature collapse and extrapolate arbitrarily for OoD points. This results in arbitrary softmax entropies for OoD points which can have high entropy, low, or anything in between, thus cannot capture epistemic uncertainty reliably. We prove that this failure lies at the core of "why" Deep Ensemble Uncertainty works well. Instead of using softmax entropy, we show that with appropriate inductive biases softmax neural nets trained with maximum likelihood reliably capture epistemic uncertainty through their feature-space density. This density is obtained using simple Gaussian Discriminant Analysis, but it cannot represent aleatoric uncertainty reliably. We show that it is necessary to combine feature-space density with softmax entropy to disentangle uncertainties well. We evaluate the epistemic uncertainty quality on active learning and OoD detection, achieving SOTA 98 AUROC on CIFAR-10 vs SVHN without fine-tuning on OoD data.

AISTATS 2023

Prediction-Oriented Bayesian Active Learning

Bickford Smith*, Freddie, Kirsch*, Andreas, Farquhar, Sebastian, Gal, Yarin, Foster, Adam, and Rainforth, Tom

26th International Conference on Artificial Intelligence and Statistics 2023

TMLR
Repro. Cert.

Does ’Deep Learning on a Data Diet’ reproduce? Overall yes, but GraNd at Initialization does not

Kirsch, Andreas

Transactions on Machine Learning Research (Reproducibility Certification) 2023

arXiv HTML PDF

TMLR

Unifying Approaches in Active Learning and Active Sampling via Fisher Information and Information-Theoretic Quantities

Kirsch, Andreas, and Gal, Yarin

Transactions on Machine Learning Research 2022

arXiv HTML PDF

ICML 2022

Prioritized Training on Points that are Learnable, Worth Learning, and not yet Learnt

Mindermann*, Sören, Brauner*, Jan M, Razzak*, Muhammed T, Sharma*, Mrinank, Kirsch, Andreas, Xu, Winnie, Höltgen, Benedikt, Gomez, Aidan N, Morisot, Adrien, Farquhar, Sebastian, and Gal, Yarin

In Proceedings of the 39th International Conference on Machine Learning 2022

arXiv HTML PDF

UDL 2020

Learning CIFAR-10 with a Simple Entropy Estimator Using Information Bottleneck Objectives

Kirsch, Andreas, Lyle, Clare, and Gal, Yarin

In Uncertainty & Robustness in Deep Learning at Int. Conf. on Machine Learning (ICML Workshop) 2020

Abs PDF

The Information Bottleneck (IB) principle characterizes learning and generalization in deep neural networks in terms of the change in two information theoretic quantities and leads to a regularized objective function for training neural networks. These quantities are difficult to compute directly for deep neural networks. We show that it is possible to backpropagate through a simple entropy estimator to obtain an IB training method that works for modern neural network architectures. We evaluate our approach empirically on the CIFAR-10 dataset, showing that IB objectives can yield competitive performance on this dataset with a conceptually simple approach while also performing well against adversarial attacks out-of-the-box.

Preprint

Unpacking Information Bottlenecks: Unifying Information-Theoretic Objectives in Deep Learning

Kirsch, Andreas, Lyle, Clare, and Gal, Yarin

arXiv Preprint 2020

Abs arXiv PDF

The Information Bottleneck principle offers both a mechanism to explain how deep neural networks train and generalize, as well as a regularized objective with which to train models. However, multiple competing objectives are proposed in the literature, and the information-theoretic quantities used in these objectives are difficult to compute for large deep neural networks, which in turn limits their use as a training objective. In this work, we review these quantities and compare and unify previously proposed objectives, which allows us to develop surrogate objectives more friendly to optimization without relying on cumbersome tools such as density estimation. We find that these surrogate objectives allow us to apply the information bottleneck to modern neural network architectures. We demonstrate our insights on MNIST, CIFAR-10 and Imagenette with modern DNN architectures (ResNets).

NeurIPS 2019

BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning

Kirsch*, Andreas, van Amersfoort*, Joost, and Gal, Yarin

NeurIPS 2019

Abs HTML PDF

We develop BatchBALD, a tractable approximation to the mutual information between a batch of points and model parameters, which we use as an acquisition function to select multiple informative points jointly for the task of deep Bayesian active learning. BatchBALD is a greedy linear-time 1-1/e-approximate algorithm amenable to dynamic programming and efficient caching. We compare BatchBALD to the commonly used approach for batch data acquisition and find that the current approach acquires similar and redundant points, sometimes performing worse than randomly acquiring data. We finish by showing that, using BatchBALD to consider dependencies within an acquisition batch, we achieve new state of the art performance on standard benchmarks, providing substantial data efficiency improvements in batch acquisition.

news

Jun 20, 2024	Another year, another update: I have published two blog posts on the ICLR 2024 blog post track—while working at Midjourney. I’m grateful for the opportunity to work on this open research on the side. In particular, one of the blog posts was selected as a Highlight of the blog post track: “Bridging the Data Processing Inequality and Function-Space Variational Inference” ^Highlight (Kirsch, 2024) “Bayesian Model Selection: The Marginal Likelihood, Cross-Validation, and Conditional Log Marginal Likelihood” (Kirsch, 2024) Both blog posts are available on the ICLR 2024 Blogpost Track website.
Dec 1, 2023	Another year, another set of papers. This year was dominated by writing up my thesis and defending it. I’m also very happy to have joined MidJourney as a Research Scientist at the end of September. Thus, this year was mostly about wrapping up some loose ends into papers: “Does ‘Deep Learning on a Data Diet’ reproduce? Overall yes, but GraNd at Initialization does not” (Kirsch, 2023) “Black-Box Batch Active Learning for Regression” (Kirsch, 2023) “Stochastic Batch Acquisition: A Simple Baseline for Deep Active Learning” (Kirsch et al., 2023) And finally my thesis: “Advancing Deep Active Learning & Data Subset Selection: Unifying Principles with Information-Theory Intuitions” (Kirsch, 2023)
Dec 1, 2022	Very happy to have published a few papers at TMLR (and co-authored one presented at ICML) this year and to have (joint) co-authored papers that we will present at CVPR and AISTATS next year: “Prioritized Training on Points that are Learnable, Worth Learning, and not yet Learnt”, ICML 2022 (Mindermann* et al., 2022) “A Note on”Assessing Generalization of {SGD} via Disagreement"", TMLR(Kirsch & Gal, 2022) “Unifying Approaches in Active Learning and Active Sampling via Fisher Information and Information-Theoretic Quantities”, TMLR (Kirsch & Gal, 2022) “Deterministic Neural Networks with Appropriate Inductive Biases Capture Epistemic and Aleatoric Uncertainty”, CVPR 2023 (Highlight) (Mukhoti* et al., 2023) “Prediction-Oriented Bayesian Active Learning”, AISTATS 2024 (Bickford Smith* et al., 2023) Several of these can be traced to workshop papers, which we were able to expand and polish into full papers.
Jul 24, 2021	Seven workshop papers at ICML 2021 (out of which five are first author submissions): Uncertainty & Robustness in Deep Learning Two papers and posters at the Uncertainty & Robustness in Deep Learning workshop: On Pitfalls in OoD Detection: Entropy Considered Harmful Andreas Kirsch, Jishnu Mukhoti, Joost van Amersfoort, Philip H.S. Torr and Yarin Gal or as Poster PDF version as download. Deterministic Neural Networks with Inductive Biases Capture Epistemic and Aleatoric Uncertainty Jishnu Mukhoti, Andreas Kirsch, Joost van Amersfoort, Philip H.S. Torr and Yarin Gal or as Poster PDF version as download. SubSetML: Subset Selection in Machine Learning: From Theory to Practice Four papers (posters, one spotlight) at the SubSetML: Subset Selection in Machine Learning: From Theory to Practice workshop: Active Learning under Pool Set Distribution Shift and Noisy Data Andreas Kirsch, Tom Rainforth, Yarin Gal SPOTLIGHT (also accepted at the Human in the Loop Learning (HILL) workshop) or as Poster PDF version as download. A Simple Baseline for Batch Active Learning with Stochastic Acquisition Functions Andreas Kirsch, Sebastian Farquhar, Yarin Gal (also accepted at the Human in the Loop Learning (HILL) workshop) or as Poster PDF version as download. A Practical & Unified Notation for Information-Theoretic Quantities in ML Andreas Kirsch, Yarin Gal or as Poster PDF version as download. Prioritized training on points that are learnable, worth learning, and not yet learned Sören Mindermann, Muhammed Razzak, Winnie Xu, Andreas Kirsch, Mrinank Sharma, Adrien Morisot, Aidan N. Gomez, Sebastian Farquhar, Jan Brauner, Yarin Gal or as Poster PNG version as download. Neglected Assumptions In Causal Inference One paper (poster) at the Neglected Assumptions In Causal Inference workshop: Causal-BALD: Deep Bayesian Active Learning of Outcomes to Infer Treatment-Effects from Observational Data Andrew Jesson, Panagiotis Tigas, Joost van Amersfoort, Andreas Kirsch, Uri Shalit, Yarin Gal or as Poster PNG version as download.
Feb 23, 2021	Lecture on “Bayesian Deep Learning, Information Theory and Active Learning” for Oxford Global Exchanges. You can download the slides here.
Feb 21, 2021	Deterministic Neural Networks with Appropriate Inductive Biases Capture Epistemic and Aleatoric Uncertainty has been uploaded to arXiv as pre-print. Joint work with Jishnu Mukhoti, and together with Joost van Amersfoort, Philip H.S. Torr, Yarin Gal. We show that a single softmax neural net with minimal changes can beat the uncertainty predictions of Deep Ensembles and other more complex single-forward-pass uncertainty approaches.
Dec 10, 2020	Unpacking Information Bottlenecks: Unifying Information-Theoretic Objectives in Deep Learning was also presented as a poster at the “NeurIPS Europe meetup on Bayesian Deep Learning”. You can find the poster below (click to open): or as PDF version to download.
Jul 17, 2020	Two workshop papers have been accepted to Uncertainty & Robustness in Deep Learning Workshop at ICML 2020: Scalable Training with Information Bottleneck Objectives, and Learning CIFAR-10 with a Simple Entropy Estimator Using Information Bottleneck Objectives both together with Clare Lyle and Yarin Gal. They are based on Unpacking Information Bottlenecks: Unifying Information-Theoretic Objectives in Deep Learning for the former, and an application of the UIB framework for the latter: we can use it to train models that perform well on CIFAR-10 without using a cross-entropy loss at all.
Mar 27, 2020	Unpacking Information Bottlenecks: Unifying Information-Theoretic Objectives in Deep Learning, together with Clare Lyle and Yarin Gal, has been uploaded as pre-print to arXiv. It examines and unifies different Information Bottleneck objectives and shows that we can introduce simple yet effective surrogate objectives without complex derivations.
Sep 4, 2019	BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning got accepted into NeurIPS 2019. See you all in Vancouver!

selected publications

news

Uncertainty & Robustness in Deep Learning

SubSetML: Subset Selection in Machine Learning: From Theory to Practice

Neglected Assumptions In Causal Inference