Abstract
Gaussian variational inference and the Laplace approximation are popular alternatives to Markov chain Monte Carlo that formulate Bayesian posterior inference as an optimization problem, enabling the use of simple and scalable stochastic optimization algorithms. However, a key limitation of both methods is that the solution to the optimization problem is typically not tractable to compute; even in simple settings, the problem is nonconvex. Thus, recently developed statistical guarantees—which all involve the (data) asymptotic properties of the global optimum—are not reliably obtained in practice. In this work, we provide two major contributions: a theoretical analysis of the asymptotic convexity properties of variational inference with a Gaussian family and the maximum a posteriori (MAP) problem required by the Laplace approximation, and two algorithms—consistent Laplace approximation (CLA) and consistent stochastic variational inference (CSVI)—that exploit these properties to find the optimal approximation in the asymptotic regime. Both CLA and CSVI involve a tractable initialization procedure that finds the local basin of the optimum, and CSVI further includes a scaled gradient descent algorithm that provably stays locally confined to that basin. Experiments on nonconvex synthetic and real-data examples show that compared with standard variational and Laplace approximations, both CSVI and CLA improve the likelihood of obtaining the global optimum of their respective optimization problems.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
There are many other properties one might require of a tractable optimization problem, e.g. pseudoconvexity (Crouzeix and Ferland 1982), quasiconvexity (Arrow and Enthoven 1961), or invexity (Ben-Israel and Mond 1986). We focus on convexity as it does not impose overly stringent assumptions on our theory and has stronger implications than each of the aforementioned conditions.
Code for the experiments is available at https://github.com/zuhengxu/Consistent-Stochastic-Variational-Inference.
Available online at http://www.stat.cmu.edu/~ryantibs/statcomp/data/pros.dat.
This dataset contains the measurements of the redshifts for 4215 galaxies in the Shapley concentration regions and was generously made available by Michael Drinkwater, University of Queensland, which can be downloaded from https://astrostatistics.psu.edu/datasets/Shapley_galaxy.html.
Although the bound in Eq. (47) is loose, the order derscribed by the bound is actually tight. A more detailed analysis can be obtained by approximating the summation with an integral, which yields the same order.
References
Addis, B., Locatelli, M., Schoen, F.: Local optima smoothing for global optimization. Optim. Methods Softw. 20(4–5), 417–437 (2005)
Alquier, P., Ridgway, J.: Concentration of tempered posteriors and of their variational approximations. Ann. Stat. 48(3), 1475–1497 (2020)
Armijo, L.: Minimization of functions having Lipschitz continuous first partial derivatives. Pac. J. Math. 16(1), 1–3 (1966)
Arrow, K., Enthoven, A.: Quasi-concave programming. Econom. J. Econom. Soc. 29, 779–800 (1961)
Balakrishnan, S., Wainwright, M., Yu, B.: Statistical guarantees for the EM algorithm: from population to sample-based analysis. Ann. Stat. 45(1), 77–120 (2017)
Barber, R.F., Drton, M., Tan, K.M.: Laplace approximation in high-dimensional Bayesian regression. In: Barber, R.F., Drton, M., Tan, K.M. (eds.) Statistical Analysis for High-Dimensional Data, pp. 15–36. Springer, Berlin (2016)
Bassett, R., Deride, J.: Maximum a posteriori estimators as a limit of Bayes estimators. Math. Program. 174(1–2), 129–144 (2019)
Bauschke, H., Combettes, P.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, Berlin (2011)
Ben-Israel, A., Mond, B.: What is invexity? ANZIAM J. 28(1), 1–9 (1986)
Bertsekas, D., Tsitsiklis, J.: Gradient convergence in gradient methods with errors. SIAM J. Optim. 10(3), 627–642 (2000)
Bishop, C., Nasrabadi, N.: Pattern Recognition and Machine Learning. Springer, Berlin (2006)
Blei, D., Kucukelbir, A., McAuliffe, J.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112(518), 859–877 (2017)
Bottou, L.: Stochastic learning. In: Bousquet, O., von Luxburg, U., Rätsch, G. (eds.) Advanced Lectures on Machine Learning: ML Summer Schools 2003, pp. 146–168. Springer, Berlin (2004)
Boucheron, S., Lugosi, G., Massart, P.: Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Oxford (2013)
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Bubeck, S.: Convex Optimization: Algorithms and Complexity. Now Publishers Inc, Delft (2015)
Campbell, T., Li, X.: Universal boosting variational inference. In: Advances in Neural Information Processing Systems (2019)
Craven, B., Glover, B.: Invex functions and duality. J. Austral. Math. Soc. 39(1), 1–20 (1985)
Crouzeix, J.P., Ferland, J.: Criteria for quasi-convexity and pseudo-convexity: relationships and comparisons. Math. Program. 23(1), 193–205 (1982)
Csiszár, I.: Information-type measures of difference of probability distributions and indirect observation. Studia Sci. Math. Hung. 2, 229–318 (1967)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7), 2121–2159 (2011)
Folland, G.: Real Analysis: Modern Techniques and Their Applications. Wiley, New York (1999)
Forsyth, D., Ponce, J.: Computer Vision: A Modern Approach. Prentice Hall Professional Technical Reference, London (2002)
Gelfand, A., Smith, A.: Sampling-based approaches to calculating marginal densities. J. Am. Stat. Assoc. 85(410), 398–409 (1990)
George, E., McCulloch, R.: Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 88(423), 881–889 (1993)
Ghosal, S., Ghosh, J., van der Vaart, A.: Convergence rates of posterior distributions. Ann. Stat. 28(2), 500–531 (2000)
Gibbs, A., Su, F.E.: On choosing and bounding probability metrics. Int. Stat. Rev. 70(3), 419–435 (2002)
Grendár, M., Judge, G.: Asymptotic equivalence of empirical likelihood and Bayesian MAP. Ann. Stat. 37, 2445–2457 (2009)
Guo, F., Wang, X., Fan, K., Broderick, T., Dunson, D.: Boosting variational inference. In: Advances in Neural Information Processing Systems (2016)
Haddad, R., Akansu, A.: A class of fast Gaussian binomial filters for speech and image processing. IEEE Trans. Signal Process. 39(3), 723–727 (1991)
Hall, P., Pham, T., Wand, M., Wang, S.S.: Asymptotic normality and valid inference for Gaussian variational approximation. Ann. Stat. 39(5), 2502–2532 (2011)
Han, W., Yang, Y.: Statistical inference in mean-field variational Bayes. arXiv:1911.01525 (2019)
Hastings, W.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1), 97–109 (1970)
Hoffman, M., Blei, D., Wang, C., Paisley, J.: Stochastic variational inference. J. Mach. Learn. Res. 14(1), 1303–1347 (2013)
Huggins, J., Kasprzak, M., Campbell, T., Broderick, T.: Validated variational inference via practical posterior error bounds. In: International Conference on Artificial Intelligence and Statistics (2020)
Ionides, E.: Truncated importance sampling. J. Comput. Graph. Stat. 17(2), 295–311 (2008)
Jaiswal, P., Rao, V., Honnappa, H.: Asymptotic consistency of \(\alpha \)-Rényi-approximate posteriors. arXiv:1902.01902 (2019)
Jennrich, R.: Asymptotic properties of non-linear least squares estimators. Ann. Math. Stat. 40(2), 633–643 (1969)
Jordan, M., Ghahramani, Z., Jaakkola, T., Saul, L.: An introduction to variational methods for graphical models. In: Learning in Graphical Models, pp. 105–161. Springer, Berlin (1998)
Kass, R.: The validity of posterior expansions based on Laplace’s method. In: Geisser, S., Hodges, J.S. (eds.) Bayesian and Likelihood Methods in Statistics and Econometrics, pp. 473–487. Elsevier, Amsterdam (1990)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)
Kingma, D., Welling, M.: Auto-encoding variational Bayes. In: International Conference on Learning Representations (2014)
Kleijn, B.: Bayesian asymptotics under misspecification. PhD thesis, Vrije Universiteit Amsterdam (2004)
Kleijn, B., van der Vaart, A.: The Bernstein–von-Mises theorem under misspecification. Electron. J. Stat. 6, 354–381 (2012)
Kontorovich, A.: Concentration in unbounded metric spaces and algorithmic stability. In: International Conference on Machine Learning (2014)
Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., Blei, D.: Automatic differentiation variational inference. J. Mach. Learn. Res. 18(1), 430–474 (2017)
LeCam, L.: Locally Asymptotically Normal Families of Distributions. University of California Press, Berkeley (1960)
Liese, F., Vajda, I.: Convex Statistical Distances. Teubner, Stuttgart (1987)
Lindeberg, T.: Scale-space for discrete signals. IEEE Trans. Pattern Anal. Mach. Intell. 12(3), 234–254 (1990)
Locatello, F., Dresdner, G., Khanna, R., Valera, I., Rätsch, G.: Boosting black box variational inference. In: Advances in Neural Information Processing Systems (2018)
Meyn, S., Tweedie, R.: Markov Chains and Stochastic Stability. Springer, Berlin (2012)
Miller, J.: Asymptotic normality, concentration, and coverage of generalized posteriors. J. Mach. Learn. Res. 22(168), 1–53 (2021)
Miller, A., Foti, N., Adams, R.: Variational boosting: iteratively refining posterior approximations. In: International Conference on Machine Learning (2017)
Mobahi, H.: Optimization by Gaussian smoothing with application to geometric alignment. PhD thesis, University of Illinois at Urbana-Champaign (2013)
Murphy, K.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)
Nesterov, Y.: A method of solving a convex programming problem with convergence rate \({O}(k^2)\). In: Doklady Akademii Nauk (1983)
Nixon, M., Aguado, A.: Feature Extraction and Image Processing for Computer Vision. Academic Press, London (2012)
Pinheiro, J., Bates, D.: Unconstrained parametrizations for variance-covariance matrices. Stat. Comput. 6, 289–296 (1996)
Qiao, Y., Minematsu, N.: A study on invariance of \( f \)-divergence and its application to speech recognition. IEEE Trans. Signal Process. 58(7), 3884–3890 (2010)
Rakhlin, A., Shamir, O., Sridharan, K.: Making gradient descent optimal for strongly convex stochastic optimization. In: International Coference on International Conference on Machine Learning (2012)
Ranganath, R.: Black box variational inference. In: Advances in Neural Information Processing Systems (2014)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Robert, C., Casella, G.: Monte Carlo Statistical Methods. Springer, Berlin (2013)
Roberts, G., Rosenthal, J.: General state space Markov chains and MCMC algorithms. Probab. Surv. 1, 20–71 (2004)
Ryu, E., Boyd, S.: Primer on monotone operator methods. Appl. Comput. Math. 15(1), 3–43 (2016)
Schatzman, M.: Numerical Analysis: A Mathematical Introduction. Clarendon Press, translation: John Taylor (2002)
Schillings, C., Sprungk, B., Wacker, P.: On the convergence of the Laplace approximation and noise-level-robustness of Laplace-based Monte Carlo methods for Bayesian inverse problems. Numer. Math. 145(4), 915–971 (2020)
Shen, X., Wasserman, L.: Rates of convergence of posterior distributions. Ann. Stat. 29(3), 687–714 (2001)
Shun, Z., McCullagh, P.: Laplace approximation of high dimensional integrals. J. R. Stat. Soc. Ser. B Methodol. 57(4), 749–760 (1995)
Stein, C.: A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In: Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability (1972)
van der Vaart, A.: Asymptotic Statistics. Cambridge University Press, Cambridge (2000)
van der Vaart, A., Wellner, J.: Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, Berlin (2013)
Van Erven, T., Harremos, P.: Rényi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 60(7), 3797–3820 (2014)
Vehtari, A., Simpson, D., Gelman, A., Yao, Y., Gabry, J.: Pareto smoothed importance sampling. arXiv:1507.02646 (2015)
Wainwright, M., Jordan, M.: Graphical Models, Exponential Families, and Variational Inference. Now Publishers Inc, Delft (2008)
Wang, Y., Blei, D.: Frequentist consistency of variational Bayes. J. Am. Stat. Assoc. 114(527), 1147–1161 (2019)
Xing, E., Jordan, M., Russell, S.: A generalized mean field algorithm for variational inference in exponential families. In: Kanal, L.N., Lemmer, J.F. (eds.) Uncertainty in Artificial Intelligence. Elsevier, Amsterdam (2002)
Yang, Y., Pati, D., Bhattacharya, A.: \(\alpha \)-variational inference with statistical guarantees. Ann. Stat. 48(2), 886–905 (2020)
Zhang, F., Gao, C.: Convergence rates of variational posterior distributions. Ann. Stat. 48(4), 2180–2207 (2020)
Acknowledgements
The authors gratefully acknowledge the support of an Natural Sciences and Engineering Research Council of Canada (NSERC) (Grant No. RGPIN-2019-03962) Discovery Grant and Discovery Launch Supplement, and UBC four year doctoral fellowship.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Details of experiments
1.1 A.1 Details of the toy example for smoothed MAP
The underlying synthetic model for Fig. 1 is as follows:
where the data are truly generated from \((X_i)_{i=1}^n \overset{{\text {i.i.d.}}}{\sim }\mathscr {N}(3, 10)\). For smoothed MAP optimization, we use a smoothing constant of \(\alpha _n = 10 n^{-0.3}\), and set the initial value uniformly within the range \((-50, 50)\). The learning rate for the SGD is chosen as \(\gamma _k = 15/(1+ k^{0.9})\).
1.2 A.2 Algorithm: CSVI (Adam)
1.3 A.3 Discussion of sparse regression experiment
In this section, we provide further discussion to the result presented in Fig. 11. Figures 14 and 15 visualize the Gaussian approximations produced by CLA and CSVI. Instead of fitting a single mode, CSVI covers the range of posterior and fit a Gaussian distribution with larger variance. Even though the performance of CSVI is consistent across runs, it does find the local optimum instead of the global solution. In this case, reverse KL—the objective function of Gaussian VI—can be limited. We compare the forward KL of these fitted Gaussian distributions using 32,000 posterior samples obtained from Stan, suggesting that CSVI find a solution that is better in forward KL.
1.4 A.4 Details of the GMM experiment
1.4.1 A.4.1 Variable transformations of Bayesian Gaussian mixture model
To transform \((\theta _1, \dots , \theta _K) \in \varDelta (K)\) and \(\sigma _{kd}\in \mathbb {R}_+\) into unconstrained space, we consider the change of random variables as below:
-
1.
For \(\sigma _{kd}\sim {\mathsf{LogNormal}}(0,1)\), we consider
$$\begin{aligned} \tau _{kd} = \log (\sigma _{kd}) \sim \mathscr {N}(0,1), \end{aligned}$$which has a full support on \(\mathbb {R}\).
-
2.
For \(\theta \sim {\mathsf{Dir}}(\alpha _0)\), we consider using marginalized LogGamma random variables. Notice the relationship of Gamma distribution and Dirichlet distribution as follows:
$$\begin{aligned}&\left( \lambda _k\right) _{k = 1}^K \overset{{\text {i.i.d.}}}{\sim }{\mathsf{LogGamma}}(\alpha _0, 1)\\&\quad \left( \frac{\exp (\lambda _1)}{\sum _k \exp (\lambda _k)}, \dots , \frac{\exp (\lambda _K)}{\sum _k \exp (\lambda _k)}\right) \sim {\mathsf{Dir}}(\alpha _0), \end{aligned}$$then \(\lambda _k\) is supported on \(\mathbb {R}\).
Therefore, instead of inferring the original parameters, we perform Gaussian variational inference on the posterior distribution of \((\lambda _{1:k}, \mu _{1:k}, \tau _{11:kd})\).
1.4.2 A.4.2 Detailed settings for real data experiment
We subsample the Shapley galaxy dataset to \(N = 500\), and the goal is to cluster the distribution of galaxies in the Shapley concentration region. In our experiment, we fix the number of component \(K = 3\) and set \(\alpha _0 = 1\). During inference, we initialize the SMAP with random samples from the prior distribution, which are also used as the mean initialization for SVI. In SMAP, we perform the smoothed MAP estimation to a tempered posterior distribution \(\pi _n^\kappa \) with \(\kappa = 1/2\); and set the smoothing constant \(\alpha _n = 3\). The learning rate for SMAP and VI algorithms is chosen as 0.005 and 0.001, respectively. And similar to the synthetic experiment, CSVI and SVI_Ind use the identity matrix for \(L_0\) and SVI use random diagonal matrix for \(L_0\), whose log diagonal indices are uniformly in the range \((\log 0.1, \log 100)\).
B Proofs
1.1 B.1 Proof of Theorem 6
Proof
We consider the KL cost for the scaled and shifted posterior distribution. Let \(\tilde{\varPi }_n\) be the Bayesian posterior distribution of \(\sqrt{n} (\theta - \theta _0)\). The KL divergence measures the difference between the distributions of two random variables and is invariant when an invertible transformation is applied to both random variables (Qiao and Minematsu 2010, Theorem 1). Note that \(\tilde{\varPi }_n\) is shifted and scaled from \(\varPi _n\), and that this linear transformation is invertible, so
Let \(\tilde{\mu }_n^\star , \tilde{\varSigma }_n^\star \) be the parameters of the optimal Gaussian variational approximation to \(\tilde{\varPi }_n\), i.e.
and let
Wang and Blei (2019, Corollary 7) shows that under Assumption 1,
Convergence in total variation implies weak convergence, which then implies pointwise convergence of the characteristic function. Denote \(\tilde{\phi }^\star _n(t)\) and \(\phi _n(t)\) to be the characteristic functions of \(\tilde{\mathscr {N}_n^\star }\) and \(\mathscr {N}\left( \varDelta _{n, \theta _{0}}, H_{\theta _0}^{-1}\right) \). Therefore
which implies
Under Assumption 1, van der Vaart (2000, Theorem 8.14) states that
and \(\theta _{\text {MLE},n} \overset{P_{\theta _0}}{\rightarrow } \theta _0\) according to Eq. (9), yielding \(\mu _n^\star {\mathop { \rightarrow }\limits ^{P_{\theta _0}}} \theta _0\).
Finally, since the Cholesky decomposition defines a continuous mapping from the set of positive definite Hermitian matrices to the set of lower triangular matrices with positive diagonals (both sets are equipped with the spectral norm) (Schatzman 2002, p. 295), we have
\(\square \)
1.2 B.2 Proof of Theorem 11
Proof
We provide a proof of the result for strong convexity; the result for Lipschitz smoothness follows the exact same proof technique. Note that if \(D'\) does not depend on x, \(F_n(x)\) is \(D'\)-strongly convex if and only if \(F_n(x) - \frac{1}{2}x^TD'x\) is convex. We use this equivalent characterization of strong convexity in this proof.
Note that for \(Z\sim \mathscr {N}(0, I)\),
Define \(\lambda \in [0,1]\), vectors \(\mu , \mu ' \in \mathbb {R}^d\), positive-diagonal lower triangular matrices \(L, L'\in \mathbb {R}^{d\times d}\), and vectors \(x, x'\in \mathbb {R}^{(d+1)d}\) by stacking \(\mu \) and the columns of L and likewise \(\mu '\) and the columns of \(L'\). Define \(x(\lambda ) = \lambda x + (1-\lambda )x'\), \(\mu (\lambda ) = \lambda \mu + (1-\lambda )\mu '\), and \(L(\lambda ) = \lambda L + (1-\lambda ) L'\). Then,
By the D-strong convexity of \(n^{-1}\log \pi _n\),
\(\square \)
1.3 B.3 Proof of Proposition 12
Proof
Note that by reparametrization,
where \(Z \sim \mathscr {N}(0,1)\). Using a Taylor expansion,
for some \(\mu '\) between \(\mu \) and \(\mu +\sigma Z\). By the uniform bound on the third derivative and local bound on the second derivative, for any \(\mu \in U\),
The result follows for any \(0<\varepsilon < \varepsilon / \eta \). \(\square \)
1.4 B.4 Proof of Theorem 13
Proof
Note that we can split L into columns and express LZ as
where \(L_i\in \mathbb {R}^p\) is the \(i\text {th}\) column of L, and \((Z_i)_{i=1}^d\overset{{\text {i.i.d.}}}{\sim }\mathscr {N}(0,1)\). Denoting \(\nabla ^2 f_n :=\nabla ^2 f_n(\mu +LZ)\) for brevity, the \(2\text {nd}\) derivatives in both \(\mu \) and L are
where we can pass the gradient and Hessian through the expectation by dominated convergence because Z has a normal distribution and \(f_n\) has \(\ell \)-Lipschitz gradients. Stacking these together in block matrices yields the overall Hessian,
Since \(f_n\) has \(\ell \)-Lipschitz gradients, for all \(x\in \mathbb {R}^d\), \(-\ell I \preceq \nabla ^2 f_n(x) \preceq \ell I\). Applying the upper bound and evaluating the expectation yields the Hessian upper bound (and the same technique yields the corresponding lower bound):
To demonstrate local strong convexity, we split the expectation into two parts: one where \(n^{-1/2} LZ\) is small enough to guarantee that \(\Vert \mu +n^{-1/2} LZ - x\Vert ^2 \le r^2\), and the complement. Define
Note that when \(\Vert Z\Vert ^2 \le r_n^2(\mu , L) \),
Then, we may write
Since \(f_n\) has \(\ell \)-Lipschitz gradients and is locally \(\varepsilon \)-strongly convex,
Note that \(A^TA\) has entries 1 and \(n^{-1} Z_i^2\) along the diagonal, as well as \(n^{-1} Z_iZ_j\), \(i\ne j\) and \(n^{-1/2} Z_i\) on the off-diagonals. By symmetry, since Z is an isotropic Gaussian, censoring by \(\mathbb {1}\left[ \Vert Z\Vert ^2 \le \dots \right] \) or \(\mathbb {1}\left[ \Vert Z\Vert ^2 > \dots \right] \) maintains that the off-diagonal expectations are 0. Therefore, the quantity \(\mathbb {E}\left[ \mathbb {1}\left[ \Vert Z\Vert ^2 \le r_n^2(\mu , L) \right] A^TA \right] \) is diagonal with coefficients \(1 - \alpha _n(\mu ,L)\) and \(n^{-1} \beta _n(\mu ,L)\), and \(\mathbb {E}\left[ \mathbb {1}\left[ \Vert Z\Vert ^2 > r_n^2(\mu , L) \right] \right. \left. A^TA \right] \) is diagonal with coefficients \(\alpha _n(\mu ,L)\) and \(n^{-1} \tau _n(\mu ,L)\) where
Note that \(\Vert Z\Vert ^2 \sim \chi ^2_d\); so \(\alpha _n(\mu ,L) = 1-\chi ^2_d(r_n^2(\mu , L) )\) and
Therefore,
\(\square \)
1.5 B.5 Proof of Lemma 9
Proof
Given Assumption 1, we know \(f_n\) is twice continuously differentiable. Thus, using the second-order characterization of strong convexity, it is equivalent to show the existence of \(r,\varepsilon >0\) such that
as \(n \rightarrow \infty \). Note that by Weyl’s inequality
Condition 4 of Assumption 1 guarantees that \(H_{\theta _0} \succeq \varepsilon I\) and that there exists a \(\kappa > 0\) such that \(H_\theta \) is continuous in \(B_{\kappa }(\theta _0)\). Hence, there exists \(0<\kappa ' \le \kappa \), such that \(\forall \theta \in B_{\kappa '}(\theta _0),\; H_\theta \succeq \frac{\varepsilon }{2}I\).
We then consider \(\lambda _{\min }\left( \nabla ^2 f_n(\theta ) - H_\theta \right) \). We aim to find a \(0 <r \le \kappa '\) such that \(|\lambda _{\min }\left( \nabla ^2 f_n(\theta ) - H_\theta \right) |\) is sufficiently small. Note that for any fixed \(r > 0\),
Now, we split \(f_n\) into prior and likelihood, yielding that
Given Condition 2 of Assumption 1, for all \(\theta \), \(\pi _0(\theta )\) is positive and \(\nabla ^2 \pi _0(\theta )\) is continuous; and further due to the compactness of \(B_r(\theta _0)\), we have that
Then, it remains to bound the first term and the last term of Eq. (13). For the first term, we aim to use the uniform weak law of large numbers to show its convergence to 0. By Condition 5 of Assumption 1, there exists a \(0 < r_1 \le \kappa '\) and a measurable function g such that for all \( \theta \in B_{r_1}(\theta _0)\) and for all x,
Then, by the compactness of \(B_{r_1}(\theta _0)\), we can apply the uniform weak law of large numbers (Jennrich 1969, Theorem 2), yielding that for all \(i,j \in [d]\),
Since the entrywise convergence of matrices implies the convergence in spectral norm,
For the last term of Eq. (13), by Condition 4 of Assumption 1,
Thus, there exists a sufficiently small \(r_2 > 0\) such that
Then, we combine Eqs. (14) to (16) and pick \(r' = \min (r_1,r_2)\le \kappa '\), yielding that
as \(n \rightarrow \infty \). Then, the local strong convexity is established. Note that we have already shown for all \(\theta \in B_{\kappa '}(\theta _0), H_\theta \succeq \frac{\varepsilon }{2} I\). By Eqs. (17) and (12), we conclude that for all \(\varepsilon \le \frac{\varepsilon }{4}\),
The smoothness argument follows from the same strategy. Weyl’s inequality implies that
By repeating the proof for local smoothness, we obtain that there exists a sufficiently small \(0 < r''\), such that \(\forall \varepsilon > 0\),
as \(n \rightarrow \infty \). Condition 4 and 5 of Assumption 1 yield that
Therefore, there exists a \(\ell >0\) such that
Then, the proof is complete by defining \(r :=\min \{r', r''\}\). \(\square \)
1.6 B.6 Proof of Corollary 14
Proof
We begin by verifying the conditions of Theorem 13 for \(f_n\). By Assumption 1, we know that \(f_n\) is twice differentiable. We also know that by Lemma 9, under Assumptions 1 and 2, there exist \(\ell , r', \varepsilon > 0\) such that
By Theorem 6, we know that \(\mu _n^\star {\mathop {\rightarrow }\limits ^{P_{\theta _0}}} \theta _0\), so there exists an \(r'> r>0\) such that
Therefore by Theorem 13, the probability that
and
hold converges to 1 as \(n\rightarrow \infty \), where \(D_n\) and \(\tau _n(\mu ,L)\) are as defined in Eq. (10) and \(x = \mu _n^\star \). Note that the gradient and Hessian in the above expression are taken with respect to a vector in \(\mathbb {R}^{d(d+1)}\) that stacks \(\mu \) and each column of L into a single vector.
Then for all \((\mu , L) \in {\mathscr {B}}_{r,n}\), we have
yielding
Hence \(\forall (\mu , L) \in {\mathscr {B}}_{r,n}\), \(\tau _n(\mu ,L ) \rightarrow 0\) as \(n\rightarrow \infty \), yielding that under sufficiently large n,
Therefore, the probability that for all \((\mu , L) \in {\mathscr {B}}_{r,n}\),
converges in \(P_{\theta _0}\) to 1 as \(n \rightarrow \infty \).
Combining Eqs. (18) to (20), the proof is completed. \(\square \)
1.7 B.7 Proof of Theorem 7
1.7.1 B.7.1 Gradient and Hessian derivation
The gradient for smoothed posterior is as follows:
and the Hessian matrix is given by
where \(W,W' \overset{{\text {i.i.d.}}}{\sim }\varPi _{ n}\).
1.7.2 B.7.2 Proof of \(1{\text {st}}\) statement of Theorem 7
Proof of first statement of Theorem 7
To show the MAP estimation for smoothed posterior is asymptotically strictly convex, we will show that
We focus on the first term of Eq. (21), and show that asymptotically it is uniformly smaller than \(\alpha _n^{-1}\) so that the overall Hessian is negative definite. For the denominator of Eq. (21,) define \(B_n :=\left\{ W, W' : \max \{\Vert W'-\theta _0\Vert , \Vert W - \theta _0\Vert \} \le \beta _n\right\} \) for any sequence \(\beta _n = o(\alpha _n)\). Then, we have
By minimizing over \( v, v' \in B_n\), the above leads to
For the numerator of the first term of Eq. (21), since \(W, W'\) are i.i.d.,
and since \(\lambda _{\text {max}} \left( (W -W')(W -W')^T\right) = \Vert W -W'\Vert ^2\),
With Eqs. (22) and (23), we can therefore bound the maximal eigenvalue of the Hessian matrix,
We now bound the supremum of this expression over \(\{\theta \in \mathbb {R}^d: \Vert \theta -\theta _0\Vert \le M\}\). Focusing on the exponent within the expectation,
where the inequality is obtained by expanding the quadratic terms and bounding \(\Vert \theta - \theta _0\Vert \) with M. We combine the above bound with Eq. (24) to show that \(\alpha _n^2\lambda _{\text {max}}\left( \nabla ^2 \log \hat{\pi }_n(\theta ) \right) + \alpha _n\) is bounded above by
By multiplying and dividing by \( \exp \left( \frac{ \Vert W-W'\Vert }{\sqrt{\beta _n}}\right) \), one notices that
where the inequality is by the fact that \(x^2 e^{-x}\) maximized at \(x=2\) with value \(4e^{-2}\) and \(\Vert W -W'\Vert \le \Vert W\Vert + \Vert W'\Vert \). If we combine this bound with Eq. (25) and note that \(W, W'\) are iid, Eq. (25) is bounded above by
To show that the Hessian is asymptotically negative definite, it suffices to show that Eq. (26) is \(o_{P_{\theta _0}}(\alpha _n)\). For the terms outside the expectation, \(\beta _n = o(\alpha _n)\) implies that \(2 e^{-2} \beta _n e^{\frac{2\beta _n^2 + 4M\beta _n}{\alpha _n} } = o(\alpha _n)\), and Assumption 1 and Lemma 15 together imply that
so
Therefore, in order to show Eq. (26) is \(o_{P_{\theta _0}}(\alpha _n)\), it is sufficient to show that
The next step is to split the expectation into two regions—\(\Vert W - \theta _0\Vert \le \beta _n\) and \(\Vert W - \theta _0\Vert > \beta _n\)—and bound its value within them separately.
-
1.
When \(\Vert W - \theta _0\Vert \le \beta _n\), the exponent inside the expectation is shrinking uniformly since \(\beta _n = o(\alpha _n)\):
$$\begin{aligned}&\mathbb {E}\left[ 1_{\{ \Vert W - \theta _0\Vert \le \beta _n\}} e^{ \left( \frac{1}{\alpha _n}M + \beta _n^{-1/2} \right) \Vert W - \theta _0\Vert - \frac{1}{2\alpha _n} \Vert W - \theta _0\Vert ^2 } \right] \\&\quad \le \mathbb {E}\left[ 1_{\{ \Vert W - \theta _0\Vert \le \beta _n\}} \right] e^{ \left( \frac{1}{\alpha _n}M + \beta _n^{-1/2} \right) \beta _n} \\&\quad = O_{P_{\theta _0}}(1). \end{aligned}$$ -
2.
When \(\Vert W - \theta _0\Vert > \beta _n\), we take the supremum over the exponent (a quadratic function), yielding \(\Vert W - \theta _0\Vert = M + \alpha _n \beta _n^{-1/2}\) and the following bound:
$$\begin{aligned}&\left( \frac{1}{\alpha _n}M + \beta _n^{-1/2} \right) \Vert W - \theta _0\Vert - \frac{1}{2\alpha _n} \Vert W - \theta _0\Vert ^2 \nonumber \\&\quad \le \sup _{\Vert v - \theta _0\Vert } \left( \left( \frac{1}{\alpha _n}M + \beta _n^{-1/2} \right) \Vert v - \theta _0\Vert - \frac{1}{2\alpha _n} \Vert v - \theta _0\Vert ^2 \right) \nonumber \\&\quad = \left( \frac{1}{\alpha _n}M + \beta _n^{-1/2} \right) \left( M + \alpha _n \beta _n^{-1/2} \right) \nonumber \\&\qquad - \frac{1}{2\alpha _n} \left( M + \alpha _n \beta _n^{-1/2} \right) ^2 \nonumber \\&\quad =\frac{M^2}{2\alpha _n} + \frac{M}{\beta _n^{1/2} }+ \frac{\alpha _n}{2\beta _n}. \end{aligned}$$(27)This yields
$$\begin{aligned}&\mathbb {E}\left[ 1_{\{ \Vert W - \theta _0\Vert> \beta _n\}} e^{ \left( \frac{1}{\alpha _n}M + \beta _n^{-1/2} \right) \Vert W - \theta _0\Vert - \frac{1}{2\alpha _n} \Vert W - \theta _0\Vert ^2 } \right] \\&\quad \le \varPi _{n}\left( \left\{ W : \Vert W - \theta _0 \Vert > \beta _n\right\} \right) \exp \left( \frac{M^2}{2\alpha _n} + \frac{M}{\beta _n^{1/2} }+ \frac{\alpha _n}{2\beta _n} \right) . \end{aligned}$$Note that it is always possible to choose \(\beta _n = o(\alpha _n)\) with \(\beta _n = \omega (\alpha _n^2)\). With this choice of \(\beta _n\), the dominating term among the three of Eq. (27) is \(\frac{M^2}{2\alpha _n} \). Then by Lemma 15, there exists a sequence \(\beta _n = o(\alpha _n)\) with \(\beta _n = \omega (\alpha _n^2)\) such that the following holds,
$$\begin{aligned} \varPi _n( \left\{ W : \Vert W - \theta _0 \Vert > \beta _n\right\} ) = o_{P_{\theta _0}}\left( \exp \left\{ -\frac{M^2}{2\alpha _n}\right\} \right) , \end{aligned}$$which implies
$$\begin{aligned}&\mathbb {E}\left[ 1_{\{ \Vert W - \theta _0\Vert > \beta _n\}} e^{ \left( \frac{1}{\alpha _n}M + \beta _n^{-1/2} \right) \Vert W - \theta _0\Vert - \frac{1}{2\alpha _n} \Vert W - \theta _0\Vert ^2 } \right] \\&= o_{P_{\theta _0}}(1) . \end{aligned}$$
This finishes the proof. \(\square \)
In the last step of the above proof, we require an exponential tail bound for the posterior \(\varPi _n\). We provide this in the following lemma, following the general proof strategy of van der Vaart (2000, Theorem 10.3). The proof of Lemma 15 involves many probability distributions; thus, for mathematical convenience and explicitness, in the proof of Lemma 15 we use square bracket—\(P\left[ X \right] \)—to denote the expectation of random variable X with respect to a probability distribution P. When taking expectation to a function of n data points \(f(X_1, \dots , X_n)\) , where \((X_i)_{ i =1}^n \overset{{\text {i.i.d.}}}{\sim }P_{\theta }\), we still write \(P_\theta [f]\); and \(P_\theta \) here represents the product measure.
Lemma 15
Under Assumption 1, \(\alpha _n^3 n \rightarrow \infty \), there exists a sequence \(\beta _n\) satisfying \(\beta _n = o(\alpha _n)\), \(\beta _n = \omega (\alpha _n^2)\) and \(\beta _n = \omega (n^{-1/2})\) such that for any fixed constant M,
Proof of Lemma 15
In order to show that \(\beta _n\) satisfies the tail probability bound, it suffices to prove that
due to Markov’s inequality (we absorb the \(M^2/2\) constant into \(\alpha _n\) because it does not affect the proof). To achieve this, we take advantage of the existence of a test sequence applied from Assumption 1. By van der Vaart (2000, Lemma 10.6), given the \(1\text {st}\) and the \(2\text {nd}\) conditions of Assumption 1 and the fact that the parameter space \(\mathbb {R}^d\) is \(\sigma \)-compact, there exists a sequence of tests \(\phi _n : \mathscr {X}^n \rightarrow [0,1]\), where \(\mathscr {X}^n\) is the space of \((X_1, \dots , X_n)\), such that as \(n \rightarrow \infty \),
Further, by Kleijn (2004, Lemma 1.2) and van der Vaart (2000, Lemma 10.3), under Assumption 1 and the existence of the above test sequence \(\phi _n\), for every \(M_n \rightarrow \infty \), there exists a constant \(C > 0\) and another sequence of tests \(\psi _n : \mathscr {X}^n\rightarrow [0,1]\) such that for all \(\Vert \theta - \theta _0\Vert > M_n/\sqrt{n}\) and sufficiently large n,
Using \(\psi _n\), we split the expectation as follows:
and we aim to show both parts converging to 0.
For term (I), the first statement of Eq. (28) implies that \(\exists C >0\) such that
where the first inequality follows by \( \varPi _n( \left\{ W : \Vert W - \theta _0 \Vert \right. \left. > \beta _n\right\} ) \le 1\). Since \(n \alpha _n^3 \rightarrow \infty \), the last bound in the above expression converges to 0.
For term (II), we work with the shifted and scaled posterior distribution. Define \(Z_n = \sqrt{n}(W - \theta _0)\) and \(B_n = \{Z_n: \Vert Z_n\Vert > \sqrt{n\beta _n^2} \}\), and let \(\tilde{\varPi }_{0}\) be the corresponding prior distribution on \(Z_n\) and \(\tilde{\varPi }_n\) be the shifted and scaled posterior distribution, which yields
Let U be a closed ball around 0 with a fixed radius r, then restricting \(\tilde{\varPi }_0\) on U defines a probability measure \(\tilde{\varPi }_0^U\), i.e. for all measurable set B, \(\tilde{\varPi }_0^U(B) = \tilde{\varPi }_0(B \cap U)/ \tilde{\varPi }_0(U)\). Write \(P_{n, z}\) for the joint distribution of n data points \((X_1, \dots , X_n)\) parametrized under \(\theta _0 + z/\sqrt{n}\) and hence write the marginal distribution of \((X_1, \dots , X_n)\) under \(\tilde{\varPi }_0^U\) for \(P_{n,U} = \int P_{n, z} \text{ d } \tilde{\varPi }_0^U(z)\). The densities of these distributions will be represented using lower case, e.g. \(p_{n,U}(x) = \int p_{n,z}(x) \tilde{\pi }_0^U(z) \text{ d } z\) is the PDF of \(P_{n,U}\). Here, we abuse the notation that x represents \((x_1, \dots , x_n)\).
We replace \(P_{\theta _0}\) in Eq. (29) with \(P_{n,U}\). Under Assumption 1, by van der Vaart (2000, p. 141), \(P_{n,U} \) is mutually contiguous to \(P_{\theta _0}\) (LeCam 1960), that is, for any statistics \(T_n\) (a Borel function of \(X^n\)), \(T_n {\mathop {\rightarrow }\limits ^{P_{\theta _0}}}0\) iff \(T_n {\mathop {\rightarrow }\limits ^{P_{n,U}}}0\). Thus, considering \(\tilde{\varPi }_0 \left( B_n \right) (1- \psi _n)\) as the statistics \(T_n\), the convergence to 0 of the expression in Eq. (29) is equivalent to
Manipulating the expression of \(P_{n,U}\) and \(\tilde{\varPi }_n \left( B_n \right) \) (we write \( \tilde{\varPi }_n \left( B_n, x\right) \) in the integral and write \(\tilde{\varPi }_n \left( B_n, (X_i)_{i = 1}^n\right) \) in the expectation to make the dependence of posterior on the data explicit),
Note that \(p_{n,U}(x ) = \int p_{n,z}(x) \text{ d } \tilde{\varPi }_0^U(z)\),
Recall that for all measurable set B, \(\tilde{\varPi }_0^U(B) = \tilde{\varPi }_0(B \cap U)/ \tilde{\varPi }_0(U)\), thus
By using Bayes rule, we expand \(\tilde{\varPi }_{n} \left( B_n, x \right) = \frac{\int \mathbb {1}[B_n] p_{n,z}(x) \text{ d } \tilde{\varPi }_0 (z)}{\int p_{n,z}(x)\text{ d }\tilde{\varPi }_0(z)}\),
Note that \(\tilde{\varPi }_n \left( U, x \right) = \frac{\int _U p_{n,z}(x) \text{ d } \tilde{\varPi }_{0}(z)}{\int p_{n,z}(x)\text{ d }\tilde{\varPi }_{0}(z)}\),
By Fubini Theorem and \( \tilde{\varPi }_n \left( U,x \right) \le 1\),
Note that \(P_{n,z} [1- \psi _n] \equiv P_{\theta }[1 - \psi _n]\) for \(\theta = \theta _0 + z/\sqrt{n}\) and that \(\sqrt{n\beta _n^2} \rightarrow \infty \) due to \(\beta _n = \omega (n^{-1/2})\). Thus, we can use the second statement of Eq. (28) to bound \(P_{n,z} [1- \psi _n]\), yielding
We then derive upper bounds for both the fraction and the integral to show the above is \(o\left( e^{-\frac{1}{\alpha _n}}\right) \). For the fraction, we define \(U_n :=\left\{ w\in \mathbb {R}^d: \sqrt{n}(w - \theta _0) \in U \right\} \), then
By Assumption 1, for all \( w \in \mathbb {R}^d, \pi _0(w)\) is positive and continuous, and hence \(\inf _{w\in U_n}\pi _0(w)\) is an increasing sequence that converges to \(\pi _0(\theta _0) > 0\) as \(n \rightarrow \infty \). Thus, there is a constant \(D > 0\) such that for sufficiently large n,
yielding that
For the integral, by splitting \(B_n\) into \(\{\sqrt{n\beta _n^2} < \Vert z_n\Vert \le k \sqrt{n} \}\) and \( \{ \Vert z_n\Vert > k \sqrt{n} \}\) for some positive \( k < 1\),
Then by change of variable to \(w = \frac{1}{\sqrt{n}} z + \theta _0\),
Note that by Assumption 1, \(\pi _0(w)\) is continuous for all \(w \in \mathbb {R}^d\), we can choose a sufficiently small k such that \(\pi _0(\theta )\) is uniformly bounded by a constant C over the region \(\{k \ge \Vert w- \theta _0\Vert > \beta _n \}\). Thus, the above can be bounded above by
where the equality is by change of variable back to \(z = \sqrt{n}(w - \theta _0) \). Then, consider the integral on RHS. Using spherical coordinates, there exists a fixed constant \( D > 0\) such that
where the second equality is by setting \(s = Cr^2\). Note that the integrand of RHS is proportional to the PDF of \(\varGamma (\frac{d}{2}, 1)\). Using the tail properties of the Gamma random variable (Boucheron et al. 2013, p. 28), we have that for some generic constant \(D >0\),
Therefore, for some generic constants \(C , D >0\),
Then, we combine Eqs. (30) and (31), yielding that for some constants \(C, D > 0\) independent to n,
Lastly, it remains to show that there exists a positive sequence \(\beta _n\) satisfying both \( \beta _n = o(\alpha _n)\) and \(\beta _n = \omega (\alpha _n^2)\) such that the RHS converges to 0. The first term always converges to 0 due to \(\alpha _n^3 n \rightarrow \infty \). For the second term, we consider two different cases. If \(\alpha _n = o(n^{-1/6})\), we pick \(\beta _n = n^{-1/3}\), which is both \( o(\alpha _n)\) and \(\omega (\alpha _n^2)\). Then,
where the convergence in the last line is by \(n \alpha _n^3 \rightarrow \infty \Leftrightarrow \frac{1}{\alpha _n} = o(n^{1/3})\). If \(\alpha _n = \omega (n^{-1/6})\), we pick \(\beta _n = \alpha _n^2\). Then,
Since \(\alpha _n = \omega (n^{-1/6})\), \(\frac{1}{\alpha _n} = o(n^{1/6})\) and \( n\alpha _n^4 = \omega (n^{1/3})\), yielding that the above converges to 0 as \(n \rightarrow \infty \). This completes the proof. \(\square \)
1.7.3 B.7.3 Proof of \(2\text {nd}\) statement of Theorem 7
In this section, we show that the smoothed MAP estimator (\(\hat{\theta }_n^\star \)) is also a consistent estimate of \(\theta _0\), but with a convergence rate that is slower than the traditional \(\sqrt{n}\). This is the case because the variance of the smoothing kernel satisfies \(\alpha _n = \omega (n^{-1/3})\), and the convergence rate of \(\hat{\theta }_n^\star \) is determined by \(\alpha _n\) via
Recall that \(\theta _{\text {MLE},n }\) is a \(\sqrt{n}\)-consistent estimate of \(\theta _0\). Thus, it is sufficient to show \(\left\| \hat{\theta }_n^\star - \theta _{\text {MLE},n} \right\| = O_{P_{\theta _0}}\left( \sqrt{\alpha _n}\right) \).
Note that \(\hat{\theta }_n^\star \) and \(\theta _n^\star \) are maximals of stochastic process \(\hat{\pi }_n(\theta )\) and \(\pi _n(\theta )\) respectively, which can be studied in the framework of M-estimator (van der Vaart 2000; van der Vaart and Wellner 2013). A useful tool in establishing the asymptotics of M-estimators is the Argmax Continuous Mapping theorem (van der Vaart and Wellner 2013, Lemma 3.2.1), which is introduced as follows.
Lemma 16
[Argmax Continuous Mapping (van der Vaart and Wellner 2013)] Let \(\{f_n(\theta ) \}\) and \(f(\theta )\) be stochastic processes indexed by \(\theta \), where \(\theta \in \varTheta \). Let \(\hat{\theta }\) be a random element such that almost surely, for every open sets G containing \(\hat{\theta }\),
and \(\hat{\theta }_n\) be a random sequence such that almost surely
If \(\sup _{\theta \in \varTheta } |f_n(\theta ) - f(\theta )| = o_P(1)\) as \(n \rightarrow \infty \), then
The proof strategy of the \(2\text {nd}\) statement of Theorem 7 is to apply Lemma 16 in a setting where \(f_n\) is \(\hat{\pi }_n(\theta )\) and f is a Gaussian density. Using the Bernstein–von Mises Theorem 4, we show that \(\hat{\pi }_n(\theta )\) converges uniformly to this Gaussian density, which implies that the MAP of \(\hat{\pi }_n(\theta )\) converges in distribution to the MAP of this Gaussian distribution by the Argmax Continuous Mapping theorem. The detailed proof is as follows.
Proof of Second statement of Theorem 7
Note that
By Theorem 4, we have \(\Vert \theta _{\text {MLE},n} - \theta _{0}\Vert \!= \!O_{P_{\theta _0}}(1/\sqrt{n})\). And in addition, given that \(\sqrt{\alpha _n } = \omega (1/ \sqrt{n})\), in order to get Eq. (32), it suffices to show
Thus, in this proof, we aim to show \(\left\| \hat{\theta }_n^\star - \theta _{\text {MLE},n} \right\| = O_{P_{\theta _0}}(\sqrt{\alpha _n}) \) and it is sufficient to prove
Let \(\xi = \frac{1}{\sqrt{\alpha _n}} \left( \theta - \theta _{\text {MLE},n}\right) \), \(\xi ^*_n = \frac{1}{\sqrt{\alpha _n}} \left( \hat{\theta }_n - \theta _{\text {MLE},n} \right) \) and \(t = \frac{1}{\sqrt{\alpha _n}} \left( w - \theta _{\text {MLE},n}\right) \). By expressing \(\hat{\pi }_n(\theta )\), which is defined in Eq. (6),
Define
where \(\phi (\cdot ; \mu , \varSigma )\) denotes the PDF of \(\mathscr {N}(\mu , \varSigma )\).
By adding and subtracting \(f(\xi )\),
We then apply Lemma 16 to show \(\xi _n^* \overset{d}{\rightarrow }\mathop {\mathrm{arg}\,\mathrm{max}}\nolimits _{\xi } f(\xi )\). We start by verifying a condition of the argmax continuous mapping theorem that
By triangle inequality, for all n,
Later we show both two terms on the RHS converging to 0.
For the first term. Note that \(\alpha _n^{d/2} \pi _n(\sqrt{\alpha _n} t+ \theta _{\text {MLE},n})\) is the probability density function of \(\varPi _{\sqrt{\alpha _n} t+ \theta _{\text {MLE},n}}\), which is the posterior distribution parametrized on t. Thus, for all n,
where the inequality is by \( \sup _{\xi ,t} \exp ( -\frac{1}{2}\Vert \xi - t \Vert ^2) \le 1\). Under Assumption 1, the posterior distribution admits Bernstein–von Mises theorem (Theorem 4) that
With the invariance of total variation under reparametrization, we have
This shows the uniform convergence from \(f_n(\xi )\) to \(g_n(\xi )\). Note that in Eq. (35), we states the Bernstein–von Mises theorem with a different centering process. It is allowed when \(\theta _{\text {MLE},n}\) is asymptotically normal (van der Vaart 2000, p. 144), which is ensured by Theorem 4.
For the second term in Eq. (34). Note that we can evaluate \(g_n(\xi )\) since it is a convolution of two Gaussian PDFs, that is
Comparing this to \(f(\xi ) = (2\pi )^{d/2} \phi \left( \xi ; 0, I \right) \), one notices that \( \frac{1}{n \alpha _n}H_{\theta _0}^{-1} + I \rightarrow I\) due to \(\alpha _n^3 n \rightarrow \infty \). And further for Gaussian distributions, the convergence of parameters implies the uniform convergence of PDFs, yielding that
Thus, we have Eq. (34) converging to 0 as \(n \rightarrow \infty \).
Now, we look at \(f(\xi )\) with the goal to apply Lemma 16 and to obtain \(\xi _n^* \overset{d}{\rightarrow }\mathop {\mathrm{arg}\,\mathrm{max}}\nolimits _{\xi } f(\xi )\). Note that
To apply Lemma 16, we need to ensure that for any open set G that contains 0,
This holds by the unimodality of standard Gaussian distribution.
Therefore, with both conditioins Eqs. (33) and (36), we can apply Lemma 16 to conclude that
This completes the proof. \(\square \)
1.8 B.8 Proof of Theorem 2
Proof
Lemma 9 implies the existence of a locally asymptotically convex and smooth region \(B_r(\theta _0)\) of \(f_n\); and Theorems 5 and 7 show the consistency of \(\theta _{n}^\star \) and \(\hat{\theta }_n^\star \) to \(\theta _0\) in data-asymptotics. Let \(\theta _\text {init}\) be the initial value of Algorithm 2 satisfying \(\Vert \theta _\text {init} - \hat{\theta }^\star _n\Vert \le \frac{r}{4(\ell + 1)}\). If \(\Vert \hat{\theta }_n^\star - \theta _0\Vert \le r/4\) and \(\Vert \theta _n^\star - \theta _0\Vert \le r/4\),
and \(B_{3r/4}(\theta _n^\star ) \subseteq B_r(\theta _0)\). Combining above shows that there exists \(0 < r':= \frac{3r}{4}\) such that
Then, it remains to show that the iterates produced by Algorithm 2 will be confined in \(B_{r'}(\theta _n^\star )\) as we get more and more data. Since \(\theta _n^\star \) is the global optimum of \(f_n\), as long as backtracking line search ensures the decay of objective value within the locally strongly convex \(B_{r'}(\theta _n^\star )\) (Boyd and Vandenberghe 2004, p. 465), the gradient norm will decay in each iteration and hence the iterates converge to the optimum. Therefore, it is sufficient to show the first iteration stays inside.
By \(\ell \)-smoothness of \(f_n\) inside \(B_{r'}(\theta _n^\star )\), we have \(\Vert \nabla f_n(\theta _\text {init})\Vert \le \frac{\ell r}{4(\ell +1)}\) and hence
This shows the confinement. The convergence rate of iterates is \(\eta ^k\), where
which follows from the standard theory of gradient descent under \(\varepsilon \)-strong convexity and \(\ell \)-Lipschitz smoothness (Ryu and Boyd 2016, p. 15). This completes the proof. \(\square \)
1.9 B.9 Proof of Theorem 3
Proof of Theorem 3
In this proof, we aim to apply Theorem 17 with
which is closed and convex. Note that in the notation of this theorem, \(x = (\mu ^T, L_1^T, \dots , L_p^T)^T \in \mathbb {R}^{(d+1)d}\) and \(V \in \mathbb {R}^{(d+1)d \times (d+1)d}\) is set to be a diagonal matrix with entries 2 for the \(\mu \) components and \(r/(2\Vert I - L_n^\star \Vert _F)\) for the L components. Therefore,
This setting yields two important facts. First, by Theorems 7 and 6,
yielding that
For \(\Vert \mu _0 - \hat{\theta }_n^\star \Vert ^2 \le \frac{r^2}{32}\), by triangle inequality, the probability that the following inequalities hold converges to 1 in \(P_{\theta _0}\) as \(n \rightarrow \infty \),
Further with \(L_0 = I\), \(J(\mu _0, L_0) \le \frac{3 r^2}{4} \le r^2\). Hence, if we initialize \(L_0 = I \) and \(\mu _0\) such that \(\Vert \mu _0 - \hat{\theta }_n^\star \Vert _2^2 \le \frac{r^2}{32}\),
Second, if \(J \le r^2\) then \(\mu \) is close to the optimal and \(\Vert L\Vert _F\) is not too large, i.e.
yielding that \(\{J(\mu , L) \le r^2 \} \subseteq {\mathscr {B}}_{r,n}\). Recall that
Then by Corollary 14, under Assumptions 1 and 2, the probability of the event that
converges to 1 in \(P_{\theta _0}\) as \(n \rightarrow \infty \).
For brevity, we make the following definitions for the rest of this proof: recall the definition of \(f_n, F_n\) in Eqs. (2) and (3) (we state here again):
Here, \(\phi _n(x, z)\) is the KL cost function with no expectation, and \(\varPhi _n(x)\) is the cost function with the expectation. To match the notation of Theorem 17, we reformulate the scaled gradient estimator defined in Eq. (8) as \(g_n\),
for a diagonal scaling matrix \(R_n(x) \in \mathbb {R}^{d(d+1)\times d(d+1)}\). Define that \(R_n(x)\) has entries 1 for the \(\mu \) components, 1 for the off-diagonal L components, and \(1/(1+(n L_{ii})^{-1})\) for the diagonal L components. Note that \(x \rightarrow \partial \mathscr {X}\) means that \(L_{ii} \rightarrow 0\), ensuring that \(g_n(x,Z)\) has entries \(-1\) for the \(L_{ii}\). Since Z is a standard normal random variable, under the event that \(-\frac{1}{n} \log \pi _n\) has Lipschitz gradient, the gradient can be passed through the expectation so that the true gradient is defined as below:
Note that the projected stochastic iteration
with \(\varPi _\mathscr {X}(x) :=\mathop {\mathrm{arg}\,\mathrm{min}}\nolimits _{y\in \mathscr {X}} \Vert V(x - y)\Vert ^2\) is equivalent to the iteration described in Algorithm 3. Note that the differentiability of \(\phi _n\) only holds for \(x \in \mathscr {X}^\mathrm {o}\). For the case where \(L_{ii} = 0\) for some \(i \in [d]\), we can use continuation via the limit \(\lim _{L_{ii} \rightarrow 0} -\frac{(n L_{ii})^{-1}}{1+ (n L_{ii})^{-1}} = -1\) to evaluate even though the gradient is not defined. For the following proof, we do not make special treatments to those boundary points when applying Taylor expansion and taking derivative.
Next, we apply Theorem 17 to carry out the proof. The rest of the proof consists two parts: to show the confinement result (statement 2. of Theorem 17) and to show the convergence result (statement 3. of Theorem 17)). We prove these two results under the event that Eqs. (37) and (38) hold; since the probability that these events hold converges in \(P_{ \theta _0}\) to 1 as \(n \rightarrow \infty \), the final result holds with the same convergent probability.
We first show the confinement result by analysing \(\varepsilon (x)\), \(\ell ^2(x)\), and \(\sigma ^2(r)\), which are defined in Eqs. (44) and (45), respectively. We aim to obtain that
-
i.
We can find sufficiently small \(\gamma _k > 0\) such that
$$\begin{aligned} \alpha _{k} = 1 + \mathbb {1}\left[ J(x_k) \le r^2\right] (-2\gamma _k\varepsilon (r) + 2\gamma _k^2\ell ^2(r)) \in (0,1] \end{aligned}$$holds for all \(x\in \mathscr {X}\), i.e.
$$\begin{aligned} \forall x\in \mathscr {X}: J(x)\le r^2, \quad 0 \le 2\gamma _k\varepsilon (x) - 2\gamma _k^2\ell ^2(x) \le 1. \nonumber \\ \end{aligned}$$(39) -
ii.
\(\sigma ^2(r) \rightarrow 0\) as \(n \rightarrow \infty \) to guarantee the SGD iterations are eventually locally confined as \(n \rightarrow \infty \) (based on Theorem 17).
To show the statement i., Eq. (39), we start by deriving a lower bound for \(2\gamma _k\varepsilon (x) - 2\gamma _k^2\ell ^2(x)\). Examine the expression,
where \(\left( \int \cdots \right) = \left( \int _0^1 \nabla ^2 \varPhi _n((1-t)x^\star + tx)\mathrm {d}t \right) \). By splitting \(\varPhi _n\) into the regularization \(I_n(x)\) and the expectation \(F_n(x)\); and defining
the above expression can be written as
Note that by Corollary 14 that \(\frac{\varepsilon }{2} D_n \preceq \nabla ^2 F_n(x) \preceq \ell D_n\) and all the V, \(R_n(x)\) are positive diagonal matrices, leading to
Thus, the above expression can be bounded below by
Now, notice that A(x) ,\(R_n(x) D_n\) are all diagonal matrices with non-negative entries,
As long as the entries of A(x), \(R_n(x) D_n\) are bounded above by a constant for all n, there exists a sufficiently small constant c such that for all \(\gamma _k < c\), Eq. (40) are both non-negative. Given that for all n and \(\forall x \in \mathscr {X}\),
we obtain the boundedness of the entries of A(x), \(R_n(x) D_n\). Therefore, we conclude that
It remains to show the second inequality of Eq. (39), i.e.
This is true if
Since \(\gamma _k \rightarrow 0\) as \(k \rightarrow \infty \), the above holds if \(\sup _{x\in \mathscr {X}: J(x)\le r^2} \varepsilon (x)\) is bounded above by a constant that is independent to n. Now, we consider the upper bound for \(\sup _{x\in \mathscr {X}: J(x)\le r^2} \varepsilon (x)\). Expanding \(\varepsilon (x)\),
Split \(\varPhi _n\) into the regularization \(I_n(x)\) and the expectation \(F_n(x)\). For the expectation, by Corollary 14 that \(\nabla ^2 F_n(x) \preceq \ell D_n\) and entries of \(R_n(x)\) are bounded by 1, we have
and for the regularization, note that \(\nabla ^2 I_n\) is a diagonal matrix with 0 for \(\mu \) and off-diagonals of L and \( L_{ii}^{-2}/n\) for diagonals of L, so
By the fact that \(\forall i\in [d], L_{ii}^\star > 0\), we have Eq. (41) is bounded above by a constant C. Use the Weyl’s inequality to bound the maximal eigenvalue of the summation of two Hermitian matrices, we conclude that
Therefore, we have completed the proof for statement i., Eq. (39).
Then, we show the statement ii. by getting upper bound on \(\sigma ^2(r)\). Recall that \(\sigma ^2(r)\) is the upper bound of the fourth moment of
Since the regularizor is cancelled in this expression, we only consider the expectation part. Note that \(VR_n(x)\) is a diagonal matrix with positive diagonals,
Let \(Z_1, Z_2 \) be independent copies, by tower property of conditional expectation,
By the convexity of \(\Vert \cdot \Vert ^4\) and Jensen’s inequality,
By \(\left\| \nabla \tilde{f}_n(x, Z_1) \!- \!\nabla \tilde{f}_n(x, Z_2) \right\| \! \le \! \left\| \nabla \tilde{f}_n(x, Z_1)\right\| \!+\!\left\| \nabla \tilde{f}_n \right. \left. (x, Z_2) \right\| \) and Minkowski’s inequality,
Now, we focus on bounding \(\mathbb {E}\left[ \Vert \nabla \tilde{f}_n(x, Z)\Vert ^4 \right] ^{1/4}\). We examine \(\Vert \nabla \tilde{f}_n(x, Z)\Vert \),
yielding
By Cauchy–Schiwartz inequality,
We then bounds these two terms on RHS separately. We use the sub-Gaussian property of \(\left\| \nabla f_n \left( \mu + \frac{1}{\sqrt{n}}L Z \right) \right\| \) to bound its \(8{\text {th}}\) moment. First notice that \(\left\| \nabla f_n \left( \mu + \frac{1}{\sqrt{n}}L Z \right) \right\| \) is a \(\max _{i \in [d]} L_{ii} \frac{\ell }{\sqrt{n}}\)-Lipschitz function of Z,
Given that \(\nabla ^2 f_n \preceq \ell I\), the above is bounded by
Since a Lipschitz function of Gaussian noise is sub-Gaussian (Kontorovich 2014, Theorem 1), i.e. let \(Z \sim \mathscr {N}(0, I_d)\), \(\psi : \mathbb {R}^d \rightarrow \mathbb {R}\) be L-Lipschitz, then
Thus, \(\left\| \nabla f_n \left( \mu + \frac{1}{\sqrt{n}}L Z \right) \right\| \) is \(\frac{4 \ell ^2}{n} \max _{i \in [d]} L_{ii}^2\)-sub-Gaussian. Then note that for a \(\sigma ^2\)-sub-Gaussian random variable \(X \in \mathbb {R}\), for any positive integer \(k \ge 2\), \(\mathbb {E}\left[ |X|^k\right] ^{1/k} \le \sigma e^{1/e} \sqrt{k}\). Hence we obtain
Along with the fact that Gaussian random variable has arbitrary order moments,
for some constant C, we obtain
and hence
Taking supremum over \(J(x) \le r^2\), the RHS is bounded above by a universal constant, we therefore conclude that
Therefore, with \(\forall x\in \mathscr {X}: J(x)\le r^2, 0 \le 2\gamma _k\varepsilon (x) - 2\gamma _k^2\ell ^2(x) \le 1\) and \(\sigma ^2(r) \rightarrow 0\) as \(n \rightarrow \infty \), applying Theorem 17 yields the confinement result, i.e.
in \(P_{ \theta _0}\) as \(n \rightarrow \infty \).
Lastly, by statement 3 of Theorem 17, we prove the convergence result by checking
We use the similar way to expand the expression,
By splitting \(\varPhi _n\) into the regularization and the expectation, we have
and
We then combine Eqs. (42) and (43) and use Weyl’s inequality to bound the minimal eigenvalue of the summation of two Hermitian matrices, yielding
This gives the convergence result.
Then, the proof is complete by applying Theorem 17. We know that \(\xi _k\) is strictly positive. Since \(\varepsilon (r) > 0\) and \(\ell (r) \) is bounded above, there exists \(\gamma _k = \varTheta (k^{-\rho }), \rho \in (0.5, 1) \) so that it satisfies the condition of the theorem. We have that \(\sigma \rightarrow 0\), which makes 3 in the statement of Theorem 17 become
Even though D is a function of n, n is fixed as Algorithm 3 runs. Since D is invertible,
which is exactly our desired result: as the number of data \(n\rightarrow \infty \), the probability that Algorithm 3 finds the optimum in a rate of \(k^{-\rho '}\) (as we take more iterations, \(k\rightarrow \infty \)) converges to 1. In other words, variational inference gets solved asymptotically in a rate of \(k^{-\rho '}\) . \(\square \)
Theorem 17
Let \(\mathscr {X}\subseteq \mathbb {R}^p\) be closed and convex, \(g : \mathscr {X}\times \mathscr {Z}\rightarrow \mathbb {R}^p\) be a function, \(G(x) :=\mathbb {E}\left[ g(x,Z)\right] \) for a random element \(Z\in \mathscr {Z}\), \(x^\star \in \mathscr {X}\) be a point in \(\mathscr {X}\) such that \(G(x^\star ) = 0\), \(V \in \mathbb {R}^{p\times p}\) be invertible, \(J(x) :=\Vert V(x - x^\star )\Vert ^2\), and \(r \ge 0\). Consider the projected stochastic iteration
with independent copies \(Z_k\) of Z, \(\gamma _k \ge 0\), and \(\varPi _\mathscr {X}(x) :=\mathop {\mathrm{arg}\,\mathrm{min}}\nolimits _{y\in \mathscr {X}} \Vert V(x - y)\Vert ^2\). If
-
1.
For all \(k\in {\mathbb {N}}\cup \{0\}\), the step sizes satisfy
$$\begin{aligned} \forall x&\in \mathscr {X}: J(x)\le r^2, \quad 0 \le 2\gamma _k\varepsilon (x) - 2\gamma _k^2\ell ^2(x) \le 1\nonumber \\ \varepsilon (x)&:=\frac{1}{J(x)}(x-x^\star )^TV^TV\left( G(x)- G(x^\star )\right) \nonumber \\ \ell ^2(x)&:=\frac{1}{J(x)}\left\| V\left( G(x)-G(x^\star )\right) \right\| ^2, \end{aligned}$$(44) -
2.
For all \(x\in \mathscr {X}\), \(\left( \mathbb {E}\Vert V(g(x,Z) - G(x))\Vert ^4\right) ^{1/4} \le \tilde{\sigma }(x)\) for \(\tilde{\sigma }: \mathscr {X}\rightarrow \mathbb {R}_{\ge 0}\), and
$$\begin{aligned} \sigma (r)&:=\sup _{x\in \mathscr {X}\,:\, J(x) \le r^2} \tilde{\sigma }(x), \end{aligned}$$(45)
then
-
1.
The iterate \(x_k\) is locally confined with high probability:
$$\begin{aligned} \mathbb {P}\left( J(x_k) \le r^2\right)&\,\ge \frac{\xi ^2_k}{\xi _k^2 + 8\sigma (r)^2\zeta _k}\\ \xi _{k}(r)&:=\max \{0, r^2 - J(x_0) - 2\sigma ^2(r)\sum _{j<k}\gamma ^2_j\}\\ \zeta _{k}(r)&:=r^2\sum _{j<k}\gamma _j^2 + \sigma ^2(r)\sum _{j<k}\gamma _j^4. \end{aligned}$$ -
2.
The iterate \(x_k\) stays locally confined for all \(k\in {\mathbb {N}}\) with high probability:
$$\begin{aligned} \mathbb {P}\left( \sup _{k\in {\mathbb {N}}} J(x_k) \le r^2\right)&\,\ge \frac{\xi ^2}{\xi ^2 + 8\sigma ^2(r)\zeta }\\ \xi (r)&:=\lim _{k\rightarrow \infty }\xi _{k}(r) \quad \zeta (r) :=\lim _{k\rightarrow \infty } \zeta _{k}(r). \end{aligned}$$ -
3.
If additionally
$$\begin{aligned} \inf _{x\in \mathscr {X}: J(x) \le r^2} \varepsilon (x) > 0 \quad \text {and}\quad \gamma _k = \varTheta (k^{-\rho }), \,\, \rho \in (0.5, 1], \end{aligned}$$the iterate \(x_k\) converges to \(x^\star \) with high probability:
$$\begin{aligned} \lim _{k \rightarrow \infty } \mathbb {P}\left( J(x_k) \le k^{-\rho '} \right)&\ge \mathbb {P}\left( \sup _{k\in {\mathbb {N}}} J(x_k) \le r^2\right) , \\&\quad \forall \rho ' \in (0, \rho - 0.5). \end{aligned}$$
Proof
To begin, we show \(\varPi _\mathscr {X}\) is non-expansive,
For all \(x, y \in \mathbb {R}^p\), define \(\langle x, y\rangle _V = x^T V^TV y\). Since V is invertible, \(V^TV\) is symmetric and positive definite, and hence \((\mathbb {R}^p, \langle \cdot , \cdot \rangle _V)\) forms a Hilbert space. Any projection operator of a Hilbert space is non-expansive (Bauschke and Combettes 2011, Proposition 4.4).
Note that \(x^\star = \varPi _{\mathscr {X}}(x^\star )\) and the projection operation is non-expansive, expanding the squared norm yields
Adding and subtracting \(G(x_k)\) in the second and third terms, using the elementary bound \(\left\| a+b\right\| ^2 \le 2\Vert a\Vert ^2 + 2\Vert b\Vert ^2\), and defining
we have that
We now define the filtration of \(\sigma \)-algebras
and the stopped process for \(r > 0\),
Note that \(Y_k\) is \(\mathscr {F}_k\)-measurable, and that \(Y_{k}\) “freezes in place” if \(J(x_k)\) ever jumps larger than \(r^2\); so for all \(t^2 \le r^2\),
Therefore, if we obtain a tail bound on \(Y_k\), it provides the same bound on \(J(x_k)\). Now substituting the stopped process into the original recursion and collecting terms,
Using the notation of Lemma 18, set
By the fourth moment assumption, \(\beta _k\) has variance bounded above by \(\tau _k^2\) conditioned on \(\mathscr {F}_k\), where
Therefore, using the descent Lemma 18,
yielding the first result. Now, since \(Y_{k+1} \le r^2 \implies Y_{k} \le r^2\) for all \(k\ge 0\), the sequence of events \(\left\{ Y_k \le r^2\right\} \) is decreasing. Therefore, the second result follows from
where \(\xi :=\lim _{k\rightarrow \infty }\xi _{k}\) and \(\zeta :=\lim _{k\rightarrow \infty }\zeta _{k}\). Finally, we analyse the conditional tail distribution of \(Y_k\) given that it stays confined, i.e. \(\forall k\ge 0\), \(Y_k\le r^2\). In the notation of Lemma 18, redefine
i.e. \(\bar{\varvec{\alpha }}_{k}\) is the largest possible value of \(\alpha _{k}\) when \(Y_k \le r^2\). So again applying Lemma 18,
Here, \(t_k = \varTheta (k^{-\rho '}), \rho ' \in (0, \rho - 0.5)\) is a decreasing sequence, whose shrinking rate—such that Eq. (46) still converges to 0—will determine the convergence rate of \(Y_k\).
To understand the rate of Eq. (46), the key is to characterizing the order of \(\prod _{j< k}\bar{\varvec{\alpha }}_{j}\) and \(\sum _{i = 0}^{k}\gamma _i^2 \prod _{j = i+1}^k \bar{\varvec{\alpha }}_j^2\). Since \(\gamma _k = \varTheta (k^{-\rho })\), \(\rho \in (0.5, 1]\), and \(\varepsilon ' :=\inf _{x\in \mathscr {X}: J(x)\le r^2} \varepsilon (x) > 0\), we know that
for some \(c > 0\), yielding that \(\prod _{j<k}a_{j}= \varTheta \left( \exp \left( -Ck^{1-\rho } \right) \right) \) for some \(C > 0\). For the second term, since \(\bar{\varvec{\alpha }}\in (0,1)\),
Similarly, \(\sum _{i = 0}^{k}\gamma _i^2 \prod _{j = i+1}^k \bar{\varvec{\alpha }}_j = \varTheta (k^{1 - 2\rho })\).Footnote 5 Therefore,
Combined with the fact that \(t_k = \varTheta (k^{- \rho '}), \rho ' \in (0, \rho - 0.5)\), this implies that Eq. (46) is o(1) and hence for all \(\varepsilon > 0, \rho ' \in (0, \rho - 0.5)\),
Therefore, \(\forall \rho ' \in (0, \rho - 0.5)\),
and the result follows. \(\square \)
Lemma 18
(Descent) Suppose we are given a filtration \(\mathscr {F}_k \subseteq \mathscr {F}_{k+1}\), \(k\ge 0\). Let
where \(Y_k \ge 0\) and \(0 \le \alpha _k\le \bar{\varvec{\alpha }}_k\) are \(\mathscr {F}_k\)-measurable, \(\beta _k\) is \(\mathscr {F}_{k+1}\)-measurable and has mean 0 and variance conditioned on \(\mathscr {F}_k\) bounded above by \(\tau _k^2\), and \(Y_0, \tau ^2_k, c_k, \bar{\varvec{\alpha }}_k \ge 0\) are \(\mathscr {F}_0\) measurable. Then,
Proof
Solving the recursion,
So
By Cantelli’s inequality and the fact that the \(i\text {th}\) term in the sum is \(\mathscr {F}_{i+1}\)-measurable,
\(\square \)
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xu, Z., Campbell, T. The computational asymptotics of Gaussian variational inference and the Laplace approximation. Stat Comput 32, 63 (2022). https://doi.org/10.1007/s11222-022-10125-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-022-10125-y