Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

The computational asymptotics of Gaussian variational inference and the Laplace approximation

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Gaussian variational inference and the Laplace approximation are popular alternatives to Markov chain Monte Carlo that formulate Bayesian posterior inference as an optimization problem, enabling the use of simple and scalable stochastic optimization algorithms. However, a key limitation of both methods is that the solution to the optimization problem is typically not tractable to compute; even in simple settings, the problem is nonconvex. Thus, recently developed statistical guarantees—which all involve the (data) asymptotic properties of the global optimum—are not reliably obtained in practice. In this work, we provide two major contributions: a theoretical analysis of the asymptotic convexity properties of variational inference with a Gaussian family and the maximum a posteriori (MAP) problem required by the Laplace approximation, and two algorithms—consistent Laplace approximation (CLA) and consistent stochastic variational inference (CSVI)—that exploit these properties to find the optimal approximation in the asymptotic regime. Both CLA and CSVI involve a tractable initialization procedure that finds the local basin of the optimum, and CSVI further includes a scaled gradient descent algorithm that provably stays locally confined to that basin. Experiments on nonconvex synthetic and real-data examples show that compared with standard variational and Laplace approximations, both CSVI and CLA improve the likelihood of obtaining the global optimum of their respective optimization problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. There are many other properties one might require of a tractable optimization problem, e.g. pseudoconvexity (Crouzeix and Ferland 1982), quasiconvexity (Arrow and Enthoven 1961), or invexity (Ben-Israel and Mond 1986). We focus on convexity as it does not impose overly stringent assumptions on our theory and has stronger implications than each of the aforementioned conditions.

  2. Code for the experiments is available at https://github.com/zuhengxu/Consistent-Stochastic-Variational-Inference.

  3. Available online at http://www.stat.cmu.edu/~ryantibs/statcomp/data/pros.dat.

  4. This dataset contains the measurements of the redshifts for 4215 galaxies in the Shapley concentration regions and was generously made available by Michael Drinkwater, University of Queensland, which can be downloaded from https://astrostatistics.psu.edu/datasets/Shapley_galaxy.html.

  5. Although the bound in Eq. (47) is loose, the order derscribed by the bound is actually tight. A more detailed analysis can be obtained by approximating the summation with an integral, which yields the same order.

References

  • Addis, B., Locatelli, M., Schoen, F.: Local optima smoothing for global optimization. Optim. Methods Softw. 20(4–5), 417–437 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  • Alquier, P., Ridgway, J.: Concentration of tempered posteriors and of their variational approximations. Ann. Stat. 48(3), 1475–1497 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  • Armijo, L.: Minimization of functions having Lipschitz continuous first partial derivatives. Pac. J. Math. 16(1), 1–3 (1966)

    Article  MathSciNet  MATH  Google Scholar 

  • Arrow, K., Enthoven, A.: Quasi-concave programming. Econom. J. Econom. Soc. 29, 779–800 (1961)

    MathSciNet  MATH  Google Scholar 

  • Balakrishnan, S., Wainwright, M., Yu, B.: Statistical guarantees for the EM algorithm: from population to sample-based analysis. Ann. Stat. 45(1), 77–120 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  • Barber, R.F., Drton, M., Tan, K.M.: Laplace approximation in high-dimensional Bayesian regression. In: Barber, R.F., Drton, M., Tan, K.M. (eds.) Statistical Analysis for High-Dimensional Data, pp. 15–36. Springer, Berlin (2016)

    Chapter  MATH  Google Scholar 

  • Bassett, R., Deride, J.: Maximum a posteriori estimators as a limit of Bayes estimators. Math. Program. 174(1–2), 129–144 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  • Bauschke, H., Combettes, P.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, Berlin (2011)

    Book  MATH  Google Scholar 

  • Ben-Israel, A., Mond, B.: What is invexity? ANZIAM J. 28(1), 1–9 (1986)

    MathSciNet  MATH  Google Scholar 

  • Bertsekas, D., Tsitsiklis, J.: Gradient convergence in gradient methods with errors. SIAM J. Optim. 10(3), 627–642 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  • Bishop, C., Nasrabadi, N.: Pattern Recognition and Machine Learning. Springer, Berlin (2006)

    Google Scholar 

  • Blei, D., Kucukelbir, A., McAuliffe, J.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112(518), 859–877 (2017)

    Article  MathSciNet  Google Scholar 

  • Bottou, L.: Stochastic learning. In: Bousquet, O., von Luxburg, U., Rätsch, G. (eds.) Advanced Lectures on Machine Learning: ML Summer Schools 2003, pp. 146–168. Springer, Berlin (2004)

    Chapter  MATH  Google Scholar 

  • Boucheron, S., Lugosi, G., Massart, P.: Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Oxford (2013)

    Book  MATH  Google Scholar 

  • Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)

    Book  MATH  Google Scholar 

  • Bubeck, S.: Convex Optimization: Algorithms and Complexity. Now Publishers Inc, Delft (2015)

    Book  MATH  Google Scholar 

  • Campbell, T., Li, X.: Universal boosting variational inference. In: Advances in Neural Information Processing Systems (2019)

  • Craven, B., Glover, B.: Invex functions and duality. J. Austral. Math. Soc. 39(1), 1–20 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  • Crouzeix, J.P., Ferland, J.: Criteria for quasi-convexity and pseudo-convexity: relationships and comparisons. Math. Program. 23(1), 193–205 (1982)

    Article  MATH  Google Scholar 

  • Csiszár, I.: Information-type measures of difference of probability distributions and indirect observation. Studia Sci. Math. Hung. 2, 229–318 (1967)

    MathSciNet  Google Scholar 

  • Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7), 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  • Folland, G.: Real Analysis: Modern Techniques and Their Applications. Wiley, New York (1999)

    MATH  Google Scholar 

  • Forsyth, D., Ponce, J.: Computer Vision: A Modern Approach. Prentice Hall Professional Technical Reference, London (2002)

    Google Scholar 

  • Gelfand, A., Smith, A.: Sampling-based approaches to calculating marginal densities. J. Am. Stat. Assoc. 85(410), 398–409 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  • George, E., McCulloch, R.: Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 88(423), 881–889 (1993)

    Article  Google Scholar 

  • Ghosal, S., Ghosh, J., van der Vaart, A.: Convergence rates of posterior distributions. Ann. Stat. 28(2), 500–531 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  • Gibbs, A., Su, F.E.: On choosing and bounding probability metrics. Int. Stat. Rev. 70(3), 419–435 (2002)

    Article  MATH  Google Scholar 

  • Grendár, M., Judge, G.: Asymptotic equivalence of empirical likelihood and Bayesian MAP. Ann. Stat. 37, 2445–2457 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Guo, F., Wang, X., Fan, K., Broderick, T., Dunson, D.: Boosting variational inference. In: Advances in Neural Information Processing Systems (2016)

  • Haddad, R., Akansu, A.: A class of fast Gaussian binomial filters for speech and image processing. IEEE Trans. Signal Process. 39(3), 723–727 (1991)

    Article  Google Scholar 

  • Hall, P., Pham, T., Wand, M., Wang, S.S.: Asymptotic normality and valid inference for Gaussian variational approximation. Ann. Stat. 39(5), 2502–2532 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Han, W., Yang, Y.: Statistical inference in mean-field variational Bayes. arXiv:1911.01525 (2019)

  • Hastings, W.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1), 97–109 (1970)

    Article  MathSciNet  MATH  Google Scholar 

  • Hoffman, M., Blei, D., Wang, C., Paisley, J.: Stochastic variational inference. J. Mach. Learn. Res. 14(1), 1303–1347 (2013)

    MathSciNet  MATH  Google Scholar 

  • Huggins, J., Kasprzak, M., Campbell, T., Broderick, T.: Validated variational inference via practical posterior error bounds. In: International Conference on Artificial Intelligence and Statistics (2020)

  • Ionides, E.: Truncated importance sampling. J. Comput. Graph. Stat. 17(2), 295–311 (2008)

    Article  MathSciNet  Google Scholar 

  • Jaiswal, P., Rao, V., Honnappa, H.: Asymptotic consistency of \(\alpha \)-Rényi-approximate posteriors. arXiv:1902.01902 (2019)

  • Jennrich, R.: Asymptotic properties of non-linear least squares estimators. Ann. Math. Stat. 40(2), 633–643 (1969)

    Article  MathSciNet  MATH  Google Scholar 

  • Jordan, M., Ghahramani, Z., Jaakkola, T., Saul, L.: An introduction to variational methods for graphical models. In: Learning in Graphical Models, pp. 105–161. Springer, Berlin (1998)

  • Kass, R.: The validity of posterior expansions based on Laplace’s method. In: Geisser, S., Hodges, J.S. (eds.) Bayesian and Likelihood Methods in Statistics and Econometrics, pp. 473–487. Elsevier, Amsterdam (1990)

    Google Scholar 

  • Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)

  • Kingma, D., Welling, M.: Auto-encoding variational Bayes. In: International Conference on Learning Representations (2014)

  • Kleijn, B.: Bayesian asymptotics under misspecification. PhD thesis, Vrije Universiteit Amsterdam (2004)

  • Kleijn, B., van der Vaart, A.: The Bernstein–von-Mises theorem under misspecification. Electron. J. Stat. 6, 354–381 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Kontorovich, A.: Concentration in unbounded metric spaces and algorithmic stability. In: International Conference on Machine Learning (2014)

  • Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., Blei, D.: Automatic differentiation variational inference. J. Mach. Learn. Res. 18(1), 430–474 (2017)

    MathSciNet  MATH  Google Scholar 

  • LeCam, L.: Locally Asymptotically Normal Families of Distributions. University of California Press, Berkeley (1960)

    Google Scholar 

  • Liese, F., Vajda, I.: Convex Statistical Distances. Teubner, Stuttgart (1987)

    MATH  Google Scholar 

  • Lindeberg, T.: Scale-space for discrete signals. IEEE Trans. Pattern Anal. Mach. Intell. 12(3), 234–254 (1990)

    Article  Google Scholar 

  • Locatello, F., Dresdner, G., Khanna, R., Valera, I., Rätsch, G.: Boosting black box variational inference. In: Advances in Neural Information Processing Systems (2018)

  • Meyn, S., Tweedie, R.: Markov Chains and Stochastic Stability. Springer, Berlin (2012)

    MATH  Google Scholar 

  • Miller, J.: Asymptotic normality, concentration, and coverage of generalized posteriors. J. Mach. Learn. Res. 22(168), 1–53 (2021)

    MathSciNet  MATH  Google Scholar 

  • Miller, A., Foti, N., Adams, R.: Variational boosting: iteratively refining posterior approximations. In: International Conference on Machine Learning (2017)

  • Mobahi, H.: Optimization by Gaussian smoothing with application to geometric alignment. PhD thesis, University of Illinois at Urbana-Champaign (2013)

  • Murphy, K.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)

    MATH  Google Scholar 

  • Nesterov, Y.: A method of solving a convex programming problem with convergence rate \({O}(k^2)\). In: Doklady Akademii Nauk (1983)

  • Nixon, M., Aguado, A.: Feature Extraction and Image Processing for Computer Vision. Academic Press, London (2012)

  • Pinheiro, J., Bates, D.: Unconstrained parametrizations for variance-covariance matrices. Stat. Comput. 6, 289–296 (1996)

    Article  Google Scholar 

  • Qiao, Y., Minematsu, N.: A study on invariance of \( f \)-divergence and its application to speech recognition. IEEE Trans. Signal Process. 58(7), 3884–3890 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Rakhlin, A., Shamir, O., Sridharan, K.: Making gradient descent optimal for strongly convex stochastic optimization. In: International Coference on International Conference on Machine Learning (2012)

  • Ranganath, R.: Black box variational inference. In: Advances in Neural Information Processing Systems (2014)

  • Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  • Robert, C., Casella, G.: Monte Carlo Statistical Methods. Springer, Berlin (2013)

    MATH  Google Scholar 

  • Roberts, G., Rosenthal, J.: General state space Markov chains and MCMC algorithms. Probab. Surv. 1, 20–71 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  • Ryu, E., Boyd, S.: Primer on monotone operator methods. Appl. Comput. Math. 15(1), 3–43 (2016)

    MathSciNet  MATH  Google Scholar 

  • Schatzman, M.: Numerical Analysis: A Mathematical Introduction. Clarendon Press, translation: John Taylor (2002)

  • Schillings, C., Sprungk, B., Wacker, P.: On the convergence of the Laplace approximation and noise-level-robustness of Laplace-based Monte Carlo methods for Bayesian inverse problems. Numer. Math. 145(4), 915–971 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  • Shen, X., Wasserman, L.: Rates of convergence of posterior distributions. Ann. Stat. 29(3), 687–714 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Shun, Z., McCullagh, P.: Laplace approximation of high dimensional integrals. J. R. Stat. Soc. Ser. B Methodol. 57(4), 749–760 (1995)

    MathSciNet  MATH  Google Scholar 

  • Stein, C.: A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In: Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability (1972)

  • van der Vaart, A.: Asymptotic Statistics. Cambridge University Press, Cambridge (2000)

    Google Scholar 

  • van der Vaart, A., Wellner, J.: Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, Berlin (2013)

    MATH  Google Scholar 

  • Van Erven, T., Harremos, P.: Rényi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 60(7), 3797–3820 (2014)

    Article  MATH  Google Scholar 

  • Vehtari, A., Simpson, D., Gelman, A., Yao, Y., Gabry, J.: Pareto smoothed importance sampling. arXiv:1507.02646 (2015)

  • Wainwright, M., Jordan, M.: Graphical Models, Exponential Families, and Variational Inference. Now Publishers Inc, Delft (2008)

    MATH  Google Scholar 

  • Wang, Y., Blei, D.: Frequentist consistency of variational Bayes. J. Am. Stat. Assoc. 114(527), 1147–1161 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  • Xing, E., Jordan, M., Russell, S.: A generalized mean field algorithm for variational inference in exponential families. In: Kanal, L.N., Lemmer, J.F. (eds.) Uncertainty in Artificial Intelligence. Elsevier, Amsterdam (2002)

    Google Scholar 

  • Yang, Y., Pati, D., Bhattacharya, A.: \(\alpha \)-variational inference with statistical guarantees. Ann. Stat. 48(2), 886–905 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, F., Gao, C.: Convergence rates of variational posterior distributions. Ann. Stat. 48(4), 2180–2207 (2020)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors gratefully acknowledge the support of an Natural Sciences and Engineering Research Council of Canada (NSERC) (Grant No. RGPIN-2019-03962) Discovery Grant and Discovery Launch Supplement, and UBC four year doctoral fellowship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Trevor Campbell.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Details of experiments

1.1 A.1 Details of the toy example for smoothed MAP

The underlying synthetic model for Fig. 1 is as follows:

$$\begin{aligned} \theta\sim & {} \frac{1}{5}\mathscr {N}(0, 0.15^2)+ \frac{1}{5}\mathscr {N}(1, 0.1^2) +\frac{1}{5}\mathscr {N}(-4, 0.3^2) \\&+ \frac{1}{5}\mathscr {N}(4, 0.3^2) + \frac{1}{5}\mathscr {N}(-8, 0.1^2) \\ X_i \,|\,\theta&\overset{{\text {i.i.d.}}}{\sim }&\mathscr {N}(\theta , 5000), \end{aligned}$$

where the data are truly generated from \((X_i)_{i=1}^n \overset{{\text {i.i.d.}}}{\sim }\mathscr {N}(3, 10)\). For smoothed MAP optimization, we use a smoothing constant of \(\alpha _n = 10 n^{-0.3}\), and set the initial value uniformly within the range \((-50, 50)\). The learning rate for the SGD is chosen as \(\gamma _k = 15/(1+ k^{0.9})\).

1.2 A.2 Algorithm: CSVI (Adam)

figure d

1.3 A.3 Discussion of sparse regression experiment

In this section, we provide further discussion to the result presented in Fig. 11. Figures 14 and 15 visualize the Gaussian approximations produced by CLA and CSVI. Instead of fitting a single mode, CSVI covers the range of posterior and fit a Gaussian distribution with larger variance. Even though the performance of CSVI is consistent across runs, it does find the local optimum instead of the global solution. In this case, reverse KL—the objective function of Gaussian VI—can be limited. We compare the forward KL of these fitted Gaussian distributions using 32,000 posterior samples obtained from Stan, suggesting that CSVI find a solution that is better in forward KL.

Fig. 14
figure 14

Visualization of different CLA approximations to the posterior distribution on real dataset. The grey area depicts the posterior density and four Gaussian approximations displayed in contour plots corresponding to different ELBOs

Fig. 15
figure 15

Visualization of different CSVI approximations to the posterior distribution on real dataset. The grey area depicts the posterior density and four Gaussian approximations displayed in contour plots corresponding to different ELBOs

Fig. 16
figure 16

Comparison of forward KL of sparse regression experiment

1.4 A.4 Details of the GMM experiment

1.4.1 A.4.1 Variable transformations of Bayesian Gaussian mixture model

To transform \((\theta _1, \dots , \theta _K) \in \varDelta (K)\) and \(\sigma _{kd}\in \mathbb {R}_+\) into unconstrained space, we consider the change of random variables as below:

  1. 1.

    For \(\sigma _{kd}\sim {\mathsf{LogNormal}}(0,1)\), we consider

    $$\begin{aligned} \tau _{kd} = \log (\sigma _{kd}) \sim \mathscr {N}(0,1), \end{aligned}$$

    which has a full support on \(\mathbb {R}\).

  2. 2.

    For \(\theta \sim {\mathsf{Dir}}(\alpha _0)\), we consider using marginalized LogGamma random variables. Notice the relationship of Gamma distribution and Dirichlet distribution as follows:

    $$\begin{aligned}&\left( \lambda _k\right) _{k = 1}^K \overset{{\text {i.i.d.}}}{\sim }{\mathsf{LogGamma}}(\alpha _0, 1)\\&\quad \left( \frac{\exp (\lambda _1)}{\sum _k \exp (\lambda _k)}, \dots , \frac{\exp (\lambda _K)}{\sum _k \exp (\lambda _k)}\right) \sim {\mathsf{Dir}}(\alpha _0), \end{aligned}$$

    then \(\lambda _k\) is supported on \(\mathbb {R}\).

Therefore, instead of inferring the original parameters, we perform Gaussian variational inference on the posterior distribution of \((\lambda _{1:k}, \mu _{1:k}, \tau _{11:kd})\).

1.4.2 A.4.2 Detailed settings for real data experiment

We subsample the Shapley galaxy dataset to \(N = 500\), and the goal is to cluster the distribution of galaxies in the Shapley concentration region. In our experiment, we fix the number of component \(K = 3\) and set \(\alpha _0 = 1\). During inference, we initialize the SMAP with random samples from the prior distribution, which are also used as the mean initialization for SVI. In SMAP, we perform the smoothed MAP estimation to a tempered posterior distribution \(\pi _n^\kappa \) with \(\kappa = 1/2\); and set the smoothing constant \(\alpha _n = 3\). The learning rate for SMAP and VI algorithms is chosen as 0.005 and 0.001, respectively. And similar to the synthetic experiment, CSVI and SVI_Ind use the identity matrix for \(L_0\) and SVI use random diagonal matrix for \(L_0\), whose log diagonal indices are uniformly in the range \((\log 0.1, \log 100)\).

B Proofs

1.1 B.1 Proof of Theorem 6

Proof

We consider the KL cost for the scaled and shifted posterior distribution. Let \(\tilde{\varPi }_n\) be the Bayesian posterior distribution of \(\sqrt{n} (\theta - \theta _0)\). The KL divergence measures the difference between the distributions of two random variables and is invariant when an invertible transformation is applied to both random variables (Qiao and Minematsu 2010, Theorem 1). Note that \(\tilde{\varPi }_n\) is shifted and scaled from \(\varPi _n\), and that this linear transformation is invertible, so

Let \(\tilde{\mu }_n^\star , \tilde{\varSigma }_n^\star \) be the parameters of the optimal Gaussian variational approximation to \(\tilde{\varPi }_n\), i.e.

and let

$$\begin{aligned} \tilde{\mathscr {N}_n}^\star :=\mathscr {N}\left( \tilde{\mu }_n^\star , \tilde{\varSigma }_n^\star \right) = \mathscr {N}\left( \sqrt{n}(\mu _n^\star - \theta _0), L_n^\star L_n^{\star T} \right) . \end{aligned}$$

Wang and Blei (2019, Corollary 7) shows that under Assumption 1,

Convergence in total variation implies weak convergence, which then implies pointwise convergence of the characteristic function. Denote \(\tilde{\phi }^\star _n(t)\) and \(\phi _n(t)\) to be the characteristic functions of \(\tilde{\mathscr {N}_n^\star }\) and \(\mathscr {N}\left( \varDelta _{n, \theta _{0}}, H_{\theta _0}^{-1}\right) \). Therefore

$$\begin{aligned} \forall t\in \mathbb {R}^d, \; \frac{\phi ^\star _n(t)}{\phi _n(t)}= & {} \exp \!\left( i ( \sqrt{n}(\mu ^\star _n - \theta _0)-\varDelta _{n, \theta _{0}})^T t\right. \\&\left. - \frac{1}{2} t^T \left( L_n^\star L_n^{\star T} - H_{\theta _0}^{-1} \right) t \right) \\&{\mathop {\longrightarrow }\limits ^{P_{\theta _0}}}&1, \end{aligned}$$

which implies

$$\begin{aligned} \mu _n^\star {\mathop {\rightarrow }\limits ^{P_{\theta _0}}} \frac{1}{\sqrt{n}}\varDelta _{n, \theta _{0}} + \theta _0, \quad \text {and}\quad L_n^\star L_n^{\star T} {\mathop {\rightarrow }\limits ^{P_{\theta _0}}} H^{-1}_{0} = L_0 L_0^T. \end{aligned}$$

Under Assumption 1, van der Vaart (2000, Theorem 8.14) states that

$$\begin{aligned} \Vert \varDelta _{n, \theta _{0}} - \sqrt{n}( \theta _{\text {MLE},n} - \theta _0)\Vert {\mathop { \rightarrow }\limits ^{P_{\theta _0}}} 0, \end{aligned}$$

and \(\theta _{\text {MLE},n} \overset{P_{\theta _0}}{\rightarrow } \theta _0\) according to Eq. (9), yielding \(\mu _n^\star {\mathop { \rightarrow }\limits ^{P_{\theta _0}}} \theta _0\).

Finally, since the Cholesky decomposition defines a continuous mapping from the set of positive definite Hermitian matrices to the set of lower triangular matrices with positive diagonals (both sets are equipped with the spectral norm) (Schatzman 2002, p. 295), we have

$$\begin{aligned} L_n^\star {\mathop { \rightarrow }\limits ^{P_{\theta _0}}} L_0. \end{aligned}$$

\(\square \)

1.2 B.2 Proof of Theorem 11

Proof

We provide a proof of the result for strong convexity; the result for Lipschitz smoothness follows the exact same proof technique. Note that if \(D'\) does not depend on x, \(F_n(x)\) is \(D'\)-strongly convex if and only if \(F_n(x) - \frac{1}{2}x^TD'x\) is convex. We use this equivalent characterization of strong convexity in this proof.

Note that for \(Z\sim \mathscr {N}(0, I)\),

$$\begin{aligned}&\mathbb {E}\left[ \frac{1}{2}(\mu +n^{-1/2}LZ)^TD(\mu +n^{-1/2}LZ)\right] \\&\quad =\frac{1}{2}\mu ^TD\mu + \frac{1}{2}{\text {tr}}L^T(n^{-1}D)L. \end{aligned}$$

Define \(\lambda \in [0,1]\), vectors \(\mu , \mu ' \in \mathbb {R}^d\), positive-diagonal lower triangular matrices \(L, L'\in \mathbb {R}^{d\times d}\), and vectors \(x, x'\in \mathbb {R}^{(d+1)d}\) by stacking \(\mu \) and the columns of L and likewise \(\mu '\) and the columns of \(L'\). Define \(x(\lambda ) = \lambda x + (1-\lambda )x'\), \(\mu (\lambda ) = \lambda \mu + (1-\lambda )\mu '\), and \(L(\lambda ) = \lambda L + (1-\lambda ) L'\). Then,

$$\begin{aligned}&{}F_n(x(\lambda )) - \frac{1}{2}x(\lambda )^T{\text {diag}}(D, n^{-1}D, \dots , n^{-1}D)x(\lambda ) \\= & {} F_n(\mu (\lambda ), L(\lambda )) \\&-\left( \frac{1}{2}\mu (\lambda )^TD\mu (\lambda ) + \frac{1}{2}{\text {tr}}L(\lambda )^T(n^{-1}D)L(\lambda )\right) \\= & {} \mathbb {E}\left[ n^{-1}\log \pi _n(\mu (\lambda )+n^{-1/2}L(\lambda )Z) \right. \\&\left. - \frac{1}{2}(\mu (\lambda ) +n^{-1/2}L(\lambda )Z)^TD(\mu (\lambda )+ n^{-1/2}L(\lambda )Z)\right] . \end{aligned}$$

By the D-strong convexity of \(n^{-1}\log \pi _n\),

$$\begin{aligned}&\le \lambda \left( F_n(\mu , L) - \frac{1}{2}\mu ^TD\mu - \frac{1}{2}{\text {tr}}L^T(n^{-1}D)L\right) + (1-\lambda ) \\&\quad \times \left( F_n(\mu ', L') - \frac{1}{2}\mu '^TD\mu ' - \frac{1}{2}{\text {tr}}L'^T(n^{-1}D)L'\right) \\&\!= \!\lambda \left( F_n(x) - \frac{1}{2}x^T{\text {diag}}(D, n^{-1}D, \dots , n^{-1}D)x\right) \!+\!(1-\lambda ) \\&\quad \times \left( F_n(x') - \frac{1}{2}x'^T{\text {diag}}(D, n^{-1}D, \dots , n^{-1}D)x'\right) . \end{aligned}$$

\(\square \)

1.3 B.3 Proof of Proposition 12

Proof

Note that by reparametrization,

where \(Z \sim \mathscr {N}(0,1)\). Using a Taylor expansion,

$$\begin{aligned}&- \mathbb {E}\left[ \frac{\mathrm{d}^2 }{\mathrm{d} \mu ^2} \left( -n^{-1}\log \pi _n (\mu + \sigma Z)\right) \right] \\&\quad = \mathbb {E}\left[ -n^{-1} \log \pi _n^{(2)} (\mu ) - n^{-1}\log \pi _n^{(3)} (\mu ') \cdot \sigma Z \right] , \end{aligned}$$

for some \(\mu '\) between \(\mu \) and \(\mu +\sigma Z\). By the uniform bound on the third derivative and local bound on the second derivative, for any \(\mu \in U\),

$$\begin{aligned}&\mathbb {E}\left[ -n^{-1} \log \pi _n^{(2)} (\mu ) - n^{-1}\log \pi _n^{(3)} (\mu ') \cdot \sigma Z \right] \\&\quad \le -\varepsilon + \eta \sigma \mathbb {E}\left| Z \right| \le -\varepsilon + \eta \sigma . \end{aligned}$$

The result follows for any \(0<\varepsilon < \varepsilon / \eta \). \(\square \)

1.4 B.4 Proof of Theorem 13

Proof

Note that we can split L into columns and express LZ as

$$\begin{aligned} LZ = \sum _{i=1}^d L_i Z_i, \end{aligned}$$

where \(L_i\in \mathbb {R}^p\) is the \(i\text {th}\) column of L, and \((Z_i)_{i=1}^d\overset{{\text {i.i.d.}}}{\sim }\mathscr {N}(0,1)\). Denoting \(\nabla ^2 f_n :=\nabla ^2 f_n(\mu +LZ)\) for brevity, the \(2\text {nd}\) derivatives in both \(\mu \) and L are

$$\begin{aligned} \nabla ^2_{\mu \mu } F_n= & {} \mathbb {E}\left[ \nabla ^2 f_n \right] \\ \nabla ^2_{L_iL_j} F_n= & {} n^{-1}\mathbb {E}\left[ Z_i Z_j \nabla ^2 f_n \right] \\ \nabla ^2_{\mu L_i} F_n= & {} n^{-1/2}\mathbb {E}\left[ Z_i\nabla ^2 f_n\right] \end{aligned}$$

where we can pass the gradient and Hessian through the expectation by dominated convergence because Z has a normal distribution and \(f_n\) has \(\ell \)-Lipschitz gradients. Stacking these together in block matrices yields the overall Hessian,

$$\begin{aligned} A&= \left[ \begin{array}{cccc} I&n^{-1/2} Z_1 I&\dots&n^{-1/2} Z_d I \end{array}\right] \in \mathbb {R}^{d \times d(d+1)}\\ \nabla ^2 F_n&= \mathbb {E}\left[ A^T \nabla ^2 f_n A\right] \in \mathbb {R}^{d(d+1)\times d(d+1)}. \end{aligned}$$

Since \(f_n\) has \(\ell \)-Lipschitz gradients, for all \(x\in \mathbb {R}^d\), \(-\ell I \preceq \nabla ^2 f_n(x) \preceq \ell I\). Applying the upper bound and evaluating the expectation yields the Hessian upper bound (and the same technique yields the corresponding lower bound):

$$\begin{aligned} \nabla ^2 F_n&= \mathbb {E}\left[ A^T \nabla ^2 f_n A\right] \\&\preceq \ell \mathbb {E}\left[ A^TA\right] \\&= \ell \left[ \begin{array}{cccc} I &{}\quad 0 &{} \quad 0 &{}\quad 0\\ 0 &{}\quad n^{-1}I &{}\quad 0 &{}\quad 0\\ 0 &{}\quad 0 &{}\quad \ddots &{}\quad 0\\ 0 &{}\quad 0 &{}\quad 0 &{}\quad n^{-1} I \end{array}\right] = \ell D_n. \end{aligned}$$

To demonstrate local strong convexity, we split the expectation into two parts: one where \(n^{-1/2} LZ\) is small enough to guarantee that \(\Vert \mu +n^{-1/2} LZ - x\Vert ^2 \le r^2\), and the complement. Define

$$\begin{aligned} r_n^2(\mu , L) :=n\frac{(r^2 - 2\Vert \mu - x\Vert ^2_2)}{2\Vert L\Vert ^2_F}. \end{aligned}$$

Note that when \(\Vert Z\Vert ^2 \le r_n^2(\mu , L) \),

$$\begin{aligned} \left\| \mu + \frac{1}{\sqrt{n}}LZ - x \right\| _2^2\le & {} 2 \Vert \mu - x\Vert ^2 + 2n^{-1}\Vert LZ\Vert ^2\\\le & {} 2 \Vert \mu - x\Vert ^2 + 2n^{-1}\Vert L\Vert _F^2\Vert Z\Vert ^2\\\le & {} r^2. \end{aligned}$$

Then, we may write

$$\begin{aligned} \nabla ^2 F_n= & {} \mathbb {E}\left[ \mathbb {1}\left[ \Vert Z\Vert ^2 \le r_n^2(\mu , L) \right] A^T\nabla ^2 f_n A\right] \\&+ \mathbb {E}\left[ \mathbb {1}\left[ \Vert Z\Vert ^2 > r_n^2(\mu , L) \right] A^T\nabla ^2 f_n A\right] . \end{aligned}$$

Since \(f_n\) has \(\ell \)-Lipschitz gradients and is locally \(\varepsilon \)-strongly convex,

$$\begin{aligned} \nabla ^2 F_n&\succeq \varepsilon \cdot \mathbb {E}\left[ \mathbb {1}\left[ \Vert Z\Vert ^2 \le r_n^2(\mu , L) \right] A^T A\right] \\&\quad - \ell \cdot \mathbb {E}\left[ \mathbb {1}\left[ \Vert Z\Vert ^2 > r_n^2(\mu , L) \right] A^T A\right] . \end{aligned}$$

Note that \(A^TA\) has entries 1 and \(n^{-1} Z_i^2\) along the diagonal, as well as \(n^{-1} Z_iZ_j\), \(i\ne j\) and \(n^{-1/2} Z_i\) on the off-diagonals. By symmetry, since Z is an isotropic Gaussian, censoring by \(\mathbb {1}\left[ \Vert Z\Vert ^2 \le \dots \right] \) or \(\mathbb {1}\left[ \Vert Z\Vert ^2 > \dots \right] \) maintains that the off-diagonal expectations are 0. Therefore, the quantity \(\mathbb {E}\left[ \mathbb {1}\left[ \Vert Z\Vert ^2 \le r_n^2(\mu , L) \right] A^TA \right] \) is diagonal with coefficients \(1 - \alpha _n(\mu ,L)\) and \(n^{-1} \beta _n(\mu ,L)\), and \(\mathbb {E}\left[ \mathbb {1}\left[ \Vert Z\Vert ^2 > r_n^2(\mu , L) \right] \right. \left. A^TA \right] \) is diagonal with coefficients \(\alpha _n(\mu ,L)\) and \(n^{-1} \tau _n(\mu ,L)\) where

$$\begin{aligned} \alpha _n(\mu ,L)= & {} \mathbb {P}\left( \Vert Z\Vert ^2> r_n^2(\mu , L) \right) \\ \beta _n(\mu ,L)= & {} \mathbb {E}\left[ Z_1^2 \mathbb {1}\left[ \Vert Z\Vert ^2 \le r_n^2(\mu , L) \right] \right] \\= & {} d^{-1}\mathbb {E}\left[ \Vert Z\Vert _2^2 \mathbb {1}\left[ \Vert Z\Vert ^2 \le r_n^2(\mu , L) \right] \right] \\ \tau _n(\mu ,L)= & {} \mathbb {E}\left[ Z_1^2\mathbb {1}\left[ \Vert Z\Vert ^2> r_n^2(\mu , L) \right] \right] \\= & {} d^{-1}\mathbb {E}\left[ \Vert Z\Vert _2^2 \mathbb {1}\left[ \Vert Z\Vert ^2 > r_n^2(\mu , L) \right] \right] . \end{aligned}$$

Note that \(\Vert Z\Vert ^2 \sim \chi ^2_d\); so \(\alpha _n(\mu ,L) = 1-\chi ^2_d(r_n^2(\mu , L) )\) and

$$\begin{aligned} \tau _n(\mu ,L)= & {} \int _{r_n^2(\mu , L) }^{\infty }\mathbb {1}\left[ x\ge 0\right] \frac{1}{2^{(d+2)/2}\varGamma ((d+2)/2)}\\&\times x^{\frac{d+2}{2}-1}e^{-x/2}\mathrm {d}x \\= & {} 1- \chi ^2_{d+2}(r_n^2(\mu , L) )\\ \beta _n(\mu ,L)= & {} 1-\tau _n(\mu ,L). \end{aligned}$$

Therefore,

$$\begin{aligned}&\nabla ^2 F_n \\&\quad \succeq {\text {diag}}\!\left( (\varepsilon (1-\alpha _n(\mu ,L))\!\! - \!\!\ell \alpha _n(\mu ,L))I,\right. \\&\qquad \left. (\varepsilon n^{-1}(1\!-\!\tau _n(\mu ,L)) \dots , (\varepsilon n^{-1}(1\!-\!\tau _n(\mu ,L)) \right. \\&\qquad \left. - \ell n^{-1}\tau _n(\mu ,L))I\right) \\&\quad = \varepsilon D_n - (\varepsilon + \ell ){\text {diag}}\\&\qquad \times \left( \alpha _n(\mu ,L)I, n^{-1}\tau _n(\mu ,L)I, \dots , n^{-1}\tau _n(\mu ,L)I\right) \\&\quad \succeq \varepsilon D_n - (\varepsilon + \ell ){\text {diag}}\\&\qquad \times \left( \tau _n(\mu ,L)I, n^{-1}\tau _n(\mu ,L)I, \dots , n^{-1}\tau _n(\mu ,L)I\right) \\&\quad = D_n\left( \varepsilon - \tau _n(\mu ,L) \cdot (\varepsilon + \ell ) \right) . \end{aligned}$$

\(\square \)

1.5 B.5 Proof of Lemma 9

Proof

Given Assumption 1, we know \(f_n\) is twice continuously differentiable. Thus, using the second-order characterization of strong convexity, it is equivalent to show the existence of \(r,\varepsilon >0\) such that

$$\begin{aligned} {\mathbb {P}}\left( \forall \theta \in B_r(\theta _0), \quad \nabla ^2 f_n(\theta ) \succeq \varepsilon I \right) \rightarrow 1, \end{aligned}$$

as \(n \rightarrow \infty \). Note that by Weyl’s inequality

$$\begin{aligned} \nabla ^2 f_n(\theta )= & {} \nabla ^2 f_n(\theta ) - H_\theta + H_\theta \nonumber \\\succeq & {} \lambda _{\min }\left( \nabla ^2 f_n(\theta ) - H_\theta \right) I + \lambda _{\min }(H_\theta ) I. \end{aligned}$$
(12)

Condition 4 of Assumption 1 guarantees that \(H_{\theta _0} \succeq \varepsilon I\) and that there exists a \(\kappa > 0\) such that \(H_\theta \) is continuous in \(B_{\kappa }(\theta _0)\). Hence, there exists \(0<\kappa ' \le \kappa \), such that \(\forall \theta \in B_{\kappa '}(\theta _0),\; H_\theta \succeq \frac{\varepsilon }{2}I\).

We then consider \(\lambda _{\min }\left( \nabla ^2 f_n(\theta ) - H_\theta \right) \). We aim to find a \(0 <r \le \kappa '\) such that \(|\lambda _{\min }\left( \nabla ^2 f_n(\theta ) - H_\theta \right) |\) is sufficiently small. Note that for any fixed \(r > 0\),

$$\begin{aligned}&\sup _{\theta \in B_r(\theta _0)} \left| \lambda _{\min }\left( \nabla ^2 f_n(\theta ) - H_\theta \right) \right| \\&\quad \le \sup _{\theta \in B_r(\theta _0)} \left\| \nabla ^2 f_n(\theta ) - H_\theta \right\| _2 \\&\quad = \sup _{\theta \in B_r(\theta _0)} \left\| \nabla ^2 f_n(\theta ) - \mathbb {E}_{\theta _0}\left[ -\nabla ^2 \log p_\theta (X) \right] \right. \\&\left. \qquad + \mathbb {E}_{\theta _0}\left[ -\nabla ^2 \log p_\theta (X) \right] - H_\theta \right\| _2\\&\quad \le \! \sup _{\theta \in B_r(\theta _0)} \!\left( \left\| \nabla ^2 f_n(\theta ) \!-\! \mathbb {E}_{\theta _0}\!\left[ -\nabla ^2 \log p_\theta (X) \right] \right\| _2 \right. \\&\left. \qquad + \left\| \mathbb {E}_{\theta _0}\!\left[ -\nabla ^2 \log p_\theta (X) \right] \!-\! H_\theta \right\| _2 \right) . \end{aligned}$$

Now, we split \(f_n\) into prior and likelihood, yielding that

$$\begin{aligned}&\le \sup _{\theta \in B_r(\theta _0)} \left\| -n^{-1}\sum _{i = 1}^n \nabla ^2\log p_\theta (X_i) - \mathbb {E}_{\theta _0}\left[ -\nabla ^2 \log p_\theta (X) \right] \right\| _2 \nonumber \\&\quad \! + \sup _{\theta \in B_r(\theta _0)}\! \Vert -n^{-1}\nabla ^2\log \pi _0(\theta ) \Vert _2 \nonumber \\&\quad + \sup _{\theta \in B_r(\theta _0)}\!\left\| \mathbb {E}_{\theta _0}\left[ -\nabla ^2 \log p_\theta (X) \right] - H_\theta \right\| _2. \end{aligned}$$
(13)

Given Condition 2 of Assumption 1, for all \(\theta \), \(\pi _0(\theta )\) is positive and \(\nabla ^2 \pi _0(\theta )\) is continuous; and further due to the compactness of \(B_r(\theta _0)\), we have that

$$\begin{aligned}&\quad \forall r > 0, \quad \sup _{\theta \in B_r(\theta _0)} \Vert -n^{-1}\nabla ^2\log \pi _0(\theta ) \Vert _2 \rightarrow 0, \quad \nonumber \\&\qquad \text {as } n \rightarrow \infty . \end{aligned}$$
(14)

Then, it remains to bound the first term and the last term of Eq. (13). For the first term, we aim to use the uniform weak law of large numbers to show its convergence to 0. By Condition 5 of Assumption 1, there exists a \(0 < r_1 \le \kappa '\) and a measurable function g such that for all \( \theta \in B_{r_1}(\theta _0)\) and for all x,

$$\begin{aligned} \max _{i,j \in [d]}\left| \left( \nabla ^2 \log p_{\theta }(x) \right) _{i,j} \right|< g(x), \quad \mathbb {E}_{\theta _0}[g(X)] < \infty . \end{aligned}$$

Then, by the compactness of \(B_{r_1}(\theta _0)\), we can apply the uniform weak law of large numbers (Jennrich 1969, Theorem 2), yielding that for all \(i,j \in [d]\),

$$\begin{aligned}&\sup _{\theta \in B_{r_1}(\theta _0)} \left| \left( -n^{-1}\sum _{i = 1}^n \nabla ^2\log p_\theta (X_i)\right) _{i,j}\right. \\&\quad \left. - \left( \mathbb {E}_{\theta _0}\left[ -\nabla ^2 \log p_\theta (X) \right] \right) _{i,j} \right| {\mathop {\rightarrow }\limits ^{P_{\theta _0}}} 0. \end{aligned}$$

Since the entrywise convergence of matrices implies the convergence in spectral norm,

$$\begin{aligned} \sup _{\theta \in B_{r_1}(\theta _0)} \left\| -n^{-1}\sum _{i = 1}^n \nabla ^2\log p_\theta (X_i) - \mathbb {E}_{\theta _0}\left[ -\nabla ^2 \log p_\theta (X) \right] \right\| _2 {\mathop { \rightarrow }\limits ^{P_{\theta _0}}} 0. \end{aligned}$$
(15)

For the last term of Eq. (13), by Condition 4 of Assumption 1,

$$\begin{aligned}&\lim _{r\rightarrow 0} \sup _{\theta \in B_r(\theta _0)} \left\| \mathbb {E}_{\theta _0}\left[ -\nabla ^2 \log p_\theta (X) \right] - H_\theta \right\| _2\\&\quad = \lim _{r\rightarrow 0} \sup _{\theta \in B_r(\theta _0)} \left\| \mathbb {E}_{\theta _0}\left[ -\nabla ^2 \log p_\theta (X) \right] \right. \\&\qquad \left. -\mathbb {E}_\theta \left[ -\nabla ^2 \log p_\theta (X)\right] \right\| _2 \rightarrow 0. \end{aligned}$$

Thus, there exists a sufficiently small \(r_2 > 0\) such that

$$\begin{aligned} \sup _{\theta \in B_{r_2}(\theta _0)} \left\| \mathbb {E}_{\theta _0}\left[ -\nabla ^2 \log p_\theta (X) \right] - H_\theta \right\| _2 \le \frac{\varepsilon }{8}. \end{aligned}$$
(16)

Then, we combine Eqs. (14) to (16) and pick \(r' = \min (r_1,r_2)\le \kappa '\), yielding that

$$\begin{aligned} {\mathbb {P}}\left( \sup _{\theta \in B_{r'}(\theta _0)}\left| \lambda _{\min }\left( \nabla ^2 f_n(\theta ) - H_\theta \right) \right| \le \frac{\varepsilon }{4} \right) \rightarrow 1, \end{aligned}$$
(17)

as \(n \rightarrow \infty \). Then, the local strong convexity is established. Note that we have already shown for all \(\theta \in B_{\kappa '}(\theta _0), H_\theta \succeq \frac{\varepsilon }{2} I\). By Eqs. (17) and (12), we conclude that for all \(\varepsilon \le \frac{\varepsilon }{4}\),

$$\begin{aligned} \lim _{n \rightarrow \infty } {\mathbb {P}}\left( \forall \theta \in B_{r'}(\theta _0), \quad \nabla ^2 f_n(\theta ) \succeq \varepsilon I\right) = 1. \end{aligned}$$

The smoothness argument follows from the same strategy. Weyl’s inequality implies that

$$\begin{aligned} \nabla ^2 f_n(\theta )= & {} \nabla ^2 f_n(\theta ) - H_\theta + H_\theta \\\preceq & {} \lambda _{\max }\left( \nabla ^2 f_n(\theta ) - H_\theta \right) I + \lambda _{\max }(H_\theta ) I. \end{aligned}$$

By repeating the proof for local smoothness, we obtain that there exists a sufficiently small \(0 < r''\), such that \(\forall \varepsilon > 0\),

$$\begin{aligned} {\mathbb {P}}\left( \sup _{\theta \in B_{r''}(\theta _0)}\left| \Vert \nabla ^2 f_n(\theta )\Vert _2 - \Vert H_\theta \Vert _2 \right| \le \varepsilon \right) \rightarrow 1, \end{aligned}$$

as \(n \rightarrow \infty \). Condition 4 and 5 of Assumption 1 yield that

$$\begin{aligned} \sup _{\theta \in B_{r''}(\theta _0)}\Vert H_{\theta _0}\Vert _2 < \infty . \end{aligned}$$

Therefore, there exists a \(\ell >0\) such that

$$\begin{aligned} \lim _{n \rightarrow \infty } {\mathbb {P}}\left( \forall \theta \in B_{r''}(\theta _0), \quad \nabla ^2 f_n(\theta ) \preceq \ell I\right) = 1. \end{aligned}$$

Then, the proof is complete by defining \(r :=\min \{r', r''\}\). \(\square \)

1.6 B.6 Proof of Corollary 14

Proof

We begin by verifying the conditions of Theorem 13 for \(f_n\). By Assumption 1, we know that \(f_n\) is twice differentiable. We also know that by Lemma 9, under Assumptions 1 and 2, there exist \(\ell , r', \varepsilon > 0\) such that

$$\begin{aligned} \mathbb {P}\left( \sup _{\theta }\left\| - n^{-1}\nabla ^2 \log \pi _n(\theta )\right\| _2 > \ell \right)&\rightarrow 0 \\ \mathbb {P}\left( \inf _{\Vert \theta - \theta _0\Vert< r'}\lambda _{\min }\left( -n^{-1}\nabla ^2 \log \pi _n(\theta )\right) < \varepsilon \right)&\rightarrow 0. \end{aligned}$$

By Theorem 6, we know that \(\mu _n^\star {\mathop {\rightarrow }\limits ^{P_{\theta _0}}} \theta _0\), so there exists an \(r'> r>0\) such that

$$\begin{aligned} \mathbb {P}\left( \inf _{\Vert \theta - \mu _n^\star \Vert< r}\lambda _{\min }\left( - n^{-1}\nabla ^2 \log \pi _n(\theta )\right) < \varepsilon \right) \rightarrow 0. \end{aligned}$$

Therefore by Theorem 13, the probability that

$$\begin{aligned} \forall \mu , L, \quad -\ell D_n \preceq&n^{-1} \nabla ^2\mathbb {E}\left[ -\log \pi _n(\mu +1/\sqrt{n}LZ)\right] \preceq \ell D_n,\quad \end{aligned}$$
(18)

and

$$\begin{aligned} \begin{aligned}&\text { for all } L \text { and for } \Vert \mu - \mu _n^\star \Vert ^2 < r^2/2,\\&\quad n^{-1} \nabla ^2\mathbb {E}\left[ -\log \pi _n(\mu +n^{-1/2}LZ)\right] \\&\quad \succeq D_n(\varepsilon - \tau _n(\mu ,L)\cdot (\varepsilon + \ell )) , \end{aligned} \end{aligned}$$
(19)

hold converges to 1 as \(n\rightarrow \infty \), where \(D_n\) and \(\tau _n(\mu ,L)\) are as defined in Eq. (10) and \(x = \mu _n^\star \). Note that the gradient and Hessian in the above expression are taken with respect to a vector in \(\mathbb {R}^{d(d+1)}\) that stacks \(\mu \) and each column of L into a single vector.

Then for all \((\mu , L) \in {\mathscr {B}}_{r,n}\), we have

$$\begin{aligned} \Vert \mu -\mu _n^\star \Vert ^2&\le r^2/4\\ \Vert L-L_n^\star \Vert _F^2&\le 4\Vert I-L_n^\star \Vert _F^2 \implies \Vert L\Vert _F \le 2\Vert I-L_n^\star \Vert _F + \Vert L_n^\star \Vert _F, \end{aligned}$$

yielding

$$\begin{aligned} \frac{r^2 - 2\Vert \mu -\mu _n^\star \Vert ^2}{n^{-1}2\Vert L\Vert ^2_F} \ge \frac{nr^2}{4\left( 2\Vert I-L_n^\star \Vert _F + \Vert L_n^\star \Vert _F\right) ^2}. \end{aligned}$$

Hence \(\forall (\mu , L) \in {\mathscr {B}}_{r,n}\), \(\tau _n(\mu ,L ) \rightarrow 0\) as \(n\rightarrow \infty \), yielding that under sufficiently large n,

$$\begin{aligned} \varepsilon - \tau _n(\mu ,L)\cdot (\varepsilon + \ell ) > \varepsilon /2. \end{aligned}$$

Therefore, the probability that for all \((\mu , L) \in {\mathscr {B}}_{r,n}\),

$$\begin{aligned} \frac{1}{n} \nabla ^2\mathbb {E}\left[ -\log \pi _n(\mu +1/\sqrt{n}LZ)\right] \succeq \frac{\varepsilon }{2} D_n \end{aligned}$$
(20)

converges in \(P_{\theta _0}\) to 1 as \(n \rightarrow \infty \).

Combining Eqs. (18) to (20), the proof is completed. \(\square \)

1.7 B.7 Proof of Theorem 7

1.7.1 B.7.1 Gradient and Hessian derivation

The gradient for smoothed posterior is as follows:

$$\begin{aligned} \nabla \log \hat{\pi }_n(\theta ) =&\nabla \log \left\{ \mathbb {E}\left[ \exp \left( -\frac{1}{2\alpha _n}\Vert \theta - W\Vert ^2\right) \right] \right\} \\ =&\frac{ \mathbb {E}\left[ \exp \left( -\frac{1}{2\alpha _n}\Vert \theta - W\Vert ^2\right) \left( - \frac{1}{\alpha _n} \right) \left( \theta - W \right) \right] }{\mathbb {E}\left[ \exp \left( -\frac{1}{2\alpha _n}\Vert \theta - W\Vert ^2\right) \right] } , \end{aligned}$$

and the Hessian matrix is given by

$$\begin{aligned} \nabla ^2 \log \hat{\pi }_n(\theta )&= \frac{1}{\alpha _n^2} \frac{\mathbb {E}\left[ e^{\frac{-\Vert \theta - W\Vert ^2-\Vert \theta - W'\Vert ^2}{2\alpha _n}}W(W -W')^T\right] }{\mathbb {E}\left[ e^{-\frac{\Vert \theta - W\Vert ^2}{2\alpha _n}}\right] ^2}\nonumber \\&\quad - \frac{1}{\alpha _n} I, \end{aligned}$$
(21)

where \(W,W' \overset{{\text {i.i.d.}}}{\sim }\varPi _{ n}\).

1.7.2 B.7.2 Proof of \(1{\text {st}}\) statement of Theorem 7

Proof of first statement of Theorem 7

To show the MAP estimation for smoothed posterior is asymptotically strictly convex, we will show that

$$\begin{aligned} \lim _{n \rightarrow \infty } {\mathbb {P}}\left( \sup _{\Vert \theta -\theta _0\Vert \le M} \lambda _{\text {max}}\left( \nabla ^2 \log \hat{\pi }_n(\theta ) \right) < 0\right) = 1. \end{aligned}$$

We focus on the first term of Eq. (21), and show that asymptotically it is uniformly smaller than \(\alpha _n^{-1}\) so that the overall Hessian is negative definite. For the denominator of Eq. (21,) define \(B_n :=\left\{ W, W' : \max \{\Vert W'-\theta _0\Vert , \Vert W - \theta _0\Vert \} \le \beta _n\right\} \) for any sequence \(\beta _n = o(\alpha _n)\). Then, we have

$$\begin{aligned} \mathbb {E}\left[ e^{-\frac{\Vert \theta - W\Vert ^2}{2\alpha _n}}\right] ^2&= \mathbb {E}\left[ e^{\frac{-\Vert \theta - W\Vert ^2-\Vert \theta - W'\Vert ^2}{2\alpha _n}}1_{B_n}\right] \\&\quad +\mathbb {E}\left[ e^{\frac{-\Vert \theta - W\Vert ^2-\Vert \theta - W'\Vert ^2}{2\alpha _n}}1_{B^c_n}\right] \\&\ge \mathbb {E}\left[ e^{\frac{-\Vert \theta - W\Vert ^2-\Vert \theta - W'\Vert ^2}{2\alpha _n}}1_{B_n}\right] \\&\ge \mathbb {E}\left[ \left( \inf _{v,v' \in B_n} e^{\frac{-\Vert \theta -v\Vert ^2-\Vert \theta -v'\Vert ^2}{2\alpha _n}}\right) 1_{B_n}\right] \\&= \left( \inf _{v,v' \in B_n} e^{\frac{-\Vert \theta -v\Vert ^2-\Vert \theta -v'\Vert ^2}{2\alpha _n}}\right) {\mathbb {P}}(B_n). \end{aligned}$$

By minimizing over \( v, v' \in B_n\), the above leads to

$$\begin{aligned} \mathbb {E}\left[ e^{-\frac{\Vert \theta - W\Vert ^2}{2\alpha _n}}\right] ^2 \ge e^{\frac{-2(\Vert \theta -\theta _0\Vert +\beta _n)^2}{2\alpha _n}}{\mathbb {P}}(B_n). \end{aligned}$$
(22)

For the numerator of the first term of Eq. (21), since \(W, W'\) are i.i.d.,

$$\begin{aligned}&\mathbb {E}\left[ e^{\frac{-\Vert \theta - W\Vert ^2-\Vert \theta - W'\Vert ^2}{2\alpha _n}}W(W -W')^T\right] \\&\quad = \frac{1}{2} \mathbb {E}\left[ e^{\frac{-\Vert \theta - W\Vert ^2-\Vert \theta - W'\Vert ^2}{2\alpha _n}}(W -W')(W -W')^T\right] , \end{aligned}$$

and since \(\lambda _{\text {max}} \left( (W -W')(W -W')^T\right) = \Vert W -W'\Vert ^2\),

$$\begin{aligned}&\lambda _{\text {max}} \left( \mathbb {E}\left[ e^{\frac{-\Vert \theta - W\Vert ^2-\Vert \theta - W'\Vert ^2}{2\alpha _n}}W(W -W')^T\right] \right) \nonumber \\&\quad \le \frac{1}{2} \mathbb {E}\left[ e^{\frac{-\Vert \theta - W\Vert ^2-\Vert \theta - W'\Vert ^2}{2\alpha _n}}\Vert W -W'\Vert ^2\right] . \end{aligned}$$
(23)

With Eqs. (22) and (23), we can therefore bound the maximal eigenvalue of the Hessian matrix,

$$\begin{aligned}&\lambda _{\text {max}}\left( \nabla ^2 \log \hat{\pi }_n(\theta ) \right) \nonumber \\&\quad \le \frac{1}{2 \alpha _n^2 {\mathbb {P}}(B_n)}\mathbb {E}\nonumber \\&\qquad \times \left[ e^{\frac{-\Vert \theta - W\Vert ^2-\Vert \theta - W'\Vert ^2}{2\alpha _n}}e^{\frac{2(\Vert \theta -\theta _0\Vert +\beta _n)^2}{2\alpha _n}}\Vert W -W'\Vert ^2\right] - \frac{1}{\alpha _n}. \end{aligned}$$
(24)

We now bound the supremum of this expression over \(\{\theta \in \mathbb {R}^d: \Vert \theta -\theta _0\Vert \le M\}\). Focusing on the exponent within the expectation,

$$\begin{aligned}&\sup _{\Vert \theta -\theta _0\Vert \le M} \frac{1}{\alpha _n} \left[ 2(\Vert \theta -\theta _0\Vert +\beta _n)^2 -\Vert \theta - W\Vert ^2-\Vert \theta - W'\Vert ^2 \right] \\&\quad = \sup _{\Vert \theta -\theta _0\Vert \le M} \frac{1}{\alpha _n} \left[ 2(\Vert \theta -\theta _0\Vert +\beta _n)^2 -\Vert \theta - \theta _0 + \theta _0 -W\Vert ^2 \right. \\&\qquad \left. -\Vert \theta -\theta _0 + \theta _0-W'\Vert ^2 \right] \\&\quad \le \frac{1}{\alpha _n} \left[ \left( 2 \beta _n^2 + 4 M \beta _n \right) - \left( \Vert \theta _0 - W\Vert ^2 + \Vert \theta _0 - W'\Vert ^2 \right) \right. \\&\qquad \left. + 2M \left( \Vert \theta _0 - W\Vert + \Vert \theta _0 - W'\Vert \right) \right] , \end{aligned}$$

where the inequality is obtained by expanding the quadratic terms and bounding \(\Vert \theta - \theta _0\Vert \) with M. We combine the above bound with Eq. (24) to show that \(\alpha _n^2\lambda _{\text {max}}\left( \nabla ^2 \log \hat{\pi }_n(\theta ) \right) + \alpha _n\) is bounded above by

$$\begin{aligned}&\quad \frac{\beta _n}{2{\mathbb {P}}(B_n)} \!e^{\frac{2\beta _n^2 + 4M\beta _n}{\alpha _n} } \mathbb {E}\nonumber \\&\qquad \!\left[ \!e^{\frac{ 2M \left( \! \Vert \theta _0 - W\Vert + \Vert \theta _0 - W'\Vert \! \right) - \left( \!\Vert \theta _0 - W\Vert ^2 + \Vert \theta _0 - W'\Vert ^2 \!\right) }{\alpha _n}} \frac{\Vert W -W'\Vert ^2}{\beta _n} \!\right] \!.\!\! \end{aligned}$$
(25)

By multiplying and dividing by \( \exp \left( \frac{ \Vert W-W'\Vert }{\sqrt{\beta _n}}\right) \), one notices that

$$\begin{aligned} \frac{\Vert W -W'\Vert ^2}{\beta _n}&= \exp \left( \frac{ \Vert W-W'\Vert }{\sqrt{\beta _n}}\right) \\&\quad \times \exp \left( - \frac{ \Vert W-W'\Vert }{\sqrt{\beta _n}}\right) \frac{\Vert W -W'\Vert ^2}{\beta _n}\\&\quad \le 4e^{-2} \exp \left( \frac{ \Vert W -\theta _0\Vert + \Vert W' - \theta _0\Vert }{\sqrt{\beta _n}}\right) , \end{aligned}$$

where the inequality is by the fact that \(x^2 e^{-x}\) maximized at \(x=2\) with value \(4e^{-2}\) and \(\Vert W -W'\Vert \le \Vert W\Vert + \Vert W'\Vert \). If we combine this bound with Eq. (25) and note that \(W, W'\) are iid, Eq. (25) is bounded above by

$$\begin{aligned} \frac{2 e^{-2} \beta _n }{{\mathbb {P}}(B_n)} e^{\frac{2\beta _n^2 + 4M\beta _n}{\alpha _n} } \mathbb {E}\left[ e^{ \left( \frac{1}{\alpha _n}M + \beta _n^{-1/2} \right) \Vert W - \theta _0\Vert - \frac{1}{2\alpha _n} \Vert W - \theta _0\Vert ^2 } \right] ^2. \end{aligned}$$
(26)

To show that the Hessian is asymptotically negative definite, it suffices to show that Eq. (26) is \(o_{P_{\theta _0}}(\alpha _n)\). For the terms outside the expectation, \(\beta _n = o(\alpha _n)\) implies that \(2 e^{-2} \beta _n e^{\frac{2\beta _n^2 + 4M\beta _n}{\alpha _n} } = o(\alpha _n)\), and Assumption 1 and Lemma 15 together imply that

$$\begin{aligned} {\mathbb {P}}(B_n) = \varPi _{n}\left( \left\{ W : \Vert W - \theta _0 \Vert \le \beta _n\right\} \right) ^2 {\mathop {\rightarrow }\limits ^{P_{\theta _0}}} 1, \end{aligned}$$

so

$$\begin{aligned} \frac{2 e^{-2} \beta _n }{{\mathbb {P}}(B_n)} e^{\frac{2\beta _n^2 + 4M\beta _n}{\alpha _n} } = o_{P_{\theta _0}}(\alpha _n). \end{aligned}$$

Therefore, in order to show Eq. (26) is \(o_{P_{\theta _0}}(\alpha _n)\), it is sufficient to show that

$$\begin{aligned} \mathbb {E}\left[ e^{ \left( \frac{1}{\alpha _n}M + \beta _n^{-1/2} \right) \Vert W - \theta _0\Vert - \frac{1}{2\alpha _n} \Vert W - \theta _0\Vert ^2 } \right] = O_{P_{\theta _0}}(1). \end{aligned}$$

The next step is to split the expectation into two regions—\(\Vert W - \theta _0\Vert \le \beta _n\) and \(\Vert W - \theta _0\Vert > \beta _n\)—and bound its value within them separately.

  1. 1.

    When \(\Vert W - \theta _0\Vert \le \beta _n\), the exponent inside the expectation is shrinking uniformly since \(\beta _n = o(\alpha _n)\):

    $$\begin{aligned}&\mathbb {E}\left[ 1_{\{ \Vert W - \theta _0\Vert \le \beta _n\}} e^{ \left( \frac{1}{\alpha _n}M + \beta _n^{-1/2} \right) \Vert W - \theta _0\Vert - \frac{1}{2\alpha _n} \Vert W - \theta _0\Vert ^2 } \right] \\&\quad \le \mathbb {E}\left[ 1_{\{ \Vert W - \theta _0\Vert \le \beta _n\}} \right] e^{ \left( \frac{1}{\alpha _n}M + \beta _n^{-1/2} \right) \beta _n} \\&\quad = O_{P_{\theta _0}}(1). \end{aligned}$$
  2. 2.

    When \(\Vert W - \theta _0\Vert > \beta _n\), we take the supremum over the exponent (a quadratic function), yielding \(\Vert W - \theta _0\Vert = M + \alpha _n \beta _n^{-1/2}\) and the following bound:

    $$\begin{aligned}&\left( \frac{1}{\alpha _n}M + \beta _n^{-1/2} \right) \Vert W - \theta _0\Vert - \frac{1}{2\alpha _n} \Vert W - \theta _0\Vert ^2 \nonumber \\&\quad \le \sup _{\Vert v - \theta _0\Vert } \left( \left( \frac{1}{\alpha _n}M + \beta _n^{-1/2} \right) \Vert v - \theta _0\Vert - \frac{1}{2\alpha _n} \Vert v - \theta _0\Vert ^2 \right) \nonumber \\&\quad = \left( \frac{1}{\alpha _n}M + \beta _n^{-1/2} \right) \left( M + \alpha _n \beta _n^{-1/2} \right) \nonumber \\&\qquad - \frac{1}{2\alpha _n} \left( M + \alpha _n \beta _n^{-1/2} \right) ^2 \nonumber \\&\quad =\frac{M^2}{2\alpha _n} + \frac{M}{\beta _n^{1/2} }+ \frac{\alpha _n}{2\beta _n}. \end{aligned}$$
    (27)

    This yields

    $$\begin{aligned}&\mathbb {E}\left[ 1_{\{ \Vert W - \theta _0\Vert> \beta _n\}} e^{ \left( \frac{1}{\alpha _n}M + \beta _n^{-1/2} \right) \Vert W - \theta _0\Vert - \frac{1}{2\alpha _n} \Vert W - \theta _0\Vert ^2 } \right] \\&\quad \le \varPi _{n}\left( \left\{ W : \Vert W - \theta _0 \Vert > \beta _n\right\} \right) \exp \left( \frac{M^2}{2\alpha _n} + \frac{M}{\beta _n^{1/2} }+ \frac{\alpha _n}{2\beta _n} \right) . \end{aligned}$$

    Note that it is always possible to choose \(\beta _n = o(\alpha _n)\) with \(\beta _n = \omega (\alpha _n^2)\). With this choice of \(\beta _n\), the dominating term among the three of Eq. (27) is \(\frac{M^2}{2\alpha _n} \). Then by Lemma 15, there exists a sequence \(\beta _n = o(\alpha _n)\) with \(\beta _n = \omega (\alpha _n^2)\) such that the following holds,

    $$\begin{aligned} \varPi _n( \left\{ W : \Vert W - \theta _0 \Vert > \beta _n\right\} ) = o_{P_{\theta _0}}\left( \exp \left\{ -\frac{M^2}{2\alpha _n}\right\} \right) , \end{aligned}$$

    which implies

    $$\begin{aligned}&\mathbb {E}\left[ 1_{\{ \Vert W - \theta _0\Vert > \beta _n\}} e^{ \left( \frac{1}{\alpha _n}M + \beta _n^{-1/2} \right) \Vert W - \theta _0\Vert - \frac{1}{2\alpha _n} \Vert W - \theta _0\Vert ^2 } \right] \\&= o_{P_{\theta _0}}(1) . \end{aligned}$$

This finishes the proof. \(\square \)

In the last step of the above proof, we require an exponential tail bound for the posterior \(\varPi _n\). We provide this in the following lemma, following the general proof strategy of van der Vaart (2000, Theorem 10.3). The proof of Lemma 15 involves many probability distributions; thus, for mathematical convenience and explicitness, in the proof of Lemma 15 we use square bracket—\(P\left[ X \right] \)—to denote the expectation of random variable X with respect to a probability distribution P. When taking expectation to a function of n data points \(f(X_1, \dots , X_n)\) , where \((X_i)_{ i =1}^n \overset{{\text {i.i.d.}}}{\sim }P_{\theta }\), we still write \(P_\theta [f]\); and \(P_\theta \) here represents the product measure.

Lemma 15

Under Assumption 1, \(\alpha _n^3 n \rightarrow \infty \), there exists a sequence \(\beta _n\) satisfying \(\beta _n = o(\alpha _n)\), \(\beta _n = \omega (\alpha _n^2)\) and \(\beta _n = \omega (n^{-1/2})\) such that for any fixed constant M,

$$\begin{aligned} \varPi _n( \left\{ W : \Vert W - \theta _0 \Vert > \beta _n\right\} ) = o_{P_{\theta _0}}\left( \exp \left\{ -\frac{M^2}{2\alpha _n}\right\} \right) . \end{aligned}$$

Proof of Lemma 15

In order to show that \(\beta _n\) satisfies the tail probability bound, it suffices to prove that

$$\begin{aligned} e^{\frac{1}{\alpha _n}} P_{\theta _0}\left[ \varPi _n( \left\{ W : \Vert W - \theta _0 \Vert > \beta _n\right\} ) \right] \rightarrow 0 , \end{aligned}$$

due to Markov’s inequality (we absorb the \(M^2/2\) constant into \(\alpha _n\) because it does not affect the proof). To achieve this, we take advantage of the existence of a test sequence applied from Assumption 1. By van der Vaart (2000, Lemma 10.6), given the \(1\text {st}\) and the \(2\text {nd}\) conditions of Assumption 1 and the fact that the parameter space \(\mathbb {R}^d\) is \(\sigma \)-compact, there exists a sequence of tests \(\phi _n : \mathscr {X}^n \rightarrow [0,1]\), where \(\mathscr {X}^n\) is the space of \((X_1, \dots , X_n)\), such that as \(n \rightarrow \infty \),

$$\begin{aligned} P_{\theta _0} [\phi _n] \rightarrow 0,\quad \sup _{\left\| \theta -\theta _{0}\right\| >\varepsilon } P_{\theta }\left[ 1-\phi _{n}\right] \rightarrow 0. \end{aligned}$$

Further, by Kleijn (2004, Lemma 1.2) and van der Vaart (2000, Lemma 10.3), under Assumption 1 and the existence of the above test sequence \(\phi _n\), for every \(M_n \rightarrow \infty \), there exists a constant \(C > 0\) and another sequence of tests \(\psi _n : \mathscr {X}^n\rightarrow [0,1]\) such that for all \(\Vert \theta - \theta _0\Vert > M_n/\sqrt{n}\) and sufficiently large n,

$$\begin{aligned} P_{\theta _0} [\psi _n ]&\le \exp \{- Cn\},\quad P_{\theta }\left[ 1-\psi _{n}\right] \nonumber \\&\le \exp \{- Cn (\Vert \theta - \theta _0 \Vert ^2 \wedge 1)\}. \end{aligned}$$
(28)

Using \(\psi _n\), we split the expectation as follows:

$$\begin{aligned}&e^{\frac{1}{\alpha _n}} \cdot P_{\theta _0} [\varPi _n( \left\{ W : \Vert W - \theta _0 \Vert> \beta _n\right\} )] \\&\quad =\underbrace{e^{\frac{1}{\alpha _n}} \cdot P_{\theta _0} [\varPi _n( \left\{ W : \Vert W - \theta _0 \Vert> \beta _n\right\} ) \psi _n]}_{(\text {I})} \\&\qquad + \underbrace{e^{\frac{1}{\alpha _n}} \cdot P_{\theta _0} [\varPi _n( \left\{ W : \Vert W - \theta _0 \Vert > \beta _n\right\} )(1- \psi _n)]}_{(\text {II})}, \end{aligned}$$

and we aim to show both parts converging to 0.

For term (I), the first statement of Eq. (28) implies that \(\exists C >0\) such that

$$\begin{aligned} e^{\frac{1}{\alpha _n}} \cdot P_{\theta _0} [ \varPi _n( \left\{ W : \Vert W - \theta _0 \Vert > \beta _n\right\} ) \psi _n]&\le e^{\frac{1}{\alpha _n}} P_{\theta _0}[ \psi _n ]\\&\le e^{\frac{1}{\alpha _n}} e^{-nC} . \end{aligned}$$

where the first inequality follows by \( \varPi _n( \left\{ W : \Vert W - \theta _0 \Vert \right. \left. > \beta _n\right\} ) \le 1\). Since \(n \alpha _n^3 \rightarrow \infty \), the last bound in the above expression converges to 0.

For term (II), we work with the shifted and scaled posterior distribution. Define \(Z_n = \sqrt{n}(W - \theta _0)\) and \(B_n = \{Z_n: \Vert Z_n\Vert > \sqrt{n\beta _n^2} \}\), and let \(\tilde{\varPi }_{0}\) be the corresponding prior distribution on \(Z_n\) and \(\tilde{\varPi }_n\) be the shifted and scaled posterior distribution, which yields

$$\begin{aligned} \begin{aligned}&e^{\frac{1}{\alpha _n}} \cdot P_{\theta _0} [\varPi _n( \left\{ W : \Vert W - \theta _0 \Vert > \beta _n\right\} )(1- \psi _n)] \\&\quad = e^{\frac{1}{\alpha _n}} \cdot P_{\theta _0} \left[ \tilde{\varPi }_n \left( B_n \right) (1- \psi _n) \right] . \end{aligned} \end{aligned}$$
(29)

Let U be a closed ball around 0 with a fixed radius r, then restricting \(\tilde{\varPi }_0\) on U defines a probability measure \(\tilde{\varPi }_0^U\), i.e. for all measurable set B, \(\tilde{\varPi }_0^U(B) = \tilde{\varPi }_0(B \cap U)/ \tilde{\varPi }_0(U)\). Write \(P_{n, z}\) for the joint distribution of n data points \((X_1, \dots , X_n)\) parametrized under \(\theta _0 + z/\sqrt{n}\) and hence write the marginal distribution of \((X_1, \dots , X_n)\) under \(\tilde{\varPi }_0^U\) for \(P_{n,U} = \int P_{n, z} \text{ d } \tilde{\varPi }_0^U(z)\). The densities of these distributions will be represented using lower case, e.g. \(p_{n,U}(x) = \int p_{n,z}(x) \tilde{\pi }_0^U(z) \text{ d } z\) is the PDF of \(P_{n,U}\). Here, we abuse the notation that x represents \((x_1, \dots , x_n)\).

We replace \(P_{\theta _0}\) in Eq. (29) with \(P_{n,U}\). Under Assumption 1, by van der Vaart (2000, p. 141), \(P_{n,U} \) is mutually contiguous to \(P_{\theta _0}\) (LeCam 1960), that is, for any statistics \(T_n\) (a Borel function of \(X^n\)), \(T_n {\mathop {\rightarrow }\limits ^{P_{\theta _0}}}0\) iff \(T_n {\mathop {\rightarrow }\limits ^{P_{n,U}}}0\). Thus, considering \(\tilde{\varPi }_0 \left( B_n \right) (1- \psi _n)\) as the statistics \(T_n\), the convergence to 0 of the expression in Eq. (29) is equivalent to

$$\begin{aligned} e^{\frac{1}{\alpha _n}} \cdot P_{n, U} \left[ \tilde{\varPi }_n \left( B_n \right) (1- \psi _n) \right] \rightarrow 0 . \end{aligned}$$

Manipulating the expression of \(P_{n,U}\) and \(\tilde{\varPi }_n \left( B_n \right) \) (we write \( \tilde{\varPi }_n \left( B_n, x\right) \) in the integral and write \(\tilde{\varPi }_n \left( B_n, (X_i)_{i = 1}^n\right) \) in the expectation to make the dependence of posterior on the data explicit),

$$\begin{aligned}&P_{n, U}\left[ \tilde{\varPi }_n \left( B_n,(X_i)_{i = 1}^n\right) (1- \psi _n) \right] \\&\qquad = \int \tilde{\varPi }_n \left( B_n, x \right) (1- \psi _n) \text{ d } P_{n, U}(x)\\&\qquad = \int \tilde{\varPi }_n \left( B_n, x\right) (1- \psi _n) p_{n,U}(x) \text{ d } x. \end{aligned}$$

Note that \(p_{n,U}(x ) = \int p_{n,z}(x) \text{ d } \tilde{\varPi }_0^U(z)\),

$$\begin{aligned}&= \int \tilde{\varPi }_n \left( B_n,x \right) (1- \psi _n) \left( \int p_{n,z}(x) \text{ d } \tilde{\varPi }_0^U(z) \right) \mathrm {d}x . \end{aligned}$$

Recall that for all measurable set B, \(\tilde{\varPi }_0^U(B) = \tilde{\varPi }_0(B \cap U)/ \tilde{\varPi }_0(U)\), thus

$$\begin{aligned}&= \frac{1}{\tilde{\varPi }_0(U)}\int \tilde{\varPi }_n \left( B_n, x \right) (1- \psi _n) \left( \int _U p_{n,z}(x) \text{ d } \tilde{\varPi }_{0}(z) \right) \mathrm {d}x. \end{aligned}$$

By using Bayes rule, we expand \(\tilde{\varPi }_{n} \left( B_n, x \right) = \frac{\int \mathbb {1}[B_n] p_{n,z}(x) \text{ d } \tilde{\varPi }_0 (z)}{\int p_{n,z}(x)\text{ d }\tilde{\varPi }_0(z)}\),

$$\begin{aligned}&= \frac{\int (1- \psi _n) \left( \int \mathbb {1}[B_n] p_{n,z}(x) \text{ d } \tilde{\varPi }_{0}(z) \right) \left( \int _U p_{n,z}(x) \text{ d } \tilde{\varPi }_0(z) \right) \mathrm {d}x}{\tilde{\varPi }_{0}(U) \int p_{n,z}(x)\text{ d }\tilde{\varPi }_{0}(z)}. \end{aligned}$$

Note that \(\tilde{\varPi }_n \left( U, x \right) = \frac{\int _U p_{n,z}(x) \text{ d } \tilde{\varPi }_{0}(z)}{\int p_{n,z}(x)\text{ d }\tilde{\varPi }_{0}(z)}\),

$$\begin{aligned}&= \frac{1}{\tilde{\varPi }_0(U)} \int \left( \int \mathbb {1}[B_n] p_{n,z}(x) \text{ d } \tilde{\varPi }_{0}(z) \right) (1- \psi _n) \tilde{\varPi }_n \left( U,x \right) \mathrm {d}x. \end{aligned}$$

By Fubini Theorem and \( \tilde{\varPi }_n \left( U,x \right) \le 1\),

$$\begin{aligned}\le & {} \frac{1}{\tilde{\varPi }_0(U)} \int _{B_n} \left( \int (1 - \psi _n) p_{n,z}(x) \mathrm {d}x\right) \mathrm {d}\tilde{\varPi }_0(z)\\= & {} \frac{1}{\tilde{\varPi }_{0}(U)} \int _{\{ \Vert z\Vert > \sqrt{n\beta _n^2}\}} P_{n,z} [1- \psi _n] \text{ d } \tilde{\varPi }_0(z). \end{aligned}$$

Note that \(P_{n,z} [1- \psi _n] \equiv P_{\theta }[1 - \psi _n]\) for \(\theta = \theta _0 + z/\sqrt{n}\) and that \(\sqrt{n\beta _n^2} \rightarrow \infty \) due to \(\beta _n = \omega (n^{-1/2})\). Thus, we can use the second statement of Eq. (28) to bound \(P_{n,z} [1- \psi _n]\), yielding

$$\begin{aligned}&\frac{1}{\tilde{\varPi }_0(U)} \int _{\{ \Vert z\Vert> \sqrt{n\beta _n^2}\}} P_{n,z} [1- \psi _n] \text{ d } \tilde{\varPi }_0(z)\\&\quad \le \frac{1}{\tilde{\varPi }_0(U)} \int _{\{ \Vert z\Vert > \sqrt{n\beta _n^2}\}} \exp \{- C (\Vert z\Vert ^2 \wedge n)\} \mathrm {d}\tilde{\varPi }_0(z). \end{aligned}$$

We then derive upper bounds for both the fraction and the integral to show the above is \(o\left( e^{-\frac{1}{\alpha _n}}\right) \). For the fraction, we define \(U_n :=\left\{ w\in \mathbb {R}^d: \sqrt{n}(w - \theta _0) \in U \right\} \), then

$$\begin{aligned} \tilde{\varPi }_0(U) =&\varPi _0(U_n) \ge \frac{\pi ^{\frac{d}{2}}}{\gamma (\frac{d}{2} +1)} \left( n^{-1/2} r\right) ^d\inf _{w\in U_n}\pi _0(w). \end{aligned}$$

By Assumption 1, for all \( w \in \mathbb {R}^d, \pi _0(w)\) is positive and continuous, and hence \(\inf _{w\in U_n}\pi _0(w)\) is an increasing sequence that converges to \(\pi _0(\theta _0) > 0\) as \(n \rightarrow \infty \). Thus, there is a constant \(D > 0\) such that for sufficiently large n,

$$\begin{aligned} \tilde{\varPi }_0(U) \ge D n^{-d/2}, \end{aligned}$$
(30)

yielding that

$$\begin{aligned} \exists C > 0, \quad \text { s.t. } \quad \frac{1}{\tilde{\varPi }_{0}(U)} \le C n^{d/2}. \end{aligned}$$

For the integral, by splitting \(B_n\) into \(\{\sqrt{n\beta _n^2} < \Vert z_n\Vert \le k \sqrt{n} \}\) and \( \{ \Vert z_n\Vert > k \sqrt{n} \}\) for some positive \( k < 1\),

$$\begin{aligned}&\int _{\{ \Vert z\Vert> \sqrt{n\beta _n^2}\}} \exp \{- C (\Vert z\Vert ^2 \wedge n)\} \mathrm {d}\tilde{\varPi }_{0}(z)\\&\quad \le \int _{\{k \sqrt{n} \ge \Vert z\Vert > \sqrt{n\beta _n^2} \}} \exp \{- C \Vert z\Vert ^2 \} \mathrm {d}\tilde{\varPi }_{0}(z) + e^{-C k^2 n}. \end{aligned}$$

Then by change of variable to \(w = \frac{1}{\sqrt{n}} z + \theta _0\),

$$\begin{aligned} = \int _{\{k \ge \Vert w- \theta _0\Vert > \beta _n \}} \exp \{- C n \Vert w -\theta _0\Vert ^2 \} \pi _0(w ) \text{ d } w + e^{-C k^2 n}. \end{aligned}$$

Note that by Assumption 1, \(\pi _0(w)\) is continuous for all \(w \in \mathbb {R}^d\), we can choose a sufficiently small k such that \(\pi _0(\theta )\) is uniformly bounded by a constant C over the region \(\{k \ge \Vert w- \theta _0\Vert > \beta _n \}\). Thus, the above can be bounded above by

$$\begin{aligned}&C \int _{\{k \ge \Vert w- \theta _0\Vert> \beta _n \}} \exp \{- C n \Vert w -\theta _0\Vert ^2 \} \text{ d } w + e^{-C k^2 n}\\&\quad = C n^{-d/2} \int _{\{k \sqrt{n} \ge \Vert z\Vert> \sqrt{n \beta _n^2} \}} \exp \left\{ -C \Vert z\Vert ^2\right\} \mathrm {d}z + e^{-C k^2 n}\\&\quad \le C n^{-d/2} \int _{\{\Vert z\Vert > \sqrt{n \beta _n^2} \}} \exp \left\{ -C \Vert z\Vert ^2\right\} \mathrm {d}z + e^{-C k^2 n}, \end{aligned}$$

where the equality is by change of variable back to \(z = \sqrt{n}(w - \theta _0) \). Then, consider the integral on RHS. Using spherical coordinates, there exists a fixed constant \( D > 0\) such that

$$\begin{aligned}&\int _{\{\Vert z\Vert> \sqrt{n \beta _n^2} \}} \exp \left\{ -C \Vert z\Vert ^2\right\} \mathrm {d}z \\&\qquad = D \int _{\{r> \sqrt{n \beta _n^2}\}} e^{-Cr^2} r^{d-1} \mathrm {d}r \\&\qquad = DC^{-d/2} \int _{\{s> C n\beta _n^2 \}} e^{-s} s^{\frac{d}{2} -1} \mathrm {d}s, \end{aligned}$$

where the second equality is by setting \(s = Cr^2\). Note that the integrand of RHS is proportional to the PDF of \(\varGamma (\frac{d}{2}, 1)\). Using the tail properties of the Gamma random variable (Boucheron et al. 2013, p. 28), we have that for some generic constant \(D >0\),

$$\begin{aligned} \int _{\{\Vert z\Vert > \sqrt{n \beta _n^2} \}} \exp \left\{ -C \Vert z\Vert ^2\right\} \mathrm {d}z \le D e^{ - C n\beta _n^2}. \end{aligned}$$

Therefore, for some generic constants \(C , D >0\),

$$\begin{aligned}&\int _{\{ \Vert z\Vert > \sqrt{n\beta _n^2}\}} \exp \{- C (\Vert z\Vert ^2 \wedge n)\} \mathrm {d}\tilde{\varPi }_{0}(z)\nonumber \\&\quad \le D n^{-d/2} e^{- C n\beta _n^2} + e^{-C k^2 n}. \end{aligned}$$
(31)

Then, we combine Eqs. (30) and (31), yielding that for some constants \(C, D > 0\) independent to n,

$$\begin{aligned}&e^{\frac{1}{\alpha _n}} \cdot P_{n, U} \left[ \tilde{\varPi }_{0} \left( B_n\right) (1- \psi _n) \right] \\&\quad \le e^{\frac{1}{\alpha _n}} \frac{1}{\varPi _{n,0}(U)} \int _{\{ \Vert z\Vert > \sqrt{n\beta _n^2}\}} \exp \{- C (\Vert z\Vert ^2 \wedge n)\} \mathrm {d}\tilde{\varPi }_{0}(z)\\&\quad \le e^{\frac{1}{\alpha _n}} C \sqrt{n}^d e^{-C n} + e^{\frac{1}{\alpha _n}} D e^{- C n\beta _n^2} . \end{aligned}$$

Lastly, it remains to show that there exists a positive sequence \(\beta _n\) satisfying both \( \beta _n = o(\alpha _n)\) and \(\beta _n = \omega (\alpha _n^2)\) such that the RHS converges to 0. The first term always converges to 0 due to \(\alpha _n^3 n \rightarrow \infty \). For the second term, we consider two different cases. If \(\alpha _n = o(n^{-1/6})\), we pick \(\beta _n = n^{-1/3}\), which is both \( o(\alpha _n)\) and \(\omega (\alpha _n^2)\). Then,

$$\begin{aligned} e^{\frac{1}{\alpha _n}} D e^{- C n\beta _n^2}&= D \exp \left\{ \alpha _n^{-1} - C n^{1/3}\right\} \\&\longrightarrow 0, \end{aligned}$$

where the convergence in the last line is by \(n \alpha _n^3 \rightarrow \infty \Leftrightarrow \frac{1}{\alpha _n} = o(n^{1/3})\). If \(\alpha _n = \omega (n^{-1/6})\), we pick \(\beta _n = \alpha _n^2\). Then,

$$\begin{aligned} e^{\frac{1}{\alpha _n}} D e^{- C n\beta _n^2} = D \exp \left\{ \alpha _n^{-1} - C n \alpha _n^4 \right\} . \end{aligned}$$

Since \(\alpha _n = \omega (n^{-1/6})\), \(\frac{1}{\alpha _n} = o(n^{1/6})\) and \( n\alpha _n^4 = \omega (n^{1/3})\), yielding that the above converges to 0 as \(n \rightarrow \infty \). This completes the proof. \(\square \)

1.7.3 B.7.3 Proof of \(2\text {nd}\) statement of Theorem 7

In this section, we show that the smoothed MAP estimator (\(\hat{\theta }_n^\star \)) is also a consistent estimate of \(\theta _0\), but with a convergence rate that is slower than the traditional \(\sqrt{n}\). This is the case because the variance of the smoothing kernel satisfies \(\alpha _n = \omega (n^{-1/3})\), and the convergence rate of \(\hat{\theta }_n^\star \) is determined by \(\alpha _n\) via

$$\begin{aligned} \Vert \hat{\theta }_n^\star - \theta _{0}\Vert = O_{P_{\theta _0}}(\sqrt{\alpha _n}). \end{aligned}$$
(32)

Recall that \(\theta _{\text {MLE},n }\) is a \(\sqrt{n}\)-consistent estimate of \(\theta _0\). Thus, it is sufficient to show \(\left\| \hat{\theta }_n^\star - \theta _{\text {MLE},n} \right\| = O_{P_{\theta _0}}\left( \sqrt{\alpha _n}\right) \).

Note that \(\hat{\theta }_n^\star \) and \(\theta _n^\star \) are maximals of stochastic process \(\hat{\pi }_n(\theta )\) and \(\pi _n(\theta )\) respectively, which can be studied in the framework of M-estimator (van der Vaart 2000; van der Vaart and Wellner 2013). A useful tool in establishing the asymptotics of M-estimators is the Argmax Continuous Mapping theorem  (van der Vaart and Wellner 2013, Lemma 3.2.1), which is introduced as follows.

Lemma 16

[Argmax Continuous Mapping (van der Vaart and Wellner 2013)] Let \(\{f_n(\theta ) \}\) and \(f(\theta )\) be stochastic processes indexed by \(\theta \), where \(\theta \in \varTheta \). Let \(\hat{\theta }\) be a random element such that almost surely, for every open sets G containing \(\hat{\theta }\),

$$\begin{aligned} f(\hat{\theta }) > \sup _{\theta \notin G} f(\theta ). \end{aligned}$$

and \(\hat{\theta }_n\) be a random sequence such that almost surely

$$\begin{aligned} f_n(\hat{\theta }_n) = \sup _{\theta \in \varTheta } f_n(\theta ). \end{aligned}$$

If \(\sup _{\theta \in \varTheta } |f_n(\theta ) - f(\theta )| = o_P(1)\) as \(n \rightarrow \infty \), then

$$\begin{aligned} \hat{\theta }_n \overset{d}{\rightarrow }\hat{\theta }. \end{aligned}$$

The proof strategy of the \(2\text {nd}\) statement of Theorem 7 is to apply Lemma 16 in a setting where \(f_n\) is \(\hat{\pi }_n(\theta )\) and f is a Gaussian density. Using the Bernstein–von Mises Theorem 4, we show that \(\hat{\pi }_n(\theta )\) converges uniformly to this Gaussian density, which implies that the MAP of \(\hat{\pi }_n(\theta )\) converges in distribution to the MAP of this Gaussian distribution by the Argmax Continuous Mapping theorem. The detailed proof is as follows.

Proof of Second statement of Theorem 7

Note that

$$\begin{aligned} \Vert \hat{\theta }_n^\star - \theta _{0}\Vert \le \Vert \hat{\theta }_n^\star - \theta _{\text {MLE},n}\Vert + \Vert \theta _{\text {MLE},n} - \theta _0 \Vert . \end{aligned}$$

By Theorem 4, we have \(\Vert \theta _{\text {MLE},n} - \theta _{0}\Vert \!= \!O_{P_{\theta _0}}(1/\sqrt{n})\). And in addition, given that \(\sqrt{\alpha _n } = \omega (1/ \sqrt{n})\), in order to get Eq. (32), it suffices to show

$$\begin{aligned} \left\| \hat{\theta }_n - \theta _{\text {MLE},n} \right\| = O_{P_{\theta _0}}\left( \sqrt{\alpha _n}\right) . \end{aligned}$$

Thus, in this proof, we aim to show \(\left\| \hat{\theta }_n^\star - \theta _{\text {MLE},n} \right\| = O_{P_{\theta _0}}(\sqrt{\alpha _n}) \) and it is sufficient to prove

$$\begin{aligned} \frac{1}{\sqrt{\alpha _n}} \left( \hat{\theta }_n - \theta _{\text {MLE},n}\right) {\mathop {\rightarrow }\limits ^{P_{\theta _0}}} 0. \end{aligned}$$

Let \(\xi = \frac{1}{\sqrt{\alpha _n}} \left( \theta - \theta _{\text {MLE},n}\right) \), \(\xi ^*_n = \frac{1}{\sqrt{\alpha _n}} \left( \hat{\theta }_n - \theta _{\text {MLE},n} \right) \) and \(t = \frac{1}{\sqrt{\alpha _n}} \left( w - \theta _{\text {MLE},n}\right) \). By expressing \(\hat{\pi }_n(\theta )\), which is defined in Eq. (6),

$$\begin{aligned} \xi _n^*= & {} \mathop {\mathrm{arg}\,\mathrm{max}}\limits _{\xi } \hat{\pi }\left( \sqrt{\alpha _n} \xi + \theta _{\text {MLE},n}\right) \\= & {} \mathop {\mathrm{arg}\,\mathrm{max}}\limits _{\xi } \int \pi _n\left( \sqrt{\alpha _n} t+ \theta _{\text {MLE},n}\right) \\&\times \exp \left( -\frac{1}{2\alpha _n}\Vert \sqrt{\alpha _n}\xi - \sqrt{\alpha _n}t \Vert ^2\right) \text{ d } t \\= & {} \mathop {\mathrm{arg}\,\mathrm{max}}\limits _{\xi } \int \alpha _n^{d/2} \pi _n\left( \sqrt{\alpha _n} t+ \theta _{\text {MLE},n}\right) \\&\times \exp \left( -\frac{1}{2}\Vert \xi - t \Vert ^2\right) \text{ d } t. \end{aligned}$$

Define

$$\begin{aligned} f_n(\xi )= & {} \int \alpha _n^{d/2} \pi _n\left( \sqrt{\alpha _n} t+ \theta _{\text {MLE},n}\right) \exp \left( -\frac{1}{2}\Vert \xi - t \Vert ^2\right) \text{ d } t, \\ g_n(\xi )= & {} \int \phi \left( t; 0, \frac{1}{n \alpha _n}H_{\theta _0}^{-1}\right) \exp \left( -\frac{1}{2}\Vert \xi - t \Vert ^2\right) \text{ d } t,\\ f(\xi )= & {} (2\pi )^{d/2} \phi \left( \xi ; 0, I \right) , \end{aligned}$$

where \(\phi (\cdot ; \mu , \varSigma )\) denotes the PDF of \(\mathscr {N}(\mu , \varSigma )\).

By adding and subtracting \(f(\xi )\),

$$\begin{aligned} \xi _n^*= & {} \mathop {\mathrm{arg}\,\mathrm{max}}\limits _{\xi } f_n(\xi )\\= & {} \mathop {\mathrm{arg}\,\mathrm{max}}\limits _{\xi } \left\{ f_n(\xi ) - f(\xi ) + f(\xi )\right\} . \end{aligned}$$

We then apply Lemma 16 to show \(\xi _n^* \overset{d}{\rightarrow }\mathop {\mathrm{arg}\,\mathrm{max}}\nolimits _{\xi } f(\xi )\). We start by verifying a condition of the argmax continuous mapping theorem that

$$\begin{aligned} \lim _{n \rightarrow \infty } \sup _{\xi } |f_n(\xi ) - f(\xi )| = 0. \end{aligned}$$
(33)

By triangle inequality, for all n,

$$\begin{aligned} \sup _{\xi } |f_n(\xi ) - f(\xi )|&\le \sup _{\xi } |f_n(\xi ) - g_n(\xi )| \nonumber \\&+ \sup _{\xi } |g_n(\xi ) - f(\xi )|. \end{aligned}$$
(34)

Later we show both two terms on the RHS converging to 0.

For the first term. Note that \(\alpha _n^{d/2} \pi _n(\sqrt{\alpha _n} t+ \theta _{\text {MLE},n})\) is the probability density function of \(\varPi _{\sqrt{\alpha _n} t+ \theta _{\text {MLE},n}}\), which is the posterior distribution parametrized on t. Thus, for all n,

where the inequality is by \( \sup _{\xi ,t} \exp ( -\frac{1}{2}\Vert \xi - t \Vert ^2) \le 1\). Under Assumption 1, the posterior distribution admits Bernstein–von Mises theorem (Theorem 4) that

(35)

With the invariance of total variation under reparametrization, we have

This shows the uniform convergence from \(f_n(\xi )\) to \(g_n(\xi )\). Note that in Eq. (35), we states the Bernstein–von Mises theorem with a different centering process. It is allowed when \(\theta _{\text {MLE},n}\) is asymptotically normal (van der Vaart 2000, p. 144), which is ensured by Theorem 4.

For the second term in Eq. (34). Note that we can evaluate \(g_n(\xi )\) since it is a convolution of two Gaussian PDFs, that is

$$\begin{aligned} g_n(\xi ) = (2\pi )^{d/2} \phi \left( \xi ; 0, \frac{1}{n \alpha _n}H_{\theta _0}^{-1} + I \right) . \end{aligned}$$

Comparing this to \(f(\xi ) = (2\pi )^{d/2} \phi \left( \xi ; 0, I \right) \), one notices that \( \frac{1}{n \alpha _n}H_{\theta _0}^{-1} + I \rightarrow I\) due to \(\alpha _n^3 n \rightarrow \infty \). And further for Gaussian distributions, the convergence of parameters implies the uniform convergence of PDFs, yielding that

$$\begin{aligned} \lim _{n \rightarrow \infty } \sup _{\xi } |g_n(\xi ) - f(\xi ) | = 0. \end{aligned}$$

Thus, we have Eq. (34) converging to 0 as \(n \rightarrow \infty \).

Now, we look at \(f(\xi )\) with the goal to apply Lemma 16 and to obtain \(\xi _n^* \overset{d}{\rightarrow }\mathop {\mathrm{arg}\,\mathrm{max}}\nolimits _{\xi } f(\xi )\). Note that

$$\begin{aligned} \mathop {\mathrm{arg}\,\mathrm{max}}\limits _{\xi } f(\xi ) = 0 \quad \text {and} \quad \sup _{\xi }f(\xi ) ={\mathrm{det}}\left( I \right) ^{-1/2} = 1. \end{aligned}$$

To apply Lemma 16, we need to ensure that for any open set G that contains 0,

$$\begin{aligned} f(0)> \sup _{\xi \in G} f(\xi ). \end{aligned}$$
(36)

This holds by the unimodality of standard Gaussian distribution.

Therefore, with both conditioins Eqs. (33) and (36), we can apply Lemma 16 to conclude that

$$\begin{aligned} \frac{1}{\sqrt{\alpha _n}} \left( \hat{\theta }_n^\star - \theta _{\text {MLE},n} \right) {\mathop {\rightarrow }\limits ^{P_{\theta _0}}} 0. \end{aligned}$$

This completes the proof. \(\square \)

1.8 B.8 Proof of Theorem 2

Proof

Lemma 9 implies the existence of a locally asymptotically convex and smooth region \(B_r(\theta _0)\) of \(f_n\); and Theorems 5 and 7 show the consistency of \(\theta _{n}^\star \) and \(\hat{\theta }_n^\star \) to \(\theta _0\) in data-asymptotics. Let \(\theta _\text {init}\) be the initial value of Algorithm 2 satisfying \(\Vert \theta _\text {init} - \hat{\theta }^\star _n\Vert \le \frac{r}{4(\ell + 1)}\). If \(\Vert \hat{\theta }_n^\star - \theta _0\Vert \le r/4\) and \(\Vert \theta _n^\star - \theta _0\Vert \le r/4\),

$$\begin{aligned}&\Vert \theta _\text {init}- \theta _n^\star \Vert \\&\quad \le \Vert \theta _\text {init} - \hat{\theta }_n^\star \Vert + \Vert \hat{\theta }_n^\star - \theta _0\Vert + \Vert \theta _n^\star - \theta _0\Vert \\&\quad \le \left( \frac{1}{2} + \frac{1}{4(\ell + 1)}\right) r, \end{aligned}$$

and \(B_{3r/4}(\theta _n^\star ) \subseteq B_r(\theta _0)\). Combining above shows that there exists \(0 < r':= \frac{3r}{4}\) such that

$$\begin{aligned} \lim _{n \rightarrow \infty } {\mathbb {P}}\left( \forall \theta \in B_{r'}(\theta _{n}^\star ), \quad \varepsilon I\preceq \nabla ^2 f_n(\theta ) \preceq \ell I\right) = 1. \end{aligned}$$

Then, it remains to show that the iterates produced by Algorithm 2 will be confined in \(B_{r'}(\theta _n^\star )\) as we get more and more data. Since \(\theta _n^\star \) is the global optimum of \(f_n\), as long as backtracking line search ensures the decay of objective value within the locally strongly convex \(B_{r'}(\theta _n^\star )\) (Boyd and Vandenberghe 2004, p. 465), the gradient norm will decay in each iteration and hence the iterates converge to the optimum. Therefore, it is sufficient to show the first iteration stays inside.

By \(\ell \)-smoothness of \(f_n\) inside \(B_{r'}(\theta _n^\star )\), we have \(\Vert \nabla f_n(\theta _\text {init})\Vert \le \frac{\ell r}{4(\ell +1)}\) and hence

$$\begin{aligned} \lim _{n \rightarrow \infty } {\mathbb {P}}\left( \theta _\text {init}- \nabla f_n(\theta _\text {init})\in B_{r'}(\theta _n^\star ) \right) = 1. \end{aligned}$$

This shows the confinement. The convergence rate of iterates is \(\eta ^k\), where

$$\begin{aligned} 0 \le \eta = \sqrt{\max \left\{ |1 - \frac{2\varepsilon }{\varepsilon + \ell }|, |1 - \frac{2\ell }{\varepsilon + \ell }|\right\} } < 1, \end{aligned}$$

which follows from the standard theory of gradient descent under \(\varepsilon \)-strong convexity and \(\ell \)-Lipschitz smoothness (Ryu and Boyd 2016, p. 15). This completes the proof. \(\square \)

1.9 B.9 Proof of Theorem 3

Proof of Theorem 3

In this proof, we aim to apply Theorem 17 with

$$\begin{aligned} \mathscr {X}= \{\mu \in \mathbb {R}^p, L\text { lower triangular with non-negative diagonals}\}, \end{aligned}$$

which is closed and convex. Note that in the notation of this theorem, \(x = (\mu ^T, L_1^T, \dots , L_p^T)^T \in \mathbb {R}^{(d+1)d}\) and \(V \in \mathbb {R}^{(d+1)d \times (d+1)d}\) is set to be a diagonal matrix with entries 2 for the \(\mu \) components and \(r/(2\Vert I - L_n^\star \Vert _F)\) for the L components. Therefore,

$$\begin{aligned} J(x) = J(\mu , L) = 4\Vert \mu -\mu _n^\star \Vert ^2 + \frac{r^2}{4\Vert I-L_n^\star \Vert ^2_F}\Vert L-L_n^\star \Vert ^2_F. \end{aligned}$$

This setting yields two important facts. First, by Theorems 7 and 6,

$$\begin{aligned} \hat{\theta }_n^\star {\mathop {\rightarrow }\limits ^{P_{\theta _0}}} \theta _0 \quad \text {and} \quad \mu _n^\star {\mathop {\rightarrow }\limits ^{P_{\theta _0}}} \theta _0, \end{aligned}$$

yielding that

$$\begin{aligned} {\mathbb {P}}\left( \Vert \hat{\theta }_n^\star -\theta _0\Vert + \Vert \mu _n^\star - \theta _0\Vert \le \frac{r}{4\sqrt{2}} \right) \rightarrow 1, \quad \text { as } n \rightarrow \infty . \end{aligned}$$

For \(\Vert \mu _0 - \hat{\theta }_n^\star \Vert ^2 \le \frac{r^2}{32}\), by triangle inequality, the probability that the following inequalities hold converges to 1 in \(P_{\theta _0}\) as \(n \rightarrow \infty \),

$$\begin{aligned} \Vert \mu _0 - \mu _n^\star \Vert \le \Vert \mu _0 - \hat{\theta }_n^\star \Vert + \Vert \hat{\theta }_n^\star - \theta _0 \Vert + \Vert \mu _n^\star - \theta _0 \Vert \le \frac{r}{2\sqrt{2}}. \end{aligned}$$

Further with \(L_0 = I\), \(J(\mu _0, L_0) \le \frac{3 r^2}{4} \le r^2\). Hence, if we initialize \(L_0 = I \) and \(\mu _0\) such that \(\Vert \mu _0 - \hat{\theta }_n^\star \Vert _2^2 \le \frac{r^2}{32}\),

$$\begin{aligned} {\mathbb {P}}\left( x_0 \in \{x: J(x) \le r^2 \} \right) \rightarrow 1, \quad \text { as } n \rightarrow \infty . \end{aligned}$$
(37)

Second, if \(J \le r^2\) then \(\mu \) is close to the optimal and \(\Vert L\Vert _F\) is not too large, i.e.

$$\begin{aligned} J(\mu ,L) \le r^2&\implies \Vert \mu -\mu _n^\star \Vert ^2 \le r^2/4\\ J(\mu ,L) \le r^2&\implies \Vert L-L_n^\star \Vert _F^2 \le 4\Vert I-L_n^\star \Vert _F^2 \\&\implies \Vert L\Vert _F \le 2\Vert I-L_n^\star \Vert _F + \Vert L_n^\star \Vert _F, \end{aligned}$$

yielding that \(\{J(\mu , L) \le r^2 \} \subseteq {\mathscr {B}}_{r,n}\). Recall that

$$\begin{aligned} {\mathscr {B}}_{r,n}= & {} \left\{ \mu \in \mathbb {R}^d, L\in \mathbb {R}^{d\times d} : \Vert \mu - \mu _n^\star \Vert ^2 \le \frac{r^2}{4}\,\,\right. \nonumber \\&\left. \text {and}\,\, \Vert L-L_n^\star \Vert _F^2 \le 4\Vert I-L_n^\star \Vert _F^2\right\} . \end{aligned}$$

Then by Corollary 14, under Assumptions 1 and 2, the probability of the event that

$$\begin{aligned}&F_n\text { is } \frac{\varepsilon }{2} D_n\text {-strongly convex in}\, \{J(\mu , L) \le r^2 \}\nonumber \\&\quad \text { and globally }\ell D_n\text {-Lipschitz smooth} \end{aligned}$$
(38)

converges to 1 in \(P_{\theta _0}\) as \(n \rightarrow \infty \).

For brevity, we make the following definitions for the rest of this proof: recall the definition of \(f_n, F_n\) in Eqs. (2) and (3) (we state here again):

$$\begin{aligned}&I_n: \mathscr {X}\rightarrow \mathbb {R}, \qquad I_n(x) := -\frac{1}{n}\log {\mathrm{det}}L \\&f_n: \mathbb {R}^d \rightarrow \mathbb {R}, \qquad f_n( \theta ) := -\frac{1}{n}\log \pi _n(\theta )\\&\tilde{f}_n: (\mathscr {X}, \mathbb {R}^d) \rightarrow \mathbb {R}, \qquad \tilde{f}_n(x, Z) := -\frac{1}{n}\log \pi _n\left( \mu + \frac{1}{\sqrt{n}} L Z \right) \\&F_n: \mathscr {X}\rightarrow \mathbb {R}, \qquad F_n(x) := \mathbb {E}\left[ -\frac{1}{n}\log \pi _n\left( \mu + \frac{1}{\sqrt{n}} L Z \right) \right] \\&\phi _n: = I_n + \tilde{f}_n ,\qquad \varPhi _n: = I_n + F_n. \end{aligned}$$

Here, \(\phi _n(x, z)\) is the KL cost function with no expectation, and \(\varPhi _n(x)\) is the cost function with the expectation. To match the notation of Theorem 17, we reformulate the scaled gradient estimator defined in Eq. (8) as \(g_n\),

$$\begin{aligned} g_n(x,Z) = \left\{ \begin{array}{ll} R_n(x)\nabla \phi _n(x, Z) &{}\quad x \in \mathscr {X}^\mathrm {o}\\ \lim _{y\rightarrow x} R_n(y)\nabla \phi _n(y, Z) &{}\quad x \in \partial \mathscr {X}\end{array}\right. , \end{aligned}$$

for a diagonal scaling matrix \(R_n(x) \in \mathbb {R}^{d(d+1)\times d(d+1)}\). Define that \(R_n(x)\) has entries 1 for the \(\mu \) components, 1 for the off-diagonal L components, and \(1/(1+(n L_{ii})^{-1})\) for the diagonal L components. Note that \(x \rightarrow \partial \mathscr {X}\) means that \(L_{ii} \rightarrow 0\), ensuring that \(g_n(x,Z)\) has entries \(-1\) for the \(L_{ii}\). Since Z is a standard normal random variable, under the event that \(-\frac{1}{n} \log \pi _n\) has Lipschitz gradient, the gradient can be passed through the expectation so that the true gradient is defined as below:

$$\begin{aligned} G_n(x) := \mathbb {E}\left[ g_n(x,Z) \right] = R_n(x) \nabla \varPhi _n(x). \end{aligned}$$

Note that the projected stochastic iteration

$$\begin{aligned} x_{k+1} = \varPi _\mathscr {X}\left( x_k - \gamma _k g_n(x_k, Z_k)\right) , \quad k = {\mathbb {N}}\cup \{0\}, \end{aligned}$$

with \(\varPi _\mathscr {X}(x) :=\mathop {\mathrm{arg}\,\mathrm{min}}\nolimits _{y\in \mathscr {X}} \Vert V(x - y)\Vert ^2\) is equivalent to the iteration described in Algorithm 3. Note that the differentiability of \(\phi _n\) only holds for \(x \in \mathscr {X}^\mathrm {o}\). For the case where \(L_{ii} = 0\) for some \(i \in [d]\), we can use continuation via the limit \(\lim _{L_{ii} \rightarrow 0} -\frac{(n L_{ii})^{-1}}{1+ (n L_{ii})^{-1}} = -1\) to evaluate even though the gradient is not defined. For the following proof, we do not make special treatments to those boundary points when applying Taylor expansion and taking derivative.

Next, we apply Theorem 17 to carry out the proof. The rest of the proof consists two parts: to show the confinement result (statement 2. of Theorem 17) and to show the convergence result (statement 3. of Theorem 17)). We prove these two results under the event that Eqs. (37) and (38) hold; since the probability that these events hold converges in \(P_{ \theta _0}\) to 1 as \(n \rightarrow \infty \), the final result holds with the same convergent probability.

We first show the confinement result by analysing \(\varepsilon (x)\), \(\ell ^2(x)\), and \(\sigma ^2(r)\), which are defined in Eqs. (44) and (45), respectively. We aim to obtain that

  1. i.

    We can find sufficiently small \(\gamma _k > 0\) such that

    $$\begin{aligned} \alpha _{k} = 1 + \mathbb {1}\left[ J(x_k) \le r^2\right] (-2\gamma _k\varepsilon (r) + 2\gamma _k^2\ell ^2(r)) \in (0,1] \end{aligned}$$

    holds for all \(x\in \mathscr {X}\), i.e.

    $$\begin{aligned} \forall x\in \mathscr {X}: J(x)\le r^2, \quad 0 \le 2\gamma _k\varepsilon (x) - 2\gamma _k^2\ell ^2(x) \le 1. \nonumber \\ \end{aligned}$$
    (39)
  2. ii.

    \(\sigma ^2(r) \rightarrow 0\) as \(n \rightarrow \infty \) to guarantee the SGD iterations are eventually locally confined as \(n \rightarrow \infty \) (based on Theorem 17).

To show the statement i., Eq. (39), we start by deriving a lower bound for \(2\gamma _k\varepsilon (x) - 2\gamma _k^2\ell ^2(x)\). Examine the expression,

$$\begin{aligned}&2\gamma _k\varepsilon (x) - 2\gamma _k^2\ell ^2(x) \\&\quad = 2\gamma _k J(x)^{-1}(x-x^\star )^TV^TV R_n(x)\left( \nabla \varPhi _n(x)-\nabla \varPhi _n(x^\star )\right) \\&\qquad \!-\! 2 \gamma _k^2 J(x)^{-1} \left( \nabla \varPhi _n(x)\!-\!\nabla \varPhi _n(x^\star )\right) ^T \\&\qquad \quad \times R^T(x) V^TV R_n(x)\left( \nabla \varPhi _n(x)\!-\!\nabla \varPhi _n(x^\star )\right) \\&\quad = \frac{2 \gamma _k}{J(x)}\left( V(x-x^\star )\right) ^TV R_n(x)\left( \int \cdots \right) V^{-1}\left( V(x-x^\star )\right) \\&\qquad - \frac{2 \gamma _k^2}{J(x)}(V(x-x^\star ))^T V^{-T} \left( \int \cdots \right) ^T \left( V R_n(x) \right) ^2 \left( \int \cdots \right) \\&\qquad \times V^{-1}\left( V(x-x^\star )\right) , \end{aligned}$$

where \(\left( \int \cdots \right) = \left( \int _0^1 \nabla ^2 \varPhi _n((1-t)x^\star + tx)\mathrm {d}t \right) \). By splitting \(\varPhi _n\) into the regularization \(I_n(x)\) and the expectation \(F_n(x)\); and defining

$$\begin{aligned} A(x)&:=V R_n(x)\left( \int _0^1 \nabla ^2 I_n((1-t)x^\star + tx) \mathrm {d}t \right) V^{-1}\\ B(x)&:=V R_n(x)\left( \int _0^1 \nabla ^2 F_n((1-t)x^\star + tx) \mathrm {d}t \right) V^{-1}\\ v(x)&:=V(x-x^\star ), \end{aligned}$$

the above expression can be written as

$$\begin{aligned}&2\gamma _k\varepsilon (x) - 2\gamma _k^2\ell ^2(x) \\&\quad = 2\gamma _k \frac{v(x)^T(A(x)+B(x))v(x) -\gamma _k \Vert (A(x)+B(x))v(x)\Vert ^2 }{\Vert v(x)\Vert ^2}\\&\quad \ge 2\gamma _k \frac{v(x)^TA(x)v(x)\!+\!v(x)^TB(x)v(x) \!-\!2\gamma _k \Vert A(x)v(x)\Vert ^2 \!-\! 2\gamma _k\Vert B(x)v(x)\Vert ^2 }{\Vert v(x)\Vert ^2}\\&\quad \ge 2\gamma _k \left\{ \frac{v(x)^TA(x)v(x)+v(x)^TB(x)v(x) -2\gamma _k \Vert A(x)v(x)\Vert ^2}{\Vert v(x)\Vert ^2} \right. \\&\qquad \left. \frac{ - 2\gamma _k\Vert B(x)(VR_n(x)\ell D_n V^{-1})^{-1}\Vert ^2\Vert VR_n(x)\ell D_n V^{-1} v(x)\Vert ^2 }{\Vert v(x)\Vert ^2} \right\} \\&\quad = 2\gamma _k \left\{ \frac{v(x)^TA(x)v(x)+v(x)^T(B(x)- VR_n(x)\frac{\varepsilon }{2}D_n V^{-1})v(x) }{\Vert v(x)\Vert ^2} \right. \\&\qquad \left. + \frac{ v(x)^T\left( VR_n(x)\frac{\varepsilon }{2}D_n V^{-1}\right) v(x)-2\gamma _k \Vert A(x)v(x)\Vert ^2 }{\Vert v(x)\Vert ^2} \right. \\&\qquad \left. - \frac{ 2\gamma _k\Vert B(x)(VR_n(x)\ell D_n V^{-1})^{-1}\Vert ^2\Vert DR_n(x)\ell D_n V^{-1} v(x)\Vert ^2 }{\Vert v(x)\Vert ^2} \right\} \end{aligned}$$

Note that by Corollary 14 that \(\frac{\varepsilon }{2} D_n \preceq \nabla ^2 F_n(x) \preceq \ell D_n\) and all the V, \(R_n(x)\) are positive diagonal matrices, leading to

$$\begin{aligned}&B(x)- VR_n(x)\frac{\varepsilon }{2}D_n V^{-1} \succeq 0 I\\&\Vert B(x)(VR_n(x)\ell D_n V^{-1})^{-1}\Vert ^2 \le 1. \end{aligned}$$

Thus, the above expression can be bounded below by

$$\begin{aligned}&2\gamma _k\varepsilon (x) - 2\gamma _k^2\ell ^2(x) \\&\quad \ge 2\gamma _k \left\{ \frac{v(x)^TA(x)v(x)+ v(x)^T\left( VR_n(x)\frac{\varepsilon }{2}D_n V^{-1}\right) v(x) }{\Vert v(x)\Vert ^2} \right. \\&\qquad \left. \frac{-2\gamma _k \Vert A(x)v(x)\Vert ^2 - 2\gamma _k\Vert VR_n(x)\ell D_n V^{-1} v(x)\Vert ^2 }{\Vert v(x)\Vert ^2} \right\} \\&\quad = \frac{2}{\Vert v(x)\Vert ^2} v(x)^T \left\{ \left[ \gamma _k A(x) - 2\gamma _k^2 A^2(x) \right] \right. \\&\qquad \left. + \frac{1}{2}\left[ \varepsilon \gamma _k R_n(x) D_n - 4 \ell ^2 \left( \gamma _k R_n(x) D_n \right) ^2 \right] \right\} v(x)\\&\quad \ge 2\lambda _{\min } \left( \left[ \gamma _k A(x) - 2\gamma _k^2 A^2(x) \right] \right. \\&\quad \left. + \frac{1}{2}\left[ \varepsilon \gamma _k R_n(x) D_n - 4 \ell ^2 \left( \gamma _k R_n(x) D_n \right) ^2 \right] \right) . \end{aligned}$$

Now, notice that A(x) ,\(R_n(x) D_n\) are all diagonal matrices with non-negative entries,

$$\begin{aligned}&\gamma _k A(x) - 2\gamma _k^2 A^2(x) = \gamma _k A(x) \left( I - 2\gamma _k A(x) \right) \nonumber \\&\quad \varepsilon \gamma _k R_n(x) D_n - 4 \ell ^2 \left( \gamma _k R_n(x) D_n \right) ^2\nonumber \\&\quad = \gamma _k R_n(x) D_n \left( \varepsilon - 4 \ell ^2 \gamma _k R_n(x) D_n \right) . \end{aligned}$$
(40)

As long as the entries of A(x), \(R_n(x) D_n\) are bounded above by a constant for all n, there exists a sufficiently small constant c such that for all \(\gamma _k < c\), Eq. (40) are both non-negative. Given that for all n and \(\forall x \in \mathscr {X}\),

$$\begin{aligned}&A(x) = {\text {diag}}\left( 0, \cdots , \frac{(n L_{ii})^{-1}}{1 + (n L_{ii})^{-1}} \frac{1}{L_{ii}^{\star }} , \cdots , 0\right) \\&\preceq \frac{1}{\min _{i \in [d]} L_{ii}^\star } I\\&R_n(x) D_n \preceq I, \end{aligned}$$

we obtain the boundedness of the entries of A(x), \(R_n(x) D_n\). Therefore, we conclude that

$$\begin{aligned} \forall x\in \mathscr {X}, \quad 0 \le 2\gamma _k\varepsilon (x) - 2\gamma _k^2\ell ^2(x). \end{aligned}$$

It remains to show the second inequality of Eq. (39), i.e.

$$\begin{aligned} \sup _{x\in \mathscr {X}: J(x)\le r^2} 2\gamma _k\varepsilon (x) - 2\gamma _k^2\ell ^2(x) \le 1. \end{aligned}$$

This is true if

$$\begin{aligned} \sup _{x\in \mathscr {X}: J(x)\le r^2} \varepsilon (x) \le \gamma _k^{-1}. \end{aligned}$$

Since \(\gamma _k \rightarrow 0\) as \(k \rightarrow \infty \), the above holds if \(\sup _{x\in \mathscr {X}: J(x)\le r^2} \varepsilon (x)\) is bounded above by a constant that is independent to n. Now, we consider the upper bound for \(\sup _{x\in \mathscr {X}: J(x)\le r^2} \varepsilon (x)\). Expanding \(\varepsilon (x)\),

$$\begin{aligned} \varepsilon (x)&= J(x)^{-1}(x-x^\star )^TV^TV R_n(x)\left( \nabla \varPhi _n(x)-\nabla \varPhi _n(x^\star )\right) \\&= \frac{v(x)^T(A(x)+B(x))v(x)}{\Vert v(x)\Vert ^2}\\&\le \lambda _{\max }(A(x)+B(x))\\&= \lambda _{\max } R_n(x)^{1/2}\left( \int _0^1 \nabla ^2 \varPhi _n((1-t)x^\star + tx)\mathrm {d}t \right) R_n(x)^{1/2}. \end{aligned}$$

Split \(\varPhi _n\) into the regularization \(I_n(x)\) and the expectation \(F_n(x)\). For the expectation, by Corollary 14 that \(\nabla ^2 F_n(x) \preceq \ell D_n\) and entries of \(R_n(x)\) are bounded by 1, we have

$$\begin{aligned} R_n(x)^{1/2} \nabla ^2 F_n(x) R_n(x)^{1/2} \preceq \ell I, \end{aligned}$$

and for the regularization, note that \(\nabla ^2 I_n\) is a diagonal matrix with 0 for \(\mu \) and off-diagonals of L and \( L_{ii}^{-2}/n\) for diagonals of L, so

$$\begin{aligned} \begin{aligned}&R_n(x)^{1/2} \left( \int _0^1 \nabla ^2 I_n((1 - t) x^\star +t x) \mathrm {d}t\right) R_n(x)^{1/2}\\&\quad = {\text {diag}}\left( 0, \dots , \frac{(n L_{ii})^{-1}}{1 + (n L_{ii})^{-1}} \frac{1}{L_{ii}^{\star }} , \dots , 0\right) \\&\quad \preceq \frac{1}{\min _{i \in [d]} L_{ii}^\star } I \end{aligned} \end{aligned}$$
(41)

By the fact that \(\forall i\in [d], L_{ii}^\star > 0\), we have Eq. (41) is bounded above by a constant C. Use the Weyl’s inequality to bound the maximal eigenvalue of the summation of two Hermitian matrices, we conclude that

$$\begin{aligned} \sup _{x \in \mathscr {X}: J(x) \le r^2} \varepsilon (x) \le \ell + C. \end{aligned}$$

Therefore, we have completed the proof for statement i., Eq. (39).

Then, we show the statement ii. by getting upper bound on \(\sigma ^2(r)\). Recall that \(\sigma ^2(r)\) is the upper bound of the fourth moment of

$$\begin{aligned} \left\| V R_n(x)\left( \nabla \phi _n(x, Z) - \nabla \varPhi _n(x) \right) \right\| . \end{aligned}$$

Since the regularizor is cancelled in this expression, we only consider the expectation part. Note that \(VR_n(x)\) is a diagonal matrix with positive diagonals,

$$\begin{aligned}&\mathbb {E}\left[ \left\| VR_n(x)\left( \nabla \tilde{f}_n(x, Z) - \nabla F_n(x)\right) \right\| ^4 \right] ^{1/4}\\&\quad \le \max _{i \in [d(d+1)]}(VR_n(x))_{ii} \mathbb {E}\left[ \left\| \nabla \tilde{f}_n(x, Z) - \nabla F_n(x)\right\| ^4\right] ^{1/4}. \end{aligned}$$

Let \(Z_1, Z_2 \) be independent copies, by tower property of conditional expectation,

$$\begin{aligned}&= \max _{i \in [d(d+1)]}(VR_n(x))_{ii} \\&\quad \times \mathbb {E}\left[ \left\| \mathbb {E}\left[ \nabla \tilde{f}_n(x, Z_1) - \nabla \tilde{f}_n(x, Z_2) | Z_1\right] \right\| ^4 \right] ^{1/4}. \end{aligned}$$

By the convexity of \(\Vert \cdot \Vert ^4\) and Jensen’s inequality,

$$\begin{aligned}&\le \max _{i \in [d(d+1)]}(VR_n(x))_{ii}\\&\quad \times \mathbb {E}\left[ \mathbb {E}\left[ \left\| \nabla \tilde{f}_n(x, Z_1) - \nabla \tilde{f}_n(x, Z_2) \right\| ^4 | Z_1\right] \right] ^{1/4}\\&= \max _{i \in [d(d+1)]}(VR_n(x))_{ii}\mathbb {E}\left[ \left\| \nabla \tilde{f}_n(x, Z_1) - \nabla \tilde{f}_n(x, Z_2) \right\| ^4 \right] ^{1/4}. \end{aligned}$$

By \(\left\| \nabla \tilde{f}_n(x, Z_1) \!- \!\nabla \tilde{f}_n(x, Z_2) \right\| \! \le \! \left\| \nabla \tilde{f}_n(x, Z_1)\right\| \!+\!\left\| \nabla \tilde{f}_n \right. \left. (x, Z_2) \right\| \) and Minkowski’s inequality,

$$\begin{aligned}&\le \max _{i \in [d(d+1)]}(VR_n(x))_{ii} \left\{ \mathbb {E}\left[ \Vert \nabla \tilde{f}_n(x, Z_1)\Vert ^4 \right] ^{1/4} \right. \\&\left. + \mathbb {E}\left[ \Vert \nabla \tilde{f}_n(x, Z_1)\Vert ^4 \right] ^{1/4} \right\} \\&= 2 \max _{i \in [d(d+1)]}(VR_n(x))_{ii} \mathbb {E}\left[ \Vert \nabla \tilde{f}_n(x, Z)\Vert ^4 \right] ^{1/4}. \end{aligned}$$

Now, we focus on bounding \(\mathbb {E}\left[ \Vert \nabla \tilde{f}_n(x, Z)\Vert ^4 \right] ^{1/4}\). We examine \(\Vert \nabla \tilde{f}_n(x, Z)\Vert \),

$$\begin{aligned} \nabla \tilde{f}_n(x, Z) = \left( \begin{array}{c} \nabla f_n \left( \mu + \frac{1}{\sqrt{n}}L Z \right) \\ \frac{Z_1}{\sqrt{n}} \nabla f_n \left( \mu + \frac{1}{\sqrt{n}}L Z \right) \\ \vdots \\ \frac{Z_p}{\sqrt{n}} \nabla f_n \left( \mu + \frac{1}{\sqrt{n}}L Z \right) \end{array} \right) \quad \in \mathbb {R}^{d(d+1)}, \end{aligned}$$

yielding

$$\begin{aligned} \Vert \nabla \tilde{f}_n(x, Z)\Vert ^4 = \left\| \nabla f_n \left( \mu + \frac{1}{\sqrt{n}}L Z \right) \right\| ^4 \left( 1 + \frac{Z^T Z}{n}\right) ^2. \end{aligned}$$

By Cauchy–Schiwartz inequality,

$$\begin{aligned} \mathbb {E}\left[ \Vert \nabla \tilde{f}_n(x, Z)\Vert ^4\right] ^{1/4}&\!\le \mathbb {E}\!\left[ \left\| \nabla f_n \left( \mu + \frac{1}{\sqrt{n}}L Z \right) \right\| ^8 \right] ^{1/8}\\&\!\mathbb {E}\!\left[ \left( 1 + \frac{Z^T Z}{n}\right) ^4\right] ^{1/8}. \end{aligned}$$

We then bounds these two terms on RHS separately. We use the sub-Gaussian property of \(\left\| \nabla f_n \left( \mu + \frac{1}{\sqrt{n}}L Z \right) \right\| \) to bound its \(8{\text {th}}\) moment. First notice that \(\left\| \nabla f_n \left( \mu + \frac{1}{\sqrt{n}}L Z \right) \right\| \) is a \(\max _{i \in [d]} L_{ii} \frac{\ell }{\sqrt{n}}\)-Lipschitz function of Z,

$$\begin{aligned}&\left| \left\| \nabla f_n \left( \mu + \frac{1}{\sqrt{n}}L Z_1 \right) \right\| - \left\| \nabla f_n \left( \mu + \frac{1}{\sqrt{n}}L Z_2 \right) \right\| \right| \\&\quad \le \left\| \nabla f_n \left( \mu + \frac{1}{\sqrt{n}}L Z_1 \right) - \nabla f_n \left( \mu + \frac{1}{\sqrt{n}}L Z_2 \right) \right\| \\&\quad = \left\| \int _0^1 \nabla ^2 f_n\left( \mu + (1-t)L Z_2/ \sqrt{n} + t Z_1/ \sqrt{n} \right) \mathrm {d}t \frac{L}{\sqrt{n}}\right. \\&\qquad \times \left. (Z_1 - Z_2) \right\| . \end{aligned}$$

Given that \(\nabla ^2 f_n \preceq \ell I\), the above is bounded by

$$\begin{aligned} \frac{\ell }{\sqrt{n}} \sqrt{(Z_1 - Z_2 )^T L^T L (Z_1 - Z_2 )}\le & {} \frac{\ell }{\sqrt{n}} \lambda _{\max }(L^T L) \Vert Z_1- Z_2 \Vert \\= & {} \frac{\ell }{\sqrt{n}} \max _{i \in [d]} L_{ii}^2 \Vert Z_1- Z_2 \Vert . \end{aligned}$$

Since a Lipschitz function of Gaussian noise is sub-Gaussian (Kontorovich 2014, Theorem 1), i.e. let \(Z \sim \mathscr {N}(0, I_d)\), \(\psi : \mathbb {R}^d \rightarrow \mathbb {R}\) be L-Lipschitz, then

$$\begin{aligned} \mathbb {P}\left( |\psi (Z) - \mathbb {E}[\psi (Z)] |> \varepsilon \right) \le 2\exp \left( -\frac{\varepsilon ^2}{4L^2} \right) . \end{aligned}$$

Thus, \(\left\| \nabla f_n \left( \mu + \frac{1}{\sqrt{n}}L Z \right) \right\| \) is \(\frac{4 \ell ^2}{n} \max _{i \in [d]} L_{ii}^2\)-sub-Gaussian. Then note that for a \(\sigma ^2\)-sub-Gaussian random variable \(X \in \mathbb {R}\), for any positive integer \(k \ge 2\), \(\mathbb {E}\left[ |X|^k\right] ^{1/k} \le \sigma e^{1/e} \sqrt{k}\). Hence we obtain

$$\begin{aligned} \mathbb {E}\left[ \left\| \nabla f_n \left( \mu + \frac{1}{\sqrt{n}}L Z \right) \right\| ^8\right] ^{1/8} \le \frac{2\ell }{\sqrt{n}} e^{1/e} \sqrt{8} \max _{i \in [d]} L_{ii}. \end{aligned}$$

Along with the fact that Gaussian random variable has arbitrary order moments,

$$\begin{aligned} \mathbb {E}\left[ \left( 1 +\frac{Z^TZ}{n}\right) ^4 \right] \le C, \end{aligned}$$

for some constant C, we obtain

$$\begin{aligned} \mathbb {E}\left[ \Vert \nabla _x f_n\Vert ^4\right] ^{1/4} \le \frac{2 C^{1/4} \ell }{\sqrt{n}} e^{1/e} \sqrt{8} \max _{i \in [d]} L_{ii}, \end{aligned}$$

and hence

$$\begin{aligned}&\mathbb {E}\left[ \left\| VR_n(x)\left( \nabla _x f_n - \nabla _x F_n\right) \right\| ^4 \right] ^{1/4} \\&\quad \le \max _{i \in [d(d+1)]}(VR_n(x))_{ii} \frac{2 C^{1/4} \ell }{\sqrt{n}} e^{1/e} \sqrt{8} \max _{i \in [d]} L_{ii} . \end{aligned}$$

Taking supremum over \(J(x) \le r^2\), the RHS is bounded above by a universal constant, we therefore conclude that

$$\begin{aligned}&\sigma ^2(r) = \\&\sup _{x\in \mathscr {X}: J(x)\le r^2} \mathbb {E}\left[ \left\| VR_n(x)\left( \nabla \phi _n(x, Z) - \nabla \varPhi _n(x) \right) \right\| ^4 \right] ^{1/4} \\&\qquad \rightarrow 0, \; n \rightarrow \infty . \end{aligned}$$

Therefore, with \(\forall x\in \mathscr {X}: J(x)\le r^2, 0 \le 2\gamma _k\varepsilon (x) - 2\gamma _k^2\ell ^2(x) \le 1\) and \(\sigma ^2(r) \rightarrow 0\) as \(n \rightarrow \infty \), applying Theorem 17 yields the confinement result, i.e.

$$\begin{aligned} \mathbb {P}\left( \sup _{k\in {\mathbb {N}}} J(x_k) \le r^2\right) \rightarrow 1 \end{aligned}$$

in \(P_{ \theta _0}\) as \(n \rightarrow \infty \).

Lastly, by statement 3 of Theorem 17, we prove the convergence result by checking

$$\begin{aligned} \inf _{x\in \mathscr {X}, J(x) \le r^2}\varepsilon (x) > 0. \end{aligned}$$

We use the similar way to expand the expression,

$$\begin{aligned} \varepsilon (x)&= J(x)^{-1}(x-x^\star )^TV^TV R_n(x)\left( \nabla \varPhi _n(x)-\nabla \varPhi _n(x^\star )\right) \\&= \frac{v(x)^T(A(x)+B(x))v(x)}{\Vert v(x)\Vert ^2}\\&\ge \lambda _{\min }(A(x)+B(x))\\&= \lambda _{\min } R_n(x)^{1/2}\left( \int _0^1 \nabla ^2 \varPhi _n((1-t)x^\star + tx)\mathrm {d}t \right) R_n(x)^{1/2}. \end{aligned}$$

By splitting \(\varPhi _n\) into the regularization and the expectation, we have

$$\begin{aligned} \begin{aligned}&R_n^{1/2}(x) \left( \int _0^1 \nabla ^2 I_n((1 - t) x^\star +t x) \mathrm {d}t\right) R_n(x)^{1/2}\\&\quad = {\text {diag}}\left( 0, \cdots , \frac{(n L_{ii})^{-1}}{1 + (n L_{ii})^{-1}} \frac{1}{L_{ii}^{\star }} , \cdots , 0\right) \\&\quad \succeq 0 I, \end{aligned} \end{aligned}$$
(42)

and

$$\begin{aligned} \begin{aligned}&R_n(x)^{1/2}\left( \int _0^1 \nabla ^2 F_n((1-t)x^\star + tx)\mathrm {d}t \right) R_n(x)^{1/2} \\&\quad \ge R_n(x)^{1/2} \frac{ D_n \varepsilon }{2} R_n(x)^{1/2} \\&\quad \succeq \varepsilon /2n >0 . \end{aligned} \end{aligned}$$
(43)

We then combine Eqs. (42) and (43) and use Weyl’s inequality to bound the minimal eigenvalue of the summation of two Hermitian matrices, yielding

$$\begin{aligned} \inf _{x\in \mathscr {X}, J(x) \le r^2}\varepsilon (x)> \varepsilon /n >0. \end{aligned}$$

This gives the convergence result.

Then, the proof is complete by applying Theorem 17. We know that \(\xi _k\) is strictly positive. Since \(\varepsilon (r) > 0\) and \(\ell (r) \) is bounded above, there exists \(\gamma _k = \varTheta (k^{-\rho }), \rho \in (0.5, 1) \) so that it satisfies the condition of the theorem. We have that \(\sigma \rightarrow 0\), which makes 3 in the statement of Theorem 17 become

$$\begin{aligned}&\mathbb {P}\left( \Vert V(x_k - x^\star )\Vert ^2 = O_{P_n}(k^{-\rho '}) \right) {\mathop {\rightarrow }\limits ^{P_{\theta _0}}} 1,\, \rho ' \in (0, \rho -0.5) \\&\quad n\rightarrow \infty \end{aligned}$$

Even though D is a function of n, n is fixed as Algorithm 3 runs. Since D is invertible,

$$\begin{aligned} \mathbb {P}\left( \Vert x_k - x^\star \Vert ^2 = O_{P_n}(k^{-\rho '}) \right) {\mathop {\rightarrow }\limits ^{P_{\theta _0}}} 1, \,\,\rho ' \in (0, \rho -0.5)\quad n\rightarrow \infty \end{aligned}$$

which is exactly our desired result: as the number of data \(n\rightarrow \infty \), the probability that Algorithm 3 finds the optimum in a rate of \(k^{-\rho '}\) (as we take more iterations, \(k\rightarrow \infty \)) converges to 1. In other words, variational inference gets solved asymptotically in a rate of \(k^{-\rho '}\) . \(\square \)

Theorem 17

Let \(\mathscr {X}\subseteq \mathbb {R}^p\) be closed and convex, \(g : \mathscr {X}\times \mathscr {Z}\rightarrow \mathbb {R}^p\) be a function, \(G(x) :=\mathbb {E}\left[ g(x,Z)\right] \) for a random element \(Z\in \mathscr {Z}\), \(x^\star \in \mathscr {X}\) be a point in \(\mathscr {X}\) such that \(G(x^\star ) = 0\), \(V \in \mathbb {R}^{p\times p}\) be invertible, \(J(x) :=\Vert V(x - x^\star )\Vert ^2\), and \(r \ge 0\). Consider the projected stochastic iteration

$$\begin{aligned} x_0 \in \mathscr {X}, \quad x_{k+1} = \varPi _\mathscr {X}\left( x_k - \gamma _k g(x_k, Z_k)\right) , \quad k = {\mathbb {N}}\cup \{0\}, \end{aligned}$$

with independent copies \(Z_k\) of Z, \(\gamma _k \ge 0\), and \(\varPi _\mathscr {X}(x) :=\mathop {\mathrm{arg}\,\mathrm{min}}\nolimits _{y\in \mathscr {X}} \Vert V(x - y)\Vert ^2\). If

  1. 1.

    For all \(k\in {\mathbb {N}}\cup \{0\}\), the step sizes satisfy

    $$\begin{aligned} \forall x&\in \mathscr {X}: J(x)\le r^2, \quad 0 \le 2\gamma _k\varepsilon (x) - 2\gamma _k^2\ell ^2(x) \le 1\nonumber \\ \varepsilon (x)&:=\frac{1}{J(x)}(x-x^\star )^TV^TV\left( G(x)- G(x^\star )\right) \nonumber \\ \ell ^2(x)&:=\frac{1}{J(x)}\left\| V\left( G(x)-G(x^\star )\right) \right\| ^2, \end{aligned}$$
    (44)
  2. 2.

    For all \(x\in \mathscr {X}\), \(\left( \mathbb {E}\Vert V(g(x,Z) - G(x))\Vert ^4\right) ^{1/4} \le \tilde{\sigma }(x)\) for \(\tilde{\sigma }: \mathscr {X}\rightarrow \mathbb {R}_{\ge 0}\), and

    $$\begin{aligned} \sigma (r)&:=\sup _{x\in \mathscr {X}\,:\, J(x) \le r^2} \tilde{\sigma }(x), \end{aligned}$$
    (45)

then

  1. 1.

    The iterate \(x_k\) is locally confined with high probability:

    $$\begin{aligned} \mathbb {P}\left( J(x_k) \le r^2\right)&\,\ge \frac{\xi ^2_k}{\xi _k^2 + 8\sigma (r)^2\zeta _k}\\ \xi _{k}(r)&:=\max \{0, r^2 - J(x_0) - 2\sigma ^2(r)\sum _{j<k}\gamma ^2_j\}\\ \zeta _{k}(r)&:=r^2\sum _{j<k}\gamma _j^2 + \sigma ^2(r)\sum _{j<k}\gamma _j^4. \end{aligned}$$
  2. 2.

    The iterate \(x_k\) stays locally confined for all \(k\in {\mathbb {N}}\) with high probability:

    $$\begin{aligned} \mathbb {P}\left( \sup _{k\in {\mathbb {N}}} J(x_k) \le r^2\right)&\,\ge \frac{\xi ^2}{\xi ^2 + 8\sigma ^2(r)\zeta }\\ \xi (r)&:=\lim _{k\rightarrow \infty }\xi _{k}(r) \quad \zeta (r) :=\lim _{k\rightarrow \infty } \zeta _{k}(r). \end{aligned}$$
  3. 3.

    If additionally

    $$\begin{aligned} \inf _{x\in \mathscr {X}: J(x) \le r^2} \varepsilon (x) > 0 \quad \text {and}\quad \gamma _k = \varTheta (k^{-\rho }), \,\, \rho \in (0.5, 1], \end{aligned}$$

    the iterate \(x_k\) converges to \(x^\star \) with high probability:

    $$\begin{aligned} \lim _{k \rightarrow \infty } \mathbb {P}\left( J(x_k) \le k^{-\rho '} \right)&\ge \mathbb {P}\left( \sup _{k\in {\mathbb {N}}} J(x_k) \le r^2\right) , \\&\quad \forall \rho ' \in (0, \rho - 0.5). \end{aligned}$$

Proof

To begin, we show \(\varPi _\mathscr {X}\) is non-expansive,

$$\begin{aligned} \Vert V(\varPi _\mathscr {X}(x) - \varPi _\mathscr {X}(y))\Vert ^2&\le \Vert V(x - y)\Vert ^2. \end{aligned}$$

For all \(x, y \in \mathbb {R}^p\), define \(\langle x, y\rangle _V = x^T V^TV y\). Since V is invertible, \(V^TV\) is symmetric and positive definite, and hence \((\mathbb {R}^p, \langle \cdot , \cdot \rangle _V)\) forms a Hilbert space. Any projection operator of a Hilbert space is non-expansive (Bauschke and Combettes 2011, Proposition 4.4).

Note that \(x^\star = \varPi _{\mathscr {X}}(x^\star )\) and the projection operation is non-expansive, expanding the squared norm yields

$$\begin{aligned} \Vert V(x_{k+1}-x^\star )\Vert ^2&\le \Vert V(x_k - x^\star )\Vert ^2 \\&\quad - 2\gamma _k (x_k-x^\star )^TV^TVg(x_k, Z_k) \\&\quad + \gamma _k^2\left\| Vg(x_k, Z_k)\right\| ^2. \end{aligned}$$

Adding and subtracting \(G(x_k)\) in the second and third terms, using the elementary bound \(\left\| a+b\right\| ^2 \le 2\Vert a\Vert ^2 + 2\Vert b\Vert ^2\), and defining

$$\begin{aligned} \beta _k(x)&:=-2\gamma _k (x-x^\star )^TV^TV(g(x, Z_k) - G(x)) \\&\quad + 2\gamma _k^2\left\| V(g(x, Z_k) - G(x))\right\| ^2 - 2\gamma _k^2\mathbb {E}\left[ \Vert \cdot \Vert ^2\right] \\ \varepsilon (x)&:=\frac{1}{J(x)}(x-x^\star )^TV^TV\left( G(x)-G(x^\star )\right) \\ \ell ^2(x)&:=\frac{1}{J(x)}\left\| V\left( G(x)-G(x^\star )\right) \right\| _2^2, \end{aligned}$$

we have that

$$\begin{aligned} J(x_{k+1})&\le J(x_k)\left( 1 - 2\gamma _k \varepsilon (x_k) + 2\gamma _k^2\ell ^2(x_k)\right) \\&\quad +\beta _k(x_k) + 2\gamma _k^2 \tilde{\sigma }^2(x_k). \end{aligned}$$

We now define the filtration of \(\sigma \)-algebras

$$\begin{aligned} \mathscr {F}_k = \sigma (x_1, \dots , x_k, Z_1, \dots , Z_{k-1}), \end{aligned}$$

and the stopped process for \(r > 0\),

$$\begin{aligned} Y_{0}&= J(x_0)\\ Y_{k+1}&= \left\{ \begin{array}{ll} Y_{k} &{}\quad Y_{k} > r^2\\ J(x_{k+1}) &{}\quad \text {o.w.} \end{array}\right. \end{aligned}$$

Note that \(Y_k\) is \(\mathscr {F}_k\)-measurable, and that \(Y_{k}\) “freezes in place” if \(J(x_k)\) ever jumps larger than \(r^2\); so for all \(t^2 \le r^2\),

$$\begin{aligned}&\mathbb {P}\left( J(x_k)> t^2\right) \\&\quad = \mathbb {P}\left( J(x_k)> t^2 , Y_{k-1}> r^2 \right) + \mathbb {P}\left( J(x_k)> t^2 , Y_{k-1} \le r^2 \right) \\&\quad = \mathbb {P}\left( J(x_k)> t^2 , Y_k> r^2, Y_{k-1}> r^2 \right) \! +\! \mathbb {P}\left( Y_k> t^2, Y_{k-1} \le r^2 \right) \\&\quad \le \mathbb {P}\left( Y_k> r^2, Y_{k-1}> r^2 \right) + \mathbb {P}\left( Y_k> t^2, Y_{k-1} \le r^2 \right) \\&\quad \le \mathbb {P}\left( Y_k> t^2, Y_{k-1}> r^2 \right) + \mathbb {P}\left( Y_k> t^2, Y_{k-1} \le r^2 \right) \\&\quad = \mathbb {P}\left( Y_k > t^2 \right) . \end{aligned}$$

Therefore, if we obtain a tail bound on \(Y_k\), it provides the same bound on \(J(x_k)\). Now substituting the stopped process into the original recursion and collecting terms,

$$\begin{aligned}&Y_{k+1} \\&\quad \le Y_{k}\left( 1\!+ \!\mathbb {1}\!\left[ Y_k \!\le \! r^2\right] \!(- 2\gamma _k\varepsilon (x_k) \!+\! \gamma _k^2\ell ^2(x_k))\right) \\&\qquad + \!\mathbb {1}\!\left[ Y_k \!\le \! r^2\right] \!\left( \beta _k(x_k) \!+ \!2\gamma _k^2 \tilde{\sigma }^2(x_k)\right) \\&\quad \le Y_{k}\left( 1\!+ \!\mathbb {1}\!\left[ Y_k \!\le \! r^2\right] \!(- 2\gamma _k\varepsilon (x_k) \!+\! \gamma _k^2\ell ^2(x_k))\right) \\&\qquad + \!\mathbb {1}\!\left[ Y_k \!\le \! r^2\right] \!\beta _k(x_k) \!+\! 2\gamma _k^2 \sigma ^2(r). \end{aligned}$$

Using the notation of Lemma 18, set

$$\begin{aligned} \alpha _{k}&= 1 + \mathbb {1}\left[ Y_k \le r^2\right] (-2\gamma _k\varepsilon (x_k) + 2\gamma _k^2\ell ^2(x_k))\\ \bar{\varvec{\alpha }}_{k}&= 1\\ \beta _{k}&= \mathbb {1}\left[ Y_k \le r^2\right] \beta _k(x_k)\\ c_{k}&= 2\gamma _k^2 \sigma ^2(r). \end{aligned}$$

By the fourth moment assumption, \(\beta _k\) has variance bounded above by \(\tau _k^2\) conditioned on \(\mathscr {F}_k\), where

$$\begin{aligned} \tau _{k}^2&= 8\gamma _k^2\mathbb {1}\left[ Y_k\le r^2\right] \Vert V(x_k-x^\star )\Vert ^2\tilde{\sigma }(x_k)^2 \\&\quad + 8\gamma _k^4\mathbb {1}\left[ Y_k\le r^2\right] \tilde{\sigma }^4(x_k) \\&\le 8\gamma _k^2r^2\sigma ^2(r) + 8\gamma _k^4\sigma ^4(r). \end{aligned}$$

Therefore, using the descent Lemma 18,

$$\begin{aligned} \mathbb {P}\left( Y_{k} > r^2\right)&\le \frac{\zeta _k}{\max \{r^2-\xi _k,0\}^2 + \zeta _k}\\ \xi _{k}&= J(x_0) + 2\sigma ^2(r)\sum _{j<k}\gamma ^2_j \\ \zeta _{k}&= 8\sigma ^2(r)\left( r^2\sum _{j<k}\gamma _j^2 + \sigma ^2(r)\sum _{j<k}\gamma _j^4\right) . \end{aligned}$$

yielding the first result. Now, since \(Y_{k+1} \le r^2 \implies Y_{k} \le r^2\) for all \(k\ge 0\), the sequence of events \(\left\{ Y_k \le r^2\right\} \) is decreasing. Therefore, the second result follows from

$$\begin{aligned} \mathbb {P}\left( \bigcap _{k=0}^\infty \left\{ Y_k \le r^2\right\} \right)&= \lim _{k\rightarrow \infty }\mathbb {P}\left( Y_k \le r^2\right) \\&\ge \lim _{k\rightarrow \infty } 1-\frac{\zeta _k}{\max \{r^2-\xi _k,0\}^2 + \zeta _k}\\&= \frac{\max \{r^2-\xi ,0\}^2}{\max \{r^2-\xi ,0\}^2 + \zeta }, \end{aligned}$$

where \(\xi :=\lim _{k\rightarrow \infty }\xi _{k}\) and \(\zeta :=\lim _{k\rightarrow \infty }\zeta _{k}\). Finally, we analyse the conditional tail distribution of \(Y_k\) given that it stays confined, i.e. \(\forall k\ge 0\), \(Y_k\le r^2\). In the notation of Lemma 18, redefine

$$\begin{aligned} 0 \le \bar{\varvec{\alpha }}_{k} :=\sup _{x \in \mathscr {X}: J(x)\le r^2} 1 - 2\gamma _k\varepsilon (x) + 2\gamma _k^2\ell ^2(x) \le 1, \end{aligned}$$

i.e. \(\bar{\varvec{\alpha }}_{k}\) is the largest possible value of \(\alpha _{k}\) when \(Y_k \le r^2\). So again applying Lemma 18,

$$\begin{aligned}&\mathbb {P}\left( Y_{k}> t_k\,|\,\forall k \,\,Y_k \le r^2 \right) \nonumber \\&\quad = \frac{\mathbb {P}\left( Y_{k}>t_k, \forall k\,\, Y_k \le r^2 \right) }{\mathbb {P}\left( \forall k\,\, Y_k \le r^2 \right) }\nonumber \\&\quad \le \frac{\mathbb {P}\left( Y_{k} >t_k, \forall k\,\, Y_k \le r^2 \right) }{ \frac{\max \{r^2-\xi ,0\}^2}{\max \{r^2-\xi ,0\}^2 + \zeta }}\nonumber \\&\quad \le \frac{\left( \frac{\zeta '_k}{\max \{t_k-\xi '_k,0\}^2 + \zeta '_k}\right) }{\left( \frac{\max \{r^2-\xi ,0\}^2}{\max \{r^2-\xi ,0\}^2 + \zeta }\right) }\nonumber \\&\,\, \xi '_{k} = J(x_0)\prod _{i=0}^{k-1}\bar{\varvec{\alpha }}_i + 2\sigma ^2(r)\sum _{i=0}^{k-1}\gamma ^2_i\prod _{j=i+1}^{k-1}\bar{\varvec{\alpha }}_j\nonumber \\&\,\,\zeta '_{k} = 8\sigma ^2(r)\sum _{i=0}^{k-1}(r^2\gamma _i^2 + \sigma ^2(r)\gamma _i^4)\prod _{j=i+1}^{k-1}\bar{\varvec{\alpha }}_j^2. \end{aligned}$$
(46)

Here, \(t_k = \varTheta (k^{-\rho '}), \rho ' \in (0, \rho - 0.5)\) is a decreasing sequence, whose shrinking rate—such that Eq. (46) still converges to 0—will determine the convergence rate of \(Y_k\).

To understand the rate of Eq. (46), the key is to characterizing the order of \(\prod _{j< k}\bar{\varvec{\alpha }}_{j}\) and \(\sum _{i = 0}^{k}\gamma _i^2 \prod _{j = i+1}^k \bar{\varvec{\alpha }}_j^2\). Since \(\gamma _k = \varTheta (k^{-\rho })\), \(\rho \in (0.5, 1]\), and \(\varepsilon ' :=\inf _{x\in \mathscr {X}: J(x)\le r^2} \varepsilon (x) > 0\), we know that

$$\begin{aligned} \prod _{j<k}\bar{\varvec{\alpha }}_{j}&= \prod _{j<k} \left\{ \sup _{x\in \mathscr {X}:J(x) \le r^2} (1 -2\gamma _j \varepsilon (x) + 2\gamma _j^2\ell ^2(x))\right\} \\&= \prod _{j<k} \left( 1 -2\gamma _j \inf _{x\in \mathscr {X}:J(x) \le r^2} \{ \varepsilon (x) - \gamma _j\ell ^2(x) \} \right\} \\&\le \prod _{j< k} ( 1- 2 c \gamma _j)\\&= \exp \left( \sum _{j< k}\log (1- 2c \gamma _j )\right) \\&\le \exp \left( 2c\sum _{j<k}\gamma _j \right) \end{aligned}$$

for some \(c > 0\), yielding that \(\prod _{j<k}a_{j}= \varTheta \left( \exp \left( -Ck^{1-\rho } \right) \right) \) for some \(C > 0\). For the second term, since \(\bar{\varvec{\alpha }}\in (0,1)\),

$$\begin{aligned} \sum _{i = 0}^{k}\gamma _i^2 \prod _{j = i+1}^k \bar{\varvec{\alpha }}_j^2 \le \sum _{i = 0 }^{k} \gamma _i^2 = \varTheta (k^{1- 2\rho }). \end{aligned}$$
(47)

Similarly, \(\sum _{i = 0}^{k}\gamma _i^2 \prod _{j = i+1}^k \bar{\varvec{\alpha }}_j = \varTheta (k^{1 - 2\rho })\).Footnote 5 Therefore,

$$\begin{aligned} \xi '_k = \varTheta \left( k^{1- 2\rho } \right) , \quad \zeta '_k = \varTheta \left( k^{1- 2\rho } \right) . \end{aligned}$$

Combined with the fact that \(t_k = \varTheta (k^{- \rho '}), \rho ' \in (0, \rho - 0.5)\), this implies that Eq. (46) is o(1) and hence for all \(\varepsilon > 0, \rho ' \in (0, \rho - 0.5)\),

$$\begin{aligned} \lim _{k \rightarrow \infty }\mathbb {P}\left( k^{\rho '} Y_k >\varepsilon \,|\,\forall k \,\,Y_k \le r^2 \right) = 0. \end{aligned}$$

Therefore, \(\forall \rho ' \in (0, \rho - 0.5)\),

$$\begin{aligned} \lim _{k \rightarrow \infty } \mathbb {P}\left( Y_k \le k^{-\rho '} \right)&\ge \lim _{k \rightarrow \infty } \mathbb {P}\left( Y_k \le k^{-\rho '} \,|\,\forall k \,\,Y_k\le r^2\right) \\&\mathbb {P}\left( \forall k \,\,Y_k\le r^2\right) \\&= \mathbb {P}\left( \forall k \,\,Y_k\le r^2\right) , \end{aligned}$$

and the result follows. \(\square \)

Lemma 18

(Descent) Suppose we are given a filtration \(\mathscr {F}_k \subseteq \mathscr {F}_{k+1}\), \(k\ge 0\). Let

$$\begin{aligned} Y_{k+1} \le \alpha _k Y_k + \beta _k + c_k, \quad k\ge 0, \end{aligned}$$

where \(Y_k \ge 0\) and \(0 \le \alpha _k\le \bar{\varvec{\alpha }}_k\) are \(\mathscr {F}_k\)-measurable, \(\beta _k\) is \(\mathscr {F}_{k+1}\)-measurable and has mean 0 and variance conditioned on \(\mathscr {F}_k\) bounded above by \(\tau _k^2\), and \(Y_0, \tau ^2_k, c_k, \bar{\varvec{\alpha }}_k \ge 0\) are \(\mathscr {F}_0\) measurable. Then,

$$\begin{aligned} \mathbb {P}\left( Y_k \ge t\right)&\le \frac{\zeta _k}{\max \{t-\xi _k, 0\}^2 + \zeta _k}\\ \xi _k&= Y_0\prod _{i=0}^{k-1}\bar{\varvec{\alpha }}_i + \sum _{i=0}^{k-1}c_i\left( \prod _{j=i+1}^{k-1}\bar{\varvec{\alpha }}_j\right) \\ \zeta _k&= \sum _{i=0}^{k-1}\tau _i^2\left( \prod _{j=i+1}^{k-1}\bar{\varvec{\alpha }}^2_j\right) . \end{aligned}$$

Proof

Solving the recursion,

$$\begin{aligned} Y_k&\le \alpha _{k-1}Y_{k-1} + \beta _{k-1} + c_{k-1}\\&\le \alpha _{k-1}\left( \alpha _{k-2}Y_{k-2} +\beta _{k-2} + c_{k-2}\right) + \beta _{k-1} + c_{k-1}\\&\le \dots \\&\le Y_0 \prod _{i=0}^{k-1}\alpha _i + \sum _{i=0}^{k-1}\beta _i\prod _{j=i+1}^{k-1}\alpha _j + \sum _{i=0}^{k-1}c_i\prod _{j=i+1}^{k-1}\alpha _j\\&\le Y_0 \prod _{i=0}^{k-1}\bar{\varvec{\alpha }}_i + \sum _{i=0}^{k-1}\beta _i\prod _{j=i+1}^{k-1}\bar{\varvec{\alpha }}_j + \sum _{i=0}^{k-1}c_i\prod _{j=i+1}^{k-1}\bar{\varvec{\alpha }}_j. \end{aligned}$$

So

$$\begin{aligned} \mathbb {P}\left( Y_k \ge t\right) \le \mathbb {P}\left( \sum \limits _{i=0}^{k-1}\beta _i\prod \limits _{j=i+1}^{k-1}\bar{\varvec{\alpha }}_j \ge t - \xi _k\right) . \end{aligned}$$

By Cantelli’s inequality and the fact that the \(i\text {th}\) term in the sum is \(\mathscr {F}_{i+1}\)-measurable,

$$\begin{aligned} \mathbb {P}\left( Y_k \ge t\right)&\le \frac{\sum _{i=1}^{k-1}\mathbb {E}\left[ \beta _i^2 \prod _{j=i+1}^{k-1} \bar{\varvec{\alpha }}_j^2\right] }{\max \{t-\xi _k,0\}^2 + \sum _{i=1}^{k-1}\mathbb {E}\left[ \beta _i^2 \prod _{j=i+1}^{k-1} \bar{\varvec{\alpha }}_j^2\right] }\\&\le \frac{\sum _{i=0}^{k-1}\tau _i^2 \prod _{j=i+1}^{k-1} \bar{\varvec{\alpha }}_j^2}{\max \{t-\xi _k,0\}^2 + \sum _{i=0}^{k-1}\tau _i^2 \prod _{j=i+1}^{k-1} \bar{\varvec{\alpha }}_j^2}\\&= \frac{\zeta _k}{\max \{t-\xi _k,0\}^2 + \zeta _k}. \end{aligned}$$

\(\square \)

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, Z., Campbell, T. The computational asymptotics of Gaussian variational inference and the Laplace approximation. Stat Comput 32, 63 (2022). https://doi.org/10.1007/s11222-022-10125-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11222-022-10125-y

Keywords

Navigation