Abstract
For big data analysis, high computational cost for Bayesian methods often limits their applications in practice. In recent years, there have been many attempts to improve computational efficiency of Bayesian inference. Here we propose an efficient and scalable computational technique for a state-of-the-art Markov chain Monte Carlo methods, namely, Hamiltonian Monte Carlo. The key idea is to explore and exploit the structure and regularity in parameter space for the underlying probabilistic model to construct an effective approximation of its geometric properties. To this end, we build a surrogate function to approximate the target distribution using properly chosen random bases and an efficient optimization process. The resulting method provides a flexible, scalable, and efficient sampling algorithm, which converges to the correct target distribution. We show that by choosing the basis functions and optimization process differently, our method can be related to other approaches for the construction of surrogate functions such as generalized additive models or Gaussian process models. Experiments based on simulated and real data show that our approach leads to substantially more efficient sampling algorithms compared to existing state-of-the-art methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ahn, S., Chen, Y., Welling, M.: Distributed and adaptive darting Monte Carlo through regenerations. In: Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AI Stat, 2013)
Amari, S., Nagaoka, H.: Methods of Information Geometry. Translations of Mathematical Monographs, vol. 191. Oxford University Press, Oxford (2000)
Andrieu, C., Thoms, J.: A tutorial on adaptive MCMC. Stat. Comput. 18(4), 343–373 (2008)
Bache, K., Lichman, M.: UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed April 2015 (2013)
Betancourt, M.J.: The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data subsampling. In: Proceedings of 31th International Conference on Machine Learning (ICML’15) (2015)
Chen, T., Fox, E.B., Guestrin, C.: Stochastic gradient Hamiltonian Monte Carlo (Preprint, 2014)
Conard, P.R., Marzouk, Y.M., Pillai, N.S., Smith, A.: Asymptotically exact MCMC algorithms via local approximations of computationally intensive models. arXiv:1402.1694v3. Accessed Nov 2014 (unpublished manuscript, preprint, 2014)
Dashti, M., Stuart, A.: Uncertainty quantification and weak approximation of an elliptic inverse problem. SIAM J. Sci. Comput. 49, 2524–2542 (2011)
Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D.: Hybrid Monte Carlo. Phys. Lett. B 195, 216–222 (1987)
Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984)
Geyer, C.J.: Practical Markov chain Monte Carlo. Stat. Sci. 7, 473–483 (1992)
Girolami, M., Calderhead, B.: Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat. Soc. B 73, 123–214 (with discussion, 2011)
Greville, T.: Some applications of the pseudoinverse of a matrix. SIAM Rev. 2(1), 15–22 (1960)
Hoffman, M.D., Blei, D.M., Bach, F.R.: Online learning for latent Dirichlet allocation. In: Neural Information Processing Systems (NIPS), pp. 856–864 (2010)
Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and applications. Neurocomputing 70(1–3), 489–501(2006a)
Huang, G.B., Chen, L., Siew, C.K.: Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans. Neural Netw. 17(4), 879–892 (2006b)
Hyvärinen, A.: Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 6, 695–709 (2005)
Kohonen, T.: Self-Organization and Associative Memory. Springer, Berlin (1988)
Kovanic, P.: On the pseudo inverse of a sum of symmetric matrices with applications to estimation. Kybernetika 15(5), 341–348 (1979)
Lan, S., Streets, J., Shahbaba, B.: Wormhole Hamiltonian Monte Carlo. In: Twenty-Eighth AAAI Conference on Artificial Intelligence (2014a)
Lan, S., Zhou, B., Shahbaba, B.: Spherical Hamiltonian Monte Carlo for constrained target distributions. In: Proceedings of the 31st International Conference on Machine Learning (ICML, 2014b)
Lin, C.J., Weng, R.C., Keerthi, S.S.: Trust region Newton method for large-scale logistic regression. J. Mach. Learn. Res. 9, 627–650 (2008)
Liu, J.S.: Monte Carlo Strategies in Scientific Computing. Springer, New York (2001)
Meeds, E., Welling, M.: GPS-ABC: Gaussian process surrogate approximate Bayesian computation. In: Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (UAI) (2014)
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092 (1953)
Moro, S., Cortez, P., Rita, P.: A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst. 62, 22–31 (2014)
Neal, R.M.: MCMC using Hamiltonian dynamics. In: Brooks, S., Gelman, A., Jones, G., Meng, X.L. (eds.) Handbook of Markov Chain Monte Carlo, pp. 113–162. Chapman and Hall/CRC, Boca Raton (2011)
Neal, R.: Regression and classification using Gaussian process priors. Bayesian Stat. 6, 475–501 (1999)
Neal, R.: Bayesian Learning in Neural Networks. Springer, New York (1996)
Quinonero-Candela, J., Rasmussen, C.E.: A unifying view of sparse approximate Gaussian process regression. J. Mach. Learn. Res. 6, 1939–1959 (2005)
Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Proceedings of the 21st Conference on Advances in Neural Information Processing Systems, NIPS (2007)
Rahimi, A., Recht, B.: Uniform approximation of functions with random bases. In: Proceedings of the 46th Annual Allerton Conference on Communication, Control, and Computing (2008)
Rasmussen, C.E.: Evaluation of Gaussian processes and other methods for non-linear regression. PhD Thesis, University of Toronto (1996)
Rasmussen, C.E.: Gaussian processes to speed up hybrid Monte Carlo for expensive Bayesian integrals. Bayesian Stat. 7, 651–659 (2003)
Rasmussen, C.E., Nickisch, H.: Gaussian processes for machine learning (GPML) toolbox. J. Mach. Learn. Res. 11, 3011–3015 (2010)
Roberts, G.O., Rosenthal, J.S.: Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithm. J. Appl. Probab. 44(2), 458–475 (2007)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. I, pp. 318–362. Bradford Books, Cambridge (1986)
Shahbaba, B., Lan, S., Johnson, W.O., Neal, R.M.: Split Hamiltonian Monte Carlo. Stat. Comput. 24, 339–349 (2014)
Snelson, E., Ghahramani, Z.: Sparse Gaussian processes using pseudo-input. In: Advances in Neural Information Processing Systems (NIPS), vol. 18. MIT Press, Cambridge (2006)
Strathmann, H., Sejdinovic, D., Livingstone, S., Szabo, Z., Gretton, A.: Gradient-free Hamiltonian Monte Carlo with efficient kernel exponential families. In: Advances in Neural Information Processing Systems (2015)
van Schaik, A., Tapson, J.: Online and adaptive pseudoinverse solutions for ELM weights. Neurocomputing 149, 233–238 (2015)
Verlet, L.: Computer ‘experiments’ on classical fluids, I: thermodynamical properties of Lennard–Jones molecules. Phys. Rev. 159, 98–103 (1967)
Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 681–688 (2011)
Zhang, C., Shahbaba, B., Zhao, H.: Precomputing strategy for Hamiltonian Monte Carlo method based on regularity in parameter space. arxiv:1504.01418v1. Accessed April 2015 (unpublished manuscript, 2015)
Acknowledgments
This work is supported by NSF grants DMS-1418422 and DMS-1622490.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Convergence to the correct target distribution
In order to prove that the equilibrium distribution remains the same, it suffices to show that the detailed balance condition still holds. Denote \(\theta = (p,\,q),\; \theta ^{\prime }={\tilde{\phi }}_t(\theta ).\) In the Metropolis–Hasting step, we use the original Hamiltonian to compute the acceptance probability
therefore,
since \( \left| \frac{\mathrm{d}\theta }{\mathrm{d}\theta ^{\prime }}\right| = 1\) due to the volume conservation property of the surrogate-induced Hamiltonian flow \({\tilde{\phi }}_t.\) Now that we showed the detailed balance condition is satisfied, along with the reversibility of the surrogate-induced Hamiltonian flow, the modified Markov chain will converge to the correct target distribution.
Appendix 2: Potential matching
In the paper, training data collected from the history of Markov chain are used without a detailed explanation. Here, we will analyze the asymptotical behavior of surrogate-induced distribution and try to explain why the history of the Markov chain is a reasonable choice for training data. Recall that we find our neural network surrogate function by minimizing the mean square error (11). Similarly to Hyvärinen (2005), it turns out that minimizing (11) is asymptotically equivalent to minimizing a new distance between the surrogate-induced distribution and the underlying target distribution, independent of their corresponding normalizing constants.
Suppose we know the density of the underlying intractable target distribution up to a constant
where Z is the unknown normalizing constant. Our goal is to approximate this distribution using a parametrized density model, also known up to a constant,
Ignoring the multiplicative constant, the corresponding potential energy functions are U(q) and \(V(q,\,\theta ),\) respectively. The straightforward square distance between the two potentials will not be a well-defined measure between the two distributions because of the unknown normalizing constants. Therefore, we use the following distance instead:
Because of its similarity to score matching (Hyvärinen 2005), we refer to the approximation method based on this new distance as potential matching; the corresponding potential matching estimator of \(\theta \) is given by
It is easy to verify that \(K(\theta ) = 0 \Rightarrow V(\theta ) = U + \mathrm{constant} \Rightarrow Q(q,\,\theta ) = P(q|Y),\) so \(K(\theta )\) is a well-defined squared distance. Exact evaluation of (14) is analytically intractable. In practice, given N samples from the target distribution \(q_1,\,q_2,\ldots ,q_N,\) we minimize the empirical version of (14)
which is asymptotically equivalent to K by law of large numbers. (15) could be more concise if we allow a shift term in the parametrized model (\(V(q,\,\theta ) = V(q,\,\theta _{-d}) + \theta _d\)). In that case, the empirical potential matching estimator is
In particular, if we adopt an additive model for \(V(q,\,\theta )\) with a shift term
where \(\varvec{w_i},\,d_i\), and activation function \(\sigma \) are chosen according to Algorithm 2 and collect early evaluations from the history of Markov chain
as training data; this way, the estimated parameters (i.e., weights for the output neuron) are asymptotically the potential matching estimates
since the Markov chain will eventually converge to the target distribution. When truncated at a finite N, the estimated parameters are almost the empirical potential matching estimates except that the samples from the history of the Markov chain are not exactly (but quite close) from the target distribution.
Appendix 3: Adaptive learning
Theorem 1
(Greville) If a matrix A, with k columns, is denoted by \(A_k\) and partitioned as \(A_k=[A_{k-1}a_k],\) with \(A_{k-1}\) a matrix having \(k-1\) columns, then the Moore–Penrose generalized inverse of \(A_k\) is
where
Proof of Proposition 1
To save computational cost, we only update the estimator. Suppose the current output matrix is \(H_{k+1}=\begin{bmatrix}H_k\\h_{k+1}^\mathrm{T}\end{bmatrix}\) and the target vector is \(T_{k+1}=\begin{bmatrix}T_k\\t_{k+1}\end{bmatrix},\) then
where \(b_{k+1}\) is obtained according to Theorem 1. Note that the computation of \(b_{k+1}\) and \(c_{k+1}\) still involves \(H_k\) and \(H_k^{\dagger }\) whose sizes increase as data size grows. Following Kohonen (1988), Kovanic (1979), and Schaik and Tapson (2015), we introduce two auxiliary matrices here
and rewrite \(b_{k+1}\) and \(c_{k+1}\) as
Updating of the two auxiliary matrices can also be done adaptively
and if \(c_{k+1} = 0,\) these formulas can be simplified as below
\(\square \)
Rights and permissions
About this article
Cite this article
Zhang, C., Shahbaba, B. & Zhao, H. Hamiltonian Monte Carlo acceleration using surrogate functions with random bases. Stat Comput 27, 1473–1490 (2017). https://doi.org/10.1007/s11222-016-9699-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-016-9699-1