Nothing Special   »   [go: up one dir, main page]

Next Article in Journal / Special Issue
Winsorization for Robust Bayesian Neural Networks
Previous Article in Journal
Intrinsic Entropy of Squeezed Quantum Fields and Nonequilibrium Quantum Dynamics of Cosmological Perturbations
Previous Article in Special Issue
Sampling the Variational Posterior with Local Refinement
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Conditional Deep Gaussian Processes: Multi-Fidelity Kernel Learning

1
Mathematics and Computer Science, Rutgers University, Newark, NJ 07102, USA
2
School of Mathematics, Institute for Advanced Study, Princeton, NJ 08540, USA
*
Author to whom correspondence should be addressed.
Entropy 2021, 23(11), 1545; https://doi.org/10.3390/e23111545
Submission received: 1 October 2021 / Revised: 10 November 2021 / Accepted: 18 November 2021 / Published: 20 November 2021
(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)
Figure 1
<p>Sampling random functions from the approximate marginal prior <math display="inline"><semantics> <mrow> <mi>q</mi> <mo>(</mo> <mi mathvariant="bold">f</mi> <mo>)</mo> </mrow> </semantics></math> which carries the effective kernel in Equation (<a href="#FD12-entropy-23-01545" class="html-disp-formula">12</a>). The low fidelity data <math display="inline"><semantics> <mrow> <msub> <mi mathvariant="bold">X</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi mathvariant="bold">y</mi> <mn>1</mn> </msub> </mrow> </semantics></math>, marked by the cross symbols, and the low fidelity function <math display="inline"><semantics> <mrow> <msub> <mi>f</mi> <mn>1</mn> </msub> <mrow> <mo>|</mo> </mrow> <msub> <mi mathvariant="bold">X</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi mathvariant="bold">y</mi> <mn>1</mn> </msub> </mrow> </semantics></math> and the uncertainty are shown in the top (noiseless) and the third (noisy) rows. Top row: the uncertainty in <math display="inline"><semantics> <mrow> <msub> <mi mathvariant="bold">X</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi mathvariant="bold">y</mi> <mn>1</mn> </msub> </mrow> </semantics></math> is negligible so <math display="inline"><semantics> <msub> <mi>f</mi> <mn>1</mn> </msub> </semantics></math> is nearly a deterministic function, so the effective kernels are basically kernels with warped input. The corresponding samples from <span class="html-italic">q</span> are shown in the second row. Third row: the noise in <math display="inline"><semantics> <mrow> <msub> <mi mathvariant="bold">X</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi mathvariant="bold">y</mi> <mn>1</mn> </msub> </mrow> </semantics></math> generates the samples in bottom row which carry additional high-frequency signals due to the non-stationary <math display="inline"><semantics> <msup> <mi>δ</mi> <mn>2</mn> </msup> </semantics></math> in Equation (<a href="#FD12-entropy-23-01545" class="html-disp-formula">12</a>).</p> ">
Figure 2
<p>Multi-fidelity regression with 30 observations (not shown) of low fidelity <math display="inline"><semantics> <mrow> <msub> <mi>f</mi> <mn>1</mn> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo form="prefix">sin</mo> <mn>8</mn> <mi>π</mi> <mi>x</mi> </mrow> </semantics></math> and 10 observations (red dots) from the target function, <math display="inline"><semantics> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mrow> <mo>(</mo> <mi>x</mi> <mo>−</mo> <msqrt> <mn>2</mn> </msqrt> <mo>)</mo> </mrow> <msubsup> <mi>f</mi> <mn>1</mn> <mn>2</mn> </msubsup> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> </semantics></math> (shown in red dashed line). Only the target prediction (solid dark) and associated uncertainty (shaded) are shown. Top row: (<b>a</b>) AR1, (<b>b</b>) LCM, (<b>c</b>) NARGP. Bottom row: (<b>d</b>) DEEP-MF, (<b>e</b>) MF-DGP, (<b>f</b>) our model with SE[SE] kernel.</p> ">
Figure 3
<p>Multi-fidelity regression on the low-level true function, <math display="inline"><semantics> <mrow> <mi>h</mi> <mo>(</mo> <mi>x</mi> <mo>)</mo> <mo>=</mo> <mo form="prefix">cos</mo> <mn>15</mn> <mi>x</mi> </mrow> </semantics></math>, with 30 observations and high-level one, <math display="inline"><semantics> <mrow> <mi>f</mi> <mo>(</mo> <mi>x</mi> <mo>)</mo> <mo>=</mo> <mi>x</mi> <mo form="prefix">exp</mo> <mo>[</mo> <mi>h</mi> <mo>(</mo> <mn>2</mn> <mi>x</mi> <mo>−</mo> <mn>0.2</mn> <mo>)</mo> <mo>]</mo> <mo>−</mo> <mn>1</mn> </mrow> </semantics></math>, with 15 observations. Top row: (<b>a</b>) AR1, (<b>b</b>) LCM, and (<b>c</b>) NARGP. Bottom row: (<b>d</b>) DEEP-MF, (<b>e</b>) MF-DGP, and (<b>f</b>) Our method with SE[SE] kernel.</p> ">
Figure 4
<p>Demonstration of compositional freedom and effects of uncertainty in low fidelity function <math display="inline"><semantics> <msub> <mi>f</mi> <mn>1</mn> </msub> </semantics></math> on the target function inference. Given the same high fidelity observations of target function, four different sets of observations of <math display="inline"><semantics> <mrow> <msub> <mi>f</mi> <mn>1</mn> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>x</mi> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <msub> <mi>f</mi> <mn>1</mn> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo form="prefix">tanh</mo> <mi>x</mi> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <msub> <mi>f</mi> <mn>1</mn> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo form="prefix">sin</mo> <mn>4</mn> <mi>π</mi> <mi>x</mi> </mrow> </semantics></math>, and <math display="inline"><semantics> <mrow> <msub> <mi>f</mi> <mn>1</mn> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo form="prefix">sin</mo> <mn>8</mn> <mi>π</mi> <mi>x</mi> </mrow> </semantics></math> are employed as low fidelity data in inferring the target function. In panel (<b>a</b>), the low fidelity data are noiseless observations of the four functions. The true target function is partially outside the model confidence for the first two cases. In panel (<b>b</b>), the low fidelity data are noisy observations of the same four functions. Now the first three cases result in the inferred function outside the model confidence. The effect of uncertainty in low fidelity is most dramatic when comparing the third subplots in (<b>a</b>,<b>b</b>).</p> ">
Figure 5
<p>Denoising regression with 30 high-noise and 15 low-noise observations from the target function <math display="inline"><semantics> <mrow> <mi>y</mi> <mo>=</mo> <mrow> <mo>(</mo> <mi>x</mi> <mo>−</mo> <msqrt> <mn>2</mn> </msqrt> <mo>)</mo> </mrow> <msup> <mo form="prefix">sin</mo> <mn>2</mn> </msup> <mn>8</mn> <mi>π</mi> <mi>x</mi> </mrow> </semantics></math> (red dashed line). The uncertainty is reduced in the GP learning with the SE[SE] kernels.</p> ">
Figure A1
<p>Comparison between the joint learning (<b>left</b>) and the sequential learning with Algorithm 1 (<b>right</b>). The same 10 training data are shown by the red dots. The joint learning algorithm results in a log marginal likelihood <math display="inline"><semantics> <mrow> <mn>1.65</mn> </mrow> </semantics></math> while the alternative one <math display="inline"><semantics> <mrow> <mn>2.64</mn> </mrow> </semantics></math>. The hyperparameters are <math display="inline"><semantics> <mrow> <mo>{</mo> <msub> <mi>σ</mi> <mrow> <mn>1</mn> <mo>,</mo> <mn>2</mn> </mrow> </msub> <mo>=</mo> <mrow> <mo>(</mo> <mn>3.3</mn> <mo>,</mo> <mn>1.24</mn> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mo>ℓ</mo> <mrow> <mn>1</mn> <mo>,</mo> <mn>2</mn> </mrow> </msub> <mo>=</mo> <mrow> <mo>(</mo> <mn>0.12</mn> <mo>,</mo> <mn>1.40</mn> <mo>)</mo> </mrow> <mo>}</mo> </mrow> </semantics></math> (<b>left</b>) and <math display="inline"><semantics> <mrow> <mo>{</mo> <msub> <mi>σ</mi> <mrow> <mn>1</mn> <mo>,</mo> <mn>2</mn> </mrow> </msub> <mo>=</mo> <mrow> <mo>(</mo> <mn>1.22</mn> <mo>,</mo> <mn>1.65</mn> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mo>ℓ</mo> <mrow> <mn>1</mn> <mo>,</mo> <mn>2</mn> </mrow> </msub> <mo>=</mo> <mrow> <mo>(</mo> <mn>0.08</mn> <mo>,</mo> <mn>0.98</mn> <mo>)</mo> </mrow> <mo>}</mo> </mrow> </semantics></math> (<b>right</b>).</p> ">
Versions Notes

Abstract

:
Deep Gaussian Processes (DGPs) were proposed as an expressive Bayesian model capable of a mathematically grounded estimation of uncertainty. The expressivity of DPGs results from not only the compositional character but the distribution propagation within the hierarchy. Recently, it was pointed out that the hierarchical structure of DGP well suited modeling the multi-fidelity regression, in which one is provided sparse observations with high precision and plenty of low fidelity observations. We propose the conditional DGP model in which the latent GPs are directly supported by the fixed lower fidelity data. Then the moment matching method is applied to approximate the marginal prior of conditional DGP with a GP. The obtained effective kernels are implicit functions of the lower-fidelity data, manifesting the expressivity contributed by distribution propagation within the hierarchy. The hyperparameters are learned via optimizing the approximate marginal likelihood. Experiments with synthetic and high dimensional data show comparable performance against other multi-fidelity regression methods, variational inference, and multi-output GP. We conclude that, with the low fidelity data and the hierarchical DGP structure, the effective kernel encodes the inductive bias for true function allowing the compositional freedom.

1. Introduction

Multi-fidelity regression refers to a category of learning tasks in which a set of sparse data is given to infer the underlying function but a larger amount of less precise or noisy observations is also provided. Multifidelity tasks frequently occur in various fields of science because precise measurement is often costly while approximate measurements are more affordable (see [1,2] for example). The assumption that the more precise function is a function of the less precise one [1,3] is shared in some hierarchical learning algorithms (e.g., one-shot learning in [4], meta learning [5], and continual learning [6]). Thus, one can view the plentiful low fidelity data as a source of prior knowledge so the function can be efficiently learned with sparse data.
In Gaussian Process (GP) regression [7] domain experts can encode their knowledge into the combinations of covariance functions [8,9], building an expressive learning model. However, construction of an appropriate kernel becomes less clear when building a prior for the precise function in the context of multi-fidelity regression because the uncertainty, both epistemic and aleatoric, in the low fidelity function prior learned by the plentiful data should be taken into account. It is desirable to fuse the low fidelity data to an effective kernel as a prior, taking advantage of marginal likelihood being able to avoid overfitting, and then perform the GP regression as if only the sparse precise observations are given.
Deep Gaussian Process (DGP) [10] and the similar models [11,12] are expressive models with a hierarchical composition of GPs. As pointed out in [3], a hierarchical structure is particularly suitable for fusing data of different fidelity into one learning model. Although full Bayesian inference is promising in obtaining expressiveness while avoiding overfitting, exact inference is not tractable and approximate solutions such as the variational approach [13,14,15] are employed. Ironically, the major difficulty in inference comes from marginalization of the latent GPs in Bayesian learning, which, on the flip side, is also why overfitting can be prevented.
We propose a conditional DGP model in which the intermediate GPs are supported by the lower fidelity data. We also define the corresponding marginal prior distribution which is obtained by marginalizing all GPs except the exposed one. For some families of kernel compositions, we previously developed an analytical method in calculating exact covariance in the marginal prior [16]. As such, the method is applied here so the marginal prior is approximated as a GP prior with an effective kernel. The high fidelity data are then connected to the exposed GP, and the hyperparameters throughout the hierarchy are optimized via the marginal likelihood. Our model, therefore, captures the expressiveness embedded in hierarchical composition, retains the Bayesian character hinted in the marginal prior, but loses the non-Gaussian aspect of DGP. From the analytical expressions, one can partially understand the propagation of uncertainty in latent GPs as it is responsible for the non-stationary aspect of effective kernels. Moreover, the compositional freedom, i.e., different compositions may result in the same target function, in a DGP model [17,18] can be shown to be intact in our approach.
The paper is organized as follows. In Section 2, we review the literature of multi-fidelity regression model and deep kernels. A background of GP, DGP, and the moment matching method is introduced in Section 3. The conditional DGP model defined as a marginal prior and the exact covariance associated with two families of kernel compositions are discussed in Section 4. The method of hyperparameter learning is given in Section 5, and the simulation of synthetic and high dimensional multi-fidelity regression in a variety of situations are presented in Section 6. A brief discussion followed by the conclusion appear in Section 7 and Section 8, respectively.

2. Related Work

Assuming autoregressive relations between data of different fidelity, Kennedy and O’Hagan [1] proposed the AR1 model for multi-fidelity regression tasks. Le Gratiet and Garnier [19] improved computational efficiency with a recursive multi-fidelity model. Deep-MF [20] mapped the input space to the latent space and followed the work in Kennedy and O’Hagan [1]. NARGP [21] stacked a sequence of GPs in which the posterior mean about the low-fidelity function is passed to the input of the next GP while the associated uncertainty is not. GPAR [22] uses a similar conditional structure between functions of interest. MF-DGP in [3] exploited the DGP structure for the multi-fidelity regression tasks and used the approximate variational inference in [13]. Multi-output GPs [23,24] regard the observations from different data sets as realization of vector-valued function; [25] modeled the multi-output GP using general relation between multiple target functions and multiple hidden functions. Alignment learning [26,27] is an application of warped GP [11,12] to time series data. We model the multi-fidelity regression as a kernel learning, effectively taking the space of functions representing the low fidelity data into account.
As for general studies of deep and non-stationary kernels, Williams [28] and Cho and Saul [29] used the basis of error functions and Heaviside polynomial functions to obtain the arc-sine and arc-cosine kernel functions, respectively, of neural networks. Duvenaud et al. [30] employed the analogy between neural network and GP, and constructed the deep kernel for DGP. Dunlop et al. [31] analyzed variety of non-stationary kernel compositions in DGP, and Shen et al. [32] provided an insight from Wigner transformation of general two-input functions. Wilson et al. [33] proposed the general recipe for constructing the deep kernel with neural networks. Daniely et al. [34] computed the deep kernel from the perspective of two correlated random variables. Mairal et al. [35] and Van der Wilk et al. [36] studied the deep kernels in the convolutional models. The moment matching method [16] allows obtaining the effective kernel encoding the uncertainty in learning the lower fidelity function.

3. Background

3.1. Gaussian Process and Deep Gaussian Process

Gaussian Process (GP) [7] is a popular Bayesian learning model in which the prior over a continuous function is modeled as a Gaussian. Denoted by f GP ( μ , k ) , the collection of any finite function values f ( x 1 : N ) with x R d has the mean E [ f ( x i ) ] = μ ( x i ) and covariance E { [ f ( x i ) μ ( x i ) ] [ f ( x j ) μ ( x j ) ] } = k ( x i , x j ) . Thus, a continuous and deterministic mean function μ ( · ) and a positive definite kernel function k ( · , · ) fully specify the stochastic process. It is common to consider the zero-mean case and write down the prior distribution, p ( f | X ) = N ( 0 , K ) with covariance matrix K. In the setting of a regression task with input and output of data { X , y } , the hyperparameters in the mean and kernel functions can be learned by optimizing the marginal likelihood, p ( y | X ) = d f p ( y | f ) p ( f | X ) .
Deep Gaussian Process (DGP) was proposed in [10] as a hierarchical composition of GPs for superior expressivity. From a generative view, the distribution over the composite function f ( x ) = f L f L 1 f 1 ( x ) is a serial product of Gaussian conditional distribution,
p ( F L , F L 1 , , F 1 | x ) = p ( F L | F L 1 ) p ( F L 1 | F L 2 ) p ( F 1 | X ) ,
in which the capital bolded face symbol F i stands for a multi-output GP in ith layer and the independent components have p ( f i | F i 1 ) = N ( 0 , K ( F i 1 . F i 1 ) ) . The above is the general DGP, and the width in each layer is denoted by H i : = | F i | . In such notation, the zeroth layer represents the collection of inputs X . Here, we shall consider the DGP with L = 2 and H 2 = H 1 = 1 and the three-layer counterpart.
The intractability of exact inference is a result of the fact that the random variables F i for L > i > 0 appear in the covariance matrix K. In a full Bayesian inference, the random variables are marginalized in order to estimate the evidence p ( y | X ) associated with the data.

3.2. Multi-Fidelity Deep Gaussian Process

The multi-fidelity model in [1] considered the regression task for a data set consisting of observations measured with both high and low precision. For simplicity, the more precise observations are denoted by { X , y } and those with low precision by { X 1 , y 1 } . The main assumption made in [1] is that the less precise observations shall come from a function f 1 ( x ) modeled by a GP with zero mean and kernel k, while the more precise ones come from the combination f ( x ) = α f 1 ( x ) + h ( x ) . With the residual function h being a GP with kernel k h , one can jointly model the two subsets with the covariance within precise observations E [ f ( x i ) f ( x j ) ] = α 2 k i j + k h i j , within the less precise ones E [ f 1 ( x i ) f 1 ( x j ) ] = k i j , and the mutual covariance E [ f ( x i ) f 1 ( x j ) ] = α k i j . k i j refers to the covariance between the two inputs at x i and x j .
The work in [3] generalized the above the linear relationship between the more and less precise functions to a nonlinear one, i.e., f ( x ) = f 2 ( f 1 ( x ) ) + noise . The hierarchical structure in DGP is suitable for nonlinear modeling. The variational inference scheme [13] was employed to evaluate the evidence lower bounds (ELBOs) for the data with all levels of precision, and the sum over all ELBOs is the objective for learning the hyperparameters and inducing points.

3.3. Covariance in Marginal Prior of DGP

The variational inference, e.g., [13], starts with connecting the joint distribution p ( f 1 , f 2 | X ) with data y , followed by applying the Jensen’s inequality along with an approximate posterior in evaluating the ELBO. In contrast, we proposed in [16] that the marginal prior for the DGP,
p ( f | X ) = d f 1 p ( f 2 | f 1 ) p ( f 1 | X ) ,
with the bolded face symbols representing the set of function values, f ( · ) = f 2 ( f 1 ( · ) ) , f 2 ( · ) , and f 1 ( · ) , can be approximated as a GP, i.e., q ( f | X ) = N ( 0 , K eff ) in the zero-mean case. The matching of covariance in p and q leads to the closed form of effective covariance function for certain families of kernel compositions. The SE[SC] composition, i.e., the squared exponential and squared cosine kernels being used in the GPs for f 2 | f 1 and f 1 , respectively, is an example. With the intractable marginalization over the intermediate f 1 being taken care of in the moment matching approximation, one can evaluate the approximate marginal likelihood for the data set { X , y } ,
p ( y | X ) d f p ( y | f ) q ( f | X ) .
In the following, we shall develop along the line of [16] the approximate inference for the multi-fidelity data consisting of precise observations { X , y } and less precise observations { X 1 : L 1 , y 1 : L 1 } with the L-layer width-1 DGP models. The effective kernels k eff shall encode the knowledge built on these less precise data, which allows modeling the precise function even with a sparse data set.

4. Conditional DGP and Multi-Fidelity Kernel Learning

In the simplest case, we are given two subsets of data, { X , y } with high precision and { X 1 , y 1 } with low precision. We can start with defining the conditional DGP in terms of the marginal prior,
p ( f | X , X 1 , y 1 ) = d f 1 p ( f 2 | f 1 ) p ( f 1 | X , X 1 , y 1 ) ,
where the Gaussian distribution p ( f 1 | X , X 1 , y 1 ) = N ( f 1 ( x 1 : N ) | m , Σ ) has the conditional mean in the vector form,
m = K X , X 1 K X 1 , X 1 1 y 1 ,
and the conditional covariance in the matrix form,
Σ = K X , X K X , X 1 K X 1 , X 1 1 K X 1 , X .
The matrix K X , X 1 registers the covariance among the inputs in X and X 1 , and likewise for K X , X and K X 1 , X 1 . Thus, the set of function values f 1 ( x 1 : N ) associated with the N inputs in X are supported by the low fidelity data.
The data { X , y } with high precision are then associated with the function f ( x ) = f 2 ( f 1 ( x ) ) . Following the previous discussion, we may write down the true evidence for the precise data conditioned on the less precise ones shown below,
p ( y | X , X 1 , y 1 ) = d f p ( y | f ) p ( f | X , X 1 , y 1 ) = d f 1 d f 2 p ( y | f 2 ) p ( f 2 | f 1 ) p ( f 1 | X , X 1 , y 1 ) .
To proceed with the moment matching approximation of the true evidence in Equation (7), one needs to find the effective kernel in the approximate distribution q ( f | X , X 1 , y 1 ) = N ( 0 , K eff ) and replace the true distribution in Equation (4) with the approximate distribution,
p ( y | X , X 1 , y 1 ) d f p ( y | f ) q ( f | X , X 1 , y 1 ) = N ( y | 0 , K eff + σ n 2 I N ) .
Therefore, the learning in the conditional DGP includes the hyperparameters in the exposed GP, f 2 | f 1 , and those in the intermediate GP, f 1 . Standard gradient descent is applied to above approximate marginal likelihood. One can customize the kernel K eff in the GPy [37] framework and implement the gradient components K eff / θ with θ { σ 1 , 2 , 1 , 2 } in the optimization.

4.1. Analytic Effective Kernels

Here, we consider the conditional DGP with two-layer and width-1 hierarchy, focusing on the cases where the exposed GP for f 2 | f 1 in Equation (4) uses the squared exponential (SE) kernel or the squared cosine (SC) kernel. We also follow the notation in [16] so that the composition denoted by k 2 [ k 1 ] represents that k 2 is the kernel used in the exposed GP and k 1 used in the intermediate GP. For example, SE[SC] means k 2 is SE while k 1 is SC. Following [16], the exact covariance in the marginal prior Equation (4) is calculated,
E f [ f ( x i ) f ( x j ) ] : = E f 1 [ E f 2 | f 1 [ f 2 ( f 1 ( x i ) ) f 2 ( f 1 ( x j ) ) ] ] = d f 1 k 2 ( f 1 ( x i ) , f 1 ( x j ) ) p ( f 1 | X , X 1 , y 1 ) .
Thus, when the exposed GP has the kernel k 2 in the exponential family, the above integral is tractable and the analytic k eff can be implemented as a customized kernel. The following two lemmas from [16] are useful for the present case with a nonzero conditional mean and a conditional covariance in f 1 .
Lemma 1.
(Lemma 2 in [16]) For a vector of Gaussian random variables g 1 : n N ( m , C ) , the expectation of exponential quadratic form exp [ 1 2 Q ( g 1 , g 2 , , g n ) ] with Q ( g 1 : n ) = i , j A i j g i g j 0 has the following closed form,
E [ e 1 2 Q ( g 1 : n ) ] = exp 1 2 m T ( I n + AC ) 1 A m | I n + C A | .
The n-dimensional matrix A appearing in the quadratic form Q is symmetric.
Lemma 2.
(Lemma 3 in [16]) With the same Gaussian vector g in Lemma 1, the expectation value of the exponential inner product exp [ a t g ] between g and a constant vector a reads,
E [ e a t g ] = exp { a t m + 1 2 T r [ C a a t ] } ,
where the transpose operation on column vector is denoted by the superscript.
We shall emphasize that our previous work [16] did not discuss the cases when the intermediate GP for f 1 is conditioned on the low precision data { X 1 , y 1 } . Thus, the conditional mean and the non-stationary conditional covariance were not considered in [16].
Lemma 3.
The covariance in the marginal prior with a SE k 2 ( x , y ) = σ 2 2 exp [ ( x y ) 2 / 2 2 2 ] in the exposed GP can be calculated analytically. With the Gaussian conditional distribution, p ( f 1 | X , X 1 , y 1 ) , supported by the low fidelity data, the effective kernel reads,
k eff ( x i , x j ) = σ 2 2 1 + δ i j 2 / 2 2 exp ( m i m j ) 2 2 ( 2 2 + δ i j 2 ) ,
where m i , j : = E [ f 1 ( x i , j ) | X 1 , y 1 ] being the conditional mean of f 1 . The positive parameter δ i j 2 : = c i i + c j j 2 c i j is defined with the conditional covariance c i j : = cov [ f 1 ( x i ) , f 1 ( x j ) | X 1 , y 1 ] . δ i j 2 and and the the length scale 2 in k 2 dictates how the uncertainty in f 1 ( x ) affects the function composition.
Proof. 
For SE[ ] composition, one can represent the kernel k 2 = exp { [ f 1 ( x i ) f 1 ( x j ) ] 2 / 2 } as an exponential quadratic form exp [ Q 2 ] with Q = f 1 t A f 1 with A = ( 1 1 1 1 ) . 2 = 1 is set for ease of notation. Now f 1 is a bivariate Gaussian variable with mean m and covariance matrix C . To calculate the expectation value in Equation (12), we need to compute the following 2-by-2 matrix and one can show I 2 + AC 1 can be reduced to
1 1 + c i i + c j j 2 c i j 1 + c j j c i j c j j c i j c i i c i j 1 + c i i c i j .
The seemingly complicated matrix in fact is reducible as one can show that ( I 2 + AC ) 1 A = A / ( 1 + δ i j 2 ) , which leads to the exponential term in the kernel. With the determinant | I 2 + CA | = ( 1 + δ i j 2 ) and restoring back the length scale 2 , the kernel in Equation (12) is reproduced. □
A few observations are in order. First, we can rewrite δ i j 2 = 1 1 c i i c i j c j i c j j ( 1 1 ) , which guarantees the positiveness of δ i j 2 as the two-by-two sub-block of covariance matrix is positive-definite. Secondly, there are deterministic and probabilistic aspects of the kernel in Equation (12). When the the length scale 2 is very large, the term δ 2 encoding the uncertainty in f 1 becomes irrelevant and the kernel is approximately a SE kernel with the input transformed via the conditonal mean Equation (5), which is reminiscent of the deep kernel proposed in [33] where GP is stacked on the output of a DNN. The kernel used in [21] similarly considers the conditional mean in f 1 as a deterministic transformation while the uncertainty is ignored. On the other hand, when δ 2 and 2 2 are comparable, it means that the (epistemic) uncertainty in f 1 shaped by the supports y 1 is relevant. The effective kernel then represents the covariance in the ensemble of GPs, each of which receives the inputs transformed by one f 1 sampled from the intermediate GP. Thirdly, we shall stress that the appearance of δ 2 is a signature of marginalization over the latent function in deep probabilistic models. Similar square distance also appeared in [30] where the effectively deep kernel was derived from a recursive inner product in the space defined by neural network feature functions.
In the following lemma, we consider the composition where the kernel in outer layer is squared cosine, k h ( x , y ) = ( σ 2 2 / 2 ) { 1 + cos [ ( x y ) / 2 ] } , which is a special case of spectrum mixture kernel [38].
Lemma 4.
The covariance in f of the marginal prior with SC kernel used in the exposed GP is given below,
k eff ( x i , x j ) = E p [ f i f j ] = σ 2 2 2 1 + cos ( m i m j ) exp ( δ i j 2 2 2 2 ) ,
where δ i j 2 has been defined in the previous lemma.
The form of product of cosine and exponential kernels is similar with the deep spectral mixture kernel [33]. In our case the cosine function has the warped input m ( x i ) m ( x j ) , but the exponential function has the input c ( x i , x i ) + c ( x j , x j ) 2 c ( x i , x j ) due to the conditional covariance in the intermediate GP.

4.2. Samples from the Marginal Prior

Now we study the samples from the approximate marginal prior with the effective kernel in Equation (12). We shall vary the low fidelity data X 1 , y 1 to see how they affect the inductive bias for the target function. See the illustration in Figure 1. The top row displays the low-fidelity functions f 1 | X 1 , y 1 , which are obtained by a standard GP regression. Here, the low-fidelity data are noiseless observations of three simple functions (linear in the left, hyper tangent in middle, and sine in right). The conditional covariance and condition mean are then fed into the effective kernel in Equation (12), and so we can sample functions from the prior carrying the effective kernel. The samples are displayed in the second row.
In such cases, it can be seen that f 1 | X 1 , y 1 is nearly a deterministic function (top row) given the sufficient amount of noiseless observations in { X 1 , y 1 } . In fact, the left panel in the second row is equivalent to the samples from a SE kernel as f 1 is the identity function. Moving to the second column, one can see the effect of nonlinear warping generates additional kinks in the target functions. The case on the third column with periodic warping results in periodic patterns to the sampled functions.
Next, we shall see the effect of uncertainty in f 1 | X 1 , y 1 (third row) on the sampled functions (bottom row). The increased uncertainty (shown by shadow region) in f 1 generates the weak and high frequency signal in the target function due to the non-stationary δ 2 in Equation (12). We stress that these weak signals are not white noise. The noise in the low fidelity data even reverses the sign of sampled functions, i.e., comparing the second against the bottom rows in the third column. Consequently, the expressivity of the effective kernel gets a contribution from the uncertainty in learning the low fidelity functions.

5. Method

Since we approximate the marginal prior for the conditional DGP with a GP, the corresponding approximate marginal likelihood should be the objective for jointly learning the hyperparameters including those in the exposed GP and the intermediate GPs. From the analytical expression for the effective kernel, e.g., Equation (12), the gradient components include the explicit derivatives K eff / σ 2 and K eff / 2 as well as those implicit derivatives K eff / σ 1 and K eff / 1 which can be computed via chain rule.
With the data consisting of observations of different fidelity, an alternative method can learn the hyperparameters associated with each layer of the hierarchy sequentially. See Algorithm 1 for details. The low fidelity data are fed into the first GP regression model for inferring f 1 and the hyperparameters 1 and σ 1 . The conditional mean and conditional covariance in f 1 | X 1 , y 1 are then sent to the effective kernel. The second GP using the effective kernel is to infer the high fidelity function f with the marginal likelihood for the high fidelity data being the objective. Optimization of the second model results in the hyperparameters 2 and σ 2 in the second layer. Learning in the three-layer hierarchy can be generalized from the two-layer hierarchy. In the Appendix A, a comparison of regression using the two methods is shown.
Algorithm 1 A learning algorithm for conditional DGP multi-fidelity regression
  Input: two sources of data, low-fidelity data ( X 1 , y 1 ) and high-fidelity data (X, y), kernel k1 for function f1, and the test input x*.
  1. k1 = Kernel (var = σ 1 2 , lengthscale = 1 {Initialize the kernel for inferring g}
  2. model1 = Regression(kernel = k1, data = X 1 and y 1 ) {Initialize regression model for f1}
  3. model1.optimize()
  4. m, C = model1.predict(input = X, x*, full-cov = true) {Output pred. mean and post cov. of f1}
  5. keff = Effective Kernel(var = σ 2 2 , lengthscale = 2 , m, C) {Initialize the effective kernel in Equation (12) for SE[ ] and Equation (14) for SC[ ].}
  6. model2 = Regression(kernel = keff, data = X, y) {Initialize regression model for f}
  7. model2.optimize()
  8. μ * , σ * 2 = model2.predict(input = x*)
   Output: Optimal hyper-parameters { σ 1 , 2 2 , 1 , 2 } and predictive mean μ * and variance σ * at x*.

6. Results

In this section, we shall present the results of multi-fidelity regression given low fidelity data X 1 , y 1 and high fidelity X , y and use the 2-layer conditional DGP model. The cases where there are three levels of fidelity can be generalized with the 3-layer counterpart. The toy demonstrations in Section 6.1 focus on data sets in which the target function is a composite, f ( x ) = f 2 ( f 1 ( x ) ) . The low fidelity data are observations of f 1 ( x ) while the high fidelity are those of f ( x ) . The aspect of compositional freedom is discussed in Section 6.2, and the same target function shall be inferred with the same high fidelity data but the low fidelity data now result from a variety of functions. In Section 6.3, we switch to the case where the low fidelity data are also observations of the target function f but with large noise. In Section 6.4, we compare our model with the work in [3] on the data set with high dimensional inputs.

6.1. Synthetic Two-Fidelity Function Regression

The first example in Figure 2 consists of 10 random observations of the target function f ( x ) = ( x 2 ) f 1 2 ( x ) (red dashed line) along with 30 observations of the low fidelity function f 1 ( x ) = sin 8 π x (not shown). The 30 observations of f 1 with a period 0.25 in the range of [ 0 , 1 ] is more than sufficient to reconstruct f 1 with high confidence. In contrast, the 10 observations of f alone (shown in red dots) are difficult to reconstruct f if a GP with SE kernel is used. The above figures demonstrate the results from a set of multi-source nonparametric regression methods which incorporate the learning of f 1 into the target regression of f. Our result, the SE[SE] [panel (f)] kernel, and NARGP [panel (c)] successfully capture the periodic pattern inherited from the low fidelity function f 1 , but the target function is fully covered in the confidence region in our prediction only. On the other hand, in the input space away from the target observations, AR1 [panel (a)] and MF-DGP [panel (e)] manages to only capture part of the oscillation. Predictions in LCM [panel(b)] and DEEP-MF [panel (d)] are reasonable near the target observations but fail to capture the oscillation away from these observations.
Figure 3 demonstrates another example of multi-fidelity regression on the nonlinear composite function. The low fidelity function is also periodic, f 1 = cos 15 x , and the target is exponential function, f = x exp [ f 1 ( 2 x 2 ) ] 1 . The 15 observations of f (red dashed line) are marked by the red dots. The exponential nature in the mapping f 1 f might make the reconstruction more challenging than the previous case, which may lead to less satisfying results from LCM [panel (b)]. NARGP [panel (c)] and MF-DGP [panel (e)] have similar predictions which mismatch some of the observations, but the target function is mostly covered by the uncertainty estimation. Our model with SE[SE] kernel [panel (f)], on the other hand, has predictions consistent with all the target observations, and the target function is fully covered by the uncertainty region. Qualitatively similar results are also obtained from AR1 [panel (a)] and DEEP-MF [panel (d)].

6.2. Compositional Freedom and Varying Low-Fidelity Data

Given the good learning results in the previous subsection, one may wonder the effects of having a different low fidelity data set on inferring the high fidelity function. Here, we consider the same high fidelity data from the target function in Figure 2, but the low fidelity data are observations of f 1 ( x ) = x , f 1 ( x ) = tanh x , f 1 ( x ) = sin 4 π x , and f 1 ( x ) = sin 8 π x . Figure 4 displays the results. Plots in the top rows represent f 1 | X 1 , y 1 , while the bottom rows show the inferred target function given the high fidelity data (red dots). It can be seen in the left most column in panel (a) that the linear f 1 is not a probable low fidelity function as the true target function (red dashed line) in the bottom is outside the predictive confidence. Similarly in the second plot in (a), f 1 being a hyper tangent function is not probable to account for the true target function. In the end, f 1 being a periodic function is more likely to account for the true target function than the first two cases, but the right most plot with f 1 ( x ) = sin 8 π x leads to the predictive mean very close to the true target function.
Next, the low fidelity data become the noisy observations of the same four functions. As shown in panel (b), the increased variance in f 1 | X 1 , y 1 also results in raising the variance in f, especially comparing the first two cases in (a) against those in (b). A dramatic difference can be found in comparing the third plot in (a) against that in (b). In (b), the presence of noise in the low fidelity data slightly raises the uncertainty in f 1 , but the ensuing inference in f generates the prediction which fails to contain most of the true target function within its model confidence. Thus, the likelihood that f 1 ( x ) = sin 4 π x is the probable low fidelity function is greatly reduced by the noise in the observation. Lastly, the noise in observing f 1 ( x ) = sin 8 π x as the low fidelity data does not significantly change the inferred target function.
Therefore, the inductive bias associated with the target function is indeed controllable by the intermediate function distribution conditioned on the lower fidelity data. The observation motivates the DGP learning from the common single-fidelity regression data with the intermediate GPs conditioned on some optimizable hyperdata [39]. These hyperdata constrain the space of intermediate function, and the uncertainty therein contribute to the expressiveness of the model.

6.3. Denoising Regression

Here we continue considering the inference of the same target function in f ( x ) = ( x 2 ) sin 2 8 π x , but now the low fidelity data set becomes the noisy observations of the target function. See Figure 5 for illustration. Now we have 15 observations of f with noise level of 0.001 (red dots) as high fidelity data and 30 observations of the same function with nosie level of 0.1 (dark cross symbol) as the low fidelity data. Next, we follow the same procedure in inferring f 1 with the low fidelity, and then use the conditional mean and covariance in constructing the effective kernel for inferring the target function f with the high fidelity data. Unlike the previous cases, the relation between f and f 1 here is not clear. However, the structure of DGP can be viewed as the intermediate GP emitting infinitely many samples of f 1 into the exposed GP. Qualitatively, one can expect that the actual prediction for f is the average over the GP models with different warping f 1 . Consequently, we may expect the variance in predicting f is reduced.
Indeed, as shown in Figure 5, the predictive variance using a GP with the low fidelity (high noise) observations only is marked by the light-blue region around the predictive mean (light-blue solid line). When the statistical information in f 1 | X 1 , y 1 is transferred to the effective kernel, the new prediction and model confidence possess much tighter uncertainty (marked by the light-green shaded region) around the improved predictive mean (dark solid line) even in the region away from the low-noise observations.

6.4. Multi-Fidelity Data with High Dimensional Input

The work in [3] along with their public code in emukit [40] assembles a set of multi-fidelity regression data sets in which the input x is of high dimension. Here we demonstrate the simulation results on these data (see [3] for details). The simulation is performed using the effective kernels with compositions: SE[SE] and SC[SE] for the Borehole (two-fidelity) regression data set, SE[SE[SE]]] and SC[SC[SE]] for Branin (three-fidelity) regression data set. The data are obtained from deploying the modules in [40]. Algorithm 1 is followed to obtain the results here. The performance of generalization is measured in terms of mean negative log likelihood (MNLL). Table 1 displays the results using the same random seed from MF-DGP and our methods. We also include the simulation of standard GP regression with the high fidelity data only. It is seen that the knowledge about the low fidelity function is significant for predicting high-level simulation (comparing with vanilla GP) and that the informative kernels have better performance in these cases.

7. Discussion

In this paper, we propose a novel kernel learning which is able to fuse data of low fidelity into a prior for high fidelity function. Our approach addresses two limitations of prior research on GPs: the need to choose or design kernel [8,9] and the lack of explicit dependence on the observations in the prediction (in Student-t process [41] the latter is possible). We resolve limitations associated with reliance on designing kernels, introducing new data-dependent kernels together with effective approximate inference. Our results show that the method is effective, and we prove that our moment-matching approximation retains some multi-scale, multi-frequency, and non-stationary correlations that are characteristic of deep kernels, e.g., [33]. The compositional freedom [18] pertaining to hierarchical learning is also manifested in our approach.

8. Conclusions

Central to the allure of Bayesian methods, including Gaussian Processes, is the ability to calibrate model uncertainty through marginalization over hidden variables. The power and promise of DGP is in allowing rich composition of functions while maintaining the Bayesian character of inference over unobserved functions. Modeling the multi-fidelity data with the hierarchical DGP is able to exploit its expressive power and to consider the effects of uncertainty propagation. Whereas most approaches are based on variational approximations for inference and Monte Carlo sampling in the prediction stage, our approach uses a moment-based approximation in which the marginal prior of DGP is analytically approximated with a GP. For both, the full implications of these approximations are unknown. Continued research is required to understand the full strengths and limitations of each approach.

Author Contributions

Conceptualization, C.-K.L. and P.S.; methodology, C.-K.L.; software, C.-K.L.; validation, C.-K.L. and P.S.; formal analysis, C.-K.L.; investigation, C.-K.L.; resources, C.-K.L.; data curation, C.-K.L.; writing—original draft preparation, C.-K.L.; writing—review and editing, C.-K.L. and P.S.; visualization, C.-K.L.; supervision, P.S.; project administration, C.-K.L.; funding acquisition, P.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Air Force Research Laboratory and DARPA under agreement number FA8750-17-2-0146.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We would like to thank the helpful correspondences with the authors of [22].

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
DGPDeep Gaussian Process
SESquared Exponential
SCSquared Cosine

Appendix A

Figure A1 shows the two results of multi-fidelity regressions with the same data. The left panel is obtained with jointly learning the hyperparameters in all layers with the standard gradient descent on the approximate marginal likelihood, while the right panel is from learning the hyperparameters sequentially with the Algorithm 1. It is noted that the right panel yields higher log of marginal likelihood.
Figure A1. Comparison between the joint learning (left) and the sequential learning with Algorithm 1 (right). The same 10 training data are shown by the red dots. The joint learning algorithm results in a log marginal likelihood 1.65 while the alternative one 2.64 . The hyperparameters are { σ 1 , 2 = ( 3.3 , 1.24 ) , 1 , 2 = ( 0.12 , 1.40 ) } (left) and { σ 1 , 2 = ( 1.22 , 1.65 ) , 1 , 2 = ( 0.08 , 0.98 ) } (right).
Figure A1. Comparison between the joint learning (left) and the sequential learning with Algorithm 1 (right). The same 10 training data are shown by the red dots. The joint learning algorithm results in a log marginal likelihood 1.65 while the alternative one 2.64 . The hyperparameters are { σ 1 , 2 = ( 3.3 , 1.24 ) , 1 , 2 = ( 0.12 , 1.40 ) } (left) and { σ 1 , 2 = ( 1.22 , 1.65 ) , 1 , 2 = ( 0.08 , 0.98 ) } (right).
Entropy 23 01545 g0a1

References

  1. Kennedy, M.C.; O’Hagan, A. Predicting the output from a complex computer code when fast approximations are available. Biometrika 2000, 87, 1–13. [Google Scholar] [CrossRef] [Green Version]
  2. Wang, Z.; Xing, W.; Kirby, R.; Zhe, S. Multi-Fidelity High-Order Gaussian Processes for Physical Simulation. In Proceedings of the International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 13 April 2021; pp. 847–855. [Google Scholar]
  3. Cutajar, K.; Pullin, M.; Damianou, A.; Lawrence, N.; González, J. Deep Gaussian processes for multi-fidelity modeling. arXiv 2019, arXiv:1903.07320. [Google Scholar]
  4. Salakhutdinov, R.; Tenenbaum, J.; Torralba, A. One-shot learning with a hierarchical nonparametric bayesian model. In Proceedings of the ICML Workshop on Unsupervised and Transfer Learning, Bellevue, WA, USA, 2 July 2012; pp. 195–206. [Google Scholar]
  5. Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6 August 2017; pp. 1126–1135. [Google Scholar]
  6. Titsias, M.K.; Schwarz, J.; Matthews, A.G.d.G.; Pascanu, R.; Teh, Y.W. Functional Regularisation for Continual Learning with Gaussian Processes. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  7. Rasmussen, C.E.; Williams, C.K.I. Gaussian Process for Machine Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
  8. Duvenaud, D.; Lloyd, J.; Grosse, R.; Tenenbaum, J.; Zoubin, G. Structure discovery in nonparametric regression through compositional kernel search. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 17–19 June 2013; pp. 1166–1174. [Google Scholar]
  9. Sun, S.; Zhang, G.; Wang, C.; Zeng, W.; Li, J.; Grosse, R. Differentiable compositional kernel learning for Gaussian processes. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4828–4837. [Google Scholar]
  10. Damianou, A.; Lawrence, N. Deep Gaussian processes. In Proceedings of the Artificial Intelligence and Statistics, Scottsdale, AZ, USA, 29 April–1 May 2013; pp. 207–215. [Google Scholar]
  11. Snelson, E.; Ghahramani, Z.; Rasmussen, C.E. Warped Gaussian processes. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 13–18 December 2004; pp. 337–344. [Google Scholar]
  12. Lázaro-Gredilla, M. Bayesian warped Gaussian processes. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1619–1627. [Google Scholar]
  13. Salimbeni, H.; Deisenroth, M. Doubly stochastic variational inference for deep Gaussian processes. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  14. Salimbeni, H.; Dutordoir, V.; Hensman, J.; Deisenroth, M.P. Deep Gaussian Processes with Importance-Weighted Variational Inference. arXiv 2019, arXiv:1905.05435. [Google Scholar]
  15. Yu, H.; Chen, Y.; Low, B.K.H.; Jaillet, P.; Dai, Z. Implicit Posterior Variational Inference for Deep Gaussian Processes. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 14502–14513. [Google Scholar]
  16. Lu, C.K.; Yang, S.C.H.; Hao, X.; Shafto, P. Interpretable deep Gaussian processes with moments. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 26–28 August 2020; pp. 613–623. [Google Scholar]
  17. Havasi, M.; Hernández-Lobato, J.M.; Murillo-Fuentes, J.J. Inference in deep Gaussian processes using stochastic gradient Hamiltonian Monte Carlo. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 7506–7516. [Google Scholar]
  18. Ustyuzhaninov, I.; Kazlauskaite, I.; Kaiser, M.; Bodin, E.; Campbell, N.; Ek, C.H. Compositional uncertainty in deep Gaussian processes. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, Toronto, ON, Canada, 3 August 2020; pp. 480–489. [Google Scholar]
  19. Le Gratiet, L.; Garnier, J. Recursive co-kriging model for design of computer experiments with multiple levels of fidelity. Int. J. Uncertain. Quantif. 2014, 4, 365–386. [Google Scholar] [CrossRef]
  20. Raissi, M.; Karniadakis, G. Deep multi-fidelity Gaussian processes. arXiv 2016, arXiv:1604.07484. [Google Scholar]
  21. Perdikaris, P.; Raissi, M.; Damianou, A.; Lawrence, N.; Karniadakis, G.E. Nonlinear information fusion algorithms for data-efficient multi-fidelity modelling. Proc. R. Soc. Math. Phys. Eng. Sci. 2017, 473, 20160751. [Google Scholar] [CrossRef] [PubMed]
  22. Requeima, J.; Tebbutt, W.; Bruinsma, W.; Turner, R.E. The Gaussian process autoregressive regression model (gpar). In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Naha, Okinawa, Japan, 16–18 April 2019; pp. 1860–1869. [Google Scholar]
  23. Alvarez, M.A.; Rosasco, L.; Lawrence, N.D. Kernels for vector-valued functions: A review. arXiv 2011, arXiv:1106.6251. [Google Scholar]
  24. Parra, G.; Tobar, F. Spectral mixture kernels for multi-output Gaussian processes. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6684–6693. [Google Scholar]
  25. Bruinsma, W.; Perim, E.; Tebbutt, W.; Hosking, S.; Solin, A.; Turner, R. Scalable Exact Inference in Multi-Output Gaussian Processes. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1190–1201. [Google Scholar]
  26. Kaiser, M.; Otte, C.; Runkler, T.; Ek, C.H. Bayesian alignments of warped multi-output Gaussian processes. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, Canada, 3–8 December 2018; pp. 6995–7004. [Google Scholar]
  27. Kazlauskaite, I.; Ek, C.H.; Campbell, N. Gaussian Process Latent Variable Alignment Learning. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Naha, Okinawa, Japan, 16–18 April 2019; pp. 748–757. [Google Scholar]
  28. Williams, C.K. Computing with infinite networks. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 2–4 December1997; pp. 295–301. [Google Scholar]
  29. Cho, Y.; Saul, L.K. Kernel methods for deep learning. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–8 December 2009; pp. 342–350. [Google Scholar]
  30. Duvenaud, D.; Rippel, O.; Adams, R.; Ghahramani, Z. Avoiding pathologies in very deep networks. In Proceedings of the Artificial Intelligence and Statistics, Reykjavik, Iceland, 22–25 April 2014; pp. 202–210. [Google Scholar]
  31. Dunlop, M.M.; Girolami, M.A.; Stuart, A.M.; Teckentrup, A.L. How deep are deep Gaussian processes? J. Mach. Learn. Res. 2018, 19, 2100–2145. [Google Scholar]
  32. Shen, Z.; Heinonen, M.; Kaski, S. Learning spectrograms with convolutional spectral kernels. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 26–28 August 2020; pp. 3826–3836. [Google Scholar]
  33. Wilson, A.G.; Hu, Z.; Salakhutdinov, R.; Xing, E.P. Deep kernel learning. In Proceedings of the Artificial Intelligence and Statistics, Cadiz, Spain, 9–11 May 2016; pp. 370–378. [Google Scholar]
  34. Daniely, A.; Frostig, R.; Singer, Y. Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity. In Proceedings of the NIPS: Centre Convencions Internacional Barcelona, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
  35. Mairal, J.; Koniusz, P.; Harchaoui, Z.; Schmid, C. Convolutional kernel networks. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 8–13 December 2014; pp. 2627–2635. [Google Scholar]
  36. Van der Wilk, M.; Rasmussen, C.E.; Hensman, J. Convolutional Gaussian processes. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 2849–2858. [Google Scholar]
  37. GPy. GPy: A Gaussian Process Framework in Python. 2012. Available online: http://github.com/SheffieldML/GPy (accessed on 1 October 2021).
  38. Wilson, A.; Adams, R. Gaussian process kernels for pattern discovery and extrapolation. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1067–1075. [Google Scholar]
  39. Lu, C.K.; Shafto, P. Conditional Deep Gaussian Processes: Empirical Bayes Hyperdata Learning. Entropy 2021, 23, 1387. [Google Scholar] [CrossRef]
  40. Paleyes, A.; Pullin, M.; Mahsereci, M.; Lawrence, N.; González, J. Emulation of physical processes with Emukit. In Proceedings of the Second Workshop on Machine Learning and the Physical Sciences, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  41. Shah, A.; Wilson, A.; Ghahramani, Z. Student-t processes as alternatives to Gaussian processes. In Proceedings of the Artificial Intelligence and Statistics, Reykjavic, Iceland, 22–25 April 2014; pp. 877–885. [Google Scholar]
Figure 1. Sampling random functions from the approximate marginal prior q ( f ) which carries the effective kernel in Equation (12). The low fidelity data X 1 , y 1 , marked by the cross symbols, and the low fidelity function f 1 | X 1 , y 1 and the uncertainty are shown in the top (noiseless) and the third (noisy) rows. Top row: the uncertainty in X 1 , y 1 is negligible so f 1 is nearly a deterministic function, so the effective kernels are basically kernels with warped input. The corresponding samples from q are shown in the second row. Third row: the noise in X 1 , y 1 generates the samples in bottom row which carry additional high-frequency signals due to the non-stationary δ 2 in Equation (12).
Figure 1. Sampling random functions from the approximate marginal prior q ( f ) which carries the effective kernel in Equation (12). The low fidelity data X 1 , y 1 , marked by the cross symbols, and the low fidelity function f 1 | X 1 , y 1 and the uncertainty are shown in the top (noiseless) and the third (noisy) rows. Top row: the uncertainty in X 1 , y 1 is negligible so f 1 is nearly a deterministic function, so the effective kernels are basically kernels with warped input. The corresponding samples from q are shown in the second row. Third row: the noise in X 1 , y 1 generates the samples in bottom row which carry additional high-frequency signals due to the non-stationary δ 2 in Equation (12).
Entropy 23 01545 g001
Figure 2. Multi-fidelity regression with 30 observations (not shown) of low fidelity f 1 ( x ) = sin 8 π x and 10 observations (red dots) from the target function, f ( x ) = ( x 2 ) f 1 2 ( x ) (shown in red dashed line). Only the target prediction (solid dark) and associated uncertainty (shaded) are shown. Top row: (a) AR1, (b) LCM, (c) NARGP. Bottom row: (d) DEEP-MF, (e) MF-DGP, (f) our model with SE[SE] kernel.
Figure 2. Multi-fidelity regression with 30 observations (not shown) of low fidelity f 1 ( x ) = sin 8 π x and 10 observations (red dots) from the target function, f ( x ) = ( x 2 ) f 1 2 ( x ) (shown in red dashed line). Only the target prediction (solid dark) and associated uncertainty (shaded) are shown. Top row: (a) AR1, (b) LCM, (c) NARGP. Bottom row: (d) DEEP-MF, (e) MF-DGP, (f) our model with SE[SE] kernel.
Entropy 23 01545 g002
Figure 3. Multi-fidelity regression on the low-level true function, h ( x ) = cos 15 x , with 30 observations and high-level one, f ( x ) = x exp [ h ( 2 x 0.2 ) ] 1 , with 15 observations. Top row: (a) AR1, (b) LCM, and (c) NARGP. Bottom row: (d) DEEP-MF, (e) MF-DGP, and (f) Our method with SE[SE] kernel.
Figure 3. Multi-fidelity regression on the low-level true function, h ( x ) = cos 15 x , with 30 observations and high-level one, f ( x ) = x exp [ h ( 2 x 0.2 ) ] 1 , with 15 observations. Top row: (a) AR1, (b) LCM, and (c) NARGP. Bottom row: (d) DEEP-MF, (e) MF-DGP, and (f) Our method with SE[SE] kernel.
Entropy 23 01545 g003
Figure 4. Demonstration of compositional freedom and effects of uncertainty in low fidelity function f 1 on the target function inference. Given the same high fidelity observations of target function, four different sets of observations of f 1 ( x ) = x , f 1 ( x ) = tanh x , f 1 ( x ) = sin 4 π x , and f 1 ( x ) = sin 8 π x are employed as low fidelity data in inferring the target function. In panel (a), the low fidelity data are noiseless observations of the four functions. The true target function is partially outside the model confidence for the first two cases. In panel (b), the low fidelity data are noisy observations of the same four functions. Now the first three cases result in the inferred function outside the model confidence. The effect of uncertainty in low fidelity is most dramatic when comparing the third subplots in (a,b).
Figure 4. Demonstration of compositional freedom and effects of uncertainty in low fidelity function f 1 on the target function inference. Given the same high fidelity observations of target function, four different sets of observations of f 1 ( x ) = x , f 1 ( x ) = tanh x , f 1 ( x ) = sin 4 π x , and f 1 ( x ) = sin 8 π x are employed as low fidelity data in inferring the target function. In panel (a), the low fidelity data are noiseless observations of the four functions. The true target function is partially outside the model confidence for the first two cases. In panel (b), the low fidelity data are noisy observations of the same four functions. Now the first three cases result in the inferred function outside the model confidence. The effect of uncertainty in low fidelity is most dramatic when comparing the third subplots in (a,b).
Entropy 23 01545 g004
Figure 5. Denoising regression with 30 high-noise and 15 low-noise observations from the target function y = ( x 2 ) sin 2 8 π x (red dashed line). The uncertainty is reduced in the GP learning with the SE[SE] kernels.
Figure 5. Denoising regression with 30 high-noise and 15 low-noise observations from the target function y = ( x 2 ) sin 2 8 π x (red dashed line). The uncertainty is reduced in the GP learning with the SE[SE] kernels.
Entropy 23 01545 g005
Table 1. MNLL results of multi-fidelity regression.
Table 1. MNLL results of multi-fidelity regression.
MFDGPSE[ ]SC[ ]GP + D f
Borehole−1.87 2.08 2.08 0.56
Branin−2.7−2.52 2.93 5180
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Lu, C.-K.; Shafto, P. Conditional Deep Gaussian Processes: Multi-Fidelity Kernel Learning. Entropy 2021, 23, 1545. https://doi.org/10.3390/e23111545

AMA Style

Lu C-K, Shafto P. Conditional Deep Gaussian Processes: Multi-Fidelity Kernel Learning. Entropy. 2021; 23(11):1545. https://doi.org/10.3390/e23111545

Chicago/Turabian Style

Lu, Chi-Ken, and Patrick Shafto. 2021. "Conditional Deep Gaussian Processes: Multi-Fidelity Kernel Learning" Entropy 23, no. 11: 1545. https://doi.org/10.3390/e23111545

APA Style

Lu, C. -K., & Shafto, P. (2021). Conditional Deep Gaussian Processes: Multi-Fidelity Kernel Learning. Entropy, 23(11), 1545. https://doi.org/10.3390/e23111545

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop