Nothing Special   »   [go: up one dir, main page]

PaperThe following article is Open access

Metric learning for kernel ridge regression: assessment of molecular similarity

, , , and

Published 22 September 2022 © 2022 The Author(s). Published by IOP Publishing Ltd
, , Citation Raimon Fabregat et al 2022 Mach. Learn.: Sci. Technol. 3 035015DOI 10.1088/2632-2153/ac8e4f

2632-2153/3/3/035015

Abstract

Supervised and unsupervised kernel-based algorithms widely used in the physical sciences depend upon the notion of similarity. Their reliance on pre-defined distance metrics—e.g. the Euclidean or Manhattan distance—are problematic especially when used in combination with high-dimensional feature vectors for which the similarity measure does not well-reflect the differences in the target property. Metric learning is an elegant approach to surmount this shortcoming and find a property-informed transformation of the feature space. We propose a new algorithm for metric learning specifically adapted for kernel ridge regression (KRR): metric learning for kernel ridge regression (MLKRR). It is based on the Metric Learning for Kernel Regression framework using the Nadaraya-Watson estimator, which we show to be inferior to the KRR estimator for typical physics-based machine learning tasks. The MLKRR algorithm allows for superior predictive performance on the benchmark regression task of atomisation energies of QM9 molecules, as well as generating more meaningful low-dimensional projections of the modified feature space.

Export citation and abstractBibTeXRIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

Over the last decade, supervised and unsupervised machine learning methods have established themselves as reliable tools in the physical sciences [112]. Supervised models aim to find a mathematical relationship between input data and target properties. Kernel-based regression methods, e.g. kernel ridge regression (KRR) and Gaussian process regression (GPR), are popular in the field [3, 4, 1315]. While equivariant neural networks have recently also become state-of-the-art models to predict molecular properties [1619], KRR models remain important, particularly for small datasets. Unsupervised machine learning algorithms instead find underlying structure in unlabelled data. Dimensionality reduction methods like t-distributed stochastic neighbor embedding (t-SNE) [20], PCA, Multidimensional scaling or Isomap [21] enable the visualisation of an otherwise uninterpretable high-dimensional feature space [1012]. A central concept to many supervised and unsupervised approaches is similarity: the underlying assumption being that points close in the feature space should be close in property space. Similarity between two elements in a feature space may be constructed using a kernel, a function that outputs a scalar value between 1 (identical) or 0 (entirely dissimilar). Many kernel functions contain a decreasing exponential of a metric, such as the Euclidean distance (Gaussian kernel) or the Manhattan distance (Laplacian kernel). Alternative measures such as the linear kernel (the dot product between the feature vectors), or the cosine similarity also exist.

Such pre-defined distance metrics pose problems, especially when a feature space is high-dimensional and possibly polluted with irrelevant or redundant features. Ab-initio representations widely used in chemistry, materials science and condensed matter physics [3, 4] illustrate these limitations. These representations are designed to encode the relevant information to describe any molecular structure: typically, using non-linear functions of atom types and positions. Popular examples are SLATM [22], SOAP [23] and FCHL [24, 25], among many others [4, 26, 27]. They are used to develop predictive models for a variety of properties [25, 26, 2838] or to facilitate the visualisation and interpretation of the chemical space [1012]. Yet, these representations do not provide a universally meaningful measure of molecular similarity, resulting in sub-optimal performance when more challenging properties are targeted [39]. This deficiency is intrinsically connected to the use of pre-defined metrics that treats all the features on a equal footing regardless of their redundancy or relevance to a particular task. For instance, the Euclidean distance used in the Gaussian kernel ($d_E(\textbf{a},\textbf{b}) = \sqrt{\sum_i(a_i - b_i)^2} = ||\textbf{a} - \textbf{b}||_2$) is dominated by features with high variance, which we found [39] to not necessarily correlate with predictive capabilities. While redundancy in a feature space is easily eliminated by using unsupervised linear dimensionality reduction algorithms such as PCA and CUR decomposition [40], adapting the relevance of descriptive features in a metric for a particular target requires a supervised approach [41].

Metric learning (or distance metric learning, or similarity learning) [42, 43] provides an elegant solution to adapt the notion of similarity (and distance) according to the target property. Thus, similarity can be transformed from a global concept (two molecules are similar based on their structure) to an application-specific concept (two molecules are similar, in the context of their properties). This conceptual shift enables a more specific definition of similarity, which naturally circumvents many of the pitfalls of pre-defined similarity notions, as well as enabling more accurate machine learning models. Despite its promise, metric learning has remained largely absent in the chemical domain, except for a few examples on graph-structured data [44]. This is likely because most frameworks are designed for clustering or classification [4547], while chemistry is dominated by regression tasks. We note that alternatives to metric learning exist with the same goal of optimising representations for a specific task: for example in the GPR framework [48], in (supervised) contrastive learning [4952] or using deep belief networks [53, 54].

One of the few examples of metric learning algorithms focused specifically on regression tasks is metric learning for kernel regression (MLKR) developed by Weinberger and Tesauro [45]. MLKR learns a linear transformation matrix that transforms the distance metric between points in order to optimise the prediction of a particular target in a kernel regression. However, the kernel regression used in MLKR employs the non-parametric Nadaraya-Watson (NW) estimator, which, as demonstrated in this work, is less suited than the KRR estimator for regression tasks. The NW estimator, which is essentially a Nearest-Neighbours estimator, evaluates similarity between points in a relative manner using a notion that is dependent on the distribution of points within a particular dataset. In KRR, the notion of similarity is instead absolute, i.e. independent of the distribution of data. This is an important distinction if the aim is to accurately predict molecular properties. Within this context, identifying the trial molecule that is close (in the absolute sense) to the molecular system of interest is more relevant than identifying the closest points that potentially correspond to dissimilar molecules.

In order to enable metric learning in the KRR framework, we propose a new algorithm—metric learning for kernel ridge regression (MLKRR)—which modifies the MLKR formalism for the KRR estimator. While this is not the first metric learning algorithm that exists for KRR [42, 55], to our knowledge this is the first effort that is not an adaptation of a algorithm originally designed for clustering or classification. We choose to start from the MLKR algorithm [45] because of its focus placed on the regression task (though not KRR). Finally, we demonstrate the improved performance of MLKRR on the prototypical task of regressing atomisation energies of small molecules in the QM9 dataset [56]. We expect that MLKRR could enable the wide-spread adoption of metric learning for kernel-based machine learning tasks in the physical sciences, thereby re-defining the fundamental notion of similarity upon which they depend.

2. Metric learning

Metric learning algorithms transform a feature space in order to construct a distance function that minimises the prediction error of a specific property. They learn a linear transformation matrix A that generates a Mahalanobis distance [57] (dM ) on the original space

Equation (1)

Since AT A is positive semidefinite, there exists an orthogonal matrix Q as well as non-negative diagonal matrix D such that $A^TA = Q^T D Q$, see [45]. Thus, the transformation of the feature space with A corresponds to a rotation and a scaling of the components. In this way, distances of points can be increased or decreased, depending on their relevance for the target property.

2.1. Metric learning for kernel regression

Weinberger and Tesauro [45] propose the following method to apply the transformation matrix A in the context of prediction. Suppose that the data points are $x_i \in \mathbb{R}^d$ and the predictions are $y_i \in \mathbb{R}$, $i = 1,\dots,n$. The transformation is used in the Gaussian kernel as follows

Equation (2)

The authors rely on the NW estimator

Equation (3)

The prediction of a new data point x is then

Equation (4)

The learning of the matrix A is expressed as an optimisation problem. The goal is to minimise the residual sum of squares $\mathcal{L} : = \sum_{i}(y_i - \widehat{y_i})^2$. The optimisation then relies on the gradient of the loss with respect to the transformation matrix A

Equation (5)

where $x_{ij} : = x_i - x_j$ and $Z_i : = \sum_{j \neq i} k_{ij}$.

2.2. Metric learning for kernel ridge regression

We introduce MLKRR, which uses the standard parametric KRR estimator instead:

Equation (6)

The prediction of a new data point x is then:

Equation (7)

The coefficients αj that minimise the sum of squares $\mathcal{L}_\alpha : = \sum_{i}(y_i - \widehat{y}_i)^2$ are obtained by solving the linear equation [58]:

Equation (8)

where $K \in \mathbb{R}^{n \times n}$ is the matrix defined by the Gaussian kernel $k(x_i,x_j)$, λ is a small regularisation parameter and I is the identity matrix. In classical KRR, the α coefficients are optimised for a fixed kernel. In the MLKR approach described above on the other hand, only the matrix A is optimised. The novelty of our method [42] lies in the specific combination of both principles. We now explain the two-step procedure that computes first the estimator and then the metric.

We first sample nα data points $\{{x_i^{(\alpha)}} \}_{i = 1}^{n_\alpha}$ and their corresponding labels $\{{y_i^{(\alpha)}} \}_{i = 1}^{n_\alpha}$ from our dataset in order to compute the estimator from the coefficients α. It follows from the expression (8) above that α may be written in terms of the matrix A, hidden in the kernel matrix $K_{ij} : = k({x_i^{(\alpha)}}, {x_j^{(\alpha)}})$. Once the estimator is found, we use its predictions (7) in order to optimise the matrix A. For this, a new loss function is considered on new data points. We sample nA new data points $\{{x_i^{(A)}} \}_{i = 1}^{n_A}$ with their corresponding labels $\{{y_i^{(A)}} \}_{i = 1}^{n_A}$. The predictions of these points are hence given by

Equation (9)

which in turn yield the second loss function to minimise $\mathcal{L}_A : = \sum_{i}(y^{(A)}_i - \widehat{y}^{(A)}_i)^2$. The computation of the matrix A is done by gradient descent using the gradient $\nabla_A\mathcal{L}_A$. To express the gradient in a closed form, we introduce the following notation.

The predictions (9) impose the definition of the additional kernel matrix $Q_{ij}: = k({x_i^{(A)}}, {x_j^{(\alpha)}})$. Further, we set ${X^{(\alpha)}}$ and ${X^{(A)}}$ the matrices whose rows are the data points of corresponding superscript. Let also ${y^{(\alpha)}}$ and ${y^{(A)}}$ be the vectors whose entries are the property labels of corresponding superscript. The gradient $\nabla_A\mathcal{L}_A$ is now given by

Equation (10)

where $W_{ij} : = (\widehat{y}^{(A)}_i - {y_i^{(A)}}) \alpha_j Q_{ij}$, $\tilde{W}_{ab} : = K_{ab} \alpha_b \left[(K+\lambda I)^{-T} Q^T (\widehat{y}^{(A)} - {y^{(A)}}) \right]_{a}$. The diagonal matrices $R, S, \tilde{R}, \tilde{S}$ are the vertical and horizontal sums of W and $\tilde{W}$, that is $R_{ii} : = \sum_j W_{ij}$, and $S_{jj} : = \sum_i W_{ij}$. More on the definitions of these matrices, together with the proof of correctness of the expression is given in the appendix.

Figure 1 illustrates the MLKRR algorithm at work. Initially, there is no correlation between the two principal components (i.e. high-variance components) of a high dimensional dataset and the target property. The MLKRR algorithm learns a transformation matrix A by minimising the squared distance between the predicted target property and the property labels of training data. The matrix A redefines the distance metric on the original feature space. As illustrated in the right panel of figure 1, the high-variance components of the feature space now correlate with the target property. We note that here the matrix A is square such that the transformation rotates and scales the components, but it could be non-square to reduce the dimensionality of the feature space [42].

Figure 1. Refer to the following caption and surrounding text.

Figure 1. MLKRR transforms and rotates the components of the original feature space. A training set is divided into two sets, labelled α and A. The colours of the points represent the corresponding function values. The projection of the points with standard PCA does not show a correlation of the value with the first principal components, while the PCA projection of the points (PCA') that have been transformed with A illustrate such a correlation. The transformed point-set features are less noisy and the relevant signals are amplified.

Standard image High-resolution image

We implemented MLKR and MLKRR in a python module found at github.com/lcmd-epfl/MLKRR based on the python library metric-learn [59].

2.3. Conceptual comparison of MLKR and MLKRR

The different nature of the similarity used by the KRR and NW estimators affects their respective regression performance. Given a datapoint x in two different data distributions, the KRR definition of similarity between x and a neighbour a is independent of any other points in the distribution. On the other hand, the NW definition of similarity adapts to the distribution of data, so that a point x is similar to a if a is the closest point to x (see figure 2(a)). Effectively, functions regressed with NW result in piece-wise surfaces resembling Voronoi diagrams, where each region of the feature space is dominated by the nearest data point. Alternatively, KRR generates surfaces that transition smoothly, which generally results in superior prediction performance. This is clearly observed in figure 2(b), which shows the result of a simple regression on a 2D feature space.

Figure 2. Refer to the following caption and surrounding text.

Figure 2. (a) Depiction of the different approaches to define similarity between KRR and NW. (b) Target function depending on two features and the corresponding regressions based on the KRR and the NW estimators. The two models were trained on a limited sample of data points marked as crosses in the first plot.

Standard image High-resolution image

3. Computational details

3.1. Dataset

All models were built using a random subset of 24 000 molecules from the QM9 dataset [56] of 134 000 three-dimensional structures and corresponding atomisation energies of small drug-like molecules (up to nine heavy atoms). The initial set was then split into 20 000 for training, 2000 for validation (fitting hyperparameters for the KRR, and tracking the accuracy of the models throughout optimisation) and 2000 for testing (used for the out-of-sample error in the learning curves). An equivalent model for the HOMO-LUMO gap (20% gain in performance) is discussed in the supplementary material.

3.2. FCHL representation

Molecules are represented using the Faber-Christensen-Huang-Lilienfeld 19 (FCHL19) representation [25] as implemented in the qml python package [60], but other representations could have been used. An equivalent model trained using the BoB [27] representation shows similar improvement and is given in the supplementary material. The default settings were used for all of the FCHL19 parameters. Local representations were converted to global ones by summing all of the atomic contributions, such that the eventual representations were a vector consisting of 720 features.

3.3. Optimisation details

The MLKR and MLKRR algorithms were optimised for enough time to reach convergence, which was around 2000 steps in both cases (see figure 3). The function minimize of the SciPy [61] library was used to optimise the matrix A, which itself uses the L-BFGS-S [62] algorithm. A was initialised as the identity matrix. For the MLKRR, we used nα = nA , although some of our tests suggest that $n_\alpha \lt n_A$ could be superior. In order to reduce overfitting for the MLKRR, the training data was reshuffled into A and α sets every 30 optimisation steps. The starting kernel width σ for the MLKRR was σ = 55, and the regularisation parameter was fixed at λ = 1 × 10−9. These parameters are the optimal values for the standard KRR, which were optimised using a grid search on the test set.

Figure 3. Refer to the following caption and surrounding text.

Figure 3. Learning curves showing the evolution of the mean absolute error (MAE) throughout the optimisation process for our implementations of MLKR and MLKRR. Both converge after around 2000 steps.

Standard image High-resolution image

4. Results and discussion

4.1. Learning transformation matrices

Training the MLKR and MLKRR algorithms consists of optimising the relevant transformation matrices A. Figure 3 illustrates the evolution of the optimisation process for the two algorithms. Despite the noisier optimisation of the MLKRR, which is a result of the periodic reshuffling of the datasets to optimise both A and α, both algorithms converge after approximately 2000 steps.

After the optimisation process, we obtain the optimised transformation matrices A, which are visualised in figure 4. Both matrices $A_{\mathrm{MLKR}}$ and $A_{\mathrm{MLKRR}}$ contain many non-zero values, both along the diagonal and in off-diagonal terms. Modifications to diagonal terms are effectively a re-weighting of the original features, akin to the application-based re-weighting of terms by Ceriotti et al [63]. Here however, the re-weighting is entirely learned by the algorithm. Diagonal terms Aii where $|A_{ii}|\gt1$ indicate that a higher weight is applied to specific features, whereas $|A_{ii}|\lt1$ indicate that a lower weight is applied. $A_{ii} = 0$ indicates that the feature is dropped entirely, reducing the dimensionality of the feature space. Off-diagonal terms $A_{ij}, i\neq j$, are linear combinations of the original features. The highly correlated nature of the FCHL features is illustrated by the fact that the transformation matrices are rather smooth (shown in the panels on the right of figure 4). After convergence, the largest terms in the transformation matrices are the diagonal values. This is partially due to the fact that the initial guess for the optimisation process is the identity matrix. It also indicates that the original features are useful to predict the target property with an appropriate re-weighting (i.e. a selection and amplification of relevant features for the target property). The selection of off-diagonal terms acts to combine features that are originally correlated, i.e. generating new features. Thus, the metric learning process naturally applies a feature selection protocol.

Figure 4. Refer to the following caption and surrounding text.

Figure 4. Colourmaps representing the learned transformation matrices A for MLKR ($A_{\mathrm{MLKR}}$) and MLKRR ($A_{\mathrm{MLKRR}}$). The lefthand panels illustrate all matrix elements, whereas the righthand panels illustrate zoomed-in components. Both transformation matrices amplify the diagonal components in order to select relevant features for the target property, as well as choosing off-diagonal components to reduce the number of redundant features. The colour scale is capped in the range [−2, 2] to facilitate the visualisation of the off-diagonal elements (the highest absolute values of the matrices are below 5).

Standard image High-resolution image

4.2. Improving predictions of atomisation energies of QM9 molecules

With our trained transformation matrices in hand, we now proceed to evaluating the new feature space for the prediction of atomisation energies of QM9 molecules. As we showed in our previous work [39], unless modified by feature selection or metric learning protocols, feature variance is not necessarily correlated to predictive capabilities. Here, since we have modified the feature space and corresponding distance metric, we expect the variance to again correlate with the relevance of features for a target property. In figure 5, the variance of the original FCHL features and modified FCHLMLKR and FCHLMLKRR features is shown. The original and transformed features do not represent exactly the same information, as the transformed features are constructed as a linear combination of the original ones. Nevertheless, they are still the dominant components, as seen in the diagonal from the A matrices in figure 4. As observed, FCHLMLKRR, and to a lesser extent, FCHLMLKR makes use of almost all of the feature space. This suggests that the metric learning procedure effectively manipulates all of the features to eliminate irrelevance and redundancy.

Figure 5. Refer to the following caption and surrounding text.

Figure 5. Feature variances of the original FCHL and the metric-learned alternatives (FCHLMLKR and FCHLMLKRR) on the QM9 dataset (left), with a zoom on the [0, 5] range of the y axis (right).

Standard image High-resolution image

The goal of the metric learning procedure is to improve predictions on the target property. In figure 6, we compare the relative capabilities of the distance metric learned by MLKR and MLKRR to improve predictions in a KRR model. The learning curves illustrate the evolution of the MAE with increasing training data. In the left panel of figure 6, we observe that the metric obtained with MLKRR offers a significant improvement over the original. The learning curve shows a faster decrease and the MAE of the final model is ∼38% lower. On the other hand, the metric obtained with MLKR in fact performs worse than the original one. The learning curve is shifted upwards and the performance of the final model is ∼24% worse. This suggests that the distance metric learned by the MLKR framework does not necessarily improve its capabilities for KRR. Since the MLKR optimisation procedure relies on the NW estimator, we suspect that the nature of this estimator compared to that of the KRR results in a metric that is less suitable for regression tasks.

Figure 6. Refer to the following caption and surrounding text.

Figure 6. Learning curves with the evolution of the MAE on a validation set of 2k structures obtained using a KRR model (left) and distribution of errors in the same set obtained with a KRR constructed using 10 000 data points from the training set (right), generated using the FCHL representation and the altered variations obtained with MLKR (FCHLMLKR) and MLKRR (FCHLMLKRR).

Standard image High-resolution image

4.3. Dimensionality reduction

We illustrate the effects of our metric-learned similarity measures by constructing 2D projections of the QM9 dataset using the feature spaces spanned by FCHL, FCHLMLKR (FCHL $\cdot \textbf{A}_{\textrm{MLKR}}^T$) and FCHLMLKRR (FCHL $\cdot \textbf{A}_{\textrm{MLKRR}}^T$). t-SNE projects data points in a N-dimensional space (N = 2 in general) by minimising the difference of the pairwise affinity between points in the original and new spaces. The definition of affinity between two points in t-SNE is:

Equation (11)

akin to the weights of the kernel average for the predictions in the NW estimator of MLKR (see equation (3)). PCA instead performs the projection such that the new dimensions are those with the highest variance in the original data. It does not rely on an explicit distance as in the t-SNE, but still selects the principal components by evaluating $\mathbf{X}^T \mathbf{X}$ (where X is the feature vector), which is analogous to a distance.

The t-SNE and PCA maps obtained with FCHL, FCHL$_{\mathrm{MLKR}}$ and FCHL$_{\mathrm{MLKRR}}$ are shown in figure 7, where each of the points representing a molecule are coloured by their atomisation energy. Distinct differences between the maps are observed. The t-SNE and PCA maps obtained with the FCHL$_{\mathrm{MLKR}}$ features tend to organise local clusters. Yet, there is no global coherence between the organisation of the projected points and the coloured target property. Instead, the FCHL$_{\mathrm{MLKRR}}$ maps resemble more closely those of the original FCHL, albeit organised into larger regions offering a better coherence with respect to the target property. Overall, MLKRR improves the organisation of data to uncover patterns (figure 7) relevant to optimise the prediction of the targeted properties (figure 6).

Figure 7. Refer to the following caption and surrounding text.

Figure 7. 2D projections of FCHL, FCHLMLKR and FCHLMLKRR constructed using t-SNE and PCA algorithms (first and second row, respectively). The colourmap shows the atomisation energy for each compound.

Standard image High-resolution image

The proposed MLKRR algorithm offers additional functionalities beyond maximising prediction accuracy for specific applications. Careful analysis of the learned transformation matrices provide valuable insights as to why some representations behave better than others [15]. In addition, the algorithm offers a mathematical route to construct a hierarchy of molecular representations adaptable and tailored to specific targets as an alternative to building representations encoding the relevant information in absolute terms [4]. Comparisons between the similarities and metrics learned for a wide range of applications might also uncover hidden trends useful to develop improved and more general molecular representations. Finally, MLKRR could also be exploited in transfer-learning or meta-learning approaches: a concept which has so far been limited to neural network applications [64].

5. Conclusions

Similarity-based machine learning methods are widely used in the physical sciences. Their dependence on a pre-defined distance metric is problematic in combination with high-dimensional feature vectors often containing irrelevant or redundant features. To address this shortcoming, we introduce an algorithm, MLKRR, which re-defines the notion of similarity between points to optimise the prediction of specific target properties in KRR tasks. MLKRR was shown to offer improved performance (38%) on the prototypical regression task of atomisation energies of the QM9 dataset [56], as well as generating more meaningful low-dimensional projections of transformed feature vectors. In addition, we illustrate why the MLKRR algorithm is more suited to the prediction of continuous target properties, as is typical in the physical sciences, than the related MLKR algorithm based on the NW estimator.

Acknowledgments

R F and C C acknowledge funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (Grant 817977). P v G and C C acknowledge the National Centre of Competence in Research (NCCR) 'Sustainable chemical process through catalysis (Catalysis)' of the Swiss National Science Foundation (SNSF, Grant No. 180544) for financial support. F E acknowledges support from the Swiss National Science Foundation (SNSF) (Grant 185030). M H, F E and C C acknowledge the FSB Intrafaculty Seed Funding. We thank Mojmir Mutny and Alberto Fabrizio for helpful discussions.

Data availability statement

The data that support the findings of this study are openly available at the following URL/DOI: https://github.com/lcmd-epfl/MLKRR.

Appendix.: Derivation of MLKR and MLKRR gradients

Appendix. Derivation of the MLKR gradient in equation (5)

Starting from the top, we find

where $\widehat{y}_i = \frac{1}{Z_i} \sum_{j\neq i} k_{ij} y_j$ and $Z_i = \sum_{j \neq i} k_{ij}$.

This naturally leads us to compute $\frac{\partial \, \widehat{y}_i}{\partial A}$, and hence $\frac{\partial k_{ij}}{\partial A}$. Firstly, the kernel derivative is

Equation (12)

where $x_{ij} : = x_i - x_j$.

Consequently, one has

From which the conclusion stems naturally.

Appendix. Derivation of the MLKRR gradient of equation (10)

In the following, the derivation of the gradient (10) and its precise definition are given. The notations defined in section 2.1 will be reused along with the subsequent new ones.

We consider two different kernel matrices: one between ${X^{(\alpha)}}$ and itself and one between ${X^{(A)}}$ and ${X^{(\alpha)}}$ (seen as K and Q respectively in equation (10)). We also denote by H the regularised kernel defining the coefficients α for KRR.

To simplify notations, we let the index i range over $\{1, \dots, n_A \}$, while the indices $j, a$, and b range over $\{1, \dots, n_\alpha\}$. We further lose the superscripts on x and y when the indices suffice to determine them, and use the same letter K for both kernels.

As above, we start from the following expression

where $\frac{\partial \, \widehat{y}_i}{\partial A} = \sum_j \frac{\partial}{\partial A}(\alpha_j k_{ij})$. Remembering that both α and K depend on A, we therefore aim to compute $\frac{\partial k_{ij}}{\partial A}$ and $\frac{\partial \alpha_j}{\partial A}$.

Remember that the kernel derivative is already computed in (12).

where $x_{ij} : = x_i - x_j$.

The second term requires treating the derivative of H−1, the details of which are spared. In fact, routine computation yields that

and hence that

Combining both derivatives gives the following expression for the gradient.

Equation (13)

We recall that K has two different definitions depending on its indexing. To clarify, we set $K \in \mathbb{R}^{n_\alpha \times n_\alpha}$ the kernel between ${X^{(\alpha)}}$ and itself; and we set $Q \in \mathbb{R}^{n_A \times n_\alpha}$ the kernel between ${X^{(A)}}$ and ${X^{(\alpha)}}$.

The lemma below allows us to turn this expression into matrix form. Indeed, note that both sums are of the form $\sum_{i,j} W_{ij} (x_i - x_j)(x_i-x_j)^T$, where xi and xj are lines of ${X^{(A)}}$ or ${X^{(\alpha)}}$ depending on their index. Lemma 1.

lemma Consider two matrices A and B with lines $\{a_i\}$, $\{b_j\}$ of matching dimensions, and let $x_{ij} = a_i - b_j$. Set further the matrix

Equation (14)

with some coefficients Wij making up a matrix W. Then

where R and S are both diagonal matrices with $R_{ii} = \sum_j W_{ij}$, and $S_{jj} = \sum_i W_{ij}$.

Applying the lemma to both expressions in (13) leads us to construct the matrices $W \in \mathbb{R}^{n_A \times n_\alpha}$ and $\tilde{W} \in \mathbb{R}^{n_\alpha \times n_\alpha}$ in the following way.

Finally, the diagonal matrices $R, S, \tilde{R}, \tilde{S}$ given by the lemma conclude. Proof of lemma 1

proof The right-hand side of (14) splits into four sums which make up each term of the result.

Note that the last two terms are transpose of each others so that only one has to be treated. Additionally, for a general matrix M of appropriate dimensions, one can easily verify that

Taking $M = W, R$, and S yields the desired result.

Please wait… references are loading.

Supplementary data (4.6 MB PDF)

10.1088/2632-2153/ac8e4f