Abstract
Quantum kernel methods are a promising method in quantum machine learning thanks to the guarantees connected to them. Their accessibility for analytic considerations also opens up the possibility of prescreening datasets based on their potential for a quantum advantage. To do so, earlier works developed the geometric difference, which can be understood as a closeness measure between two kernel-based machine learning approaches, most importantly between a quantum kernel and a classical kernel. This metric links the quantum and classical model complexities, and it was developed to bound generalization error. Therefore, it raises the question of how this metric behaves in an empirical setting. In this work, we investigate the effects of hyperparameter choice on the model performance and the generalization gap between classical and quantum kernels. The importance of hyperparameters is well known also for classical machine learning. Of special interest are hyperparameters associated with the quantum Hamiltonian evolution feature map, as well as the number of qubits to trace out before computing a projected quantum kernel. We conduct a thorough investigation of the hyperparameters across 11 datasets, and we identify certain aspects that can be exploited. Analyzing the effects of certain hyperparameter settings on the empirical performance, as measured by cross validation accuracy, and generalization ability, as measured by geometric difference described above, brings us one step closer to understanding the potential of quantum kernel methods on classical datasets.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Assessing the performance of quantum models on current quantum computers in tackling machine learning tasks on classical datasets is an enduring ambition of the research community. Many strides have been made addressing this subject from theory-driven approaches that showed encouraging results for specific models, e.g., investigating the difference in capacity between classical and quantum models (Abbas et al. 2021) and showcasing a model that allows to sample from a circuit that is hard to simulate classically (Coyle et al. 2020), to more empirical-driven projects that showed promising results for specific models on specific use cases, e.g., medical classification with hybrid quantum-classical convolutional neural network (Monnet et al. 2023) and generation of high-resolution MNIST datapoints with hybrid generative neural networks (Rudolph et al. 2022). However, making an overall claim of quantum advantage for a broader class of models for a broader set of use cases remains challenging, especially due to the notorious insufficiency of theoretical understanding of certain sub-fields of machine learning, such as deep learning (DL) (Lin et al. 2017). Nevertheless, in the recent years, a body of literature emerged linking DL models to kernel methods (KMs) (Daniely et al. 2017; Jacot et al. 2020; Lee et al. 2018) followed by literature making similar conclusion for the quantum domain (Rad 2023; Incudini et al. 2022; Schuld 2021). This is great news due to favorable qualities of KMs, such as exhibiting convex loss landscapes, which allows to establish convergence, and hence trainability, guarantees (Schuld 2021). It is a big advantage in comparison to DL that infamously suffers from non-convex loss landscapes and their quantum counterparts that suffer from barren plateaus (McClean et al. 2018). Focusing on KMs therefore allows for a more theoretically sound argument, while at the same time, the insights gained for KMs can be applied to a broader range of models due to the aforementioned interrelationships.
Huang et al. (2021) by arguing from a standpoint of generalization of kernels challenged a common reasoning for claiming quantum advantage that is often used in quantum machine learning (QML) literature. In their work, they showed that a classical intractability of the function, that is represented by a quantum model, is not sufficient to claim a quantum advantage. A classical learner can learn to approximate the quantum output if a sufficient amount of data is available. Following this argumentation, they presented a method that allows to probe specific quantum and classical kernels on a specific dataset to derive a possibility for quantum kernels to exhibit superior performance. The method consists of two steps: (1) a geometric difference (GD) measure is computed between kernel matrices, which allows to determine whether the kernels are different enough for potential quantum advantage to exist; (2) the model complexities of classical and quantum models are compared to determine if their discrepancy is sufficient (high classical and low quantum complexities are desirable).
In this work, we strive to comprehend the capacity of quantum kernels on classical datasets. It is our ambition to understand the possible gap between classical and quantum models and whether GD can be a helpful ally for this purpose. However, given a rather theoretical nature of this metric, we need to first deepen our grasp of the empirical relevance of the GD tool. To this extent, we study how the choice of hyperparameters influences the GD and the empirical performance of the quantum KMs. We performed an experiment spanning multiple different classical datasets, a large hyperparameter search grid, and different kernel setups. Additionally, we artificially saturate the difference between classical and quantum learners for some datasets by creating artificial quantum label function following ideas by Huang et al. (2021) to expand our understanding of GD metric. We empirically established which hyperparameters QKMs are most sensitive to and hence are the most crucial to optimize. Our results indicate that the GD depends on the individual hyperparameters as much as the accuracy, but the optimal values of them differ greatly between the metrics. These insights open new research revenues towards understanding which use cases might be most well suited for QKMs.
2 Background
Previous work by Moussa et al. (2022) investigated the importance of hyperparameters of quantum neural networks, in which the performance of different hyperparameter settings has been evaluated on multiple datasets. From their experimental results, it follows that the choice of data embedding, the depth of the quantum circuit, and the learning rate of the optimizer are the most important hyperparameters. In this work, we draw our inspiration from their experimental design, but we adapt our search grid to hyperparameters that are meaningful in quantum kernel methods setting and conduct additional analysis to determine versatility of GD as potential metric for hyperparameter tuning. This section provides background information on our approach.
2.1 Kernel functions
A kernel function \(k(\textbf{x},\textbf{x}'): \mathbb {R}^D \times \mathbb {R}^D \rightarrow \mathbb {R}\) is bivariate, real-valued and can be interpreted as a similarity measure between \({\textbf{x}, \textbf{x}'} \in \mathbb {R}^D\), which is symmetric \(k(x,x') = k(x',x)\) and non-negative \(k(x,x')\ge 0\). For any finite set \(\mathcal {X}_N \subset \mathbb {R}^D\), we can construct a Gram matrix defined by
which is Hermitian and positive semi-definite (PSD). According to Mercer’s theorem, a Gram matrix with this property allows to represent the kernel function as an inner product of feature vectors \(k(\textbf{x},\textbf{x}') = \phi (\textbf{x})^T \phi (\textbf{x}')\), where \(\phi : \mathbb {R}^D \rightarrow \mathcal {H}\) (Murphy 2012) is a feature map that maps original input space into a higher dimensional feature Hilbert space.
Depending on a choice of a feature map \(\phi \), we can generate kernels with different properties. For example, for \(\textbf{x} \in \mathbb {R}^D\), we can consider the simplest (linear) map \(\phi (\textbf{x}) = \textbf{x}\), and we get a linear kernel defined as \(k(\textbf{x},\textbf{x}') = \textbf{x}^T \textbf{x}'\).
One of the most widely-used maps \(\phi (x)\) generates a radial basis function (RBF) kernel defined as
where \(\gamma = \sigma ^{-2}\) and \(\sigma \) is known as bandwidth. This kernel has infinite dimensions, which becomes apparent if we consider the Taylor expansion of Eq. 1.
Kernels are employed by a variety of machine learning algorithms. One of the most prominent example of such algorithms is a binary linear classifier known as a support vector machine (SVM). The main objective of an SVM is to find a separating hyperplane between two classes of data with the maximal separating margin. However, the data might not necessarily be linearly separable in the input space \(\mathbb {R}^D\), in which case mapping data into a higher dimensional feature space \(\mathcal {H}\) is of an advantage. Moreover, one can avoid expensive explicit computation in high or infinite dimensional space by utilizing what is known as a kernel trick. In essence, one computes merely an inner product of two feature vectors without accessing high-dimensional space.
2.2 Hamiltonian evolution feature map
For a quantum algorithm to operate on a classical data, the data has to be embedded into a quantum state \(\phi : \textbf{x}_i \rightarrow | \textbf{x}_i \rangle \), which exists in a complex Hilbert space. The significance of the choice of embedding operation for the performance of quantum models has been highlighted in the literature (Abbas et al. 2021; Schuld et al. 2021). In this work, we concentrate on the Hamiltonian evolution feature map as it has been proven to be powerful in QKM setting (Canatar et al. 2023; Huang et al. 2021; Shaydulin and Wild 2022). This feature map has been developed for many-body problems, and it requires \(n +1\) qubits to embed n features and is defined as follows:
where \(H_j^{XYZ} = (X_j X_{j+1} + Y_j Y_{j+1} + Z_j Z_{j+1})\) with \(X_j, Y_j, Z_j\) being Pauli operators acting on j-th qubit, \(| \psi _j \rangle \) is a Haar random state, and t and T are hyperparameters that symbolizes total evolution time and number of Trotter steps, respectively. The significance of hyperparameter t for QKMs has been highlighted by Shaydulin and Wild (2022). Our results support and extend their work.
2.3 Quantum kernels
An inner product of two quantum states defines a quantum fidelity kernel as
Quantum kernels insert data into exponentially high-dimensional Hilbert feature space, which is intractable for classical computers. The high dimensionality of this space has to be treated with caution as it requires \(N \ge \Omega (2^n)\) of training data to approximate a certain function well, where n is the amount of logical qubits (Huang et al. 2021; Kübler et al. 2021; Nakaji et al. 2022; Thanasilp et al. 2022). This is due to a problem known as exponential concentration (Thanasilp et al. 2022). The same logic holds for a quantum fidelity kernel of mixed quantum states \(k(x, x') = Tr(\rho (x) \rho (x'))\) (Nakaji et al. 2022). Reducing the dimensionality of the space by projecting it onto a smaller subspace, thereby creating so-called projected quantum kernels, increases the generalization capacity of these methods. To achieve that, we can utilize reduced density matrices (RDMs) \(\rho ^{(q)}(x) = Tr_{\ne q} (\Phi (\rho (x)))\), where \(Tr_{\ne q}\) stands for a trace of a subsystem that is created by excluding q qubits and \(\Phi \) is a trace-preserving map. Depending on the choice of the map \(\Phi \), different proposals for projected kernels can be found in the literature (Huang et al. 2021; Kübler et al. 2021). As generalized by Nakaji et al. (2022), these projected (RDM-based) kernels can be categorized into inner product-based defined as
where \(\alpha _k > 0\) are coefficients, and distance-based defined as
where the Frobenius norm is denoted by \(||.||_F\), \(\gamma >0\) is a hyperparameter, and \(\alpha _k > 0\) are coefficients. The subsystem of size K represented by \(q_k\) is the k-th of the \(N_K\) possibilities.
2.4 Geometric difference (GD)
Huang et al. (2021) challenged a common argument for claiming quantum advantage that is often used in QML literature. In their work, they showed that a classical intractability of the function f(x), that is represented by a quantum model, is not sufficient to claim a quantum advantage. Under certain circumstances, a classical learner h(x) can learn to approximate f(x) given a sufficient number N of independent samples from a data distribution \(\mathcal {D}\). They showed that an expected error between f(x) and h(x) can be upper bounded as follows:
where \(s_K(N)\) is the model complexity of the trained function h(x) and \(c>0\) is a constant. For that reason, with \(N \propto s_K(N) / \epsilon ^2\) datapoints, a classical model can learn to predict the quantum function f(x) up to an error \(\epsilon \). Thus, if the best classical learner h(x) has a low model complexity \(s_K(N)\), it will not require too much data N to learn the outputs of the quantum model f(x). From here, it follows that for a quantum advantage to exist, it is required that there is the largest possible separation between complexities of quantum and classical models. Huang et al. (2021) considered kernel models in their work and defined an asymmetric GD-based test between kernel matrices that captures this separation on a specific dataset. This test is defined as follows:
where \(K_C\) and \(K_Q\) are classical and quantum kernel matrices with \(Tr(K_Q)=Tr(K_C)=N\), respectively. The distance-based kernel meets this requirement, while the inner product-based kernel requires rescaling according to \(\tilde{K}_{Q} = N\cdot K_{Q}/Tr(K_{Q})\) in the case of projected kernels, due to the fact that the diagonal of this kernel’s matrix contains purity related values (see Section 1). Importantly, this test is reliant on the GD between embedded datapoints, without considering the labeling function. The GD now links the classical model complexity \(s_{K_C}(N)\) and the quantum model complexity \(s_{K_Q}(N)\) according to
which is also a connection of the prediction error bounds as given in Eq. 6.
3 Experimental setup
In this work, we investigate how different parameter settings affect the empirical performance of quantum kernels, as well as their generalization ability, as determined by the GD. We have designed an experiment to evaluate a variety of different hyperparameter settings. In this section, we describe the parameters of the search grid used in our experiment. We varied the types of models and their hyperparameters and tested their performance on various datasets.
3.1 Models
In this work, we combine classical SVMs (see Section 2) from sklearn library with quantum kernels implemented with PennyLane framework on simulators. To obtain the kernel, we first embed classical datapoints with the Hamiltonian evolution feature map as in Eq. 2, and the full DM of the resulting quantum state is saved. These DMs are used to compute RDM kernels described in Eq. 4 (with additional rescaling to \(Tr(K_Q)=N\)), Eqs. 5 and A1. From here forth, we will refer to these kernels as inner and distance kernels respectively. In all cases, we set \(\alpha _k = 1/N_K\). In Appendix 6, we discuss the case \(\alpha _k = 1\).
From these models, we extract relevant hyperparameters. From SVMs, we consider regularization hyperparameter C. From the feature map, we adopt the number of Trotterization steps T and the evolution time t (as in Eq. 2). The random seed for the initial Haar random state is set to different values, which however cannot be considered a hyperparameter. From quantum kernels, we take into account the size of the RDM subsystem K (as in Eqs. 4 and 5) and, in case of the distance kernel, we also consider bandwidth \(\gamma \). All hyperparameters of the search grid and their corresponding value ranges are summarized in Table 1.
For a classical baseline, we train SVM with classical kernels listed in Section 2: from basic kernels (linear and polynomial) to more complex ones (RBF and Laplacian) and the (not necessarily PSD) sigmoid kernel. For this comparison, the C and \(\gamma \) parameters are explored for the same range as in the quantum case with the addition of more standard classical \(\gamma \) choices, such as 1/D and \(1 / (D \cdot \sigma (\mathcal {X}^D))\) with D being the dimensionality of the feature space and \(\sigma (\mathcal {X}^D))\) variance of the dataset.
3.2 Metrics
To rate the importance of the hyperparameters, we need to correlate them with the model’s performance. In this paper, we consider performance metrics, such as accuracy on test dataset and mean accuracy of fivefold cross validation on training dataset, as well as the generalization metric geometric difference (GD) introduced in Section 2.4. GD measures the difference between two kernel matrices, so in our studies, we consider the GD from quantum to all classical matrices.
3.3 Datasets
To train and test our models, we selected the 11 different datasets listed in Table 2 that have been previously used in the QML literature (Huang et al. 2021; Moradi et al. 2022; Moussa et al. 2022; Shaydulin and Wild 2022). Due to simulation time restrictions, we compress each dataset’s feature space to five dimensions, which translates to 6 qubits according to Eq. 2, and 200 datapoints (or less features/datapoints if fewer were available). In this study, we restrict to binary classification tasks, so for multi-class dataset 8 (Fashion MNIST), only two classes are selected (T-shirt/Top and Dress) and the rest are discarded. Datapoints with incomplete or missing features are removed as well as duplicate datapoints. In addition, the datasets were balanced. Features with a variance of less than 0.001 were eliminated. All the features were normalized (mean with 0 and variance with 1). To reduce the dimensionality of the features, all features were sorted according to their analysis of variance (ANOVA) F-value, and the top 5 were selected. Different preprocessing routines are discussed in Appendix 4.
3.4 Artificial labels
For a more in-depth study of the GD, we additionally employ a method for generating artificial labels for classical datasets that are more favorable for quantum learners as proposed in Huang et al. (2021). This is done by first augmenting (7) to a matrix of form \(\sqrt{K_Q}(K_C + \lambda Id)^{-1}\sqrt{K_Q}\) with the regularization term \(\lambda = 1.1\). We select eigenvector \(\textbf{v}\) that corresponds to the largest absolute eigenvalue, and the intermediate new labels are calculated as \(y = \sqrt{K_Q}\textbf{v}\). These labels are binarized by setting them to 1 if they are greater than the median of all y and to 0 otherwise. The learning task with these new labels should now saturate (8). However, the kernel matrices in this scenario are formed using the complete dataset, resulting in slight deviations from this saturation upon separation into training and test sets. To improve generalization, 5% of the labels are randomly set to 0. We use different Haar random initial states for relabeling and learning. Most of the analysis is performed on initial datasets. The relabeled ones are discussed in Section 4.3.
3.5 Analysis
After all results from the experiments are agglomerated, the data is prepared for further analysis. The choice of the kernel function is treated as a hyperparameter. The two kernel functions are encoded as the binary variable basis. We also exclude hyperparameter C (as it has no influence on the GD) from the main analysis and discuss it separately in Section 4.2.6. For datasets 1 and 8, we removed the highest 3% of each GD due to the presence of significant outliers. These outliers may occur when the classical kernel matrix has nearly singular values (7) and the GD is therefore exceedingly large.
We start the analysis by investigating the importance of each hyperparameter. Based on the insights from our prior experiments, we chose the Gradient Boosted Decision Trees (GBDT) from sklearn, which is fitted to the experimental data. The features correspond to the hyperparameters and the labels to the accuracies of the earlier classification tasks. Such a GBDT consists of hundreds of decision trees. Depending on how often a certain feature (hyperparameter) is used for a split point, a feature importance can be derived. More precisely, from this model, an impurity-based feature importance, also known as the Gini importance, can be extracted.
Additionally, we analyze marginal distributions of each hyperparameter, which is an average over all other hyperparameters to capture the effects of this particular one on the accuracy and the GD to the classical RBF kernel. Apart from the mean, we also track standard deviation as it provides the necessary information regarding the reliability of the marginal distribution. A large standard deviation indicates a small importance of that particular hyperparameter on the metric and vice versa. The impact of each hyperparameter on performance and GD is evaluated using the GBDT and its margin plots.
4 Results
Figure 1 visualizes the range of accuracies across different search grid settings as well as best-achieved performances of classical and quantum learners (for more details on best learners, see Appendix 2). Since the quantum models achieve comparable results to the classical counterparts, we can assume that the explored hyperparameter ranges allowed for a high enough expressivity. Nevertheless, the best classical model can barely be surpassed. The selection of datasets and their characteristics range from those which are easy to learn to those which are difficult to learn. There is also a varying spread of the accuracies on one specific dataset across the different parameter settings.
4.1 Hyperparameter importance
We analyze whether accuracy and the GD metrics assign the same importance to the hyperparameters. For that, we extract Gini importance from the GBDT model fitted to the data extracted from the search grid experiment as described in Section 3.5. Figure 2 displays the average importance found for the different hyperparameters across all the datasets for performance as well as for the GD to the classical RBF kernel. See Fig. 16 for a breakdown of hyperparameter importance by all metrics.
Contrary to our initial expectations, both the performance and GD exhibit minimal sensitivity to variations in the value of K. Also, the number of Trotterization steps T has no significant effect on any of the metrics studied. These results are consistent with Shaydulin and Wild (2022), where t and T hyperparameters were investigated as well. This analysis also found that the random seed for the Haar initial state had the lowest importance for every metric, but as it cannot be interpreted as a hyperparameter, it will not be discussed further.
Judging from the results in Fig. 2, the hyperparameter t is assigned the most importance by both the accuracy and the GD metrics. The choice of the basis and \(\gamma \) has a strong effect on these metrics as well, as both assigned these two parameters a similar importance that ranks them together at the second place. When examining the standard deviations, it is apparent that the importance of each hyperparameter for the GD is more consistent across the datasets than the importance for the accuracy.
4.2 Metric dependence on individual hyperparameters
In this section, the effects of individual hyperparameters are discussed. For the most part, we base our assessments on the marginals of the hyperparameters (see Section 3.5 for further details). The hyperparameters are analyzed in the order of their importance. As expected, the cross validation (CV) aligns closely with the test accuracy in all marginals and is therefore well suited for tuning all hyperparameters.
4.2.1 Evolution time (t)
Figure 3 visualizes the marginals of the t parameter for the accuracy and the GD. The standard deviations are illustrated as shadows and indicate the importance of t on each of those metrics. If a standard deviation is small when averaging over all other parameters and datasets, this indicates that the metric does not depend on these other parameters as strongly as on t. This graphic highlights that the t values for which a good performance was achieved correspond to a rather small GD, and as the predictions get worse, the potential for quantum advantage increases. This behavior is consistent and sometimes even more significant when looking at individual datasets (Figs. 20 and 21) or 2 marginals like Fig. 4 and Fig. 5. Furthermore, of all hyperparameters, t causes the most differences between the choices of basis. The distance kernel seems to be more resilient for smaller values of t, demonstrated in Fig. 19. As t increases, the GD begins to rise earlier when the distance kernel is used.
4.2.2 (Basis) of the kernel function
In order to have a fair comparison between the distance kernel and the inner kernel, the \(\gamma \) parameter is not averaged over; rather, the maximal value over this parameter is chosen in all of Section 4, where the choice between the two is considered a parameter too (Section 3.5). This is further motivated by the fact that \(\gamma \) along with C are the computationally cheapest hyperparameters to optimize, since the kernel can be reevaluated without having to recompute the distances between DMs. Figure 6 shows the accuracies achieved for the specific choices of kernel function basis over all other parameter settings (except for \(\gamma \)) and datasets. If \(\gamma \) is optimized, then the distance basis outperforms the other two. In Fig. 18, we observe a different situation when \(\gamma \) is averaged like all the other parameters, but this is mostly due to the broad range (and thus many suboptimal values) we explored.
4.2.3 Gamma (\(\gamma \))
The second hyperparameter controlling the bandwidth is \(\gamma \). This parameter is only relevant for the distance kernel (5), which is why Figs. 7, 23, and 24 are restricted to just the distance kernel. In contrast to the behavior seen in Section 4.2.1, the GD is here not necessarily running opposite to the accuracy, which can be seen from Figs. 7 or 23 and 24. There is a general behavior that can be observed for all datasets, which is that all the GDs are increasing with \(\gamma \). For most datasets, the accuracy has a maximum in the range of 0.1–10 with varying curvature around that point across the different datasets. \(\gamma \) is correlated with other parameters as seen in Figs. 4 and 5. For example, the optimal value for t shifts to smaller values with larger \(\gamma \), which makes sense as the kernel values in general decrease when either of these parameters increases.
4.2.4 Size of the reduced density matrices (K)
Here, we are investigating RDMs that have the means to tackle the exponential concentration problem (Thanasilp et al. 2022). Because datasets 3 and 4 have only 4 features, they are excluded from all RDM marginals in Section 4 as they would bias \(K=6\). The trend in our results indicates a slight increase in accuracy with this hyperparameter, but minimal effect overall, while GD increases notably with K as displayed in Figs. 5 and 8. Both observations are consistent across all datasets (Figs. 25 and 26). Furthermore, Fig. 2 also indicates the difference in importance.
In Appendix 6, we explore an alternative option for \(\alpha _k\) in Eqs. 4 and 5, which primarily impacts the influence of K on the metrics. Furthermore, artificial labels for various K lead to different observations depending on \(\alpha _k\).
4.2.5 Number of trotterization steps (T)
We found that varying T has minimal impact on performance or GD for most datasets, which can be deduced from the large standard deviations in Fig. 9 and the almost absent variation of any metric with increasing T (see also Figs. 27 and 28). In Fig. 27, only dataset 9 shows more accuracy with higher T.
4.2.6 Regularization strength (C)
Because C does not affect the kernel, there is no dependence of the GD on it. The accuracy however has a notable dependence especially for smaller values (Fig. 10). The pattern is as expected for this parameter, which is that there is a strong increase in performance up to a global maximum. From there on, the quality of the predictions slightly decreases with increasing regularization. There are large discrepancies between the datasets as to where that saturation happens and the slopes before and after (compare Fig. 29).
4.3 Relabeled datasets
With the method described in Section 3.4, we relabeled datasets for different quantum kernels. The motivation behind this is to detect differences between the original datasets and datasets that are more favorable for the quantum methods. Since the GD does not depend on the labels, this solely affects the accuracy. As the initial state has a negligible influence on the results (Section 4.1), we limit ourselves to a single random state in this section.
We found that relabeling shifts the optimal hyperparameter values well to the setting for which the relabeling was performed. As an example, Fig. 11 shows the marginal accuracies for the original dataset 6 compared to 2 relabeled versions for the settings: \(basis = distance\), \(T=9\), \(K=6\), \(\gamma = 1\) and either \(t = 0.5\) or \(t = 32\). Furthermore, the GD is plotted for exactly these choices of basis, T, K, and \(\gamma \). The marginal dependency on t illustrates that the relabeling has shifted the optimal values for this hyperparameter to the settings for which the relabeling was performed. This effect was also observed for other parameters.
When the relabeling is done for \(t=0.5\), the separation between the best classical and the best quantum accuracy is smaller compared to relabeling for \(t=32\), which is displayed in Table 3. This is expected because the gradient GD increases with t as shown in Fig. 11. Similarly, higher values of K also result in larger GD (see Section 4.2.4) in the case of \(t=32\); this results in a larger separation. However, when \(t=0.5\), the GD is too small to significantly improve the quantum model’s ability to learn the artificial labels in comparison to the classical models. Small deviations in the separation can be explained by the process described in Section 3.4. This process includes applying a regularization term (\(\lambda Id\)), utilizing a train-test-split, and incorporating additional randomness, which could explain any variance in the separation. Similar to Table 3, we discuss Table 5 in Appendix 6, where we set \(\alpha _k = 1\) instead of \(\alpha _k = 1/N_K\) in Eqs. 4 and 5.
5 Pipeline for investigating new datasets
In this section, we agglomerate lessons learned into a coherent pipeline for investigating new datasets with QKMs in combination with the Hamiltonian evolution feature map. These insights are specific to our experimental setup (6 qubit systems in a simulation environment, dataset preprocessed in the same way as outlined in Section 3.3); however, we believe that these insights might hold true for a broader set of use cases.
Our experiments have shown the importance of the hyperparameter t, which is responsible for the data scaling and whose importance has been emphasized by Canatar et al. (2023); Shaydulin and Wild (2022). This raises an interesting question whether there is a good universal scaling factor for the Hamiltonian evolution feature map. To investigate this, we have found the values of t that provide the best performance in terms of accuracy on average (over all other hyperparameter settings). Additionally, inner and distance kernels are considered separately as the distance kernel has an additional parameter \(\gamma \), which exerts control over the scaling as well. As can be seen in Fig. 12, the datasets are scaled down to a similar range for both types of kernels separately. From here, it follows that scaling the data down to have a standard deviation of approximately 0.49 for the inner kernel and 0.22 for the distance kernel would be a good initial guess.
The findings from the relabeled datasets contrast a universally optimal dataset scaling. As shown in Section 4.3, the optimal scale of the input data would actually be far from the \([-1,1]\) range because the best t value is \(t=32\). Therefore, the suggested range will be a good initial guess based on the fact that it was observed for all the original datasets, but there is no guarantee connected with it, especially for datasets that might have an inductive bias that favors a QML model.
Similar to the scaling discussion above, across many datasets, the best values for t are usually within the narrow range [0.1, 1] for the inner kernel. The distance kernel also works well for even smaller t (Fig. 19), since this can be compensated for with \(\gamma \). The hyperparameter \(\gamma \) is correlated with t (Fig. 5) and has an optimal range of [0.1, 10] (Section 4.2.3).
In our study, we observed that T had little influence and could be set, for example, to \(T=9\) or even smaller to avoid deep circuits. On the other hand, the accuracy was not affected by K, but projecting appeared to hinder GD. Based on our findings, \(K=6\) would be the optimal selection for our setup. Moreover, this choice is the least computationally expensive, since the number of possible subsystems scales as \(N!/(N/2)!^2\) (binomial coefficient), where N is the number of qubits.
The pipeline described above can be summarized as follows:
-
Choose \(K=N\), \(T=9\) and any Haar random initial state.
-
Explore t in the interval [0.1, 1], starting with the one that would scale the data into similar ranges as in Fig. 12.
-
Choose inner kernel to avoid hyperparameter optimization and easier realization; choose distance kernel for a better performance at the cost of optimizing \(\gamma \) (optimize \(\gamma \) and C in a combined fashion).
When this pipeline is applied to the datasets in Table 2 instead of the full grid search of the main experiment (Section 3), there is only a minor decrease in best quantum performance while the runtime is reduced significantly. The accuracy distribution in the search grid (Fig. 1) narrows down to the one displayed in Fig. 13.
6 Discussion
The experiment demonstrated that the distance kernel outperforms the inner kernel when \(\gamma \) is optimized, but when we average \(\gamma \) over the range we investigated, the opposite is true (Figs. 6 and 18). Our results also indicate that each hyperparameter has compatible importance for both the accuracy and the GD (Fig. 2). These importances, which are extracted from the GBDT model (Section 3.5), match our intuition based on the standard deviations in the marginals and the formulas.
The \(\gamma \) hyperparameter is easy to optimize in terms of computational resources and because of its single extreme point. Since it can also make or break the distance model, it should be optimized. \(\gamma \) also shows correlations with other parameters, most importantly with t. Since both t and \(\gamma \) regulate the bandwidth, they are interchangeable to a certain degree. Hence, the behavior of the distance kernel over the inner kernel when varying t is more robust. This can be seen in Fig. 4, which also highlights the fact that the areas with good accuracy and the ones with high GD are rather distinct, especially regarding the dependence on t. The importance of the size of the RDM K is relatively small for the setting we investigated (Fig. 2), but we assume that the importance of this parameter to increase with the number of features (qubit count) as the exponential concentration increases in the same way. Notably, the importance of K also increases when we set \(\alpha _k = 1\) (4) and (5), which is discussed in Section 1. This choice however also leads to a stronger dependence of the bandwidth on K. T has a relatively insignificant effect on the performance or the GD as already stated by Shaydulin and Wild (2022). This suggests that an exact evolution in the Hamiltonian evolution feature map is not necessary, meaning that the approximation one tries to reduce by Trotterization which actually does not result in a worse embedding compared to the exact evolution. From the formula stated in Eq. 2, one cannot expect independence from the permutation of the features. In Appendix 5, we investigate this effect and determine that different permutations lead to different predictions, but that the impact is not significant.
In a supervised learning setup, we can consider on one hand the complexity of the features and on the other hand the complexity of the labels. The complexity of the labels can completely change assumptions based on the features is the reason why the GD and the accuracy sometimes have different dependencies. It is also the reason why a relabeled dataset and the original one have little similarities in terms of accuracy despite having the same features. Understanding the intricate relationship between the characteristics of the features, complexity of the labels, GD metric, and empirical difference between quantum and classical learners can help us in the search for real-world datasets that favor the inductive bias of quantum KMs. The results from this study provide us with initial hints. For example, it might in general be better to have a higher t with a lower \(\gamma \) compared to a low t with a high \(\gamma \). Both possibilities can have the same accuracy, but the first one is closer to the area with a high GD as the second one based on Fig. 5 and therefore might be a better initial point. Choosing a larger value for K notably increases the GD and slightly enhances performance. This, nonetheless, leads to Hilbert spaces with higher dimensions and convergence of the kernel to a fixed value (Thanasilp et al. 2022). RDM kernels might remedy this problem by contracting the space, which comes at an expense of the possible expressivity loss (as indicated by GD). Further research is required to understand use cases that might particularly benefit from the deployment of quantum models.
7 Conclusion
To summarize our contributions, in this work, we have performed an extensive study of the effects of different hyperparameter settings on the accuracy and the GD score. Our results suggest that even though the hyperparameters have comparable importance for both the accuracy and the GD, the optimal value peaks for the accuracy and GD sometimes differ. This study provides some clues towards understanding the difference between classical and quantum kernel methods. The most important hyperparameter for both performance metrics is bandwidth t, which enforces the claims that were made in previous literature. This hyperparameter controls the scaling of the input data, and among the investigated datasets, the optimal t corresponded to a scaling of the features in a similar range. The GD and the accuracy were observed to depend in an opposing manner on t. The distance kernel is more robust and performs better than the inner kernel thanks to \(\gamma \) that can be easily optimized and has a similar effect as t. The two alternatives for providing the necessary kernel trace to compute the GD performed almost equally.
Overall, we found optimal ranges for the hyperparameters that work well across the datasets. These findings can be used in order to reduce the computational resources needed when learning a new dataset (Section 5). These findings are, however, specific to the size of the feature space (Section 3.3) we investigated. One of the main findings of our work concerns K, especially since this hyperparameter has no classical equivalence and has not been subject to many experiments yet. We find that it has the most significant difference in importance for the accuracy and the GD. How this difference changes with higher dimensions of the feature space is left for future investigations.
Relabeled datasets that possess labels that are more favorable to quantum learners show different dependencies and optimal values than the real-world datasets. Some findings, like an optimal range for feature scaling, can break down for them. Depending on the choice of prefactors \(\alpha _k\) in Eqs. 4 and 5, the observations for relabeling with different K may change. Specifically, setting \(\alpha _k = 1/N_K\) yields greater separation with higher GD, whereas \(\alpha _k = 1\) did not produce similar results in our conducted experiments. More studies are necessary to make concussive statements about this connection.
In order to further increase the understanding of the GD, we also suggest that more experiments with relabeled datasets, different embeddings, and choices of RDM kernels are conducted. Further analyzing the spectral properties of the quantum kernel can shed light on the inductive bias of the model (Kübler et al. 2021; Slattery et al. 2023). Another interesting future work in the context of the quantum Hamiltonian evolution feature map would be to include noise in such a broad hyperparameter analysis, which is ideally done with real hardware runs. Both noisy runs and increasing the number of qubits may provide new insights, particularly regarding the role of K, as exponential concentration (Thanasilp et al. 2022) would become more pronounced. This hyperparameter governs the size of the projection that helps counteract the effects of the exponential concentration, and hence, its importance could rise once we consider scenarios that are more affected by this phenomenon.
Given the relationship of QKMs to other QML approaches (Schuld 2021), an interesting further research direction would be to investigate how our findings translate to other models. For example, hyperparameters responsible for scaling of the input, such as t and \(\gamma \), have been identified to have a dominating influence while their optimal values put the data ranges (see Fig. 12) way below the industrial standard. This begs the question whether other QML models, such as quantum neural networks, might benefit from a change in the standard data preprocessing procedure to fit the preprocessed data into the identified ranges.
Data Availability
No datasets were generated or analyzed during the current study.
References
Abbas A, Sutter D, Zoufal C, Lucchi A, Figalli A, Woerner S (2021) The power of quantum neural networks. Nature Comput Sci 1(6):403–409. https://doi.org/10.1038/s43588-021-00084-1
Bischl B, Casalicchio G, Feurer M, Gijsbers P, Hutter F, Lang M, Mantovani RG, Rijn JN, Vanschoren J (2021) OpenML Benchmarking Suites
Canatar A, Peters E, Pehlevan C, Wild SM, Shaydulin R (2023) Bandwidth enables generalization in quantum kernel models
Coyle B, Mills D, Danos V, Kashefi E (2020) The born supremacy: quantum advantage and training of an Ising born machine. npj Quantum Inf 6(1). https://doi.org/10.1038/s41534-020-00288-9
Daniely A, Frostig R, Singer Y (2017) Toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity
Huang H-Y, Broughton M, Mohseni M, Babbush R, Boixo S, Neven H, McClean JR (2021) Power of data in quantum machine learning. Nat Commun 12(1). https://doi.org/10.1038/s41467-021-22539-9
Incudini M, Grossi M, Mandarino A, Vallecorsa S, Pierro AD, Windridge D (2022) The quantum path kernel: a generalized quantum neural tangent kernel for deep quantum machine learning
Jacot A, Gabriel F, Hongler C (2020) Neural tangent kernel: convergence and generalization in neural networks
Kübler J, Buchholz S, Schölkopf B (2021) The inductive bias of quantum kernels. Adv Neural Inf Process Syst 34:12661–12673
Lee J, Bahri Y, Novak R, Schoenholz SS, Pennington J, Sohl-Dickstein J (2018) Deep neural networks as gaussian processes
Lin HW, Tegmark M, Rolnick D (2017) Why does deep and cheap learning work so well? J Stat Phys 168(6):1223–1247. https://doi.org/10.1007/s10955-017-1836-5
Markelle Kelly KN Rachel Longjohn: The UCI Machine Learning Repository. https://archive.ics.uci.edu
McClean JR, Boixo S, Smelyanskiy VN, Babbush R, Neven H (2018) Barren plateaus in quantum neural network training landscapes. Nat Commun 9(1). https://doi.org/10.1038/s41467-018-07090-4
Monnet M, Gebran H, Matic-Flierl A, Kiwit F, Schachtner B, Bentellis A, Lorenz JM (2023) Pooling techniques in hybrid quantum-classical convolutional neural networks
Moradi S, Brandner C, Spielvogel C, Krajnc D, Hillmich S, Wille R, Drexler W, Papp L (2022) Clinical data classification with noisy intermediate scale quantum computers. Sci Rep 12(1):1851
Moussa C, Rijn JN, Bäck T, Dunjko V (2022) Hyperparameter importance of quantum neural networks across small datasets. In: Discovery science, pp 32–46. Springer, https://doi.org/10.1007/978-3-031-18840-4_3. https://doi.org/10.1007%2F978-3-031-18840-4_3
Murphy KP (2012) Machine learning: a probabilistic perspective. MIT Press
Nakaji, K, Tezuka, H, Yamamoto, N (2022) Deterministic and random features for large-scale quantum kernel machine
Rad A (2023) Deep quantum neural networks are Gaussian process
Rudolph MS, Toussaint NB, Katabarwa A, Johri S, Peropadre B, Perdomo-Ortiz A (2022) Generation of high-resolution handwritten digits with an ion-trap quantum computer
Schuld M (2021) Supervised quantum machine learning models are kernel methods
Schuld M, Sweke R, Meyer JJ (2021) Effect of data encoding on the expressive power of variational quantum-machine-learning models. Phys Rev A. 103(3). https://doi.org/10.1103/physreva.103.032430
Shaydulin R, Wild SM (2022) Importance of kernel bandwidth in quantum machine learning. Phys Rev A 106(4). https://doi.org/10.1103/physreva.106.042407
Slattery L, Shaydulin R, Chakrabarti S, Pistoia M, Khairy S, Wild SM (2023) Numerical evidence against advantage with quantum fidelity kernels on classical data. Phys Rev A 107(6). https://doi.org/10.1103/physreva.107.062417
Tensorflow Tutorial on Quantum Data. https://www.tensorflow.org/quantum/tutorials/quantum_data. Accessed 01 Sep 2023
Thanasilp S, Wang S, Cerezo M, Holmes Z (2022) Exponential concentration and untrainability in quantum kernel methods
Funding
Open Access funding enabled and organized by Projekt DEAL. The research is part of the Munich Quantum Valley, which is supported by the Bavarian state government with funds from the Hightech Agenda Bayern Plus.
Author information
Authors and Affiliations
Contributions
S.E. implemented the empirical study and prepared all figures, and S.E and A.S. wrote the main manuscript text. All authors contributed to the conception of the work and reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1. Inner normalized kernel function
Additionally, to the kernels considered in the main text, we tested an adaptation of the inner product-based kernel. The motivation is that the kernel entries can become relatively small for large dimensions and highly entangled states. On the main diagonal, the values are related to the purity of a quantum state \(Tr(\rho ^2)\), which can become \(\frac{1}{d}\) where d is the dimension of the Hilbert space. The normalized inner product-based kernel is defined as
and therefore remains a valid kernel, because the same feature vectors as in Nakaji et al. (2022) can be found for the redefined density matrices (DMs). This normalization affects all entries in the gram matrix; in particular, the main diagonal now consists of just ones, as it is common for kernels in general. Important to note is that this kernel is only meaningful in a simulation environment.
When comparing methods inner and inner normalized, we notice almost no difference in performance (as seen in Fig. 14) or GD (as seen in later sections). This implies that this adaptation has almost no influence on the classification of the SVM.
Appendix 2. Analyzing different kernel functions
To validate our choice of a classical RBF kernel as a benchmark and to investigate how this choice affected GD results, we test other standard kernel functions, such as linear, polynomial, Sigmoid, and Laplacian, in the same setup. Figure 15 portrays the best-achieved performances by all models on all available datasets. The RBF performs best on average by a small margin. The importance of hyperparameter changes by a small margin as we change a classical basis of a GD. The importance values shown in Fig. 16 indicate that the dependence of the GD on the different parameters is relatively universal over the choice of the classical kernel to which this GD is taken.
The standard deviations across the different datasets are smaller for the GDs than for the accuracies. The GD for the classical RBF kernel has an exceptionally small standard deviation, which is one reason why it is selected for the main part of this study. Other reasons are stated in Fig. 15.
To illustrate that the implications of Fig. 6 persist for every dataset individually with little exceptions, we show in Fig. 17 the individual best performance of these three quantum methods (distance, inner, and normalized inner) for each dataset. This plot illustrates a slight advantage of the distance kernel that was already determined in the original plot. Additionally, in both Figs. 6 and 17, we see that the maxima of the accuracies of the inner and the inner normalized align well.
Appendix 3. Additional details of the empirical study
In this section, additional figures are shown to support statements from Sect. 4. The marginal plots for each dataset show the mean over all other parameter settings as already described in Sect. 3.5. Figure 19 has the means to highlight the difference between the distance and the inner kernel with a focus on the stability of the distance one for small t values. Figure 18 also compares the different choices of basis, but for an average over \(\gamma \) in contrast to Fig. 6 where this parameter is optimized. The remaining figures in this section have the purpose to display common or different behavior of the parameter dependencies across the different datasets.
Appendix 4. Effects of changing dataset preprocessing steps
In addition to the preprocessing routine described in Sect. 3.3, we investigated how keeping or dropping correlated features affects the accuracies. For the whole discussion in the main part of this work, highly correlated features were kept. In this section, _A indicates that any feature, where the correlation with another feature was greater than 0.75, was eliminated. This step affects only datasets 1, 5, 8, and 9. For dataset 8, we also tested the preprocessing steps from Tensorflow Tutorial on Quantum Data (2003) (indicated as _B) meaning that the datapoints are scaled to the range [0, 1] and principal component analysis (PCA) was used for dimensionality reduction. We restrict it to just one random initial state for this section. Figures 30 and 31 demonstrate that different preprocessing techniques have the greatest impact on dataset 8, where they lead to better results. This particular dataset is unique, as it is the only one based on image data (and therefore has 784 initial features), compared to an initial number of features of 9, 10, and 30 for datasets 1, 5, and 9, respectively. Furthermore, the sample size for this dataset is substantially larger than the other datasets.
Appendix 5. Effects of feature permutations
In the quantum Hamiltonian evolution feature map given in Eq. 2, each feature acts on the qubit with the same index as the feature and the qubit with this index+1. Thus, unlike most classical kernels, different permutations of the input features do not lead to exactly the same model. Moreover, the last feature is entangled with an additional qubit and therefore depends on only one feature directly. The first qubit also depends on only one feature, while the other qubits depend on two. The optimal permutation is difficult to optimize, so we attempt to gain some insight into the severity of this effect. To this end, we apply the pipeline of Sect. 5 to a selection of datasets for each possible permutation of the features. Because the number of possible permutations \(N_{p}\) of the number of features \(N_{f}\) grows like \(N_{p} = N_{f}!\), we restrict to just 3 features for this experiment. For each possible hyperparameter setting, we compute the standard deviation of test accuracy across permutations. These standard deviations are then averaged across settings and listed in Table 4 along with the mean test accuracy across all settings and permutations. The values are reported for the GD and the test accuracy. For the specific choice of 3 features and for the datasets we consider, we did not find a noticeable effect on GD, but not much effect on test accuracy (see also Fig. 22).
Appendix 6. Effects of \(\alpha _K\) parameter
Both \(\alpha _K=1/N_K\) and \(\alpha _K=1\) in Eqs. 4 and 5 result in valid kernels, but we find that the latter choice leads to counter-intuitive observations especially regarding the GD. This choice mainly affects observations based on different values of the hyperparameter K. For example, Table 5 is equivalent to Table 3, but for \(\alpha _K=1\) instead of \(\alpha _K=1/N_K\). When we make this choice, we see that relabeling for a larger GD does not necessarily lead to a larger separation between the best classical and the best quantum model in terms of test accuracy. Going from \(t=0.5\) to \(t=32\) increases the GD and the separation for both \(\alpha _k\). For different K, this is not the case.
For \(\alpha _K=1\), the sum over all possible subsystems yields kernel entries that are roughly \(\propto N_K\). This suggests an added scaling factor of the bandwidth, leading to strong correlations between \(\gamma \) and K, as well as t and K, which do not exist for \(\alpha _K=1/N_K\). Based on this observation, we conclude that \(\alpha _K=1/N_K\) is a better choice for this parameter and use it in the main part of this study.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Egginger, S., Sakhnenko, A. & Lorenz, J.M. A hyperparameter study for quantum kernel methods. Quantum Mach. Intell. 6, 44 (2024). https://doi.org/10.1007/s42484-024-00172-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42484-024-00172-1