1 Introduction

Assessing the performance of quantum models on current quantum computers in tackling machine learning tasks on classical datasets is an enduring ambition of the research community. Many strides have been made addressing this subject from theory-driven approaches that showed encouraging results for specific models, e.g., investigating the difference in capacity between classical and quantum models (Abbas et al. 2021) and showcasing a model that allows to sample from a circuit that is hard to simulate classically (Coyle et al. 2020), to more empirical-driven projects that showed promising results for specific models on specific use cases, e.g., medical classification with hybrid quantum-classical convolutional neural network (Monnet et al. 2023) and generation of high-resolution MNIST datapoints with hybrid generative neural networks (Rudolph et al. 2022). However, making an overall claim of quantum advantage for a broader class of models for a broader set of use cases remains challenging, especially due to the notorious insufficiency of theoretical understanding of certain sub-fields of machine learning, such as deep learning (DL) (Lin et al.  2017). Nevertheless, in the recent years, a body of literature emerged linking DL models to kernel methods (KMs) (Daniely et al. 2017; Jacot et al. 2020; Lee et al. 2018) followed by literature making similar conclusion for the quantum domain (Rad 2023; Incudini et al. 2022; Schuld 2021). This is great news due to favorable qualities of KMs, such as exhibiting convex loss landscapes, which allows to establish convergence, and hence trainability, guarantees (Schuld 2021). It is a big advantage in comparison to DL that infamously suffers from non-convex loss landscapes and their quantum counterparts that suffer from barren plateaus (McClean et al. 2018). Focusing on KMs therefore allows for a more theoretically sound argument, while at the same time, the insights gained for KMs can be applied to a broader range of models due to the aforementioned interrelationships.

Huang et al. (2021) by arguing from a standpoint of generalization of kernels challenged a common reasoning for claiming quantum advantage that is often used in quantum machine learning (QML) literature. In their work, they showed that a classical intractability of the function, that is represented by a quantum model, is not sufficient to claim a quantum advantage. A classical learner can learn to approximate the quantum output if a sufficient amount of data is available. Following this argumentation, they presented a method that allows to probe specific quantum and classical kernels on a specific dataset to derive a possibility for quantum kernels to exhibit superior performance. The method consists of two steps: (1) a geometric difference (GD) measure is computed between kernel matrices, which allows to determine whether the kernels are different enough for potential quantum advantage to exist; (2) the model complexities of classical and quantum models are compared to determine if their discrepancy is sufficient (high classical and low quantum complexities are desirable).

In this work, we strive to comprehend the capacity of quantum kernels on classical datasets. It is our ambition to understand the possible gap between classical and quantum models and whether GD can be a helpful ally for this purpose. However, given a rather theoretical nature of this metric, we need to first deepen our grasp of the empirical relevance of the GD tool. To this extent, we study how the choice of hyperparameters influences the GD and the empirical performance of the quantum KMs. We performed an experiment spanning multiple different classical datasets, a large hyperparameter search grid, and different kernel setups. Additionally, we artificially saturate the difference between classical and quantum learners for some datasets by creating artificial quantum label function following ideas by Huang et al. (2021) to expand our understanding of GD metric. We empirically established which hyperparameters QKMs are most sensitive to and hence are the most crucial to optimize. Our results indicate that the GD depends on the individual hyperparameters as much as the accuracy, but the optimal values of them differ greatly between the metrics. These insights open new research revenues towards understanding which use cases might be most well suited for QKMs.

2 Background

Previous work by Moussa et al. (2022) investigated the importance of hyperparameters of quantum neural networks, in which the performance of different hyperparameter settings has been evaluated on multiple datasets. From their experimental results, it follows that the choice of data embedding, the depth of the quantum circuit, and the learning rate of the optimizer are the most important hyperparameters. In this work, we draw our inspiration from their experimental design, but we adapt our search grid to hyperparameters that are meaningful in quantum kernel methods setting and conduct additional analysis to determine versatility of GD as potential metric for hyperparameter tuning. This section provides background information on our approach.

2.1 Kernel functions

A kernel function \(k(\textbf{x},\textbf{x}'): \mathbb {R}^D \times \mathbb {R}^D \rightarrow \mathbb {R}\) is bivariate, real-valued and can be interpreted as a similarity measure between \({\textbf{x}, \textbf{x}'} \in \mathbb {R}^D\), which is symmetric \(k(x,x') = k(x',x)\) and non-negative \(k(x,x')\ge 0\). For any finite set \(\mathcal {X}_N \subset \mathbb {R}^D\), we can construct a Gram matrix defined by

$$ K_{\mathcal {X}_N} = \begin{pmatrix} k(\textbf{x}_1,\textbf{x}_1) & \cdots & k(\textbf{x}_1,\textbf{x}_N) \\ & \vdots & \\ k(\textbf{x}_N,\textbf{x}_1) & \cdots & k(\textbf{x}_N,\textbf{x}_N) \end{pmatrix}, $$

which is Hermitian and positive semi-definite (PSD). According to Mercer’s theorem, a Gram matrix with this property allows to represent the kernel function as an inner product of feature vectors \(k(\textbf{x},\textbf{x}') = \phi (\textbf{x})^T \phi (\textbf{x}')\), where \(\phi : \mathbb {R}^D \rightarrow \mathcal {H}\) (Murphy 2012) is a feature map that maps original input space into a higher dimensional feature Hilbert space.

Depending on a choice of a feature map \(\phi \), we can generate kernels with different properties. For example, for \(\textbf{x} \in \mathbb {R}^D\), we can consider the simplest (linear) map \(\phi (\textbf{x}) = \textbf{x}\), and we get a linear kernel defined as \(k(\textbf{x},\textbf{x}') = \textbf{x}^T \textbf{x}'\).

One of the most widely-used maps \(\phi (x)\) generates a radial basis function (RBF) kernel defined as

$$\begin{aligned} k(\textbf{x},\textbf{x}') = \exp (-\gamma ||\textbf{x} - \textbf{x}'||^2 ), \end{aligned}$$
(1)

where \(\gamma = \sigma ^{-2}\) and \(\sigma \) is known as bandwidth. This kernel has infinite dimensions, which becomes apparent if we consider the Taylor expansion of Eq. 1.

Kernels are employed by a variety of machine learning algorithms. One of the most prominent example of such algorithms is a binary linear classifier known as a support vector machine (SVM). The main objective of an SVM is to find a separating hyperplane between two classes of data with the maximal separating margin. However, the data might not necessarily be linearly separable in the input space \(\mathbb {R}^D\), in which case mapping data into a higher dimensional feature space \(\mathcal {H}\) is of an advantage. Moreover, one can avoid expensive explicit computation in high or infinite dimensional space by utilizing what is known as a kernel trick. In essence, one computes merely an inner product of two feature vectors without accessing high-dimensional space.

2.2 Hamiltonian evolution feature map

For a quantum algorithm to operate on a classical data, the data has to be embedded into a quantum state \(\phi : \textbf{x}_i \rightarrow | \textbf{x}_i \rangle \), which exists in a complex Hilbert space. The significance of the choice of embedding operation for the performance of quantum models has been highlighted in the literature (Abbas et al. 2021; Schuld et al. 2021). In this work, we concentrate on the Hamiltonian evolution feature map as it has been proven to be powerful in QKM setting (Canatar et al. 2023; Huang et al. 2021; Shaydulin and Wild 2022). This feature map has been developed for many-body problems, and it requires \(n +1\) qubits to embed n features and is defined as follows:

$$\begin{aligned} | \textbf{x}_i \rangle = \bigg (\prod _{j=1}^n \exp \Big (-i \frac{t}{T}x_{ij} H_j^{XYZ} \Big )\bigg )^T \bigotimes _{j=1}^{n+1} | \psi _j \rangle , \end{aligned}$$
(2)

where \(H_j^{XYZ} = (X_j X_{j+1} + Y_j Y_{j+1} + Z_j Z_{j+1})\) with \(X_j, Y_j, Z_j\) being Pauli operators acting on j-th qubit, \(| \psi _j \rangle \) is a Haar random state, and t and T are hyperparameters that symbolizes total evolution time and number of Trotter steps, respectively. The significance of hyperparameter t for QKMs has been highlighted by Shaydulin and Wild (2022). Our results support and extend their work.

2.3 Quantum kernels

An inner product of two quantum states defines a quantum fidelity kernel as

$$\begin{aligned} k(x, x') = |\langle {\phi (x)}{\phi (x')}\rangle |^2. \end{aligned}$$
(3)

Quantum kernels insert data into exponentially high-dimensional Hilbert feature space, which is intractable for classical computers. The high dimensionality of this space has to be treated with caution as it requires \(N \ge \Omega (2^n)\) of training data to approximate a certain function well, where n is the amount of logical qubits (Huang et al. 2021; Kübler et al. 2021; Nakaji et al. 2022; Thanasilp et al. 2022). This is due to a problem known as exponential concentration (Thanasilp et al. 2022). The same logic holds for a quantum fidelity kernel of mixed quantum states \(k(x, x') = Tr(\rho (x) \rho (x'))\) (Nakaji et al. 2022). Reducing the dimensionality of the space by projecting it onto a smaller subspace, thereby creating so-called projected quantum kernels, increases the generalization capacity of these methods. To achieve that, we can utilize reduced density matrices (RDMs) \(\rho ^{(q)}(x) = Tr_{\ne q} (\Phi (\rho (x)))\), where \(Tr_{\ne q}\) stands for a trace of a subsystem that is created by excluding q qubits and \(\Phi \) is a trace-preserving map. Depending on the choice of the map \(\Phi \), different proposals for projected kernels can be found in the literature (Huang et al. 2021; Kübler et al. 2021). As generalized by Nakaji et al. (2022), these projected (RDM-based) kernels can be categorized into inner product-based defined as

$$\begin{aligned} k(x, x') = \sum ^{N_K}_k \alpha _k Tr(\rho ^{(q_k)}(x)\rho ^{(q_k)}(x')), \end{aligned}$$
(4)

where \(\alpha _k > 0\) are coefficients, and distance-based defined as

$$\begin{aligned} k(x, x') = \exp (-\gamma \sum ^{N_K}_k \alpha _k ||\rho ^{(q_k)}(x) - \rho ^{(q_k)}(x')||^2_{F}), \end{aligned}$$
(5)

where the Frobenius norm is denoted by \(||.||_F\), \(\gamma >0\) is a hyperparameter, and \(\alpha _k > 0\) are coefficients. The subsystem of size K represented by \(q_k\) is the k-th of the \(N_K\) possibilities.

2.4 Geometric difference (GD)

Huang et al. (2021) challenged a common argument for claiming quantum advantage that is often used in QML literature. In their work, they showed that a classical intractability of the function f(x), that is represented by a quantum model, is not sufficient to claim a quantum advantage. Under certain circumstances, a classical learner h(x) can learn to approximate f(x) given a sufficient number N of independent samples from a data distribution \(\mathcal {D}\). They showed that an expected error between f(x) and h(x) can be upper bounded as follows:

$$\begin{aligned} ~ \mathbb {E}_{\textbf{x}\sim \mathcal {D}}|h(\textbf{x}) - f(\textbf{x})|\le c\sqrt{\frac{s_K(N)}{N}}, \end{aligned}$$
(6)

where \(s_K(N)\) is the model complexity of the trained function h(x) and \(c>0\) is a constant. For that reason, with \(N \propto s_K(N) / \epsilon ^2\) datapoints, a classical model can learn to predict the quantum function f(x) up to an error \(\epsilon \). Thus, if the best classical learner h(x) has a low model complexity \(s_K(N)\), it will not require too much data N to learn the outputs of the quantum model f(x). From here, it follows that for a quantum advantage to exist, it is required that there is the largest possible separation between complexities of quantum and classical models. Huang et al. (2021) considered kernel models in their work and defined an asymmetric GD-based test between kernel matrices that captures this separation on a specific dataset. This test is defined as follows:

$$\begin{aligned} ~ g_{CQ} = \sqrt{||\sqrt{K_Q}(K_C)^{-1}\sqrt{K_Q}||_{\infty }}, \end{aligned}$$
(7)

where \(K_C\) and \(K_Q\) are classical and quantum kernel matrices with \(Tr(K_Q)=Tr(K_C)=N\), respectively. The distance-based kernel meets this requirement, while the inner product-based kernel requires rescaling according to \(\tilde{K}_{Q} = N\cdot K_{Q}/Tr(K_{Q})\) in the case of projected kernels, due to the fact that the diagonal of this kernel’s matrix contains purity related values (see Section 1). Importantly, this test is reliant on the GD between embedded datapoints, without considering the labeling function. The GD now links the classical model complexity \(s_{K_C}(N)\) and the quantum model complexity \(s_{K_Q}(N)\) according to

$$\begin{aligned} c\sqrt{\frac{s_{K_C}(N)}{N}}\le c g_{CQ}\sqrt{\frac{s_{K_Q}(N)}{N}}, \end{aligned}$$
(8)

which is also a connection of the prediction error bounds as given in Eq. 6.

3 Experimental setup

In this work, we investigate how different parameter settings affect the empirical performance of quantum kernels, as well as their generalization ability, as determined by the GD. We have designed an experiment to evaluate a variety of different hyperparameter settings. In this section, we describe the parameters of the search grid used in our experiment. We varied the types of models and their hyperparameters and tested their performance on various datasets.

3.1 Models

In this work, we combine classical SVMs (see Section 2) from sklearn library with quantum kernels implemented with PennyLane framework on simulators. To obtain the kernel, we first embed classical datapoints with the Hamiltonian evolution feature map as in Eq. 2, and the full DM of the resulting quantum state is saved. These DMs are used to compute RDM kernels described in Eq. 4 (with additional rescaling to \(Tr(K_Q)=N\)), Eqs. 5 and A1. From here forth, we will refer to these kernels as inner and distance kernels respectively. In all cases, we set \(\alpha _k = 1/N_K\). In Appendix 6, we discuss the case \(\alpha _k = 1\).

From these models, we extract relevant hyperparameters. From SVMs, we consider regularization hyperparameter C. From the feature map, we adopt the number of Trotterization steps T and the evolution time t (as in Eq. 2). The random seed for the initial Haar random state is set to different values, which however cannot be considered a hyperparameter. From quantum kernels, we take into account the size of the RDM subsystem K (as in Eqs. 4 and 5) and, in case of the distance kernel, we also consider bandwidth \(\gamma \). All hyperparameters of the search grid and their corresponding value ranges are summarized in Table 1.

For a classical baseline, we train SVM with classical kernels listed in Section 2: from basic kernels (linear and polynomial) to more complex ones (RBF and Laplacian) and the (not necessarily PSD) sigmoid kernel. For this comparison, the C and \(\gamma \) parameters are explored for the same range as in the quantum case with the addition of more standard classical \(\gamma \) choices, such as 1/D and \(1 / (D \cdot \sigma (\mathcal {X}^D))\) with D being the dimensionality of the feature space and \(\sigma (\mathcal {X}^D))\) variance of the dataset.

Table 1 List of hyperparameters for embedding (Trotter steps T and evolution time t), kernel function (subsystem size K and bandwidth \(\gamma \)), and SVM (regularization C) and their corresponding values range, number of samples, and spacing that were used in the search grid. D is the number of features

3.2 Metrics

To rate the importance of the hyperparameters, we need to correlate them with the model’s performance. In this paper, we consider performance metrics, such as accuracy on test dataset and mean accuracy of fivefold cross validation on training dataset, as well as the generalization metric geometric difference (GD) introduced in Section 2.4. GD measures the difference between two kernel matrices, so in our studies, we consider the GD from quantum to all classical matrices.

Table 2 A comprehensive list of all datasets and their sources that were used in the experiment

3.3 Datasets

To train and test our models, we selected the 11 different datasets listed in Table 2 that have been previously used in the QML literature (Huang et al. 2021; Moradi et al. 2022; Moussa et al. 2022; Shaydulin and Wild 2022). Due to simulation time restrictions, we compress each dataset’s feature space to five dimensions, which translates to 6 qubits according to Eq. 2, and 200 datapoints (or less features/datapoints if fewer were available). In this study, we restrict to binary classification tasks, so for multi-class dataset 8 (Fashion MNIST), only two classes are selected (T-shirt/Top and Dress) and the rest are discarded. Datapoints with incomplete or missing features are removed as well as duplicate datapoints. In addition, the datasets were balanced. Features with a variance of less than 0.001 were eliminated. All the features were normalized (mean with 0 and variance with 1). To reduce the dimensionality of the features, all features were sorted according to their analysis of variance (ANOVA) F-value, and the top 5 were selected. Different preprocessing routines are discussed in Appendix 4.

3.4 Artificial labels

For a more in-depth study of the GD, we additionally employ a method for generating artificial labels for classical datasets that are more favorable for quantum learners as proposed in Huang et al. (2021). This is done by first augmenting (7) to a matrix of form \(\sqrt{K_Q}(K_C + \lambda Id)^{-1}\sqrt{K_Q}\) with the regularization term \(\lambda = 1.1\). We select eigenvector \(\textbf{v}\) that corresponds to the largest absolute eigenvalue, and the intermediate new labels are calculated as \(y = \sqrt{K_Q}\textbf{v}\). These labels are binarized by setting them to 1 if they are greater than the median of all y and to 0 otherwise. The learning task with these new labels should now saturate (8). However, the kernel matrices in this scenario are formed using the complete dataset, resulting in slight deviations from this saturation upon separation into training and test sets. To improve generalization, 5% of the labels are randomly set to 0. We use different Haar random initial states for relabeling and learning. Most of the analysis is performed on initial datasets. The relabeled ones are discussed in Section 4.3.

3.5 Analysis

After all results from the experiments are agglomerated, the data is prepared for further analysis. The choice of the kernel function is treated as a hyperparameter. The two kernel functions are encoded as the binary variable basis. We also exclude hyperparameter C (as it has no influence on the GD) from the main analysis and discuss it separately in Section 4.2.6. For datasets 1 and 8, we removed the highest 3% of each GD due to the presence of significant outliers. These outliers may occur when the classical kernel matrix has nearly singular values (7) and the GD is therefore exceedingly large.

We start the analysis by investigating the importance of each hyperparameter. Based on the insights from our prior experiments, we chose the Gradient Boosted Decision Trees (GBDT) from sklearn, which is fitted to the experimental data. The features correspond to the hyperparameters and the labels to the accuracies of the earlier classification tasks. Such a GBDT consists of hundreds of decision trees. Depending on how often a certain feature (hyperparameter) is used for a split point, a feature importance can be derived. More precisely, from this model, an impurity-based feature importance, also known as the Gini importance, can be extracted.

Additionally, we analyze marginal distributions of each hyperparameter, which is an average over all other hyperparameters to capture the effects of this particular one on the accuracy and the GD to the classical RBF kernel. Apart from the mean, we also track standard deviation as it provides the necessary information regarding the reliability of the marginal distribution. A large standard deviation indicates a small importance of that particular hyperparameter on the metric and vice versa. The impact of each hyperparameter on performance and GD is evaluated using the GBDT and its margin plots.

4 Results

Figure 1 visualizes the range of accuracies across different search grid settings as well as best-achieved performances of classical and quantum learners (for more details on best learners, see Appendix 2). Since the quantum models achieve comparable results to the classical counterparts, we can assume that the explored hyperparameter ranges allowed for a high enough expressivity. Nevertheless, the best classical model can barely be surpassed. The selection of datasets and their characteristics range from those which are easy to learn to those which are difficult to learn. There is also a varying spread of the accuracies on one specific dataset across the different parameter settings.

Fig. 1
figure 1

Distribution of accuracies achieved by a SVM with a quantum kernel on the test set for all hyperparameters setting from the search grid (Section 3.1) for each dataset (Table 2). The distributions are visualized as boxplots. Additionally, the diamonds indicate the best accuracies achieved by classical and quantum models. The dashed line indicates the accuracy that would be achieved by random guessing

4.1 Hyperparameter importance

We analyze whether accuracy and the GD metrics assign the same importance to the hyperparameters. For that, we extract Gini importance from the GBDT model fitted to the data extracted from the search grid experiment as described in Section 3.5. Figure 2 displays the average importance found for the different hyperparameters across all the datasets for performance as well as for the GD to the classical RBF kernel. See Fig. 16 for a breakdown of hyperparameter importance by all metrics.

Contrary to our initial expectations, both the performance and GD exhibit minimal sensitivity to variations in the value of K. Also, the number of Trotterization steps T has no significant effect on any of the metrics studied. These results are consistent with Shaydulin and Wild (2022), where t and T hyperparameters were investigated as well. This analysis also found that the random seed for the Haar initial state had the lowest importance for every metric, but as it cannot be interpreted as a hyperparameter, it will not be discussed further.

Fig. 2
figure 2

Mean Gini importance for each hyperparameter listed in Table 1, across all datasets listed in Table 2, and for both accuracy and GD to the classical RBF kernel. The height of each bar represents mean value while error bars indicate standard deviation. The importance indicates how much influence the variation of this hyperparameter value has on one of the two metrics. These importances were extracted from the GBDT model fitted on the results from the search grid (Section 3.5)

Judging from the results in Fig. 2, the hyperparameter t is assigned the most importance by both the accuracy and the GD metrics. The choice of the basis and \(\gamma \) has a strong effect on these metrics as well, as both assigned these two parameters a similar importance that ranks them together at the second place. When examining the standard deviations, it is apparent that the importance of each hyperparameter for the GD is more consistent across the datasets than the importance for the accuracy.

4.2 Metric dependence on individual hyperparameters

In this section, the effects of individual hyperparameters are discussed. For the most part, we base our assessments on the marginals of the hyperparameters (see Section 3.5 for further details). The hyperparameters are analyzed in the order of their importance. As expected, the cross validation (CV) aligns closely with the test accuracy in all marginals and is therefore well suited for tuning all hyperparameters.

Fig. 3
figure 3

Dependence of the GD, the test accuracy, and the cross validation accuracy on t (evolution time). The shadows illustrate the standard deviations when averaging over the other parameters and over all datasets. The GD was scaled to the range [0, 1] by dividing through the maximum

Fig. 4
figure 4

Dependence of the accuracy and the GD on \(\gamma \) (bandwidth) and t (evolution time) combined. The graphic displays the mean over all datasets and other parameters

4.2.1 Evolution time (t)

Figure 3 visualizes the marginals of the t parameter for the accuracy and the GD. The standard deviations are illustrated as shadows and indicate the importance of t on each of those metrics. If a standard deviation is small when averaging over all other parameters and datasets, this indicates that the metric does not depend on these other parameters as strongly as on t. This graphic highlights that the t values for which a good performance was achieved correspond to a rather small GD, and as the predictions get worse, the potential for quantum advantage increases. This behavior is consistent and sometimes even more significant when looking at individual datasets (Figs. 20 and 21) or 2 marginals like Fig. 4 and Fig. 5. Furthermore, of all hyperparameters, t causes the most differences between the choices of basis. The distance kernel seems to be more resilient for smaller values of t, demonstrated in Fig. 19. As t increases, the GD begins to rise earlier when the distance kernel is used.

Fig. 5
figure 5

Different marginals of the accuracy depending on 2 hyperparameters. \(\gamma \) in combination with C (regularization strength), T (number of Trotterization steps) and K (RDM size) as well as t (evolution time) in combination with C (regularization strength). The ranges of the color bars are different since the parameters have different importance on the accuracy

4.2.2 (Basis) of the kernel function

In order to have a fair comparison between the distance kernel and the inner kernel, the \(\gamma \) parameter is not averaged over; rather, the maximal value over this parameter is chosen in all of Section 4, where the choice between the two is considered a parameter too (Section 3.5). This is further motivated by the fact that \(\gamma \) along with C are the computationally cheapest hyperparameters to optimize, since the kernel can be reevaluated without having to recompute the distances between DMs. Figure 6 shows the accuracies achieved for the specific choices of kernel function basis over all other parameter settings (except for \(\gamma \)) and datasets. If \(\gamma \) is optimized, then the distance basis outperforms the other two. In Fig. 18, we observe a different situation when \(\gamma \) is averaged like all the other parameters, but this is mostly due to the broad range (and thus many suboptimal values) we explored.

4.2.3 Gamma (\(\gamma \))

The second hyperparameter controlling the bandwidth is \(\gamma \). This parameter is only relevant for the distance kernel (5), which is why Figs. 7, 23, and 24 are restricted to just the distance kernel. In contrast to the behavior seen in Section 4.2.1, the GD is here not necessarily running opposite to the accuracy, which can be seen from Figs. 7 or 23 and 24. There is a general behavior that can be observed for all datasets, which is that all the GDs are increasing with \(\gamma \). For most datasets, the accuracy has a maximum in the range of 0.1–10 with varying curvature around that point across the different datasets. \(\gamma \) is correlated with other parameters as seen in Figs. 4 and 5. For example, the optimal value for t shifts to smaller values with larger \(\gamma \), which makes sense as the kernel values in general decrease when either of these parameters increases.

Fig. 6
figure 6

Dependence of the test accuracy on the choice of basis for the kernel function. For this comparison, \(\gamma \) is optimized and not averaged. The specific form of a distance- or inner-based kernel function is given in Section 2.3. The dashed line indicates the accuracy that would be achieved by random guessing

Fig. 7
figure 7

Dependence of the GD, the test accuracy, and the cross validation accuracy on \(\gamma \) (bandwidth). The shadows illustrate the standard deviations when averaging over all datasets and over the other parameters except for basis, which is restricted to just the distance kernel. The GD was scaled to the range [0, 1] by dividing through the maximum

4.2.4 Size of the reduced density matrices (K)

Here, we are investigating RDMs that have the means to tackle the exponential concentration problem (Thanasilp et al. 2022). Because datasets 3 and 4 have only 4 features, they are excluded from all RDM marginals in Section 4 as they would bias \(K=6\). The trend in our results indicates a slight increase in accuracy with this hyperparameter, but minimal effect overall, while GD increases notably with K as displayed in Figs. 5 and 8. Both observations are consistent across all datasets (Figs. 25 and 26). Furthermore, Fig. 2 also indicates the difference in importance.

In Appendix 6, we explore an alternative option for \(\alpha _k\) in Eqs. 4 and 5, which primarily impacts the influence of K on the metrics. Furthermore, artificial labels for various K lead to different observations depending on \(\alpha _k\).

4.2.5 Number of trotterization steps (T)

We found that varying T has minimal impact on performance or GD for most datasets, which can be deduced from the large standard deviations in Fig. 9 and the almost absent variation of any metric with increasing T (see also Figs. 27 and 28). In Fig. 27, only dataset 9 shows more accuracy with higher T.

Fig. 8
figure 8

Dependence of the GD, the test accuracy, and the cross validation accuracy on K (RDM size). The shadows illustrate the standard deviations when averaging over the other parameters and over all datasets except for 3 and 4. The GD was scaled to the range [0, 1] by dividing through the maximum

Fig. 9
figure 9

Dependence of the GD, the test accuracy, and the cross validation accuracy on T (number of Trotterization steps). The shadows illustrate the standard deviations when averaging over the other parameters and over all datasets. The GD was scaled to the range [0, 1] by dividing through the maximum

Fig. 10
figure 10

Dependence of the GD, the test accuracy, and the cross validation accuracy on C (regularization strength). The shadows illustrate the standard deviations when averaging over the other parameters and over all datasets. The GD was scaled to the range [0, 1] by dividing through the maximum

4.2.6 Regularization strength (C)

Because C does not affect the kernel, there is no dependence of the GD on it. The accuracy however has a notable dependence especially for smaller values (Fig. 10). The pattern is as expected for this parameter, which is that there is a strong increase in performance up to a global maximum. From there on, the quality of the predictions slightly decreases with increasing regularization. There are large discrepancies between the datasets as to where that saturation happens and the slopes before and after (compare Fig. 29).

4.3 Relabeled datasets

With the method described in Section 3.4, we relabeled datasets for different quantum kernels. The motivation behind this is to detect differences between the original datasets and datasets that are more favorable for the quantum methods. Since the GD does not depend on the labels, this solely affects the accuracy. As the initial state has a negligible influence on the results (Section 4.1), we limit ourselves to a single random state in this section.

We found that relabeling shifts the optimal hyperparameter values well to the setting for which the relabeling was performed. As an example, Fig. 11 shows the marginal accuracies for the original dataset 6 compared to 2 relabeled versions for the settings: \(basis = distance\), \(T=9\), \(K=6\), \(\gamma = 1\) and either \(t = 0.5\) or \(t = 32\). Furthermore, the GD is plotted for exactly these choices of basis, T, K, and \(\gamma \). The marginal dependency on t illustrates that the relabeling has shifted the optimal values for this hyperparameter to the settings for which the relabeling was performed. This effect was also observed for other parameters.

When the relabeling is done for \(t=0.5\), the separation between the best classical and the best quantum accuracy is smaller compared to relabeling for \(t=32\), which is displayed in Table 3. This is expected because the gradient GD increases with t as shown in Fig. 11. Similarly, higher values of K also result in larger GD (see Section 4.2.4) in the case of \(t=32\); this results in a larger separation. However, when \(t=0.5\), the GD is too small to significantly improve the quantum model’s ability to learn the artificial labels in comparison to the classical models. Small deviations in the separation can be explained by the process described in Section 3.4. This process includes applying a regularization term (\(\lambda Id\)), utilizing a train-test-split, and incorporating additional randomness, which could explain any variance in the separation. Similar to Table 3, we discuss Table 5 in Appendix 6, where we set \(\alpha _k = 1\) instead of \(\alpha _k = 1/N_K\) in Eqs. 4 and 5.

Fig. 11
figure 11

Dependence of the test accuracy on t (evolution time) for the original dataset 6 and 2 relabeled versions for the settings: \(basis = distance\), \(T=9\), \(K=6\), \(\gamma = 1\), and either \(t = 0.5\) or \(t = 32\). The dependence of the GD on t and for the specific choices of the other parameters is also shown. This metric is universal for the three versions, as the GD does not depend on the labels. Furthermore, it was scaled to the range [0, 1] by dividing through the maximum. The shadows of the accuracies illustrate the standard deviations when averaging over the other parameters

Table 3 List connecting the difference between the best classical and the best quantum accuracies and their separation with the GD
Fig. 12
figure 12

All datasets from Table 2 and their original feature scales (left). The features are scaled by the best on average t values, which is the parameter that is responsible for the dataset scaling. We consider inner (middle) and distance (right) kernels separately, since the distance kernel has the additional \(\gamma \) hyperparameter that controls bandwidth too, which is closely related to the parameter scaling

5 Pipeline for investigating new datasets

In this section, we agglomerate lessons learned into a coherent pipeline for investigating new datasets with QKMs in combination with the Hamiltonian evolution feature map. These insights are specific to our experimental setup (6 qubit systems in a simulation environment, dataset preprocessed in the same way as outlined in Section 3.3); however, we believe that these insights might hold true for a broader set of use cases.

Our experiments have shown the importance of the hyperparameter t, which is responsible for the data scaling and whose importance has been emphasized by Canatar et al. (2023); Shaydulin and Wild (2022). This raises an interesting question whether there is a good universal scaling factor for the Hamiltonian evolution feature map. To investigate this, we have found the values of t that provide the best performance in terms of accuracy on average (over all other hyperparameter settings). Additionally, inner and distance kernels are considered separately as the distance kernel has an additional parameter \(\gamma \), which exerts control over the scaling as well. As can be seen in Fig. 12, the datasets are scaled down to a similar range for both types of kernels separately. From here, it follows that scaling the data down to have a standard deviation of approximately 0.49 for the inner kernel and 0.22 for the distance kernel would be a good initial guess.

The findings from the relabeled datasets contrast a universally optimal dataset scaling. As shown in Section 4.3, the optimal scale of the input data would actually be far from the \([-1,1]\) range because the best t value is \(t=32\). Therefore, the suggested range will be a good initial guess based on the fact that it was observed for all the original datasets, but there is no guarantee connected with it, especially for datasets that might have an inductive bias that favors a QML model.

Similar to the scaling discussion above, across many datasets, the best values for t are usually within the narrow range [0.1, 1] for the inner kernel. The distance kernel also works well for even smaller t (Fig. 19), since this can be compensated for with \(\gamma \). The hyperparameter \(\gamma \) is correlated with t (Fig. 5) and has an optimal range of [0.1, 10] (Section 4.2.3).

In our study, we observed that T had little influence and could be set, for example, to \(T=9\) or even smaller to avoid deep circuits. On the other hand, the accuracy was not affected by K, but projecting appeared to hinder GD. Based on our findings, \(K=6\) would be the optimal selection for our setup. Moreover, this choice is the least computationally expensive, since the number of possible subsystems scales as \(N!/(N/2)!^2\) (binomial coefficient), where N is the number of qubits.

The pipeline described above can be summarized as follows:

  • Choose \(K=N\), \(T=9\) and any Haar random initial state.

  • Explore t in the interval [0.1, 1], starting with the one that would scale the data into similar ranges as in Fig. 12.

  • Choose inner kernel to avoid hyperparameter optimization and easier realization; choose distance kernel for a better performance at the cost of optimizing \(\gamma \) (optimize \(\gamma \) and C in a combined fashion).

When this pipeline is applied to the datasets in Table 2 instead of the full grid search of the main experiment (Section 3), there is only a minor decrease in best quantum performance while the runtime is reduced significantly. The accuracy distribution in the search grid (Fig. 1) narrows down to the one displayed in Fig. 13.

Fig. 13
figure 13

Distribution of accuracies for each dataset (Table 2) achieved by an SVM with a quantum kernel for a reduced search grid of hyperparameters. The diamonds indicate the best accuracies achieved by classical and quantum models. The dashed line indicates the accuracy of guessing randomly

6 Discussion

The experiment demonstrated that the distance kernel outperforms the inner kernel when \(\gamma \) is optimized, but when we average \(\gamma \) over the range we investigated, the opposite is true (Figs. 6 and 18). Our results also indicate that each hyperparameter has compatible importance for both the accuracy and the GD (Fig. 2). These importances, which are extracted from the GBDT model (Section 3.5), match our intuition based on the standard deviations in the marginals and the formulas.

The \(\gamma \) hyperparameter is easy to optimize in terms of computational resources and because of its single extreme point. Since it can also make or break the distance model, it should be optimized. \(\gamma \) also shows correlations with other parameters, most importantly with t. Since both t and \(\gamma \) regulate the bandwidth, they are interchangeable to a certain degree. Hence, the behavior of the distance kernel over the inner kernel when varying t is more robust. This can be seen in Fig. 4, which also highlights the fact that the areas with good accuracy and the ones with high GD are rather distinct, especially regarding the dependence on t. The importance of the size of the RDM K is relatively small for the setting we investigated (Fig. 2), but we assume that the importance of this parameter to increase with the number of features (qubit count) as the exponential concentration increases in the same way. Notably, the importance of K also increases when we set \(\alpha _k = 1\) (4) and (5), which is discussed in Section 1. This choice however also leads to a stronger dependence of the bandwidth on K. T has a relatively insignificant effect on the performance or the GD as already stated by Shaydulin and Wild (2022). This suggests that an exact evolution in the Hamiltonian evolution feature map is not necessary, meaning that the approximation one tries to reduce by Trotterization which actually does not result in a worse embedding compared to the exact evolution. From the formula stated in Eq. 2, one cannot expect independence from the permutation of the features. In Appendix 5, we investigate this effect and determine that different permutations lead to different predictions, but that the impact is not significant.

In a supervised learning setup, we can consider on one hand the complexity of the features and on the other hand the complexity of the labels. The complexity of the labels can completely change assumptions based on the features is the reason why the GD and the accuracy sometimes have different dependencies. It is also the reason why a relabeled dataset and the original one have little similarities in terms of accuracy despite having the same features. Understanding the intricate relationship between the characteristics of the features, complexity of the labels, GD metric, and empirical difference between quantum and classical learners can help us in the search for real-world datasets that favor the inductive bias of quantum KMs. The results from this study provide us with initial hints. For example, it might in general be better to have a higher t with a lower \(\gamma \) compared to a low t with a high \(\gamma \). Both possibilities can have the same accuracy, but the first one is closer to the area with a high GD as the second one based on Fig. 5 and therefore might be a better initial point. Choosing a larger value for K notably increases the GD and slightly enhances performance. This, nonetheless, leads to Hilbert spaces with higher dimensions and convergence of the kernel to a fixed value (Thanasilp et al. 2022). RDM kernels might remedy this problem by contracting the space, which comes at an expense of the possible expressivity loss (as indicated by GD). Further research is required to understand use cases that might particularly benefit from the deployment of quantum models.

7 Conclusion

To summarize our contributions, in this work, we have performed an extensive study of the effects of different hyperparameter settings on the accuracy and the GD score. Our results suggest that even though the hyperparameters have comparable importance for both the accuracy and the GD, the optimal value peaks for the accuracy and GD sometimes differ. This study provides some clues towards understanding the difference between classical and quantum kernel methods. The most important hyperparameter for both performance metrics is bandwidth t, which enforces the claims that were made in previous literature. This hyperparameter controls the scaling of the input data, and among the investigated datasets, the optimal t corresponded to a scaling of the features in a similar range. The GD and the accuracy were observed to depend in an opposing manner on t. The distance kernel is more robust and performs better than the inner kernel thanks to \(\gamma \) that can be easily optimized and has a similar effect as t. The two alternatives for providing the necessary kernel trace to compute the GD performed almost equally.

Overall, we found optimal ranges for the hyperparameters that work well across the datasets. These findings can be used in order to reduce the computational resources needed when learning a new dataset (Section 5). These findings are, however, specific to the size of the feature space (Section 3.3) we investigated. One of the main findings of our work concerns K, especially since this hyperparameter has no classical equivalence and has not been subject to many experiments yet. We find that it has the most significant difference in importance for the accuracy and the GD. How this difference changes with higher dimensions of the feature space is left for future investigations.

Relabeled datasets that possess labels that are more favorable to quantum learners show different dependencies and optimal values than the real-world datasets. Some findings, like an optimal range for feature scaling, can break down for them. Depending on the choice of prefactors \(\alpha _k\) in Eqs. 4 and 5, the observations for relabeling with different K may change. Specifically, setting \(\alpha _k = 1/N_K\) yields greater separation with higher GD, whereas \(\alpha _k = 1\) did not produce similar results in our conducted experiments. More studies are necessary to make concussive statements about this connection.

In order to further increase the understanding of the GD, we also suggest that more experiments with relabeled datasets, different embeddings, and choices of RDM kernels are conducted. Further analyzing the spectral properties of the quantum kernel can shed light on the inductive bias of the model (Kübler et al. 2021; Slattery et al. 2023). Another interesting future work in the context of the quantum Hamiltonian evolution feature map would be to include noise in such a broad hyperparameter analysis, which is ideally done with real hardware runs. Both noisy runs and increasing the number of qubits may provide new insights, particularly regarding the role of K, as exponential concentration (Thanasilp et al. 2022) would become more pronounced. This hyperparameter governs the size of the projection that helps counteract the effects of the exponential concentration, and hence, its importance could rise once we consider scenarios that are more affected by this phenomenon.

Given the relationship of QKMs to other QML approaches (Schuld 2021), an interesting further research direction would be to investigate how our findings translate to other models. For example, hyperparameters responsible for scaling of the input, such as t and \(\gamma \), have been identified to have a dominating influence while their optimal values put the data ranges (see Fig. 12) way below the industrial standard. This begs the question whether other QML models, such as quantum neural networks, might benefit from a change in the standard data preprocessing procedure to fit the preprocessed data into the identified ranges.