Abstract
Most machine learning approaches are classified into either supervised or unsupervised. However, joining generative and discriminative functions in the learning process may beneficially influence each other. Using the centered kernel alignment similarity, this paper proposes a new hybrid cost function based on the linear combination of two computed terms: a discriminative component that accounts for the affinity between projected data and their labels, and a generative component that measures the similarity between the input and projected distributions. Further, the data projection is assumed as a linear model so that a matrix has to be learned by maximizing the proposed cost function. We compare our approach using a kNN classifier against the raw features and a multi-layer perceptron machine. Attained results on a handwritten digit recognition database show that there exists a trade-off value other than the trivial ones that provide the highest accuracy. Moreover, the proposed approach not only outperforms the baseline machines but also becomes more robust to several noise levels.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In the last decade, machine learning algorithms have experienced a big boost due to improvements in both learning fields, unsupervised and supervised. Often, the task of selecting a relevant data representation that encodes the studied phenomenon becomes difficult due to the input matrix space is ill-posed, that is, a huge number of features and a limited number of vector samples. Hence, the reduction dimension approaches based on feature embedding appear as an important step in dealing with large dimensions. So, the optimized representation of the inputs allows visualizing the similarity between input vectors in both scenarios, unsupervised and supervised [4]. Generally, these approaches for selecting the suitable data representation comprises linear and nonlinear methodologies.
Some linear strategies extract the relevant information from sample covariances, but regardless of including their unsupervised or supervised estimations, they lead to unsatisfactory results [8]. In [7], Centered Kernel Alignment (CKA) is proposed to learn a linear projection, encoding discriminative input features and benefiting from the nonlinear notion of similarity of kernel methods. The approach takes advantage of the joint information associating the available labels to the corresponding input samples, so achieving competitive results in object recognition [5] and computer-aided diagnosis [6] tasks. Nonetheless, encoding generative and discriminative information into a linear projection may influence favorable each other if they are combined during training, outperforming the conventional alignment [11].
Instead, the non-linear approaches aim to preserve the similarity structure using more elaborate methods of reduction dimension like kernel analysis [1] or manifold learning [9]. A particular case is the Artificial Neural Networks (ANN) that map the weighted inputs to the output of each neuron embedded in a nonlinear activation function. If there is a lack of a priori knowledge to model the architecture, ANN-based method splits the training into two separate models: unsupervised pre-training model and supervised fine-tuning model [10]. This strategy seems to be neither biologically plausible, nor optimal when it comes to optimization, as carefully designed stopping criteria have to be implemented to prevent over- or under-fitting [11]. Moreover, a direct interpretability from a nonlinear-based mapping is not always possible [2].
In this work, we introduce a linear projection learning procedure for data representation using kernel alignment approach. In particular, we define a hybrid CKA similarity by joining generative and discriminative functions, controlled by a trade-off parameter \(\lambda \). It turns out that by annealing \(\lambda \), when solving this unconstrained non-convex optimization problem, the model cope with the shortcomings described above. The present manuscript is organized as follows: Sect. 2.1 firstly describes the mathematical background of learning linear projections using CKA for classification. Section 3 introduces the developed experiments and achieved results. Then, performance results are discussed in Sect. 4. Finally, Sect. 5 presents the conclusions and future research directions.
2 Materials and Methods
2.1 Linear Projection Learning Through Hybrid Cost Function
Let \({\varvec{X}}\negthinspace =\negthinspace \{{\varvec{x}}_{i}\negthinspace \in \negthinspace \mathcal {X}\negthinspace :\negthinspace \forall i\negthinspace \in \negthinspace \{1,\dots ,N\}\}\) be an input data matrix of N P-sized vector samples drawn from the random process \(\mathcal {X}\negthickspace \subset \negthickspace \mathbb {R}^P\) and \({\varvec{L}}\negthinspace =\negthinspace \{{l}_{i}\negthinspace \in \negthinspace \mathcal {L}\negthinspace :\negthinspace \forall i\negthinspace \in \negthinspace \{1,\dots ,N\}\}\) that contains N output vectors \({l}_i\negthinspace \in \negthinspace \{1,\dots ,C\}\) representing C mutually exclusive classes. We define the classification problem in a similarity-based framework so that the model aims to maximize the distance between two samples if \({\varvec{l}}_i\ne {\varvec{l}}_i\) and minimize it if \({\varvec{l}}_i\negthinspace =\negthinspace {\varvec{l}}_i\). Given the dataset \(\{{\varvec{X}},{\varvec{L}}\}\), we consider the generalized inner product to measure the similarity between a couple of input vectors implemented through the kernel function \(\kappa ({\varvec{x}}_i,{\varvec{x}}_j)=\langle \varphi ({\varvec{x}}_i),\varphi ({\varvec{x}}_j)\rangle \negthinspace :\negthinspace \forall i,j\negthinspace \in \negthinspace \{1,\dots ,N\}\), where notation \( \langle \cdot ,\cdot \rangle \) stands for the inner product and \(\varphi (\cdot )\negthickspace :\negthickspace \mathbb {R}^{P}\negthickspace \rightarrow \negthickspace \mathcal {H}\) maps from the original domain, \(\mathbb {R}^{P},\) into a Reproduced Kernel Hilbert Space (RKHS), \(\mathcal {H}\), so that \(|\mathcal {H}|\gg P\). The well-known kernel trick is employed for computing kernel function \(\kappa (:)\) due to there is no need for computing \(\varphi (\cdot )\) directly. Particularly, an introduced distance function parameterizes a positive definite and infinitely divisible kernel function as \(k_{ij} = \kappa ({\varvec{x}}_i,{\varvec{x}}_j) = \kappa \left( {\text {d}}_{{\varvec{W}}}\left( {\varvec{x}}_i,{\varvec{x}}_j\right) \right) \), being the Mahalanobis distance operator so that the pairwise comparison between input vectors \({\varvec{x}}_i\) and \({\varvec{x}}_j\) is carried out as follows, \({\text {d}}_{{\varvec{W}}}({\varvec{x}}_i,{\varvec{x}}_j) =\left( {\varvec{x}}_i-{\varvec{x}}_j\right) {{\varvec{W}}}{{\varvec{W}}^\top }\left( {\varvec{x}}_i-{\varvec{x}}_j\right) ^\top \), then the parameter matrix linearly projects the samples \({\varvec{y}}_i\negthinspace =\negthinspace {\varvec{x}}_i{\varvec{W}}\) (\({\varvec{y}}_i\negthinspace \in \negthinspace \mathbb {R}^{Q}\) and \(Q\negthickspace \le \negthickspace P\)), and \({{\varvec{W}}}{{\varvec{W}}^\top }\) implements the inverse covariance matrix of the Mahalanobis distance in the input feature space. Hence, a kernel matrix , resulting from the application of \(\kappa \) over each sample pair in \({\varvec{X}}\), estimates the covariance of the random process \(\mathcal {X}\) over \(\mathcal {H}\).
To make our proposal suitable for a supervised learning task, we introduce the prior knowledge about the feasible sample membership in the computation of \({\varvec{W}}\). To this end, we firstly enclose the target similarities in a matrix with elements \(k_{ij}=\kappa _{l}\left( l_i,l_j\right) \negthinspace =\negthinspace \delta (l_i-l_j)\). Thus, the matrix \({\varvec{W}}\) can be learned by maximizing the similarity between \(\{{\varvec{K}}_{{\varvec{W}}},{\varvec{K}}_{\mathcal {L}}\}\) through the following real-valued function, \(\rho (\cdot ,\cdot )\negthinspace \in \negthinspace [0,1]\), that is termed as Centered Kernel Alignment (CKA) [3]:
where \({\varvec{H}}\negthinspace =\negthinspace {\varvec{I}}\negthickspace -\negthickspace N^{-1}{\varvec{1}}{\varvec{1}}^\top \) is a centering matrix (), \({\varvec{1}}\negthinspace \in \negthinspace \mathbb {R}^N\) is an all-ones vector, and notations \(\langle \cdot ,\cdot \rangle _F\) and \(\Vert \cdot ,\cdot \Vert _F\) stand for the Frobenius inner product and norm, respectively.
Aiming to favor the system performance, we further encode discriminative and generative information during the tuning of \({\varvec{W}}\) by proposing the hybrid cost function as follows:
where encodes the all pair-wise similarities between input vectors into a kernel matrix with elements computed as \(k_{ij}\negthinspace =\negthinspace \kappa \left( {\text {d}}_{{\varvec{X}}}\left( {\varvec{x}}_i,{\varvec{x}}_j\right) \right) \), is the Euclidean distance operator, and the real-valued parameter \(\lambda \negthinspace \in \negthinspace [0,1]\) is a trade-off searching for a compromise between the generative (first term) and discriminative (second term) parts. As a result, the CKA hybrid cost function highlights relevant features by learning the matrix \({\varvec{W}}\) that best match all relations between the projected vectors and the input samples, as well as the relations between the resulting feature vectors and provided target classes. Consequently, we state the following optimization problem to compute the projection matrix:
where the obtained \({{\varvec{W}}}^\star \) holds the linear projection \({\varvec{Y}}\negthinspace =\negthinspace {{\varvec{W}}}^\star {\varvec{X}}\) that optimizes the similarity-based classification by assessing distances with \({\text {d}}_{{\varvec{W}}}(\cdot ,\cdot )\). The optimization details for Eq. (3) are described as below.
2.2 Gradient Descent Optimization of Hybrid CKA
The proposed cost function stated in a conventional minimization problem by transforming each term in Eq. (2) into explicit minimization functions as [3]:
being \({\varvec{K}}_\mathcal {Z}\) either the target label (\({\varvec{K}}_{\mathcal {L}}\)) or the input (\({\varvec{K}}_{{\varvec{X}}}\)) kernel matrix, and \(\rho _0\negthinspace \in \negthinspace \mathbb {R}\) corresponds to a constant independent on \({\varvec{W}}\). In this case, we consider the widely known Gaussian kernel for computing \({\varvec{K}}_{{\varvec{X}}}\) and \({\varvec{K}}_{{\varvec{W}}}\) defined as \(\kappa \left( {\text {d}}_{{\varvec{W}}}\left( {\varvec{x}}_{i},{\varvec{x}}_{j}\right) \right) =\exp \left( {-{\text {d}}^2_{{\varvec{W}}}\left( {\varvec{x}}_{i},{\varvec{x}}_{j}\right) }/{2\sigma ^2}\right) \), where the kernel bandwidth \(\sigma \negthinspace \in \negthinspace \mathbb {R}^+\) is adjusted by maximizing the variance of the information forces among samples [2].
Here, we solve the optimization problem at hand using the iterative gradient descent algorithm and the derivative of each term in the hybrid cost function (Eq. (2)) with respect to the projection matrix \({\varvec{W}}\) given by:
where notations \({{\text {diag}}}(\cdot )\) and \(\circ \) denote the diagonal operator and the Hadamard product, respectively. \({\varvec{G}}_{\mathcal {Z}}\negthinspace \in \negthinspace \mathbb {R}^{N\times N}\) is the gradient of each term in the cost function with respect to \({\varvec{K}}_{{\varvec{W}}}\), calculated as follows:
that yields the following updating rule for \({\varvec{W}}\), provided an initial guess \({\varvec{W}}_{o}\):
where \(\mu _{t}\negthinspace \in \negthinspace \mathbb {R}^+\) is the step size of the learning rule at optimization epoch t.
3 Experimental Set-Up and Results
We validate the proposed data representation method based on learning the linear projection through hybrid cost function within a classification framework. The methodological development of this approach comprises the following stages: (i) image preprocessing that includes noise injection procedure, (ii) linear projection learning that we carry out using the kernel alignment, and (iii) training the classifier using resulting data projection. Later, we assess the trained model performance by means of the classification accuracy in a cross-validation scheme.
3.1 Tested Database
We evaluate our proposed approach for classifying handwritten digits on the well-known MNIST collection that holds 60.000 training images and 10.000 test images sizing 28 \(\times \) 28 pixels and 256 level gray-scale of digits between 0 and 9 (as shown in Fig. 1a). In order to evaluate the compromise between generative and discriminative terms in Eq. (2), we corrupt each image using salt-and-pepper noise with six density levels corresponding to the probability of occurrence of noisy pixels \(\eta \negthinspace \in \negthinspace \{2\%,5\%,8\%,10\%,15\%,20\%\}\). Figure 1 illustrates one handwritten digit image affected by the noise levels.
3.2 Performance Assessment
The proposed linear projection learning based on CKA is performed per training batch. Once the projection matrix \({\varvec{W}}\) is estimated following Eq. (9), the linearly projected data \({\varvec{Y}}\) is computed and then used to tune the distance-based classifier. Finally, the classification accuracy is computed as the average performance on the test data batches. We carry out the experiments using the widely known k-nearest neighbors (kNN) algorithm as the distance-based classifier.
For the sake of the model robustness, we compute the linear projection from the corrupted data with different noise levels. Then, the average classification performance is computed along a range of \(\lambda \) values using \(k=3\) neighbors. As seen in Fig. 2, as the noise level increases the classification accuracy decreases, i.e., the problem becomes more complex. However, the results hold the compromise between both terms of the cost function, yielding an optimal trade-off of \(\lambda \negthinspace =\negthinspace 0.4\) for noise levels. As a result, the model demands a major contribution of the discriminative part without turning off the generative one.
Lastly, we compare our proposal with the classification performance attained by the kNN classifier employing non-projected data. Additionally, we evaluate the performance of a conventional Multi-Layer Perceptron approach (MLP), as it implements linear projection estimations with the addition of saturating nonlinear activation functions. Table 1 displays the classification performance accomplished by each compared linear projection methodology under different noise conditions, where the best result is marked in boldface for each noise density level. In either scenario of comparison, all algorithms are evaluated in terms of their classification performance, accuracy \((a_c)\) and standard deviation. Overall, the proposed approach outperforms the baselines techniques in terms of the studied performance measures. Hence, the estimated linear projection based on kernel alignment using hybrid cost function enhances the performance in similarity-based classification problems, facilitating the class discrimination.
4 Discussion
This paper enhances distance-based classifiers by linearly projecting data based on the centered kernel alignment function. In this regard, we introduce a hybrid cost function that encodes generative information, between the input and projected samples, and discriminative information, between projected data and the target label kernel. From the obtained results, upon the MNIST dataset for handwritten digit recognition, the following aspects are considered important:
As seen in Fig. 2, the resulting performance curves illustrate that despite the images being corrupted with a particular level of salt and pepper noise, there exists a trade-off between generative and discriminative terms of the objective cost function that attains the best classification accuracy. Therefore, the learned linear projection based on the hybrid CKA similarity function enhances the distance-based classification, achieving the highest accuracy with the additional benefit of robustness to the data representation.
Table 1 holds the average and standard deviation of performed accuracy for MNIST testing data. The MLP approach performs the worst (76.36 ± 6.06) which indicates that the learned non-linear projection randomly initialized limit the classification accuracy. Furthermore, the objective function of network training stage is purely discriminative and lacks information to model data inherent properties. Then, the kNN approach using raw data (83.36 ± 1.07) can only account for the original data distribution, so that noise corruption reduces its performance. Lastly, application of kNN on projected data, learned through the hybrid cost function, outperforms the baseline approaches (86.45 ± 1.02). As a result, estimating mapping matrices \({\varvec{W}}\) based on a compromise between generative and discriminative terms allows capturing both the inherent input information while forcing the samples to be projected into a space with higher class separability.
5 Conclusion
This paper discusses the linear projection learning for data representation through hybrid cost function controlled by a trade-off parameter within a kNN-based classification framework. To this end, we make use of the centered kernel alignment criterion in order to encode generative and discriminative information available in the data into a projection matrix and enhance the class discrimination. We implement a similarity-based classification problem to evaluate the linear projection performance. For this purpose, we corrupt data with different noise levels in order to observe the behavior of classification accuracy in a range of trade-off values. In addition, we compare our proposal against a conventional MLP. Attained results for handwritten digit recognition show that our proposal improves other approaches in all of the considered performance measures, reduce data dimension, and add robustness to the data representation.
As a future research direction, we will evaluate our learning approach, exploring new classification and regression problems. We will also introduce constraints to the cost function, as linear independence, to account for other data properties, and explore other combination approaches, as the geometric average, for dealing with more complex relations among inherent distributions and labels.
References
Alvarez-Meza, A., Lee, J., Verleysen, M., Castellanos-Dominguez, G.: Kernel-based dimensionality reduction using renyi’s \(\alpha \)-entropy measures of similarity. Neurocomputing 222, 36–46 (2017). https://doi.org/10.1016/j.neucom.2016.10.004
Álvarez-Meza, A.M., Cárdenas-Peña, D., Castellanos-Dominguez, G.: Unsupervised kernel function building using maximization of information potential variability. In: Bayro-Corrochano, E., Hancock, E. (eds.) CIARP 2014. LNCS, vol. 8827, pp. 335–342. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12568-8_41
Brockmeier, A.J., Choi, J.S., Kriminger, E.G., Francis, J.T., Principe, J.C.: Neural decoding with kernel-based metric learning. Neural Comput. 26(6), 1080–1107 (2014). https://doi.org/10.1162/NECO_a_00591
Brockmeier, A.J., et al.: Information-theoretic metric learning: 2-D linear projections of neural data for visualization. In: 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 5586–5589. IEEE (2013). https://doi.org/10.1109/EMBC.2013.6610816
Cárdenas-Peña, D., Collazos-Huertas, D., Álvarez-Meza, A., Castellanos-Dominguez, G.: Supervised kernel approach for automated learning using general stochastic networks. Eng. Appl. Artif. Intell. 68, 10–17 (2018). https://doi.org/10.1016/j.engappai.2017.10.003
Cárdenas-Peña, D., Collazos-Huertas, D., Castellanos-Dominguez, G.: Centered kernel alignment enhancing neural network pretraining for mri-based dementia diagnosis. Comput. Math. Methods Med. 2016, 1–10 (2016). https://doi.org/10.1155/2016/9523849
Cortes, C., Mohri, M., Rostamizadeh, A.: Algorithms for learning kernels based on centered alignment. J. Mach. Learn. Res. 13, 795–828 (2012). http://dl.acm.org/citation.cfm?id=2503308.2188413
Cowley, B.R., et al.: DataHigh: graphical user interface for visualizing and interacting with high-dimensional neural activity. J. Neural Eng. 10(6), 066012 (2013). https://doi.org/10.1088/1741-2560/10/6/066012
Harandi, M., Salzmann, M., Hartley, R.: Dimensionality reduction on spd manifolds: the emergence of geometry-aware methods. IEEE Trans. Pattern Anal. Mach. Intell. 40(1), 48–62 (2018). https://doi.org/10.1109/TPAMI.2017.2655048
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)
Zöhrer, M., Pernkopf, F.: General stochastic networks for classification. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 2015–2023. Curran Associates, Inc. (2014). http://dl.acm.org/citation.cfm?id=2969033.2969052
Acknowledgment
This work was supported by Doctorados Nacionales 2017 - Conv 785 and the research project 111974454838, both funded by COLCIENCIAS.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Collazos-Huertas, D., Cárdenas-Peña, D., Castellanos-Dominguez, G. (2019). Linear Projection Learned from Hybrid CKA for Enhancing Distance-Based Classifiers. In: Vera-Rodriguez, R., Fierrez, J., Morales, A. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2018. Lecture Notes in Computer Science(), vol 11401. Springer, Cham. https://doi.org/10.1007/978-3-030-13469-3_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-13469-3_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13468-6
Online ISBN: 978-3-030-13469-3
eBook Packages: Computer ScienceComputer Science (R0)