Abstract
In recent years, the machine learning community has witnessed a tremendous growth in the development of kernel-based learning algorithms. However, the performance of this class of algorithms greatly depends on the choice of the kernel function. Kernel function implicitly represents the inner product between a pair of points of a dataset in a higher dimensional space. This inner product amounts to the similarity between points and provides a solid foundation for nonlinear analysis in kernel-based learning algorithms. The most important challenge in kernel-based learning is the selection of an appropriate kernel for a given dataset. To remedy this problem, algorithms to learn the kernel have recently been proposed. These methods formulate a learning algorithm that finds an optimal kernel for a given dataset. In this paper, we present an overview of these algorithms and provide a comparison of various approaches to find an optimal kernel. Furthermore, a list of pivotal issues that lead to efficient design of such algorithms will be presented.
Similar content being viewed by others
References
Abbasnejad ME, Ramachandram D, Mandava R (2009) Optimizing kernel functions using transfer learning from unlabeled data. In: Machine Vision, 2009. ICMV ’09. Second international conference on, pp 111–117. doi:10.1109/ICMV.2009.10
Abbasnejad ME, Ramachandram D, Mandava R (2010) An unsupervised approach to learn the kernel functions: from global influence to local similarity. Neural Comput Appl. doi:10.1007/s00521-010-0411-7
Adankon MM, Cheriet M (2007) Optimizing resources in model selection for support vector machine. Pattern Recognit 40(3): 953–963. doi:10.1016/j.patcog.2006.06.012
Amari S, Wu S (1999) Improving support vector machine classifiers by modifying kernal functions. Neural Netw 12: 783–789. doi:10.1016/S0893-6080(99)00032-5
Argyriou A, Micchelli CA, Pontil M (2005) Learning convex combinations of continuously parameterized basic kernels. In: Auer P, Meir R (eds) Learning theory, Lecture Notes in Computer Science, vol 3559.. Springer, Heidelberg, pp 338–352. doi:10.1007/11503415_23
Argyriou A, Hauser R, Micchelli CA, Pontil M (2006) A dc-programming algorithm for kernel selection. In: ICML ’06: Proceedings of the 23rd international conference on machine learning, ACM, New York, NY, USA, pp 41–48. doi:10.1145/1143844.1143850
Argyriou A, Evgeniou T, Pontil M, Argyriou A, Evgeniou T, Pontil M (2007) Multi-task feature learning. In: Schölkopf B, Platt J, Hoffman T (eds) Advances in neural information processing systems, vol 19.. MIT Press, Cambridge, MA, pp 41–48
Aronszajn N (1950) Theory of reproducing kernels. Trans Ame Math Soc 68: 337–404
Ayat N, Cheriet M, Suen C (2005) Automatic model selection for the optimization of svm kernels. Pattern Recognit 38(10): 1733–1745. doi:10.1016/j.patcog.2005.03.011
Bach FR, Lanckriet GRG, Jordan MI (2004) Multiple kernel learning, conic duality, and the smo algorithm. In: ICML ’04: Proceedings of the twenty-first international conference on machine learning, ACM, New York, NY, USA. doi:10.1145/1015330.1015424
Ben-Hur A, Horn D, Siegelmann HT, Vapnik V (2002) Support vector clustering. J Mach Learn Res 2: 125–137
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, New York
Burges CJC (1999) Geometry and invariance in kernel based methods. MIT Press, Cambridge, pp 89–116
Caruana R (1997) Multitask learning. Mach Learn 28(1): 41–75
Chen B, Liu H, Bao Z (2008) A kernel optimization method based on the localized kernel fisher criterion. Pattern Recognit 41(3): 1098–1109. doi:10.1016/j.patcog.2007.08.009
Collins M, Schapire RE, Singer Y (2002) Logistic regression, adaboost and bregman distances. Mach Learn 48(1–3): 253–285
Cortes C, Mohri M, Rostamizadeh A (2009) Learning non-linear combinations of kernels. In: Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A (eds) NIPS: advances in neural information processing systems, vol 22, pp 396–404
Cristianini N, Shawe-taylor J, Elissee A, Kandola J (2002) On kernel-target alignment. In: Advances in neural information processing systems, vol 14, MIT Press, pp 367–373
Cristianini N, Kandola J, Elisseeff A, Shawe-Taylor J (2003) On optimizing kernel alignment. Tech. rep., UC Davis Department of Statistics
Davis JV, Kulis B, Jain P, Sra S, Dhillon IS (2007) Information-theoretic metric learning. In: ICML ’07: Proceedings of the 24th international conference on machine learning, ACM, New York, NY, USA, pp 209–216. doi:10.1145/1273496.1273523
Domeniconi C, Peng J, Yan B (2010) Composite kernels for semi-supervised clustering. Knowl Inf Syst, pp 1–18
Duan K, Keerthi SS, Poo AN (2003) Evaluation of simple performance measures for tuning svm hyperparameters. Neurocomputing 51: 41–59. doi:10.1016/S0925-2312(02)00601-X
Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. JMLR Org 6: 615–637
Florian A, Potra SJW (2000) Interior-point methods. J Comput Appl Math 124(1–2): 281–302. doi:10.1016/S0377-0427(00)00433-7
Freund RM, Mizuno S (1996) Interior point methods: current status and future directions. Working papers 3924-96., Massachusetts Institute of Technology (MIT), Sloan School of Management. http://ideas.repec.org/p/mit/sloanp/2634.html
Fung G, Rosales R, Rao RB (2008) Feature selection and kernel design via linear programming. In: IJCAI’07: Proceedings of the 20th international joint conference on artifical intelligence, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 786–791
Gang Wu EYC (2003) Adaptive feature-space conformal transformation for imbalanced-data learning. In: Proceedings of IEEE international conference on machine learning, pp 816–823
Gehler P, Nowozin S (2008) Infinite kernel learning. Tech. rep., Max Planck Institute For Biological Cybernetics, Tuebingen, Germany
Grant M, Boyd S (2008) Graph implementations for nonsmooth convex programs. Recent Adv Learn Control 371: 95–110. doi:10.1007/978-1-84800-155-8_7
Grant M, Boyd S (2009) Cvx: Matlab software for disciplined convex programming. web page and software. http://stanford.edu/~boyd/cvx
Herbster M, Pontil M, Wainer L (2005) Online learning over graphs. In: ICML ’05: Proceedings of the 22nd international conference on machine learning, ACM, New York, NY, USA, pp 305–312. doi:10.1145/1102351.1102390
Hertz T, Hillel AB, Weinshall D (2006) Learning a kernel function for classification with small training samples. In: ICML ’06: Proceedings of the 23rd international conference on machine learning, ACM, New York, NY, USA, pp 401–408. doi:10.1145/1143844.1143895
Hoi SCH, Jin R (2008) Active kernel learning. In: ICML ’08: Proceedings of the 25th international conference on machine learning, ACM, New York, NY, USA, pp 400–407. doi:10.1145/1390156.1390207
Hoi SCH, Lyu MR, Chang EY (2006) Learning the unified kernel machines for classification. In: KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, NY, USA, pp 187–196. doi:10.1145/1150402.1150426
Hoi SCH, Jin R, Lyu MR (2007) Learning nonparametric kernel matrices from pairwise constraints. In: ICML ’07: Proceedings of the 24th international conference on machine learning, ACM, New York, NY, USA, pp 361–368. doi:10.1145/1273496.1273542
Horst R, Thoai NV (1999) Dc programming: overview. J Optim Theory Appl 103(1): 1–4310.1023/A:1021765131316
Huang K, Ying Y, Campbell C (2010) Generalized sparse metric learning with relative comparisons. Knowl Inf Syst
Jaakkola TS, Haussler D (1999) Exploiting generative models in discriminative classifiers. In: Proceedings of the 1998 conference on advances in neural information processing systems II, MIT Press, Cambridge, MA, USA, pp 487–493
Jebara T (2004) Multi-task feature and kernel selection for svms. In: ICML ’04: Proceedings of the twenty-first international conference on machine learning, ACM, New York, NY, USA. doi:10.1145/1015330.1015426
Jebara T, Kondor R, Howard A (2004) Probability product kernels. J Mach Learn Res 5: 819–844
Jieping Ye JC, Shuiwang Ji (2007) Multi-class discriminant kernel learning via convex programming. J Mach Learn Res 9: 719–758
Joseph KC, Keshet J, Singer Y (2002) Kernel design using boosting. In: Advances in Neural Information Processing Systems, vol 15, MIT Press, pp 537–544
Kandola J, Shawe-Taylor J (2002a) On the extensions of kernel alignment. Tech. Rep. NC-TR-2002-120, University of Southampton (United Kingdom). http://eprints.ecs.soton.ac.uk/9745/
Kandola J , Shawe-Taylor J (2002b) Optimizing kernel alignment over combinations of kernel. Tech. Rep. NC-TR-2002-121. http://eprints.ecs.soton.ac.uk/9746/
Kapoor A, Alan Qi Y, Ahn H, Picard RW (2005) Hyperparameter and kernel learning for graph based semi-supervised classification
Kim SJ, Magnani A, Boyd S (2006) Optimal kernel selection in kernel fisher discriminant analysis. In: ICML ’06: Proceedings of the 23rd international conference on machine learning, ACM, New York, NY, USA, pp 465–472. doi:10.1145/1143844.1143903
Kim SJ, Zymnis A, Magnani A, Koh K, Boyd S (2008) Learning the kernel via convex optimization. In: Acoustics, speech and signal processing, 2008. ICASSP 2008. IEEE international conference on, pp 1997–2000. doi:10.1109/ICASSP.2008.4518030
Kondor RI, Lafferty J (2002) Diffusion kernels on graphs and other discrete structures. In: Proceedings of the ICML ’02: Proceedings of the 23rd international conference on machine learning, pp 315–322
Kristin P, Bennett MJE Michinari M (2002) Mark: a boosting algorithm for heterogeneous kernel models. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, pp 24–31. doi:10.1145/775047.775051
Kulis B, Sustik M, Dhillon I (2006) Learning low-rank kernel matrices. In: ICML ’06: Proceedings of the 23rd international conference on machine learning, ACM, New York, NY, USA, pp 505–512. doi:10.1145/1143844.1143908
Kulis B, Sustik MA, Dhillon IS (2009) Low-rank kernel learning with bregman matrix divergences. J Mach Learn Res 10: 341–376
Lanckriet G, Deng M, Cristianini N, Jordan MI, Noble WS (2004a) Kernel-based data fusion and its application to protein function prediction in yeast. Pac Symp Biocomput, vol 11
Lanckriet GRG, Cristianini N, Bartlett P, Ghaoui LE, Jordan MI (2004) Learning the kernel matrix with semidefinite programming. J Mach Learn Res 5: 27–72
Li F, Fu Y, Dai YH, Sminchisescu C, Jue W (2009) Kernel learning by unconstrained optimization. J Mach Learn Res 5: 328–335
Lu J, Plataniotis KN, Venetsanopoulos AN (2005) Kernel discriminant learning with application to face recognition. In: Wang L (ed) Support vector machines: theory and applications, Springer, Heidelberg, Studies in Fuzziness and Soft Computing, pp 275–296. doi:10.1007/10984697_13
MacKay JCD (1997) Introduction to gaussian processes. NATO ASI Ser F: Comput Syst Sci 168: 33–165
Mercer J (1909) Functions of positive and negative type and their connection with the theory of integral equations. Philos Trans R Soc 209: 415–446. doi:10.1098%2Frsta.1909.0016
Micchelli CA, Pontil M (2005) Learning the kernel function via regularization. J Mach Learn Res 6: 1099–1125
Micchelli CA, Pontil M (2007) Feature space perspectives for learning the kernel. Mach Learn 66(2–3): 297–319. doi:10.1007/s10994-006-0679-0
Mika S, Ratsch G, Weston J, Scholkopf B, Mullers K (1999) Fisher discriminant analysis with kernels pp 41–48. doi:10.1109/NNSP.1999.788121
Moreno PJ, Ho P, Vasconcelos N (2003) A kullback-leibler divergence based kernel for svm classification in multimedia applications. In: Thrun S, Saul LK, Schölkopf B, Thrun S, Saul LK, Schölkopf B (eds) NIPS, MIT Press. http://dblp.uni-trier.de/rec/bibtex/conf/nips/MorenoHV03
Nguyen CH, Ho TB (2008) An efficient kernel matrix evaluation measure. Pattern Recognit 41(11): 3366–3372. doi:10.1016/j.patcog.2008.04.005
Platt JC (1999a) Fast training of support vector machines using sequential minimal optimization. Adv Kernel Methods Support Vector Learn, pp 185–208
Platt JC (1999b) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classif, pp 61–74
Raina R, Battle A, Lee H (2007) Self-taught learning: transfer learning from unlabeled data. 24th International conference on machine learning
Rakotomamonjy A, Bach F, Canu S, Grandvalet Y (2007) More efficiency in multiple kernel learning. In: ICML ’07: Proceedings of the 24th international conference on machine learning, ACM, New York, NY, USA, pp 775–782. doi:10.1145/1273496.1273594
Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) Simplemkl. J Mach Learn Res 9: 2491–2521
Rasmussen CE (2003) Gaussian processes in machine learning 3176:63–71 doi:10.1007/978-3-540-28650-9_4
Rückert U, Kramer S (2008) Kernel-based inductive transfer. In: Machine learning and knowledge discovery in databases, Springer, Berlin, Heidelberg, pp 220–233. doi:10.1007/978-3-540-87481-2_15
Saunders C, Gammerman A, Vovk V (1998) Ridge regression learning algorithm in dual variables. In: ICML ’98: Proceedings of the fifteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 515–521
Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. J Mach Learn 37(3): 297–336. doi:10.1023/A:1007614523901
Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Scholkopf B, Smola A, Muller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comp 10(5):1299–1319 http://neco.mitpress.org/cgi/content/abstract/10/5/1299
Shaw B, Jebara T (2009) Structure preserving embedding. In: ICML ’09: Proceedings of the 26th annual international conference on machine learning, ACM, New York, NY, USA, pp 937–944. doi:10.1145/1553374.1553494
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, New York
Sindhwani V, Niyogi P, Belkin M (2005) Beyond the point cloud: from transductive to semi-supervised learning. In: ICML ’05: Proceedings of the 22nd international conference on machine learning, ACM, New York, NY, USA, pp 824–831. doi:10.1145/1102351.1102455
Smola A, Hofmann T, Scholkopf B (2007) Kernel methods in machine learning. Annal Stat 36:1171–1220. http://eprints.pascal-network.org/archive/00003984/
Szafranski M, Grandvalet Y, Rakotomamonjy A (2008) Composite kernel learning. In: ICML ’08: Proceedings of the 25th international conference on machine learning, ACM, New York, NY, USA, pp 1040–1047. doi:10.1145/1390156.1390287
Tsuda K, Noble WS (2004) Learning kernels from biological networks by maximizing entropy. Bioinformatics 20(1): 326–333. doi:10.1093/bioinformatics/bth906
Tsuda K, Kin T, Asai K (2002) Marginalized kernels for biological sequences. Bioinformatics (Oxford, England) 19: 2149–2156. doi:10.1093/bioinformatics/18.suppl_1.S268
Vapnik V (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10(5): 988–999. doi:10.1109/72.788640
Wang G, Yeung DY, Lochovsky FH (2007) A kernel path algorithm for support vector machines. In: ICML ’07: Proceedings of the 24th international conference on machine learning, ACM, New York, NY, USA, pp 951–958. doi:10.1145/1273496.1273616
Wang J, Lu H, Plataniotis K, Lu J (2009) Gaussian kernel optimization for pattern classification. Pattern Recognit 42(7): 1237–1247. doi:10.1016/j.patcog.2008.11.024
Weinberger KQ, Sha F, Saul LK (2004) Learning a kernel matrix for nonlinear dimensionality reduction. In: ICML ’04: Proceedings of the twenty-first international conference on machine learning, ACM, New York, NY, USA
Weinberger KQ, Packer BD, Saul LK (2005) Nonlinear dimensionality reduction by semidefinite programming and kernel matrix factorization. In: Proceedings of the tenth international workshop on artificial intelligence and statistics, pp 381–388
Williams CKI, Barber D (1998) Bayesian classification with gaussian processes. IEEE Trans Pattern Anal Mach Intell 20(12): 1342–1351. doi:10.1109/34.735807
Williams P, Li S, Feng J, Wu S (2007) A geometrical method to improve performance of the support vector machine. IEEE Trans Neural Netw 18(3): 942–947. doi:10.1109/TNN.2007.891625
Wu S, Amari SI (2002) Conformal transformation of kernel functions: a data-dependent way to improve support vector machine classifiers. Neural Process Lett 15(1): 59–67. doi:10.1023/A:1013848912046
Xiong H, Swamy M, Ahmad M (2005) Optimizing the kernel in the empirical feature space. IEEE Trans Neural Netw 16(2): 460–474. doi:10.1109/TNN.2004.841784
Xu Z, Zhu J, Lyu MR, King I (2007) Maximum margin based semi-supervised spectral kernel learning. In: Proceedings of international joint conference on neural networks, Orlando, Florida, USA
Xu Z, Jin R, Yang H, King I, Lyu MR (2010) Simple and efficient multiple kernel learning by group lasso. In: ICML ’10: Proceedings of the 27th international conference on machine learning, ACM
Yang J, Cheung WK, Chen X (2008) Learning element similarity matrix for semi-structured document analysis. Knowl Inf Syst
Yang L, Jin R (May, 2006) Distance metric learning: a comprehensive survey. Department of Computer Science and Engineering, Michigan State University
Yeung DY, Chang H, Dai G (2007) Learning the kernel matrix by maximizing a kfd-based class separability criterion. Pattern Recognit 40(7): 2021–2028. doi:10.1016/j.patcog.2006.12.031
Yeung DY, Chang H, Dai G (2008) A scalable kernel-based semisupervised metric learning algorithm with out-of-sample generalization ability. Neural Comput 20(11): 2839–2861. doi:10.1162/neco.2008.05-07-528
Yuan M, Yuan M, Lin Y, Lin Y (2006) Model selection and estimation in regression with grouped variables. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.79.2062
Zhang Z, Yeung DY, Kwok JT (2004) Bayesian inference for transductive learning of kernel matrix using the tanner-wong data augmentation algorithm. In: Brodley CE (ed) ICML, machine learning, proceedings of the twenty-first international conference (ICML 2004), Banff, Alberta, Canada, July 4–8, 2004, ACM, ACM International Conference Proceeding Series, vol 69
Zhang Z, Kwok JT, Yeung DY (2006) Model-based transductive learning of the kernel matrix. Mach Learn 63(1): 69–101
Zhang Z, Yeung DY, , Kwok JT, Kwok JT (2007) Probabilistic kernel matrix learning with a mixture model of kernels
Zhengdong Lu PJ, Dhillon I (2009) Geometry-aware metric learning. In: ICML ’09: Proceedings of the 26nd international conference on machine learning, ACM
Zhu X, Kandola J, Ghahramani Z, Lafferty J (2005) Nonparametric transforms of graph kernels for semi-supervised learning. In: Nonparametric transforms of graph kernels for semi-supervised learning, no. 17 in advances in neural information processing systems
Zhuang J, Tsang IW, Hoi SCH (2009) Simplenpkl: simple non-parametric kernel learning. In: ICML ’09: Proceedings of the 26th annual international conference on machine learning, ACM, New York, NY, USA, pp 1273–1280. doi:10.1145/1553374.1553537
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Abbasnejad, M.E., Ramachandram, D. & Mandava, R. A survey of the state of the art in learning the kernels. Knowl Inf Syst 31, 193–221 (2012). https://doi.org/10.1007/s10115-011-0404-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0404-6