Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

lp-Norm Multiple Kernel Learning

Published: 01 July 2011 Publication History

Abstract

Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this l1-norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, that is lp-norms with p ≥ 1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artificial data shed light on the appropriateness of sparse, non-sparse and l-norm MKL in various scenarios. Importantly, empirical applications of lp-norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art. Data sets, source code to reproduce the experiments, implementations of the algorithms, and further information are available at http://doc.ml.tu-berlin.de/nonsparse_mkl/.

References

[1]
T. Abeel, Y. Van de Peer, and Y. Saeys. Towards a gold standard for promoter prediction evaluation. Bioinformatics, 2009.
[2]
J. Aflalo, A. Ben-Tal, C. Bhattacharyya, J. Saketha Nath, and S. Raman. Variable sparsity kernel learning--algorithms and applications. Journal of Machine Learning Research, 2011. To appear.
[3]
A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243-272, 2008.
[4]
F. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 105-112, 2009.
[5]
F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In Proc. 21st ICML. ACM, 2004.
[6]
V. B. Bajic, S. L. Tan, Y. Suzuki, and S. Sugano. Promoter prediction analysis on the whole human genome. Nature Biotechnology, 22(11):1467-1473, 2004.
[7]
P.L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463-482, November 2002.
[8]
D.P. Bertsekas. Nonlinear Programming, Second Edition. Athena Scientific, Belmont, MA, 1999.
[9]
K. Bleakley, G. Biau, and J.-P. Vert. Supervised reconstruction of biological networks with local models. Bioinformatics, 23:i57-i65, 2007.
[10]
O. Bousquet and D.J.L. Herrmann. On the complexity of learning the kernel matrix. In Advances in Neural Information Processing Systems, 2002.
[11]
O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In Olivier Bousquet, Ulrike von Luxburg, and Gunnar Rätsch, editors, Advanced Lectures on Machine Learning, volume 3176 of Lecture Notes in Computer Science, pages 169-207. Springer Berlin / Heidelberg, 2004.
[12]
S. Boyd and L. Vandenberghe. Convex Optimization. Cambrigde University Press, Cambridge, UK, 2004.
[13]
O. Chapelle. Training a support vector machine in the primal. Neural Computation, 2006.
[14]
O. Chapelle and A. Rakotomamonjy. Second order optimization of kernel parameters. In Proc. of the NIPS Workshop on Kernel Learning: Automatic Selection of Optimal Kernels, 2008.
[15]
O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1):131-159, 2002.
[16]
C. Cortes, A. Gretton, G. Lanckriet, M. Mohri, and A. Rostamizadeh. Proceedings of the NIPS Workshop on Kernel Learning: Automatic Selection of Optimal Kernels, 2008. URL http://www.cs.nyu.edu/learning_kernels.
[17]
C. Cortes, M. Mohri, and A. Rostamizadeh. L2 regularization for learning kernels. In Proceedings of the International Conference on Uncertainty in Artificial Intelligence, 2009a.
[18]
C. Cortes, M. Mohri, and A. Rostamizadeh. Learning non-linear combinations of kernels. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 396-404, 2009b.
[19]
C. Cortes, M. Mohri, and A. Rostamizadeh. Generalization bounds for learning kernels. In Proceedings, 27th ICML, 2010a.
[20]
C. Cortes, M. Mohri, and A. Rostamizadeh. Two-stage learning kernel algorithms. In Proceedings of the 27th Conference on Machine Learning (ICML 2010), 2010b.
[21]
N. Cristianini, J. Kandola, A. Elisseeff, and J. Shawe-Taylor. On kernel-target alignment. In Advances in Neural Information Processing Systems, 2002.
[22]
L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Number 31 in Applications of Mathematics. Springer, New York, 1996.
[23]
R. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871-1874, 2008.
[24]
R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using the second order information for training support vector machines. Journal of Machine Learning Research, 6:1889-1918, 2005.
[25]
V. Franc and S. Sonnenburg. OCAS optimized cutting plane algorithm for support vector machines. In Proceedings of the 25th International Machine Learning Conference. ACM Press, 2008.
[26]
P. V. Gehler and S. Nowozin. Infinite kernel learning. In Proceedings of the NIPS 2008 Workshop on Kernel Learning: Automatic Selection of Optimal Kernels, 2008.
[27]
A. Gelman. Causality and statistical learning. American Journal of Sociology, 0, 2010.
[28]
G.H. Golub and C.F. van Loan. Matrix Computations. John Hopkins University Press, Baltimore, London, 3rd edition, 1996.
[29]
M. Gönen and E. Alpaydin. Localized multiple kernel learning. In ICML '08: Proceedings of the 25th international conference on Machine learning, pages 352-359, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-205-4.
[30]
V.K. Ivanov, V.V. Vasin, and V.P. Tanana. Theory of Linear Ill-Posed Problems and its application. VSP, Zeist, 2002.
[31]
S. Ji, L. Sun, R. Jin, and J. Ye. Multi-label multiple kernel learning. In Advances in Neural Information Processing Systems, 2009.
[32]
T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C.J.C. Burges, and A.J. Smola, editors, Advances in Kernel Methods -- Support Vector Learning, pages 169-184, Cambridge, MA, 1999. MIT Press.
[33]
S. M. Kakade, S. Shalev-Shwartz, and A. Tewari. Applications of strong convexity-strong smoothness duality to learning with matrices. CoRR, abs/0910.0610, 2009.
[34]
M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno, and M. Hattori. The KEGG resource for deciphering the genome. Nucleic Acids Res, 32:D277-D280, 2004.
[35]
G. Kimeldorf and G. Wahba. Some results on tchebycheffian spline functions. J. Math. Anal. Applic., 33:82-95, 1971.
[36]
M. Kloft, U. Brefeld, P. Laskov, and S. Sonnenburg. Non-sparse multiple kernel learning. In Proc. of the NIPS Workshop on Kernel Learning: Automatic Selection of Kernels, dec 2008.
[37]
M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. Müller, and A. Zien. Efficient and accurate lp-norm multiple kernel learning. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 997-1005. MIT Press, 2009a.
[38]
M. Kloft, S. Nakajima, and U. Brefeld. Feature selection for density level-sets. In W. L. Buntine, M. Grobelnik, D. Mladenic, and J. Shawe-Taylor, editors, Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD), pages 692-704, 2009b.
[39]
M. Kloft, U. Rückert, and P. L. Bartlett. A unifying view of multiple kernel learning. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD), 2010. To appear. ArXiv preprint: http://arxiv.org/abs/1005.0437.
[40]
V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30:1-50, 2002.
[41]
G. Lanckriet, N. Cristianini, L. E. Ghaoui, P. Bartlett, and M. I. Jordan. Learning the kernel matrix with semi-definite programming. JMLR, 5:27-72, 2004.
[42]
D.C. Liu and J. Nocedal. On the limited memory method for large scale optimization. Mathematical Programming B, 45(3):503-528, 1989.
[43]
P. C. Mahalanobis. On the generalised distance in statistics. In Proceedings National Institute of Science, India, volume 2, no. 1, April 1936.
[44]
M. Markou and S. Singh. Novelty detection: a review - part 1: statistical approaches. Signal Processing, 83:2481-2497, 2003a.
[45]
M. Markou and S. Singh. Novelty detection: a review - part 2: neural network based approaches. Signal Processing, 83:2499-2521, 2003b.
[46]
C. A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of Machine Learning Research, 6:1099-1125, 2005.
[47]
K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf. An introduction to kernel-based learning algorithms. IEEE Neural Networks, 12(2):181-201, May 2001.
[48]
S. Nash and A. Sofer. Linear and Nonlinear Programming. McGraw-Hill, New York, NY, 1996.
[49]
J. S. Nath, G. Dinesh, S. Ramanand, C. Bhattacharyya, A. Ben-Tal, and K. R. Ramakrishnan. On the algorithmics and applications of a mixed-norm based kernel learning formulation. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 844-852, 2009.
[50]
A. Nemirovski. Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15:229-251, 2004.
[51]
C. S. Ong and A. Zien. An Automated Combination of Kernels for Predicting Protein Subcellular Localization. In Proc. of the 8th Workshop on Algorithms in Bioinformatics, 2008.
[52]
C. S. Ong, A. J. Smola, and R. C. Williamson. Learning the kernel with hyperkernels. Journal of Machine Learning Research, 6:1043-1071, 2005.
[53]
S. Özögür-Akyüz and G.W. Weber. Learning with infinitely many kernels via semi-infinite programming. In Proceedings of Euro Mini Conference on Continuous Optimization and Knowledge Based Technologies, 2008.
[54]
J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C.J.C. Burges, and A.J. Smola, editors, Advances in Kernel Methods -- Support Vector Learning, pages 185-208, Cambridge, MA, 1999. MIT Press.
[55]
A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. More efficiency in multiple kernel learning. In ICML, pages 775-782, 2007.
[56]
A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. Journal of Machine Learning Research, 9:2491-2521, 2008.
[57]
R. M. Rifkin and R. A. Lippert. Value regularization and Fenchel duality. J. Mach. Learn. Res., 8: 441-479, 2007.
[58]
V. Roth and B. Fischer. Improved functional prediction of proteins by learning kernel combinations in multilabel settings. BMC Bioinformatics, 8(Suppl 2):S12, 2007. ISSN 1471-2105.
[59]
V. Roth and B. Fischer. The group-lasso for generalized linear models: uniqueness of solutions and efficient algorithms. In Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML 2008), volume 307, pages 848-855. ACM, 2008.
[60]
E. Rubinstein. Support vector machines via advanced optimization techniques. Master's thesis, Faculty of Electrical Engineering, Technion, 2005, Nov 2005.
[61]
W. Rudin. Functional Analysis. McGraw-Hill, 1991.
[62]
B. Schölkopf and A.J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.
[63]
B. Schölkopf, A.J. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299-1319, 1998.
[64]
B. Schölkopf, S. Mika, C.J.C. Burges, P. Knirsch, K.-R. Müller, G. Rätsch, and A.J. Smola. Input space vs. feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10(5): 1000-1017, September 1999.
[65]
B. Schölkopf, J. Platt, J. Shawe-Taylor, A.J. Smola, and R.C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443-1471, 2001.
[66]
J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge University Press, 2004.
[67]
S. Sonnenburg, G. Rätsch, and C. Schäfer. Learning interpretable SVMs for biological sequence classification. In RECOMB 2005, LNBI 3500, pages 389-407. Springer-Verlag Berlin Heidelberg, 2005.
[68]
S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. Large scale multiple kernel learning. Journal of Machine Learning Research, 7:1531-1565, July 2006a.
[69]
S. Sonnenburg, A. Zien, and G. Rätsch. Arts: Accurate recognition of transcription starts in human. Bioinformatics, 22(14):e472-e480, 2006b.
[70]
S. Sonnenburg, G. Rätsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. de Bona, A. Binder, C. Gehl, and V. Franc. The SHOGUN Machine Learning Toolbox. Journal of Machine Learning Research, 2010.
[71]
J. M. Steele. The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities. Cambridge University Press, New York, NY, USA, 2004. ISBN 052154677X.
[72]
M. Stone. Cross-validatory choice and assessment of statistical predictors (with discussion). Journal of the Royal Statistical Society, B36:111-147, 1974.
[73]
Y. Suzuki, R. Yamashita, K. Nakai, and S. Sugano. dbTSS: Database of human transcriptional start sites and full-length cDNAs. Nucleic Acids Research, 30(1):328-331, 2002.
[74]
M. Szafranski, Y. Grandvalet, and A. Rakotomamonjy. Composite kernel learning. In Proceedings of the International Conference on Machine Learning, 2008.
[75]
M. Szafranski, Y. Grandvalet, and A. Rakotomamonjy. Composite kernel learning. Mach. Learn., 79(1-2):73-103, 2010. ISSN 0885-6125.
[76]
D.M.J. Tax and R.P.W. Duin. Support vector domain description. Pattern Recognition Letters, 20 (11-13):1191-1199, 1999.
[77]
A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-posed problems. W. H. Winston, Washington, 1977.
[78]
M. Varma and B. R. Babu. More generality in efficient multiple kernel learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 1065-1072, New York, NY, USA, 2009. ACM.
[79]
M. Varma and D. Ray. Learning the discriminative power-invariance trade-off. In IEEE 11th International Conference on Computer Vision (ICCV), pages 1-8, 2007.
[80]
Z. Xu, R. Jin, I. King, and M. Lyu. An extended level method for efficient multiple kernel learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1825-1832, 2009.
[81]
Z. Xu, R. Jin, H. Yang, I. King, and M. Lyu. Simple and efficient multiple kernel learning by group lasso. In Proceedings of the 27th Conference on Machine Learning (ICML 2010), 2010.
[82]
Y. Yamanishi, J.-P. Vert, and M. Kanehisa. Supervised enzyme network inference from the integration of genomic data and chemical information. Bioinformatics, 21:i468-i477, 2005.
[83]
Y. Ying, C. Campbell, T. Damoulas, and M. Girolami. Class prediction from disparate biological data sources using an iterative multi-kernel algorithm. In Visakan Kadirkamanathan, Guido Sanguinetti, Mark Girolami, Mahesan Niranjan, and Josselin Noirel, editors, Pattern Recognition in Bioinformatics, volume 5780 of Lecture Notes in Computer Science, pages 427-438. Springer Berlin / Heidelberg, 2009.
[84]
S. Yu, T. Falck, A. Daemen, L.-C. Tranchevent, J. Suykens, B. De Moor, and Y. Moreau. L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinformatics, 11 (1):309, 2010. ISSN 1471-2105.
[85]
M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68:49-67, 2006.
[86]
A. Zien and C. S. Ong. Multiclass multiple kernel learning. In Proceedings of the 24th international conference on Machine learning (ICML), pages 1191-1198. ACM, 2007.

Cited By

View all
  • (2024)Domain Adaptation With Interval-Valued Observations: Theory and AlgorithmsIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2024.336746032:5(3107-3120)Online publication date: 20-Feb-2024
  • (2023)An active learning framework for multi-group mean estimationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667537(32602-32635)Online publication date: 10-Dec-2023
  • (2023)Parallel Software for Million-scale Exact Kernel RegressionProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593737(313-323)Online publication date: 21-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research
The Journal of Machine Learning Research  Volume 12, Issue
2/1/2011
3426 pages
ISSN:1532-4435
EISSN:1533-7928
Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 July 2011
Published in JMLR Volume 12

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)158
  • Downloads (Last 6 weeks)10
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Domain Adaptation With Interval-Valued Observations: Theory and AlgorithmsIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2024.336746032:5(3107-3120)Online publication date: 20-Feb-2024
  • (2023)An active learning framework for multi-group mean estimationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667537(32602-32635)Online publication date: 10-Dec-2023
  • (2023)Parallel Software for Million-scale Exact Kernel RegressionProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593737(313-323)Online publication date: 21-Jun-2023
  • (2023)Temporal Dynamic Concept Modeling Network for Explainable Video Event RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356831219:6(1-22)Online publication date: 12-Jul-2023
  • (2023)One-Class Classification Using ℓp-Norm Multiple Kernel Fisher Null ApproachIEEE Transactions on Image Processing10.1109/TIP.2023.325510232(1843-1856)Online publication date: 1-Jan-2023
  • (2022)Personalized online federated learning with multiple kernelsProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602684(33316-33329)Online publication date: 28-Nov-2022
  • (2022)Sample Weighted Multiple Kernel K-means via Min-Max optimizationProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547917(1679-1687)Online publication date: 10-Oct-2022
  • (2022)Multi-view Intact Discriminant Space Learning for Image ClassificationNeural Processing Letters10.1007/s11063-018-9951-050:2(1661-1685)Online publication date: 11-Mar-2022
  • (2022)Kernel ensemble support vector machine with integrated loss in shared parameters spaceMultimedia Tools and Applications10.1007/s11042-022-14226-882:12(18077-18096)Online publication date: 1-Dec-2022
  • (2022)An element-wise kernel learning frameworkApplied Intelligence10.1007/s10489-022-04020-253:8(9531-9547)Online publication date: 9-Aug-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media