article

Free access

A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

Authors:

S.V.N. Vishwanathan,

Nicol N. SchraudolphAuthors Info & Claims

The Journal of Machine Learning Research, Volume 11

Pages 1145 - 1200

Published: 01 March 2010 Publication History

Abstract

We extend the well-known BFGS quasi-Newton method and its memory-limited variant LBFGS to the optimization of nonsmooth convex objectives. This is done in a rigorous fashion by generalizing three components of BFGS to subdifferentials: the local quadratic model, the identification of a descent direction, and the Wolfe line search conditions. We prove that under some technical conditions, the resulting subBFGS algorithm is globally convergent in objective function value. We apply its memory-limited variant (subLBFGS) to L₂-regularized risk minimization with the binary hinge loss. To extend our algorithm to the multiclass and multilabel settings, we develop a new, efficient, exact line search algorithm. We prove its worst-case time complexity bounds, and show that our line search can also be used to extend a recently developed bundle method to the multiclass and multilabel settings. We also apply the direction-finding component of our algorithm to L₁-regularized risk minimization with logistic loss. In all these contexts our methods perform comparable to or better than specialized state-of-the-art solvers on a number of publicly available data sets. An open source implementation of our algorithms is freely available.

References

[1]

N. Abe, J. Takeuchi, andM. K.Warmuth. Polynomial Learnability of Stochastic Rules with Respect to the KL-Divergence and Quadratic Distance. IEICE Transactions on Information and Systems, 84(3):299-316, 2001.

[2]

P. K. Agarwal and M. Sharir. Davenport-Schinzel sequences and their geometric applications. In J. Sack and J. Urrutia, editors, Handbook of Computational Geometry, pages 1-47. North-Holland, New York, 2000.

[3]

G. Andrew and J. Gao. Scalable training of L<inf>1</inf>-regularized log-linear models. In Proc. Intl. Conf. Machine Learning, pages 33-40, New York, NY, USA, 2007. ACM.

Digital Library

[4]

J. Basch. Kinetic Data Structures. PhD thesis, Stanford University, June 1999.

Digital Library

[5]

D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1999.

[6]

J. R. Birge, L. Qi, and Z. Wei. A general approach to convergence properties of some methods for nonsmooth convex optimization. Applied Mathematics and Optimization, 38(2):141-158, 1998.

[7]

A. Bordes, L. Bottou, P. Gallinari, and J. Weston. Solving multiclass support vector machines with LaRank. In Proc. Intl. Conf. Machine Learning, pages 89-96, New York, NY, USA, 2007. ACM.

Digital Library

[8]

S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, England, 2004.

Digital Library

[9]

K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research, 3:951-991, January 2003a.

Digital Library

[10]

K. Crammer and Y. Singer. A family of additive online algorithms for category ranking. J. Mach. Learn. Res., 3:1025-1058, February 2003b.

Digital Library

[11]

V. Franc and S. Sonnenburg. Optimized cutting plane algorithm for support vector machines. In A. McCallum and S. Roweis, editors, ICML, pages 320-327. Omnipress, 2008.

Digital Library

[12]

V. Franc and S. Sonnenburg. Optimized cutting plane algorithm for large-scale risk minimization. Journal of Machine Learning Research, 10:2157-2192, 2009.

Digital Library

[13]

M. Haarala. Large-Scale Nonsmooth Optimization. PhD thesis, University of Jyväskylä, 2004.

[14]

J. Hershberger. Finding the upper envelope of n line segments in O(nlogn) time. Information Processing Letters, 33(4):169-174, December 1989.

Digital Library

[15]

J. B. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms, I and II, volume 305 and 306. Springer-Verlag, 1993.

[16]

T. Joachims. Training linear SVMs in linear time. In Proc. ACM Conf. Knowledge Discovery and Data Mining (KDD). ACM, 2006.

Digital Library

[17]

Y. J. Lee and O. L. Mangasarian. SSVM: A smooth support vector machine for classification. Computational optimization and Applications, 20(1):5-22, 2001.

Digital Library

[18]

C. Lemarechal. Numerical experiments in nonsmooth optimization. Progress in Nondifferentiable Optimization, 82:61-84, 1982.

[19]

A. S. Lewis and M. L. Overton. Nonsmooth optimization via BFGS. Technical report, Optimization Online, 2008a. URL http://www.optimization-online.org/DB_FILE/2008/12/ 2172.pdf. Submitted to SIAM J. Optimization.

[20]

A. S. Lewis and M. L. Overton. Behavior of BFGS with an exact line search on nonsmooth examples. Technical report, Optimization Online, 2008b. URL http://www. optimization-online.org/DB_FILE/2008/12/2173.pdf. Submitted to SIAM J. Optimization.

[21]

D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(3):503-528, 1989.

Digital Library

[22]

L. Luk¿an and J. Vl'ek. Globally convergent variable metric method for convex nonsmooth unconstrained minimization. Journal of Optimization Theory and Applications, 102(3):593-613, 1999.

Digital Library

[23]

F. Maes, L. Denoyer, and P. Gallinari. XML structure mapping application to the PASCAL/INEX 2006 XML document mining track. In Advances in XML Information Retrieval and Evaluation: Fifth Workshop of the INitiative for the Evaluation of XML Retrieval (INEX'06), Dagstuhl, Germany, 2007.

[24]

A. Nedic and D. P. Bertsekas. Convergence rate of incremental subgradient algorithms. In S. Uryasev and P. M. Pardalos, editors, Stochastic Optimization: Algorithms and Applications, pages 263-304. Kluwer Academic Publishers, 2000.

[25]

A. Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. on Optimization, 15(1):229-251, 2005. ISSN 1052-6234.

Digital Library

[26]

Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1):127-152, 2005.

Digital Library

[27]

J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer, 1999.

[28]

S. Shalev-Shwartz and Y. Singer. On the equivalence of weak learnability and linear separability: New relaxations and efficient boosting algorithms. In Proceedings of COLT, 2008.

[29]

A. J. Smola, S. V. N. Vishwanathan, and Q. V. Le. Bundle methods for machine learning. In D. Koller and Y. Singer, editors, Advances in Neural Information Processing Systems 20, Cambridge MA, 2007. MIT Press.

[30]

B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, pages 25-32, Cambridge, MA, 2004. MIT Press.

[31]

C.-H. Teo, S. V. N. Vishwanthan, A. J. Smola, and Q. V. Le. Bundle methods for regularized risk minimization. Journal of Machine Learning Research, 11:311-365, 2010.

Digital Library

[32]

I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6:1453-1484, 2005.

Digital Library

[33]

M. K. Warmuth, K. A. Glocer, and S. V. N. Vishwanathan. Entropy regularized LPBoost. In Y. Freund, Y. Làszlò Györfi, and G. Turàn, editors, Proc. Intl. Conf. Algorithmic Learning Theory, number 5254 in Lecture Notes in Artificial Intelligence, pages 256 - 271, Budapest, October 2008. Springer-Verlag.

Digital Library

[34]

P. Wolfe. Convergence conditions for ascent methods. SIAM Review, 11(2):226-235, 1969.

Digital Library

[35]

P. Wolfe. A method of conjugate subgradients for minimizing nondifferentiable functions. Mathematical Programming Study, 3:145-173, 1975.

[36]

J. Yu, S. V. N. Vishwanathan, S. Günter, and N. N. Schraudolph. A quasi-Newton approach to nonsmooth convex optimization. In A. McCallum and S. Roweis, editors, ICML, pages 1216- 1223. Omnipress, 2008.

Digital Library

[37]

T. Zhang and F. J. Oles. Text categorization based on regularized linear classification methods. Information Retrieval, 4:5-31, 2001.

Digital Library

Cited By

Jalilzadeh ANedić AShanbhag UYousefian F(2022)A Variable Sample-Size Stochastic Quasi-Newton Method for Smooth and Nonsmooth Stochastic Convex OptimizationMathematics of Operations Research10.1287/moor.2021.114747:1(690-719)Online publication date: 1-Feb-2022
https://dl.acm.org/doi/10.1287/moor.2021.1147
Krejić NKrklec Jerinkić NOstojić T(2022)Spectral projected subgradient method for nonsmooth convex optimization problemsNumerical Algorithms10.1007/s11075-022-01419-393:1(347-365)Online publication date: 30-Sep-2022
https://dl.acm.org/doi/10.1007/s11075-022-01419-3
Wang EWang ZWu Q(2021)One novel class of Bézier smooth semi-supervised support vector machines for classificationNeural Computing and Applications10.1007/s00521-021-05765-633:16(9975-9991)Online publication date: 1-Aug-2021
https://dl.acm.org/doi/10.1007/s00521-021-05765-6
Show More Cited By

Index Terms

A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

Recommendations

Forward---backward quasi-Newton methods for nonsmooth optimization problems

The forward---backward splitting method (FBS) for minimizing a nonsmooth composite function can be interpreted as a (variable-metric) gradient method over a continuously differentiable function which we call forward---backward envelope (FBE). This ...
Newton and Quasi-Newton Methods for a Class of Nonsmooth Equations and Related Problems

The paper presents concrete realizations of quasi-Newton methods for solving several standard problems including complementarity problems, special variational inequality problems, and the Karush--Kuhn--Tucker (KKT) system of nonlinear programming. A new ...
A Proximal Quasi-Newton Trust-Region Method for Nonsmooth Regularized Optimization

We develop a trust-region method for minimizing the sum of a smooth term (f) and a nonsmooth term (h), both of which can be nonconvex. Each iteration of our method minimizes a possibly nonconvex model of (f + h) in a trust region. The model coincides with (...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research

The Journal of Machine Learning Research Volume 11, Issue

3/1/2010

3637 pages

ISSN:1532-4435

EISSN:1533-7928

Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 March 2010

Published in JMLR Volume 11

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
716
Total Downloads

Downloads (Last 12 months)184
Downloads (Last 6 weeks)16

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jalilzadeh ANedić AShanbhag UYousefian F(2022)A Variable Sample-Size Stochastic Quasi-Newton Method for Smooth and Nonsmooth Stochastic Convex OptimizationMathematics of Operations Research10.1287/moor.2021.114747:1(690-719)Online publication date: 1-Feb-2022
https://dl.acm.org/doi/10.1287/moor.2021.1147
Krejić NKrklec Jerinkić NOstojić T(2022)Spectral projected subgradient method for nonsmooth convex optimization problemsNumerical Algorithms10.1007/s11075-022-01419-393:1(347-365)Online publication date: 30-Sep-2022
https://dl.acm.org/doi/10.1007/s11075-022-01419-3
Wang EWang ZWu Q(2021)One novel class of Bézier smooth semi-supervised support vector machines for classificationNeural Computing and Applications10.1007/s00521-021-05765-633:16(9975-9991)Online publication date: 1-Aug-2021
https://dl.acm.org/doi/10.1007/s00521-021-05765-6
Wangni JMcIlraith SWeinberger K(2018)Training L1-regularized models with orthant-wise passive descent algorithmsProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence10.5555/3504035.3504558(4268-4275)Online publication date: 2-Feb-2018
https://dl.acm.org/doi/10.5555/3504035.3504558
Yang PTan QHe J(2018)Function-on-Function Regression with Mode-Sparsity RegularizationACM Transactions on Knowledge Discovery from Data10.1145/317811312:3(1-23)Online publication date: 23-Mar-2018
https://dl.acm.org/doi/10.1145/3178113
Smith VForte SMa CTakáč MJordan MJaggi M(2017)CoCoAThe Journal of Machine Learning Research10.5555/3122009.329041518:1(8590-8638)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.5555/3122009.3290415
Chouzenoux EPesquet J(2017)A Stochastic Majorize-Minimize Subspace Algorithm for Online Penalized Least Squares EstimationIEEE Transactions on Signal Processing10.1109/TSP.2017.270926565:18(4770-4783)Online publication date: 15-Sep-2017
https://dl.acm.org/doi/10.1109/TSP.2017.2709265
Ding SZhu ZZhang X(2017)An overview on semi-supervised support vector machineNeural Computing and Applications10.1007/s00521-015-2113-728:5(969-978)Online publication date: 1-May-2017
https://dl.acm.org/doi/10.1007/s00521-015-2113-7
Wang XZhang MRen FJonker CMarsella SThangarajah JTuyls K(2016)Load Forecasting in a Smart Grid through Customer Behaviour Learning Using L1-Regularized Continuous Conditional Random FieldsProceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems10.5555/2936924.2937044(817-826)Online publication date: 9-May-2016
https://dl.acm.org/doi/10.5555/2936924.2937044
(2016)Quasi-newton method for LP multiple kernel learningNeurocomputing10.1016/j.neucom.2016.01.079194:C(218-226)Online publication date: 19-Jun-2016
https://dl.acm.org/doi/10.1016/j.neucom.2016.01.079
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents