Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

Published: 01 March 2010 Publication History

Abstract

We extend the well-known BFGS quasi-Newton method and its memory-limited variant LBFGS to the optimization of nonsmooth convex objectives. This is done in a rigorous fashion by generalizing three components of BFGS to subdifferentials: the local quadratic model, the identification of a descent direction, and the Wolfe line search conditions. We prove that under some technical conditions, the resulting subBFGS algorithm is globally convergent in objective function value. We apply its memory-limited variant (subLBFGS) to L2-regularized risk minimization with the binary hinge loss. To extend our algorithm to the multiclass and multilabel settings, we develop a new, efficient, exact line search algorithm. We prove its worst-case time complexity bounds, and show that our line search can also be used to extend a recently developed bundle method to the multiclass and multilabel settings. We also apply the direction-finding component of our algorithm to L1-regularized risk minimization with logistic loss. In all these contexts our methods perform comparable to or better than specialized state-of-the-art solvers on a number of publicly available data sets. An open source implementation of our algorithms is freely available.

References

[1]
N. Abe, J. Takeuchi, andM. K.Warmuth. Polynomial Learnability of Stochastic Rules with Respect to the KL-Divergence and Quadratic Distance. IEICE Transactions on Information and Systems, 84(3):299-316, 2001.
[2]
P. K. Agarwal and M. Sharir. Davenport-Schinzel sequences and their geometric applications. In J. Sack and J. Urrutia, editors, Handbook of Computational Geometry, pages 1-47. North-Holland, New York, 2000.
[3]
G. Andrew and J. Gao. Scalable training of L<inf>1</inf>-regularized log-linear models. In Proc. Intl. Conf. Machine Learning, pages 33-40, New York, NY, USA, 2007. ACM.
[4]
J. Basch. Kinetic Data Structures. PhD thesis, Stanford University, June 1999.
[5]
D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1999.
[6]
J. R. Birge, L. Qi, and Z. Wei. A general approach to convergence properties of some methods for nonsmooth convex optimization. Applied Mathematics and Optimization, 38(2):141-158, 1998.
[7]
A. Bordes, L. Bottou, P. Gallinari, and J. Weston. Solving multiclass support vector machines with LaRank. In Proc. Intl. Conf. Machine Learning, pages 89-96, New York, NY, USA, 2007. ACM.
[8]
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, England, 2004.
[9]
K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research, 3:951-991, January 2003a.
[10]
K. Crammer and Y. Singer. A family of additive online algorithms for category ranking. J. Mach. Learn. Res., 3:1025-1058, February 2003b.
[11]
V. Franc and S. Sonnenburg. Optimized cutting plane algorithm for support vector machines. In A. McCallum and S. Roweis, editors, ICML, pages 320-327. Omnipress, 2008.
[12]
V. Franc and S. Sonnenburg. Optimized cutting plane algorithm for large-scale risk minimization. Journal of Machine Learning Research, 10:2157-2192, 2009.
[13]
M. Haarala. Large-Scale Nonsmooth Optimization. PhD thesis, University of Jyväskylä, 2004.
[14]
J. Hershberger. Finding the upper envelope of n line segments in O(nlogn) time. Information Processing Letters, 33(4):169-174, December 1989.
[15]
J. B. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms, I and II, volume 305 and 306. Springer-Verlag, 1993.
[16]
T. Joachims. Training linear SVMs in linear time. In Proc. ACM Conf. Knowledge Discovery and Data Mining (KDD). ACM, 2006.
[17]
Y. J. Lee and O. L. Mangasarian. SSVM: A smooth support vector machine for classification. Computational optimization and Applications, 20(1):5-22, 2001.
[18]
C. Lemarechal. Numerical experiments in nonsmooth optimization. Progress in Nondifferentiable Optimization, 82:61-84, 1982.
[19]
A. S. Lewis and M. L. Overton. Nonsmooth optimization via BFGS. Technical report, Optimization Online, 2008a. URL http://www.optimization-online.org/DB_FILE/2008/12/ 2172.pdf. Submitted to SIAM J. Optimization.
[20]
A. S. Lewis and M. L. Overton. Behavior of BFGS with an exact line search on nonsmooth examples. Technical report, Optimization Online, 2008b. URL http://www. optimization-online.org/DB_FILE/2008/12/2173.pdf. Submitted to SIAM J. Optimization.
[21]
D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(3):503-528, 1989.
[22]
L. Luk¿an and J. Vl'ek. Globally convergent variable metric method for convex nonsmooth unconstrained minimization. Journal of Optimization Theory and Applications, 102(3):593-613, 1999.
[23]
F. Maes, L. Denoyer, and P. Gallinari. XML structure mapping application to the PASCAL/INEX 2006 XML document mining track. In Advances in XML Information Retrieval and Evaluation: Fifth Workshop of the INitiative for the Evaluation of XML Retrieval (INEX'06), Dagstuhl, Germany, 2007.
[24]
A. Nedic and D. P. Bertsekas. Convergence rate of incremental subgradient algorithms. In S. Uryasev and P. M. Pardalos, editors, Stochastic Optimization: Algorithms and Applications, pages 263-304. Kluwer Academic Publishers, 2000.
[25]
A. Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. on Optimization, 15(1):229-251, 2005. ISSN 1052-6234.
[26]
Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1):127-152, 2005.
[27]
J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer, 1999.
[28]
S. Shalev-Shwartz and Y. Singer. On the equivalence of weak learnability and linear separability: New relaxations and efficient boosting algorithms. In Proceedings of COLT, 2008.
[29]
A. J. Smola, S. V. N. Vishwanathan, and Q. V. Le. Bundle methods for machine learning. In D. Koller and Y. Singer, editors, Advances in Neural Information Processing Systems 20, Cambridge MA, 2007. MIT Press.
[30]
B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, pages 25-32, Cambridge, MA, 2004. MIT Press.
[31]
C.-H. Teo, S. V. N. Vishwanthan, A. J. Smola, and Q. V. Le. Bundle methods for regularized risk minimization. Journal of Machine Learning Research, 11:311-365, 2010.
[32]
I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6:1453-1484, 2005.
[33]
M. K. Warmuth, K. A. Glocer, and S. V. N. Vishwanathan. Entropy regularized LPBoost. In Y. Freund, Y. Làszlò Györfi, and G. Turàn, editors, Proc. Intl. Conf. Algorithmic Learning Theory, number 5254 in Lecture Notes in Artificial Intelligence, pages 256 - 271, Budapest, October 2008. Springer-Verlag.
[34]
P. Wolfe. Convergence conditions for ascent methods. SIAM Review, 11(2):226-235, 1969.
[35]
P. Wolfe. A method of conjugate subgradients for minimizing nondifferentiable functions. Mathematical Programming Study, 3:145-173, 1975.
[36]
J. Yu, S. V. N. Vishwanathan, S. Günter, and N. N. Schraudolph. A quasi-Newton approach to nonsmooth convex optimization. In A. McCallum and S. Roweis, editors, ICML, pages 1216- 1223. Omnipress, 2008.
[37]
T. Zhang and F. J. Oles. Text categorization based on regularized linear classification methods. Information Retrieval, 4:5-31, 2001.

Cited By

View all
  • (2022)A Variable Sample-Size Stochastic Quasi-Newton Method for Smooth and Nonsmooth Stochastic Convex OptimizationMathematics of Operations Research10.1287/moor.2021.114747:1(690-719)Online publication date: 1-Feb-2022
  • (2022)Spectral projected subgradient method for nonsmooth convex optimization problemsNumerical Algorithms10.1007/s11075-022-01419-393:1(347-365)Online publication date: 30-Sep-2022
  • (2021)One novel class of Bézier smooth semi-supervised support vector machines for classificationNeural Computing and Applications10.1007/s00521-021-05765-633:16(9975-9991)Online publication date: 1-Aug-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research
The Journal of Machine Learning Research  Volume 11, Issue
3/1/2010
3637 pages
ISSN:1532-4435
EISSN:1533-7928
Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 March 2010
Published in JMLR Volume 11

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)152
  • Downloads (Last 6 weeks)12
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)A Variable Sample-Size Stochastic Quasi-Newton Method for Smooth and Nonsmooth Stochastic Convex OptimizationMathematics of Operations Research10.1287/moor.2021.114747:1(690-719)Online publication date: 1-Feb-2022
  • (2022)Spectral projected subgradient method for nonsmooth convex optimization problemsNumerical Algorithms10.1007/s11075-022-01419-393:1(347-365)Online publication date: 30-Sep-2022
  • (2021)One novel class of Bézier smooth semi-supervised support vector machines for classificationNeural Computing and Applications10.1007/s00521-021-05765-633:16(9975-9991)Online publication date: 1-Aug-2021
  • (2018)Training L1-regularized models with orthant-wise passive descent algorithmsProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence10.5555/3504035.3504558(4268-4275)Online publication date: 2-Feb-2018
  • (2018)Function-on-Function Regression with Mode-Sparsity RegularizationACM Transactions on Knowledge Discovery from Data10.1145/317811312:3(1-23)Online publication date: 23-Mar-2018
  • (2017)CoCoAThe Journal of Machine Learning Research10.5555/3122009.329041518:1(8590-8638)Online publication date: 1-Jan-2017
  • (2017)A Stochastic Majorize-Minimize Subspace Algorithm for Online Penalized Least Squares EstimationIEEE Transactions on Signal Processing10.1109/TSP.2017.270926565:18(4770-4783)Online publication date: 15-Sep-2017
  • (2017)An overview on semi-supervised support vector machineNeural Computing and Applications10.1007/s00521-015-2113-728:5(969-978)Online publication date: 1-May-2017
  • (2016)Load Forecasting in a Smart Grid through Customer Behaviour Learning Using L1-Regularized Continuous Conditional Random FieldsProceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems10.5555/2936924.2937044(817-826)Online publication date: 9-May-2016
  • (2016)Quasi-newton method for LP multiple kernel learningNeurocomputing10.1016/j.neucom.2016.01.079194:C(218-226)Online publication date: 19-Jun-2016
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media