Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1273496.1273501acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
Article

Scalable training of L1-regularized log-linear models

Published: 20 June 2007 Publication History

Abstract

The L-BFGS limited-memory quasi-Newton method is the algorithm of choice for optimizing the parameters of large-scale log-linear models with L2 regularization, but it cannot be used for an L1-regularized loss due to its non-differentiability whenever some parameter is zero. Efficient algorithms have been proposed for this task, but they are impractical when the number of parameters is very large. We present an algorithm Orthant-Wise Limited-memory Quasi-Newton (OWL-QN), based on L-BFGS, that can efficiently optimize the L1-regularized log-likelihood of log-linear models with millions of parameters. In our experiments on a parse reranking task, our algorithm was several orders of magnitude faster than an alternative algorithm, and substantially faster than L-BFGS on the analogous L2-regularized problem. We also present a proof that OWL-QN is guaranteed to converge to a globally optimal parameter vector.

References

[1]
Benson, J. S., & More, J. J. (2001). A limited memory variable metric method for bound constraint minimization.
[2]
Bertsekas, D. P. (1999). Nonlinear programming. Athena Scientific.
[3]
Byrd, R. H., Lu, P., Nocedal, J., & Zhu, C. Y. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16, 1190--1208.
[4]
Charniak, E., & Johnson, M. (2005). Coarse-to-fine n-best parsing and maxent discriminative reranking. ACL.
[5]
Collins, M. (2000). Discriminative reranking for natural language parsing. ICML (pp. 175--182).
[6]
Darroch, J., & Ratcliff, D. (1972). Generalised iterative scaling for log-linear models. Annals of Mathematical Statistics.
[7]
Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics.
[8]
Gao, J., Andrew, G., Johnson, M., & Toutanova, K. (2007). A comparative study of parameter estimation methods for statistical NLP. ACL.
[9]
Goodman, J. (2004). Exponential priors for maximum entropy models. ACL.
[10]
Kazama, J., & Tsujii, J. (2003). Evaluation and extension of maximum entropy models with inequality constraints. EMNLP.
[11]
Lee, S.-I., Lee, H., Abbeel, P., & Ng, A. (2006). Efficient L1 regularized logistic regression. AAAI-06.
[12]
Malouf, R. (2002). A comparison of algorithms for maximum entropy parameter estimation. CONLL.
[13]
Minka, T. P. (2003). A comparison of numerical optimizers for logistic regression (Technical Report). Microsoft Research.
[14]
Ng, A. Y. (2004). Feature selection, L1 vs. L2 regularization, and rotational invariance. ICML.
[15]
Nocedal, J., & Wright, S. J. (1999). Numerical optimization. Springer.
[16]
Perkins, S., & Theiler, J. (2003). Online feature selection using grafting. ICML.
[17]
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B.
[18]
Zhu, C., Byrd, R. H., Lu, P., & Nocedal, J. (1997). Algorithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw., 23, 550--560.

Cited By

View all
  • (2024)Batch Active Learning of Reward Functions from Human PreferencesACM Transactions on Human-Robot Interaction10.1145/364988513:2(1-27)Online publication date: 14-Jun-2024
  • (2024)High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global UpdatesProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3672038(2037-2047)Online publication date: 25-Aug-2024
  • (2024)Cardinality Minimization, Constraints, and Regularization: A SurveySIAM Review10.1137/21M142770X66:3(403-477)Online publication date: 8-Aug-2024
  • Show More Cited By
  1. Scalable training of L1-regularized log-linear models

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      ICML '07: Proceedings of the 24th international conference on Machine learning
      June 2007
      1233 pages
      ISBN:9781595937933
      DOI:10.1145/1273496
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      • Machine Learning Journal

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 20 June 2007

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Article

      Conference

      ICML '07 & ILP '07
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 140 of 548 submissions, 26%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)103
      • Downloads (Last 6 weeks)15
      Reflects downloads up to 16 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Batch Active Learning of Reward Functions from Human PreferencesACM Transactions on Human-Robot Interaction10.1145/364988513:2(1-27)Online publication date: 14-Jun-2024
      • (2024)High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global UpdatesProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3672038(2037-2047)Online publication date: 25-Aug-2024
      • (2024)Cardinality Minimization, Constraints, and Regularization: A SurveySIAM Review10.1137/21M142770X66:3(403-477)Online publication date: 8-Aug-2024
      • (2024)Improving convolutional neural networks for cosmological fields with random permutationPhysical Review D10.1103/PhysRevD.110.043535110:4Online publication date: 29-Aug-2024
      • (2024)Realization of higher-order topological lattices on a quantum computerNature Communications10.1038/s41467-024-49648-515:1Online publication date: 10-Jul-2024
      • (2024)Physics-informed neural nets for control of dynamical systemsNeurocomputing10.1016/j.neucom.2024.127419579:COnline publication date: 2-Jul-2024
      • (2024)Empirical density estimation based on spline quasi-interpolation with applications to copulas clustering modelingJournal of Computational and Applied Mathematics10.1016/j.cam.2024.116131452:COnline publication date: 15-Dec-2024
      • (2024)Enhanced physics‐informed neural networks for efficient modelling of hydrodynamics in river networksHydrological Processes10.1002/hyp.1514338:4Online publication date: 11-Apr-2024
      • (2023)Phase-aware adversarial defense for improving adversarial robustnessProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3620209(42724-42741)Online publication date: 23-Jul-2023
      • (2023)Eliminating adversarial noise via information discard and robust representation restorationProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3620198(42517-42530)Online publication date: 23-Jul-2023
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media