Learning algorithms based on Stochastic Gradient approximations are known for their poor performance on optimization tasks and their extremely good performance on machine learning tasks (Bottou and Bousquet, 2008). Despite these proven capabilities, there were lingering concerns about the difficulty of setting the adaptation gains and achieving robust performance. Stochastic gradient algorithms have been historically associated with back-propagation algorithms in multilayer neural networks, which can be very challenging non-convex problems. Stochastic gradient algorithms are also notoriously hard to debug because they often appear to somehow work despite the bugs. Experimenters are then led to believe, incorrectly, that the algorithm itself is flawed.
Therefore it is useful to see how Stochastic Gradient Descent performs on simple linear and convex problems such as linear Support Vector Machines (SVMs) or Conditional Random Fields (CRFs). This page proposes simple code examples illustrating the good properties of stochastic gradient descent algorithms. The provided source code values clarity over speed.
git clone http://leon.bottou.org/git/sgd.git
“README.txt”
and “data/README.txt”
.“data/rcv1”
.“data/pascal”
. “data/conll2000”
.$ cd data/pascal $ ./convert.py -o alpha.txt alpha train $ ./convert.py -o webspam.txt webspam train # (optional)
“make”
in the toplevel directory. The compilation requires the ZLib library. Most Linux distributions provide this by default. Things should be very similar under other Unix operating systems. “win/README.txt”
.
The SVM programs are located in directory “sgd/svm”
.
The file “svm/README.txt”
provides the full details and describes how to run various experiments.
Program “svmsgd”
implements a straightforward stochastic gradient descent algorithm.
Points (1) and (4) follow (Shalev-Schwartz et al., 2007).
Program “svmasgd”
implements the averaged stochastic gradient descent algorithm.
-avgstart
to change this.Points (1) and (4) follow (Xu, 2010).
The benchmark task is the recognition of RCV1 documents belonging to the class CCAT
.
Program “prep_rcv1”
reads the RCV1-V2 token files from directory “data/rcv1”
and computes the TF/IDF features on the basis of the our training set. This programs produces the two files “rcv1.train.bin.gz”
and “rcv1.test.bin.gz”
containing our training and testing sets. This short program also generates files in SvmLight format when compiled with option -DPREP_SVMLIGHT
.
Benchmark | Features | Training examples | Testing examples |
---|---|---|---|
RCV1 | 47152 | 781265 | 23149 |
Algorithm (hinge loss, λ=1e-4) | Training Time* | Primal cost | Test Error |
---|---|---|---|
SMO (SVMLight) | ≃ 16000 secs1 | 0.2275 | 6.02% |
Cutting Plane (SVMPerf) | ≃ 45 secs2 | 0.2278 | 6.03% |
Hinge Loss SDCA (LibLinear -s 3 -B 1) | 2.5 secs | - | 6.02% |
SGD (svmsgd ) | < 1 sec. | 0.2275 | 6.02% |
ASGD (svmasgd ) | < 1 sec. | 0.2275 | 6.02% |
1 Extrapolated from a 23642 seconds run on a 30% slower machine.
2 Extrapolated from a 66 seconds run on a 30% slower machine.
* All timings exclude data loading time.
Algorithm (log loss, λ=5e-7) | Training Time | Primal cost | Test Error |
---|---|---|---|
TRON (LibLinear -s 0 -B 1) | 33 secs | - | 5.14% |
SDCA (LibLinear -s 7 -B 1) | 15 secs | - | 5.13% |
SGD (svmsgd ) | 4 secs | 0.1283 | 5.14% |
ASGD (svmasgd ) | 5 secs | 0.1281 | 5.13% |
You must first convert the original Pascal files using the Python script convert.py
as explained above. Program “prep_alpha”
then reads the converted Pascal file “data/pascal/alpha.txt”
and produces the two files “rcv1.train.bin.gz”
and “rcv1.test.bin.gz”
containing our training and testing sets. The alpha dataset illustrates how ASGD sometimes vastly outperforms plain SGD. The plot below shows how ASGD reaches near-optimal test performance only one pass after the activation of the averaging code (epoch 2.)
Benchmark | Features | Training examples | Testing examples |
---|---|---|---|
Alpha | 500 | 250000 | 250000 |
Algorithm (log loss, λ=1e-6) | Training Time | Primal cost | Test Error |
---|---|---|---|
TRON (LibLinear -s 0 -B 1) | 115 secs | - | 21.86% |
SDCA (LibLinear -s 7 -B 1) | 19 secs | - | 21.86% |
SGD (svmsgd ) | 51 secs | 0.4717 | 21.95% |
ASGD (svmasgd ) | 4 secs | 0.4713 | 21.86% |
The CRF programs “crfsgd”
and “crfasgd”
respectively use the SGD and ASGD algorithms
to train a Conditional Random Field. The algorithms
are setup exactly as the SVM variants, but the implementation accounts for
the greater structural complexity of conditional random fields.
Both programs accept the same input files as the well known
CRF++ software by Taku Kudo.
They produce the same tagging files which can be analyzed using the CONLL perl script “conlleval”
.
They use the same template files to specify the features and accept the same data files.
For convenience, “crfsgd”
and “crfasgd”
can also directly read gzipped data files.
See the CRF++ documentation for details.
Our CRF benchmark is the CONLL2000 Text Chunking task, which consists of dividing a text in syntactically correlated segments (noun phrases, verb phrases, etc.). We use the feature template that comes with CRF++.
Benchmark | Parameters | Training sentences | Training chunks | Testing sentences | Testing chunks |
---|---|---|---|---|---|
CONLL2000 | 1679700 | 8936 | 106978 | 2012 | 23852 |
Algorithm (-f 3 -c 1) | Training Time | Primal cost | Test F1 score |
---|---|---|---|
LBFGS (CRF++) | ≃ 3000 secs3 | 9042 | 93.74% |
SGD (crfsgd ) | 303 secs | 9096 | 93.73% |
ASGD (crfasgd ) | 134 secs | 9204 | 93.79% |
3 Extrapolated from a 4335 seconds run on a 30% slower machine.
pdf.
Version | Date | Comments |
---|---|---|
sgd-1.0.tar.gz | 2007-08-22 | Initial release. |
sgd-1.1.tar.gz | 2007-10-04 | Relicensed under LGPL |
sgd-1.2.tar.gz | 2008-04-07 | Bug fix in svm/reprocess.cpp . Thanks to Wei Xu. |
sgd-1.3.tar.gz | 2009-01-31 | Added SSE2 code for dense vectors. |
sgd-2.0.tar.gz | 2011-10-12 | Major release featuring sgd and asgd. |
sgd-2.1.tar.gz | 2012-09-14 | Bug fix: istream::unget() does not clear eof bit |
Latest | git clone ht tp:leon.bottou.org/git/sgd.git |
Rather than emphasizing performance or generality, this code favors readability. I am therefore glad to see that many authors of machine learning projects have found it useful, sometimes directly, sometimes as a source of inspiration. Here are a few links: