Learning Hierarchical Invariant Spatio-Temporal Features For Action Recognition With Independent Subspace Analysis
Learning Hierarchical Invariant Spatio-Temporal Features For Action Recognition With Independent Subspace Analysis
Learning Hierarchical Invariant Spatio-Temporal Features For Action Recognition With Independent Subspace Analysis
Abstract
1. Introduction
Common approaches in visual recognition rely on handdesigned features such as SIFT [24, 25] and HOG [4]. A
weakness of such approaches is that it is difficult and timeconsuming to extend these features to other sensor modalities, such as laser scans, text or even videos. There is a
growing interest in unsupervised feature learning methods
such as Sparse Coding [31, 21, 34], Deep Belief Nets [7]
and Stacked Autoencoders [2] because they learn features
directly from data and consequently are more generalizable.
In this paper, we provide further evidence that unsupervised learning not only generalizes to different domains
but also achieves impressive performance on many realistic
video datasets. At the heart of our algorithm is the use of Independent Subspace Analysis (ISA), an extension of Inde-
2. Previous work
3362
layer, by solving:
minimize
PT
subject to
WW
Pm
t=1
T
i=1
=I
pi (xt ; W, V ),
(1)
In many experiments, we found that this invariant property makes ISA perform much better than other simpler
methods such as ICA and sparse coding.
3363
Our method is trained by batch projected gradient descent. Compared to other feature learning methods (e.g.,
RBMs [7]), the gradient of the objective function in Eq. 1 is
tractable.
The orthonormal constraint is ensured by projection with
symmetric orthogonalization [10]. In detail, during optimization, projected gradient descent requires us to project
3364
120
Figure 6. Examples of three ISA features learned from Hollywood2 data (16x16 spatial size). In this picture, each row consists
of two sets of filters. Each set of filters is a filter in 3D (i.e., a
row in matrix W ), and two sets grouped together to form an ISA
feature.
30
150
180
210
330
240
300
270
3365
5. Experiments
In this section we will numerically compare our algorithm against the current state-of-the-art action recognition
algorithms. We would like to emphasize that for our method
we use an identical pipeline as described in [42]. This
pipeline extracts local features, then performs vector quantization by K-means and classifies by 2 kernel. With our
method, the only change is the feature extraction stage:
we replaced hand-designed features with the learned features. Results of control experiments such as speed, benefits of the second layer and training features on unrelated
data [33] are also reported. Further results, detailed comparisons and parameter settings can be seen in the Appendix
(http://ai.stanford.edu/wzou/).
5.1. Datasets
We evaluate our algorithm on four well-known benchmark action recognition datasets: KTH [37], UCF sport
actions [35], Hollywood2 [26] and YouTube action [23].
These datasets were obtained from original authors websites. The processing steps, dataset splits and metrics are
identical to those described in [42] or [23]. The main purpose of using identical protocols is to identify the contributions of the learned features.
5.3. Results
We report the performance of our method on the KTH
dataset in Table 2. In this table, we compare our test set
accuracy against best reported results in literature. More
detailed results can be seen in [42] or [12]. We note that
for this dataset, an interest point detector can be very useful
because the background does not convey any meaningful information [42]. Therefore, we apply our norm-thresholding
interest point detector to this dataset (see Section 3.5). Using this technique, our method achieves superior performance compared to all published results in the literature.
There is an increase in performance between our method
A comparison of our method against best published results for Hollywood2 and UCF sport actions datasets is
reported in Table 3 and 4. In these experiments, we only
consider dense sampling for our algorithm. As can be seen
from the tables, our approach outperforms a wide range of
5 Our model achieves 94.5% if we use the interest point detector to filter out the background, then run feature extraction more densely than described in [42].
3366
6. Conclusion
In this paper, we presented a method that learns features
from spatio-temporal data using independent subspace analysis. We scaled up the algorithm to large receptive fields by
convolution and stacking and learn hierarchical representations.
Experiments were carried out with KTH, Hollywood2,
UCF sports action and YouTube datasets using a very standard processing pipeline [42]. Using this pipeline, we observed that our simple method outperforms many state-ofthe-art methods.
This result is interesting, given that our single method,
using the same parameters across four datasets, is consistently better than a wide variety of combinations of methods. It also suggests that learning features directly from data
is a very important research direction: Not only is this approach more generalizable to many domains, it is also very
powerful in recognition tasks.
6 Personal
References
7 The
3367
SURF:
3368