A Novel Scheme for Training Two-Stream CNNs for Action Recognition

Reinier Oves García¹¹,
Eduardo F. Morales¹¹ &
L. Enrique Sucar¹¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11896))

Included in the following conference series:

Iberoamerican Congress on Pattern Recognition

1728 Accesses
1 Citations

Abstract

Human actions recognition from realistic video data constitutes a challenging and relevant research area. Leading the state-of-the-art we can find those methods based on Convolutional Neural Networks (CNNs) and specially two-stream CNNs (appearance and motion). In this paper we present a novel scheme for training two-stream CNNs that increases the accuracy of the fusion (when one of the channels does not perform as well as the other one) and reduces the total time used for training the entire architecture. In addition, we introduce a new descriptor for motion representation that improves the state-of-the-art. Based on this more efficient scheme, we developed an early recognition system. The proposed approach is evaluated on the UCF101 data set with competitive results.

You have full access to this open access chapter, Download conference paper PDF

Second-order motion descriptors for efficient action recognition

Article 28 October 2020

More efficient and effective tricks for deep action recognition

Article 07 November 2017

Action Recognition Using Co-trained Deep Convolutional Neural Networks

Keywords

1 Introduction

The automatic recognition of human actions from video data is an important topic withing the Computer Vision area due to its strength in providing a personalized support for several real-world applications, such as medicine, human-computer interaction and robotics, among others. Action recognition deals with the problem of assigning a predefined label to an input video and constitutes a challenging task for both areas, Computer Vision and Machine Learning. This research topic has been developed constantly in the last two decades [13] with a considerable improvement when using non-handcrafted features [18].

The best performance so far has been achieved by multi-stream approaches [3], specifically by two-stream CNNs [21], turning obsolete those approaches based on handcrafted features [27]. During the last two years several works have been focused on 3D CNNs and transfer learning [2, 5]. 3D CNNs are able to capture spatio-temporal relations and constitute the more powerful tool for action recognition so far. In 2017 [2] established a new mark in the state-of-the-art by transferring the knowledge from a 2D pre-trained model to 3D models, creating the most powerful two-stream CNN for action recognition (I3D).

Two-stream architectures are based on the late fusion of two CNNs which are trained independently over two different domains; appearance and motion. The appearance-stream learns features from the RGB frames while the motion-stream learns the features from the optical flow [30]. For the final fusion, a linear combination of the last layer of each stream is applied and transformed into a probabilistic distribution by a softmax layer [2] (See Fig. 1).

The main drawback of two-stream CNNs is given by the fusion layer. When all streams perform reasonably well, a sophisticated combination is not needed since the contribution will be minimal. However, when one of the streams has a weak performance the final results can be severely affected, producing in that way a result worse than the produced by the best of the streams. On the other hand, the quality of video data, the sampling rate and interpolation methods used while sampling, also affect the performance.

In this paper, we present a novel strategy that outperforms the results achieved by the state-of-the-art when using two-stream CNNs [2] over the UCF101-1, UCF101-2 and UCF101-3 datasets [23]. In addition, we included a new motion descriptor, easy to understand and fast to compute. This descriptor (CurlDiv) is composed by the curl and divergence of the optical flow and is used in order to replace the motion stream. Although this descriptor itself does not perform as well as the optical flow, when it is combined with the appearance channel under our training scheme, it outperforms the classical fusion. Given that the computation of the CurlDiv descriptor is faster than the optical flow, we report a modification of [2] for early action recognition, based on a Bayesian approach.

The rest of the paper is organized as follow. In Sect. 2 the related work is presented in action recognition using deep CNNs. Section 3 explains how our CurlDiv motion representation is computed. The proposed scheme for training two-stream CNNs is presented in Sect. 4 while Sect. 5 describes the approach proposed for early action recognition. In Sect. 6 a set of experiments are presented in order to evaluate the performance of our method as well as the comparison with the state-of-the-art methods. Finally, in Sect. 7 the conclusions and future work are given.

2 Related Work

Human Action Recognition task has presented a significant improvement in accuracy during the last decade by applying CNNs. The most prominent approaches have adapted classical architectures from image recognition [12, 22] to represent spatio-temporal relations [6, 11, 14, 15]. These approaches are differentiated by the way they process data: spatio-temporal convolutions (3D CNNs) [15, 26], Recurrent Neural Networks [7, 29] or multi-streams CNNs [2, 28]. This last approach has been used for action localization and video segmentation [19, 20].

Spatio-temporal Convolutions or 3D CNNs model the video as a three-dimensional volume and apply a set of 3D convolutions at different levels of depth [10, 15]. On the other hand, Recurrent Neural Networks approaches model video data as a sequence of frames [1, 16]. In this approach exists two common ways for data representation: (1) using a classical image classification CNN for feature extraction [17], where frames are preprocessed and transformed into a sequence of non-handcrafted features and (2) by extracting skeleton information [8, 16], videos are transformed into sequences of human poses. The main drawback of (1) is given by the action localization and motion in the background while (2) is fully affected by occlusions, video resolution and the target-camera distance; resulting (2) in a more accurate approach than (1).

Multi-streams approaches constitute the most revolutionary approach for action recognition. As is mentioned before, these approaches count with multiple CNNs trained independently and merged in the testing phase. Within these approaches, the best results have been achieved by those that transfer the knowledge from image processing domains [2, 3].

Two-streams architectures transform the video into appearance (purely RGB frames) and motion (optical flow) throughout a set of offline algorithms [30]. The RGB channel is usually sampled at 25 fps, using a bilinear interpolation method. In order to get a best performance it is recommended to generate optical flow after video sampling [2]. Recently, I3D architecture [2] emerges as a two-stream architecture with spatio-temporal convolutions (3D CNNs) and pooling operators, inflated from an image classification CNN with spatial convolutions and pooling layers [25]. Contributions of [2] rely on the transfer learning from 2D convolutions (from image classification) to 3D convolutions (video classification). In 2018 the authors of [3] encode the skeleton information into a set of RGB images and utilize a shallow CNN for classification. Although this architecture does not show the best performance, when it is fused with I3D the authors report an improvement.

All these architectures based on deep CNNs can be characterized by the large amount of time for training multiple channels as well as for preprocessing. Although our approach is based on I3D architecture, it considerably reduces the training times for one of the streams without sacrificing the accuracy. In addition, our motion representation is about 6 times faster to compute than optical flow [30] and is more accurate than the skeleton based approach proposed by [3].

3 CurlDiv Motion Representation

The motion channel constitutes the more important channel for two-stream architectures. Usually, optical flow [30] is used in order to estimate such a motion. In the OpenCV library we can find implementations of these algorithms for both architectures: GPU and CPU. The problem with both implementations is the time they take to process data. In order to reduce preprocessing times, we introduce a new representation for the motion channel based on the curl and divergence of the optical flow [24]. The curl is a scalar property of vector fields that represents how the field rotates around a point. On the other hand, the divergence measures the density of the outward flux of a vector field from an infinitesimal boundary around a given point.

Let $\overrightarrow{V}$ be a 2-dimensional vector field computed with the Gunner Farneback algorithm [9] with scalar components P(x, y) and Q(x, y).

$$\begin{aligned} \overrightarrow{V}(x, y) = \begin{bmatrix} P(x,y) \\ Q(x,y) \end{bmatrix} \end{aligned}$$

(1)

The Curl of $\overrightarrow{V}(x, y)$ can be defined as:

$$\begin{aligned} Curl(x, y) = \left| \left| \frac{\partial Q}{\partial x} - \frac{\partial P}{\partial y} \right| \right| \end{aligned}$$

(2)

and the Divergence as follow:

$$\begin{aligned} Div(x, y) = \frac{\partial P}{\partial x} + \frac{\partial Q}{\partial y} \end{aligned}$$

(3)

Second order partial derivatives of $\overrightarrow{V}$ are computed by using a finite difference strategy for reducing computational times. In this way, our CurlDiv motion descriptor takes the following formulation:

$$\begin{aligned} CurlDiv(x, y) = \begin{bmatrix} Curl(x,y) \\ Div(x,y) \end{bmatrix} \end{aligned}$$

(4)

Following the schema proposed in [2] for motion data manipulation we also rescaled the CurlDiv descriptor in the range $[-1,1]$.

4 Training Strategy and Fusion

Our training strategy follows the assumption that one of the streams has a weak performance. Based on this, we train first the strongest stream (appearance); detecting the set of samples in which this stream does not perform well. Over these samples the motion CNN is trained. In order to train the motion channel over the CurlDiv representation we reutilize the motion CNN from I3D. Such training is conducted in two ways: (1) by using all training data and (2) using the proposed strategy.

Through the former strategy (1), the CurlDiv channel is trained over the entire dataset, without taking into consideration the quality of the other channel; the appearance. For such reason, the training time is the same as that consumed by the appearance channel. On the other hand, our motion representation does not improve the optical flow although we used the same CNN. This is because we start training with a model pretrained over the optical flow in order to accelerate the convergence. During experimentation we realized that the $top_2$ accuracy of the appearance channel is above 99% in all the UCF101 databases (See Table 1).

Table 1. This table shows the top2 accuracy for the appearance channel on I3D two-stream architecture as well as the accuracy achieved by our motion channel and the fusion of the two streams (RGB+CurlDiv).

Full size table

Our second strategy (2) for training two-stream architectures is based on the assumption that one of the channels is better than the other one. Taking this into account we utilize a simple, but effective heuristic proposed in [4] that helps us to improve the training phase (Eq. 5).

$$\begin{aligned} max(softmax) - 2nd(softmax) \ge 2 * 2nd(softmax) \end{aligned}$$

(5)

where max(softmax) is the top of the softmax layer and 2nd(softmax) corresponds with the second most probable element in the softmax layer. Following this criteria, the training data and training times can be considerably reduced for the second channel. In other words, the motion channel will only be fine tuned over those samples that not fulfill the condition proposed in Eq. 5.

5 Bayesian Strategy for Early Action Recognition

An early response of the action that is taking place can be useful for real world applications (e.g, video surveillance). In this section we show how retraining the I3D (RGB + CurlDiv) over few frames can be successfully used as input to a Bayesian classifier in order to determine the action of a video in an early way. The core idea of this method is based on accumulating evidence about the observed activity while sliding windows are processed and making a decision when the highest probability fulfills the condition in Eq. 5.

Taking the vector of probabilities P obtained from the softmax layer we can apply a Bayesian approach in order to make predictions, observing a few segments of the video.

$$\begin{aligned} P(C_{j} | f) = \frac{P(C_{i}) * P(f | C_{i})}{P(f)} \end{aligned}$$

(6)

In Eq. 6 the idea is to estimate the maximum a posteriori probability (MAP) of the segment $C_{j}$ given the video features. Notice that, as the denominator does not depends on the class we can go without it and rewrite the equation as follow.

$$\begin{aligned} P(C_{j} | f) \sim P(C_{i}) * P(f | C_{i}) \end{aligned}$$

(7)

where f is the feature set extracted by the CNN, $P(C_{i})$ is the probability of the previous segment and $P(f | C_{i})$ is the probability estimated for the current segment. As video data can be seen as a continuous sequence of sliding windows, we can use Eq. 7 for predicting the action label. Following the condition proposed in Eq. 5 we can compute the most probable action for a given video and stop the processing. Figure 2 presents how the higher probability only varies between two classes across time.

6 Experiments and Results

In this paper experiments were conducted over three datasets in which two-stream architectures have been tested: UCF101-1, UCF101-2 and UCF101-3 [23]. Each of this dataset counts with 101 classes and 13,320 videos recorded from internet, divided into training and testing sets. The experiments presented in this section are focused in measuring how the training times can be reduced following our scheme, to what extent the RGB+CurlDiv outperforms previous results and how good can a Bayesian strategy be for early action recognition.

Table 2. Hours (h) used by our architecture for training the I3D two-stream model following both schemes: classical and our (CurlDiv). Hardware: Core i7 32GB RAM and NVidia GeForce GTX 1080 with 11 GB of VRAM.

Full size table

6.1 Preprocessing and Training Times

This experiment relates the times during preprocessing and training phases. These two phases are crucial factors when deep learning architectures are used. Our hardware architecture took almost 1 h for extracting all the frames of 13,320 videos and 18 h for optical flow computation. The CurlDiv only took less than 3 h for preprocessing the same amount of data, which represents a considerable reduction in time. To train the I3D architecture following the classical scheme took about 6 days for both channels: appearance and motion (3 days for each channel, over each dataset). By using our approach, we only expend 4 days for training the two channels (3 days for RGB and 1 day for CurlDiv) (See table 2). The considerable reduction in training times is given by applying the condition presented in Eq. 5 while the appearance channel is being trained.

Table 3. RGB, FLOW and RGB+FLOW columns show the accuracy achieved with the I3D without applying image smoothing. RGB_s, FLOW_s and RGB_s+ FLOW_s columns show the accuracy achieved by the I3D after applying image smoothing.

Full size table

6.2 Results Achieved by Our RGB+CurlDiv Scheme

We detect visually that noise and distortion are widely presented in the datasets. For such a reason we apply over all the frames a smoothing algorithm for image blurring with a $3 \,\times \,3$ kernel for border and distortion reduction. Table 3 shows how applying a smoothing strategy before preprocessing data outperforms the state-of-the-art results.

Taking into consideration the improvement produced by image smoothing, we compute the CurlDiv motion representation and retrain the model following the proposed in Sect. 4. For testing we follow the same strategy used for training the appearance stream (guided by the condition presented in Eq. 5). First we let the appearance channel to decide. If this channel fulfills the condition we take the selected class. In case the condition is not satisfied, we apply the condition over the motion channel. If the motion channel satisfies the condition, then we take the predicted class. In case neither the appearance channel nor the motion channel fulfills the condition, the classical fusion of both channels is carried out. Results achieved by our framework outperform the state-of-the-art results in accuracy and are shown in Table 4.

Table 4. Comparison with the state-of-the-art methods: I3D [2] and PoTion [3]. The UCF101* represents the mean accuracy achieved by three datasets. The metric used is accuracy.

Full size table

6.3 Impact of I3D in Early Action Recognition

Each dataset presented in these experiments have a total of 7,352,018 frames distributed into 13,320 videos. The aim of these experiments is to quantify how many frames can be saved using the strategy proposed in Sect. 5 without sacrificing the performance of the global method (See Table 5). For that, we retrain the I3D with clips of 32 frames (usually clip-size is 64) randomly selected within the input videos during 40 epochs. In this experiment we take into consideration three different early conditions for stopping the algorithm: [First] taking only the first 32 frames, [Bayes] applying our early recognition framework and [Total] spanning entire videos.

Table 5 shows how spanning entire videos [Total] is the worst case. This happen because activities occur in a low number of frames and videos are extremely large. This situation causes that the evidence of real classes is less than the rest of the classes. On the other hand, the [First] strategy behaves decently for short videos or videos where the action performed starts from the beginning. Finally, a more accurate and safe way is to let a high level classifier [Bayes] to decide the true class while the video is spanned. In general, can be seen how following increase the number of non analyzed frames without sacrificing the performance.

Table 5. [First]: This condition utilizes only the first 32 frames of each video. [Total]: With this condition the video is spanned entirely. [Bayes]: Employs the proposed condition for stopping video spanning. f-saved is the number of non analyzed frames for each strategy.

Full size table

7 Conclusions and Future Work

In this paper was presented a novel strategy for training two-stream CNNs that considerably reduces the training times. We showed how by reducing the noise and distortions from input frames we can improve the performance of the classical two-stream CNNs; especially the I3D architecture. Our new representation for the motion channel results in a weak classifier by itself, but when it is combined with the appearance channel, the state-of-the-art methods are outperformed. Presented results shows that the CurlDiv representation is easier and faster to compute than the commonly used optical flow [30]. On the other hand, despite the reduction in training and preprocessing times, we reported the best results achieved so far over the UCF101 dataset. At the same time we presented an approach based on the I3D architecture for early action recognition that avoids the need of spanning entire videos without sacrificing the accuracy of the method. In future work we will introduce another scalar properties derived from the optical flow in order to encode the motion channel and we will extend the architecture for multiple CNNs streams (streams $> 2$).

References

Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: Salah, A.A., Lepri, B. (eds.) HBU 2011. LNCS, vol. 7065, pp. 29–39. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25446-8_4
Chapter Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on CVPR, pp. 4724–4733. IEEE (2017)
Google Scholar
Choutas, V., Weinzaepfel, P., Revaud, J., Schmid, C.: Potion: pose motion representation for action recognition. In: CVPR. pp. 7024–7033 (2018)
Google Scholar
Cruz, C., Sucar, L.E., Morales, E.F.: Real-time face recognition for human-robot interaction. In: 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition, pp. 1–6. IEEE (2008)
Google Scholar
Diba, A., et al.: Temporal 3d convnets: new architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200 (2017)
Diba, A., Pazandeh, A.M., Van Gool, L.: Efficient two-stream motion and appearance 3d cnns for video classification. arXiv preprint arXiv:1608.08851 (2016)
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp. 2625–2634 (2015)
Google Scholar
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: CVPR, pp. 1110–1118 (2015)
Google Scholar
Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 363–370. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45103-X_50
Chapter Google Scholar
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: CVPR. pp. 6546–6555 (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Chapter Google Scholar
Herath, S., Harandi, M., Porikli, F.: Going deeper into action recognition: a survey. Image Vis. Comput. 60, 4–21 (2017)
Article Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_50
Chapter Google Scholar
Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in lstms for activity detection and early detection. In: CVPR, pp. 1942–1950 (2016)
Google Scholar
Nanni, L., Ghidoni, S., Brahnam, S.: Handcrafted vs. non-handcrafted features for computer vision classification. Pattern Recogn. 71, 158–172 (2017)
Article Google Scholar
Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 744–759. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_45
Chapter Google Scholar
Saha, S., Singh, G., Sapienza, M., Torr, P.H., Cuzzolin, F.: Deep learning for detecting multiple space-time action tubes in videos. arXiv preprint arXiv:1608.01529 (2016)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Suter, D.: Motion estimation and vector splines. In: CVPR, vol. 94, pp. 939–942 (1994)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR, pp. 2818–2826 (2016)
Google Scholar
Tran, D., Ray, J., Shou, Z., Chang, S.F., Paluri, M.: Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 (2017)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV, pp. 3551–3558 (2013)
Google Scholar
Wang, Y., Song, J., Wang, L., Van Gool, L., Hilliges, O.: Two-stream sr-cnns for action recognition in videos. In: BMVC (2016)
Google Scholar
Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR, pp. 4694–4702 (2015)
Google Scholar
Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L¹ optical flow. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74936-3_22
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Instituto Nacional de Astrofísica, Óptica y Electrónica, San Andrés Cholula, Mexico
Reinier Oves García, Eduardo F. Morales & L. Enrique Sucar

Authors

Reinier Oves García
View author publications
You can also search for this author in PubMed Google Scholar
Eduardo F. Morales
View author publications
You can also search for this author in PubMed Google Scholar
L. Enrique Sucar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Reinier Oves García .

Editor information

Editors and Affiliations

Uppsala University, Uppsala, Sweden
Ingela Nyström
University of Information Science, Havana, Cuba
Yanio Hernández Heredia
University of Information Science, Havana, Cuba
Vladimir Milián Núñez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Oves García, R., Morales, E.F., Sucar, L.E. (2019). A Novel Scheme for Training Two-Stream CNNs for Action Recognition. In: Nyström, I., Hernández Heredia, Y., Milián Núñez, V. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2019. Lecture Notes in Computer Science(), vol 11896. Springer, Cham. https://doi.org/10.1007/978-3-030-33904-3_69

Download citation

DOI: https://doi.org/10.1007/978-3-030-33904-3_69
Published: 22 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33903-6
Online ISBN: 978-3-030-33904-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

A Novel Scheme for Training Two-Stream CNNs for Action Recognition

Abstract

Similar content being viewed by others

Second-order motion descriptors for efficient action recognition

More efficient and effective tricks for deep action recognition

Action Recognition Using Co-trained Deep Convolutional Neural Networks

Keywords

1 Introduction

2 Related Work

3 CurlDiv Motion Representation

4 Training Strategy and Fusion

5 Bayesian Strategy for Early Action Recognition

6 Experiments and Results

6.1 Preprocessing and Training Times

6.2 Results Achieved by Our RGB+CurlDiv Scheme

6.3 Impact of I3D in Early Action Recognition

7 Conclusions and Future Work

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

A Novel Scheme for Training Two-Stream CNNs for Action Recognition

Abstract

Similar content being viewed by others

Second-order motion descriptors for efficient action recognition

More efficient and effective tricks for deep action recognition

Action Recognition Using Co-trained Deep Convolutional Neural Networks

Keywords

1 Introduction

2 Related Work

3 CurlDiv Motion Representation

4 Training Strategy and Fusion

5 Bayesian Strategy for Early Action Recognition

6 Experiments and Results

6.1 Preprocessing and Training Times

6.2 Results Achieved by Our RGB+CurlDiv Scheme

6.3 Impact of I3D in Early Action Recognition

7 Conclusions and Future Work

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation