Keywords

1 Introduction

The automatic recognition of human actions from video data is an important topic withing the Computer Vision area due to its strength in providing a personalized support for several real-world applications, such as medicine, human-computer interaction and robotics, among others. Action recognition deals with the problem of assigning a predefined label to an input video and constitutes a challenging task for both areas, Computer Vision and Machine Learning. This research topic has been developed constantly in the last two decades [13] with a considerable improvement when using non-handcrafted features [18].

The best performance so far has been achieved by multi-stream approaches [3], specifically by two-stream CNNs [21], turning obsolete those approaches based on handcrafted features [27]. During the last two years several works have been focused on 3D CNNs and transfer learning [2, 5]. 3D CNNs are able to capture spatio-temporal relations and constitute the more powerful tool for action recognition so far. In 2017 [2] established a new mark in the state-of-the-art by transferring the knowledge from a 2D pre-trained model to 3D models, creating the most powerful two-stream CNN for action recognition (I3D).

Two-stream architectures are based on the late fusion of two CNNs which are trained independently over two different domains; appearance and motion. The appearance-stream learns features from the RGB frames while the motion-stream learns the features from the optical flow [30]. For the final fusion, a linear combination of the last layer of each stream is applied and transformed into a probabilistic distribution by a softmax layer [2] (See Fig. 1).

Fig. 1.
figure 1

General overview of two-stream CNNs. “Appearance” is a CNN that receives multiple RGB frames at once for producing the appearance outcome. “Motion” is another CNN that receives multiple optical-flows at once for producing the motion outcome. Both outcomes: appearance and motion usually contribute equally to a linear combination. Finally, such a combination is transformed into a probabilistic distribution throughout a softmax layer.

The main drawback of two-stream CNNs is given by the fusion layer. When all streams perform reasonably well, a sophisticated combination is not needed since the contribution will be minimal. However, when one of the streams has a weak performance the final results can be severely affected, producing in that way a result worse than the produced by the best of the streams. On the other hand, the quality of video data, the sampling rate and interpolation methods used while sampling, also affect the performance.

In this paper, we present a novel strategy that outperforms the results achieved by the state-of-the-art when using two-stream CNNs [2] over the UCF101-1, UCF101-2 and UCF101-3 datasets [23]. In addition, we included a new motion descriptor, easy to understand and fast to compute. This descriptor (CurlDiv) is composed by the curl and divergence of the optical flow and is used in order to replace the motion stream. Although this descriptor itself does not perform as well as the optical flow, when it is combined with the appearance channel under our training scheme, it outperforms the classical fusion. Given that the computation of the CurlDiv descriptor is faster than the optical flow, we report a modification of [2] for early action recognition, based on a Bayesian approach.

The rest of the paper is organized as follow. In Sect. 2 the related work is presented in action recognition using deep CNNs. Section 3 explains how our CurlDiv motion representation is computed. The proposed scheme for training two-stream CNNs is presented in Sect. 4 while Sect. 5 describes the approach proposed for early action recognition. In Sect. 6 a set of experiments are presented in order to evaluate the performance of our method as well as the comparison with the state-of-the-art methods. Finally, in Sect. 7 the conclusions and future work are given.

2 Related Work

Human Action Recognition task has presented a significant improvement in accuracy during the last decade by applying CNNs. The most prominent approaches have adapted classical architectures from image recognition [12, 22] to represent spatio-temporal relations [6, 11, 14, 15]. These approaches are differentiated by the way they process data: spatio-temporal convolutions (3D CNNs) [15, 26], Recurrent Neural Networks [7, 29] or multi-streams CNNs [2, 28]. This last approach has been used for action localization and video segmentation [19, 20].

Spatio-temporal Convolutions or 3D CNNs model the video as a three-dimensional volume and apply a set of 3D convolutions at different levels of depth [10, 15]. On the other hand, Recurrent Neural Networks approaches model video data as a sequence of frames [1, 16]. In this approach exists two common ways for data representation: (1) using a classical image classification CNN for feature extraction [17], where frames are preprocessed and transformed into a sequence of non-handcrafted features and (2) by extracting skeleton information [8, 16], videos are transformed into sequences of human poses. The main drawback of (1) is given by the action localization and motion in the background while (2) is fully affected by occlusions, video resolution and the target-camera distance; resulting (2) in a more accurate approach than (1).

Multi-streams approaches constitute the most revolutionary approach for action recognition. As is mentioned before, these approaches count with multiple CNNs trained independently and merged in the testing phase. Within these approaches, the best results have been achieved by those that transfer the knowledge from image processing domains [2, 3].

Two-streams architectures transform the video into appearance (purely RGB frames) and motion (optical flow) throughout a set of offline algorithms [30]. The RGB channel is usually sampled at 25 fps, using a bilinear interpolation method. In order to get a best performance it is recommended to generate optical flow after video sampling [2]. Recently, I3D architecture [2] emerges as a two-stream architecture with spatio-temporal convolutions (3D CNNs) and pooling operators, inflated from an image classification CNN with spatial convolutions and pooling layers [25]. Contributions of [2] rely on the transfer learning from 2D convolutions (from image classification) to 3D convolutions (video classification). In 2018 the authors of [3] encode the skeleton information into a set of RGB images and utilize a shallow CNN for classification. Although this architecture does not show the best performance, when it is fused with I3D the authors report an improvement.

All these architectures based on deep CNNs can be characterized by the large amount of time for training multiple channels as well as for preprocessing. Although our approach is based on I3D architecture, it considerably reduces the training times for one of the streams without sacrificing the accuracy. In addition, our motion representation is about 6 times faster to compute than optical flow [30] and is more accurate than the skeleton based approach proposed by [3].

3 CurlDiv Motion Representation

The motion channel constitutes the more important channel for two-stream architectures. Usually, optical flow [30] is used in order to estimate such a motion. In the OpenCV library we can find implementations of these algorithms for both architectures: GPU and CPU. The problem with both implementations is the time they take to process data. In order to reduce preprocessing times, we introduce a new representation for the motion channel based on the curl and divergence of the optical flow [24]. The curl is a scalar property of vector fields that represents how the field rotates around a point. On the other hand, the divergence measures the density of the outward flux of a vector field from an infinitesimal boundary around a given point.

Let \(\overrightarrow{V}\) be a 2-dimensional vector field computed with the Gunner Farneback algorithm [9] with scalar components P(xy) and Q(xy).

$$\begin{aligned} \overrightarrow{V}(x, y) = \begin{bmatrix} P(x,y) \\ Q(x,y) \end{bmatrix} \end{aligned}$$
(1)

The Curl of \(\overrightarrow{V}(x, y)\) can be defined as:

$$\begin{aligned} Curl(x, y) = \left| \left| \frac{\partial Q}{\partial x} - \frac{\partial P}{\partial y} \right| \right| \end{aligned}$$
(2)

and the Divergence as follow:

$$\begin{aligned} Div(x, y) = \frac{\partial P}{\partial x} + \frac{\partial Q}{\partial y} \end{aligned}$$
(3)

Second order partial derivatives of \(\overrightarrow{V}\) are computed by using a finite difference strategy for reducing computational times. In this way, our CurlDiv motion descriptor takes the following formulation:

$$\begin{aligned} CurlDiv(x, y) = \begin{bmatrix} Curl(x,y) \\ Div(x,y) \end{bmatrix} \end{aligned}$$
(4)

Following the schema proposed in [2] for motion data manipulation we also rescaled the CurlDiv descriptor in the range \([-1,1]\).

4 Training Strategy and Fusion

Our training strategy follows the assumption that one of the streams has a weak performance. Based on this, we train first the strongest stream (appearance); detecting the set of samples in which this stream does not perform well. Over these samples the motion CNN is trained. In order to train the motion channel over the CurlDiv representation we reutilize the motion CNN from I3D. Such training is conducted in two ways: (1) by using all training data and (2) using the proposed strategy.

Through the former strategy (1), the CurlDiv channel is trained over the entire dataset, without taking into consideration the quality of the other channel; the appearance. For such reason, the training time is the same as that consumed by the appearance channel. On the other hand, our motion representation does not improve the optical flow although we used the same CNN. This is because we start training with a model pretrained over the optical flow in order to accelerate the convergence. During experimentation we realized that the \(top_2\) accuracy of the appearance channel is above 99% in all the UCF101 databases (See Table 1).

Table 1. This table shows the top2 accuracy for the appearance channel on I3D two-stream architecture as well as the accuracy achieved by our motion channel and the fusion of the two streams (RGB+CurlDiv).

Our second strategy (2) for training two-stream architectures is based on the assumption that one of the channels is better than the other one. Taking this into account we utilize a simple, but effective heuristic proposed in [4] that helps us to improve the training phase (Eq. 5).

$$\begin{aligned} max(softmax) - 2nd(softmax) \ge 2 * 2nd(softmax) \end{aligned}$$
(5)

where max(softmax) is the top of the softmax layer and 2nd(softmax) corresponds with the second most probable element in the softmax layer. Following this criteria, the training data and training times can be considerably reduced for the second channel. In other words, the motion channel will only be fine tuned over those samples that not fulfill the condition proposed in Eq. 5.

5 Bayesian Strategy for Early Action Recognition

An early response of the action that is taking place can be useful for real world applications (e.g, video surveillance). In this section we show how retraining the I3D (RGB + CurlDiv) over few frames can be successfully used as input to a Bayesian classifier in order to determine the action of a video in an early way. The core idea of this method is based on accumulating evidence about the observed activity while sliding windows are processed and making a decision when the highest probability fulfills the condition in Eq. 5.

Taking the vector of probabilities P obtained from the softmax layer we can apply a Bayesian approach in order to make predictions, observing a few segments of the video.

$$\begin{aligned} P(C_{j} | f) = \frac{P(C_{i}) * P(f | C_{i})}{P(f)} \end{aligned}$$
(6)

In Eq. 6 the idea is to estimate the maximum a posteriori probability (MAP) of the segment \(C_{j}\) given the video features. Notice that, as the denominator does not depends on the class we can go without it and rewrite the equation as follow.

$$\begin{aligned} P(C_{j} | f) \sim P(C_{i}) * P(f | C_{i}) \end{aligned}$$
(7)

where f is the feature set extracted by the CNN, \(P(C_{i})\) is the probability of the previous segment and \(P(f | C_{i})\) is the probability estimated for the current segment. As video data can be seen as a continuous sequence of sliding windows, we can use Eq. 7 for predicting the action label. Following the condition proposed in Eq. 5 we can compute the most probable action for a given video and stop the processing. Figure 2 presents how the higher probability only varies between two classes across time.

Fig. 2.
figure 2

This image shows the sliding windows classification over two different videos of class 101. The y axis enumerates the sliding windows (32 frames) and the x axis enumerates the 101 classes within the dataset. The outcome of the softmax layer is visualized horizontally, according to the y - index. Note how the higher probability only tends to vary between two classes. The black color represents the higher value while light blue represents the smaller value. (Best seen in color). (Color figure online)

6 Experiments and Results

In this paper experiments were conducted over three datasets in which two-stream architectures have been tested: UCF101-1, UCF101-2 and UCF101-3 [23]. Each of this dataset counts with 101 classes and 13,320 videos recorded from internet, divided into training and testing sets. The experiments presented in this section are focused in measuring how the training times can be reduced following our scheme, to what extent the RGB+CurlDiv outperforms previous results and how good can a Bayesian strategy be for early action recognition.

Table 2. Hours (h) used by our architecture for training the I3D two-stream model following both schemes: classical and our (CurlDiv). Hardware: Core i7 32GB RAM and NVidia GeForce GTX 1080 with 11 GB of VRAM.

6.1 Preprocessing and Training Times

This experiment relates the times during preprocessing and training phases. These two phases are crucial factors when deep learning architectures are used. Our hardware architecture took almost 1 h for extracting all the frames of 13,320 videos and 18 h for optical flow computation. The CurlDiv only took less than 3 h for preprocessing the same amount of data, which represents a considerable reduction in time. To train the I3D architecture following the classical scheme took about 6 days for both channels: appearance and motion (3 days for each channel, over each dataset). By using our approach, we only expend 4 days for training the two channels (3 days for RGB and 1 day for CurlDiv) (See table 2). The considerable reduction in training times is given by applying the condition presented in Eq. 5 while the appearance channel is being trained.

Table 3. RGB, FLOW and RGB+FLOW columns show the accuracy achieved with the I3D without applying image smoothing. RGB_s, FLOW_s and RGB_s+ FLOW_s columns show the accuracy achieved by the I3D after applying image  smoothing.

6.2 Results Achieved by Our RGB+CurlDiv Scheme

We detect visually that noise and distortion are widely presented in the datasets. For such a reason we apply over all the frames a smoothing algorithm for image blurring with a \(3 \,\times \,3\) kernel for border and distortion reduction. Table 3 shows how applying a smoothing strategy before preprocessing data outperforms the state-of-the-art results.

Taking into consideration the improvement produced by image smoothing, we compute the CurlDiv motion representation and retrain the model following the proposed in Sect. 4. For testing we follow the same strategy used for training the appearance stream (guided by the condition presented in Eq. 5). First we let the appearance channel to decide. If this channel fulfills the condition we take the selected class. In case the condition is not satisfied, we apply the condition over the motion channel. If the motion channel satisfies the condition, then we take the predicted class. In case neither the appearance channel nor the motion channel fulfills the condition, the classical fusion of both channels is carried out. Results achieved by our framework outperform the state-of-the-art results in accuracy and are shown in Table 4.

Table 4. Comparison with the state-of-the-art methods: I3D [2] and PoTion [3]. The UCF101* represents the mean accuracy achieved by three datasets. The metric used is accuracy.

6.3 Impact of I3D in Early Action Recognition

Each dataset presented in these experiments have a total of 7,352,018 frames distributed into 13,320 videos. The aim of these experiments is to quantify how many frames can be saved using the strategy proposed in Sect. 5 without sacrificing the performance of the global method (See Table 5). For that, we retrain the I3D with clips of 32 frames (usually clip-size is 64) randomly selected within the input videos during 40 epochs. In this experiment we take into consideration three different early conditions for stopping the algorithm: [First] taking only the first 32 frames, [Bayes] applying our early recognition framework and [Total] spanning entire videos.

Table 5 shows how spanning entire videos [Total] is the worst case. This happen because activities occur in a low number of frames and videos are extremely large. This situation causes that the evidence of real classes is less than the rest of the classes. On the other hand, the [First] strategy behaves decently for short videos or videos where the action performed starts from the beginning. Finally, a more accurate and safe way is to let a high level classifier [Bayes] to decide the true class while the video is spanned. In general, can be seen how following increase the number of non analyzed frames without sacrificing the performance.

Table 5. [First]: This condition utilizes only the first 32 frames of each video. [Total]: With this condition the video is spanned entirely. [Bayes]: Employs the proposed condition for stopping video spanning. f-saved is the number of non analyzed frames for each strategy.

7 Conclusions and Future Work

In this paper was presented a novel strategy for training two-stream CNNs that considerably reduces the training times. We showed how by reducing the noise and distortions from input frames we can improve the performance of the classical two-stream CNNs; especially the I3D architecture. Our new representation for the motion channel results in a weak classifier by itself, but when it is combined with the appearance channel, the state-of-the-art methods are outperformed. Presented results shows that the CurlDiv representation is easier and faster to compute than the commonly used optical flow [30]. On the other hand, despite the reduction in training and preprocessing times, we reported the best results achieved so far over the UCF101 dataset. At the same time we presented an approach based on the I3D architecture for early action recognition that avoids the need of spanning entire videos without sacrificing the accuracy of the method. In future work we will introduce another scalar properties derived from the optical flow in order to encode the motion channel and we will extend the architecture for multiple CNNs streams (streams \(> 2\)).