Abstract
Recent methods for video action recognition have reached outstanding performances on existing benchmarks. However, they tend to leverage context such as scenes or objects instead of focusing on understanding the human action itself. For instance, a tennis field leads to the prediction playing tennis irrespectively of the actions performed in the video. In contrast, humans have a more complete understanding of actions and can recognize them without context. The best example of out-of-context actions are mimes, that people can typically recognize despite missing relevant objects and scenes. In this paper, we propose to benchmark action recognition methods in such absence of context and introduce a novel dataset, Mimetics, consisting of mimed actions for a subset of 50 classes from the Kinetics benchmark. Our experiments show that (a) state-of-the-art 3D convolutional neural networks obtain disappointing results on such videos, highlighting the lack of true understanding of the human actions and (b) models leveraging body language via human pose are less prone to context biases. In particular, we show that applying a shallow neural network with a single temporal convolution over body pose features transferred to the action recognition problem performs surprisingly well compared to 3D action recognition methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Angelini, F., Fu, Z., Long, Y., Shao, L., & Naqvi, S. M. (2018). ActionXPose: A novel 2D multi-view pose-based algorithm for real-time human action recognition. arXiv preprint arXiv:1810.12126.
Bahng, H., Chun, S., Yun, S., Choo, J., & Oh, S. J. (2019). Learning de-biased representations with biased representations. arXiv.
Cao, C., Zhang, Y., Zhang, C., & Lu, H. (2016). Action recognition with joints-pooled 3D deep convolutional descriptors. In IJCAI.
Cao, Z., Hidalgo, G., Simon, T., Wei, S. E., & Sheikh, Y. (2018). OpenPose: Realtime multi-person 2D pose estimation using Part Affinity Fields. arXiv preprint arXiv:1812.08008.
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR.
Chéron, G., Laptev, I., & Schmid, C. (2015). P-CNN: Pose-based CNN features for action recognition. In ICCV.
Choutas, V., Weinzaepfel, P., Revaud, J., & Schmid, C. (2018). Potion: Pose motion representation for action recognition. CVPR.
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In CVPR.
Du, W., Wang, Y., Qiao, & Y. (2017). RPAN: An end-to-end recurrent pose-attention network for action recognition in videos. In ICCV.
Du, Y., Fu, Y., & Wang, L. (2015a). Skeleton based action recognition with convolutional neural network. In ACPR.
Du, Y., Wang, W., & Wang, L. (2015b). Hierarchical recurrent neural network for skeleton based action recognition. In CVPR.
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In ICCV.
Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In CVPR.
Ghadiyaram, D., Tran, D., & Mahajan, D. (2019). Large-scale weakly-supervised pre-training for video action recognition. In CVPR.
Girdhar, R., & Ramanan, D. (2017). Attentional pooling for action recognition. In NIPS.
Gkioxari, G., & Malik, J. (2015). Finding action tubes. In CVPR.
Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet. In CVPR.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
Iqbal, U., Garbade, M., & Gall, J. (2017). Pose for action-action for pose. In International conference on automatic face & gesture recognition (FG).
Jacquot, V., Ying, Z., & Kreiman, G. (2020). Can deep learning recognize subtle human activities. In CVPR.
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., & Black, M. J. (2013). Towards understanding action recognition. In ICCV.
Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Action tubelet detector for spatio-temporal action localization. In ICCV.
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P, et al. (2017). The Kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
Khosla, A., Zhou, T., Malisiewicz, T., Efros, A. A., & Torralba, A. (2012). Undoing the damage of dataset bias. In ECCV.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, & T., Serre, T. (2011). HMDB: A large video database for human motion recognition. In ICCV.
Li, Y., Li, Y., & Vasconcelos, N. (2018). Resound: Towards action recognition without representation bias. In ECCV.
Li, Y., & Vasconcelos, N. (2019). Repair: Removing representation bias by dataset resampling. In CVPR.
Liu, J., Shahroudy, A., Xu, D., & Wang, G. (2016). Spatio-temporal LSTM with trust gates for 3D human action recognition. In ECCV.
Liu, M., & Yuan, J. (2018). Recognizing human actions as the evolution of pose estimation maps. In CVPR.
Luvizon, D. C., Picard, D., & Tabia, H. (2018). 2D/3D pose estimation and action recognition using multitask deep learning. In CVPR.
McNally, W., Wong, A., & McPhee, J. (2019). STAR-Net: Action recognition using spatio-temporal activation reprojection. In CRV.
Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei, M., Seidel, H. P., et al. (2017). VNect: Real-time 3D human pose estimation with a single RGB camera. ACM Transactions on Graphics. https://doi.org/10.1145/3072959.3073596.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS.
Rogez, G., Weinzaepfel, P., & Schmid, C. (2020). LCR-Net++: Multi-person 2D and 3D pose detection in natural images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(5), 1146–1161.
Saha, S., Singh, G., Sapienza, M., Torr, P. H., & Cuzzolin, F. (2016). Deep learning for detecting multiple space-time action tubes in videos. In BMVC.
Sevilla-Lara, L., Liao, Y., Güney, F., Jampani, V., Geiger, A., & Black, M. J. (2018). On the integration of optical flow and action recognition. In GCPR.
Shahroudy, A., Liu, J., Ng, T. T., & Wang, G. (2016). NTU RGB+D: A large scale dataset for 3D human activity analysis. In CVPR
Si, C., Jing, Y., Wang, W., Wang, L., & Tan, T. (2018). Skeleton-based action recognition with spatial reasoning and temporal stack learning. In ECCV.
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.
Singh, G., Saha, S., Sapienza, M., Torr, P. H., & Cuzzolin, F. (2017). Online real-time multiple spatiotemporal action localisation and prediction. In ICCV.
Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. In CRCV-TR-12-01.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In CVPR.
Torralba, A., & Efros, A. A., et al. (2011). Unbiased look at dataset bias. In CVPR.
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In ICCV.
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In CVPR.
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In ECCV.
Wang, W., Zhang, J., Si, C., & Wang, L. (2018). Pose-based two-stream relational networks for action recognition in videos. arXiv preprint arXiv:1805.08484.
Weinzaepfel, P., Harchaoui, Z., & Schmid, C. (2015). Learning to track for spatio-temporal action localization. In ICCV.
Weng, J., Liu, M., Jiang, X., & Yuan, J. (2018). Deformable pose traversal convolution for 3d action and gesture recognition. In ECCV.
Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In CVPR.
Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV.
Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI.
Yao, A., Gall, J., & van Gool, L. (2012). Coupled action recognition and pose estimation from multiple views. In IJCV.
Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime TV-L1 optical flow. In: Joint pattern recognition symposium.
Zhang, W., Zhu, M., & Derpanis, K. G. (2013). From actemes to action: A strongly-supervised representation for detailed action understanding. In ICCV.
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In CVPR.
Zhu, J., Zou, W., Xu, L., Hu, Y., Zhu, Z., Chang, M., Huang, J., Huang, G., & Du, D. (2018). Action machine: Rethinking action recognition in trimmed videos. arXiv preprint arXiv:1812.05770.
Zhu, W., Lan, C., Xing, J., Li, Y., Shen, L., Zeng, W., & Xie, X. (2016). Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In AAAI.
Zolfaghari, M., Oliveira, G. L., Sedaghat, N., Brox, T. (2017). Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In ICCV.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Angela Dai.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Extended Experiments on Existing Datasets
In this section, we provide more analysis about the performance of the three pose-based baselines on existing action recognition datasets. We first perform a parametric study of SIP-Net in “SIP-Net baseline” of appendix. We then use the various levels of ground-truth (see Table 3 of the main paper) to study the impact of using ground-truth or extracted tubes and poses (“Comparison between baselines” of appendix).
Tubes For datasets with ground-truth 2D poses, we compare the performance when using ground-truth tubes (GT Tubes) obtained from GT 2D poses, or estimated tubes (LCR Tubes) built from estimated 2D poses, see Sect. 4.1 of the main paper. In the latter case, tubes are labeled positive if the spatio-temporal IoU with a GT tube is over 0.5, and negative otherwise. When there is no tube annotation, we assume that all tubes are labeled with the video class label. Note that in some videos, no tube is extracted, in which case the videos are ignored when training, and considered as wrongly classified for test videos. In particular, this happens when only the head is visible, as well as for many clips with first person viewpoint, where only one hand or the main manipulated object is visible. We obtain no tube for 0.1% of the videos on PennAction, 2.5% on JHMDB, 2.7% on HMDB51, 6.7% on UCF101 and 15.3% on Kinetics.
1.1 SIP-Net Baseline
We first present the results for the SIP-Net baseline with GT tubes (blue curve ‘GT tubes, Pose Feats’) and LCR tubes (green curve ‘LCR Tubes, Pose Feats’) on all datasets for varying clip length T, see Fig. 7. Overall, a larger clip size T leads to a higher classification accuracy. This is in particular the case for datasets with longer videos such as NTU and Kinetics. This holds both when using GT tubes (blue curve) and LCR tubes (green curve). We keep \(T=32\) in the remaining of this paper.
Next, we measure the impact of applying transfer learning from the pose domain to action recognition. To this end, we compare the temporal convolution on LCR pose features (blue curve, ‘Pose Feats’), to features extracted from a Faster R-CNN model with ResNet50 backbone trained to classify actions (red curve, ‘Action Feats’). This latter method is not supposed to be state-of-the-art in action recognition, but it allows to fairly compare the pose features to action features, keeping the network architecture exactly the same, simply changing the learned weights. Note that such a frame-level action detector has been used in the spatio-temporal action detection literature (Saha et al. 2016a; Weinzaepfel et al. 2015b), before the rise of 3D CNNs. Results in Fig. 7 show a clear drop of accuracy when using action features instead of pose features: about 20% on JHMDB-1 and PennAction, and around 5% on NTU for \(T=32\). Interestingly, this holds for \(T=1\) on HMDB-1 and PennAction, i.e., without temporal integration, showing that ‘Pose feats’ are more powerful. To better understand why using pose features considerably increases performance compared to action features, we visualize the distances between features inside tubes in Fig. 8. When training a per-frame detector specifically for actions, most features of a given tube are correlated. It is therefore hard to leverage temporal information from them. In contrast, LCR-Net++ pose features considerably change over time, as does the pose, deriving greater benefit from temporal integration. Figure 9 shows confusion matrices on PennAction when using ‘Pose feats’ (left) versus ‘Action feats’ (right). With ‘Action feats’, confusions happen between the two tennis or the two baseball actions, while this is disambiguated with ‘Pose feats’.
1.2 Comparison Between Baselines
We compare the performance of the baselines using GT and LCR tubes, on the JHMDB-1, PennAction and NTU datasets in Table 10. On JHMDB-1 and PennAction, despite being a much simpler architecture, the SIP-Net baseline outperforms the methods based on explicit 2D–3D pose representations, both with GT and LCR tubes. Estimated 3D pose sequences are usually noisy and may lack temporal consistency. We also observe that the STGCN3D approach significantly outperforms its 2D counterpart (STGCN2D), confirming that 2D poses contain less discriminative and more ambiguous information.
On the NTU dataset, the 3D pose baseline obtains 74.8% accuracy when using GT tubes and estimated poses (STGCN3D on GT Tubes), compared to 81.5% reported in (Yan et al. 2018b) when using ground-truth 3D poses. This gap of 7% in a constrained environment is likely to increase for videos captured in the wild. The performance of the features-based baseline (SIP-Net) is lower, 66.4% on GT tubes, suggesting than SIP-Net performs better only in unconstrained scenarios.
Per-Class Results on Mimetics
In Table 11, we present for each class the top-1 accuracy and the AP of the different methods. For the top-1 accuracy metric, SIP-Net obtains the best performance for 19 out of 50 classes, with a mean accuracy of 14.2%. The RGB 3D CNN baseline obtains the highest AP for 8 classes, which often correspond to classes in which manipulated objects are small, making the network less bias towards context (e.g. the ball for the action catching of throwing baseball). Table 11 also highlights that the recognition of mimed actions is a very challenging and open task, as none of the videos are correctly classified (i.e. 0% top-1 accuracy) by the 5 baselines for 5 out of the 50 classes.
Mimetics Evaluation with Superclasses
One hypothesis for the low performance of the different methods on Mimetics is the fact that actions are too fine-grained. To evaluate this aspect, we aggregated several actions from the Kinetics dataset into some higher-level classes, denoted as superclasses. Fig. 10 shows the actions that belong to a larger group. Classes not present in one of this group are kept as is, i.e., they form their own superclass.
For evaluation, we still consider models trained on the 400 classes from Kinetics, but at test time, when evaluating on Mimetics, we transformed the 400 class probabilities into superclass probabilities. As a side effect, the number of 50 classes from Mimetics is down to 45 superclasses. Note that some elements in the superclasses do not belong to the subset of classes covered by Mimetics but are still used to sum the probabilities of a superclass from the Kinetics predictions. Results are presented in Table 12 (top part with 50 classes, bottom part with 45 classes). We observe that the performance of all methods slightly increase compared to the evaluation with fine-grained classes (Table 6) but the overall conclusion still holds. Note in particular that performances of pose-based methods show a larger increase, e.g + 10% top-5 accuracy for STGCN3D or + 4% for SIP-Net compared to + 2% for 3D-ResNeXt-101 RGB.
Rights and permissions
About this article
Cite this article
Weinzaepfel, P., Rogez, G. Mimetics: Towards Understanding Human Actions Out of Context. Int J Comput Vis 129, 1675–1690 (2021). https://doi.org/10.1007/s11263-021-01446-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-021-01446-y