Abstract
This paper addresses an important and challenging task, namely detecting the temporal intervals of actions in untrimmed videos. Specifically, we present a framework called structured segment network (SSN). It is built on temporal proposals of actions. SSN models the temporal structure of each action instance via a structured temporal pyramid. On top of the pyramid, we further introduce a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness. This allows the framework to effectively distinguish positive proposals from background or incomplete ones, thus leading to both accurate recognition and precise localization. These components are integrated into a unified network that can be efficiently trained in an end-to-end manner. Additionally, a simple yet effective temporal action proposal scheme, dubbed temporal actionness grouping is devised to generate high quality action proposals. We further study the importance of the decomposed discriminative model and discover a way to achieve similar accuracy using a single classifier, which is also complementary with the original SSN design. On two challenging benchmarks, THUMOS’14 and ActivityNet, our method remarkably outperforms previous state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling actions with various temporal structures.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
To be specific, the 10 classes are: “Clean and Jerk”, “Hammer Throw”, “High jump”, “Javelin Throw”, “Long Jump”, “Pole Vault”, “Shotput”, “Tennis Swing”, “Throw Discus”, “Volleyball Spiking”. Note that THUMOS14 has two classes named “Cricket Bowling” and “Cricket Shot” while ActivityNet v1.2 also has one called “Cricket”. However we categorize the two classes into the unseen part since the single label in ActivityNet is unable to distinguish these two specific actions.
The ActivityNet 2016 challenge summary is provided here: http://activity-net.org/challenges/2016/data/anet_challenge_summary.pdf.
References
Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose estimation. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1014–1021). IEEE.
Buch, S., Escorcia, V., Ghanem, B., Fei-Fei, L., & Niebles, J. C. (2017a). End-to-end, single-stream temporal action detection in untrimmed videos. In The British machine vision conference (BMVC) (Vol. 2, p. 7).
Buch, S., Escorcia, V., Shen, C., Ghanem, B., & Niebles, J. C. (2017b). SST: Single-stream temporal action proposals. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6373–6382). IEEE.
Caba Heilbron, F., Escorcia, V., Ghanem, B., & Niebles, J. C. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 961–970).
Caba Heilbron, F., Niebles, J. C., & Ghanem, B. (2016). Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1914–1923).
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4724–4733). IEEE.
Chao, Y. W., Vijayanarasimhan, S., Seybold, B., Ross, D. A., Deng, J., & Sukthankar, R. (2018). Rethinking the faster R-CNN architecture for temporal action localization. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1130–1139).
Dai, X., Singh, B., Zhang, G., Davis, L. S., & Chen, Y. Q. (2017). Temporal context network for activity localization in videos. In The IEEE international conference on computer vision (ICCV) (pp. 5727–5736).
De Geest, R., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., & Tuytelaars, T. (2016). Online action detection. In European conference on computer vision (ECCV) (pp. 269–284). Springer.
Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). ImageNet: A large-scale hierarchical image database. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 248–255).
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2625–2634).
Escorcia, V., Caba Heilbron, F., Niebles, J. C., & Ghanem, B. (2016). Daps: Deep action proposals for action understanding”. In European conference on computer vision (ECCV) (pp. 768–784).
Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 111(1), 98–136.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Fernando, B., Gavves, E., Jo, M., Ghodrati, A., & Tuytelaars, T. (2015). Modeling video evolution for action recognition. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5378–5387).
Gaidon, A., Harchaoui, Z., & Schmid, C. (2013). Temporal localization of actions with actoms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11), 2782–2795.
Gao, J., Chen, K., & Nevatia, R. (2018). Ctap: Complementary temporal action proposal generation. In The European conference on computer vision (ECCV) (pp. 68–83).
Gao, J., Yang, Z., & Nevatia, R. (2017). Cascaded boundary regression for temporal action detection. In The British machine vision conference (BMVC).
Girshick, R. (2015). Fast R-CNN. In The IEEE international conference on computer vision (ICCV) (pp. 1440–1448).
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 580–587).
Gkioxari, G., & Malik, J. (2015). Finding action tubes. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 759–768).
Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D. A., Toderici, G., Li, Y., Ricco, S., Sukthankar, R., Schmid, C., et al. (2018). AVA: A video dataset of spatio-temporally localized atomic visual actions. In The IEEE conference on computer vision and pattern recognition (CVPR).
He, K., Zhang, X., Ren, S., & Sun, J. (2014), Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision (ECCV) (pp. 346–361). Springer.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778).
Hoai, M., Lan, Z. Z., & De la Torre, F. (2011). Joint segmentation and classification of human actions in video. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3265–3272). IEEE.
Hoiem, D., Efros, A. A., & Hebert, M. (2008). Putting objects in perspective. International Journal of Computer Vision (IJCV), 80(1), 3–15.
Hosang, J., Benenson, R., Dollár, P., & Schiele, B. (2016). What makes for effective detection proposals? IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(4), 814–830.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML) (pp. 448–456).
Jain, M., van Gemert, J. C., Jégou, H., Bouthemy, P., & Snoek, C. G. M. (2014). Action localization by tubelets from motion. In The IEEE conference on computer vision and pattern recognition (CVPR).
Jiang, Y. G., Liu, J., Roshan Zamir, A., Toderici, G., Laptev, I., Shah, M., & Sukthankar, R. (2014). THUMOS challenge: Action recognition with a large number of classes. Retrieved April 7, 2019 from http://crcv.ucf.edu/THUMOS14/.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1725–1732).
Lafferty, J., McCallum, A., Pereira, F., et al. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. International Conference on Machine Learning (ICML), 1, 282–289.
Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision (IJCV), 64(2–3), 107–123.
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In The IEEE conference on computer vision and pattern recognition (CVPR) (Vol. 2, pp. 2169–2178). IEEE.
Li, X., & Loy, C. C. (2018). Video object segmentation with joint re-identification and attention-aware mask propagation. In The European conference on computer vision (ECCV) (pp. 90–105).
Li, Y., He, K., Sun, J., et al. (2016). R-FCN: Object detection via region-based fully convolutional networks. In Neural information processing systems (NIPS) (pp. 379–387).
Lin, T., Zhao, X., & Shou, Z. (2017). Single shot temporal action detection. In Proceedings of the 25th ACM international conference on Multimedia (pp. 988–996). ACM.
Lin, T., Zhao, X., Su, H., Wang, C., & Yang, M. (2018). BSN: Boundary sensitive network for temporal action proposal generation. In The European conference on computer vision (ECCV) (pp. 3–19).
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. In European conference on computer vision (ECCV) (pp. 21–37). Springer.
Mettes, P., van Gemert, J. C., & Snoek, C. G. (2016). Spot on: Action localization from pointly-supervised proposals. In European conference on computer vision (ECCV) (pp. 437–453). Springer.
Mettes, P., van Gemert, J. C., Cappallo, S., Mensink, T., & Snoek, C. G. (2015). Bag-of-fragments: Selecting and encoding video fragments for event detection and recounting. In ACM international conference on multimedia retrieval (ICMR) (pp. 427–434).
Montes, A., Salvador, A., Pascual, S., & Giro-i Nieto, X. (2016). Temporal activity detection in untrimmed videos with recurrent neural networks. In NIPS workshop on large scale computer vision systems.
Ng, J. Y. H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4694–4702).
Nguyen, P., Liu, T., Prasad, G., & Han, B. (2018) Weakly supervised action localization by sparse temporal pooling network. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6752–6761).
Niebles, J. C., Chen, C. W., & Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In European conference on computer vision (ECCV) (pp. 392–405). Springer.
Oneata, D., Verbeek, J., & Schmid, C. (2013). Action and event recognition with fisher vectors on a compact feature set. In The IEEE international conference on computer vision (ICCV) (pp. 1817–1824).
Oneata, D., Verbeek, J., & Schmid, C. (2014). The lear submission at thumos 2014. In THUMOS action recognition challenge.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Peng, X., & Schmid, C. (2016). Multi-region two-stream R-CNN for action detection. In European conference on computer vision (ECCV). Springer.
Pirsiavash, H., & Ramanan, D. (2014). Parsing videos of actions with segmental grammars. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 612–619).
Pont-Tuset, J., Arbelaez, P., Barron, J. T., Marques, F., & Malik, J. (2017). Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1), 128–140.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Neural information processing systems (NIPS) (pp. 91–99).
Richard, A., & Gall, J. (2016). Temporal action detection using a statistical language model. In The IEEE conference on computer vision and pattern recognition (CVPR)( pp. 3131–3140).
Roerdink, J. B., & Meijster, A. (2000). The watershed transform: Definitions, algorithms and parallelization strategies. Fundamenta Informaticae, 41(1,2), 187–228.
Schindler, K., & Van Gool, L. (2008). Action snippets: How many frames does human action recognition require? In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8). IEEE.
Shou, Z., Chan, J., Zareian, A., Miyazawa, K., & Chang, S. F. (2017). CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1417–1426).
Shou, Z., Gao, H., Zhang, L., Miyazawa, K., & Chang, S. F. (2018). AutoLoc: Weakly-supervised temporal action localization in untrimmed videos. In European conference on computer vision (ECCV) (pp. 154–171).
Shou, Z., Wang, D., & Chang, S. F. (2016). Temporal action localization in untrimmed videos via multi-stage CNNs. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1049–1058).
Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors with online hard example mining. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 761–769).
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Neural information processing systems (NIPS) (pp. 568–576).
Singh, G., & Cuzzolin, F. (2016). Untrimmed video classification for activity detection: Submission to activitynet challenge. CoRR abs/1607.01979
Singh, B., Marks, T. K., Jones, M., Tuzel, O., & Shao, M. (2016). A multi-stream bi-directional recurrent neural network for fine-grained action detection. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1961–1970).
Soomro, K., Zamir, A. R., & Shah, M. (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2818–2826).
Tang, K., Yao, B., Fei-Fei, L., & Koller, D. (2013). Combining the right features for complex event recognition. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2696–2703).
Tran, D., Bourdev, L. D., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In The IEEE international conference on computer vision (ICCV) (pp. 4489–4497).
Van de Sande, K. E., Uijlings, J. R., Gevers, T., & Smeulders, A. W. (2011). Segmentation as selective search for object recognition. In The IEEE international conference on computer vision (ICCV) (pp. 1879–1886).
Van Gemert, J. C., Jain, M., Gati, E., Snoek, C. G., et al. (2015). APT: Action localization proposals from dense trajectories. In The British machine vision conference (BMVC) (Vol. 2, p. 4).
Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In The IEEE international conference on computer vision (ICCV) (pp. 3551–3558).
Wang, R., & Tao, D. (2016). UTS at activitynet 2016. In AcitivityNet large scale activity recognition challenge 2016.
Wang, L., Qiao, Y., & Tang, X. (2014a). Action recognition and detection by combining motion and appearance features. In THUMOS action recognition challenge.
Wang, L., Qiao, Y., & Tang, X. (2014b). Latent hierarchical model of temporal structure for complex activity classification. IEEE Transactions on Image Processing, 23(2), 810–822.
Wang, L., Qiao, Y., & Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4305–4314).
Wang, L., Qiao, Y., Tang, X., & Van Gool, L. (2016a). Actionness estimation using hybrid fully convolutional networks. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2708–2717).
Wang, L., Xiong, Y., Lin, D., & Van Gool, L. (2017). Untrimmednets for weakly supervised action recognition and detection. In The IEEE conference on computer vision and pattern recognition (CVPR).
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016b). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (ECCV) (pp. 20–36).
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., et al. (2018). Temporal segment networks for action recognition in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Wang, P., Cao, Y., Shen, C., Liu, L., & Shen, H. T. (2016c). Temporal pyramid pooling based convolutional neural network for action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 27, 2613–2622.
Weinzaepfel, P., Harchaoui, Z., & Schmid, C. (2015). Learning to track for spatio-temporal action localization. In The IEEE international conference on computer vision (ICCV) (pp. 3164–3172).
Xu, H., Das, A., & Saenko, K. (2017). R-C3D: Region convolutional 3D network for temporal activity detection. In The IEEE international conference on computer vision (ICCV) (Vol. 6, p. 8).
Yeung, S., Russakovsky, O., Mori, G., & Fei-Fei, L. (2016). End-to-end learning of action detection from frame glimpses in videos. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2678–2687).
Yuan, J., Ni, B., Yang, X., & Kassim, A. A. (2016). Temporal action localization with pyramid of score distribution features. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3093–3102).
Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime tv-\(L^1\) optical flow. In 29th DAGM symposium on pattern recognition (pp. 214–223).
Zhang, D., Dai, X., Wang, X., & Wang, Y. F. (2018). \(\rm S^3D\): Single shot multi-span detector via fully 3d convolutional network. In The British machine vision conference (BMVC).
Zhang, B., Wang, L., Wang, Z., Qiao, Y., & Wang, H. (2016). Real-time action recognition with enhanced motion vector CNNs. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2718–2726).
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., & Lin, D. (2017a). Temporal action detection with structured segment networks. The IEEE International Conference on Computer Vision (ICCV), 8, 2914–2923.
Zhao, Y., Zhang, B., Wu, Z., Yang, S., Zhou, L., Yan, S., Wang, L., Xiong, Y., Lin, D., & Qiao, Y. (2017b). CUHK & ETHZ & SIAT submission to Activitynet Challenge 2017. arXiv:1710.08011
Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In European conference on computer vision (ECCV) (pp. 391–405).
Acknowledgements
This work is partially supported by the Big Data Collaboration Research Grant from SenseTime Group (CUHK Agreement No. TS1610626), the General Research Fund (GRF) of Hong Kong (No. 14236516) and the Early Career Scheme (ECS) of Hong Kong (No. 24204215), the National Science Foundation of China (No. 61921006, No. 61321491), and Collaborative Innovation Center of Novel Software Technology and Industrialization. We thank Tianwei Lin for kindly providing the BSN proposals on THUMOS14.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Thomas Brox.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhao, Y., Xiong, Y., Wang, L. et al. Temporal Action Detection with Structured Segment Networks. Int J Comput Vis 128, 74–95 (2020). https://doi.org/10.1007/s11263-019-01211-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-019-01211-2