Abstract
3D convolutional neural network (3D CNN) captures spatial and temporal information on 3D data such as video sequences. However, due to the convolution and pooling mechanism, the information loss that occurs seems unavoidable. To improve the visual explanations and classification in 3D CNN, we propose two approaches; (i) aggregate layer-wise global to local (global–local) discrete gradient using trained 3DResNext network, and (ii) implement attention gating network to improve the accuracy of the action recognition. The proposed approach intends to show the usefulness of every layer termed as global–local attention in 3D CNN via visual attribution, weakly-supervised action localization, and action recognition. Firstly, the 3DResNext is trained and applied for action classification using backpropagation concerning the maximum predicted class. The gradient and activation of every layer are then up-sampled. Later, aggregation is used to produce more nuanced attention, which points out the most critical part of the predicted class’s input videos. We use contour thresholding of final attention for final localization. We evaluate spatial and temporal action localization in trimmed videos using fine-grained visual explanation via 3DCAM. Experimental results show that the proposed approach produces informative visual explanations and discriminative attention. Furthermore, the action recognition via attention gating of each layer produces better classification results than the baseline model.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Adebayo, J., et al. (2018). Sanity checks for saliency maps. Advances in Neural Information Processing Systems.
Bargal, S.A., et al. (2018)Excitation backprop for RNNs. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Bazzani, L., et al. (2016). Self-taught object localization with deep networks. In 2016 IEEE winter conference on applications of computer vision (WACV). IEEE.
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828.
Carreira, J., & Zisserman, A. (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE.
Chattopadhay, A., et al. (2018) Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE winter conference on applications of computer vision (WACV). IEEE.
Chattopadhay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (2018). Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, 839–847. https://doi.org/10.1109/WACV.2018.00097.
Chen, L., et al. (2017). Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Choe, J., et al. (2020). Evaluating weakly supervised object localization methods right. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Deng, J., et al. (2009) Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. IEEE.
Fukui, H., et al. (2019). Attention branch network: Learning of attention mechanism for visual explanation. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Girdhar, R., & Deva R. (2017). Attentional pooling for action recognition. Advances in Neural Information Processing Systems.
Hara, K., Kataoka, H., & Satoh, Y. (2018) Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet. In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA.
He, K., et al. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Kawaguchi, K., & Bengio, Y. (2019). Depth with nonlinearity creates no bad local minima in ResNets. Neural Networks, 118, 167–174.
Kuehne, H., et al. (2013) Hmdb51: A large video database for human motion recognition. In High performance computing in science and engineering ‘12. (pp. 571-582). Berlin: Springer.
Li, W., Xiatian, Z., & Shaogang, G. (2017). Harmonious attention network for person reidentification. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Oquab, M., et al. (2015) Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Preim, B., & Botha, C.P. (2013). Visual computing for medicine: Theory, algorithms, and applications. Newnes.
Russakovsky, O., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Schlemper, J., et al. (2019). Attention gated networks: Learning to leverage salient regions in medical images. Medical Image Analysis, 53, 197–207.
Selvaraju, R. R., et al.(2017) Grad-CAM: Visual explanations from deep networks via gradient-based localization. In ICCV.
Shamir, O. (2018). Are ResNets provably better than linear predictors?. Advances in neural information processing systems.
Shrikumar, A., Greenside, P., & Kundaje, A. (2017) Learning important features through propagating activation differences. In Proceedings of the 34th international conference on machine learning, (Volume 70. JMLR. org).
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Soomro, K., Zamir, A.R., Shah, M. (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Sundararajan, M., Ankur, T., & Qiqi, Y. (2017). Axiomatic attribution for deep networks. In Proceedings of the 34th international conference on machine learning, (Volume 70. JMLR. org).
Szegedy, C., et al. (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Tran, D., et al. (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision.
Xie, S., et al. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Yudistira, N., & Kurita, T. (2017) Correlation Net: Spatio temporal multimodal deep learning for action recognition. arXiv preprint arXiv:1807.08291.
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. European conference on computer vision. Cham: Springer.
Zhang, J., et al. (2018). Top-down neural attention by excitation backprop. International Journal of Computer Vision, 126(10), 1084–1102.
Acknowledgements
The authors would like to thank KAKENHI Project No. 16K00239 for funding the research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Koichi Kise.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yudistira, N., Kavitha, M.S. & Kurita, T. Weakly-Supervised Action Localization, and Action Recognition Using Global–Local Attention of 3D CNN. Int J Comput Vis 130, 2349–2363 (2022). https://doi.org/10.1007/s11263-022-01649-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-022-01649-x