Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3606038.3616167acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Relative Boundary Modeling: A High-Resolution Cricket Bowl Release Detection Framework with I3D Features

Published: 29 October 2023 Publication History

Abstract

Cricket Bowl Release Detection aims to segment specific portions of bowl release actions occurring in multiple videos, with a focus on detecting the entire time window of this action. Unlike traditional detection tasks that identify action categories at a specific moment, this task involves identifying events that typically span around 100 frames and require recognizing all instances of the bowl release action in the video. Strictly speaking, this task falls under a branch of temporal action detection. With the advancement of deep neural networks, recent works have proposed deep learning-based approaches to address this task. However, due to the challenge of unclear action boundaries in videos, many existing methods perform poorly on the DeepSportradar Cricket Bowl Release Dataset. To more accurately identify specific portions of the bowl release action in videos, we adopt a one-stage architecture based on Relative Boundary Modeling. Specifically, our method consists of three stages. In the first stage, we use the Inflated 3D ConvNet (I3D) model to extract spatio-temporal features from the input videos. In the second stage, we utilize Temporal Action Detection with Relative Boundary Modeling (TriDet) to model the boundaries of the bowl release action's specific portions based on the relative relationships between different time moments, thereby predicting the action's time window. Lastly, as the target events typically span around 100 frames and the predicted time windows may exhibit overlapping regions based on confidence scores, we implement a post-processing step to merge and filter these outputs, resulting in the final submission results. We conducted extensive experiments to demonstrate that our proposed method achieves superior performance. Additionally, we evaluated the training techniques of existing approaches. Our proposed method achieves a PQ score of 0.519, an SQ score of 0.822, and an RQ score of 0.632 on the challenge set of the DeepSportradar Cricket Bowl Release Dataset. Through this approach, our team, USTC\_IAT\_United, won the third place in the first phase of the DeepSportradar Cricket Bowl Release Challenge.

Supplementary Material

MP4 File (IDmmsp029-video.mp4)
Explore our cutting-edge framework for high-resolution cricket bowl release detection! In this engaging presentation, we unveil our paper 'Relative Boundary Modeling,' showcasing the power of I3D features.

References

[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
[2]
Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. 2017. SoftNMS--improving object detection with one line of code. In Proceedings of the IEEE Int'l Conf. on computer vision. 5561--5569.
[3]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition. 6299--6308.
[4]
François Chollet. 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conf. on computer vision and pattern recognition. 1251--1258.
[5]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conf. on computer vision and pattern recognition. 2625-- 2634.
[6]
Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. 2021. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In Int'l Conf. on Machine Learning. PMLR, 2793--2803.
[7]
Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020. Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conf. on computer vision and pattern recognition. 909--918.
[8]
Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF Int'l Conf. on computer vision. 7083--7093.
[9]
Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. 2019. Bmn: Boundarymatching network for temporal action proposal generation. In Proceedings of the IEEE/CVF Int'l Conf. on computer vision. 3889--3898.
[10]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE Int'l Conf. on computer vision. 2980--2988.
[11]
Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Tao Mei, and Jiebo Luo. 2019. Coarse-to-fine localization of temporal action proposals. IEEE Transactions on Multimedia 22, 6 (2019), 1577--1590.
[12]
Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. 2021. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF Conf. on computer vision and pattern recognition. 485--494.
[13]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE Int'l Conf. on Computer Vision. 5533--5541.
[14]
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conf. on computer vision and pattern recognition. 658--666.
[15]
Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Advances in neural information processing systems 28 (2015).
[16]
Zheng Shou, Junting Pan, Jonathan Chan, Kazuyuki Miyazawa, Hassan Mansour, Anthony Vetro, Xavier Giroi Nieto, and Shih-Fu Chang. 2018. Online action detection in untrimmed, streaming videos-modeling and evaluation. In ECCV, Vol. 1. 5.
[17]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems 27 (2014).
[18]
Gurkirt Singh, Suman Saha, Michael Sapienza, Philip HS Torr, and Fabio Cuzzolin. 2017. Online real-time multiple spatiotemporal action localisation and prediction. In Proceedings of the IEEE Int'l Conf. on computer vision. 3637--3646.
[19]
Deepak Sridhar, Niamul Quader, Srikanth Muralidharan, Yaoxin Li, Peng Dai, and Juwei Lu. 2021. Class semantics-based attention for action detection. In Proceedings of the IEEE/CVF Int'l Conf. on Computer Vision. 13739--13748.
[20]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE Int'l Conf. on computer vision. 4489--4497.
[21]
Khoa Vo, Hyekang Joo, Kashu Yamazaki, Sang Truong, Kris Kitani, Minh-Triet Tran, and Ngan Le. 2021. Aei: Actors-environment interaction with adaptive attention for temporal action proposals generation. arXiv preprint arXiv:2110.11474 (2021).
[22]
Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectorypooled deep-convolutional descriptors. In Proceedings of the IEEE Conf. on computer vision and pattern recognition. 4305--4314.
[23]
Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. 2021. Tdn: Temporal difference networks for efficient action recognition. In Proceedings of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition. 1895--1904.
[24]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European Conf. on computer vision. Springer, 20--36.
[25]
Yuetian Weng, Zizheng Pan, Mingfei Han, Xiaojun Chang, and Bohan Zhuang. 2022. An efficient spatio-temporal pyramid transformer for action detection. In European Conf. on Computer Vision. Springer, 358--375.
[26]
Yuxin Wu and Kaiming He. 2018. Group Normalization. In Proceedings of the European Conf. on Computer Vision (ECCV).
[27]
Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conf. on computer vision (ECCV). 305--321.
[28]
Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, and Limin Wang. 2023. Basictad: an astounding rgb-only baseline for temporal action detection. Computer Vision and Image Understanding 232 (2023), 103692.
[29]
Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF Int'l Conf. on computer vision. 7094--7103.
[30]
Chen-Lin Zhang, Jianxin Wu, and Yin Li. 2022. Actionformer: Localizing moments of actions with transformers. In European Conf. on Computer Vision. Springer, 492--510.
[31]
Liang Zhang, Guangming Zhu, Lin Mei, Peiyi Shen, Syed Afaq Ali Shah, and Mohammed Bennamoun. 2018. Attention in convolutional LSTM for gesture recognition. Advances in neural information processing systems 31 (2018).
[32]
Peisen Zhao, Lingxi Xie, Chen Ju, Ya Zhang, Yanfeng Wang, and Qi Tian. 2020. Bottom-up temporal action localization with mutual regularization. In Computer Vision--ECCV 2020: 16th European Conf., Glasgow, UK, August 23--28, 2020, Proceedings, Part VIII 16. Springer, 539--555.
[33]
Zixin Zhu, Wei Tang, Le Wang, Nanning Zheng, and Gang Hua. 2021. Enriching local and global contexts for temporal action localization. In Proceedings of the IEEE/CVF Int'l Conf. on computer vision. 13516--13525.

Index Terms

  1. Relative Boundary Modeling: A High-Resolution Cricket Bowl Release Detection Framework with I3D Features

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MMSports '23: Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports
    October 2023
    174 pages
    ISBN:9798400702693
    DOI:10.1145/3606038
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. challenge paper
    2. feature extraction
    3. temporal action detection
    4. transformer

    Qualifiers

    • Research-article

    Conference

    MM '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 29 of 49 submissions, 59%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 50
      Total Downloads
    • Downloads (Last 12 months)43
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 30 Nov 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media