Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3595916.3626362acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

Multi-region CNN-Transformer for Micro-gesture Recognition in Face and Upper Body

Published: 01 January 2024 Publication History

Abstract

This paper presents a novel task that recognizes from a video unintentional micro-gestures (UMGs), which are movements made by people unconsciously and unintentionally. Recognizing UMGs is crucial because they reveal a person’s underlying psychological state. Since a UMG is composed of subtle sequential movements, the recognition model must be able to capture accurate information in both the spatial and temporal directions. Therefore, we utilize a convolutional neural network (CNN) to capture information in the spatial direction and a Transformer to merge the features extracted by the CNN in the temporal direction. However, this model often misrecognizes UMGs because it is not possible to capture slight differences in movements, such as in the face and mouth regions. To address this issue, we propose a novel model for UMG recognition, the Multi-Region CNN-Transformer model, that inputs cropped videos from multiple upper body regions simultaneously. The key advance of our method is to capture subtle changes in regions such as the upper body, face, head, and mouth for recognizing UMGs. We demonstrate the effectiveness of the proposed method through experiments using our newly created UMG dataset for this task.

References

[1]
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In Proceedings of the 38th International Conference on Machine Learning. 813–824.
[2]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248–255.
[3]
Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale vision transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6824–6835.
[4]
Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. 445–450.
[5]
Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, and Dacheng Tao. 2022. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2022), 87–110.
[6]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[7]
Davis E King. 2009. Dlib-ml: A Machine Learning Toolkit. The Journal of Machine Learning Research 10 (2009), 1755–1758.
[8]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
[9]
Shan Li and Weihong Deng. 2020. Deep Facial Expression Recognition: A Survey. IEEE Transactions on Affective Computing 13, 3 (2020), 1195–1215.
[10]
Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. 2020. On the Variance of the Adaptive Learning Rate and Beyond. In Proceedings of the 8th International Conference on Learning Representations.
[11]
Xin Liu, Henglin Shi, Haoyu Chen, Zitong Yu, Xiaobai Li, and Guoying Zhao. 2021. iMiGUE: An Identity-free Video Dataset for Micro-Gesture Understanding and Emotion Analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10631–10642.
[12]
Son Thai Ly, Guee-Sang Lee, Soo-Hyung Kim, and Hyung-Jeong Yang. 2018. Emotion Recognition via Body Gesture: Deep Learning Model Coupled with Keyframe Selection. In Proceedings of the 1st International Conference on Machine Learning and Machine Intelligence. 27–31.
[13]
Albert Mehrabian. 1971. Silent messages. Vol. 8. Wadsworth Belmont, CA.
[14]
Fatemeh Noroozi, Ciprian Adrian Corneanu, Dorota Kamińska, Tomasz Sapiński, Sergio Escalera, and Gholamreza Anbarjafari. 2018. Survey on Emotional Body Gesture Recognition. IEEE Transactions on Affective Computing 12, 2 (2018), 505–523.
[15]
Cicero dos Santos, Ming Tan, Bing Xiang, and Bowen Zhou. 2016. Attentive Pooling Networks. arXiv preprint arXiv:1602.03609 (2016).
[16]
Zehua Sun, Qiuhong Ke, Hossein Rahmani, Mohammed Bennamoun, Gang Wang, and Jun Liu. 2023. Human Action Recognition From Various Data Modalities: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 3 (2023), 3200–3225.
[17]
Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning. 6105–6114.
[18]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000–6010.
[19]
Valentin Vielzeuf, Stéphane Pateux, and Frédéric Jurie. 2017. Temporal multimodal fusion for video emotion classification in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 569–576.
[20]
Yan Wang, Wei Song, Wei Tao, Antonio Liotta, Dawei Yang, Xinlei Li, Shuyong Gao, Yixuan Sun, Weifeng Ge, Wei Zhang, and Wenqiang Zhang. 2022. A Systematic Review on Affective Computing: Emotion Models, Databases, and Recent Advances. Information Fusion 83–84 (2022), 19–52.
[21]
Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. 2017. Dilated Residual Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 472–480.
[22]
SHI Yuanyuan, LI Yunan, FU Xiaolong, MIAO Kaibin, and MIAO Qiguang. 2021. Review of dynamic gesture recognition. Virtual Reality & Intelligent Hardware 3, 3 (2021), 183–206.
[23]
Miron Zuckerman, Bella M DePaulo, and Robert Rosenthal. 1981. Verbal and Nonverbal Communication of Deception. Advances in Experimental Social Psychology 14 (1981), 1–59.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia
December 2023
745 pages
ISBN:9798400702051
DOI:10.1145/3595916
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Transformer
  2. convolutional neural network
  3. movements
  4. multiple regions
  5. unintentional micro-gesture recognition

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

MMAsia '23
Sponsor:
MMAsia '23: ACM Multimedia Asia
December 6 - 8, 2023
Tainan, Taiwan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 91
    Total Downloads
  • Downloads (Last 12 months)91
  • Downloads (Last 6 weeks)5
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media