short-paper

Multi-region CNN-Transformer for Micro-gesture Recognition in Face and Upper Body

Authors:

Satoshi Suzuki,

Naoki MakishimaAuthors Info & Claims

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

Article No.: 89, Pages 1 - 5

https://doi.org/10.1145/3595916.3626362

Published: 01 January 2024 Publication History

Abstract

This paper presents a novel task that recognizes from a video unintentional micro-gestures (UMGs), which are movements made by people unconsciously and unintentionally. Recognizing UMGs is crucial because they reveal a person’s underlying psychological state. Since a UMG is composed of subtle sequential movements, the recognition model must be able to capture accurate information in both the spatial and temporal directions. Therefore, we utilize a convolutional neural network (CNN) to capture information in the spatial direction and a Transformer to merge the features extracted by the CNN in the temporal direction. However, this model often misrecognizes UMGs because it is not possible to capture slight differences in movements, such as in the face and mouth regions. To address this issue, we propose a novel model for UMG recognition, the Multi-Region CNN-Transformer model, that inputs cropped videos from multiple upper body regions simultaneously. The key advance of our method is to capture subtle changes in regions such as the upper body, face, head, and mouth for recognizing UMGs. We demonstrate the effectiveness of the proposed method through experiments using our newly created UMG dataset for this task.

References

[1]

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In Proceedings of the 38th International Conference on Machine Learning. 813–824.

[2]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248–255.

[3]

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale vision transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6824–6835.

[4]

Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. 445–450.

Digital Library

[5]

Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, and Dacheng Tao. 2022. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2022), 87–110.

[6]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.

Digital Library

[7]

Davis E King. 2009. Dlib-ml: A Machine Learning Toolkit. The Journal of Machine Learning Research 10 (2009), 1755–1758.

Digital Library

[8]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.

[9]

Shan Li and Weihong Deng. 2020. Deep Facial Expression Recognition: A Survey. IEEE Transactions on Affective Computing 13, 3 (2020), 1195–1215.

[10]

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. 2020. On the Variance of the Adaptive Learning Rate and Beyond. In Proceedings of the 8th International Conference on Learning Representations.

[11]

Xin Liu, Henglin Shi, Haoyu Chen, Zitong Yu, Xiaobai Li, and Guoying Zhao. 2021. iMiGUE: An Identity-free Video Dataset for Micro-Gesture Understanding and Emotion Analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10631–10642.

[12]

Son Thai Ly, Guee-Sang Lee, Soo-Hyung Kim, and Hyung-Jeong Yang. 2018. Emotion Recognition via Body Gesture: Deep Learning Model Coupled with Keyframe Selection. In Proceedings of the 1st International Conference on Machine Learning and Machine Intelligence. 27–31.

Digital Library

[13]

Albert Mehrabian. 1971. Silent messages. Vol. 8. Wadsworth Belmont, CA.

[14]

Fatemeh Noroozi, Ciprian Adrian Corneanu, Dorota Kamińska, Tomasz Sapiński, Sergio Escalera, and Gholamreza Anbarjafari. 2018. Survey on Emotional Body Gesture Recognition. IEEE Transactions on Affective Computing 12, 2 (2018), 505–523.

Digital Library

[15]

Cicero dos Santos, Ming Tan, Bing Xiang, and Bowen Zhou. 2016. Attentive Pooling Networks. arXiv preprint arXiv:1602.03609 (2016).

[16]

Zehua Sun, Qiuhong Ke, Hossein Rahmani, Mohammed Bennamoun, Gang Wang, and Jun Liu. 2023. Human Action Recognition From Various Data Modalities: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 3 (2023), 3200–3225.

[17]

Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning. 6105–6114.

[18]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000–6010.

[19]

Valentin Vielzeuf, Stéphane Pateux, and Frédéric Jurie. 2017. Temporal multimodal fusion for video emotion classification in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 569–576.

Digital Library

[20]

Yan Wang, Wei Song, Wei Tao, Antonio Liotta, Dawei Yang, Xinlei Li, Shuyong Gao, Yixuan Sun, Weifeng Ge, Wei Zhang, and Wenqiang Zhang. 2022. A Systematic Review on Affective Computing: Emotion Models, Databases, and Recent Advances. Information Fusion 83–84 (2022), 19–52.

[21]

Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. 2017. Dilated Residual Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 472–480.

[22]

SHI Yuanyuan, LI Yunan, FU Xiaolong, MIAO Kaibin, and MIAO Qiguang. 2021. Review of dynamic gesture recognition. Virtual Reality & Intelligent Hardware 3, 3 (2021), 183–206.

[23]

Miron Zuckerman, Bella M DePaulo, and Robert Rosenthal. 1981. Verbal and Nonverbal Communication of Deception. Advances in Experimental Social Psychology 14 (1981), 1–59.

Index Terms

Multi-region CNN-Transformer for Micro-gesture Recognition in Face and Upper Body
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

Integration of Face and Hand Gesture Recognition
ICCIT '08: Proceedings of the 2008 Third International Conference on Convergence and Hybrid Information Technology - Volume 01

Face recognition and hand gesture recognition technologies have been developed separately for many years. Usually they are treated as independent systems. In this paper, we integrate the face and hand gesture recognition. We claim that the face ...
Age-Invariant Face Recognition

One of the challenges in automatic face recognition is to achieve temporal invariance. In other words, the goal is to come up with a representation and matching scheme that is robust to changes due to facial aging. Facial aging is a complex process that ...
Face Recognition Based on Deep Learning
Human Centered Computing
Abstract
As one of the non-contact biometrics, face representation had been widely used in many circumstances. However conventional methods could no longer satisfy the demand at present, due to its low recognition accuracy and restrictions of many ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

December 2023

745 pages

ISBN:9798400702051

DOI:10.1145/3595916

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

MMAsia '23

Sponsor:

SIGMM

MMAsia '23: ACM Multimedia Asia

December 6 - 8, 2023

Tainan, Taiwan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
91
Total Downloads

Downloads (Last 12 months)91
Downloads (Last 6 weeks)5

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents