Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3639390.3639392acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicvipConference Proceedingsconference-collections
research-article

NAC-TCN: Temporal Convolutional Networks with Causal Dilated Neighborhood Attention for Emotion Understanding

Published: 05 February 2024 Publication History

Abstract

In the task of emotion recognition from videos, a key improvement has been to focus on emotions over time rather than a single frame. There are many architectures to address this task such as GRUs, LSTMs, Self-Attention, Transformers, and Temporal Convolutional Networks (TCNs). However, these methods suffer from high memory usage, large amounts of operations, or poor gradients. We propose a method known as Neighborhood Attention with Convolutions TCN (NAC-TCN) which incorporates the benefits of attention and Temporal Convolutional Networks while ensuring that causal relationships are understood which results in a reduction in computation and memory cost. We accomplish this by introducing a causal version of Dilated Neighborhood Attention while incorporating it with convolutions. Our model achieves comparable, better, or state-of-the-art performance over TCNs, TCAN, LSTMs, and GRUs while requiring fewer parameters on standard emotion recognition datasets. We publish our code online for easy reproducibility and use in other projects – Github Link.

References

[1]
2023. NATTEN – Neighborhood Attention Extension. https://github.com/SHI-Labs/NATTEN.
[2]
Hojjat Abdollahi, Mohammad Mahoor, Rohola Zandie, Jarid Sewierski, and Sara Qualls. 2022. Artificial Emotional Intelligence in Socially Assistive Robots for Older Adults: A Pilot Study. IEEE Transactions on Affective Computing (2022), 1–1. https://doi.org/10.1109/taffc.2022.3143803
[3]
Fladio Armandika, Esmeralda Contessa Djamal, Fikri Nugraha, and Fatan Kasyidi. 2020. Dynamic Hand Gesture Recognition Using Temporal-Stream Convolutional Neural Networks. In 2020 7th International Conference on Electrical Engineering, Computer Sciences and Informatics (EECSI). 132–136. https://doi.org/10.23919/EECSI50503.2020.9251902
[4]
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. ViViT: A Video Vision Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 6836–6846.
[5]
Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. 2018. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv:1803.01271 (2018).
[6]
Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. 2019. Trellis Networks for Sequence Modeling. In International Conference on Learning Representations (ICLR).
[7]
David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. 2018. The Shattered Gradients Problem: If resnets are the answer, then what is the question?arxiv:1702.08591 [cs.NE]
[8]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. arxiv:2005.12872 [cs.CV]
[9]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
[10]
Vishal Chudasama, Purbayan Kar, Ashish Gudmalwar, Nirmesh Shah, Pankaj Wasnik, and Naoyuki Onoe. 2022. M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation. arxiv:2206.02187 [cs.CV]
[11]
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv:1810.04805 [cs.CL]
[13]
Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. 2012. Collecting Large, Richly Annotated Facial-Expression Databases from Movies. IEEE Multimedia 19 (09 2012), 34–31.
[14]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arxiv:2010.11929 [cs.CV]
[15]
Niki Efthymiou, Panagiotis Paraskevas Filntisis, Gerasimos Potamianos, and Petros Maragos. 2021. Visual Robotic Perception System with Incremental Learning for Child–Robot Interaction Scenarios. Technologies 9, 4 (2021). https://doi.org/10.3390/technologies9040086
[16]
Arpita Gupta, Subrahmanyam Arunachalam, and Ramadoss Balakrishnan. 2020. Deep self-attention network for facial emotion recognition. Procedia Computer Science 171 (2020), 1527–1534. https://doi.org/10.1016/j.procs.2020.04.163 Third International Conference on Computing and Network Communications (CoCoNet’19).
[17]
Hongyan Hao, Yan Wang, Yudi Xia, Jian Zhao, and Furao Shen. 2020. Temporal convolutional attention-based network for sequence modeling. arXiv preprint arXiv:2002.12530 (2020).
[18]
Hongyan Hao, Yan Wang, Yudi Xia, Jian Zhao, and Furao Shen. 2020. Temporal Convolutional Attention-based Network For Sequence Modeling. CoRR abs/2002.12530 (2020). arXiv:2002.12530https://arxiv.org/abs/2002.12530
[19]
Ali Hassani and Humphrey Shi. 2022. Dilated Neighborhood Attention Transformer. (2022). arxiv:2209.15001 [cs.CV] https://arxiv.org/abs/2209.15001
[20]
Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. 2022. Neighborhood Attention Transformer. (2022). arxiv:2204.07143 [cs.CV] https://arxiv.org/abs/2204.07143
[21]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arxiv:1512.03385 [cs.CV]
[22]
Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. 2015. An empirical exploration of recurrent network architectures. In International conference on machine learning. PMLR, 2342–2350.
[23]
Agata Kołakowska, Agnieszka Landowska, Mariusz Szwoch, Wioleta Szwoch, and Michal R Wrobel. 2014. Emotion recognition and its applications. In Human-Computer Systems Interaction: Backgrounds and Applications 3. Springer, 51–62.
[24]
Dimitrios Kollias. 2022. ABAW: Valence-Arousal Estimation, Expression Recognition, Action Unit Detection and Multi-Task Learning Challenges. arXiv preprint arXiv:2202.10659 (2022).
[25]
D Kollias, A Schulc, E Hajiyev, and S Zafeiriou. [n.d.]. Analysing Affective Behavior in the First ABAW 2020 Competition. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)(FG). 794–800.
[26]
Dimitrios Kollias, Viktoriia Sharmanska, and Stefanos Zafeiriou. 2019. Face Behavior a la carte: Expressions, Affect and Action Units in a Single Network. arXiv preprint arXiv:1910.11111 (2019).
[27]
Dimitrios Kollias, Viktoriia Sharmanska, and Stefanos Zafeiriou. 2021. Distribution Matching for Heterogeneous Multi-Task Learning: a Large-scale Face Study. arXiv preprint arXiv:2105.03790 (2021).
[28]
Dimitrios Kollias, Panagiotis Tzirakis, Mihalis A Nicolaou, Athanasios Papaioannou, Guoying Zhao, Björn Schuller, Irene Kotsia, and Stefanos Zafeiriou. 2019. Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. International Journal of Computer Vision (2019), 1–23.
[29]
Dimitrios Kollias and Stefanos Zafeiriou. 2019. Aff-Wild2: Extending the Aff-Wild Database for Affect Recognition. arxiv:1811.07770 [cs.CV]
[30]
Dimitrios Kollias and Stefanos Zafeiriou. 2019. Expression, Affect, Action Unit Recognition: Aff-Wild2, Multi-Task Learning and ArcFace. arXiv preprint arXiv:1910.04855 (2019).
[31]
Dimitrios Kollias and Stefanos Zafeiriou. 2021. Affect Analysis in-the-wild: Valence-Arousal, Expressions, Action Units and a Unified Framework. arXiv preprint arXiv:2103.15792 (2021).
[32]
Dimitrios Kollias and Stefanos Zafeiriou. 2021. Analysing affective behavior in the second abaw2 competition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3652–3660.
[33]
Jean Kossaifi, Georgios Tzimiropoulos, Sinisa Todorovic, and Maja Pantic. 2017. AFEW-VA Database for Valence and Arousal estimation In-The-Wild. Image and Vision Computing 65 (02 2017). https://doi.org/10.1016/j.imavis.2017.02.001
[34]
Colin Lea, Michael D. Flynn, René Vidal, Austin Reiter, and Gregory D. Hager. 2016. Temporal Convolutional Networks for Action Segmentation and Detection. CoRR abs/1611.05267 (2016). arXiv:1611.05267http://arxiv.org/abs/1611.05267
[35]
Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager. 2017. Temporal convolutional networks for action segmentation and detection. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 156–165.
[36]
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. 2018. Visualizing the Loss Landscape of Neural Nets. arxiv:1712.09913 [cs.LG]
[37]
Min Lin, Qiang Chen, and Shuicheng Yan. 2014. Network In Network. arxiv:1312.4400 [cs.NE]
[38]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arxiv:2103.14030 [cs.CV]
[39]
Liyu Meng, Yuchen Liu, Xiaolong Liu, Zhaopei Huang, Wenqiang Jiang, Tenggan Zhang, Yuanyuan Deng, Ruichen Li, Yannan Wu, Jinming Zhao, 2022. Multi-modal emotion estimation for in-the-wild videos. arXiv preprint arXiv:2203.13032 (2022).
[40]
Michael Moor, Max Horn, Bastian Rieck, Damian Roqueiro, and Karsten Borgwardt. 2019. Early recognition of sepsis with Gaussian process temporal convolutional networks and dynamic time warping. In Machine Learning for Healthcare Conference. PMLR, 2–26.
[41]
Soujanya Narayana, Ramanathan Subramanian, Ibrahim Radwan, and Roland Goecke. 2023. Focus on Change: Mood Prediction by Learning Emotion Changes via Spatio-Temporal Attention. arxiv:2303.06632 [cs.HC]
[42]
Hong-Hai Nguyen, Van-Thong Huynh, and Soo-Hyung Kim. 2022. An Ensemble Approach for Facial Behavior Analysis In-the-Wild Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. 2512–2517.
[43]
Behnaz Nojavanasghari, Tadas Baltrušaitis, Charles E Hughes, and Louis-Philippe Morency. 2016. Emoreact: a multimodal approach and dataset for recognizing emotional responses in children. In Proceedings of the 18th acm international conference on multimodal interaction. 137–144.
[44]
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
[45]
Razvan Pascanu, Tomas Mikolov, and Y. Bengio. 2012. On the difficulty of training Recurrent Neural Networks. 30th International Conference on Machine Learning, ICML 2013 (11 2012).
[46]
Soujanya Poria, Navonil Majumder, Rada Mihalcea, and Eduard H. Hovy. 2019. Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances. CoRR abs/1905.02947 (2019). arXiv:1905.02947http://arxiv.org/abs/1905.02947
[47]
Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. 2020. Designing Network Design Spaces. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10425–10433. https://doi.org/10.1109/CVPR42600.2020.01044
[48]
Denis Rangulov and Muhammad Fahim. 2020. Emotion recognition on large video dataset based on convolutional feature extractor and recurrent neural network. In 2020 IEEE 4th International Conference on Image Processing, Applications and Systems (IPAS). IEEE, 14–20.
[49]
Aakash Saroop, Pathik Ghugare, Sashank Mathamsetty, and Vaibhav Vasani. 2021. Facial Emotion Recognition: A multi-task approach using deep learning. arxiv:2110.15028 [cs.CV]
[50]
Katharina Schultebraucks, Vijay Yadav, Arieh Y Shalev, George A Bonanno, and Isaac R Galatzer-Levy. 2022. Deep learning-based classification of posttraumatic stress disorder and depression following trauma utilizing visual and auditory markers of arousal and mood. Psychological Medicine 52, 5 (2022), 957–967.
[51]
Vladislav Sovrasov. 2021. Flops-Counter PyTorch. https://github.com/sovrasov/flops-counter.pytorch.git.
[52]
Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. 2015. Efficient object localization using convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 648–656.
[53]
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers and distillation through attention. arxiv:2012.12877 [cs.CV]
[54]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arxiv:1706.03762 [cs.CL]
[55]
Andreas Veit, Michael Wilber, and Serge Belongie. 2016. Residual Networks Behave Like Ensembles of Relatively Shallow Networks. arxiv:1605.06431 [cs.CV]
[56]
Steven Walton, Ali Hassani, Xingqian Xu, Zhangyang Wang, and Humphrey Shi. 2022. StyleNAT: Giving Each Head a New Perspective. (2022). arxiv:2211.05770 [cs.CV] https://arxiv.org/abs/2211.05770
[57]
Weijian Xu, Yifan Xu, Tyler Chang, and Zhuowen Tu. 2021. Co-Scale Conv-Attentional Image Transformers. arxiv:2104.06399 [cs.CV]
[58]
Junyong You and Jari Korhonen. 2020. Attention Boosted Deep Networks For Video Classification. In 2020 IEEE International Conference on Image Processing (ICIP). 1761–1765. https://doi.org/10.1109/ICIP40778.2020.9190996
[59]
Fisher Yu and Vladlen Koltun. 2016. Multi-Scale Context Aggregation by Dilated Convolutions. arxiv:1511.07122 [cs.CV]
[60]
Alireza Zaeemzadeh, Nazanin Rahnavard, and Mubarak Shah. 2020. Norm-Preservation: Why Residual Networks Can Become Extremely Deep?arxiv:1805.07477 [cs.CV]
[61]
Su Zhang, Ruyi An, Yi Ding, and Cuntai Guan. 2022. Continuous Emotion Recognition using Visual-audio-linguistic information: A Technical Report for ABAW3. https://doi.org/10.48550/ARXIV.2203.13031
[62]
Su Zhang, Ruyi An, Yi Ding, and Cuntai Guan. 2022. Continuous Emotion Recognition using Visual-audio-linguistic information: A Technical Report for ABAW3. arxiv:2203.13031 [cs.MM]
[63]
Su Zhang, Ziyuan Zhao, and Cuntai Guan. 2023. Multimodal Continuous Emotion Recognition: A Technical Report for ABAW5. arXiv preprint arXiv:2303.10335 (2023).
[64]
Yan Zhao, Deliang Wang, Buye Xu, and Tao Zhang. 2020. Monaural speech dereverberation using temporal convolutional networks with self attention. IEEE ACM Trans. Audio Speech Lang. Process. 28 (May 2020), 1598–1607.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICVIP '23: Proceedings of the 2023 7th International Conference on Video and Image Processing
December 2023
97 pages
ISBN:9798400709388
DOI:10.1145/3639390
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 February 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. AFEW-VA
  2. AffWild2
  3. Attention-Based Video Models
  4. EmoReact
  5. Emotion Recognition
  6. Recurrent Neural Networks
  7. Temporal Convolutional Networks
  8. Video Understanding

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICVIP 2023

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 49
    Total Downloads
  • Downloads (Last 12 months)49
  • Downloads (Last 6 weeks)4
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media