research-article

NAC-TCN: Temporal Convolutional Networks with Causal Dilated Neighborhood Attention for Emotion Understanding

Authors:

Alexander Mehta,

William YangAuthors Info & Claims

ICVIP '23: Proceedings of the 2023 7th International Conference on Video and Image Processing

Pages 9 - 16

https://doi.org/10.1145/3639390.3639392

Published: 05 February 2024 Publication History

Abstract

In the task of emotion recognition from videos, a key improvement has been to focus on emotions over time rather than a single frame. There are many architectures to address this task such as GRUs, LSTMs, Self-Attention, Transformers, and Temporal Convolutional Networks (TCNs). However, these methods suffer from high memory usage, large amounts of operations, or poor gradients. We propose a method known as Neighborhood Attention with Convolutions TCN (NAC-TCN) which incorporates the benefits of attention and Temporal Convolutional Networks while ensuring that causal relationships are understood which results in a reduction in computation and memory cost. We accomplish this by introducing a causal version of Dilated Neighborhood Attention while incorporating it with convolutions. Our model achieves comparable, better, or state-of-the-art performance over TCNs, TCAN, LSTMs, and GRUs while requiring fewer parameters on standard emotion recognition datasets. We publish our code online for easy reproducibility and use in other projects – Github Link.

References

[1]

2023. NATTEN – Neighborhood Attention Extension. https://github.com/SHI-Labs/NATTEN.

[2]

Hojjat Abdollahi, Mohammad Mahoor, Rohola Zandie, Jarid Sewierski, and Sara Qualls. 2022. Artificial Emotional Intelligence in Socially Assistive Robots for Older Adults: A Pilot Study. IEEE Transactions on Affective Computing (2022), 1–1. https://doi.org/10.1109/taffc.2022.3143803

Digital Library

[3]

Fladio Armandika, Esmeralda Contessa Djamal, Fikri Nugraha, and Fatan Kasyidi. 2020. Dynamic Hand Gesture Recognition Using Temporal-Stream Convolutional Neural Networks. In 2020 7th International Conference on Electrical Engineering, Computer Sciences and Informatics (EECSI). 132–136. https://doi.org/10.23919/EECSI50503.2020.9251902

[4]

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. ViViT: A Video Vision Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 6836–6846.

[5]

Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. 2018. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv:1803.01271 (2018).

[6]

Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. 2019. Trellis Networks for Sequence Modeling. In International Conference on Learning Representations (ICLR).

[7]

David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. 2018. The Shattered Gradients Problem: If resnets are the answer, then what is the question?arxiv:1702.08591 [cs.NE]

[8]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. arxiv:2005.12872 [cs.CV]

[9]

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).

[10]

Vishal Chudasama, Purbayan Kar, Ashish Gudmalwar, Nirmesh Shah, Pankaj Wasnik, and Naoyuki Onoe. 2022. M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation. arxiv:2206.02187 [cs.CV]

[11]

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).

Digital Library

[12]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv:1810.04805 [cs.CL]

[13]

Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. 2012. Collecting Large, Richly Annotated Facial-Expression Databases from Movies. IEEE Multimedia 19 (09 2012), 34–31.

[14]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arxiv:2010.11929 [cs.CV]

[15]

Niki Efthymiou, Panagiotis Paraskevas Filntisis, Gerasimos Potamianos, and Petros Maragos. 2021. Visual Robotic Perception System with Incremental Learning for Child–Robot Interaction Scenarios. Technologies 9, 4 (2021). https://doi.org/10.3390/technologies9040086

[16]

Arpita Gupta, Subrahmanyam Arunachalam, and Ramadoss Balakrishnan. 2020. Deep self-attention network for facial emotion recognition. Procedia Computer Science 171 (2020), 1527–1534. https://doi.org/10.1016/j.procs.2020.04.163 Third International Conference on Computing and Network Communications (CoCoNet’19).

[17]

Hongyan Hao, Yan Wang, Yudi Xia, Jian Zhao, and Furao Shen. 2020. Temporal convolutional attention-based network for sequence modeling. arXiv preprint arXiv:2002.12530 (2020).

[18]

Hongyan Hao, Yan Wang, Yudi Xia, Jian Zhao, and Furao Shen. 2020. Temporal Convolutional Attention-based Network For Sequence Modeling. CoRR abs/2002.12530 (2020). arXiv:2002.12530https://arxiv.org/abs/2002.12530

[19]

Ali Hassani and Humphrey Shi. 2022. Dilated Neighborhood Attention Transformer. (2022). arxiv:2209.15001 [cs.CV] https://arxiv.org/abs/2209.15001

[20]

Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. 2022. Neighborhood Attention Transformer. (2022). arxiv:2204.07143 [cs.CV] https://arxiv.org/abs/2204.07143

[21]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arxiv:1512.03385 [cs.CV]

[22]

Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. 2015. An empirical exploration of recurrent network architectures. In International conference on machine learning. PMLR, 2342–2350.

[23]

Agata Kołakowska, Agnieszka Landowska, Mariusz Szwoch, Wioleta Szwoch, and Michal R Wrobel. 2014. Emotion recognition and its applications. In Human-Computer Systems Interaction: Backgrounds and Applications 3. Springer, 51–62.

[24]

Dimitrios Kollias. 2022. ABAW: Valence-Arousal Estimation, Expression Recognition, Action Unit Detection and Multi-Task Learning Challenges. arXiv preprint arXiv:2202.10659 (2022).

[25]

D Kollias, A Schulc, E Hajiyev, and S Zafeiriou. [n.d.]. Analysing Affective Behavior in the First ABAW 2020 Competition. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)(FG). 794–800.

[26]

Dimitrios Kollias, Viktoriia Sharmanska, and Stefanos Zafeiriou. 2019. Face Behavior a la carte: Expressions, Affect and Action Units in a Single Network. arXiv preprint arXiv:1910.11111 (2019).

[27]

Dimitrios Kollias, Viktoriia Sharmanska, and Stefanos Zafeiriou. 2021. Distribution Matching for Heterogeneous Multi-Task Learning: a Large-scale Face Study. arXiv preprint arXiv:2105.03790 (2021).

[28]

Dimitrios Kollias, Panagiotis Tzirakis, Mihalis A Nicolaou, Athanasios Papaioannou, Guoying Zhao, Björn Schuller, Irene Kotsia, and Stefanos Zafeiriou. 2019. Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. International Journal of Computer Vision (2019), 1–23.

Digital Library

[29]

Dimitrios Kollias and Stefanos Zafeiriou. 2019. Aff-Wild2: Extending the Aff-Wild Database for Affect Recognition. arxiv:1811.07770 [cs.CV]

[30]

Dimitrios Kollias and Stefanos Zafeiriou. 2019. Expression, Affect, Action Unit Recognition: Aff-Wild2, Multi-Task Learning and ArcFace. arXiv preprint arXiv:1910.04855 (2019).

[31]

Dimitrios Kollias and Stefanos Zafeiriou. 2021. Affect Analysis in-the-wild: Valence-Arousal, Expressions, Action Units and a Unified Framework. arXiv preprint arXiv:2103.15792 (2021).

[32]

Dimitrios Kollias and Stefanos Zafeiriou. 2021. Analysing affective behavior in the second abaw2 competition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3652–3660.

[33]

Jean Kossaifi, Georgios Tzimiropoulos, Sinisa Todorovic, and Maja Pantic. 2017. AFEW-VA Database for Valence and Arousal estimation In-The-Wild. Image and Vision Computing 65 (02 2017). https://doi.org/10.1016/j.imavis.2017.02.001

[34]

Colin Lea, Michael D. Flynn, René Vidal, Austin Reiter, and Gregory D. Hager. 2016. Temporal Convolutional Networks for Action Segmentation and Detection. CoRR abs/1611.05267 (2016). arXiv:1611.05267http://arxiv.org/abs/1611.05267

[35]

Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager. 2017. Temporal convolutional networks for action segmentation and detection. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 156–165.

[36]

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. 2018. Visualizing the Loss Landscape of Neural Nets. arxiv:1712.09913 [cs.LG]

[37]

Min Lin, Qiang Chen, and Shuicheng Yan. 2014. Network In Network. arxiv:1312.4400 [cs.NE]

[38]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arxiv:2103.14030 [cs.CV]

[39]

Liyu Meng, Yuchen Liu, Xiaolong Liu, Zhaopei Huang, Wenqiang Jiang, Tenggan Zhang, Yuanyuan Deng, Ruichen Li, Yannan Wu, Jinming Zhao, 2022. Multi-modal emotion estimation for in-the-wild videos. arXiv preprint arXiv:2203.13032 (2022).

[40]

Michael Moor, Max Horn, Bastian Rieck, Damian Roqueiro, and Karsten Borgwardt. 2019. Early recognition of sepsis with Gaussian process temporal convolutional networks and dynamic time warping. In Machine Learning for Healthcare Conference. PMLR, 2–26.

[41]

Soujanya Narayana, Ramanathan Subramanian, Ibrahim Radwan, and Roland Goecke. 2023. Focus on Change: Mood Prediction by Learning Emotion Changes via Spatio-Temporal Attention. arxiv:2303.06632 [cs.HC]

[42]

Hong-Hai Nguyen, Van-Thong Huynh, and Soo-Hyung Kim. 2022. An Ensemble Approach for Facial Behavior Analysis In-the-Wild Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. 2512–2517.

[43]

Behnaz Nojavanasghari, Tadas Baltrušaitis, Charles E Hughes, and Louis-Philippe Morency. 2016. Emoreact: a multimodal approach and dataset for recognizing emotional responses in children. In Proceedings of the 18th acm international conference on multimodal interaction. 137–144.

Digital Library

[44]

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).

[45]

Razvan Pascanu, Tomas Mikolov, and Y. Bengio. 2012. On the difficulty of training Recurrent Neural Networks. 30th International Conference on Machine Learning, ICML 2013 (11 2012).

[46]

Soujanya Poria, Navonil Majumder, Rada Mihalcea, and Eduard H. Hovy. 2019. Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances. CoRR abs/1905.02947 (2019). arXiv:1905.02947http://arxiv.org/abs/1905.02947

[47]

Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. 2020. Designing Network Design Spaces. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10425–10433. https://doi.org/10.1109/CVPR42600.2020.01044

[48]

Denis Rangulov and Muhammad Fahim. 2020. Emotion recognition on large video dataset based on convolutional feature extractor and recurrent neural network. In 2020 IEEE 4th International Conference on Image Processing, Applications and Systems (IPAS). IEEE, 14–20.

[49]

Aakash Saroop, Pathik Ghugare, Sashank Mathamsetty, and Vaibhav Vasani. 2021. Facial Emotion Recognition: A multi-task approach using deep learning. arxiv:2110.15028 [cs.CV]

[50]

Katharina Schultebraucks, Vijay Yadav, Arieh Y Shalev, George A Bonanno, and Isaac R Galatzer-Levy. 2022. Deep learning-based classification of posttraumatic stress disorder and depression following trauma utilizing visual and auditory markers of arousal and mood. Psychological Medicine 52, 5 (2022), 957–967.

[51]

Vladislav Sovrasov. 2021. Flops-Counter PyTorch. https://github.com/sovrasov/flops-counter.pytorch.git.

[52]

Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. 2015. Efficient object localization using convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 648–656.

[53]

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers and distillation through attention. arxiv:2012.12877 [cs.CV]

[54]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arxiv:1706.03762 [cs.CL]

[55]

Andreas Veit, Michael Wilber, and Serge Belongie. 2016. Residual Networks Behave Like Ensembles of Relatively Shallow Networks. arxiv:1605.06431 [cs.CV]

[56]

Steven Walton, Ali Hassani, Xingqian Xu, Zhangyang Wang, and Humphrey Shi. 2022. StyleNAT: Giving Each Head a New Perspective. (2022). arxiv:2211.05770 [cs.CV] https://arxiv.org/abs/2211.05770

[57]

Weijian Xu, Yifan Xu, Tyler Chang, and Zhuowen Tu. 2021. Co-Scale Conv-Attentional Image Transformers. arxiv:2104.06399 [cs.CV]

[58]

Junyong You and Jari Korhonen. 2020. Attention Boosted Deep Networks For Video Classification. In 2020 IEEE International Conference on Image Processing (ICIP). 1761–1765. https://doi.org/10.1109/ICIP40778.2020.9190996

[59]

Fisher Yu and Vladlen Koltun. 2016. Multi-Scale Context Aggregation by Dilated Convolutions. arxiv:1511.07122 [cs.CV]

[60]

Alireza Zaeemzadeh, Nazanin Rahnavard, and Mubarak Shah. 2020. Norm-Preservation: Why Residual Networks Can Become Extremely Deep?arxiv:1805.07477 [cs.CV]

[61]

Su Zhang, Ruyi An, Yi Ding, and Cuntai Guan. 2022. Continuous Emotion Recognition using Visual-audio-linguistic information: A Technical Report for ABAW3. https://doi.org/10.48550/ARXIV.2203.13031

[62]

Su Zhang, Ruyi An, Yi Ding, and Cuntai Guan. 2022. Continuous Emotion Recognition using Visual-audio-linguistic information: A Technical Report for ABAW3. arxiv:2203.13031 [cs.MM]

[63]

Su Zhang, Ziyuan Zhao, and Cuntai Guan. 2023. Multimodal Continuous Emotion Recognition: A Technical Report for ABAW5. arXiv preprint arXiv:2303.10335 (2023).

[64]

Yan Zhao, Deliang Wang, Buye Xu, and Tao Zhang. 2020. Monaural speech dereverberation using temporal convolutional networks with self attention. IEEE ACM Trans. Audio Speech Lang. Process. 28 (May 2020), 1598–1607.

Digital Library

Index Terms

NAC-TCN: Temporal Convolutional Networks with Causal Dilated Neighborhood Attention for Emotion Understanding

Recommendations

Deep Neural Networks for Emotion Recognition
Distributed Computer and Communication Networks
Abstract
The paper investigates the problem of recognizing human emotions by voice using deep learning methods. Deep convolutional neural networks and recurrent neural networks with bidirectional LSTM memory cell were used as models of deep neural ...
A temporal convolutional recurrent autoencoder based framework for compressing time series data
Abstract
The sharply growing volume of time series data due to recent sensing technology advancement poses emerging challenges to the data transfer speed and storage as well as corresponding energy consumption. To tackle the overwhelming volume of time ...
Highlights
- A novel deep learning based framework for compressing long time series is proposed.
- A temporal convolutional network encoder is developed to learn latent representations.
- Two decoders are built to restore time series considering ...
CM-TCN: Channel-Aware Multi-scale Temporal Convolutional Networks for Speech Emotion Recognition
Neural Information Processing
Abstract
Speech emotion recognition (SER) plays a crucial role in understanding user intent and improving human-computer interaction (HCI). Currently, the most widely used and effective methods are based on deep learning. In the existing research, the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICVIP '23: Proceedings of the 2023 7th International Conference on Video and Image Processing

December 2023

97 pages

ISBN:9798400709388

DOI:10.1145/3639390

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 February 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICVIP 2023

ICVIP 2023: 2023 the 7th International Conference on Video and Image Processing

December 14 - 17, 2023

Kyoto, Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
57
Total Downloads

Downloads (Last 12 months)49
Downloads (Last 6 weeks)5

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten