research-article

TensorFormer: A Tensor-Based Multimodal Transformer for Multimodal Sentiment Analysis and Depression Detection

Authors:

Lanfen LinAuthors Info & Claims

IEEE Transactions on Affective Computing, Volume 14, Issue 4

Pages 2776 - 2786

https://doi.org/10.1109/TAFFC.2022.3233070

Published: 30 December 2022 Publication History

Abstract

Sentiment analysis is an important research field aiming to extract and fuse sentimental information from human utterances. Due to the diversity of human sentiment, analyzing from multiple modalities is usually more accurate than from a single modality. To complement the information between related modalities, one effective approach is performing cross-modality interactions. Recently, Transformer-based frameworks have shown a strong ability to capture long-range dependencies, leading to the introduction of several Transformer-based approaches for multimodal processing. However, due to the built-in attention mechanism of the Transformers, only two modalities can be engaged at once. As a result, the complementary information flow in these Transformer-based techniques is partial and constrained. To mitigate this, we propose, TensorFormer, a tensor-based multimodal Transformer framework that takes into account all relevant modalities for interactions. More precisely, we first construct a tensor utilizing the features extracted from each modality, assuming one modality is the target while the remaining tensors serve as the sources. We can generate the corresponding interacted features by calculating source-target attention. This strategy interacts with all involved modalities and generates complementing global information. Experiments on multimodal sentiment analysis benchmark datasets demonstrated the effectiveness of TensorFormer. In addition, we also evaluate TensorFormer in another related area: depression detection and the results reveal significant improvements when compared to other state-of-the-art methods.

References

[1]

E. Cambria, S. Poria, A. Hussain, and B. Liu, “Computational intelligence for affective computing and sentiment analysis [guest editorial],” IEEE Comput. Intell. Mag., vol. 14, no. 2, pp. 16–17, May 2019.

[2]

E. Cambria, D. Das, S. Bandyopadhyay, and A. Feraco, “Affective computing and sentiment analysis,” in A Practical Guide to Sentiment Analysis, Berlin, Germany: Springer, 2017, pp. 1–10.

[3]

D. Lahat, T. Adali, and C. Jutten, “Multimodal data fusion: An overview of methods, challenges, and prospects,” Proc. IEEE, vol. 103, no. 9, pp. 1449–1477, Sep. 2015.

[4]

A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, “Tensor fusion network for multimodal sentiment analysis,” in Proc. Conf. Empir. Methods Natural Lang. Process., 2017, pp. 1103–1114.

[5]

F. Huang, S. Zhang, J. Zhang, and G. Yu, “Multimodal learning for topic sentiment analysis in microblogging,” Neurocomputing, vol. 253, pp. 144–153, 2017.

Digital Library

[6]

A. Zadeh, P. P. Liang, S. Poria, P. Vij, E. Cambria, and L.-P. Morency, “Multi-attention recurrent network for human communication comprehension,” in Proc. 32nd AAAI Conf. Artif. Intell. 30th Innov. Appl. Artif. Intell. Conf. 8th AAAI Symp. Educ. Adv. Artif. Intell., 2018, pp. 5642–5649.

[7]

J.-B. Delbrouck, N. Tits, M. Brousmiche, and S. Dupont, “A transformer-based joint-encoding for emotion recognition and sentiment analysis,” in Proc. 2nd Grand-Challenge Workshop Multimodal Lang., 2020, pp. 1–7.

[8]

W. Han, H. Chen, A. Gelbukh, A. Zadeh, L.-P. Morency, and S. Poria, “Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis,” in Proc. Int. Conf. Multimodal Interaction, 2021, pp. 6–15.

[9]

A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Representations, 2021, pp. 1–22.

[10]

S. A. Qureshi, G. Dias, M. Hasanuzzaman, and S. Saha, “Improving depression level estimation by concurrently learning emotion intensity,” IEEE Comput. Intell. Mag., vol. 15, no. 3, pp. 47–59, Aug. 2020.

[11]

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proc. 28th Int. Conf. Mach. Learn., 2011, pp. 689–696.

[12]

A. Hussain, E. Cambria, S. Poria, A. Hawalah, and F. Herrera, “Information fusion for affective computing and sentiment analysis,” Inf. Fusion, vol. 71, pp. 97–98, 2021. [Online]. Available: http://researchrepository.napier.ac.uk/Output/2763759

[13]

R. Mao, Q. Liu, K. He, W. Li, and E. Cambria, “The biases of pre-trained language models: An empirical study on prompt-based sentiment analysis and emotion detection,” IEEE Trans. Affective Comput., to be published.

Digital Library

[14]

K. He, R. Mao, T. Gong, C. Li, and E. Cambria, “Meta-based self-training and re-weighting for aspect-based sentiment analysis,” IEEE Trans. Affective Comput., to be published.

Digital Library

[15]

Y. Miura, S. Sakaki, K. Hattori, and T. Ohkuma, “TeamX: A sentiment analyzer with enhanced lexicon mapping and weighting scheme for unbalanced data,” in Proc. 8th Int. Workshop Semantic Eval., 2014, pp. 628–632.

[16]

M. Hagen, M. Potthast, M. Büchner, and B. Stein, “Webis: An ensemble for twitter sentiment detection,” in Proc. 9th Int. Workshop Semantic Eval., 2015, pp. 582–589.

[17]

E. Cambria, A. Hussain, and A. Vinciarelli, “Affective reasoning for big social data analysis,” IEEE Trans. Affective Comput., vol. 8, no. 4, pp. 426–427, Fourth Quarter 2017.

Digital Library

[18]

S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,” Inf. Fusion, vol. 37, pp. 98–125, 2017.

Digital Library

[19]

A. Zadeh, R. Zellers, E. Pincus, and L.-P. Morency, “Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages,” IEEE Intell. Syst., vol. 31, no. 6, pp. 82–88, Nov./Dec. 2016.

Digital Library

[20]

A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency, “Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph,” in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics, 2018, pp. 2236–2246.

[21]

S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “MELD: A multimodal multi-party dataset for emotion recognition in conversations,” in Proc. 57th Annu. Meeting Assoc. Comput. Linguistics, 2019, pp. 527–536.

[22]

L. Stappen et al., “The MuSe 2021 multimodal sentiment analysis challenge: Sentiment, emotion, physiological-emotion, and stress,” in Proc. 2nd Multimodal Sentiment Anal. Challenge, 2021, pp. 5–14.

[23]

L. Stappen et al., “MuSe-toolbox: The multimodal sentiment analysis continuous annotation fusion and discrete class transformation toolbox,” in Proc. 2nd Multimodal Sentiment Anal. Challenge, 2021, pp. 75–82.

[24]

A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, and L.-P. Morency, “Memory fusion network for multi-view sequential learning,” in Proc. 32nd AAAI Conf. Artif. Intell. 30th Innov. Appl. Artif. Intell. Conf. 8th AAAI Symp. Educ. Adv. Artif. Intell., 2018, pp. 5634–5641.

[25]

M. Chen and X. Li, “SWAFN: Sentimental words aware fusion network for multimodal sentiment analysis,” in Proc. 28th Int. Conf. Comput. Linguistics, 2020, pp. 1067–1077.

[26]

N. Majumder, D. Hazarika, A. Gelbukh, E. Cambria, and S. Poria, “Multimodal sentiment analysis using hierarchical fusion with context modeling,” Knowl.-Based Syst., vol. 161, pp. 124–133, 2018.

[27]

J. Joshi et al., “Multimodal assistive technologies for depression diagnosis and monitoring,” J. Multimodal User Interfaces, vol. 7, no. 3, pp. 217–228, 2013.

[28]

M. Rodrigues Makiuchi, T. Warnita, K. Uto, and K. Shinoda, “Multimodal fusion of BERT-CNN and gated CNN representations for depression detection,” in Proc. 9th Int. Audio/Visual Emotion Challenge Workshop, 2019, pp. 55–63.

[29]

A. Ray, S. Kumar, R. Reddy, P. Mukherjee, and R. Garg, “Multi-level attention network using text, audio and video for depression prediction,” in Proc. 9th Int. Audio/Visual Emotion Challenge Workshop, 2019, pp. 81–88.

[30]

H. Sun et al., “Multi-modal adaptive fusion transformer network for the estimation of depression level,” Sensors, vol. 21, no. 14, 2021, Art. no.

[31]

E. Cambria, N. Howard, J. Hsu, and A. Hussain, “Sentic blending: Scalable multimodal fusion for the continuous interpretation of semantics and sentics,” in Proc. IEEE Symp. Comput. Intell. Hum.-Like Intell., 2013, pp. 108–117.

[32]

H. Deng, P. Kang, Z. Yang, T. Hao, Q. Li, and W. Liu, “Dense fusion network with multimodal residual for sentiment classification,” in Proc. IEEE Int. Conf. Multimedia Expo, 2021, pp. 1–6.

[33]

S. Han, R. Mao, and E. Cambria, “Hierarchical attention network for explainable depression detection on twitter aided by metaphor concept mappings,” in Proc. 29th Int. Conf. Comput. Linguistics, 2022, pp. 94–104.

[34]

W. Peng, X. Hong, and G. Zhao, “Adaptive modality distillation for separable multimodal sentiment analysis,” IEEE Intell. Syst., vol. 36, no. 3, pp. 82–89, May/Jun. 2021.

[35]

M. Chen, S. Wang, P. P. Liang, T. Baltrušaitis, A. Zadeh, and L.-P. Morency, “Multimodal sentiment analysis with word-level fusion and reinforcement learning,” in Proc. 19th ACM Int. Conf. Multimodal Interaction, 2017, pp. 163–171.

[36]

H. Peng, Y. Ma, S. Poria, Y. Li, and E. Cambria, “Phonetic-enriched text representation for chinese sentiment analysis with reinforcement learning,” Inf. Fusion, vol. 70, pp. 88–99, 2021.

[37]

K. Zhang, Y. Li, J. Wang, E. Cambria, and X. Li, “Real-time video emotion recognition based on reinforcement learning and domain knowledge,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 3, pp. 1034–1047, Mar. 2022.

[38]

D. Hazarika, R. Zimmermann, and S. Poria, “MISA: Modality-invariant and-specific representations for multimodal sentiment analysis,” in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 1122–1131.

[39]

D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L.-P. Morency, and R. Zimmermann, “Conversational memory network for emotion recognition in dyadic dialogue videos,” in Proc. Conf. Assoc. Comput. Linguistics, 2018, pp. 2122–2132.

[40]

G. Tu, J. Wen, C. Liu, D. Jiang, and E. Cambria, “Context- and sentiment-aware networks for emotion recognition in conversation,” IEEE Trans. Artif. Intell., vol. 3, no. 5, pp. 699–708, Oct. 2022.

[41]

Q. Chen, I. Chaturvedi, S. Ji, and E. Cambria, “Sequential fusion of facial appearance and dynamics for depression recognition,” Pattern Recognit. Lett., vol. 150, pp. 115–121, 2021.

Digital Library

[42]

Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” in Proc. Conf. Assoc. Comput. Linguistics, 2019, pp. 6558–6569.

[43]

S. Sahay, E. Okur, S. H. Kumar, and L. Nachman, “Low rank fusion based transformers for multimodal sequences,” in Proc. 2nd Grand-Challenge Workshop Multimodal Lang., 2020, pp. 29–34.

[44]

Z. Wang, Z. Wan, and X. Wan, “TransModality: An End2End fusion method with transformer for multimodal sentiment analysis,” in Proc. Web Conf., 2020, pp. 2514–2520.

[45]

C. Bao, Z. Fountas, T. Olugbade, and N. Bianchi-Berthouze, “Multimodal data fusion based on the global workspace theory,” in Proc. Int. Conf. Multimodal Interaction, 2020, pp. 414–422.

[46]

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.

Digital Library

[47]

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Annu. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., 2019, pp. 4171–4186.

[48]

I. Lawrence and K. Lin, “A concordance correlation coefficient to evaluate reproducibility,” Biometrics, vol. 45, pp. 255–268, 1989.

[49]

Z. Sun, P. Sarma, W. Sethares, and Y. Liang, “Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 8992–8999.

[50]

F. Ringeval et al., “AVEC 2019 workshop and challenge: State-of-mind, detecting depression with AI, and cross-cultural affect recognition,” in Proc. 9th Int. Audio/Visual Emotion Challenge Workshop, 2019, pp. 3–12.

[51]

H. Kaya et al., “Predicting depression and emotions in the cross-roads of cultures, para-linguistics, and non-linguistics,” in Proc. 9th Int. Audio/Visual Emotion Challenge Workshop, 2019, pp. 27–35.

[52]

W. Fan, Z. He, X. Xing, B. Cai, and W. Lu, “Multi-modality depression detection via multi-scale temporal dilated CNNs,” in Proc. 9th Int. Audio/Visual Emotion Challenge Workshop, 2019, pp. 73–80.

[53]

S. Yin, C. Liang, H. Ding, and S. Wang, “A multi-modal hierarchical recurrent neural network for depression detection,” in Proc. 9th Int. Audio/Visual Emotion Challenge Workshop, 2019, pp. 65–71.

[54]

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Representations, 2015, pp. 1–14.

[55]

G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4700–4708.

[56]

G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scherer, “COVAREP–A collaborative voice analysis repository for speech technologies,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2014, pp. 960–964.

[57]

P. Ekman and E. L. Rosenberg, What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS). New York, NY, USA: Oxford Univ. Press, 1997.

[58]

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. 3rd Int. Conf. Learn. Representations, 2015, pp. 1–15.

Cited By

Zhi YLi JWang HChen JWei W(2024)A Multimodal Sentiment Analysis Method Based on Fuzzy Attention FusionIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2024.343461432:10(5886-5898)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1109/TFUZZ.2024.3434614

Index Terms

TensorFormer: A Tensor-Based Multimodal Transformer for Multimodal Sentiment Analysis and Depression Detection

Index terms have been assigned to the content through auto-classification.

Recommendations

Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis
Abstract
Multimodal Sentiment Analysis (MSA) constitutes a pivotal technology in the realm of multimedia research. The efficacy of MSA models largely hinges on the quality of multimodal fusion. Notably, when conveying information pertinent to specific ...
Highlights
- Novel multimodal adaptive weight matrix enables accurate sentiment analysis by considering unique contributions of each modality.
- Multimodal attention mechanism addresses over-focusing on intra-modality attention.
- Multiple Softmax ...
Multimodal Analysis of Interruptions
Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. Anthropometry, Human Behavior, and Communication
Abstract
During an interaction, interactants exchange speaking turns. Exchanges can be done smoothly or through interruptions. Listeners can display backchannels, send signals to grab the speaking turn, wait for the speaker to yield the turn, or even ...
Joint training strategy of unimodal and multimodal for multimodal sentiment analysis
Abstract
With the explosive growth of social media video content, research on multimodal sentiment analysis (MSA) has attracted considerable attention recently. Despite significant progress in MSA, there remains challenges: current research mostly focuses ...
Highlights
- Jointly training unimodal and multimodal tasks to optimize multimodal fusion.
- Using two modules for unimodal and multimodal learning.
- The proposed model achieves competitive results compared to latest baselines.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Affective Computing

IEEE Transactions on Affective Computing Volume 14, Issue 4

Oct.-Dec. 2023

832 pages

ISSN:1949-3045

Issue’s Table of Contents

1949-3045 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 30 December 2022

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhi YLi JWang HChen JWei W(2024)A Multimodal Sentiment Analysis Method Based on Fuzzy Attention FusionIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2024.343461432:10(5886-5898)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1109/TFUZZ.2024.3434614

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents