Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

TensorFormer: A Tensor-Based Multimodal Transformer for Multimodal Sentiment Analysis and Depression Detection

Published: 30 December 2022 Publication History

Abstract

Sentiment analysis is an important research field aiming to extract and fuse sentimental information from human utterances. Due to the diversity of human sentiment, analyzing from multiple modalities is usually more accurate than from a single modality. To complement the information between related modalities, one effective approach is performing cross-modality interactions. Recently, Transformer-based frameworks have shown a strong ability to capture long-range dependencies, leading to the introduction of several Transformer-based approaches for multimodal processing. However, due to the built-in attention mechanism of the Transformers, only two modalities can be engaged at once. As a result, the complementary information flow in these Transformer-based techniques is partial and constrained. To mitigate this, we propose, TensorFormer, a tensor-based multimodal Transformer framework that takes into account all relevant modalities for interactions. More precisely, we first construct a tensor utilizing the features extracted from each modality, assuming one modality is the target while the remaining tensors serve as the sources. We can generate the corresponding interacted features by calculating source-target attention. This strategy interacts with all involved modalities and generates complementing global information. Experiments on multimodal sentiment analysis benchmark datasets demonstrated the effectiveness of TensorFormer. In addition, we also evaluate TensorFormer in another related area: depression detection and the results reveal significant improvements when compared to other state-of-the-art methods.

References

[1]
E. Cambria, S. Poria, A. Hussain, and B. Liu, “Computational intelligence for affective computing and sentiment analysis [guest editorial],” IEEE Comput. Intell. Mag., vol. 14, no. 2, pp. 16–17, May 2019.
[2]
E. Cambria, D. Das, S. Bandyopadhyay, and A. Feraco, “Affective computing and sentiment analysis,” in A Practical Guide to Sentiment Analysis, Berlin, Germany: Springer, 2017, pp. 1–10.
[3]
D. Lahat, T. Adali, and C. Jutten, “Multimodal data fusion: An overview of methods, challenges, and prospects,” Proc. IEEE, vol. 103, no. 9, pp. 1449–1477, Sep. 2015.
[4]
A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, “Tensor fusion network for multimodal sentiment analysis,” in Proc. Conf. Empir. Methods Natural Lang. Process., 2017, pp. 1103–1114.
[5]
F. Huang, S. Zhang, J. Zhang, and G. Yu, “Multimodal learning for topic sentiment analysis in microblogging,” Neurocomputing, vol. 253, pp. 144–153, 2017.
[6]
A. Zadeh, P. P. Liang, S. Poria, P. Vij, E. Cambria, and L.-P. Morency, “Multi-attention recurrent network for human communication comprehension,” in Proc. 32nd AAAI Conf. Artif. Intell. 30th Innov. Appl. Artif. Intell. Conf. 8th AAAI Symp. Educ. Adv. Artif. Intell., 2018, pp. 5642–5649.
[7]
J.-B. Delbrouck, N. Tits, M. Brousmiche, and S. Dupont, “A transformer-based joint-encoding for emotion recognition and sentiment analysis,” in Proc. 2nd Grand-Challenge Workshop Multimodal Lang., 2020, pp. 1–7.
[8]
W. Han, H. Chen, A. Gelbukh, A. Zadeh, L.-P. Morency, and S. Poria, “Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis,” in Proc. Int. Conf. Multimodal Interaction, 2021, pp. 6–15.
[9]
A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Representations, 2021, pp. 1–22.
[10]
S. A. Qureshi, G. Dias, M. Hasanuzzaman, and S. Saha, “Improving depression level estimation by concurrently learning emotion intensity,” IEEE Comput. Intell. Mag., vol. 15, no. 3, pp. 47–59, Aug. 2020.
[11]
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proc. 28th Int. Conf. Mach. Learn., 2011, pp. 689–696.
[12]
A. Hussain, E. Cambria, S. Poria, A. Hawalah, and F. Herrera, “Information fusion for affective computing and sentiment analysis,” Inf. Fusion, vol. 71, pp. 97–98, 2021. [Online]. Available: http://researchrepository.napier.ac.uk/Output/2763759
[13]
R. Mao, Q. Liu, K. He, W. Li, and E. Cambria, “The biases of pre-trained language models: An empirical study on prompt-based sentiment analysis and emotion detection,” IEEE Trans. Affective Comput., to be published.
[14]
K. He, R. Mao, T. Gong, C. Li, and E. Cambria, “Meta-based self-training and re-weighting for aspect-based sentiment analysis,” IEEE Trans. Affective Comput., to be published.
[15]
Y. Miura, S. Sakaki, K. Hattori, and T. Ohkuma, “TeamX: A sentiment analyzer with enhanced lexicon mapping and weighting scheme for unbalanced data,” in Proc. 8th Int. Workshop Semantic Eval., 2014, pp. 628–632.
[16]
M. Hagen, M. Potthast, M. Büchner, and B. Stein, “Webis: An ensemble for twitter sentiment detection,” in Proc. 9th Int. Workshop Semantic Eval., 2015, pp. 582–589.
[17]
E. Cambria, A. Hussain, and A. Vinciarelli, “Affective reasoning for big social data analysis,” IEEE Trans. Affective Comput., vol. 8, no. 4, pp. 426–427, Fourth Quarter 2017.
[18]
S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,” Inf. Fusion, vol. 37, pp. 98–125, 2017.
[19]
A. Zadeh, R. Zellers, E. Pincus, and L.-P. Morency, “Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages,” IEEE Intell. Syst., vol. 31, no. 6, pp. 82–88, Nov./Dec. 2016.
[20]
A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency, “Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph,” in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics, 2018, pp. 2236–2246.
[21]
S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “MELD: A multimodal multi-party dataset for emotion recognition in conversations,” in Proc. 57th Annu. Meeting Assoc. Comput. Linguistics, 2019, pp. 527–536.
[22]
L. Stappen et al., “The MuSe 2021 multimodal sentiment analysis challenge: Sentiment, emotion, physiological-emotion, and stress,” in Proc. 2nd Multimodal Sentiment Anal. Challenge, 2021, pp. 5–14.
[23]
L. Stappen et al., “MuSe-toolbox: The multimodal sentiment analysis continuous annotation fusion and discrete class transformation toolbox,” in Proc. 2nd Multimodal Sentiment Anal. Challenge, 2021, pp. 75–82.
[24]
A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, and L.-P. Morency, “Memory fusion network for multi-view sequential learning,” in Proc. 32nd AAAI Conf. Artif. Intell. 30th Innov. Appl. Artif. Intell. Conf. 8th AAAI Symp. Educ. Adv. Artif. Intell., 2018, pp. 5634–5641.
[25]
M. Chen and X. Li, “SWAFN: Sentimental words aware fusion network for multimodal sentiment analysis,” in Proc. 28th Int. Conf. Comput. Linguistics, 2020, pp. 1067–1077.
[26]
N. Majumder, D. Hazarika, A. Gelbukh, E. Cambria, and S. Poria, “Multimodal sentiment analysis using hierarchical fusion with context modeling,” Knowl.-Based Syst., vol. 161, pp. 124–133, 2018.
[27]
J. Joshi et al., “Multimodal assistive technologies for depression diagnosis and monitoring,” J. Multimodal User Interfaces, vol. 7, no. 3, pp. 217–228, 2013.
[28]
M. Rodrigues Makiuchi, T. Warnita, K. Uto, and K. Shinoda, “Multimodal fusion of BERT-CNN and gated CNN representations for depression detection,” in Proc. 9th Int. Audio/Visual Emotion Challenge Workshop, 2019, pp. 55–63.
[29]
A. Ray, S. Kumar, R. Reddy, P. Mukherjee, and R. Garg, “Multi-level attention network using text, audio and video for depression prediction,” in Proc. 9th Int. Audio/Visual Emotion Challenge Workshop, 2019, pp. 81–88.
[30]
H. Sun et al., “Multi-modal adaptive fusion transformer network for the estimation of depression level,” Sensors, vol. 21, no. 14, 2021, Art. no.
[31]
E. Cambria, N. Howard, J. Hsu, and A. Hussain, “Sentic blending: Scalable multimodal fusion for the continuous interpretation of semantics and sentics,” in Proc. IEEE Symp. Comput. Intell. Hum.-Like Intell., 2013, pp. 108–117.
[32]
H. Deng, P. Kang, Z. Yang, T. Hao, Q. Li, and W. Liu, “Dense fusion network with multimodal residual for sentiment classification,” in Proc. IEEE Int. Conf. Multimedia Expo, 2021, pp. 1–6.
[33]
S. Han, R. Mao, and E. Cambria, “Hierarchical attention network for explainable depression detection on twitter aided by metaphor concept mappings,” in Proc. 29th Int. Conf. Comput. Linguistics, 2022, pp. 94–104.
[34]
W. Peng, X. Hong, and G. Zhao, “Adaptive modality distillation for separable multimodal sentiment analysis,” IEEE Intell. Syst., vol. 36, no. 3, pp. 82–89, May/Jun. 2021.
[35]
M. Chen, S. Wang, P. P. Liang, T. Baltrušaitis, A. Zadeh, and L.-P. Morency, “Multimodal sentiment analysis with word-level fusion and reinforcement learning,” in Proc. 19th ACM Int. Conf. Multimodal Interaction, 2017, pp. 163–171.
[36]
H. Peng, Y. Ma, S. Poria, Y. Li, and E. Cambria, “Phonetic-enriched text representation for chinese sentiment analysis with reinforcement learning,” Inf. Fusion, vol. 70, pp. 88–99, 2021.
[37]
K. Zhang, Y. Li, J. Wang, E. Cambria, and X. Li, “Real-time video emotion recognition based on reinforcement learning and domain knowledge,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 3, pp. 1034–1047, Mar. 2022.
[38]
D. Hazarika, R. Zimmermann, and S. Poria, “MISA: Modality-invariant and-specific representations for multimodal sentiment analysis,” in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 1122–1131.
[39]
D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L.-P. Morency, and R. Zimmermann, “Conversational memory network for emotion recognition in dyadic dialogue videos,” in Proc. Conf. Assoc. Comput. Linguistics, 2018, pp. 2122–2132.
[40]
G. Tu, J. Wen, C. Liu, D. Jiang, and E. Cambria, “Context- and sentiment-aware networks for emotion recognition in conversation,” IEEE Trans. Artif. Intell., vol. 3, no. 5, pp. 699–708, Oct. 2022.
[41]
Q. Chen, I. Chaturvedi, S. Ji, and E. Cambria, “Sequential fusion of facial appearance and dynamics for depression recognition,” Pattern Recognit. Lett., vol. 150, pp. 115–121, 2021.
[42]
Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” in Proc. Conf. Assoc. Comput. Linguistics, 2019, pp. 6558–6569.
[43]
S. Sahay, E. Okur, S. H. Kumar, and L. Nachman, “Low rank fusion based transformers for multimodal sequences,” in Proc. 2nd Grand-Challenge Workshop Multimodal Lang., 2020, pp. 29–34.
[44]
Z. Wang, Z. Wan, and X. Wan, “TransModality: An End2End fusion method with transformer for multimodal sentiment analysis,” in Proc. Web Conf., 2020, pp. 2514–2520.
[45]
C. Bao, Z. Fountas, T. Olugbade, and N. Bianchi-Berthouze, “Multimodal data fusion based on the global workspace theory,” in Proc. Int. Conf. Multimodal Interaction, 2020, pp. 414–422.
[46]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[47]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Annu. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., 2019, pp. 4171–4186.
[48]
I. Lawrence and K. Lin, “A concordance correlation coefficient to evaluate reproducibility,” Biometrics, vol. 45, pp. 255–268, 1989.
[49]
Z. Sun, P. Sarma, W. Sethares, and Y. Liang, “Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 8992–8999.
[50]
F. Ringeval et al., “AVEC 2019 workshop and challenge: State-of-mind, detecting depression with AI, and cross-cultural affect recognition,” in Proc. 9th Int. Audio/Visual Emotion Challenge Workshop, 2019, pp. 3–12.
[51]
H. Kaya et al., “Predicting depression and emotions in the cross-roads of cultures, para-linguistics, and non-linguistics,” in Proc. 9th Int. Audio/Visual Emotion Challenge Workshop, 2019, pp. 27–35.
[52]
W. Fan, Z. He, X. Xing, B. Cai, and W. Lu, “Multi-modality depression detection via multi-scale temporal dilated CNNs,” in Proc. 9th Int. Audio/Visual Emotion Challenge Workshop, 2019, pp. 73–80.
[53]
S. Yin, C. Liang, H. Ding, and S. Wang, “A multi-modal hierarchical recurrent neural network for depression detection,” in Proc. 9th Int. Audio/Visual Emotion Challenge Workshop, 2019, pp. 65–71.
[54]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Representations, 2015, pp. 1–14.
[55]
G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4700–4708.
[56]
G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scherer, “COVAREP–A collaborative voice analysis repository for speech technologies,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2014, pp. 960–964.
[57]
P. Ekman and E. L. Rosenberg, What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS). New York, NY, USA: Oxford Univ. Press, 1997.
[58]
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. 3rd Int. Conf. Learn. Representations, 2015, pp. 1–15.

Cited By

View all
  • (2024)A Multimodal Sentiment Analysis Method Based on Fuzzy Attention FusionIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2024.343461432:10(5886-5898)Online publication date: 1-Oct-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Affective Computing
IEEE Transactions on Affective Computing  Volume 14, Issue 4
Oct.-Dec. 2023
832 pages

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 30 December 2022

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Multimodal Sentiment Analysis Method Based on Fuzzy Attention FusionIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2024.343461432:10(5886-5898)Online publication date: 1-Oct-2024

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media