Nothing Special   »   [go: up one dir, main page]

Skip to main content

Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13834))

Included in the following conference series:

  • 1705 Accesses

Abstract

The dynamic-static fusion features play an important role in speech emotion recognition (SER). However, the fusion methods of dynamic features and static features generally are simple addition or serial fusion, which might cause the loss of certain underlying emotional information. To address this issue, we proposed a dynamic-static cross attentional feature fusion method (SD-CAFF) with a cross attentional feature fusion mechanism (Cross AFF) to extract superior deep dynamic-static fusion features. To be specific, the Cross AFF is utilized to parallel fuse the deep features from the CNN/LSTM feature extraction module, which can extract the deep static features and the deep dynamic features from acoustic features (MFCC, Delta, and Delta-delta). In addition to the SD-CAFF framework, we also employed muti-task learning in the training process to further improve the accuracy of emotion recognition. The experimental results on IEMOCAP demonstrated the WA and UA of SD-CAFF are 75.78% and 74.89%, respectively, which outperformed the current SOTAs. Furthermore, SD-CAFF achieved competitive performances (WA: 56.77%; UA: 56.30%) in the comparison experiments of cross-corpus capability on MSP-IMPROV.

Supported by National Natural Science Foundation (NNSF) of China (Grant 61867005).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008). https://doi.org/10.1007/s10579-008-9076-6

    Article  Google Scholar 

  2. Busso, C., Parthasarathy, S., Burmania, A., AbdelWahab, M., Sadoughi, N., Provost, E.M.: MSP-IMPROV: an acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput. 8(1), 67–80 (2016). https://doi.org/10.1109/TAFFC.2016.2515617

    Article  Google Scholar 

  3. Cao, Q., Hou, M., Chen, B., Zhang, Z., Lu, G.: Hierarchical network based on the fusion of static and dynamic features for speech emotion recognition. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6334–6338. IEEE (2021). https://doi.org/10.1109/icassp39728.2021.9414540

  4. Chen, Y., Dai, X., Liu, M., Chen, D., Yuan, L., Liu, Z.: Dynamic ReLU. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 351–367. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_21

    Chapter  Google Scholar 

  5. Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., Barnard, K.: Attentional feature fusion. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3560–3569 (2021). https://doi.org/10.1109/WACV48630.2021.00360

  6. Huilian, L., Weiping, H., Yan, W.: Speech emotion recognition based on BLSTM and CNN feature fusion. In: Proceedings of the 2020 4th International Conference on Digital Signal Processing, pp. 169–172 (2020). https://doi.org/10.1145/3408127.3408192

  7. Lambrecht, L., Kreifelts, B., Wildgruber, D.: Gender differences in emotion recognition: impact of sensory modality and emotional category. Cogn. Emot. 28(3), 452–469 (2014). https://doi.org/10.1080/02699931.2013.837378

    Article  Google Scholar 

  8. Latif, S., Rana, R., Khalifa, S., Jurdak, R., Epps, J., Schuller, B.W.: Multi-task semi-supervised adversarial autoencoding for speech emotion recognition. IEEE Trans. Affect. Comput. 13(2), 992–1004 (2020). https://doi.org/10.1109/taffc.2020.2983669

    Article  Google Scholar 

  9. Li, Y., Baidoo, C., Cai, T., Kusi, G.A.: Speech emotion recognition using 1D CNN with no attention. In: 2019 23rd International Computer Science and Engineering Conference (ICSEC), pp. 351–356. IEEE (2019). https://doi.org/10.1109/ICSEC47112.2019.8974716

  10. Liu, J., Liu, Z., Wang, L., Guo, L., Dang, J.: Speech emotion recognition with local-global aware deep representation learning. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7174–7178. IEEE (2020). https://doi.org/10.1109/icassp40776.2020.9053192

  11. Liu, L.Y., Liu, W.Z., Zhou, J., Deng, H.Y., Feng, L.: ATDA: attentional temporal dynamic activation for speech emotion recognition. Knowl.-Based Syst. 243, 108472 (2022). https://doi.org/10.1016/j.knosys.2022.108472

  12. Nediyanchath, A., Paramasivam, P., Yenigalla, P.: Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7179–7183. IEEE (2020). https://doi.org/10.1109/icassp40776.2020.9054073

  13. Shirian, A., Guha, T.: Compact graph architecture for speech emotion recognition. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6284–6288. IEEE (2021). https://doi.org/10.1109/icassp39728.2021.9413876

  14. Su, B.H., Chang, C.M., Lin, Y.S., Lee, C.C.: Improving speech emotion recognition using graph attentive bi-directional gated recurrent unit network. In: INTERSPEECH, pp. 506–510 (2020). https://doi.org/10.21437/interspeech.2020-1733

  15. Sun, B., Wei, Q., Li, L., Xu, Q., He, J., Yu, L.: LSTM for dynamic emotion and group emotion recognition in the wild. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 451–457 (2016). https://doi.org/10.1145/2993148.2997640

  16. Sun, S.: A survey of multi-view machine learning. Neural Comput. Appl. 23(7), 2031–2038 (2013). https://doi.org/10.1007/s00521-013-1362-6

    Article  Google Scholar 

  17. Ullah, A., Muhammad, K., Del Ser, J., Baik, S.W., de Albuquerque, V.H.C.: Activity recognition using temporal optical flow convolutional features and multilayer LSTM. IEEE Trans. Industr. Electron. 66(12), 9692–9702 (2018). https://doi.org/10.1109/TIE.2018.2881943

    Article  Google Scholar 

  18. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017). https://doi.org/10.5555/3295222.3295349

  19. Yang, J., Yang, J.Y., Zhang, D., Lu, J.F.: Feature fusion: parallel strategy vs. serial strategy. Pattern Recogn. 36(6), 1369–1381 (2003). https://doi.org/10.1016/S0031-3203(02)00262-5

    Article  MATH  Google Scholar 

  20. Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 112–118. IEEE (2018). https://doi.org/10.1109/SLT.2018.8639583

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ke Dong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dong, K., Peng, H., Che, J. (2023). Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition. In: Dang-Nguyen, DT., et al. MultiMedia Modeling. MMM 2023. Lecture Notes in Computer Science, vol 13834. Springer, Cham. https://doi.org/10.1007/978-3-031-27818-1_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-27818-1_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-27817-4

  • Online ISBN: 978-3-031-27818-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics