Abstract
The dynamic-static fusion features play an important role in speech emotion recognition (SER). However, the fusion methods of dynamic features and static features generally are simple addition or serial fusion, which might cause the loss of certain underlying emotional information. To address this issue, we proposed a dynamic-static cross attentional feature fusion method (SD-CAFF) with a cross attentional feature fusion mechanism (Cross AFF) to extract superior deep dynamic-static fusion features. To be specific, the Cross AFF is utilized to parallel fuse the deep features from the CNN/LSTM feature extraction module, which can extract the deep static features and the deep dynamic features from acoustic features (MFCC, Delta, and Delta-delta). In addition to the SD-CAFF framework, we also employed muti-task learning in the training process to further improve the accuracy of emotion recognition. The experimental results on IEMOCAP demonstrated the WA and UA of SD-CAFF are 75.78% and 74.89%, respectively, which outperformed the current SOTAs. Furthermore, SD-CAFF achieved competitive performances (WA: 56.77%; UA: 56.30%) in the comparison experiments of cross-corpus capability on MSP-IMPROV.
Supported by National Natural Science Foundation (NNSF) of China (Grant 61867005).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008). https://doi.org/10.1007/s10579-008-9076-6
Busso, C., Parthasarathy, S., Burmania, A., AbdelWahab, M., Sadoughi, N., Provost, E.M.: MSP-IMPROV: an acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput. 8(1), 67–80 (2016). https://doi.org/10.1109/TAFFC.2016.2515617
Cao, Q., Hou, M., Chen, B., Zhang, Z., Lu, G.: Hierarchical network based on the fusion of static and dynamic features for speech emotion recognition. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6334–6338. IEEE (2021). https://doi.org/10.1109/icassp39728.2021.9414540
Chen, Y., Dai, X., Liu, M., Chen, D., Yuan, L., Liu, Z.: Dynamic ReLU. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 351–367. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_21
Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., Barnard, K.: Attentional feature fusion. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3560–3569 (2021). https://doi.org/10.1109/WACV48630.2021.00360
Huilian, L., Weiping, H., Yan, W.: Speech emotion recognition based on BLSTM and CNN feature fusion. In: Proceedings of the 2020 4th International Conference on Digital Signal Processing, pp. 169–172 (2020). https://doi.org/10.1145/3408127.3408192
Lambrecht, L., Kreifelts, B., Wildgruber, D.: Gender differences in emotion recognition: impact of sensory modality and emotional category. Cogn. Emot. 28(3), 452–469 (2014). https://doi.org/10.1080/02699931.2013.837378
Latif, S., Rana, R., Khalifa, S., Jurdak, R., Epps, J., Schuller, B.W.: Multi-task semi-supervised adversarial autoencoding for speech emotion recognition. IEEE Trans. Affect. Comput. 13(2), 992–1004 (2020). https://doi.org/10.1109/taffc.2020.2983669
Li, Y., Baidoo, C., Cai, T., Kusi, G.A.: Speech emotion recognition using 1D CNN with no attention. In: 2019 23rd International Computer Science and Engineering Conference (ICSEC), pp. 351–356. IEEE (2019). https://doi.org/10.1109/ICSEC47112.2019.8974716
Liu, J., Liu, Z., Wang, L., Guo, L., Dang, J.: Speech emotion recognition with local-global aware deep representation learning. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7174–7178. IEEE (2020). https://doi.org/10.1109/icassp40776.2020.9053192
Liu, L.Y., Liu, W.Z., Zhou, J., Deng, H.Y., Feng, L.: ATDA: attentional temporal dynamic activation for speech emotion recognition. Knowl.-Based Syst. 243, 108472 (2022). https://doi.org/10.1016/j.knosys.2022.108472
Nediyanchath, A., Paramasivam, P., Yenigalla, P.: Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7179–7183. IEEE (2020). https://doi.org/10.1109/icassp40776.2020.9054073
Shirian, A., Guha, T.: Compact graph architecture for speech emotion recognition. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6284–6288. IEEE (2021). https://doi.org/10.1109/icassp39728.2021.9413876
Su, B.H., Chang, C.M., Lin, Y.S., Lee, C.C.: Improving speech emotion recognition using graph attentive bi-directional gated recurrent unit network. In: INTERSPEECH, pp. 506–510 (2020). https://doi.org/10.21437/interspeech.2020-1733
Sun, B., Wei, Q., Li, L., Xu, Q., He, J., Yu, L.: LSTM for dynamic emotion and group emotion recognition in the wild. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 451–457 (2016). https://doi.org/10.1145/2993148.2997640
Sun, S.: A survey of multi-view machine learning. Neural Comput. Appl. 23(7), 2031–2038 (2013). https://doi.org/10.1007/s00521-013-1362-6
Ullah, A., Muhammad, K., Del Ser, J., Baik, S.W., de Albuquerque, V.H.C.: Activity recognition using temporal optical flow convolutional features and multilayer LSTM. IEEE Trans. Industr. Electron. 66(12), 9692–9702 (2018). https://doi.org/10.1109/TIE.2018.2881943
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017). https://doi.org/10.5555/3295222.3295349
Yang, J., Yang, J.Y., Zhang, D., Lu, J.F.: Feature fusion: parallel strategy vs. serial strategy. Pattern Recogn. 36(6), 1369–1381 (2003). https://doi.org/10.1016/S0031-3203(02)00262-5
Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 112–118. IEEE (2018). https://doi.org/10.1109/SLT.2018.8639583
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Dong, K., Peng, H., Che, J. (2023). Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition. In: Dang-Nguyen, DT., et al. MultiMedia Modeling. MMM 2023. Lecture Notes in Computer Science, vol 13834. Springer, Cham. https://doi.org/10.1007/978-3-031-27818-1_29
Download citation
DOI: https://doi.org/10.1007/978-3-031-27818-1_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-27817-4
Online ISBN: 978-3-031-27818-1
eBook Packages: Computer ScienceComputer Science (R0)