Abstract
Cognitive science has well-established the correlation between faces and voices because neuro-cognitive pathways of both information share the same structure. Recently, the task has come to the attention of the computer vision community with the introduction of large-scale face-voice data. To this end, our work aims to leverage the structure of faces and voices along with the availability of large-scale face-voice information to improve speaker recognition tasks including identification and verification. To achieve this task, we propose novel multimodal systems to leverage the structure of face and voice, one with weight sharing and another without weight sharing, to learn joint representations of multiple modalities establishing the Face-voice association. Afterwards, features are extracted from the trained multimodal networks capturing face-voice association to perform speaker recognition tasks. We evaluated our proposed multimodal networks for speaker recognition along with Face-voice association tasks on challenging benchmark datasets including VoxCeleb1 and MAV-Celeb. Our results show that adding facial information improved speaker recognition tasks’ performance.
Similar content being viewed by others
Data Availibility
In our experiments, we use VoxCeleb1 [3] dataset to evaluate the proposed method. VoxCeleb1 is gender balanced, with 55% of the speakers male. The speakers span a wide range of different ethnicities, accents, professions and ages. Moreover, data privacy notice is available on the official website of the dataset [69]. Specifically, we extract face embeddings using Inception-ResNet-V1 trained with triplet loss, similar to the work of Schroff et al. [62]. We extract audio embeddings (\(\textbf{e}_i\)) using an utterance level aggregator [34] trained on speaker recognition task.
References
Bai Z, Zhang XL (2021) Speaker recognition based on deep learning: an overview. Neural Netw 140:65–99
Chung JS, Nagrani A, Zisserman A (2018) VoxCeleb2: deep speaker recognition. In: INTERSPEECH
Nagrani A, Chung JS, Zisserman A (2017) VoxCeleb: a large-scale speaker identification dataset. In: INTERSPEECH
Jung Jw, Kim YJ, Heo HS, Lee BJ, Kwon Y, Chung JS (2022) Pushing the limits of raw waveform speaker recognition. In: Proc. Interspeech
Stoll LL (2011) Finding difficult speakers in automatic speaker recognition. PhD thesis, EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-152.html
Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models. Digit Sig Process 10(1–3):19–41
Kenny P (2005) Joint factor analysis of speaker and session variability: theory and algorithms. CRIM, Montreal,(Report) CRIM-06/08–13 14(28–29):2
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: delving deep into convolutional nets. In: British machine vision conference
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 770–778
Nagrani A, Albanie S, Zisserman A (2018) Seeing voices and hearing faces: cross-modal biometric matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 8427–8436
Nagrani A, Albanie S, Zisserman A (2018) Learnable pins: cross-modal embeddings for person identity. In: Proceedings of the European conference on computer vision (ECCV). pp 71–88
Saeed MS, Nawaz S, Yousaf Khan MH, Zaheer MZ, Nandakumar K, Yousaf MH, Mahmood A (2023) Single-branch network for multimodal training. In: ICASSP 2023–2023 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE
Horiguchi S, Kanda N, Nagamatsu K (2018) Face-voice matching using cross-modal embeddings. In: Proceedings of the 26th ACM international conference on multimedia. pp 1011–1019
Nawaz S, Janjua MK, Gallo I, Mahmood A, Calefati A (2019) Deep latent space learning for cross-modal mapping of audio and visual signals. In: 2019 digital image computing: techniques and applications (DICTA). IEEE, pp 1–7
Wen P, Xu Q, Jiang Y, Yang Z, He Y, Huang Q (2021) Seeking the shape of sound: an adaptive framework for learning voice-face association. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 16347–16356
Wen Y, Ismail MA, Liu W, Raj B, Singh R (2019) Disjoint mapping network for cross-modal matching of voices and faces. In: 7th International conference on learning representations, ICLR 2019, New Orleans, LA, USA
Shah SH, Saeed MS, Nawaz S, Yousaf MH (2023) Speaker recognition in realistic scenario using multimodal data. In: 2023 3rd international conference on artificial intelligence (ICAI). IEEE, pp 209–213
Saeed MS, Nawaz S, Khan MH, Javed S, Yousaf MH, Del Bue A (2022) Learning branched fusion and orthogonal projection for face-voice association. arXiv:2208.10238
Nawaz S, Saeed MS, Morerio P, Mahmood A, Gallo I, Yousaf MH, Del Bue A (2021) Cross-modal speaker verification and recognition: a multilingual perspective. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 1682–1691
Albanie S, Nagrani A, Vedaldi A, Zisserman A (2018) Emotion recognition in speech using cross-modal transfer in the wild. In: Proceedings of the 26th ACM international conference on multimedia. pp 292–301
Afouras T, Chung JS, Zisserman A (2018) The conversation: deep audio-visual speech enhancement. In: INTERSPEECH
Koepke AS, Wiles O, Zisserman A (2018) Self-supervised learning of a facial attribute embedding from video. In: BMVC. pp 302
Ellis AW (1989) Neuro-cognitive processing of faces and voices. In: Handbook of research on face processing. Elsevier, pp 207–215
Kamachi M, Hill H, Lander K, Vatikiotis-Bateson E (2003) Putting the face to the voice’: matching identity across modality. Curr Biol 13(19):1709–1714
Kim C, Shin HV, Oh TH, Kaspar A, Elgharib M, Matusik W (2018) On learning associations of faces and voices. In: Asian conference on computer vision. Springer, pp 276–292
Pruzansky S (1963) Pattern-matching procedure for automatic talker recognition. J Acoust Soc Am 35(3):354–358
Dehak N, Kenny P, Dehak R, Glembek O, Dumouchel P, Burget L, Hubeika V, Castaldo F (2009) Support vector machines and joint factor analysis for speaker verification. In: 2009 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 4237–4240
Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2020) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798
Yapanel U, Zhang X, Hansen JH (2002) High performance digit recognition in real car environments. In: Seventh international conference on spoken language processing
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Lei Y, Scheffer N, Ferrer L, McLaren M (2014) A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 1695–1699
Snyder D, Garcia-Romero D, Sell G, Povey D, Khudanpur S (2018) X-vectors: Robust DNN embeddings for speaker recognition. In: 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5329–5333
Salman A, Chen K (2011) Exploring speaker-specific characteristics with deep learning. In: The 2011 international joint conference on neural networks. IEEE, pp 103–110
Xie W, Nagrani A, Chung JS, Zisserman A (2019) Utterance-level aggregation for speaker recognition in the wild. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5791–5795
Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5297–5307
Zhong Y, Arandjelović R, Zisserman A (2019) GhostVLAD for set-based face recognition. In: Computer vision–ACCV 2018: 14th Asian conference on computer vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part II 14. Springer, pp 35–50
Wang R, Ao J, Zhou L, Liu S, Wei Z, Ko T, Li Q, Zhang Y (2022) Multi-view self-attention based transformer for speaker recognition. In: ICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6732–6736
India M, Safari P, Hernando J (2021) Double multi-head attention for speaker verification. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6144–6148
Zhu H, Lee KA, Li H (2021) Serialized multi-layer multi-head attention for neural speaker embedding. In: Proc. Interspeech 2021. pp 106–110. https://doi.org/10.21437/Interspeech.2021-2210
Wu CY, Hsu CC, Neumann U (2022) Cross-modal perceptionist: can face geometry be gleaned from voices? In: CVPR
Wang J, Li C, Zheng A, Tang J, Luo B (2022) Looking and hearing into details: dual-enhanced Siamese adversarial network for audio-visual matching. IEEE Transactions on Multimedia
Saeed MS, Khan MH, Nawaz S, Yousaf MH, Del Bue A (2022) Fusion and orthogonal projection for improved face-voice association. In: ICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7057–7061
Baltrušaitis T, Ahuja C, Morency LP (2018) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443
Vielzeuf V, Lechervy A, Pateux S, Jurie F (2018) Centralnet: a multilayer approach for multimodal fusion. In: Proceedings of the European conference on computer vision (ECCV) workshops. pp 0–0
Kiela D, Grave E, Joulin A, Mikolov T (2018) Efficient large-scale multi-modal classification. arXiv:1802.02892
Kiela D, Firooz H, Mohan A, Goswami V, Singh A, Ringshia P, Testuggine D (2020) The hateful memes challenge: detecting hate speech in multimodal memes. Adv Neural Inf Process Syst 33:2611–2624
Gallo I, Calefati A, Nawaz S (2017) Multimodal classification fusion in real-world scenarios. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 5. IEEE, pp 36–41
Arshad O, Gallo I, Nawaz S, Calefati A (2019) Aiding intra-text representations with visual context for multimodal named entity recognition. In: 2019 international conference on document analysis and recognition (ICDAR). IEEE, pp 337–342
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6077–6086
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 3156–3164
Yan L, Han C, Xu Z, Liu D, Wang Q (2023) Prompt learns prompt: exploring knowledge-aware generative prompt collaboration for video captioning. In: Proceedings of international joint conference on artificial intelligence (IJCAI). pp 1622–1630
Yan L, Wang Q, Cui Y, Feng F, Quan X, Zhang X, Liu D (2022) GL-RG: global-local representation granularity for video captioning. arXiv:2205.10706
Popattia M, Rafi M, Qureshi R, Nawaz S (2022) Guiding attention using partial order relationships for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 4671–4680
Yan L, Liu D, Song Y, Yu C (2020) Multimodal aggregation approach for memory vision-voice indoor navigation with meta-learning. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 5847–5854
Nawaz S, Cavazza J, Del Bue A (2022) Semantically grounded visual embeddings for zero-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 4589–4599
Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5005–5013
Nagrani A, Chung JS, Albanie S, Zisserman A (2020) Disentangled speech embeddings using cross-modal self-supervision. In: ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6829–6833
Hajavi A, Etemad A (2023) Audio representation learning by distilling video as privileged information. IEEE Transactions on Artificial Intelligence
Nawaz S (2019) Multimodal representation and learning. PhD thesis, Universitá degli Studi dell’Insubria
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence
Schroff F, Kalenichenko D, Philbin J (2015) FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 815–823
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. pp 448–456. PMLR
Calefati A, Janjua MK, Nawaz S, Gallo I (2018) Git loss for deep face recognition. In: Proceedings of the British machine vision conference (BMVC)
Sarı L, Singh K, Zhou J, Torresani L, Singhal N, Saraf Y (2021) A multi-view approach to audio-visual speaker verification. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6194–6198
Zheng A, Hu M, Jiang B, Huang Y, Yan Y, Luo B (2021) Adversarial-metric learning for audio-visual cross-modal matching. IEEE Trans Multimedia 24:338–351
Ning H, Zheng X, Lu X, Yuan Y (2021) Disentangled representation learning for cross-modal biometric matching. IEEE Trans Multimedia 24:1763–1774
Deng J, Guo J, Xue N, Zafeiriou S (2019) ArcFace: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 4690–4699
VGG Dataset Privacy Notice–robots.ox.ac.uk. https://www.robots.ox.ac.uk/~vgg/terms/url-lists-privacy-notice.html. Accessed 01 Jan 2024
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jabeen, S., Amin, M.S. & Li, X. Multimodal pre-train then transfer learning approach for speaker recognition. Multimed Tools Appl 83, 78563–78576 (2024). https://doi.org/10.1007/s11042-024-18575-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-024-18575-4