Nothing Special   »   [go: up one dir, main page]

Skip to main content

FaVoA: Face-Voice Association Favours Ambiguous Speaker Detection

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2021 (ICANN 2021)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12891))

Included in the following conference series:

Abstract

The strong relation between face and voice can aid active speaker detection systems when faces are visible, even in difficult settings, when the face of a speaker is not clear or when there are several people in the same scene. By being capable of estimating the frontal facial representation of a person from his/her speech, it becomes easier to determine whether he/she is a potential candidate for being classified as an active speaker, even in challenging cases in which no mouth movement is detected from any person in that same scene. By incorporating a face-voice association neural network into an existing state-of-the-art active speaker detection model, we introduce FaVoA (Face-Voice Association Ambiguous Speaker Detector), a neural network model that can correctly classify particularly ambiguous scenarios. FaVoA not only finds positive associations, but helps to rule out non-matching face-voice associations, where a face does not match a voice. Its use of a gated-bimodal-unit architecture for the fusion of those models offers a way to quantitatively determine how much each modality contributes to the classification.

The authors thank Leyuan Qu for the constructive feedback and suggestions, and acknowledge partial support from the German Research Foundation DFG under project CML (TRR 169).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alcázar, J.L., et al.: Active speakers in context. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  2. Arevalo, J., Solorio, T., Montes-y-Gómez, M., González, F.A.: Gated multimodal units for information fusion. In: 5th International Conference on Learning Representations, ICLR 2017, Workshop Track Proceedings (2017). OpenReview.net

  3. Bahrick, L.E., Hernandez-Reif, M., Flom, R.: The development of infant learning about specific face-voice relations. Dev. Psychol. 41(3), 541–552 (2005)

    Article  Google Scholar 

  4. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1724–1734. Association for Computational Linguistics (2014)

    Google Scholar 

  5. Choi, H.S., Park, C., Lee, K.: From inference to generation: end-to-end fully self-supervised generation of human face from speech. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020 (2020). OpenReview.net

  6. Chung, J.S.: Naver at Activitynet Challenge 2019 - Task B Active Speaker Detection (AVA) (2019). https://research.google.com/ava/2019/Naver_Corporation.pdf

  7. Chung, J.S., Huh, J., Nagrani, A., Afouras, T., Zisserman, A.: Spot the conversation: speaker diarisation in the wild. In: Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, pp. 299–303. ISCA (2020)

    Google Scholar 

  8. Gaver, W.W.: What in the world do we hear? An ecological approach to auditory event perception. Ecol. Psychol. 5, 1–29 (1993)

    Article  Google Scholar 

  9. Hou, J., Wang, S., Lai, Y., Tsao, Y., Chang, H., Wang, H.: Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Trans. Emerging Top. Comput. Intell. 2(2), 117–128 (2018)

    Article  Google Scholar 

  10. Huang, C., Koishida, K.: Improved active speaker detection based on optical flow. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2020)

    Google Scholar 

  11. Kim, C., Shin, H.V., Oh, T.H., Kaspar, A., Elgharib, M., Matusik, W.: On learning associations of faces and voices. In: Proceedings of Asian Conference on Computer Vision (ACCV) (2018)

    Google Scholar 

  12. Nagrani, A., Albanie, S., Zisserman, A.: Learnable PINs: cross-modal embeddings for person identity. In: European Conference on Computer Vision (2018)

    Google Scholar 

  13. Oh, T., et al.: Speech2Face: learning the face behind a voice. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7531–7540 (2019)

    Google Scholar 

  14. Qu, L., Weber, C., Wermter, S.: LipSound: Neural mel-spectrogram reconstruction for lip reading. In: Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, pp. 2768–2772. ISCA (2019)

    Google Scholar 

  15. Qu, L., Weber, C., Wermter, S.: Multimodal target speech separation with voice and face references. In: Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, pp. 1416–1420. ISCA (2020)

    Google Scholar 

  16. Roth, J., et al.: AVA-ActiveSpeaker: an audio-visual dataset for active speaker detection. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4492–4496 (2020)

    Google Scholar 

  17. Zhang, Y.H., Xiao, J., Yang, S., Shan, S.: Multi-task learning for audio-visual active speaker detection (2019). https://research.google.com/ava/2019/Multi_Task_Learning_for_Audio_Visual_Active_Speaker_Detection.pdf

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hugo Carneiro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Carneiro, H., Weber, C., Wermter, S. (2021). FaVoA: Face-Voice Association Favours Ambiguous Speaker Detection. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12891. Springer, Cham. https://doi.org/10.1007/978-3-030-86362-3_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86362-3_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86361-6

  • Online ISBN: 978-3-030-86362-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics