Nothing Special   »   [go: up one dir, main page]

Skip to main content

DeepMSRF: A Novel Deep Multimodal Speaker Recognition Framework with Feature Selection

  • Conference paper
  • First Online:
Advances in Computer Vision and Computational Biology

Abstract

For recognizing speakers in video streams, significant research studies have been made to obtain a rich machine learning model by extracting high-level speaker’s features such as facial expression, emotion, and gender. However, generating such a model is not feasible by using only single modality feature extractors that exploit either audio signals or image frames, extracted from video streams. In this paper, we address this problem from a different perspective and propose an unprecedented multimodality data fusion framework called DeepMSRF, Deep Multimodal Speaker Recognition with Feature selection. We execute DeepMSRF by feeding features of the two modalities, namely speakers’ audios and face images. DeepMSRF uses a two-stream VGGNET to train on both modalities to reach a comprehensive model capable of accurately recognizing the speaker’s identity. We apply DeepMSRF on a subset of VoxCeleb2 dataset with its metadata merged with VGGFace2 dataset. The goal of DeepMSRF is to identify the gender of the speaker first, and further to recognize his or her name for any given video stream. The experimental results illustrate that DeepMSRF outperforms single modality speaker recognition methods with at least 3% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 349.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 449.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 449.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. A. Afshar, I. Perros, H. Park, C. deFilippi, X. Yan, W. Stewart, J. Ho, J. Sun, Taste: temporal and static tensor factorization for phenotyping electronic health records, in Proceedings of the ACM Conference on Health, Inference, and Learning (2020), pp. 193–203

    Google Scholar 

  2. M. Sotoodeh, J.C. Ho, Improving length of stay prediction using a hidden Markov model. AMIA Summits Transl. Sci. Proc. 2019, 425 (2019)

    Google Scholar 

  3. K.W. Buffinton, B.B. Wheatley, S. Habibian, J. Shin, B.H. Cenci, A.E. Christy, Investigating the mechanics of human-centered soft robotic actuators with finite element analysis, in 2020 3rd IEEE International Conference on Soft Robotics (RoboSoft) (IEEE, Piscataway, 2020), pp. 489–496

    Book  Google Scholar 

  4. H. Haeri, K. Jerath, J. Leachman, Thermodynamics-inspired modeling of macroscopic swarm states, in Dynamic Systems and Control Conference, vol. 59155 (American Society of Mechanical Engineers, New York, 2019), p. V002T15A001

    Google Scholar 

  5. E. Seraj, M. Gombolay, Coordinated control of UAVs for human-centered active sensing of wildfires (2020). Preprint, arXiv:2006.07969

    Google Scholar 

  6. M. Dadvar, S. Moazami, H.R. Myler, H. Zargarzadeh, Multiagent task allocation in complementary teams: a hunter-and-gatherer approach. Complexity 2020, Article ID 1752571 (2020)

    Article  Google Scholar 

  7. M. Etemad, N. Zare, M. Sarvmaili, A. Soares, B.B. Machado, S. Matwin, Using deep reinforcement learning methods for autonomous vessels in 2D environments, in Canadian Conference on Artificial Intelligence (Springer, Berlin, 2020), pp. 220–231

    Google Scholar 

  8. M. Karimi, M. Ahmazadeh, Mining robocup log files to predict own and opponent action. Int. J. Adv. Res. Comput. Sci. 5(6), 1–6 (2014)

    Google Scholar 

  9. F. Tahmasebian, L. Xiong, M. Sotoodeh, V. Sunderam, Crowdsourcing under data poisoning attacks: a comparative study, in IFIP Annual Conference on Data and Applications Security and Privacy (Springer, Berlin, 2020), pp. 310–332

    Google Scholar 

  10. S. Voghoei, N.H. Tonekaboni, J. Wallace, H.R. Arabnia, Deep learning at the edge, in Proceedings of International Conference on Computational Science and Computational Intelligence CSCI, Internet of Things” Research Track (2018), pp. 895–901

    Google Scholar 

  11. F.G. Mohammadi, M.H. Amini, H.R. Arabnia, An introduction to advanced machine learning: meta-learning algorithms, applications, and promises, in Optimization, Learning, and Control for Interdependent Complex Networks (Springer, Berlin, 2020), pp. 129–144

    Google Scholar 

  12. S. Amirian, Z. Wang, T.R. Taha, H.R. Arabnia, Dissection of deep learning with applications in image recognition, in Proceedings of International Conference on Computational Science and Computational Intelligence (CSCI 2018: December 2018, USA); “Artificial Intelligence” Research Track (CSCI-ISAI) (2018), pp. 1132–1138

    Google Scholar 

  13. F.G. Mohammadi, H.R. Arabnia, M.H. Amini, On parameter tuning in meta-learning for computer vision, in 2019 International Conference on Computational Science and Computational Intelligence (CSCI) (IEEE, Piscataway, 2019), pp. 300–305

    Google Scholar 

  14. Z. Wang, F. Li, T. Taha, H. Arabnia, 2d multi-spectral convolutional encoder-decoder model for geobody segmentation, in 2018 International Conference on Computational Science and Computational Intelligence (CSCI) (IEEE, Piscataway, 2018), pp. 1193–1198

    Google Scholar 

  15. N. Soans, E. Asali, Y. Hong, P. Doshi, Sa-net: Robust state-action recognition for learning from observations, in IEEE International Conference on Robotics and Automation (ICRA) (2020), pp. 2153–2159

    Google Scholar 

  16. S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, in Advances in Neural Information Processing Systems (2015), pp. 91–99

    Google Scholar 

  17. F. Shenavarmasouleh, H.R. Arabnia, DRDR: automatic masking of exudates and microaneurysms caused by diabetic retinopathy using mask R-CNN and transfer learning (2020). Preprint, arXiv:2007.02026

    Google Scholar 

  18. F.G. Mohammadi, M.H. Amini, Evolutionary computation, optimization and learning algorithms for data science, in Optimization, Learning and Control for Interdependent Complex Networks (Springer, Berlin, 2019)

    Google Scholar 

  19. F.G. Mohammadi, M.H. Amini, Applications of nature-inspired algorithms for dimension reduction: enabling efficient data analytics, in Optimization, Learning and Control for Interdependent Complex Networks (Springer, Berlin, 2019)

    Google Scholar 

  20. G. Chetty, M. Wagner, Robust face-voice based speaker identity verification using multilevel fusion. Image Vis. Comput. 26(9), 1249–1260 (2008)

    Article  Google Scholar 

  21. S.P. Mudunuri, S. Biswas, Low resolution face recognition across variations in pose and illumination. IEEE Trans. Pattern Anal. Mach. Intell. 38(5), 1034–1040 (2015)

    Article  Google Scholar 

  22. J.H. Shah, M. Sharif, M. Raza, M. Murtaza, S. Ur-Rehman, Robust face recognition technique under varying illumination. J. Appl. Res. Technol. 13(1), 97–105 (2015)

    Article  Google Scholar 

  23. H. Sellahewa, S.A. Jassim, Image-quality-based adaptive face recognition. IEEE Trans. Instrum. Meas. 59(4), 805–813 (2010)

    Article  Google Scholar 

  24. P. Li, L. Prieto, D. Mery, P. Flynn, Face recognition in low quality images: a survey (2018) . Preprint, arXiv:1805.11519

    Google Scholar 

  25. F.G. Mohammadi, M.S. Abadeh, Image steganalysis using a bee colony based feature selection algorithm. Eng. Appl. Artif. Intell. 31, 35–43 (2014)

    Article  Google Scholar 

  26. F.G. Mohammadi, M.S. Abadeh, A new metaheuristic feature subset selection approach for image steganalysis. J. Intell. Fuzzy Syst. 27(3), 1445–1455 (2014)

    Article  Google Scholar 

  27. Y. Koda, Y. Yoshitomi, M. Nakano, M. Tabuse, A facial expression recognition for a speaker of a phoneme of vowel using thermal image processing and a speech recognition system, in RO-MAN 2009-The 18th IEEE International Symposium on Robot and Human Interactive Communication (IEEE, Piscataway, 2009), pp. 955–960

    Google Scholar 

  28. C.C. Chibelushi, F. Deravi, J.S. Mason, Voice and facial image integration for person recognition (1994)

    Google Scholar 

  29. C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 1933–1941

    Google Scholar 

  30. D. Rezazadegan, S. Shirazi, B. Upcroft, M. Milford, Action recognition: from static datasets to moving robots, Jan 2017

    Google Scholar 

  31. X. Peng, C. Schmid, Multi-region two-stream R-CNN for action detection, in European Conference on Computer Vision (Springer, Berlin, 2016), pp. 744–759

    Google Scholar 

  32. X. Yang, P. Molchanov, J. Kautz, Multilayer and multimodal fusion of deep neural networks for video classification, in Proceedings of the 24th ACM international conference on Multimedia (2016), pp. 978–987

    Google Scholar 

  33. C. Feichtenhofer, H. Fan, J. Malik, K. He, Slowfast networks for video recognition, in Proceedings of the IEEE International Conference on Computer Vision (2019), pp. 6202–6211

    Google Scholar 

  34. F. Xiao, Y.J. Lee, K. Grauman, J. Malik, C. Feichtenhofer, Audiovisual slowfast networks for video recognition (2020). Preprint, arXiv:2001.08740

    Google Scholar 

  35. C. Feichtenhofer, A. Pinz, A. Zisserman, Detect to track and track to detect, in Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 3038–3046

    Google Scholar 

  36. A. He, C. Luo, X. Tian, W. Zeng, A twofold Siamese network for real-time object tracking, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 4834–4843

    Google Scholar 

  37. P. Zhou, X. Han, V.I. Morariu, L.S. Davis, Two-stream neural networks for tampered face detection, in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (IEEE, Piscataway, 2017), pp. 1831–1839

    Google Scholar 

  38. R. Arandjelovic, A. Zisserman, Look, listen and learn, in Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 609–617

    Google Scholar 

  39. J. Cramer, H.-H. Wu, J. Salamon, J.P. Bello, Look, listen, and learn more: design choices for deep audio embeddings, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, Piscataway, 2019), pp. 3852–3856

    Book  Google Scholar 

  40. P. Dhakal, P. Damacharla, A.Y. Javaid, V. Devabhaktuni, A near real-time automatic speaker recognition architecture for voice-based user interface. Mach. Learn. Knowl. Extr. 1(1), 504–520 (2019)

    Article  Google Scholar 

  41. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778

    Google Scholar 

  42. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 1–9

    Google Scholar 

  43. X. Zhang, J. Zou, K. He, J. Sun, Accelerating very deep convolutional networks for classification and detection. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 1943–1955 (2015)

    Article  Google Scholar 

  44. J.S. Chung, A. Nagrani, A. Zisserman, Voxceleb2: deep speaker recognition (2018). Preprint, arXiv:1806.05622

    Google Scholar 

  45. F. Shenavarmasouleh, H.R. Arabnia, Causes of misleading statistics and research results irreproducibility: a concise review, in 2019 International Conference on Computational Science and Computational Intelligence (CSCI) (IEEE, Piscataway, 2019), pp. 465–470

    Google Scholar 

  46. T.K. Ho, Random decision forests, in Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 1 (IEEE, Piscataway, 1995), pp. 278–282

    Book  Google Scholar 

  47. G.H. John, P. Langley, Estimating continuous distributions in Bayesian classifiers, in Proceedings of the Eleventh conference on Uncertainty in Artificial Intelligence (Morgan Kaufmann Publishers Inc., Burlington, 1995), pp. 338–345

    Google Scholar 

  48. D.G. Kleinbaum, K. Dietz, M. Gail, M. Klein, M. Klein, Logistic Regression (Springer, Berlin, 2002)

    Google Scholar 

  49. P.V. Amini, A.R. Shahabinia, H.R. Jafari, O. Karami, A. Azizi, Estimating conservation value of lighvan chay river using contingent valuation method (2016)

    Google Scholar 

  50. O. Karami, S. Yazdani, I. Saleh, H. Rafiee, A. Riahi, A comparison of Zayandehrood river water values for agriculture and the environment. River Res. Appl. 36(7), 1279–1285 (2020)

    Article  Google Scholar 

  51. A.R. Shahabinia, V.A. Parsa, H. Jafari, S. Karimi, O. Karami, Estimating the recreational value of Lighvan Chay River uses contingent valuation method. J. Environ. Friendly Process. 4(3), 69 (2016)

    Google Scholar 

  52. M.A. Hearst, S.T. Dumais, E. Osuna, J. Platt, B. Scholkopf, Support vector machines. IEEE Intell. Syst. Appl. 13(4), 18–28 (1998)

    Article  Google Scholar 

  53. E. Maddah, B. Beigzadeh, Use of a smartphone thermometer to monitor thermal conductivity changes in diabetic foot ulcers: a pilot study. J. Wound Care 29(1), 61–66 (2020)

    Article  Google Scholar 

  54. R. Khayami, N. Zare, M. Karimi, P. Mahor, A. Afshar, M.S. Najafi, M. Asadi, F. Tekrar, E. Asali, A. Keshavarzi, Cyrus 2d simulation team description paper 2014, in RoboCup 2014 Symposium and Competitions: Team Description Papers (2014)

    Google Scholar 

  55. E. Asali, F. Negahbani, S. Tafazzol, M.S. Maghareh, S. Bahmeie, S. Barazandeh, S. Mirian, M. Moshkelgosha, Namira soccer 2d simulation team description paper 2018, in RoboCup 2018 (2018)

    Google Scholar 

  56. E. Asali, M. Valipour, A. Afshar, O. Asali, M. Katebzadeh, S. Tafazol, A. Moravej, S. Salehi, H. Karami, M. Mohammadi, Shiraz soccer 2d simulation team description paper 2016, in RoboCup 2016 Symposium and Competitions: Team Description Papers, Leipzig, Germany (2016)

    Google Scholar 

  57. E. Asali, M. Valipour, N. Zare, A. Afshar, M. Katebzadeh, G.H. Dastghaibyfard, Using machine learning approaches to detect opponent formation, in 2016 Artificial Intelligence and Robotics (IRANOPEN) (IEEE, Piscataway, 2016), pp. 140–144

    Google Scholar 

  58. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition (2014). Preprint, arXiv:1409.1556

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ehsan Asali .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Asali, E., Shenavarmasouleh, F., Mohammadi, F.G., Suresh, P.S., Arabnia, H.R. (2021). DeepMSRF: A Novel Deep Multimodal Speaker Recognition Framework with Feature Selection. In: Arabnia, H.R., Deligiannidis, L., Shouno, H., Tinetti, F.G., Tran, QN. (eds) Advances in Computer Vision and Computational Biology. Transactions on Computational Science and Computational Intelligence. Springer, Cham. https://doi.org/10.1007/978-3-030-71051-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-71051-4_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-71050-7

  • Online ISBN: 978-3-030-71051-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics