Abstract
For recognizing speakers in video streams, significant research studies have been made to obtain a rich machine learning model by extracting high-level speaker’s features such as facial expression, emotion, and gender. However, generating such a model is not feasible by using only single modality feature extractors that exploit either audio signals or image frames, extracted from video streams. In this paper, we address this problem from a different perspective and propose an unprecedented multimodality data fusion framework called DeepMSRF, Deep Multimodal Speaker Recognition with Feature selection. We execute DeepMSRF by feeding features of the two modalities, namely speakers’ audios and face images. DeepMSRF uses a two-stream VGGNET to train on both modalities to reach a comprehensive model capable of accurately recognizing the speaker’s identity. We apply DeepMSRF on a subset of VoxCeleb2 dataset with its metadata merged with VGGFace2 dataset. The goal of DeepMSRF is to identify the gender of the speaker first, and further to recognize his or her name for any given video stream. The experimental results illustrate that DeepMSRF outperforms single modality speaker recognition methods with at least 3% accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
A. Afshar, I. Perros, H. Park, C. deFilippi, X. Yan, W. Stewart, J. Ho, J. Sun, Taste: temporal and static tensor factorization for phenotyping electronic health records, in Proceedings of the ACM Conference on Health, Inference, and Learning (2020), pp. 193–203
M. Sotoodeh, J.C. Ho, Improving length of stay prediction using a hidden Markov model. AMIA Summits Transl. Sci. Proc. 2019, 425 (2019)
K.W. Buffinton, B.B. Wheatley, S. Habibian, J. Shin, B.H. Cenci, A.E. Christy, Investigating the mechanics of human-centered soft robotic actuators with finite element analysis, in 2020 3rd IEEE International Conference on Soft Robotics (RoboSoft) (IEEE, Piscataway, 2020), pp. 489–496
H. Haeri, K. Jerath, J. Leachman, Thermodynamics-inspired modeling of macroscopic swarm states, in Dynamic Systems and Control Conference, vol. 59155 (American Society of Mechanical Engineers, New York, 2019), p. V002T15A001
E. Seraj, M. Gombolay, Coordinated control of UAVs for human-centered active sensing of wildfires (2020). Preprint, arXiv:2006.07969
M. Dadvar, S. Moazami, H.R. Myler, H. Zargarzadeh, Multiagent task allocation in complementary teams: a hunter-and-gatherer approach. Complexity 2020, Article ID 1752571 (2020)
M. Etemad, N. Zare, M. Sarvmaili, A. Soares, B.B. Machado, S. Matwin, Using deep reinforcement learning methods for autonomous vessels in 2D environments, in Canadian Conference on Artificial Intelligence (Springer, Berlin, 2020), pp. 220–231
M. Karimi, M. Ahmazadeh, Mining robocup log files to predict own and opponent action. Int. J. Adv. Res. Comput. Sci. 5(6), 1–6 (2014)
F. Tahmasebian, L. Xiong, M. Sotoodeh, V. Sunderam, Crowdsourcing under data poisoning attacks: a comparative study, in IFIP Annual Conference on Data and Applications Security and Privacy (Springer, Berlin, 2020), pp. 310–332
S. Voghoei, N.H. Tonekaboni, J. Wallace, H.R. Arabnia, Deep learning at the edge, in Proceedings of International Conference on Computational Science and Computational Intelligence CSCI, Internet of Things” Research Track (2018), pp. 895–901
F.G. Mohammadi, M.H. Amini, H.R. Arabnia, An introduction to advanced machine learning: meta-learning algorithms, applications, and promises, in Optimization, Learning, and Control for Interdependent Complex Networks (Springer, Berlin, 2020), pp. 129–144
S. Amirian, Z. Wang, T.R. Taha, H.R. Arabnia, Dissection of deep learning with applications in image recognition, in Proceedings of International Conference on Computational Science and Computational Intelligence (CSCI 2018: December 2018, USA); “Artificial Intelligence” Research Track (CSCI-ISAI) (2018), pp. 1132–1138
F.G. Mohammadi, H.R. Arabnia, M.H. Amini, On parameter tuning in meta-learning for computer vision, in 2019 International Conference on Computational Science and Computational Intelligence (CSCI) (IEEE, Piscataway, 2019), pp. 300–305
Z. Wang, F. Li, T. Taha, H. Arabnia, 2d multi-spectral convolutional encoder-decoder model for geobody segmentation, in 2018 International Conference on Computational Science and Computational Intelligence (CSCI) (IEEE, Piscataway, 2018), pp. 1193–1198
N. Soans, E. Asali, Y. Hong, P. Doshi, Sa-net: Robust state-action recognition for learning from observations, in IEEE International Conference on Robotics and Automation (ICRA) (2020), pp. 2153–2159
S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, in Advances in Neural Information Processing Systems (2015), pp. 91–99
F. Shenavarmasouleh, H.R. Arabnia, DRDR: automatic masking of exudates and microaneurysms caused by diabetic retinopathy using mask R-CNN and transfer learning (2020). Preprint, arXiv:2007.02026
F.G. Mohammadi, M.H. Amini, Evolutionary computation, optimization and learning algorithms for data science, in Optimization, Learning and Control for Interdependent Complex Networks (Springer, Berlin, 2019)
F.G. Mohammadi, M.H. Amini, Applications of nature-inspired algorithms for dimension reduction: enabling efficient data analytics, in Optimization, Learning and Control for Interdependent Complex Networks (Springer, Berlin, 2019)
G. Chetty, M. Wagner, Robust face-voice based speaker identity verification using multilevel fusion. Image Vis. Comput. 26(9), 1249–1260 (2008)
S.P. Mudunuri, S. Biswas, Low resolution face recognition across variations in pose and illumination. IEEE Trans. Pattern Anal. Mach. Intell. 38(5), 1034–1040 (2015)
J.H. Shah, M. Sharif, M. Raza, M. Murtaza, S. Ur-Rehman, Robust face recognition technique under varying illumination. J. Appl. Res. Technol. 13(1), 97–105 (2015)
H. Sellahewa, S.A. Jassim, Image-quality-based adaptive face recognition. IEEE Trans. Instrum. Meas. 59(4), 805–813 (2010)
P. Li, L. Prieto, D. Mery, P. Flynn, Face recognition in low quality images: a survey (2018) . Preprint, arXiv:1805.11519
F.G. Mohammadi, M.S. Abadeh, Image steganalysis using a bee colony based feature selection algorithm. Eng. Appl. Artif. Intell. 31, 35–43 (2014)
F.G. Mohammadi, M.S. Abadeh, A new metaheuristic feature subset selection approach for image steganalysis. J. Intell. Fuzzy Syst. 27(3), 1445–1455 (2014)
Y. Koda, Y. Yoshitomi, M. Nakano, M. Tabuse, A facial expression recognition for a speaker of a phoneme of vowel using thermal image processing and a speech recognition system, in RO-MAN 2009-The 18th IEEE International Symposium on Robot and Human Interactive Communication (IEEE, Piscataway, 2009), pp. 955–960
C.C. Chibelushi, F. Deravi, J.S. Mason, Voice and facial image integration for person recognition (1994)
C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 1933–1941
D. Rezazadegan, S. Shirazi, B. Upcroft, M. Milford, Action recognition: from static datasets to moving robots, Jan 2017
X. Peng, C. Schmid, Multi-region two-stream R-CNN for action detection, in European Conference on Computer Vision (Springer, Berlin, 2016), pp. 744–759
X. Yang, P. Molchanov, J. Kautz, Multilayer and multimodal fusion of deep neural networks for video classification, in Proceedings of the 24th ACM international conference on Multimedia (2016), pp. 978–987
C. Feichtenhofer, H. Fan, J. Malik, K. He, Slowfast networks for video recognition, in Proceedings of the IEEE International Conference on Computer Vision (2019), pp. 6202–6211
F. Xiao, Y.J. Lee, K. Grauman, J. Malik, C. Feichtenhofer, Audiovisual slowfast networks for video recognition (2020). Preprint, arXiv:2001.08740
C. Feichtenhofer, A. Pinz, A. Zisserman, Detect to track and track to detect, in Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 3038–3046
A. He, C. Luo, X. Tian, W. Zeng, A twofold Siamese network for real-time object tracking, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 4834–4843
P. Zhou, X. Han, V.I. Morariu, L.S. Davis, Two-stream neural networks for tampered face detection, in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (IEEE, Piscataway, 2017), pp. 1831–1839
R. Arandjelovic, A. Zisserman, Look, listen and learn, in Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 609–617
J. Cramer, H.-H. Wu, J. Salamon, J.P. Bello, Look, listen, and learn more: design choices for deep audio embeddings, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, Piscataway, 2019), pp. 3852–3856
P. Dhakal, P. Damacharla, A.Y. Javaid, V. Devabhaktuni, A near real-time automatic speaker recognition architecture for voice-based user interface. Mach. Learn. Knowl. Extr. 1(1), 504–520 (2019)
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 1–9
X. Zhang, J. Zou, K. He, J. Sun, Accelerating very deep convolutional networks for classification and detection. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 1943–1955 (2015)
J.S. Chung, A. Nagrani, A. Zisserman, Voxceleb2: deep speaker recognition (2018). Preprint, arXiv:1806.05622
F. Shenavarmasouleh, H.R. Arabnia, Causes of misleading statistics and research results irreproducibility: a concise review, in 2019 International Conference on Computational Science and Computational Intelligence (CSCI) (IEEE, Piscataway, 2019), pp. 465–470
T.K. Ho, Random decision forests, in Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 1 (IEEE, Piscataway, 1995), pp. 278–282
G.H. John, P. Langley, Estimating continuous distributions in Bayesian classifiers, in Proceedings of the Eleventh conference on Uncertainty in Artificial Intelligence (Morgan Kaufmann Publishers Inc., Burlington, 1995), pp. 338–345
D.G. Kleinbaum, K. Dietz, M. Gail, M. Klein, M. Klein, Logistic Regression (Springer, Berlin, 2002)
P.V. Amini, A.R. Shahabinia, H.R. Jafari, O. Karami, A. Azizi, Estimating conservation value of lighvan chay river using contingent valuation method (2016)
O. Karami, S. Yazdani, I. Saleh, H. Rafiee, A. Riahi, A comparison of Zayandehrood river water values for agriculture and the environment. River Res. Appl. 36(7), 1279–1285 (2020)
A.R. Shahabinia, V.A. Parsa, H. Jafari, S. Karimi, O. Karami, Estimating the recreational value of Lighvan Chay River uses contingent valuation method. J. Environ. Friendly Process. 4(3), 69 (2016)
M.A. Hearst, S.T. Dumais, E. Osuna, J. Platt, B. Scholkopf, Support vector machines. IEEE Intell. Syst. Appl. 13(4), 18–28 (1998)
E. Maddah, B. Beigzadeh, Use of a smartphone thermometer to monitor thermal conductivity changes in diabetic foot ulcers: a pilot study. J. Wound Care 29(1), 61–66 (2020)
R. Khayami, N. Zare, M. Karimi, P. Mahor, A. Afshar, M.S. Najafi, M. Asadi, F. Tekrar, E. Asali, A. Keshavarzi, Cyrus 2d simulation team description paper 2014, in RoboCup 2014 Symposium and Competitions: Team Description Papers (2014)
E. Asali, F. Negahbani, S. Tafazzol, M.S. Maghareh, S. Bahmeie, S. Barazandeh, S. Mirian, M. Moshkelgosha, Namira soccer 2d simulation team description paper 2018, in RoboCup 2018 (2018)
E. Asali, M. Valipour, A. Afshar, O. Asali, M. Katebzadeh, S. Tafazol, A. Moravej, S. Salehi, H. Karami, M. Mohammadi, Shiraz soccer 2d simulation team description paper 2016, in RoboCup 2016 Symposium and Competitions: Team Description Papers, Leipzig, Germany (2016)
E. Asali, M. Valipour, N. Zare, A. Afshar, M. Katebzadeh, G.H. Dastghaibyfard, Using machine learning approaches to detect opponent formation, in 2016 Artificial Intelligence and Robotics (IRANOPEN) (IEEE, Piscataway, 2016), pp. 140–144
K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition (2014). Preprint, arXiv:1409.1556
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Asali, E., Shenavarmasouleh, F., Mohammadi, F.G., Suresh, P.S., Arabnia, H.R. (2021). DeepMSRF: A Novel Deep Multimodal Speaker Recognition Framework with Feature Selection. In: Arabnia, H.R., Deligiannidis, L., Shouno, H., Tinetti, F.G., Tran, QN. (eds) Advances in Computer Vision and Computational Biology. Transactions on Computational Science and Computational Intelligence. Springer, Cham. https://doi.org/10.1007/978-3-030-71051-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-71051-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71050-7
Online ISBN: 978-3-030-71051-4
eBook Packages: Computer ScienceComputer Science (R0)