Abstract
Detecting human speech is foundational for a wide range of emerging intelligent applications. However, accurately detecting human speech is challenging, especially in the presence of unknown noise patterns. Generally, deep learning-based methods have shown to be more robust and accurate than statistical methods and other existing approaches. However, typically creating a noise-robust and more generalized deep learning-based voice activity detection system requires the collection of an enormous amount of annotated audio data. In this work, we develop a generalized model trained on limited types of human speeches with noisy backgrounds. Yet, it can detect human speech in the presence of various unseen noise types, which were not present in the training set. To achieve this, we propose a one-class residual connections-based variational autoencoder (ORVAE), which only requires a limited number of human speech data with noisy background for training, thereby eliminating the need for collecting data with diverse noise patterns. Evaluating ORVAE with three different datasets (synthesized TIMIT and NOISEX-92, synthesized LibriSpeech and NOISEX-92, and a Publicly Recorded dataset), our method outperforms other one-class baseline methods, achieving \(F_1\)-scores of over \(90\%\) for multiple signal-to-noise ratio levels.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Alam J, Kenny P, Ouellet P, Stafylakis T, Dumouchel P (2014) Supervised/unsupervised voice activity detectors for text-dependent speaker recognition on the rsr2015 corpus. In: Odyssey speaker and language recognition workshop, pp 123–130
Benyassine A, Shlomot E, Su HY, Massaloux D, Lamblin C, Petit JP (1997) Itu-t recommendation g. 729 annex b: a silence compression scheme for use with g. 729 optimized for v. 70 digital simultaneous voice and data applications. IEEE Commun Mag 35(9):64–73
Beritelli F, Casale S, Cavallaero A (1998) A robust voice activity detector for wireless communications using soft computing. IEEE J Sel Areas Commun 16(9):1818–1829
Böhm V, Seljak U (2020) Probabilistic auto-encoder. arXiv:2006.05479
Chalapathy R, Menon AK, Chawla S (2018) Anomaly detection using one-class neural networks. arXiv:1802.06360
Chang JH, Kim NS, Mitra SK (2006) Voice activity detection based on multiple statistical models. IEEE Trans Signal Process 54(6):1965–1976
Chengalvarayan R (1999) Robust energy normalization using speech/nonspeech discriminator for German connected digit recognition. In: Sixth European conference on speech communication and technology
Dai W, Dai C, Qu S, Li J, Das S (2017) Very deep convolutional neural networks for raw waveforms. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 421–425
Dean D, Sridharan S, Vogt R (2010) The qut-noise databases and protocols. https://researchdatafinder.qut.edu.au/display/n442. Accessed 03 Sept 2021
Dennison PE, Halligan KQ, Roberts DA (2004) A comparison of error metrics and constraints for multiple endmember spectral mixture analysis and spectral angle mapper. Remote Sens Environ 93(3):359–367
Dey J, Hossain MSB, Haque MA (2018) An ensemble SVM-based approach for voice activity detection. In: 2018 10th international conference on electrical and computer engineering (ICECE). IEEE, pp 297–300
Dosselmann R, Yang XD (2011) A comprehensive assessment of the structural similarity index. SIViP 5(1):81–91
DSP Algorithms: Bril noise reduction (2020). https://www.dspalgorithms.com/w3/noise-reduction/bril-noise-reduction-algorithm.php. Accessed 03 Feb 2021
Eyben F, Weninger F, Squartini S, Schuller B (2013) Real-life voice activity detection with LSTM recurrent neural networks and an application to Hollywood movies. In: 2013 IEEE international conference on acoustics, speech and signal processing, pp 483–487. https://doi.org/10.1109/ICASSP.2013.6637694
Garofolo JS, Lamel LF, Fisher WM, Fiscus, JG, Pallett DS (1993) Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n 93
Google Developers: Real-time communication for the web (2020). https://webrtc.org/. Accessed 22 Dec 2020
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hirsch HG (2005) Fant-filtering and noise adding tool. Niederrhein University of Applied Sciences. http://dnt.kr.hsnr.de/download.html
Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl-Based Syst 6(02):107–116
Hwang I, Park HM, Chang JH (2016) Ensemble of deep neural networks using acoustic environment classification for statistical model-based voice activity detection. Comput Speech Lang 38:1–12
Joyce JM (2011) Kullback–Leibler divergence. Springer, Berlin, pp 720–722. https://doi.org/10.1007/978-3-642-04898-2_327
Jung Y, Kim Y, Choi Y, Kim H (2018) Joint learning using denoising variational autoencoders for voice activity detection. In: Proceedings of Interspeech 2018, pp 1210–1214. https://doi.org/10.21437/Interspeech.2018-1151
Khalid H, Kim M, Tariq S, Woo SS (2021) Evaluation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors. In: Proceedings of the 1st workshop on synthetic multimedia-audiovisual deepfake generation and detection, pp 7–15
Khalid H, Tariq S, Kim M, Woo SS (2021) FakeAVCeleb: a novel audio-video multimodal deepfake dataset. In: Thirty-fifth conference on neural information processing systems datasets and benchmarks track (Round 2). https://openreview.net/forum?id=TAXFsg6ZaOl
Khalid H, Woo SS (2020) kedect: classifying deepfakes using one-class variational autoencoder. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 656–657
Kim J, Hahn M (2019) Speech enhancement using a two-stage network for an efficient boosting strategy. IEEE Signal Process Lett 26(5):770–774
Kingma DP, Ba J (2014) A method for stochastic optimization. arXiv:1412.6980
Kingma DP, Welling M (2013) Auto-encoding variational Bayes. arXiv:1312.6114
Lee Y, Min J, Han DK, Ko H (2019) Spectro-temporal attention-based voice activity detection. IEEE Signal Process Lett 27:131–135
Leglaive S, Alameda-Pineda X, Girin L, Horaud R (2019) A recurrent variational autoencoder for speech enhancement. arXiv:1910.10942
Loizou PC (2013) Speech enhancement: theory and practice. CRC Press, Boca Raton
Lu X, Tsao Y, Matsuda S, Hori C (2013) Speech enhancement based on deep denoising autoencoder. In: Interspeech, pp 436–440
Lu Y, Loizou PC (2008) A geometric approach to spectral subtraction. Speech Commun 50(6):453–466
Mak MW, Yu HB (2014) A study of voice activity detection techniques for Nist speaker recognition evaluations. Comput Speech Lang 28(1):295–313
Manevitz LM, Yousef M (2001) One-class SVMs for document classification. J Mach Learn Res 2:139–154 ((Dec))
Markou M, Singh S (2003) Novelty detection: a review-part 1: statistical approaches. Signal Process 83(12):2481–2497
Mohamed OMM, Jaïdane-Saïdane M (2009) Generalized gaussian mixture model. In: 2009 17th European signal processing conference, pp 2273–2277
Naranjo-Alcazar J, Perez-Castanos S, Martín-Morató I, Zuccarello P, Ferri FJ, Cobos M (2020) A comparative analysis of residual block alternatives for end-to-end audio classification. IEEE Access 8:188875–188882
Nemer E, Goubran R, Mahmoud S (2001) Robust voice activity detection using higher-order statistics in the LPC residual domain. IEEE Trans Speech Audio Process 9(3):217–231
O’Shaughnessy D (1988) Linear predictive coding. IEEE Potentials 7(1):29–32. https://doi.org/10.1109/45.1890
Panayotov V, Chen G, Povey D, Khudanpur S (2015) Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5206–5210
Prasadh SK, Natrajan SS, Kalaivani S (2017) Efficiency analysis of noise reduction algorithms: analysis of the best algorithm of noise reduction from a set of algorithms. In: 2017 International conference on inventive computing and informatics (ICICI). IEEE, pp 1137–1140
Ramirez J, Górriz J M, Segura J C (2007) Voice activity detection. fundamentals and speech recognition system robustness. Robust Speech Recognit Underst 6(9):1–22
Saha G, Chakroborty S, Senapati S (2005) A new silence removal and endpoint detection algorithm for speech and speaker recognition applications. In: Proceedings of the NCC, vol 2005. Citeseer, p 5
Shin Y, Lee, S, Tariq S, Lee MS, Jung O, Chung D, Woo SS (2020) Itad: integrative tensor-based anomaly detection system for reducing false positives of satellite systems. In: Proceedings of the 29th ACM international conference on information & knowledge management, pp 2733–2740
Sohn J, Kim NS, Sung W (1999) A statistical model-based voice activity detection. IEEE Signal Process Lett 6(1):1–3
Tan ZH, Dehak N et al (2020) rvad: an unsupervised segment-based robust voice activity detection method. Comput Speech Lang 59:1–21
Tanyer SG, Ozer H (2000) Voice activity detection in nonstationary noise. IEEE Trans Speech Audio Process 8(4):478–482
Tao F, Busso C (2018) Audiovisual speech activity detection with advanced long short-term memory. In: Proceedings of Interspeech 2018, pp 1244–1248. https://doi.org/10.21437/Interspeech.2018-2490
Tariq S, Lee S, Shin Y, Lee MS, Jung O, Chung D, Woo SS (2019) Detecting anomalies in space using multivariate convolutional LSTM with mixtures of probabilistic PCA. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2123–2133
Tariq S, Lee S, Woo S.S (2020) Cantransfer: transfer learning based intrusion detection on a controller area network using convolutional LSTM network. In: Proceedings of the 35th annual ACM symposium on applied computing, pp 1048–1055
Tucker R (1992) Voice activity detection using a periodicity measure. IEE Proc I (Commun Speech Vis) 139(4):377–380
Vafeiadis A, Fanioudakis E, Potamitis I, Votis K, Giakoumis D, Tzovaras D, Chen L, Hamzaoui R (2019) Two-dimensional convolutional recurrent neural networks for speech activity detection. In: Proceedings of Interspeech 2019, pp 2045–2049. https://doi.org/10.21437/Interspeech.2019-1354
Varga A, Steeneken H J (1993) Assessment for automatic speech recognition: II. Noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251
Wikipedia contributors: Hann function—Wikipedia, the free encyclopedia (2021). https://en.wikipedia.org/w/index.php?title=Hann_function&oldid=1001711522. Accessed 09 March 2021
Wu J, Zhang XL (2011) Efficient multiple kernel support vector machine based voice activity detection. IEEE Signal Process Lett 18(8):466–469
You C, Robinson DP, Vidal R (2017) Provable self-representation based outlier detection in a union of subspaces. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3395–3404
Yu C, Hung KH, Lin IF, Fu SW, Tsao Y, Hung JW (2020) Waveform-based voice activity detection exploiting fully convolutional networks with multi-branched encoders. arXiv:2006.11139
Zhang XL, Wang D (2015) Boosting contextual information for deep neural network based voice activity detection. IEEE/ACM Trans Audio Speech Lang Process 24(2):252–264
Zhang XL, Wu J (2012) Deep belief networks based voice activity detection. IEEE Trans Audio Speech Lang Process 21(4):697–710
Acknowledgements
This work was partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) Grant funded by the Korea government (MSIT) (No.2019-0-00421, AI Graduate School Support Program (Sungkyunkwan University)), (No. 2019-0-01343, Regional strategic industry convergence security core talent training business) and the Basic Science Research Program through National Research Foundation of Korea (NRF) Grant funded by Korea government MSIT (No. 2020R1C1C1006004). Also, this research was partly supported by IITP Grant funded by the Korea government MSIT (No. 2021-0-00066 and 2021-0-00017, Core Technology Development of Artificial Intelligence Industry), and was partly supported by the MSIT (Ministry of Science, ICT), Korea, under the High-Potential Individuals Global Training Program (2020-0-01550) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation)
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Khalid, H., Tariq, S., Kim, T. et al. ORVAE: One-Class Residual Variational Autoencoder for Voice Activity Detection in Noisy Environment. Neural Process Lett 54, 1565–1586 (2022). https://doi.org/10.1007/s11063-021-10695-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-021-10695-4