ORVAE: One-Class Residual Variational Autoencoder for Voice Activity Detection in Noisy Environment

Hasam Khalid ORCID: orcid.org/0000-0002-0968-0781¹,
Shahroz Tariq¹,
TaeSoo Kim²,
Jong Hwan Ko² &
…
Simon S. Woo³

789 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Detecting human speech is foundational for a wide range of emerging intelligent applications. However, accurately detecting human speech is challenging, especially in the presence of unknown noise patterns. Generally, deep learning-based methods have shown to be more robust and accurate than statistical methods and other existing approaches. However, typically creating a noise-robust and more generalized deep learning-based voice activity detection system requires the collection of an enormous amount of annotated audio data. In this work, we develop a generalized model trained on limited types of human speeches with noisy backgrounds. Yet, it can detect human speech in the presence of various unseen noise types, which were not present in the training set. To achieve this, we propose a one-class residual connections-based variational autoencoder (ORVAE), which only requires a limited number of human speech data with noisy background for training, thereby eliminating the need for collecting data with diverse noise patterns. Evaluating ORVAE with three different datasets (synthesized TIMIT and NOISEX-92, synthesized LibriSpeech and NOISEX-92, and a Publicly Recorded dataset), our method outperforms other one-class baseline methods, achieving $F_1$-scores of over $90\%$ for multiple signal-to-noise ratio levels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech Enhancement Using Dynamical Variational AutoEncoder

A Variational Autoencoder Approach for Speech Signal Separation

Lightweight CNN for Robust Voice Activity Detection

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Alam J, Kenny P, Ouellet P, Stafylakis T, Dumouchel P (2014) Supervised/unsupervised voice activity detectors for text-dependent speaker recognition on the rsr2015 corpus. In: Odyssey speaker and language recognition workshop, pp 123–130
Benyassine A, Shlomot E, Su HY, Massaloux D, Lamblin C, Petit JP (1997) Itu-t recommendation g. 729 annex b: a silence compression scheme for use with g. 729 optimized for v. 70 digital simultaneous voice and data applications. IEEE Commun Mag 35(9):64–73
Article Google Scholar
Beritelli F, Casale S, Cavallaero A (1998) A robust voice activity detector for wireless communications using soft computing. IEEE J Sel Areas Commun 16(9):1818–1829
Article Google Scholar
Böhm V, Seljak U (2020) Probabilistic auto-encoder. arXiv:2006.05479
Chalapathy R, Menon AK, Chawla S (2018) Anomaly detection using one-class neural networks. arXiv:1802.06360
Chang JH, Kim NS, Mitra SK (2006) Voice activity detection based on multiple statistical models. IEEE Trans Signal Process 54(6):1965–1976
Article Google Scholar
Chengalvarayan R (1999) Robust energy normalization using speech/nonspeech discriminator for German connected digit recognition. In: Sixth European conference on speech communication and technology
Dai W, Dai C, Qu S, Li J, Das S (2017) Very deep convolutional neural networks for raw waveforms. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 421–425
Dean D, Sridharan S, Vogt R (2010) The qut-noise databases and protocols. https://researchdatafinder.qut.edu.au/display/n442. Accessed 03 Sept 2021
Dennison PE, Halligan KQ, Roberts DA (2004) A comparison of error metrics and constraints for multiple endmember spectral mixture analysis and spectral angle mapper. Remote Sens Environ 93(3):359–367
Article Google Scholar
Dey J, Hossain MSB, Haque MA (2018) An ensemble SVM-based approach for voice activity detection. In: 2018 10th international conference on electrical and computer engineering (ICECE). IEEE, pp 297–300
Dosselmann R, Yang XD (2011) A comprehensive assessment of the structural similarity index. SIViP 5(1):81–91
Article Google Scholar
DSP Algorithms: Bril noise reduction (2020). https://www.dspalgorithms.com/w3/noise-reduction/bril-noise-reduction-algorithm.php. Accessed 03 Feb 2021
Eyben F, Weninger F, Squartini S, Schuller B (2013) Real-life voice activity detection with LSTM recurrent neural networks and an application to Hollywood movies. In: 2013 IEEE international conference on acoustics, speech and signal processing, pp 483–487. https://doi.org/10.1109/ICASSP.2013.6637694
Garofolo JS, Lamel LF, Fisher WM, Fiscus, JG, Pallett DS (1993) Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n 93
Google Developers: Real-time communication for the web (2020). https://webrtc.org/. Accessed 22 Dec 2020
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hirsch HG (2005) Fant-filtering and noise adding tool. Niederrhein University of Applied Sciences. http://dnt.kr.hsnr.de/download.html
Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl-Based Syst 6(02):107–116
Article Google Scholar
Hwang I, Park HM, Chang JH (2016) Ensemble of deep neural networks using acoustic environment classification for statistical model-based voice activity detection. Comput Speech Lang 38:1–12
Article Google Scholar
Joyce JM (2011) Kullback–Leibler divergence. Springer, Berlin, pp 720–722. https://doi.org/10.1007/978-3-642-04898-2_327
Book Google Scholar
Jung Y, Kim Y, Choi Y, Kim H (2018) Joint learning using denoising variational autoencoders for voice activity detection. In: Proceedings of Interspeech 2018, pp 1210–1214. https://doi.org/10.21437/Interspeech.2018-1151
Khalid H, Kim M, Tariq S, Woo SS (2021) Evaluation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors. In: Proceedings of the 1st workshop on synthetic multimedia-audiovisual deepfake generation and detection, pp 7–15
Khalid H, Tariq S, Kim M, Woo SS (2021) FakeAVCeleb: a novel audio-video multimodal deepfake dataset. In: Thirty-fifth conference on neural information processing systems datasets and benchmarks track (Round 2). https://openreview.net/forum?id=TAXFsg6ZaOl
Khalid H, Woo SS (2020) kedect: classifying deepfakes using one-class variational autoencoder. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 656–657
Kim J, Hahn M (2019) Speech enhancement using a two-stage network for an efficient boosting strategy. IEEE Signal Process Lett 26(5):770–774
Article Google Scholar
Kingma DP, Ba J (2014) A method for stochastic optimization. arXiv:1412.6980
Kingma DP, Welling M (2013) Auto-encoding variational Bayes. arXiv:1312.6114
Lee Y, Min J, Han DK, Ko H (2019) Spectro-temporal attention-based voice activity detection. IEEE Signal Process Lett 27:131–135
Article Google Scholar
Leglaive S, Alameda-Pineda X, Girin L, Horaud R (2019) A recurrent variational autoencoder for speech enhancement. arXiv:1910.10942
Loizou PC (2013) Speech enhancement: theory and practice. CRC Press, Boca Raton
Book Google Scholar
Lu X, Tsao Y, Matsuda S, Hori C (2013) Speech enhancement based on deep denoising autoencoder. In: Interspeech, pp 436–440
Lu Y, Loizou PC (2008) A geometric approach to spectral subtraction. Speech Commun 50(6):453–466
Article Google Scholar
Mak MW, Yu HB (2014) A study of voice activity detection techniques for Nist speaker recognition evaluations. Comput Speech Lang 28(1):295–313
Article Google Scholar
Manevitz LM, Yousef M (2001) One-class SVMs for document classification. J Mach Learn Res 2:139–154 ((Dec))
MATH Google Scholar
Markou M, Singh S (2003) Novelty detection: a review-part 1: statistical approaches. Signal Process 83(12):2481–2497
Article Google Scholar
Mohamed OMM, Jaïdane-Saïdane M (2009) Generalized gaussian mixture model. In: 2009 17th European signal processing conference, pp 2273–2277
Naranjo-Alcazar J, Perez-Castanos S, Martín-Morató I, Zuccarello P, Ferri FJ, Cobos M (2020) A comparative analysis of residual block alternatives for end-to-end audio classification. IEEE Access 8:188875–188882
Article Google Scholar
Nemer E, Goubran R, Mahmoud S (2001) Robust voice activity detection using higher-order statistics in the LPC residual domain. IEEE Trans Speech Audio Process 9(3):217–231
Article Google Scholar
O’Shaughnessy D (1988) Linear predictive coding. IEEE Potentials 7(1):29–32. https://doi.org/10.1109/45.1890
Article Google Scholar
Panayotov V, Chen G, Povey D, Khudanpur S (2015) Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5206–5210
Prasadh SK, Natrajan SS, Kalaivani S (2017) Efficiency analysis of noise reduction algorithms: analysis of the best algorithm of noise reduction from a set of algorithms. In: 2017 International conference on inventive computing and informatics (ICICI). IEEE, pp 1137–1140
Ramirez J, Górriz J M, Segura J C (2007) Voice activity detection. fundamentals and speech recognition system robustness. Robust Speech Recognit Underst 6(9):1–22
Google Scholar
Saha G, Chakroborty S, Senapati S (2005) A new silence removal and endpoint detection algorithm for speech and speaker recognition applications. In: Proceedings of the NCC, vol 2005. Citeseer, p 5
Shin Y, Lee, S, Tariq S, Lee MS, Jung O, Chung D, Woo SS (2020) Itad: integrative tensor-based anomaly detection system for reducing false positives of satellite systems. In: Proceedings of the 29th ACM international conference on information & knowledge management, pp 2733–2740
Sohn J, Kim NS, Sung W (1999) A statistical model-based voice activity detection. IEEE Signal Process Lett 6(1):1–3
Article Google Scholar
Tan ZH, Dehak N et al (2020) rvad: an unsupervised segment-based robust voice activity detection method. Comput Speech Lang 59:1–21
Article Google Scholar
Tanyer SG, Ozer H (2000) Voice activity detection in nonstationary noise. IEEE Trans Speech Audio Process 8(4):478–482
Article Google Scholar
Tao F, Busso C (2018) Audiovisual speech activity detection with advanced long short-term memory. In: Proceedings of Interspeech 2018, pp 1244–1248. https://doi.org/10.21437/Interspeech.2018-2490
Tariq S, Lee S, Shin Y, Lee MS, Jung O, Chung D, Woo SS (2019) Detecting anomalies in space using multivariate convolutional LSTM with mixtures of probabilistic PCA. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2123–2133
Tariq S, Lee S, Woo S.S (2020) Cantransfer: transfer learning based intrusion detection on a controller area network using convolutional LSTM network. In: Proceedings of the 35th annual ACM symposium on applied computing, pp 1048–1055
Tucker R (1992) Voice activity detection using a periodicity measure. IEE Proc I (Commun Speech Vis) 139(4):377–380
Article Google Scholar
Vafeiadis A, Fanioudakis E, Potamitis I, Votis K, Giakoumis D, Tzovaras D, Chen L, Hamzaoui R (2019) Two-dimensional convolutional recurrent neural networks for speech activity detection. In: Proceedings of Interspeech 2019, pp 2045–2049. https://doi.org/10.21437/Interspeech.2019-1354
Varga A, Steeneken H J (1993) Assessment for automatic speech recognition: II. Noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251
Article Google Scholar
Wikipedia contributors: Hann function—Wikipedia, the free encyclopedia (2021). https://en.wikipedia.org/w/index.php?title=Hann_function&oldid=1001711522. Accessed 09 March 2021
Wu J, Zhang XL (2011) Efficient multiple kernel support vector machine based voice activity detection. IEEE Signal Process Lett 18(8):466–469
Article Google Scholar
You C, Robinson DP, Vidal R (2017) Provable self-representation based outlier detection in a union of subspaces. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3395–3404
Yu C, Hung KH, Lin IF, Fu SW, Tsao Y, Hung JW (2020) Waveform-based voice activity detection exploiting fully convolutional networks with multi-branched encoders. arXiv:2006.11139
Zhang XL, Wang D (2015) Boosting contextual information for deep neural network based voice activity detection. IEEE/ACM Trans Audio Speech Lang Process 24(2):252–264
Article Google Scholar
Zhang XL, Wu J (2012) Deep belief networks based voice activity detection. IEEE Trans Audio Speech Lang Process 21(4):697–710
Article Google Scholar

Download references

Acknowledgements

This work was partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) Grant funded by the Korea government (MSIT) (No.2019-0-00421, AI Graduate School Support Program (Sungkyunkwan University)), (No. 2019-0-01343, Regional strategic industry convergence security core talent training business) and the Basic Science Research Program through National Research Foundation of Korea (NRF) Grant funded by Korea government MSIT (No. 2020R1C1C1006004). Also, this research was partly supported by IITP Grant funded by the Korea government MSIT (No. 2021-0-00066 and 2021-0-00017, Core Technology Development of Artificial Intelligence Industry), and was partly supported by the MSIT (Ministry of Science, ICT), Korea, under the High-Potential Individuals Global Training Program (2020-0-01550) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation)

Author information

Authors and Affiliations

College of Computing and Informatics, Sungkyunkwan University, Suwon, Korea
Hasam Khalid & Shahroz Tariq
College of Information and Communication Engineering, Sungkyunkwan University, Suwon, Korea
TaeSoo Kim & Jong Hwan Ko
Department of Applied Data Science, Sungkyunkwan University, Suwon, Korea
Simon S. Woo

Authors

Hasam Khalid
View author publications
You can also search for this author in PubMed Google Scholar
Shahroz Tariq
View author publications
You can also search for this author in PubMed Google Scholar
TaeSoo Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jong Hwan Ko
View author publications
You can also search for this author in PubMed Google Scholar
Simon S. Woo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jong Hwan Ko or Simon S. Woo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khalid, H., Tariq, S., Kim, T. et al. ORVAE: One-Class Residual Variational Autoencoder for Voice Activity Detection in Noisy Environment. Neural Process Lett 54, 1565–1586 (2022). https://doi.org/10.1007/s11063-021-10695-4

Download citation

Accepted: 10 November 2021
Published: 03 January 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s11063-021-10695-4

ORVAE: One-Class Residual Variational Autoencoder for Voice Activity Detection in Noisy Environment

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Speech Enhancement Using Dynamical Variational AutoEncoder

A Variational Autoencoder Approach for Speech Signal Separation

Lightweight CNN for Robust Voice Activity Detection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

ORVAE: One-Class Residual Variational Autoencoder for Voice Activity Detection in Noisy Environment

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Speech Enhancement Using Dynamical Variational AutoEncoder

A Variational Autoencoder Approach for Speech Signal Separation

Lightweight CNN for Robust Voice Activity Detection

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation