research-article

Generative Modeling of Pseudo-Whisper for Robust Whispered Speech Recognition

Authors:

Shabnam Ghaffarzadegan,

John H. L. Hansen,

Shabnam Ghaffarzadegan,

John H. L. Hansen,

Shabnam Ghaffarzadegan,

John H. L. Hansen,

Hynek BorilAuthors Info & Claims

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 24, Issue 10

Pages 1705 - 1720

https://doi.org/10.1109/TASLP.2016.2580944

Published: 01 October 2016 Publication History

Abstract

Whisper is a common means of communication used to avoid disturbing individuals or to exchange private information. As a vocal style, whisper would be an ideal candidate for human-handheld/computer interactions in open-office or public area scenarios. Unfortunately, current speech technology is predominantly focused on modal neutral speech and completely breaks down when exposed to whisper. One of the major barriers for successful whisper recognition engines is the lack of available large transcribed whispered speech corpora. This study introduces two strategies that require only a small amount of untranscribed whisper samples to produce excessive amounts of whisper-like pseudo-whisper utterances from easily accessible modal speech recordings. Once generated, the pseudo-whisper samples are used to adapt modal acoustic models of a speech recognizer toward whisper. The first strategy is based on Vector Taylor Series VTS where a whisper “background” model is first trained to capture a rough estimate of global whisper characteristics from a small amount of actual whisper data. Next, that background model is utilized in the VTS to establish specific broad phone classes' unvoiced/voiced phones transformations from each input modal utterance to its pseudo-whispered version. The second strategy generates pseudo-whisper samples by means of denoising autoencoders DAE. Two generative models are investigated-one produces pseudo-whisper cepstral features on a frame-by-frame basis, while the second generates pseudo-whisper statistics for whole phone segments. It is shown that word error rates of a TIMIT-trained speech recognizer are considerably reduced for a whisper recognition task with a constrained lexicon after adapting the acoustic model toward the VTS or DAE pseudo-whisper samples, compared to model adaptation on an available small whisper set.

References

[1]

S. Ghaffarzadegan, H. Bo¿il, and J. H. L. Hansen, "UT-VOCAL EFFORT II: Analysis and constrained-lexicon recognition of whispered speech," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Florence, Italy, May 2014, pp. 2544-2548.

[2]

S. Ghaffarzadegan, H. Bo¿il, and J. H. L. Hansen, "Model and feature based compensation for whispered speech recognition," in Proc. Interspeech, Singapore, Sep. 2014, pp. 2420-2424.

[3]

S. Ghaffarzadegan, H. Bo¿il, and J. H. L. Hansen, "Generative modeling of pseudo-target domain adaptation samples for whispered speech recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Brisbane, QLD, Australia, Apr. 2015.

[4]

R. W. Morris and M. A. Clements, "Reconstruction of speech from whispers," Med. Eng. Phys., vol. 24, no. 7, pp. 515-520, Sep. 2002.

[5]

W. F. L. Heeren and C. Lorenzi, "Perception of prosody in normal and whispered French," J. Acoust. Soc. Amer., vol. 135, no. 4, pp. 2026-2040, 2014.

[6]

P. X. Lee, D. Wee, H. S. Y. Toh, B. P. Lim, N. Chen, and B. Ma, "A whispered Mandarin corpus for speech technology applications," in Proc. Annu. Conf. Int. Speech Commun. Assoc., Singapore, Sep. 2014, pp. 1598-1602.

[7]

C. Zhang, T. Yu, and J. H. L. Hansen, "Microphone array processing for distance speech capture: A probe study on whisper speech detection," in Proc. Asilomar Conf. Signals, Syst. Comput., 2010, pp. 1707-1710.

[8]

X. Fan and J. H. L. Hansen, "Acoustic analysis for speaker identification of whispered speech," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2010, pp. 5046-5049.

[9]

X. Fan and J. H. L. Hansen, "Speaker identification within whispered speech audio streams," IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 5, pp. 1408-1421, Jul. 2011.

Digital Library

[10]

T. Ito, K. Takeda, and F. Itakura, "Acoustic analysis and recognition of whispered speech," in Proc. IEEE Workshop Autom. Speech Recog. Understanding, 2001, pp. 429-432.

[11]

I. Eklund and H. Traunmuller, "Comparative study of male and female whispered and phonated versions of the long vowels of Swedish," Phonetica, vol. 37, pp. 131-134, 1996.

[12]

T. Ito, K. Takeda, and F. Itakura, "Analysis and recognition of whispered speech," Speech Commun., vol. 45, no. 2, pp. 139-152, 2005.

[13]

B. P. Lim, "Computational differences between whispered and nonwhispered speech," Ph.D. dissertation, Elect. Comput. Eng., Univ. Illinois at Urbana-Champaign, Champaign, IL, USA, 2011.

Digital Library

[14]

M. Matsuda and H. Kasuya, "Acoustic nature of the whisper," in Proc. 6th Eur. Conf. Speech Commun. Technol., 1999, pp. 133-136.

[15]

H. R. Sharifzadeh, I. V. McLoughlin, and M. J. Russell, "A comprehensive vowel space for whispered speech," J. Voice, vol. 26, no. 2, pp. e49-e56, 2012.

[16]

A. Mathur, S. M. Reddy, and R. M. Hegde, "Significance of parametric spectral ratio methods in detection and recognition of whispered speech," EURASIP J. Adv. Signal Process., vol. 2012, no. 1, pp. 1-20, 2012.

[17]

S.-C. Jou, T. Schultz, and A. Waibel, "Whispery speech recognition using adapted articulatory features," in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2005, vol. 1, pp. 1009-1012.

[18]

C.-Y. Yang, G. Brown, L. Lu, J. Yamagishi, and S. King, "Noise-robust whispered speech recognition using a non-audible-murmur microphone with VTS compensation," in Proc. 8th Int. Symp. Chinese Spoken Lang. Process., 2012, pp. 220-223.

[19]

F. Tao and C. Busso, "Lipreading approach for isolated digits recognition under whisper and neutral speech," in Proc. Annu. Conf. Int. Speech Commun. Assoc., Singapore, Sep. 2014, pp. 1154-1158.

[20]

J. Galic, S. T. Jovicic, D. Grozdic, and B. Markovic, "Constrained lexicon speaker dependent recognition of whispered speech," in Proc. Int. Symp. Ind. Electron., Nov. 2014, pp. 180-184.

[21]

D. T. Grozdic, S. T. Jovicic, J. Galic, and B. Markovic, "Application of inverse filtering in enhancement of whisper recognition," in Proc. 12th Symp. Neural Netw. Appl. Electr. Eng., Nov. 2014, pp. 157-162.

[22]

X. Fan and J. H. L. Hansen, "Acoustic analysis and feature transformation from neutral to whisper for speaker identification within whispered speech audio streams," Speech Commun., vol. 55, no. 1, pp. 119-134, 2013.

Digital Library

[23]

C. Zhang and J. H. L. Hansen, "Advancement in whisper-island detection with normally phonated audio streams," in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2009, pp. 860-863.

[24]

V. Zue, S. Seneff, and J. Glass, "Speech database development at MIT: TIMIT and beyond," Speech Comm., vol. 9, no. 4, pp. 351-356, 1990.

[25]

J. H. L. Hansen, "Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition," Speech Comm., vol. 20, nos. 1-2, pp. 151-173, 1996.

Digital Library

[26]

H. Bo¿il, "Robust speech recognition: Analysis and equalization of Lombard effect in Czech corpora," Ph.D. dissertation, Faculty Elect. Eng., Czech Techn. Univ. in Prague, Prague, Czech Republic, 2008, http://www.utdallas.edu/~hynek

[27]

M. Garnier, "Communication in noisy environments: From adaptation to vocal straining," Ph.D. dissertation, Univ. of Paris VI, LAM - Institute Jean Le Roud d'Alembert, Paris, France, 2007.

[28]

T. Ogawa and T. Kobayashi, "Influence of Lombard effect: Accuracy analysis of simulation-based assessments of noisy speech recognition systems for various recognition conditions," IEICE Trans. Inform. Syst., vol. E92.D, no. 11, p. 2244-2252, Nov. 2009.

[29]

M. Cooke and Y. Lu, "Spectral and temporal changes to speech produced in the presence of energetic and informational maskers," J. Acoust. Soc. Amer., vol. 128, no. 4, pp. 2059-2069, Oct. 2010.

[30]

M. Garnier and N. Henrich, "Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?" Comput. Speech Lang., vol. 28, no. 2, pp. 580-597, Mar. 2014.

Digital Library

[31]

J. Kim and C. Davis, "Comparing the consistency and distinctiveness of speech produced in quiet and in noise," Comput. Speech Lang., vol. 28, pp. 598-606, 2013.

Digital Library

[32]

H. Bo¿il, O. Sadjadi, and J. H. L. Hansen, "UTDrive: Emotion and cognitive load classification for in-vehicle scenarios," presented at the 5th Biennial Workshop Digital Signal Processing In-Vehicle Systems, Kiel, Germany, Sep. 2011.

[33]

S. B. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-28, no. 4, pp. 357-366, Aug. 1980.

[34]

K. Sjolander and J. Beskow, "WaveSurfer - an open source speech tool," in Proc. Int. Conf. Spoken Lang. Process., Beijing, China, 2000, vol. 4, pp. 464-467.

[35]

D. G. Childers and C. K. Lee, "Vocal quality factors: Analysis, synthesis, and perception," J. Acoust. Soc. Amer., vol. 90, no. 5, pp. 2394-2410, 1991.

[36]

D. B. Paul, "A speaker-stress resistant HMM isolated word recognizer," in IEEE Int. Conf. Acoust. Speech Signal Process., vol. 12, 1987, pp. 713-716.

[37]

J. Hansen and O. Bria, "Lombard effect compensation for robust automatic speech recognition in noise," in Proc. Int. Conf. Spoken Lang. Process., Kobe, Japan, 1990, pp. 1125-1128.

[38]

J. H. L. Hansen, "Morphological constrained feature enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and Lombard effect," IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp. 598-614, Oct. 1994.

[39]

J. H. L. Hansen and D. A. Cairns, "ICARUS: Source generator based real-time recognition of speech in noisy stressful and Lombard effect environments," Speech Commun., vol. 16, pp. 391-422, 1995.

Digital Library

[40]

B. Womack and J. H. L. Hansen, "N-channel hidden Markov models for combined stress speech classification and recognition," IEEE Trans. Speech Audio Process., vol. 7, no. 6, pp. 668-677, Nov. 1999.

[41]

S. E. Bou-Ghazale and J. H. L. Hansen, "A comparative study of traditional and newly proposed features for recognition of speech under stress," IEEE Trans. Speech Audio Process., vol. 8, no. 4, pp. 429-442, Jul. 2000.

[42]

L. Lee and R. C. Rose, "Speaker normalization using efficient frequency warping procedure," in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 1, 1996, pp. 353-356.

Digital Library

[43]

D. Pye and P. Woodland, "Experiments in speaker normalisation and adaptation for large vocabulary speech recognition," in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 1997, pp. 1047-1050.

Digital Library

[44]

H. Bo¿il and J. H. L. Hansen, "Unsupervised equalization of Lombard effect for speech recognition in noisy adverse environments," IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 6, pp. 1379-1393, Aug. 2010.

Digital Library

[45]

E. Eide and H. Gish, "A parametric approach to vocal tract length normalization," in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 1996, pp. 346-348.

Digital Library

[46]

E. B. Gouva and R. M. Stern, "Speaker normalization through formant-based warping of the frequency scale," presented at the Fifth European Conf. Speech Communication Technology, Rhodes, Greece, 1997.

[47]

H. Bo¿il and J. H. L. Hansen, "Unsupervised equalization of Lombard effect for speech recognition in noisy adverse environment," in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Taipei, Taiwan, Apr. 2009, pp. 3937-3940.

Digital Library

[48]

P. J. Moreno, B. Raj, and R. M. Stern, "A vector Taylor series approach for environment-independent speech recognition," in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 1996, pp. 733-736.

Digital Library

[49]

A. Acero, L. Deng, T. T. Kristjansson, and J. Zhang, "HMM adaptation using vector Taylor series for noisy speech recognition." in Proc. Int. Speech Commun. Assoc., 2000, pp. 869-872.

[50]

P. J. Moreno, "Speech recognition in noisy environments," Ph.D. dissertation, Elect. Comput. Eng. Dept., Carnegie Mellon Univ., Pittsburgh, PA, USA, 1996.

Digital Library

[51]

X. Feng, Y. Zhang, and J. R. Glass, "Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition," in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2014, pp. 1759-1763.

[52]

T. Ishii, H. Komiyama, T. Shinozaki, Y. Horiuchi, and S. Kuroiwa, "Reverberant speech recognition based on denoising autoencoder." in Proc. Int. Speech Commun. Assoc., 2013, pp. 3512-3516.

[53]

P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol, "Extracting and composing robust features with denoising autoencoders," in Proc. 25th Int. Conf. Mach. Learning, 2008, pp. 1096-1103. [Online]. Available:

Digital Library

[54]

P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol, "Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion," J. Mach. Learn. Res., vol. 11, pp. 3371-3408, Dec. 2010. [Online]. Available: http://dl.acm.org/citation.cfm?id=1756006.1953039

Digital Library

[55]

G. E. Hinton, S. Osindero, and Y. Teh, "A fast learning algorithm for deep belief nets," Neural Comput., vol. 18, no. 7, pp. 1527-1554, Jul. 2006.

Digital Library

[56]

Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, "Greedy layer-wise training of deep networks," in Neural Information Processing. Cambridge, MA, USA: MIT Press, 2007, pp. 153-160.

Digital Library

[57]

C. M. University, "CMUSphinx--Open source toolkit for speech recognition," 2013. [Online]. Available: http://cmusphinx.sourceforge.net/wiki

[58]

LabRosa, "RASTA/PLP/MFCC feature calculation and inversion," 2013. [Online]. Available: http://labrosa.ee.columbia.edu/matlab

[59]

B. S. Atal, "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification," J. Acoust. Soc. Amer., vol. 55, no. 6, pp. 1304-1312, 1974.

[60]

M. Gales, D. Pye, and P. Woodland, "Variance compensation within the MLLR framework for robust speech recognition and speaker adaptation," in Proc. 4th Int. Conf. Spoken Lang. Process., Philadelphia, PA, USA, 1996, vol. 3, pp. 1832-1835.

[61]

H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech," J. Acoust. Soc. Amer., vol. 87, no. 4, pp. 1738-1752, 1990.

[62]

P. Boersma and D. Weenink, "Praat: Doing phonetics by computer (version 4.4.33)," [Computer program], 2006.

Cited By

Rekimoto J(2022)DualVoice: Speech Interaction that Discriminates between Normal and Whispered Voice InputProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology10.1145/3526113.3545685(1-10)Online publication date: 29-Oct-2022
https://dl.acm.org/doi/10.1145/3526113.3545685
Rekimoto J(2022)DualVoice: A Speech Interaction Method Using Whisper-Voice as CommandsExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491101.3519700(1-6)Online publication date: 27-Apr-2022
https://dl.acm.org/doi/10.1145/3491101.3519700
Pandey LHasan KArif AKitamura YQuigley AIsbister KIgarashi TBjørn PDrucker S(2021)Acceptability of Speech and Silent Speech Input Methods in Private and PublicProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445430(1-13)Online publication date: 6-May-2021
https://dl.acm.org/doi/10.1145/3411764.3445430
Show More Cited By

Recommendations

Whispered Speech Recognition Using Deep Denoising Autoencoder and Inverse Filtering

Due to the profound differences between acoustic characteristics of neutral and whispered speech, the performance of traditional automatic speech recognition ASR systems trained on neutral speech degrades significantly when whisper is applied. In order ...
Enhancement and recognition of whispered speech
Combined speech enhancement and auditory modelling for robust distributed speech recognition

The performance of automatic speech recognition (ASR) systems in the presence of noise is an area that has attracted a lot of research interest. Additive noise from interfering noise sources, and convolutional noise arising from transmission channel ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Audio, Speech and Language Processing

IEEE/ACM Transactions on Audio, Speech and Language Processing Volume 24, Issue 10

October 2016

195 pages

ISSN:2329-9290

EISSN:2329-9304

Issue’s Table of Contents

Copyright © 2016.

Publisher

IEEE Press

Publication History

Published: 01 October 2016

Published in TASLP Volume 24, Issue 10

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
54
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Rekimoto J(2022)DualVoice: Speech Interaction that Discriminates between Normal and Whispered Voice InputProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology10.1145/3526113.3545685(1-10)Online publication date: 29-Oct-2022
https://dl.acm.org/doi/10.1145/3526113.3545685
Rekimoto J(2022)DualVoice: A Speech Interaction Method Using Whisper-Voice as CommandsExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491101.3519700(1-6)Online publication date: 27-Apr-2022
https://dl.acm.org/doi/10.1145/3491101.3519700
Pandey LHasan KArif AKitamura YQuigley AIsbister KIgarashi TBjørn PDrucker S(2021)Acceptability of Speech and Silent Speech Input Methods in Private and PublicProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445430(1-13)Online publication date: 6-May-2021
https://dl.acm.org/doi/10.1145/3411764.3445430
Yang ZYu CZheng FShi Y(2019)ProxiTalkProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/33512763:3(1-25)Online publication date: 9-Sep-2019
https://dl.acm.org/doi/10.1145/3351276
Grozdic DJovicic S(2017)Whispered Speech Recognition Using Deep Denoising Autoencoder and Inverse FilteringIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2017.273855925:12(2313-2322)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1109/TASLP.2017.2738559
Ghaffarzadegan SBořil HHansen J(2017)Deep neural network training for whispered speech recognition using small databases and generative model samplingInternational Journal of Speech Technology10.1007/s10772-017-9461-x20:4(1063-1075)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1007/s10772-017-9461-x

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents