Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Generative Modeling of Pseudo-Whisper for Robust Whispered Speech Recognition

Published: 01 October 2016 Publication History

Abstract

Whisper is a common means of communication used to avoid disturbing individuals or to exchange private information. As a vocal style, whisper would be an ideal candidate for human-handheld/computer interactions in open-office or public area scenarios. Unfortunately, current speech technology is predominantly focused on modal neutral speech and completely breaks down when exposed to whisper. One of the major barriers for successful whisper recognition engines is the lack of available large transcribed whispered speech corpora. This study introduces two strategies that require only a small amount of untranscribed whisper samples to produce excessive amounts of whisper-like pseudo-whisper utterances from easily accessible modal speech recordings. Once generated, the pseudo-whisper samples are used to adapt modal acoustic models of a speech recognizer toward whisper. The first strategy is based on Vector Taylor Series VTS where a whisper “background” model is first trained to capture a rough estimate of global whisper characteristics from a small amount of actual whisper data. Next, that background model is utilized in the VTS to establish specific broad phone classes' unvoiced/voiced phones transformations from each input modal utterance to its pseudo-whispered version. The second strategy generates pseudo-whisper samples by means of denoising autoencoders DAE. Two generative models are investigated-one produces pseudo-whisper cepstral features on a frame-by-frame basis, while the second generates pseudo-whisper statistics for whole phone segments. It is shown that word error rates of a TIMIT-trained speech recognizer are considerably reduced for a whisper recognition task with a constrained lexicon after adapting the acoustic model toward the VTS or DAE pseudo-whisper samples, compared to model adaptation on an available small whisper set.

References

[1]
S. Ghaffarzadegan, H. Bo¿il, and J. H. L. Hansen, "UT-VOCAL EFFORT II: Analysis and constrained-lexicon recognition of whispered speech," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Florence, Italy, May 2014, pp. 2544-2548.
[2]
S. Ghaffarzadegan, H. Bo¿il, and J. H. L. Hansen, "Model and feature based compensation for whispered speech recognition," in Proc. Interspeech, Singapore, Sep. 2014, pp. 2420-2424.
[3]
S. Ghaffarzadegan, H. Bo¿il, and J. H. L. Hansen, "Generative modeling of pseudo-target domain adaptation samples for whispered speech recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Brisbane, QLD, Australia, Apr. 2015.
[4]
R. W. Morris and M. A. Clements, "Reconstruction of speech from whispers," Med. Eng. Phys., vol. 24, no. 7, pp. 515-520, Sep. 2002.
[5]
W. F. L. Heeren and C. Lorenzi, "Perception of prosody in normal and whispered French," J. Acoust. Soc. Amer., vol. 135, no. 4, pp. 2026-2040, 2014.
[6]
P. X. Lee, D. Wee, H. S. Y. Toh, B. P. Lim, N. Chen, and B. Ma, "A whispered Mandarin corpus for speech technology applications," in Proc. Annu. Conf. Int. Speech Commun. Assoc., Singapore, Sep. 2014, pp. 1598-1602.
[7]
C. Zhang, T. Yu, and J. H. L. Hansen, "Microphone array processing for distance speech capture: A probe study on whisper speech detection," in Proc. Asilomar Conf. Signals, Syst. Comput., 2010, pp. 1707-1710.
[8]
X. Fan and J. H. L. Hansen, "Acoustic analysis for speaker identification of whispered speech," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2010, pp. 5046-5049.
[9]
X. Fan and J. H. L. Hansen, "Speaker identification within whispered speech audio streams," IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 5, pp. 1408-1421, Jul. 2011.
[10]
T. Ito, K. Takeda, and F. Itakura, "Acoustic analysis and recognition of whispered speech," in Proc. IEEE Workshop Autom. Speech Recog. Understanding, 2001, pp. 429-432.
[11]
I. Eklund and H. Traunmuller, "Comparative study of male and female whispered and phonated versions of the long vowels of Swedish," Phonetica, vol. 37, pp. 131-134, 1996.
[12]
T. Ito, K. Takeda, and F. Itakura, "Analysis and recognition of whispered speech," Speech Commun., vol. 45, no. 2, pp. 139-152, 2005.
[13]
B. P. Lim, "Computational differences between whispered and nonwhispered speech," Ph.D. dissertation, Elect. Comput. Eng., Univ. Illinois at Urbana-Champaign, Champaign, IL, USA, 2011.
[14]
M. Matsuda and H. Kasuya, "Acoustic nature of the whisper," in Proc. 6th Eur. Conf. Speech Commun. Technol., 1999, pp. 133-136.
[15]
H. R. Sharifzadeh, I. V. McLoughlin, and M. J. Russell, "A comprehensive vowel space for whispered speech," J. Voice, vol. 26, no. 2, pp. e49-e56, 2012.
[16]
A. Mathur, S. M. Reddy, and R. M. Hegde, "Significance of parametric spectral ratio methods in detection and recognition of whispered speech," EURASIP J. Adv. Signal Process., vol. 2012, no. 1, pp. 1-20, 2012.
[17]
S.-C. Jou, T. Schultz, and A. Waibel, "Whispery speech recognition using adapted articulatory features," in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2005, vol. 1, pp. 1009-1012.
[18]
C.-Y. Yang, G. Brown, L. Lu, J. Yamagishi, and S. King, "Noise-robust whispered speech recognition using a non-audible-murmur microphone with VTS compensation," in Proc. 8th Int. Symp. Chinese Spoken Lang. Process., 2012, pp. 220-223.
[19]
F. Tao and C. Busso, "Lipreading approach for isolated digits recognition under whisper and neutral speech," in Proc. Annu. Conf. Int. Speech Commun. Assoc., Singapore, Sep. 2014, pp. 1154-1158.
[20]
J. Galic, S. T. Jovicic, D. Grozdic, and B. Markovic, "Constrained lexicon speaker dependent recognition of whispered speech," in Proc. Int. Symp. Ind. Electron., Nov. 2014, pp. 180-184.
[21]
D. T. Grozdic, S. T. Jovicic, J. Galic, and B. Markovic, "Application of inverse filtering in enhancement of whisper recognition," in Proc. 12th Symp. Neural Netw. Appl. Electr. Eng., Nov. 2014, pp. 157-162.
[22]
X. Fan and J. H. L. Hansen, "Acoustic analysis and feature transformation from neutral to whisper for speaker identification within whispered speech audio streams," Speech Commun., vol. 55, no. 1, pp. 119-134, 2013.
[23]
C. Zhang and J. H. L. Hansen, "Advancement in whisper-island detection with normally phonated audio streams," in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2009, pp. 860-863.
[24]
V. Zue, S. Seneff, and J. Glass, "Speech database development at MIT: TIMIT and beyond," Speech Comm., vol. 9, no. 4, pp. 351-356, 1990.
[25]
J. H. L. Hansen, "Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition," Speech Comm., vol. 20, nos. 1-2, pp. 151-173, 1996.
[26]
H. Bo¿il, "Robust speech recognition: Analysis and equalization of Lombard effect in Czech corpora," Ph.D. dissertation, Faculty Elect. Eng., Czech Techn. Univ. in Prague, Prague, Czech Republic, 2008, http://www.utdallas.edu/~hynek
[27]
M. Garnier, "Communication in noisy environments: From adaptation to vocal straining," Ph.D. dissertation, Univ. of Paris VI, LAM - Institute Jean Le Roud d'Alembert, Paris, France, 2007.
[28]
T. Ogawa and T. Kobayashi, "Influence of Lombard effect: Accuracy analysis of simulation-based assessments of noisy speech recognition systems for various recognition conditions," IEICE Trans. Inform. Syst., vol. E92.D, no. 11, p. 2244-2252, Nov. 2009.
[29]
M. Cooke and Y. Lu, "Spectral and temporal changes to speech produced in the presence of energetic and informational maskers," J. Acoust. Soc. Amer., vol. 128, no. 4, pp. 2059-2069, Oct. 2010.
[30]
M. Garnier and N. Henrich, "Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?" Comput. Speech Lang., vol. 28, no. 2, pp. 580-597, Mar. 2014.
[31]
J. Kim and C. Davis, "Comparing the consistency and distinctiveness of speech produced in quiet and in noise," Comput. Speech Lang., vol. 28, pp. 598-606, 2013.
[32]
H. Bo¿il, O. Sadjadi, and J. H. L. Hansen, "UTDrive: Emotion and cognitive load classification for in-vehicle scenarios," presented at the 5th Biennial Workshop Digital Signal Processing In-Vehicle Systems, Kiel, Germany, Sep. 2011.
[33]
S. B. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-28, no. 4, pp. 357-366, Aug. 1980.
[34]
K. Sjolander and J. Beskow, "WaveSurfer - an open source speech tool," in Proc. Int. Conf. Spoken Lang. Process., Beijing, China, 2000, vol. 4, pp. 464-467.
[35]
D. G. Childers and C. K. Lee, "Vocal quality factors: Analysis, synthesis, and perception," J. Acoust. Soc. Amer., vol. 90, no. 5, pp. 2394-2410, 1991.
[36]
D. B. Paul, "A speaker-stress resistant HMM isolated word recognizer," in IEEE Int. Conf. Acoust. Speech Signal Process., vol. 12, 1987, pp. 713-716.
[37]
J. Hansen and O. Bria, "Lombard effect compensation for robust automatic speech recognition in noise," in Proc. Int. Conf. Spoken Lang. Process., Kobe, Japan, 1990, pp. 1125-1128.
[38]
J. H. L. Hansen, "Morphological constrained feature enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and Lombard effect," IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp. 598-614, Oct. 1994.
[39]
J. H. L. Hansen and D. A. Cairns, "ICARUS: Source generator based real-time recognition of speech in noisy stressful and Lombard effect environments," Speech Commun., vol. 16, pp. 391-422, 1995.
[40]
B. Womack and J. H. L. Hansen, "N-channel hidden Markov models for combined stress speech classification and recognition," IEEE Trans. Speech Audio Process., vol. 7, no. 6, pp. 668-677, Nov. 1999.
[41]
S. E. Bou-Ghazale and J. H. L. Hansen, "A comparative study of traditional and newly proposed features for recognition of speech under stress," IEEE Trans. Speech Audio Process., vol. 8, no. 4, pp. 429-442, Jul. 2000.
[42]
L. Lee and R. C. Rose, "Speaker normalization using efficient frequency warping procedure," in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 1, 1996, pp. 353-356.
[43]
D. Pye and P. Woodland, "Experiments in speaker normalisation and adaptation for large vocabulary speech recognition," in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 1997, pp. 1047-1050.
[44]
H. Bo¿il and J. H. L. Hansen, "Unsupervised equalization of Lombard effect for speech recognition in noisy adverse environments," IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 6, pp. 1379-1393, Aug. 2010.
[45]
E. Eide and H. Gish, "A parametric approach to vocal tract length normalization," in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 1996, pp. 346-348.
[46]
E. B. Gouva and R. M. Stern, "Speaker normalization through formant-based warping of the frequency scale," presented at the Fifth European Conf. Speech Communication Technology, Rhodes, Greece, 1997.
[47]
H. Bo¿il and J. H. L. Hansen, "Unsupervised equalization of Lombard effect for speech recognition in noisy adverse environment," in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Taipei, Taiwan, Apr. 2009, pp. 3937-3940.
[48]
P. J. Moreno, B. Raj, and R. M. Stern, "A vector Taylor series approach for environment-independent speech recognition," in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 1996, pp. 733-736.
[49]
A. Acero, L. Deng, T. T. Kristjansson, and J. Zhang, "HMM adaptation using vector Taylor series for noisy speech recognition." in Proc. Int. Speech Commun. Assoc., 2000, pp. 869-872.
[50]
P. J. Moreno, "Speech recognition in noisy environments," Ph.D. dissertation, Elect. Comput. Eng. Dept., Carnegie Mellon Univ., Pittsburgh, PA, USA, 1996.
[51]
X. Feng, Y. Zhang, and J. R. Glass, "Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition," in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2014, pp. 1759-1763.
[52]
T. Ishii, H. Komiyama, T. Shinozaki, Y. Horiuchi, and S. Kuroiwa, "Reverberant speech recognition based on denoising autoencoder." in Proc. Int. Speech Commun. Assoc., 2013, pp. 3512-3516.
[53]
P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol, "Extracting and composing robust features with denoising autoencoders," in Proc. 25th Int. Conf. Mach. Learning, 2008, pp. 1096-1103. [Online]. Available:
[54]
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol, "Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion," J. Mach. Learn. Res., vol. 11, pp. 3371-3408, Dec. 2010. [Online]. Available: http://dl.acm.org/citation.cfm?id=1756006.1953039
[55]
G. E. Hinton, S. Osindero, and Y. Teh, "A fast learning algorithm for deep belief nets," Neural Comput., vol. 18, no. 7, pp. 1527-1554, Jul. 2006.
[56]
Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, "Greedy layer-wise training of deep networks," in Neural Information Processing. Cambridge, MA, USA: MIT Press, 2007, pp. 153-160.
[57]
C. M. University, "CMUSphinx--Open source toolkit for speech recognition," 2013. [Online]. Available: http://cmusphinx.sourceforge.net/wiki
[58]
LabRosa, "RASTA/PLP/MFCC feature calculation and inversion," 2013. [Online]. Available: http://labrosa.ee.columbia.edu/matlab
[59]
B. S. Atal, "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification," J. Acoust. Soc. Amer., vol. 55, no. 6, pp. 1304-1312, 1974.
[60]
M. Gales, D. Pye, and P. Woodland, "Variance compensation within the MLLR framework for robust speech recognition and speaker adaptation," in Proc. 4th Int. Conf. Spoken Lang. Process., Philadelphia, PA, USA, 1996, vol. 3, pp. 1832-1835.
[61]
H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech," J. Acoust. Soc. Amer., vol. 87, no. 4, pp. 1738-1752, 1990.
[62]
P. Boersma and D. Weenink, "Praat: Doing phonetics by computer (version 4.4.33)," [Computer program], 2006.

Cited By

View all
  • (2022)DualVoice: Speech Interaction that Discriminates between Normal and Whispered Voice InputProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology10.1145/3526113.3545685(1-10)Online publication date: 29-Oct-2022
  • (2022)DualVoice: A Speech Interaction Method Using Whisper-Voice as CommandsExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491101.3519700(1-6)Online publication date: 27-Apr-2022
  • (2021)Acceptability of Speech and Silent Speech Input Methods in Private and PublicProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445430(1-13)Online publication date: 6-May-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Audio, Speech and Language Processing
IEEE/ACM Transactions on Audio, Speech and Language Processing  Volume 24, Issue 10
October 2016
195 pages
ISSN:2329-9290
EISSN:2329-9304
Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 October 2016
Published in TASLP Volume 24, Issue 10

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)DualVoice: Speech Interaction that Discriminates between Normal and Whispered Voice InputProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology10.1145/3526113.3545685(1-10)Online publication date: 29-Oct-2022
  • (2022)DualVoice: A Speech Interaction Method Using Whisper-Voice as CommandsExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491101.3519700(1-6)Online publication date: 27-Apr-2022
  • (2021)Acceptability of Speech and Silent Speech Input Methods in Private and PublicProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445430(1-13)Online publication date: 6-May-2021
  • (2019)ProxiTalkProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/33512763:3(1-25)Online publication date: 9-Sep-2019
  • (2017)Whispered Speech Recognition Using Deep Denoising Autoencoder and Inverse FilteringIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2017.273855925:12(2313-2322)Online publication date: 1-Dec-2017
  • (2017)Deep neural network training for whispered speech recognition using small databases and generative model samplingInternational Journal of Speech Technology10.1007/s10772-017-9461-x20:4(1063-1075)Online publication date: 1-Dec-2017

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media