Stream fusion for multi-stream automatic speech recognition

Hesam Sagha¹,
Feipeng Li²^nAff3,
Ehsan Variani²^nAff4,
José del R. Millán⁵,
Ricardo Chavarriaga⁵ &
…
Björn Schuller^1,6

224 Accesses
Explore all metrics

Abstract

Multi-stream automatic speech recognition (MS-ASR) has been confirmed to boost the recognition performance in noisy conditions. In this system, the generation and the fusion of the streams are the essential parts and need to be designed in such a way to reduce the effect of noise on the final decision. This paper shows how to improve the performance of the MS-ASR by targeting two questions; (1) How many streams are to be combined, and (2) how to combine them. First, we propose a novel approach based on stream reliability to select the number of streams to be fused. Second, a fusion method based on Parallel Hidden Markov Models is introduced. Applying the method on two datasets (TIMIT and RATS) with different noises, we show an improvement of MS-ASR.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

Notes

We used the Quicknet toolbox developed at the International Computer Science Institute (http://www1.icsi.berkeley.edu/Speech/qn.html).

References

Allen, J. (1994). How do humans process and recognize speech? IEEE Transactions on Speech and Audio Processing, 2(4), 567–577.
Article Google Scholar
Bourlard, H. & Dupont, S. (1997). Subband-based speech recognition. In 22nd International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 2 (pp 1251–1254). Munich, Germany
Bourlard, H., Dupont , S., Ris, C. (1997). Multi-stream speech recognition. Tech. Rep. IDIAP-RR 96-07, IDIAP
Fletcher, H. (1953). Speech and hearing in communication. New York: Krieger.
Google Scholar
Furui, S. (1992). Towards robust speech recognition under adverse conditions. In ESCA Workshop on Speech Processing in Adverse Conditions (pp. 31–41)
Ganapathy, S., & Hermansky, H. (2012). Temporal resolution analysis in frequency domain linear prediction. The Journal of the Acoustical Society of America, 132(5), 436–442.
Article Google Scholar
Garofolo, J. S., et al. (1988). Getting started with the darpa timit cd-rom: An acoustic phonetic continuous speech database. National Institute of Standards and Technology (NIST), Gaithersburgh, MD, p. 107
Geiger, J. T., Zhang, Z., Weninger, F., Schuller, B., Rigoll, G. (2014). Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In: Proceedings of 15th Annual Conference of the International Speech Communication Association (INTERSPEECH), ISCA, Singapore, Singapore
Giacinto, G., Roli, F. (2000). Dynamic classifier selection. In Multiple Classifier Systems (pp. 177–189). Springer
Hermansky, H. (2013). Multistream recognition of speech: Dealing with unknown unknowns. IEEE Proceedings, 101(5), 1076–1088.
Article Google Scholar
Hermansky, H., & Morgan, N. (1994). Rasta processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589.
Article Google Scholar
Hermansky, H., Tibrewala, S., Pavel, M. (1996). Towards ASR on partially corrupted speech. In Fourth International Conference on Spoken Language (ICSLP), vol 1 (pp. 462–465). IEEE, Philadelphia, PA, USA
Hermansky, H., Variani, E., Peddinti, V. (2013). Mean temporal distance: Predicting ASR error from temporal properties of speech signal. In 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, Vancouver, Canada
Ikbal, S., Misra, H., Hermansky, H., & Magimai-Doss, M. (2012). Phase autocorrelation (PAC) features for noise robust speech recognition. Speech Communication, 54(7), 867–880.
Article Google Scholar
Mallidi, S. H., & Hermansky, H. (2016). Novel neural network based fusion for multistream ASR. In 41st International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5680–5684). Shanghai, China: IEEE.
Mallidi, S. H., Ogawa, T., & Hermansky, H. (2015). Uncertainty estimation of dnn classifiers. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 283–288). USA: Arizona.
Mesgarani, N., Thomas, S., Hermansky, H. (2011). Adaptive stream fusion in multistream recognition of speech. In 12th Annual Conference of the International Speech Communication Association (InterSpeech). Portland, Oregon
Mohamed, A., Dahl, G., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio Speech and Language Processing, 20(1), 14–22.
Article Google Scholar
Sharma, S. R. (1999). Multi-stream approach to robust speech recognition. PhD thesis
Tibrewala, S., Hermansky, H. (1997). Sub-band based recognition of noisy speech. In 22nd IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 2 (pp. 1255–1258). Munich, Germany,
Variani, E., Li, F., Hermansky, H. (2013). Multi-stream recognition of noisy speech with performance monitoring. In 14th Annual Conference of the International Speech Communication Association (InterSpeech). Lyon, France
Weninger, F., Geiger, J., Wöllmer, M., Schuller, B., & Rigoll, G. (2014). Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments. Computer Speech and Language, 28(4), 888–902.
Article Google Scholar
Wöllmer, M., Weninger, F., Geiger, J., Schuller, B., & Rigoll, G. (2013). Noise robust ASR in reverberated multisource environments applying convolutive NMF and long short-term memory. Computer Speech and Language, 27(3), 780–797.
Article Google Scholar

Download references

Acknowledgments

The authors would like to thank Professor Hynek Hermansky for his valuable comments.

Author information

Feipeng Li
Present address: Apple Inc, San Francisco Bay Area, CA, USA
Ehsan Variani
Present address: Google, San Francisco Bay Area, CA, USA

Authors and Affiliations

Chair of Complex & Intelligent Systems, University of Passau, Passau, Germany
Hesam Sagha & Björn Schuller
Center of Language and Speech Processing, Johns Hopkins University, Baltimore, USA
Feipeng Li & Ehsan Variani
Defitech Chair in Brain-Machine Interface, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
José del R. Millán & Ricardo Chavarriaga
Department of Computing, Imperial College, London, UK
Björn Schuller

Authors

Hesam Sagha
View author publications
You can also search for this author in PubMed Google Scholar
Feipeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Ehsan Variani
View author publications
You can also search for this author in PubMed Google Scholar
José del R. Millán
View author publications
You can also search for this author in PubMed Google Scholar
Ricardo Chavarriaga
View author publications
You can also search for this author in PubMed Google Scholar
Björn Schuller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hesam Sagha.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sagha, H., Li, F., Variani, E. et al. Stream fusion for multi-stream automatic speech recognition. Int J Speech Technol 19, 669–675 (2016). https://doi.org/10.1007/s10772-016-9357-1

Download citation

Received: 29 June 2015
Accepted: 23 July 2016
Published: 04 August 2016
Issue Date: December 2016
DOI: https://doi.org/10.1007/s10772-016-9357-1

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multimodal Data Fusion Architectures in Audiovisual Speech Recognition

Weighting Schemes Based Discriminative Model Combination Technique for Robust Speech Recognition

Fusion of multi-stream speech features for dialect classification

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Stream fusion for multi-stream automatic speech recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multimodal Data Fusion Architectures in Audiovisual Speech Recognition

Weighting Schemes Based Discriminative Model Combination Technique for Robust Speech Recognition

Fusion of multi-stream speech features for dialect classification

Explore related subjects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation