article

HMM/SVM segmentation and labelling of Arabic speech for speech recognition applications

Authors:

Halima BahiAuthors Info & Claims

International Journal of Speech Technology, Volume 20, Issue 3

Pages 563 - 573

Published: 01 September 2017 Publication History

Abstract

Building a large vocabulary continuous speech recognition (LVCSR) system requires a lot of hours of segmented and labelled speech data. Arabic language, as many other low-resourced languages, lacks such data, but the use of automatic segmentation proved to be a good alternative to make these resources available. In this paper, we suggest the combination of hidden Markov models (HMMs) and support vector machines (SVMs) to segment and to label the speech waveform into phoneme units. HMMs generate the sequence of phonemes and their frontiers; the SVM refines the frontiers and corrects the labels. The obtained segmented and labelled units may serve as a training set for speech recognition applications. The HMM/SVM segmentation algorithm is assessed using both the hit rate and the word error rate (WER); the resulting scores were compared to those provided by the manual segmentation and to those provided by the well-known embedded learning algorithm. The results show that the speech recognizer built upon the HMM/SVM segmentation outperforms in terms of WER the one built upon the embedded learning segmentation of about 0.05%, even in noisy background.

References

[1]

Abdo, M. S., & Kandil, A. H. (2016). Semi-automatic segmentation system for syllables extraction from continuous Arabic audio signal. International Journal of Advanced Computer Science and Applications, 7(1), 535-540.

[2]

Amanpreet, K., & Tarandeep, S. (2010). Segmentation of Continuous Punjabi Speech Signal into Syllables: WCECS'2010 Proceedings. San Francisco.

[3]

Anwar, M. J., Awais, M. M., Masud, S., & Shamail, S. (2006). Automatic Arabic speech segmentation system. International Journal of Information Technology, 12(6), 102-111.

[4]

Awais, M. M., Ahmad, W., Masud, S., & Shamail, S. (2006). Continuous Arabic speech segmentation using FFT spectrogram: Innovations in Information Technology Proceedings, Dubai, UAE.

[5]

Bilmes, J. A. (2003). Buried Markov models: A graphical-modelling approach to automatic speech recognition. Computer Speech and Language, 17(2-3), 213-231.

[6]

Brognaux, S., & Drugman, T. (2016). HMM-based speech segmentation: Improvements of fully automatic approaches. IEEE/ ACM Transactions on Audio, Speech, and Language Processing, 24(1), 5-15.

[7]

Brognaux, S., Roekhaut, S., Drugman, T. & R. Beaufort, R. (2012). Train&Align: A new online tool for automatic phonetic alignments: IEEE Workshop Spoken Lang. Technol. (SLT) Proceedings, Miami, Florida, USA.

[8]

Brugnara, F., Falavigna, D., & Omologo, M. (1993). Automatic segmentation and labeling of speech based on hidden Markov models. Speech Communication, 12, 357-370.

[9]

Clarkson, P., & Moreno, P. J. (1999). On the use of support vector machines for phonetic classification: ICASSP'1999 Proceedings, Phoenix, Arizona, USA, (pp. 585-588).

[10]

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.

Digital Library

[11]

Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multi-class SVMs. Journal of Machine Learning Research, 2, 265-292.

[12]

Dines, J., Sridharan, S., & Moody, M. (2002). Automatic speech segmentation with HMM: 9th Australian International Conference on Speech Science and Technology Proceedings, Melbourne, Australia (pp. 544-549).

[13]

Frihia, H., & Bahi, H. (2016). Embedded Learning Segmentation Approach for Arabic Speech Recognition: TSD'2016, LNAI 9924, Brno, Czech Republic (pp. 383-390).

[14]

Galka, J., & Ziolko, B. (2007). Study of performance evaluation methods for non-uniform speech segmentation. International Journal of Circuits, Systems and Signal Processing, 1(2), 167-172.

[15]

Garofolo, J., et al. (1993). TIMIT acoustic-phonetic continuous speech corpus LDC93S1. Web Download. Philadelphia: Linguistic Data Consortium.

[16]

Hsu, C. -W., & Lin, C. -J. (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13(2), 415-425.

[17]

Kalamani, M., Valarmathy, D. S., & Anith, S. (2015). Hybrid speech segmentation algorithm for continuous speech recognition. International Journal on Applications of Information and Communication Engineering, 1(1), 39-46.

[18]

Kaur, G., & Singh, P. (2013) A technique to detect syllable boundary in a wave file. International Journal of Computer Science and Communication Engineering, Special issue on "Recent Advances in Engineering and Technology".

[19]

Khanagha, V., Daoudi, K., Pont, O., Yahia, H. (2014). Phonetic segmentation of speech signal using local singularity analysis, Digital Signal Processing, 35, 86-94.

[20]

King, S., & Hasegawa-Johnson, M. (2013). Accurate speech segmentation by mimicking human auditory processing: ICASSP' 2013 Proceedings, Vancouver, BC, Canada.

[21]

Kuo, J. -W., Lo, H. -Y., & Wang, H. -M. (2007). Improved HMM/ SVM methods for automatic phoneme segmentation: INTERSPEECH' 2007 Proceedings, Antwerp, Belgium (pp. 2057-2060).

[22]

Lakshmi, A., & Murthy, H. A. (2006). A syllable based continuous speech recognizer for Tamil. Pittsburgh: INTERSPEECH'2006.

[23]

Malcangi, M. (2009). Softcomputing approach to segmentation of speech in phonetic units. International Journal of Computers and Communications, 3(3), 41-48.

[24]

Malfrere, F., Deroo, O., Dutoit, T., & Ris, C. (2003). Phonetic Alignment: speech synthesis-based vs. Viterbi-based. Speech Communication, 40, 503-515.

[25]

Mporas, I., Ganchev, T., & Fakotakis, N. (2008). A hybrid architecture for automatic segmentation of speech waveforms: ICASSP'2008 Proceedings, Las Vegas, NV, USA.

[26]

Nagarajan, T., Murthy, H. A., & Rajesh, M. H. (2003). Segmentation of speech into syllable-like units EuroSpeech'2003, Geneva, Switzerland (pp. 2893-2896).

[27]

Nofal, M., Abdel-Raheem, E., El Henawy, H., & Abdel Kader, N. S. (2003). Arabic automatic segmentation system and its application for Arabic speech recognition system, 46th Midwest Symposium on Circuits and Systems Proceedings, Cairo, Egypt (pp. 697-700).

[28]

Panda, S. P., & Nayak, A. K. (2016). Automatic speech segmentation in syllable centric speech recognition system. International Journal of Speech Technology, 19(9), 9-18.

[29]

Prasad, V. K., Nagarajan, T., & Murthy, H. A. (2004). Automatic segmentation of continuous speech using minimum phase group delay functions. Speech Communication, 42(3), 429-446.

[30]

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition: Proc. IEEE (pp. 257-286).

[31]

Rabiner, L. R. and Juang, B. H. (1986). An introduction to hidden Markov models. IEEE ASSP Magazine, 3(1), 4-16.

[32]

Rabiner, L. R., & Sambur, M. R. (1975). An algorithm for determining the endpoints of isolated utterances. The Bell System Technical Journal, 54(2), 297-315.

[33]

Rahman, M., & Bhuiya, A. (2012). Continuous bangla speech segmentation using short-term speech features extraction approaches. International Journal of Advanced Computer Science and Applications, 3(11), 131-138.

[34]

Sangeetha, J., & Jothilakshmi, S. (2012). Robust automatic continuous speech segmentation for indian languages to improve speech to speech translation. International Journal of Computer Applications, 53(15), 13-16.

[35]

Sarkar, A., & Sreenivas, T. V. (2005). Automatic Speech Segmentation using average level crossing rate information: ICASSP'2005 Proceedings, Philadelphia, PA, USA.

[36]

Shah, N. J., Vachhani, B. B., Sailor, H. B., & Patil, H. A. (2014). Effectiveness of PLP-based phonetic segmentation for speech synthesis: ICASSP'2014 Proceedings, Florence, Italy.

[37]

Shanmugam, S. A., & Murthy, H. (2014). A hybrid approach to segmentation of speech using group delay processing and HMM based embedded reestimation: INTERSPEECH'2014 Proceedings, Singapore (pp. 1648-1652).

[38]

Shastri, L., Chang, S., & Greenberg, S. (1999). Syllable detection and segmentation using temporal flow neural networks: ICPhS'99 Proceedings, San Francisco, USA (pp. 1721-1724).

[39]

Solera-Ureña, R., Padrell-Sendra, J., Martín-Iglesias, D., Gallardo-Antolín, A., Peláez-Moreno, C., & Díaz-de-María, F. (2007). Chapter LNCS 9391. In SVMs for automatic speech recognition: A Survey (pp. 190-216). Berlin: Springer.

[40]

Sorin, D., & Rabiner, L. (2006). On the relation between maximum spectral transition positions and phone boundaries: INTERSPEECH' 2006 Proceedings, Pittsburgh, Pennsylvania, USA (pp. 645-648).

[41]

Tolba, M. F., Nazmy, T., Abdelhamid, A. A., & Gadallaha, M. E. (2005). A novel method for Arabic consonant/vowel segmentation using wavelet transform. International Journal on Intelligent Cooperative Information Systems, 5(1), 353-364.

[42]

Toledano, D. T., & Gómez, L. A. H. (2002). HMMs for automatic phonetic segmentation: LREC Proceedings (pp. 1558-1563).

[43]

Vachhani, B. B., & Patil, H. (2013). Use of PLP cepstral features for phonetic segmentation: International Conference on Asian Language Processing (IALP) Proceedings, Urumqi, China (pp. 143-146).

[44]

van Vuuren, V. Z., ten Bosch, L., & Niesler, T. (2015). Unconstrained speech segmentation using deep neural networks: ICPRAM'2015 Proceedings, Lisbon, Portugal (pp. 248-254).

[45]

Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.

[46]

Wang, H., Lee, T., Leung, C. C., Ma, B., & Li, H. (2015). Acoustic segment modeling with spectral clustering methods. IEEE/ ACM Transactions on Audio, Speech and Language Processing (TASLP), 23(2), 264-277.

Digital Library

[47]

Young, S., et al. (2002). The HTK Book (for HTK Version 3.4). Cambridge: Cambridge University Engineering Department.

[48]

Zarrouk, E., Ben Ayed, Y., & Gargouri, F. (2014). Hybrid continuous speech recognition systems by HMM, MLP and SVM: a comparative study. International Jouranl of Speech Technology, 17, 223-233.

Cited By

Teimoori FRazzazi F(2019)Unsupervised help-trained LS-SVR-based segmentation in speaker diarization systemMultimedia Tools and Applications10.1007/s11042-018-6621-178:9(11743-11777)Online publication date: 25-May-2019
https://dl.acm.org/doi/10.1007/s11042-018-6621-1
Teimoori FRazzazi F(2019)Incomplete-Data-Driven Speaker Segmentation for Diarization Application; A Help-Training ApproachCircuits, Systems, and Signal Processing10.1007/s00034-018-0974-638:6(2489-2522)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1007/s00034-018-0974-6

Index Terms

HMM/SVM segmentation and labelling of Arabic speech for speech recognition applications
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Hardware
  1. Communication hardware, interfaces and storage
    1. Signal processing systems
  2. Power and energy
    1. Power estimation and optimization
      1. Platform power issues

Index terms have been assigned to the content through auto-classification.

Recommendations

Syllable-based automatic arabic speech recognition in noisy-telephone channel

The performance of well-trained speech recognizers using high quality full bandwidth speech data is usually degraded when used in real world environments. In particular, telephone speech recognition is extremely difficult due to the limited bandwidth of ...
Automatic speech segmentation in syllable centric speech recognition system

Speech recognition is the process of understanding the human or natural language speech by a computer. A syllable centric speech recognition system in this aspect identifies the syllable boundaries in the input speech and converts it into the respective ...
Syllable-based automatic Arabic speech recognition
ISPRA'08: Proceedings of the 7th WSEAS International Conference on Signal Processing, Robotics and Automation

In this paper, we concentrate on the automatic recognition of Egyptian Arabic speech using syllables. Arabic spoken digits were described by showing their constructing phonemes, triphones, syllables and words. Speaker-independent hidden markov models (...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of Speech Technology

International Journal of Speech Technology Volume 20, Issue 3

September 2017

311 pages

ISSN:1381-2416

Issue’s Table of Contents

Copyright © Copyright © 2017 Springer Science+Business Media, LLC.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 September 2017

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Teimoori FRazzazi F(2019)Unsupervised help-trained LS-SVR-based segmentation in speaker diarization systemMultimedia Tools and Applications10.1007/s11042-018-6621-178:9(11743-11777)Online publication date: 25-May-2019
https://dl.acm.org/doi/10.1007/s11042-018-6621-1
Teimoori FRazzazi F(2019)Incomplete-Data-Driven Speaker Segmentation for Diarization Application; A Help-Training ApproachCircuits, Systems, and Signal Processing10.1007/s00034-018-0974-638:6(2489-2522)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1007/s00034-018-0974-6

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents