Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Interruption Point Detection of Spontaneous Speech Using Inter-Syllable Boundary-Based Prosodic Features

Published: 01 March 2011 Publication History

Abstract

This article presents a probabilistic scheme for detecting the interruption point (IP) in spontaneous speech based on inter-syllable boundary-based prosodic features. Because of the high error rate in spontaneous speech recognition, a combined acoustic model considering both syllable and subsyllable recognition units, is firstly used to determine the inter-syllable boundaries and output the recognition confidence of the input speech. Based on the finding that IPs always occur at inter-syllable boundaries, a probability distribution of the prosodic features at the current potential IP is estimated. The Conditional Random Field (CRF) model, which employs the clustered prosodic features of the current potential IP and its preceding and succeeding inter-syllable boundaries, is employed to output the IP likelihood measure. Finally, the confidence of the recognized speech, the probability distribution of the prosodic features and the CRF-based IP likelihood measure are integrated to determine the optimal IP sequence of the input spontaneous speech. In addition, pitch reset and lengthening are also applied to improve the IP detection performance. The Mandarin Conversional Dialogue Corpus is adopted for evaluation. Experimental results show that the proposed IP detection approach obtains 10.56% and 6.5% more effective results than the hidden Markov model and the Maximum Entropy model respectively under the same experimental conditions. Besides, the IP detection error rate can be further reduced by 9.15% using pitch reset and lengthening information. The experimental results confirm that the proposed model based on inter-syllable boundary-based prosodic features can effectively detect the interruption point in spontaneous Mandarin speech.

References

[1]
Banerjee, S. 2009. NIST conducts rich transcription evaluation. IEEE Speech Lang. Process. Tech. Comm. Newsl.
[2]
Bear, J., Downding, J., and Shriberg, E. 1992. Integrating multiple knowledge sources for detection and correction of repairs in human-computer dialog. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’92). 56--63.
[3]
Boersma, P. and Weenink, D. 2009. Praat: Doing phonetics by computer. http://www.praat.org.
[4]
Charniak, E. and Johnson, M. 2004. A TAG-based noisy channel model of speech repairs. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’04).
[5]
Chen, S. F. and Rosenfeld, R. 1999. A Gaussian prior for smoothing maximum entropy models. Tech. rep. CMU-CS-99-108. Carnegie Mellon University.
[6]
Core M. G., and Schubert, L. K. 1999. A syntactic framework for speech repairs and other disruptions. In Proceedings of the Annual Meeting on Association for Computational Linguistics (ACL’99). 413--420.
[7]
Duda, R. O., Hart, P. E., and Stork, D. G. 2001. Pattern Recognition 2nd Ed. Wiley Interscience Publication.
[8]
Heeman, P. A. and Allen, J. F. 1999. Speech repairs, intonational phrases, and discourse markers: Modeling speakers’ utterances in spoken dialogue. Computat. Linguist. 25, 4, 527--571.
[9]
Huang, Z., Chen, L., and Harper, M. 2006. An open source prosodic feature extraction tool. In Proceedings of the Conference on Language Resources and Evaluation (LREC’06).
[10]
Kim, J., Schwarm, S. E., and Ostendorf, M. 2004. Detecting structural meta-data with decision tree and transformation based learning. In Proceedings of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLT/NAACL’04). 137--144.
[11]
Kudo, T. 2009. CRF++: Yet another CRF toolkit. http://crfpp.sourceforge.net/.
[12]
Lafferty, J., McCallum, A., and Pereira, F. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of International Conference on Machine Learning (ICML’01). 282--289.
[13]
Lee, C.-H. 2004. From knowledge-ignorant to knowledge-rich modeling: A new speech research paradigm for next generation automatic speech recognition. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’04). 109--112.
[14]
Liang, W.-B., Yeh, J.-F., Wu, C.-H., and Liou, C.-C. 2008. Interruption point detection of spontaneous speech using prior knowledge and multiple features. In Proceedings of IEEE Conference on Multimedia and Expo (ICME’08). 1457--1460.
[15]
Lin, C.-K. and Lee, L.-S. 2009. Improved features and models for detecting edit disfluences in transcribing spontaneous Mandarin speech. IEEE Trans. Acoustic, Speech, Lang. Process. 17, 7, 1263--1278.
[16]
Liu, D. C. and Nocedal, J. 1989. On the limited memory BFGS method for large scale optimization. Math. Program. 45, 3, 503--528.
[17]
Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M., and Harper, M. 2006. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Trans. Acoustics Speech Lang. Proc. 14, 5, 1526--1540.
[18]
Nakatani, C. and Hirschberg, J. 1994. A corpus-based study of repair cues in spontaneous speech. J. Acoust. Soc. Am. 95, 3, 1603--1616.
[19]
NIST. 2004. Rich transcription (RT-04F) evaluation plan. www.nist.gov/speech/tests/rt/2004-fall/docs/rt04f-eval-plan-v14.pdf.
[20]
Sha, F. and Pereira, F. 2003. Shallow parsing with conditional random fields. In Proceedings of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLT/NAACL’03). 134--141.
[21]
Shriberg, E., Stolcke, A., Hakkani-Tur, D., and Tur, G. 2000. Prosody-based automatic segmentation of speech into sentences and topics. Speech Comm. 32, 1, 127--154.
[22]
Snover, M., Dorr, B., and Schwartz, R. 2004. A lexically-driven algorithm for disfluency detection. In Proceedings of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLT/NAACL’04). 157--160.
[23]
Strassel, S. 2004. Simple metadata annotation specification version 6.2. linguistic data consortium. http://www.ldc.upenn.edu/Projects/MDE.
[24]
Sun, X. 2002. The determination analysis and synthesis of fundamental frequency. Ph.D thesis. Northwestern University.
[25]
Toledano, D. T., Rodriguez, M. A. C., and Sardina J. G. E. 1998. Try to mimic human segmentation of speech using HMM and fuzzy logic post-correction rules. In Proceedings of the 3rd ESCA/COCOSDA Workshop on Speech Synthesis (ESCA’98). 1263--1266.
[26]
Tseng, C.-Y. and Lee, Y.-L. 2004. Speech rate and prosody units: Evidence of interaction from Mandarin Chinese. In Proceedings of the International Conference on Speech Prosody (SP’04). 215--254.
[27]
Tseng, S.-C., and Liu, Y.-F. 2002. Annotation of Mandarin conversational dialogue corpus. CKIP Tech. rep. 02-01, Academia Sinica.
[28]
The Association for Computational Linguistics and Chinese Language Processing (ACLCLP). Brief introduction to TCC-300 corpus. http://www.aclclp.org.tw/doc/tcc_doc.PDF.
[29]
Van Rijsbergen, C. J. 1979. Information Retrieval 2nd Ed. Butterworths, London.
[30]
Yeh, J.-F. and Wu, C.-H. 2006. Edit disfluency detection and correction using a cleanup language model and an alignment model. IEEE Trans. Acoustic Speech Lang. Proc. 14, 5, 1574--1583.
[31]
Yeh, J.-F., Wu, C.-H., and Wu, W.-Y. 2007. Disfluency correction of spontaneous speech using conditional random fields with variable-length features. In Proceedings of the European Conference on Speech Communication and Technology (INTERSPEECH’07). 2157--2160.

Cited By

View all
  • (2023)Towards Dialogue Modeling Beyond TextICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095598(1-5)Online publication date: 4-Jun-2023
  • (2017)Miscommunication handling in spoken dialog systems based on error-aware dialog state detectionEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-017-0107-32017:1(1-17)Online publication date: 1-Dec-2017
  • (2016)Speech Act Identification Using Semantic Dependency Graphs with Probabilistic Context-Free GrammarsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/278697815:1(1-28)Online publication date: 7-Jan-2016

Index Terms

  1. Interruption Point Detection of Spontaneous Speech Using Inter-Syllable Boundary-Based Prosodic Features

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian Language Information Processing
      ACM Transactions on Asian Language Information Processing  Volume 10, Issue 1
      March 2011
      88 pages
      ISSN:1530-0226
      EISSN:1558-3430
      DOI:10.1145/1929908
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 March 2011
      Accepted: 01 August 2010
      Revised: 01 July 2010
      Received: 01 May 2010
      Published in TALIP Volume 10, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Interruption point detection
      2. conditional random field
      3. disfluency
      4. feature clustering
      5. prosodic feature

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)4
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 14 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Towards Dialogue Modeling Beyond TextICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095598(1-5)Online publication date: 4-Jun-2023
      • (2017)Miscommunication handling in spoken dialog systems based on error-aware dialog state detectionEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-017-0107-32017:1(1-17)Online publication date: 1-Dec-2017
      • (2016)Speech Act Identification Using Semantic Dependency Graphs with Probabilistic Context-Free GrammarsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/278697815:1(1-28)Online publication date: 7-Jan-2016

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media