research-article

Interruption Point Detection of Spontaneous Speech Using Inter-Syllable Boundary-Based Prosodic Features

Authors:

Chung-Hsien Wu,

Jui-Feng YehAuthors Info & Claims

ACM Transactions on Asian Language Information Processing (TALIP), Volume 10, Issue 1

Article No.: 6, Pages 1 - 21

https://doi.org/10.1145/1929908.1929914

Published: 01 March 2011 Publication History

Abstract

This article presents a probabilistic scheme for detecting the interruption point (IP) in spontaneous speech based on inter-syllable boundary-based prosodic features. Because of the high error rate in spontaneous speech recognition, a combined acoustic model considering both syllable and subsyllable recognition units, is firstly used to determine the inter-syllable boundaries and output the recognition confidence of the input speech. Based on the finding that IPs always occur at inter-syllable boundaries, a probability distribution of the prosodic features at the current potential IP is estimated. The Conditional Random Field (CRF) model, which employs the clustered prosodic features of the current potential IP and its preceding and succeeding inter-syllable boundaries, is employed to output the IP likelihood measure. Finally, the confidence of the recognized speech, the probability distribution of the prosodic features and the CRF-based IP likelihood measure are integrated to determine the optimal IP sequence of the input spontaneous speech. In addition, pitch reset and lengthening are also applied to improve the IP detection performance. The Mandarin Conversional Dialogue Corpus is adopted for evaluation. Experimental results show that the proposed IP detection approach obtains 10.56% and 6.5% more effective results than the hidden Markov model and the Maximum Entropy model respectively under the same experimental conditions. Besides, the IP detection error rate can be further reduced by 9.15% using pitch reset and lengthening information. The experimental results confirm that the proposed model based on inter-syllable boundary-based prosodic features can effectively detect the interruption point in spontaneous Mandarin speech.

References

[1]

Banerjee, S. 2009. NIST conducts rich transcription evaluation. IEEE Speech Lang. Process. Tech. Comm. Newsl.

[2]

Bear, J., Downding, J., and Shriberg, E. 1992. Integrating multiple knowledge sources for detection and correction of repairs in human-computer dialog. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’92). 56--63.

Digital Library

[3]

Boersma, P. and Weenink, D. 2009. Praat: Doing phonetics by computer. http://www.praat.org.

[4]

Charniak, E. and Johnson, M. 2004. A TAG-based noisy channel model of speech repairs. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’04).

Digital Library

[5]

Chen, S. F. and Rosenfeld, R. 1999. A Gaussian prior for smoothing maximum entropy models. Tech. rep. CMU-CS-99-108. Carnegie Mellon University.

[6]

Core M. G., and Schubert, L. K. 1999. A syntactic framework for speech repairs and other disruptions. In Proceedings of the Annual Meeting on Association for Computational Linguistics (ACL’99). 413--420.

Digital Library

[7]

Duda, R. O., Hart, P. E., and Stork, D. G. 2001. Pattern Recognition 2nd Ed. Wiley Interscience Publication.

[8]

Heeman, P. A. and Allen, J. F. 1999. Speech repairs, intonational phrases, and discourse markers: Modeling speakers’ utterances in spoken dialogue. Computat. Linguist. 25, 4, 527--571.

Digital Library

[9]

Huang, Z., Chen, L., and Harper, M. 2006. An open source prosodic feature extraction tool. In Proceedings of the Conference on Language Resources and Evaluation (LREC’06).

[10]

Kim, J., Schwarm, S. E., and Ostendorf, M. 2004. Detecting structural meta-data with decision tree and transformation based learning. In Proceedings of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLT/NAACL’04). 137--144.

[11]

Kudo, T. 2009. CRF++: Yet another CRF toolkit. http://crfpp.sourceforge.net/.

[12]

Lafferty, J., McCallum, A., and Pereira, F. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of International Conference on Machine Learning (ICML’01). 282--289.

Digital Library

[13]

Lee, C.-H. 2004. From knowledge-ignorant to knowledge-rich modeling: A new speech research paradigm for next generation automatic speech recognition. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’04). 109--112.

[14]

Liang, W.-B., Yeh, J.-F., Wu, C.-H., and Liou, C.-C. 2008. Interruption point detection of spontaneous speech using prior knowledge and multiple features. In Proceedings of IEEE Conference on Multimedia and Expo (ICME’08). 1457--1460.

[15]

Lin, C.-K. and Lee, L.-S. 2009. Improved features and models for detecting edit disfluences in transcribing spontaneous Mandarin speech. IEEE Trans. Acoustic, Speech, Lang. Process. 17, 7, 1263--1278.

Digital Library

[16]

Liu, D. C. and Nocedal, J. 1989. On the limited memory BFGS method for large scale optimization. Math. Program. 45, 3, 503--528.

Digital Library

[17]

Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M., and Harper, M. 2006. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Trans. Acoustics Speech Lang. Proc. 14, 5, 1526--1540.

Digital Library

[18]

Nakatani, C. and Hirschberg, J. 1994. A corpus-based study of repair cues in spontaneous speech. J. Acoust. Soc. Am. 95, 3, 1603--1616.

[19]

NIST. 2004. Rich transcription (RT-04F) evaluation plan. www.nist.gov/speech/tests/rt/2004-fall/docs/rt04f-eval-plan-v14.pdf.

[20]

Sha, F. and Pereira, F. 2003. Shallow parsing with conditional random fields. In Proceedings of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLT/NAACL’03). 134--141.

Digital Library

[21]

Shriberg, E., Stolcke, A., Hakkani-Tur, D., and Tur, G. 2000. Prosody-based automatic segmentation of speech into sentences and topics. Speech Comm. 32, 1, 127--154.

Digital Library

[22]

Snover, M., Dorr, B., and Schwartz, R. 2004. A lexically-driven algorithm for disfluency detection. In Proceedings of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLT/NAACL’04). 157--160.

Digital Library

[23]

Strassel, S. 2004. Simple metadata annotation specification version 6.2. linguistic data consortium. http://www.ldc.upenn.edu/Projects/MDE.

[24]

Sun, X. 2002. The determination analysis and synthesis of fundamental frequency. Ph.D thesis. Northwestern University.

[25]

Toledano, D. T., Rodriguez, M. A. C., and Sardina J. G. E. 1998. Try to mimic human segmentation of speech using HMM and fuzzy logic post-correction rules. In Proceedings of the 3rd ESCA/COCOSDA Workshop on Speech Synthesis (ESCA’98). 1263--1266.

[26]

Tseng, C.-Y. and Lee, Y.-L. 2004. Speech rate and prosody units: Evidence of interaction from Mandarin Chinese. In Proceedings of the International Conference on Speech Prosody (SP’04). 215--254.

[27]

Tseng, S.-C., and Liu, Y.-F. 2002. Annotation of Mandarin conversational dialogue corpus. CKIP Tech. rep. 02-01, Academia Sinica.

[28]

The Association for Computational Linguistics and Chinese Language Processing (ACLCLP). Brief introduction to TCC-300 corpus. http://www.aclclp.org.tw/doc/tcc_doc.PDF.

[29]

Van Rijsbergen, C. J. 1979. Information Retrieval 2nd Ed. Butterworths, London.

Digital Library

[30]

Yeh, J.-F. and Wu, C.-H. 2006. Edit disfluency detection and correction using a cleanup language model and an alignment model. IEEE Trans. Acoustic Speech Lang. Proc. 14, 5, 1574--1583.

Digital Library

[31]

Yeh, J.-F., Wu, C.-H., and Wu, W.-Y. 2007. Disfluency correction of spontaneous speech using conditional random fields with variable-length features. In Proceedings of the European Conference on Speech Communication and Technology (INTERSPEECH’07). 2157--2160.

Cited By

Wu TZhou YLing WYang HVeloso JSun LHuang RGuimaraes NSanner S(2023)Towards Dialogue Modeling Beyond TextICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095598(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10095598
Wu CSu MLiang W(2017)Miscommunication handling in spoken dialog systems based on error-aware dialog state detectionEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-017-0107-32017:1(1-17)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1186/s13636-017-0107-3
Yeh J(2016)Speech Act Identification Using Semantic Dependency Graphs with Probabilistic Context-Free GrammarsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/278697815:1(1-28)Online publication date: 7-Jan-2016
https://dl.acm.org/doi/10.1145/2786978

Index Terms

Interruption Point Detection of Spontaneous Speech Using Inter-Syllable Boundary-Based Prosodic Features
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition
2. Hardware
  1. Communication hardware, interfaces and storage
    1. Signal processing systems

Recommendations

Improved Features and Models for Detecting Edit Disfluencies in Transcribing Spontaneous Mandarin Speech

Detection of edit disfluencies is key to transcribing spontaneous utterances. In this paper, we present improved features and models to detect edit disfluencies and enhance transcription of spontaneous Mandarin speech using hypothesized disfluency ...
Enriching speech recognition with automatic detection of sentence boundaries and disfluencies

Effective human and automatic processing of speech requires recovery of more than just the words. It also involves recovering phenomena such as sentence boundaries, filler words, and disfluencies, referred to as structural metadata. We describe a ...
Spontaneous speech and opinion detection: mining call-centre transcripts

Opinion mining on conversational telephone speech tackles two challenges: the robustness of speech transcriptions and the relevance of opinion models. The two challenges are critical in an industrial context such as marketing. The paper addresses ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian Language Information Processing

ACM Transactions on Asian Language Information Processing Volume 10, Issue 1

March 2011

88 pages

ISSN:1530-0226

EISSN:1558-3430

DOI:10.1145/1929908

Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2011

Accepted: 01 August 2010

Revised: 01 July 2010

Received: 01 May 2010

Published in TALIP Volume 10, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Council Taiwan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
253
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu TZhou YLing WYang HVeloso JSun LHuang RGuimaraes NSanner S(2023)Towards Dialogue Modeling Beyond TextICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095598(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10095598
Wu CSu MLiang W(2017)Miscommunication handling in spoken dialog systems based on error-aware dialog state detectionEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-017-0107-32017:1(1-17)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1186/s13636-017-0107-3
Yeh J(2016)Speech Act Identification Using Semantic Dependency Graphs with Probabilistic Context-Free GrammarsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/278697815:1(1-28)Online publication date: 7-Jan-2016
https://dl.acm.org/doi/10.1145/2786978

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents