Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A prosody-based vector-space model of dialog activity for information retrieval

Published: 01 April 2015 Publication History

Abstract

Display Omitted Prosodic information can support search in dialog archives.We represent prosodic-context information with a vector space model.Proximity in this space reflects dialog-activity similarity and topic similarity.Weighted-distance measures outperform city-block distance and Euclidean distance.Prosodic information provides less value for search than lexical information, but can usefully complement it. Search in audio archives is a challenging problem. Using prosodic information to help find relevant content has been proposed as a complement to word-based retrieval, but its utility has been an open question. We propose a new way to use prosodic information in search, based on a vector-space model, where each point in time maps to a point in a vector space whose dimensions are derived from numerous prosodic features of the local context. Point pairs that are close in this vector space are frequently similar, not only in terms of the dialog activities, but also in topic. Using proximity in this space as an indicator of similarity, we built support for a query-by-example function. Searchers were happy to use this function, and it provided value on a large testset. Prosody-based retrieval did not perform as well as word-based retrieval, but the two sources of information were often non-redundant and in combination they sometimes performed better than either separately.

References

[1]
Akiba, T., Nishizaki, H., Aikawa, K., Kawahara, T., Matsui, T., 2011. Overview of the IR for spoken documents task in NTCIR-9 workshop. In: Proceedings of the NII Test Collection for IR Systems Workshop. pp. 223-235.
[2]
A.-M. Barraja-Rohan, Using conversation analysis in the second language classroom to teach interactional competence, Lang. Teach. Res., 15 (2011) 479-507.
[3]
E. Bruni, N.-K. Tran, M. Baroni, Multimodal distributional semantics, J. Artif. Intell. Res., 49 (2014) 1-47.
[4]
Buckel, T., Thiesse, F., 2013. Predicting the disclosure of personal information on social networks: an empirical investigation. In: 11th International Conference on Wirtschaftsinformatik, pp. 1619-1633.
[5]
H. Bunt, Multifunctionality in dialogue, Comp. Speech Lang., 25 (2011) 222-245.
[6]
C. Chelba, T.J. Hazen, M. Saraclar, Retrieval and browsing of spoken content, IEEE Sig. Process. Mag., 25 (2008) 39-49.
[7]
Chen, Y.-W., Chen, K.-Y., Wang, H.-M., Chen, B., 2013. Effective pseudo-relevance feedback for spoken document retrieval. In: IEEE ICASSP, pp. 8535-8539.
[8]
K. Erk, Vector space models of word meaning and phrase meaning: a survey, Lang. Ling. Compass, 6 (2012) 635-653.
[9]
M. Eskevich, W. Magdy, G. Jones, New metrics for meaningful evaluation of informally structured speech retrieval, Adv. Inform. Retriev. (2012) 170-181.
[10]
Eskevich, M., Aly, R., Ordelman, R.C.S., Jones, G.J.F., 2013. The search and hyperlinking task at MediaEval 2013. In: MediaEval Workshop.
[11]
Freedman, M., Baron, A., Punyakanok, V., Weischedel, R., 2011. Language use: what can it tell us? In: 49th Association for Computational Linguistics, vol. 2. pp. 341-345.
[12]
P. Galuščáková, P. Pecina, Experiments with segmentation strategies for passage retrieval in audio-visual documents, in: Proceedings of International Conference on Multimedia Retrieval, ACM, 2014, pp. 217-224.
[13]
P. Galuscakova, P. Pecina, J. Hajic, Penalty functions for evaluation measures of unsegmented speech retrieval, in: CLEF: Information Access Evaluation, Springer, 2012, pp. 100-111.
[14]
Garcia, F., Sanchis, E., Calvo, M., Pla, F., Hurtado, L.-F., 2013. ELiRF at MediaEval 2013: similar segments in social speech task. In: MediaEval Workshop.
[15]
Garofolo, J., Auzanne, C., Voorhees, E., 2000. The TREC Spoken Document Retrieval Track: A Success Story NIST Special Publication, vol. 246, pp. 107-130.
[16]
Godfrey, J.J., Holliman, E.C., McDaniel, J., 1992. Switchboard: Telephone speech corpus for research and development. In: Proceedings of ICASSP, pp. 517-520.
[17]
Hakkani-Tur, D., Tur, G., Stolcke, A., Shriberg, E.E., 1999. Combining words and prosody for information extraction from speech. In: Proceedings of Eurospeech, vol. 5. pp. 1991-1994.
[18]
Hanjalic, A., Kofler, C., Larson, M., 2012. Intent and its discontents: the user at the wheel of the online video search engine. In: ACM Multimedia.
[19]
Huang, C.-L., Hori, C., Kashioka, H., 2013. Semantic inference based on neural probabilistic language modeling for speech indexing. In: ICASSP (IEEE), pp. 8480-8484.
[20]
Jung, S., Na, S.-H., 2013. Refining sentence similarity with discourse information in dialog systems. In: Interspeech, pp. 3742-3745.
[21]
Kim, S., Yella, S. H., Valente, F., 2012. Automatic detection of conflict escalation in spoken conversation. In: Interspeech.
[22]
Larson, M., Eskevich, M., et al., 2011. Overview of MediaEval 2011 rich speech retrieval task and genre tagging task. In: MediaEval Workshop.
[23]
M. Larson, G.J. F. Jones, Spoken content retrieval: a survey of techniques and technologies, Found. Trends Inform. Retr., 5 (2012) 235-422.
[24]
Levow, G.-A., 2013. UWCL at MediaEval 2013: Similar segments in social speech task. In: MediaEval Workshop.
[25]
Liu, Z., Huang, Q., 2000. Content-based indexing and retrieval-by-example in audio. In: IEEE Multimedia, pp. 877-880.
[26]
Liu, B., Oard, D. W., 2006. One-sided measures for evaluating ranked retrieval effectiveness with spontaneous conversational speech. In: 29th SIGIR, pp. 673-674.
[27]
P. Lukowicz, A.S. Pentland, A. Ferscha, From context awareness to socially aware computing, Perv. Comput., IEEE, 11 (2012) 32-41.
[28]
Mairesse, F., Poifroni, J., Di Fabbrizio, G., 2012. Can prosody inform sentiment analysis? Experiments on short spoken reviews. In: IEEE ICASSP.
[29]
F. Metze, X. Anguera, E. Barnard, M. Davel, G. Gravier, Language independent search in MediaEval's spoken web search task, Comp. Speech Lang., 28 (2014) 1066-1082.
[30]
Mikolov, T., Yih, W.-T., Zweig, G., 2013. Linguistic regularities in continuous space word representations. In: Proceedings of NAACL-HLT, pp. 746-751.
[31]
Mizuno, J., Ogata, J., Goto, M., 2008. A similar content retrieval method for podcast episodes. In: IEEE Spoken Language Technology Workshop, pp. 297-300.
[32]
Oard, D. W., 2012. Query by babbling: a research agenda. In: Proceedings of the First Workshop on Information and Knowledge Management for Developing Region, pp. 17-22.
[33]
Oertel, C., Scherer, S., Campbell, N., 2011. On the use of multimodal cues for the prediction of degrees of involvement in spontaneous conversation. In: Interspeech.
[34]
V. Pallotta, V. Seretan, M. Ailomaa, User requirements analysis for meeting information retrieval based on query elicitation, 2007.
[35]
Pedersen, T., Patwardhan, S., Michelizzi, J., 2004. Measuring the relatedness of concepts. In: Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-04), pp. 1024-1025.
[36]
Purver, M., Dowding, J., Niekrasz, J., Ehlen, P., Noorbaloochi, S., Peters, S., 2007. Detecting and summarizing action items in multi-party dialogue. In: Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue, pp. 200-211.
[37]
Reichel, U. D., Kleber, F., Winkelmann, R., 2009. Modelling similarity perception of intonation. In: Interspeech.
[38]
Rose, D. E., Levinson, D., 2004. Understanding user goals in web search. In: WWW '04: 13th International Conference on World Wide Web, pp. 13-19.
[39]
M. Slaney, Y. Lifshits, J. He, Optimal parameters for locality-sensitive hashing, Proc. IEEE, 100 (2012) 2604-2623.
[40]
J. Toivanen, T. Seppänen, Prosody-based search features in information retrieval, Proc. FONETIK, 2002 (2002) 105-108.
[41]
Toutanova, K., Klein, D., Manning, C., Singer, Y., 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL, pp. 252-259.
[42]
J. Turian, L. Ratinov, Y. Bengio, Word representations: a simple and general method for semi-supervised learning, in: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 2010, pp. 384-394.
[43]
Ward, N.G., 2014. Automatic discovery of simply-composable prosodic elements. In: Speech Prosody. pp. 915-919.
[44]
Ward, N.G., Vega, A., 2012. A bottom-up exploration of the dimensions of dialog state in spoken interaction. In: 13th Annual SIGdial Meeting on Discourse and Dialogue.
[45]
Ward, N.G., Werner, S.D., 2012. Thirty-Two Sample Audio Search Tasks, Tech. Rep. UTEP-CS-12-39. University of Texas at El Paso.
[46]
Ward, N.G., Werner, S.D., Novick, D.G., Kawahara, T., Shriberg, E.E., Morency, L.-P., Oertel, C., 2013. The similar segments in social speech task. In: MediaEval Workshop.
[47]
Ward, N.G., Werner, S.D., 2013a. Data Collection for the Similar Segments in Social Speech Task, Tech. Rep. UTEP-CS-13-58. University of Texas at El Paso.
[48]
Ward, N.G., Werner, S.D., 2013b. Using dialog-activity similarity for spoken information retrieval. In: Interspeech.
[49]
Werner, S.D., Ward, N.G., 2013. Evaluating prosody-based similarity models for information retrieval. In: MediaEval Workshop.
[50]
Whittaker, S., Hirschberg, J., Choi, J., Hindle, D., Pereira, F., Singhal, A., 1999. Scan: designing and evaluating user interfaces to support retrieval from speech archives. In: SIGIR, pp. 26-33.
[51]
S. Whittaker, S. Tucker, K. Swampillai, R. Laban, Design and evaluation of systems to support interaction capture and retrieval, Pers. Ubiquit. Comput., 12 (2008) 197-221.
[52]
M. Wollmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae, L.-P. Morency, Youtube movie reviews: sentiment analysis in an audio-visual context, Intell. Syst. IEEE, 28 (2013) 46-53.
[53]
Wrede, B., Shriberg, E., 2003. Spotting 'hot spots' in meetings: human judgments and prosodic cues. In: Eurospeech, pp. 2805-2808.
[54]
Yuan, J., Liberman, M., Cieri, C., 2006. Towards an integrated understanding of speaking rate in conversation. In: ICSLP.
[55]
Zimmerer, F., Jugler, J., Andreeva, B., Mobius, B., Trouvain, J., 2014. Too cautious to vary more? A comparison of pitch variation in native and non-native productions of French and German speakers. In: Speech Prosody Conference.

Cited By

View all
  • (2018)Inferring stance in news broadcasts from prosodic-feature configurationsComputer Speech and Language10.1016/j.csl.2017.12.00750:C(85-104)Online publication date: 1-Jul-2018
  • (2018)Prediction of a hotspot pattern in keyword search resultsComputer Speech and Language10.1016/j.csl.2017.10.00548:C(80-102)Online publication date: 1-Mar-2018
  1. A prosody-based vector-space model of dialog activity for information retrieval

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Speech Communication
    Speech Communication  Volume 68, Issue C
    April 2015
    106 pages

    Publisher

    Elsevier Science Publishers B. V.

    Netherlands

    Publication History

    Published: 01 April 2015

    Author Tags

    1. Audio
    2. Principal components analysis
    3. Search
    4. Similarity judgments
    5. Similarity metrics
    6. Speech

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 14 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Inferring stance in news broadcasts from prosodic-feature configurationsComputer Speech and Language10.1016/j.csl.2017.12.00750:C(85-104)Online publication date: 1-Jul-2018
    • (2018)Prediction of a hotspot pattern in keyword search resultsComputer Speech and Language10.1016/j.csl.2017.10.00548:C(80-102)Online publication date: 1-Mar-2018

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media