Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1963192.1963364acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Two-stream indexing for spoken web search

Published: 28 March 2011 Publication History

Abstract

This paper presents two-stream processing of audio to index the audio content for Spoken Web search. The first stream indexes the meta-data associated with a particular audio document. The meta-data is usually very sparse, but accurate. This therefore results in a high-precision, low-recall index. The second stream uses a novel language-independent speech recognition to generate text to be indexed. Owing to the multiple languages and the noise in user generated content on the Spoken Web, the speech recognition accuracy of such systems is not high, thus they result in a low-precision, high-recall index. The paper attempts to use these two complementary streams to generate a combined index to increase the precision-recall performance in audio content search.
The problem of audio content search is motivated by the real world implication of the Web in developing regions, where due to literacy and affordability issues, people use Spoken Web which consists of interconnected VoiceSites, which have content in audio. The experiments are based on more than 20,000 audio documents spanning over seven live VoiceSites and four different languages. The results suggest significant improvement over a meta-data-only or a speech-recognitiononly system, thus justifying the two-stream processing approach. Audio content search is a growing problem area and this paper wishes to be a first step to solving this at a large scale, across languages, in a Web context.

References

[1]
S. Agarwal, D. Chakraborty, A. Kumar, A. A. Nanavati, and N. Rajput. HSTP: Hyperspeech Transfer Protocol. In ACM Hypertext 2007, UK, September 2007.
[2]
S. Agarwal, K. Dhanesha, A. Jain, A. Kumar, S. Menon, N. Rajput, K. Srivastava, and S. Srivastava. Organizational, social and executional implications in delivering ict solutions: A telecom web case-study. In Proc. Intl. Conf. on Information and Communication Technologies and Development (ICTD), 2010.
[3]
S. Agarwal, A. Kumar, A. A. Nanavati, and N. Rajput. Content creation and dissemination by-and-for users in rural areas. In Proc. Intl. Conf. on Information and Communication Technologies and Development (ICTD), April 2009.
[4]
C. Alberti, M. Bacchiani, A. Bezman, C. Chelba, A. Drofa, H. Liao, P. Moreno, T. Power, A. Sahuguet, M. Shugrina, and O. Siohan. An audio indexing system for election video material. In In Proc. ICASSP, April 2009.
[5]
C. Allauzen, M. Mohri, and M. Saraclar. General indexation of weighted automata - application to spoken utterance retrieval. In Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT/NAACL, pages 33--40, 2004.
[6]
M. Baldonado, C. chuan K. Chang, L. Gravano, and A. Paepcke. The stanford digital library metadata architecture. International Journal of Digital Libraries, 1:108--121, 1997.
[7]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1):107--117, 1998.
[8]
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. The International Journal of Computer and Telecommunications Networking, 1(6), 2000.
[9]
A. Callerya and D. Tracy-Proulxa. Yahoo! cataloging the web. Journal of Library Metadata, 1(1), 1997.
[10]
D. Chakrabarti, R. Kumar, and K. Punera. Quicklink selectoin for navigational query results. In WWW '09: Proceedings of the 18th international conference on World Wide Web, Madrid, Spain, May 2009.
[11]
J. Charzinski. Traffic Properties, Client Side Cachability and CDN Usage of Popular Web Sites. Lecture Notes in Computer Science, 2010(5987), 2010.
[12]
C. Chelba and A. Acero. Position specific posterior lattices for indexing speech. In ACL '05: Proceedings of the Annual Meeting on Association for Computational Linguistics, pages 443--450, 2005.
[13]
T. K. Chia, K. C. Sim, H. Li, and H. T. Ng. A lattice-based approach to query-by-example spoken document retrieval. In SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 363--370, 2008.
[14]
M. Diao, S. Mukherjea, N. Rajput, and K. Srivastava. Faceted search and browsing of audio content on spoken web. In CIKM '10: Proceedings of the nineteenth international conference on Information and knowledge management, 2010.
[15]
L. Ding, T. Finin, A. Joshi, R. Pan, R. S. Cost, Y. Peng, P. Reddivari, V. Doshi, and J. Sachs. Swoogle: a search and metadata engine for the semantic web. In CIKM '04: Proceedings of the thirteenth international conference on Information and knowledge management, pages 652--659, 2004.
[16]
T. Heimonen and M. Kaki. Mobile finder: supporting mobile web search with automatic result categories. In Proceedings of the MobileHCI, 2007.
[17]
B. Hughes and A. Kamat. A metadata search engine for digital language archives. Digital Libraries Magazine, 11(2), 2005.
[18]
M. Jones, G. Buchanan, and H. Thimnbleby. Improving web search on small screen devices. Interacting with Computers, 4(15), 2003.
[19]
R. Kraft and J. Zien. Mining anchor text for query refinement. In WWW '04: Proceedings of the 13th international conference on World Wide Web, New York, USA, May 2004.
[20]
A. Kumar, N. Rajput, S. Agarwal, D. Chakraborty, and A. A. Nanavati. Organizing the unorganized -- employing it to empower the under-privileged. In Proceedings of the World Wide Web, April 2008.
[21]
A. Kumar, N. Rajput, D. Chakraborty, S. Agarwal, and A. A. Nanavati. Voiserv: Creation and delivery of converged services through voice for emerging economies. In WoWMoM'07 Proceedings of the 2007 International Symposium on a World of Wireless, Mobile and Multimedia Networks, Finland, June 2007.
[22]
A. Kumar, N. Rajput, D. Chakraborty, S. Agarwal, and A. A. Nanavati. WWTW: A World Wide Telecom Web for Developing Regions. In ACM SIGCOMM Workshop on Networked Systems For Developing Regions, Aug 2007.
[23]
J. Ledlie, B. Odero, E. Minkov, I. Kiss, and J. Polifroni. Crowd translator: On building localized speech recognizers through micropayments. SIGOPS Operating Systems Review, 43(4), 2009.
[24]
M. McCandless, E. Hatcher, and O. Gospodneti. Lucene in Action, Second Edition. Manning Publications Company, 2008.
[25]
I. Medhi, A. Sagar, and K. Toyama. Text-Free User Interfaces for Illiterate and Semi-Literate Users. In ICTD, Berkeley, USA, May 2006.
[26]
R. Miller and K. Bharat. Sphinx: A framework for creating personal, site-specific web crawlers. In WWW '98: Proceedings of the 7th international conference on World Wide Web, Brisbane, Australia, May 1998.
[27]
G. Mishne, D. Carmel, R. Hoory, A. Roytman, and A. Soffer. Automatic analysis of call-center conversations. In CIKM '05: Proceedings of the 14th international conference on Information and knowledge management, pages 453--459, 2005.
[28]
C. Parada, A. Sethy, and B. Ramachandran. Query-by-example spoken term detection for OOV terms. In In Proc. ASRU, December 2009.
[29]
N. Patel, D. Chittamuru, A. Jain, P. Dave, and T. S. Parikh. Avaaj Otalo - A Field Study of an Interactive Voice Forum for Small Farmers in Rural India. In Proc. CHI, USA, April 2010.
[30]
M. Plauch, U. Nallasamy, J. Pal, C. Wooters, and D. Ramachandran. Speech Recognition for Illiterate Access Information and Technology. In ICTD), Berkeley, CA, USA, May 2006.
[31]
A. Ranjan, R. Balakrishnan, and M. Chignell. Searching in audio: the utility of transcripts, dichotic presentation, and time-compression. In CHI '06: Proceedings of the SIGCHI conference on Human Factors in computing systems, pages 721--730, 2006.
[32]
R. Sarvas, E. Herrarte, A. Wilhelm, and M. Davis. Metadata creation system for mobile images. In Proceedings of the 2nd international conference on Mobile systems, applications, and services, pages 36--48, 2004.
[33]
U. N. E. Scientific and C. Organization. Education for All Global Monitoring Report - Reaching the Marginalized. http://unesdoc.unesco.org/images/0018/001866/186606E.pdf, pages 16--32, 2010.
[34]
J. Sherwani. Are Spoken Dialog Systems Viable for Under-served Semi-literate Populations? PhD Thesis Proposal, Carnegie Mellon University, http:// www.cs.cmu.edu/jsherwan/JS-proposal.pdf, 2005.
[35]
Sourceforge. Jspider - the Open Source Web Robot. http://j-spider.sourceforge.net, October 2010.
[36]
M. Svensson and A. Kurti. Using contextual metadata for enhanced reusability of mobile media objects. In Sharing Experiences with Social Mobile Media : Proceedings of the International Workshop in conjunction with MobileHCI, pages 72--79, 2009.
[37]
K.-P. Yee, K. Swearingen, L. Li, and M. Hearst. Faceted metadata for image search and browsing. In CHI '03: Proceedings of the SIGCHI conference on Human Factors in computing systems, pages 401--408, 2003.
[38]
K. C. Yu, C. Ma, and F. Seide. Vocabulary independent indexing of spontaneous speech. IEEE Transactions on Speech and Audio Processing, 13(5), 2005.
[39]
Y. Zhang and J. Glass. Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. In In Proc. ASRU, December 2009.

Cited By

View all
  • (2013)Supporting Voice Content Sharing among Underprivileged People in Urban IndiaHuman-Computer Interaction – INTERACT 201310.1007/978-3-642-40498-6_38(489-506)Online publication date: 2013
  • (2012)Query by babblingProceedings of the first workshop on Information and knowledge management for developing region10.1145/2389776.2389781(17-22)Online publication date: 2-Nov-2012
  • (2011)Social ranking for spoken web searchProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063840(1835-1840)Online publication date: 24-Oct-2011

Recommendations

Reviews

Gerald Friedland

The Spoken Web affects many people in rural India and other developing areas that have just recently adopted the widespread use of voice-only cell phones (as opposed to smartphones). The Spoken Web, like the World Wide Web (WWW), enables users to browse and surf for information. The presentation, however, is spoken language instead of text and graphics, so users can access content using a cell phone with no display. Since there is no textual information, though, how does one search this Web__?__ This is exactly the topic of this paper. This paper presents a technique that combines metadata and speech recognition into an indexing approach for Spoken Web search. The paper describes different variants of algorithms, and then presents an evaluation of precision and recall based on more than 20,000 voice documents. It provides evidence that the combination of the metadata and the speech recognition stream performs better than comparable algorithms on a single modality. This paper is for speech and natural language processing researchers. I definitely recommend it to those working in such fields, even if they are not working on the Spoken Web. Though the authors performed the experiments within this domain, as a speech and multimedia researcher, I see no reason why the techniques could not also be applied to "found data" from the WWW, such as consumer-produced videos on social networking sites. The characteristics of the data match reasonably. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '11: Proceedings of the 20th international conference companion on World wide web
March 2011
552 pages
ISBN:9781450306379
DOI:10.1145/1963192
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 March 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. audio search
  2. developing regions
  3. literacy
  4. mobile phone
  5. spoken web
  6. world wide telecom web

Qualifiers

  • Research-article

Conference

WWW '11
WWW '11: 20th International World Wide Web Conference
March 28 - April 1, 2011
Hyderabad, India

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2013)Supporting Voice Content Sharing among Underprivileged People in Urban IndiaHuman-Computer Interaction – INTERACT 201310.1007/978-3-642-40498-6_38(489-506)Online publication date: 2013
  • (2012)Query by babblingProceedings of the first workshop on Information and knowledge management for developing region10.1145/2389776.2389781(17-22)Online publication date: 2-Nov-2012
  • (2011)Social ranking for spoken web searchProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063840(1835-1840)Online publication date: 24-Oct-2011

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media