Nothing Special   »   [go: up one dir, main page]

skip to main content
article

On the Applicability of Speaker Diarization to Audio Indexing of Non-Speech and Mixed Non-Speech/Speech Video Soundtracks

Published: 01 July 2012 Publication History

Abstract

A video's soundtrack is usually highly correlated to its content. Hence, audio-based techniques have recently emerged as a means for video concept detection complementary to visual analysis. Most state-of-the-art approaches rely on manual definition of predefined sound concepts such as "ngine sounds," "utdoor/indoor sounds." These approaches come with three major drawbacks: manual definitions do not scale as they are highly domain-dependent, manual definitions are highly subjective with respect to annotators and a large part of the audio content is omitted since the predefined concepts are usually found only in a fraction of the soundtrack. This paper explores how unsupervised audio segmentation systems like speaker diarization can be adapted to automatically identify low-level sound concepts similar to annotator defined concepts and how these concepts can be used for audio indexing. Speaker diarization systems are designed to answer the question "ho spoke when?"by finding segments in an audio stream that exhibit similar properties in feature space, i.e., sound similar. Using a diarization system, all the content of an audio file is analyzed and similar sounds are clustered. This article provides an in-depth analysis on the statistic properties of similar acoustic segments identified by the diarization system in a predefined document set and the theoretical fitness of this approach to discern one document class from another. It also discusses how diarization can be tuned in order to better reflect the acoustic properties of general sounds as opposed to speech and introduces a proof-of-concept system for multimedia event classification working with diarization-based indexing.

References

[1]
Chang, C.-C.,&Lin, C.-J. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 23, 1-27.
[2]
Chaudhuri, S., Harvilla, M.,&Raj, B. 2011. Unsupervised learning of acoustic unit descriptors for audio content representation and classification. In Proceedings of the 12th Annual International Conference Interspeech.
[3]
Friedland, G.,&Vinyals, O. 2008, October. Live speaker identification in conversations. In Proceedings of the ACM International Conference on Multimedia, Vancouver, BC, Canada pp. 1017-1018.
[4]
Huang, J., Liu, Z., Wang, Y., Chen, Y.,&Wong, E. K. 1999. Integration of multimodal features for video scene classification based on HMM. In Proceedings of the IEEE 3rd Workshop on Multimedia Signal Processing pp. 53-58.
[5]
Huang, Y., Vinyals, O., Friedland, G., Muller, C., Mirghafori, N.,&Wooters, C. 2007. A fast-match approach for robust, faster than real-time speaker diarization. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding pp. 693-698.
[6]
Imseng, D.,&Friedland, G. 2009, December. Robust speaker diarization for short speech recordings. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding pp. 432-437.
[7]
Jiang, Y.-G., Zeng, X., Ye, G., Bhattacharya, S., Ellis, D., Shah, M.,&Chang, S.-F. 2010. Columbia-ucf trecvid2010 multimedia event detection: Combining multiple modalities, contextual concepts, and temporal matching. In Proceedings of the NIST TRECVID Workshop on Video Retrieval Evaluation.
[8]
Lan, M.,&Low, H. 2005. A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In Special Interest Tracks and Posters of the 14th International Conference on World Wide Web pp. 1032-1033.
[9]
Leopold, E.,&Kindermann, J. 2002. Text categorization with support vector machines. How to represent texts in input space? Machine Learning, 46, 423-444.
[10]
Lew, M. S., Sebe, N., Djeraba, C.,&Jain, R. 2006. Content-based multimedia information retrieval: State of the art and challenges. ACM Transactions on Multimedia Computing and Communications and Applications, 21, 1-19.
[11]
Li, H., Bao, L., Gao, Z., Overwijk, A., Liu, W.,&Zhang, L. 'Hauptmann, A. 2010. Informedia@trecvid 2010. In Notebook for NIST's TREC Video Retrieval Evaluation.
[12]
Lu, L.,&Hanjalic, A. 2008. Audio keywords discovery for textlike audio content analysis and retrieval. IEEE Transactions on Multimedia, 101, 74-85.
[13]
Mertens, R., Lei, H., Gottlieb, L., Friedland, G.,&Divakaran, A. 2011, November 28-December 1. Acoustic super models for large scale video event detection. In Proceedings of the International ACM Workshop on Events in Multimedia, Scottsdale, AZ pp. 19-24.
[14]
NIST TRECVid. 2011. Evaluation. Retrieved December 15, 2011, from http://www-nlpir.nist.gov/projects/trecvid/
[15]
Robertson, S. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. The Journal of Documentation, 605, 503-520.
[16]
Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Meinedo, H., Bugalho, M.,&Trancoso, I. 2010. On the use of audio events for improving video scene segmentation. In Proceedings of the 11th International Workshop on Image Analysis for Multimedia Interactive Services pp. 1-4.
[17]
Snoek, C. G. M.,&Worring, M. 2009. Concept-based video retrieval. Fundamental Trends in Information Retrieval, 24, 215-322.
[18]
Wactlar, H. D., Kanade, T., Smith, M. A.,&Stevens, S. M. 1996. Intelligent access to digital video: Informedia project. Computer, 295, 46-52.
[19]
Wooters, C.,&Huijbregts, M. 2008. Multimodal technologies for perception of humans. In R. Stiefelhagen&J. Garofolo Eds., Proceedings of the First International Evaluation Workshop on Classification of Events, Activities and Relationships LNCS 4122, pp. 509-519.

Cited By

View all
  • (2019)A new architecture based VAD for speaker diarization/detection systemsInternational Journal of Speech Technology10.1007/s10772-019-09625-622:3(827-840)Online publication date: 1-Sep-2019
  • (2012)There is no data like less dataProceedings of the 2012 ACM international workshop on Audio and multimedia methods for large-scale video analysis10.1145/2390214.2390223(27-32)Online publication date: 2-Nov-2012

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of Multimedia Data Engineering & Management
International Journal of Multimedia Data Engineering & Management  Volume 3, Issue 3
July 2012
82 pages
ISSN:1947-8534
EISSN:1947-8542
Issue’s Table of Contents

Publisher

IGI Global

United States

Publication History

Published: 01 July 2012

Author Tags

  1. Audio Indexing
  2. Computer Science
  3. Information Retrieval
  4. Multimedia
  5. Speaker Diarization

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2019)A new architecture based VAD for speaker diarization/detection systemsInternational Journal of Speech Technology10.1007/s10772-019-09625-622:3(827-840)Online publication date: 1-Sep-2019
  • (2012)There is no data like less dataProceedings of the 2012 ACM international workshop on Audio and multimedia methods for large-scale video analysis10.1145/2390214.2390223(27-32)Online publication date: 2-Nov-2012

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media