Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1076034.1076097acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Automatic music video summarization based on audio-visual-text analysis and alignment

Published: 15 August 2005 Publication History

Abstract

In this paper, we propose a novel approach for automatic music video summarization based on audio-visual-text analysis and alignment. The music video is separated into the music and video tracks. For the music track, the chorus is detected based on music structure analysis. For the video track, we first segment the shots and classify the shots into close-up face shots and non-face shots, then we extract the lyrics and detect the most repeated lyrics from the shots. The music video summary is generated based on the alignment of boundaries of the detected chorus, shot class and the most repeated lyrics from the music video. The experiments on chorus detection, shot classification, and lyrics detection using 20 English music videos are described. Subjective user studies have been conducted to evaluate the quality and effectiveness of summary. The comparisons with the summaries based on our previous method and the manual method indicate that the results of summarization using the proposed method are better at meeting users' expectations.

References

[1]
Logan B and Chu S, Music Summarization Using Key Phrases, In Proc. IEEE International Conference on Audio, Speech and Signal Processing, Istanbul,Turkey, 2000, vol.2, II749--II752.
[2]
Xu C, Zhu Y and Tian Q, Automatic music summarization based on temporal, spectral and cepstral features, In Proc. IEEE International Conference on Multimedia and Explore, Lausanne, Switzerland, 2002, 117--120.
[3]
Lu L, and Zhang H, Automated Extraction of Music Snippets, In Proc. ACM International Conference on Multimedia, Berkeley, CA, 2003, 140--147.
[4]
Bartsch M A and Wakefield G H, To Catch a Chorus: Using Chroma-based Representations for Audio Thumbnailing, In Proc. Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York, 2001, 15--18.
[5]
Cooper M and Foote J, Automatic Music Summarization via Similarity Analysis, In Proc. International Conference on Music Information Retrieval, Paris, France, 2002, 81--85.
[6]
Chai W and Vercoe B, Music Thumbnailing via Structural Analysis, In Proc. ACM international conference on Multimedia, Berkeley, CA, 2003, 223--226.
[7]
Yow, D., Yeo, B.L., Yeung, M., and Liu, G., Analysis and presentation of soccer highlights from digital video, In Proc. of Asian Conference on Computer Vision, Singapore, 1995, vol. II, 499--503.
[8]
Gong, Y., Liu, X., and Hua, W., Creating motion video summaries with partial audio-visual alignment, In Proc. of IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, 2002, vol.1, 285--288.
[9]
Foote, J., Cooper, M., and Girgensohn, A., Creating Music videos using automatic media analysis. In Proc. ACM International Conference on Multimedia, Juan-les-Pins, France, 2002, 553--560.
[10]
Pfeiffer, S., Lienhart, R., Fischer, S., and Effelsberg, W., Abstracting digital movies automatically, Journal of Visual Communication and Image Representation, 7, 4, (1996), 345--353.
[11]
Shao, X., Xu, C., and Kankanhalli, M.S., Automatically generating summaries for musical video, In Proceedings of IEEE International Conference on Image Processing, Barcelona, Spain, 2003, Vol.2, 547--550.
[12]
Agnihotri, L., Dimitrova, N., Kender, J., and Zimmerman, J., Music videos miner, In Proc. of the ACM International Conference on Multimedia, Berkeley, CA, 2003, 442--443.
[13]
Agnihotri, L., Dimitrova, N., and Kender, J., Design and evaluation of a music video summarization system, In Proc. of IEEE International Conference on Multimedia and Expo, 2004, Taibei, Taiwan.
[14]
Ten Minute Master No 18: Song Structure. MUSIC TECH magazine. www.musictechmag.co.uk (Oct. 2003), 62--63.
[15]
Goto M A, Chorus-section detecting method for musical audio signals, In Proc. IEEE International Conference on Acoustics Speech and Signal Processing, Hong Kong, 6-10 April, 2003.
[16]
Dannenberg R B and Hu N, Discovering music structure in audio recording, In Proc. 2nd International Conference on Music and Artificial Intelligence, Scotland, UK, 2002, 43--57.
[17]
Maddage C. N, Xu.C, Kankanhalli M.S, Shao X, Content-based music structure analysis with the applications to music semantic understanding, In Proc. ACM International Conference on Multimedia, New York, NY, 2004, 112--119.
[18]
Rossing, T.D., Moore, F.R., and Wheeler, P.A., Science of Sound. Addison Wesley, 3rd Edition 2001.
[19]
Navarro, G. A guided tour to approximate string matching, ACM Computing Surveys, Vol.33, No.1, March 2001, 31--88.
[20]
Sheh, A., and Ellis, D.P.W., Chord Segmentation and Recognition using EM-Trained Hidden Markov Models. In Proc. ISMIR 2003.
[21]
Young S et al., The HTK Book, Dept. of Engineering, University of Cambridge, Version 3.2, 2002.
[22]
Deller, J. R., Hansen, J.H.L., and Proakis, H. J. G. Discrete-Time Processing of Speech Signals, IEEE Press (1999).
[23]
Collobert, R., and Bengio, S. SVMTorch: Support Vector Machines for Large-Scale Regression Problems. Journal of Machine Learning Research. Vol 1, 2001, 143--160.
[24]
Smoliar, S., and Zhang, H., Content-based video indexing and retrieval, IEEE Multimedia, vol. 1, pp. 62--72, 1994.
[25]
Li S. Z., Zhu L., Zhang Z. Q., Blake A., Zhang H. J. and Shum H., "Statistical learning of multi-view face detection", European Conference on Computer Vision, Denmark, May 2002.
[26]
Hua X. S., Chen X. R., Liu W., Zhang H. J., Automatic location of text in video frames. 3rd International Workshop on Multimedia Information Retrieval, Ottawa, Canada, 2001.
[27]
Rudiments and Theory of Music. The Associated Board of the Royal Schools of Music, 14 Bedford Square, London, WC1B 3JG, 1949.
[28]
John.P.C., Virginia A. Diehl and Kent L.N, Development of an instrument measuring user satisfaction of the human-computer interface,Proceedings of SIGCHI'88, pp.213--218, New York, 1988.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
August 2005
708 pages
ISBN:1595930345
DOI:10.1145/1076034
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 August 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. alignment
  2. chorus
  3. lyrics
  4. music video
  5. shot
  6. summarization

Qualifiers

  • Article

Conference

SIGIR05
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)2
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A review on video summarization techniquesEngineering Applications of Artificial Intelligence10.1016/j.engappai.2022.105667118:COnline publication date: 1-Feb-2023
  • (2020)Multi-Modal Deep Analysis for MultimediaIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2019.294064730:10(3740-3764)Online publication date: Oct-2020
  • (2019)3D Shape Retrieval through Multilayer RBF Neural Network2019 IEEE International Conference on Image Processing (ICIP)10.1109/ICIP.2019.8803384(2394-2398)Online publication date: Sep-2019
  • (2018)Getting Rid of Night: Thermal Image Classification Based on Feature Fusion2018 24th International Conference on Pattern Recognition (ICPR)10.1109/ICPR.2018.8545321(2827-2832)Online publication date: Aug-2018
  • (2018)Video SummarizationEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_1026(4439-4443)Online publication date: 7-Dec-2018
  • (2016)Multimedia maximal marginal relevance for multi-video summarizationMultimedia Tools and Applications10.1007/s11042-014-2287-575:1(199-220)Online publication date: 1-Jan-2016
  • (2016)Video SummarizationEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_1026-2(1-6)Online publication date: 8-Dec-2016
  • (2014)Robot Actors and Authoring Tools for Live Performance System2014 International Conference on Information Science & Applications (ICISA)10.1109/ICISA.2014.6847457(1-3)Online publication date: May-2014
  • (2014)Music Emotion Classification Based on Music Highlight Detection2014 International Conference on Information Science & Applications (ICISA)10.1109/ICISA.2014.6847435(1-2)Online publication date: May-2014
  • (2014)FBDtoVerilog 2.0: An Automatic Translation of FBD into Verilog to Develop FPGA2014 International Conference on Information Science & Applications (ICISA)10.1109/ICISA.2014.6847402(1-4)Online publication date: May-2014
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media