poster

Multi-modal Language Models for Lecture Video Retrieval

Authors:

Huizhong Chen,

Matthew Cooper,

Dhiraj Joshi,

Bernd GirodAuthors Info & Claims

MM '14: Proceedings of the 22nd ACM international conference on Multimedia

Pages 1081 - 1084

https://doi.org/10.1145/2647868.2654964

Published: 03 November 2014 Publication History

Get Access

Abstract

We propose Multi-modal Language Models (MLMs), which adapt latent variable techniques for document analysis to exploring co-occurrence relationships in multi-modal data. In this paper, we focus on the application of MLMs to indexing text from slides and speech in lecture videos, and subsequently employ a multi-modal probabilistic ranking function for lecture video retrieval. The MLM achieves highly competitive results against well established retrieval methods such as the Vector Space Model and Probabilistic Latent Semantic Analysis. When noise is present in the data, retrieval performance with MLMs is shown to improve with the quality of the spoken text extracted from the video.

References

[1]

J. Adcock, M. Cooper, L. Denoue, H. Pirsiavash, and L. A. Rowe. Talkminer: A lecture webcast search engine. In Proceedings of the International Conference on Multimedia, MM '10, pages 241--250, 2010.

Digital Library

Google Scholar

[2]

K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. M. Blei, and M. I. Jordan. Matching words and pictures. The Journal of Machine Learning Research, 3:1107--1135, March 2003.

Digital Library

Google Scholar

[3]

D. M. Blei and M. I. Jordan. Modeling annotated data. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR '03, pages 127--134, 2003.

Digital Library

Google Scholar

[4]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993--1022, March 2003.

Digital Library

Google Scholar

[5]

B. Chen. Word topic models for spoken document retrieval and transcription. ACM Transactions on Asian Language Information Processing, 8(1):2:1--2:27, March 2009.

Digital Library

Google Scholar

[6]

Q. Fan, K. Barnard, A. Amir, and A. Efrat. Robust spatiotemporal matching of electronic slides to presentation videos. IEEE Transactions on Image Processing, 20:2315--2328, 2011.

Digital Library

Google Scholar

[7]

T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1/2):177--196, January 2001.

Digital Library

Google Scholar

[8]

T. Kawahara, Y. Nemoto, and Y. Akita. Automatic lecture transcription by exploiting presentation slide information for language model adaptation. In IEEE ICASSP, 2008.

Google Scholar

[9]

R. Lienhart, S. Romberg, and E. Hörster. Multilayer plsa for multimodal image retrieval. In Proceedings of the ACM International Conference on Image and Video Retrieval, CIVR '09, pages 9:1--9:8, 2009.

Digital Library

Google Scholar

[10]

C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, 2008.

Digital Library

Google Scholar

[11]

N. V. Nguyen, J.-M. Ogier, and F. Charneau. Bag of subjects: Lecture videos multimodal indexing. In Proceedings of the 2013 ACM Symposium on Document Engineering, pages 225--6, 2013.

Digital Library

Google Scholar

[12]

D. Putthividhya, H. T. Attias, and S. S. Nagarajan. Topic-regression multi-modal latent dirichlet allocation for image and video annotation. In IEEE Computer Vision and Pattern Recognition, 2010.

Google Scholar

[13]

N. Rasiwasia, J. C. Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, and N. Vasconcelos. A new approach to cross-modal multimedia retrieval. In Proceedings of the International Conference on Multimedia, MM '10, pages 251--260. ACM, 2010.

Digital Library

Google Scholar

[14]

A. Vinciarelli and J. Odobez. Application of information retrieval technologies to presentation slides. IEEE Transactions on Multimedia, 8(5):981--995, 2006.

Digital Library

Google Scholar

Cited By

View all

Jobin KMishra AJawahar C(2024)Semantic Labels-Aware Transformer Model for Searching over a Large Collection of Lecture-Slides2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00591(6004-6013)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00591
Li MZhou SChen YHuang CJiang Y(2024)EduCross: Dual adversarial bipartite hypergraph learning for cross-modal retrieval in multimodal educational slidesInformation Fusion10.1016/j.inffus.2024.102428109(102428)Online publication date: Sep-2024
https://doi.org/10.1016/j.inffus.2024.102428
Lee DAhuja CLiang PNatu SMorency L(2023)Lecture Presentations Multimodal Dataset: Towards Understanding Multimodality in Educational Videos2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01838(20030-20041)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.01838
Show More Cited By

Index Terms

Multi-modal Language Models for Lecture Video Retrieval
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

A semantic model for cross-modal and multi-modal retrieval
ICMR '13: Proceedings of the 3rd ACM conference on International conference on multimedia retrieval

In this paper, a semantic model for cross-modal and multi-modal retrieval is studied. We assume that the semantic correlation of multimedia data from different modalities can be depicted in a probabilistic generation framework. Media data from different ...
Predicting Micro-video Popularity via Multi-modal Retrieval Augmentation
SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

Accurately predicting the popularity of micro-videos is crucial for real-world applications such as recommender systems and identifying viral marketing opportunities. Existing methods often focus on limited cross-modal information within individual micro-...
Multi-modal Indexing and Retrieval Using anźLSA-Based Kernel
Revised Selected Papers from the First International Workshop on Multimodal Retrieval in the Medical Domain - Volume 9059

This article proposes a Latent Semantic Analysis LSA based kernel function which effectively integrates low-level visual features with higher semantic ones into a common latent space that correlates multimodal features in the same latent space. The ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

MM '14: Proceedings of the 22nd ACM international conference on Multimedia

November 2014

1310 pages

ISBN:9781450330633

DOI:10.1145/2647868

General Chairs:
Kien A. Hua
University of Central Florida, USA
,
Yong Rui
Microsoft Research, China
,
Ralf Steinmetz
Technische Universitt Darmstadt, Germany
,
Program Chairs:
Alan Hanjalic
Delft University of Technology, Netherlands
,
Apostol (Paul) Natsev
Google, USA
,
Wenwu Zhu
Tsinghua University, China

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

MM '14

Sponsor:

SIGMM

MM '14: 2014 ACM Multimedia Conference

November 3 - 7, 2014

Florida, Orlando, USA

Acceptance Rates

MM '14 Paper Acceptance Rate 55 of 286 submissions, 19%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
226
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)3

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Jobin KMishra AJawahar C(2024)Semantic Labels-Aware Transformer Model for Searching over a Large Collection of Lecture-Slides2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00591(6004-6013)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00591
Li MZhou SChen YHuang CJiang Y(2024)EduCross: Dual adversarial bipartite hypergraph learning for cross-modal retrieval in multimodal educational slidesInformation Fusion10.1016/j.inffus.2024.102428109(102428)Online publication date: Sep-2024
https://doi.org/10.1016/j.inffus.2024.102428
Lee DAhuja CLiang PNatu SMorency L(2023)Lecture Presentations Multimodal Dataset: Towards Understanding Multimodality in Educational Videos2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01838(20030-20041)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.01838
R. PSiri Agarwal ASingh U(2022)Online Educational Video Recommendation System AnalysisEncyclopedia of Data Science and Machine Learning10.4018/978-1-7998-9220-5.ch093(1559-1577)Online publication date: 14-Oct-2022
https://doi.org/10.4018/978-1-7998-9220-5.ch093
Madasu AOliva JBertasius GMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Learning to Retrieve Videos by Asking QuestionsProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548361(356-365)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3548361
Turcu GMihaescu MHeras SPalanca JJulian V(2019)Video Transcript Indexing and Retrieval Procedure2019 International Conference on Software, Telecommunications and Computer Networks (SoftCOM)10.23919/SOFTCOM.2019.8903790(1-6)Online publication date: Sep-2019
https://doi.org/10.23919/SOFTCOM.2019.8903790
Turcu GHeras SPalanca JJulian VMihaescu M(2019)Towards a Custom Designed Mechanism for Indexing and Retrieving Video TranscriptsHybrid Artificial Intelligent Systems10.1007/978-3-030-29859-3_26(299-309)Online publication date: 26-Aug-2019
https://doi.org/10.1007/978-3-030-29859-3_26
Galanopoulos DMezaris V(2018)Temporal Lecture Video Fragmentation Using Word EmbeddingsMultiMedia Modeling10.1007/978-3-030-05716-9_21(254-265)Online publication date: 11-Dec-2018
https://doi.org/10.1007/978-3-030-05716-9_21
Wang XYang YLiu HQian Y(2017)Improving speech transcription by exploiting user feedback and word repetitionMultimedia Tools and Applications10.1007/s11042-017-4714-x76:19(20359-20376)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.1007/s11042-017-4714-x
Basu SYu YSingh VZimmermann R(2016)VideopediaProceedings, Part I, of the 22nd International Conference on MultiMedia Modeling - Volume 951610.1007/978-3-319-27671-7_20(238-250)Online publication date: 4-Jan-2016
https://dl.acm.org/doi/10.1007/978-3-319-27671-7_20
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

A semantic model for cross-modal and multi-modal retrieval

Predicting Micro-video Popularity via Multi-modal Retrieval Augmentation

Multi-modal Indexing and Retrieval Using anźLSA-Based Kernel