Audiovisual diarization of people in video content

Elie El Khoury^1,2,
Christine Sénac³ &
Philippe Joly³

499 Accesses
3 Altmetric
Explore all metrics

Abstract

Audio-Visual People Diarization (AVPD) is an original framework that simultaneously improves audio, video, and audiovisual diarization results. Following a literature review of people diarization for both audio and video content and their limitations, which includes our own contributions, we describe a proposed method for associating both audio and video information by using co-occurrence matrices and present experiments which were conducted on a corpus containing TV news, TV debates, and movies. Results show the effectiveness of the overall diarization system and confirm the gains audio information can bring to video indexing and vice versa.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Character-Aware Audio-Visual Subtitling in Context

Multimodal speaker clustering in full length movies

Article 11 January 2016

Voice Data-Mining on Audio from Audio and Video Clips

Notes

References

Anguera X, Wooters C, Hernando J (2006) Robust speaker diarization for meetings: ICSI RT06 evaluation system. In: International conference on spoken language processing
Andriluka M, Roth S, Schiele B (2008) People-tracking-by-detection and people-detection-by-tracking. In: IEEE conference on computer vision and pattern recognition
Arandjelovic O, Zisserman A (2005) Automatic face recognition for film character retrieval in feature-length films. In: IEEE conference on computer vision and pattern recognition
Azarbayejani A, Starner T, Horowitz B, Pentland A (1993) Visually controlled graphics. IEEE Trans Pattern Anal Mach Intell 15:602–605
Article Google Scholar
Bicego M, Lagorio A, Grosso E, Tistarelli M (2006) On the use of sift features for face authentication. In: Computer vision and pattern recognition workshop
Bigot B, Ferrané I, Pinquier J (2010) Exploiting speaker segmentations for automatic role detection. An application to broadcast news documents. In: International workshop on content-based multimedia indexing
Bozonnet S, Evans N, Fredouille C (2010) The LIA-EURECOM RT09 Speaker diarization system: anhancements in speaker modelling and cluster purification. In: IEEE international conference on acoustics, speech, and signal processing
Cettolo M, Vescovi M (2003) Efficient audio segmentation algorithms based on the bic. In: IEEE international conference on acoustics, speech, and signal processing
Chang SF, He J, Jiang YG, El Khoury E, Ngo CW, Yanagawa A, Zavesky E (2008) Columbia University/VIREO-CityU/IRIT TRECVID2008 high-level feature extraction and interactive video search. In: TREC video retrieval workshop, NIST
Chaudhari UV, Ramaswamy GN, Potamianos G, Neti C (2003) Audio-visual speaker recognition using time-varying stream. In: IEEE international conference on acoustics, speech and signal processing
Chaudhari UV, Ramaswamy GN, Potamianos G, Neti C (2003) Information fusion and decision cascading for audio-visual speaker recognition based on time-varying stream reliability prediction. In: IEEE international conference on multimedia and expo
Chen SS, Gopalakrishnan PS (1998) Clustering via the bayesian information criterion with applications in speech recognition. In: IEEE international conference on acoustics, speech and signal processing
Chu WT, Lee YL, Yu JY (2009) Visual language model for face clustering in consumer photos. In: ACM international conference on multimedia
Cinbis G, Verbeek J, Schmid C (2011) Unsupervised metric learning for face identification in TV video. In: IEEE international conference on computer vision
Czirjek C, Marlow S, Murphy N (2003) Face detection and clustering for video indexing applications. In: Advanced concepts for intelligent vision systems
Dielmann A (2010) Unsupervised detection of multimodal clusters in edited recordings. In: IEEE international workshop on Multimedia Signal Processing (MMSP)
Doretto G, Sebastian T, Tu P, Rittscher J (2011) Appearance-based person re-identification in camera networks: Problem overview and current approaches. Journal of Ambient Intelligence and Humanized Computing 2(2):127–151
Article Google Scholar
Everingham M, Sivic J, Zisserman A (2006) Hello! my name is... buffy—automatic naming of characters in TV video. In: British Machine Vision Conference, BMVC06
Everingham M, Sivic J, Zisserman A (2009) Taking the bite out of automated naming of characters in TV video. Image Vision Comput 27(5):545–559
Article Google Scholar
Fitzgibbon AW, Zisserman A (2002) On affine invariant clustering and automatic cast listing in movies. In: ECCV ’02: European Conference on Computer Vision
Fredouille C, Bozonnet S, Evans N (2009) The LIA-EURECOM RT09 speaker diarization system. In: NIST Rich transcription workshop
Friedland G, Hung H, Chuohao Yeo (2009) Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In: IEEE international conference on acoustics, speech and signal processing
Friedland G, Yeo C, Hung H (2010) Dialocalisation: acoustic speaker diarization and visual localization as joint optimization problem. ACM Trans Multimedia Comput Commun Appl, TOMCCAP 6(4):27
Google Scholar
Galliano S, Geofrois E, Mosterfa D, Bonastre JF, Gravier G (2005) The ESTER phase II evaluation campaign for the rich transcription of the French broadcast news. In: European conference on speech communication and technology
Galliano S, Gravier G, Chaubard L (2009) The ester 2 evaluation campaign for the rich transcription of French radio broadcasts. INTERSPEECH
Gish H, Siu MH, Rohlicek R (1991) Segregation of speakers for speech recognition and speaker identification. In: International conference on acoustics, speech, and signal processing
Guillaumin M, Verbeek J, Schmid C (2009) Is that you? Metric learning approaches for face identification. ICCV
Hilsmann A, Eisert P (2009) Tracking and retexturing cloth for real-time virtual clothing applications. In: International conference on computer vision/computer graphics collaboration techniques
Hung H, Friedland G (2008) Towards audio-visual on-line diarization of participants In group meetings. In: Workshop on multi-camera and multi-modal sensor fusion
Ioffe S, Forsyth DA (2001) Human tracking with mixtures of trees. ICCV01
Jaffré G, Joly P (2004) Costume: a new feature for automatic video content indexing. RIAO
El Khoury E, Senac C, André-Obrecht R (2007) Speaker Diarization: Towards a more robust and portable system. In: IEEE international conference on acoustics, speech, and signal processing
El-Khoury E, Senac C, Pinquier J (2009) Improved speaker diarization system for meetings. In: IEEE international conference on acoustics, speech, and signal processing
El Khoury E, Senac C, Joly P (2010) Unsupervised segmentation methods of TV contents. Int J Digital Multimedia Broadcast. doi:10.1155/2010/539796
Google Scholar
El Khoury E, Senac C, Joly P (2010) Face-and-clothing based people clustering in video content. In: ACM International conference on multimedia information retrieval
Leeuwen DAV, Konecný M (2008) Progress in the AMIDA speaker diarization system for meeting data. In: Multimodal technologies for perception of humans: international evaluation workshops CLEAR 2007 and RT 2007
Lerdsudwichai C, Abdel-MottalebM, Ansari AN (2005) Tracking multiple people with recovery from partial and total occlusion. Pattern Recogn 38(7):1059–1070
Article Google Scholar
Liu Z, Gibbon D, Zavesky E, Shahraray B, Haffner P (2007) A fast, comprehensive shot boundary determination system. In: IEEE international conference on multimedia and expo
Liu Z, Wang Y (2001) Major cast detection in video using both audio and visual information. In: IEEE international conference on acoustics, speech, and signal processing
Liu Z, Wang Y (2007) Major cast detection in video using both speaker and face information. IEEE Transactions on Multimedia 9(1):89–101
Article Google Scholar
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60(2):91–110
Article Google Scholar
Manjunath BS, Ma WY (1996) Texture features for browsing and retrieval of image data. IEEE Trans Pattern Anal Mach Intell 18(8):837–842
Article Google Scholar
Nguyen TH, Sun H, Zhao S, Khine SZ, Tran HD, Ma TL, Ma B, Chng ES, Li H (2009) The IIR-NTU speaker diarization systems for RT 2009. In: NIST rich transcription workshop
Nockc HJ, Iyengar G, Neti C (2003) Speaker localisation using audio-visual synchrony: an ampirical study. In: CIVR: ACM international conference on image and video retrieval
Peng J, Lin QX (2008) Automatic classification video for person indexing. In: Proceedings of the 2008 congress on image and signal processing, CISP ’08, vol 2. IEEE Computer Society, Washington, DC, USA, pp 475–479. ISBN 978-0-7695-3119-9
Chapter Google Scholar
Philippeau J, Pinquier J, Joly P (2006) Intervenant classification in an audiovisual document. In: International conference on signal processing and multimedia applications
Pinquier J, Rouas JL, André-Obrecht R (2003) A fusion study in speech/music classification. In: IEEE international conference on acoustics, speech and signal processing
Plackett RL (1983) Karl Pearson and the chi-squared test. Int Stat Rev 51(1):59–72
Article MATH MathSciNet Google Scholar
Ramirez J, Girriz JM, Segura JC (2007) Voice activity detection. In: Grimm M, Kroschel K (eds) Fundamentals and speech recognition system robustness. Robust Speech Recognition and Understanding
Rosenhahn B, Kersting U, Powell K, Brox T, Seidel HP (2007) Tracking clothed people. In: Human motion—understanding, modeling, capture, and animation. Springer
Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. In: IEEE international conference on acoustics, speech, and signal processing
Schmalenstroeer J, Haeb-Umbach R (2010) Online Diarization of Streaming Audio-Visual Data for Smart Environments. J Sel Topics Signal Processing 4(5):845–856
Article Google Scholar
Siegler MA, Jain U, Raj B, Stern RM (1997) Automatic segmentation, classification and clustering of broadcast news audio. In: DARPA Speech Recognition Workshop
Sivakumaran P, Fortuna J, Ariyaeeinia AM (2001) On the use of the bayesian information criterion in multiple speaker detection. In: The 7th European conference on speech communication and technology (Eurospeech’01)
Smeaton AF, Over P, Doherty AR (2010) Video shot boundary detection: seven years of trecvid activity. Comput Vis Image Und 114(4):411–418
Article Google Scholar
Stiefelhagen R, Bowers R, Fiscus J (2008) Multimodal technologies for perception of humans: international evaluation workshops CLEAR 2007 and RT 2007. ser. Lecture Notes in Computer Science. Springer
Sung JW, Kanade T, Kim DJ (2008) Pose robust face tracking by combining active appearance models and cylinder head models. Int J Comput Vis 80(2):260–274
Article Google Scholar
Tamura S, Iwano K, Furui S (2004) Multi-modal speech recognition using optical-flow analysis for lip images. J VLSI Signal Process Syst 36(2/3):117–124
Google Scholar
Terzopoulos D, Waters K (1993) Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Trans Pattern Anal Mach Intell 15:569–579
Article Google Scholar
Truong BT, Dorai C, Venkatesh S (2000) New enhancements to cut, fade, and dissolve detection processes in video segmentation. In: ACM international conference on Multimedia
Tsai WH, Cheng SS, Chao YH, Wang HM (2005) Clustering speech utterances by speaker using eigenvoice-motivated vector space model. In: IEEE international conference on acoustics, speech, and signal processing
Vajaria H, Islam T, Sarkar S, Sankar R, Kasturi R (2006) Audio segmentation and speaker localization in meeting videos. In: ICPR’06: international conference on pattern recognition
Vezhnevets V, Sazonov V, Andreeva A (2003) A survey on pixel-based skin color detection techniques. In: Proc. Graphicon
Viola P, Jones MJ, Snow D (2003) Detecting pedestrians using patterns of motion and appearance. In: ICCV ’03: IEEE international conference on computer vision
Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154
Article Google Scholar
Yang MH (2009) Face detection. In: Encyclopedia of biometrics. Springer
Zhou B, Hansen JHL (2005) Efficient audio stream segmentation via the combined T2 statistic and the bayesian information criterion. IEEE Trans Speech Audio Processing 13(4):467–474
Article Google Scholar
Zhu X, Barras C, Lamel L, Gauvain JL (2008) Multi-stage speaker diarization for conference and lecture meetings. In: Multimodal technologies for perception of humans. Springer

Download references

Acknowledgements

This work was supported by a 3-year individual fellowship from the French Ministry of High Education and Research, and by the SODA project funded by the National French Research Agency (ANR).

Author information

Authors and Affiliations

Idiap Research Institute, Martigny, Switzerland
Elie El Khoury
Laboratoire d’Informatique de l’Université du Maine, Le Mans, France
Elie El Khoury
Institut de Recherche en Informatique de Toulouse, Toulouse, France
Christine Sénac & Philippe Joly

Authors

Elie El Khoury
View author publications
You can also search for this author in PubMed Google Scholar
Christine Sénac
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Joly
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elie El Khoury.

Rights and permissions

Reprints and permissions

About this article

Cite this article

El Khoury, E., Sénac, C. & Joly, P. Audiovisual diarization of people in video content. Multimed Tools Appl 68, 747–775 (2014). https://doi.org/10.1007/s11042-012-1080-6

Download citation

Published: 29 April 2012
Issue Date: February 2014
DOI: https://doi.org/10.1007/s11042-012-1080-6

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Character-Aware Audio-Visual Subtitling in Context

Multimodal speaker clustering in full length movies

Voice Data-Mining on Audio from Audio and Video Clips

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Audiovisual diarization of people in video content

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Character-Aware Audio-Visual Subtitling in Context

Multimodal speaker clustering in full length movies

Voice Data-Mining on Audio from Audio and Video Clips

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now