Abstract
Audio-Visual People Diarization (AVPD) is an original framework that simultaneously improves audio, video, and audiovisual diarization results. Following a literature review of people diarization for both audio and video content and their limitations, which includes our own contributions, we describe a proposed method for associating both audio and video information by using co-occurrence matrices and present experiments which were conducted on a corpus containing TV news, TV debates, and movies. Results show the effectiveness of the overall diarization system and confirm the gains audio information can bring to video indexing and vice versa.
Similar content being viewed by others
References
Anguera X, Wooters C, Hernando J (2006) Robust speaker diarization for meetings: ICSI RT06 evaluation system. In: International conference on spoken language processing
Andriluka M, Roth S, Schiele B (2008) People-tracking-by-detection and people-detection-by-tracking. In: IEEE conference on computer vision and pattern recognition
Arandjelovic O, Zisserman A (2005) Automatic face recognition for film character retrieval in feature-length films. In: IEEE conference on computer vision and pattern recognition
Azarbayejani A, Starner T, Horowitz B, Pentland A (1993) Visually controlled graphics. IEEE Trans Pattern Anal Mach Intell 15:602–605
Bicego M, Lagorio A, Grosso E, Tistarelli M (2006) On the use of sift features for face authentication. In: Computer vision and pattern recognition workshop
Bigot B, Ferrané I, Pinquier J (2010) Exploiting speaker segmentations for automatic role detection. An application to broadcast news documents. In: International workshop on content-based multimedia indexing
Bozonnet S, Evans N, Fredouille C (2010) The LIA-EURECOM RT09 Speaker diarization system: anhancements in speaker modelling and cluster purification. In: IEEE international conference on acoustics, speech, and signal processing
Cettolo M, Vescovi M (2003) Efficient audio segmentation algorithms based on the bic. In: IEEE international conference on acoustics, speech, and signal processing
Chang SF, He J, Jiang YG, El Khoury E, Ngo CW, Yanagawa A, Zavesky E (2008) Columbia University/VIREO-CityU/IRIT TRECVID2008 high-level feature extraction and interactive video search. In: TREC video retrieval workshop, NIST
Chaudhari UV, Ramaswamy GN, Potamianos G, Neti C (2003) Audio-visual speaker recognition using time-varying stream. In: IEEE international conference on acoustics, speech and signal processing
Chaudhari UV, Ramaswamy GN, Potamianos G, Neti C (2003) Information fusion and decision cascading for audio-visual speaker recognition based on time-varying stream reliability prediction. In: IEEE international conference on multimedia and expo
Chen SS, Gopalakrishnan PS (1998) Clustering via the bayesian information criterion with applications in speech recognition. In: IEEE international conference on acoustics, speech and signal processing
Chu WT, Lee YL, Yu JY (2009) Visual language model for face clustering in consumer photos. In: ACM international conference on multimedia
Cinbis G, Verbeek J, Schmid C (2011) Unsupervised metric learning for face identification in TV video. In: IEEE international conference on computer vision
Czirjek C, Marlow S, Murphy N (2003) Face detection and clustering for video indexing applications. In: Advanced concepts for intelligent vision systems
Dielmann A (2010) Unsupervised detection of multimodal clusters in edited recordings. In: IEEE international workshop on Multimedia Signal Processing (MMSP)
Doretto G, Sebastian T, Tu P, Rittscher J (2011) Appearance-based person re-identification in camera networks: Problem overview and current approaches. Journal of Ambient Intelligence and Humanized Computing 2(2):127–151
Everingham M, Sivic J, Zisserman A (2006) Hello! my name is... buffy—automatic naming of characters in TV video. In: British Machine Vision Conference, BMVC06
Everingham M, Sivic J, Zisserman A (2009) Taking the bite out of automated naming of characters in TV video. Image Vision Comput 27(5):545–559
Fitzgibbon AW, Zisserman A (2002) On affine invariant clustering and automatic cast listing in movies. In: ECCV ’02: European Conference on Computer Vision
Fredouille C, Bozonnet S, Evans N (2009) The LIA-EURECOM RT09 speaker diarization system. In: NIST Rich transcription workshop
Friedland G, Hung H, Chuohao Yeo (2009) Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In: IEEE international conference on acoustics, speech and signal processing
Friedland G, Yeo C, Hung H (2010) Dialocalisation: acoustic speaker diarization and visual localization as joint optimization problem. ACM Trans Multimedia Comput Commun Appl, TOMCCAP 6(4):27
Galliano S, Geofrois E, Mosterfa D, Bonastre JF, Gravier G (2005) The ESTER phase II evaluation campaign for the rich transcription of the French broadcast news. In: European conference on speech communication and technology
Galliano S, Gravier G, Chaubard L (2009) The ester 2 evaluation campaign for the rich transcription of French radio broadcasts. INTERSPEECH
Gish H, Siu MH, Rohlicek R (1991) Segregation of speakers for speech recognition and speaker identification. In: International conference on acoustics, speech, and signal processing
Guillaumin M, Verbeek J, Schmid C (2009) Is that you? Metric learning approaches for face identification. ICCV
Hilsmann A, Eisert P (2009) Tracking and retexturing cloth for real-time virtual clothing applications. In: International conference on computer vision/computer graphics collaboration techniques
Hung H, Friedland G (2008) Towards audio-visual on-line diarization of participants In group meetings. In: Workshop on multi-camera and multi-modal sensor fusion
Ioffe S, Forsyth DA (2001) Human tracking with mixtures of trees. ICCV01
Jaffré G, Joly P (2004) Costume: a new feature for automatic video content indexing. RIAO
El Khoury E, Senac C, André-Obrecht R (2007) Speaker Diarization: Towards a more robust and portable system. In: IEEE international conference on acoustics, speech, and signal processing
El-Khoury E, Senac C, Pinquier J (2009) Improved speaker diarization system for meetings. In: IEEE international conference on acoustics, speech, and signal processing
El Khoury E, Senac C, Joly P (2010) Unsupervised segmentation methods of TV contents. Int J Digital Multimedia Broadcast. doi:10.1155/2010/539796
El Khoury E, Senac C, Joly P (2010) Face-and-clothing based people clustering in video content. In: ACM International conference on multimedia information retrieval
Leeuwen DAV, Konecný M (2008) Progress in the AMIDA speaker diarization system for meeting data. In: Multimodal technologies for perception of humans: international evaluation workshops CLEAR 2007 and RT 2007
Lerdsudwichai C, Abdel-MottalebM, Ansari AN (2005) Tracking multiple people with recovery from partial and total occlusion. Pattern Recogn 38(7):1059–1070
Liu Z, Gibbon D, Zavesky E, Shahraray B, Haffner P (2007) A fast, comprehensive shot boundary determination system. In: IEEE international conference on multimedia and expo
Liu Z, Wang Y (2001) Major cast detection in video using both audio and visual information. In: IEEE international conference on acoustics, speech, and signal processing
Liu Z, Wang Y (2007) Major cast detection in video using both speaker and face information. IEEE Transactions on Multimedia 9(1):89–101
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60(2):91–110
Manjunath BS, Ma WY (1996) Texture features for browsing and retrieval of image data. IEEE Trans Pattern Anal Mach Intell 18(8):837–842
Nguyen TH, Sun H, Zhao S, Khine SZ, Tran HD, Ma TL, Ma B, Chng ES, Li H (2009) The IIR-NTU speaker diarization systems for RT 2009. In: NIST rich transcription workshop
Nockc HJ, Iyengar G, Neti C (2003) Speaker localisation using audio-visual synchrony: an ampirical study. In: CIVR: ACM international conference on image and video retrieval
Peng J, Lin QX (2008) Automatic classification video for person indexing. In: Proceedings of the 2008 congress on image and signal processing, CISP ’08, vol 2. IEEE Computer Society, Washington, DC, USA, pp 475–479. ISBN 978-0-7695-3119-9
Philippeau J, Pinquier J, Joly P (2006) Intervenant classification in an audiovisual document. In: International conference on signal processing and multimedia applications
Pinquier J, Rouas JL, André-Obrecht R (2003) A fusion study in speech/music classification. In: IEEE international conference on acoustics, speech and signal processing
Plackett RL (1983) Karl Pearson and the chi-squared test. Int Stat Rev 51(1):59–72
Ramirez J, Girriz JM, Segura JC (2007) Voice activity detection. In: Grimm M, Kroschel K (eds) Fundamentals and speech recognition system robustness. Robust Speech Recognition and Understanding
Rosenhahn B, Kersting U, Powell K, Brox T, Seidel HP (2007) Tracking clothed people. In: Human motion—understanding, modeling, capture, and animation. Springer
Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. In: IEEE international conference on acoustics, speech, and signal processing
Schmalenstroeer J, Haeb-Umbach R (2010) Online Diarization of Streaming Audio-Visual Data for Smart Environments. J Sel Topics Signal Processing 4(5):845–856
Siegler MA, Jain U, Raj B, Stern RM (1997) Automatic segmentation, classification and clustering of broadcast news audio. In: DARPA Speech Recognition Workshop
Sivakumaran P, Fortuna J, Ariyaeeinia AM (2001) On the use of the bayesian information criterion in multiple speaker detection. In: The 7th European conference on speech communication and technology (Eurospeech’01)
Smeaton AF, Over P, Doherty AR (2010) Video shot boundary detection: seven years of trecvid activity. Comput Vis Image Und 114(4):411–418
Stiefelhagen R, Bowers R, Fiscus J (2008) Multimodal technologies for perception of humans: international evaluation workshops CLEAR 2007 and RT 2007. ser. Lecture Notes in Computer Science. Springer
Sung JW, Kanade T, Kim DJ (2008) Pose robust face tracking by combining active appearance models and cylinder head models. Int J Comput Vis 80(2):260–274
Tamura S, Iwano K, Furui S (2004) Multi-modal speech recognition using optical-flow analysis for lip images. J VLSI Signal Process Syst 36(2/3):117–124
Terzopoulos D, Waters K (1993) Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Trans Pattern Anal Mach Intell 15:569–579
Truong BT, Dorai C, Venkatesh S (2000) New enhancements to cut, fade, and dissolve detection processes in video segmentation. In: ACM international conference on Multimedia
Tsai WH, Cheng SS, Chao YH, Wang HM (2005) Clustering speech utterances by speaker using eigenvoice-motivated vector space model. In: IEEE international conference on acoustics, speech, and signal processing
Vajaria H, Islam T, Sarkar S, Sankar R, Kasturi R (2006) Audio segmentation and speaker localization in meeting videos. In: ICPR’06: international conference on pattern recognition
Vezhnevets V, Sazonov V, Andreeva A (2003) A survey on pixel-based skin color detection techniques. In: Proc. Graphicon
Viola P, Jones MJ, Snow D (2003) Detecting pedestrians using patterns of motion and appearance. In: ICCV ’03: IEEE international conference on computer vision
Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154
Yang MH (2009) Face detection. In: Encyclopedia of biometrics. Springer
Zhou B, Hansen JHL (2005) Efficient audio stream segmentation via the combined T2 statistic and the bayesian information criterion. IEEE Trans Speech Audio Processing 13(4):467–474
Zhu X, Barras C, Lamel L, Gauvain JL (2008) Multi-stage speaker diarization for conference and lecture meetings. In: Multimodal technologies for perception of humans. Springer
Acknowledgements
This work was supported by a 3-year individual fellowship from the French Ministry of High Education and Research, and by the SODA project funded by the National French Research Agency (ANR).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
El Khoury, E., Sénac, C. & Joly, P. Audiovisual diarization of people in video content. Multimed Tools Appl 68, 747–775 (2014). https://doi.org/10.1007/s11042-012-1080-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-012-1080-6