Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Audiovisual diarization of people in video content

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Audio-Visual People Diarization (AVPD) is an original framework that simultaneously improves audio, video, and audiovisual diarization results. Following a literature review of people diarization for both audio and video content and their limitations, which includes our own contributions, we describe a proposed method for associating both audio and video information by using co-occurrence matrices and present experiments which were conducted on a corpus containing TV news, TV debates, and movies. Results show the effectiveness of the overall diarization system and confirm the gains audio information can bring to video indexing and vice versa.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://www.itl.nist.gov/iad/mig/tests/rt/

  2. http://opencvlibrary.sourceforge.net/

  3. http://www.itl.nist.gov/iad/mig/tests/rt/

References

  1. Anguera X, Wooters C, Hernando J (2006) Robust speaker diarization for meetings: ICSI RT06 evaluation system. In: International conference on spoken language processing

  2. Andriluka M, Roth S, Schiele B (2008) People-tracking-by-detection and people-detection-by-tracking. In: IEEE conference on computer vision and pattern recognition

  3. Arandjelovic O, Zisserman A (2005) Automatic face recognition for film character retrieval in feature-length films. In: IEEE conference on computer vision and pattern recognition

  4. Azarbayejani A, Starner T, Horowitz B, Pentland A (1993) Visually controlled graphics. IEEE Trans Pattern Anal Mach Intell 15:602–605

    Article  Google Scholar 

  5. Bicego M, Lagorio A, Grosso E, Tistarelli M (2006) On the use of sift features for face authentication. In: Computer vision and pattern recognition workshop

  6. Bigot B, Ferrané I, Pinquier J (2010) Exploiting speaker segmentations for automatic role detection. An application to broadcast news documents. In: International workshop on content-based multimedia indexing

  7. Bozonnet S, Evans N, Fredouille C (2010) The LIA-EURECOM RT09 Speaker diarization system: anhancements in speaker modelling and cluster purification. In: IEEE international conference on acoustics, speech, and signal processing

  8. Cettolo M, Vescovi M (2003) Efficient audio segmentation algorithms based on the bic. In: IEEE international conference on acoustics, speech, and signal processing

  9. Chang SF, He J, Jiang YG, El Khoury E, Ngo CW, Yanagawa A, Zavesky E (2008) Columbia University/VIREO-CityU/IRIT TRECVID2008 high-level feature extraction and interactive video search. In: TREC video retrieval workshop, NIST

  10. Chaudhari UV, Ramaswamy GN, Potamianos G, Neti C (2003) Audio-visual speaker recognition using time-varying stream. In: IEEE international conference on acoustics, speech and signal processing

  11. Chaudhari UV, Ramaswamy GN, Potamianos G, Neti C (2003) Information fusion and decision cascading for audio-visual speaker recognition based on time-varying stream reliability prediction. In: IEEE international conference on multimedia and expo

  12. Chen SS, Gopalakrishnan PS (1998) Clustering via the bayesian information criterion with applications in speech recognition. In: IEEE international conference on acoustics, speech and signal processing

  13. Chu WT, Lee YL, Yu JY (2009) Visual language model for face clustering in consumer photos. In: ACM international conference on multimedia

  14. Cinbis G, Verbeek J, Schmid C (2011) Unsupervised metric learning for face identification in TV video. In: IEEE international conference on computer vision

  15. Czirjek C, Marlow S, Murphy N (2003) Face detection and clustering for video indexing applications. In: Advanced concepts for intelligent vision systems

  16. Dielmann A (2010) Unsupervised detection of multimodal clusters in edited recordings. In: IEEE international workshop on Multimedia Signal Processing (MMSP)

  17. Doretto G, Sebastian T, Tu P, Rittscher J (2011) Appearance-based person re-identification in camera networks: Problem overview and current approaches. Journal of Ambient Intelligence and Humanized Computing 2(2):127–151

    Article  Google Scholar 

  18. Everingham M, Sivic J, Zisserman A (2006) Hello! my name is... buffy—automatic naming of characters in TV video. In: British Machine Vision Conference, BMVC06

  19. Everingham M, Sivic J, Zisserman A (2009) Taking the bite out of automated naming of characters in TV video. Image Vision Comput 27(5):545–559

    Article  Google Scholar 

  20. Fitzgibbon AW, Zisserman A (2002) On affine invariant clustering and automatic cast listing in movies. In: ECCV ’02: European Conference on Computer Vision

  21. Fredouille C, Bozonnet S, Evans N (2009) The LIA-EURECOM RT09 speaker diarization system. In: NIST Rich transcription workshop

  22. Friedland G, Hung H, Chuohao Yeo (2009) Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In: IEEE international conference on acoustics, speech and signal processing

  23. Friedland G, Yeo C, Hung H (2010) Dialocalisation: acoustic speaker diarization and visual localization as joint optimization problem. ACM Trans Multimedia Comput Commun Appl, TOMCCAP 6(4):27

    Google Scholar 

  24. Galliano S, Geofrois E, Mosterfa D, Bonastre JF, Gravier G (2005) The ESTER phase II evaluation campaign for the rich transcription of the French broadcast news. In: European conference on speech communication and technology

  25. Galliano S, Gravier G, Chaubard L (2009) The ester 2 evaluation campaign for the rich transcription of French radio broadcasts. INTERSPEECH

  26. Gish H, Siu MH, Rohlicek R (1991) Segregation of speakers for speech recognition and speaker identification. In: International conference on acoustics, speech, and signal processing

  27. Guillaumin M, Verbeek J, Schmid C (2009) Is that you? Metric learning approaches for face identification. ICCV

  28. Hilsmann A, Eisert P (2009) Tracking and retexturing cloth for real-time virtual clothing applications. In: International conference on computer vision/computer graphics collaboration techniques

  29. Hung H, Friedland G (2008) Towards audio-visual on-line diarization of participants In group meetings. In: Workshop on multi-camera and multi-modal sensor fusion

  30. Ioffe S, Forsyth DA (2001) Human tracking with mixtures of trees. ICCV01

  31. Jaffré G, Joly P (2004) Costume: a new feature for automatic video content indexing. RIAO

  32. El Khoury E, Senac C, André-Obrecht R (2007) Speaker Diarization: Towards a more robust and portable system. In: IEEE international conference on acoustics, speech, and signal processing

  33. El-Khoury E, Senac C, Pinquier J (2009) Improved speaker diarization system for meetings. In: IEEE international conference on acoustics, speech, and signal processing

  34. El Khoury E, Senac C, Joly P (2010) Unsupervised segmentation methods of TV contents. Int J Digital Multimedia Broadcast. doi:10.1155/2010/539796

    Google Scholar 

  35. El Khoury E, Senac C, Joly P (2010) Face-and-clothing based people clustering in video content. In: ACM International conference on multimedia information retrieval

  36. Leeuwen DAV, Konecný M (2008) Progress in the AMIDA speaker diarization system for meeting data. In: Multimodal technologies for perception of humans: international evaluation workshops CLEAR 2007 and RT 2007

  37. Lerdsudwichai C, Abdel-MottalebM, Ansari AN (2005) Tracking multiple people with recovery from partial and total occlusion. Pattern Recogn 38(7):1059–1070

    Article  Google Scholar 

  38. Liu Z, Gibbon D, Zavesky E, Shahraray B, Haffner P (2007) A fast, comprehensive shot boundary determination system. In: IEEE international conference on multimedia and expo

  39. Liu Z, Wang Y (2001) Major cast detection in video using both audio and visual information. In: IEEE international conference on acoustics, speech, and signal processing

  40. Liu Z, Wang Y (2007) Major cast detection in video using both speaker and face information. IEEE Transactions on Multimedia 9(1):89–101

    Article  Google Scholar 

  41. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60(2):91–110

    Article  Google Scholar 

  42. Manjunath BS, Ma WY (1996) Texture features for browsing and retrieval of image data. IEEE Trans Pattern Anal Mach Intell 18(8):837–842

    Article  Google Scholar 

  43. Nguyen TH, Sun H, Zhao S, Khine SZ, Tran HD, Ma TL, Ma B, Chng ES, Li H (2009) The IIR-NTU speaker diarization systems for RT 2009. In: NIST rich transcription workshop

  44. Nockc HJ, Iyengar G, Neti C (2003) Speaker localisation using audio-visual synchrony: an ampirical study. In: CIVR: ACM international conference on image and video retrieval

  45. Peng J, Lin QX (2008) Automatic classification video for person indexing. In: Proceedings of the 2008 congress on image and signal processing, CISP ’08, vol 2. IEEE Computer Society, Washington, DC, USA, pp 475–479. ISBN 978-0-7695-3119-9

    Chapter  Google Scholar 

  46. Philippeau J, Pinquier J, Joly P (2006) Intervenant classification in an audiovisual document. In: International conference on signal processing and multimedia applications

  47. Pinquier J, Rouas JL, André-Obrecht R (2003) A fusion study in speech/music classification. In: IEEE international conference on acoustics, speech and signal processing

  48. Plackett RL (1983) Karl Pearson and the chi-squared test. Int Stat Rev 51(1):59–72

    Article  MATH  MathSciNet  Google Scholar 

  49. Ramirez J, Girriz JM, Segura JC (2007) Voice activity detection. In: Grimm M, Kroschel K (eds) Fundamentals and speech recognition system robustness. Robust Speech Recognition and Understanding

  50. Rosenhahn B, Kersting U, Powell K, Brox T, Seidel HP (2007) Tracking clothed people. In: Human motion—understanding, modeling, capture, and animation. Springer

  51. Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. In: IEEE international conference on acoustics, speech, and signal processing

  52. Schmalenstroeer J, Haeb-Umbach R (2010) Online Diarization of Streaming Audio-Visual Data for Smart Environments. J Sel Topics Signal Processing 4(5):845–856

    Article  Google Scholar 

  53. Siegler MA, Jain U, Raj B, Stern RM (1997) Automatic segmentation, classification and clustering of broadcast news audio. In: DARPA Speech Recognition Workshop

  54. Sivakumaran P, Fortuna J, Ariyaeeinia AM (2001) On the use of the bayesian information criterion in multiple speaker detection. In: The 7th European conference on speech communication and technology (Eurospeech’01)

  55. Smeaton AF, Over P, Doherty AR (2010) Video shot boundary detection: seven years of trecvid activity. Comput Vis Image Und 114(4):411–418

    Article  Google Scholar 

  56. Stiefelhagen R, Bowers R, Fiscus J (2008) Multimodal technologies for perception of humans: international evaluation workshops CLEAR 2007 and RT 2007. ser. Lecture Notes in Computer Science. Springer

  57. Sung JW, Kanade T, Kim DJ (2008) Pose robust face tracking by combining active appearance models and cylinder head models. Int J Comput Vis 80(2):260–274

    Article  Google Scholar 

  58. Tamura S, Iwano K, Furui S (2004) Multi-modal speech recognition using optical-flow analysis for lip images. J VLSI Signal Process Syst 36(2/3):117–124

    Google Scholar 

  59. Terzopoulos D, Waters K (1993) Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Trans Pattern Anal Mach Intell 15:569–579

    Article  Google Scholar 

  60. Truong BT, Dorai C, Venkatesh S (2000) New enhancements to cut, fade, and dissolve detection processes in video segmentation. In: ACM international conference on Multimedia

  61. Tsai WH, Cheng SS, Chao YH, Wang HM (2005) Clustering speech utterances by speaker using eigenvoice-motivated vector space model. In: IEEE international conference on acoustics, speech, and signal processing

  62. Vajaria H, Islam T, Sarkar S, Sankar R, Kasturi R (2006) Audio segmentation and speaker localization in meeting videos. In: ICPR’06: international conference on pattern recognition

  63. Vezhnevets V, Sazonov V, Andreeva A (2003) A survey on pixel-based skin color detection techniques. In: Proc. Graphicon

  64. Viola P, Jones MJ, Snow D (2003) Detecting pedestrians using patterns of motion and appearance. In: ICCV ’03: IEEE international conference on computer vision

  65. Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154

    Article  Google Scholar 

  66. Yang MH (2009) Face detection. In: Encyclopedia of biometrics. Springer

  67. Zhou B, Hansen JHL (2005) Efficient audio stream segmentation via the combined T2 statistic and the bayesian information criterion. IEEE Trans Speech Audio Processing 13(4):467–474

    Article  Google Scholar 

  68. Zhu X, Barras C, Lamel L, Gauvain JL (2008) Multi-stage speaker diarization for conference and lecture meetings. In: Multimodal technologies for perception of humans. Springer

Download references

Acknowledgements

This work was supported by a 3-year individual fellowship from the French Ministry of High Education and Research, and by the SODA project funded by the National French Research Agency (ANR).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elie El Khoury.

Rights and permissions

Reprints and permissions

About this article

Cite this article

El Khoury, E., Sénac, C. & Joly, P. Audiovisual diarization of people in video content. Multimed Tools Appl 68, 747–775 (2014). https://doi.org/10.1007/s11042-012-1080-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-012-1080-6

Keywords

Navigation