Article

Multimodal multispeaker probabilistic tracking in meetings

Authors:

Daniel Gatica-Perez,

Guillaume Lathoud,

Jean-Marc Odobez,

Iain McCowanAuthors Info & Claims

ICMI '05: Proceedings of the 7th international conference on Multimodal interfaces

Pages 183 - 190

https://doi.org/10.1145/1088463.1088496

Published: 04 October 2005 Publication History

Abstract

Tracking speakers in multiparty conversations constitutes a fundamental task for automatic meeting analysis. In this paper, we present a probabilistic approach to jointly track the location and speaking activity of multiple speakers in a multisensor meeting room, equipped with a small microphone array and multiple uncalibrated cameras. Our framework is based on a mixed-state dynamic graphical model defined on a multiperson state-space, which includes the explicit definition of a proximity-based interaction model. The model integrates audio-visual (AV) data through a novel observation model. Audio observations are derived from a source localization algorithm. Visual observations are based on models of the shape and spatial structure of human heads. Approximate inference in our model, needed given its complexity, is performed with a Markov Chain Monte Carlo particle filter (MCMC-PF), which results in high sampling efficiency. We present results -based on an objective evaluation procedure-that show that our framework (1) is capable of locating and tracking the position and speaking activity of multiple meeting participants engaged in real conversations with good accuracy; (2) can deal with cases of visual clutter and partial occlusion; and (3) significantly outperforms a traditional sampling-based approach.

References

[1]

M. Beal, H. Attias, and N. Jojic, "Audio-video sensor fusion with probabilistic graphical models," in Proc. ECCV, May 2002.

Digital Library

[2]

N. Checka, K. Wilson, M. Siracusa, and T. Darrell, "Multiple person and speaker activity tracking with a particle filter," in Proc. ICASSP, May 2004.

[3]

Y. Chen and Y. Rui, "Real-time speaker tracking using particle filter sensor fusion," Proc. of the IEEE, vol. 92, no. 3, pp. 485--494, Mar. 2004.

[4]

R. Cutler, Y. Rui, A. Gupta, JJ Cadiz, I. Tashev, L. He, A. Colburn, Z. Zhang, Z. Liu, and S. Silverberg, "Distributed meetings: A meeting capture and broadcasting system," in Proc. ACM MM, Oct. 2002.

Digital Library

[5]

J. DiBiase, H. Silverman, and M. Brandstein, "Robust localization in reverberant rooms," in Microphone Arrays, Ch. 8, pp. 157--180. Springer, 2001.

[6]

D. Gatica-Perez, G. Lathoud, I. McCowan, and J.-M. Odobez, "A mixed-state i-particle filter for multi-camera speaker tracking," in Proc. ICCV-WOMTEC, Oct. 2003.

[7]

M. Isard, Visual Motion Analysis by Probabilistic Propagation of Conditional Density, PhD Thesis, 1998.

[8]

M. Isard and J. MacCormick, "Bramble: A Bayesian multi-blob tracker," in Proc. ICCV, Jul. 2001.

[9]

Z. Khan, T. Balch, and F. Dellaert, "An MCMC-based particle filter for tracking multiple interacting targets," in Proc. ECCV, May 2004.

[10]

S.Z. Li, Markov Random Field Modeling in Computer Vision, Springer, 1995.

Digital Library

[11]

J.S. Liu, Monte Carlo Strategies in Scientific Computing, Springer-Verlag, 2001.

Digital Library

[12]

J.E. McGrath, Groups: Interaction and Performance, Prentice-Hall, 1984.

[13]

V. Pavlovic, A. Garg, and J. Rehg, "Multimodal speaker detection using error feedback dynamic Bayesian networks," in Proc. CVPR, Jun. 2000.

Digital Library

[14]

P. Perez, C. Hue, and J. Vermaak, "Color-based Probabilistic Tracking," in Proc. ECCV, May 2002.

Digital Library

[15]

J. Vermaak, M. Gagnet, A. Blake, and P. Perez, "Sequential Monte Carlo fusion of sound and vision for speaker tracking," in Proc. ICCV, July 2001.

[16]

P. Viola and M. Jones, "Rapid object detection by boosted cascade of simple features," in Proc. CVPR, Dec. 2001.

Cited By

Oviatt SCohen P(2015)The Paradigm Shift to Multimodality in Contemporary Computer InterfacesSynthesis Lectures on Human-Centered Informatics10.2200/S00636ED1V01Y201503HCI0308:3(1-243)Online publication date: 13-Apr-2015
https://doi.org/10.2200/S00636ED1V01Y201503HCI030
Jensen JChristensen M(2015)A joint audio-visual approach to audio localization2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2015.7178010(454-458)Online publication date: Apr-2015
https://doi.org/10.1109/ICASSP.2015.7178010
Oviatt SJacko J(2012)Multimodal InterfacesHuman–Computer Interaction Handbook10.1201/b11963-22(405-430)Online publication date: 14-May-2012
https://doi.org/10.1201/b11963-22
Show More Cited By

Index Terms

Multimodal multispeaker probabilistic tracking in meetings
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Tracking
      2. Computer vision tasks
        Scene understanding
        Vision for robotics

Recommendations

Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings

Tracking speakers in multiparty conversations constitutes a fundamental task for automatic meeting analysis. In this paper, we present a novel probabilistic approach to jointly track the location and speaking activity of multiple speakers in a ...
Speaker localization for microphone array-based ASR: the effects of accuracy on overlapping speech
ICMI '06: Proceedings of the 8th international conference on Multimodal interfaces

Accurate speaker location is essential for optimal performance of distant speech acquisition systems using microphone array techniques. However, to the best of our knowledge, no comprehensive studies on the degradation of automatic speech recognition (...
A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization
ICMI '08: Proceedings of the 10th international conference on Multimodal interfaces

This paper presents a realtime system for analyzing group meetings that uses a novel omnidirectional camera-microphone system. The goal is to automatically discover the visual focus of attention (VFOA), i.e. "who is looking at whom", in addition to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '05: Proceedings of the 7th international conference on Multimodal interfaces

October 2005

344 pages

ISBN:1595930280

DOI:10.1145/1088463

General Chairs:
Gianni Lazzari
ITC-irst, Trento (Italy)
,
Fabio Pianesi
ITC-irst, Trento (Italy)
,
Program Chairs:
James Crowley
I.N.P. Grenoble (France)
,
Kenji Mase
Nagoya University (Japan)
,
Sharon Oviatt
Oregon Health & Sciences University

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 October 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

ICMI05

Sponsor:

ICMI05: Seventh International Conference on Multimodal Interfaces 2005

October 4 - 6, 2005

Torento, Italy

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
310
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Oviatt SCohen P(2015)The Paradigm Shift to Multimodality in Contemporary Computer InterfacesSynthesis Lectures on Human-Centered Informatics10.2200/S00636ED1V01Y201503HCI0308:3(1-243)Online publication date: 13-Apr-2015
https://doi.org/10.2200/S00636ED1V01Y201503HCI030
Jensen JChristensen M(2015)A joint audio-visual approach to audio localization2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2015.7178010(454-458)Online publication date: Apr-2015
https://doi.org/10.1109/ICASSP.2015.7178010
Oviatt SJacko J(2012)Multimodal InterfacesHuman–Computer Interaction Handbook10.1201/b11963-22(405-430)Online publication date: 14-May-2012
https://doi.org/10.1201/b11963-22
Vinyals OBohus DCaruana RMorency LBohus DAghajan HCassell JNijholt AEpps J(2012)Learning speaker, addressee and overlap detection models from multimodal streamsProceedings of the 14th ACM international conference on Multimodal interaction10.1145/2388676.2388770(417-424)Online publication date: 22-Oct-2012
https://dl.acm.org/doi/10.1145/2388676.2388770
Ronzhin AKarpov AKipyatkova IŽelezný M(2010)Client and speech detection system for intelligent infokioskProceedings of the 13th international conference on Text, speech and dialogue10.5555/1887176.1887251(560-567)Online publication date: 6-Sep-2010
https://dl.acm.org/doi/10.5555/1887176.1887251
Ronzhin AKarpov AKipyatkova IŽelezný M(2010)Client and Speech Detection System for Intelligent InfokioskText, Speech and Dialogue10.1007/978-3-642-15760-8_71(560-567)Online publication date: 2010
https://doi.org/10.1007/978-3-642-15760-8_71
Maganti HGatica-Perez DMcCowan I(2007)Speech Enhancement and Recognition in Meetings With an Audio–Visual Sensor ArrayIEEE Transactions on Audio, Speech, and Language Processing10.1109/TASL.2007.90619715:8(2257-2269)Online publication date: 1-Nov-2007
https://dl.acm.org/doi/10.1109/TASL.2007.906197
Lathoud GOdobez J(2007)Short-Term Spatio–Temporal Clustering Applied to Multiple Moving SpeakersIEEE Transactions on Audio, Speech, and Language Processing10.1109/TASL.2007.89666715:5(1696-1710)Online publication date: 1-Jul-2007
https://dl.acm.org/doi/10.1109/TASL.2007.896667
Arthur ALunsford RWesson MOviatt SQuek FYang JMassaro DAlwan AHazen T(2006)Prototyping novel collaborative multimodal systemsProceedings of the 8th international conference on Multimodal interfaces10.1145/1180995.1181039(209-216)Online publication date: 2-Nov-2006
https://dl.acm.org/doi/10.1145/1180995.1181039
Maganti HGatica-Perez DQuek FYang JMassaro DAlwan AHazen T(2006)Speaker localization for microphone array-based ASRProceedings of the 8th international conference on Multimodal interfaces10.1145/1180995.1181004(35-38)Online publication date: 2-Nov-2006
https://dl.acm.org/doi/10.1145/1180995.1181004
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents