Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1088463.1088496acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
Article

Multimodal multispeaker probabilistic tracking in meetings

Published: 04 October 2005 Publication History

Abstract

Tracking speakers in multiparty conversations constitutes a fundamental task for automatic meeting analysis. In this paper, we present a probabilistic approach to jointly track the location and speaking activity of multiple speakers in a multisensor meeting room, equipped with a small microphone array and multiple uncalibrated cameras. Our framework is based on a mixed-state dynamic graphical model defined on a multiperson state-space, which includes the explicit definition of a proximity-based interaction model. The model integrates audio-visual (AV) data through a novel observation model. Audio observations are derived from a source localization algorithm. Visual observations are based on models of the shape and spatial structure of human heads. Approximate inference in our model, needed given its complexity, is performed with a Markov Chain Monte Carlo particle filter (MCMC-PF), which results in high sampling efficiency. We present results -based on an objective evaluation procedure-that show that our framework (1) is capable of locating and tracking the position and speaking activity of multiple meeting participants engaged in real conversations with good accuracy; (2) can deal with cases of visual clutter and partial occlusion; and (3) significantly outperforms a traditional sampling-based approach.

References

[1]
M. Beal, H. Attias, and N. Jojic, "Audio-video sensor fusion with probabilistic graphical models," in Proc. ECCV, May 2002.
[2]
N. Checka, K. Wilson, M. Siracusa, and T. Darrell, "Multiple person and speaker activity tracking with a particle filter," in Proc. ICASSP, May 2004.
[3]
Y. Chen and Y. Rui, "Real-time speaker tracking using particle filter sensor fusion," Proc. of the IEEE, vol. 92, no. 3, pp. 485--494, Mar. 2004.
[4]
R. Cutler, Y. Rui, A. Gupta, JJ Cadiz, I. Tashev, L. He, A. Colburn, Z. Zhang, Z. Liu, and S. Silverberg, "Distributed meetings: A meeting capture and broadcasting system," in Proc. ACM MM, Oct. 2002.
[5]
J. DiBiase, H. Silverman, and M. Brandstein, "Robust localization in reverberant rooms," in Microphone Arrays, Ch. 8, pp. 157--180. Springer, 2001.
[6]
D. Gatica-Perez, G. Lathoud, I. McCowan, and J.-M. Odobez, "A mixed-state i-particle filter for multi-camera speaker tracking," in Proc. ICCV-WOMTEC, Oct. 2003.
[7]
M. Isard, Visual Motion Analysis by Probabilistic Propagation of Conditional Density, PhD Thesis, 1998.
[8]
M. Isard and J. MacCormick, "Bramble: A Bayesian multi-blob tracker," in Proc. ICCV, Jul. 2001.
[9]
Z. Khan, T. Balch, and F. Dellaert, "An MCMC-based particle filter for tracking multiple interacting targets," in Proc. ECCV, May 2004.
[10]
S.Z. Li, Markov Random Field Modeling in Computer Vision, Springer, 1995.
[11]
J.S. Liu, Monte Carlo Strategies in Scientific Computing, Springer-Verlag, 2001.
[12]
J.E. McGrath, Groups: Interaction and Performance, Prentice-Hall, 1984.
[13]
V. Pavlovic, A. Garg, and J. Rehg, "Multimodal speaker detection using error feedback dynamic Bayesian networks," in Proc. CVPR, Jun. 2000.
[14]
P. Perez, C. Hue, and J. Vermaak, "Color-based Probabilistic Tracking," in Proc. ECCV, May 2002.
[15]
J. Vermaak, M. Gagnet, A. Blake, and P. Perez, "Sequential Monte Carlo fusion of sound and vision for speaker tracking," in Proc. ICCV, July 2001.
[16]
P. Viola and M. Jones, "Rapid object detection by boosted cascade of simple features," in Proc. CVPR, Dec. 2001.

Cited By

View all
  • (2015)The Paradigm Shift to Multimodality in Contemporary Computer InterfacesSynthesis Lectures on Human-Centered Informatics10.2200/S00636ED1V01Y201503HCI0308:3(1-243)Online publication date: 13-Apr-2015
  • (2015)A joint audio-visual approach to audio localization2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2015.7178010(454-458)Online publication date: Apr-2015
  • (2012)Multimodal InterfacesHuman–Computer Interaction Handbook10.1201/b11963-22(405-430)Online publication date: 14-May-2012
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI '05: Proceedings of the 7th international conference on Multimodal interfaces
October 2005
344 pages
ISBN:1595930280
DOI:10.1145/1088463
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 October 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. MCMC
  2. audio-visual speaker tracking
  3. particle filters

Qualifiers

  • Article

Conference

ICMI05
Sponsor:

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2015)The Paradigm Shift to Multimodality in Contemporary Computer InterfacesSynthesis Lectures on Human-Centered Informatics10.2200/S00636ED1V01Y201503HCI0308:3(1-243)Online publication date: 13-Apr-2015
  • (2015)A joint audio-visual approach to audio localization2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2015.7178010(454-458)Online publication date: Apr-2015
  • (2012)Multimodal InterfacesHuman–Computer Interaction Handbook10.1201/b11963-22(405-430)Online publication date: 14-May-2012
  • (2012)Learning speaker, addressee and overlap detection models from multimodal streamsProceedings of the 14th ACM international conference on Multimodal interaction10.1145/2388676.2388770(417-424)Online publication date: 22-Oct-2012
  • (2010)Client and speech detection system for intelligent infokioskProceedings of the 13th international conference on Text, speech and dialogue10.5555/1887176.1887251(560-567)Online publication date: 6-Sep-2010
  • (2010)Client and Speech Detection System for Intelligent InfokioskText, Speech and Dialogue10.1007/978-3-642-15760-8_71(560-567)Online publication date: 2010
  • (2007)Speech Enhancement and Recognition in Meetings With an Audio–Visual Sensor ArrayIEEE Transactions on Audio, Speech, and Language Processing10.1109/TASL.2007.90619715:8(2257-2269)Online publication date: 1-Nov-2007
  • (2007)Short-Term Spatio–Temporal Clustering Applied to Multiple Moving SpeakersIEEE Transactions on Audio, Speech, and Language Processing10.1109/TASL.2007.89666715:5(1696-1710)Online publication date: 1-Jul-2007
  • (2006)Prototyping novel collaborative multimodal systemsProceedings of the 8th international conference on Multimodal interfaces10.1145/1180995.1181039(209-216)Online publication date: 2-Nov-2006
  • (2006)Speaker localization for microphone array-based ASRProceedings of the 8th international conference on Multimodal interfaces10.1145/1180995.1181004(35-38)Online publication date: 2-Nov-2006
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media