Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Joint face and head tracking inside multi-camera smart rooms

  • Original paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

The paper introduces a novel detection and tracking system that provides both frame-view and world-coordinate human location information, based on video from multiple synchronized and calibrated cameras with overlapping fields of view. The system is developed and evaluated for the specific scenario of a seminar lecturer presenting in front of an audience inside a “smart room”, its aim being to track the lecturer’s head centroid in the three-dimensional (3D) space and also yield two-dimensional (2D) face information in the available camera views. The proposed approach is primarily based on a statistical appearance model of human faces by means of well-known AdaBoost-like face detectors, extended to address the head pose variation observed in the smart room scenario of interest. The appearance module is complemented by two novel components and assisted by a simple tracking drift detection mechanism. The first component of interest is the initialization module, which employs a spatio-temporal dynamic programming approach with appropriate penalty functions to obtain optimal 3D location hypotheses. The second is an adaptive subspace learning based 2D tracking scheme with a novel forgetting mechanism, introduced to reduce tracking drift and increase robustness. System performance is benchmarked on an extensive database of realistic human interaction in the lecture smart room scenario, collected as part of the European integrated project “CHIL”. The system consistently achieves excellent tracking precision, with a 3D mean tracking error of less than 16 cm, and is demonstrated to outperform four alternative tracking schemes. Furthermore, the proposed system performs relatively well in detecting frontal and near-frontal faces in the available frame views.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. CHIL: Computers in the human interaction loop [Online]. Available: http://chil.server.de

  2. Mostefa, D., Moreau, N., Choukri, K., Potamianos, G., Chu, S.M., Tyagi, A., Casas, J.R., Turmo, J., Christoforetti, L., Tobia, F., Pnevmatikakis, A., Mylonakis, V., Talantzis, F., Burger, S., Stiefelhagen, R., Bernardin, K., Rochet, C.: The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms. J. Lang. Resour. Eval. (submitted) (2007)

  3. Stiefelhagen, R., Garofolo, J. (eds.) Multimodal Technologies for Perception of Humans: First International Evaluation Workshop on Classification of Events, Activities, and Relationships, CLEAR 2006. vol. 4122, Springer, LNCS (2007)

  4. Fiscus, J.G., Ajot, J., Michel, M., Garofolo, J.S.: The rich transcription 2006 spring meeting recognition evaluation. In: Renals, S., Bangio, S., Fiscus, J.G. (eds.) Machine Learning for Multimodal Interaction, LNCS vol. 4299, pp. 309–322 (2006)

  5. Stiefelhagen, R., Bernardin, K., Bowers, R., Garofolo, J., Mostefa, D., Soundararajan, P.: The CLEAR 2006 evaluation. In: Stiefelhagen, R., Garofolo, J. (eds.) Multimodal Technologies for Perception of Humans: First International Evaluation Workshop on Classification of Events, Activities, and Relationships, CLEAR 2006. vol. 4122, Springer, LNCS (2007), pp. 1–44 (2007)

  6. Stergiou, A., Pnevmatikakis, A., Polymenakos, L.: A decision fusion system across time and classifiers for audio-visual person identification. In: Stiefelhagen, R., Garofolo, J. (eds.) Multimodal Technologies for Perception of Humans: First International Evaluation Workshop on Classification of Events, Activities, and Relationships, CLEAR 2006. vol. 4122, Springer, LNCS (2007), pp. 223–232 (2007)

  7. Wölfel, M., Nickel, K., McDonough, J.: Microphone array driven speech recognition: influence of localization on the word error rate. In: Proceedings joint workshop on multimodal interaction and related machine learning algorithms (MLMI), LNCS vol. 3869, pp. 320–331 (2005)

  8. Pinhanez, C., Bobick, A.: Intelligent studios: using computer vision to control TV cameras. In: Proceedings Workshop on Entertainment and AI/Alife, pp. 69–76 (1995)

  9. Wallick, M.N., Rui, Y., He, L.: A portable solution for automatic lecture room camera management. In: Proceedings International Conference Multimedia Expo (ICME) (2004)

  10. Hampapur, A., Pankanti, S., Senior, A.W., Tian, Y.-L., Brown, L., Bolle, R.: Face cataloger: multi-scale imaging for relating identity to location. In: Proceedings IEEE conference advanced video signal based surveillance, pp. 13–20 (2003)

  11. Potamianos, G., Lucey, P.: Audio-visual ASR from multiple views inside smart rooms. In: Proceedings International Conference Multisensor Fusion and Integration for Intelligent Systems (MFI), pp. 35–40 (2006)

  12. Bouguet, J.-Y.: Camera Calibration Toolbox [Online]. Available: http://www.vision.caltech.edu/bouguetj/calib_doc/

  13. Pnevmatikakis, A., Polymenakos, L.: 2D person tracking using Kalman filtering and adaptive background learning in a feedback loop. In: Stiefelhagen, R., Garofolo, J. (eds.) Multimodal Technologies for Perception of Humans: First International Evaluation Workshop on Classification of Events, Activities, and Relationships, CLEAR 2006. vol. 4122, Springer, LNCS (2007), pp. 151–160 (2007)

  14. Nechyba, M.C., Schneiderman, H.: PittPatt face detection and tracking for the CLEAR 2006 evaluation. In: Stiefelhagen, R., Garofolo, J. (eds.) Multimodal Technologies for Perception of Humans: First International Evaluation Workshop on Classification of Events, Activities, and Relationships, CLEAR 2006. vol. 4122, Springer, LNCS (2007), pp. 161–170 (2007)

  15. Bernardin, K., Gehrig, T., Stiefelhagen, R.: Multi- and single view multiperson tracking for smart room environments. In: Stiefelhagen, R., Garofolo, J. (eds.) Multimodal Technologies for Perception of Humans: First International Evaluation Workshop on Classification of Events, Activities, and Relationships, CLEAR 2006. vol. 4122, Springer, LNCS (2007), pp. 81–92 (2007)

  16. Nickel, K., Gehrig, T., Stiefelhagen, R., McDonough, J.: A joint particle filter for audio-visual speaker tracking. In: Proceedings International Conference Multimodal Interfaces (ICMI) (2005)

  17. Abad, A., Canton-Ferrer, C., Segura, C., Landabaso, J.L., Macho, D., Casas, J.R., Hernando, J., Pardàs, M., Nadeu, C.: UPC audio, video and multimodal person tracking systems in the CLEAR evaluation campaign. In: Stiefelhagen, R., Garofolo, J. (eds.) Multimodal Technologies for Perception of Humans: First International Evaluation Workshop on Classification of Events, Activities, and Relationships, CLEAR 2006. vol. 4122, Springer, LNCS (2007), pp. 93–104 (2007)

  18. Brunelli, R., Brutti, A., Chippendale, P., Lanz, O., Omologo, M., Svaizer, P., Tobia, F.: A generative approach to audio-visual person tracking. In: Stiefelhagen, R., Garofolo, J. (eds.) Multimodal Technologies for Perception of Humans: First International Evaluation Workshop on Classification of Events, Activities, and Relationships, CLEAR 2006. vol. 4122, Springer, LNCS (2007), pp. 55–68 (2007)

  19. Wu, B., Singh, V.K., Nevatia, R., Chu, C.-W.: Speaker tracking in seminars by human body detection. In: Stiefelhagen, R., Garofolo, J. (eds.) Multimodal Technologies for Perception of Humans: First International Evaluation Workshop on Classification of Events, Activities, and Relationships, CLEAR 2006. vol. 4122, Springer, LNCS (2007), pp. 119–126 (2007)

  20. Zhang, Z., Potamianos, G., Senior, A., Chu, S., Huang, T.: A joint system for person tracking and face detection. In: Proceedings International Workshop Human-Computer Interaction (ICCV 2005 Work. on HCI), pp. 47–59 (2005)

  21. Lim, J., Ross, D., Lin, R.-S., Yang, M.-H.: Incremental learning for visual tracking. In: Proceedings NIPS (2004)

  22. Hampapur A., Brown L., Connell J., Ekin A., Haas N., Lu M., Merkl H., Pankanti S., Senior A., Shu C.-F. and Tian Y.-L. (2005). Smart Video Surveillance. IEEE Signal Process. Mag. 22(2): 38–51

    Article  Google Scholar 

  23. Isard, M., MacCormick, J.: BraMBLe: A Bayesian multiple blob tracker. In: Proceedings International Conference Computer Vision, vol. 2, pp. 34–41 (2003)

  24. Senior, A.: Real-time articulated human body tracking using silhouette information. In: Proceedings Workshop Visual Surveillance/PETS (2003)

  25. Rowley H.A., Baluja S. and Kanade T. (1998). Neural network-based face detection. IEEE Trans. Pattern Anal. Mach. Intell. 20(1): 23–28

    Article  Google Scholar 

  26. Osuna, E., Freund, R., Girosi, F.: Training support vector machines: an application to face detection. In: Proceedings Conference Computer Vision Pattern Recog, pp. 130–136 (1997)

  27. Roth, D., Yang, M.-H., Ahuja, N.: A SNoW-based face detector. In: Proceedings of NIPS (2000)

  28. Viola, P., Jones, M.: Robust real time object detection. In: Proceedings IEEE ICCV Work. Statistical and Computational Theories of Vision (2001)

  29. Graf, H.P., Cosatto, E., Potamianos, G.: Robust recognition of faces and facial features with a multi-modal system. In: Proceedings International Conference Systems Man Cybernetics pp. 2034–2039 (1997)

  30. Cootes T.F., Edwards G.J. and Taylor C.J. (2001). Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 23(6): 681–685

    Article  Google Scholar 

  31. Pentland, A.P., Moghaddam, B., Starner, T.: View-based and modular eigenspaces for face recognition. In: Proceedings Conference Computer Vision Pattern Recogonition pp. 84–91 (1994)

  32. Li S.Z. and Zhang Z. (2004). FloatBoost learning and statistical face detection. IEEE Trans. Pattern Anal. Mach. Intell. 26(9): 1112–1123

    Article  Google Scholar 

  33. Isard, M., Blake, A.: Contour tracking by stochastic propagation of conditional density. In: Proceedings European Conference Computer Vision, pp. 343–356 (1996)

  34. Comaniciu, D., Ramesh, V., Meer, P.: Real-time tracking of non-rigid objects using mean shift. In: Proceedings International Conference Computer Vision Pattern Recogonition vol. 2, pp. 142–149 (2000)

  35. Tao, H., Sawhney, H.S., Kumar, R.: Dynamic layer representation with applications to tracking. In: Proceedings International Conference Computer Vision Pattern Recogonition vol. 2, pp. 134–141 (2000)

  36. Black M.J. and Jepson A. (1998). Eigentracking: robust matching and tracking of articulated objects using a view-based representation. Int. J. Comput. Vis. 26(1): 63–84

    Article  Google Scholar 

  37. Jepson A.D., Fleet D.J. and El-Maraghi T.F. (2003). Robust online appearance models for visual tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25(10): 1296–1311

    Article  Google Scholar 

  38. Collins R.T., Liu Y. and Leordeanu M. (2005). Online selection of discriminative tracking features. IEEE Trans. Pattern Anal. Mach. Intell. 27(10): 1631–1643

    Article  Google Scholar 

  39. Han, B., Davis, L.: On-line density-based appearance modeling for object tracking. In: Proceedings International Conference Computer Vision (2005)

  40. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision 2nd edn. Cambridge University Press, ISBN: 0521540518 (2004)

  41. Lanz O. (2006). Approximate Bayesian multibody tracking. IEEE Trans. Pattern Anal. Mach. Intell. 28(9): 1436–1449

    Article  Google Scholar 

  42. Zotkin D.N., Duraiswami R. and Davis L.S. (2002). Joint audio-visual tracking using particle filters. EURASIP J. Appl. Signal Process. 2002(11): 1154–1164

    Article  MATH  Google Scholar 

  43. Mittal, A., Davis, L.: M2Tracker: a multi-view approach to segmenting and tracking people in a cluttered scene using region-based stereo. In: Proceedings European Conference Comp. Vision, pp. 18–36 (2002)

  44. Kalman R.E. (1960). A new approach to linear filtering and prediction problems. Trans. ASME J. Basic Engin. (Ser. D) 82: 35–45

    Google Scholar 

  45. Arulampalam M.S., Maskell S., Gordon N. and Clapp T. (2002). A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans. Signal Process. 50(2): 174–188

    Article  Google Scholar 

  46. Stauffer C. and Grimson W.E.L. (2000). Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Anal. Mach. Intell. 22(8): 747–757

    Article  Google Scholar 

  47. Tyagi, A., Potamianos, G., Davis, J.W., Chu, S.M.: Fusion of multiple camera views for kernel-based 3D tracking. In: Proceedings IEEE Workshop Motion and Video Computing (2007)

  48. Ho, J., Lee, K.-C., Yang, M.-H., Kriegman, D.: Visual tracking using learned linear subspaces. In: Proceedings International Conference Computer Vision Pattern Recogonition. vol. 1, pp. 782–789 (2004)

  49. Hall P., Marshall D. and Martin R. (2000). Merging and splitting eigenspace models. IEEE Trans. Pattern Anal. Mach. Intell. 22(9): 1042–1049

    Article  Google Scholar 

  50. Freund Y. and Schapire R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1): 119–139

    Article  MATH  MathSciNet  Google Scholar 

  51. Tieu, K., Viola, P.: Boosting image retrieval. In: Proceedings Conference Computer Vision Pattern Recogonition vol. 1, pp. 228–235 (2000)

  52. Pudil P., Novovicova J. and Kittler J. (1994). Floating search methods in feature selection. Pattern Recog. Lett. 15: 1119–1125

    Article  Google Scholar 

  53. Senior, A.W., Potamianos, G., Chu, S., Zhang, Z., Hampapur, A.: A comparison of multicamera person-tracking algorithms. In: Proceedings IEEE International Workshop Visual Surveillance (VS/ECCV) (2006)

  54. Bobick A. and Davis J. (2001). The representation and recognition of action using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23(3): 257–267

    Article  Google Scholar 

  55. Senior, A.: Tracking with probabilistic appearance models. In: Proceedings International Workshop on Performance Evaluation of Tracking and Surveillance Systems (2002)

  56. Bernardin, K., Elbs, A., Stiefelhagen, R.: Multiple object tracking performance metrics and evaluation in a smart room environment. In: Proceedings IEEE International Workshop Visual Surveillance (VS/ECCV) (2006)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gerasimos Potamianos.

Additional information

This work was performed while Zhenqiu Zhang was on a summer internship with the Human Language Technology Department at the IBM T.J. Watson Research Center.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Z., Potamianos, G., Senior, A.W. et al. Joint face and head tracking inside multi-camera smart rooms. SIViP 1, 163–178 (2007). https://doi.org/10.1007/s11760-007-0018-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-007-0018-3

Keywords

Navigation