Robust facial landmark detection and tracking across poses and expressions for in-the-wild monocular video

Shuang Liu¹,
Yongqiang Zhang²,
Xiaosong Yang¹,
Daming Shi² &
…
Jian J. Zhang¹

10k Accesses
10 Citations
Explore all metrics

Abstract

We present a novel approach for automatically detecting and tracking facial landmarks across poses and expressions from in-the-wild monocular video data, e.g., YouTube videos and smartphone recordings. Our method does not require any calibration or manual adjustment for new individual input videos or actors. Firstly, we propose a method of robust 2D facial landmark detection across poses, by combining shape-face canonical-correlation analysis with a global supervised descent method. Since 2D regression-based methods are sensitive to unstable initialization, and the temporal and spatial coherence of videos is ignored, we utilize a coarse-todense 3D facial expression reconstruction method to refine the 2D landmarks. On one side, we employ an in-the-wild method to extract the coarse reconstruction result and its corresponding texture using the detected sparse facial landmarks, followed by robust pose, expression, and identity estimation. On the other side, to obtain dense reconstruction results, we give a face tracking flow method that corrects coarse reconstruction results and tracks weakly textured areas; this is used to iteratively update the coarse face model. Finally, a dense reconstruction result is estimated after it converges. Extensive experiments on a variety of video sequences recorded by ourselves or downloaded from YouTube show the results of facial landmark detection and tracking under various lighting conditions, for various head poses and facial expressions. The overall performance and a comparison with state-of-art methods demonstrate the robustness and effectiveness of our method.

Article PDF

3D facial feature and expression computing from Internet image or video

Article 21 March 2018

Facial Features Detection and Localization

A Comprehensive Performance Evaluation of Deformable Face Tracking “In-the-Wild”

Article Open access 25 February 2017

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Mori, M.; MacDorman, K. F.; Kageki, N. The uncanny valley [from the field]. IEEE Robotics & Automation Magazine Vol. 19, No. 2, 98–100, 2012.
Article Google Scholar
Cootes, T. F.; Taylor, C. J.; Cooper, D. H.; Graham, J. Active shape models—Their training and application. Computer Vision and Image Understanding Vol. 61, No. 1, 38–59, 1995.
Article Google Scholar
Cootes, T. F.; Edwards, G. J.; Taylor, C. J. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 23, No. 6, 681–685, 2001.
Article Google Scholar
Cristinacce, D.; Cootes, T. F. Feature detection and tracking with constrained local models. In: Proceedings of the British Machine Conference, 95.1–95.10, 2006.
Google Scholar
Gonzalez-Mora, J.; De la Torre, F.; Murthi, R.; Guil, N.; Zapata, E. L. Bilinear active appearance models. In: Proceedings of IEEE 11th International Conference on Computer Vision, 1–8, 2007.
Google Scholar
Lee, H.-S.; Kim, D. Tensor-based AAM with continuous variation estimation: Application to variation-robust face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 31, No. 6, 1102–1116, 2009.
Article Google Scholar
Cao, X.; Wei, Y.; Wen, F.; Sun, J. Face alignment by explicit shape regression. U.S. Patent Application 13/728,584. 2012-12-27.
Google Scholar
Xiong, X.; De la Torre, F. Supervised descent method and its applications to face alignment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 532–539, 2013.
Google Scholar
Xing, J.; Niu, Z.; Huang, J.; Hu, W.; Yan, S. Towards multi-view and partially-occluded face alignment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1829–1836, 2014.
Google Scholar
Yan, J.; Lei, Z.; Yi, D.; Li, S. Z. Learn to combine multiple hypotheses for accurate face alignment. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, 392–396, 2013.
Google Scholar
Burgos-Artizzu, X. P.; Perona, P.; Dollár, P. Robust face landmark estimation under occlusion. In: Proceedings of the IEEE International Conference on Computer Vision, 1513–1520, 2013.
Google Scholar
Yang, H.; He, X.; Jia, X.; Patras, I. Robust face alignment under occlusion via regional predictive power estimation. IEEE Transactions on Image Processing Vol. 24, No. 8, 2393–2403, 2015.
Article MathSciNet Google Scholar
Feng, Z.-H.; Huber, P.; Kittler, J.; Christmas, W.; Wu, X.-J. Random cascaded-regression copse for robust facial landmark detection. IEEE Signal Processing Letters Vol. 22, No. 1, 76–80, 2015.
Article Google Scholar
Yang, H.; Jia, X.; Patras, I.; Chan, K.-P. Random subspace supervised descent method for regression problems in computer vision. IEEE Signal Processing Letters Vol. 22, No. 10, 1816–1820, 2015.
Article Google Scholar
Zhu, S.; Li, C.; Loy, C. C.; Tang, X. Face alignment by coarse-to-fine shape searching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4998–5006, 2015.
Google Scholar
Cao, C.; Hou, Q.; Zhou, K. Displaced dynamic expression regression for real-time facial tracking and animation. ACM Transactions on Graphics Vol. 33, No. 4, Article No. 43, 2014.
Google Scholar
Liu, S.; Yang, X.; Wang, Z.; Xiao, Z.; Zhang, J. Real-time facial expression transfer with single video camera. Computer Animation and Virtual Worlds Vol. 27, Nos. 3–4, 301–310, 2016.
Article Google Scholar
Tzimiropoulos, G.; Pantic, M. Optimization problems for fast AAM fitting in-the-wild. In: Proceedings of the IEEE International Conference on Computer Vision, 593–600, 2013.
Google Scholar
Suwajanakorn, S.; Kemelmacher-Shlizerman, I.; Seitz, S. M. Total moving face reconstruction. In: Computer Vision–ECCV 2014. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer International Publishing, 796–812, 2014.
Google Scholar
Cootes, T. F.; Taylor, C. J. Statistical models of appearance for computer vision. 2004. Available at http://personalpages.manchester.ac.uk/staff/timothy.f. cootes/Models/app models.pdf.
Google Scholar
Yan, S.; Liu, C.; Li, S. Z.; Zhang, H.; Shum, H.-Y.; Cheng, Q. Face alignment using texture-constrained active shape models. Image and Vision Computing Vol. 21, No. 1, 69–75, 2003.
Article Google Scholar
Donner, R.; Reiter, M.; Langs, G.; Peloschek, P.; Bischof, H. Fast active appearance model search using canonical correlation analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 28, No. 10, 1690–1694, 2006.
Article Google Scholar
Matthews, I.; Baker, S. Active appearance models revisited. International Journal of Computer Vision Vol. 60, No. 2, 135–164, 2004.
Article Google Scholar
Cao, X.; Wei, Y.; Wen, F.; Sun, J. Face alignment by explicit shape regression. International Journal of Computer Vision Vol. 107, No. 2, 177–190, 2014.
Article MathSciNet Google Scholar
Dollár, P.; Welinder, P.; Perona, P. Cascaded pose regression. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1078–1085, 2010.
Google Scholar
Zhou, S. K.; Comaniciu, D. Shape regression machine. In: Information Processing in Medical Imaging. Karssemeijer, N.; Lelieveldt, B. Eds. Springer Berlin Heidelberg, 13–25, 2007.
Chapter Google Scholar
Burgos-Artizzu, X. P.; Perona, P.; Dollár, P. Robust face landmark estimation under occlusion. In: Proceedings of the IEEE International Conference on Computer Vision, 1513–1520, 2013.
Google Scholar
Ren, S.; Cao, X.; Wei, Y.; Sun, J. Face alignment at 3000 fps via regressing local binary features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1685–1692, 2014.
Google Scholar
Cootes, T. F.; Ionita, M. C.; Lindner, C.; Sauer, P. Robust and accurate shape model fitting using random forest regression voting. In: Computer Vision–ECCV 2012. Fitzgibbon, A.; Lazebnik, S.; Perona, P.; Sato, Y.; Schmid, C. Eds. Springer Berlin Heidelberg, 278–291, 2012.
Chapter Google Scholar
Kazemi, V.; Sullivan, J. One millisecond face alignment with an ensemble of regression trees. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1867–1874, 2014.
Google Scholar
Sagonas, C.; Tzimiropoulos, G.; Zafeiriou, S.; Pantic, M. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, 397–403, 2013.
Google Scholar
Zhou, F.; Brandt, J.; Lin, Z. Exemplar-based graph matching for robust facial landmark localization. In: Proceedings of the IEEE International Conference on Computer Vision, 1025–1032, 2013.
Google Scholar
Huang, G. B.; Ramesh, M.; Berg, T.; Learned-Miller, E. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, 2007.
Google Scholar
Shen, J.; Zafeiriou, S.; Chrysos, G. G.; Kossaifi, J.; Tzimiropoulos, G.; Pantic, M. The first facial landmark tracking in-the-wild challenge: Benchmark and results. In: Proceedings of the IEEE International Conference on Computer Vision Workshop, 1003–1011, 2015.
Google Scholar
Cao, C.; Bradley, D.; Zhou, K.; Beeler, T. Realtime high-fidelity facial performance capture. ACM Transactions on Graphics Vol. 34, No. 4, Article No. 46, 2015.
Google Scholar
Cao, C.; Wu, H.; Weng, Y.; Shao, T.; Zhou, K. Real-time facial animation with image-based dynamic avatars. ACM Transactions on Graphics Vol. 35, No. 4, Article No. 126, 2016.
Google Scholar
Garrido, P.; Valgaerts, L.; Wu, C.; Theobalt, C. Reconstructing detailed dynamic face geometry from monocular video. ACM Transactions on Graphics Vol. 32, No. 6, Article No. 158, 2013.
Google Scholar
Ichim, A. E.; Bouaziz, S.; Pauly, M. Dynamic 3D avatar creation from hand-held video input. ACM Transactions on Graphics Vol. 34, No. 4, Article No. 45, 2015.
Google Scholar
Saito, S.; Li, T.; Li, H. Real-time facial segmentation and performance capture from RGB input. arXiv preprint arXiv:1604.02647, 2016.
Book Google Scholar
Shi, F.; Wu, H.-T.; Tong, X.; Chai, J. Automatic acquisition of high-fidelity facial performances using monocular videos. ACM Transactions on Graphics Vol. 33, No. 6, Article No. 222, 2014.
Google Scholar
Thies, J.; Zollhöfer, M.; Stamminger, M.; Theobalt, C.; Nießner, M. Face2face: Real-time face capture and reenactment of RGB videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1, 2016.
Google Scholar
Furukawa, Y.; Ponce, J. Accurate camera calibration from multi-view stereo and bundle adjustment. International Journal of Computer Vision Vol. 84, No. 3, 257–268, 2009.
Article Google Scholar
Cao, C.; Weng, Y.; Zhou, S.; Tong, Y.; Zhou, K. FaceWarehouse: A 3D facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics Vol. 20, No. 3, 413–425, 2014.
Article Google Scholar
Newcombe, R. A.; Izadi, S.; Hilliges, O.; Molyneaux, D.; Kim, D.; Davison, A. J.; Kohi, P.; Shotton, J.; Hodges, S.; Fitzgibbon, A. KinectFusion: Realtime dense surface mapping and tracking. In: Proceedings of the 10th IEEE International Symposium on Mixed and Augmented Reality, 127–136, 2011.
Google Scholar
Weise, T.; Bouaziz, S.; Li, H.; Pauly, M. Realtime performance-based facial animation. ACM Transactions on Graphics Vol. 30, No. 4, Article No. 77, 2011.
Google Scholar
Blanz, V.; Vetter, T. A morphable model for the synthesis of 3D faces. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, 187–194, 1999.
Google Scholar
Yan, J.; Zhang, X.; Lei, Z.; Yi, D.; Li, S. Z. Structural models for face detection. In: Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, 1–6, 2013.
Google Scholar
Xiong, X.; De la Torre, F. Global supervised descent method. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2664–2673, 2015.
Google Scholar
Snavely, N. Bundler: Structure from motion (SFM) for unordered image collections. 2010. Available at http://www.cs.cornell.edu/~snavely/bundler/.
Google Scholar
Chen, L.; Armstrong, C. W.; Raftopoulos, D. D. An investigation on the accuracy of threedimensional space reconstruction using the direct linear transformation technique. Journal of Biomechanics Vol. 27, No. 4, 493–500, 1994.
Article Google Scholar
Moré, J. J. The Levenberg–Marquardt algorithm: Implementation and theory. In: Numerical Analysis. Watson, G. A. Ed. Springer Berlin Heidelberg, 105–116, 1978.
Chapter Google Scholar
Rall, L. B. Automatic Differentiation: Techniques and Applications. Springer Berlin Heidelberg, 1981.
Book MATH Google Scholar
Kolda, T. G.; Sun, J. Scalable tensor decompositions for multi-aspect data mining. In: Proceedings of the 8th IEEE International Conference on Data Mining, 363–372, 2008.
Google Scholar
Li, D.-H.; Fukushima, M. A modified BFGS method and its global convergence in nonconvex minimization. Journal of Computational and Applied Mathematics Vol. 129, Nos. 1–2, 15–35, 2001.
Article MathSciNet MATH Google Scholar
Igarashi, T.; Moscovich, T.; Hughes, J. F. As-rigidas-possible shape manipulation. ACM Transactions on Graphics Vol. 24, No. 3, 1134–1141, 2005.
Article Google Scholar
Hartigan, J. A.; Wong, M. A. Algorithm AS 136: A K-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 28, No. 1, 100–108, 1979.
MATH Google Scholar
Brox, T.; Bruhn, A.; Papenberg, N.; Weickert, J. High accuracy optical flow estimation based on a theory for warping. In: Computer Vision–ECCV 2004. Pajdla, T.; Matas, J. Eds. Springer Berlin Heidelberg, 25–36, 2004.
Chapter Google Scholar
Brox, T.; Malik, J. Large displacement optical flow: Descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 33, No. 3, 500–513, 2011.
Article Google Scholar
Agarwal, S.; Snavely, N.; Seitz, S. M.; Szeliski, R. Bundle adjustment in the large. In: Computer Vision–ECCV 2010. Daniilidis, K.; Maragos, P.; Paragios, N. Eds. Springer Berlin Heidelberg, 29–42, 2010.
Chapter Google Scholar
Belhumeur, P. N.; Jacobs, D. W.; Kriegman, D. J.; Kumar, N. Localizing parts of faces using a consensus of exemplars. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 35, No. 12, 2930–2940, 2013.
Article Google Scholar

Download references

Acknowledgements

This work was supported by the Harbin Institute of Technology Scholarship Fund 2016 and the National Centre for Computer Animation, Bournemouth University.

Author information

Authors and Affiliations

Bournemouth University, Poole, BH12 5BB, UK
Shuang Liu, Xiaosong Yang & Jian J. Zhang
Harbin Institute of Technology, Harbin, 150001, China
Yongqiang Zhang & Daming Shi

Authors

Shuang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yongqiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaosong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Daming Shi
View author publications
You can also search for this author in PubMed Google Scholar
Jian J. Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongqiang Zhang.

Additional information

This article is published with open access at Springerlink.com

Shuang Liu received his B.S. degree in computer science from the Hebei University of Technology, China, in 2014. He is currently a Ph.D. student in the National Centre for Computer Animation, Bournemouth University, UK. His research interests include computer vision and computer animation.

Yongqiang Zhang received his B.S. and M.S. degrees from Harbin Institute of Technology, China, in 2012 and 2014, respectively. He is currently a Ph.D. student in the School of Computer Science and Technology, Harbin Institute of Technology, China. His research interests include machine learning, computer vision, object tracking, and facial animation.

Xiaosong Yang is currently a senior lecturer in the National Centre for Computer Animation (NCCA), Bournemouth University, UK. His research interests include interactive graphics and animation, rendering and modeling, virtual reality, virtual surgery simulation, and CAD. He received his bachelor (1993) and master (1996) degrees in computer science from Zhejiang University, China, and his Ph.D. degree (2000) in computing mechanics from Dalian University of Technology, China. He spent two years as a postdoc (2000–2002) in Tsinghua University working on scientific visualization, and one year (2001–2002) as a research assistant in the Virtual Reality, Visualization and Imaging Research Centre of the Chinese University of Hong Kong. In 2003, he came to NCCA to continue his work on computer animation.

Daming Shi received his Ph.D. degree in mechanical control from Harbin Institute of Technology, China, and Ph.D. degree in computer science from the University of Southampton, UK. He had served as an assistant professor in Nanyang Technological University, Singapore, from 2002. Dr. Shi is currently a chair professor in Harbin Institute of Technology, China. His current research interests include machine learning, medical image processing, pattern recognition, and neural networks.

Jian J. Zhang is a professor of computer graphics in the National Centre for Computer Animation, Bournemouth University, UK, and leads the Computer Animation Research Centre. His research focuses on a number of topics relating to 3D computer animation, including virtual human modelling and simulation, geometric modelling, motion synthesis, deformation, and physics-based animation. He is also interested in virtual reality and medical visualisation and simulation. Prof. Zhang has published over 200 peer reviewed journal and conference publications. He has chaired over 30 international conferences and symposia, and served on a number of editorial boards. Prof. Zhang is also one of the two co-founders of the EPSRC-funded multi-million pound Centre for Digital Entertainment (CDE) with Prof. Phil Willis in the University of Bath.

Electronic supplementary material

Supplementary material, approximately 30.4 MB.

Rights and permissions

Open Access The articles published in this journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.

Reprints and permissions

About this article

Cite this article

Liu, S., Zhang, Y., Yang, X. et al. Robust facial landmark detection and tracking across poses and expressions for in-the-wild monocular video. Comp. Visual Media 3, 33–47 (2017). https://doi.org/10.1007/s41095-016-0068-y

Download citation

Received: 04 September 2016
Accepted: 20 December 2016
Published: 17 March 2017
Issue Date: March 2017
DOI: https://doi.org/10.1007/s41095-016-0068-y

Robust facial landmark detection and tracking across poses and expressions for in-the-wild monocular video

Abstract

Article PDF

Similar content being viewed by others

3D facial feature and expression computing from Internet image or video

Facial Features Detection and Localization

A Comprehensive Performance Evaluation of Deformable Face Tracking “In-the-Wild”

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 30.4 MB.

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Robust facial landmark detection and tracking across poses and expressions for in-the-wild monocular video

Abstract

Article PDF

Similar content being viewed by others

3D facial feature and expression computing from Internet image or video

Facial Features Detection and Localization

A Comprehensive Performance Evaluation of Deformable Face Tracking “In-the-Wild”

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 30.4 MB.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation