Abstract
We present a novel approach for automatically detecting and tracking facial landmarks across poses and expressions from in-the-wild monocular video data, e.g., YouTube videos and smartphone recordings. Our method does not require any calibration or manual adjustment for new individual input videos or actors. Firstly, we propose a method of robust 2D facial landmark detection across poses, by combining shape-face canonical-correlation analysis with a global supervised descent method. Since 2D regression-based methods are sensitive to unstable initialization, and the temporal and spatial coherence of videos is ignored, we utilize a coarse-todense 3D facial expression reconstruction method to refine the 2D landmarks. On one side, we employ an in-the-wild method to extract the coarse reconstruction result and its corresponding texture using the detected sparse facial landmarks, followed by robust pose, expression, and identity estimation. On the other side, to obtain dense reconstruction results, we give a face tracking flow method that corrects coarse reconstruction results and tracks weakly textured areas; this is used to iteratively update the coarse face model. Finally, a dense reconstruction result is estimated after it converges. Extensive experiments on a variety of video sequences recorded by ourselves or downloaded from YouTube show the results of facial landmark detection and tracking under various lighting conditions, for various head poses and facial expressions. The overall performance and a comparison with state-of-art methods demonstrate the robustness and effectiveness of our method.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Mori, M.; MacDorman, K. F.; Kageki, N. The uncanny valley [from the field]. IEEE Robotics & Automation Magazine Vol. 19, No. 2, 98–100, 2012.
Cootes, T. F.; Taylor, C. J.; Cooper, D. H.; Graham, J. Active shape models—Their training and application. Computer Vision and Image Understanding Vol. 61, No. 1, 38–59, 1995.
Cootes, T. F.; Edwards, G. J.; Taylor, C. J. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 23, No. 6, 681–685, 2001.
Cristinacce, D.; Cootes, T. F. Feature detection and tracking with constrained local models. In: Proceedings of the British Machine Conference, 95.1–95.10, 2006.
Gonzalez-Mora, J.; De la Torre, F.; Murthi, R.; Guil, N.; Zapata, E. L. Bilinear active appearance models. In: Proceedings of IEEE 11th International Conference on Computer Vision, 1–8, 2007.
Lee, H.-S.; Kim, D. Tensor-based AAM with continuous variation estimation: Application to variation-robust face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 31, No. 6, 1102–1116, 2009.
Cao, X.; Wei, Y.; Wen, F.; Sun, J. Face alignment by explicit shape regression. U.S. Patent Application 13/728,584. 2012-12-27.
Xiong, X.; De la Torre, F. Supervised descent method and its applications to face alignment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 532–539, 2013.
Xing, J.; Niu, Z.; Huang, J.; Hu, W.; Yan, S. Towards multi-view and partially-occluded face alignment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1829–1836, 2014.
Yan, J.; Lei, Z.; Yi, D.; Li, S. Z. Learn to combine multiple hypotheses for accurate face alignment. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, 392–396, 2013.
Burgos-Artizzu, X. P.; Perona, P.; Dollár, P. Robust face landmark estimation under occlusion. In: Proceedings of the IEEE International Conference on Computer Vision, 1513–1520, 2013.
Yang, H.; He, X.; Jia, X.; Patras, I. Robust face alignment under occlusion via regional predictive power estimation. IEEE Transactions on Image Processing Vol. 24, No. 8, 2393–2403, 2015.
Feng, Z.-H.; Huber, P.; Kittler, J.; Christmas, W.; Wu, X.-J. Random cascaded-regression copse for robust facial landmark detection. IEEE Signal Processing Letters Vol. 22, No. 1, 76–80, 2015.
Yang, H.; Jia, X.; Patras, I.; Chan, K.-P. Random subspace supervised descent method for regression problems in computer vision. IEEE Signal Processing Letters Vol. 22, No. 10, 1816–1820, 2015.
Zhu, S.; Li, C.; Loy, C. C.; Tang, X. Face alignment by coarse-to-fine shape searching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4998–5006, 2015.
Cao, C.; Hou, Q.; Zhou, K. Displaced dynamic expression regression for real-time facial tracking and animation. ACM Transactions on Graphics Vol. 33, No. 4, Article No. 43, 2014.
Liu, S.; Yang, X.; Wang, Z.; Xiao, Z.; Zhang, J. Real-time facial expression transfer with single video camera. Computer Animation and Virtual Worlds Vol. 27, Nos. 3–4, 301–310, 2016.
Tzimiropoulos, G.; Pantic, M. Optimization problems for fast AAM fitting in-the-wild. In: Proceedings of the IEEE International Conference on Computer Vision, 593–600, 2013.
Suwajanakorn, S.; Kemelmacher-Shlizerman, I.; Seitz, S. M. Total moving face reconstruction. In: Computer Vision–ECCV 2014. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer International Publishing, 796–812, 2014.
Cootes, T. F.; Taylor, C. J. Statistical models of appearance for computer vision. 2004. Available at http://personalpages.manchester.ac.uk/staff/timothy.f. cootes/Models/app models.pdf.
Yan, S.; Liu, C.; Li, S. Z.; Zhang, H.; Shum, H.-Y.; Cheng, Q. Face alignment using texture-constrained active shape models. Image and Vision Computing Vol. 21, No. 1, 69–75, 2003.
Donner, R.; Reiter, M.; Langs, G.; Peloschek, P.; Bischof, H. Fast active appearance model search using canonical correlation analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 28, No. 10, 1690–1694, 2006.
Matthews, I.; Baker, S. Active appearance models revisited. International Journal of Computer Vision Vol. 60, No. 2, 135–164, 2004.
Cao, X.; Wei, Y.; Wen, F.; Sun, J. Face alignment by explicit shape regression. International Journal of Computer Vision Vol. 107, No. 2, 177–190, 2014.
Dollár, P.; Welinder, P.; Perona, P. Cascaded pose regression. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1078–1085, 2010.
Zhou, S. K.; Comaniciu, D. Shape regression machine. In: Information Processing in Medical Imaging. Karssemeijer, N.; Lelieveldt, B. Eds. Springer Berlin Heidelberg, 13–25, 2007.
Burgos-Artizzu, X. P.; Perona, P.; Dollár, P. Robust face landmark estimation under occlusion. In: Proceedings of the IEEE International Conference on Computer Vision, 1513–1520, 2013.
Ren, S.; Cao, X.; Wei, Y.; Sun, J. Face alignment at 3000 fps via regressing local binary features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1685–1692, 2014.
Cootes, T. F.; Ionita, M. C.; Lindner, C.; Sauer, P. Robust and accurate shape model fitting using random forest regression voting. In: Computer Vision–ECCV 2012. Fitzgibbon, A.; Lazebnik, S.; Perona, P.; Sato, Y.; Schmid, C. Eds. Springer Berlin Heidelberg, 278–291, 2012.
Kazemi, V.; Sullivan, J. One millisecond face alignment with an ensemble of regression trees. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1867–1874, 2014.
Sagonas, C.; Tzimiropoulos, G.; Zafeiriou, S.; Pantic, M. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, 397–403, 2013.
Zhou, F.; Brandt, J.; Lin, Z. Exemplar-based graph matching for robust facial landmark localization. In: Proceedings of the IEEE International Conference on Computer Vision, 1025–1032, 2013.
Huang, G. B.; Ramesh, M.; Berg, T.; Learned-Miller, E. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, 2007.
Shen, J.; Zafeiriou, S.; Chrysos, G. G.; Kossaifi, J.; Tzimiropoulos, G.; Pantic, M. The first facial landmark tracking in-the-wild challenge: Benchmark and results. In: Proceedings of the IEEE International Conference on Computer Vision Workshop, 1003–1011, 2015.
Cao, C.; Bradley, D.; Zhou, K.; Beeler, T. Realtime high-fidelity facial performance capture. ACM Transactions on Graphics Vol. 34, No. 4, Article No. 46, 2015.
Cao, C.; Wu, H.; Weng, Y.; Shao, T.; Zhou, K. Real-time facial animation with image-based dynamic avatars. ACM Transactions on Graphics Vol. 35, No. 4, Article No. 126, 2016.
Garrido, P.; Valgaerts, L.; Wu, C.; Theobalt, C. Reconstructing detailed dynamic face geometry from monocular video. ACM Transactions on Graphics Vol. 32, No. 6, Article No. 158, 2013.
Ichim, A. E.; Bouaziz, S.; Pauly, M. Dynamic 3D avatar creation from hand-held video input. ACM Transactions on Graphics Vol. 34, No. 4, Article No. 45, 2015.
Saito, S.; Li, T.; Li, H. Real-time facial segmentation and performance capture from RGB input. arXiv preprint arXiv:1604.02647, 2016.
Shi, F.; Wu, H.-T.; Tong, X.; Chai, J. Automatic acquisition of high-fidelity facial performances using monocular videos. ACM Transactions on Graphics Vol. 33, No. 6, Article No. 222, 2014.
Thies, J.; Zollhöfer, M.; Stamminger, M.; Theobalt, C.; Nießner, M. Face2face: Real-time face capture and reenactment of RGB videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1, 2016.
Furukawa, Y.; Ponce, J. Accurate camera calibration from multi-view stereo and bundle adjustment. International Journal of Computer Vision Vol. 84, No. 3, 257–268, 2009.
Cao, C.; Weng, Y.; Zhou, S.; Tong, Y.; Zhou, K. FaceWarehouse: A 3D facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics Vol. 20, No. 3, 413–425, 2014.
Newcombe, R. A.; Izadi, S.; Hilliges, O.; Molyneaux, D.; Kim, D.; Davison, A. J.; Kohi, P.; Shotton, J.; Hodges, S.; Fitzgibbon, A. KinectFusion: Realtime dense surface mapping and tracking. In: Proceedings of the 10th IEEE International Symposium on Mixed and Augmented Reality, 127–136, 2011.
Weise, T.; Bouaziz, S.; Li, H.; Pauly, M. Realtime performance-based facial animation. ACM Transactions on Graphics Vol. 30, No. 4, Article No. 77, 2011.
Blanz, V.; Vetter, T. A morphable model for the synthesis of 3D faces. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, 187–194, 1999.
Yan, J.; Zhang, X.; Lei, Z.; Yi, D.; Li, S. Z. Structural models for face detection. In: Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, 1–6, 2013.
Xiong, X.; De la Torre, F. Global supervised descent method. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2664–2673, 2015.
Snavely, N. Bundler: Structure from motion (SFM) for unordered image collections. 2010. Available at http://www.cs.cornell.edu/~snavely/bundler/.
Chen, L.; Armstrong, C. W.; Raftopoulos, D. D. An investigation on the accuracy of threedimensional space reconstruction using the direct linear transformation technique. Journal of Biomechanics Vol. 27, No. 4, 493–500, 1994.
Moré, J. J. The Levenberg–Marquardt algorithm: Implementation and theory. In: Numerical Analysis. Watson, G. A. Ed. Springer Berlin Heidelberg, 105–116, 1978.
Rall, L. B. Automatic Differentiation: Techniques and Applications. Springer Berlin Heidelberg, 1981.
Kolda, T. G.; Sun, J. Scalable tensor decompositions for multi-aspect data mining. In: Proceedings of the 8th IEEE International Conference on Data Mining, 363–372, 2008.
Li, D.-H.; Fukushima, M. A modified BFGS method and its global convergence in nonconvex minimization. Journal of Computational and Applied Mathematics Vol. 129, Nos. 1–2, 15–35, 2001.
Igarashi, T.; Moscovich, T.; Hughes, J. F. As-rigidas-possible shape manipulation. ACM Transactions on Graphics Vol. 24, No. 3, 1134–1141, 2005.
Hartigan, J. A.; Wong, M. A. Algorithm AS 136: A K-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 28, No. 1, 100–108, 1979.
Brox, T.; Bruhn, A.; Papenberg, N.; Weickert, J. High accuracy optical flow estimation based on a theory for warping. In: Computer Vision–ECCV 2004. Pajdla, T.; Matas, J. Eds. Springer Berlin Heidelberg, 25–36, 2004.
Brox, T.; Malik, J. Large displacement optical flow: Descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 33, No. 3, 500–513, 2011.
Agarwal, S.; Snavely, N.; Seitz, S. M.; Szeliski, R. Bundle adjustment in the large. In: Computer Vision–ECCV 2010. Daniilidis, K.; Maragos, P.; Paragios, N. Eds. Springer Berlin Heidelberg, 29–42, 2010.
Belhumeur, P. N.; Jacobs, D. W.; Kriegman, D. J.; Kumar, N. Localizing parts of faces using a consensus of exemplars. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 35, No. 12, 2930–2940, 2013.
Acknowledgements
This work was supported by the Harbin Institute of Technology Scholarship Fund 2016 and the National Centre for Computer Animation, Bournemouth University.
Author information
Authors and Affiliations
Corresponding author
Additional information
This article is published with open access at Springerlink.com
Shuang Liu received his B.S. degree in computer science from the Hebei University of Technology, China, in 2014. He is currently a Ph.D. student in the National Centre for Computer Animation, Bournemouth University, UK. His research interests include computer vision and computer animation.
Yongqiang Zhang received his B.S. and M.S. degrees from Harbin Institute of Technology, China, in 2012 and 2014, respectively. He is currently a Ph.D. student in the School of Computer Science and Technology, Harbin Institute of Technology, China. His research interests include machine learning, computer vision, object tracking, and facial animation.
Xiaosong Yang is currently a senior lecturer in the National Centre for Computer Animation (NCCA), Bournemouth University, UK. His research interests include interactive graphics and animation, rendering and modeling, virtual reality, virtual surgery simulation, and CAD. He received his bachelor (1993) and master (1996) degrees in computer science from Zhejiang University, China, and his Ph.D. degree (2000) in computing mechanics from Dalian University of Technology, China. He spent two years as a postdoc (2000–2002) in Tsinghua University working on scientific visualization, and one year (2001–2002) as a research assistant in the Virtual Reality, Visualization and Imaging Research Centre of the Chinese University of Hong Kong. In 2003, he came to NCCA to continue his work on computer animation.
Daming Shi received his Ph.D. degree in mechanical control from Harbin Institute of Technology, China, and Ph.D. degree in computer science from the University of Southampton, UK. He had served as an assistant professor in Nanyang Technological University, Singapore, from 2002. Dr. Shi is currently a chair professor in Harbin Institute of Technology, China. His current research interests include machine learning, medical image processing, pattern recognition, and neural networks.
Jian J. Zhang is a professor of computer graphics in the National Centre for Computer Animation, Bournemouth University, UK, and leads the Computer Animation Research Centre. His research focuses on a number of topics relating to 3D computer animation, including virtual human modelling and simulation, geometric modelling, motion synthesis, deformation, and physics-based animation. He is also interested in virtual reality and medical visualisation and simulation. Prof. Zhang has published over 200 peer reviewed journal and conference publications. He has chaired over 30 international conferences and symposia, and served on a number of editorial boards. Prof. Zhang is also one of the two co-founders of the EPSRC-funded multi-million pound Centre for Digital Entertainment (CDE) with Prof. Phil Willis in the University of Bath.
Electronic supplementary material
Rights and permissions
Open Access The articles published in this journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.
About this article
Cite this article
Liu, S., Zhang, Y., Yang, X. et al. Robust facial landmark detection and tracking across poses and expressions for in-the-wild monocular video. Comp. Visual Media 3, 33–47 (2017). https://doi.org/10.1007/s41095-016-0068-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41095-016-0068-y