Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1007/978-3-030-01264-9_17guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model

Published: 08 September 2018 Publication History

Abstract

We present a box-free bottom-up approach for the tasks of pose estimation and instance segmentation of people in multi-person images using an efficient single-shot model. The proposed PersonLab model tackles both semantic-level reasoning and object-part associations using part-based modeling. Our model employs a convolutional network which learns to detect individual keypoints and predict their relative displacements, allowing us to group keypoints into person pose instances. Further, we propose a part-induced geometric embedding descriptor which allows us to associate semantic person pixels with their corresponding person instance, delivering instance-level person segmentations. Our system is based on a fully-convolutional architecture and allows for efficient inference, with runtime essentially independent of the number of people present in the scene. Trained on COCO data alone, our system achieves COCO test-dev keypoint average precision of 0.665 using single-scale inference and 0.687 using multi-scale inference, significantly outperforming all previous bottom-up pose estimation systems. We are also the first bottom-up method to report competitive results for the person class in the COCO instance segmentation task, achieving a person category average precision of 0.417.

References

[1]
Lin, T.Y., et al.: Coco 2016 keypoint challenge (2016)
[2]
Newell, A., Deng, J.: Associative embedding: end-to-end learning for joint detection and grouping. In: NIPS (2017)
[3]
Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y.: Fully convolutional instance-aware semantic segmentation. In: CVPR (2017)
[4]
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. In: Proceedings IEEE (1998)
[5]
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
[6]
Fischler, M.A., Elschlager, R.: The representation and matching of pictorial structures. In: IEEE TOC (1973)
[7]
Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: CVPR (2008)
[8]
Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: people detection and articulated pose estimation. In: CVPR (2009)
[9]
Eichner, M., Ferrari, V.: Better appearance models for pictorial structures. In: BMVC (2009)
[10]
Sapp, B., Jordan, C., Taskar, B.: Adaptive pose priors for pictorial structures. In: CVPR (2010)
[11]
Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures of parts. In: CVPR (2011)
[12]
Dantone, M., Gall, J., Leistner, C., Gool., L.V.: Human pose estimation using body parts dependent joint regressors. In: CVPR (2013)
[13]
Johnson, S., Everingham, M.: Learning effective human pose estimation from inaccurate annotation. In: CVPR (2011)
[14]
Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Poselet conditioned pictorial structures. In: CVPR (2013)
[15]
Sapp, B., Taskar, B.: Modec: Multimodal decomposable models for human pose estimation. In: CVPR (2013)
[16]
Gkioxari, G., Arbelaez, P., Bourdev, L., Malik, J.: Articulated pose estimation using discriminative armlet classifiers. In: CVPR (2013)
[17]
Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: CVPR (2014)
[18]
Jain, A., Tompson, J., Andriluka, M., Taylor, G., Bregler, C.: Learning human pose estimation features with convolutional networks. In: ICLR (2014)
[19]
Tompson, J., Jain, A., LeCun, Y., Bregler, C.: Join training of a convolutional network and a graphical model for human pose estimation. In: NIPS (2014)
[20]
Chen, X., Yuille, A.: Articulated pose estimation by a graphical model with image dependent pairwise relations. In: NIPS (2014)
[21]
Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 648–656 (2015)
[22]
Newell A, Yang K, and Deng J Leibe B, Matas J, Sebe N, and Welling M Stacked hourglass networks for human pose estimation Computer Vision – ECCV 2016 2016 Cham Springer 483-499
[23]
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR (2014)
[24]
Bulat A and Tzimiropoulos G Leibe B, Matas J, Sebe N, and Welling M Human pose estimation via convolutional part heatmap regression Computer Vision – ECCV 2016 2016 Cham Springer 717-732
[25]
Belagiannis, V., Zisserman, A.: Recurrent human pose estimation. arxiv (2016)
[26]
Gkioxari G, Toshev A, and Jaitly N Leibe B, Matas J, Sebe N, and Welling M Chained predictions using convolutional neural networks Computer Vision – ECCV 2016 2016 Cham Springer 728-743
[27]
Pishchulin, L., et al.: DeepCut: joint subset partition and labeling for multi person pose estimation. In: CVPR (2016)
[28]
Insafutdinov E, Pishchulin L, Andres B, Andriluka M, and Schiele B Leibe B, Matas J, Sebe N, and Welling M DeeperCut: a deeper, stronger, and faster multi-person pose estimation model Computer Vision – ECCV 2016 2016 Cham Springer 34-50
[29]
Insafutdinov, E., Andriluka, M., Pishchulin, L., Tang, S., Andres, B., Schiele, B.: Articulated multi-person tracking in the wild. arXiv:1612.01465 (2016)
[30]
Iqbal U and Gall J Hua G and Jégou H Multi-person pose estimation with local joint-to-person associations Computer Vision – ECCV 2016 Workshops 2016 Cham Springer 627-642
[31]
Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. arXiv (2016)
[32]
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: CVPR (2017)
[33]
Papandreou, G., et al.: Towards accurate multi-person pose estimation in the wild. In: CVPR (2017)
[34]
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. arXiv:1703.06870v2 (2017)
[35]
Huang, S., Gong, M., Tao, D.: A coarse-fine network for keypoint localization. In: ICCV (2017)
[36]
Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: ICCV (2017)
[37]
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. arXiv:1711.07319 (2017)
[38]
Girshick, R.: Fast R-CNN. In: ICCV, pp. 1440–1448 (2015)
[39]
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
[40]
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: Object detection via region-based fully convolutional networks. In: NIPS (2016)
[41]
Carreira J and Sminchisescu C CPMC: automatic object segmentation using constrained parametric min-cuts PAMI 2012 34 7 1312-1328
[42]
Arbeláez, P., Pont-Tuset, J., Barron, J.T., Marques, F., Malik, J.: Multiscale combinatorial grouping. In: CVPR (2014)
[43]
Hariharan B, Arbeláez P, Girshick R, and Malik J Fleet D, Pajdla T, Schiele B, and Tuytelaars T Simultaneous detection and segmentation Computer Vision – ECCV 2014 2014 Cham Springer 297-312
[44]
Pinheiro, P.O., Collobert, R., Dollár, P.: Learning to segment object candidates. In: NIPS (2015)
[45]
Dai, J., He, K., Sun, J.: Convolutional feature masking for joint object and stuff segmentation. In: CVPR (2015)
[46]
Pinheiro PO, Lin T-Y, Collobert R, and Dollár P Leibe B, Matas J, Sebe N, and Welling M Learning to refine object segments Computer Vision – ECCV 2016 2016 Cham Springer 75-91
[47]
Dai J, He K, Li Y, Ren S, and Sun J Leibe B, Matas J, Sebe N, and Welling M Instance-sensitive fully convolutional networks Computer Vision – ECCV 2016 2016 Cham Springer 534-549
[48]
Dai, J., He, K., Sun, J.: Instance-aware semantic segmentation via multi-task network cascades. In: CVPR (2016)
[49]
Peng, C., et al.: MegDet: a large mini-batch object detector (2018)
[50]
Chen, L.C., Hermans, A., Papandreou, G., Schroff, F., Wang, P., Adam, H.: MaskLab: instance segmentation by refining object detection with semantic and direction features. In: CVPR (2018)
[51]
Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: CVPR (2018)
[52]
Liang, X., Wei, Y., Shen, X., Yang, J., Lin, L., Yan, S.: Proposal-free network for instance-level object segmentation. arXiv preprint arXiv:1509.02636 (2015)
[53]
Uhrig Jonas, Cordts Marius, Franke Uwe, and Brox Thomas Pixel-Level Encoding and Depth Layering for Instance-Level Semantic Labeling Lecture Notes in Computer Science 2016 Cham Springer International Publishing 14-25
[54]
Zhang, Z., Schwing, A.G., Fidler, S., Urtasun, R.: Monocular object instance segmentation and depth ordering with CNNs. In: ICCV (2015)
[55]
Zhang, Z., Fidler, S., Urtasun, R.: Instance-level segmentation for autonomous driving with deep densely connected MRFs. In: CVPR (2016)
[56]
Wu, Z., Shen, C., van den Hengel, A.: Bridging category-level and instance-level semantic image segmentation. arXiv:1605.06885 (2016)
[57]
Liu, S., Qi, X., Shi, J., Zhang, H., Jia, J.: Multi-scale patch aggregation (MPA) for simultaneous detection and segmentation. In: CVPR (2016)
[58]
Levinkov, E., et al.: Joint graph decomposition & node labeling: problem, algorithms, applications. In: CVPR (2017)
[59]
Kirillov, A., Levinkov, E., Andres, B., Savchynskyy, B., Rother, C.: InstanceCut: from edges to instances with multicut. In: CVPR (2017)
[60]
Jin, L., Chen, Z., Tu, Z.: Object detection free instance segmentation with labeling transformations. arXiv:1611.08991 (2016)
[61]
Fathi, A., et al.: Semantic instance segmentation via deep metric learning. arXiv:1703.10277 (2017)
[62]
De Brabandere, B., Neven, D., Van Gool, L.: Semantic instance segmentation with a discriminative loss function. arXiv:1708.02551 (2017)
[63]
Bai, M., Urtasun, R.: Deep watershed transform for instance segmentation. In: CVPR (2017)
[64]
Liu, S., Jia, J., Fidler, S., Urtasun, R.: SGN: sequential grouping networks for instance segmentation. In: ICCV (2017)
[65]
Bodla, N., Singh, B., Chellappa, R., Davis, L.S.: Soft-NMS: improving object detection with one line of code. In: ICCV (2017)
[66]
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
[67]
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. TPAMI (2017)
[68]
Radosavovic, I., Dollár, P., Girshick, R., Gkioxari, G., He, K.: Data distillation: towards omni-supervised learning. arXiv:1712.04440 (2017)
[69]
Lin T-Y et al. Fleet D, Pajdla T, Schiele B, Tuytelaars T, et al. Microsoft COCO: common objects in context Computer Vision – ECCV 2014 2014 Cham Springer 740-755
[70]
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
[71]
Russakovsky O et al. ImageNet large scale visual recognition challenge IJCV 2015 115 3 211-252
[72]
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167 (2015)
[73]
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). tensorflow.org

Cited By

View all
  • (2024)Benchmarking analysis of human pose estimation solutions for virtual television setsProceedings of the XXIV International Conference on Human Computer Interaction10.1145/3657242.3657244(1-10)Online publication date: 19-Jun-2024
  • (2023)HumVis: Human-Centric Visual Analysis SystemProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612663(9396-9398)Online publication date: 26-Oct-2023
  • (2023)Lightweight Super-Resolution Head for Human Pose EstimationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612236(2353-2361)Online publication date: 26-Oct-2023
  • Show More Cited By

Index Terms

  1. PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image Guide Proceedings
          Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part XIV
          Sep 2018
          844 pages
          ISBN:978-3-030-01263-2
          DOI:10.1007/978-3-030-01264-9

          Publisher

          Springer-Verlag

          Berlin, Heidelberg

          Publication History

          Published: 08 September 2018

          Author Tags

          1. Person detection and pose estimation
          2. Segmentation and grouping

          Qualifiers

          • Article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 25 Nov 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Benchmarking analysis of human pose estimation solutions for virtual television setsProceedings of the XXIV International Conference on Human Computer Interaction10.1145/3657242.3657244(1-10)Online publication date: 19-Jun-2024
          • (2023)HumVis: Human-Centric Visual Analysis SystemProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612663(9396-9398)Online publication date: 26-Oct-2023
          • (2023)Lightweight Super-Resolution Head for Human Pose EstimationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612236(2353-2361)Online publication date: 26-Oct-2023
          • (2023)IMUPoser: Full-Body Pose Estimation using IMUs in Phones, Watches, and EarbudsProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581392(1-12)Online publication date: 19-Apr-2023
          • (2022)AutoLinkProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602888(36123-36141)Online publication date: 28-Nov-2022
          • (2022)QueryPoseProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601175(12464-12477)Online publication date: 28-Nov-2022
          • (2022)Obj2SeqProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3600451(2494-2506)Online publication date: 28-Nov-2022
          • (2022)Towards High Performance One-Stage Human Pose EstimationProceedings of the 4th ACM International Conference on Multimedia in Asia10.1145/3551626.3564968(1-5)Online publication date: 13-Dec-2022
          • (2022)Human Body Pose Estimation for Gait Identification: A Comprehensive Survey of Datasets and ModelsACM Computing Surveys10.1145/353338455:6(1-42)Online publication date: 7-Dec-2022
          • (2022)A Fall Detection Network by 2D/3D Spatio-temporal Joint Models with Tensor Compression on EdgeACM Transactions on Embedded Computing Systems10.1145/353100421:6(1-19)Online publication date: 12-Dec-2022
          • Show More Cited By

          View Options

          View options

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media