research-article

Mask-Guided Deformation Adaptive Network for Human Parsing

Authors:

Shengfeng HeAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 18, Issue 1

Article No.: 11, Pages 1 - 20

https://doi.org/10.1145/3467889

Published: 14 March 2022 Publication History

Abstract

Due to the challenges of densely compacted body parts, nonrigid clothing items, and severe overlap in crowd scenes, human parsing needs to focus more on multilevel feature representations compared to general scene parsing tasks. Based on this observation, we propose to introduce the auxiliary task of human mask and edge detection to facilitate human parsing. Different from human parsing, which exploits the discriminative features of each category, human mask and edge detection emphasizes the boundaries of semantic parsing regions and the difference between foreground humans and background clutter, which benefits the parsing predictions of crowd scenes and small human parts. Specifically, we extract human mask and edge labels from the human parsing annotations and train a shared encoder with three independent decoders for the three mutually beneficial tasks. Furthermore, the decoder feature maps of the human mask prediction branch are further exploited as attention maps, indicating human regions to facilitate the decoding process of human parsing and human edge detection. In addition to these auxiliary tasks, we further alleviate the problem of deformed clothing items under various human poses by tracking the deformation patterns with the deformable convolution. Extensive experiments show that the proposed method can achieve superior performance against state-of-the-art methods on both single and multiple human parsing datasets. Codes and trained models are available https://github.com/ViktorLiang/MGDAN.

References

[1]

Piotr Bilinski and Victor Prisacariu. 2018. Dense decoder shortcut connections for single-pass semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6596–6605.

[2]

Liang-Chieh Chen, Jonathan T. Barron, George Papandreou, Kevin Murphy, and Alan L. Yuille. 2016. Semantic image segmentation with task-specific edge detection using CNNs and a discriminatively trained domain transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4545–4554.

[3]

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision.

Digital Library

[4]

Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. 2014. Detect what you can: Detecting and representing objects using holistic models and body parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1979–1986.

Digital Library

[5]

J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. 2017. Deformable convolutional networks. In Proceedings of IEEE International Conference on Computer Vision. IEEE, 764–773.

[6]

Hao-Shu Fang, Guansong Lu, Xiaolin Fang, Jianwen Xie, Yu-Wing Tai, and Cewu Lu. 2018. Weakly and semi supervised human body part parsing via pose-guided knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 70–78.

[7]

Pedro F. Felzenszwalb, Ross B. Girshick, David A. McAllester, and Deva Ramanan. 2009. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2009), 1627–1645.

Digital Library

[8]

Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. 2019. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3141–3149.

[9]

Ke Gong, Yiming Gao, Xiaodan Liang, Xiaohui Shen, Meng Wang, and Liang Lin. 2019. Graphonomy: Universal human parsing via graph transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7442–7451.

[10]

Ke Gong, Xiaodan Liang, Yicheng Li, Yimin Chen, Ming Yang, and Liang Lin. 2018. Instance-level human parsing via part grouping network. In Proceedings of the European Conference on Computer Vision. Springer, Cham, 805–822.

[11]

Ke Gong, Xiaodan Liang, Dongyu Zhang, Xiaohui Shen, and Liang Lin. 2017. Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6757–6765.

[12]

Haoyu He, Jing Zhang, Qiming Zhang, and Dacheng Tao. 2020. Grapy-ML: Graph pyramid mutual learning for cross-dataset human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI.

[13]

K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In CVPR. IEEE, 770–778.

[14]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7132–7141.

[15]

Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2015. Spatial transformer networks. In NeurIPS. Curran Associates, Inc., Montreal, Quebec, Canada.

[16]

Ruyi Ji, Dawei Du, Libo Zhang, Longyin Wen, Yanjun Wu, Chen Zhao, Feiyue Huang, and Siwei Lyu. 2020. Learning semantic neural tree for human parsing. In ECCV, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.).

[17]

Mahdi M. Kalayeh, Emrah Basaran, Muhittin Gokmen, Mustafa E. Kamasak, and Mubarak Shah. 2018. Human semantic parsing for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1062–1071.

[18]

P. Li, Y. Xu, Y. Wei, and Y. Yang. 2020. Self-correction for human parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence (Early Access) (2020), 1–1.

[19]

T. Li, Z. Liang, S. Zhao, J. Gong, and J. Shen. 2020. Self-learning with rectification strategy for human parsing. In CVPR. 9260–9269.

[20]

Yanwei Li, Xinze Chen, Zheng Zhu, Lingxi Xie, Guan Huang, Dalong Du, and Xingang Wang. 2019. Attention-guided unified network for panoptic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7019–7028.

[21]

X. Liang, K. Gong, X. Shen, and L. Lin. 2019. Look into person: Joint body parsing pose estimation network and a new benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2019), 871–885.

Digital Library

[22]

X. Liang, L. Lin, Y. Wei, X. Shen, J. Yang, and S. Yan. 2018. Proposal-free network for instance-level object segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (2018), 2978–2991.

Digital Library

[23]

X. Liang, L. Lin, W. Yang, P. Luo, J. Huang, and S. Yan. 2016. Clothes co-parsing via joint image segmentation and labeling with application to clothing retrieval. IEEE Transactions on Multimedia 18 (2016), 1175–1186.

Digital Library

[24]

X. Liang, S. Liu, X. Shen, J. Yang, L. Liu, J. Dong, L. Lin, and S. Yan. 2015. Deep human parsing with active template regression. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (2015), 2402–2414.

Digital Library

[25]

Guosheng Lin, Fayao Liu, Anton Milan, Chunhua Shen, and Ian Reid. 2019. RefineNet: Multi-path refinement networks for dense prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2019), 1228–1242.

[26]

Yutian Lin, Liang Zheng, Zhedong Zheng, Yu Wu, and Yang Yang. 2019. Improving person re-identification by attribute and identity learning. Pattern Recognition 95 (2019), 151–161.

Digital Library

[27]

S. Liu, J. Feng, C. Domokos, H. Xu, J. Huang, Z. Hu, and S. Yan. 2014. Fashion parsing with weak color-category labels. IEEE Transactions on Multimedia 16 (2014), 253–265.

[28]

S. Liu, X. Liang, L. Liu, X. Shen, J. Yang, C. Xu, L. Lin, Xiaochun Cao, and S. Yan. 2015. Matching-CNN meets KNN: Quasi-parametric human parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1419–1427.

[29]

Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2013. Pedestrian parsing via deep decompositional network. In Proceedings of IEEE International Conference on Computer Vision. IEEE, 2380–7504.

[30]

Yawei Luo, Zhedong Zheng, Liang Zheng, Tao Guan, Junqing Yu, and Yi Yang. 2018. Macro-micro adversarial network for human parsing. In Proceedings of the European Conference on Computer Vision. Springer, Cham, Munich, Germany, 424–440.

[31]

Xuecheng Nie, Jiashi Feng, and Shuicheng Yan. 2018. Mutual learning to adapt for joint human parsing and pose estimation. In Proceedings of the European Conference on Computer Vision. Springer, Cham, Munich, Germany, 519–534.

[32]

Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. 2017. Large kernel matters–improve semantic segmentation by global convolutional network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4353–4361.

[33]

Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao, Masood Dehghan, and Martin Jagersand. 2019. BASNet: Boundary-Aware salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7471–7481.

[34]

Rodolfo Quispe and Helio Pedrini. 2019. Enhanced person re-identification based on saliency and semantic parsing with deep neural network models. Image and Vision Computing 92 (2019), 103809.

Digital Library

[35]

Tao Ruan, Ting Liu, Zilong Huang, Yunchao Wei, Shikui Wei, and Yao Zhao. 2019. Devil in the details: Towards accurate single and multiple human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4814–4821.

Digital Library

[36]

Tao Ruan, Ting Liu, Zilong Huang, Yunchao Wei, Shikui Wei, Yao Zhao, and Thomas Huang. 2019. Devil in the details: Towards accurate single and multiple human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 4814–4821.

Digital Library

[37]

A. Shahroudy, T. Ng, Q. Yang, and G. Wang. 2016. Multimodal multipart learning for action recognition in depth videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2016), 2123–2129.

Digital Library

[38]

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of initialization and momentum in deep learning. In Proceedings of International Conference on Machine Learning. PMLR, Atlanta, Georgia, 1139–1147.

[39]

Towaki Takikawa, David Acuna, Varun Jampani, and Sanja Fidler. 2019. Gated-SCNN: Gated shape CNNs for semantic segmentation. In Proceedings of IEEE International Conference on Computer Vision. IEEE, 5228–5237.

[40]

Wenguan Wang, Zhijie Zhang, Siyuan Qi, Jianbing Shen, Yanwei Pang, and Ling Shao. 2019. Learning compositional neural information fusion for human parsing. In Proceedings of IEEE International Conference on Computer Vision. IEEE, 5702–5712.

[41]

W. Wang, T. Zhou, S. Qi, J. Shen, and S. C. Zhu. 2021. Hierarchical human semantic parsing with comprehensive part-relation modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence (Early Access) (2021), 1–1.

[42]

W. Wang, H. Zhu, J. Dai, Y. Pang, J. Shen, and L. Shao. 2020. Hierarchical human parsing with typed part-relation reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8926–8936.

[43]

Yang Wang, Duan Tran, Zicheng Liao, and David A. Forsyth. 2012. Discriminative hierarchical part-based models for human parsing and action recognition. Journal of Machine Learning Research 13 (2012), 3075–3102.

Digital Library

[44]

Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Bian, and Y. Yang. 2019. Progressive learning for person re-identification with one example. IEEE Transactions on Image Processing 28 (2019), 2872–2881.

[45]

Fangting Xia, Jun Zhu, Peng Wang, and Alan L. Yuille. 2016. Pose-Guided human parsing by an and/or graph using pose-context features. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 3632–3640.

[46]

Saining Xie and Zhuowen Tu. 2017. Holistically-nested edge detection. International Journal of Computer Vision 125 (2017), 3–18.

Digital Library

[47]

Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, and Raquel Urtasun. 2019. UPSNet: A unified panoptic segmentation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 8810–8818.

[48]

K. Yamaguchi, M. H. Kiapour, L. E. Ortiz, and T. L. Berg. 2015. Retrieving similar styles to parse clothing. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (2015), 1028–1040.

Digital Library

[49]

Zhiding Yu, Chen Feng, Ming-Yu Liu, and Srikumar Ramalingam. 2017. CASENet: Deep category-aware semantic edge detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1761–1770.

[50]

Xiaomei Zhang, Y. Chen, B. Zhu, Jinqiao Wang, and Ming Tang. 2020. Blended grammar network for human parsing. In Proceedings of the European Conference on Computer Vision.

Digital Library

[51]

X. Zhang, Y. Chen, B. Zhu, J. Wang, and M. Tang. 2020. Part-aware context network for human parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8968–8977.

[52]

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. 2017. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6230–6239.

[53]

J. Zhao, J. Li, X. Nie, F. Zhao, Y. Chen, Z. Wang, J. Feng, and S. Yan. 2017. Self-Supervised neural aggregation networks for human parsing. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’17). IEEE.

[54]

Ting Zhao and Xiangqian Wu. 2019. Pyramid feature attention network for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3080–3089.

[55]

Bingke Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. 2018. Progressive cognitive human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI.

[56]

Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. 2019. Deformable ConvNets V2: More deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 9300–9308.

Cited By

Wu TZhu RWan S(2024)Semantic Map Guided Identity Transfer GAN for Person Re-identificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363135520:11(1-20)Online publication date: 12-Sep-2024
https://dl.acm.org/doi/10.1145/3631355
Chen DKong DLi JWang SYin B(2023)ADOSMNet: a novel visual affordance detection network with object shape mask guided feature encodersMultimedia Tools and Applications10.1007/s11042-023-16898-283:11(31629-31653)Online publication date: 18-Sep-2023
https://doi.org/10.1007/s11042-023-16898-2

Index Terms

Mask-Guided Deformation Adaptive Network for Human Parsing
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Image segmentation
      2. Computer vision representations
        Appearance and texture representations

Recommendations

Multi-human Parsing with a Graph-based Generative Adversarial Model
Human parsing is an important task in human-centric image understanding in computer vision and multimedia systems. However, most existing works on human parsing mainly tackle the single-person scenario, which deviates from real-world applications where ...
Hybrid Resolution Network Using Edge Guided Region Mutual Information Loss for Human Parsing
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

In this paper, we propose a new method for human parsing, which effectively maintains high-resolution representations and leverages body edge details to improve the performance. First, we propose a hybrid resolution network (HyRN) for human parsing and ...
Multi-Human Parsing Machines
MM '18: Proceedings of the 26th ACM international conference on Multimedia

Human parsing is an important task in human-centric analysis. Despite the remarkable progress in single-human parsing, the more realistic case of multi-human parsing remains challenging in terms of the data and the model. Compared with the considerable ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 18, Issue 1

January 2022

517 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3505205

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2022

Accepted: 01 May 2021

Revised: 01 March 2021

Received: 01 August 2020

Published in TOMM Volume 18, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

National Natural Science Foundation of China
Guangdong International Science and Technology Cooperation Project
Guangdong Natural Science Foundation
Guangzhou Basic and Applied Research Project
Fundamental Research Funds for the Central Universities
Social Science Research Base of Guangdong Province-Research Center of Network Civilization in New Era of SCUT
CCF-Tencent Open Research fund

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
499
Total Downloads

Downloads (Last 12 months)60
Downloads (Last 6 weeks)2

Reflects downloads up to 30 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wu TZhu RWan S(2024)Semantic Map Guided Identity Transfer GAN for Person Re-identificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363135520:11(1-20)Online publication date: 12-Sep-2024
https://dl.acm.org/doi/10.1145/3631355
Chen DKong DLi JWang SYin B(2023)ADOSMNet: a novel visual affordance detection network with object shape mask guided feature encodersMultimedia Tools and Applications10.1007/s11042-023-16898-283:11(31629-31653)Online publication date: 18-Sep-2023
https://doi.org/10.1007/s11042-023-16898-2

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents