Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3474085.3475369acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval

Published: 17 October 2021 Publication History

Abstract

Many previous methods on text-based person retrieval tasks are devoted to learning a latent common space mapping, with the purpose of extracting modality-invariant features from both visual and textual modality. Nevertheless, due to the complexity of high-dimensional data, the unconstrained mapping paradigms are not able to properly catch discriminative clues about the corresponding person while drop the misaligned information. Intuitively, the information contained in visual data can be divided into person information (PI) and surroundings information (SI), which are mutually exclusive from each other. To this end, we propose a novel Deep Surroundings-person Separation Learning (DSSL) model in this paper to effectively extract and match person information, and hence achieve a superior retrieval accuracy. A surroundings-person separation and fusion mechanism plays the key role to realize an accurate and effective surroundings-person separation under a mutually exclusion constraint. In order to adequately utilize multi-modal and multi-granular information for a higher retrieval accuracy, five diverse alignment paradigms are adopted. Extensive experiments are carried out to evaluate the proposed DSSL on CUHK-PEDES, which is currently the only accessible dataset for text-base person retrieval task. DSSL achieves the state-of-the-art performance on CUHK-PEDES. To properly evaluate our proposed DSSL in the real scenarios, a Real Scenarios Text-based Person Reidentification (RSTPReid) dataset is constructed to benefit future research on text-based person retrieval, which will be publicly available.

Supplementary Material

MP4 File (MM_mfp1201_Final.mp4)
Presentation video.

References

[1]
Surbhi Aggarwal, Venkatesh Babu Radhakrishnan, and Anirban Chakraborty. 2020. Text-based person search via attribute-aided matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2617--2625.
[2]
Dapeng Chen, Hongsheng Li, Xihui Liu, Yantao Shen, Jing Shao, Zejian Yuan, and Xiaogang Wang. 2018a. Improving deep visual representation for person re-identification by global and local image-language association. In Proceedings of the European Conference on Computer Vision (ECCV). 54--70.
[3]
T. Chen, C. Xu, and J. Luo. 2018b. Improving Text-Based Person Search by Spatial Matching and Adaptive Threshold. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). 1879--1887.
[4]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse+: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).
[5]
Chenyang Gao, Guanyu Cai, Xinyang Jiang, Feng Zheng, Jun Zhang, Yifei Gong, Pai Peng, Xiaowei Guo, and Xing Sun. 2021. Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search. arXiv preprint arXiv:2101.03036 (2021).
[6]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[7]
Ruibing Hou, Bingpeng Ma, Hong Chang, Xinqian Gu, Shiguang Shan, and Xilin Chen. 2019. Interaction-and-aggregation network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9317--9326.
[8]
Ya Jing, Chenyang Si, Junbo Wang, Wei Wang, Liang Wang, and Tieniu Tan. 2018. Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search. arXiv preprint arXiv:1809.08440 (2018).
[9]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128--3137.
[10]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV). 201--216.
[11]
Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, and Xiaogang Wang. 2017a. Identity-aware textual-visual matching with latent co-attention. In Proceedings of the IEEE International Conference on Computer Vision. 1890--1899.
[12]
Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017b. Person search with natural language description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1970--1979.
[13]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.
[14]
Jiawei Liu, Zheng-Jun Zha, Richang Hong, Meng Wang, and Yongdong Zhang. 2019. Deep adversarial graph attention convolution network for text-based person search. In Proceedings of the 27th ACM International Conference on Multimedia. 665--673.
[15]
Yu Liu, Yanming Guo, Erwin M Bakker, and Michael S Lew. 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision. 4107--4116.
[16]
Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE conference on computer vision and pattern recognition. 299--307.
[17]
Ioannis A. Kakadiaris Nikolaos Sarafianos, Xiang Xu. 2019. Adversarial Representation Learning for Text-to-Image Matching, In ICCV. ICCV, 5813--5823.
[18]
Kai Niu, Yan Huang, Wanli Ouyang, and Liang Wang. 2020. Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Transactions on Image Processing, Vol. 29 (2020), 5542--5556.
[19]
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision. 2641--2649.
[20]
Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. 2016. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 49--58.
[21]
Changchang Sun, Xuemeng Song, Fuli Feng, Wayne Xin Zhao, Hao Zhang, and Liqiang Nie. 2019. Supervised hierarchical cross-modal hashing. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 725--734.
[22]
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning. 1096--1103.
[23]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156--3164.
[24]
Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, and Jingkuan Song. 2019. Matching images and text with multi-modal tensor fusion and re-ranking. In Proceedings of the 27th ACM international conference on multimedia. 12--20.
[25]
Zijie Wang, Aichun Zhu, Zhe Zheng, Jing Jin, Zhouxin Xue, and Gang Hua. 2020. IMG-Net: inner-cross-modal attentional multigranular network for description-based person re-identification. Journal of Electronic Imaging, Vol. 29, 4 (2020), 043028.
[26]
Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. 2018. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 79--88.
[27]
Bryan Ning Xia, Yuan Gong, Yizhe Zhang, and Christian Poellabauer. 2019. Second-order non-local attention networks for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision. 3760--3769.
[28]
Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3441--3450.
[29]
Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. 2014. Deep metric learning for person re-identification. In 2014 22nd International Conference on Pattern Recognition. IEEE, 34--39.
[30]
Kecheng Zheng, Wu Liu, Jiawei Liu, Zheng-Jun Zha, and Tao Mei. 2020 a. Hierarchical Gumbel Attention Network for Text-based Person Search. In Proceedings of the 28th ACM International Conference on Multimedia. 3441--3449.
[31]
Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. 2020 b. Dual-Path Convolutional Image-Text Embeddings with Instance Loss. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 16, 2 (2020), 1--23.

Cited By

View all
  • (2025)Graph-based Consistent Reconstruction and Alignment for imbalanced text–image person re-identificationExpert Systems with Applications10.1016/j.eswa.2024.125429260(125429)Online publication date: Jan-2025
  • (2024)Person Re-Identification in Special Scenes Based on Deep Learning: A Comprehensive SurveyMathematics10.3390/math1216249512:16(2495)Online publication date: 13-Aug-2024
  • (2024)TVPR: Text-to-Video Person Retrieval and a New BenchmarkProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681715(10105-10113)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-modal retrieval
  2. person retrieval
  3. surroundings-person separation
  4. text-based person re-identification

Qualifiers

  • Research-article

Funding Sources

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)114
  • Downloads (Last 6 weeks)14
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Graph-based Consistent Reconstruction and Alignment for imbalanced text–image person re-identificationExpert Systems with Applications10.1016/j.eswa.2024.125429260(125429)Online publication date: Jan-2025
  • (2024)Person Re-Identification in Special Scenes Based on Deep Learning: A Comprehensive SurveyMathematics10.3390/math1216249512:16(2495)Online publication date: 13-Aug-2024
  • (2024)TVPR: Text-to-Video Person Retrieval and a New BenchmarkProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681715(10105-10113)Online publication date: 28-Oct-2024
  • (2024)Fine-grained Semantic Alignment with Transferred Person-SAM for Text-based Person RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681553(5432-5441)Online publication date: 28-Oct-2024
  • (2024)Selection and Reconstruction of Key Locals: A Novel Specific Domain Image-Text Retrieval MethodProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681421(5653-5662)Online publication date: 28-Oct-2024
  • (2024)Accurate and Lightweight Learning for Specific Domain Image-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681280(9719-9728)Online publication date: 28-Oct-2024
  • (2024)Fine-grained Semantics-aware Representation Learning for Text-based Person RetrievalProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658054(92-100)Online publication date: 30-May-2024
  • (2024)Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363946920:6(1-22)Online publication date: 8-Mar-2024
  • (2024)TextAug: Test Time Text Augmentation for Multimodal Person Re-Identification2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)10.1109/WACVW60836.2024.00040(320-329)Online publication date: 1-Jan-2024
  • (2024)Cross-Modal Adaptive Dual Association for Text-to-Image Person RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.335564426(6609-6620)Online publication date: 18-Jan-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media