research-article

VGSG: Vision-Guided Semantic-Group Network for Text-Based Person Search

Authors:

Henghui DingAuthors Info & Claims

IEEE Transactions on Image Processing, Volume 33

Pages 163 - 176

https://doi.org/10.1109/TIP.2023.3337653

Published: 05 December 2023 Publication History

Abstract

Text-based Person Search (TBPS) aims to retrieve images of target pedestrian indicated by textual descriptions. It is essential for TBPS to extract fine-grained local features and align them crossing modality. Existing methods utilize external tools or heavy cross-modal interaction to achieve explicit alignment of cross-modal fine-grained features, which is inefficient and time-consuming. In this work, we propose a Vision-Guided Semantic-Group Network (VGSG) for text-based person search to extract well-aligned fine-grained visual and textual features. In the proposed VGSG, we develop a Semantic-Group Textual Learning (SGTL) module and a Vision-guided Knowledge Transfer (VGKT) module to extract textual local features under the guidance of visual local clues. In SGTL, in order to obtain the local textual representation, we group textual features from the channel dimension based on the semantic cues of language expression, which encourages similar semantic patterns to be grouped implicitly without external tools. In VGKT, a vision-guided attention is employed to extract visual-related textual features, which are inherently aligned with visual cues and termed vision-guided textual features. Furthermore, we design a relational knowledge transfer, including a vision-language similarity transfer and a class probability transfer, to adaptively propagate information of the vision-guided textual features to semantic-group textual features. With the help of relational knowledge transfer, VGKT is capable of aligning semantic-group textual features with corresponding visual features without external tools and complex pairwise interaction. Experimental results on two challenging benchmarks demonstrate its superiority over state-of-the-art methods.

References

[1]

S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X. Wang, “Person search with natural language description,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 5187–5196.

[2]

Z. Shao, X. Zhang, M. Fang, Z. Lin, J. Wang, and C. Ding, “Learning granularity-unified representations for text-to-image person re-identification,” in Proc. 30th ACM Int. Conf. Multimedia, Oct. 2022, pp. 5566–5574.

[3]

Y. Wu, Z. Yan, X. Han, G. Li, C. Zou, and S. Cui, “LapsCore: Language-guided person search via color reasoning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Oct. 2021, pp. 1624–1633.

[4]

X. Han, S. He, L. Zhang, and T. Xiang, “Text-based person search with limited data,” in Proc. Brit. Mach. Vis. Conf., 2021, p. 337.

[5]

Z. Ding, C. Ding, Z. Shao, and D. Tao, “Semantically self-aligned network for text-to-image part-aware person re-identification,” 2021, arXiv:2107.12666.

[6]

C. Wang, Z. Luo, Y. Lin, and S. Li, “Text-based person search via multi-granularity embedding learning,” in Proc. 38th Int. Joint Conf. Artif. Intell., Aug. 2021, pp. 1068–1074.

[7]

K. Zheng, W. Liu, J. Liu, Z. Zha, and T. Mei, “Hierarchical Gumbel attention network for text-based person search,” in Proc. ACM Int. Conf. Multimedia, 2020, pp. 3441–3449.

[8]

C. Gaoet al., “Contextual non-local alignment over full-scale representation for text-based person search,” 2021, arXiv:2101.03036.

[9]

Y. Chen, R. Huang, H. Chang, C. Tan, T. Xue, and B. Ma, “Cross-modal knowledge adaptation for language-based person search,” IEEE Trans. Image Process., vol. 30, pp. 4057–4069, 2021.

Digital Library

[10]

Z. Zheng, L. Zheng, M. Garrett, Y. Yang, M. Xu, and Y.-D. Shen, “Dual-path convolutional image-text embeddings with instance loss,” ACM Trans. Multimedia Comput., Commun., Appl., vol. 16, no. 2, pp. 1–23, May 2020.

Digital Library

[11]

Y. Jing, C. Si, J. Wang, W. Wang, L. Wang, and T. Tan, “Pose-guided multi-granularity attention network for text-based person search,” in Proc. 34th AAAI Conf. Artif. Intell., 2020, pp. 11189–11196.

[12]

A. Farooq, M. Awais, J. Kittler, and S. S. Khalid, “AXM-Net: Implicit cross-modal feature alignment for person re-identification,” 2021, arXiv:2101.08238.

[13]

Y. Chen, G. Zhang, Y. Lu, Z. Wang, and Y. Zheng, “TIPCB: A simple but effective part-based convolutional baseline for text-based person search,” Neurocomputing, vol. 494, pp. 171–181, Jul. 2022.

[14]

S. Li, T. Xiao, H. Li, W. Yang, and X. Wang, “Identity-aware textual-visual matching with latent co-attention,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 1908–1917.

[15]

S. Aggarwal, R. V. Babu, and A. Chakraborty, “Text-based person search via attribute-aided matching,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2020, pp. 2606–2614.

[16]

E. Loper and S. Bird, “NLTK: The natural language toolkit,” in Proc. ACL Interact. Poster Demonstration Sessions. Stroudsburg, PA, USA: Association for Computational Linguistics, 2004, pp. 214–217.

[17]

Z. Wang, Z. Fang, J. Wang, and Y. Yang, “ViTAA: Visual-textual attributes alignment in person search by natural language,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 402–420.

[18]

Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, “Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline),” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 480–496.

[19]

G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou, “Learning discriminative features with multiple granularities for person re-identification,” in Proc. 26th ACM Int. Conf. Multimedia, Oct. 2018, pp. 274–282.

[20]

A. Radfordet al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Learn. Represent., 2021, pp. 8748–8763.

[21]

X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian, “Picking deep filter responses for fine-grained image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1134–1142.

[22]

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” 2015, arXiv:1503.02531.

[23]

T. Chen, C. Xu, and J. Luo, “Improving text-based person search by spatial matching and adaptive threshold,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2018, pp. 1879–1887.

[24]

Y. Zhang and H. Lu, “Deep cross-modal projection learning for image-text matching,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 707–723.

[25]

S. Heet al., “Region generation and assessment network for occluded person re-identification,” IEEE Trans. Inf. Forensics Security, vol. 19, pp. 120–132, 2024.

Digital Library

[26]

X. Chen, W. Liu, X. Liu, Y. Zhang, and T. Mei, “A cross-modality and progressive person search system,” in Proc. 28th ACM Int. Conf. Multimedia, Oct. 2020, pp. 4550–4552.

[27]

A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman, “End-to-end learning of visual representations from uncurated instructional videos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9876–9886.

[28]

Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao, “SimVLM: Simple visual language model pretraining with weak supervision,” in Proc. Int. Conf. Learn. Represent., 2022, pp. 1–17.

[29]

A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic, “HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 2630–2640.

[30]

T. Weiet al., “HairCLIP: Design your hair by text and reference image,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 18051–18060.

[31]

F. A. Galatolo, M. G. C. A. Cimino, and G. Vaglini, “Generating images from caption and vice versa via CLIP-guided generative latent space search,” 2021, arXiv:2102.01645.

[32]

H. Fang, P. Xiong, L. Xu, and Y. Chen, “CLIP2Video: Mastering video-text retrieval via image CLIP,” 2021, arXiv:2106.11097.

[33]

M. Tang, Z. Wang, Z. Liu, F. Rao, D. Li, and X. Li, “CLIP4Caption: CLIP for video caption,” in Proc. 29th ACM Int. Conf. Multimedia, Oct. 2021, pp. 4858–4862.

[34]

Z. Wanget al., “CRIS: CLIP-driven referring image segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 11676–11685.

[35]

K. Choet al., “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2014, pp. 1724–1734.

[36]

A. Vaswaniet al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.

[37]

Y. Renet al., “FastSpeech: Fast, robust and controllable text to speech,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 3165–3174.

[38]

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186.

[39]

H. Ding, C. Liu, S. Wang, and X. Jiang, “VLT: Vision-language transformer and query generation for referring segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 6, pp. 7900–7916, Jun. 2023.

Digital Library

[40]

H. Ding, C. Liu, S. He, X. Jiang, and C. C. Loy, “MeViS: A large-scale benchmark for video segmentation with motion expressions,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2023, pp. 2694–2703.

[41]

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 213–229.

[42]

R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Transformer for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 7262–7272.

[43]

H. Ding, X. Jiang, B. Shuai, A. Q. Liu, and G. Wang, “Context contrasted feature and gated multi-scale aggregation for scene segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 2393–2402.

[44]

A. Dosovitskiyet al., “An image is worth 16 × 16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Represent., 2020, pp. 1–21.

[45]

S. He, H. Luo, P. Wang, F. Wang, H. Li, and W. Jiang, “TransReID: Transformer-based object re-identification,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Oct. 2021, pp. 15013–15022.

[46]

E. Parisotto, J. L. Ba, and R. Salakhutdinov, “Actor-Mimic: Deep multitask and transfer reinforcement learning,” in Proc. Int. Conf. Learn. Represent., 2016, pp. 1–16.

[47]

A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “FitNets: Hints for thin deep nets,” in Proc. Int. Conf. Learn. Represent., 2015, pp. 1–13.

[48]

Y. Chen, N. Wang, and Z. Zhang, “DarkRank: Accelerating deep metric learning via cross sample similarities transfer,” in Proc. 32nd AAAI Conf. Artif. Intell., 2018, pp. 2852–2859.

[49]

X. Chen, X. Liu, W. Liu, X.-P. Zhang, Y. Zhang, and T. Mei, “Explainable person re-identification with attribute-guided metric distillation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Oct. 2021, pp. 11813–11822.

[50]

S. Zhanget al., “Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition,” Knowl.-Based Syst., vol. 229, Oct. 2021, Art. no.

[51]

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 770–778.

[52]

R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proc. 54th Annu. Meeting Assoc. Comput. Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2016, pp. 1715–1725.

[53]

Y. Wang, X. Zhang, T. Yang, and J. Sun, “Anchor DETR: Query design for transformer-based object detection,” 2021, arXiv:2109.07107.

[54]

Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li, “Learning to rank: From pairwise approach to listwise approach,” in Proc. 24th Int. Conf. Mach. Learn., Jun. 2007, pp. 129–136.

[55]

F. Xia, T.-Y. Liu, J. Wang, W. Zhang, and H. Li, “Listwise approach to learning to rank: Theory and algorithm,” in Proc. 25th Int. Conf. Mach. Learn. (ICML), 2008, pp. 1192–1199.

[56]

L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person transfer GAN to bridge domain gap for person re-identification,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 79–88.

[57]

Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing data augmentation,” in Proc. AAAI, 2020, pp. 13001–13008.

[58]

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Represent., 2015, pp. 1–15.

[59]

G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks with stochastic depth,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 646–661.

[60]

K. Niu, Y. Huang, W. Ouyang, and L. Wang, “Improving description-based person re-identification by multi-granularity image-text alignments,” IEEE Trans. Image Process., vol. 29, pp. 5542–5556, 2020.

[61]

N. Sarafianos, X. Xu, and I. Kakadiaris, “Adversarial representation learning for text-to-image matching,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 5813–5823.

[62]

Z. Ji, J. Hu, D. Liu, L. Y. Wu, and Y. Zhao, “Asymmetric cross-scale alignment for text-based person search,” IEEE Trans. Multimedia, early access, Dec. 5, 2022. 10.1109/TMM.2022.3225754.

Digital Library

[63]

S. Yan, H. Tang, L. Zhang, and J. Tang, “Image-specific information suppression and implicit local alignment for text-based person search,” 2022, arXiv:2208.14365.

[64]

W. Suoet al., “A simple and robust correlation filtering method for text-based person search,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 726–742.

[65]

Z. Wanget al., “Look before you leap: Improving text-based person retrieval by learning a consistent cross-modal common manifold,” in Proc. 30th ACM Int. Conf. Multimedia, Oct. 2022, pp. 1984–1992.

[66]

Z. Wanget al., “CAIBC: Capturing all-round information beyond color for text-based person retrieval,” in Proc. 30th ACM Int. Conf. Multimedia, Oct. 2022, pp. 5314–5322.

[67]

K. Niu, L. Huang, Y. Huang, P. Wang, L. Wang, and Y. Zhang, “Cross-modal co-occurrence attributes alignments for person search by language,” in Proc. 30th ACM Int. Conf. Multimedia, Oct. 2022, pp. 4426–4434.

[68]

X. Shuet al., “See finer, see more: Implicit modality alignment for text-based person retrieval,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 624–641.

[69]

S. Yan, N. Dong, L. Zhang, and J. Tang, “CLIP-driven fine-grained text-image person re-identification,” 2022, arXiv:2210.10276.

[70]

L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res., vol. 9, no. 11, pp. 2579–2605, 2008.

[71]

K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross attention for image-text matching,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 201–216.

Cited By

Wang YYang MCao RCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Fine-grained Semantic Alignment with Transferred Person-SAM for Text-based Person RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681553(5432-5441)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681553
Yan SLiu JDong NZhang LTang JCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Prototypical Prompting for Text-to-image Person Re-identificationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681165(2331-2340)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681165
Niu KLiu YLong YHuang YWang LZhang Y(2024)An Overview of Text-Based Person Search: Recent Advances and Future DirectionsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.337637334:9(7803-7819)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1109/TCSVT.2024.3376373
Show More Cited By

Index Terms

VGSG: Vision-Guided Semantic-Group Network for Text-Based Person Search
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Machine learning
2. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Enhanced Attribute Alignment Based on Semantic Co-Attention for Text-Based Person Search
Artificial Intelligence
Abstract
Person search by natural language description aims to retrieve the most related person in the image gallery according to the given textual descriptions. This task is challenging due to the gap of cross-domain and cross-modality. Previous methods ...
Text-Guided Visual Feature Refinement for Text-Based Person Search
ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

Text-based person search is a task to retrieve the corresponding person in a large-scale image database given a textual description, which has important value in various fields like video surveillance. In the inferring phase, language descriptions, ...
A Navigation System for Vision-Guided Mobile Robots
ICIAP '99: Proceedings of the 10th International Conference on Image Analysis and Processing

In this paper we present a vision system for an autonomous mobile robot. Our robot performs goal reaching tasks into unknown indoor environments by using visual landmarks. Robot vision processes are performed at different levels of abstraction in order ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Image Processing

IEEE Transactions on Image Processing Volume 33, Issue

2024

5933 pages

ISSN:1057-7149

Issue’s Table of Contents

1941-0042 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 05 December 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang YYang MCao RCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Fine-grained Semantic Alignment with Transferred Person-SAM for Text-based Person RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681553(5432-5441)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681553
Yan SLiu JDong NZhang LTang JCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Prototypical Prompting for Text-to-image Person Re-identificationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681165(2331-2340)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681165
Niu KLiu YLong YHuang YWang LZhang Y(2024)An Overview of Text-Based Person Search: Recent Advances and Future DirectionsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.337637334:9(7803-7819)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1109/TCSVT.2024.3376373
Ahmed NAmin RAldabbas HSaeed MBilal MSong H(undefined)A Novel Approach for Sentiment Analysis of a Low Resource Language Using Deep Learning ModelsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3696789
https://dl.acm.org/doi/10.1145/3696789

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents