Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

VGSG: Vision-Guided Semantic-Group Network for Text-Based Person Search

Published: 05 December 2023 Publication History

Abstract

Text-based Person Search (TBPS) aims to retrieve images of target pedestrian indicated by textual descriptions. It is essential for TBPS to extract fine-grained local features and align them crossing modality. Existing methods utilize external tools or heavy cross-modal interaction to achieve explicit alignment of cross-modal fine-grained features, which is inefficient and time-consuming. In this work, we propose a Vision-Guided Semantic-Group Network (VGSG) for text-based person search to extract well-aligned fine-grained visual and textual features. In the proposed VGSG, we develop a Semantic-Group Textual Learning (SGTL) module and a Vision-guided Knowledge Transfer (VGKT) module to extract textual local features under the guidance of visual local clues. In SGTL, in order to obtain the local textual representation, we group textual features from the channel dimension based on the semantic cues of language expression, which encourages similar semantic patterns to be grouped implicitly without external tools. In VGKT, a vision-guided attention is employed to extract visual-related textual features, which are inherently aligned with visual cues and termed vision-guided textual features. Furthermore, we design a relational knowledge transfer, including a vision-language similarity transfer and a class probability transfer, to adaptively propagate information of the vision-guided textual features to semantic-group textual features. With the help of relational knowledge transfer, VGKT is capable of aligning semantic-group textual features with corresponding visual features without external tools and complex pairwise interaction. Experimental results on two challenging benchmarks demonstrate its superiority over state-of-the-art methods.

References

[1]
S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X. Wang, “Person search with natural language description,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 5187–5196.
[2]
Z. Shao, X. Zhang, M. Fang, Z. Lin, J. Wang, and C. Ding, “Learning granularity-unified representations for text-to-image person re-identification,” in Proc. 30th ACM Int. Conf. Multimedia, Oct. 2022, pp. 5566–5574.
[3]
Y. Wu, Z. Yan, X. Han, G. Li, C. Zou, and S. Cui, “LapsCore: Language-guided person search via color reasoning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Oct. 2021, pp. 1624–1633.
[4]
X. Han, S. He, L. Zhang, and T. Xiang, “Text-based person search with limited data,” in Proc. Brit. Mach. Vis. Conf., 2021, p. 337.
[5]
Z. Ding, C. Ding, Z. Shao, and D. Tao, “Semantically self-aligned network for text-to-image part-aware person re-identification,” 2021, arXiv:2107.12666.
[6]
C. Wang, Z. Luo, Y. Lin, and S. Li, “Text-based person search via multi-granularity embedding learning,” in Proc. 38th Int. Joint Conf. Artif. Intell., Aug. 2021, pp. 1068–1074.
[7]
K. Zheng, W. Liu, J. Liu, Z. Zha, and T. Mei, “Hierarchical Gumbel attention network for text-based person search,” in Proc. ACM Int. Conf. Multimedia, 2020, pp. 3441–3449.
[8]
C. Gaoet al., “Contextual non-local alignment over full-scale representation for text-based person search,” 2021, arXiv:2101.03036.
[9]
Y. Chen, R. Huang, H. Chang, C. Tan, T. Xue, and B. Ma, “Cross-modal knowledge adaptation for language-based person search,” IEEE Trans. Image Process., vol. 30, pp. 4057–4069, 2021.
[10]
Z. Zheng, L. Zheng, M. Garrett, Y. Yang, M. Xu, and Y.-D. Shen, “Dual-path convolutional image-text embeddings with instance loss,” ACM Trans. Multimedia Comput., Commun., Appl., vol. 16, no. 2, pp. 1–23, May 2020.
[11]
Y. Jing, C. Si, J. Wang, W. Wang, L. Wang, and T. Tan, “Pose-guided multi-granularity attention network for text-based person search,” in Proc. 34th AAAI Conf. Artif. Intell., 2020, pp. 11189–11196.
[12]
A. Farooq, M. Awais, J. Kittler, and S. S. Khalid, “AXM-Net: Implicit cross-modal feature alignment for person re-identification,” 2021, arXiv:2101.08238.
[13]
Y. Chen, G. Zhang, Y. Lu, Z. Wang, and Y. Zheng, “TIPCB: A simple but effective part-based convolutional baseline for text-based person search,” Neurocomputing, vol. 494, pp. 171–181, Jul. 2022.
[14]
S. Li, T. Xiao, H. Li, W. Yang, and X. Wang, “Identity-aware textual-visual matching with latent co-attention,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 1908–1917.
[15]
S. Aggarwal, R. V. Babu, and A. Chakraborty, “Text-based person search via attribute-aided matching,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2020, pp. 2606–2614.
[16]
E. Loper and S. Bird, “NLTK: The natural language toolkit,” in Proc. ACL Interact. Poster Demonstration Sessions. Stroudsburg, PA, USA: Association for Computational Linguistics, 2004, pp. 214–217.
[17]
Z. Wang, Z. Fang, J. Wang, and Y. Yang, “ViTAA: Visual-textual attributes alignment in person search by natural language,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 402–420.
[18]
Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, “Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline),” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 480–496.
[19]
G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou, “Learning discriminative features with multiple granularities for person re-identification,” in Proc. 26th ACM Int. Conf. Multimedia, Oct. 2018, pp. 274–282.
[20]
A. Radfordet al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Learn. Represent., 2021, pp. 8748–8763.
[21]
X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian, “Picking deep filter responses for fine-grained image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1134–1142.
[22]
G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” 2015, arXiv:1503.02531.
[23]
T. Chen, C. Xu, and J. Luo, “Improving text-based person search by spatial matching and adaptive threshold,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2018, pp. 1879–1887.
[24]
Y. Zhang and H. Lu, “Deep cross-modal projection learning for image-text matching,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 707–723.
[25]
S. Heet al., “Region generation and assessment network for occluded person re-identification,” IEEE Trans. Inf. Forensics Security, vol. 19, pp. 120–132, 2024.
[26]
X. Chen, W. Liu, X. Liu, Y. Zhang, and T. Mei, “A cross-modality and progressive person search system,” in Proc. 28th ACM Int. Conf. Multimedia, Oct. 2020, pp. 4550–4552.
[27]
A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman, “End-to-end learning of visual representations from uncurated instructional videos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9876–9886.
[28]
Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao, “SimVLM: Simple visual language model pretraining with weak supervision,” in Proc. Int. Conf. Learn. Represent., 2022, pp. 1–17.
[29]
A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic, “HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 2630–2640.
[30]
T. Weiet al., “HairCLIP: Design your hair by text and reference image,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 18051–18060.
[31]
F. A. Galatolo, M. G. C. A. Cimino, and G. Vaglini, “Generating images from caption and vice versa via CLIP-guided generative latent space search,” 2021, arXiv:2102.01645.
[32]
H. Fang, P. Xiong, L. Xu, and Y. Chen, “CLIP2Video: Mastering video-text retrieval via image CLIP,” 2021, arXiv:2106.11097.
[33]
M. Tang, Z. Wang, Z. Liu, F. Rao, D. Li, and X. Li, “CLIP4Caption: CLIP for video caption,” in Proc. 29th ACM Int. Conf. Multimedia, Oct. 2021, pp. 4858–4862.
[34]
Z. Wanget al., “CRIS: CLIP-driven referring image segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 11676–11685.
[35]
K. Choet al., “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2014, pp. 1724–1734.
[36]
A. Vaswaniet al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
[37]
Y. Renet al., “FastSpeech: Fast, robust and controllable text to speech,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 3165–3174.
[38]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186.
[39]
H. Ding, C. Liu, S. Wang, and X. Jiang, “VLT: Vision-language transformer and query generation for referring segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 6, pp. 7900–7916, Jun. 2023.
[40]
H. Ding, C. Liu, S. He, X. Jiang, and C. C. Loy, “MeViS: A large-scale benchmark for video segmentation with motion expressions,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2023, pp. 2694–2703.
[41]
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 213–229.
[42]
R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Transformer for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 7262–7272.
[43]
H. Ding, X. Jiang, B. Shuai, A. Q. Liu, and G. Wang, “Context contrasted feature and gated multi-scale aggregation for scene segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 2393–2402.
[44]
A. Dosovitskiyet al., “An image is worth 16 × 16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Represent., 2020, pp. 1–21.
[45]
S. He, H. Luo, P. Wang, F. Wang, H. Li, and W. Jiang, “TransReID: Transformer-based object re-identification,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Oct. 2021, pp. 15013–15022.
[46]
E. Parisotto, J. L. Ba, and R. Salakhutdinov, “Actor-Mimic: Deep multitask and transfer reinforcement learning,” in Proc. Int. Conf. Learn. Represent., 2016, pp. 1–16.
[47]
A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “FitNets: Hints for thin deep nets,” in Proc. Int. Conf. Learn. Represent., 2015, pp. 1–13.
[48]
Y. Chen, N. Wang, and Z. Zhang, “DarkRank: Accelerating deep metric learning via cross sample similarities transfer,” in Proc. 32nd AAAI Conf. Artif. Intell., 2018, pp. 2852–2859.
[49]
X. Chen, X. Liu, W. Liu, X.-P. Zhang, Y. Zhang, and T. Mei, “Explainable person re-identification with attribute-guided metric distillation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Oct. 2021, pp. 11813–11822.
[50]
S. Zhanget al., “Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition,” Knowl.-Based Syst., vol. 229, Oct. 2021, Art. no.
[51]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 770–778.
[52]
R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proc. 54th Annu. Meeting Assoc. Comput. Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2016, pp. 1715–1725.
[53]
Y. Wang, X. Zhang, T. Yang, and J. Sun, “Anchor DETR: Query design for transformer-based object detection,” 2021, arXiv:2109.07107.
[54]
Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li, “Learning to rank: From pairwise approach to listwise approach,” in Proc. 24th Int. Conf. Mach. Learn., Jun. 2007, pp. 129–136.
[55]
F. Xia, T.-Y. Liu, J. Wang, W. Zhang, and H. Li, “Listwise approach to learning to rank: Theory and algorithm,” in Proc. 25th Int. Conf. Mach. Learn. (ICML), 2008, pp. 1192–1199.
[56]
L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person transfer GAN to bridge domain gap for person re-identification,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 79–88.
[57]
Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing data augmentation,” in Proc. AAAI, 2020, pp. 13001–13008.
[58]
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Represent., 2015, pp. 1–15.
[59]
G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks with stochastic depth,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 646–661.
[60]
K. Niu, Y. Huang, W. Ouyang, and L. Wang, “Improving description-based person re-identification by multi-granularity image-text alignments,” IEEE Trans. Image Process., vol. 29, pp. 5542–5556, 2020.
[61]
N. Sarafianos, X. Xu, and I. Kakadiaris, “Adversarial representation learning for text-to-image matching,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 5813–5823.
[62]
Z. Ji, J. Hu, D. Liu, L. Y. Wu, and Y. Zhao, “Asymmetric cross-scale alignment for text-based person search,” IEEE Trans. Multimedia, early access, Dec. 5, 2022. 10.1109/TMM.2022.3225754.
[63]
S. Yan, H. Tang, L. Zhang, and J. Tang, “Image-specific information suppression and implicit local alignment for text-based person search,” 2022, arXiv:2208.14365.
[64]
W. Suoet al., “A simple and robust correlation filtering method for text-based person search,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 726–742.
[65]
Z. Wanget al., “Look before you leap: Improving text-based person retrieval by learning a consistent cross-modal common manifold,” in Proc. 30th ACM Int. Conf. Multimedia, Oct. 2022, pp. 1984–1992.
[66]
Z. Wanget al., “CAIBC: Capturing all-round information beyond color for text-based person retrieval,” in Proc. 30th ACM Int. Conf. Multimedia, Oct. 2022, pp. 5314–5322.
[67]
K. Niu, L. Huang, Y. Huang, P. Wang, L. Wang, and Y. Zhang, “Cross-modal co-occurrence attributes alignments for person search by language,” in Proc. 30th ACM Int. Conf. Multimedia, Oct. 2022, pp. 4426–4434.
[68]
X. Shuet al., “See finer, see more: Implicit modality alignment for text-based person retrieval,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 624–641.
[69]
S. Yan, N. Dong, L. Zhang, and J. Tang, “CLIP-driven fine-grained text-image person re-identification,” 2022, arXiv:2210.10276.
[70]
L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res., vol. 9, no. 11, pp. 2579–2605, 2008.
[71]
K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross attention for image-text matching,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 201–216.

Cited By

View all
  • (2024)Fine-grained Semantic Alignment with Transferred Person-SAM for Text-based Person RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681553(5432-5441)Online publication date: 28-Oct-2024
  • (2024)Prototypical Prompting for Text-to-image Person Re-identificationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681165(2331-2340)Online publication date: 28-Oct-2024
  • (2024)An Overview of Text-Based Person Search: Recent Advances and Future DirectionsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.337637334:9(7803-7819)Online publication date: 1-Sep-2024
  • Show More Cited By

Index Terms

  1. VGSG: Vision-Guided Semantic-Group Network for Text-Based Person Search
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image IEEE Transactions on Image Processing
        IEEE Transactions on Image Processing  Volume 33, Issue
        2024
        5933 pages

        Publisher

        IEEE Press

        Publication History

        Published: 05 December 2023

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 25 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Fine-grained Semantic Alignment with Transferred Person-SAM for Text-based Person RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681553(5432-5441)Online publication date: 28-Oct-2024
        • (2024)Prototypical Prompting for Text-to-image Person Re-identificationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681165(2331-2340)Online publication date: 28-Oct-2024
        • (2024)An Overview of Text-Based Person Search: Recent Advances and Future DirectionsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.337637334:9(7803-7819)Online publication date: 1-Sep-2024
        • (undefined)A Novel Approach for Sentiment Analysis of a Low Resource Language Using Deep Learning ModelsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3696789

        View Options

        View options

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media