Abstract
Weakly supervised text-based person re-identification aims to identify a target person using textual descriptions, where the identity annotations are not available during the training phase. Previous methods attempted to cluster images and texts simultaneously for generating pseudo identity labels. However, we observed that the number of text clusters is significantly smaller than the number of true identities, while the number of image clusters is closer to the actual number of identities. This leads to uncertain pseudo identity labels. To address this issue, we propose a new approach called Image-Centered Pseudo Label Generation (ICPG) for weakly supervised text-based person re-identification. It directly generates pseudo labels for images and texts based on image clustering results. Firstly, we introduce a cross-modal distribution matching loss, which focuses on minimizing the KL divergence between the distributions of image-text similarity and normalized pseudo label matching distributions. Secondly, to enhance cross-modal associations, we propose a cross-modal hard sample mining method to explore challenging cross-modal examples. Experimental results demonstrate the effectiveness of our proposed methods. Compared to the state-of-the-art method, our approach achieves improvements of 3.6\(\%\), 2.4\(\%\) and 3.0\(\%\) in rank-1 accuracy on three datasets, respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Chen, T., Xu, C., Luo, J.: Improving text-based person search by spatial matching and adaptive threshold. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1879–1887. IEEE (2018)
Chen, Y., Huang, R., Chang, H., Tan, C., Xue, T., Ma, B.: Cross-modal knowledge adaptation for language-based person search. IEEE Trans. Image Process. 30, 4057–4069 (2021)
Dai, Y., Liu, J., Bai, Y., Tong, Z., Duan, L.Y.: Dual-refinement: joint label and feature refinement for unsupervised domain adaptive person re-identification. IEEE Trans. Image Process. 30, 7815–7829 (2021)
Dai, Z., Wang, G., Yuan, W., Zhu, S., Tan, P.: Cluster contrast for unsupervised person re-identification. In: Proceedings of the Asian Conference on Computer Vision (ACCV), pp. 1142–1160 (2022)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). arXiv:1810.04805
Ding, Z., Ding, C., Shao, Z., Tao, D.: Semantically self-aligned network for text-to-image part-aware person re-identification (2021). arXiv:2107.12666
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, vol. 96, pp. 226–231 (1996)
Fu, Y., Wei, Y., Wang, G., Zhou, Y., Shi, H., Huang, T.S.: Self-similarity grouping: a simple unsupervised cross domain adaptation approach for person re-identification. In: proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6112–6121 (2019)
Ge, Y., Chen, D., Li, H.: Mutual mean-teaching: Pseudo label refinery for unsupervised domain adaptation on person re-identification (2020). arXiv:2001.01526
Ge, Y., Zhu, F., Chen, D., Zhao, R., et al.: Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. Adv. Neural. Inf. Process. Syst. 33, 11309–11321 (2020)
Han, X., He, S., Zhang, L., Xiang, T.: Text-based person search with limited data (2021). arXiv:2110.10807
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification (2017). arXiv:1703.07737
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Jiang, D., Ye, M.: Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2787–2797 (2023)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv:1412.6980
Li, S., Cao, M., Zhang, M.: Learning semantic-aligned feature representation for text-based person search. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2724–2728. IEEE (2022)
Li, S., Xiao, T., Li, H., Yang, W., Wang, X.: Identity-aware textual-visual matching with latent co-attention. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1890–1899 (2017)
Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5187–5196 (2017)
Li, W., Tan, L., Dai, P., Zhang, Y.: Prompt decoupling for text-to-image person re-identification (2024). arXiv:2401.02173
Lin, Y., Dong, X., Zheng, L., Yan, Y., Yang, Y.: A bottom-up clustering approach to unsupervised person re-identification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8738–8745 (2019)
Lin, Y., Xie, L., Wu, Y., Yan, C., Tian, Q.: Unsupervised person re-identification via softened similarity learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3387–3396 (2020)
Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding (2018). arXiv:1807.03748
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 8748–8763. PMLR (2021)
Shao, Z., Zhang, X., Ding, C., Wang, J., Wang, J.: Unified pre-training with pseudo texts for text-to-image person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11174–11184 (2023)
Shu, X., Wen, W., Wu, H., Chen, K., Song, Y., Qiao, R., Ren, B., Wang, X.: See finer, see more: Implicit modality alignment for text-based person retrieval. In: European Conference on Computer Vision, pp. 624–641. Springer (2022)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv:1409.1556
Song, L., Wang, C., Zhang, L., Du, B., Zhang, Q., Huang, C., Wang, X.: Unsupervised domain adaptive re-identification: theory and practice. Pattern Recogn. 102, 107173 (2020)
Wang, C., Luo, Z., Lin, Y., Li, S.: Text-based person search via multi-granularity embedding learning. In: IJCAI, pp. 1068–1074 (2021)
Wang, C., Luo, Z., Lin, Y., Li, S.: Improving embedding learning by virtual attribute decoupling for text-based person search. Neural Comput. Appl. 1–23 (2022)
Wang, C., Luo, Z., Zhong, Z., Li, S.: Divide-and-merge the embedding space for cross-modality person search. Neurocomputing 463, 388–399 (2021)
Wang, Z., Fang, Z., Wang, J., Yang, Y.: Vitaa: visual-textual attributes alignment in person search by natural language. In: Computer Vision–ECCV 2020, pp. 402–420. Springer (2020)
Wang, Z., Zhu, A., Xue, J., Wan, X., Liu, C., Wang, T., Li, Y.: Caibc: capturing all-round information beyond color for text-based person retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 5314–5322 (2022)
Yan, S., Dong, N., Zhang, L., Tang, J.: Clip-driven fine-grained text-image person re-identification. IEEE Trans. Image Process. 32, 6032–6046 (2023)
Ye, M., Shen, J., Lin, G., Xiang, T., Shao, L., Hoi, S.C.: Deep learning for person re-identification: a survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 44(6), 2872–2893 (2022)
Zeng, K., Ning, M., Wang, Y., Guo, Y.: Hierarchical clustering with hard-batch triplet loss for person re-identification. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13654–13662 (2020)
Zhao, S., Gao, C., Shao, Y., Zheng, W.S., Sang, N.: Weakly supervised text-based person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11395–11404 (2021)
Zheng, Y., Zhao, X., Lan, C., Zhang, X., Huang, B., Yang, J., Yu, D.: Cpcl: cross-modal prototypical contrastive learning for weakly supervised text-based person re-identification (2024). arXiv:2401.10011
Zhong, Z., Zheng, L., Li, S., Yang, Y.: Generalizing a person retrieval model hetero-and homogeneously. In: Proceedings of the European conference on computer vision (ECCV), pp. 172–188 (2018)
Zhu, A., Wang, Z., Li, Y., Wan, X., Jin, J., Wang, T., Hu, F., Hua, G.: Dssl: deep surroundings-person separation learning for text-based person retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 209–217 (2021)
Acknowledgement
This work was partially supported by the China Postdoctoral Science Foundation (2023M741305), the Fundamental Research Funds for the Central Universities (CCNU23XJ001).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Nie, W., Wang, C., Sun, H., Xie, W. (2025). Image-Centered Pseudo Label Generation for Weakly Supervised Text-Based Person Re-Identification. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15042. Springer, Singapore. https://doi.org/10.1007/978-981-97-8858-3_33
Download citation
DOI: https://doi.org/10.1007/978-981-97-8858-3_33
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8857-6
Online ISBN: 978-981-97-8858-3
eBook Packages: Computer ScienceComputer Science (R0)