Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Continual learning for cross-modal image-text retrieval based on domain-selective attention

Published: 25 June 2024 Publication History

Abstract

Cross-modal image-text retrieval (CMITR) has been a high-value research topic for more than a decade. In most of the previous studies, the data for all tasks are trained as a single set. However, in reality, a more likely scenario is that the dataset has multiple tasks and trains them in sequence. The consequence is the limited ability to memorize the old task once a new task arrives; in other words, catastrophic forgetting. To solve this issue, this paper proposes a novel continual learning for cross-modal image-text retrieval (CLCMR) method to alleviate catastrophic forgetting. We construct a multilayer domain-selective attention (MDSA) based network to obtain knowledge from task-relevant and domain-specific attention levels. Moreover, a memory factor has been designed to achieve weight regularization, and a novel memory loss function is utilized to constrain MDSA. The extensive experimental results from multiple datasets (Wikipedia, Pascal Sentence, and PKU XMedianet datasets) demonstrate that CLCMR can effectively alleviate catastrophic forgetting and achieve a superior continual learning ability compared with the state-of-the-art methods.

Highlights

A novel continual learning for cross-modal image-text retrieval (CLCMR) method has been proposed.
A multilayer domain-selective attention (MDSA) module has been used to select parameters specific to particular previous tasks.
A memory factor has been designed to achieve weight regularization.
A loss function called memory loss has been devised to help optimize MDSA.
The extensive experimental results from multiple datasets demonstrate that CLCMR can effectively alleviate catastrophic forgetting.

References

[1]
Y. Yang, D. Xu, F. Nie, J. Luo, Y. Zhuang, Ranking with local regression and global alignment for cross media retrieval, in: Proceedings of the 17th ACM International Conference on Multimedia, 2009, pp. 175–184.
[2]
Baltrušaitis T., Ahuja C., Morency L.-P., Multimodal machine learning: A survey and taxonomy, IEEE Trans. Pattern Anal. Mach. Intell. 41 (2) (2018) 423–443.
[3]
N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G.R. Lanckriet, R. Levy, N. Vasconcelos, A New Approach to Cross-Modal Multimedia Retrieval, in: Proceedings of the 18th International Conference on Multimedea, 2010, pp. 251–260.
[4]
Rasiwasia N., Mahajan D., Mahadevan V., Aggarwal G., Cluster canonical correlation analysis, in: Artificial Intelligence and Statistics, PMLR, 2014, pp. 823–831.
[5]
Pereira J.C., Coviello E., Doyle G., Rasiwasia N., Lanckriet G.R., Levy R., Vasconcelos N., On the role of correlation and abstraction in cross-modal multimedia retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 36 (3) (2013) 521–535.
[6]
Ngiam J., Khosla A., Kim M., Nam J., Lee H., Ng A.Y., Multimodal deep learning, in: ICML, 2011.
[7]
Andrew G., Arora R., Bilmes J., Livescu K., Deep canonical correlation analysis, in: International Conference on Machine Learning, PMLR, 2013, pp. 1247–1255.
[8]
Peng Y., Huang X., Qi J., Cross-media shared representation by hierarchical learning with multiple deep networks, in: IJCAI, 2016, pp. 3846–3853.
[9]
Peng Y., Qi J., Huang X., Yuan Y., CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network, IEEE Trans. Multimed. 20 (2) (2017) 405–420.
[10]
Y. Huang, Q. Wu, C. Song, L. Wang, Learning semantic concepts and order for image and sentence matching, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6163–6171.
[11]
Xu R., Li C., Yan J., Deng C., Liu X., Graph convolutional network hashing for cross-modal retrieval, in: IJCAI, 2019, pp. 982–988.
[12]
K.-H. Lee, X. Chen, G. Hua, H. Hu, X. He, Stacked cross attention for image-text matching, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 201–216.
[13]
Q. Zhang, Z. Lei, Z. Zhang, S.Z. Li, Context-aware attention network for image-text retrieval, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3536–3545.
[14]
B. Wang, Y. Yang, X. Xu, A. Hanjalic, H.T. Shen, Adversarial cross-modal retrieval, in: Proceedings of the 25th ACM International Conference on Multimedia, 2017, pp. 154–162.
[15]
L. Zhen, P. Hu, X. Wang, D. Peng, Deep supervised cross-modal retrieval, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10394–10403.
[16]
Goodfellow I.J., Mirza M., Xiao D., Courville A., Bengio Y., An empirical investigation of catastrophic forgetting in gradient-based neural networks, 2013, arXiv preprint arXiv:1312.6211.
[17]
Kirkpatrick J., Pascanu R., Rabinowitz N., Veness J., Desjardins G., Rusu A.A., Milan K., Quan J., Ramalho T., Grabska-Barwinska A., et al., Overcoming catastrophic forgetting in neural networks, Proc. Natl. Acad. Sci. 114 (13) (2017) 3521–3526.
[18]
Peng Y., Qi J., Ye Z., Zhuo Y., Hierarchical visual-textual knowledge distillation for life-long correlation learning, Int. J. Comput. Vis. (2021) 1–21.
[19]
K. Wang, L. Herranz, J. van de Weijer, Continual Learning in Cross-Modal Retrieval, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2021, pp. 3628–3638.
[20]
Peng Y., Huang X., Zhao Y., An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges, IEEE Trans. Circuits Syst. Video Technol. 28 (9) (2017) 2372–2385.
[21]
Hotelling H., Relations between two sets of variates, in: Breakthroughs in Statistics, Springer, 1992, pp. 162–190.
[22]
V. Ranjan, N. Rasiwasia, C. Jawahar, Multi-label cross-modal retrieval, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4094–4102.
[23]
Wang Y., Huang W., Sun F., Xu T., Rong Y., Huang J., Deep multimodal fusion by channel exchanging, Adv. Neural Inf. Process. Syst. 33 (2020).
[24]
T.-H. Oh, T. Dekel, C. Kim, I. Mosseri, W.T. Freeman, M. Rubinstein, W. Matusik, Speech2face: Learning the face behind a voice, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7539–7548.
[25]
Quan D., Liang X., Wang S., Wei S., Li Y., Huyan N., Jiao L., AFD-net: Aggregated feature difference learning for cross-spectral image patch matching, in: 2019 IEEE/CVF International Conference on Computer Vision, ICCV, 2019, pp. 3017–3026,.
[26]
J. Lu, V. Goswami, M. Rohrbach, D. Parikh, S. Lee, 12-in-1: Multi-task vision and language representation learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10437–10446.
[27]
J. Lei, L. Li, L. Zhou, Z. Gan, T.L. Berg, M. Bansal, J. Liu, Less is more: Clipbert for video-and-language learning via sparse sampling, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7331–7341.
[28]
F. Feng, X. Wang, R. Li, Cross-modal retrieval with correspondence autoencoder, in: Proceedings of the 22nd ACM International Conference on Multimedia, 2014, pp. 7–16.
[29]
P. Hu, X. Peng, H. Zhu, L. Zhen, J. Lin, Learning Cross-Modal Retrieval With Noisy Labels, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5403–5413.
[30]
Y. Wang, T. Zhang, X. Zhang, Z. Cui, Y. Huang, P. Shen, S. Li, J. Yang, Wasserstein Coupled Graph Learning for Cross-Modal Retrieval, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1813–1822.
[31]
Li Z., Lu H., Fu H., Gu G., Image-text bidirectional learning network based cross-modal retrieval, Neurocomputing 483 (2022) 148–159.
[32]
Wang Z., Zhu A., Xue J., Jiang D., Liu C., Li Y., Hu F., SUM: Serialized updating and matching for text-based person retrieval, Knowl.-Based Syst. 248 (2022).
[33]
S.-A. Rebuffi, A. Kolesnikov, G. Sperl, C.H. Lampert, icarl: Incremental classifier and representation learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2001–2010.
[34]
Rolnick D., Ahuja A., Schwarz J., Lillicrap T.P., Wayne G., Experience replay for continual learning, 2018, arXiv preprint arXiv:1811.11682.
[35]
D. Isele, A. Cosgun, Selective experience replay for lifelong learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, 2018.
[36]
Chaudhry A., Rohrbach M., Elhoseiny M., Ajanthan T., Dokania P.K., Torr P.H., Ranzato M., Continual learning with tiny episodic memories, 2019.
[37]
D. Lopez-Paz, M. Ranzato, Gradient episodic memory for continual learning, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 6470–6479.
[38]
H. Shin, J.K. Lee, J. Kim, J. Kim, Continual learning with deep generative replay, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 2994–3003.
[39]
R. Kemker, C. Kanan, Fearnet: Brain-inspired model for incremental learning, in: International Conference on Learning Representations, 2018.
[40]
S.-A. Rebuffi, A. Kolesnikov, G. Sperl, C.H. Lampert, iCaRL: Incremental Classifier and Representation Learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2017.
[41]
Rusu A.A., Rabinowitz N.C., Desjardins G., Soyer H., Kirkpatrick J., Kavukcuoglu K., Pascanu R., Hadsell R., Progressive neural networks, 2016, arXiv preprint arXiv:1606.04671.
[42]
J. Xu, Z. Zhu, Reinforced continual learning, in: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 907–916.
[43]
Fernando C., Banarse D., Blundell C., Zwols Y., Ha D., Rusu A.A., Pritzel A., Wierstra D., Pathnet: Evolution channels gradient descent in super neural networks, 2017, arXiv preprint arXiv:1701.08734.
[44]
Yoon J., Yang E., Lee J., Hwang S.J., Lifelong learning with dynamically expandable networks, 2017, arXiv preprint arXiv:1708.01547.
[45]
R. Aljundi, P. Chakravarty, T. Tuytelaars, Expert gate: Lifelong learning with a network of experts, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3366–3375.
[46]
Silver D.L., Mercer R.E., The task rehearsal method of life-long learning: Overcoming impoverished data, in: Conference of the Canadian Society for Computational Studies of Intelligence, Springer, 2002, pp. 90–101.
[47]
Li Z., Hoiem D., Learning without forgetting, IEEE Trans. Pattern Anal. Mach. Intell. 40 (12) (2017) 2935–2947.
[48]
C.V. Nguyen, Y. Li, T.D. Bui, R.E. Turner, Variational continual learning, in: International Conference on Learning Representations, 2017.
[49]
Serra J., Suris D., Miron M., Karatzoglou A., Overcoming catastrophic forgetting with hard attention to the task, in: International Conference on Machine Learning, PMLR, 2018, pp. 4548–4557.
[50]
R. Del Chiaro, B. Twardowski, A.D. Bagdanov, J. van de Weijer, Ratt: Recurrent attention to transient tasks for continual image captioning, in: Lifelong Machine Learning Workshop at ICML 2020, 2020.
[51]
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[52]
Deng J., Dong W., Socher R., Li L.-J., Li K., Fei-Fei L., Imagenet: A large-scale hierarchical image database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, pp. 248–255.
[53]
Devlin J., Chang M.-W., Lee K., Toutanova K., Bert: Pre-training of deep bidirectional transformers for language understanding, 2018, arXiv preprint arXiv:1810.04805.
[54]
C. Rashtchian, P. Young, M. Hodosh, J. Hockenmaier, Collecting image annotations using amazon’s mechanical turk, in: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, 2010, pp. 139–147.
[55]
Peng Y., Qi J., Yuan Y., Modality-specific cross-modal similarity measurement with recurrent attention network, IEEE Trans. Image Process. (2018) 5585–5599.
[56]
J. Rupnik, J. Shawe-Taylor, Multi-view canonical correlation analysis, in: Conference on Data Mining and Data Warehouses, SiKDD 2010, 2010, pp. 1–4.
[57]
P. Hu, L. Zhen, D. Peng, P. Liu, Scalable deep multimodal learning for cross-modal retrieval, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019, pp. 635–644.

Cited By

View all
  • (2024)Selection and Reconstruction of Key Locals: A Novel Specific Domain Image-Text Retrieval MethodProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681421(5653-5662)Online publication date: 28-Oct-2024
  • (2024)Accurate and Lightweight Learning for Specific Domain Image-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681280(9719-9728)Online publication date: 28-Oct-2024
  • (2024)Generalized Source-Free Domain-adaptive Segmentation via Reliable Knowledge PropagationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680567(5967-5976)Online publication date: 28-Oct-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Pattern Recognition
Pattern Recognition  Volume 149, Issue C
May 2024
904 pages

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 25 June 2024

Author Tags

  1. Cross-modal retrieval
  2. Continual learning
  3. Attention
  4. Weight regularization

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Selection and Reconstruction of Key Locals: A Novel Specific Domain Image-Text Retrieval MethodProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681421(5653-5662)Online publication date: 28-Oct-2024
  • (2024)Accurate and Lightweight Learning for Specific Domain Image-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681280(9719-9728)Online publication date: 28-Oct-2024
  • (2024)Generalized Source-Free Domain-adaptive Segmentation via Reliable Knowledge PropagationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680567(5967-5976)Online publication date: 28-Oct-2024

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media