research-article

A comprehensive survey on deep-learning-based visual captioning

Authors:

Tingting Zhang,

An-An LiuAuthors Info & Claims

Multimedia Systems, Volume 29, Issue 6

Pages 3781 - 3804

https://doi.org/10.1007/s00530-023-01175-x

Published: 21 September 2023 Publication History

Abstract

Generating a description for an image/video is termed as the visual captioning task. It requires the model to capture the semantic information of visual content and translate them into syntactically and semantically human language. Connecting both research communities of computer vision (CV) and natural language processing (NLP), visual captioning presents the big challenge to bridge the gap between low-level visual features and high-level language information. Thanks to recent advances in deep learning, which are widely applied to the fields of visual and language modeling, the visual captioning methods depending on the deep neural networks has demonstrated state-of-the-art performances. In this paper, we aim to present a comprehensive survey of existing deep learning-based visual captioning methods. Relying on the adopted mechanism and technique to narrow the semantic gap, we divide visual captioning methods into various groups. Representative categories in each group are summarized, and their strengths and limitations are discussed. The quantitative evaluations of state-of-the-art approaches on popular benchmark datasets are also presented and analyzed. Furthermore, we provide the discussions on future research directions.

References

[1]

Aafaq, N., Akhtar, N., Liu, W., et al.: Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: CVPR, pp. 12,487–12,496 (2019)

[2]

Anderson, P., Fernando, B., Johnson, M., et al.: SPICE: semantic propositional image caption evaluation. In: ECCV, pp. 382–398 (2016)

[3]

Anderson, P., He, X., Buehler, C., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)

[4]

Aneja, J., Agrawal, H., Batra, D., et al.: Sequential latent spaces for modeling the intention during diverse image captioning. In: ICCV, pp. 4260–4269 (2019)

[5]

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)

[6]

Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: CVPR, pp. 3185–3194 (2017)

[7]

Barati, E., Chen, X.: Critic-based attention network for event-based video captioning. In: ACMMM, pp. 811–817 (2019)

[8]

Barto AG, Sutton RS, and Anderson CW Neuronlike adaptive elements that can solve difficult learning control problems IEEE Trans. Syst. Man Cybern. 1983 13 5 834-846

[9]

Bengio, Y., Ducharme, R., Vincent, P.: A neural probabilistic language model. In: NIPS, pp. 932–938 (2000)

[10]

Bengio, S., Vinyals, O., Jaitly, N., et al.: Scheduled sampling for sequence prediction with recurrent neural networks. In: NIPS, pp. 1171–1179 (2015)

[11]

Biten, A.F., Gómez, L., Rusiñol, M., et al.: Good news, everyone! context driven entity-aware captioning for news images. In: CVPR, pp. 12,466–12,475 (2019)

[12]

Changpinyo, S., Sharma, P., Ding, N., et al.: Conceptual 12m: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: CVPR, pp. 3558–3568 (2021)

[13]

Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL, pp. 190–200 (2011)

[14]

Chen, F., Ji, R., Sun, X., et al.: Groupcap: group-based image captioning with structured relevance and diversity constraints. In: CVPR, pp. 1345– 1353 (2018)

[15]

Chen, L., Jiang, Z., Xiao, J., et al.: Human-like controllable image captioning with verb-specific semantic roles. In: CVPR, pp. 16,846–16,856 (2021)

[16]

Chen, S., Jiang, Y.: Motion guided spatial attention for video captioning. In: AAAI, pp. 8191–8198 (2019)

[17]

Chen, S., Jin, Q., Wang, P., et al.: Say as you wish: fine-grained control of image caption generation with abstract scene graphs. In: CVPR, pp. 9959–9968 (2020)

[18]

Chen, X., Ma, L., Jiang, W., et al.: Regularizing RNNs for caption generation by reconstructing the past with the present. In: CVPR, pp. 7995–8003 (2018)

[19]

Chen, X., Song, J., Zeng, P., et al.: Support-set based multi-modal representation enhancement for video captioning. In: IEEE International Conference on Multimedia and Expo, pp. 1–6 (2022)

[20]

Chen, Y., Wang, S., Zhang, W., et al.: Less is more: picking informative frames for video captioning. In: ECCV, pp. 367–384 (2018d)

[21]

Chen, L., Zhang, H., Xiao, J., et al.: Counterfactual critic multi-agent training for scene graph generation. In: ICCV, pp. 4612–4622 (2019)

[22]

Chen, T., Zhang, Z., You, Q., et al.: "factual" or "emotional": stylized image captioning with adaptive learning and attention. In: ECCV, pp. 527–543 (2018)

[23]

Chen S, Jin Q, Chen J, et al. Generating video descriptions with latent topic guidance IEEE Trans. Multimed. 2019 21 9 2407-2418

[24]

Cho, K., van Merrienboer, B., Gülçehre, Ç., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP, pp. 1724–1734 (2014)

[25]

Cho K, Courville AC, and Bengio Y Describing multimedia content using attention-based encoder-decoder networks IEEE Trans. Multimed. 2015 17 11 1875-1886

[26]

Cornia, M., Baraldi, L., Cucchiara, R.: Show, control and tell: a framework for generating controllable and grounded captions. In: CVPR, pp. 8307–8316 (2019)

[27]

Cornia, M., Baraldi, L., Fiameni, G., et al .: Universal captioner: long-tail vision-and-language model training through content-style separation. CoRR. arXiv: abs/2111.12727 (2021)

[28]

Dai, B., Fidler, S., Urtasun, R., et al.: Towards diverse and natural image descriptions via a conditional GAN. In: ICCV, pp. 2989–2998 (2017)

[29]

Dai, B., Lin, D.: Contrastive learning for image captioning. In: NIPS, pp. 898–907 (2017)

[30]

Deng, C., Ding, N., Tan, M., et al.: Length-controllable image captioning. In: ECCV, pp. 712–729 (2020)

[31]

Denkowski, M.J., Lavie, A.: Meteor universal: Language specific translation evaluation for any target language. In: WMT@ACL, pp. 376–380 (2014)

[32]

Deshpande, A., Aneja, J., Wang, L., et al.: Fast, diverse and accurate image captioning guided by part-of-speech. In: CVPR, pp. 10,695–10,704 (2019)

[33]

Devlin, J., Chang, M., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)

[34]

Dognin, P.L., Melnyk, I., Mroueh, Y., et al.: Adversarial semantic alignment for improved image captions. In: CVPR, pp. 10,463–10,471 (2019)

[35]

Donahue J, Hendricks LA, Rohrbach M, et al. Long-term recurrent convolutional networks for visual recognition and description IEEE Trans. Pattern Anal. Mach. Intell. 2017 39 4 677-691

[36]

Dong, J., Li, X., Lan, W., et al.:Early embedding and late reranking for video captioning. In: ACMMM, pp. 1082–1086 (2016)

[37]

Duan, X., Huang, W., Gan, C., et al.: Weakly supervised dense event captioning in videos. In: NIPS, pp. 3063–3073 (2018)

[38]

Elliott, D., Frank, S., Hasler, E.: Multi-language image description with neural sequence models. CoRR (2015). arXiv: abs/1510.04709

[39]

Elliott, D., Keller, F.: Image description using visual dependency representations. In: EMNLP, pp. 1292–1302 (2013)

[40]

Fang, H., Gupta, S., Iandola, F.N., et al.: From captions to visual concepts and back. In: CVPR, pp. 1473–1482 (2015)

[41]

Farhadi, A., Hejrati, S.M.M., Sadeghi, M.A., et al.: Every picture tells a story: generating sentences from images. In: ECCV, pp. 15–29 (2010)

[42]

Fei, Z.: Fast image caption generation with position alignment. CoRR (2019) arXiv: abs/1912.06365

[43]

Fei, Z.: Iterative back modification for faster image captioning. In: MMACM, pp. 3182–3190 (2020)

[44]

Feng, Y., Ma, L., Liu, W., et al.: Unsupervised image captioning. In: CVPR, pp. 4125–4134 (2019)

[45]

Forsyth DA Object detection with discriminatively trained part-based models IEEE Comput. 2014 47 2 6-7

[46]

Fu K, Li J, Jin J, et al. Image-text surgery: efficient concept learning in image captioning by generating pseudopairs IEEE Trans. Neural Netw. Learn. Syst. 2018 29 12 5910-5921

[47]

Gan, Z., Gan, C., He, X., et al.: Semantic compositional networks for visual captioning. In: CVPR, pp. 1141–1150 (2017)

[48]

Gan, C., Yang, T., Gong, B.: Learning attributes equals multi-source domain generalization. In: CVPR, pp. 87–97 (2016)

[49]

Gao, J., Wang, S., Wang, S., et al.: Self-critical n-step training for image captioning. In: CVPR, pp. 6300–6308 (2019)

[50]

Gao L, Guo Z, Zhang H, et al. Video captioning with attention-based LSTM and semantic consistency IEEE Trans. Multimed. 2017 19 9 2045-2055

[51]

Gong, Y., Wang, L., Guo, R., et al.: Multi-scale orderless pooling of deep convolutional activation features. In: ECCV, pp. 392–407 (2014)

[52]

Gong, Y., Wang, L., Hodosh, M., et al.: Improving image-sentence embeddings using large weakly annotated photo collections. In: ECCV, pp. 529–545 (2014)

[53]

Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., et al.: Generative adversarial nets. In: NeurIPS, pp. 2672–2680 (2014)

[54]

Goyal, Y., Khot, T., Summers-Stay, D., et al.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR, pp. 6325–6334 (2017)

[55]

Gueguen, L., Hamid, R.: Large-scale damage detection using satellite imagery. In: CVPR, pp. 1321–1328 (2015)

[56]

Guo, L., Liu, J., Yao, P., et al.: Mscap: multi-style image captioning with unpaired stylized text. In: CVPR, pp. 4204–4213 (2019)

[57]

Guo, L., Liu, J., Zhu, X., et al.: Non-autoregressive image captioning with counterfactuals-critical multi-agent learning. In: IJCAI, pp. 767–773 (2020)

[58]

Hendricks, L.A., Venugopalan, S., Rohrbach, M., et al.: Deep compositional captioning: Describing novel object categories without paired training data. In: CVPR, pp. 1–10 (2016)

[59]

Herdade, S., Kappeler, A., Boakye, K., et al.: Image captioning: transforming objects into words. In: NIPS, pp. 11,135–11,145 (2019)

[60]

Hochreiter S and Schmidhuber J Long short-term memory Neural Comput. 1997 9 8 1735-1780

[61]

Hodosh M, Young P, and Hockenmaier J Framing image description as a ranking task: data, models and evaluation metrics J. Artif. Intell. Res. 2013 47 853-899

[62]

Hori, C., Hori, T., Lee, T., et al.: Attention-based multimodal fusion for video description. In: ICCV, pp. 4203–4212 (2017)

[63]

Hou, J., Wu, X., Zhao, W., et al.: Joint syntax representation learning and visual cue translation for video captioning. In: ICCV, pp. 8917–8926 (2019)

[64]

Hu, A., Chen, S., Jin, Q.: ICECAP: information concentrated entity-aware image captioning. CoRR. arXiv: abs/2108.02050 (2021)

[65]

Huang, Q., Gan, Z., Çelikyilmaz, A., et al .: Hierarchically structured reinforcement learning for topically coherent visual story generation. In: AAAI, pp. 8465–8472 (2019b)

[66]

Huang, L., Wang, W., Chen, J., et al.: Attention on attention for image captioning. In: ICCV, pp. 4633–4642 (2019)

[67]

Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. In: ICLR (2017)

[68]

Jia, X., Gavves, E., Fernando, B., et al.: Guiding the long-short term memory model for image caption generation. In: ICCV, pp. 2407–2415 (2015)

[69]

Jin, Q., Chen, J., Chen, S., et al.: Describing videos using multi-modal fusion. In: ACM MM, pp. 1087–1091 (2016)

[70]

Jin, J., Fu, K., Cui, R., et al.: Aligning where to see and what to tell: image caption with region-based attention and scene factorization. CoRR. arXiv: abs/1506.06272 (2015)

[71]

Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: fully convolutional localization networks for dense captioning. In: CVPR, pp. 4565–4574 (2016)

[72]

Karpathy A and Fei-Fei L Deep visual-semantic alignments for generating image descriptions IEEE Trans. Pattern Anal. Mach. Intell. 2017 39 4 664-676

[73]

Ke, L., Pei, W., Li, R., et al.: Reflective decoding network for image captioning. In: ICCV, pp. 8887–8896 (2019)

[74]

Khan, M.U.G., Gotoh,Y.: Describing video contents in natural language. In: Proceeding of Workshop Innovative Hybrid Approaches Process. Textual Data, pp. 27–35 (2012)

[75]

Kiros, R., Salakhutdinov, R., Zemel, R.S.: Multimodal neural language models. In: ICML, pp. 595–603 (2014)

[76]

Krishna R, Zhu Y, Groth O, et al. Visual genome: connecting language and vision using crowdsourced dense image annotations IJCV 2017 123 1 32-73

[77]

Kulkarni, G., Premraj, V., Dhar, S., et al.: Baby talk: understanding and generating simple image descriptions. In: CVPR, pp. 1601–1608 (2011)

[78]

Kuznetsova P, Ordonez V, Berg TL, et al. TREETALK: composition and compression of trees for image descriptions TACL 2014 2 351-362

[79]

Laina, I., Rupprecht, C., Navab, N.: Towards unsupervised image captioning with shared multimodal embeddings. In: ICCV, pp. 7413–7423 (2019)

[80]

Lan, W., Li, X., Dong, J.: Fluency-guided cross-lingual image captioning. In: ACM MM, pp. 1549–1557 (2017)

[81]

Li, Y., Pan, Y., Yao, T., et al.: Comprehending and ordering semantics for image captioning. In: CVPR, pp. 17,969–17,978 (2022)

[82]

Li, L., Tang, S., Deng, L., et al.: Image caption with global-local attention. In: AAAI, pp. 4133–4139 (2017)

[83]

Li, Y., Yao, T., Mei, T., et al.: Share-and-chat: achieving human-level video commenting by search and multi-view embedding. In: ACMMM, pp. 928–937 (2016)

[84]

Li, Y., Yao, T., Pan, Y., et al.: Pointing novel objects in image captioning. In: CVPR, pp. 12,497–12,506 (2019)

[85]

Li, X., Yin, X., Li, C., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: ECCV, pp. 121–137 (2020)

[86]

Li, G., Zhu, L., Liu, P., et al.: Entangled transformer for image captioning. In: ICCV, pp. 8927–8936 (2019)

[87]

Li X and Jiang S Know more say less: image captioning based on scene graphs IEEE Trans. Multimed. 2019 21 8 2117-2130

[88]

Li F, Asha I, Christof K, et al. What do we perceive in a glance of a real-world scene? J. Vis. 2007 7 1 1-29

[89]

Li L, Tang S, Zhang Y, et al. GLA: global-local attention for image description IEEE Trans. Multimed. 2018 20 3 726-737

[90]

Li X, Xu C, Wang X, et al. COCO-CN for cross-lingual image tagging, captioning, and retrieval IEEE Trans. Multimed. 2019 21 9 2347-2360

[91]

Liang, X., Hu, Z., Zhang, H., et al.: Recurrent topic-transition GAN for visual paragraph generation. In: ICCV, pp. 3382–3391 (2017)

[92]

Lin, T., Maire, M., Belongie, S.J., et al.: Microsoft COCO: common objects in context. In: ECCV, pp. 740–755 (2014)

[93]

Lin, C.: Rouge: a package for automatic evaluation of summaries. In: ACL Workshop, pp. 74–81 (2004)

[94]

Liu, W., Chen, S., Guo, L., et al.: CPTR: full transformer network for image captioning. CoRR. arXiv: abs/2101.10804 (2021)

[95]

Liu, L., Tang, J., Wan, X., et al.: Generating diverse and descriptive image captions using visual paraphrases. In: ICCV, pp. 4239–4248 (2019)

[96]

Liu, Y., Wang, R., Shan, S, et al.: Structure inference net: object detection using scene-level context and instance-level relationships. In: CVPR, pp. 6985–6994 (2018)

[97]

Liu, F., Wang, Y., Wang, T., et al.: Visual news: benchmark and challenges in news image captioning. In: EMNLP, pp. 6761–6771 (2021)

[98]

Liu, S., Zhu, Z., Ye, N., et al.: Improved image captioning via policy gradient optimization of spider. In: ICCV, pp. 873–881 (2017)

[99]

Liu A, Xu N, Wong Y, et al. Hierarchical and multimodal video captioning: discovering and transferring multimodal knowledge for vision to language Comput. Vis. Image Underst. 2017 163 113-125

[100]

Liu A, Xu N, Nie W, et al. Multi-domain and multi-task learning for human action recognition IEEE Trans. Image Process. 2019 28 2 853-867

[101]

Long X, Gan C, and de Melo G Video captioning with multi-faceted attention TACL 2018 6 173-184

[102]

Lu, J., Xiong, C., Parikh, D., et al.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: CVPR, pp. 3242–3250 (2017)

[103]

Lu, J., Yang, J., Batra, D., et al.: Neural baby talk. In: CVPR, pp. 7219–7228 (2018)

[104]

Luo, Y., Ji, J., Sun,X., et al.: Dual-level collaborative transformer for image captioning. In: AAAI, pp. 2286–2293 (2021)

[105]

Luo, R., Price, B.L., Cohen, S., et al.: Discriminability objective for training descriptive captions. In: CVPR, pp. 6964–6974 (2018)

[106]

Ma, Z., Yang, Y., Xu, Z., et al.: Complex event detection via multi-source video attributes. In: CVPR, pp. 2627–2633 (2013)

[107]

Mao, J., Wei, X., Yang, Y., et al.: Learning like a child: fast novel visual concept learning from sentence descriptions of images. In: ICCV, pp. 2533–2541 (2015)

[108]

Mao, J., Xu, W., Yang, Y., et al.: Deep captioning with multimodal recurrent neural networks (m-rnn). In: ICLR (2015)

[109]

Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. In: NIPS, pp. 570–576 (1997)

[110]

Marr, D.: Vision: a computational investigation into the human representation and processing of visual information. mit press. Cambridge, Massachusetts (1982)

[111]

Mathews, A.P., Xie, L., He, X.: Semstyle: learning to generate stylised image captions using unaligned text. In: CVPR, pp. 8591–8600 (2018)

[112]

Miech, A., Zhukov, D., Alayrac, J., et al.: Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In: ICCV, pp. 2630–2640 (2019)

[113]

Mikolov, T., Chen, K., Corrado, G., et al.: Efficient estimation of word representations in vector space. In: ICLR (Workshop Poster) (2013)

[114]

Mitchell, M., Dodge, J., Goyal, A., et al.: Midge: generating image descriptions from computer vision detections. In: EACL, pp. 747–756 (2012)

[115]

Mottaghi, R., Chen, X., Liu, X., et al.: The role of context for object detection and semantic segmentation in the wild. In: CVPR, pp. 891–898 (2014)

[116]

Mun, J., Yang, L., Ren, Z., et al.: Streamlined dense video captioning. In: CVPR, pp. 6588–6597 (2019)

[117]

Pan, Y., Mei, T., Yao, T., et al.: Jointly modeling embedding and translation to bridge video and language. In: CVPR, pp. 4594–4602 (2016b)

[118]

Pan, P., Xu, Z., Yang, Y., et al.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR, pp. 1029–1038 (2016a)

[119]

Pan, Y., Yao, T., Li, H., et al.: Video captioning with transferred semantic attributes. In: CVPR, pp. 984–992 (2017)

[120]

Pan, Y., Yao, T., Li, Y., et al.: X-linear attention networks for image captioning. In: CVPR, pp. 10,968–10,977 (2020)

[121]

Papineni, K., Roukos, S., Ward, T., et al.: Bleu: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)

[122]

Park, D.H., Darrell, T., Rohrbach, A.: Robust change captioning. In: ICCV, pp. 4624–4633 (2019)

[123]

Park, C.C., Kim, B., Kim, G.: Attend to you: personalized image captioning with context sequence memory networks. In: CVPR, pp. 6432–6440 (2017)

[124]

Park, C.C., Kim, G.: Expressing an image stream with a sequence of natural sentences. In: NeurIPS, pp. 73–81 (2015)

[125]

Park CC, Kim B, and Kim G Towards personalized image captioning via multimodal memory networks IEEE Trans. Pattern Anal. Mach. Intell. 2019 41 4 999-1012

[126]

Pasunuru, R., Bansal, M.: Multi-task video captioning with video and entailment generation. In: ACL, pp. 1273–1283 (2017)

[127]

Patriarche JW and Erickson BJ A review of the automated detection of change in serial imaging studies of the brain J. Digital Imaging 2004 17 3 158-174

[128]

Pedersoli, M., Lucas, T., Schmid, C., et al.: Areas of attention for image captioning. In: ICCV, pp. 1251–1259 (2017)

[129]

Pei, W., Zhang, J., Wang, X., et al.: Memory-attended recurrent network for video captioning. In: CVPR, pp. 8347–8356 (2019)

[130]

Peng Y and Qi J Show and tell in the loop: cross-modal circular correlation learning IEEE Trans. Multimed. 2019 21 6 1538-1550

[131]

Perez-Martin, J., Bustos, B., Pérez, J.: Improving video captioning with temporal composition of a visual-syntactic embedding

^{*}

. In: IEEE Winter Conference on Applications of Computer Vision, pp. 3038–3048 (2021)

[132]

Plummer, B.A., Wang, L., Cervantes, C.M., et al.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV, pp. 2641–2649 (2015)

[133]

Prajwal, K.R., Jawahar, C.V., Kumaraguru, P.: Towards increased accessibility of meme images with the help of rich face emotion captions. In: ACM MM, pp. 202–210 (2019)

[134]

Radke RJ, Andra S, Al-Kofahi O, et al. Image change detection algorithms: a systematic survey IEEE Trans. Image Process. 2005 14 3 294-307

[135]

Ramanishka, V., Das, A., Park, D.H., et al.: Multimodal video description. In: ACM MM, pp. 1092–1096 (2016)

[136]

Ranzato, M., Chopra, S., Auli, M., et al.: Sequence level training with recurrent neural networks. In: ICLR (2016)

[137]

Ren, Z., Wang, X., Zhang, N., et al.: Deep reinforcement learning-based image captioning with embedding reward. In: CVPR, pp. 1151–1159 (2017)

[138]

Rennie, S.J., Marcheret, E., Mroueh, Y., et al.: Self-critical sequence training for image captioning. In: CVPR, pp. 1179–1195 (2017)

[139]

Rohrbach, A., Rohrbach, M., Schiele, B.: The long-short story of movie description. In: GCPR, pp. 209–221 (2015)

[140]

Rohrbach, A., Rohrbach, M., Tandon, N., et al.: A dataset for movie description. In: CVPR, pp. 3202–3212 (2015)

[141]

Ryu, H., Kang, S., Kang, H., et al.: Semantic grouping network for video captioning. In: AAAI, pp. 2514–2522 (2021)

[142]

Sakurada, K., Okatani, T.: Change detection from a street image pair using CNN features and superpixel segmentation. In: BMVC, pp. 61.1–61.12 (2015)

[143]

Seo, P.H., Nagrani, A., Arnab, A., et al.: End-to-end generative pretraining for multimodal video captioning. In: CVPR, pp. 17,938–17,947 (2022)

[144]

Sharma, P., Ding, N., Goodman, S., et al.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL, pp. 2556–2565 (2018)

[145]

Shen, T., Kar, A., Fidler, S.: Learning to caption images through a lifetime by asking questions. In: ICCV, pp. 10,392–10,401 (2019)

[146]

Shen, Z., Li, J., Su, Z., et al.: Weakly supervised dense video captioning. In: CVPR, pp. 5159–5167 (2017)

[147]

Shetty, R., Laaksonen, J.: Frame- and segment-level features and candidate pool evaluation for video caption generation. In: ACM MM, pp. 1073–1076 (2016)

[148]

Shetty, R., Laaksonen, J.: Video captioning with recurrent networks based on frame- and video-level features and visual content classification. CoRR arXiv: abs/1512.02949 (2015)

[149]

Shetty, R., Rohrbach, M., Hendricks, L.A., et al.: Speaking the same language: Matching machine to human captions by adversarial training. In: ICCV, pp. 4155–4164 (2017)

[150]

Shi, X., Cai, J., Joty, S.R., et al.: (2019) Watch it twice: Video captioning with a refocused video encoder. In: ACMMM, pp. 818–826

[151]

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)

[152]

Song, Y., Chen, S., Zhao, Y., et al.: Unpaired cross-lingual image caption generation with self-supervised rewards. In: ACM MM, pp. 784–792 (2019)

[153]

Song, J., Gao, L., Guo, Z., et al.: Hierarchical LSTM with adjusted temporal attention for video captioning. In: IJCAI, pp. 2737–2743 (2017)

[154]

Song, J., Guo, Y., Gao, L., et al.: From deterministic to generative: multi-modal stochastic RNNs for video captioning. CoRR. (2017). arXiv: abs/1708.02478

[155]

Song, X., Wang, B., Chen, G., et al.: MUCH: mutual coupling enhancement of scene recognition and dense captioning. In: ACMMM, pp. 793–801 (2019)

[156]

Subramanian, S., Rajeswar, S., Dutil, F., et al.: Adversarial generation of natural language. In: Rep4NLP@ACL, pp. 241–251 (2017)

[157]

Sun, C., Myers, A., Vondrick, C., et al.: Videobert: a joint model for video and language representation learning. In: ICCV, pp. 7463–7472 (2019)

[158]

Sutton, R.S., McAllester, D.A., Singh, S.P., et al.: Policy gradient methods for reinforcement learning with function approximation. In: NIPS, pp. 1057–1063 (1999)

[159]

Torabi, A., Pal, C.J., Larochelle, H., et al.: Using descriptive video services to create a large data source for video annotation research. CoRR. (2015). arXiv: abs/1503.01070

[160]

Tran, K., He, X., Zhang, L., et al.: Rich image captioning in the wild. In: CVPR, pp. 434–441 (2016)

[161]

Tran, A., Mathews, A.P., Xie, L.: Transform and tell: entity-aware news image captioning. In: CVPR, pp. 13,032–13,042 (2020)

[162]

Tu, Y., Zhang, X., Liu, B., et al.: Video description with spatial-temporal attention. In: ACMMM, pp. 1014–1022 (2017)

[163]

van Miltenburg, E., Elliott, D., Vossen, P.: Measuring the diversity of automatic image descriptions. In: COLING, pp. 1730–1741 (2018)

[164]

Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)

[165]

Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)

[166]

Venugopalan, S., Hendricks, L.A., Mooney, R.J., et al.: Improving lstm-based video description with linguistic knowledge mined from text. In: EMNLP, pp. 1961–1966 (2016)

[167]

Venugopalan, S., Hendricks, L.A., Rohrbach, M., et al.: Captioning images with diverse objects. In: CVPR, pp. 1170–1178 (2017)

[168]

Venugopalan, S., Rohrbach, M., Donahue, J., et al.: Sequence to sequence - video to text. In: ICCV, pp. 4534–4542 (2015)

[169]

Venugopalan, S., Xu, H., Donahue, J, et al.: Translating videos to natural language using deep recurrent neural networks. In: NAACL, pp. 1494–1504 (2015)

[170]

Vinyals, O., Toshev, A., Bengio, S., et al.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)

[171]

Vinyals O, Toshev A, Bengio S, et al. Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge IEEE Trans. Pattern Anal. Mach. Intell. 2017 39 4 652-663

[172]

Viola, P.A., Platt, J.C., Zhang, C.: Multiple instance boosting for object detection. In: NIPS, pp. 1417–1424 (2005)

[173]

Vo, D.M., Chen, H., Sugimoto, A., et al.: NOC-REK: novel object captioning with retrieved vocabulary from external knowledge. In: CVPR, pp. 17,979–17,987 (2022)

[174]

Wang, Q., Chan, A.B.: Describing like humans: on diversity in image captioning. In: CVPR, pp. 4195–4203 (2019)

[175]

Wang, X., Chen, W., Wu, J., et al.: Video captioning via hierarchical reinforcement learning. In: CVPR, pp. 4213–4222 (2018)

[176]

Wang, B., Ma, L., Zhang, W., et al.: Controllable video captioning with POS sequence guidance based on gated fusion network. In: ICCV, pp. 2641–2650 (2019)

[177]

Wang, B., Ma, L., Zhang, W., et al.: Reconstruction network for video captioning. In: CVPR, pp. 7622–7631 (2018)

[178]

Wang, J., Wang, W., Huang, Y., et al: M3: multimodal memory modelling for video captioning. In: CVPR, pp. 7512–7520 (2018)

[179]

Wang, Y., Xu, J., Sun, Y.: End-to-end transformer based model for image captioning. In: AAAI, pp. 2585–2594 (2022)

[180]

Wang Q, Wan J, and Chan AB On diversity in image captioning: metrics and methods IEEE Trans. Pattern Anal. Mach. Intell. 2022 44 2 1035-1049

[181]

Williams RJ Simple statistical gradient-following algorithms for connectionist reinforcement learning Mach. Learn. 1992 8 229-256

[182]

Wu, Q., Shen, C., Liu, L., et al.: What value do explicit high level concepts have in vision to language problems? In: CVPR, pp. 203–212 (2016)

[183]

Wu, M., Zhang, X., Sun, X., et al.: Difnet: boosting visual information flow for image captioning. In: CVPR, pp. 17,999–18,008 (2022)

[184]

Xian Y and Tian Y Self-guiding multimodal LSTM—when we do not have a perfect training dataset for image captioning IEEE Trans. Image Process. 2019 28 11 5241-5252

[185]

Xiao X, Wang L, Ding K, et al. Deep hierarchical encoder-decoder network for image captioning IEEE Trans. Multimed. 2019 21 11 2942-2956

[186]

Xu, K., Ba, J., Kiros, R., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)

[187]

Xu, J., Mei, T., Yao, T., et al.: MSR-VTT: a large video description dataset for bridging video and language. In: CVPR, pp. 5288–5296 (2016)

[188]

Xu, N., Zhang, H., Liu, A.A., et al.: Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Trans. Multimed. (2020)

[189]

Xu, D., Zhu, Y., Choy, C.B., et al.: Scene graph generation by iterative message passing. In: CVPR, pp. 3097–3106 (2017)

[190]

Xu N, Liu A, Liu J, et al. Scene graph Captioner: image captioning based on structural visual representation J. Vis. Commun. Image Represent. 2019 58 477-485

[191]

Xu N, Liu A, Wong Y, et al. Dual-stream recurrent neural network for video captioning IEEE Trans. Circ. Syst. Video Technol. 2019 29 8 2482-2493

[192]

Yang, X., Karaman, S., Tetreault, J.R., et al.: Journalistic guidelines aware news image captioning. In: EMNLP, pp. 5162–5175 (2021)

[193]

Yang, L., Tang, K.D., Yang, J., et al.: Dense captioning with joint inference and visual context. In: CVPR, pp. 1978–1987 (2017)

[194]

Yang, X., Tang, K., Zhang, H., et al.: Auto-encoding scene graphs for image captioning. In: CVPR, pp. 10,685–10,694 (2019)

[195]

Yang, Z., Yuan, Y., Wu, Y., et al.: Encode, review, and decode: reviewer module for caption generation. CoRR. abs/1605.07912 (2016)

[196]

Yang, X., Zhang, H., Cai, J.: Learning to collocate neural modules for image captioning. In: ICCV, pp. 4249–4259 (2019)

[197]

Yang, B., Zou, Y., Liu, F., et al.: Non-autoregressive coarse-to-fine video captioning. In: AAAI, pp. 3119–3127 (2021)

[198]

Yang Y, Zhou J, Ai J, et al. Video captioning by adversarial LSTM IEEE Trans. Image Process. 2018 27 11 5600-5611

[199]

Yang M, Zhao W, Xu W, et al. Multitask learning for cross-domain image captioning IEEE Trans. Multimed. 2019 21 4 1047-1061

[200]

Yao, T., Pan, Y., Li, Y., et al.: Boosting image captioning with attributes. In: ICCV, pp. 4904–4912 (2017)

[201]

Yao, T., Pan, Y., Li, Y., et al.: Exploring visual relationship for image captioning. In: ECCV, pp. 711–727 (2018)

[202]

Yao, T., Pan, Y., Li, Y., et al.: Hierarchy parsing for image captioning. In: ICCV, pp. 2621–2629 (2019)

[203]

Yao, L., Torabi, A., Cho, K., et al.: Describing videos by exploiting temporal structure. In: ICCV, pp. 4507–4515 (2015)

[204]

Yin, G., Sheng, L., Liu, B., et al.: Context and attribute grounded dense captioning. In: CVPR, pp. 6241–6250 (2019)

[205]

You, Q., Jin, H., Wang, Z., et al.: Image captioning with semantic attention. In: CVPR, pp. 4651–4659 (2016)

[206]

Yu, H., Wang, J., Huang, Z, et al.: Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR, pp. 4584–4593 (2016)

[207]

Yu, L., Zhang, W., Wang, J., et al.: Seqgan: sequence generative adversarial nets with policy gradient. In: AAAI, pp. 2852–2858 (2017)

[208]

Zeng, P., Zhang, H., Song, J., et al.: S2 transformer for image captioning. In: IJCAI, pp. 1608–1614 (2022)

[209]

Zhang, H., Dana, K.J., Shi, J., et al.: Context encoding for semantic segmentation. In: CVPR, pp. 7151–7160 (2018)

[210]

Zhang, J., Fang, S., Mao, Z., et al.: Fine-tuning with multi-modal entity prompts for news image captioning. In: ACM MM, pp. 4365–4373 (2022)

[211]

Zhang, X., Gao, K., Zhang, Y., et al.: Task-driven dynamic fusion: reducing ambiguity in video description. In: CVPR, pp. 6250–6258 (2017)

[212]

Zhang, P., Li, X., Hu, X, et al.: Vinvl: Revisiting visual representations in vision-language models. In: CVPR, pp. 5579–5588 (2021)

[213]

Zhang, J., Peng, Y.: Object-aware aggregation with bidirectional temporal graph for video captioning. In: CVPR, pp. 8327–8336 (2019)

[214]

Zhang, X., Sun, X., Luo, Y., et al.: Rstnet: captioning with adaptive attention on visual and non-visual words. In: CVPR, pp. 15,465–15,474 (2021)

[215]

Zhang, L., Sung, F., Liu, F., et al.: Actor-critic sequence training for image captioning. CoRR. (2017). arXiv: abs/1706.09601

[216]

Zhao, W., Hu, Y., Wang, H., et al.: Boosting entity-aware image captioning with multi-modal knowledge graph. CoRR. (2021). arXiv: abs/2107.11970

[217]

Zhao, B., Li, X., Lu, X.: Video captioning with tube features. In: IJCAI, pp. 1177–1183 (2018)

[218]

Zhao, H., Shi, J., Qi, X., et al.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017)

[219]

Zhao B, Li X, and Lu X CAM-RNN: co-attention model based RNN for video captioning IEEE Trans. Image Process. 2019 28 11 5552-5565

[220]

Zheng, Y., Li, Y., Wang, S.: Intention oriented image captions with guiding objects. In: CVPR, pp. 8395–8404 (2019)

[221]

Zhou, L., Palangi, H., Zhang, L., et al.: Unified vision-language pre-training for image captioning and VQA. In: AAAI, pp. 13,041–13,049 (2020)

[222]

Zhou, L., Zhou, Y., Corso, J.J., et al.: End-to-end dense video captioning with masked transformer. In: CVPR, pp. 8739–8748 (2018)

[223]

Zhou L, Zhang Y, Jiang Y, et al. Re-caption: saliency-enhanced image captioning through two-phase learning IEEE Trans. Image Process. 2020 29 694-709

Index Terms

A comprehensive survey on deep-learning-based visual captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
    2. Natural language processing
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Deep visual tracking

The first comprehensive survey on deep-learning-based trackers.Review existing deep visual trackers from three different perspectives.Large-scale benchmark evaluations of deep visual trackers.Summarize cutting-edge research works and discuss future ...
Evolution of visual data captioning Methods, Datasets, and evaluation Metrics: A comprehensive survey
Abstract
Automatic Visual Captioning (AVC) generates syntactically and semantically correct sentences by describing important objects, attributes, and their relationships with each other. It is classified into two categories: image captioning ...
A Comprehensive Survey of Deep Learning for Image Captioning

Generating a description of an image is called image captioning. Image captioning requires recognizing the important objects, their attributes, and their relationships in an image. It also needs to generate syntactically and semantically correct ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Multimedia Systems

Multimedia Systems Volume 29, Issue 6

Dec 2023

800 pages

ISSN:0942-4962

Issue’s Table of Contents

© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 21 September 2023

Accepted: 24 August 2023

Received: 15 April 2023

Author Tags

Qualifiers

Research-article

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents