research-article

Open access

Generating Captions for Images of Ancient Artworks

Authors:

Marie-Francine MoensAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 2478 - 2486

https://doi.org/10.1145/3343031.3350972

Published: 15 October 2019 Publication History

Abstract

The neural encoder-decoder framework is widely adopted for image captioning of natural images. However, few works have contributed to generating captions for cultural images using this scheme. In this paper, we propose an artwork type enriched image captioning model where the encoder represents an input artwork image as a 512-dimensional vector and the decoder generates a corresponding caption based on the input image vector. The artwork type is first predicted by a convolutional neural network classifier and then merged into the decoder. We investigate multiple approaches to integrate the artwork type into the captioning model among which is one that applies a step-wise weighted sum of the artwork type vector and the hidden representation vector of the decoder. This model outperforms three baseline image captioning models for a Chinese art image captioning dataset on all evaluation metrics. One of the baselines is a state-of-the-art approach fusing textual image attributes into the captioning model for natural images. The proposed model also obtains promising results for another Egyptian art image captioning dataset.

References

[1]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic propositional image caption evaluation. In Proceedings of European Conference on Computer Vision. Springer, 382--398.

[2]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086.

[3]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65--72.

[4]

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research 3, Feb (2003), 1137--1155.

[5]

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).

[6]

Qi Dong, Xiatian Zhu, and Shaogang Gong. 2019. Single-label multi-class image classification by deep logistic regression. CoRR abs/1811.08400 (2019).

[7]

Yunchao Gong, Yangqing Jia, Thomas Leung, Alexander Toshev, and Sergey Ioffe. 2013. Deep convolutional ranking for multilabel image annotation. arXiv preprint arXiv:1312.4894 (2013).

[8]

Nicolas Gonthier, Yann Gousseau, Said Ladjal, and Olivier Bonfait. 2018. Weakly supervised object detection in artworks. In Proceedings of European Conference on Computer Vision. Springer, 692--709.

[9]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[10]

Nicholas J. Higham. 1996. Accuracy and stability of numerical algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA.

[11]

Md Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR) 51, 6 (2019), 118.

Digital Library

[12]

Forrest Iandola, Matt Moskewicz, Sergey Karayev, Ross Girshick, Trevor Darrell, and Kurt Keutzer. 2014. DenseNet: Implementing efficient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869 (2014).

[13]

Andrej Karpathy, Armand Joulin, and Fei-Fei Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1889--1897.

Digital Library

[14]

Nikhil Ketkar. 2017. Introduction to Pytorch. In Deep learning with python. Springer, 195--208.

[15]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[16]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.

[17]

Katrien Laenen, Susana Zoghbi, and Marie-Francine Moens. 2018. Web search of fashion items with multimodal querying. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM '18). ACM, 342--350.

Digital Library

[18]

Séamus Lawless, Maristella Agosti, Paul Clough, and Owen Conlan. 2013. Exploration, navigation and retrieval of information in cultural heritage: ENRICH 2013. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1136--1136.

Digital Library

[19]

Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of Text Summarization Branches Out. 74--81.

[20]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of European Conference on Computer Vision. Springer, 740--755.

[21]

Hui Mao, Ming Cheung, and James She. 2017. DeepArt: Learning joint representations of visual arts. In Proceedings of the 25th ACM International Conference on Multimedia. ACM, 1183--1191.

Digital Library

[22]

Thomas Mensink and Jan Van Gemert. 2014. The Rijksmuseum challenge: Museum-centered visual recognition. In Proceedings of International Conference on Multimedia Retrieval. ACM, 451.

Digital Library

[23]

Sewon Min, Minjoon Seo, and Hannaneh Hajishirzi. 2017. Question answering through transfer learning from large fine-grained supervision data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 510-- 517.

[24]

Pauline C Ng and Steven Henikoff. 2003. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Research 31, 13 (2003), 3812--3814.

[25]

Abraham Montoya Obeso, Mireya S García Vázquez, Alejandro A Ramírez Acosta, and Jenny Benois-Pineau. 2017. Connoisseur: Classification of styles of mexican architectural heritage with deep learning and visual attention prediction. In Proceedings of the 15th International Workshop on Content-based Multimedia Indexing. ACM, 16.

Digital Library

[26]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311--318.

[27]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91--99.

[28]

Shurong Sheng, Katrien Laenen, and Marie-Francine Moens. 2019. Can image captioning help passage retrieval in multimodal question answering? In Proceedings of European Conference on Information Retrieval. Springer, 94--101.

Digital Library

[29]

Shurong Sheng, Aparna Nurani Venkitasubramanian, and Marie-Francine Moens. 2018. A Markov network based passage retrieval method for multimodal question answering in the cultural heritage domain. In Proceedings of International Conference on Multimedia Modeling. Springer, 3--15.

[30]

Gjorgji Strezoski and Marcel Worring. 2017. OmniArt: Multi-task deep learning for artistic data analysis. arXiv preprint arXiv:1708.00684 (2017).

[31]

Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM neural networks for language modeling. In Proceedings of the Thirteenth Annual Conference of the International Speech Communication Association. 194--197.

[32]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.

[33]

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.

[34]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.

[35]

James Z Wang, Kurt Grieb, Ya Zhang, Ching-chih Chen, Yixin Chen, and Jia Li. 2006. Machine annotation and retrieval for digital imagery of historical materials. International Journal on Digital Libraries 6, 1 (2006), 18--29.

Digital Library

[36]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of International Conference on Machine Learning. 2048--2057.

[37]

Lei Xu, Albert Merono-Penuela, Zhisheng Huang, and Frank van Harmelen. 2017. An ontology model for narrative image annotation in the field of cultural heritage. In Proceedings of the 2nd Workshop on Humanities in the Semantic web (WHiSe). 15--26.

[38]

Lei Xu and Xiaoguang Wang. 2015. Semantic description of cultural digital images: Using a hierarchical model and controlled vocabulary. D-Lib magazine 21, 5/6 (2015).

[39]

Heekyung Yang and Kyungha Min. 2019. Classification of basic artistic media based on a deep convolutional approach. The Visual Computer (2019), 1--20.

[40]

Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4894--4902.

[41]

AM Yasser, Kathy Clawson, and Chris Bowerman. 2017. Saving cultural heritage with digital make-believe: Machine learning and digital techniques to the rescue. In Proceedings of the 31st British Computer Society Human Computer Interaction Conference. BCS Learning & Development Ltd., 97.

Digital Library

Cited By

Bin YShi WDing YHu ZWang ZYang YNg SShen HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)GalleryGPT: Analyzing Paintings with Large Multimodal ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681656(7734-7743)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681656
Wu SChen ZSu QCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and Multi-Source SupervisionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681533(2719-2728)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681533
Zhang MLiu FLi BLiu ZMa WRan C(2024)CrePosterExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.123136245:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.123136
Show More Cited By

Index Terms

Generating Captions for Images of Ancient Artworks
1. Information systems
  1. Information systems applications
    1. Multimedia information systems
      1. Multimedia content creation

Recommendations

Generating Image Captions in Polish Using Transformer Architecture
Artificial Intelligence and Soft Computing
Abstract
This paper presents an attention-based method for image captioning. Existing deep-learning methods mainly use recurrent neural networks for producing captions. We present Transformer based image captioning method dubbed CaptFormer. Our method use ...
Towards Generating Stylized Image Captions via Adversarial Training
PRICAI 2019: Trends in Artificial Intelligence
Abstract
While most image captioning aims to generate objective descriptions of images, the last few years have seen work on generating visually grounded image captions which have a specific style (e.g., incorporating positive or negative sentiment). ...
Iconographic Image Captioning for Artworks
Pattern Recognition. ICPR International Workshops and Challenges
Abstract
Image captioning implies automatically generating textual descriptions of images based only on the visual input. Although this has been an extensively addressed research topic in recent years, not many contributions have been made in the domain of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
990
Total Downloads

Downloads (Last 12 months)237
Downloads (Last 6 weeks)28

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bin YShi WDing YHu ZWang ZYang YNg SShen HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)GalleryGPT: Analyzing Paintings with Large Multimodal ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681656(7734-7743)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681656
Wu SChen ZSu QCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and Multi-Source SupervisionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681533(2719-2728)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681533
Zhang MLiu FLi BLiu ZMa WRan C(2024)CrePosterExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.123136245:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.123136
Castellano GFanelli NScaringi RVessio G(2024)Exploring the Synergy Between Vision-Language Pretraining and ChatGPT for Artwork Captioning: A Preliminary StudyImage Analysis and Processing - ICIAP 2023 Workshops10.1007/978-3-031-51026-7_27(309-321)Online publication date: 21-Jan-2024
https://doi.org/10.1007/978-3-031-51026-7_27
Liu FZhang MZheng BCui SMa WLiu Z(2023)Feature fusion via multi-target learning for ancient artwork captioningInformation Fusion10.1016/j.inffus.2023.10181197(101811)Online publication date: Sep-2023
https://doi.org/10.1016/j.inffus.2023.101811
Bengamra SMzoughi OBigand AZagrouba E(2023)A comprehensive survey on object detection in Visual Art: taxonomy and challengeMultimedia Tools and Applications10.1007/s11042-023-15968-983:5(14637-14670)Online publication date: 3-Jul-2023
https://doi.org/10.1007/s11042-023-15968-9
Kalampokas TMentizis DVrochidou EPapakostas G(2023)Connecting national flags – a deep learning approachMultimedia Tools and Applications10.1007/s11042-023-15056-y82:25(39435-39457)Online publication date: 29-Mar-2023
https://doi.org/10.1007/s11042-023-15056-y
Zheng BLiu FZhang MZhou TCui SYe YGuo Y(2023)Image captioning for cultural artworks: a case study on ceramicsMultimedia Systems10.1007/s00530-023-01178-829:6(3223-3243)Online publication date: 23-Sep-2023
https://doi.org/10.1007/s00530-023-01178-8
Zhao SAkdağ Salah ASalah A(2023)Automatic Analysis of Human Body Representations in Western ArtComputer Vision – ECCV 2022 Workshops10.1007/978-3-031-25056-9_19(282-297)Online publication date: 15-Feb-2023
https://doi.org/10.1007/978-3-031-25056-9_19
Sirisha USai Chandana B(2022)Semantic interdisciplinary evaluation of image captioning modelsCogent Engineering10.1080/23311916.2022.21043339:1Online publication date: 21-Aug-2022
https://doi.org/10.1080/23311916.2022.2104333
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents