Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3343031.3350972acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

Generating Captions for Images of Ancient Artworks

Published: 15 October 2019 Publication History

Abstract

The neural encoder-decoder framework is widely adopted for image captioning of natural images. However, few works have contributed to generating captions for cultural images using this scheme. In this paper, we propose an artwork type enriched image captioning model where the encoder represents an input artwork image as a 512-dimensional vector and the decoder generates a corresponding caption based on the input image vector. The artwork type is first predicted by a convolutional neural network classifier and then merged into the decoder. We investigate multiple approaches to integrate the artwork type into the captioning model among which is one that applies a step-wise weighted sum of the artwork type vector and the hidden representation vector of the decoder. This model outperforms three baseline image captioning models for a Chinese art image captioning dataset on all evaluation metrics. One of the baselines is a state-of-the-art approach fusing textual image attributes into the captioning model for natural images. The proposed model also obtains promising results for another Egyptian art image captioning dataset.

References

[1]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic propositional image caption evaluation. In Proceedings of European Conference on Computer Vision. Springer, 382--398.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086.
[3]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65--72.
[4]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research 3, Feb (2003), 1137--1155.
[5]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).
[6]
Qi Dong, Xiatian Zhu, and Shaogang Gong. 2019. Single-label multi-class image classification by deep logistic regression. CoRR abs/1811.08400 (2019).
[7]
Yunchao Gong, Yangqing Jia, Thomas Leung, Alexander Toshev, and Sergey Ioffe. 2013. Deep convolutional ranking for multilabel image annotation. arXiv preprint arXiv:1312.4894 (2013).
[8]
Nicolas Gonthier, Yann Gousseau, Said Ladjal, and Olivier Bonfait. 2018. Weakly supervised object detection in artworks. In Proceedings of European Conference on Computer Vision. Springer, 692--709.
[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[10]
Nicholas J. Higham. 1996. Accuracy and stability of numerical algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA.
[11]
Md Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR) 51, 6 (2019), 118.
[12]
Forrest Iandola, Matt Moskewicz, Sergey Karayev, Ross Girshick, Trevor Darrell, and Kurt Keutzer. 2014. DenseNet: Implementing efficient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869 (2014).
[13]
Andrej Karpathy, Armand Joulin, and Fei-Fei Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1889--1897.
[14]
Nikhil Ketkar. 2017. Introduction to Pytorch. In Deep learning with python. Springer, 195--208.
[15]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[16]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.
[17]
Katrien Laenen, Susana Zoghbi, and Marie-Francine Moens. 2018. Web search of fashion items with multimodal querying. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM '18). ACM, 342--350.
[18]
Séamus Lawless, Maristella Agosti, Paul Clough, and Owen Conlan. 2013. Exploration, navigation and retrieval of information in cultural heritage: ENRICH 2013. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1136--1136.
[19]
Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of Text Summarization Branches Out. 74--81.
[20]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of European Conference on Computer Vision. Springer, 740--755.
[21]
Hui Mao, Ming Cheung, and James She. 2017. DeepArt: Learning joint representations of visual arts. In Proceedings of the 25th ACM International Conference on Multimedia. ACM, 1183--1191.
[22]
Thomas Mensink and Jan Van Gemert. 2014. The Rijksmuseum challenge: Museum-centered visual recognition. In Proceedings of International Conference on Multimedia Retrieval. ACM, 451.
[23]
Sewon Min, Minjoon Seo, and Hannaneh Hajishirzi. 2017. Question answering through transfer learning from large fine-grained supervision data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 510-- 517.
[24]
Pauline C Ng and Steven Henikoff. 2003. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Research 31, 13 (2003), 3812--3814.
[25]
Abraham Montoya Obeso, Mireya S García Vázquez, Alejandro A Ramírez Acosta, and Jenny Benois-Pineau. 2017. Connoisseur: Classification of styles of mexican architectural heritage with deep learning and visual attention prediction. In Proceedings of the 15th International Workshop on Content-based Multimedia Indexing. ACM, 16.
[26]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311--318.
[27]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91--99.
[28]
Shurong Sheng, Katrien Laenen, and Marie-Francine Moens. 2019. Can image captioning help passage retrieval in multimodal question answering? In Proceedings of European Conference on Information Retrieval. Springer, 94--101.
[29]
Shurong Sheng, Aparna Nurani Venkitasubramanian, and Marie-Francine Moens. 2018. A Markov network based passage retrieval method for multimodal question answering in the cultural heritage domain. In Proceedings of International Conference on Multimedia Modeling. Springer, 3--15.
[30]
Gjorgji Strezoski and Marcel Worring. 2017. OmniArt: Multi-task deep learning for artistic data analysis. arXiv preprint arXiv:1708.00684 (2017).
[31]
Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM neural networks for language modeling. In Proceedings of the Thirteenth Annual Conference of the International Speech Communication Association. 194--197.
[32]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.
[33]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.
[34]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.
[35]
James Z Wang, Kurt Grieb, Ya Zhang, Ching-chih Chen, Yixin Chen, and Jia Li. 2006. Machine annotation and retrieval for digital imagery of historical materials. International Journal on Digital Libraries 6, 1 (2006), 18--29.
[36]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of International Conference on Machine Learning. 2048--2057.
[37]
Lei Xu, Albert Merono-Penuela, Zhisheng Huang, and Frank van Harmelen. 2017. An ontology model for narrative image annotation in the field of cultural heritage. In Proceedings of the 2nd Workshop on Humanities in the Semantic web (WHiSe). 15--26.
[38]
Lei Xu and Xiaoguang Wang. 2015. Semantic description of cultural digital images: Using a hierarchical model and controlled vocabulary. D-Lib magazine 21, 5/6 (2015).
[39]
Heekyung Yang and Kyungha Min. 2019. Classification of basic artistic media based on a deep convolutional approach. The Visual Computer (2019), 1--20.
[40]
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4894--4902.
[41]
AM Yasser, Kathy Clawson, and Chris Bowerman. 2017. Saving cultural heritage with digital make-believe: Machine learning and digital techniques to the rescue. In Proceedings of the 31st British Computer Society Human Computer Interaction Conference. BCS Learning & Development Ltd., 97.

Cited By

View all
  • (2024)GalleryGPT: Analyzing Paintings with Large Multimodal ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681656(7734-7743)Online publication date: 28-Oct-2024
  • (2024)Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and Multi-Source SupervisionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681533(2719-2728)Online publication date: 28-Oct-2024
  • (2024)CrePosterExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.123136245:COnline publication date: 2-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. artwork type
  2. image captioning
  3. neural encoder-decoder

Qualifiers

  • Research-article

Conference

MM '19
Sponsor:

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)237
  • Downloads (Last 6 weeks)28
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)GalleryGPT: Analyzing Paintings with Large Multimodal ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681656(7734-7743)Online publication date: 28-Oct-2024
  • (2024)Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and Multi-Source SupervisionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681533(2719-2728)Online publication date: 28-Oct-2024
  • (2024)CrePosterExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.123136245:COnline publication date: 2-Jul-2024
  • (2024)Exploring the Synergy Between Vision-Language Pretraining and ChatGPT for Artwork Captioning: A Preliminary StudyImage Analysis and Processing - ICIAP 2023 Workshops10.1007/978-3-031-51026-7_27(309-321)Online publication date: 21-Jan-2024
  • (2023)Feature fusion via multi-target learning for ancient artwork captioningInformation Fusion10.1016/j.inffus.2023.10181197(101811)Online publication date: Sep-2023
  • (2023)A comprehensive survey on object detection in Visual Art: taxonomy and challengeMultimedia Tools and Applications10.1007/s11042-023-15968-983:5(14637-14670)Online publication date: 3-Jul-2023
  • (2023)Connecting national flags – a deep learning approachMultimedia Tools and Applications10.1007/s11042-023-15056-y82:25(39435-39457)Online publication date: 29-Mar-2023
  • (2023)Image captioning for cultural artworks: a case study on ceramicsMultimedia Systems10.1007/s00530-023-01178-829:6(3223-3243)Online publication date: 23-Sep-2023
  • (2023)Automatic Analysis of Human Body Representations in Western ArtComputer Vision – ECCV 2022 Workshops10.1007/978-3-031-25056-9_19(282-297)Online publication date: 15-Feb-2023
  • (2022)Semantic interdisciplinary evaluation of image captioning modelsCogent Engineering10.1080/23311916.2022.21043339:1Online publication date: 21-Aug-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media