research-article

A Hindi Image Caption Generation Framework Using Deep Learning

Authors:

Santosh Kumar Mishra,

Pushpak BhattacharyyaAuthors Info & Claims

Transactions on Asian and Low-Resource Language Information Processing, Volume 20, Issue 2

Article No.: 32, Pages 1 - 19

https://doi.org/10.1145/3432246

Published: 15 March 2021 Publication History

Abstract

Image captioning is the process of generating a textual description of an image that aims to describe the salient parts of the given image. It is an important problem, as it involves computer vision and natural language processing, where computer vision is used for understanding images, and natural language processing is used for language modeling. A lot of works have been done for image captioning for the English language. In this article, we have developed a model for image captioning in the Hindi language. Hindi is the official language of India, and it is the fourth most spoken language in the world, spoken in India and South Asia. To the best of our knowledge, this is the first attempt to generate image captions in the Hindi language. A dataset is manually created by translating well known MSCOCO dataset from English to Hindi. Finally, different types of attention-based architectures are developed for image captioning in the Hindi language. These attention mechanisms are new for the Hindi language, as those have never been used for the Hindi language. The obtained results of the proposed model are compared with several baselines in terms of BLEU scores, and the results show that our model performs better than others. Manual evaluation of the obtained captions in terms of adequacy and fluency also reveals the effectiveness of our proposed approach.

Availability of resources: The codes of the article are available at https://github.com/santosh1821cs03/Image_Captioning_Hindi_Language; The dataset will be made available: http://www.iitp.ac.in/∼ai-nlp-ml/resources.html.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).

[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).

[3]

Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Large language models in machine translation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’07). 858–867.

[4]

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).

[5]

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).

[6]

Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexander G. Schwing, and David Forsyth. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).

[7]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625–2634.

[8]

Pratik Dutta and Sriparna Saha. 2017. Fusion of expression values and protein interaction information using multi-objective optimization for improving gene clustering. Comput. Biol. Med. 89 (2017), 31–43.

Digital Library

[9]

Desmond Elliott and Frank Keller. 2013. Image description using visual dependency representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1292–1302.

[10]

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473–1482.

[11]

Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the European Conference on Computer Vision. Springer, 15–29.

Digital Library

[12]

Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. 2019. Unsupervised image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).

[13]

Jane Gary and Carl Rubino. 2001. Facts About the World’s Languages: An Encyclopedia of the World’s Major Languages. H.W. Wilson, NY.

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[15]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput 9, 8 (1997), 1735–1780.

Digital Library

[16]

Wenhao Jiang, Lin Ma, Xinpeng Chen, Hanwang Zhang, and Wei Liu. 2018. Learning to guide decoding for image captioning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.

[17]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.

[18]

Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2011. Baby talk: Understanding and generating image descriptions. In Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition. Citeseer.

Digital Library

[19]

Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, and Yejin Choi. 2012. Collective generation of natural image descriptions. In Proceedings of the 50th Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 359–368.

Digital Library

[20]

Rémi Lebret, Pedro O. Pinheiro, and Ronan Collobert. 2014. Simple image description generator via a linear phrase-based approach. arXiv preprint arXiv:1412.8419 (2014).

[21]

Siming Li, Girish Kulkarni, Tamara L. Berg, Alexander C. Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 220–228.

Digital Library

[22]

Junhao Liu, Kai Wang, Chunpu Xu, Zhou Zhao, Ruifeng Xu, Ying Shen, and Min Yang. 2020. Interactive dual generative adversarial networks for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 11588–11595.

[23]

Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 375–383.

[24]

Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).

[25]

Haley MacLeod, Cynthia L. Bennett, Meredith Ringel Morris, and Edward Cutrell. 2017. Understanding blind people’s experiences with computer-generated captions of social media images. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 5988–5999.

Digital Library

[26]

Junhua Mao, Xu Wei, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan L. Yuille. 2015. Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In Proceedings of the IEEE International Conference on Computer Vision. 2533–2541.

Digital Library

[27]

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632 (2014).

[28]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311–318.

Digital Library

[29]

Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. 2013. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013).

[30]

Raimonda Staniūtė and Dmitrij Šešok. 2019. A systematic literature review on image captioning. Appl. Sci. 9, 10 (2019), 2024.

[31]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the International Conference on Advances in Neural Information Systems. 3104–3112.

Digital Library

[32]

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, inception-ResNet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.

Digital Library

[33]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.

[34]

Bernard L. Welch. 1947. The generalization of students’ problem when several different population variances are involved. Biometrika 34, 1/2 (1947), 28–35.

[35]

Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton Van Den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203–212.

[36]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048–2057.

Digital Library

[37]

Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. 2017. Comparative study of CNN and RNN for natural language processing. arXiv preprint arXiv:1702.01923 (2017).

[38]

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.

[39]

Liang Zhou and Eduard Hovy. 2004. Template-filtered headline summarization. In Proceedings of the Text Summarization Branches Out. 56–60.

[40]

Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence. 13041–13049.

Cited By

Jayaswal VRani RKaur J(2024)A Deep Learning-Based Efficient Image Captioning Approach for Hindi LanguageDevelopments Towards Next Generation Intelligent Systems for Sustainable Development10.4018/979-8-3693-5643-2.ch009(225-246)Online publication date: 5-Apr-2024
https://doi.org/10.4018/979-8-3693-5643-2.ch009
Choudhury PGuha PNandi S(2024)Impact of Language-Specific Training on Image Caption Synthesis: A Case Study on Low-Resource Assamese LanguageInternational Journal of Asian Language Processing10.1142/S2717554524500048Online publication date: 29-Jul-2024
https://doi.org/10.1142/S2717554524500048
Sharma HPadha D(2024)Domain-specific image captioning: a comprehensive reviewInternational Journal of Multimedia Information Retrieval10.1007/s13735-024-00328-613:2Online publication date: 18-Apr-2024
https://doi.org/10.1007/s13735-024-00328-6
Show More Cited By

Index Terms

A Hindi Image Caption Generation Framework Using Deep Learning
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Information extraction

Recommendations

Dynamic Convolution-based Encoder-Decoder Framework for Image Captioning in Hindi
In sequence-to-sequence modeling tasks, such as image captioning, machine translation, and visual question answering, encoder-decoder architectures are state of the art. An encoder, convolutional neural network (CNN) encodes input images into fixed ...
Efficient Channel Attention Based Encoder–Decoder Approach for Image Captioning in Hindi
Image captioning refers to the process of generating a textual description that describes objects and activities present in a given image. It connects two fields of artificial intelligence, computer vision, and natural language processing. Computer vision ...
An Object Localization-based Dense Image Captioning Framework in Hindi
Dense image captioning is a task that requires generating localized captions in natural language for multiple regions of an image. This task leverages its functionalities from both computer vision for recognizing regions in an image and natural language ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 20, Issue 2

March 2021

313 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3454116

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Copyright © 2021 Association for Computing Machinery.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 March 2021

Accepted: 01 October 2020

Revised: 01 July 2020

Received: 01 September 2019

Published in TALLIP Volume 20, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

SERBWOMEN IN EXCELLENCE AWARD 2018

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
427
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)3

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jayaswal VRani RKaur J(2024)A Deep Learning-Based Efficient Image Captioning Approach for Hindi LanguageDevelopments Towards Next Generation Intelligent Systems for Sustainable Development10.4018/979-8-3693-5643-2.ch009(225-246)Online publication date: 5-Apr-2024
https://doi.org/10.4018/979-8-3693-5643-2.ch009
Choudhury PGuha PNandi S(2024)Impact of Language-Specific Training on Image Caption Synthesis: A Case Study on Low-Resource Assamese LanguageInternational Journal of Asian Language Processing10.1142/S2717554524500048Online publication date: 29-Jul-2024
https://doi.org/10.1142/S2717554524500048
Sharma HPadha D(2024)Domain-specific image captioning: a comprehensive reviewInternational Journal of Multimedia Information Retrieval10.1007/s13735-024-00328-613:2Online publication date: 18-Apr-2024
https://doi.org/10.1007/s13735-024-00328-6
Meghwal VMittal NSingh G(2024)Feature Fusion and Multi-head Attention Based Hindi CaptionerComputer Vision and Image Processing10.1007/978-3-031-58181-6_40(479-487)Online publication date: 3-Jul-2024
https://doi.org/10.1007/978-3-031-58181-6_40
Mustafa Hilal AAlrowais FN. Al-Wesabi FMarzouk R(2023)Red Deer Optimization with Artificial Intelligence Enabled Image Captioning System for Visually Impaired PeopleComputer Systems Science and Engineering10.32604/csse.2023.03552946:2(1929-1945)Online publication date: 2023
https://doi.org/10.32604/csse.2023.035529
Zhang FZhou Z(2023)Design of English pronunciation quality evaluation system based on the deep learning modelApplied Mathematics and Nonlinear Sciences10.2478/amns.2023.1.004608:2(2805-2816)Online publication date: 26-Jun-2023
https://doi.org/10.2478/amns.2023.1.00460
Mishra SChakraborty SSaha SBhattacharyya P(2023)GAGPT-2: A Geometric Attention-based GPT-2 Framework for Image Captioning in HindiACM Transactions on Asian and Low-Resource Language Information Processing10.1145/362293622:10(1-16)Online publication date: 13-Oct-2023
https://dl.acm.org/doi/10.1145/3622936
Mishra SSinha SSaha SBhattacharyya P(2023)Dynamic Convolution-based Encoder-Decoder Framework for Image Captioning in HindiACM Transactions on Asian and Low-Resource Language Information Processing10.1145/357389122:4(1-18)Online publication date: 24-Mar-2023
https://dl.acm.org/doi/10.1145/3573891
Pei JZhong KYu ZWang LLakshmanna K(2023)Scene Graph Semantic Inference for Image and Text MatchingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/356339022:5(1-23)Online publication date: 9-May-2023
https://dl.acm.org/doi/10.1145/3563390
Chethas KAnkita VApoorva BSushma HJayashree R(2023)Image Caption Generation in Kannada using Deep Learning Frameworks2023 International Conference on Advances in Electronics, Communication, Computing and Intelligent Information Systems (ICAECIS)10.1109/ICAECIS58353.2023.10170312(486-491)Online publication date: 19-Apr-2023
https://doi.org/10.1109/ICAECIS58353.2023.10170312
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents