Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A Hindi Image Caption Generation Framework Using Deep Learning

Published: 15 March 2021 Publication History

Abstract

Image captioning is the process of generating a textual description of an image that aims to describe the salient parts of the given image. It is an important problem, as it involves computer vision and natural language processing, where computer vision is used for understanding images, and natural language processing is used for language modeling. A lot of works have been done for image captioning for the English language. In this article, we have developed a model for image captioning in the Hindi language. Hindi is the official language of India, and it is the fourth most spoken language in the world, spoken in India and South Asia. To the best of our knowledge, this is the first attempt to generate image captions in the Hindi language. A dataset is manually created by translating well known MSCOCO dataset from English to Hindi. Finally, different types of attention-based architectures are developed for image captioning in the Hindi language. These attention mechanisms are new for the Hindi language, as those have never been used for the Hindi language. The obtained results of the proposed model are compared with several baselines in terms of BLEU scores, and the results show that our model performs better than others. Manual evaluation of the obtained captions in terms of adequacy and fluency also reveals the effectiveness of our proposed approach.
Availability of resources: The codes of the article are available at https://github.com/santosh1821cs03/Image_Captioning_Hindi_Language; The dataset will be made available: http://www.iitp.ac.in/∼ai-nlp-ml/resources.html.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[3]
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Large language models in machine translation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’07). 858–867.
[4]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
[5]
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).
[6]
Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexander G. Schwing, and David Forsyth. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).
[7]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625–2634.
[8]
Pratik Dutta and Sriparna Saha. 2017. Fusion of expression values and protein interaction information using multi-objective optimization for improving gene clustering. Comput. Biol. Med. 89 (2017), 31–43.
[9]
Desmond Elliott and Frank Keller. 2013. Image description using visual dependency representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1292–1302.
[10]
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473–1482.
[11]
Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the European Conference on Computer Vision. Springer, 15–29.
[12]
Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. 2019. Unsupervised image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).
[13]
Jane Gary and Carl Rubino. 2001. Facts About the World’s Languages: An Encyclopedia of the World’s Major Languages. H.W. Wilson, NY.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[15]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput 9, 8 (1997), 1735–1780.
[16]
Wenhao Jiang, Lin Ma, Xinpeng Chen, Hanwang Zhang, and Wei Liu. 2018. Learning to guide decoding for image captioning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.
[17]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.
[18]
Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2011. Baby talk: Understanding and generating image descriptions. In Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition. Citeseer.
[19]
Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, and Yejin Choi. 2012. Collective generation of natural image descriptions. In Proceedings of the 50th Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 359–368.
[20]
Rémi Lebret, Pedro O. Pinheiro, and Ronan Collobert. 2014. Simple image description generator via a linear phrase-based approach. arXiv preprint arXiv:1412.8419 (2014).
[21]
Siming Li, Girish Kulkarni, Tamara L. Berg, Alexander C. Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 220–228.
[22]
Junhao Liu, Kai Wang, Chunpu Xu, Zhou Zhao, Ruifeng Xu, Ying Shen, and Min Yang. 2020. Interactive dual generative adversarial networks for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 11588–11595.
[23]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 375–383.
[24]
Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).
[25]
Haley MacLeod, Cynthia L. Bennett, Meredith Ringel Morris, and Edward Cutrell. 2017. Understanding blind people’s experiences with computer-generated captions of social media images. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 5988–5999.
[26]
Junhua Mao, Xu Wei, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan L. Yuille. 2015. Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In Proceedings of the IEEE International Conference on Computer Vision. 2533–2541.
[27]
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632 (2014).
[28]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311–318.
[29]
Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. 2013. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013).
[30]
Raimonda Staniūtė and Dmitrij Šešok. 2019. A systematic literature review on image captioning. Appl. Sci. 9, 10 (2019), 2024.
[31]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the International Conference on Advances in Neural Information Systems. 3104–3112.
[32]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, inception-ResNet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.
[33]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.
[34]
Bernard L. Welch. 1947. The generalization of students’ problem when several different population variances are involved. Biometrika 34, 1/2 (1947), 28–35.
[35]
Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton Van Den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203–212.
[36]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048–2057.
[37]
Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. 2017. Comparative study of CNN and RNN for natural language processing. arXiv preprint arXiv:1702.01923 (2017).
[38]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.
[39]
Liang Zhou and Eduard Hovy. 2004. Template-filtered headline summarization. In Proceedings of the Text Summarization Branches Out. 56–60.
[40]
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence. 13041–13049.

Cited By

View all
  • (2024)A Deep Learning-Based Efficient Image Captioning Approach for Hindi LanguageDevelopments Towards Next Generation Intelligent Systems for Sustainable Development10.4018/979-8-3693-5643-2.ch009(225-246)Online publication date: 5-Apr-2024
  • (2024)Impact of Language-Specific Training on Image Caption Synthesis: A Case Study on Low-Resource Assamese LanguageInternational Journal of Asian Language Processing10.1142/S2717554524500048Online publication date: 29-Jul-2024
  • (2024)Domain-specific image captioning: a comprehensive reviewInternational Journal of Multimedia Information Retrieval10.1007/s13735-024-00328-613:2Online publication date: 18-Apr-2024
  • Show More Cited By

Index Terms

  1. A Hindi Image Caption Generation Framework Using Deep Learning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 20, Issue 2
    March 2021
    313 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3454116
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 March 2021
    Accepted: 01 October 2020
    Revised: 01 July 2020
    Received: 01 September 2019
    Published in TALLIP Volume 20, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Image captioning
    2. hindi
    3. deep-learning
    4. attention

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • SERBWOMEN IN EXCELLENCE AWARD 2018

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)39
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 24 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Deep Learning-Based Efficient Image Captioning Approach for Hindi LanguageDevelopments Towards Next Generation Intelligent Systems for Sustainable Development10.4018/979-8-3693-5643-2.ch009(225-246)Online publication date: 5-Apr-2024
    • (2024)Impact of Language-Specific Training on Image Caption Synthesis: A Case Study on Low-Resource Assamese LanguageInternational Journal of Asian Language Processing10.1142/S2717554524500048Online publication date: 29-Jul-2024
    • (2024)Domain-specific image captioning: a comprehensive reviewInternational Journal of Multimedia Information Retrieval10.1007/s13735-024-00328-613:2Online publication date: 18-Apr-2024
    • (2024)Feature Fusion and Multi-head Attention Based Hindi CaptionerComputer Vision and Image Processing10.1007/978-3-031-58181-6_40(479-487)Online publication date: 3-Jul-2024
    • (2023)Red Deer Optimization with Artificial Intelligence Enabled Image Captioning System for Visually Impaired PeopleComputer Systems Science and Engineering10.32604/csse.2023.03552946:2(1929-1945)Online publication date: 2023
    • (2023)Design of English pronunciation quality evaluation system based on the deep learning modelApplied Mathematics and Nonlinear Sciences10.2478/amns.2023.1.004608:2(2805-2816)Online publication date: 26-Jun-2023
    • (2023)GAGPT-2: A Geometric Attention-based GPT-2 Framework for Image Captioning in HindiACM Transactions on Asian and Low-Resource Language Information Processing10.1145/362293622:10(1-16)Online publication date: 13-Oct-2023
    • (2023)Dynamic Convolution-based Encoder-Decoder Framework for Image Captioning in HindiACM Transactions on Asian and Low-Resource Language Information Processing10.1145/357389122:4(1-18)Online publication date: 24-Mar-2023
    • (2023)Scene Graph Semantic Inference for Image and Text MatchingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/356339022:5(1-23)Online publication date: 9-May-2023
    • (2023)Image Caption Generation in Kannada using Deep Learning Frameworks2023 International Conference on Advances in Electronics, Communication, Computing and Intelligent Information Systems (ICAECIS)10.1109/ICAECIS58353.2023.10170312(486-491)Online publication date: 19-Apr-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media