short-paper

Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages

Authors:

Shota Orihashi,

Yoshihiro Yamazaki,

Naoki Makishima,

Akihiko Takashima,

Tomohiro Tanaka,

Ryo MasumuraAuthors Info & Claims

MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia

Article No.: 41, Pages 1 - 5

https://doi.org/10.1145/3469877.3490571

Published: 10 January 2022 Publication History

Abstract

This paper presents a novel training method for end-to-end scene text recognition. End-to-end scene text recognition offers high recognition accuracy, especially when using the encoder-decoder model based on Transformer. To train a highly accurate end-to-end model, we need to prepare a large image-to-text paired dataset for the target language. However, it is difficult to collect this data, especially for resource-poor languages. To overcome this difficulty, our proposed method utilizes well-prepared large datasets in resource-rich languages such as English, to train the resource-poor encoder-decoder model. Our key idea is to build a model in which the encoder reflects knowledge of multiple languages while the decoder specializes in knowledge of just the resource-poor language. To this end, the proposed method pre-trains the encoder by using a multilingual dataset that combines the resource-poor language’s dataset and the resource-rich language’s dataset to learn language-invariant knowledge for scene text recognition. The proposed method also pre-trains the decoder by using the resource-poor language’s dataset to make the decoder better suited to the resource-poor language. Experiments on Japanese scene text recognition using a small, publicly available dataset demonstrate the effectiveness of the proposed method.

References

[1]

Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. 2019. What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 4714–4722.

[2]

Xiang Bai, Mingkun Yang, Pengyuan Lyu, Yongchao Xu, and Jiebo Luo. 2018. Integrating Scene Text and Visual Appearance for Fine-Grained Image Classification. IEEE Access 6(2018), 66322–66335.

[3]

Ali Furkan Biten, Rubén Tito, Andrés Mafla, Lluis Gomez, Marçal Rusiñol, C.V. Jawahar, Ernest Valveny, and Dimosthenis Karatzas. 2019. Scene Text Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 4290–4300.

[4]

Maurits Bleeker and Maarten de Rijke. 2020. Bidirectional Scene Text Recognition with a Single Decoder. In Proceedings of the European Conference on Artificial Intelligence (ECAI). 2664–2671.

[5]

Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large Scale System for Text Detection and Recognition in Images. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). 71–79.

Digital Library

[6]

Lluís Gómez, Andrés Mafla, Marçal Rusiñol, and Dimosthenis Karatzas. 2018. Single Shot Scene Text Retrieval. In Proceedings of the European Conference on Computer Vision (ECCV). 700–715.

Digital Library

[7]

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the International Conference on Machine Learning (ICML). 369–376.

Digital Library

[8]

Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic Data for Text Localisation in Natural Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2315–2324.

[9]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.

[10]

Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition. In Proceedings of the Workshop on Deep Learning, NIPS.

[11]

Sezer Karaoglu, Ran Tao, Theo Gevers, and Arnold W. M. Smeulders. 2017. Words Matter: Scene Text for Image Classification and Retrieval. IEEE Transactions on Multimedia 19, 5 (2017), 1063–1076.

Digital Library

[12]

Sezer Karaoglu, Ran Tao, Jan C. van Gemert, and Theo Gevers. 2017. Con-Text: Text Detection for Fine-Grained Object Classification. IEEE Transactions on Image Processing 26, 8 (2017), 3965–3980.

Digital Library

[13]

Chen-Yu Lee and Simon Osindero. 2016. Recursive Recurrent Nets with Attention Modeling for OCR in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2231–2239.

[14]

Wei Liu, Chaofeng Chen, Kwan-YeeK Wong, Zhizhong Su, and Junyu Han. 2016. STAR-Net: A SpaTial Attention Residue Network for Scene Text Recognition. In Proceedings of the British Machine Vision Conference (BMVC).

[15]

Shangbang Long, Xin He, and Cong Yao. 2021. Scene Text Detection and Recognition: The Deep Learning Era. International Journal of Computer Vision 129, 1 (2021), 161–184.

Digital Library

[16]

Shangbang Long and Cong Yao. 2020. UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5488–5497.

[17]

Ning Lu, Wenwen Yu, Xianbiao Qi, Yihao Chen, Ping Gong, Rong Xiao, and Xiang Bai. 2021. MASTER: Multi-Aspect Non-local Network for Scene Text Recognition. Pattern Recognition 117(2021), 107980.

[18]

Anand Mishra, Karteek Alahari, and C.V. Jawahar. 2012. Scene Text Recognition using Higher Order Language Priors. In Proceedings of the British Machine Vision Conference (BMVC).

[19]

Nibal Nayef, Yash Patel, Michal Busta, Pinaki Nath Chowdhury, Dimosthenis Karatzas, Wafa Khlif, Jiri Matas, Umapada Pal, Jean-Christophe Burie, Cheng lin Liu, and Jean-Marc Ogier. 2019. ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition. In Proceedings of the IEEE International Conference on Document Analysis and Recognition (ICDAR). 1582–1587.

[20]

Fenfen Sheng, Zhineng Chen, and Bo Xu. 2019. NRTR: A No-Recurrence Sequence-to-Sequence Model for Scene Text Recognition. In Proceedings of the IEEE International Conference on Document Analysis and Recognition (ICDAR). 781–786.

[21]

Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 11(2016), 2298–2304.

Digital Library

[22]

Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2016. Robust Scene Text Recognition with Automatic Rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4168–4176.

[23]

Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations (ICLR).

[24]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In Advances in Neural Information Processing Systems (NIPS). 5998–6008.

[25]

Jianfeng Wang and Xiaolin Hu. 2017. Gated Recurrent Convolution Neural Network for OCR. In Advances in Neural Information Processing Systems (NIPS). 334–343.

[26]

Peng Wang, Lu Yang, Hui Li, Yuyan Deng, Chunhua Shen, and Yanning Zhang. 2019. A Simple and Robust Convolutional-Attention Network for Irregular Text Recognition. arXiv preprint arXiv:1904.01375(2019).

[27]

Deli Yu, Xuan Li, Chengquan Zhang, Junyu Han, Jingtuo Liu, and Errui Ding. 2020. Towards Accurate Scene Text Recognition With Semantic Reasoning Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 12113–12122.

[28]

Fangneng Zhan, Shijian Lu, and Chuhui Xue. 2018. Verisimilar Image Synthesis for Accurate Detection and Recognition of Texts in Scenes. In Proceedings of the European Conference on Computer Vision (ECCV). 249–266.

Digital Library

[29]

Yiwei Zhu, Shilin Wang, Zheng Huang, and Kai Chen. 2019. Text Recognition in Images Based on Transformer with Hierarchical Attention. In Proceedings of the IEEE International Conference on Image Processing (ICIP). 1945–1949.

Cited By

Zhang HWhittaker EKitagishi I(2023)Extending TrOCR for Text Localization-Free OCR of Full-Page Scanned Receipt Images2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00160(1471-1477)Online publication date: 2-Oct-2023
https://doi.org/10.1109/ICCVW60793.2023.00160

Index Terms

Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
    2. Natural language processing
  2. Machine learning

Index terms have been assigned to the content through auto-classification.

Recommendations

Stemming resource-poor Indian languages

Stemming is a basic method for morphological normalization of natural language texts. In this study, we focus on the problem of stemming several resource-poor languages from Eastern India, viz., Assamese, Bengali, Bishnupriya Manipuri and Bodo. While ...
Improving statistical machine translation for a resource-poor language using related resource-rich languages

We propose a novel language-independent approach for improving machine translation for resource-poor languages by exploiting their similarity to resource-rich ones. More precisely, we improve the translation from a resource-poor source language X₁ into ...
The First Swahili Language Scene Text Detection and Recognition Dataset
Document Analysis and Recognition - ICDAR 2024
Abstract
Scene text recognition is essential in many applications, including automated translation, information retrieval, driving assistance, and enhancing accessibility for individuals with visual impairments. Much research has been done to improve the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia

December 2021

508 pages

ISBN:9781450386074

DOI:10.1145/3469877

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 January 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

MMAsia '21

Sponsor:

SIGMM

MMAsia '21: ACM Multimedia Asia

December 1 - 3, 2021

Gold Coast, Australia

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
55
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang HWhittaker EKitagishi I(2023)Extending TrOCR for Text Localization-Free OCR of Full-Page Scanned Receipt Images2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00160(1471-1477)Online publication date: 2-Oct-2023
https://doi.org/10.1109/ICCVW60793.2023.00160

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents