Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3469877.3490571acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages

Published: 10 January 2022 Publication History

Abstract

This paper presents a novel training method for end-to-end scene text recognition. End-to-end scene text recognition offers high recognition accuracy, especially when using the encoder-decoder model based on Transformer. To train a highly accurate end-to-end model, we need to prepare a large image-to-text paired dataset for the target language. However, it is difficult to collect this data, especially for resource-poor languages. To overcome this difficulty, our proposed method utilizes well-prepared large datasets in resource-rich languages such as English, to train the resource-poor encoder-decoder model. Our key idea is to build a model in which the encoder reflects knowledge of multiple languages while the decoder specializes in knowledge of just the resource-poor language. To this end, the proposed method pre-trains the encoder by using a multilingual dataset that combines the resource-poor language’s dataset and the resource-rich language’s dataset to learn language-invariant knowledge for scene text recognition. The proposed method also pre-trains the decoder by using the resource-poor language’s dataset to make the decoder better suited to the resource-poor language. Experiments on Japanese scene text recognition using a small, publicly available dataset demonstrate the effectiveness of the proposed method.

References

[1]
Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. 2019. What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 4714–4722.
[2]
Xiang Bai, Mingkun Yang, Pengyuan Lyu, Yongchao Xu, and Jiebo Luo. 2018. Integrating Scene Text and Visual Appearance for Fine-Grained Image Classification. IEEE Access 6(2018), 66322–66335.
[3]
Ali Furkan Biten, Rubén Tito, Andrés Mafla, Lluis Gomez, Marçal Rusiñol, C.V. Jawahar, Ernest Valveny, and Dimosthenis Karatzas. 2019. Scene Text Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 4290–4300.
[4]
Maurits Bleeker and Maarten de Rijke. 2020. Bidirectional Scene Text Recognition with a Single Decoder. In Proceedings of the European Conference on Artificial Intelligence (ECAI). 2664–2671.
[5]
Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large Scale System for Text Detection and Recognition in Images. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). 71–79.
[6]
Lluís Gómez, Andrés Mafla, Marçal Rusiñol, and Dimosthenis Karatzas. 2018. Single Shot Scene Text Retrieval. In Proceedings of the European Conference on Computer Vision (ECCV). 700–715.
[7]
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the International Conference on Machine Learning (ICML). 369–376.
[8]
Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic Data for Text Localisation in Natural Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2315–2324.
[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.
[10]
Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition. In Proceedings of the Workshop on Deep Learning, NIPS.
[11]
Sezer Karaoglu, Ran Tao, Theo Gevers, and Arnold W. M. Smeulders. 2017. Words Matter: Scene Text for Image Classification and Retrieval. IEEE Transactions on Multimedia 19, 5 (2017), 1063–1076.
[12]
Sezer Karaoglu, Ran Tao, Jan C. van Gemert, and Theo Gevers. 2017. Con-Text: Text Detection for Fine-Grained Object Classification. IEEE Transactions on Image Processing 26, 8 (2017), 3965–3980.
[13]
Chen-Yu Lee and Simon Osindero. 2016. Recursive Recurrent Nets with Attention Modeling for OCR in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2231–2239.
[14]
Wei Liu, Chaofeng Chen, Kwan-YeeK Wong, Zhizhong Su, and Junyu Han. 2016. STAR-Net: A SpaTial Attention Residue Network for Scene Text Recognition. In Proceedings of the British Machine Vision Conference (BMVC).
[15]
Shangbang Long, Xin He, and Cong Yao. 2021. Scene Text Detection and Recognition: The Deep Learning Era. International Journal of Computer Vision 129, 1 (2021), 161–184.
[16]
Shangbang Long and Cong Yao. 2020. UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5488–5497.
[17]
Ning Lu, Wenwen Yu, Xianbiao Qi, Yihao Chen, Ping Gong, Rong Xiao, and Xiang Bai. 2021. MASTER: Multi-Aspect Non-local Network for Scene Text Recognition. Pattern Recognition 117(2021), 107980.
[18]
Anand Mishra, Karteek Alahari, and C.V. Jawahar. 2012. Scene Text Recognition using Higher Order Language Priors. In Proceedings of the British Machine Vision Conference (BMVC).
[19]
Nibal Nayef, Yash Patel, Michal Busta, Pinaki Nath Chowdhury, Dimosthenis Karatzas, Wafa Khlif, Jiri Matas, Umapada Pal, Jean-Christophe Burie, Cheng lin Liu, and Jean-Marc Ogier. 2019. ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition. In Proceedings of the IEEE International Conference on Document Analysis and Recognition (ICDAR). 1582–1587.
[20]
Fenfen Sheng, Zhineng Chen, and Bo Xu. 2019. NRTR: A No-Recurrence Sequence-to-Sequence Model for Scene Text Recognition. In Proceedings of the IEEE International Conference on Document Analysis and Recognition (ICDAR). 781–786.
[21]
Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 11(2016), 2298–2304.
[22]
Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2016. Robust Scene Text Recognition with Automatic Rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4168–4176.
[23]
Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations (ICLR).
[24]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In Advances in Neural Information Processing Systems (NIPS). 5998–6008.
[25]
Jianfeng Wang and Xiaolin Hu. 2017. Gated Recurrent Convolution Neural Network for OCR. In Advances in Neural Information Processing Systems (NIPS). 334–343.
[26]
Peng Wang, Lu Yang, Hui Li, Yuyan Deng, Chunhua Shen, and Yanning Zhang. 2019. A Simple and Robust Convolutional-Attention Network for Irregular Text Recognition. arXiv preprint arXiv:1904.01375(2019).
[27]
Deli Yu, Xuan Li, Chengquan Zhang, Junyu Han, Jingtuo Liu, and Errui Ding. 2020. Towards Accurate Scene Text Recognition With Semantic Reasoning Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 12113–12122.
[28]
Fangneng Zhan, Shijian Lu, and Chuhui Xue. 2018. Verisimilar Image Synthesis for Accurate Detection and Recognition of Texts in Scenes. In Proceedings of the European Conference on Computer Vision (ECCV). 249–266.
[29]
Yiwei Zhu, Shilin Wang, Zheng Huang, and Kai Chen. 2019. Text Recognition in Images Based on Transformer with Hierarchical Attention. In Proceedings of the IEEE International Conference on Image Processing (ICIP). 1945–1949.

Cited By

View all
  • (2023)Extending TrOCR for Text Localization-Free OCR of Full-Page Scanned Receipt Images2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00160(1471-1477)Online publication date: 2-Oct-2023

Index Terms

  1. Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia
        December 2021
        508 pages
        ISBN:9781450386074
        DOI:10.1145/3469877
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 10 January 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Transformer
        2. pre-training
        3. resource-poor language
        4. scene text recognition

        Qualifiers

        • Short-paper
        • Research
        • Refereed limited

        Conference

        MMAsia '21
        Sponsor:
        MMAsia '21: ACM Multimedia Asia
        December 1 - 3, 2021
        Gold Coast, Australia

        Acceptance Rates

        Overall Acceptance Rate 59 of 204 submissions, 29%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)6
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 13 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)Extending TrOCR for Text Localization-Free OCR of Full-Page Scanned Receipt Images2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00160(1471-1477)Online publication date: 2-Oct-2023

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media