An Algorithm Based on Text Position Correction and Encoder-Decoder Network for Text Recognition in the Scene Image of Visual Sensors
<p>Overall structure of our text recognition algorithm.</p> "> Figure 2
<p>TPC structure diagram.</p> "> Figure 3
<p>Schematic diagram of coordinate normalization.</p> "> Figure 4
<p>BLSTM structure diagram of two floors.</p> "> Figure 5
<p>Feature map is transformed into feature vector sequence.</p> "> Figure 6
<p>Detail structure on Decoder Network.</p> "> Figure 7
<p>Sample of the scene texts in the data sets in this paper.</p> "> Figure 8
<p>Experimental results of text recognition in a visual sensor scene image.</p> ">
Abstract
:1. Introduction
2. Overall Network Structure
2.1. Text Position Correction Module
2.2. Encoder Network
2.3. Decoder Network
3. Implementation Details
4. Experiments
4.1. Experimental Data Set
4.2. Experimental Results and Analysis
4.2.1. TPC and its Influence on Text Recognition Results
4.2.2. Dense Connection Network and Its Impact on Text Recognition Results
4.2.3. Depth of BLSTM in EN and Its Influence on the Text Recognition Results
4.2.4. Attention Mechanism in DN and Its Influence on Text Recognition Results
4.3. Results Compared with Other Text Recognition Algorithms
4.4. Experimental Results of Text Recognition in the Scene Image of Visual Sensors
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Zhu, L.; Shen, J.; Xie, L. Unsupervised topic hypergraph hashing for efficient mobile image retrieval. IEEE Trans. Cybern. 2017, 47, 3941–3954. [Google Scholar] [CrossRef] [PubMed]
- Piras, L.; Giacinto, G. Information fusion in content based image retrieval: A comprehensive overview. Inf. Fusion 2017, 37, 50–60. [Google Scholar] [CrossRef]
- Bulan, O.; Kozitsky, V.; Ramesh, P. Segmentation-and annotation-free license plate recognition with deep localization and failure identification. IEEE Trans. Intell. Transp. Syst. 2017, 18, 2351–2363. [Google Scholar] [CrossRef]
- Islam, K.T.; Raj, R.G.; Mujtaba, G. Recognition of traffic sign based on bag-of-words and artificial neural network. Symmetry 2017, 9, 138. [Google Scholar] [CrossRef] [Green Version]
- Wang, X.J.; Zhang, L.; Ma, W.Y. Text to Image Translation. US Patent 9678992B2, 13 June 2017. [Google Scholar]
- Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. Available online: https://arxiv.org/abs/1506.02025 (accessed on 20 May 2020).
- Zhan, F.; Lu, S. ESIR: End-to-end Scene Text Recognition via Iterative Image Rectification. In Proceedings of the the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Angeles, CA, USA, 16–19 June 2019; pp. 2059–2068. [Google Scholar]
- Shi, B.G.; Yang, M.K.; Wang, X.G. ASTER: An Attentional Scene Text Recognizer with Flexible Rectification. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2035–2048. [Google Scholar] [CrossRef]
- Su, B.L.; Lu, S.J. Accurate Scene Text Recognition Based on Recurrent Neural Network. In Proceedings of the Asian Conference on Computer Vision, Singapore, 1–5 November 2014; pp. 35–48. [Google Scholar]
- Su, B.L.; Lu, S.J. Accurate recognition of words in scenes without character segmentation using recurrent neural network. Pattern Recognit. 2017, 63, 397–405. [Google Scholar] [CrossRef]
- Yu, C.; Song, Y.; Zhang, Y. Scene text localization using edge analysis and feature pool. Neurocomputing 2016, 175, 652–661. [Google Scholar] [CrossRef]
- Chng, C.K.; Chan, C.S. Total-Text: A Comprehensive Dataset for Scene Text Detection and Recognition. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017. [Google Scholar] [CrossRef] [Green Version]
- Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-Oriented Scene Text Detection via Rotation Proposals. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar] [CrossRef] [Green Version]
- Ren, X.; Zhou, Y.; Huang, Z.; Sun, J.; Yang, X.; Chen, K. A Novel Text Structure Feature Extractor for Chinese Scene Text Detection and Recognition. IEEE Access 2017, 5, 3193–3204. [Google Scholar] [CrossRef]
- Tang, Y.; Wu, X. Scene Text Detection Using Superpixel-Based Stroke Feature Transform and Deep Learning Based Region Classification. IEEE Trans. Multimed. 2018, 20, 2276–2288. [Google Scholar] [CrossRef]
- Tian, S.; Yin, X.C.; Su, Y.; Hao, H.W. A Unified Framework for Tracking Based Text Detection and Recognition from Web Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 542–554. [Google Scholar] [CrossRef] [PubMed]
- Huang, Z.; Xu, W.; Yu, K. Bidirectional lstm-crf Models for Sequence Tagging. Available online: https://arxiv.org/abs/1508.01991 (accessed on 20 May 2020).
- Gao, Y.T.; Huang, Z.; Dai, Y.C.; Xu, C.; Chen, K.; Guo, J. Double Supervised Network with Attention Mechanism for Scene Text Recognition. Available online: https://arxiv.org/pdf/1808.00677 (accessed on 20 May 2020).
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. Neural Inf. Process. Syst. 2015, 20, 2017–2025. [Google Scholar]
- Dai, J.F.; Qi, H.Z.; Xiong, Y.W. Deformable convolutional networks. In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
- Kobchaisawat, T.; Chalidabhongse, T.H.; Satoh, S.I. Scene Text Detection with Polygon Offsetting and Border Augmentation. Electronics 2018, 9, 117. [Google Scholar] [CrossRef] [Green Version]
- Tao, P.; Yi, H.; Wei, C. A method based on weighted F-score and SVM for feature selection. In Proceedings of the 2013 25th Chinese Control and Decision Conference (CCDC), Guiyang, China, 25–27 May 2013; pp. 4287–4290. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. Neural Inf. Process. Syst. 2014, 8, 3104–3112. [Google Scholar]
- Lin, T.; Dollar, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. Comput. Vis. Pattern Recognit. 2017, 6, 936–944. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L. Densely Connected Convolutional Networks. Computer Vis. Pattern Recognit. 2017, 12, 2261–2269. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N. Attention is all you need. Neural Inf. Process. Syst. 2017, 14, 5998–6008. [Google Scholar]
- Wang, K.; Babenko, B.; Belongie, S. End-to-end scene text recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1457–1464. [Google Scholar]
- Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; Bigorda, L.G.I.; Mestre, S.R.; Mas, J.; Mota, D.F.; Almazan, A.; Heras, L.P.D.L. ICDAR 2013 robust reading competition. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition., Washington, WA, USA, 25–28 August 2013; pp. 1484–1493. [Google Scholar]
- Lucas, S.M.; Panaretos, A.; Sosa, L. ICDAR 2003 robust reading competitions. In Proceedings of the Seventh International Conference on Document Analysis and Recognition, Edinburgh, UK, 6 August 2003; pp. 682–687. [Google Scholar]
- Risnumawan, A.; Shivakumara, P.; Chan, C.S.; Tan, C.L. A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 2014, 41, 8027–8048. [Google Scholar] [CrossRef]
- Mishra, A.; Alahari, K.; Jawahar, C.V. Scene text recognition using higher order language priors. In Proceedings of the British Machine Vision Conference 2012, Surrey, UK, 3–7 September 2012; pp. 1–11. [Google Scholar]
- Rahman, M.A.; Wang, Y. Optimizing Intersection-Over-Union in deep neural networks for image segmentation. In Advances in Visual Computing; Springer International Publishing: Gewerbestrasse, Switzerland, 2016. [Google Scholar]
- Bissacco, A.; Cummins, M.; Netzer, Y.; Neven, H. PhotoOCR: Reading Text in Uncontrolled Conditions. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV), Sydney Conference Centre, Darling Harbour, Sydney, 1–8 December 2013; pp. 785–792. [Google Scholar]
- He, P.; Huang, W.L.; Qiao, Y.; Loy, C.C.; Tang, X.O.; Info, A. Reading scene text in deep convolutional sequences. In Proceedings of the National Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 3501–3508. [Google Scholar]
- Jaderberg, M.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep Structured Output Learning for Unconstrained Text Recognition. Eprint Arxiv 2014, 24, 603–611. [Google Scholar]
- Lee, C.Y.; Osindero, S. Recursive recurrent nets with attention modeling for OCR in the wild. Comput. Vis. Pattern Recognit. 2016, 43, 2231–2239. [Google Scholar]
- Shi, B.G.; Bai, X.; Yao, C. An End-to-End trainable neural network for Image-Based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 2298–2304. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Shi, B.G.; Wang, X.G.; Lyu, P.; Yao, C.; Bai, X. Robust scene text recognition with automatic rectification. Comput. Vis. Pattern Recognit. 2016, 2, 4168–4176. [Google Scholar]
- Yang, X.; He, D.F.; Zhou, Z.H.; Kifer, D.; Giles, C.L. Learning to read irregular text with attention mechanisms. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 3280–3286. [Google Scholar]
Type | Configuration | Size |
---|---|---|
The Input | - | 1 × 64 ×200 |
Conv | k3, num64, s1, p1 | 64 × 64 × 200 |
AVGPool | k2, s2 | 64 × 32 × 100 |
Conv | k3, num128, s1, p1 | 128 × 32 × 100 |
AVGPool | k2, s2 | 128 × 16 × 50 |
Conv | k3, num256, s1, p1 | 256 × 16 × 50 |
AVGPool | k2, s2 | 256 × 8 × 25 |
Conv | k3, num128, s1, p1 | 128 × 8 × 25 |
AVGPool | k2, s1 | 128 × 7 × 24 |
Conv | k3, num64, s1, p1 | 64 × 3 × 12 |
Conv | k3, num32, s1, p1 | 32 × 3 × 12 |
Conv | k3, num8, s1, p1 | 8 × 3 × 12 |
Conv | k3, num2, s1, p1 | 2 × 3 × 12 |
AVGPool | k2, s1 | 2 × 2 × 11 |
Tanh | - | 2 ×2 × 11 |
The Resize | - | 2 × 64 × 200 |
Type | Configuration |
---|---|
Convolution | [k3, num32, s1, p1] |
Max Pooling | [k2, s1] |
The Activation Function | Swish |
Dense Block | [k3, num32, s1, p1] x 4 |
Dense Block | [k3, num64, s1, p1] x 4 |
Dense Block | [k3, num128, s1, p1] x 4 |
Dense Block | [k3, num256, s1, p1] x 4 |
Convolution | [k3, num128, s1, p1] |
Max Pooling | [k2, s1] |
The Activation Function | Swish |
Convolution | [k3, num128, s1, p1] |
BLSTM | Hidden unit: 256 |
BLSTM | Hidden unit: 256 |
Type | Configuration |
---|---|
The input size | 64 × 200 |
The iterations | 100,000 |
Batch size | 16 |
Learning rate | 10−3 |
Learning rate attenuation | 0.9/10,000 of the iterations |
Model | SVT | IIIT5K | |||||
---|---|---|---|---|---|---|---|
50 | None | Time | 50 | 1 k | None | Time | |
Without TPC | 89.6 | 76.5 | 3.2 h. | 93.2 | 92.5 | 81.6 | 6.1 h. |
With TPC | 96.5 | 83.7 | 3.6 h. | 99.4 | 98.1 | 88.3 | 6.7 h. |
Model | ICDAR 2013 | IIIT5K | ||||||
---|---|---|---|---|---|---|---|---|
50 | FULL | None | Time | 50 | 1 k | None | Time | |
Without DCN | 89.6 | 86.5 | 79.5 | 5.4 h. | 91.2 | 89.5 | 80.5 | 6.5 h. |
With DCN | 98.6 | 97.5 | 92.3 | 5.5 h. | 99.4 | 98.1 | 88.3 | 6.7 h. |
Depth | ICDAR 2013 | IIIT5K | ||||||
---|---|---|---|---|---|---|---|---|
50 | FULL | None | Time | 50 | 1 k | None | Time | |
0 | 89.6 | 89.5 | 83.5 | 3.4 h. | 91.2 | 90.5 | 82.6 | 4.5 h. |
1 layer | 97.2 | 94.6 | 89.9 | 4.7 h. | 93.6 | 93.3 | 85.1 | 5.6 h. |
2 layers | 98.6 | 97.5 | 92.3 | 5.5 h. | 99.4 | 98.1 | 88.3 | 6.7 h. |
3 layers | 98.3 | 97.1 | 92.2 | 6.8 h. | 99.3 | 97.8 | 88.1 | 7.9 h. |
Model | SVT | IIIT5K | |||||
---|---|---|---|---|---|---|---|
50 | None | Time | 50 | 1 k | None | Time | |
Without Attention | 92.7 | 80.5 | 3.4 h. | 95.5 | 93.5 | 84.6 | 6.5 h. |
With Attention | 96.5 | 83.7 | 3.6 h. | 99.4 | 98.1 | 88.3 | 6.7 h. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Huang, Z.; Lin, J.; Yang, H.; Wang, H.; Bai, T.; Liu, Q.; Pang, Y. An Algorithm Based on Text Position Correction and Encoder-Decoder Network for Text Recognition in the Scene Image of Visual Sensors. Sensors 2020, 20, 2942. https://doi.org/10.3390/s20102942
Huang Z, Lin J, Yang H, Wang H, Bai T, Liu Q, Pang Y. An Algorithm Based on Text Position Correction and Encoder-Decoder Network for Text Recognition in the Scene Image of Visual Sensors. Sensors. 2020; 20(10):2942. https://doi.org/10.3390/s20102942
Chicago/Turabian StyleHuang, Zhiwei, Jinzhao Lin, Hongzhi Yang, Huiqian Wang, Tong Bai, Qinghui Liu, and Yu Pang. 2020. "An Algorithm Based on Text Position Correction and Encoder-Decoder Network for Text Recognition in the Scene Image of Visual Sensors" Sensors 20, no. 10: 2942. https://doi.org/10.3390/s20102942
APA StyleHuang, Z., Lin, J., Yang, H., Wang, H., Bai, T., Liu, Q., & Pang, Y. (2020). An Algorithm Based on Text Position Correction and Encoder-Decoder Network for Text Recognition in the Scene Image of Visual Sensors. Sensors, 20(10), 2942. https://doi.org/10.3390/s20102942