A Three-Stage Uyghur Recognition Model Combining the Attention Mechanism and Different Convolutional Recurrent Networks
<p>Three-stage Uyghur identification structure.</p> "> Figure 2
<p>VGG feature extraction network.</p> "> Figure 3
<p>GRCNN feature extraction network.</p> "> Figure 4
<p>(<b>a</b>) ResNet and (<b>b</b>) ConvNeXt feature extraction network block.</p> "> Figure 5
<p>Deep bidirectional LSTM network.</p> "> Figure 6
<p>Connectionist temporal classification (CTC).</p> "> Figure 7
<p>Attention mechanism (Attn).</p> "> Figure 8
<p>Data enhancement of Uyghur images.</p> "> Figure 9
<p>Example of a computer-cut Uyghur words. (<b>a</b>) Sample print template data. (<b>b</b>) Sample hand-drawn template data. (<b>c</b>) Sample electronic template data.</p> "> Figure 10
<p>Example of manual adjustment data.</p> "> Figure 11
<p>Example of a computer-cut Uyghur words.</p> "> Figure 12
<p>Sample data after trimming the edges.</p> "> Figure 13
<p>Data pre-processing steps. (<b>a</b>) Original Picture. (<b>b</b>) Padding. (<b>c</b>) Uniform Scale. (<b>d</b>) Grayscale.</p> "> Figure 14
<p>Figure showing the effect of data augmentation via different methods. (<b>a</b>) Original picture. (<b>b</b>) Stochastic affine transformation. (<b>c</b>) Gaussian noise. (<b>d</b>) Elastic transformation.</p> "> Figure 15
<p>Saliency Map of the whole word for predicting the degree of influence.</p> "> Figure 16
<p>Saliency Maps for each character for predicting the degree of influence.</p> "> Figure 17
<p>Local Interpretable Model-Agnostic Explanations.</p> "> Figure 18
<p>Shapley additive explanation.</p> "> Figure 19
<p>Feature maps for partial networks.</p> "> Figure 20
<p>Accuracy—parametric chart.</p> "> Figure 21
<p>Accuracy—time chart.</p> "> Figure 22
<p>CTC versus Attn. (Number of parameters.)</p> "> Figure 23
<p>CTC versus Attn. (Model testing time.)</p> "> Figure 24
<p>Comparison of feature extraction networks. (Number of parameters.)</p> "> Figure 25
<p>Comparison of feature extraction networks. (Model testing time.)</p> "> Figure A1
<p>Different writing styles of handwritten Uyghur words.</p> "> Figure A1 Cont.
<p>Different writing styles of handwritten Uyghur words.</p> ">
Abstract
:1. Introduction
2. Related Work
3. Structure of Recognition
3.1. Feature Extraction Stage
3.2. Sequence Modeling Stage
3.3. Model Prediction Stage
4. Datasets
4.1. Printed Dataset Generation
4.2. Handwritten Dataset Generation
- 1.
- A 20 5 blank grid paper template was designed and printed, where each blank paper can contain 100 written Uyghur words. The grid paper template and the sample Uyghur words were distributed to the students and they were required to write in order according to the sample. The handwritten sample is shown in Figure 9a, due to the condition, some students used the hand-drawn grid paper template to complete the writing as in Figure 9b, some students used the electronic template, and some students used electronic templates and capacitive pens to complete handwriting, as shown in Figure 9c.
- 2.
- The completed paper images were scanned or photographed, and the electronic images were filed in a single folder. One dataset contains three pictures, and the pictures in each folder were named a.jpg, b.jpg, and c.jpg. A total of 140 files were organized and stored in the same directory, and the 140 files were named according to E1–E140.
- 3.
- Due to the differences in shooting or scanning angles, the picture was distorted and the horizontal line (vertical line) in the original template was no longer horizontal (vertical). In order to facilitate the computer to crop each Uyghur word more accurately in the later stages, it was necessary to adjust the angle of the picture so that the original horizontal line (vertical line) in the picture is as horizontal (vertical) as possible and to crop the edge part which is not related to the data. The electronic sample did not need to be cropped. The manually adjusted and cropped data sample is shown in Figure 10.
- 4.
- Due to errors or omissions in manual writing, there exist data samples written by some students with misordered or missing words. In order to facilitate batch cropping by the computer as well as labeled order classification, the archived data needed to be manually screened and corrected before computer cropping was performed.
- 5.
- Regarding computer cropping, morphological operations were used to identify the horizontal and vertical lines in the picture, and they were matched to obtain intersection coordinates. Then, the top and bottom boundaries were divided according to the distance between the coordinates, and each Uyghur word in the picture was cropped from left to right and from top to bottom according to the boundary coordinates. Automatic computer cropping was performed in the order a.jpg, b.jpg, and c.jpg in each folder, traversing E1–E140, with 270 Uyghur word pictures cropped out of each folder. All the data in E1–E140 were computer cropped to obtain a total of 140 folder directories named S1–S140. The images under each file were named in the format x.jpg, with x taking values from 1 to 270 (e.g., if x is 112, it corresponds to the second word from left to right in the third line of the b.jpg image). The corrected cropped image is shown in Figure 11.
- 6.
- For computer categorization, the same words corresponding to the same serial numbers in the S1–S140 folder directories were evaluated and categorized. The final result was 270 folders, each file containing 140 samples of the same words but with different handwriting styles. The file names and the names of the pictures in each file were automatically labeled by the computer. A total of 270 folders were named T1–T270, and the pictures in each file were named in the format of M(N).jpg (e.g., 243(7).jpg means the seventh word in file T243). The detailed semantics and correspondence of each word in the language are shown in Appendix A.
- 7.
- In order to reduce the influence of the background on the Uyghur words in the images, the blank part of the edge of each Uyghur word was cropped. The categorized and cropped data are shown in Figure 12.
- 8.
- Due to the random nature of the handwritten data, there were some misspelled words, and two Uyghur students were invited to screen out this part of misspelled words and remove them. A total of 20,250 handwritten Uyghur pictures including 250 words were finally used to construct the dataset.
4.3. Handwritten Dataset Pre-Processing
- Filling the images as square: Filling the input data as square will ensure the handwritten Uyghur words are trained in a normal glyph shape without distortion due to resizing during network input. Therefore, in this paper, all images were filled as squares according to the background color, and an equal amount of filling was performed on both sides (length or width) according to the dimension which is smaller. The effect after filling is shown in Figure 13b.
- Uniform scale: most of the dataset images established in this paper are in the size range of to ; in order to ensure the uniformity of all the data and to improve the convergence speed and accuracy of the model, all the image sizes were standardized to size, as shown in Figure 13c.
- Grayscale image: The handwritten Uyghur language data collected in this paper have a mostly gray background and the image background noise is large. The use of binarization will amplify the effect of noise on the data, affecting the model when conducting recognition; thus, in this paper, we used grayscale for normalized processing, as shown in Figure 13d.
4.4. Handwritten Dataset Augmentation
- Stochastic affine transformation: The stochastic affine transformation implements translation, rotation, scaling, and shearing transformations on the image, and is more random and realistic compared to single data enhancement methods. Performing random affine transformation on handwritten Uyghur data can simulate the effect of text deformation that occurs in real writing, as shown in Figure 14b.
- Gaussian noise: In image processing, Gaussian filters can be used to suppress high-frequency noise features in an image to improve image quality. The background of handwritten Uyghur text images is cluttered with a lot of noise and interference points, and Gaussian noise can mitigate such effects, as shown in Figure 14c.
- Elastic transformation: Random perturbation of pixel points is realized by applying a small random translation to the position of each pixel point in the image. The distortion of characters in text recognition can simulate the pictures produced in the context of handwritten text due to muscle shaking, lighting changes, etc. In the handwritten digit recognition experiments, the recognition effect of handwritten digits was significantly improved after the original picture was augmented. Therefore, for the recognition of handwritten Uyghur language in this paper, this method was also chosen for data augmentation of the dataset, and the effect is shown in Figure 14d.
5. Results and Discussion
5.1. Evaluation Standards
5.2. Description of Experimental Environment and Parameters
5.3. Model Interpretability
5.4. Classification Experiments
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
References
- Graves, A.; Fernández, S.; Gomez, F. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Wang, J.; Hu, X. Convolutional Neural Networks With Gated Recurrent Connections. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3421–3435. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
- Mnih, V.; Heess, N.; Graves, A. Recurrent models of visual attention. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2204–2212. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Shi, B.; Bai, X.; Yao, C. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2298–2304. [Google Scholar] [CrossRef] [PubMed]
- Lee, C.Y.; Osindero, S. Recursive Recurrent Nets with Attention Modeling for OCR in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 21–23 June 2016; pp. 2231–2239. [Google Scholar]
- Borisyuk, F.; Gordo, A.; Sivakumar, V. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 71–79. [Google Scholar]
- Baek, J.; Kim, G.; Lee, J.; Park, S.; Han, D.; Yun, S. What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 4714–4722. [Google Scholar]
- Diaz, D.H.; Qin, S.; Ingle, R.; Fujii, Y. Rethinking text line recognition models. arXiv 2021, arXiv:2104.07787. [Google Scholar]
- Shi, B.; Yang, M.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. ASTER: An Attentional Scene Text Recognizer with Flexible Rectification. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2035–2048. [Google Scholar] [CrossRef]
- Xie, Z.; Huang, Y.; Zhu, Y.; Jin, L.; Liu, Y.; Xie, L. Aggregation Cross-Entropy for Sequence Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6531–6540. [Google Scholar]
- He, S.; Hu, X. Chinese Character Recognition in Natural Scenes. In Proceedings of the 2016 9th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, 10–11 December 2016; pp. 124–127. [Google Scholar]
- Shi, B.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. Robust Scene Text Recognition with Automatic Rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4168–4176. [Google Scholar]
- Cheng, Z.; Bai, F.; Xu, Y.; Zheng, G.; Pu, S.; Zhou, S. Focusing Attention: Towards Accurate Text Recognition in Natural Images. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5086–5094. [Google Scholar]
- Liu, Y.; Wang, Y.; Shi, H. A Convolutional Recurrent Neural-Network-Based Machine Learning for Scene Text Recognition Application. Symmetry 2023, 15, 849. [Google Scholar] [CrossRef]
- Liao, M.; Wan, Z.; Yao, C.; Chen, K.; Bai, X. Real-time scene text detection with differentiable binarization. AAAI 2020, 37, 11474–11481. [Google Scholar] [CrossRef]
- Chandio, A.A.; Asikuzzaman, M.D.; Pickering, M.R.; Leghari, M. Cursive Text Recognition in Natural Scene Images Using Deep Convolutional Recurrent Neural Network. IEEE Access 2022, 10, 10062–10078. [Google Scholar] [CrossRef]
- Bhatti, A.; Arif, A.; Khalid, W.; Khan, B.; Ali, A.; Khalid, S.; Rehman, A.U. Recognition and classification of handwritten urdu numerals using deep learning techniques. Appl. Sci. 2023, 13, 1624. [Google Scholar] [CrossRef]
- Faizullah, S.; Ayub, M.S.; Hussain, S.; Khan, M.A. A Survey of OCR in Arabic Language: Applications, Techniques, and Challenges. Appl. Sci. 2023, 13, 4584. [Google Scholar] [CrossRef]
- Najam, R.; Faizullah, S. Analysis of Recent Deep Learning Techniques for Arabic Handwritten-Text OCR and Post-OCR Correction. Appl. Sci. 2023, 13, 7568. [Google Scholar] [CrossRef]
- Wang, X. Research and Application of Key Technologies for Printed Uyghur Recognition. China XiDian University. 2017. Available online: https://kns.cnki.net/reader/review?invoice=E0BHzLmOAztuvDM6NECx5tY0qrvYJ9uyW%2FGjN%2FX9KGiWam%2BHGEAtL4BGdLgp21SL2FuGRlzFO8%2BRuX%2B3im7Sj7Ad769FhI5qWhENCPYhGtbttupPl%2FFVdCu1X7YFNTW5i53ieUC1p7ovIpDUkG3aPwpZYnOxVvdPDaU0trGTgL0%3D&platform=NZKPT&product=CMFD&filename=1017301920.nh&tablename=cmfd201801&type=DISSERTATION&scope=trial&cflag=overlay&dflag=&pages=&language=chs&trial=&nonce=327839BEC1664DD69EEF336A5EE6E039 (accessed on 3 July 2020).
- Chen, Y. Research and Design of Uyghur Language Detection and Recognition Based on Deep Learning. Master’s Thesis, China Chengdu University of Technology, Chengdu, China, 2020. Available online: https://kns.cnki.net/kcms2/article/abstract?v=3uoqIhG8C475KOm_zrgu4lQARvep2SAkyRJRH-nhEQBuKg4okgcHYvv4vXrBT6PYbsMn7WEdE2OP-_8B7-YusUQvfmf8uVLO&uniplatform=NZKPT (accessed on 9 August 2020).
- Tang, J. Uyghur Scanned Body Recognition Based on Deep Learning. China J. Northeast. Norm. Univ. (Natural Sci. Ed.) 2021, 13, 71–76. Available online: https://kns.cnki.net/kcms2/article/abstract?v=3uoqIhG8C44YLTlOAiTRKibYlV5Vjs7iy_Rpms2pqwbFRRUtoUImHboqwGMQdpFVOy_Z6EXzOfvlndeg_RIeccuGRM3ph9Vp&uniplatform=NZKPT (accessed on 21 November 2021).
- Xiong, L. Design and Implementation of Django-based Printed Uyghur Recognition System. China J. Zhengzhou Univ. (Nat. Sci. Ed.) 2021, 53, 9–14. Available online: http://www.xml-data.org/ZZDXXBLXB/html/3987c5aa-7f51-4e6c-8c33-9a1e21f7fe93.htm (accessed on 6 June 2021).
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing System, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
- Saimati, B.; Gomel, A. 5000 Words Commonly Used in Uyghur; People’s Publishing House: Xinjiang, China, 2012; Available online: https://book.douban.com/subject/26690805/ (accessed on 3 July 2020).
- Zeiler, M.D. Adadelta: An adaptive learning rate method. arXiv 2012, arXiv:1212.5701. [Google Scholar]
- Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv 2013, arXiv:1312.6034. [Google Scholar]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
- Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the Advances in Neural Information Processing System, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
# | Feat. | Seq. | Pred. | ACC.% | Time ms/Image | Norm_ED | Params |
---|---|---|---|---|---|---|---|
P 1 | VGG | None | CTC | 29.62 | 0.996 | 0.55 | 5.57 |
P 2 | VGG | None | Attn | 80.36 | 6.977 | 0.93 | 6.58 |
P 3 | VGG | BiLSTM | CTC | 69.03 | 1.993 | 0.89 | 8.45 |
P 4 | VGG | BiLSTM | Attn | 84.70 | 7.973 | 0.95 | 9.14 |
P 5 | RCNN | None | CTC | 28.55 | 4.983 | 0.55 | 1.88 |
P 6 | RCNN | None | Attn | 83.82 | 10.964 | 0.94 | 2.89 |
P 7 | RCNN | BiLSTM | CTC | 69.15 | 5.980 | 0.90 | 4.76 |
P 8 | RCNN | BiLSTM | Attn | 86.19 | 12.956 | 0.95 | 5.46 |
P 9 | ResNet | None | CTC | 53.41 | 2.990 | 0.78 | 44.28 |
P 10 | ResNet | None | Attn | 85.80 | 9.968 | 0.95 | 45.29 |
P 11 | ResNet | BiLSTM | CTC | 78.30 | 3.986 | 0.93 | 47.16 |
P 12 | ResNet | BiLSTM | Attn | 89.32 | 11.960 | 0.96 | 47.86 |
P 13 | ConvNeXt | None | CTC | 56.81 | 5.979 | 0.83 | 67.57 |
P 14 | ConvNeXt | None | Attn | 86.27 | 11.960 | 0.95 | 68.58 |
P 15 | ConvNeXt | BiLSTM | CTC | 79.24 | 6.976 | 0.93 | 70.45 |
P 16 | ConvNeXt | BiLSTM | Attn | 90.21 | 14.951 | 0.97 | 71.14 |
Stage | Model | ACC.% | Time ms/Image | Norm_ED | Params |
---|---|---|---|---|---|
Feat. | VGG | 65.93 | 4.48 | 0.830 | 5.55 |
RCNN | 66.93(+1.00) | 8.72 | 0.835(+0.005) | 1.86 | |
ResNet | 76.71(+10.78) | 7.23 | 0.905(+0.075) | 44.26 | |
ConvNeXt | 78.13(+12.20) | 9.97 | 0.920(+0.090) | 44.26 | |
Seq. | None | 63.08 | 6.85 | 0.810 | N/A |
BiLSTM | 80.77(+17.69) | 7.47 | 0.932(+0.122) | 2.89 | |
Pred. | CTC | 58.01 | 4.24 | 0.795 | 0.01 |
Attn | 85.83(+27.82) | 10.96 | 0.950(+0.155) | 1.03 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, W.; Zhang, Y.; Huang, Y.; Shen, Y.; Wang, Z. A Three-Stage Uyghur Recognition Model Combining the Attention Mechanism and Different Convolutional Recurrent Networks. Appl. Sci. 2023, 13, 9539. https://doi.org/10.3390/app13179539
Li W, Zhang Y, Huang Y, Shen Y, Wang Z. A Three-Stage Uyghur Recognition Model Combining the Attention Mechanism and Different Convolutional Recurrent Networks. Applied Sciences. 2023; 13(17):9539. https://doi.org/10.3390/app13179539
Chicago/Turabian StyleLi, Wentao, Yuduo Zhang, Yongdong Huang, Yue Shen, and Zhe Wang. 2023. "A Three-Stage Uyghur Recognition Model Combining the Attention Mechanism and Different Convolutional Recurrent Networks" Applied Sciences 13, no. 17: 9539. https://doi.org/10.3390/app13179539
APA StyleLi, W., Zhang, Y., Huang, Y., Shen, Y., & Wang, Z. (2023). A Three-Stage Uyghur Recognition Model Combining the Attention Mechanism and Different Convolutional Recurrent Networks. Applied Sciences, 13(17), 9539. https://doi.org/10.3390/app13179539