Scene Text Detection Based on Two-Branch Feature Extraction
<p>It shows the overall framework of the method proposed in this paper. In this paper, the backbone network ResNet is modified appropriately, and the residual correction branch is designed to improve the ability of network feature extraction. Secondly, a more efficient feature fusion module, namely, Two-Branch Attention Feature Fusion (TB-AFF) module, is adopted.</p> "> Figure 2
<p>Structural design of residual correction branch (RCB).</p> "> Figure 3
<p>Structure design of two-branch attention feature fusion (TB-AFF) module.</p> "> Figure 4
<p>Structure diagram of differentiable binarization.</p> "> Figure 5
<p>Visualization results of different types of text examples. Where (<b>a</b>,<b>b</b>) are multi-directional texts, (<b>c</b>,<b>d</b>) are multi-lingual texts, (<b>e</b>,<b>f</b>) are curved texts.</p> "> Figure 6
<p>Visualization results of baseline and ours.</p> "> Figure 7
<p>Visualization results display on different types of text examples. Where (<b>a</b>,<b>b</b>) are multi-directional texts, (<b>c</b>,<b>d</b>) are multi-lingual texts, (<b>e</b>,<b>f</b>) are curved texts.</p> ">
Abstract
:1. Introduction
2. Related Work
3. Method
3.1. Residual Correction Branch (RCB)
3.2. Two-Branch Attention Feature Fusion (TB-AFF) Module
- (1)
- TB-AFF module focuses on the size of attention by point-by-point convolution, instead of convolution kernels with different sizes. Point-by-point convolution is also used to make TB-AFF as lightweight as possible.
- (2)
- TB-AFF module is not in the backbone network but is based on feature pyramid network FPN. It aggregates global and local feature information, strengthens contact with contextual feature information, and updates the text area.
3.3. Differentiable Binary Module
3.4. Loss Function
4. Experimental Results and Analysis
4.1. Training Configuration
4.2. Experiment and Discussion
5. Discussion and Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Jiang, Y.; Zhu, X.; Wang, X.; Yang, S.; Li, W.; Wang, H.; Luo, Z. R2cnn: Rotational region cnn for orientation robust scene text detection. arXiv 2017, arXiv:1706.09579. [Google Scholar]
- He, T.; Tian, Z.; Huang, W.; Shen, C.; Qiao, Y.; Sun, C. An end-to-end textspotter with explicit alignment and attention. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5020–5029. [Google Scholar]
- Ch’ng, C.K.; Chan, C.S. Total-text: A comprehensive dataset for scene text detection and recognition. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; pp. 935–942. [Google Scholar]
- Liao, M.; Wan, Z.; Yao, C.; Chen, K.; Bai, X. Real-time scene text detection with differentiable binarization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11474–11481. [Google Scholar]
- Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S.; et al. ICDAR 2015 competition on robust reading. In Proceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR), Nancy, France, 23–26 August 2015; pp. 1156–1160. [Google Scholar]
- Yao, C.; Bai, X.; Liu, W.; Ma, Y.; Tu, Z. Detecting texts of arbitrary orientations in natural images. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1083–1090. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Li, Y.; Qi, H.; Dai, J.; Ji, X.; Wei, Y. Fully convolutional instance-aware semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2359–2367. [Google Scholar]
- Zhong, Z.; Jin, L.; Huang, S. Deeptext: A new approach for text proposal generation and text detection in natural images. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 1208–1212. [Google Scholar]
- Tian, Z.; Huang, W.; He, T.; He, P.; Qiao, Y. Detecting text in natural image with connectionist text proposal network. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 56–72. [Google Scholar]
- Shi, B.; Bai, X.; Belongie, S. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2550–2558. [Google Scholar]
- Liao, M.; Shi, B.; Bai, X.; Wang, X.; Liu, W. Textboxes: A fast text detector with a single deep neural network. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4161–4167. [Google Scholar]
- Liao, M.; Shi, B.; Bai, X. Textboxes++: A single-shot oriented scene text detector. IEEE Trans. Image Processing 2018, 27, 3676–3690. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar] [CrossRef] [Green Version]
- Jiang, Y.; Zhu, X.; Wang, X.; Yang, S.; Li, W.; Wang, H.; Fu, P.; Luo, Z. R2 CNN: Rotational region cnn for arbitrarily-oriented scene text detection. In Proceedings of the 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 3610–3615. [Google Scholar]
- Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. East: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5551–5560. [Google Scholar]
- Deng, D.; Liu, H.F.; Li, X.L.; Cai, D. PixelLink: Detecting Scene Text via Instance Segmentation. In Proceedings of the Thirty-Second (AAAI) Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 6773–6780. [Google Scholar]
- Long, S.; Ruan, J.; Zhang, W.; He, X.; Wu, W.; Yao, C. Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 20–36. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Hou, W.; Lu, T.; Yu, G.; Shao, S. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9336–9345. [Google Scholar]
- Wang, W.; Xie, E.; Song, X.; Zang, Y.; Wang, W.; Lu, T.; Yu, G.; Shen, C. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 16–20 June 2019; pp. 8440–8449. [Google Scholar]
- He, W.; Zhang, X.Y.; Yin, F.; Liu, C.L. Deep direct regression for multi-oriented scene text detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 745–753. [Google Scholar]
- Tian, Z.; Shu, M.; Lyu, P.; Li, R.; Zhou, C.; Shen, X.; Jia, J. Learning shape-aware embedding for scene text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4234–4243. [Google Scholar]
- Xu, Y.; Wang, Y.; Zhou, W.; Wang, Y.; Yang, Z.; Bai, X. Textfield: Learning a deep direction field for irregular scene text detection. IEEE Trans. Image Processing 2019, 28, 5566–5579. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Liu, J.; Liu, X.; Sheng, J.; Liang, D.; Li, X.; Liu, Q. Pyramid mask text detector. arXiv 2019, arXiv:1903.11800. [Google Scholar]
- Baek, Y.; Lee, B.; Han, D.; Yun, S.; Lee, H. Character region awareness for text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9365–9374. [Google Scholar]
- Huang, Z.; Zhong, Z.; Sun, L.; Huo, Q. Mask R-CNN with pyramid attention network for scene text detection. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 764–772. [Google Scholar]
- Xie, E.; Zang, Y.; Shao, S.; Yu, G.; Yao, C.; Li, G. Scene text detection with supervised pyramid context network. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 9038–9045. [Google Scholar]
- Fan, D.P.; Wang, W.; Cheng, M.M.; Shen, J. Shifting more attention to video salient object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 8554–8564. [Google Scholar]
- Fu, K.; Fan, D.P.; Ji, G.P.; Zhao, Q. JL-DCF: Joint learning and densely-cooperative fusion frame-work for RGB-D salient object detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3049–3059. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic Data for Text Localisation in Natural Images. In Proceedings of the 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26–30 June 2016; pp. 2315–2324. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Lyu, P.; Yao, C.; Wu, W.; Yan, S.; Bai, S. Multi-oriented Scene Text Detection via Corner Localization and Region Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7553–7563. [Google Scholar]
- He, W.; Zhang, X.Y.; Yin, F.; Luo, Z.; Ogier, J.M.; Liu, C.L. Realtime multi-scale scene text detection with scale-based region proposal network. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 107026–107035. [Google Scholar]
Y2 | C2/P2 | ||
Y3 | C3/P3 | ||
Y4 | C4/P4 | ||
Y5 | C5/P5 |
Method | RCB | TB-AFF | P (%) | R (%) | F (%) |
---|---|---|---|---|---|
Resnet-18 | 89.3 | 73.80 | 80.80 | ||
Resnet-18 | √ | 86.66 | 78.48 | 82.36 | |
Resnet-18 | √ | 87.51 | 78.62 | 82.83 | |
Resnet-18 | √ | √ | 87.26 | 79.48 | 83.19 |
Method | RCB | TB-AFF | P (%) | R (%) | F (%) |
---|---|---|---|---|---|
Resnet-18 | 86.70 | 75.4. | 80.70 | ||
Resnet-18 | √ | 88.05 | 78.36 | 82.92 | |
Resnet-18 | √ | 87.04 | 79.13 | 82.90 | |
Resnet-18 | √ | √ | 87.37 | 78.95 | 82.95 |
Method | RCB | TB-AFF | P (%) | R (%) | F (%) |
---|---|---|---|---|---|
Resnet-18 | 85.70 | 73.20 | 79.0 | ||
Resnet-18 | √ | 86.84 | 81.62 | 84.15 | |
Resnet-18 | √ | 87.17 | 80.58 | 83.75 | |
Resnet-18 | √ | √ | 88.02 | 83.33 | 85.61 |
Method | Flops(G) | P (%) | R (%) | F (%) |
---|---|---|---|---|
Ours-Resnet-18 (r = 1) | 45.44 | 86.45 | 79.87 | 83.03 |
Ours-Resnet-18 (r = 2) | 39.77 | 86.79 | 79.73 | 83.11 |
Ours-Resnet-18 (r = 3) | 38.65 | 87.63 | 79.15 | 83.17 |
Ours-Resnet-18 (r = 4) | 38.35 | 87.26 | 79.48 | 83.19 |
Ours-Resnet-18 (r = 5) | 38.18 | 87.97 | 78.86 | 83.17 |
Method | Params (M) | Flops (G) | P (%) | R (%) | F (%) |
---|---|---|---|---|---|
TextSnake (Long et al., 2018) | 19.1 | 136.01 | 82.7 | 74.5 | 78.4 |
ATRR (Wang et al., 2019b) | - | - | 80.9 | 76.2 | 78.5 |
MTS (Lyu et al., 2018a) | - | - | 82.5 | 75.6 | 78.6 |
TextField (Xu et al., 2019) | - | - | 81.2 | 79.9 | 80.6 |
LOMO (Zhang et al., 2019) * | - | - | 87.6 | 79.3 | 83.3 |
CRAFT (Baek et al., 2019) | 20.8 | 146.29 | 87.6 | 79.9 | 83.6 |
CSE (Liu et al., 2019b) | - | - | 81.4 | 79.1 | 80.2 |
PSE-1s (Wang et al., 2019a) | 28.6 | 117.1 | 84.0 | 78.0 | 80.9 |
DB-ResNet-18 (800 × 800) | 12.2 | 24.46 | 86.7 | 75.4 | 80.7 |
Ours-ResNet-18 (800 × 800) | 18.4 | 38.35 | 87.37 | 78.95 | 82.95 |
DB-ResNet-50 (800 × 800) | 25.5 | 49.36 | 84.3 | 78.4 | 81.3 |
Ours-ResNet-50 (800 × 800) | 69.9 | 60.99 | 88.06 | 82.19 | 85.03 |
Method | P (%) | R (%) | F (%) |
---|---|---|---|
EAST (Zhou et al., 2017) | 83.6 | 73.5 | 78.2 |
Corner (Lyu et al., 2018b) | 94.1 | 70.7 | 80.7 |
RRD (Liao et al., 2018) | 85.6 | 79.0 | 82.2 |
PAN (Wang et al., 2019) | 84.0 | 81.9 | 82.9 |
PSE-1s (Wang et al., 2019a) | 86.9 | 84.5 | 85.7 |
SPCNet (Xie et al., 2019a) | 88.7 | 85.8 | 87.2 |
LOMO (Zhang et al., 2019) | 91.3 | 83.5 | 87.2 |
CRAFT (Baek et al., 2019) | 89.8 | 84.3 | 86.9 |
SAE (Tian et al., 2019) | 88.3 | 85.0 | 86.6 |
SRPN (He et al., 2020) | 92.0 | 79.7 | 85.4 |
DB-ResNet-18 (1280 × 736) | 89.3 | 73.8 | 80.8 |
Ours-ResNet-18 (1280 × 736) | 87.26 | 79.48 | 83.19 |
DB-ResNet-50 (1280 × 736) | 88.6 | 77.8 | 82.9 |
Ours-ResNet-50 (1280 × 736) | 87.82 | 79.83 | 83.63 |
DB-ResNet-50 (2048 × 1152) | 89.8 | 79.3 | 84.2 |
Ours-ResNet-50 (2048 × 1152) | 88.21 | 84.26 | 86.19 |
Method | P (%) | R (%) | F (%) |
---|---|---|---|
(He et al., 2016b) | 71.0 | 61.0 | 69.0 |
EAST (Zhou et al., 2017) | 87.28 | 67.43 | 76.08 |
DeepReg (He et al., 2017b) | 77.0 | 70.0 | 74.0 |
SegLink (Shi et al., 2017) | 86 | 70 | 77 |
RRPN (Ma et al., 2018) | 82.0 | 68.0 | 74.0 |
RRD (Liao et al., 2018) | 87.0 | 73.0 | 79.0 |
MCN (Liu et al., 2018) | 88.0 | 79.0 | 83.0 |
PixelLink (Deng et al., 2018) | 83.0 | 73.2 | 77.8 |
Corner (Lyu et al., 2018b) | 87.6 | 76.2 | 81.5 |
TextSnake (Long et al., 2018) | 83.2 | 73.9 | 78.3 |
(Xue, Lu, and Zhan 2018) | 83.0 | 77.4 | 80.1 |
(Xue, Lu, and Zhang 2019) | 87.4 | 76.7 | 81.7 |
CRAFT (Baek et al., 2019) | 88.2 | 78.2 | 82.9 |
SAE (Tian et al., 2019) | 84.2 | 81.7 | 82.9 |
PAN (Wang et al., 2019) | 84.4 | 83.8 | 84.1 |
SRPN (He et al., 2020) | 84.9 | 77.0 | 80.7 |
DB-ResNet-18 (512 × 512) | 85.7 | 73.2 | 79.0 |
Ours-ResNet-18 (512 × 512) | 90.16 | 77.15 | 83.15 |
DB-ResNet-18 (736 × 736) | 90.4 | 76.3 | 82.8 |
Ours-ResNet-18 (736 × 736) | 88.02 | 83.33 | 85.61 |
DB-ResNet-50 (736 × 736) | 91.5 | 79.2 | 84.9 |
Ours-ResNet-50 (736 × 736) | 89.80 | 84.71 | 87.18 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ibrayim, M.; Li, Y.; Hamdulla, A. Scene Text Detection Based on Two-Branch Feature Extraction. Sensors 2022, 22, 6262. https://doi.org/10.3390/s22166262
Ibrayim M, Li Y, Hamdulla A. Scene Text Detection Based on Two-Branch Feature Extraction. Sensors. 2022; 22(16):6262. https://doi.org/10.3390/s22166262
Chicago/Turabian StyleIbrayim, Mayire, Yuan Li, and Askar Hamdulla. 2022. "Scene Text Detection Based on Two-Branch Feature Extraction" Sensors 22, no. 16: 6262. https://doi.org/10.3390/s22166262
APA StyleIbrayim, M., Li, Y., & Hamdulla, A. (2022). Scene Text Detection Based on Two-Branch Feature Extraction. Sensors, 22(16), 6262. https://doi.org/10.3390/s22166262