Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Street View Text Recognition With Deep Learning for Urban Scene Understanding in Intelligent Transportation Systems

Published: 01 July 2021 Publication History

Abstract

Understanding the surrounding scenes is one of the fundamental tasks in intelligent transportation systems (ITS), especially in unpredictable driving scenes or in developing regions/cities without digital maps. Street view is the most common scene during driving. Since streets are often full of shops with signboards, scene text recognition over the shop sign images in street views is of great significance and utility to urban scene understanding in ITS. To advance research in this field, (1) we build <italic>ShopSign</italic>, which is a large-scale scene text dataset of Chinese shop signs in street views. It contains 25,770 natural scene images, and 267,049 text instances. The images in ShopSign were captured in different scenes, from downtown to developing regions, and across 8 provinces and 20 cities in China, using more than 50 different mobile phones. It is very sparse and imbalanced in nature. (2) we carry out a comprehensive empirical study on the performance of state-of-the-art DL based scene text reading algorithms on ShopSign and three other Chinese scene text datasets, which has not been addressed in the literature before. Through comparative analysis, we demonstrate that language has a critical influence on scene text detection. Moreover, by comparing the accuracy of four scene text recognition algorithms, we show that there is a very large room for further improvements in street view text recognition to fit real-world ITS applications.

References

[1]
H. G. Seif and X. Hu, “Autonomous driving in the iCity—HD maps as a key challenge of the automotive industry,” Engineering, vol. 2, no. 2, pp. 159–162, Jun. 2016.
[2]
K. Jiang, D. Yang, C. Liu, T. Zhang, and Z. Xiao, “A flexible multi-layer map model designed for lane-level route planning in autonomous vehicles,” Engineering, vol. 5, no. 2, pp. 305–318, Apr. 2019.
[3]
D. Fenget al., “Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges,” IEEE Trans. Intell. Transport. Syst., pp. 1–20, 2020.
[4]
X. Bai, M. Yang, P. Lyu, Y. Xu, and J. Luo, “Integrating scene text and visual appearance for fine-grained image classification,” IEEE Access, vol. 6, pp. 66322–66335, 2018.
[5]
H. Yin, Y. Wang, X. Ding, L. Tang, S. Huang, and R. Xiong, “3D LiDAR-based global localization using Siamese neural network,” IEEE Trans. Intell. Transport. Syst., vol. 21, no. 4, pp. 1380–1392, Apr. 2020.
[6]
E. Arnold, O. Y. Al-Jarrah, M. Dianati, S. Fallah, D. Oxtoby, and A. Mouzakitis, “A survey on 3D object detection methods for autonomous driving applications,” IEEE Trans. Intell. Transport. Syst., vol. 20, no. 10, pp. 3782–3795, Oct. 2019.
[7]
D. Tabernik and D. Skocaj, “Deep learning for large-scale traffic-sign detection and recognition,” IEEE Trans. Intell. Transport. Syst., vol. 21, no. 4, pp. 1427–1440, Apr. 2020.
[8]
Y. Jin, X. Guo, Y. Li, J. Xing, and H. Tian, “Towards stabilizing facial landmark detection and tracking via hierarchical filtering: A new method,” J. Franklin Inst., vol. 357, no. 5, pp. 3019–3037, Mar. 2020.
[9]
Z. Tian, W. Huang, T. He, P. He, and Y. Qiao, “Detecting text in natural image with connectionist text proposal network,” in Proc. ECCV, 2016, pp. 56–72.
[10]
W. He, X.-Y. Zhang, F. Yin, and C.-L. Liu, “Deep direct regression for multi-oriented scene text detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 745–753.
[11]
X. Zhouet al., “EAST: An efficient and accurate scene text detector,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2642–2651.
[12]
M. Liao, B. Shi, and X. Bai, “TextBoxes++: A single-shot oriented scene text detector,” IEEE Trans. Image Process., vol. 27, no. 8, pp. 3676–3690, Aug. 2018.
[13]
B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 11, pp. 2298–2304, Nov. 2017.
[14]
F. Yin, Y.-C. Wu, X.-Y. Zhang, and C.-L. Liu, “Scene text recognition with sliding convolutional character models,” 2017, arXiv:1709.01727. [Online]. Available: http://arxiv.org/abs/1709.01727
[15]
B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “ASTER: An attentional scene text recognizer with flexible rectification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 9, pp. 2035–2048, Sep. 2019.
[16]
C. Zhanget al., “ShopSign: A diverse scene text dataset of Chinese shop signs in street views,” 2019, arXiv:1903.10412. [Online]. Available: http://arxiv.org/abs/1903.10412
[17]
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proc. 23rd Int. Conf. Mach. Learn. (ICML), 2006, pp. 369–376.
[18]
X.-C. Yin, X. Yin, K. Huang, and H.-W. Hao, “Robust text detection in natural scene images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 5, pp. 970–983, May 2014.
[19]
W. Feng, W. He, F. Yin, X.-Y. Zhang, and C.-L. Liu, “TextDragon: An end-to-end framework for arbitrary shaped text spotting,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 9075–9084.
[20]
Y. Zhu, C. Yao, and X. Bai, “Scene text detection and recognition: Recent advances and future trends,” Frontiers Comput. Sci., vol. 10, no. 1, pp. 19–36, Feb. 2016.
[21]
X.-C. Yin, Z.-Y. Zuo, S. Tian, and C.-L. Liu, “Text detection, tracking and recognition in video: A comprehensive survey,” IEEE Trans. Image Process., vol. 25, no. 6, pp. 2752–2773, Jun. 2016.
[22]
Q. Ye and D. Doermann, “Text detection and recognition in imagery: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 7, pp. 1480–1500, Jul. 2015.
[23]
S. Long, X. He, and C. Yao, “Scene text detection and recognition: The deep learning era,” 2018, arXiv:1811.04256. [Online]. Available: http://arxiv.org/abs/1811.04256
[24]
S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young, “ICDAR 2003 robust reading competitions,” in Proc. 7th Int. Conf. Document Anal. Recognit. (ICDAR), 2003, pp. 682–687.
[25]
D. Karatzaset al., “ICDAR 2013 robust reading competition,” in Proc. 12th Int. Conf. Document Anal. Recognit. (ICDAR), Aug. 2013, pp. 1484–1493.
[26]
D. Karatzaset al., “ICDAR 2015 competition on robust reading,” in Proc. 13th Int. Conf. Document Anal. Recognit. (ICDAR), Aug. 2015, pp. 1156–1160.
[27]
C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts of arbitrary orientations in natural images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 1083–1090.
[28]
N. Nayefet al., “ICDAR2017 robust reading challenge on multi-lingual scene text detection and script Identification–RRC-MLT,” in Proc. 14th IAPR Int. Conf. Document Anal. Recognit. (ICDAR), Nov. 2017, pp. 1454–1459.
[29]
S. Tianet al., “Multilingual scene character recognition with co-occurrence of histogram of oriented gradients,” Pattern Recognit., vol. 51, pp. 125–134, Mar. 2016.
[30]
K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recognition,” in Proc. Int. Conf. Comput. Vis., Nov. 2011, pp. 1457–1464.
[31]
A. Mishra, K. Alahari, and C. Jawahar, “Scene text recognition using higher order language priors,” in Proc. Brit. Mach. Vis. Conf. (BMVC), 2012, pp. 1–11.
[32]
B. Shiet al., “ICDAR2017 competition on reading Chinese text in the wild (RCTW-17),” in Proc. ICDAR, 2017, pp. 1429–1434.
[33]
M. Heet al., “ICPR2018 contest on robust reading for multi-type Web images,” in Proc. ICPR, 2018, pp. 7–12.
[34]
T.-L. Yuan, Z. Zhu, K. Xu, C.-J. Li, T.-J. Mu, and S.-M. Hu, “A large Chinese text dataset in the wild,” J. Comput. Sci. Technol., vol. 34, no. 3, pp. 509–521, May 2019.
[35]
A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localisation in natural images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2315–2324.
[36]
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 91–99.
[37]
M. Busta, L. Neumann, and J. Matas, “Deep TextSpotter: An end-to-end trainable scene text localization and recognition framework,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2223–2231.
[38]
J. Maet al., “Arbitrary-oriented scene text detection via rotation proposals,” IEEE Trans. Multimedia, vol. 20, no. 11, pp. 3111–3122, Nov. 2018.
[39]
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
[40]
M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu, “Textboxes: A fast text detector with a single deep neural network,” in Proc. AAAI, 2017, pp. 4161–4167.
[41]
W. Liuet al., “Ssd: Single shot multibox detector,” in Proc. ECCV, 2016, pp. 21–37.
[42]
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 3431–3440.
[43]
B. Shi, X. Bai, and S. Belongie, “Detecting oriented text in natural images by linking segments,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 3482–3490.
[44]
Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character region awareness for text detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 9365–9374.
[45]
S.-X. Zhanget al., “Deep relational reasoning graph network for arbitrary shape text detection,” in Proc. IEEE CVPR, Jun. 2020, pp. 9699–9708.
[46]
D. Yuet al., “Towards accurate scene text recognition with semantic reasoning networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 12113–12122.
[47]
Y. Liu, H. Chen, C. Shen, T. He, L. Jin, and L. Wang, “ABCNet: Real-time scene text spotting with adaptive bezier-curve network,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9809–9818.
[48]
W. Liu, C. Chen, K.-Y. Wong, Z. Su, and J. Han, “STAR-net: A SpaTial attention residue network for scene text recognition,” in Proc. Brit. Mach. Vis. Conf. (BMVC), 2016, pp. 1–7.
[49]
Z. Liu, Y. Li, F. Ren, W. L. Goh, and H. Yu, “Squeezedtext: A real-time scene text recognition by binary convolutional encoder-decoder network,” in Proc. AAAI, 2018, pp. 7194–7201.
[50]
Z. Qiao, Y. Zhou, D. Yang, Y. Zhou, and W. Wang, “SEED: Semantics enhanced encoder-decoder framework for scene text recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 13528–13537.
[51]
R. Litman, O. Anschel, S. Tsiper, R. Litman, S. Mazor, and R. Manmatha, “SCATTER: Selective context attentional scene text recognizer,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 11962–11972.
[52]
X. Yang, D. He, Z. Zhou, D. Kifer, and C. L. Giles, “Learning to read irregular text with attention mechanisms,” in Proc. 26th Int. Joint Conf. Artif. Intell. (IJCAI), Aug. 2017, pp. 3280–3286.
[53]
Z. Cheng, Y. Xu, F. Bai, Y. Niu, S. Pu, and S. Zhou, “AON: Towards arbitrarily-oriented text recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 5571–5579.
[54]
H. Li, P. Wang, C. Shen, and G. Zhang, “Show, attend and read: A simple and strong baseline for irregular text recognition,” in Proc. AAAI, 2019, pp. 8610–8617.
[55]
F. Zhan, C. Xue, and S. Lu, “GA-DAN: Geometry-aware domain adaptation network for scene text detection and recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 9105–9115.
[56]
F. Zhan and S. Lu, “ESIR: End-to-end scene text recognition via iterative image rectification,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 2059–2068.
[57]
Z. Wan, M. He, H. Chen, X. Bai, and C. Yao, “Textscanner: Reading characters in order for robust scene text recognition,” in Proc. AAAI, 2020, pp. 12120–12127.
[58]
L. Xing, Z. Tian, W. Huang, and M. Scott, “Convolutional character networks,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 9126–9136.
[59]
M. Liaoet al., “Scene text recognition from two-dimensional perspective,” in Proc. AAAI, 2019, pp. 8714–8721.
[60]
J. Baeket al., “What is wrong with scene text recognition model comparisons? dataset and model analysis,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 4714–4722.
[61]
X. Wu, S. Dai, Y. Guo, and H. Fujita, “A machine learning attack against variable-length Chinese character CAPTCHAs,” Int. J. Speech Technol., vol. 49, no. 4, pp. 1548–1565, Apr. 2019.
[62]
X. Xu, J. Chen, J. Xiao, L. Gao, F. Shen, and H. T. Shen, “What machines see is not what they get: Fooling scene text recognition models with adversarial text images,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 12304–12314.
[63]
L. Wuet al., “Editing text in the wild,” in Proc. ACM Multimedia, 2019, pp. 1500–1508.
[64]
C. Luo, Y. Zhu, L. Jin, and Y. Wang, “Learn to augment: Joint data augmentation and network optimization for text recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 13746–13755.
[65]
V. Ramanishka, Y.-T. Chen, T. Misu, and K. Saenko, “Toward driving scene understanding: A dataset for learning driver behavior and causal reasoning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 7699–7707.
[66]
M. Cordtset al., “The cityscapes dataset for semantic urban scene understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 3213–3223.
[67]
A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 3354–3361.
[68]
S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao, “Textsnake: A flexible representation for detecting text of arbitrary shapes,” in Proc. ECCV, 2018, pp. 20–36.
[69]
L. Xie, Y. Liu, L. Jin, and Z. Xie, “DeRPN: Taking a further step toward more general object detection,” in Proc. AAAI, 2019, pp. 9046–9053.
[70]
Y. Xu, Y. Wang, W. Zhou, Y. Wang, Z. Yang, and X. Bai, “TextField: Learning a deep direction field for irregular scene text detection,” IEEE Trans. Image Process., vol. 28, no. 11, pp. 5566–5579, Nov. 2019.
[71]
L. Duan and F. Lafarge, “Towards large-scale city reconstruction from satellites,” in Proc. ECCV, 2016, pp. 89–104.

Cited By

View all
  • (2024)TTS: Hilbert Transform-Based Generative Adversarial Network for Tattoo and Scene Text SpottingIEEE Transactions on Multimedia10.1109/TMM.2024.337845826(8226-8241)Online publication date: 19-Mar-2024
  • (2024)Hastening Stream Offloading of Inference via Multi-Exit DNNs in Mobile Edge ComputingIEEE Transactions on Mobile Computing10.1109/TMC.2022.321872423:1(535-548)Online publication date: 1-Jan-2024
  • (2024)Adversarial Deep Learning based Dampster–Shafer data fusion model for intelligent transportation systemInformation Fusion10.1016/j.inffus.2023.102050102:COnline publication date: 1-Feb-2024
  • Show More Cited By

Index Terms

  1. Street View Text Recognition With Deep Learning for Urban Scene Understanding in Intelligent Transportation Systems
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image IEEE Transactions on Intelligent Transportation Systems
          IEEE Transactions on Intelligent Transportation Systems  Volume 22, Issue 7
          July 2021
          867 pages

          Publisher

          IEEE Press

          Publication History

          Published: 01 July 2021

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 20 Nov 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)TTS: Hilbert Transform-Based Generative Adversarial Network for Tattoo and Scene Text SpottingIEEE Transactions on Multimedia10.1109/TMM.2024.337845826(8226-8241)Online publication date: 19-Mar-2024
          • (2024)Hastening Stream Offloading of Inference via Multi-Exit DNNs in Mobile Edge ComputingIEEE Transactions on Mobile Computing10.1109/TMC.2022.321872423:1(535-548)Online publication date: 1-Jan-2024
          • (2024)Adversarial Deep Learning based Dampster–Shafer data fusion model for intelligent transportation systemInformation Fusion10.1016/j.inffus.2023.102050102:COnline publication date: 1-Feb-2024
          • (2023)Data Recognition for Multi-Source Heterogeneous Experimental Detection in Cloud Edge CollaborativesInternational Journal of Information Technologies and Systems Approach10.4018/IJITSA.33098616:3(1-19)Online publication date: 26-Sep-2023
          • (2023)Locate then generateProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i9.26357(11479-11487)Online publication date: 7-Feb-2023
          • (2023)Pixel Adapter: A Graph-Based Post-Processing Approach for Scene Text Image Super-ResolutionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611913(2168-2179)Online publication date: 26-Oct-2023
          • (2023)Monocular 3D Object Detection Utilizing Auxiliary Learning With Deformable ConvolutionIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.331955625:3(2424-2436)Online publication date: 9-Oct-2023
          • (2023)HFENet: Hybrid Feature Enhancement Network for Detecting Texts in Scenes and Traffic PanelsIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.330568624:12(14200-14212)Online publication date: 1-Dec-2023
          • (2023)Multi-lane detection by combining line anchor and feature shift for urban traffic managementEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.106238123:PAOnline publication date: 1-Aug-2023
          • (2023)PESTD: a large-scale Persian-English scene text datasetMultimedia Tools and Applications10.1007/s11042-023-15062-082:22(34793-34808)Online publication date: 25-Mar-2023
          • Show More Cited By

          View Options

          View options

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media