research-article

Open access

A Single-Shot Arbitrarily-Shaped Text Detector based on Context Attended Multi-Task Learning

Authors:

Chengquan Zhang,

Guangming ShiAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 1277 - 1285

https://doi.org/10.1145/3343031.3350988

Published: 15 October 2019 Publication History

Abstract

Detecting scene text of arbitrary shapes has been a challenging task over the past years. In this paper, we propose a novel segmentation-based text detector, namely SAST, which employs a context attended multi-task learning framework based on a Fully Convolutional Network (FCN) to learn various geometric properties for the reconstruction of polygonal representation of text regions. Taking sequential characteristics of text into consideration, a Context Attention Block is introduced to capture long-range dependencies of pixel information to obtain a more reliable segmentation. In post-processing, a Point-to-Quad assignment method is proposed to cluster pixels into text instances by integrating both high-level object knowledge and low-level pixel information in a single shot. Moreover, the polygonal representation of arbitrarily-shaped text can be extracted with the proposed geometric properties much more effectively. Experiments on several benchmarks, including ICDAR2015, ICDAR2017-MLT, SCUT-CTW1500, and Total-Text, demonstrate that SAST achieves better or comparable performance in terms of accuracy. Furthermore, the proposed algorithm runs at 27.63 FPS on SCUT-CTW1500 with a Hmean of 81.0% on a single NVIDIA Titan Xp graphics card, surpassing most of the existing segmentation-based methods.

References

[1]

Chee Kheng Ch'ng and Chee Seng Chan. 2017. Total-Text: A comprehensive dataset for scene text detection and recognition. In Int. Conf. Doc. Anal. Recognit. (ICDAR), Vol. 1. IEEE, 935--942.

[2]

Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. 2018. PixelLink: Detecting scene text via instance segmentation. In Proc. AAAI Conf. Artif. Intell. (AAAI) .

[3]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). IEEE, 248--255.

[4]

Alireza Fathi, Zbigniew Wojna, Vivek Rathod, Peng Wang, Hyun Oh Song, Sergio Guadarrama, and Kevin P Murphy. 2017. Semantic instance segmentation via deep metric learning. arxiv: 1703.10277

[5]

R. Girshick. 2015. Fast R-CNN. In IEEE Int. Conf. Comp. Vis. (ICCV). 1440--1448.

[6]

Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 2315--2324.

[7]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017a. Mask R-CNN. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 2961--2969.

[8]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR) . 770--778.

[9]

Pan He, Weilin Huang, Tong He, Qile Zhu, Yu Qiao, and Xiaolin Li. 2017b. Single shot text detector with regional attention. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 3047--3055.

[10]

Wenhao He, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. 2017c. Deep direct regression for multi-oriented scene text detection. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 745--753.

[11]

Han Hu, Chengquan Zhang, Yuxuan Luo, Yuzhuo Wang, Junyu Han, and Errui Ding. 2017. WordSup: Exploiting Word Annotations for Character Based Text Detection. In IEEE Int. Conf. Comp. Vis. (ICCV). 4950--4959.

[12]

Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. 2015. DenseBox: Unifying landmark localization with end to end object detection. arxiv: 1509.04874

[13]

Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. 2018. CCNet: Criss-cross attention for semantic segmentation. arxiv: 1811.11721

[14]

Zhida Huang, Zhuoyao Zhong, Lei Sun, and Qiang Huo. 2019. Mask R-CNN with pyramid attention network for scene text detection. In Winter Conf. Appl. Comp. Vis. (WACV). IEEE, 764--772.

[15]

Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, et almbox. 2015. ICDAR 2015 competition on robust reading. In Int. Conf. Doc. Anal. Recognit. (ICDAR). IEEE, 1156--1160.

[16]

Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bogdan Savchynskyy, and Carsten Rother. 2017. InstanceCut: from edges to instances with multicut. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). IEEE, 7322--7331.

[17]

Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. TextBoxes: A fast text detector with a single deep neural network. In Proc. AAAI Conf. Artif. Intell. (AAAI). 4161--4167.

[18]

Minghui Liao, Zhen Zhu, Baoguang Shi, Gui-song Xia, and Xiang Bai. 2018. Rotation-sensitive regression for oriented scene text detection. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 5909--5918.

[19]

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR) . 2117--2125.

[20]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. SSD: Single shot multibox detector. In Eur. Conf. Comp. Vis. (ECCV). Springer, 21--37.

[21]

Yuliang Liu and Lianwen Jin. 2017. Deep matching prior network: Toward tighter multi-oriented text detection. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 1962--1969.

[22]

Yiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu, Houqiang Li, and Yan Lu. 2018. Affinity derivation and graph merge for instance segmentation. In Eur. Conf. Comp. Vis. (ECCV) . 686--703.

[23]

Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. 2018. TextSnake: A flexible representation for detecting text of arbitrary shapes. In Eur. Conf. Comp. Vis. (ECCV). 20--36.

[24]

Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018a. Mask TextSpotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Eur. Conf. Comp. Vis. (ECCV) . 67--83.

[25]

Pengyuan Lyu, Cong Yao, Wenhao Wu, Shuicheng Yan, and Xiang Bai. 2018b. Multi-oriented scene text detection via corner localization and region segmentation. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 7553--7563.

[26]

Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, and Xiangyang Xue. 2018. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimedia, Vol. 20, 11 (2018), 3111--3122.

Digital Library

[27]

Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In 4th Int. Conf. 3D Vision (3DV). IEEE, 565--571.

[28]

Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon, et almbox. 2017. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In Int. Conf. Doc. Anal. Recognit. (ICDAR), Vol. 1. IEEE, 1454--1459.

[29]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Adv. Neural Inf. Process. Syst. (NIPS). 91--99.

[30]

Baoguang Shi, Xiang Bai, and Serge Belongie. 2017. Detecting oriented text in natural images by linking segments. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 2550--2558.

[31]

Bharat Singh and Larry S Davis. 2018. An analysis of scale invariance in object detection snip. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 3578--3587.

[32]

Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text in natural image with connectionist text proposal network. In Eur. Conf. Comp. Vis. (ECCV). Springer, 56--72.

[33]

Jonas Uhrig, Eike Rehder, Björn Fröhlich, Uwe Franke, and Thomas Brox. 2018. Box2Pix: Single-shot instance segmentation by assigning pixels to object boxes. In IEEE Intell. Veh. Symp. (IV). IEEE, 292--299.

[34]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Adv. Neural Inf. Process. Syst. (NIPS) . 5998--6008.

Digital Library

[35]

Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, and Shuai Shao. 2019. Shape Robust Text Detection With Progressive Scale Expansion Network. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 9336--9345.

[36]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR) . 7794--7803.

[37]

Yue Wu and Prem Natarajan. 2017. Self-organized text detection with minimal post-processing via border learning. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 5000--5009.

[38]

Yongchao Xu, Yukang Wang, Wei Zhou, Yongpan Wang, Zhibo Yang, and Xiang Bai. 2019. TextField: Learning A Deep Direction Field for Irregular Scene Text Detection. IEEE Trans. Image Process. (2019). arxiv: 1812.01393

[39]

Qiangpeng Yang, Mengli Cheng, Wenmeng Zhou, Yan Chen, Minghui Qiu, and Wei Lin. 2018. IncepText: a new inception-text module with deformable PSROI pooling for multi-oriented scene text detection. In Int. Joint Conf. Artif. Intell. (IJCAI) . IJCAI, 1071--1077.

[40]

Qixiang Ye and David Doermann. 2015. Text detection and recognition in imagery: A survey. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 37, 7 (2015), 1480--1500.

Digital Library

[41]

Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. 2018. Learning a discriminative feature network for semantic segmentation. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). IEEE, 1857--1866.

[42]

Liu Yuliang, Jin Lianwen, Zhang Shuaitao, and Zhang Sheng. 2017. Detecting curve text in the wild: New dataset and new solution. arxiv: 1712.02170

[43]

Chengquan Zhang, Borong Liang, Zuming Huang, Mengyi En, Junyu Han, Errui Ding, and Xinghao Ding. 2019. Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR) .

[44]

Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang Bai. 2016. Multi-oriented text detection with fully convolutional networks. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 4159--4167.

[45]

Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. 2015. Conditional random fields as recurrent neural networks. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 1529--1537.

Digital Library

[46]

Zhuoyao Zhong, Lei Sun, and Qiang Huo. 2018. An Anchor-Free Region Proposal Network for Faster R-CNN based Text Detection Approaches. arxiv: 1804.09003

[47]

Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. EAST: An efficient and accurate scene text detector. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 5551--5560.

[48]

Yixing Zhu and Jun Du. 2018. Sliding line point regression for shape robust scene text detection. In Int. Conf. Pattern Recognit. (ICPR). 3735--3740.

[49]

Yingying Zhu, Cong Yao, and Xiang Bai. 2016. Scene text detection and recognition: Recent advances and future trends. Frontiers of Computer Science, Vol. 10, 1 (2016), 19--36.

Digital Library

Cited By

Ren JLiu WChen JZhu XLi RZhao TZhang JTao YYin SZhai XPeng YWang X(2025)HI-CMAIM: Hybrid Intelligence-Based Multi-Source Unstructured Chinese Map Annotation Interpretation ModelRemote Sensing10.3390/rs1702020417:2(204)Online publication date: 8-Jan-2025
https://doi.org/10.3390/rs17020204
Xu JWang RHei JCao XWan ZYu CDing YGao CQian S(2025)FSANet: Feature shuffle and adaptive channel attention network for arbitrary shape scene text detectionNeurocomputing10.1016/j.neucom.2025.129443624(129443)Online publication date: Apr-2025
https://doi.org/10.1016/j.neucom.2025.129443
Zhu BChen XTang QChen CLiu F(2025)EK-Net++: Real-time scene text detection with expand kernel distance and Epoch Adaptive WeightExpert Systems with Applications10.1016/j.eswa.2024.126159267(126159)Online publication date: Apr-2025
https://doi.org/10.1016/j.eswa.2024.126159
Show More Cited By

Index Terms

A Single-Shot Arbitrarily-Shaped Text Detector based on Context Attended Multi-Task Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Image segmentation
      2. Computer vision tasks
        Scene understanding

Recommendations

Arbitrarily Shaped Scene Text Detection With a Mask Tightness Text Detector
Scene text in the environment is complicated. It can exist in arbitrary text fonts, sizes or shapes. Although scene text detection has witnessed considerable progress in recent years, the detection of text with complex shapes, especially curved text, ...
Skull-Stripping of Glioblastoma MRI Scans Using 3D Deep Learning
Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries
Abstract
Skull-stripping is an essential pre-processing step in computational neuro-imaging directly impacting subsequent analyses. Existing skull-stripping methods have primarily targeted non-pathologically-affected brains. Accordingly, they may perform ...
Review on deep learning fetal brain segmentation from Magnetic Resonance images
Abstract
Brain segmentation is often the first and most critical step in quantitative analysis of the brain for many clinical applications, including fetal imaging. Different aspects challenge the segmentation of the fetal brain in magnetic resonance ...
Highlights
- We reviewed 39 DL studies for fetal brain segmentation from MR images
- U-Net backbone is the dominant method for automatic fetal brain segmentation
- There is a segmentation performances convergence for the currently available ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

56
Total Citations
View Citations
1,102
Total Downloads

Downloads (Last 12 months)209
Downloads (Last 6 weeks)15

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ren JLiu WChen JZhu XLi RZhao TZhang JTao YYin SZhai XPeng YWang X(2025)HI-CMAIM: Hybrid Intelligence-Based Multi-Source Unstructured Chinese Map Annotation Interpretation ModelRemote Sensing10.3390/rs1702020417:2(204)Online publication date: 8-Jan-2025
https://doi.org/10.3390/rs17020204
Xu JWang RHei JCao XWan ZYu CDing YGao CQian S(2025)FSANet: Feature shuffle and adaptive channel attention network for arbitrary shape scene text detectionNeurocomputing10.1016/j.neucom.2025.129443624(129443)Online publication date: Apr-2025
https://doi.org/10.1016/j.neucom.2025.129443
Zhu BChen XTang QChen CLiu F(2025)EK-Net++: Real-time scene text detection with expand kernel distance and Epoch Adaptive WeightExpert Systems with Applications10.1016/j.eswa.2024.126159267(126159)Online publication date: Apr-2025
https://doi.org/10.1016/j.eswa.2024.126159
Ning AXue MWei YZhou MZhong S(2025)Artistic-style text detector and a new Movie-Poster datasetExpert Systems with Applications10.1016/j.eswa.2024.125544261(125544)Online publication date: Feb-2025
https://doi.org/10.1016/j.eswa.2024.125544
Tang QJiang ZPan BGuo JJiang W(2024)Scene Text Detection Using HRNet and Spatial Attention MechanismProgramming and Computer Software10.1134/S036176882308021249:8(954-965)Online publication date: 24-Jan-2024
https://doi.org/10.1134/S0361768823080212
Zhang SYang CZhu XYin X(2024)Arbitrary Shape Text Detection via Boundary TransformerIEEE Transactions on Multimedia10.1109/TMM.2023.328665726(1747-1760)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3286657
Xu JLin ALi JLu G(2024)Text Position-Aware Pixel Aggregation Network With Adaptive Gaussian Threshold: Detecting Text in the WildIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.328509634:1(286-298)Online publication date: Jan-2024
https://doi.org/10.1109/TCSVT.2023.3285096
Lau LLim LTew Y(2024)Modelling Studies of Automatic Container Code Recognition System for Real Time Implementation2024 IEEE Symposium on Industrial Electronics & Applications (ISIEA)10.1109/ISIEA61920.2024.10607358(1-6)Online publication date: 6-Jul-2024
https://doi.org/10.1109/ISIEA61920.2024.10607358
Li JWen BCai KLi YMa MLi XTan SWu DWang J(2024)TL-DREN: Transfer Learning Based Detection and Recognition of Electricity Nameplates2024 5th International Conference on Intelligent Computing and Human-Computer Interaction (ICHCI)10.1109/ICHCI63580.2024.10808048(328-332)Online publication date: 27-Sep-2024
https://doi.org/10.1109/ICHCI63580.2024.10808048
Ma JHuang SZhang JSun WLi LZhou HMa YYe H(2024)Enhancing Corporate Data Security: A MobileNetV3- Based Approach for Complex Scene Payment Numeric Text Data Recognition2024 5th International Conference on Computer Vision, Image and Deep Learning (CVIDL)10.1109/CVIDL62147.2024.10604145(729-736)Online publication date: 19-Apr-2024
https://doi.org/10.1109/CVIDL62147.2024.10604145
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten