Abstract
Visual data obtained during driving scenarios usually contain large amounts of text that conveys semantic information necessary to analyse the urban environment and is integral to the traffic control plan. Yet, research on autonomous driving or driver assistance systems typically ignores this information. To advance research in this direction, we present RoadText-3K, a large driving video dataset with fully annotated text. RoadText-3K is three times bigger than its predecessor and contains data from varied geographical locations, unconstrained driving conditions and multiple languages and scripts. We offer a comprehensive analysis of tracking by detection and detection by tracking methods exploring the limits of state-of-the-art text detection. Finally, we propose a new end-to-end trainable tracking model that yields state-of-the-art results on this challenging dataset. Our experiments demonstrate the complexity and variability of RoadText-3K and establish a new, realistic benchmark for scene text tracking in the wild.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
For CTPN, EAST and FOTS we have used unofficial implementations of the original methods, for CRAFT we have used the author’s released implementation:
- 2.
We used the implementation given in https://github.com/cheind/py-motmetrics.
References
Lukežič, A., Vojíř, T., Čehovin, L., Matas, J., Kristan, M.: Discriminative correlation filter tracker with channel and spatial reliability. IJCV 126, 671–688 (2018)
Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: CVPR (2019)
Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the CLEAR MOT metrics. EURASIP J. Image Video Process. 2008, 1–10 (2008)
Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP (2016)
Bochkovskiy, A., Wang, C.-Y., Liao, H.-Y.M.: YOLOv4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020)
Cerf, M., Frady, E.P., Koch, C.: Faces and text attract gaze independent of the task: experimental data and computer model. J. Vision 9, 10 (2009)
Cheng, Z., et al.: FREE: a fast and robust end-to-end video text spotter. IEEE Trans. Image Process. 30, 822–837 (2020)
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Donoser, M., Bischof, H.: Efficient maximally stable extremal region (MSER) tracking. In: CVPR (2006)
Gomez, L., Karatzas, D.: MSER-based real-time text detection and tracking. In: ICPR (2014)
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: Exploiting the circulant structure of tracking-by-detection with kernels. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 702–715. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_50
Kalal, Z., Mikolajczyk, K., Matas, J.,: Forward-backward error: automatic detection of tracking failures. In: ICPR (2010)
Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: ICDAR (2015)
Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: ICDAR (2013)
Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics Q. 2, 83–97 (1955)
Liao, M., Shi, B., Bai, X.: TextBoxes++: a single-shot oriented scene text detector. TIP 27, 3676–3690 (2018)
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)
Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., Yan, J.: FOTS: fast oriented text spotting with a unified network. In: CVPR (2018)
Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: MOT16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831 (2016)
Minetto, R., Thome, N., Cord, M., Leite, N.J., Stolfi, J.: SnooperTrack: text detection and tracking for outdoor videos. In: ICIP (2011)
Misra, D.: Mish: a self regularized non-monotonic neural activation function. arXiv preprint arXiv:1908.08681 (2019)
Nguyen, P.X., Wang, K., Belongie, S.: Video text detection and recognition: dataset and benchmark. In: WACV (2014)
Petter, M., Fragoso, V., Turk, M., Baur, C.: Automatic text detection for mobile augmented reality translation. In: ICCV Workshops (2011)
Reddy, S., Mathew, M., Gomez, L., Rusinol, M., Karatzas, D., Jawahar, C.V.: RoadText-1K: text detection & recognition dataset for driving videos. In: ICRA (2020)
Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.K., Woo, W.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: NeurIPS (2015)
Tian, S., Pei, W.-Y., Zuo, Z.-Y., Yin, X.-C.: Scene text detection in video by learning locally and globally. In: IJCAI (2016)
Tian, S., Yin, X.-C., Ya, S., Hao, H.-W.: A unified framework for tracking based text detection and recognition from web videos. IEEE Trans. Pattern Anal. Mach. Intell. 40(3), 542–554 (2017)
Tian, Z., Huang, W., He, T., He, P., Qiao, Yu.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_4
Topolšek, D., Areh, I., Cvahte, T.: Examination of driver detection of roadside traffic signs and advertisements using eye tracking. Transp. Res. Part F: Traffic Psychol. Behav. 43, 212–224 (2016)
Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: COCO-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016)
Wang, X., et al.: End-to-end scene text recognition in videos based on multi frame tracking. In: ICDAR (2017)
Williams, D.: The Arbitron National In-Car Study. Arbitron Inc., Columbia (2009)
Wu, W., et al.: A bilingual, OpenWorld video text dataset and end-to-end video text spotter with transformer. In: NeurIPS 2021 Track on Datasets and Benchmarks (2021)
Yu, H., Huang, Y., Pi, L., Zhang, C., Li, X., Wang, L.: End-to-end video text detection with online tracking. PR 113, 107791 (2021)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)
Zhou, X., et al.: EAST: an efficient and accurate scene text detector. In: CVPR (2017)
Acknowledgements
This work has been supported by the Pla de Doctorats Industrials de la Secretaria d’Universitats i Recerca del Departament d’Empresa i Coneixement de la Generalitat de Catalunya; Grant PDC2021-121512-I00 funded by MCIN /AEI/ 10.13039/501100011033 and the European Union NextGenerationEU/PRTR; Project PID2020-116298GB-I00 funded by MCIN/ AEI /10.13039/501100011033; Grant PLEC2021-007850 funded by MCIN/AEI/10.13039/501100011033 and the European Union NextGenerationEU/PRTR; Spanish Project NEOTEC SNEO-20211172 from CDTI and CREATEC-CV IMCBTA/2020/46 from IVACE and IHub-Data at IIIT-Hyderabad.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Garcia-Bordils, S. et al. (2022). Read While You Drive - Multilingual Text Tracking on the Road. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. DAS 2022. Lecture Notes in Computer Science, vol 13237. Springer, Cham. https://doi.org/10.1007/978-3-031-06555-2_51
Download citation
DOI: https://doi.org/10.1007/978-3-031-06555-2_51
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06554-5
Online ISBN: 978-3-031-06555-2
eBook Packages: Computer ScienceComputer Science (R0)