research-article

TTS: Hilbert Transform-Based Generative Adversarial Network for Tattoo and Scene Text Spotting

Authors:

Shivakumara Palaiahnakote,

Apostolos Antonacopoulos,

Josep Llados CanetAuthors Info & Claims

IEEE Transactions on Multimedia, Volume 26

Pages 8226 - 8241

https://doi.org/10.1109/TMM.2024.3378458

Published: 01 January 2024 Publication History

Abstract

Text spotting in natural scenes is of increasing interest and significance due to its critical role in several applications, such as visual question answering, named entity recognition and event rumor detection on social media. One of the newly emerging challenging problems is Tattoo Text Spotting (TTS) in images for assisting forensic teams and for person identification. Unlike the generally simpler scene text addressed by current state-of-the-art methods, tattoo text is typically characterized by the presence of decorative backgrounds, calligraphic handwriting and several distortions due to the deformable nature of the skin. This paper describes the first approach to address TTS in a real-world application context by designing an end-to-end text spotting method employing a Hilbert transform-based Generative Adversarial Network (GAN). To reduce the complexity of the TTS task, the proposed approach first detects fine details in the image using the Hilbert transform and the Optimum Phase Congruency (OPC). To overcome the challenges of only having a relatively small number of training samples, a GAN is then used for generating suitable text samples and descriptors for text spotting (i.e., both detection and recognition). The superior performance of the proposed TTS approach, for both tattoo and general scene text, over the state-of-the-art methods is demonstrated on a new TTS-specific dataset (publicly available) as well as on the existing benchmark natural scene text datasets: Total-Text, CTW1500 and ICDAR 2015.

References

[1]

W. Wang et al., “PAN++: Towards efficient and accurate end-to-end spotting of arbitrary-shaped text,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 9, pp. 5349–5367, Sep. 2022.

[2]

C. Zheng, Z. Wu, T. Wang, Y. Cai, and Q. Li, “Object-aware multimodal named entity recognition in social media posts with adversarial learning,” IEEE Trans. Multimedia, vol. 23, pp. 2520–2532, 2021.

Digital Library

[3]

H. Zhang, S. Qian, Q. Fang, and C. Xu, “Multimodal disentangled domain adaption for social media event rumor detection,” IEEE Trans. Multimedia, vol. 23, pp. 4441–4454, 2021.

[4]

M. Huang et al., “SwinTextSpotter: Scene text spotting via better synergy between text detection and text recognition,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2022, pp. 4593–4603.

[5]

P. N. Chowdhury et al., “An episodic learning network for text detection on human bodies in sports images,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 4, pp. 2279–2289, Apr. 2022.

[6]

S. Nag, P. Shivakumara, U. Pal, T. Lu, and M. Blumenstein, “A new unified method for detecting text from marathon runners and sports players in video,” Pattern Recognit., vol. 107, 2020, Art. no.

[7]

H. Han, J. Li, A. K. Jain, S. Shan, and X. Chen, “Tattoo image search at scale: Joint detection and compact representation learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 10, pp. 2333–2348, Oct. 2019.

Digital Library

[8]

K. Wang, P. Xiao, X. Feng, and G. Wu, “Image features detection from phase congruency based on two-dimensional Hilbert transform,” Pattern Recognit. Lett., vol. 32, no. 15, pp. 2015–2024, 2011.

Digital Library

[9]

P. Keserwani and P. P. Roy, “Text region conditional generative adversarial network for text concealment in the wild,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 5, pp. 3152–3163, May 2022.

Digital Library

[10]

F. Zhan, H. Zhu, and S. Lu, “Spatial fusion GAN for image synthesis,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2019, pp. 3648–3657.

[11]

G. Deng, Y. Ming, and J. H. Xue, “RFRN: A recurrent feature refinement network for accurate and efficient scene text detection,” Neurocomputing, vol. 453, pp. 465–481, 2021.

Digital Library

[12]

Z. Raisi, M. A. Naiel, G. Younes, S. Wardell, and J. S. Zelek, “Transformer-based text detection in the wild,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2021, pp. 3162–3171.

[13]

P. Dai, Y. Li, H. Zhang, J. Li, and X. Cao, “Accurate scene text detection via scale-aware data augmentation and shape similarity constraint,” IEEE Trans. Multimedia, vol. 24, pp. 1883–1895, 2022.

[14]

P. Dai, H. Zhang, and X. Cao, “Deep multi-scale context aware feature aggregation for curved scene text detection,” IEEE Trans. Multimedia, vol. 22, no. 8, pp. 1969–1984, Aug. 2020.

[15]

P. Dai, S. Zhang, H. Zhang, and X. Cao, “Progressive contour regression for arbitrary-shape scene text detection,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2021, pp. 7389–7398.

[16]

Y. Wang et al., “ContourNet: Taking a further step toward accurate arbitrary-shaped scene text detection,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2020, pp. 1750–11759.

[17]

M. Xue et al., “Arbitrarily-oriented text detection in low light natural scene images,” IEEE Trans. Multimedia, vol. 23, pp. 2706–2720, 2021.

Digital Library

[18]

S. X. Zhang et al., “Deep relational reasoning graph network for arbitrary shape text detection,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2020, pp. 9696–9705.

[19]

S. Zhang, Y. Liu, L. Jin, Z. Wei, and C. Shen, “OPMP: An omnidirectional pyramid mask proposal network for arbitrary-shape scene text detection,” IEEE Trans. Multimedia, vol. 23, pp. 454–467, 2021.

[20]

S. X. Zhang, Z. Xiaobin, Y. Chun, W. Hongfa, and X. C. Yin, “Adaptive boundary proposal network for arbitrary shape text detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 1305–1314.

[21]

Y. Zhu et al., “Fourier contour embedding for arbitrary-shaped text detection,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2021, pp. 3123–3131.

[22]

M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai, “Real-time scene text detection with differentiable binarization,” in Proc. Assoc. Advance. Artif. Intell., 2020, pp. 11474–11481.

[23]

X. Zhou et al., “East: An efficient and accurate scene text detector,” in Proc. IEEE Comput. Vis. Pattern Recognit., 2017, pp. 5551–5560.

[24]

B. Shi, X. Bai, and S. Belongie, “Detecting oriented text in natural images by linking segments,” in Proc. IEEE Comput. Vis. Pattern Recognit., 2017, pp. 2550–2558.

[25]

M. He et al., “MOST: A multi-oriented scene text detector with localization refinement,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2021, pp. 8813–8822.

[26]

L. Nandanwar et al., “A new deep wavefront based model for text localization in 3D video,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 6, pp. 3375–3389, Jun. 2022.

[27]

K. S. Raghunandan et al., “Riesz fractional based model for enhancing license plate detection and recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 9, pp. 2276–2288, Sep. 2018.

[28]

K. S. Raghunndan et al., “Multi-script-oriented text detection and recognition in videos/scene/born digital images,” IEEE Trans. Circuits Syst. Video Technol., vol. 29, no. 4, pp. 1145–1162, Apr. 2019.

[29]

N. Lu et al., “Master: Multi-aspect non-local network for scene text recognition,” Pattern Recognit., vol. 117, 2021, Art. no.

[30]

Z. Qiao, X. Qin, Y. Zhou, F. Yang, and W. Wang, “Gaussian constrained attention network for scene text recognition,” in Proc. IEEE 25th Int. Conf. Pattern Recognit., 2021, pp. 3328–3335.

[31]

P. Dai, H. Zhang, and X. Cao, “SLOAN: Scale-adaptive orientation attention network for scene text recognition,” IEEE Trans. Image Process., vol. 30, pp. 1687–1701, 2021.

[32]

Y. Gao, Y. Chen, J. Wang, and H. Lu, “Semi-supervised scene text recognition,” IEEE Trans. Image Process., vol. 30, pp. 3005–3016, 2021.

[33]

Q. Lin, C. Luo, L. Jin, and S. Lai, “STAN: A sequential transformation attention-based network for scene text recognition,” Pattern Recognit., vol. 111, 2021, Art. no.

[34]

R. Litman et al., “SCATTER: Selective context attentional scene text recognizer,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2020, pp. 11959–11969.

[35]

C. Luo, Y. Zhu, L. Jin, and Y. Wang, “Learn to augment: Joint data augmentation and network optimization for text recognition,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2020, pp. 13743–13752.

[36]

Z. Qiao, Y. Zhou, D. Yang, Y. Zhou, and W. Wang, “Seed: Semantics enhanced encoder-decoder framework for scene text recognition,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2020, pp. 13528–13537.

[37]

U. Sajid, M. Chow, J. Zhang, T. Kim, and G. Wang, “Parallel scale-wise attention network for effective scene text recognition,” in Proc. IEEE Int. Joint Conf. Neural Netw., 2021, pp. 1–8.

[38]

Z. Wan, J. Zhang, L. Zhang, J. Luo, and C. Yao, “On vocabulary reliance in scene text recognition,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2020, pp. 11422–11431.

[39]

C. Zhang, W. Ding, G. Peng, F. Fu, and W. Wang, “Street view text recognition with deep learning for urban scene understanding in intelligent transportation systems,” IEEE Trans. Intell. Transp. Syst., vol. 22, no. 7, pp. 4727–4743, Jul. 2021.

Digital Library

[40]

C. Luo, L. Jin, and J. Chen, “SimAN: Exploring self-supervised representation learning of scene text via similarity-aware normalization,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2022, pp. 1039–1048.

[41]

M. Liao, G. Pang, J. Huang, T. Hassner, and X. Bai, “Mask TextSpotter v3: Segmentation proposal network for robust scene text spotting,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 706–722.

[42]

H. Wang et al., “All you need is boundary: Toward arbitrary-shaped text spotting,” in Proc. Assoc. Advance. Artif. Intell., 2020, pp. 12160–12167.

[43]

L. Qiao et al., “Text perceptron: Towards end-to-end arbitrary-shaped text spotting,” in Proc. Assoc. Advance. Artif. Intell., 2020, pp. 11899–11907.

[44]

M. Liao et al., “Mask TextSpotter: An end-to-end trainable neural network for spotting text with arbitrary shapes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 2, pp. 532–548, Feb. 2021.

Digital Library

[45]

Y. Liu et al., “ABCNetv2: Adaptive bezier-curve network for real-time end-to-end text spotting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 11, pp. 8048–8064, Nov. 2022.

[46]

P. Wang, H. Li, and C. Shen, “Towards end-to-end text spotting in natural scenes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 10, pp. 7266–7281, Oct. 2022.

Digital Library

[47]

X. Zhang, Y. Su, S. Tripathi, and Z. Tu, “Text spotting transformers,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2022, pp. 9519–9528.

[48]

Y. Kittenplon et al., “Towards weakly-supervised text spotting using a multi-task transformer,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2022, pp. 4604–4613.

[49]

M. Ye et al., “Deepsolo: Let transformer decoder with explicit points solo for text spotting,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2023, pp. 19348–19357.

[50]

F. Zhan, H. Zhu, and S. Lu, “Spatial fusion GAN for image synthesis,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2019, pp. 3468–3657.

[51]

D. M. Souza, J. Wehrmann, and D. D. Ruiz, “Efficient neural architecture for text-to-image synthesis,” in Proc. IEEE Int. Joint Conf. Neural Netw., 2020, pp. 1–8.

[52]

C. Liu et al., “EraseNet: End-to-end text removal in the wild,” IEEE Trans. Image Process., vol. 29, pp. 8760–8775, 2020.

[53]

Y. Li, L. Gao, Z. Tang, Q. Yan, and Y. Huang, “A GAN-based feature generator for table detection,” in Proc. IEEE Int. Conf. Document Anal. Recognit., 2019, pp. 763–768.

[54]

T. Hinz, S. Heinrich, and S. Wermter, “Semantic obeject accuracy for generative text-to-image synthesis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 3, pp. 1552–1565, Mar. 2022.

[55]

P. Dzianok, M. Koldziej, and E. Kublik, “Detecting attention in hilbert transformed EEG brain signals from simple reaction and choice reaction cognitive tasks,” in Proc. IEEE 21st Int. Conf. Bioinf. Bioeng., 2021, pp. 1–4.

[56]

A. Dragulinescu, A. M. Dragulinescu, and I. Marcu, “Optical correlator based on Hilbert transform for image recognition,” in Proc. IEEE 13th Int. Conf. Electron., Comput. Artif. Intell., 2021, pp. 1–4.

[57]

M. Zabin, J. Uddin, H. J. Choi, M. H. Furthad, and A. B. Ullah, “Industrial fault diagnosis using Hilbrert transform and texture features,” in Proc. IEEE Int. Conf. Big Data Smart Comput., 2021, pp. 121–128.

[58]

T. Chowdhury et al., “DCINN: Deformable convolution and inception based neural network for tattoo text detection through skin region,” in Proc. Document Anal. Recognit.: 16th Int. Conf., Lausanne, 2021, pp. 335–350.

[59]

R. Jiang, “Riesz transform via heat kernel and harmonic functions on non-compact manifolds,” Adv. Math., vol. 377, 2021, Art. no.

[60]

J. Y. Zhu et al., “Toward multimodal image-to-image translation,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 465–476.

[61]

B. Liu, K. Song, Y. Zhu, G. de Melo, and A. Elgammal, “Time: Text and image mutual-translation adversarial networks,” in Proc. Assoc. Advance. Artif. Intell., 2021, pp. 2082–2090.

[62]

B. Duan, W. Wang, H. Tang, H. Latapie, and Y. Yan, “Cascade attention guided residue learning GAN for cross-modal translation,” in Proc. IEEE Int. Conf. Pattern Recognit., 2021, pp. 1336–1343.

[63]

T. Xu and W. Takano, “Graph stacked hourglass networks for 3D human pose estimation,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2021, pp. 16105–16114.

[64]

S. Vandenhende et al., “Multi-task learning for dense prediction tasks: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 3614–3633, Jul. 2022.

[65]

S. Kwon, “Att-Net: Enhanced emotion recognition system using lightweight self-attention module,” Appl. Soft Comput., vol. 102, 2021, Art. no.

[66]

K. Li, Z. Fu, H. Wang, Z. Chen, and Y. Guo, “Adv-Depth: Self-supervised monocular depth estimation with an adversarial loss,” IEEE Signal Process. Lett., vol. 28, pp. 638–642, 2021.

[67]

H. H. Lee, S. K. Ko, and Y. S. Han, “SALNet: Semi-supervised few-shot text classification with attention-based lexicon construction,” in Proc. Assoc. Advance. Artif. Intell., 2021, pp. 13189–13197.

[68]

Y. Liu, L. Jin, S. Zhang, and S. Zhang, “Detecting curve text in the wild: New dataset and new solution,” 2017, arXiv.1712.02170.

[69]

C. K. Ch'ng and C. S. Chan, “Total-Text: A comprehensive dataset for scene text detection and recognition,” in Proc. IEEE 14th IAPR Int. Conf. Document Anal. Recognit., 2017, pp. 935–942.

[70]

D. Karatzas et al., “ICDAR 2015 competition on Robust Reading,” in Proc. IEEE 13th Int. Conf. Document Anal. Recognit., 2015, pp. 1156–1160.

Digital Library

[71]

D. Li et al., “BigDatasetGAN: Synthesizing ImageNet with pixel-wise annotations,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2022, pp. 21330–21340.

Index Terms

TTS: Hilbert Transform-Based Generative Adversarial Network for Tattoo and Scene Text Spotting

Index terms have been assigned to the content through auto-classification.

Recommendations

Generative adversarial network based synthetic data training model for lightweight convolutional neural networks
Abstract
Inadequate training data is a significant challenge for deep learning techniques, particularly in applications where data is difficult to get, and publicly available datasets are uncommon owing to ethical and privacy concerns. Various approaches, ...
A Generative Adversarial Network based Training Framework for Robust TTS in Noisy Environment
IC3-2021: Proceedings of the 2021 Thirteenth International Conference on Contemporary Computing

Humans overcome the degradation effect of background noise on the speech by changing their vocal characteristics dynamically, which text-to-speech systems trained on clean speech cannot, resulting in degraded intelligibility. To improve the ...
Generative adversarial networks for speech processing: A review
Abstract
Generative adversarial networks (GANs) have seen remarkable progress in recent years. They are used as generative models for all kinds of data such as text, images, audio, music, videos, and animations. This paper presents a ...
Highlights
- Presents a comprehensive review of speech GANs.
- Categorizes speech GANs based ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia

IEEE Transactions on Multimedia Volume 26, Issue

2024

11427 pages

ISSN:1520-9210

Issue’s Table of Contents

1520-9210 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 01 January 2024

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents