Abstract
Tampered text detection in images, as a task focused on detecting manipulated or forged text with image documentation or signage, has increasingly attracted attention due to the widespread use of image editing software and CNN synthesis techniques. The potential difficulties of perceiving subtle differences in tampered text images lie in the gap between the model’s capability to obtain global fine-grained information and the realistic demand. In this work, we propose a robust detection method, Tampered Text Detection with Mamba (TTDMamba). It achieves linear complexity without sacrificing global spatial contextual information, offering significant advancements over the limitations of the Transformer architecture. In particular, we utilize the advanced VMamba architecture as the encoder and incorporate the proposed High-frequency Feature Aggregation to enhance the visual feature set by integrating additional signals. This aggregation guides Mamba’s attention toward capturing fine-grained forgery information. Additionally, we integrate Disentangled Semantic Axial Attention into the stacked Visual State Space block of the VMamba architecture. This integration allows us to incorporate the inherent high-level semantic attributes of the tampered image into a pretrained hierarchical converter. As a result, we obtain a tamper map that is more reliable and accurate. Extensive experiments on the T-SROIE, T-IC13, and DocTamper datasets demonstrate that TTDMamba not only surpasses existing state-of-the-art methods in detection accuracy but also shows superior robustness in pixel-level forgery localization, marking a significant contribution to the domain of text tampering detection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Cao, J., Luo, M., Yu, J., Yang, M.H., He, R.: Scoremix: a scalable augmentation strategy for training gans with limited data. IEEE Trans. Pattern Anal. Mach. Intell. 45(7), 8920–8935 (2022)
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818 (2018)
Chen, X., Dong, C., Ji, J., Cao, J., Li, X.: Image manipulation detection by multi-view multi-scale supervision. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14185–14193 (2021)
Fan, Q., Huang, H., Chen, M., Liu, H., He, R.: Rmt: Retentive networks meet vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5641–5651 (2024)
Fan, Q., Huang, H., Zhou, X., He, R.: Lightweight vision transformer with bidirectional interaction. Adv. Neural Inf. Process. Syst. 36 (2024)
Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces (2023). arXiv:2312.00752
He, R., Hu, B., Yuan, X., Wang, L., et al.: Robust Recognition Via Information Theoretic Learning. Springer, Cham (2014)
He, R., Zhang, M., Wang, L., Ji, Y., Yin, Q.: Cross-modal subspace learning via pairwise constraints. IEEE Trans. Image Process. 24(12), 5543–5556 (2015)
Huang, H., Zhou, X., Cao, J., He, R., Tan, T.: Vision transformer with super token sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22690–22699 (2023)
Huang, H., Zhou, X., He, R.: Orthogonal transformer: an efficient vision transformer backbone with token orthogonalization. Adv. Neural Inf. Process. Syst. 35, 14596–14607 (2022)
Huang, Z., et al.: Icdar2019 competition on scanned receipt ocr and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520. IEEE (2019)
Kwon, M.J., Nam, S.H., Yu, I.J., Lee, H.K., Kim, C.: Learning jpeg compression artifacts for image manipulation detection and localization. Int. J. Comput. Vision 130(8), 1875–1895 (2022)
Liu, X., Liu, Y., Chen, J., Liu, X.: Pscc-net: progressive spatio-channel correlation network for image manipulation detection and localization. IEEE Trans. Circuits Syst. Video Technol. 32(11), 7505–7517 (2022)
Liu, Y., et al.: Vmamba: Visual state space model (2024). arXiv:2401.10166
Liu, Z., et al.: Swin transformer v2: Scaling up capacity and resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12009–12019 (2022)
Loshchilov, I., Hutter, F.: Decoupled Weight Decay Regularization (2017). arXiv:1711.05101
Luo, D., et al.: Toward real text manipulation detection: New dataset and new solution (2023). arXiv:2312.06934
Qu, C., et al.: Towards robust tampered text detection in document image: new dataset and new solution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5937–5946 (2023)
Qu, Y., Tan, Q., Xie, H., Xu, J., Wang, Y., Zhang, Y.: Exploring stroke-level modifications for scene text editing. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 2119–2127 (2023)
Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision transformers see like convolutional neural networks? Adv. Neural Inf. Process. Syst. 34, 12116–12128 (2021)
Roy, P., Bhattacharya, S., Ghosh, S., Pal, U.: Stefann: scene text editor using font adaptive neural network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13228–13237 (2020)
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349–3364 (2020)
Wang, W., et al.: Shape robust text detection with progressive scale expansion network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9336–9345 (2019)
Wang, X., Jiang, Y., Luo, Z., Liu, C.L., Choi, H., Kim, S.: Arbitrary shape scene text detection with adaptive text region representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6449–6458 (2019)
Wang, Y., Xie, H., Xing, M., Wang, J., Zhu, S., Zhang, Y.: Detecting tampered scene text in the wild. In: European Conference on Computer Vision, pp. 215–232. Springer, Berlin (2022)
Wang, Y., Xie, H., Zha, Z.J., Xing, M., Fu, Z., Zhang, Y.: Contournet: Taking a further step toward accurate arbitrary-shaped scene text detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)
Wang, Y., Zhang, B., Xie, H., Zhang, Y.: Tampered text detection via rgb and frequency relationship modeling. Chin. J. Netw. Inf. Secur. 8(3), 29–40 (2022)
Wang, Y., Zhang, B., Xie, H., Zhang, Y.: Tampered text detection via rgb and frequency relationship modeling. Chin. J. Netw. Inf. Secur. 8(3), 29 (2022)
Wu, Y., AbdAlmageed, W., Natarajan, P.: Mantra-net: manipulation tracing network for detection and localization of image forgeries with anomalous features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9543–9552 (2019)
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 34, 12077–12090 (2021)
Xu, W., et al.: Document images forgery localization using a two-stream network. Int. J. Intell. Syst. 37(8), 5272–5289 (2022)
Zhou, X., et al.: East: an efficient and accurate scene text detector. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5551–5560 (2017)
Acknowledgement.
This work is funded by Beijing Municipal Science and Technology Project(Nos. Z231100010323005), Beijing Nova Program (20230484276), Youth Innovation Promotion Association CAS(Grant No.2022132), and National Natural Science Foundation of China (Grant No. 62206277).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Sun, H., Cao, J., Zhang, Z., Wu, T., Zhou, K., Huang, H. (2025). Learning Fine-Grained and Semantically Aware Mamba Representations for Tampered Text Detection in Images. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15037. Springer, Singapore. https://doi.org/10.1007/978-981-97-8511-7_4
Download citation
DOI: https://doi.org/10.1007/978-981-97-8511-7_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8510-0
Online ISBN: 978-981-97-8511-7
eBook Packages: Computer ScienceComputer Science (R0)