research-article

Self-Supervised Cross-Language Scene Text Editing

Authors:

Songze LiAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 4546 - 4554

https://doi.org/10.1145/3581783.3612174

Published: 27 October 2023 Publication History

Abstract

We propose and formulate the task of cross-language scene text editing, modifying the text content of a scene image into new text in another language, while preserving the scene text style and background texture. The key challenges of this task lie in the difficulty in distinguishing text and background, great distribution differences among languages, and the lack of fine-labeled real-world data. To tackle these problems, we propose a novel network named Cross-LAnguage Scene Text Editing (CLASTE), which is capable of separating the foreground text and background, as well as further decomposing the content and style of the foreground text. Our model can be trained in a self-supervised training manner on the unlabeled and multi-language data in real-world scenarios, where the source images serve as both input and ground truth. Experimental results on the Chinese-English cross-language dataset show that our proposed model can generate realistic text images, specifically, modifying English to Chinese and vice versa. Furthermore, our method is universal and can be extended to other languages such as Arabic, Korean, Japanese, Hindi, Bengali, and so on.

References

[1]

Eloi Alonso, Bastien Moysset, and Ronaldo Messina. 2019. Adversarial generation of handwritten text images conditioned on sequences. In International Conference on Document Analysis and Recognition. 481--486.

[2]

Lu Chi, Borui Jiang, and Yadong Mu. 2020. Fast Fourier Convolution. In Advances in Neural Information Processing Systems, Vol. 33. 4479--4488.

[3]

Chee Kheng Chng, Yuliang Liu, Yipeng Sun, et al. 2019. ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text - RRC-ArT. In International Conference on Document Analysis and Recognition. 1571--1576.

[4]

Sharon Fogel, Hadar Averbuch-Elor, Sarel Cohen, Shai Mazor, and Roee Litman. 2020. ScrabbleGAN: Semi-supervised varying length handwritten text generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4324--4333.

[5]

Ji Gan, Weiqiang Wang, Jiaxu Leng, and Xinbo Gao. 2022. HiGAN: Handwriting Imitation GAN with Disentangled Representations. ACM Transactions on Graphics, Vol. 42, 1 (2022), 1--17.

Digital Library

[6]

Raul Gomez, Baoguang Shi, Lluis Gomez, Lukas Numann, Andreas Veit, Jiri Matas, Serge Belongie, and Dimosthenis Karatzas. 2017. ICDAR2017 robust reading challenge on COCO-Text. In International Conference on Document Analysis and Recognition, Vol. 1. 1435--1443.

[7]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[8]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems.

[9]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, Vol. 33. 6840--6851.

[10]

Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. 2017. Globally and locally consistent image completion. ACM Transactions on Graphics, Vol. 36, 4 (2017), 1--14.

Digital Library

[11]

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision. 694--711.

[12]

Lei Kang, Pau Riba, Marcal Rusinol, Alicia Fornés, and Mauricio Villegas. 2021. Content and style aware generation of text-line images for handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).

[13]

Lei Kang, Pau Riba, Yaxing Wang, Marçal Rusinol, Alicia Fornés, and Mauricio Villegas. 2020. GANwriting: content-conditioned generation of styled handwritten word images. In European Conference on Computer Vision. 273--289.

Digital Library

[14]

Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, et al. 2015. ICDAR 2015 competition on robust reading. In International Conference on Document Analysis and Recognition. 1156--1160.

[15]

Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4401--4410.

[16]

Praveen Krishnan, Rama Kovvuri, Guan Pang, Boris Vassilev, and Tal Hassner. 2023. Textstylebrush: Transfer of text aesthetics from a single example. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).

[17]

Hyeonsu Lee and Chankyu Choi. 2022. The Surprisingly Straightforward Scene Text Removal Method with Gated Attention and Region of Interest Generation: A Comprehensive Prominent Model Analysis. In European Conference on Computer Vision. 457--472.

[18]

Junyeop Lee, Yoonsik Kim, Seonghyeon Kim, Moonbin Yim, Seung Shin, Gayoung Lee, and Sungrae Park. 2021. RewriteNet: Realistic Scene Text Image Generation via Editing Text in Real-world Image. arXiv preprint arXiv:2107.11041 (2021).

[19]

Chenhao Li, Yuta Taniguchi, Min Lu, and Shin'ichi Konomi. 2021. Few-shot font style transfer between different languages. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 433--442.

[20]

Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2017. Universal Style Transfer via Feature Transforms. In Advances in Neural Information Processing Systems, Vol. 30.

Digital Library

[21]

Chongyu Liu, Yuliang Liu, lianwen Jin, Shuaitao Zhang, Canjie Luo, and Yongpan Wang. 2020. EraseNet: End-to-End Text Removal in the Wild. IEEE Transactions on Image Processing, Vol. 29 (2020), 8760--8775.

[22]

Pengyuan Lyu, Xiang Bai, Cong Yao, Zhen Zhu, Tengteng Huang, and Wenyu Liu. 2017. Auto-encoder guided GAN for Chinese calligraphy synthesis. In International Conference on Document Analysis and Recognition. 1095--1100.

[23]

Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).

[24]

Nibal Nayef, Yash Patel, Michal Busta, et al. 2019. ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition-RRC-MLT-2019. In International Conference on Document Analysis and Recognition. 1582--1587.

[25]

Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon, Wafa Khlif, Muhammad Muzzamil Luqman, Jean-Christophe Burie, Cheng-lin Liu, and Jean-Marc Ogier. 2017. ICDAR2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification - RRC-MLT. 1454--1459.

[26]

Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning. 8162--8171.

[27]

Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. 2016. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2536--2544.

[28]

Yadong Qu, Qingfeng Tan, Hongtao Xie, Jianjun Xu, Yuxin Wang, and Yongdong Zhang. 2023. Exploring stroke-level modifications for scene text editing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2119--2127.

Digital Library

[29]

Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. 2022. Palette: Image-to-Image Diffusion Models. In ACM SIGGRAPH 2022 Conference Proceedings. 1--10.

[30]

Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, 11 (2016), 2298--2304.

Digital Library

[31]

K. Simonyan and A. Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations.

[32]

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. 2256--2265.

[33]

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. 2022. Resolution-robust Large Mask Inpainting with Fourier Convolutions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2149--2159.

[34]

Osman Tursun, Rui Zeng, Simon Denman, Sabesan Sivapalan, Sridha Sridharan, and Clinton Fookes. 2019. MTRNet: A generic scene text eraser. In International Conference on Document Analysis and Recognition. 39--44.

[35]

Liang Wu, Chengquan Zhang, Jiaming Liu, Junyu Han, Jingtuo Liu, Errui Ding, and Xiang Bai. 2019. Editing text in the wild. In Proceedings of ACM International Conference on Multimedia. 1500--1508.

Digital Library

[36]

Yangchen Xie, Xinyuan Chen, Li Sun, and Yue Lu. 2021. DG-Font: Deformable Generative Networks for Unsupervised Font Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5130--5140.

[37]

Qiangpeng Yang, Jun Huang, and Wei Lin. 2020. SwapText: Image based texts transfer in scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14700--14709.

[38]

Boxi Yu, Yong Xu, Yan Huang, Shuai Yang, and Jiaying Liu. 2021. Mask-guided GAN for robust text editing in the scene. Neurocomputing, Vol. 441 (2021), 192--201.

[39]

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 586--595.

[40]

Shuaitao Zhang, Yuliang Liu, Lianwen Jin, Yaoxiong Huang, and Songxuan Lai. 2019. EnsNet: Ensconce text in the wild. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 801--808.

Digital Library

Cited By

Chen JHuang YLv TCui LChen QWei F(2024)TextDiffuser-2: Unleashing the Power of Language Models for Text RenderingComputer Vision – ECCV 202410.1007/978-3-031-72652-1_23(386-402)Online publication date: 30-Oct-2024
https://doi.org/10.1007/978-3-031-72652-1_23

Index Terms

Self-Supervised Cross-Language Scene Text Editing
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Reconstruction
  2. Computer graphics
    1. Image manipulation
      1. Image-based rendering
2. Information systems
  1. Information systems applications
    1. Multimedia information systems
      1. Multimedia content creation

Recommendations

English–Vietnamese cross-language paraphrase identification using hybrid feature classes
Abstract
Paraphrase identification plays an important role with various applications in natural language processing tasks such as machine translation, bilingual information retrieval, plagiarism detection, etc. With the development of information ...
Mining comparable bilingual text corpora for cross-language information integration
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

Integrating information in multiple natural languages is a challenging task that often requires manually created linguistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-...
A word embedding-based approach to cross-lingual topic modeling
Abstract
The cross-lingual topic analysis aims at extracting latent topics from corpora of different languages. Early approaches rely on high-cost multilingual resources (e.g., a parallel corpus), which is hard to come by in many real cases. Some works ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
266
Total Downloads

Downloads (Last 12 months)259
Downloads (Last 6 weeks)8

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen JHuang YLv TCui LChen QWei F(2024)TextDiffuser-2: Unleashing the Power of Language Models for Text RenderingComputer Vision – ECCV 202410.1007/978-3-031-72652-1_23(386-402)Online publication date: 30-Oct-2024
https://doi.org/10.1007/978-3-031-72652-1_23

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents