Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3581783.3612174acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Self-Supervised Cross-Language Scene Text Editing

Published: 27 October 2023 Publication History

Abstract

We propose and formulate the task of cross-language scene text editing, modifying the text content of a scene image into new text in another language, while preserving the scene text style and background texture. The key challenges of this task lie in the difficulty in distinguishing text and background, great distribution differences among languages, and the lack of fine-labeled real-world data. To tackle these problems, we propose a novel network named Cross-LAnguage Scene Text Editing (CLASTE), which is capable of separating the foreground text and background, as well as further decomposing the content and style of the foreground text. Our model can be trained in a self-supervised training manner on the unlabeled and multi-language data in real-world scenarios, where the source images serve as both input and ground truth. Experimental results on the Chinese-English cross-language dataset show that our proposed model can generate realistic text images, specifically, modifying English to Chinese and vice versa. Furthermore, our method is universal and can be extended to other languages such as Arabic, Korean, Japanese, Hindi, Bengali, and so on.

References

[1]
Eloi Alonso, Bastien Moysset, and Ronaldo Messina. 2019. Adversarial generation of handwritten text images conditioned on sequences. In International Conference on Document Analysis and Recognition. 481--486.
[2]
Lu Chi, Borui Jiang, and Yadong Mu. 2020. Fast Fourier Convolution. In Advances in Neural Information Processing Systems, Vol. 33. 4479--4488.
[3]
Chee Kheng Chng, Yuliang Liu, Yipeng Sun, et al. 2019. ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text - RRC-ArT. In International Conference on Document Analysis and Recognition. 1571--1576.
[4]
Sharon Fogel, Hadar Averbuch-Elor, Sarel Cohen, Shai Mazor, and Roee Litman. 2020. ScrabbleGAN: Semi-supervised varying length handwritten text generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4324--4333.
[5]
Ji Gan, Weiqiang Wang, Jiaxu Leng, and Xinbo Gao. 2022. HiGAN: Handwriting Imitation GAN with Disentangled Representations. ACM Transactions on Graphics, Vol. 42, 1 (2022), 1--17.
[6]
Raul Gomez, Baoguang Shi, Lluis Gomez, Lukas Numann, Andreas Veit, Jiri Matas, Serge Belongie, and Dimosthenis Karatzas. 2017. ICDAR2017 robust reading challenge on COCO-Text. In International Conference on Document Analysis and Recognition, Vol. 1. 1435--1443.
[7]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[8]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems.
[9]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, Vol. 33. 6840--6851.
[10]
Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. 2017. Globally and locally consistent image completion. ACM Transactions on Graphics, Vol. 36, 4 (2017), 1--14.
[11]
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision. 694--711.
[12]
Lei Kang, Pau Riba, Marcal Rusinol, Alicia Fornés, and Mauricio Villegas. 2021. Content and style aware generation of text-line images for handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
[13]
Lei Kang, Pau Riba, Yaxing Wang, Marçal Rusinol, Alicia Fornés, and Mauricio Villegas. 2020. GANwriting: content-conditioned generation of styled handwritten word images. In European Conference on Computer Vision. 273--289.
[14]
Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, et al. 2015. ICDAR 2015 competition on robust reading. In International Conference on Document Analysis and Recognition. 1156--1160.
[15]
Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4401--4410.
[16]
Praveen Krishnan, Rama Kovvuri, Guan Pang, Boris Vassilev, and Tal Hassner. 2023. Textstylebrush: Transfer of text aesthetics from a single example. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
[17]
Hyeonsu Lee and Chankyu Choi. 2022. The Surprisingly Straightforward Scene Text Removal Method with Gated Attention and Region of Interest Generation: A Comprehensive Prominent Model Analysis. In European Conference on Computer Vision. 457--472.
[18]
Junyeop Lee, Yoonsik Kim, Seonghyeon Kim, Moonbin Yim, Seung Shin, Gayoung Lee, and Sungrae Park. 2021. RewriteNet: Realistic Scene Text Image Generation via Editing Text in Real-world Image. arXiv preprint arXiv:2107.11041 (2021).
[19]
Chenhao Li, Yuta Taniguchi, Min Lu, and Shin'ichi Konomi. 2021. Few-shot font style transfer between different languages. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 433--442.
[20]
Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2017. Universal Style Transfer via Feature Transforms. In Advances in Neural Information Processing Systems, Vol. 30.
[21]
Chongyu Liu, Yuliang Liu, lianwen Jin, Shuaitao Zhang, Canjie Luo, and Yongpan Wang. 2020. EraseNet: End-to-End Text Removal in the Wild. IEEE Transactions on Image Processing, Vol. 29 (2020), 8760--8775.
[22]
Pengyuan Lyu, Xiang Bai, Cong Yao, Zhen Zhu, Tengteng Huang, and Wenyu Liu. 2017. Auto-encoder guided GAN for Chinese calligraphy synthesis. In International Conference on Document Analysis and Recognition. 1095--1100.
[23]
Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).
[24]
Nibal Nayef, Yash Patel, Michal Busta, et al. 2019. ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition-RRC-MLT-2019. In International Conference on Document Analysis and Recognition. 1582--1587.
[25]
Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon, Wafa Khlif, Muhammad Muzzamil Luqman, Jean-Christophe Burie, Cheng-lin Liu, and Jean-Marc Ogier. 2017. ICDAR2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification - RRC-MLT. 1454--1459.
[26]
Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning. 8162--8171.
[27]
Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. 2016. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2536--2544.
[28]
Yadong Qu, Qingfeng Tan, Hongtao Xie, Jianjun Xu, Yuxin Wang, and Yongdong Zhang. 2023. Exploring stroke-level modifications for scene text editing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2119--2127.
[29]
Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. 2022. Palette: Image-to-Image Diffusion Models. In ACM SIGGRAPH 2022 Conference Proceedings. 1--10.
[30]
Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, 11 (2016), 2298--2304.
[31]
K. Simonyan and A. Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations.
[32]
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. 2256--2265.
[33]
Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. 2022. Resolution-robust Large Mask Inpainting with Fourier Convolutions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2149--2159.
[34]
Osman Tursun, Rui Zeng, Simon Denman, Sabesan Sivapalan, Sridha Sridharan, and Clinton Fookes. 2019. MTRNet: A generic scene text eraser. In International Conference on Document Analysis and Recognition. 39--44.
[35]
Liang Wu, Chengquan Zhang, Jiaming Liu, Junyu Han, Jingtuo Liu, Errui Ding, and Xiang Bai. 2019. Editing text in the wild. In Proceedings of ACM International Conference on Multimedia. 1500--1508.
[36]
Yangchen Xie, Xinyuan Chen, Li Sun, and Yue Lu. 2021. DG-Font: Deformable Generative Networks for Unsupervised Font Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5130--5140.
[37]
Qiangpeng Yang, Jun Huang, and Wei Lin. 2020. SwapText: Image based texts transfer in scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14700--14709.
[38]
Boxi Yu, Yong Xu, Yan Huang, Shuai Yang, and Jiaying Liu. 2021. Mask-guided GAN for robust text editing in the scene. Neurocomputing, Vol. 441 (2021), 192--201.
[39]
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 586--595.
[40]
Shuaitao Zhang, Yuliang Liu, Lianwen Jin, Yaoxiong Huang, and Songxuan Lai. 2019. EnsNet: Ensconce text in the wild. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 801--808.

Cited By

View all
  • (2024)TextDiffuser-2: Unleashing the Power of Language Models for Text RenderingComputer Vision – ECCV 202410.1007/978-3-031-72652-1_23(386-402)Online publication date: 30-Oct-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-language
  2. gan
  3. image generation
  4. scene text editing
  5. self-supervised
  6. style transfer

Qualifiers

  • Research-article

Funding Sources

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)259
  • Downloads (Last 6 weeks)8
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)TextDiffuser-2: Unleashing the Power of Language Models for Text RenderingComputer Vision – ECCV 202410.1007/978-3-031-72652-1_23(386-402)Online publication date: 30-Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media