research-article

Rethinking Supervision in Document Unwarping: A Self-Consistent Flow-Free Approach

Authors:

Wengang ZhouAuthors Info & Claims

IEEE Transactions on Circuits and Systems for Video Technology, Volume 34, Issue 6

Pages 4817 - 4828

https://doi.org/10.1109/TCSVT.2023.3336068

Published: 23 November 2023 Publication History

Abstract

In recent years, the proliferation of smartphones has led to an upsurge in the digitization of document files via these portable devices. However, images captured by smartphones often suffer from distortions, thereby negatively affecting digital preservation and downstream applications. To address this issue, we introduce DRNet, a novel deep network for document image rectification. Our approach is based on three key designs. Firstly, we exploit the intrinsic geometric consistency inherent in document images to guide the learning process of distortion rectification. Secondly, we design a coarse-to-fine rectification network to leverage the representations derived from the distorted document image, thereby enhancing the rectification result. Thirdly, we propose a unique perspective for supervising the learning of rectification networks, where undistorted document images are employed for supervision, which is free of warping mesh as ground truth in existing methods. Technically, both low-level pixel alignment and high-level semantic alignment jointly contribute to the learning of the mapping relationship between deformed document images and distortion-free ones. We evaluate our method on the challenging DocUNet Benchmark dataset, where it sets a series of state-of-the-art records, demonstrating the superiority of our approach compared to existing learning-based solutions. Additionally, we conduct a comprehensive series of ablation experiments to further validate the effectiveness and merits of our method.

References

[1]

B. Li, X. Tang, X. Qi, Y. Chen, C.-G. Li, and R. Xiao, “EMU: Effective multi-hot encoding net for lightweight scene text recognition with a large character set,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 8, pp. 5374–5385, Aug. 2022.

Digital Library

[2]

G. Ciardiello, G. Scafuro, M. Degrandi, M. Spada, and M. Roccotelli, “An experimental system for office document handling and text recognition,” in Proc. Int. Conf. Pattern Recognit., 1988, pp. 739–743.

[3]

K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recognition,” in Proc. Int. Conf. Comput. Vis., Nov. 2011, pp. 1457–1464.

[4]

A. Lat and C. V. Jawahar, “Enhancing OCR accuracy with super resolution,” in Proc. 24th Int. Conf. Pattern Recognit. (ICPR), Aug. 2018, pp. 3162–3167.

[5]

D. Penget al., “Recognition of handwritten Chinese text by segmentation: A Segment-annotation-free approach,” IEEE Trans. Multimedia, vol. 25, pp. 2368–2381, 2022.

[6]

H. Yuan, Y. Chen, X. Hu, and S. Ji, “Interpreting deep models for text analysis via optimization and regularization methods,” in Proc. AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 5717–5724.

[7]

Z. Zhang, J. Ma, J. Du, L. Wang, and J. Zhang, “Multimodal pre-training based on graph attention network for document understanding,” IEEE Trans. Multimedia, vol. 25, pp. 6743–6755, 2022.

[8]

G. Kimet al., “OCR-free document understanding transformer,” in Proc. Eur. Conf. Comput. Vis., Oct. 2022, pp. 498–517.

[9]

J. Yu and Q. Tian, “Semantic subspace projection and its applications in image retrieval,” IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 4, pp. 544–548, Apr. 2008.

Digital Library

[10]

G. Salton, “Developments in automatic text retrieval,” Science, vol. 253, no. 5023, pp. 974–980, Aug. 1991.

[11]

Y. Yang, Y.-T. Zhuang, F. Wu, and Y.-H. Pan, “Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval,” IEEE Trans. Multimedia, vol. 10, no. 3, pp. 437–446, Apr. 2008.

Digital Library

[12]

L. Liu, Y. Lu, and C. Y. Suen, “End-to-end learning of representations for instance-level document image retrieval,” Appl. Soft Comput., vol. 136, Mar. 2023, Art. no.

[13]

M. Mathew, D. Karatzas, and C. V. Jawahar, “DocVQA: A dataset for VQA on document images,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2021, pp. 2199–2208.

[14]

L. Nie, M. Wang, Y. Gao, Z.-J. Zha, and T.-S. Chua, “Beyond text QA: Multimedia answer generation by harvesting web information,” IEEE Trans. Multimedia, vol. 15, no. 2, pp. 426–441, Feb. 2013.

Digital Library

[15]

H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” 2023, arXiv:2304.08485.

[16]

M. S. Brown and W. B. Seales, “Image restoration of arbitrarily warped documents,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 10, pp. 1295–1306, Oct. 2004.

Digital Library

[17]

M. S. Brown and W. B. Seal, “Document restoration using 3D shape,” in Proc. IEEE Int. Conf. Comput. Vis., Jul. 2001, pp. 9–12.

[18]

G. Meng, Y. Wang, S. Qu, S. Xiang, and C. Pan, “Active flattening of curved document images via two structured beams,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 3890–3897.

[19]

L. Zhang, Y. Zhang, and C. Lim Tan, “An improved physically-based method for geometric restoration of distorted document images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 4, pp. 728–734, Apr. 2008.

Digital Library

[20]

H. Il Koo, J. Kim, and N. Ik Cho, “Composition of a dewarped and enhanced document image from two view images,” IEEE Trans. Image Process., vol. 18, no. 7, pp. 1551–1562, Jul. 2009.

[21]

A. Yamashita, A. Kawarago, T. Kaneko, and K. T. Miura, “Shape reconstruction and image restoration for non-flat surfaces of documents with a stereo vision system,” in Proc. 17th Int. Conf. Pattern Recognit. (ICPR), Aug. 2004, pp. 482–485.

[22]

Y.-C. Tsoi and M. S. Brown, “Multi-view document rectification using boundary,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2007, pp. 1–8.

[23]

Y.-C. Tsoi and M. S. Brown, “Geometric and shading correction for images of printed materials a unified approach using boundary,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2004, p. 1.

[24]

S. You, Y. Matsushita, S. Sinha, Y. Bou, and K. Ikeuchi, “Multiview rectification of folded documents,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 2, pp. 505–511, Feb. 2018.

Digital Library

[25]

C. Lim Tan, L. Zhang, Z. Zhang, and T. Xia, “Restoring warped document images through 3D shape modeling,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 2, pp. 195–208, Feb. 2006.

[26]

T. Wada, H. Ukida, and T. Matsuyama, “Shape from shading with interreflections under a proximal light source: Distortion-free copying of an unfolded book,” Int. J. Comput. Vis., vol. 24, no. 2, pp. 125–135, 1997.

Digital Library

[27]

Y. He, P. Pan, S. Xie, J. Sun, and S. Naoi, “A book dewarping system by boundary-based 3D surface reconstruction,” in Proc. 12th Int. Conf. Document Anal. Recognit., Aug. 2013, pp. 403–407.

[28]

H. Cao, X. Ding, and C. Liu, “Rectifying the bound document image captured by the camera: A model based approach,” in Proc. 7th Int. Conf. Document Anal. Recognit., Aug. 2003, pp. 71–75.

[29]

O. Lavialle, X. Molines, F. Angella, and P. Baylou, “Active contours network to straighten distorted text lines,” in Proc. Int. Conf. Image Process., Oct. 2001, pp. 748–751.

[30]

C. Wu and G. Agam, “Document image de-warping for text/graphics recognition,” in Proc. Joint IAPR Int. Workshops Stat. Techn. Pattern Recognit. Struct. Syntactic Pattern Recognit., 2002, pp. 348–357.

[31]

G. Meng, C. Pan, S. Xiang, J. Duan, and N. Zheng, “Metric rectification of curved document images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, pp. 707–722, Apr. 2012.

Digital Library

[32]

D. Luo and P. Bo, “Geometric rectification of creased document images based on isometric mapping,” 2022, arXiv:2212.08365.

[33]

J. Liang, D. DeMenthon, and D. Doermann, “Geometric rectification of camera-captured document images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 4, pp. 591–605, Apr. 2008.

Digital Library

[34]

S. Das, H. M. Sial, R. Baldrich, M. Vanrell, and D. Samaras, “Intrinsic decomposition of document images in-the-wild,” in Proc. Brit. Mach. Vis. Conf., 2020, pp. 1–14.

[35]

S. Das, K. Ma, Z. Shu, D. Samaras, and R. Shilkrot, “DewarpNet: Single-image document unwarping with stacked 3D and 2D regression networks,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 131–140.

[36]

S. Daset al., “End-to-end piece-wise unwarping of document images,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 4248–4257.

[37]

H. Feng, Y. Wang, W. Zhou, J. Deng, and H. Li, “DocTr: Document image transformer for geometric unwarping and illumination correction,” in Proc. 29th ACM Int. Conf. Multimedia, Oct. 2021, pp. 273–281.

[38]

X. Jiang, R. Long, N. Xue, Z. Yang, C. Yao, and G.-S. Xia, “Revisiting document image dewarping by grid regularization,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 4533–4542.

[39]

X. Liu, G. Meng, B. Fan, S. Xiang, and C. Pan, “Geometric rectification of document images using adversarial gated unwarping network,” Pattern Recognit., vol. 108, Dec. 2020, Art. no.

[40]

K. Ma, Z. Shu, X. Bai, J. Wang, and D. Samaras, “DocUNet: Document image unwarping via a stacked U-Net,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 4700–4709.

[41]

G.-W. Xie, F. Yin, X.-Y. Zhang, and C.-L. Liu, “Document dewarping with control points,” in Proc. Int. Conf. Document Anal. Recognit., 2021, pp. 466–480.

[42]

G. Xie, F. Yin, X. Zhang, and C. Liu, “Dewarping document image by displacement flow estimation with fully convolutional network,” in Proc. Int. Workshop Document Anal. Syst., 2020, pp. 131–144.

[43]

C. Xue, Z. Tian, F. Zhan, S. Lu, and S. Bai, “Fourier document restoration for robust document dewarping and recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 4563–4572.

[44]

H. Feng, W. Zhou, J. Deng, Q. Tian, and H. Li, “DocScanner: Robust document image rectification with progressive learning,” 2021, arXiv:2110.14968.

[45]

H. Feng, W. Zhou, J. Deng, Y. Wang, and H. Li, “Geometric representation learning for document image rectification,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 475–492.

[46]

A. Markovitz, I. Lavi, O. Perel, S. Mazor, and R. Litman, “Can you read me now? Content aware rectification using angle supervision,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 208–223.

[47]

K. Ma, S. Das, Z. Shu, and D. Samaras, “Learning from documents in the wild to improve document unwarping,” in Proc. ACM SIGGRAPH Conf., 2022, pp. 1–9.

[48]

J. Zhang, C. Luo, L. Jin, F. Guo, and K. Ding, “Marior: Margin removal and iterative content rectification for document dewarping in the wild,” in Proc. 30th ACM Int. Conf. Multimedia, Oct. 2022, pp. 2805–2815.

[49]

X. Li, B. Zhang, J. Liao, and P. V. Sander, “Document rectification and illumination correction using a patch-based CNN,” ACM Trans. Graph., vol. 38, no. 6, pp. 1–11, Dec. 2019.

Digital Library

[50]

F. Verhoeven, T. Magne, and O. Sorkine-Hornung, “Neural document unwarping using coupled grids,” 2023, arXiv:2302.02887.

[51]

S. Liu, H. Feng, W. Zhou, H. Li, C. Liu, and F. Wu, “DocMAE: Document image rectification via self-supervised representation learning,” in Proc. IEEE Int. Conf. Multimedia Expo. (ICME), Jul. 2023, pp. 1613–1618.

[52]

H. Feng, S. Liu, J. Deng, W. Zhou, and H. Li, “Deep unrestricted document image rectification,” 2023, arXiv:2304.08796.

[53]

Z. Teed and J. Deng, “RAFT: Recurrent all-pairs field transforms for optical flow (extended abstract),” in Proc. 13th Int. Joint Conf. Artif. Intell., Aug. 2021, pp. 402–419.

[54]

A. Vaswaniet al., “Attention is all you need,” in Proc. Neural Inf. Process. Syst., 2017, pp. 6000–6010.

[55]

D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004.

Digital Library

[56]

G. Meng, Y. Su, Y. Wu, S. Xiang, and C. Pan, “Exploiting vector fields for geometric rectification of distorted document images,” in Proc. Eur. Conf. Comput. Vis., Sep. 2018, pp. 172–187.

[57]

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent, 2015, pp. 234–241.

[58]

J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 3431–3440.

[59]

E. Meijering, “A chronology of interpolation: From ancient astronomy to modern signal and image processing,” Proc. IEEE, vol. 90, no. 3, pp. 319–342, Mar. 2002.

[60]

F. Hertlein and A. Naumann, “Template-guided illumination correction for document images with imperfect geometric reconstruction,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2023, pp. 904–913.

[61]

H. Li, X. Wu, Q. Chen, and Q. Xiang, “Foreground and text-lines aware document image rectification,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2023, pp. 19574–19583.

[62]

A. Dosovitskiyet al., “An image is worth 16×16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2021.

[63]

K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 15979–15988.

[64]

X. Qin, Z. Zhang, C. Huang, M. Dehghan, O. R. Zaiane, and M. Jagersand, “U2-Net: Going deeper with nested U-structure for salient object detection,” Pattern Recognit., vol. 106, Oct. 2020, Art. no.

[65]

M. Jaderberg, K. Simonyan, and A. Zisserman, “Spatial transformer networks,” in Proc. Neural Inf. Process. Syst., 2015, pp. 2017–2025.

[66]

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014, arXiv:1409.1556.

[67]

P. Zhou, L. Xie, B. Ni, L. Liu, and Q. Tian, “HRInversion: High-resolution GAN inversion for cross-domain image synthesis,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 5, pp. 2147–2161, May 2023.

Digital Library

[68]

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.

Digital Library

[69]

Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in Proc. 37th Asilomar Conf. Signals, Syst. Comput., 2003, pp. 1398–1402.

[70]

D. Ren, W. Zuo, Q. Hu, P. Zhu, and D. Meng, “Progressive image deraining networks: A better and simpler baseline,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 3932–3941.

[71]

A. Buades, B. Coll, and J. M. Morel, “A review of image denoising algorithms, with a new one,” Multiscale Model. Simul., vol. 4, no. 2, pp. 490–530, Jan. 2005.

[72]

C. Liu, J. Yuen, and A. Torralba, “SIFT flow: Dense correspondence across scenes and its applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 5, pp. 978–994, May 2011.

Digital Library

[73]

V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” Sov. Phys.-Dokl., vol. 10, no. 8, pp. 707–710, 1966.

[74]

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Represent., 2019.

[75]

L. N. Smith and N. Topin, “Super-convergence: Very fast training of neural networks using large learning rates,” in Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, vol. 11006, 2019, pp. 369–386.

[76]

R. Smith, “An overview of the tesseract OCR engine,” in Proc. 9th Int. Conf. Document Anal. Recognit. (ICDAR), vol. 2, Sep. 2007, pp. 629–633.

[77]

S. Peng, W. Jiang, H. Pi, X. Li, H. Bao, and X. Zhou, “Deep snake for real-time instance segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 8530–8539.

[78]

B. Daiet al., “MataDoc: Margin and text aware document dewarping for arbitrary boundary,” 2023, arXiv:2307.12571.

Recommendations

Learning From Documents in the Wild to Improve Document Unwarping
SIGGRAPH '22: ACM SIGGRAPH 2022 Conference Proceedings

Document image unwarping is important for document digitization and analysis. The state-of-the-art approach relies on purely synthetic data to train deep networks for unwarping. As a result, the trained networks have generalization limitations when ...
Data-Driven Document Unwarping
DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

In this work, we propose a new framework, called Document Image Transformer (DocTr), to address the issue of geometry and illumination distortion of the document images. Specifically, DocTr consists of a geometric unwarping transformer and an ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Circuits and Systems for Video Technology

IEEE Transactions on Circuits and Systems for Video Technology Volume 34, Issue 6

June 2024

1070 pages

Issue’s Table of Contents

1051-8215 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 23 November 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents