Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1609/aaai.v37i2.25282guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

DocEdit: language-guided document editing

Published: 07 February 2023 Publication History

Abstract

Professional document editing tools require a certain level of expertise to perform complex edit operations. To make editing tools accessible to increasingly novice users, we investigate intelligent document assistant systems that can make or suggest edits based on a user's natural language request. Such a system should be able to understand the user's ambiguous requests and contextualize them to the visual cues and textual content found in a document image to edit localized unstructured text and structured layouts. To this end, we propose a new task of language-guided localized document editing, where the user provides a document and an open vocabulary editing request, and the intelligent system produces a command that can be used to automate edits in real-world document editing software. In support of this task, we curate the DocEdit dataset, a collection of approximately 28K instances of user edit requests over PDF and design templates along with their corresponding ground truth software executable commands. To our knowledge, this is the first dataset that provides a diverse mix of edit operations with direct and indirect references to the embedded text and visual objects such as paragraphs, lists, tables, etc. We also propose DocEditor, a Transformer-based localization-aware multimodal (textual, spatial, and visual) model that performs the new task. The model attends to both document objects and related text contents which may be referred to in a user edit request, generating a multimodal embedding that is used to predict an edit command and associated bounding box localizing it. Our proposed model empirically outperforms other baseline deep learning approaches by 15-18%, providing a strong starting point for future work.

References

[1]
Aggarwal, M.; Gupta, H.; Sarkar, M.; and Krishnamurthy, B. 2020. Form2Seq : A Framework for Higher-Order Form Structure Extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 3830-3840. Online: Association for Computational Linguistics.
[2]
Almazán, J.; Gordo, A.; Fornés, A.; and Valveny, E. 2014. Word spotting and recognition with embedded attributes. IEEE transactions on pattern analysis and machine intelligence, 36(12): 2552-2566.
[3]
Berg, M. d.; Kreveld, M. v.; Overmars, M.; and Schwarzkopf, O. 1997. Computational geometry. In Computational geometry, 1-17. Springer.
[4]
Bojanowski, P.; Grave, E.; Joulin, A.; and Mikolov, T. 2017. Enriching word vectors with subword information. Transactions of the association for computational linguistics, 5: 135-146.
[5]
Chen, J.; Shen, Y.; Gao, J.; Liu, J.; and Liu, X. 2018. Language-based image editing with recurrent attentive models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8721-8729.
[6]
Deng, J.; Yang, Z.; Chen, T.; gang Zhou, W.; and Li, H. 2021. TransVG: End-to-End Visual Grounding with Transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 1749-1759.
[7]
Deng, J.; Yang, Z.; Liu, D.; Chen, T.; gang Zhou, W.; Zhang, Y.; Li, H.; and Ouyang, W. 2022. TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer. ArXiv, abs/2206.06619.
[8]
El-Nouby, A.; Sharma, S.; Schulz, H.; Hjelm, R. D.; Asri, L. E.; Kahou, S. E.; Bengio, Y.; and Taylor, G. W. 2018. Keep Drawing It: Iterative language-based image generation and editing. ArXiv, abs/1811.09845.
[9]
Hu, R.; Singh, A.; Darrell, T.; and Rohrbach, M. 2020. Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9992-10002.
[10]
Jiang, W.; Xu, N.; Wang, J.-Y.; Gao, C.; Shi, J.; Lin, Z. L.; and Liu, S. 2021a. Language-Guided Global Image Editing via Cross-Modal Cyclic Mechanism. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2095-2104.
[11]
Jiang, Y.; Huang, Z.; Pan, X.; Loy, C. C.; and Liu, Z. 2021b. Talk-to-Edit: Fine-Grained Facial Editing via Dialog. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 13779-13788.
[12]
Kim, H.; Kim, D. S.; Yoon, S.; Dernoncourt, F.; Bui, T.; and Bansal, M. 2022. CAISE: Conversational Agent for Image Search and Editing. In AAAI.
[13]
Kirkpatrick, D. G.; and Radke, J. D. 1985. A framework for computational morphology. In Machine Intelligence and Pattern Recognition, volume 2, 217-248. Elsevier.
[14]
Kudo, T.; and Richardson, J. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
[15]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL.
[16]
Li, B.; Qi, X.; Lukasiewicz, T.; and Torr, P. H. S. 2020. Mani-GAN: Text-Guided Image Manipulation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7877-7886.
[17]
Li, J.; Xu, Y.; Lv, T.; Cui, L.; Zhang, C.; and Wei, F. 2022. Dit: Self-supervised pre-training for document image transformer. ACM Multimedia 2022.
[18]
Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74-81.
[19]
Lin, T.-H.; Bui, T.; Kim, D. S.; and Oh, J. 2020a. A Multimodal Dialogue System for Conversational Image Editing. ArXiv, abs/2002.06484.
[20]
Lin, T.-H.; Rudnicky, A.; Bui, T.; Kim, D. S.; and Oh, J. 2020b. Adjusting image attributes of localized regions with low-level dialogue. arXiv preprint arXiv:2002.04678.
[21]
Lin, T.-H.; Rudnicky, A. I.; Bui, T.; Kim, D. S.; and Oh, J. 2020c. Adjusting Image Attributes of Localized Regions with Low-level Dialogue. In LREC.
[22]
Manuvinakurike, R.; Brixey, J.; Bui, T.; Chang, W.; Artstein, R.; and Georgila, K. 2018a. Dialedit: Annotations for spoken conversational image editing. In Proceedings 14th Joint ACL-ISO Workshop on Interoperable Semantic Annotation, 1-9.
[23]
Manuvinakurike, R. R.; Brixey, J.; Bui, T.; Chang, W.; Kim, D. S.; Artstein, R.; and Georgila, K. 2018b. Edit me: A Corpus and a Framework for Understanding Natural Language Image Editing. In LREC.
[24]
Manuvinakurike, R. R.; Bui, T.; Chang, W.; and Georgila, K. 2018c. Conversational Image Editing: Incremental Intent Identification in a New Dialogue Task. In SIGDIAL Conference.
[25]
Mokady, R.; Hertz, A.; Bermano, A. H.; and . 2021. ClipCap: CLIP Prefix for Image Captioning. ArXiv, abs/2111.09734.
[26]
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748-8763. PMLR.
[27]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I.; et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8): 9.
[28]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P. J.; et al. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140): 1-67.
[29]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; and Savarese, S. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 658-666.
[30]
Rothe, S.; Narayan, S.; and Severyn, A. 2020. Leveraging Pre-trained Checkpoints for Sequence Generation Tasks. Transactions of the Association for Computational Linguistics, 8: 264-280.
[31]
Shi, J.; Xu, N.; Bui, T.; Dernoncourt, F.; Wen, Z.; and Xu, C. 2020. A Benchmark and Baseline for Language-Driven Image Editing. In Proceedings of the Asian Conference on Computer Vision.
[32]
Shi, J.; Xu, N.; Xu, Y.; Bui, T.; Dernoncourt, F.; and Xu, C. 2021. Learning by Planning: Language-Guided Global Image Editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13590-13599.
[33]
Shi, J.; Xu, N.; Zheng, H.; Smith, A.; Luo, J.; and Xu, C. 2022. SpaceEdit: Learning a Unified Editing Space for Open-Domain Image Color Editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19730-19739.
[34]
Singh, A.; Natarajan, V.; Shah, M.; Jiang, Y.; Chen, X.; Batra, D.; Parikh, D.; and Rohrbach, M. 2019. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8317-8326.
[35]
Wang, J.; Lu, G.; Xu, H.; Li, Z.; Xu, C.; and Fu, Y. 2022. ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation. ArXiv, abs/2204.04428.
[36]
Williams, R. J.; and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2): 270-280.
[37]
Xu, K.; Hu, W.; Leskovec, J.; and Jegelka, S. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826.
[38]
Yang, Z.; Chen, T.; Wang, L.; and Luo, J. 2020. Improving One-stage Visual Grounding by Recursive Sub-query Construction. ArXiv, abs/2008.01059.
[39]
Yao, T.; Pan, Y.; Li, Y.; and Mei, T. 2018. Exploring visual relationship for image captioning. In Proceedings of the European conference on computer vision (ECCV), 684-699.

Information & Contributors

Information

Published In

cover image Guide Proceedings
AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence
February 2023
16496 pages
ISBN:978-1-57735-880-0

Sponsors

  • Association for the Advancement of Artificial Intelligence

Publisher

AAAI Press

Publication History

Published: 07 February 2023

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media