Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3689094.3689471acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

Historical Postcards Retrieval through Vision Foundation Models

Published: 28 October 2024 Publication History

Abstract

The analysis of historical documents presents challenges in automated processing, given issues such as degradation over time and the need to extract information from a large data source. This paper proposes a methodology, comprising two key stages: text extraction using Optical Character Recognition (OCR) and Vision Foundation Models (VFM) for the retrieval process within a challenging collection of 4,294 historical postcards from the East of France region. This approach allows users to effortlessly find postcards that align with their interests and preferences. VFMs and the textual information extracted from the postcards play a key role in this process, by providing a robust and efficient way to match user queries to relevant postcards in the dataset.VFMs trained on large datasets offer a solution to reduce dependence on annotated data and enhance model versatility. For our retrieval stage, we have selected two VFMs, CLIP and DINOv2, and evaluate their performance using quantitative metrics to identify the model yielding the best results.

References

[1]
OpenAI Josh Achiam, Steven Adler, and Sandhini Agarwal et al. 2023. GPT-4 Technical Report. https://doi.org/10.48550/arXiv.2303.08774.
[2]
Christian Bartz, Hendrik Raetz, and Jona Otholt et al. 2022. Synthesis in style: Semantic segmentation of historical documents using synthetic data. In International Conference on Pattern Recognition.
[3]
Rishi Bommasani, Drew A. Hudson, and Ehsan Adeli et al. 2021. On the opportunities and risks of foundation models. https://doi.org/10.48550/arXiv.2108.07258.
[4]
Tom B. Brown, Benjamin Mann, and Nick Ryder et al. 2020. Language Models are Few-Shot Learners. https://doi.org/10.48550/arXiv.2005.14165.
[5]
Tu Bui, Leonardo Ribeiro, Moacir Ponti, and John Collomosse. 2018. Sketching out the details: Sketch-based image retrieval using convolutional neural networks with multi-stage regression. Computers & Graphics 71 (2018).
[6]
Mathilde Caron, Hugo Touvron, and Ishan Misra et al. 2021. Emerging Properties in Self-Supervised Vision Transformers. In IEEE International Conference on Computer Vision.
[7]
Aakanksha Chowdhery, Sharan Narang, and Jacob Devlin et al. 2023. PaLM: Scaling Language Modeling with Pathways. https://doi.org/10.48550/arXiv.2204.02311.
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. https://doi.org/10.48550/arXiv.1810.04805.
[9]
Alexey Dosovitskiy, Lucas Beyer, and Alexander Kolesnikov et al. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. https://doi.org/10.48550/arXiv.2010.11929.
[10]
Wei Chen; Yu Liu;WeipingWang et al. 2022. Deep learning for instance retrieval: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 6 (2022).
[11]
Gernot A. Fink, Leonard Rothacker, and René Grzeszick. 2014. Grouping historical postcards using query-by-example word spotting. In International Conference on Frontiers in Handwriting Recognition.
[12]
Bruno García, Belén Moreno, José F. Vélez, and Ángel Sánchez. 2022. Deep layout extraction applied to historical postcards. Springer.
[13]
Walter Goodwin, Sagar Vaze, Ioannis Havoutis, and Ingmar Posner. 2022. Zeroshot category-level object pose estimation. In European Conference on Computer Vision.
[14]
Rene Grzeszick and Gernot A. Fink. 2014. Recognizing scene categories of historical postcards. In German Conference on Pattern Recognition.
[15]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In IEEE International Conference on Computer Vision.
[16]
Glenn Jocher, Alex Stoken, and Jií Borovec. 2021. ultralytics/yolov5: v3.0.
[17]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-scale similarity search with gpus. IEEE Transactions on Big Data 7, 3 (2021).
[18]
Li Junnan, Li Dongxu, Xiong Caiming, and Hoi Steven. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning.
[19]
Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In IEEE Conference on Computer Vision and Pattern Recognition.
[20]
Alexander Kirillov, Eric Mintun, and Nikhila Ravi et al. 2023. Segment anything. https://doi.org/10.48550/arXiv.2304.02643.
[21]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. Commun. ACM 60, 6 (2012).
[22]
Mike Lewis, Yinhan Liu, and Naman Goyal et al. 2020. BART: Denoising Sequenceto- Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Annual Meeting of the Association for Computational Linguistics.
[23]
Jian Liang, David Doermann, and Huiping Li. 2005. Camera-based analysis of text and documents: a survey. International Journal of Document Analysis and Recognition 7, 2 (2005).
[24]
Yinhan Liu, Myle Ott, and Naman Goyal et al. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://doi.org/10.48550/arXiv.1907.11692.
[25]
David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91 (2004).
[26]
Maroua Mehri, Akrem Sellami, and Salvatore Tabbone. 2023. Historical Document Image Segmentation Combining Deep Learning and Gabor Features. In International Conference on Document Analysis and Recogntion.
[27]
Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, and Andrea Vedaldi. 2022. Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization. In IEEE Conference on Computer Vision and Pattern Recognition.
[28]
Jamshed Memon, Maira Sami, and Rizwan Ahmed Khan et al. 2020. Handwritten optical character recognition (ocr): A comprehensive systematic literature review (SLR). IEEE ACCESS 8 (2020).
[29]
Tien-Nam Nguyen, Jean-Christophe Burie, Thi-Lan Le, and Anne-Valérie Schweyer. 2022. An effective method for text line segmentation in historical document images. In International Conference on Pattern Recognition.
[30]
Van Nguyen Nguyen, Thibault Groueix, and Georgy Ponimatkin et al. 2023. CNOS: A Strong Baseline for CAD-based Novel Object Segmentation. In International Conference on Computer Vision Workshops.
[31]
Maxime Oquab, Timothée Darcet, and Théo Moutakanni et al. 2023. DINOv2: Learning Robust Visual Features without Supervision. https://doi.org/10.48550/arXiv.2304.07193.
[32]
Alec Radford, Jong W. Kim, and Chris Hallacy et al. 2021. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning.
[33]
Ryan Schuerkamp, Jared Barrett, and Amber Bales et al. 2023. Enabling new interactions with library digital collections: automatic gender recognition in historical postcards via deep learning. The Journal of Academic Librarianship 49, 4 (2023).
[34]
Amarjot Singh, Ketan Bacchuwar, and Akshay Bhasin. 2012. A survey of ocr applications. International Journal of Machine Learning and Computing 3 (2012).
[35]
Josef Sivic and Andrew Zisserman. 2003. Video Google: a text retrieval approach to object matching in videos. In IEEE International Conference on Computer Vision.
[36]
Thomas Smits, Wouter Haverals, and Loren Verreyen et al. 2023. Greetings from! Extracting address information from 100, 000 historical picture postcards. In Workshop on Computational Humanities Research.
[37]
Kyoko Sugisaki, Nicolas Wiedmer, and Heiko Hausendorf. 2018. Building a corpus from handwritten picture postcards: Transcription, annotation and partof- speech tagging. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation.
[38]
Hugo Touvron, Thibaut Lavril, and Gautier Izacard et al. 2023. LLaMA: Open and Efficient Foundation Language Models. https://doi.org/10.48550/arXiv.2302.13971.
[39]
Hugo Touvron, Louis Martin, and et al. Kevin Stone. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. https://doi.org/10.48550/arXiv.2307.09288.
[40]
Yangtao Wang, Xi Shen, and Shell Hu et al. 2022. Self-supervised transformers for unsupervised object discovery using normalized cut. In IEEE Conference on Computer Vision and Pattern Recognition.
[41]
Fabian Wolf and Gernot A. Fink. 2022. Self-training of handwritten word recognition for synthetic-to-real adaptation. In International Conference on Pattern Recognition.
[42]
Artem B. Yandex and Victor Lempitsky. 2015. Aggregating local deep features for image retrieval. In IEEE International conference on computer vision.
[43]
Honggang Zhang, Kaili Zhao, Yi-Zhe Song, and Jun Guo. 2013. Text extraction from natural scene image: A survey. Neurocomputing 122 (2013).
[44]
Jingyi Zhang, Fumin Shen, Li Liu, and Mengyang Yu et al. Fan Zhu. 2018. Generative domain-migration hashing for sketch-to-image retrieval. In European conference on computer vision.
[45]
Liang Zheng, Yi Yang, and Qi Tian. 2015. SIFT meets CNN: A decade survey of instance retrieval. IEEE transactions on pattern analysis and machine intelligence 14, 8 (2015).
[46]
Xueyan Zou, Jianwei Yang, and Hao Zhang et al. 2023. Segment everything everywhere all at once. https://doi.org/10.48550/arXiv.2304.06718.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SUMAC '24: Proceedings of the 6th workshop on the analySis, Understanding and proMotion of heritAge Contents
October 2024
67 pages
ISBN:9798400712050
DOI:10.1145/3689094
  • Program Chairs:
  • Valerie Gouet-Brunet,
  • Ronak Kosti,
  • Li Weng
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Check for updates

Author Tags

  1. historical postcards
  2. image retrieval.
  3. information extraction
  4. natural language processing
  5. ocr
  6. vision foundation models

Qualifiers

  • Research-article

Funding Sources

  • région Grand Est France

Conference

MM '24
Sponsor:
MM '24: The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne VIC, Australia

Acceptance Rates

Overall Acceptance Rate 5 of 6 submissions, 83%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 123
    Total Downloads
  • Downloads (Last 12 months)123
  • Downloads (Last 6 weeks)50
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media