Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3529372.3530924acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

A Library Perspective on Nearly-Unsupervised Information Extraction Workflows in Digital Libraries

Published: 20 June 2022 Publication History

Abstract

Information extraction can support novel and effective access paths for digital libraries. Nevertheless, designing reliable extraction workflows can be cost-intensive in practice. On the one hand, suitable extraction methods rely on domain-specific training data. On the other hand, unsupervised and open extraction methods usually produce not-canonicalized extraction results. This paper tackles the question how digital libraries can handle such extractions and if their quality is sufficient in practice. We focus on unsupervised extraction workflows by analyzing them in case studies in the domains of encyclopedias (Wikipedia), pharmacy and political sciences. We report on opportunities and limitations. Finally we discuss best practices for unsupervised extraction workflows.

References

[1]
Giusepppe Attardi. 2015. WikiExtractor. https://github.com/attardi/wikiextractor.
[2]
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In The semantic web. Springer, Busan, Korea, 722--735.
[3]
Sangnie Bhardwaj, Samarth Aggarwal, and Mausam Mausam. 2019. CaRB: A Crowdsourced Benchmark for Open IE. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 6262--6267.
[4]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186.
[5]
Paul Groth, Mike Lauruhn, Antony Scerri, and Ron Daniel Jr. 2018. Open Information Extraction on Scientific Text: An Evaluation. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, 3414--3423. https://aclanthology.org/C18-1289
[6]
Dimitar Hristovski, Andrej Kastrin, Dejan Dinevski, and Thomas C Rindflesch. 2015. Constructing a Graph Database for Semantic Literature-Based Discovery. Studies in health technology and informatics 216 (2015), 1094.
[7]
Mohamad Yaser Jaradeh, Allard Oelen, Kheir Eddine Farfar, Manuel Prinz, Jennifer D'Souza, Gábor Kismihók, Markus Stocker, and Sören Auer. 2019. Open Research Knowledge Graph: Next Generation Infrastructure for Semantic Scholarly Knowledge. In Proceedings of the 10th International Conference on Knowledge Capture (Marina Del Rey, CA, USA) (K-CAP '19). Association for Computing Machinery, New York, NY, USA, 243--246.
[8]
Halil Kilicoglu, Dongwook Shin, Marcelo Fiszman, Graciela Rosemblat, and Thomas C. Rindflesch. 2012. SemMedDB: a -scale repository of biomedical semantic predications. Bioinformatics 28, 23 (10 2012), 3158--3160.
[9]
Keshav Kolluru, Vaibhav Adlakha, Samarth Aggarwal, Mausam, and Soumen Chakrabarti. 2020. OpenIE6: Iterative Grid Labeling and Coordination Analysis for Open Information Extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16--20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 3748--3761.
[10]
Hermann Kroll, Judy Al-Chaar, and Wolf-Tilo Balke. 2021. Open Information Extraction in Digital Libraries: Current Challenges and Open Research Questions. In Proceedings of the Workshop on Digital Infrastructures for Scholarly Content Objects (DISCO 2021) co-located with ACM/IEEE Joint Conference on Digital Libraries 2021(JCDL 2021), Online, September 30, 2021 (CEUR Workshop Proceedings, Vol. 2976), Wolf-Tilo Balke, Anita de Waard, Yuanxi Fu, Bolin Hua, Jodi Schneider, Ningyuan Song, and Xiaoguang Wang (Eds.). CEUR-WS.org, Online, 14--18. http://ceur-ws.org/Vol-2976/short-1.pdf
[11]
Hermann Kroll, Jan-Christoph Kalo, Denis Nagel, Stephan Mennicke, and Wolf-Tilo Balke. 2020. Context-Compatible Information Fusion for Scientific Knowledge Graphs. In Digital Libraries for Open Knowledge, Mark Hall, Tanja Merčun, Thomas Risse, and Fabien Duchateau (Eds.). Springer International Publishing, Cham, 33--47.
[12]
Hermann Kroll, Jan Pirklbauer, and Wolf-Tilo Balke. 2021. A Toolbox for the Nearly-Unsupervised Construction of Digital Library Knowledge Graphs. In ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021, Champaign, IL, USA, September 27--30, 2021, J. Stephen Downie, Dana McKay, Hussein Suleman, David M. Nichols, and Faryaneh Poursardar (Eds.). IEEE, Champaign, IL, USA, 21--30.
[13]
Hermann Kroll, Jan Pirklbauer, Jan-Christoph Kalo, Morris Kunz, Johannes Ruthmann, and Wolf-Tilo Balke. 2021. Narrative Query Graphs for Entity-Interaction-Aware Document Retrieval. In Towards Open and Trustworthy Digital Societies - 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1--3, 2021, Proceedings (Lecture Notes in Computer Science, Vol. 13133), Hao-Ren Ke, Chei Sian Lee, and Kazunari Sugiyama (Eds.). Springer, Online, 80--95.
[14]
Ruben Kruiper, Julian Vincent, Jessica Chen-Burger, Marc Desmulliez, and Ioannis Konstas. 2020. In Layman's Terms: Semi-Open Relation Extraction from Scientific Texts. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 1489--1500.
[15]
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (09 2019), 1234--1240.
[16]
Ying Liu, Kun Bai, Prasenjit Mitra, and C. Lee Giles. 2007. TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries. In Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries (Vancouver, BC, Canada) (JCDL '07). Association for Computing Machinery, New York, NY, USA, 91--100.
[17]
Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. Association for Computational Linguistics, Baltimore, Maryland, USA, 55--60.
[18]
David Mendez, Anna Gaulton, A Patrícia Bento, Jon Chambers, Marleen De Veij, Eloy Félix, and al. 2018. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Research 47, D1 (11 2018), D930--D940.
[19]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2--4, 2013, Workshop Track Proceedings. ICLR, Scottsdale, Arizona, USA.
[20]
Christina Niklaus, Matthias Cetto, André Freitas, and Siegfried Handschuh. 2018. A Survey on Open Information Extraction. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, 3866--3878. https://aclanthology.org/C18-1326
[21]
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. 2020. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Online, 101--108.
[22]
Santosh Tokala Yaswanth Sri Sai, Prantika Chakraborty, Sudakshina Dutta, Debarshi Kumar Sanyal, and Partha Pratim Das. 2021. Joint Entity and Relation Extraction from Scientific Documents: Role of Linguistic Information and Entity Types. In Proceedings of the 2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE 2021) co-located with JCDL 2021, Virtual Event, September 30th, 2021 (CEUR Workshop Proceedings, Vol. 3004), Chengzhi Zhang, Philipp Mayr, Wei Lu, and Yi Zhang (Eds.). CEUR-WS.org, Online, 15--19. http://ceur-ws.org/Vol-3004/paper2.pdf
[23]
Tim Schardelmann and Wolfgang Otto. 2018. POLLUX - von der Bedarfsanalyse zur technischen Umsetzung. Bibliotheksdienst 52, 3--4 (2018), 225--234.
[24]
Menasha Thilakaratne, Katrina Falkner, and Thushari Atapattu. 2020. Information Extraction in Digital Libraries: First Steps towards Portability of LBD Workflow. Association for Computing Machinery, New York, NY, USA, 345--348.
[25]
Shikhar Vashishth, Prince Jain, and Partha Talukdar. 2018. CESI: Canonicalizing Open Knowledge Bases Using Embeddings and Side Information. In Proceedings of the 2018 World Wide Web Conference (Lyon, France) (WWW '18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1317--1327.
[26]
Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM 57, 10 (2014), 78--85.
[27]
Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu. 2013. PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res. 41, Web Server issue (July 2013), W518--22.
[28]
Gerhard Weikum, Xin Luna Dong, Simon Razniewski, and Fabian M. Suchanek. 2021. Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases., 108--490 pages.
[29]
Kyle Williams, Jian Wu, Zhaohui Wu, and C. Lee Giles. 2016. Information Extraction for Scholarly Digital Libraries. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries (Newark, New Jersey, USA) (JCDL '16). Association for Computing Machinery, New York, NY, USA, 287--288.
[30]
Rui Zhang, Michael J. Cairelli, Marcelo Fiszman, Graciela Rosemblat, Halil Kilicoglu, Thomas C. Rindflesch, Serguei V. Pakhomov, and Genevieve B. Melton. 2014. Using semantic predications to uncover drug-drug interactions in clinical data. Journal of Biomedical Informatics 49 (2014), 134--147.
[31]
Yijia Zhang, Qingyu Chen, Zhihao Yang, Hongfei Lin, and Zhiyong Lu. 2019. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Scientific data 6, 1 (2019), 1--9.

Cited By

View all
  • (2024)A detailed library perspective on nearly unsupervised information extraction workflows in digital librariesInternational Journal on Digital Libraries10.1007/s00799-023-00368-z25:2(401-425)Online publication date: 1-Jun-2024
  • (2023)A discovery system for narrative query graphs: entity-interaction-aware document retrievalInternational Journal on Digital Libraries10.1007/s00799-023-00356-325:1(3-24)Online publication date: 24-Apr-2023

Index Terms

  1. A Library Perspective on Nearly-Unsupervised Information Extraction Workflows in Digital Libraries

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        JCDL '22: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries
        June 2022
        392 pages
        ISBN:9781450393454
        DOI:10.1145/3529372
        • General Chairs:
        • Akiko Aizawa,
        • Thomas Mandl,
        • Zeljko Carevic,
        • Program Chairs:
        • Annika Hinze,
        • Philipp Mayr,
        • Philipp Schaer
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        In-Cooperation

        • IEEE Technical Committee on Digital Libraries (TC DL)

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 20 June 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. digital libraries
        2. open information extraction
        3. workflows

        Qualifiers

        • Research-article

        Funding Sources

        Conference

        JCDL '22
        Sponsor:

        Acceptance Rates

        JCDL '22 Paper Acceptance Rate 35 of 132 submissions, 27%;
        Overall Acceptance Rate 415 of 1,482 submissions, 28%

        Upcoming Conference

        JCDL '24
        The 2024 ACM/IEEE Joint Conference on Digital Libraries
        December 16 - 20, 2024
        Hong Kong , China

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)16
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 09 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)A detailed library perspective on nearly unsupervised information extraction workflows in digital librariesInternational Journal on Digital Libraries10.1007/s00799-023-00368-z25:2(401-425)Online publication date: 1-Jun-2024
        • (2023)A discovery system for narrative query graphs: entity-interaction-aware document retrievalInternational Journal on Digital Libraries10.1007/s00799-023-00356-325:1(3-24)Online publication date: 24-Apr-2023

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media