Information Extraction in Editorial Setting. A Tale of PDFs

Anna Lisa Gentile²⁰,
Daniel Gruhl²⁰,
Petar Ristoski²⁰ &
…
Steve Welch²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11762))

Included in the following conference series:

European Semantic Web Conference

1100 Accesses

Abstract

In the last decade the Semantic Web initiative has promoted the construction of knowledge resources that are understandable by both humans and machines. Nonetheless considerable scientific and technical content is still locked behind proprietary formats, especially PDF files. While many solutions have been proposed to shift the publishing mechanism to more accessible formats, we believe that is paramount, especially in business scenarios, to be able to tap into this type of content and be able to extract machine readable semantic information from it.

In this demo we show how we can process and semantically annotate Medication Package Inserts, publicly available from the pharmaceutical companies in the form of PDF files. Our proposed solution is fully integrated with a standard PDF viewer and does not require the subject matter expert to use any external software.

You have full access to this open access chapter, Download conference paper PDF

Personalized Knowledge Graphs for the Pharmaceutical Domain

Spá: A Web-Based Viewer for Text Mining in Evidence Based Medicine

Annotation and detection of drug effects in text for pharmacovigilance

Article Open access 13 August 2018

1 Introduction

The Semantic Web community is constantly pushing the barrier on processing and producing knowledge that is understandable by both humans and machines. Nonetheless when it comes to scientific content, much of the information is conveyed by editorial publishers and locked behind proprietary formatting - e.g. in PDF files - which are not directly machine readable. In recent years many scientific publishers have been showcasing the benefit of augmenting scholarly content with semantic information, while academic initiatives have been encouraging the idea of “semantic publishing” [5] where the authors themselves augment their scientific papers with semantic annotations.

Despite many such initiatives, any information system that relies on scientific knowledge needs to extract information from various proprietary document formats, very often PDFs. Many approaches for information extraction start with the assumption that raw text is available and the task of obtaining such text from diverse sources (Web pages, text documents, PDF documents, etc.) is neglected. Nonetheless in many business applications it is extremely important to (i) assure ease and accuracy of extraction from whichever the input format and (ii) perform the annotation and extraction tasks directly on the input documents, without introducing costly (and often disruptive for the end user) format transformation.

We propose a strategy to perform such information extraction tasks directly on the input documents, so as to be completely transparent for the end user. As our focus is on PDF documents, we add small task specific semantic annotators directly in to the PDF (and which are thus viewable with a standard PDF reader). These new semantic annotators are trained on demand, with a human-in-the-loop methodology, and modularly added to the document.

In the demo we showcase an application on extracting semantic information from medical documents, specifically from publicly available patient package inserts distributed in PDF format. We train and apply (i) an ontology-based Named Entity Recognition (NER) tool to recognize adverse drug events, (i) a sentence annotator that identify whole sentences that express a relation between a drug and its potential Adverse Drug Reaction, and (iii) a knowledge based lookup annotator to identify drugs in the text.

The advantage of our proposed solution is it’s ability to unlock semantic information from proprietary documents - especially PDF - seamlessly, and allowing new annotators to be added modularly on demand, after a training interaction with the end user.

2 State of the Art

Much of today’s scientific and technical content is locked behind proprietary document formats, making it difficult to consume for analytic systems.

In recent years many scientific publishers have been showcasing the benefit of augmenting scholarly content with semantic information. Examples are the SciGraph project^{Footnote 1} by Springer-Nature, the Dynamic Knowledge Platforms (DKP) by Elsevier^{Footnote 2} among others. Academic projects such as Biotea [3] pursue the same goal of creating machine readable and sharable knowledge extracted from scientific content in proprietary format (specifically XML files). Academic initiatives have been encouraging the idea of “semantic publishing” [5] where the authors themselves augment their scientific papers with semantic annotations, instead of relying on post-processing information extraction performed by third parties. Other initiatives aim at maintaining sharable knowledge about the metadata of scientific publication [4].

While significant effort has been put into extracting and maintaining semantic information from scientific publications, much of the content is still locked inside PDF files. This is even more true for technical documents that are not necessarily scientific papers, but which still contain extremely valuable information, e.g. as the use case of Medication Package Inserts that we are presenting in this demo. There are several efforts in the literature which explore extracting information from PDFs directly. Early examples [7, 8] focus on parsing textual content and extracting structured information. They do so without maintaining the user interaction with the original files. This is undesirable, especially in cases where the layout of the text (e.g., tables) or ancillary information (e.g., chemical structures or other illustrations) are critical context to the understanding of the text. More recent examples exploit the specific structure of certain PDF files, therefore also using specific visual clues of the documents to train the extraction models [1, 2, 6].

On the other hand we propose a solution that is agnostic of any specific structure of the input file and that is fully integrated within a PDF, and thus can be viewed with the PDF reader the subject matter expert is already using. Our solution allows the user to visually identify the information which is semantically relevant for their business case. Such information is used by the system to train semantic annotators, which are then integrated directly in the PDF viewer tool that the subject matter expert is already using.

3 Use Case: Extracting Semantics from Medication Package Inserts

During the demo we will showcase how we use the system to ingest, annotate and semantically enrich Medication Package Inserts.

A package insert (PI) is a document included in the package of a medication that provides information about that drug and its use. In U.S.A., annually all pharmaceutical companies must provide updated information about all their drugs to the U.S. Food and Drug Administration (FDA)^{Footnote 3}, including the package inserts. All this information is then made publicly available on the FDA Web site. DAILYMED^{Footnote 4} provides access to 106, 938 drug listings including daily updates. Such daily updates can be very useful to monitor the changes in the package inserts. For example, new adverse drug reactions could be added, the dosage of the drug is changed, new drug interactions are discovered, etc. Such information is highly valuable to patients and medical practitioners. However, manual monitoring for such changes is not viable and automation is needed. With our tool, we can speed up the process of identifying the most relevant information in the updated documents and present it to the subject matter experts, e.g., drug names mentions, adverse drug reactions, dosage terms, and important textual changes.

3.1 Annotating Patient Package Inserts

The system takes as input a collection of documents \(D = {d_1, d_2, \dots , d_n}\) (PIs in this case) and enriches each document \(d_i\) with semantic annotations. In the instance of the system that we will demonstrate, the implemented semantic annotators include a set of entity types \(E= {e_1, e_2,..., e_n}\) and textual annotations \(A = {a_1, a_2, \dots , a_n}\). The entities \(e_i\) are either resolved to a specific Knowledge Base - as in the case of Adverse Drug Reactions (ADR) that we resolve using the MedDRA ontology^{Footnote 5} or simply typed - as in the case for Drugs where we use a dictionary lookup mechanism. The textual annotations \(a_i\) in this demo are sentences that identify salient information as defined by a subject matter expert. Specifically these are sentences that express a potential causal relation between a drug and an ADR. Figure 1 shows the overall workflow of the system. In essence the annotation process runs in 2 phases, i.e., (i) initialization and (ii) adjudication.

Initialization. The subject matter expert uploads the desired collection of documents to the system. The subject matter experts provides some seed annotation of the semantic information they want to extract and specifies external Knowledge Resources to be used, if any. Once a small number of annotations are provided, the system builds learning models for each type of annotations.

Adjudication. The initial models are applied to the whole document collection. The SME performs the adjudication of the produced semantic annotations and can (i) correct the mistakes and (ii) identify and add missing annotations. After each batch of corrections (where the batch size can be adjusted), the models are retrained and reapplied on the rest of the documents. New semantic annotations can be added at any time, i.e., once a new item is added the models are retrained and are able to identify the new item, being entity or textual annotation.

With such a system the SME has full control of what types of semantic annotations are going to be identified, and they can enforce that the accuracy of the system is always above a certain threshold (even 100% if they are willing or required to manually review the whole collection). The system simply assists the user to improve their efficiency in identifying the semantic annotations of interest, and reduce the human error. The produced semantic annotation enrich the initial document, without altering its layout (and potentially obscuring the context needed to understand the text). To realize that, we add semantic layers on top of the original document, where each layer contains the information for a specific semantic annotation. Figure 2 shows an example of annotated PI, as depicted in Adobe Acrobat Reader^{Footnote 6}. The results of the semantic annotators can be toggled on and off at will. For annotators that implement entity resolution, the recognized entities are linkable and refer to the external sources.

4 Conclusions and Future Work

In this demo we showcase our solution to perform on-demand information extraction directly from PDF files. The key strengths of our solution are that the subject matter expert can train new semantic annotators which are directly integrated in their PDF viewer tool. Being able to semantically annotate PDF files is an extremely important capability in many business scenarios. In this demo we show an application with Medical Package Inserts, but one can envision many other scenarios, such as scientific publishing, legal contracts, the recruiting business - where many applicants’ resumes are PDF files - just to mention a few. In all these scenarios it is crucial to be able to quickly train the extraction models and to deliver the results in a way that is familiar to the subject matter expert, without disrupting any existing workflow.

Notes

References

Abekawa, T., Aizawa, A.: SideNoter: scholarly paper browsing system based on PDF restructuring and text annotation. In: COLING 2016, pp. 136–140 (2016). http://www.aclweb.org/anthology/C16-2029
Ahmad, R., Afzal, M.T., Qadir, M.A.: Information extraction from PDF sources based on rule-based system using integrated formats. In: Sack, H., Dietze, S., Tordai, A., Lange, C. (eds.) SemWebEval 2016. CCIS, vol. 641, pp. 293–308. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46565-4_23
Chapter Google Scholar
Garcia, A., Lopez, F., Garcia, L., Giraldo, O., Bucheli, V., Dumontier, M.: Biotea: semantics for pubmed central. PeerJ. 2018, 1–26 (2018). https://doi.org/10.7717/peerj.4201
Article Google Scholar
Nuzzolese, A.G., Gentile, A.L., Presutti, V., Gangemi, A.: Conference linked data: the ScholarlyData project. In: Groth, P., Simperl, E., Gray, A., Sabou, M., Krötzsch, M., Lecue, F., Flöck, F., Gil, Y. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 150–158. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46547-0_16
Chapter Google Scholar
Shotton, D.: Semantic publishing: the coming revolution in scientific journal publishing. Learn. Publish. 22(2), 85–94 (2009). https://doi.org/10.1087/2009202
Article Google Scholar
Staar, P.W.J., Dolfi, M., Auer, C., Bekas, C.: Corpus conversion service: a machine learning platform to ingest documents at scale. In: SIGKDD 2018, pp. 774–782. ACM (2018). https://doi.org/10.1145/3219819.3219834
Yuan, F., Liu, B.O.: A new method of information extraction from PDF files. In: ICMLC 2005, pp. 18–21. IEEE (2005)
Google Scholar
Yuan, F., Liu, B., Yu, G.: A study on information extraction from PDF files. In: Yeung, D.S., Liu, Z.-Q., Wang, X.-Z., Yan, H. (eds.) ICMLC 2005. LNCS (LNAI), vol. 3930, pp. 258–267. Springer, Heidelberg (2006). https://doi.org/10.1007/11739685_27
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

IBM Research Almaden, San Jose, CA, USA
Anna Lisa Gentile, Daniel Gruhl, Petar Ristoski & Steve Welch

Authors

Anna Lisa Gentile
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Gruhl
View author publications
You can also search for this author in PubMed Google Scholar
Petar Ristoski
View author publications
You can also search for this author in PubMed Google Scholar
Steve Welch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anna Lisa Gentile .

Editor information

Editors and Affiliations

Kansas State University, Manhattan, KS, USA
Pascal Hitzler
Vienna University of Economics and Business, Vienna, Austria
Sabrina Kirrane
Linköping University, Linköping, Sweden
Olaf Hartig
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Victor de Boer
Leibniz Information Centre for Science and Technology University Library (TIB), Hannover, Germany
Maria-Esther Vidal
University of Bonn, Bonn, Germany
Maria Maleshkova
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Stefan Schlobach
Jönköping University, Jönköping, Sweden
Karl Hammar
F. Hoffmann-La Roche AG, Basel, Switzerland
Nelia Lasierra
Robert Bosch GmbH, Stuttgart, Germany
Steffen Stadtmüller
Aalborg University, Aalborg, Denmark
Katja Hose
IMEC, Ghent University, Ghent, Belgium
Ruben Verborgh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gentile, A.L., Gruhl, D., Ristoski, P., Welch, S. (2019). Information Extraction in Editorial Setting. A Tale of PDFs. In: Hitzler, P., et al. The Semantic Web: ESWC 2019 Satellite Events. ESWC 2019. Lecture Notes in Computer Science(), vol 11762. Springer, Cham. https://doi.org/10.1007/978-3-030-32327-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-32327-1_14
Published: 10 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32326-4
Online ISBN: 978-3-030-32327-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics