1 Introduction

The Semantic Web community is constantly pushing the barrier on processing and producing knowledge that is understandable by both humans and machines. Nonetheless when it comes to scientific content, much of the information is conveyed by editorial publishers and locked behind proprietary formatting - e.g. in PDF files - which are not directly machine readable. In recent years many scientific publishers have been showcasing the benefit of augmenting scholarly content with semantic information, while academic initiatives have been encouraging the idea of “semantic publishing” [5] where the authors themselves augment their scientific papers with semantic annotations.

Despite many such initiatives, any information system that relies on scientific knowledge needs to extract information from various proprietary document formats, very often PDFs. Many approaches for information extraction start with the assumption that raw text is available and the task of obtaining such text from diverse sources (Web pages, text documents, PDF documents, etc.) is neglected. Nonetheless in many business applications it is extremely important to (i) assure ease and accuracy of extraction from whichever the input format and (ii) perform the annotation and extraction tasks directly on the input documents, without introducing costly (and often disruptive for the end user) format transformation.

We propose a strategy to perform such information extraction tasks directly on the input documents, so as to be completely transparent for the end user. As our focus is on PDF documents, we add small task specific semantic annotators directly in to the PDF (and which are thus viewable with a standard PDF reader). These new semantic annotators are trained on demand, with a human-in-the-loop methodology, and modularly added to the document.

In the demo we showcase an application on extracting semantic information from medical documents, specifically from publicly available patient package inserts distributed in PDF format. We train and apply (i) an ontology-based Named Entity Recognition (NER) tool to recognize adverse drug events, (i) a sentence annotator that identify whole sentences that express a relation between a drug and its potential Adverse Drug Reaction, and (iii) a knowledge based lookup annotator to identify drugs in the text.

The advantage of our proposed solution is it’s ability to unlock semantic information from proprietary documents - especially PDF - seamlessly, and allowing new annotators to be added modularly on demand, after a training interaction with the end user.

2 State of the Art

Much of today’s scientific and technical content is locked behind proprietary document formats, making it difficult to consume for analytic systems.

In recent years many scientific publishers have been showcasing the benefit of augmenting scholarly content with semantic information. Examples are the SciGraph projectFootnote 1 by Springer-Nature, the Dynamic Knowledge Platforms (DKP) by ElsevierFootnote 2 among others. Academic projects such as Biotea [3] pursue the same goal of creating machine readable and sharable knowledge extracted from scientific content in proprietary format (specifically XML files). Academic initiatives have been encouraging the idea of “semantic publishing” [5] where the authors themselves augment their scientific papers with semantic annotations, instead of relying on post-processing information extraction performed by third parties. Other initiatives aim at maintaining sharable knowledge about the metadata of scientific publication [4].

While significant effort has been put into extracting and maintaining semantic information from scientific publications, much of the content is still locked inside PDF files. This is even more true for technical documents that are not necessarily scientific papers, but which still contain extremely valuable information, e.g. as the use case of Medication Package Inserts that we are presenting in this demo. There are several efforts in the literature which explore extracting information from PDFs directly. Early examples [7, 8] focus on parsing textual content and extracting structured information. They do so without maintaining the user interaction with the original files. This is undesirable, especially in cases where the layout of the text (e.g., tables) or ancillary information (e.g., chemical structures or other illustrations) are critical context to the understanding of the text. More recent examples exploit the specific structure of certain PDF files, therefore also using specific visual clues of the documents to train the extraction models [1, 2, 6].

On the other hand we propose a solution that is agnostic of any specific structure of the input file and that is fully integrated within a PDF, and thus can be viewed with the PDF reader the subject matter expert is already using. Our solution allows the user to visually identify the information which is semantically relevant for their business case. Such information is used by the system to train semantic annotators, which are then integrated directly in the PDF viewer tool that the subject matter expert is already using.

3 Use Case: Extracting Semantics from Medication Package Inserts

During the demo we will showcase how we use the system to ingest, annotate and semantically enrich Medication Package Inserts.

A package insert (PI) is a document included in the package of a medication that provides information about that drug and its use. In U.S.A., annually all pharmaceutical companies must provide updated information about all their drugs to the U.S. Food and Drug Administration (FDA)Footnote 3, including the package inserts. All this information is then made publicly available on the FDA Web site. DAILYMEDFootnote 4 provides access to 106, 938 drug listings including daily updates. Such daily updates can be very useful to monitor the changes in the package inserts. For example, new adverse drug reactions could be added, the dosage of the drug is changed, new drug interactions are discovered, etc. Such information is highly valuable to patients and medical practitioners. However, manual monitoring for such changes is not viable and automation is needed. With our tool, we can speed up the process of identifying the most relevant information in the updated documents and present it to the subject matter experts, e.g., drug names mentions, adverse drug reactions, dosage terms, and important textual changes.

Fig. 1.
figure 1

Proprietary document annotation. System workflow.

3.1 Annotating Patient Package Inserts

The system takes as input a collection of documents \(D = {d_1, d_2, \dots , d_n}\) (PIs in this case) and enriches each document \(d_i\) with semantic annotations. In the instance of the system that we will demonstrate, the implemented semantic annotators include a set of entity types \(E= {e_1, e_2,..., e_n}\) and textual annotations \(A = {a_1, a_2, \dots , a_n}\). The entities \(e_i\) are either resolved to a specific Knowledge Base - as in the case of Adverse Drug Reactions (ADR) that we resolve using the MedDRA ontologyFootnote 5 or simply typed - as in the case for Drugs where we use a dictionary lookup mechanism. The textual annotations \(a_i\) in this demo are sentences that identify salient information as defined by a subject matter expert. Specifically these are sentences that express a potential causal relation between a drug and an ADR. Figure 1 shows the overall workflow of the system. In essence the annotation process runs in 2 phases, i.e., (i) initialization and (ii) adjudication.

Initialization. The subject matter expert uploads the desired collection of documents to the system. The subject matter experts provides some seed annotation of the semantic information they want to extract and specifies external Knowledge Resources to be used, if any. Once a small number of annotations are provided, the system builds learning models for each type of annotations.

Adjudication. The initial models are applied to the whole document collection. The SME performs the adjudication of the produced semantic annotations and can (i) correct the mistakes and (ii) identify and add missing annotations. After each batch of corrections (where the batch size can be adjusted), the models are retrained and reapplied on the rest of the documents. New semantic annotations can be added at any time, i.e., once a new item is added the models are retrained and are able to identify the new item, being entity or textual annotation.

With such a system the SME has full control of what types of semantic annotations are going to be identified, and they can enforce that the accuracy of the system is always above a certain threshold (even 100% if they are willing or required to manually review the whole collection). The system simply assists the user to improve their efficiency in identifying the semantic annotations of interest, and reduce the human error. The produced semantic annotation enrich the initial document, without altering its layout (and potentially obscuring the context needed to understand the text). To realize that, we add semantic layers on top of the original document, where each layer contains the information for a specific semantic annotation. Figure 2 shows an example of annotated PI, as depicted in Adobe Acrobat ReaderFootnote 6. The results of the semantic annotators can be toggled on and off at will. For annotators that implement entity resolution, the recognized entities are linkable and refer to the external sources.

Fig. 2.
figure 2

Example of annotated Medication Package Insert.

4 Conclusions and Future Work

In this demo we showcase our solution to perform on-demand information extraction directly from PDF files. The key strengths of our solution are that the subject matter expert can train new semantic annotators which are directly integrated in their PDF viewer tool. Being able to semantically annotate PDF files is an extremely important capability in many business scenarios. In this demo we show an application with Medical Package Inserts, but one can envision many other scenarios, such as scientific publishing, legal contracts, the recruiting business - where many applicants’ resumes are PDF files - just to mention a few. In all these scenarios it is crucial to be able to quickly train the extraction models and to deliver the results in a way that is familiar to the subject matter expert, without disrupting any existing workflow.