Nothing Special   »   [go: up one dir, main page]

D-Lib Magazine
spacer
The Magazine of Digital Library Research
spacer
transparent image

D-Lib Magazine

January/February 2015
Volume 21, Number 1/2
Table of Contents

 

Semantic Enrichment and Search: A Case Study on Environmental Science Literature

Kalina Bontcheva, University of Sheffield, UK
k.bontcheva@sheffield.ac.uk

Johanna Kieniewicz and Stephen Andrews, British Library, UK
johanna.kieniewicz@iop.org; stephen.andrews@bl.uk

Michael Wallis, HR Wallingford, UK
M.Wallis@hrwallingford.com

DOI: 10.1045/january2015-bontcheva

 

Printer-friendly Version

 

Abstract

As information discovery needs become more and more challenging, traditional keyword-based information retrieval methods are increasingly falling short in providing adequate support. The problem is often compounded by the poor quality of article metadata in some digital collections. This paper investigates automatic semantic enrichment and search methods, as ways to meet these challenges. In particular, the benefits of enriching articles with knowledge from Linked Open Data resources are investigated, with focus on the domain of environmental science. In order to facilitate environmental science researchers in carrying out better semantic searches, a form-based semantic search interface is proposed. It helps researchers to benefit from the semantically enriched content, e.g. to carry out sophisticated location-based searches. The usability and ease of learning of this web interface were evaluated in a user-based study, the results of which are also reported.

 

1 Introduction

Environmental Science is a broad, interdisciplinary subject area that spans biology, chemistry, earth sciences, physics, and engineering. Due to this breadth of subject scope, information discovery and sharing in environmental science is often a challenge. This is due to the fact that traditional keyword-based, full-text search is not able to address the more complex information seeking requirements, which include sense-making and exploratory search (Pirolli, 2009). In the latter cases, traditional precision-oriented approaches from the field of Information Retrieval (IR) are not sufficient. For exploratory search, in particular, recall is paramount, as well as the ability to carry out interactive retrieval (Pirolli, 2009).

Linked Open Data (LOD), when coupled with semantic enrichment and search methods, offers an opportunity to improve the process of information discovery through enriching and contextualizing scientific publications with respect to unique, machine-readable, interlinked open vocabularies. In particular, semantic search over documents aims to address these challenges by finding information that is not based just on the presence of words, but also on their meaning (Kiryakov, et al., 2004).

Relevant LOD vocabularies for environmental science are already becoming available (e.g. the GEMET thesaurus), as are other key resources relevant for the domain (e.g. GeoNames, DBpedia). Manual enrichment of article metadata and textual content with knowledge from LOD resources, however, is prohibitively expensive and unsustainable, since LOD vocabularies typically have millions of entries. Therefore, automatic LOD-based semantic annotation methods were used, in order to enrich the full text content with disambiguated domain terms and entities (e.g. locations, organisations, persons), described through Unique Resource Identifiers (URIs). In addition, the original articles are enriched with relevant knowledge from the respective LOD resources (e.g. that Oxford is part of England). This is needed, in order to answer queries that require common-sense knowledge, which is often not present in the original article content. For example, following semantic enrichment, a semantic search for documents on flooding in England will now able to retrieve a relevant document about floods in Oxford, even though the original text does not explicitly mention England.

Designing easy to use and learn semantic search interfaces, however, is both a key requirement and a major challenge (Bast, et al., 2013). The interface needs to be not only more powerful than traditional full-text search over publications, but also to be simple enough for non-expert users.

The novel contributions of this paper are threefold:

  1. To investigate semantic enrichment and search of environmental science literature (Section 3) through the development of a web-based prototype (Section 5), designed in close collaboration with environmental science researchers.
  2. To demonstrate how semantic enrichment, based on knowledge from Linked Open Data resources, can help information discovery (Section 4).
  3. To evaluate the usability and ease of learning of the proposed semantic search interface (Section 5), in a user-based experiment (Section 6).
 

2 Background

Within the sphere of environmental science, the area with the greatest legacy of semantic enrichment is that of geospatial information (Janowicz, et al., 2013), with applications including GIS environments/Spatial Data infrastructures (SDI), environmental sensor networks and geotagging (Pilman, et al., 2011). These approaches all identify interdisciplinary datasets, as are commonly found in environmental science, as a particularly fruitful area for LOD exploration. In these contexts, dataset metadata is semantically enriched in order to improve search and enable correct use of data (Schentz, et al., 2011). The LOD GEMET thesaurus underpins the EU INSPIRE directive, which aims to establish a digital infrastructure for spatial information in Europe in order to support environmental research, policy and decision-making. This ties into the Open Data movement and data.gov.uk which is being used as a vehicle through which the UK might comply with INSPIRE requirements for making environmental data available and discoverable (Shaon, et al., 2011).

Although progress is being made in environmental informatics with respect to enabling the discovery and better use of datasets and geographic information within the GIS/SDI context, LOD vocabularies have not as yet been applied in the context of semantic enrichment of environmental science literature. This contrasts with the biomedical sciences where text mining has been enabled by the Unified Medical Language System, a meta-thesaurus provided by the US National Library of Medicine, which acts as a comprehensive thesaurus and ontology of biomedical concepts (Hettne, et al., 2010).

In more detail, we experimented with existing environmental Linked Data vocabularies, namely GEMET and the Ordnance Survey Hydrology ontologies (Devaraju & Kuhn, 2010), as well as two general purpose LOD resources (DBpedia (Bizer et al., 2009) and GeoNames). These were used as knowledge sources for automated semantic enrichment of environmental science literature, coupled with a semantic search user interface.

 

3 LOD-based Semantic Enrichment

Semantic annotation is the process of tying semantic models, such as ontologies, and scientific articles together. It may be characterized as the dynamic semantic enrichment of unstructured and semi-structured documents with new knowledge and linking these to relevant domain ontologies/knowledge bases. It typically requires annotating a potentially ambiguous entity mention (e.g. Cambridge) with the canonical identifier of the correct unique entity (e.g. depending on the document content, http://dbpedia.org/resource/Cambridge or http://dbpedia.org/resource/Cambridge,_Massachusetts).

In our experiments, domain-specific LOD resources, such as the GEMET thesaurus and the Ordnance Survey Hydrology ontology (Devaraju & Kuhn, 2010) were used as a source of relevant terms, with which to enrich the article metadata and also to aid semantic search by providing synonyms. Occurrences of such terms were annotated automatically, using a combination of the GATE open-source English morphological analyser (Cunningham, et al., 2011) to detect the root word forms, and the ontology-based OntoRoot gazetteer, which matches terms using their labels in the thesauri (Cunningham, et al., 2011).

Since some of the most frequently used searches are for persons, locations, organisations, and other named entities (Pound, et al., 2010), we also used YODIE (Damljanovic & Bontcheva, 2012) to identify such entities mentioned in the article full-text and disambiguate them to DBpedia URIs. YODIE uses a combination of four classes of similarity metrics: string similarity, semantic similarity between nearby entities, contextual similarity between the document and the textual abstract of the candidate URI in DBpedia, and URI commonness as anchor text in Wikipedia articles. This system uses the GATE tokeniser, POS tagger, and the ANNIE NER system (Cunningham, et al., 2013) for linguistic pre-processing and entity recognition respectively. YODIE also uses the open-source GATE Large Knowledge Gazetteer (LKB) (Cunningham, et al., 2011), which assigns candidate DBpedia URIs to entities mentioned in the text.

The result of the semantic annotation and entity disambiguation algorithm are full text articles, enriched with URIs — one URI per term or named entity mentioned. Once the URIs are added as annotations, the article texts are enriched with additional semantic knowledge from the respective LOD resource. For instance, if a document mentions Cambridge, once it is disambiguated to http://dbpedia.org/resource/Cambridge (i.e., the English university city), a new annotation will be added to the text, containing the latitude and longitude, country, county, and population information, as given in DBpedia. This additional knowledge enables, inter alia, better location-based searches. For example, a user searching for publications on flooding in East Anglia will now be able to find a report about flooding in Cambridge.

In total, 10,000 environmental science documents and associated metadata were enriched automatically with term and entity URIs from DBpedia, GeoNames, GEMET, and the Ordnance Survey ontology, as well as with linguistic information, such as part of speech.

 

4 Impact of Semantic Enrichment and Search on Information Discovery

In order to scope requirements for the semantic search tool, it was important to understand the needs and search behaviour of its potential users. Users were contacted via personal contacts and environmental science networks. A total of 34 respondents answered, which could be split into Local Authority, Consultancy, Academia, NGO/Charity, Government Agency and SME (business). There is a slight emphasis in responses from local authorities due to the survey being posted on the FlowNet website, which provides resources and a point of interaction for that group. The results of the user requirements scoping are detailed in (Kieniewicz & Wallis, 2013).

Based on the survey results, environmental science researchers from within The British Library and HR Wallingford carried out information discovery searches on the semantically enriched metadata records and full-text documents.

The purpose of this small scale user assessment was to gain insight into how semantic enrichment and semantic search can improve information discovery. In particular, we examined:

  1. How semantic enrichment helps enhance article metadata, by populating automatically the Dublin Core Subject field with automatically discovered terms, and
  2. How the more complex search queries can be answered by combining full-text search with semantic knowledge added automatically from LOD resources.
 

4.1 Impact of Semantic Enrichment on Article Metadata

The automatically added LOD-based semantic annotations were manually checked in each of the documents, to assess their accuracy and relevance to the types of searches requested by the environmental science researchers in our survey (Kieniewicz & Wallis, 2013). The focus was on enhancing the article metadata by populating the Dublin Core (Weibel, et al., 1998) Subject field.

The benefit of semantic enrichment in this case, is that by surfacing annotated terms derived from the full-text content, concepts buried within the body of the paper/report can be highlighted. The addition of terms affects the relevance ranking in full-text searches. Moreover, searches can be made more specific by limiting the search criteria to the Subject field (e.g. through faceted search). This is similar in principle to the use of Medical Subject Headings (MeSH) (NLM, 1960) within the Medline and PubMed databases, where the content of the original document is described through the use of key terms added to the bibliographic record.

For each semantically annotated full-text document, the metadata enrichment algorithm retained the top five locations and organisations with DBpedia entity URIs and the corresponding location-related knowledge. Domain-specific terms were also added to the metadata, on the basis of the environmental science ontologies. This automatically acquired metadata was incorporated into the Subject fields of the document (see the highlighted terms at the bottom of Figure 1).

bontcheva-fig1

Figure 1: Automatically Enriched Subject Metadata

Once the semantic enrichment process was complete, the enhanced metadata was loaded and indexed in a separate full-text search repository. Differences in retrieval were measured by comparing the results across the annotated and the non-annotated versions of the 10,000 test documents, using structured search queries. Examples of ontology-derived domain-specific terms that populated the Subject field of one particular article were 'Environment Agency', 'East Anglia', 'Cambridge', 'flooding'. In this, and a number of other cases, these automatically generated terms provided additional contextual information to the user, particularly useful in those instances where the original metadata is sparse and there is no abstract present.

 

4.2 Impact of Semantic Search on User Query Results

In addition to populating the metadata Subject fields, the semantic annotations derived from both the metadata and article full-text were indexed into a GATE Mimir semantic search repository (Tablan, et al., In Press). GATE Mimir is a semantic search tool which can be used to index and search over text, annotations, semantic schemas (ontologies), and Linked Open Data knowledge. It supports queries that arbitrarily mix full-text, structural, linguistic and semantic constraints and scales up to terabytes of text through federated indexing. The rationale for choosing GATE Mimir is its transparent support for semantic search constraints. In particular, our aim was to determine whether the more complex search needs of environmental researchers could be met better through semantic queries that make use of the additional knowledge added to article texts from DBpedia.

4.2.1 Removing False Positives through Semantic Restrictions

The first benefit observed by the users, was that the semantic annotations were making the search results more precise, i.e. removed false positives.

In particular, a frequent literature search query involves environmental science terms (e.g. flooding) coupled with a geo-location (e.g. Oxford). When using a traditional full-text search engine, such queries often return false positives, due to location names being ambiguous. For instance, the query "flooding Oxford" returns on our test collection 8 documents, 4 of which mention irrelevant locations (e.g. Oxford Road Mill — an industrial site) and organisations (e.g. University of Oxford, Oxford University Press).

The corresponding semantic search query could be made much more precise by the users, by specifying explicitly in the query that Oxford is a location or even a city. In our example, the corresponding Mimir query is flooding AND ({Location} OVER "Oxford"), which filters out the 4 false positives, where Oxford is part of an organisation name.

Another similar query we tested, included "flood management Northern Ireland", where full-text search returns irrelevant hits due to documents mentioning the Northern Ireland Rivers Agency. Using semantic search allows users again to constrain the search specifically to locations within Northern Ireland.

4.2.2 Improving Search Results through LOD Knowledge

The second observed benefit was improved search coverage (i.e. recall), by exploiting the knowledge added during semantic enrichment of the article text. In particular, our survey indicated that users would typically carry out location-based searches at the county (e.g. Oxfordshire) or regional (e.g. South East England) level, whereas the original article would typically mention explicitly only the city names (e.g. Banbury). Therefore, full-text searches over the original article texts (e.g. "flooding Oxfordshire") would tend to have poor recall (i.e. would not return some relevant documents), due to their lack of common knowledge that Banbury is in Oxfordshire, which is in turn in South East England. Consequently, a keyword search over the original articles for the query "climate change Oxfordshire" would return only one hit.

In contrast, the corresponding semantic search query over the enriched articles returns three relevant documents: one mentioning Oxfordshire as before, but also one about Banbury and one — about Wytham Woods (see Figure 4). The knowledge that these locations are in Oxfordshire was added by the semantic enrichment process, based on knowledge encoded in DBpedia.

4.2.3 Adding Semantic Search Constraints

The third major benefit of semantic enrichment is that it gives users the ability to formulate sophisticated semantic search constraints, which go beyond keyword-based queries. Examples from our user survey included: "flooding in the last 10 years", "flooding since 2007", and "where is the floodplain near Aylesbury".

The first two kinds of queries are answered based on the automatically recognised and normalised dates in the full-text content. Relative dates in the query, such as "the last 10 years", are also normalised and converted into constraints over dates mentioned in the articles. Semantic constraints based on knowledge added to the articles from DBpedia are essential for answering the last query, which is more about facts, rather than documents.

Let us take as an example another user query for documents on flooding in countries with population density greater than 500 people per square kilometre. Since this is not a keyword-based query and none of the original articles contain any information on population density, this query cannot be answered by the standard full-text search engine.

The corresponding semantic search query is:

root:flood AND {Location sparql="select distinct ?inst
where {?inst rdf:type :Country.
?inst :populationDensity ?popDensity.
FILTER(?popDensity > 500)}"}

In this case documents containing the stemmed word 'flood' ('flood', 'flooding', 'flooded', etc.) are retrieved along with any words in the document that have been annotated as a location, by the semantic enrichment algorithm. An additional constraint on these matching location URIs is that they need to be of type Country and the value of the populationDensity property needs to be more than 500. This additional knowledge has been added to the articles automatically, based on knowledge in DBpedia about countries and their population.

Our last example query is the full-text search query 'river flooding'. Using semantic search over the annotated articles, this can be formulated as a query for documents containing the stemmed word 'flood' and a location, which is of class 'River'. This semantic search query retrieves articles mentioning the Thames, that cannot be found using traditional full-text search, as the keyword query would need to enumerate all river names explicitly, as an OR statement. Given the large number of river names in the collection, this is not feasible. In the automatically enriched articles though, the semantic annotation algorithm has tagged the Thames with the corresponding DBpedia URI and and its DBpedia type, which is River.

root:flood AND {Location sparql="select distinct ?inst
where {?inst rdf:type :River}"}

 

5 User Interface for Semantic Search

Figure 2 shows the form-based semantic search UI, which was designed to help answer the kinds of semantic queries, discussed in the previous section. It was designed in close collaboration with the environmental science researchers from the British Library and HR Wallingford, in order to ensure it meets user needs.

This UI has a keyword search field, complemented with optional semantic search constraints, through a set of inter-dependent drop-down lists. Users can search for specific entity types: Locations, Organisations, Dates, Rivers, or provide restrictions over the Document metadata fields, e.g. publication date before 2010. More than one semantic constraint can be added, through the plus button.

bontcheva-fig2

Figure 2 The Form-Based Semantic Search Interface

A specific requirement was support for location-based queries, so users can narrow down results by name, geographic coordinates, population, population density and country code. Users also wished to search for locations that a river flows through, as well as documents mentioning locations near a given location.

Figure 3 shows how location-based semantic searches are formulated. First, a Location is chosen as a constraint, then, if required, further constraints can be specified by choosing an appropriate property (e.g. country). Population allows users to pose restrictions on population size, where this knowledge was imported from DBpedia during the semantic enrichment process. Similar numeric constraints can be imposed on the latitude, longitude, and population density.

bontcheva-fig3

Figure 3 Formulating Location-based Searches

Restrictions can also be imposed in terms of location name or the country code, i.e. which country it belongs to. When "is" is chosen, it means that the location name must be exactly as specified (e.g. Oxford), whereas "contains" provides sub-string matching, (e.g. Oxfordshire will be matched as a location containing the string "Oxford" in its name). In more detail, if a user searches for documents mentioning locations with name containing Oxford, then this would return documents mentioning Oxford explicitly, but also documents mentioning Oxfordshire and other locations in Oxfordshire (e.g. Wytham Woods and Banbury). See Figure 4 for some of the results returned by this query.

bontcheva-fig4

Figure 4 Sample Semantic Search Results

 
 

6 User-based Evaluation

In order to evaluate the usability and learning overhead of the form-based semantic-search UI, a user-based evaluation was conducted. There were 17 participants, who could be broadly classified into environmental scientists (43%), users interested in applications of semantic search to other domains (22%), and semantic technology practitioners (35%).

Participants were asked to compare the results of keyword-based searches against those produced by semantic searches over the enriched articles, as part of four tasks:

  • Task 1. Find documents about flooding on rivers flowing through Gloucester
  • Task 2. Find documents about flooding in places near Sheffield
  • Task 3. Find documents about flood risk management in locations with population less than 15000 inhabitants
  • Task 4. Find areas at risk of surface water flooding in London

Participants were asked first to complete the task using keyword search only and then to formulate the query including also semantic search constraints from the form-based interface. For each task, participants were asked to write down the queries they used, as well as any other notes they wished to make on query formulation.

More formally, the evaluation experiment had a repeated measures, task-based design (also called within subjects design), i.e., the same participants interacted with two versions of the system, in order to complete a given set of tasks. Prior to the experiment, the participants were given a 15 minute live demonstration of the semantic search interface, in order to familiarise them with the way semantic constraints are formulated. Afterwards, participants were given 30 minutes to complete the tasks, using the two search methods.

In total, 17 task sheets were completed, including query formulations and comments on the 4 search tasks. Here we analyse the feedback and task success on a task by task basis.

  Task 1 Task 2 Task 3 Task 4
Task completion rate 100% 88.24% 88.24% 76.47%
Found answers with keyword search only 47.96% 70.59% 35.71% 69.23%
Semantic search results better than keyword search 82.35% 70.59% 96.43% 73.08%

Firstly, task success rates vary by task. This is partly due to the higher complexity of tasks 2, 3, and 4, but also some users simply did not attempt the later tasks, because they ran out of time. Nevertheless, each task was completed by at least 13 participants.

The percentage of participants who found relevant documents using only keyword search varies depending on the tasks. The low success rates in tasks 1 and 3 are exactly on tasks, where additional knowledge, not present in the original article is needed, in order to find the relevant results. Namely, in task 1 this is knowledge on which rivers flow through Gloucester and in task 3 — which places in the UK have population less than 15,000 inhabitants. Task 4 is about searching for risk areas in London, where again some relevant documents do not explicitly mention the keyword London.

Overall, participants did find that the results obtained by using semantic search were better than those from keyword search alone. However, as task success rate indicates, not all users were able to learn how to use the semantic search constraints, even though all required interface functionality was demonstrated in advance.

Lastly, we also evaluated the usability of the semantic search UI, through a post-task questionnaire. It contained 8 questions only, due to the limited time available. The first seven questions are based on the SUS usability questionnaire. The first question focused on frequency of use. The second, third, and seventh questions examined the complexity of use of the semantic search UI. Questions four, five, and six probed how hard it is to learn the semantic search UI. The parallel to the SUS questionnaire allowed us to use the same scoring mechanism, thus making the results broadly comparable. The last question asked whether the search results made sense to the user.

 

6.1 Overall Questionnaire Scores

In total, we received 16 filled in questionnaires. The 5 point Likert scale was mapped to numerical scores between 1 (Strongly Disagree) and 5 (Strongly Agree).

Following the SUS scoring methodology for such questionnaires, we subtracted one from the numeric scores for answers to our questions 1, 3, 5, and 8. For the other four questions (which were negative) we subtracted the user responses from five. This scales all response values between 0 and 4 (with four being the most positive response). Then all converted responses were added up and scaled.

The mean questionnaire score is 72.3, which indicates that the system has good overall usability (this score is scaled to match the SUS scores, where a good SUS score needs to be over 68). Standard deviation is 10.2. 69% of participants scored the system above 68 overall. The mean and standard deviation remain very similar, even when the newly added question is excluded from the scores (73 with SD of 11).

 

6.2 Frequency of Use

Nine of the 16 (56.3%) participants agreed or strongly agreed that they would use such a system frequently. Another 6 participants were neutral and only 1 participant strongly disagreed.

 

6.3 System Use

Questions 2, 3, and 7 in our questionnaire examine the ease of use of the semantic search UI. In particular, 14 of the participants (87.5%) disagreed or strongly disagreed with the statement that the UI is unnecessarily complex and 2 were neutral. 13 of the participants (81.25%) agreed or strongly agreed that the semantic search UI is easy to use (question 3). Question 7 is the opposite of question 3, since it stated that the system is very cumbersome to use. There 12 participants disagreed or strongly disagreed (75%), which validates the positive answers to question 3. Thus overall, we can conclude that there are no major issues with the semantic search UI, which make it complex or hard to use for the majority of users.

 

6.4 Ease of Learning

Questions 4, 5, and 6 were focused around learning the user interface. Fifteen of the 16 participants (93.75%) disagreed or strongly disagreed that they would need help from a technical person to use the system (question 4). The same participants also felt they can use the system without needing to learn more about it first (question 6). Participants were more divided around the question of how quickly would others learn the semantic search UI (question 5). There only 11 of the 16 (68.75%) agreed or strongly agreed with this statement.

From this we can conclude that the participants were able to learn to use the semantic search UI successfully and confidently after only a short demonstration. However, open questions remain as to how easy it would be for users to benefit from semantic search without any prior training.

 

6.5 Result Quality

The last question focused on whether the results returned by the semantic search UI made sense to the users. There 12 of the 16 participants (75%) agreed or strongly agreed with this statement. A more in-depth follow-up user study is needed, in order to understand how much of this is due to mistakes made by the system versus the fact that some participants were not specialists in the domain.

Overall, coupled with task success rates and comments made as part of task completion, we can conclude tentatively that the results produced by semantic search were perceived as meaningful and useful.

 

6.6 Qualitative Feedback Received

The study participants also worked in groups to provide feedback and discuss the semantic search interface. We led three such groups, with between 6 and 10 participants each, in order to stimulate discussions and allow sufficient time to gather feedback. In addition, we asked participants to give short written suggestions after the structured usability questionnaire, focused on three topics:

  • Barriers in adopting the semantic search interface
  • Suggestions for interface improvements
  • Any new features that they would like to see covered by a semantic search UI

In terms of possible adoption barriers by non-specialist users, the most important ones identified were:

  • Removing the need to see first a demonstration of how semantic search works
  • Make query syntax more similar to Google keyword search
  • Helping non-specialist users to formulate semantic searches successfully through query by example and results explanation
  • Support for user feedback on the relevance of each returned result, to help train the text mining and information retrieval engines underneath
  • Map-based visualisations to support result visualisation and browsing
  • Indicator of system confidence in the results returned

Some participants were also interested in having an advanced option where they could see the semantically enriched document content, e.g. through highlighting. Another suggestion was to allow advanced users to edit the formal semantic search query, e.g. to add or remove elements. Another idea was to show more feedback on why a certain document was matched, especially when this is a result of using semantic knowledge added from the LOD resource (e.g. for a population constraint to show the population sizes of the locations in the matched documents), in order to reassure users that indeed the correct results were returned.

 

7 Conclusions

This paper demonstrated that it is possible to use existing LOD vocabularies for automatic semantic enrichment of publications. Specifically, we tested the usefulness of knowledge from DBpedia, GEMET, and the Hydrology Ordnance Survey Ontology, to enhance information discovery and management of environmental science literature. The conclusion is that semantic enrichment of articles with LOD knowledge allows for generalizations and, thus, answering more complex information needs, such as sense making. Although our experiments are focused on the domain of environmental science, the methods and results are relevant to LOD-based semantic enrichment of scientific publications in general.

Next we demonstrated that semantic search interfaces can be useful for end-users, to meet their advanced search needs. The study participants found the semantic search UI easy to learn and use. Importantly, this included environmental scientists, who particularly appreciated the location-based search. Some key conclusions are that semantic search UIs need to be made as intuitive as possible, to reduce the initial learning curve. Particular attention needs to be paid to usability and design details, especially consistency with keyword-based general-purpose search engines, e.g. Google, Yahoo.

The group discussions and the written feedback questions on the survey forms allowed us to elicit a number of small, easy to implement changes to the user interface, which we hope to improve usability in the future. We are planning to implement these in follow-up research and then carry out a second user-based evaluation, this time with users recruited online, who will not be shown a demonstration of the semantic search in advance.

In addition, we also elicited a number of more challenging ideas for future improvements, which cannot be easily addressed within the scope of short, informal follow-up work. The most substantial of these include the implementation of a natural language interface, map-based visualisations, support for user feedback on search results, and search query refinement by example. These are all valuable future extensions to this work, including building a natural language interface, which we plan to base on (Damljanovic, et al., 2013).

 

Acknowledgements

This research was partially supported by the JISC-funded EnviLOD project. The authors wish to thank the anonymous LCPD reviewers for their feedback and comments, as well as Niraj Aswani for his help with implementation aspects of the project.

 

References

[1] Agatonovic, M., Aswani, N., Bontcheva, K., Cunningham, H., Heitz, T., et al. (2008) Large-scale, parallel automatic patent annotation. In: Proceedings of 1st International CIKM Workshop on Patent Information Retrieval — PaIR'08, Napa Valley, California, USA.

[2] Bast, H., Baurle, F., Buchhold, B., Haussmann, E. (2012) A case for semantic full-text search. In: Proceedings of the 1st Joint International Workshop on Entity-Oriented and Semantic Search, JIWES '12, ACM, pp. 4:1—4:3.

[3] Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C,. et al. (2009) DBpedia — a crystallization point for the web of data. Journal of Web Semantics: Science, Services and Agents on the World Wide Web 7: 154—165. http://doi.org/10.1016/j.websem.2009.07.002

[4] Bontcheva, K., Aswani, N. (2013) EnviLOD Workpackage 5 — Quantitative Evaluation Report. February 2013.

[5] Cunninghamn, H., Tablan, V., Roberts, A., Bontcheva, K. (2013) Getting More Out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics. PLoS Comput Biol 9(2): e1002854. http://doi.org/10.1371/journal.pcbi.1002854

[6] Cunningham, H., et al. (2011) Text Processing with GATE (Version 6). University of Sheffield Department of Computer Science, 15 April 2011, ISBN 0956599311.

[7] Damljanovic, D., Agatonovic, M., Cunningham, H., Bontcheva, K. (2013) Improving habitability of natural language interfaces for querying ontologies with feedback and clarification dialogues. Web Semantics: Science, Services and Agents on the World Wide Web, Volume 19, pp. 1—21. http://dx.doi.org/10.1016/j.websem.2013.02.002

[8] Damljanovic, D., Bontcheva, K. Named Entity Disambiguation using Linked Data. Proceedings of the 9th Extended Semantic Web Conference (ESWC 2012), Heraklion, Greece, May 2012. Poster session.

[9] Devaraju, A., Kuhn, W. (2010) A process-centric ontological approach for integrating geo-sensor data. In: Proceedings of the Sixth International Conference on Formal Ontology in Information Systems (FOIS), pp. 199—212.

[10] Gruhl, D., Nagarajan, M., Pieper, J., Robson, C., Sheth, A. Context and Domain Knowledge Enhanced Entity Spotting in Informal Text. In: Proceedings of the 8th International Semantic Web Conference (ISWC'2009), 2009, ISBN 978-3-642-04930-9.

[11] Hettne, K., van Mulligan, E., Schuemie, M., Schijvennaars, B., Kors, J. (2010) Rewriting and suppressing UMLS terms for improved biomedical term identification. Journal of Biomedical Semantics 1:1—14. http://doi.org/10.1186/2041-1480-1-5

[12] Janowicz, K., Scheider, S., Pehle, T., Hart, G. (2012) Geospatial semantics and linked spatio-temporal data — past, present, and future. Semantic Web Interoperability, Usability, Applicability, Vol. 3, Number 4. http://doi.org/10.3233/SW-2012-0077.

[13] Ji, H., Grishman, R. (2011) Knowledge base population: Successful approaches and challenges. In: Proceedings of 49th Annual Meeting of the Association for Computational Linguistics (ACL'2011), pp. 1148—1158, ISBN 978-1-932432-87-9.

[14] Kieniewicz, J., Wallis, M. (2013) EnviLOD Work Package 2—User Engagement and Case Studies. EnviLOD project report, July 2013.

[15] Kiryakov, A., Popov, B., Ognyanov, D., Manov, D., Kirilov, A., Goranov, M. (2004) Semantic annotation, indexing and retrieval. Journal of Web Semantics 1 (2), pp. 671—680. http://doi.org/10.1016/j.websem.2004.07.005

[16] Maynard, D., Greenwood, M. (2012) Large Scale Semantic Annotation, Indexing and Search at The National Archives. In: Proceedings of LREC 2012, May 2012, Istanbul, Turkey.

[17] National Library of Medicine. (1960) Medical subject headings: main headings, subheadings, and cross references used in the Index Medicus and the National Library of Medicine Catalog. 1st ed., Washington, DC, U.S. Department of Health, Education, and Welfare.

[18] Pillman, W., Schade, S., Smits P. (2011) Innovations in sharing environmental observations and information. In: Proceedings of the 25th EnviroInfo Conference, Shaker-Verlag.

[19] Pound, J., Mika, P., Zaragoza, H. (2010) Ad-hoc object retrieval in the web of data. In: Proceedings of the 19th International Conference on World Wide Web, ACM, pp. 771—780, ISBN 978-1-60558-799-8.

[20] Pirolli, P. (2009) Powers of 10: Modeling complex information-seeking systems at multiple scales. IEEE Computer 42 (3), pp. 33—40. http://doi.org/10.1109/MC.2009.94

[21] Rao, D., McNamee, P., Dredze, M. (2013) Entity linking: Finding extracted entities in a knowledge base. In: Multi-source, Multi-lingual Information Extraction and Summarization. Springer Verlag, ISBN 978-3-642-43090-9.

[22] Shaon, A., Woolf, A., Crompton, S., Boczek, R., Rogers, W., et al. (2011) An open source linked data framework for publishing environmental data under the UK location strategy. In: Terra Cognita 2011: Foundations, Technologies and Applications of the Geospatial Web, pp. 62—74.

[23] Schentz, H., Peterseil, J., Magagna, B., Mirtil, M. (2011) Semantics in ecosystems research and monitoring. In: Pillman, W., Schade, S., Smits, P., editors. Proceedings of the 25th International EnviroInfo Conference.

[24] Tablan, V., Bontcheva, K., Roberts, I., Cunningham, H. (2014) Mimir: an Open-Source Semantic Search Framework for Interactive Information Seeking and Discovery. Journal of Web Semantics. http://doi.org/10.1016/j.websem.2014.10.002

[25] Weibel, S., Kunze, J., Lagoze, C., Wolf, M. (1998) Dublin core metadata for resource discovery. Internet Engineering Task Force RFC, 2413(222), 132.

 

About the Authors

bontcheva

Kalina Bontcheva is a senior research scientist at the University of Sheffield and the holder of an EPSRC career acceleration fellowship, working on text mining and summarisation of social media. Dr. Bontcheva's main interests are semantic annotation and search, information extraction, opinion mining, text summarisation, and the open-source GATE software infrastructure for NLP. She is currently coordinating the PHEME European project on analysing rumours in social media, as well as leading the Sheffield teams in the TrendMiner and DecarboNet European projects, as well as the EnviLOD JISC-funded project. She also co-organises and lectures at the week-long, annual GATE NLP summer school in Sheffield.

 
kieniewicz

Johanna Kieniewicz (now Head of Outreach and Engagement at the Institute of Physics) was Environmental Science Research and Engagement Manager at the British Library at the time of this research, and led the engagement with the environmental science research community. Experienced with a variety of consultation methodologies, Dr. Kieniewicz researched information needs of the UK environmental science research community and captured content and user interface requirements for the Library's Envia project. She has also been trained by the University of Sheffield on the semantic annotation of content using GATE Developer.

 
andrews

Stephen Andrews is STM Products and Services Development Leader at the British Library, managing the development of products and services within the STM team. To date this has encompassed the implementation of UK PubMed Central, the Names project (in collaboration with the University of Manchester), Envia (in particular the exploration of the use of semantic technologies), and a joint project with Microsoft Research prototyping a Virtual Research Environment within the researcher's desktop.

 
wallis

Michael Wallis is a flood and coastal research scientist and experienced project manager within the Coasts and Estuaries Group at HR Wallingford. A Chartered Water and Environmental Manager, Michael has worked on many flood and coastal erosion risk management projects and has assisted regulators and responsible authorities in the development of new research, tools and guidance for the assessment of flood and coastal risks and for the management of flood and coastal defence assets.

 
transparent image