Hybrid Search: Effectively Combining Keywords and Semantic Searches
Hybrid Search: Effectively Combining Keywords and Semantic Searches
Hybrid Search: Effectively Combining Keywords and Semantic Searches
Semantic Searches
Abstract. This paper describes hybrid search, a search method supporting both
document and knowledge retrieval via the flexible combination of ontology-
based search and keyword-based matching. Hybrid search smoothly copes with
lack of semantic coverage of document content, which is one of the main
limitations of current semantic search methods. In this paper we define hybrid
search formally, discuss its compatibility with the current semantic trends and
present a reference implementation: K-Search. We then show how the method
outperforms both keyword-based search and pure semantic search in terms of
precision and recall in a set of experiments performed on a collection of about
18.000 technical documents. Experiments carried out with professional users
show that users understand the paradigm and consider it very powerful and
reliable. K-Search has been ported to two applications released at Rolls-Royce
plc for searching technical documentation about jet engines.
1 Introduction
The Semantic Web (SW) is a creative mix of metadata designed according to multiple
ontologies and unstructured documents (e.g. classic Web documents). The assumption
that the SW is not a Web of documents, but a Web of relations between resources
denoting real world objects [4] is too restrictive of the true nature of the SW. There
are a number of applications and situations where coexistence of documents and
metadata is actually required. One example is the legal scenario, where access to
documents is the main focus and the available metadata is the means to reach a
specific set of documents [5]. However it may well happen that the available metadata
does not cover parts of the document that are of interest to some users because: (i) the
ontology used for annotation has a different focus and does not model that part of the
content or (ii) annotations can be incomplete, whether user or system provided. A
human annotator may miss some or provide spurious ones; in the same way
automated means such as Information Extraction from texts (IE) may be unable to
reliably extract the information required. This is because IE is a technology that
performs very well on simple tasks (such as named entity recognition), but poorly on
more complex tasks such as event capture [8]. Therefore, some metadata modelled by
an ontology may be impossible to capture with IE thus preventing any future
operation (e.g. retrieval) via that metadata.
In this paper, we focus on searching the SW as a collection of both documents and
metadata, with the aim of accommodating different user tasks: document retrieval
and/or knowledge retrieval. A document retrieval task implies searching for
documents using concepts or keywords of interest; a knowledge retrieval task
concerns retrieving facts from a knowledge base (i.e. triples). Differently from
previous literature [1, 2, 3, 4, 9], we consider the issue of working in a complex
environment where metadata only partially covers the user information needs. We
therefore propose to use a strategy (called Hybrid Search, (HS) where a mix of
keyword-based and metadata-based strategies are used. We formally define the
approach and describe how to organise a HS architecture. We then describe K-Search,
a reference implementation of HS. In implementing an approach, a number of
decisions are made: methodological (e.g. we selected a form-based approach [1]), and
technical (e.g. on the expressivity of covered language and architecture design). We
discuss how these choices impact the HS mechanism. Then we present two
experiments performed using a K-Search application:
• in vitro: K-search was applied to a large corpus of legacy documents; an evaluation
of the resulting application shows HS outperforming both keyword based searching
and semantic searching;
• in vivo: the application was evaluated with real users; the results show that users
appreciate the full power of the HS concept.
Finally we compare our work to the state of the art, we discuss how it is possible to
extend the currently available semantic search paradigms to cope with HS, draw
conclusions and highlight future work.
2 Hybrid Search
The most commonly used method for document retrieval is keyword-based search
(KS). KS effectiveness is often affected by two main issues, ambiguity and
synonymity. Ambiguity arises in traditional keyword search systems because
keywords can be polysemous, i.e. they can have multiple meanings. A search
containing ambiguous terms will return spurious documents (low precision).
Synonymity is found when an object can be identified by multiple equivalent terms.
When searching documents using just one of the terms, the documents containing
other synonym are not retrieved (low recall). Semantic search as metadata-based
search defined according to an ontology, enables overcoming both issues because
annotations are unambiguous and do not suffer from synonymity.
Nonetheless when pure Semantic Search is applied to a document retrieval task, it can
fail to encompass the user information needs (either because of limitations in the
ontology or because the metadata is unavailable for a specific document), as it would
restrict the types of queries users can perform (low recall).
HS combines the disambiguation capabilities of semantic search (when metadata is
available) with the generality and extensibility of keyword-based search (for the other
cases). The expected result is that:
• precision and recall are increased with respect to the standard keyword-based
search because ambiguity and synonymity are dealt with by semantic search
when available;
• the use of keywords where metadata is missing enables to answer otherwise
impossible queries (increased recall with respect to semantic search). As
keywords are combined with metadata in the same query, the context given by
the available metadata helps in disambiguating keywords as well (higher
precision than keyword-based search).
This section discusses a generic architecture for HS, while the next one presents an
actual implementation.
At indexing time, documents are indexed using a standard keyword–based engine
such as SolR1. Annotations (e.g. generated by an IE system) are stored in a
Knowledge Base (e.g. a triple store like Sesame2) in the form of RDF triples.
Provenance of facts must be recorded, for example in the form of triples connecting
the facts’ URIs and those of the document of origin, as well as the original strings
used in the documents.
At retrieval time, HS performs the following steps:
• the query is parsed and the different components (keywords, keywords-in-context
and metadata-based) identified;
• keyword matches are sent to the traditional information retrieval system;
• metadata searches are translated into a query language like SPARQL3 and sent
to a triple store;
• keywords-in-context queries are matched with the provenance of annotations in
documents (again using SPARQL and a triple store);
1
http://lucene.apache.org/solr/
2
http://www.openrdf.org/
3
http://www.w3.org/TR/rdf-sparql-query/
• finally, the results of the different queries are merged, ranked and displayed.
Merging of results. A direct matching between keyword and semantic results is not
straightforward as their results are incompatible. Keyword matching returns a set of
URIs of documents (KSDocUriSet) of size n.
uri1,
uri2,
KSDocUriSet ⊂ URIs, where KSDocUriSet =
…
urin
while a semantic search performed on a knowledge base returns an unordered set rSet
(size m) of individual assertions < subj, rel, obj>4
In order to provide the answer for users interested in document retrieval, the list of
URIs of documents generated using provenance information is now directly
compatible with the output of keyword matching. The result of the query is given by
the intersection of the two sets of document URIs.
HybridSearchUriSet= KSDocUriSet ∩ OSDocUriSet
Ranking. Effective ranking (i.e. the ability to return relevant documents first) is
extremely important for a positive user experience. The results returned by the
different modalities provide material for orthogonal ranking methods:
• keyword-based systems like Lucene enable ranking of documents according to (1)
their ability to match the keyword-based query; (2) the keywords used in anchor
links (i.e. the text associated to hyperlinks pointing to a specific document) and (3)
the document popularity measured as function of the weight of the links referring
to the document itself;
4
Both ontology-based and keyword in context queries are covered here.
• semantic search ranks according to the presence and quality of metadata.
Different ranking solutions can be adopted accordingly to the use case. The most
natural one is to adopt the ranking provided by the keyword based search, as it is
based on solidly proven methods, especially the use of anchor texts and hyperlinking
However more sophisticated strategies can be designed, especially for organisational
repositories where such interlinking is generally not present [14].
Presentation of results. Depending on the task (i.e. document retrieval Vs
knowledge retrieval), results can be presented in different ways: as a list of ranked
documents, as aggregated metadata (e.g. via graphs or charts) with associated
provenance, etc.
Figure 1 - Interface detail: the query form. Clicking a concept on the ontology creates a form
item enabling inserting restrictions on metadata. Disjunctions are easily introduced by clicking
[or].
We have chosen to model our search interface on a form data entry paradigm. The
interface (Figure 1) works in a standard browser and enables the definition of
complex hybrid queries in an intuitive way. Keywords can be inserted into a default
form field in a way similar to that required by search engines; Boolean operators OR
and AND can be used in their combination. Conditions on metadata can be added to
the query by clicking on the ontology tree (left side of interface in Figure 2). This
creates a form item to insert conditions on the specific concept. As multiple
constraints can be added to the query, the logical language is restricted to provide a
simple and intuitive interface: only common Boolean combinations are supported.
This decision was supported by the observation that in carrying out their tasks, users
adopted strategies that do not require the full logical language; furthermore research
done in human-computer interaction shows that graphical representation of the whole
Boolean logic is not understood by most users [10].
AND constructs are allowed among conditions checking different concepts in the
ontology. So for example, contains(removed-component, “fuel”) AND contains(jet-
engine-name, “engineA”) is acceptable, but contains(removed-component, “fuel”)
AND contains(removed-component, “meter”) is not. The latter is acceptable if
formulated as contains (removed-component, “fuel meter”). Conditions in AND are
displayed on different lines in the interface (Figure 3 shows an example of a
combination of removed-component AND operational-effect). The expressivity
restrictions are motivated by the results of our user studies, which showed which
types of queries the users wanted to make.
OR constructs are acceptable only if between conditions on the same concept. So
contains(removed-component, “fuel”) OR contains(removed-component, “meter”) is
accepted, but contains(removed-component, “fuel”) OR contains(jet-engine-name,
“engineA”) is not. The latter must be split into two different queries. Again, these
restrictions are motivated by results of our user studies.
Figure 1 shows how the query retrieve all events where removal of a fuel meter unit
caused delay or cancellation” - logically translated in (contains(removed-component
“fuel meter unit”)) AND equal(operational-effect (delay OR cancellation)) - appears
at the interface level: two concepts (removed-component and operational-effect) have
been selected; removed-component has been specified with a single option (fuel meter
unit) while operational-effect covers two alternatives (delay or cancellation).
Figure 2 - The interface showing the list of documents returned (centre top), an annotated
document and a graph produced from the results (image modified to protect confidential data).
In order to make available document metadata and indexes, K-Search uses: (i) SolR
for indexing documents and (ii) a generic semantic annotation plugin . Plugins
currently exist for AktiveMedia (manual and semi-automatic annotation [6]) and
some information extraction tools (T-Rex, an ontology-based IE tool [15] and Saxon,
a rule-based extraction system5). Extracted information (ontology-based annotations)
is stored in the form of RDF triples according to OWL or RDF ontologies into a triple
store. K-Search provides plugins for Sesame and 3store; query languages supported
are SPARQL and SeRQL.
5
http://nlp.shef.ac.uk/wig/tools/saxon/
5 Evaluation
Tests were carried out to evaluate the effectiveness and the user acceptance of the HS
paradigm. The evaluation was performed using the K-Search Event Reports
application (developed for Rolls-Royce plc) in two separate steps:
• in vitro: first of all the precision and recall of the IE system used in the specific
case were evaluated; then 21 user-defined topics were translated into queries using
three options: keyword-based searching, ontology-based searching and hybrid
searching and the performances were recorded; these tests enabled us to evaluate
the effectiveness of the method in principle;
• in vivo: 32 Rolls-Royce plc employees were involved in a usability test and
commented on efficiency, effectiveness, and satisfaction; this evaluation enabled
measuring the extent to which users understand the HS paradigm and feel that it
returns appropriate results.
The in vitro evaluation is composed by two parts, one to evaluate the effectiveness of
the IE, the second to compare HS to keyword-based and semantic search.
IE evaluation
We analyzed a corpus of 18,097 reports on operational conditions of jet engines
provided by Rolls-Royce plc. They are semi-structured Word documents containing
tables and free text. As these documents are generated as part of the same
management process, they all contain broadly the same relevant information but
tables are user defined, so in principle each document can contain different types of
table. However, some regularity occurs in tables across documents as users tend to re-
use previously generated documents as template. The documents were converted into
XML and HTML then indexed using SolR and metadata were generated using T-Rex.
The ontology included concepts like the location where the event occurred, installed
component(s), removed component(s), event details, what was the operational effect
on the flight (delay, cancellation etc.), location, author, etc. The evaluation of the IE
system was performed in order to understand which metadata were recognisable with
an acceptable accuracy. Information in tables tends to be captured reliably by the IE
system. This is because, although tables are irregular (e.g. sometimes the semantics is
on the rows, sometimes on the columns, sometimes the information is spread over
multiple cells, sometimes multiple information is compressed in one single cell), they
roughly contain the same information and derive from evolution of common tables. T-
Rex’s learning curve assumed an asymptotic shape after learning from about 200
manually annotated documents. The combined evaluation results on all fields
obtained in a two-cross folder test using 400 documents were Precision=98%,
Recall=99%, (harmonic) F-Measure=98%. Information in tables contained most of
the metadata required in the ontology with the exception of the event cause.
As for the information contained in the free text (which was mainly describing the
event cause), instead, accuracy was not at a level adequate to the user expectations
(which was – according to our studies very close to 100% for recall and >90% for
precision) therefore it was not made available to semantic search; it was however still
available for searching via keywords.
This is because the metadata did not cover completely 6 of the topics. Keyword-based
search has the lowest precision and fair recall in the same task. Hybrid Search reports
very high precision (same as OS, +51% with respect to KS), and the highest recall
(+46% with respect to keywords and +109% with respect to ontology-based search).
(weighted harmonic) F-Measure is +49% with respect to keywords and +55% with
respect to ontology-based. In conclusion, in our experiment HS outperforms the other
methods in ranking relevant documents within the first 20 results. Experimental
results for the first 50 returned documents are largely equivalent.
The data collected allows assessing the validity of the HS paradigm as well as the
usability of K-Search (Figure 4):
• Use of HS: all users appeared to have grasped the concept of HS. Users adopted
different strategies: some started querying using keywords and added conditions on
metadata in a second iteration; others instead composed conditions on metadata
and keywords in a single search; others used metadata search initially and added
keywords later to refine the task. This means that different user’s searching
strategies can be accommodated within the framework.
• Learning: 75% of users found easy or very easy to learn to use the system, 25%
found it average.
• System accuracy (system reliability in retrieving relevant documents): 82% of the
users judged K-Search reliable or highly reliable; although this could seem a
feature of the system rather than of HS, in our view the comment refers to the fact
that with HS the searches were effective.
• Searching experience: 82% of users found K-Search easy or very easy to use; the
ease of use was often commented about in the interviews;
• System Speed: the system was judged fast or very fast in executing the queries
allowing a quick task completion by 98% of users.
Acknowledgments. This work was supported by IPAS, a project jointly funded by the UK
DTI (Ref. TP/2/IC/6/I/10292) and Rolls-Royce plc and by X-Media (www.x-media-
project.org), an Integrated Project on large scale knowledge management across media, funded
by the European Commission under the IST programme, (IST-FP6-026978). Thanks to Colin
Cadas (Rolls-Royce) for the constant support in the past two years. Thanks to all the users for
their very positive attitude and the helpful feedback.
References
1. Uren, V., Lei, Y., Lopez, V., Liu, H., Motta, E.and Giordanino, M.: The usability of
semantic search tools: a review, Knowledge Engineering Review, in press.
2. Kaufmann, E. and Bernstein, A.: How Useful are Natural Language Interfaces to the
Semantic Web for Casual End-users? Proceedings of the 6th International Semantic Web
Conference and the 2nd Asian Semantic Web Conference, Busan, Korea, November 2007
3. Lei, Y., Uren, V. and Motta, E. SemSearch: A Search Engine for the Semantic Web. in 15th
International Conference on Knowledge Engineering and Knowledge Management
Managing Knowledge in a World of Networks (EKAW 2006). 2006. Podebrady.
4. Guha, R., McCool, R. Miller, E. Semantic Search. in 12th International Conference on
World Wide Web. 2003
5. Gilardoni, L., Biasuzzi, C., Ferraro, M., Fonti, R., Slavazza, P.: LKMS – A Legal
Knowledge Management System exploiting Semantic Web technologies, Proceedings of the
4th International Conference on the Semantic Web (ISWC), Galway, November 2005.
6. Chakravarthy, A., Lanfranchi, V., Ciravegna, F.: Cross-media Document Annotation and
Enrichment, Proceedings of the 1st Semantic Authoring and Annotation Workshop, 5th
International Semantic Web Conference (ISWC2006), Athens, GA, USA, 2006
8. Ireson, N., Ciravegna, F., Califf, M.E., Freitag, D., Kushmerick, N., Lavelli, A.: Evaluating
Machine Learning for Information Extraction, Proceedings of the 22nd International
Conference on Machine Learning (ICML 2005), Bonn, Germany, 2005
9. Kiryakov, A., Popov, P., Terziev, I., Manov, D., Ognyanoff, D.: Semantic annotation,
indexing, and retrieval, Journal of Web Semantics, Vol 2 (1), 49-79
10. Shneiderman, B.: Designing the User Interface (3rd edition). Addison-Wesley, 1997.
12. Dzbor, M. - Domingue, J. B. - Motta, E.: Magpie - towards a semantic web browser. 2nd
Intlernational Semantic Web Conference (ISWC), Sanibel Island, Florida, USA, 2003.
13. Lanfranchi, V., Ciravegna, F., Petrelli, D.: Semantic Web-based Document: Editing and
Browsing in AktiveDoc, Proceedings of the 2nd European Semantic Web Conference ,
Heraklion, Greece, 2005.
14. Rocha, R., Schwabe, D. and Poggi de Aragão, M.: A Hybrid Approach for Searching in the
Semantic Web, in the 2004 International World Wide Web Conference, May 17-22, 2004,
New York, New York.
15. Iria, J. and Ciravegna, F A Methodology and Tool for Representing Language Resources
for Information Extraction. In Proc. of LREC 2006, Genoa, Italy, May 2006.
16.Tran, T., Cimiano, P., Rudolph, R. and Studer, R.: Ontology-based Interpretation of
Keywords for Semantic Search. Proceedings of the 6th International Semantic Web
Conference and the 2nd Asian Semantic Web Conference, Busan, Korea, November 2007
17. Catarci, T., Di Mascio, T., Franconi, E., Santucci, G., Tessaris, S. An Ontology Based
Visual Tool for Query Formulation Support. in 16th European Conference on Artificial
Intelligence (ECAI-04). 2004. Valencia, Spain.
18. Kaufmann, E., Bernstein, A. and Zumstein, R. Querix: A natural language interface to
query ontologies based on clarification dialogs. In 5th ISWC, pages 980–981, Athens, GA,
2006.
19. Corby, O., Dieng-Kuntz, R., Faron-Zucker, C., and Gandon, F., Searching the Semantic
Web: Approximate Query Processing Based on Ontologies. IEEE Intelligent Systems, 2006.
21(1)