Knowledge Harvesting in the Big-Data Era

Fabian Suchanek
Max Planck Institute for Informatics
D-66123 Saarbruecken, Germany
Gerhard Weikum
Max Planck Institute for Informatics
D-66123 Saarbruecken, Germany
The proliferation of knowledge-sharing communities such as Wiki-
pedia and the progress in scalable information extraction from Web
and text sources have enabled the automatic construction of very
large knowledge bases. Endeavors of this kind include projects
such as DBpedia, Freebase, KnowItAll, ReadTheWeb, and YAGO.
These projects provide automatically constructed knowledge bases
of facts about named entities, their semantic classes, and their mu-
tual relationships. They contain millions of entities and hundreds of
millions of facts about them. Such world knowledge in turn enables
cognitive applications and knowledge-centric services like disam-
biguating natural-language text, semantic search for entities and
relations in Web and enterprise data, and entity-oriented analytics
over unstructured contents. Prominent examples of how knowledge
bases can be harnessed include the Google Knowledge Graph and
the IBM Watson question answering system. This tutorial presents
state-of-the-art methods, recent advances, research opportunities,
and open challenges along this avenue of knowledge harvesting and
its applications. Particular emphasis will be on the twofold role of
knowledge bases for big-data analytics: using scalable distributed
algorithms for harvesting knowledge from Web and text sources,
and leveraging entity-centric knowledge for deeper interpretation
of and better intelligence with Big Data.
Categories and Subject Descriptors
H.1 [Information Systems]: Models and Principles
Big Data, Information Extraction, Knowledge Base, Ontology, En-
tity Recognition, Web Contents
1.1 Knowledge Bases
Knowledge harvesting from Web and text sources has become a
major research avenue in the last ve years. It is the core methodol-
ogy for the automatic construction of large knowledge bases [2, 3,
51], going beyond manually compiled knowledge collections like
Cyc [63], WordNet [33], and a variety of ontologies [101]. Salient
projects with publicly available resources include KnowItAll [30,
6, 31], ConceptNet [99], DBpedia [5], Freebase [11], NELL [16],
WikiTaxonomy [88], and YAGO [102, 46, 8]. Commercial inter-
est has been strongly growing, with evidence by projects like the
Google Knowledge Graph, the EntityCube/Renlifang project at Mi-
crosoft Research [84], and the use of public knowledge bases for
type coercion in IBMs Watson project [52].
These knowledge bases contain many millions of entities, orga-
nized in hundreds to hundred thousands of semantic classes, and
hundred millions of relational facts between entities. All this is
typically represented in the form of RDF-style subject-predicate-
object (SPO) triples. Moreover, knowledge resources can be se-
mantically interlinked via owl:sameAs triples at the entity level,
contributing to the Web of Linked Open Data (LOD) [45].
Large knowledge bases are typically built by mining and dis-
tilling information from sources like Wikipedia which offer high-
quality semi-structured elements (infoboxes, categories, tables, lists),
but many projects also tap into extracting knowledge from arbitrary
Web pages and natural-language texts. Despite great advances in
these regards, there are still many challenges regarding the scale of
the methodology and the scope and depth of the harvested knowl-
covering more entities beyond Wikipedia and discovering newly
emerging entities,
increasing the number of facts about entities and extracting more
interesting relationship types in an open manner,
capturing the temporal scope of relational facts,
tapping into multilingual inputs such as Wikipedia editions in
many different languages,
extending fact-oriented knowledge bases with commonsense
knowledge and (soft) rules,
detecting and disambiguating entity mentions in natural-language
text and other unstructured contents, and
large-scale sameAs linkage across many knowledge and data
1.2 Enabling Intelligent Applications
Knowledge bases are a key asset that enables and contributes to
intelligent computer behavior. Application areas along these lines
include the following:
Semantic search and question answering: Machine-readable en-
cyclopediae are a rich source of answering expert-level ques-
tions in a precise and concise manner. Moreover, interpret-
ing users information needs in terms of entities and relation-
ships yields strong features for informative ranking of search
results and entity-level recommendations over Web and enter-
prise data.
Deep interpretation of natural language: Both written and spo-
ken language are full of ambiguities. Knowledge is the key
to mapping surface phrases to their proper meanings, so that
machines interpret language as uently as humans. As user-
generated social-media contents is abundant and human-computer
interaction is more and more based on smartphones, coping
with text, speech, and gestures will become crucial.
Machine reading at scale: The deluge of online contents over-
whelms users. Users wish to obtain overviews of the salient
entities and relationships for a week of news, a month of scien-
tic articles, a year of political speeches, or a century of essays
on a specic topic.
Reasoning and smart assistants: Rich sets of facts and rules
from a knowledge base enable computers to perform logical
inferences in application contexts.
Big-Data analytics over uncertain contents: Daily news, social
media, scholarly publications, and other Web contents are the
raw inputs for analytics to obtain insights on business, poli-
tics, health, and more. Knowledge bases are key to discovering
and tracking entities and relationships and thus making sense
of noisy contents.
1.3 Scope and Structure of the Tutorial
This tutorial gives an overview on knowledge harvesting and dis-
cusses hot topics in this eld, pointing out research opportunities
and open challenges. As the relevant literature is widely dispersed
across different communities, we also venture into the neighbor-
ing elds of Web Mining, Articial Intelligence, Natural Language
Processing, Semantic Web, and Data Management. The presenta-
tion is structured according to the following sections and subsec-
2.1 Knowledge Bases in the Big-Data Era
Many Big-Data applications need to tap unstructured data. News,
social media, web sites, and enterprise sources produce huge amounts
of valuable contents in the form of text and speech. Key to mak-
ing sense of this contents is to identify the entities that are referred
to and the relationships between entities. This allows linking un-
structured contents with structured data, for value-added analytics.
Knowledge bases are a key asset for lifting unstructured contents
into entity-relationship form and making the connection to struc-
tured data. We give an overview of several large and publicly avail-
able knowledge bases, and outline how they can support Big-Data
2.2 Harvesting of Entities and Classes
Every entity in a knowledge base (such as Steve_Jobs) be-
longs to one or more classes (such as computer_pioneer). These
classes are organized into a taxonomy, where more special classes
are subsumed by more general classes (such as person). We dis-
cuss two groups of approaches to harvest information on classes
and their instances: i) Wikipedia-based approaches and ii) Web-
based approaches using set expansion and other techniques. Rele-
vant work in the rst group includes [88, 89, 102, 123]. Relevant
work in the second group includes [4, 21, 44, 57, 87, 105, 114,
3.1 Harvesting Relational Facts
Relational facts express relationships between two entities, for
example, the following facts about Steve Jobs:
Steve_Jobs founded Apple_Inc.,
Steve_Jobs was_Board_Member_of Walt_Disney_Company,
Steve_Jobs died_on 5-Oct-2011,
Steve_Jobs died_of Pancreas_Cancer,
Steve_Jobs has_Friend Joan_Baez, and more.
There is a large spectrum of methods to extract such facts from
Web data, tapping both semistructured sources like Wikipedia in-
foboxes, lists, and tables, and natural-language text sources like
Wikipedia full-text articles, news and social media. We give an
overview on methods from pattern matching (e.g., regular expres-
sions), computational linguistics (e.g., dependency parsing), statis-
tical learning (e.g., factor graphs and MLNs), and logical consis-
tency reasoning (e.g., weighted MaxSat or ILP solvers). We also
discuss to what extent these approaches scale to handle big data.
Overviews of information extraction methods for knowledge base
population are given in [26, 95, 122]. For specic state-of-the-art
methods, see the following original papers and references given
there: [1, 10, 13, 15, 16, 17, 18, 30, 32, 36, 40, 46, 49, 58, 60, 69,
70, 76, 85, 86, 93, 103, 111, 127]. For foundations of statistical
learning methods used in this context, see [27, 39, 55].
3.2 Open Information Extraction
In contrast to approaches that operate on a predened list of rela-
tions and a huge, but xed set of entities, open IE harvests arbitrary
subject-predicate-object triples from natural-language documents.
It aggressively taps into noun phrases as entity candidates and ver-
bal phrases as prototypic patterns for relations. For example, in ad-
dition to capturing the pre-specied hasWonPrize relation, we aim
to automatically learn that nominatedForPrize is also an interest-
ing relation expressed by natural-language patterns such as candi-
date for . . . prize or expected to win . . . prize. We discuss recent
methods that follow this Open IE direction [6, 12, 24, 31, 41, 56,
72, 75, 77, 83, 115, 125]. Some methods along these lines make
clever use of Big-Data techniques like frequent sequence mining
and map-reduce computation.
3.3 Temporal, Multilingual, Commonsense,
and Visual Knowledge
In this part, we venture beyond entity-relationship facts and de-
scribe approaches that attach meta-information to facts. This con-
cerns the temporal or spatial context of a fact [38, 61, 68, 106, 107,
112, 116, 117, 118], or describes entities in multiple languages [22,
23, 78, 81]. Along the temporal dimension, we would like to cap-
ture the timepoints of events and the timespans during which cer-
tain relationships hold, for example:
Steve_Jobs Chairman_of Apple_Inc. @[1976,1985],
Steve_Jobs CEO_of Apple_Inc. @[Sep-1997,Aug-2011],
Pixar acquired_by Walt_Disney_Company @5-May-2006.
We also discuss a dimension that complements factual knowl-
edge by commonsense knowledge: properties and rules that every
child knows but are hard to acquire by a computer (see, e.g., [37,
62, 71, 98, 108, 113]). For example, snakes can crawl and hiss, but
they cannot y or sing. An example for a (soft) commonsense rule
is that the husband of a mother is the father of her child (husband at
the time of the childs birth). Here again, state-of-the-art methods
use techniques that scale out to handle Big-Data inputs.
Finally, another dimension of knowledge is to associate entities
and classes with visual data: images and videos [25, 94, 109, 110].
When analytic tasks tap into text or Web data, it is crucial to iden-
tify entities (people, places, products, etc.) in the input for proper
grouping and aggregation. An example application could aim to
track and compare two entities in social media over an extended
timespan (e.g., the Apple iPhone vs. Samsung Galaxy families).
Kowledge about entities is an invaluable asset here.
4.1 Named-Entity Disambiguation
When extracting knowledge from text or tables, entities are rst
seen only in surface form: by names (e.g., Jobs) or phrases (e.g.,
the Apple founder). Entity mentions can be discovered by named-
entity recognition (NER) methods, usually based on CRFs [35] or
other probabilistic graphical models and/or using dictionary of sur-
face forms [100]. Some methods infer semantic types for mentions,
e.g., telling that the Apple founder is a person, or in a ne-grained
manner, an entrepreneur (see, e.g., [66, 67, 126] and references
Nevertheless, entity mentions are just noun phrases and still am-
biguous. Mapping mentions to canonicalized entities registered in a
knowledge base is the task of named-entity disambiguation (NED).
State-of-the-art NED methods combine context similarity between
the surroundings of a mention and salient phrases associated with
an entity, with coherence measures for two or more entities co-
occurring together [14, 19, 20, 28, 34, 43, 47, 48, 59, 74, 92].
Although these principles are well understood, NED remains an
active research area towards improving robustness, scalability, and
The NED problem also arises in structured but schema-less data
like HTML tables in Web pages [65]. NED is a special case of the
general word-sense disambiguation problem [80], which considers
also general nouns (concepts that are not entities, e.g., rugby or
peace), verbal phrases, adjectives, etc. Finally note that NED is
not the same as co-reference resolution [90, 96]. The latter aims to
nd equivalence classes of surface forms (e.g., Michelle and the
First Lady of America are the same entity), but without mapping
to an entity catalog.
4.2 Entity Linkage
We see more and more structured data on the Web, in the form of
(HTML) tables, microdata embedded in Web pages (using, e.g., the vocabulary), and Linked Open Data. Even when
entities are explicitly marked in these kinds of data, the problem
arises to tell whether two entities are the same or not. This is a vari-
ant of the classical record-linkage problem (aka. entity matching,
entity resolution, entity de-duplication) [29, 54, 79]. For knowl-
edge bases and Linked Open Data, it is of particular interest be-
cause of the need for generating and maintaining owl:sameAs link-
age across knowledge resources. We give an overviewof approaches
to this end, covering statistical learning approaches (e.g., [7, 42, 91,
97]) and graph algorithms (see, e.g., [9, 50, 53, 64, 73, 82, 104, 119,
120, 121] and further references given there).
Fabian M. Suchanek is the leader of the Otto Hahn Research
Group Ontologies at the Max Planck Institute for Informatics
in Germany. He obtained his PhD from Saarland University in
2008, and was a postdoc at Microsoft Research Search Labs in
Silicon Valley (in the group of Rakesh Agrawal) and in the Web-
Dam team at INRIA Saclay in France (in the group of Serge Abite-
boul). Fabian is the main architect of the YAGO ontology, one of
the largest public knowledge bases.
Gerhard Weikum is a Scientic Director at the Max Planck In-
stitute for Informatics in Saarbruecken, Germany, where he is lead-
ing the department on databases and information systems. He co-
authored a comprehensive textbook on transactional systems, re-
ceived the VLDB 10-Year Award for his work on automatic DB
tuning, and is one of the creators of the YAGO knowledge base.
Gerhard is an ACM Fellow, a member of the German Academy
of Science and Engineering, and a recipient of a Google Focused
Research Award and an ACM SIGMOD Contributions Award.
