Sigmod2013 Tutorial

Knowledge Harvesting in the Big-Data Era
Fabian Suchanek
Max Planck Institute for Informatics
D-66123 Saarbruecken, Germany
suchanek@mpi-inf.mpg.de
Gerhard Weikum
Max Planck Institute for Informatics
D-66123 Saarbruecken, Germany
weikum@mpi-inf.mpg.de
ABSTRACT
The proliferation of knowledge-sharing communities such as Wiki-
pedia and the progress in scalable information extraction from Web
and text sources have enabled the automatic construction of very
large knowledge bases. Endeavors of this kind include projects
such as DBpedia, Freebase, KnowItAll, ReadTheWeb, and YAGO.
These projects provide automatically constructed knowledge bases
of facts about named entities, their semantic classes, and their mu-
tual relationships. They contain millions of entities and hundreds of
millions of facts about them. Such world knowledge in turn enables
cognitive applications and knowledge-centric services like disam-
biguating natural-language text, semantic search for entities and
relations in Web and enterprise data, and entity-oriented analytics
over unstructured contents. Prominent examples of how knowledge
bases can be harnessed include the Google Knowledge Graph and
the IBM Watson question answering system. This tutorial presents
state-of-the-art methods, recent advances, research opportunities,
and open challenges along this avenue of knowledge harvesting and
its applications. Particular emphasis will be on the twofold role of
knowledge bases for big-data analytics: using scalable distributed
algorithms for harvesting knowledge from Web and text sources,
and leveraging entity-centric knowledge for deeper interpretation
of and better intelligence with Big Data.
Categories and Subject Descriptors
H.1 [Information Systems]: Models and Principles
Keywords
Big Data, Information Extraction, Knowledge Base, Ontology, En-
tity Recognition, Web Contents
1. MOTIVATION AND OVERVIEW
1.1 Knowledge Bases
Knowledge harvesting from Web and text sources has become a
major research avenue in the last ve years. It is the core methodol-
ogy for the automatic construction of large knowledge bases [2, 3,
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for prot or commercial advantage and that copies
bear this notice and the full citation on the rst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specic
permission and/or a fee.
SIGMOD13, June 2227, 2013, New York, New York, USA.
Copyright 2013 ACM 978-1-4503-2037-5/13/06 ...$15.00.
51], going beyond manually compiled knowledge collections like
Cyc [63], WordNet [33], and a variety of ontologies [101]. Salient
projects with publicly available resources include KnowItAll [30,
6, 31], ConceptNet [99], DBpedia [5], Freebase [11], NELL [16],
WikiTaxonomy [88], and YAGO [102, 46, 8]. Commercial inter-
est has been strongly growing, with evidence by projects like the
Google Knowledge Graph, the EntityCube/Renlifang project at Mi-
crosoft Research [84], and the use of public knowledge bases for
type coercion in IBMs Watson project [52].
These knowledge bases contain many millions of entities, orga-
nized in hundreds to hundred thousands of semantic classes, and
hundred millions of relational facts between entities. All this is
typically represented in the form of RDF-style subject-predicate-
object (SPO) triples. Moreover, knowledge resources can be se-
mantically interlinked via owl:sameAs triples at the entity level,
contributing to the Web of Linked Open Data (LOD) [45].
Large knowledge bases are typically built by mining and dis-
tilling information from sources like Wikipedia which offer high-
quality semi-structured elements (infoboxes, categories, tables, lists),
but many projects also tap into extracting knowledge from arbitrary
Web pages and natural-language texts. Despite great advances in
these regards, there are still many challenges regarding the scale of
the methodology and the scope and depth of the harvested knowl-
edge:
covering more entities beyond Wikipedia and discovering newly
emerging entities,
increasing the number of facts about entities and extracting more
interesting relationship types in an open manner,
capturing the temporal scope of relational facts,
tapping into multilingual inputs such as Wikipedia editions in
many different languages,
extending fact-oriented knowledge bases with commonsense
knowledge and (soft) rules,
detecting and disambiguating entity mentions in natural-language
text and other unstructured contents, and
large-scale sameAs linkage across many knowledge and data
sources.
1.2 Enabling Intelligent Applications
Knowledge bases are a key asset that enables and contributes to
intelligent computer behavior. Application areas along these lines
include the following:
Semantic search and question answering: Machine-readable en-
cyclopediae are a rich source of answering expert-level ques-
tions in a precise and concise manner. Moreover, interpret-
ing users information needs in terms of entities and relation-
ships yields strong features for informative ranking of search
results and entity-level recommendations over Web and enter-
prise data.
Deep interpretation of natural language: Both written and spo-
ken language are full of ambiguities. Knowledge is the key
to mapping surface phrases to their proper meanings, so that
machines interpret language as uently as humans. As user-
generated social-media contents is abundant and human-computer
interaction is more and more based on smartphones, coping
with text, speech, and gestures will become crucial.
Machine reading at scale: The deluge of online contents over-
whelms users. Users wish to obtain overviews of the salient
entities and relationships for a week of news, a month of scien-
tic articles, a year of political speeches, or a century of essays
on a specic topic.
Reasoning and smart assistants: Rich sets of facts and rules
from a knowledge base enable computers to perform logical
inferences in application contexts.
Big-Data analytics over uncertain contents: Daily news, social
media, scholarly publications, and other Web contents are the
raw inputs for analytics to obtain insights on business, poli-
tics, health, and more. Knowledge bases are key to discovering
and tracking entities and relationships and thus making sense
of noisy contents.
1.3 Scope and Structure of the Tutorial
This tutorial gives an overview on knowledge harvesting and dis-
cusses hot topics in this eld, pointing out research opportunities
and open challenges. As the relevant literature is widely dispersed
across different communities, we also venture into the neighbor-
ing elds of Web Mining, Articial Intelligence, Natural Language
Processing, Semantic Web, and Data Management. The presenta-
tion is structured according to the following sections and subsec-
tions.
2. KNOWLEDGE BASE CONSTRUCTION
2.1 Knowledge Bases in the Big-Data Era
Many Big-Data applications need to tap unstructured data. News,
social media, web sites, and enterprise sources produce huge amounts
of valuable contents in the form of text and speech. Key to mak-
ing sense of this contents is to identify the entities that are referred
to and the relationships between entities. This allows linking un-
structured contents with structured data, for value-added analytics.
Knowledge bases are a key asset for lifting unstructured contents
into entity-relationship form and making the connection to struc-
tured data. We give an overview of several large and publicly avail-
able knowledge bases, and outline how they can support Big-Data
applications.
2.2 Harvesting of Entities and Classes
Every entity in a knowledge base (such as Steve_Jobs) be-
longs to one or more classes (such as computer_pioneer). These
classes are organized into a taxonomy, where more special classes
are subsumed by more general classes (such as person). We dis-
cuss two groups of approaches to harvest information on classes
and their instances: i) Wikipedia-based approaches and ii) Web-
based approaches using set expansion and other techniques. Rele-
vant work in the rst group includes [88, 89, 102, 123]. Relevant
work in the second group includes [4, 21, 44, 57, 87, 105, 114,
124].
3. HARVESTING FACTS AT WEB SCALE
3.1 Harvesting Relational Facts
Relational facts express relationships between two entities, for
example, the following facts about Steve Jobs:
Steve_Jobs founded Apple_Inc.,
Steve_Jobs was_Board_Member_of Walt_Disney_Company,
Steve_Jobs died_on 5-Oct-2011,
Steve_Jobs died_of Pancreas_Cancer,
Steve_Jobs has_Friend Joan_Baez, and more.
There is a large spectrum of methods to extract such facts from
Web data, tapping both semistructured sources like Wikipedia in-
foboxes, lists, and tables, and natural-language text sources like
Wikipedia full-text articles, news and social media. We give an
overview on methods from pattern matching (e.g., regular expres-
sions), computational linguistics (e.g., dependency parsing), statis-
tical learning (e.g., factor graphs and MLNs), and logical consis-
tency reasoning (e.g., weighted MaxSat or ILP solvers). We also
discuss to what extent these approaches scale to handle big data.
Overviews of information extraction methods for knowledge base
population are given in [26, 95, 122]. For specic state-of-the-art
methods, see the following original papers and references given
there: [1, 10, 13, 15, 16, 17, 18, 30, 32, 36, 40, 46, 49, 58, 60, 69,
70, 76, 85, 86, 93, 103, 111, 127]. For foundations of statistical
learning methods used in this context, see [27, 39, 55].
3.2 Open Information Extraction
In contrast to approaches that operate on a predened list of rela-
tions and a huge, but xed set of entities, open IE harvests arbitrary
subject-predicate-object triples from natural-language documents.
It aggressively taps into noun phrases as entity candidates and ver-
bal phrases as prototypic patterns for relations. For example, in ad-
dition to capturing the pre-specied hasWonPrize relation, we aim
to automatically learn that nominatedForPrize is also an interest-
ing relation expressed by natural-language patterns such as candi-
date for . . . prize or expected to win . . . prize. We discuss recent
methods that follow this Open IE direction [6, 12, 24, 31, 41, 56,
72, 75, 77, 83, 115, 125]. Some methods along these lines make
clever use of Big-Data techniques like frequent sequence mining
and map-reduce computation.
3.3 Temporal, Multilingual, Commonsense,
and Visual Knowledge
In this part, we venture beyond entity-relationship facts and de-
scribe approaches that attach meta-information to facts. This con-
cerns the temporal or spatial context of a fact [38, 61, 68, 106, 107,
112, 116, 117, 118], or describes entities in multiple languages [22,
23, 78, 81]. Along the temporal dimension, we would like to cap-
ture the timepoints of events and the timespans during which cer-
tain relationships hold, for example:
Steve_Jobs Chairman_of Apple_Inc. @[1976,1985],
Steve_Jobs CEO_of Apple_Inc. @[Sep-1997,Aug-2011],
Pixar acquired_by Walt_Disney_Company @5-May-2006.
We also discuss a dimension that complements factual knowl-
edge by commonsense knowledge: properties and rules that every
child knows but are hard to acquire by a computer (see, e.g., [37,
62, 71, 98, 108, 113]). For example, snakes can crawl and hiss, but
they cannot y or sing. An example for a (soft) commonsense rule
is that the husband of a mother is the father of her child (husband at
the time of the childs birth). Here again, state-of-the-art methods
use techniques that scale out to handle Big-Data inputs.
Finally, another dimension of knowledge is to associate entities
and classes with visual data: images and videos [25, 94, 109, 110].
4. KNOWLEDGE FOR BIG DATA
When analytic tasks tap into text or Web data, it is crucial to iden-
tify entities (people, places, products, etc.) in the input for proper
grouping and aggregation. An example application could aim to
track and compare two entities in social media over an extended
timespan (e.g., the Apple iPhone vs. Samsung Galaxy families).
Kowledge about entities is an invaluable asset here.
4.1 Named-Entity Disambiguation
When extracting knowledge from text or tables, entities are rst
seen only in surface form: by names (e.g., Jobs) or phrases (e.g.,
the Apple founder). Entity mentions can be discovered by named-
entity recognition (NER) methods, usually based on CRFs [35] or
other probabilistic graphical models and/or using dictionary of sur-
face forms [100]. Some methods infer semantic types for mentions,
e.g., telling that the Apple founder is a person, or in a ne-grained
manner, an entrepreneur (see, e.g., [66, 67, 126] and references
there).
Nevertheless, entity mentions are just noun phrases and still am-
biguous. Mapping mentions to canonicalized entities registered in a
knowledge base is the task of named-entity disambiguation (NED).
State-of-the-art NED methods combine context similarity between
the surroundings of a mention and salient phrases associated with
an entity, with coherence measures for two or more entities co-
occurring together [14, 19, 20, 28, 34, 43, 47, 48, 59, 74, 92].
Although these principles are well understood, NED remains an
active research area towards improving robustness, scalability, and
coverage.
The NED problem also arises in structured but schema-less data
like HTML tables in Web pages [65]. NED is a special case of the
general word-sense disambiguation problem [80], which considers
also general nouns (concepts that are not entities, e.g., rugby or
peace), verbal phrases, adjectives, etc. Finally note that NED is
not the same as co-reference resolution [90, 96]. The latter aims to
nd equivalence classes of surface forms (e.g., Michelle and the
First Lady of America are the same entity), but without mapping
to an entity catalog.
4.2 Entity Linkage
We see more and more structured data on the Web, in the form of
(HTML) tables, microdata embedded in Web pages (using, e.g., the
schema.org vocabulary), and Linked Open Data. Even when
entities are explicitly marked in these kinds of data, the problem
arises to tell whether two entities are the same or not. This is a vari-
ant of the classical record-linkage problem (aka. entity matching,
entity resolution, entity de-duplication) [29, 54, 79]. For knowl-
edge bases and Linked Open Data, it is of particular interest be-
cause of the need for generating and maintaining owl:sameAs link-
age across knowledge resources. We give an overviewof approaches
to this end, covering statistical learning approaches (e.g., [7, 42, 91,
97]) and graph algorithms (see, e.g., [9, 50, 53, 64, 73, 82, 104, 119,
120, 121] and further references given there).
5. PRESENTERS BIOGRAPHIES
Fabian M. Suchanek is the leader of the Otto Hahn Research
Group Ontologies at the Max Planck Institute for Informatics
in Germany. He obtained his PhD from Saarland University in
2008, and was a postdoc at Microsoft Research Search Labs in
Silicon Valley (in the group of Rakesh Agrawal) and in the Web-
Dam team at INRIA Saclay in France (in the group of Serge Abite-
boul). Fabian is the main architect of the YAGO ontology, one of
the largest public knowledge bases.
Gerhard Weikum is a Scientic Director at the Max Planck In-
stitute for Informatics in Saarbruecken, Germany, where he is lead-
ing the department on databases and information systems. He co-
authored a comprehensive textbook on transactional systems, re-
ceived the VLDB 10-Year Award for his work on automatic DB
tuning, and is one of the creators of the YAGO knowledge base.
Gerhard is an ACM Fellow, a member of the German Academy
of Science and Engineering, and a recipient of a Google Focused
Research Award and an ACM SIGMOD Contributions Award.
6. REFERENCES
[1] E. Agichtein, L. Gravano: Snowball: Extracting Relations
from Large Plain-Text Collections. ACM DL 2000
[2] AKBC 2010: First Int. Workshop on Automated Knowledge
Base Construction, Grenoble, 2010,
http://akbc.xrce.xerox.com/
[3] AKBC-WEKEX 2012: The Knowledge Extraction
Workshop at NAACL-HLT, 2012,
http://akbcwekex2012.wordpress.com/
[4] E. Alfonseca, M. Pasca, E. Robledo-Arnuncio: Acquisition
of Instance Attributes via Labeled and Related Instances.
SIGIR 2010
[5] S. Auer, C. Bizer, et al.: DBpedia: A Nucleus for a Web of
Open Data. ISWC 2007
[6] M. Banko, M.J. Cafarella, S. Soderland, M. Broadhead, O.
Etzioni: Open Information Extraction from the Web. IJCAI
2007
[7] I.Bhattacharya, L. Getoor: Collective Entity Resolution in
Relational Data. TKDD 1(1), 2007
[8] J. Biega et al.: Inside YAGO2s: a Transparent Information
Extraction Architecture. WWW 2013
[9] C. Bhm et al.: LINDA: Distributed Web-of-Data-Scale
Entity Matching. CIKM 2012
[10] P. Bohannon et al.: Automatic Web-Scale Information
Extraction. SIGMOD 2012
[11] K.D. Bollacker et al.: Freebase: a Collaboratively Created
Graph Database for Structuring Human Knowledge.
SIGMOD 2008
[12] D. Bollegala, Y. Matsuo, M. Ishizuka: Relational Duality:
Unsupervised Extraction of Semantic Relations between
Entities on the Web. WWW 2010
[13] S. Brin: Extracting Patterns and Relations from the World
Wide Web. WebDB 1998
[14] R.C. Bunescu, M. Pasca: Using Encyclopedic Knowledge
for Named Entity Disambiguation. EACL 2006
[15] M.J. Cafarella: Extracting and Querying a Comprehensive
Web Database. CIDR 2009
[16] A. Carlson et al.: Toward an Architecture for Never-Ending
Language Learning. AAAI 2010
[17] L. Chiticariu et al.: SystemT: An Algebraic Approach to
Declarative Information Extraction. ACL 2010
[18] P. Cimiano, J. Vlker: Text2Onto. NLDB 2005
[19] M. Cornolti, P. Ferragina, M. Ciaramita: A Framework for
Benchmarking Entity-Annotation Systems. WWW 2013
[20] S. Cucerzan: Large-Scale Named Entity Disambiguation
Based on Wikipedia Data. EMNLP 2007
[21] B.B. Dalvi, W.W. Cohen, J. Callan: WebSets: Extracting
Sets of Entities from the Web using Unsupervised
Information Extraction. WSDM 2012
[22] G. de Melo, G. Weikum: Towards a Universal Wordnet by
Learning from Combined Evidence. CIKM 2009
[23] G. de Melo, G. Weikum: MENTA: Inducing Multilingual
Taxonomies from Wikipedia. CIKM 2010
[24] L. Del Corro, R. Gemulla: ClausIE: Clause-Based Open
Informtion Extraction. WWW 2013
[25] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei,
ImageNet: A Large-Scale Hierarchical Image Database.
CVPR 2009
[26] A. Doan et al. (Eds.): Special Issue on Managing
Information Extraction, SIGMOD Record 37(4), Special
issue on managing information extraction, 2008
[27] P. Domingos, D. Lowd: Markov Logic: An Interface Layer
for Articial Intelligence. Morgan & Claypool 2009
[28] M. Dredze et al.: Entity Disambiguation for Knowledge
Base Population. COLING 2010
[29] A.K. Elmagarmid, P.G. Ipeirotis, V.S. Verykios: Duplicate
Record Detection: A Survey. IEEE TKDE 19(1), 2007
[30] O. Etzioni et al.: Unsupervised Named-Entity Extraction
from the Web: An Experimental Study. Artif. Intell. 165(1),
2005
[31] A. Fader. S. Soderland, O. Etzioni: Identifying Relations for
Open Information Extraction, EMNLP 2011
[32] Y. Fang, K. Chang: Searching Patterns for Relation
Extraction over the Web: Rediscovering the Pattern-Relation
Duality. WSDM 2011
[33] C. Fellbaum, G. Miller (Eds.): WordNet: An Electronic
Lexical Database, MIT Press, 1998
[34] P. Ferragina, U. Scaiella: TAGME: On-the-Fly Annotation of
Short Text Fragments. CIKM 2010
[35] J.R. Finkel, T. Grenager, C. Manning. Incorporating
Non-local Information into Information Extraction Systems
by Gibbs Sampling. ACL 2005
[36] T. Furche et al.: DIADEM: Domain-centric, Intelligent,
Automated Data Extraction Methodology. WWW 2012
[37] L. Galarraga, C. Teioudi, K. Hose, F. Suchanek: AMIE:
Association Rule Mining under Incomplete Evidence in
Ontological Knowledge Bases. WWW 2013
[38] G. Garrido et al.: Temporally Anchored Relation Extraction.
ACL 2012
[39] L. Getoor, B. Taskar (Eds.): Introduction to Statistical
Relational Learning. MIT Press 2007
[40] G. Gottlob et al.: The Lixto Data Extraction Project - Back
and Forth between Theory and Practice. PODS 2004
[41] R. Gupta, S. Sarawagi: Joint Training for Open-Domain
Extraction on the Web: Exploiting Overlap when
Supervision is Limited. WSDM 2011
[42] R. Hall, C.A. Sutton, A. McCallum: Unsupervised
Deduplication using Cross-Field Dependencies. KDD 2008
[43] X. Han, L. Sun, J. Zhao: Collective Entity Linking in Web
Text: a Graph-based Method. SIGIR 2011
[44] M.A Hearst: Automatic Acquisition of Hyponyms from
Large Text Corpora. COLING 1992
[45] T. Heath, C. Bizer: Linked Data: Evolving the Web into a
Global Data Space. Morgan & Claypool, 2011
[46] J. Hoffart, F.M. Suchanek, K. Berberich, G. Weikum:
YAGO2: a Spatially and Temporally Enhanced Knowledge
Base from Wikipedia, Artif. Intell. 194, 2013
[47] J. Hoffart, M. A. Yosef, I. Bordino, et al.: Robust
Disambiguation of Named Entities in Text. EMNLP 2011
[48] J. Hoffart et al.: KORE: Keyphrase Overlap Relatedness for
Entity Disambiguation. CIKM 2012
[49] R. Hoffmann, C. Zhang, D.S. Weld: Learning 5000
Relational Extractors. ACL 2010
[50] A.Hogan et al.: Scalable and Distributed Methods for Entity
Matching. J. Web Sem. 10, 2012
[51] E. Hovy, R. Navigli, S.P. Ponzetto: Collaboratively Built
Semi-Structured Content and Articial Intelligence: the
Story So Far, Artif. Intell. 194, 2013
[52] IBM Journal of Research and Development 56(3/4), Special
Issue on This is Watson, 2012
[53] H. Kpcke et al.: Evaluation of Entity Resolution
Approaches on Real-World Match Problems. PVLDB 2010
[54] H. Kpcke, E. Rahm: Frameworks for entity matching: A
comparison. Data Knowl. Eng. 69(2), 2010
[55] D. Koller, N. Friedman: Probabilistic Graphical Models:
Principles and Techniques. MIT Press, 2009
[56] S.K. Kondreddi, P. Triantallou, G. Weikum: HIGGINS:
Knowledge Acquisition meets the Crowds. WWW 2013
[57] Z. Kozareva, E.H. Hovy: A Semi-Supervised Method to
Learn and Construct Taxonomies Using the Web. EMNLP
2010
[58] S. Krause, H. Li, H. Uszkoreit, F. Xu: Large-Scale Learning
of Relation-Extraction Rules with Distant Supervision from
the Web. ISWC 2012
[59] S. Kulkarni et al.: Collective Annotation of Wikipedia
Entities in Web Text. KDD 2009
[60] N. Kushmerick, D.S. Weld, R.B. Doorenbos: Wrapper
Induction for Information Extraction. IJCAI 1997
[61] E. Kuzey, G. Weikum: Extraction of temporal facts and
events from Wikipedia. TempWeb 2012
[62] N. Lao, T.M. Mitchell, W.W. Cohen: Random Walk
Inference and Learning in A Large Scale Knowledge Base.
EMNLP 2011
[63] D.B. Lenat: CYC: A Large-Scale Investment in Knowledge
Infrastructure. CACM 38(11), 1995
[64] J.Li, J.Tang, Y.Li, Q.Luo: RiMOM: A Dynamic
Multistrategy Ontology Alignment Framework. TKDE
21(8), 2009
[65] G.Limaye et al: Annotating and Searching Web Tables Using
Entities, Types and Relationships. PVLDB 2010
[66] T. Lin et al.: No Noun Phrase Left Behind: Detecting and
Typing Unlinkable Entities. EMNLP 2012
[67] X. Ling, D.S. Weld: Fine-Grained Entity Recognition. AAAI
2012
[68] X. Ling, D.S. Weld: Temporal Information Extraction.
AAAI 2010
[69] A. Machanavajjhala et al.: Collective extraction from
heterogeneous web lists. WSDM 2011
[70] B. Marthi, B. Milch, S. Russell, First-Order Probabilistic
Models for Information Extraction. IJCAI 2003
[71] C. Matuszek et al.: Searching for Common Sense:
Populating Cyc from the Web. AAAI 2005
[72] Mausam, M. Schmitz, S. Soderland, et al.: Open Language
Learning for Information Extraction. EMNLP 2012
[73] S. Melnik, H. Garcia-Molina, E. Rahm: Similarity Flooding:
A Versatile Graph Matching Algorithm and its Application
to Schema Matching. ICDE 2002
[74] D.N. Milne, I.H. Witten: Learning to link with wikipedia.
CIKM 2008
[75] T. Mohamed, E.R. Hruschka, T.M. Mitchell: Discovering
Relations between Noun Categories. EMNLP 2011
[76] N. Nakashole, M. Theobald, G. Weikum: Scalable
Knowledge Harvesting with High Precision and High Recall.
WSDM 2011
[77] N. Nakashole, G. Weikum, F. Suchanek: PATTY: A
Taxonomy of Relational Patterns with Semantic Types.
EMNLP 2012
[78] V. Nastase et al.: WikiNet: A Very Large Scale
Multi-Lingual Concept Network. LREC 2010
[79] F. Naumann, M. Herschel: An Introduction to Duplicate
Detection. Morgan & Claypool, 2010
[80] R. Navigli: Word Sense Disambiguation: a Survey. ACM
Comput. Surv. 41(2), 2009
[81] R. Navigli, S. Ponzetto: BabelNet: Building a Very Large
Multilingual Semantic Network. ACL 2010
[82] T. Nguyen et al.: Multilingual Schema Matching for
Wikipedia Infoboxes. PVLDB 2012
[83] M. Nickel, V. Tresp, H.-P. Kriegel: Factorizing YAGO:
Scalable Machine Learning for Linked Data. WWW 2012
[84] Z. Nie, Y. Ma, S. Shi, J.-R. Wen, W.Y. Ma: Web Object
Retrieval. WWW 2007
[85] F. Niu et al.: DeepDive: Web-scale Knowledge-base
Construction using Statistical Learning and Inference, VLDS
Workshop 2012
[86] M. Palmer, D. Gildea, N. Xue: Semantic Role Labeling
Morgan & Claypool 2010
[87] M. Pasca: Ranking Class Labels Using Query Sessions. ACL
2011
[88] S.P. Ponzetto, M. Strube: Deriving a Large-Scale Taxonomy
from Wikipedia. AAAI 2007
[89] S.P. Ponzetto, M. Strube: Taxonomy induction based on a
collaboratively built knowledge repository. Artif. Intell.
175(9-10), 2011
[90] A. Rahman, V. Ng: Coreference Resolution with World
Knowledge. ACL 2011
[91] V. Rastogi, N. Dalvi, M. Garofalakis: Large-Scale Collective
Entity Matching. PVLDB 2011
[92] L. Ratinov et al.: Local and Global Algorithms for
Disambiguation to Wikipedia. ACL 2011
[93] S. Riedel, L. Yao, A. McCallum: Modeling Relations and
their Mentions without Labeled Text. ECML 2010
[94] M. Rohrbach et al.: What Helps Where - and Why?
Semantic Relatedness for Knowledge Transfer. CVPR 2010
[95] S. Sarawagi: Information Extraction. Foundations & Trends
in Databases 1(3), 2008.
[96] S. Singh, A. Subramanya, F.C.N. Pereira, A. McCallum:
Large-Scale Cross-Document Coreference Using Distributed
Inference and Hierarchical Models. ACL 2011
[97] P. Singla, P. Domingos: Entity Resolution with Markov
Logic. ICDM 2006
[98] R.Speer, C.Havasi, H.Surana: Using Verbosity: Common
Sense Data from Games with a Purpose. FLAIRS 2010
[99] R. Speer, C. Havasi: Representing General Relational
Knowledge in ConceptNet 5, LREC 2012
[100] V.I. Spitkovsky, A.X. Chang: A Cross-Lingual Dictionary
for English Wikipedia Concepts. LREC 2012
[101] S. Staab, R. Studer: Handbook on Ontologies, Springer,
2009
[102] F.M. Suchanek, G. Kasneci, G. Weikum: YAGO: a Core of
Semantic Knowledge. WWW 2007
[103] F.M. Suchanek, M. Sozio, G. Weikum: SOFIE: a
Self-Organizing Framework for Information Extraction.
WWW 2009
[104] F. Suchanek et al.: PARIS: Probabilistic Alignment of
Relations, Instances, and Schema. PVLDB 2012
[105] P.P. Talukdar, F. Pereira: Experiments in Graph-Based
Semi-Supervised Learning Methods for Class-Instance
Acquisition. ACL 2010
[106] P.P. Talukdar, D.T. Wijaya, T. Mitchell: Coupled temporal
scoping of relational facts. WSDM 2012
[107] P.P. Talukdar, D. Wijaya, T. Mitchell: Acquiring Temporal
Constraints between Relations. CIKM 2012
[108] N. Tandon, G. de Melo, G. Weikum: Deriving a Web-Scale
Common Sense Fact Database. AAAI 2011
[109] B. Taneva et al.: Gathering and Ranking Photos of Named
Entities with High Precision, High Recall, and Diversity.
WSDM 2010
[110] B. Taneva et al.: Finding Images of Difcult Entities in the
Long Tail. CIKM 2011
[111] P. Venetis, A. Halevy, J. Madhavan, et al.: Recovering
Semantics of Tables on the Web. PVLDB 2011
[112] M. Verhagen et al.: Automating Temporal Annotation with
TARSQI. ACL 2005
[113] J. Vlker, P. Hitzler, P. Cimiano: Acquisition of OWL DL
Axioms from Lexical Resources. ESWC 2007
[114] R. Wang, W.W. Cohen: Language-independent Set
Expansion of Named Entities using the Web. ICDM 2007
[115] C. Wang, J. Fan, A. Kalyanpur, D. Gondek: Relation
Extraction with Relation Topics. EMNLP 2011
[116] Y. Wang et al.: Timely YAGO: Harvesting, Querying, and
Visualizing Temporal Knowledge from Wikipedia. EDBT
2010
[117] Y. Wang et al.: Harvesting Facts from Textual Web Sources
by Constrained Label Propagation. CIKM 2011
[118] Y. Wang, M. Dylla, M. Spaniol, G. Weikum: Coupling
Label Propagation and Constraints for Temporal Fact
Extraction. ACL 2012
[119] J. Wang, T. Kraska, M. Franklin, J. Feng: CrowdER:
Crowdsourcing Entity Resolution. PVLDB 2012
[120] Z. Wang, J. Li, Z. Wang, J. Tang: Cross-lingual knowledge
linking across wiki knowledge bases. WWW 2012
[121] S.E. Whang, H. Garcia-Molina: Joint Entity Resolution.
ICDE 2012
[122] G. Weikum, M. Theobald: From Information to
Knowledge: Harvesting Entities and Relationships from Web
Sources. PODS 2010
[123] F. Wu, D.S. Weld: Automatically Rening the Wikipedia
Infobox Ontology. WWW 2008
[124] W. Wu, H. Li, H. Wang, K.Q. Zhu: Probase: a Probabilistic
Taxonomy for Text Understanding. SIGMOD 2012
[125] L. Yao, S. Riedel, A. McCallum: Unsupervised Relation
Discovery with Sense Disambiguation. ACL 2012
[126] M.A. Yosef et al.: HYENA: Hierarchical Type
Classication for Entity Names. COLING 2012
[127] J. Zhu et al.: StatSnowball: a Statistical Approach to
Extracting Entity Relationships. WWW 2009

Sigmod2013 Tutorial

Uploaded by

Copyright:

Available Formats

Sigmod2013 Tutorial

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sigmod2013 Tutorial

Uploaded by

Copyright:

Available Formats

Knowledge Harvesting in the Big-Data Era

You might also like