Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Querying probabilistic information extraction

Published: 01 September 2010 Publication History

Abstract

Recently, there has been increasing interest in extending relational query processing to include data obtained from unstructured sources. A common approach is to use stand-alone Information Extraction (IE) techniques to identify and label entities within blocks of text; the resulting entities are then imported into a standard database and processed using relational queries. This two-part approach, however, suffers from two main drawbacks. First, IE is inherently probabilistic, but traditional query processing does not properly handle probabilistic data, resulting in reduced answer quality. Second, performance inefficiencies arise due to the separation of IE from query processing. In this paper, we address these two problems by building on an in-database implementation of a leading IE model---Conditional Random Fields using the Viterbi inference algorithm. We develop two different query approaches on top of this implementation. The first uses deterministic queries over maximum-likelihood extractions, with optimizations to push the relational operators into the Viterbi algorithm. The second extends the Viterbi algorithm to produce a set of possible extraction "worlds", from which we compute top-k probabilistic query answers. We describe these approaches and explore the trade-offs of efficiency and effectiveness between them using two datasets.

References

[1]
F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan, "An Algebraic Approach to Rule-Based Information Extraction," in ICDE, 2008.
[2]
W. Shen, A. Doan, J. Naughton, and R. Ramakrishnan, "Declarative Information Extraction Using Datalog with Embedded Extraction Predicates," in VLDB, 2007.
[3]
A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayyadian, and W. Shen, "Community information management," 2006.
[4]
N. Dalvi and D. Suciu, "Efficient Query Evaluation on Probabilistic Databases," in VLDB, 2004.
[5]
O. Benjelloun, A. Sarma, A. Halevy, and J. Widom, "ULDB: Databases with Uncertainty and Lineage," in VLDB, 2006.
[6]
A. Deshpande and S. Madden, "MauveDB: Supporting Model-based User Views in Database Systems," in SIGMOD, 2006.
[7]
P. Sen and A. Deshpande, "Representing and Querying Correlated Tuples in Probabilistic Databases," in ICDE, 2007.
[8]
L. Antova, T. Jansen, C. Koch, and D. Olteanu, "Fast and Simple Relational Processing of Uncertain Data," in ICDE, 2008.
[9]
D. Wang, E. Michelakis, M. Garofalakis, and J. Hellerstein, "BayesStore: Managing Large, Uncertain Data Repositories with Probabilistic Graphical Models," in VLDB, 2008.
[10]
D. Wang, E. Michelakis, M. Franklin, M. Garofalakis, and J. Hellerstein, "Probabilistic Declarative Information Extraction," in ICDE, 2010.
[11]
R. Gupta and S. Sarawagi, "Creating Probabilistic Databases from Information Extraction Models," in VLDB, 2006.
[12]
T. Kristjansson, A. Culotta, P. Viola, and A. McCallum, "Interactive Information Extraction with Constrained Conditional Random Fields," in AAAI'04, 2004.
[13]
"Enron email dataset, http://www.cs.cmu.edu/enron/."
[14]
J. Lafferty, A. McCallum, and F. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," in ICML, 2001.
[15]
C. Sutton and A. McCallum, "Introduction to Conditional Random Fields for Relational Learning," in Introduction to Statistical Relational Learning, 2008.
[16]
G. D. Forney, "The Viterbi Algorithm," IEEE, 1973.
[17]
L. Huang and D. Chiang, "Better k-best Parsing," in IWPT, 2005.
[18]
I. F. Ilyas, R. Shah, W. G. Aref, J. S. Vitter, and A. K. Elmagarmid, "Rank-aware Query Optimization," in SIGMOD, 2004.
[19]
"Contact record extraction data, http://www2.selu.edu/academics/faculty/aculotta/data/contact.html."
[20]
"Dblp dataset, http://kdl.cs.umass.edu/data/dblp/dblp-info.html."
[21]
E. Agichtein and S. Sarawagi, "Scalable Information Extraction and Integration," in KDD, 2006.
[22]
A. Doan, R. Ramakrishnan, and S. Vaithyanathan., "Managing Information Extraction: State of the Art and Research Directions," in SIGMOD, 2006.
[23]
E. Michelakis, P. Haas, R. Krishnamurthy, and S. Vaithyanathan, "Uncertainty Management in Rule-based Information Extraction Systems," in SIGMOD, 2009.
[24]
W. Shen, P. DeRose, R. McCann, A. Doan, and R. Ramakrishnan, "Toward best-effort information extraction," in SIGMOD, 2008.
[25]
M. Skounakis, M. Craven, and S. Ray, "Hierarchical Hidden Markov Models for Information Extraction," in Proc. of IJCAI, 2003.
[26]
J. Eisner, E. Goldlust, and N. Smith, "Compiling Comp Ling: Practical Weighted Dynamic Programming and the Dyna," in HLT/EMNLP, 2005.
[27]
H. Poon and P. Domingos, "Joint Inference in Information Extraction," in Proc. of AAAI, 2007.

Cited By

View all
  • (2018)In-RDBMS hardware acceleration of advanced analyticsProceedings of the VLDB Endowment10.14778/3236187.323618811:11(1317-1331)Online publication date: 1-Jul-2018
  • (2017)Holistic query evaluation over information extraction pipelinesProceedings of the VLDB Endowment10.14778/3149193.314920111:2(217-229)Online publication date: 1-Oct-2017
  • (2017)A Probabilistically Integrated System for Crowd-Assisted Text Labeling and ExtractionJournal of Data and Information Quality10.1145/30120038:2(1-23)Online publication date: 9-Feb-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 3, Issue 1-2
September 2010
1658 pages

Publisher

VLDB Endowment

Publication History

Published: 01 September 2010
Published in PVLDB Volume 3, Issue 1-2

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2018)In-RDBMS hardware acceleration of advanced analyticsProceedings of the VLDB Endowment10.14778/3236187.323618811:11(1317-1331)Online publication date: 1-Jul-2018
  • (2017)Holistic query evaluation over information extraction pipelinesProceedings of the VLDB Endowment10.14778/3149193.314920111:2(217-229)Online publication date: 1-Oct-2017
  • (2017)A Probabilistically Integrated System for Crowd-Assisted Text Labeling and ExtractionJournal of Data and Information Quality10.1145/30120038:2(1-23)Online publication date: 9-Feb-2017
  • (2015)Learning and inference in tractable probabilistic knowledge basesProceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence10.5555/3020847.3020913(632-641)Online publication date: 12-Jul-2015
  • (2015)Query Analytics over Probabilistic Databases with Unmerged DuplicatesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.240550727:8(2245-2260)Online publication date: 1-Aug-2015
  • (2014)Data management research at the technical university of creteACM SIGMOD Record10.1145/2590989.259099942:4(61-66)Online publication date: 28-Feb-2014
  • (2013)A performance comparison of parallel DBMSs and MapReduce on large-scale text analyticsProceedings of the 16th International Conference on Extending Database Technology10.1145/2452376.2452448(613-624)Online publication date: 18-Mar-2013
  • (2013)10 Years of Probabilistic Querying --- What Next?Proceedings of the 17th East European Conference on Advances in Databases and Information Systems - Volume 813310.1007/978-3-642-40683-6_1(1-13)Online publication date: 1-Sep-2013
  • (2012)Automatic knowledge base construction using probabilistic extraction, deductive reasoning, and human feedbackProceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction10.5555/2391200.2391220(106-110)Online publication date: 7-Jun-2012
  • (2012)The MADlib analytics libraryProceedings of the VLDB Endowment10.14778/2367502.23675105:12(1700-1711)Online publication date: 1-Aug-2012
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media