Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Building query optimizers for information extraction: the SQoUT project

Published: 20 March 2009 Publication History

Abstract

Text documents often embed data that is structured in nature. This structured data is increasingly exposed using information extraction systems, which generate structured relations from documents, introducing an opportunity to process expressive, structured queries over text databases. This paper discusses our SQoUT1 project, which focuses on processing structured queries over relations extracted from text databases. We show how, in our extraction-based scenario, query processing can be decomposed into a sequence of basic steps: retrieving relevant text documents, extracting relations from the documents, and joining extracted relations for queries involving multiple relations. Each of these steps presents different alternatives and together they form a rich space of possible query execution strategies. We identify execution efficiency and output quality as the two critical properties of a query execution, and argue that an optimization approach needs to consider both properties. To this end, we take into account the userspecified requirements for execution efficiency and output quality, and choose an execution strategy for each query based on a principled, cost-based comparison of the alternative execution strategies.

References

[1]
E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In DL, 2000.
[2]
E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. In ICDE, 2003.
[3]
S. Brin. Extracting patterns and relations from the world wide web. In WebDB, 1998.
[4]
M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In IAAI, 1999.
[5]
W. Cohen and A. McCallum. Information extraction from the World Wide Web (tutorial). In KDD, 2003.
[6]
H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: An architecture for development of robust HLT applications. In ACL, 2002.
[7]
D. Downey, O. Etzioni, and S. Soderland. A probabilistic model of redundancy in information extraction. In IJCAI, 2005.
[8]
O. Etzioni, M. J. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in KnowItAll (preliminary results). In WWW, 2004.
[9]
D. Ferrucci and A. Lally. UIMA: An architectural approach to unstructured information processing in the corporate research environment. In Natural Language Engineering, 2004.
[10]
P. G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano. To search or to crawl? Towards a query optimizer for text-centric tasks. In SIGMOD, 2006.
[11]
P. G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano. Towards a query optimizer for text-centric tasks. ACM Transactions on Database Systems, 32(4), Dec. 2007.
[12]
A. Jain, A. Doan, and L. Gravano. Optimizing SQL queries over text databases. In ICDE, 2008.
[13]
A. Jain and P. G. Ipeirotis. A quality-aware optimizer for information extraction. ACM Transactions on Database Systems, 2009. To appear.
[14]
A. Jain, P. G. Ipeirotis, A. Doan, and L. Gravano. Join optimization of information extraction output: Quality matters! In ICDE, 2009. To appear.
[15]
A. Jain and D. Srivastava. Exploring a few good tuples from text databases. In ICDE, 2009. To appear.
[16]
I. Mansuri and S. Sarawagi. A system for integrating unstructured data into relational databases. In ICDE, 2006.
[17]
M. Paşca, D. Lin, J. Bigham, A. Lifchits, and A. Jain. Organizing and searching the world wide web of facts - step one: The one-million fact extraction challenge. In WWW, 2007.
[18]
R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill, 2002.
[19]
F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, 2008.
[20]
W. Shen, A. Doan, J. Naughton, and R. Ramakrishnan. Declarative information extraction using Datalog with embedded extraction predicates. In VLDB, 2007.
[21]
V. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? Improving data quality and data mining using multiple, noisy labelers. In KDD, 2008.

Cited By

View all
  • (2023)Autonomously Computable Information ExtractionProceedings of the VLDB Endowment10.14778/3603581.360358516:10(2431-2443)Online publication date: 8-Aug-2023
  • (2018)Managing Textual Data Semantically In Relational Databases2018 International Conference on Smart Computing and Electronic Enterprise (ICSCEE)10.1109/ICSCEE.2018.8538426(1-5)Online publication date: Jul-2018
  • (2015)Cost-Effective Conceptual Design for Information ExtractionACM Transactions on Database Systems10.1145/271632140:2(1-39)Online publication date: 30-Jun-2015
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record
ACM SIGMOD Record  Volume 37, Issue 4
December 2008
116 pages
ISSN:0163-5808
DOI:10.1145/1519103
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 March 2009
Published in SIGMOD Volume 37, Issue 4

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Autonomously Computable Information ExtractionProceedings of the VLDB Endowment10.14778/3603581.360358516:10(2431-2443)Online publication date: 8-Aug-2023
  • (2018)Managing Textual Data Semantically In Relational Databases2018 International Conference on Smart Computing and Electronic Enterprise (ICSCEE)10.1109/ICSCEE.2018.8538426(1-5)Online publication date: Jul-2018
  • (2015)Cost-Effective Conceptual Design for Information ExtractionACM Transactions on Database Systems10.1145/271632140:2(1-39)Online publication date: 30-Jun-2015
  • (2012)MADdenProceedings of the 21st ACM international conference on Information and knowledge management10.1145/2396761.2398746(2740-2742)Online publication date: 29-Oct-2012
  • (2012)Just-in-time information extraction using extraction viewsProceedings of the 2012 ACM SIGMOD International Conference on Management of Data10.1145/2213836.2213913(613-616)Online publication date: 20-May-2012
  • (2011)SystemTProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations10.5555/2002440.2002459(109-114)Online publication date: 21-Jun-2011
  • (2011)Towards automatic column-based data object clustering for multilingual databases2011 IEEE International Conference on Control System, Computing and Engineering10.1109/ICCSCE.2011.6190562(415-420)Online publication date: Nov-2011
  • (2010)Enterprise information extractionProceedings of the 2010 ACM SIGMOD International Conference on Management of data10.1145/1807167.1807339(1257-1258)Online publication date: 6-Jun-2010
  • (2009)A quality-aware optimizer for information extractionACM Transactions on Database Systems10.1145/1508857.150886234:1(1-48)Online publication date: 23-Apr-2009

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media