research-article

Building query optimizers for information extraction: the SQoUT project

Authors:

Panagiotis Ipeirotis,

Luis GravanoAuthors Info & Claims

ACM SIGMOD Record, Volume 37, Issue 4

Pages 28 - 34

https://doi.org/10.1145/1519103.1519108

Published: 20 March 2009 Publication History

Abstract

Text documents often embed data that is structured in nature. This structured data is increasingly exposed using information extraction systems, which generate structured relations from documents, introducing an opportunity to process expressive, structured queries over text databases. This paper discusses our SQoUT1 project, which focuses on processing structured queries over relations extracted from text databases. We show how, in our extraction-based scenario, query processing can be decomposed into a sequence of basic steps: retrieving relevant text documents, extracting relations from the documents, and joining extracted relations for queries involving multiple relations. Each of these steps presents different alternatives and together they form a rich space of possible query execution strategies. We identify execution efficiency and output quality as the two critical properties of a query execution, and argue that an optimization approach needs to consider both properties. To this end, we take into account the userspecified requirements for execution efficiency and output quality, and choose an execution strategy for each query based on a principled, cost-based comparison of the alternative execution strategies.

References

[1]

E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In DL, 2000.

Digital Library

[2]

E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. In ICDE, 2003.

[3]

S. Brin. Extracting patterns and relations from the world wide web. In WebDB, 1998.

Digital Library

[4]

M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In IAAI, 1999.

Digital Library

[5]

W. Cohen and A. McCallum. Information extraction from the World Wide Web (tutorial). In KDD, 2003.

[6]

H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: An architecture for development of robust HLT applications. In ACL, 2002.

Digital Library

[7]

D. Downey, O. Etzioni, and S. Soderland. A probabilistic model of redundancy in information extraction. In IJCAI, 2005.

Digital Library

[8]

O. Etzioni, M. J. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in KnowItAll (preliminary results). In WWW, 2004.

Digital Library

[9]

D. Ferrucci and A. Lally. UIMA: An architectural approach to unstructured information processing in the corporate research environment. In Natural Language Engineering, 2004.

Digital Library

[10]

P. G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano. To search or to crawl? Towards a query optimizer for text-centric tasks. In SIGMOD, 2006.

Digital Library

[11]

P. G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano. Towards a query optimizer for text-centric tasks. ACM Transactions on Database Systems, 32(4), Dec. 2007.

Digital Library

[12]

A. Jain, A. Doan, and L. Gravano. Optimizing SQL queries over text databases. In ICDE, 2008.

Digital Library

[13]

A. Jain and P. G. Ipeirotis. A quality-aware optimizer for information extraction. ACM Transactions on Database Systems, 2009. To appear.

Digital Library

[14]

A. Jain, P. G. Ipeirotis, A. Doan, and L. Gravano. Join optimization of information extraction output: Quality matters! In ICDE, 2009. To appear.

Digital Library

[15]

A. Jain and D. Srivastava. Exploring a few good tuples from text databases. In ICDE, 2009. To appear.

Digital Library

[16]

I. Mansuri and S. Sarawagi. A system for integrating unstructured data into relational databases. In ICDE, 2006.

Digital Library

[17]

M. Paşca, D. Lin, J. Bigham, A. Lifchits, and A. Jain. Organizing and searching the world wide web of facts - step one: The one-million fact extraction challenge. In WWW, 2007.

Digital Library

[18]

R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill, 2002.

Digital Library

[19]

F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, 2008.

Digital Library

[20]

W. Shen, A. Doan, J. Naughton, and R. Ramakrishnan. Declarative information extraction using Datalog with embedded extraction predicates. In VLDB, 2007.

Digital Library

[21]

V. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? Improving data quality and data mining using multiple, noisy labelers. In KDD, 2008.

Digital Library

Cited By

Kassaie BTompa F(2023)Autonomously Computable Information ExtractionProceedings of the VLDB Endowment10.14778/3603581.360358516:10(2431-2443)Online publication date: 8-Aug-2023
https://dl.acm.org/doi/10.14778/3603581.3603585
Yafooz WAbdin SFahad S(2018)Managing Textual Data Semantically In Relational Databases2018 International Conference on Smart Computing and Electronic Enterprise (ICSCEE)10.1109/ICSCEE.2018.8538426(1-5)Online publication date: Jul-2018
https://doi.org/10.1109/ICSCEE.2018.8538426
Termehchy AVakilian AChodpathumwan YWinslett M(2015)Cost-Effective Conceptual Design for Information ExtractionACM Transactions on Database Systems10.1145/271632140:2(1-39)Online publication date: 30-Jun-2015
https://dl.acm.org/doi/10.1145/2716321
Show More Cited By

Index Terms

Building query optimizers for information extraction: the SQoUT project
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
  2. Information retrieval
    1. Evaluation of retrieval results
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Building query optimizers with combinators
Data-induced predicates for sideways information passing in query optimizers
Abstract
Using data statistics, we convert predicates on a table into data-induced predicates (diPs) that apply on the joining tables. Doing so substantially speeds up multi-relation queries because the benefits of predicate pushdown can now apply beyond ...
View-based query containment
PODS '03: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Query containment is the problem of checking whether for all databases the answer to a query is a subset of the answer to a second query. In several data management tasks, such as data integration, mobile computing, etc., the data of interest are only ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record

ACM SIGMOD Record Volume 37, Issue 4

December 2008

116 pages

ISSN:0163-5808

DOI:10.1145/1519103

Issue’s Table of Contents

Copyright © 2009 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 March 2009

Published in SIGMOD Volume 37, Issue 4

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
285
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kassaie BTompa F(2023)Autonomously Computable Information ExtractionProceedings of the VLDB Endowment10.14778/3603581.360358516:10(2431-2443)Online publication date: 8-Aug-2023
https://dl.acm.org/doi/10.14778/3603581.3603585
Yafooz WAbdin SFahad S(2018)Managing Textual Data Semantically In Relational Databases2018 International Conference on Smart Computing and Electronic Enterprise (ICSCEE)10.1109/ICSCEE.2018.8538426(1-5)Online publication date: Jul-2018
https://doi.org/10.1109/ICSCEE.2018.8538426
Termehchy AVakilian AChodpathumwan YWinslett M(2015)Cost-Effective Conceptual Design for Information ExtractionACM Transactions on Database Systems10.1145/271632140:2(1-39)Online publication date: 30-Jun-2015
https://dl.acm.org/doi/10.1145/2716321
Grant CGumbs JLi KWang DChitouras GChen XLebanon GWang HZaki M(2012)MADdenProceedings of the 21st ACM international conference on Information and knowledge management10.1145/2396761.2398746(2740-2742)Online publication date: 29-Oct-2012
https://dl.acm.org/doi/10.1145/2396761.2398746
El-Helw AFarid MIlyas ICandan KChen YSnodgrass RGravano LFuxman A(2012)Just-in-time information extraction using extraction viewsProceedings of the 2012 ACM SIGMOD International Conference on Management of Data10.1145/2213836.2213913(613-616)Online publication date: 20-May-2012
https://dl.acm.org/doi/10.1145/2213836.2213913
Li YReiss FChiticariu LKurohashi S(2011)SystemTProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations10.5555/2002440.2002459(109-114)Online publication date: 21-Jun-2011
https://dl.acm.org/doi/10.5555/2002440.2002459
Yafooz WAbidin SOmar N(2011)Towards automatic column-based data object clustering for multilingual databases2011 IEEE International Conference on Control System, Computing and Engineering10.1109/ICCSCE.2011.6190562(415-420)Online publication date: Nov-2011
https://doi.org/10.1109/ICCSCE.2011.6190562
Chiticariu LLi YRaghavan SReiss FElmagarmid AAgrawal D(2010)Enterprise information extractionProceedings of the 2010 ACM SIGMOD International Conference on Management of data10.1145/1807167.1807339(1257-1258)Online publication date: 6-Jun-2010
https://dl.acm.org/doi/10.1145/1807167.1807339
Jain AIpeirotis P(2009)A quality-aware optimizer for information extractionACM Transactions on Database Systems10.1145/1508857.150886234:1(1-48)Online publication date: 23-Apr-2009
https://dl.acm.org/doi/10.1145/1508857.1508862

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents