Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3294052.3319699acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

Complexity Bounds for Relational Algebra over Document Spanners

Published: 25 June 2019 Publication History

Abstract

We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex formulas (i.e., regular expressions with capture variables) over text documents. Such queries, also known as the regular document spanners, were shown to have an evaluation with polynomial delay for every positive RA expression (i.e., consisting of only natural joins, projections and unions); here, the RA expression is fixed and the input consists of both the regex formulas and the document. In this work, we explore the implication of two fundamental generalizations. The first is adopting the "schemaless'' semantics for spanners, as proposed and studied by Maturana et al. The second is going beyond the positive RA to allowing the difference operator. We show that each of the two generalizations introduces computational hardness: it is intractable to compute the natural join of two regex formulas under the schemaless semantics, and the difference between two regex formulas under both the ordinary and schemaless semantics. Nevertheless, we propose and analyze syntactic constraints, on the RA expression and the regex formulas at hand, such that the expressive power is fully preserved and, yet, evaluation can be done with polynomial delay. Unlike the previous work on RA over regex formulas, our technique is not (and provably cannot be) based on the static compilation of regex formulas, but rather on an ad-hoc compilation into an automaton that incorporates both the query and the document. This approach also allows us to include black-box extractors in the RA expression.

References

[1]
Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-delay enumeration for nondeterministic document spanners. CoRR, abs/1807.09320, 2018.
[2]
Guillaume Bagan, Arnaud Durand, and Etienne Grandjean. On acyclic conjunctive queries and constant delay enumeration. In CSL, volume 4646 of Lecture Notes in Computer Science, pages 208--222. Springer, 2007.
[3]
Nofar Carmeli and Markus Krö ll. Enumeration complexity of conjunctive queries with functional dependencies. In ICDT, volume 98 of LIPIcs, pages 11:1--11:17. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2018.
[4]
Goutam Chakraborty, Murali Pagolu, and Satish Garla. Text mining and analysis: practical methods, examples, and case studies using SAS. SAS Institute, 2014.
[5]
Laura Chiticariu, Marina Danilevsky, Yunyao Li, Frederick Reiss, and Huaiyu Zhu. Systemt: Declarative text understanding for enterprise. In NAACL-HTL (3), pages 76--83. Association for Computational Linguistics, 2018.
[6]
Rodney G Downey and Michael Ralph Fellows. Parameterized complexity. Springer Science & Business Media, 2012.
[7]
Keith Ellul, Bryan Krawetz, Jeffrey Shallit, and Ming-wei Wang. Regular expressions: New results and open problems. Journal of Automata, Languages and Combinatorics, 10(4):407--437, 2005.
[8]
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners: A formal approach to information extraction. J. ACM, 62(2):12, 2015.
[9]
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Declarative cleaning of inconsistencies in information extraction. ACM Trans. Database Syst., 41(1):6:1--6:44, 2016.
[10]
Fernando Florenzano, Cristian Riveros, Mart'i n Ugarte, Stijn Vansummeren, and Domagoj Vrgoc. Constant delay algorithms for regular document spanners. In PODS, pages 165--177, 2018.
[11]
Dominik D. Freydenberger. A logic for document spanners. Theory Comput. Syst., Sep 2018.
[12]
Dominik D. Freydenberger and Mario Holldack. Document spanners: From expressive power to decision problems. Theory Comput. Syst., 62(4):854--898, 2018.
[13]
Dominik D. Freydenberger, Benny Kimelfeld, and Liat Peterfreund. Joining extractions of regular expressions. In PODS, pages 137--149, 2018.
[14]
Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. Database System Implementation. Prentice-Hall, 2000.
[15]
Michael R. Garey and David S. Johnson. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1990.
[16]
Galina Jirá sková . State complexity of some operations on binary regular languages. Theor. Comput. Sci., 330(2):287--298, 2005.
[17]
David S. Johnson, Christos H. Papadimitriou, and Mihalis Yannakakis. On generating all maximal independent sets. Inf. Process. Lett., 27(3):119--123, 1988.
[18]
Yunyao Li, Frederick Reiss, and Laura Chiticariu. SystemT: A declarative information extraction system. In ACL, pages 109--114. ACL, 2011.
[19]
Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. Document spanners for extracting incomplete information: Expressiveness and complexity. In PODS, pages 125--136, 2018.
[20]
Matthias Niewerth and Luc Segoufin. Enumeration of MSO queries on strings with constant delay and logarithmic updates. In PODS. ACM, 2018.
[21]
Christian W. Omlin and C. Lee Giles. Extraction of rules from discrete-time recurrent neural networks. Neural Networks, 9(1):41--52, 1996.
[22]
Hao Peng, Roy Schwartz, Sam Thomson, and Noah A. Smith. Rational recurrences. CoRR, abs/1808.09357, 2018.
[23]
Jorge Pé rez, Marcelo Arenas, and Claudio Gutié rrez. Semantics and complexity of SPARQL. ACM Trans. Database Syst., 34(3):16:1--16:45, 2009.
[24]
Liat Peterfreund, Dominik D. Freydenberger, Benny Kimelfeld, and Markus Krö ll. Complexity bounds for relational algebra over document spanners. CoRR, abs/1901.04182, 2019.
[25]
Liat Peterfreund, Balder ten Cate, Ronald Fagin, and Benny Kimelfeld. Recursive programs for document spanners. In ICDT, 2019.
[26]
Christopher De Sa, Alexander Ratner, Christopher Ré, Jaeho Shin, Feiran Wang, Sen Wu, and Ce Zhang. Deepdive: Declarative knowledge base construction. SIGMOD Record, 45(1):60--67, 2016.
[27]
Sunita Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3):261--377, 2008.
[28]
Luc Segoufin. Enumerating with constant delay the answers to a query. In ICDT, pages 10--20. ACM, 2013.
[29]
Warren Shen, AnHai Doan, Jeffrey F. Naughton, and Raghu Ramakrishnan. Declarative information extraction using Datalog with embedded extraction predicates. In VLDB, pages 1033--1044, 2007.
[30]
John Miles Smith and Philip Yen-Tang Chang. Optimizing the performance of a relational algebra database interface. Commun. ACM, 18(10):568--579, 1975.
[31]
Craig A. Tovey. A simplified NP-complete satisfiability problem. Discrete Applied Mathematics, 8(1):85--89, 1984.
[32]
Moshe Y. Vardi. The complexity of relational query languages (extended abstract). In STOC, pages 137--146. ACM, 1982.
[33]
Gail Weiss, Yoav Goldberg, and Eran Yahav. Extracting automata from recurrent neural networks using queries and counterexamples. In ICML, volume 80 of JMLR Workshop and Conference Proceedings, pages 5244--5253. JMLR.org, 2018.

Cited By

View all
  • (2024)The Information Extraction Framework of Document Spanners - A Very Informal SurveySOFSEM 2024: Theory and Practice of Computer Science10.1007/978-3-031-52113-3_1(3-22)Online publication date: 7-Feb-2024
  • (2023)Autonomously Computable Information ExtractionProceedings of the VLDB Endowment10.14778/3603581.360358516:10(2431-2443)Online publication date: 8-Aug-2023
  • (2022)Document Spanners - A Brief Overview of Concepts, Results, and Recent DevelopmentsProceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3517804.3526069(139-150)Online publication date: 12-Jun-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '19: Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
June 2019
494 pages
ISBN:9781450362276
DOI:10.1145/3294052
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. document spanners
  2. information extraction
  3. polynomial delay
  4. regular expressions
  5. relational algebra

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS '19
Sponsor:
SIGMOD/PODS '19: International Conference on Management of Data
June 30 - July 5, 2019
Amsterdam, Netherlands

Acceptance Rates

PODS '19 Paper Acceptance Rate 29 of 87 submissions, 33%;
Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)3
Reflects downloads up to 30 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)The Information Extraction Framework of Document Spanners - A Very Informal SurveySOFSEM 2024: Theory and Practice of Computer Science10.1007/978-3-031-52113-3_1(3-22)Online publication date: 7-Feb-2024
  • (2023)Autonomously Computable Information ExtractionProceedings of the VLDB Endowment10.14778/3603581.360358516:10(2431-2443)Online publication date: 8-Aug-2023
  • (2022)Document Spanners - A Brief Overview of Concepts, Results, and Recent DevelopmentsProceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3517804.3526069(139-150)Online publication date: 12-Jun-2022
  • (2022)Query Evaluation over SLP-Represented Document Databases with Complex Document EditingProceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3517804.3524158(79-89)Online publication date: 12-Jun-2022
  • (2021)Database Principles and Challenges in Text AnalysisACM SIGMOD Record10.1145/3484622.348462450:2(6-17)Online publication date: 31-Aug-2021
  • (2020)Formal Languages in Information Extraction and Graph DatabasesBeyond the Horizon of Computability10.1007/978-3-030-51466-2_28(306-309)Online publication date: 29-Jun-2020

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media