Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Constant-Delay Enumeration for Nondeterministic Document Spanners

Published: 14 April 2021 Publication History

Abstract

We consider the information extraction framework known as document spanners and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm that is tractable in combined complexity, i.e., in the sizes of the input document and the VA, while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS’18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, particularly for the restricted case of so-called extended VAs. Finally, we evaluate our algorithm empirically using a prototype implementation.

References

[1]
Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. 1974. The Design and Analysis of Computer Algorithms. Addison-Wesley.
[2]
Antoine Amarilli, Pierre Bourhis, Louis Jachiet, and Stefan Mengel. 2017. A circuit-based approach to efficient enumeration. In Proceedings of ICALP 2017. https://arxiv.org/abs/1702.05589.
[3]
Antoine Amarilli, Pierre Bourhis, and Stefan Mengel. 2018. Enumeration on trees under relabelings. In Proceedings of ICDT 2018. https://arxiv.org/abs/1709.06185.
[4]
Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. 2019. Constant-delay enumeration for nondeterministic document spanners. In Proceedings of ICDT 2019. https://drops.dagstuhl.de/opus/frontdoor.php?source_opus=10324.
[5]
Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. 2019. Enumeration on trees with tractable combined complexity and efficient updates. In Proceedings of PODS 2019. https://arxiv.org/abs/1812.09519.
[6]
Guillaume Bagan. 2006. MSO queries on tree decomposable structures are computable with linear delay. In Proceedings of CSL 2006.
[7]
Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. 2017. Answering conjunctive queries under updates. In Proceedings of PODS 2017. https://arxiv.org/abs/1702.06370.
[8]
Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. 2017. Answering FO+MOD queries under updates on bounded degree databases. In Proceedings of ICDT 2017. https://arxiv.org/abs/1702.08764.
[9]
Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. 2018. Answering UCQs under updates and in the presence of integrity constraints. In Proceedings of ICDT 2017. https://arxiv.org/abs/1709.10039.
[10]
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms (3rd ed.). MIT Press, Cambridge, MA.
[11]
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. 2015. Document spanners: A formal approach to information extraction. Journal of the ACM 62, 2 (2015), Article 12. https://pdfs.semanticscholar.org/8df0/ad1c6aa0df93e58071b8afe3371a16a3182f.pdf.
[12]
Fernando Florenzano, Cristian Riveros, Martín Ugarte, Stijn Vansummeren, and Domagoj Vrgoc. 2018. Constant delay algorithms for regular document spanners. In Proceedings of PODS 2018. https://arxiv.org/abs/1803.05277.
[13]
Dominik D. Freydenberger. 2017. A logic for document spanners. In Proceedings of ICDT 2017. http://drops.dagstuhl.de/opus/volltexte/2017/7049/.
[14]
Dominik D. Freydenberger. 2019. A logic for document spanners. Theory of Computing Systems 63, 7 (2019), 1679--1754. https://link.springer.com/article/10.1007%2Fs00224-018-9874-1.
[15]
Dominik D. Freydenberger and Mario Holldack. 2018. Document spanners: From expressive power to decision problems. Theory of Computing Systems 62, 4 (2018), 854--898. https://doi.org/10.1007/s00224-017-9770-0.
[16]
Dominik D. Freydenberger, Benny Kimelfeld, and Liat Peterfreund. 2018. Joining extractions of regular expressions. In Proceedings of PODS 2018. https://arxiv.org/abs/1703.10350.
[17]
François Le Gall. 2012. Improved output-sensitive quantum algorithms for Boolean matrix multiplication. In Proceedings of SODA 2012. https://pdfs.semanticscholar.org/91a5/dd90ed43a6e8f55f8ec18ceead7dd0a6e988.pdf.
[18]
François Le Gall. 2014. Powers of tensors and fast matrix multiplication. In Proceedings of ISSAC 2014. https://arxiv.org/abs/1401.7714.
[19]
Étienne Grandjean. 1996. Sorting, linear time and the satisfiability problem. Annals of Mathematics and Artificial Intelligence 16, 1 (1996), 183--236.
[20]
Katja Losemann and Wim Martens. 2014. MSO queries on trees: Enumerating answers under updates. In Proceedings of CSL-LICS 2014. http://www.theoinf.uni-bayreuth.de/download/lics14-preprint.pdf.
[21]
Arnaud Mary and Yann Strozecki. 2016. Efficient enumeration of solutions produced by closure operations. In Proceedings of STACS 2016. http://drops.dagstuhl.de/opus/volltexte/2016/5753/.
[22]
Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. 2018. Document spanners for extracting incomplete information: Expressiveness and complexity. In Proceedings of PODS 2018. https://arxiv.org/abs/1707.00827.
[23]
Andrea Morciano. 2017. Engineering a Runtime System for AQL. Master’s Thesis. Politecnico di Milano. https://www.politesi.polimi.it/bitstream/10589/135034/1/2017_07_Morciano.pdf.
[24]
Matthias Niewerth. 2018. MSO queries on trees: Enumerating answers under updates using forest algebras. In Proceedings of LICS 2018.
[25]
Matthias Niewerth and Luc Segoufin. 2018. Enumeration of MSO queries on strings with constant delay and logarithmic updates. In Proceedings of PODS 2018. http://www.di.ens.fr/∼segoufin/Papers/Mypapers/enum-update-words.pdf.
[26]
Ronald C. Read and Robert E. Tarjan. 1975. Bounds on backtrack algorithms for listing cycles, paths, and spanning trees. Networks 5, 3 (1975), 237--252.
[27]
IBM Research. 2018. SystemT. Retrieved February 7, 2021 from https://researcher.watson.ibm.com/researcher/view_group.php?id=1264.
[28]
Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James W. Pennebaker. 2006. Effects of age and gender on blogging. In AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, Vol. 6. 199--205. http://u.cs.biu.ac.il/∼koppel/BlogCorpus.htm.
[29]
Luc Segoufin. 2014. A glimpse on constant delay enumeration (Invited talk). In Proceedings of STACS 2014. https://hal.inria.fr/hal-01070893/document.
[30]
Shuji Tsukiyama, Mikio Ide, Hiromu Ariyoshi, and I. Shirakawa. 1977. A new algorithm for generating all the maximal independent sets. SIAM Journal on Computing 6, 3 (1997), 505--517.
[31]
L. G. Valiant. 1979. The complexity of computing the permanent. Theoretical Computer Science 8, 2 (1979), 189--201. https://www.sciencedirect.com/science/article/pii/0304397579900446.
[32]
Kunihiro Wasa. 2016. Enumeration of enumeration algorithms. arXiv:1605.05012. https://arxiv.org/abs/1605.05102.

Cited By

View all
  • (2024)Revisiting Weighted Information Extraction: A Simpler and Faster Algorithm for Ranked EnumerationProceedings of the ACM on Management of Data10.1145/36958402:5(1-19)Online publication date: 7-Nov-2024
  • (2023)REmatch: A Novel Regex Engine for Finding All MatchesProceedings of the VLDB Endowment10.14778/3611479.361148816:11(2792-2804)Online publication date: 1-Jul-2023
  • (2023)Representing Paths in Graph Database Pattern MatchingProceedings of the VLDB Endowment10.14778/3587136.358715116:7(1790-1803)Online publication date: 1-Mar-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 46, Issue 1
March 2021
143 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/3457891
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 April 2021
Accepted: 01 November 2020
Revised: 01 September 2020
Received: 01 February 2020
Published in TODS Volume 46, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Documents spanners
  2. constant delay enumeration
  3. information extraction

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • ANR
  • DFG

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)1
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Revisiting Weighted Information Extraction: A Simpler and Faster Algorithm for Ranked EnumerationProceedings of the ACM on Management of Data10.1145/36958402:5(1-19)Online publication date: 7-Nov-2024
  • (2023)REmatch: A Novel Regex Engine for Finding All MatchesProceedings of the VLDB Endowment10.14778/3611479.361148816:11(2792-2804)Online publication date: 1-Jul-2023
  • (2023)Representing Paths in Graph Database Pattern MatchingProceedings of the VLDB Endowment10.14778/3587136.358715116:7(1790-1803)Online publication date: 1-Mar-2023
  • (2022)COREProceedings of the VLDB Endowment10.14778/3538598.353861515:9(1951-1964)Online publication date: 1-May-2022
  • (2021)Database Principles and Challenges in Text AnalysisACM SIGMOD Record10.1145/3484622.348462450:2(6-17)Online publication date: 31-Aug-2021

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media