Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Public Access

LinCQA: Faster Consistent Query Answering with Linear Time Guarantees

Published: 30 May 2023 Publication History

Abstract

Most data analytical pipelines often encounter the problem of querying inconsistent data that violate pre-determined integrity constraints. Data cleaning is an extensively studied paradigm that singles out a consistent repair of the inconsistent data. Consistent query answering (CQA) is an alternative approach to data cleaning that asks for all tuples guaranteed to be returned by a given query on all (in most cases, exponentially many) repairs of the inconsistent data. In this paper, we identify a class of acyclic select-project-join (SPJ) queries for which CQA can be solved via SQL rewriting with a linear time guarantee. Our rewriting method can be viewed as a generalization of Yannakakis' algorithm for acyclic joins to the inconsistent setting. We present LinCQA, a system that takes as input any query in our class and outputs rewritings in both SQL and non-recursive Datalog with negation. We show that LinCQA often outperforms the existing CQA systems on both synthetic and real-world workloads, and in some cases, by orders of magnitude.

Supplemental Material

MP4 File
Presentation video

References

[1]
Miklós Ajtai and Yuri Gurevich. 1994. Datalog vs First-Order Logic. J. Comput. Syst. Sci. 49, 3 (1994), 562--588.
[2]
Lyublena Antova, Thomas Jansen, Christoph Koch, and Dan Olteanu. 2008. Fast and Simple Relational Processing of Uncertain Data. In Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, April 7--12, 2008, Cancún, Mexico, Gustavo Alonso, José A. Blakeley, and Arbee L. P. Chen (Eds.). IEEE Computer Society, 983--992. https://doi.org/10.1109/ICDE.2008.4497507
[3]
Arvind Arasu and Raghav Kaushik. 2009. A grammar-based entity representation framework for data cleaning. In SIGMOD Conference. ACM, 233--244.
[4]
Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. 1999. Consistent Query Answers in Inconsistent Databases. In PODS. ACM Press, 68--79.
[5]
Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. 2003. Answer sets for consistent query answering in inconsistent databases. Theory Pract. Log. Program. 3, 4--5 (2003), 393--424.
[6]
Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Renée J. Miller, Paolo Papotti, and Donatello Santoro. 2015. Messing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms. Proc. VLDB Endow. 9, 2 (2015), 36--47. https://doi.org/10.14778/2850578.2850579
[7]
Pablo Barceló and Gaëlle Fontaine. 2015. On the Data Complexity of Consistent Query Answering over Graph Databases. In ICDT (LIPIcs, Vol. 31). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 380--397.
[8]
Pablo Barceló and Gaëlle Fontaine. 2017. On the data complexity of consistent query answering over graph databases. J. Comput. Syst. Sci. 88 (2017), 164--194.
[9]
Catriel Beeri, Ronald Fagin, David Maier, and Mihalis Yannakakis. 1983. On the Desirability of Acyclic Database Schemes. J. ACM 30, 3 (1983), 479--513. https://doi.org/10.1145/2402.322389
[10]
Moria Bergman, Tova Milo, Slava Novgorodov, and Wang Chiew Tan. 2015. Query-Oriented Data Cleaning with Oracles. In SIGMOD Conference. ACM, 1199--1214.
[11]
Leopoldo Bertossi, Solmaz Kolahi, and Laks VS Lakshmanan. 2013. Data cleaning and query answering with matching dependencies and matching functions. Theory of Computing Systems 52, 3 (2013), 441--482.
[12]
Leopoldo E. Bertossi. 2019. Database Repairs and Consistent Query Answering: Origins and Further Developments. In PODS. ACM, 48--58.
[13]
Leopoldo E. Bertossi, Solmaz Kolahi, and Laks V. S. Lakshmanan. 2013. Data Cleaning and Query Answering with Matching Dependencies and Matching Functions. Theory Comput. Syst. 52, 3 (2013), 441--482.
[14]
Song Bian, Xiating Ouyang, Zhiwei Fan, and Paraschos Koutris. 2023. Certifiable Robustness for Naive Bayes Classifiers. CoRR abs/2303.04811 (2023).
[15]
Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2007. Conditional Functional Dependencies for Data Cleaning. In ICDE. IEEE Computer Society, 746--755.
[16]
Marco Calautti, Marco Console, and Andreas Pieris. 2021. Benchmarking Approximate Consistent Query Answering. In PODS. ACM, 233--246.
[17]
Reynold Cheng, Jinchuan Chen, and Xike Xie. 2008. Cleaning uncertain data with quality guarantees. Proc. VLDB Endow. 1, 1 (2008), 722--735.
[18]
Jan Chomicki and Jerzy Marcinkowski. 2005. Minimal-change integrity maintenance using tuple deletions. Inf. Comput. 197, 1--2 (2005), 90--121. https://doi.org/10.1016/j.ic.2004.04.007
[19]
Jan Chomicki, Jerzy Marcinkowski, and Slawomir Staworko. 2004. Hippo: A System for Computing Consistent Answers to a Class of SQL Queries. In EDBT (Lecture Notes in Computer Science, Vol. 2992). Springer, 841--844.
[20]
Xu Chu, Ihab F. Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data Cleaning: Overview and Emerging Challenges. In SIGMOD Conference. ACM, 2201--2206.
[21]
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Holistic data cleaning: Putting violations into context. In ICDE. IEEE Computer Society, 458--469.
[22]
Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. In SIGMOD Conference. ACM, 1247--1261.
[23]
CloudLab 2018. https://www.cloudlab.us/.
[24]
Jessica Davies and Fahiem Bacchus. 2011. Solving MAXSAT by Solving a Sequence of Simpler SAT Instances. In CP (Lecture Notes in Computer Science, Vol. 6876). Springer, 225--239.
[25]
Akhil Anand Dixit. 2021. Answering Queries Over Inconsistent Databases Using SAT Solvers. Ph. D. Dissertation. UC Santa Cruz.
[26]
Akhil A. Dixit and Phokion G. Kolaitis. 2019. A SAT-Based System for Consistent Query Answering. In SAT (Lecture Notes in Computer Science, Vol. 11628). Springer, 117--135.
[27]
Akhil A. Dixit and Phokion G. Kolaitis. 2021. Consistent Answers of Aggregation Queries using SAT Solvers. CoRR abs/2103.03314 (2021).
[28]
Amr Ebaid, Ahmed K. Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin. 2013. NADEEF: A Generalized Data Cleaning System. Proc. VLDB Endow. 6, 12 (2013), 1218--1221.
[29]
Austen Z. Fan and Paraschos Koutris. 2022. Certifiable Robustness for Nearest Neighbor Classifiers. In ICDT (LIPIcs, Vol. 220). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 6:1--6:20.
[30]
Zhiwei Fan, Paraschos Koutris, Xiating Ouyang, and Jef Wijsen. 2022. LinCQA: Faster Consistent Query Answering with Linear Time Guarantees. CoRR abs/2208.12339 (2022). https://doi.org/10.48550/arXiv.2208.12339 arXiv:2208.12339
[31]
Ariel Fuxman, Elham Fazli, and Renée J Miller. 2005. Conquer: Efficient management of inconsistent databases. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 155--166.
[32]
Ariel Fuxman and Renée J. Miller. 2007. First-order query rewriting for inconsistent databases. J. Comput. Syst. Sci. 73, 4 (2007), 610--635.
[33]
Congcong Ge, Yunjun Gao, Xiaoye Miao, Bin Yao, and Haobo Wang. 2021. A Hybrid Data Cleaning Framework Using Markov Logic Networks (Extended Abstract). In ICDE. IEEE, 2344--2345.
[34]
Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. 2013. The LLUNATIC Data-Cleaning Framework. Proc. VLDB Endow. 6, 9 (2013), 625--636.
[35]
Gianluigi Greco, Sergio Greco, and Ester Zumpano. 2003. A Logical Framework for Querying and Repairing Inconsistent Databases. IEEE Trans. Knowl. Data Eng. 15, 6 (2003), 1389--1408.
[36]
Yeye He, Xu Chu, Kris Ganjam, Yudian Zheng, Vivek R. Narasayya, and Surajit Chaudhuri. 2018. Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations. Proc. VLDB Endow. 11, 10 (2018), 1165--1177.
[37]
Lara A Kahale, Assem M Khamis, Batoul Diab, Yaping Chang, Luciane Cruz Lopes, Arnav Agarwal, Ling Li, Reem A Mustafa, Serge Koujanian, Reem Waziry, et al . 2020. Meta-Analyses Proved Inconsistent in How Missing Data Were Handled Across Their Included Primary Trials: A Methodological Survey. Clinical Epidemiology 12 (2020), 527--535.
[38]
Bojan Karlas, Peng Li, Renzhi Wu, Nezihe Merve Gürel, Xu Chu, Wentao Wu, and Ce Zhang. 2020. Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions. Proc. VLDB Endow. 14, 3 (2020), 255--267.
[39]
Yannis Katsis, Alin Deutsch, Yannis Papakonstantinou, and Vasilis Vassalos. 2010. Inconsistency resolution in online databases. In ICDE. IEEE Computer Society, 1205--1208.
[40]
Aziz Amezian El Khalfioui, Jonathan Joertz, Dorian Labeeuw, Gaëtan Staquet, and Jef Wijsen. 2020. Optimization of Answer Set Programs for Consistent Query Answering by Means of First-Order Rewriting. In CIKM. ACM, 25--34.
[41]
Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin. 2015. BigDansing: A System for Big Data Cleansing. In SIGMOD Conference. ACM, 1215--1230.
[42]
Henning Kohler and Sebastian Link. 2021. Possibilistic data cleaning. IEEE Transactions on Knowledge and Data Engineering (2021).
[43]
Phokion G. Kolaitis and Enela Pema. 2012. A dichotomy in the complexity of consistent query answering for queries with two atoms. Inf. Process. Lett. 112, 3 (2012), 77--85.
[44]
Phokion G. Kolaitis, Enela Pema, and Wang-Chiew Tan. 2013. Efficient Querying of Inconsistent Databases with Binary Integer Programming. Proc. VLDB Endow. 6, 6 (2013), 397--408.
[45]
Phokion G. Kolaitis, Enela Pema, and Wang-Chiew Tan. 2013. Efficient Querying of Inconsistent Databases with Binary Integer Programming. Proc. VLDB Endow. 6, 6 (2013), 397--408.
[46]
Paraschos Koutris, Xiating Ouyang, and Jef Wijsen. 2021. Consistent Query Answering for Primary Keys on Path Queries. In PODS. ACM, 215--232.
[47]
Paraschos Koutris and Dan Suciu. 2014. A Dichotomy on the Complexity of Consistent Query Answering for Atoms with Simple Keys. In ICDT. OpenProceedings.org, 165--176.
[48]
Paraschos Koutris and Jef Wijsen. 2015. The Data Complexity of Consistent Query Answering for Self-Join-Free Conjunctive Queries Under Primary Key Constraints. In PODS. ACM, 17--29.
[49]
Paraschos Koutris and Jef Wijsen. 2017. Consistent Query Answering for Self-Join-Free Conjunctive Queries Under Primary Key Constraints. ACM Trans. Database Syst. 42, 2 (2017), 9:1--9:45.
[50]
Paraschos Koutris and Jef Wijsen. 2018. Consistent Query Answering for Primary Keys and Conjunctive Queries with Negated Atoms. In PODS. ACM, 209--224.
[51]
Paraschos Koutris and Jef Wijsen. 2019. Consistent Query Answering for Primary Keys in Logspace. In ICDT (LIPIcs, Vol. 127). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 23:1--23:19.
[52]
Paraschos Koutris and Jef Wijsen. 2020. First-Order Rewritability in Consistent Query Answering with Respect to Multiple Keys. In PODS. ACM, 113--129.
[53]
Paraschos Koutris and Jef Wijsen. 2021. Consistent Query Answering for Primary Keys in Datalog. Theory Comput. Syst. 65, 1 (2021), 122--178.
[54]
Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. 2016. ActiveClean: Interactive Data Cleaning For Statistical Modeling. Proc. VLDB Endow. 9, 12 (2016), 948--959.
[55]
Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2021. CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. In ICDE. IEEE, 13--24.
[56]
Andrei Lopatenko and Leopoldo E. Bertossi. 2007. Complexity of Consistent Query Answering in Databases Under Cardinality-Based and Incremental Repair Semantics. In ICDT (Lecture Notes in Computer Science, Vol. 4353). Springer, 179--193.
[57]
Marco Manna, Francesco Ricca, and Giorgio Terracina. 2015. Taming primary key violations to query large inconsistent data via ASP. Theory Pract. Log. Program. 15, 4--5 (2015), 696--710.
[58]
Mónica Caniupán Marileo and Leopoldo E. Bertossi. 2005. Optimizing repair programs for consistent query answering. In SCCC. IEEE Computer Society, 3--12.
[59]
Patrick E. O'Neil, Elizabeth J. O'Neil, Xuedong Chen, and Stephen Revilak. 2009. The Star Schema Benchmark and Augmented Fact Table Indexing. In Performance Evaluation and Benchmarking, First TPC Technology Conference, TPCTC 2009, Lyon, France, August 24--28, 2009, Revised Selected Papers (Lecture Notes in Computer Science, Vol. 5895), Raghunath Othayoth Nambiar and Meikel Poess (Eds.). Springer, 237--252. https://doi.org/10.1007/978--3--642--10424--4_17
[60]
Meikel Poess and Chris Floyd. 2000. New TPC benchmarks for decision support and web commerce. ACM Sigmod Record 29, 4 (2000), 64--71.
[61]
Nataliya Prokoshyna, Jaroslaw Szlichta, Fei Chiang, Renée J. Miller, and Divesh Srivastava. 2015. Combining Quantitative and Logical Data Cleaning. Proc. VLDB Endow. 9, 4 (2015), 300--311.
[62]
Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3--13.
[63]
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. 10, 11 (2017), 1190--1201.
[64]
El Kindi Rezig, Mourad Ouzzani, Walid G. Aref, Ahmed K. Elmagarmid, Ahmed R. Mahmood, and Michael Stonebraker. 2021. Horizon: Scalable Dependency-driven Data Cleaning. Proc. VLDB Endow. 14, 11 (2021), 2546--2554.
[65]
M. Andrea Rodríguez, Leopoldo E. Bertossi, and Mónica Caniupán Marileo. 2013. Consistent query answering under spatial semantic constraints. Inf. Syst. 38, 2 (2013), 244--263.
[66]
Yongxin Tong, Caleb Chen Cao, Chen Jason Zhang, Yatao Li, and Lei Chen. 2014. CrowdCleaner: Data cleaning for multi-version data on the web via crowdsourcing. In ICDE. IEEE Computer Society, 1182--1185.
[67]
Jef Wijsen. 2010. On the first-order expressibility of computing certain answers to conjunctive queries over uncertain databases. In PODS. ACM, 179--190.
[68]
Jef Wijsen. 2012. Certain conjunctive query answering in first-order logic. ACM Trans. Database Syst. 37, 2 (2012), 9:1--9:35.
[69]
Jef Wijsen. 2019. Foundations of Query Answering on Inconsistent Databases. SIGMOD Rec. 48, 3 (2019), 6--16.
[70]
Mihalis Yannakakis. 1981. Algorithms for Acyclic Database Schemes. In Very Large Data Bases, 7th International Conference, September 9--11, 1981, Cannes, France, Proceedings. IEEE Computer Society, 82--94.

Cited By

View all
  • (2024)Computing Range Consistent Answers to Aggregation Queries via RewritingProceedings of the ACM on Management of Data10.1145/36958362:5(1-19)Online publication date: 7-Nov-2024
  • (2024)Consistent Query Answering for Primary Keys on Rooted Tree QueriesProceedings of the ACM on Management of Data10.1145/36511392:2(1-26)Online publication date: 14-May-2024
  • (2024)Efficient Mixture of Experts based on Large Language Models for Low-Resource Data PreprocessingProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671873(3690-3701)Online publication date: 25-Aug-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 1
PACMMOD
May 2023
2807 pages
EISSN:2836-6573
DOI:10.1145/3603164
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2023
Published in PACMMOD Volume 1, Issue 1

Permissions

Request permissions for this article.

Author Tags

  1. acyclic queries
  2. consistent query answering
  3. uncertain databases

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)131
  • Downloads (Last 6 weeks)20
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Computing Range Consistent Answers to Aggregation Queries via RewritingProceedings of the ACM on Management of Data10.1145/36958362:5(1-19)Online publication date: 7-Nov-2024
  • (2024)Consistent Query Answering for Primary Keys on Rooted Tree QueriesProceedings of the ACM on Management of Data10.1145/36511392:2(1-26)Online publication date: 14-May-2024
  • (2024)Efficient Mixture of Experts based on Large Language Models for Low-Resource Data PreprocessingProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671873(3690-3701)Online publication date: 25-Aug-2024
  • (2024)Discovering Denial Constraints Based on Deep Reinforcement LearningProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679714(120-129)Online publication date: 21-Oct-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media