Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Query-Guided Resolution in Uncertain Databases

Published: 20 June 2023 Publication History

Abstract

We present a novel framework for uncertain data management. We start with a database whose tuple correctness is uncertain and an oracle that can resolve the uncertainty, i.e., decide if a tuple is correct or not. Such an oracle may correspond, e.g., to a data expert or to a crowdsourcing platform. We wish to use the oracle to clean the database with the goal of ensuring the correct answer for specific mission-critical queries. To avoid the prohibitive cost of cleaning the entire database and to minimize the expected number of calls to the oracle, we must carefully select tuples whose resolution would suffice to resolve the uncertainty in query results. In other words, we need a query-guided process for the resolution of uncertain data.
We develop an end-to-end solution to this problem, based on the derivation of query answers and on correctness probabilities for the uncertain data. At a high level, we first track Boolean provenance to identify which input tuples contribute to the derivation of each output tuple, and in what ways. We then design an active learning solution for iteratively choosing tuples to resolve, based on the provenance structure and on an evolving estimation of tuple correctness probabilities. We conduct an extensive experimental study to validate our framework in different use cases.

Supplemental Material

MP4 File
We present a novel framework for an end-to-end solution for uncertain mission-critical data management. We start with a database whose tuple correctness is uncertain and an oracle (e.g., to a data expert or to a crowdsourcing platform) that can resolve the uncertainty, i.e., decide if a tuple is correct or not. We wish to use the oracle to clean the database with the goal of ensuring the correct answer for specific mission-critical queries. Our goal is to compute the precise ground truth query result while performing a minimal number of sequential oracles probes.
PDF File
Read me
ZIP File
Source Code

References

[1]
Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley.
[2]
Foto N. Afrati and Phokion G. Kolaitis. 2009. Repair Checking in Inconsistent Databases: Algorithms and Complexity. In ICDT. 31--41.
[3]
Parag Agrawal, Omar Benjelloun, Anish Das Sarma, Chris Hayworth, Shubha U. Nabar, Tomoe Sugihara, and Jennifer Widom. 2006. Trio: A System for Data, Uncertainty, and Lineage. In PVLDB. 1151--1154.
[4]
Sarah R. Allen, Lisa Hellerstein, Devorah Kletenik, and Tonguç Ünlüyurt. 2017. Evaluation of Monotone DNF Formulas. Algorithmica 77, 3 (2017), 661--685.
[5]
Yael Amsterdamer, Susan B. Davidson, Daniel Deutch, Tova Milo, Julia Stoyanovich, and Val Tannen. 2011. Putting Lipstick on Pig: Enabling Database-style Workflow Provenance. PVLDB 5, 4 (2011), 346--357.
[6]
Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. Provenance for Aggregate Queries. In PODS. 153--164.
[7]
Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. 2003. Answer Sets for Consistent Query Answering in Inconsistent Databases. Theory Pract. Log. Program. 3, 4--5 (2003), 393--424.
[8]
Ahmad Assadi, Tova Milo, and Slava Novgorodov. 2018. Cleaning Data with Constraints and Experts. In WebDB. 1:1--1:6.
[9]
Adam Bates, Benjamin Mood, Masoud Valafar, and Kevin R. B. Butler. 2013. Towards Secure Provenance-Based Access Control in Cloud Environments. In CODASPY. 277--284.
[10]
Omar Benjelloun, Anish Das Sarma, Alon Y. Halevy, Martin Theobald, and Jennifer Widom. 2008. Databases With Uncertainty and Lineage. VLDB J. 17, 2 (2008), 243--264.
[11]
Moria Bergman, Tova Milo, Slava Novgorodov, and Wang Chiew Tan. 2015. Query-Oriented Data Cleaning with Oracles. In SIGMOD. 1199--1214.
[12]
Leopoldo E. Bertossi, Solmaz Kolahi, and Laks V. S. Lakshmanan. 2013. Data Cleaning and Query Answering with Matching Dependencies and Matching Functions. Theory Comput. Syst. 52, 3 (2013), 441--482.
[13]
Michael Bloodgood. 2018. Support Vector Machine Active Learning Algorithms with Query-by-Committee Versus Closest-to-Hyperplane Selection. In ICSC. 148--155.
[14]
Piero A. Bonatti, Luca Ioffredo, Iliana M. Petrova, Luigi Sauro, and Ida Sri Rejeki Siahaan. 2020. Real-Time Reasoning in OWL2 for GDPR Compliance. Artif. Intell. 289 (2020), 103389.
[15]
Endre Boros and Tonguç Ünlüyurt. 2000. Sequential Testing of Series-Parallel Systems of Small Depth. In Computing tools for modeling, optimization and simulation. Springer, 39--73.
[16]
Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (2001), 5--32.
[17]
Peter Buneman, Sanjeev Khanna, and Wang Chiew Tan. 2001. Why and Where: A Characterization of Data Provenance. In ICDT. 316--330.
[18]
James Cheney, Laura Chiticariu, and Wang Chiew Tan. 2009. Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases 1, 4 (2009), 379--474.
[19]
Reynold Cheng, Jinchuan Chen, and Xike Xie. 2008. Cleaning Uncertain Data with Quality Guarantees. PVLDB 1, 1 (2008), 722--735.
[20]
Reynold Cheng, Eric Lo, Xuan S. Yang, Ming-Hay Luk, Xiang Li, and Xike Xie. 2010. Explore or Exploit? Effective Strategies for Disambiguating Large Databases. PVLDB 3, 1 (2010), 815--825.
[21]
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Holistic Data Cleaning: Putting Violations Into Context. In ICDE. 458--469.
[22]
Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. In SIGMOD. 1247--1261.
[23]
Ferdinando Cicalese, Eduardo Sany Laber, and Aline Medeiros Saettler. 2014. Diagnosis determination: decision trees optimizing simultaneously worst and expected testing cost. In ICML. 414--422.
[24]
Nilesh N. Dalvi and Dan Suciu. 2007. Efficient Query Evaluation on Probabilistic Databases. VLDB J. 16, 4 (2007), 523--544.
[25]
Dave DeBarr and H. Wechsler. 2009. Spam Detection using Clustering, Random Forests, and Active Learning. In CEAS.
[26]
Ofer Dekel, Claudio Gentile, and Karthik Sridharan. 2012. Selective Sampling and Active Learning from Single and Multiple Teachers. J. Mach. Learn. Res. 13 (2012), 2655--2697.
[27]
Guy Van den Broeck and Dan Suciu. 2017. Query Processing on Probabilistic Data: A Survey. Found. Trends DBs 7, 3--4 (2017), 197--341.
[28]
Amol Deshpande, Lisa Hellerstein, and Devorah Kletenik. 2014. Approximation Algorithms for Stochastic Boolean Function Evaluation and Stochastic Submodular Set Cover. In SODA. 1453--1466.
[29]
Daniel Deutch, Nave Frost, and Amir Gilad. 2017. Provenance for Natural Language Queries. PVLDB 10, 5 (2017), 577--588.
[30]
Daniel Deutch, Tova Milo, Sudeepa Roy, and Val Tannen. 2014. Circuits for Datalog Provenance. In ICDT. 201--212.
[31]
Osnat Drien, Antoine Amarilli, and Yael Amsterdamer. 2021. Managing Consent for Data Access in Shared Databases. In ICDE. 2012--2015.
[32]
Anna Fariha, Ashish Tiwari, Alexandra Meliou, Arjun Radhakrishna, and Sumit Gulwani. 2021. CoCo: Interactive Exploration of Conformance Constraints for Data Understanding and Data Cleaning. In SIGMOD. 2706--2710.
[33]
Kaniz Fatema, Ensar Hadziselimovic, Harshvardhan J. Pandit, Christophe Debruyne, Dave Lewis, and Declan O'Sullivan. 2017. Compliance through Informed Consent: Semantic Based Consent Permission and Data Management Model. In PrivOn at ISWC.
[34]
Amos Fiat and Dmitry Pechyony. 2004. Decision Trees: More Theoretical Justification for Practical Algorithms. In ALT. 156--170.
[35]
J. Nathan Foster, Todd J. Green, and Val Tannen. 2008. Annotated XML: Queries and Provenance. In PODS. 271--280.
[36]
Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, and Reynold Xin. 2011. CrowdDB: Answering Queries with Crowdsourcing. In SIGMOD. 61--72.
[37]
Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. 2020. Cleaning Data with Llunatic. VLDB J. 29, 4 (2020), 867--892.
[38]
Shima Gerani, Mark James Carman, and F. Crestani. 2010. Proximity-Based Opinion Retrieval. SIGIR (2010).
[39]
Amir Gilad, Daniel Deutch, and Sudeepa Roy. 2020. On Multiple Semantics for Declarative Database Repairs. In SIGMOD. 817--831.
[40]
Boris Glavic and Gustavo Alonso. 2009. Perm: Processing Provenance and Data on the Same Data Model through Query Rewriting. In ICDE. 174--185.
[41]
Boris Glavic and Gustavo Alonso. 2009. Provenance for Nested Subqueries. In EDBT. 982--993.
[42]
Abigail Goldsteen, Shelly Garion, Sima Nadler, Natalia Razinkov, Yosef Moatti, and Paula Ta-Shma. 2017. Brief Announcement: A Consent Management Solution for Enterprises. In CSCML. 189--192.
[43]
Daniel Golovin and Andreas Krause. 2011. Adaptive Submodularity: Theory and Applications in Active Learning and Stochastic Optimization. JAIR 42 (2011), 427--486.
[44]
Daniel Golovin, Andreas Krause, and Debajyoti Ray. 2010. Near-Optimal Bayesian Active Learning with Noisy Observations. In NeurIPS. Curran Associates, Inc., 766--774.
[45]
Sergio Greco, Cristian Molinaro, and Irina Trubitsyna. 2017. Computing Approximate Certain Answers over Incomplete Databases. In AMW, Vol. 1912.
[46]
Todd J. Green, Grigoris Karvounarakis, Zachary G. Ives, and Val Tannen. 2010. Provenance in ORCHESTRA. IEEE Data Eng. Bull. 33, 3 (2010), 9--16.
[47]
Todd J. Green, Gregory Karvounarakis, and Val Tannen. 2007. Provenance Semirings. In PODS. 31--40.
[48]
Husheng Guo and Wenjian Wang. 2015. An Active Learning-Based SVM Multi-Class Classification Model. Pattern Recognit. 48, 5 (2015), 1577--1597.
[49]
Naeemul Hassan, Gensheng Zhang, Fatma Arslan, Josue Caraballo, Damian Jimenez, Siddhant Gawsane, Shohedul Hasan, Minumol Joseph, Aaditya Kulkarni, Anil Kumar Nayak, Vikas Sable, Chengkai Li, and Mark Tremayne. 2017. ClaimBuster: The First-ever End-to-end Fact-checking System. PVLDB 10, 12 (2017), 1945--1948.
[50]
Melanie Herschel, Ralf Diestelkämper, and Houssem Ben Lahmar. 2017. A Survey on Provenance: What for? What Form? What from? VLDB J. 26, 6 (2017), 881--906.
[51]
Ihab F. Ilyas and Xu Chu. 2019. Data Cleaning. ACM.
[52]
Tomasz Imielinski and Witold Lipski Jr. 1984. Incomplete Information in Relational Databases. JACM 31, 4 (1984), 761--791.
[53]
Haim Kaplan, Eyal Kushilevitz, and Yishay Mansour. 2005. Learning with Attribute Costs. In STOC. 356--365.
[54]
Grigoris Karvounarakis, Zachary G. Ives, and Val Tannen. 2010. Querying Data Provenance. In SIGMOD. 951--962.
[55]
O. Keren. 2008. Reduction of Average Path Length in Binary Decision Diagrams by Spectral Methods. IEEE TOCS 57 (2008), 520--531.
[56]
Christoph Koch and Dan Olteanu. 2008. Conditioning Probabilistic Databases. PVLDB 1, 1 (2008), 313--325.
[57]
Daphne Koller. 1999. Probabilistic Relational Models. In ILP. 3--13.
[58]
George Konstantinidis, Jet Holt, and Adriane Chapman. 2021. Enabling Personal Consent in Databases. PVLDB 15, 2 (2021), 375--387.
[59]
Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua. 2017. Learning Active Learning from Data. In NIPS. 4225--4235.
[60]
Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. 2016. ActiveClean: Interactive Data Cleaning For Statistical Modeling. PVLDB 9, 12 (2016), 948--959.
[61]
Jasper Kuperus, Cor J. Veenman, and Maurice van Keulen. 2013. Increasing NER Recall with Minimal Precision Loss. In EISIC. 106--111.
[62]
Guoliang Li, Chengliang Chai, Ju Fan, Xueping Weng, Jian Li, Yudian Zheng, Yuanbing Li, Xiang Yu, Xiaohang Zhang, and Haitao Yuan. 2018. CDB: A Crowd-Powered Database System. PVLDB 11, 12 (2018), 1926--1929.
[63]
Xiang Lian, Lei Chen, and Shaoxu Song. 2010. Consistent Query Answers in Inconsistent Probabilistic Databases. In SIGMOD. 303--314.
[64]
Xin Lin, Yun Peng, Jianliang Xu, and Byron Choi. 2018. Human-Powered Data Cleaning for Probabilistic Reachability Queries on Uncertain Graphs. In ICDE. 1755--1756.
[65]
Ye-Zheng Liu, Zhe Li, Chong Zhou, Yuanchun Jiang, Jianshan Sun, Meng Wang, and Xiangnan He. 2020. Generative Adversarial Active Learning for Unsupervised Outlier Detection. IEEE TKDE 32, 8 (2020), 1517--1528.
[66]
Ester Livshits, Benny Kimelfeld, and Sudeepa Roy. 2018. Computing Optimal Repairs for Functional Dependencies. In PODS. 225--237.
[67]
A. Marcus, E. Wu, D.R. Karger, S. Madden, and R.C. Miller. 2011. Crowdsourced Databases: Query Processing with People. In CIDR.
[68]
Alexandra Meliou, Wolfgang Gatterbauer, and Dan Suciu. 2011. Bringing Provenance to Its Full Potential Using Causal Reasoning. In TaPP.
[69]
Tom M. Mitchell, William W. Cohen, Estevam R. Hruschka Jr., Partha P. Talukdar, Bo Yang, Justin Betteridge, Andrew Carlson, Bhavana Dalvi Mishra, Matt Gardner, Bryan Kisiel, Jayant Krishnamurthy, Ni Lao, Kathryn Mazaitis, Thahir Mohamed, Ndapandula Nakashole, Emmanouil A. Platanios, Alan Ritter, Mehdi Samadi, Burr Settles, Richard C. Wang, Derry Wijaya, Abhinav Gupta, Xinlei Chen, Abulhair Saparov, Malcolm Greaves, and Joel Welling. 2018. Never-Ending Learning. Commun. ACM 61, 5 (2018), 103--115.
[70]
Luyi Mo, Reynold Cheng, Xiang Li, David W. Cheung, and Xuan S. Yang. 2013. Cleaning Uncertain Data for Top-k Queries. In ICDE. 134--145.
[71]
Felix Neutatz, Binger Chen, Ziawasch Abedjan, and Eugene Wu. 2021. From Cleaning before ML to Cleaning for ML. IEEE Data Eng. Bull. 44, 1 (2021), 24--41.
[72]
Federico Olmedo, Friedrich Gretz, Nils Jansen, Benjamin Lucien Kaminski, Joost-Pieter Katoen, and Annabelle McIver. 2018. Conditioning in Probabilistic Programming. TOPLAS 40, 1 (2018), 1--50.
[73]
Natalia Ostapuk, Jie Yang, and Philippe Cudré-Mauroux. 2019. ActiveLink: Deep Active Learning for Link Prediction in Knowledge Graphs. In WWW. 1398--1408.
[74]
Harshvardhan J. Pandit, Declan O'Sullivan, and Dave Lewis. 2018. Queryable Provenance Metadata For GDPR Compliance. In SEMANTICS. 262--268.
[75]
Katerina Papaioannou, Martin Theobald, and Michael H. Böhlen. 2018. Supporting Set Operations in Temporal- Probabilistic Databases. In ICDE. 1180--1191.
[76]
Aditya G. Parameswaran, Hyunjung Park, Hector Garcia-Molina, Neoklis Polyzotis, and Jennifer Widom. 2012. Deco: Declarative Crowdsourcing. In CIKM.
[77]
Jaehong Park, Dang Nguyen, and Ravi S. Sandhu. 2012. A Provenance-Based Access Control Model. In PST. 137--144.
[78]
Eugenia A. Politou, Efthimios Alepis, and Constantinos Patsakis. 2018. Forgetting Personal Data and Revoking Consent under the GDPR: Challenges and Proposed Solutions. J. Cybersecur. 4, 1 (2018), tyy001.
[79]
Christopher Re, Nilesh Dalvi, and Dan Suciu. 2007. Efficient Top-k Query Evaluation on Probabilistic Data. In ICDE. 886--895.
[80]
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB 10, 11 (2017), 1190--1201.
[81]
El Kindi Rezig, Mourad Ouzzani, Walid G. Aref, Ahmed K. Elmagarmid, Ahmed R. Mahmood, and Michael Stonebraker. 2021. Horizon: Scalable Dependency-driven Data Cleaning. PVLDB 14, 11 (2021), 2546--2554.
[82]
Sudeepa Roy, Vittorio Perduca, and Val Tannen. 2011. Faster Query Answering in Probabilistic Databases Using Read-Once Functions. In ICDT. 232--243.
[83]
Rodrygo L. T. Santos, Ben He, Craig Macdonald, and Iadh Ounis. 2009. Integrating Proximity to Subjective Sentences for Blog Opinion Retrieval. In ECIR. 325--336.
[84]
Pierre Senellart, Louis Jachiet, Silviu Maniu, and Yann Ramusat. 2018. ProvSQL: Provenance and Probability Management in PostgreSQL. PVLDB 11, 12 (2018), 2034--2037.
[85]
Yanyao Shen, Hyokun Yun, Zachary C. Lipton, Yakov Kronrod, and Animashree Anandkumar. 2018. Deep Active Learning for Named Entity Recognition. In ICLR.
[86]
Yasuhiro Sogawa, Tsuyoshi Ueno, Yoshinobu Kawahara, and Takashi Washio. 2013. Active Learning for Noisy Oracle via Density Power Divergence. Neural Networks 46 (2013), 133--143.
[87]
sourcecode 2022. Source code repository. https://github.com/osnatairy/Active-Probabilistic-Databases.git.
[88]
Dan Suciu, Dan Olteanu, Christopher Ré, and Christoph Koch. 2011. Probabilistic Databases. Morgan & Claypool.
[89]
Ruiming Tang, Reynold Cheng, Huayu Wu, and Stéphane Bressan. 2012. A Framework for Conditioning Uncertain Relational Data. In DEXA, Vol. 7447. 71--87.
[90]
TPC-H 2019. TPC-H Benchmark. http://www.tpc.org/tpch/.
[91]
Tonguç Ünlüyurt. 2004. Sequential Testing of Complex Systems: A Review. Discrete Applied Mathematics 142, 1--3 (2004), 189--205.
[92]
Maurice van Keulen, Benjamin Lucien Kaminski, Christoph Matheja, and Joost-Pieter Katoen. 2018. Rule-Based Conditioning of Probabilistic Data. In SUM, Vol. 11142. 290--305.
[93]
Joannès Vermorel and M. Mohri. 2005. Multi-armed Bandit Algorithms and Empirical Evaluation. In ECML.
[94]
Jiannan Wang, Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, Tim Kraska, and Tova Milo. 2014. A Sample-and- Clean Framework for Fast and Accurate Query Processing on Dirty Data. In SIGMOD. 469--480.
[95]
Geoffrey I Webb, Eamonn Keogh, and Risto Miikkulainen. 2010. Naïve Bayes. Encyclopedia of Machine Learning 15 (2010), 713--714.
[96]
Jef Wijsen. 2019. Foundations of Query Answering on Inconsistent Databases. SIGMOD Rec. 48, 3 (2019), 6--16.
[97]
Jie Xu, Dmitri V. Kalashnikov, and Sharad Mehrotra. 2015. Query Aware Determinization of Uncertain Objects. IEEE TKDE 27, 1 (2015), 207--221.
[98]
Mohamed Yakout, Laure Berti-Équille, and Ahmed K. Elmagarmid. 2013. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In SIGMOD. 553--564.
[99]
Wei Zhang, Clement Yu, and Weiyi Meng. 2007. Opinion Retrieval from Blogs. In CIKM. 831.
[100]
Nan Zheng and Zack Ives. 2020. Compact, Tamper-Resistant Archival of Fine-Grained Provenance. PVLDB 14, 4 (2020), 485--497.
[101]
Hong Zhu, Caicai Zhang, Zhongsheng Cao, and Ruiming Tang. 2016. On Efficient Conditioning of Probabilistic Relational Databases. Knowledge-Based Systems 92 (2016), 112--126.

Cited By

View all
  • (2024)CleanEr: Interactive, Query-Guided Error Mitigation for Data Cleaning Systems2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00418(5421-5424)Online publication date: 13-May-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 2
PACMMOD
June 2023
2310 pages
EISSN:2836-6573
DOI:10.1145/3605748
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2023
Published in PACMMOD Volume 1, Issue 2

Permissions

Request permissions for this article.

Badges

Author Tags

  1. active learning
  2. provenance
  3. uncertain databases

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)96
  • Downloads (Last 6 weeks)4
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)CleanEr: Interactive, Query-Guided Error Mitigation for Data Cleaning Systems2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00418(5421-5424)Online publication date: 13-May-2024

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media