Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Sampling the repairs of functional dependency violations under hard constraints

Published: 01 September 2010 Publication History

Abstract

Violations of functional dependencies (FDs) are common in practice, often arising in the context of data integration or Web data extraction. Resolving these violations is known to be challenging for a variety of reasons, one of them being the exponential number of possible "repairs". Previous work has tackled this problem either by producing a single repair that is (nearly) optimal with respect to some metric, or by computing consistent answers to selected classes of queries without explicitly generating the repairs. In this paper, we propose a novel data cleaning approach that is not limited to finding a single repair or to a particular class of queries, namely, sampling from the space of possible repairs. We give several motivating scenarios where sampling from the space of FD repairs is desirable, propose a new class of useful repairs, and present an algorithm that randomly samples from this space. We also show how to restrict the space of generated repairs based on user-defined hard constraints that define an immutable trusted subset of the input relation, and we experimentally evaluate our algorithm against previous approaches. While this paper focuses on repairing FDs, we envision the proposed sampling approach to be applicable to other integrity constraints with large repair spaces.

References

[1]
UIS data generator, http://www.cs.utexas.edu/users/ml/riddle/data.html.
[2]
S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.
[3]
F. N. Afrati and P. G. Kolaitis. Repair checking in inconsistent databases: algorithms and complexity. In ICDT, pages 31--41, 2009.
[4]
M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In PODS, pages 68--79, 1999.
[5]
O. Benjelloun, A. D. Sarma, A. Y. Halevy, and J. Widom. ULDBs: Databases with uncertainty and lineage. In VLDB, pages 953--964, 2006.
[6]
P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143--154, 2005.
[7]
J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Information and Computation, 197(1/2):90--121, 2005.
[8]
J. Chomicki, J. Marcinkowski, and S. Staworko. Computing consistent query answers using conflict hypergraphs. In CIKM, pages 417--426, 2004.
[9]
A. Fuxman and R. J. Miller. First-order query rewriting for inconsistent databases. Journal of Computer and System Sciences, 73(4):610--635, 2007.
[10]
S. Greco and C. Molinaro. Approximate probabilistic query answering over inconsistent databases. In ER, pages 311--325, 2008.
[11]
R. Jampani, F. Xu, M. Wu, L. L. Perez, C. M. Jermaine, and P. J. Haas. MCDB: a monte carlo approach to managing uncertain data. In SIGMOD, pages 687--700, 2008.
[12]
S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009.
[13]
A. Lopatenko and L. E. Bertossi. Complexity of consistent query answering in databases under cardinality-based and incremental repair semantics. In ICDT, pages 179--193, 2007.
[14]
P. Sen and A. Deshpande. Representing and querying correlated tuples in probabilistic databases. In ICDE, pages 596--605, 2007.
[15]
J. Wijsen. Condensed representation of database repairs for consistent query answering. In ICDT, pages 375--390, 2003.
[16]
J. Wijsen. Database repairing using updates. ACM Trans. Database Syst., 30(3):722--768, 2005.

Cited By

View all
  • (2024)IterClean: An Iterative Data Cleaning Framework with Large Language ModelsProceedings of the ACM Turing Award Celebration Conference - China 202410.1145/3674399.3674436(100-105)Online publication date: 5-Jul-2024
  • (2023)Time Series Data ValidityProceedings of the ACM on Management of Data10.1145/35889391:1(1-26)Online publication date: 30-May-2023
  • (2022)Finding Label and Model Errors in Perception Data With Learned Observation AssertionsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517907(496-505)Online publication date: 10-Jun-2022
  • Show More Cited By
  1. Sampling the repairs of functional dependency violations under hard constraints

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 3, Issue 1-2
      September 2010
      1658 pages

      Publisher

      VLDB Endowment

      Publication History

      Published: 01 September 2010
      Published in PVLDB Volume 3, Issue 1-2

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)23
      • Downloads (Last 6 weeks)7
      Reflects downloads up to 09 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)IterClean: An Iterative Data Cleaning Framework with Large Language ModelsProceedings of the ACM Turing Award Celebration Conference - China 202410.1145/3674399.3674436(100-105)Online publication date: 5-Jul-2024
      • (2023)Time Series Data ValidityProceedings of the ACM on Management of Data10.1145/35889391:1(1-26)Online publication date: 30-May-2023
      • (2022)Finding Label and Model Errors in Perception Data With Learned Observation AssertionsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517907(496-505)Online publication date: 10-Jun-2022
      • (2021)HorizonProceedings of the VLDB Endowment10.14778/3476249.347630114:11(2546-2554)Online publication date: 1-Jul-2021
      • (2021)Stream Data Cleaning under Speed and Acceleration ConstraintsACM Transactions on Database Systems10.1145/346574046:3(1-44)Online publication date: 28-Sep-2021
      • (2021)Benchmarking Approximate Consistent Query AnsweringProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458309(233-246)Online publication date: 20-Jun-2021
      • (2020)Automatic weighted matching rectifying rule discovery for data repairingThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-020-00617-629:6(1433-1447)Online publication date: 9-Jun-2020
      • (2019)AutoRepair: an automatic repairing approach over multi-source dataKnowledge and Information Systems10.1007/s10115-018-1284-961:1(227-257)Online publication date: 1-Oct-2019
      • (2019)Bayesian Classifier Modeling for Dirty DataPRICAI 2019: Trends in Artificial Intelligence10.1007/978-3-030-29894-4_6(66-79)Online publication date: 26-Aug-2019
      • (2018)Discovery of genuine functional dependencies from relational data with missing valuesProceedings of the VLDB Endowment10.14778/3204028.320403211:8(880-892)Online publication date: 1-Apr-2018
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media