Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2723372.2749431acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing

Published: 27 May 2015 Publication History

Abstract

Classical approaches to clean data have relied on using integrity constraints, statistics, or machine learning. These approaches are known to be limited in the cleaning accuracy, which can usually be improved by consulting master data and involving experts to resolve ambiguity. The advent of knowledge bases KBs both general-purpose and within enterprises, and crowdsourcing marketplaces are providing yet more opportunities to achieve higher accuracy at a larger scale. We propose KATARA, a knowledge base and crowd powered data cleaning system that, given a table, a KB, and a crowd, interprets table semantics to align it with the KB, identifies correct and incorrect data, and generates top-k possible repairs for incorrect data. Experiments show that KATARA can be applied to various datasets and KBs, and can efficiently annotate data and suggest possible repairs.

References

[1]
S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.
[2]
P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005.
[3]
G. Bouma. Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL, pages 31--40, 2009.
[4]
S. Buchholz and J. Latorre. Crowdsourcing preference tests, and how to detect cheating. 2011.
[5]
A. Calì, G. Gottlob, and A. Pieris. Advanced processing for ontological queries. PVLDB, 2010.
[6]
F. Chiang and R. J. Miller. Discovering data quality rules. PVLDB, 2008.
[7]
F. Chiang and R. J. Miller. A unified model for data and constraint repair. In ICDE, 2011.
[8]
X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 2013.
[9]
X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, 2013.
[10]
K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1):22--29, Mar. 1990.
[11]
G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB, 2007.
[12]
M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: a commodity data cleaning system. In SIGMOD, 2013.
[13]
D. Deng, Y. Jiang, G. Li, J. Li, and C. Yu. Scalable column concept determination for web tables using large knowledge bases. PVLDB, 2013.
[14]
O. Deshpande, D. S. Lamba, M. Tourn, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Building, maintaining, and using knowledge bases: a report from the trenches. In SIGMOD Conference, 2013.
[15]
X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In SIGKDD, 2014.
[16]
X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun, and W. Zhang. From data fusion to knowledge fusion. PVLDB, 2014.
[17]
W. Fan. Dependencies revisited for improving data quality. In PODS, 2008.
[18]
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, 2011.
[19]
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB J., 21(2), 2012.
[20]
F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The LLUNATIC Data-Cleaning Framework. PVLDB, 2013.
[21]
J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum. YAGO2: A spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell., 194, 2013.
[22]
I. F. Ilyas, W. G. Aref, and A. K. Elmagarmid. Supporting top-k join queries in relational databases. VLDB J., 13(3), 2004.
[23]
I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga. Cords: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, 2004.
[24]
M. Interlandi and N. Tang. Proof positive and negative data cleaning. In ICDE, 2015.
[25]
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Interpreting and unifying outlier scores. In SDM, pages 13--24, 2011.
[26]
R. Lange and X. Lange. Quality control in crowdsourcing: An objective measurement approach to identifying and correcting rater effects in the social evaluation of products and services. In AAAI, 2012.
[27]
J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web Journal, 6(2):167--195, 2015.
[28]
G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. PVLDB, 3(1), 2010.
[29]
C. D. Manning, P. Raghavan, and H. Schütze. Scoring, term weighting and the vector space model. Introduction to Information Retrieval, 100, 2008.
[30]
C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, 2010.
[31]
M. Morsey, J. Lehmann, S. Auer, and A. N. Ngomo. Dbpedia SPARQL benchmark - performance assessment with real queries on real data. In ISWC, 2011.
[32]
J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In COLING, 2010.
[33]
A. Poggi, D. Lembo, D. Calvanese, G. D. Giacomo, M. Lenzerini, and R. Rosati. Linking data to ontologies. J. Data Semantics, 10, 2008.
[34]
L. Qian, M. J. Cafarella, and H. V. Jagadish. Sample-driven schema mapping. In SIGMOD, 2012.
[35]
V. Raman and J. M. Hellerstein. Potter's Wheel: An interactive data cleaning system. In VLDB, 2001.
[36]
S. Song, H. Cheng, J. X. Yu, and L. Chen. Repairing vertex labels under neighborhood constraints. PVLDB, 7(11), 2014.
[37]
F. M. Suchanek, S. Abiteboul, and P. Senellart. Paris: Probabilistic alignment of relations, instances, and schema. PVLDB, 2011.
[38]
B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, 2013.
[39]
P. Venetis, A. Y. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. PVLDB, 2011.
[40]
M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In ICDE, 2014.
[41]
J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 2012.
[42]
J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, 2014.
[43]
M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In SIGMOD, 2013.
[44]
M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 2011.
[45]
C. J. Zhang, L. Chen, H. V. Jagadish, and C. C. Cao. Reducing uncertainty of schema matching via crowdsourcing. PVLDB, 6, 2013.

Cited By

View all
  • (2024)Naive Bayes classifiers over missing dataProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692227(3913-3934)Online publication date: 21-Jul-2024
  • (2024)A No-Code Automated Machine Learning Platform for the Energy SectorGazi University Journal of Science Part A: Engineering and Innovation10.54287/gujsa.147378211:2(289-303)Online publication date: 4-Jun-2024
  • (2024)Hardware-Efficient Data Imputation through DBMS ExtensibilityProceedings of the VLDB Endowment10.14778/3681954.368201617:11(3497-3510)Online publication date: 1-Jul-2024
  • Show More Cited By

Index Terms

  1. KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
    May 2015
    2110 pages
    ISBN:9781450327589
    DOI:10.1145/2723372
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 May 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. crowdsourcing
    2. data cleaning
    3. data quality
    4. knowledge base

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS'15
    Sponsor:
    SIGMOD/PODS'15: International Conference on Management of Data
    May 31 - June 4, 2015
    Victoria, Melbourne, Australia

    Acceptance Rates

    SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)143
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 09 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Naive Bayes classifiers over missing dataProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692227(3913-3934)Online publication date: 21-Jul-2024
    • (2024)A No-Code Automated Machine Learning Platform for the Energy SectorGazi University Journal of Science Part A: Engineering and Innovation10.54287/gujsa.147378211:2(289-303)Online publication date: 4-Jun-2024
    • (2024)Hardware-Efficient Data Imputation through DBMS ExtensibilityProceedings of the VLDB Endowment10.14778/3681954.368201617:11(3497-3510)Online publication date: 1-Jul-2024
    • (2024)Automatic Data Repair: Are We Ready to Deploy?Proceedings of the VLDB Endowment10.14778/3675034.367505117:10(2617-2630)Online publication date: 1-Jun-2024
    • (2024)FusionQuery: On-demand Fusion Queries over Multi-source Heterogeneous DataProceedings of the VLDB Endowment10.14778/3648160.364817417:6(1337-1349)Online publication date: 3-May-2024
    • (2024)GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language ModelsProceedings of the ACM on Management of Data10.1145/36988112:6(1-29)Online publication date: 20-Dec-2024
    • (2024)IterClean: An Iterative Data Cleaning Framework with Large Language ModelsProceedings of the ACM Turing Award Celebration Conference - China 202410.1145/3674399.3674436(100-105)Online publication date: 5-Jul-2024
    • (2024)Akane: Perplexity-Guided Time Series Data CleaningProceedings of the ACM on Management of Data10.1145/36549932:3(1-26)Online publication date: 30-May-2024
    • (2024)Efficient Mixture of Experts based on Large Language Models for Low-Resource Data PreprocessingProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671873(3690-3701)Online publication date: 25-Aug-2024
    • (2024)Record Fusion via Inference and Data AugmentationACM / IMS Journal of Data Science10.1145/35935791:1(1-23)Online publication date: 16-Jan-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media