research-article

KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing

Authors:

Mourad Ouzzani,

Yin YeAuthors Info & Claims

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 1247 - 1261

https://doi.org/10.1145/2723372.2749431

Published: 27 May 2015 Publication History

Abstract

Classical approaches to clean data have relied on using integrity constraints, statistics, or machine learning. These approaches are known to be limited in the cleaning accuracy, which can usually be improved by consulting master data and involving experts to resolve ambiguity. The advent of knowledge bases KBs both general-purpose and within enterprises, and crowdsourcing marketplaces are providing yet more opportunities to achieve higher accuracy at a larger scale. We propose KATARA, a knowledge base and crowd powered data cleaning system that, given a table, a KB, and a crowd, interprets table semantics to align it with the KB, identifies correct and incorrect data, and generates top-k possible repairs for incorrect data. Experiments show that KATARA can be applied to various datasets and KBs, and can efficiently annotate data and suggest possible repairs.

References

[1]

S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.

Digital Library

[2]

P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005.

Digital Library

[3]

G. Bouma. Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL, pages 31--40, 2009.

[4]

S. Buchholz and J. Latorre. Crowdsourcing preference tests, and how to detect cheating. 2011.

[5]

A. Calì, G. Gottlob, and A. Pieris. Advanced processing for ontological queries. PVLDB, 2010.

Digital Library

[6]

F. Chiang and R. J. Miller. Discovering data quality rules. PVLDB, 2008.

Digital Library

[7]

F. Chiang and R. J. Miller. A unified model for data and constraint repair. In ICDE, 2011.

Digital Library

[8]

X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 2013.

Digital Library

[9]

X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, 2013.

Digital Library

[10]

K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1):22--29, Mar. 1990.

Digital Library

[11]

G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB, 2007.

Digital Library

[12]

M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: a commodity data cleaning system. In SIGMOD, 2013.

Digital Library

[13]

D. Deng, Y. Jiang, G. Li, J. Li, and C. Yu. Scalable column concept determination for web tables using large knowledge bases. PVLDB, 2013.

Digital Library

[14]

O. Deshpande, D. S. Lamba, M. Tourn, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Building, maintaining, and using knowledge bases: a report from the trenches. In SIGMOD Conference, 2013.

Digital Library

[15]

X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In SIGKDD, 2014.

Digital Library

[16]

X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun, and W. Zhang. From data fusion to knowledge fusion. PVLDB, 2014.

Digital Library

[17]

W. Fan. Dependencies revisited for improving data quality. In PODS, 2008.

Digital Library

[18]

W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, 2011.

Digital Library

[19]

W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB J., 21(2), 2012.

Digital Library

[20]

F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The LLUNATIC Data-Cleaning Framework. PVLDB, 2013.

Digital Library

[21]

J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum. YAGO2: A spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell., 194, 2013.

Digital Library

[22]

I. F. Ilyas, W. G. Aref, and A. K. Elmagarmid. Supporting top-k join queries in relational databases. VLDB J., 13(3), 2004.

Digital Library

[23]

I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga. Cords: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, 2004.

Digital Library

[24]

M. Interlandi and N. Tang. Proof positive and negative data cleaning. In ICDE, 2015.

[25]

H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Interpreting and unifying outlier scores. In SDM, pages 13--24, 2011.

[26]

R. Lange and X. Lange. Quality control in crowdsourcing: An objective measurement approach to identifying and correcting rater effects in the social evaluation of products and services. In AAAI, 2012.

[27]

J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web Journal, 6(2):167--195, 2015.

[28]

G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. PVLDB, 3(1), 2010.

Digital Library

[29]

C. D. Manning, P. Raghavan, and H. Schütze. Scoring, term weighting and the vector space model. Introduction to Information Retrieval, 100, 2008.

[30]

C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, 2010.

Digital Library

[31]

M. Morsey, J. Lehmann, S. Auer, and A. N. Ngomo. Dbpedia SPARQL benchmark - performance assessment with real queries on real data. In ISWC, 2011.

Digital Library

[32]

J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In COLING, 2010.

Digital Library

[33]

A. Poggi, D. Lembo, D. Calvanese, G. D. Giacomo, M. Lenzerini, and R. Rosati. Linking data to ontologies. J. Data Semantics, 10, 2008.

Digital Library

[34]

L. Qian, M. J. Cafarella, and H. V. Jagadish. Sample-driven schema mapping. In SIGMOD, 2012.

Digital Library

[35]

V. Raman and J. M. Hellerstein. Potter's Wheel: An interactive data cleaning system. In VLDB, 2001.

Digital Library

[36]

S. Song, H. Cheng, J. X. Yu, and L. Chen. Repairing vertex labels under neighborhood constraints. PVLDB, 7(11), 2014.

Digital Library

[37]

F. M. Suchanek, S. Abiteboul, and P. Senellart. Paris: Probabilistic alignment of relations, instances, and schema. PVLDB, 2011.

Digital Library

[38]

B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, 2013.

Digital Library

[39]

P. Venetis, A. Y. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. PVLDB, 2011.

Digital Library

[40]

M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In ICDE, 2014.

[41]

J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 2012.

Digital Library

[42]

J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, 2014.

Digital Library

[43]

M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In SIGMOD, 2013.

Digital Library

[44]

M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 2011.

Digital Library

[45]

C. J. Zhang, L. Chen, H. V. Jagadish, and C. C. Cao. Reducing uncertainty of schema matching via crowdsourcing. PVLDB, 6, 2013.

Digital Library

Cited By

Bian SOuyang XFan ZKoutris PSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Naive Bayes classifiers over missing dataProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692227(3913-3934)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692227
Avcı E(2024)A No-Code Automated Machine Learning Platform for the Energy SectorGazi University Journal of Science Part A: Engineering and Innovation10.54287/gujsa.147378211:2(289-303)Online publication date: 4-Jun-2024
https://doi.org/10.54287/gujsa.1473782
Mohr-Daurat HTheodorakis GPirk H(2024)Hardware-Efficient Data Imputation through DBMS ExtensibilityProceedings of the VLDB Endowment10.14778/3681954.368201617:11(3497-3510)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682016
Show More Cited By

Index Terms

KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing
1. Information systems
  1. Information systems applications

Recommendations

An Enhanced Technique to Clean Data in the Data Warehouse
DESE '11: Proceedings of the 2011 Developments in E-systems Engineering

Data quality is a critical factor for the success of data warehousing projects. Improving the quality of data is important in data warehouse, because it is used in the process of decision support, which requires accurate data. There are many errors and ...
KATARA: reliable data cleaning with knowledge bases and crowdsourcing
Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii

Data cleaning with guaranteed reliability is hard to achieve without accessing external sources, since the truth is not necessarily discoverable from the data at hand. Furthermore, even in the presence of external sources, mainly knowledge bases and ...
A Comparative Study of Data Cleaning Tools

In the information era, data is crucial in decision making. Most data sets contain impurities that need to be weeded out before any meaningful decision can be made from the data. Hence, data cleaning is essential and often takes more than 80 percent ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

May 2015

2110 pages

ISBN:9781450327589

DOI:10.1145/2723372

General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'15

Sponsor:

SIGMOD

SIGMOD/PODS'15: International Conference on Management of Data

May 31 - June 4, 2015

Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

224
Total Citations
View Citations
1,732
Total Downloads

Downloads (Last 12 months)143
Downloads (Last 6 weeks)11

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bian SOuyang XFan ZKoutris PSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Naive Bayes classifiers over missing dataProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692227(3913-3934)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692227
Avcı E(2024)A No-Code Automated Machine Learning Platform for the Energy SectorGazi University Journal of Science Part A: Engineering and Innovation10.54287/gujsa.147378211:2(289-303)Online publication date: 4-Jun-2024
https://doi.org/10.54287/gujsa.1473782
Mohr-Daurat HTheodorakis GPirk H(2024)Hardware-Efficient Data Imputation through DBMS ExtensibilityProceedings of the VLDB Endowment10.14778/3681954.368201617:11(3497-3510)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682016
Ni WMiao XZhao XWu YLiang SYin J(2024)Automatic Data Repair: Are We Ready to Deploy?Proceedings of the VLDB Endowment10.14778/3675034.367505117:10(2617-2630)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.14778/3675034.3675051
Zhu JMao YChen LGe CWei ZGao Y(2024)FusionQuery: On-demand Fusion Queries over Multi-source Heterogeneous DataProceedings of the VLDB Endowment10.14778/3648160.364817417:6(1337-1349)Online publication date: 3-May-2024
https://doi.org/10.14778/3648160.3648174
Yan MWang YWang YMiao XLi J(2024)GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language ModelsProceedings of the ACM on Management of Data10.1145/36988112:6(1-29)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698811
Ni WZhang KMiao XZhao XWu YYin J(2024)IterClean: An Iterative Data Cleaning Framework with Large Language ModelsProceedings of the ACM Turing Award Celebration Conference - China 202410.1145/3674399.3674436(100-105)Online publication date: 5-Jul-2024
https://dl.acm.org/doi/10.1145/3674399.3674436
Han XXiong HHe ZWang PWang CWang X(2024)Akane: Perplexity-Guided Time Series Data CleaningProceedings of the ACM on Management of Data10.1145/36549932:3(1-26)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654993
Yan MWang YPang KXie MLi JBaeza-Yates RBonchi F(2024)Efficient Mixture of Experts based on Large Language Models for Low-Resource Data PreprocessingProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671873(3690-3701)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671873
Heidari AMichalopoulos GIlyas IRekatsinas T(2024)Record Fusion via Inference and Data AugmentationACM / IMS Journal of Data Science10.1145/35935791:1(1-23)Online publication date: 16-Jan-2024
https://dl.acm.org/doi/10.1145/3593579
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents