Nothing Special   »   [go: up one dir, main page]

skip to main content
column

Data Quality: From Theory to Practice

Published: 03 December 2015 Publication History

Abstract

Data quantity and data quality, like two sides of a coin, are equally important to data management. This paper provides an overview of recent advances in the study of data quality, from theory to practice. We also address challenges introduced by big data to data quality management.

References

[1]
S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.
[2]
F. N. Afrati and P. G. Kolaitis. Repair checking in inconsistent databases: algorithms and complexity. In ICDT, pages 31--41, 2009.
[3]
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, pages 586--597, 2002.
[4]
A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework for record matching. In ICDE, pages 40--49, 2008.
[5]
A. Arasu, C. Ré, and D. Suciu. Large-scale deduplication with constraints using dedupalog. In ICDE, pages 952--963, 2009.
[6]
M. Arenas, L. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In PODS, pages 68--79, 1999.
[7]
C. Batini and M. Scannapieco. Data Quality: Concepts, Methodologies and Techniques. Springer, 2006.
[8]
M. Baudinet, J. Chomicki, and P. Wolper. Constraint-generating dependencies. JCSS, 59(1):94--115, 1999.
[9]
C. Beeri and M. Y. Vardi. A proof procedure for data dependencies. JACM, 31(4):718--741, 1984.
[10]
O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. Swoosh: a generic approach to entity resolution. VLDB J., 18(1):255--276, 2009.
[11]
L. Bertossi. Database Repairing and Consistent Query Answering. Morgan & Claypool Publishers, 2011.
[12]
L. E. Bertossi, S. Kolahi, and L. V. S. Lakshmanan. Data cleaning and query answering with matching dependencies and matching functions. TCS, 52(3):441--482, 2013.
[13]
I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. TKDD, 1(1), 2007.
[14]
P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143--154, 2005.
[15]
L. Bravo, W. Fan, and S. Ma. Extending inclusion dependencies with conditions. In VLDB, 2007.
[16]
D. Burdick, R. Fagin, P. G. Kolaitis, L. Popa, and W.-C. Tan. A declarative framework for linking entities. In ICDT, 2015.
[17]
Y. Cao, T. Deng, W. Fan, and F. Geerts. On the data complexity of relative information completeness. Inf. Syst., 45:18--34, 2014.
[18]
Y. Cao, W. Fan, and W. Yu. Determining the relative accuracy of attributes. In SIGMOD, pages 565--576, 2013.
[19]
A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti. Descriptive and prescriptive data cleaning. In SIGMOD, pages 445--456, 2014.
[20]
W. Chen, W. Fan, and S. Ma. Analyses and validation of conditional dependencies with built-in predicates. In DEXA, 2009.
[21]
F. Chiang and R. J. Miller. Discovering data quality rules. PVLDB, 1(1):1166--1177, 2008.
[22]
F. Chiang and R. J. Miller. A unified model for data and constraint repair. In ICDE, pages 446--457, 2011.
[23]
J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Information and Computation, 197(1-2):90--121, 2005.
[24]
J. Chomicki and D. Toman. Time in database systems. In M. Fisher, D. Gabbay, and L. Víla, editors, Handbook of Temporal Reasoning in Artificial Intelligence, pages 429--467. Elsevier, 2005.
[25]
X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13):1498--1509, 2013.
[26]
X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013.
[27]
W.W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In KDD, 2002.
[28]
G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB, pages 315--326, 2007.
[29]
M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: a commodity data cleaning system. In SIGMOD, 2013.
[30]
X. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. In PVLDB, 2009.
[31]
W.W. Eckerson. Data quality and the bottom line: Achieving business success through a commitment to high quality data. Technical report, The Data Warehousing Institute, 2002.
[32]
M. G. Elfeky, A. K. Elmagarmid, and V. S. Verykios. TAILOR: A record linkage tool box. In ICDE, 2002.
[33]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1):1--16, 2007.
[34]
W. Fan, H. Gao, X. Jia, J. Li, and S. Ma. Dynamic constraints for record matching. VLDB J., 20(4):495--520, 2011.
[35]
W. Fan and F. Geerts. Capturing missing tuples and missing values. In PODS, pages 169--178, 2010.
[36]
W. Fan and F. Geerts. Relative information completeness. ACM Trans. on Database Systems, 35(4), 2010.
[37]
W. Fan and F. Geerts. Foundations of Data Quality Management. Morgan & Claypool Publishers, 2012.
[38]
W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. TODS, 33(1), 2008.
[39]
W. Fan, F. Geerts, J. Li, and M. Xiong. Discovering conditional functional dependencies. TKDE, 23(5):683--698, 2011.
[40]
W. Fan, F. Geerts, S. Ma, and H. Müller. Detecting inconsistencies in distributed data. In ICDE, pages 64--75, 2010.
[41]
W. Fan, F. Geerts, and F. Neven. Making queries tractable on big data with preprocessing. PVLDB, 6(8):577--588, 2013.
[42]
W. Fan, F. Geerts, N. Tang, and W. Yu. Conflict resolution with data currency and consistency. J. Data and Information Quality, 5(1-2):6, 2014.
[43]
W. Fan, F. Geerts, and J. Wijsen. Determining the currency of data. TODS, 37(4), 2012.
[44]
W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. In VLDB, 2009.
[45]
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, 2011.
[46]
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB J., 21(2):213--238, 2012.
[47]
W. Fan, J. Li, N. Tang, and W. Yu. Incremental detection of inconsistencies in distributed data. TKDE, 2014.
[48]
W. Fan, S. Ma, Y. Hu, J. Liu, and Y. Wu. Propagating functional dependencies with conditions. PVLDB, 1(1):391--407, 2008.
[49]
I. Fellegi and A. B. Sunter. A theory for record linkage. J. American Statistical Association, 64(328):1183--1210, 1969.
[50]
H. Galhardas, D. Florescu, D. Shasha, and E. Simon. AJAX: An extensible data cleaning tool. In SIGMOD, page 590, 2000.
[51]
A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In WSDM, 2010.
[52]
V. Ganti and A. D. Sarma. Data Cleaning: A Practical Perspective. Morgan & Claypool Publishers, 2013.
[53]
Gartner. 'Dirty data' is a business problem, not an IT problem, 2007. http://www.gartner.com/newsroom/id/501733.
[54]
Gartner. Magic quardrant for data quality tools, 2014.
[55]
F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The LLUNATIC data-cleaning framework. PVLDB, 2013.
[56]
B. Goethals, W. L. Page, and H. Mannila. Mining association rules of simple conjunctive queries. In SDM, 2008.
[57]
C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J.W. Shavlik, and X. Zhu. Corleone: hands-off crowdsourcing for entity matching. In SIGMOD, 2014.
[58]
L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-optimal tableaux for conditional functional dependencies. PVLDB, 1(1):376--390, 2008.
[59]
G. Grahne. The Problem of Incomplete Information in Relational Databases. Springer, 1991.
[60]
S. Guha, N. Koudas, A. Marathe, and D. Srivastava. Merging the results of approximate match operations. In VLDB, 2004.
[61]
M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD, pages 127--138, 1995.
[62]
T. N. Herzog, F. J. Scheuren, and W. E. Winkler. Data Quality and Record Linkage Techniques. Springer, 2009.
[63]
Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. COMP. J., 42(2):100--111, 1999.
[64]
T. Imieliński and W. Lipski, Jr. Incomplete information in relational databases. JACM, 31(4), 1984.
[65]
M. A. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa Florida. J. American Statistical Association, 89:414--420, 1989.
[66]
S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, 2009.
[67]
P. G. Kolaitis and E. Pema. A dichotomy in the complexity of consistent query answering for queries with two atoms. Inf. Process. Lett., 112(3):77--85, 2012.
[68]
C. P. Kruskal, L. Rudolph, and M. Snir. A complexity theory of efficient parallel algorithms. TCS, 71(1):95--132, 1990.
[69]
A. Y. Levy. Obtaining complete answers from incomplete databases. In VLDB, pages 402--412, 1996.
[70]
P. Li, X. L. Dong, A. Maurino, and D. Srivastava. Linking temporal records. PVLDB, 4(11):956--967, 2011.
[71]
L. Libkin. Certain answers as objects and knowledge. In KR, 2014.
[72]
J. Liu, J. Li, C. Liu, and Y. Chen. Discover dependencies from data - a review. TKDE, 24(2):251--264, 2012.
[73]
A. Lopatenko and L. E. Bertossi. Complexity of consistent query answering in databases under cardinality-based and incremental repair semantics. In ICDT, pages 179--193, 2007.
[74]
D. Loshin. Master Data Management. Knowledge Integrity, Inc., 2009.
[75]
A. Meliou, W. Gatterbauer, S. Nath, and D. Suciu. Tracing data errors with view-conditioned causality. In SIGMOD, pages 505--516, 2011.
[76]
D. W. Miller Jr., J. D. Yeast, and R. L. Evans. Missing prenatal records at a birth center: A communication problem quantified. In AMIA Annu Symp Proc., pages 535--539, 2005.
[77]
A. Motro. Integrity = validity + completeness. ACM Trans. on Database Systems, 14(4):480--502, 1989.
[78]
F. Naumann and M. Herschel. An Introduction to Duplicate Detection. Morgan & Claypool Publishers, 2010.
[79]
G. Ramalingam and T. Reps. On the computational complexity of dynamic graph problems. TCS, 158(1-2):213--224, 1996.
[80]
S. Razniewski and W. Nutt. Completeness of queries over incomplete databases. PVLDB, pages 749--760, 2011.
[81]
B. Salimi and L. E. Bertossi. From causes for database queries to repairs and model-based diagnosis and back. In ICDT, 2015.
[82]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, pages 269--278, 2002.
[83]
R. T. Snodgrass. Developing Time-Oriented Database Applications in SQL. Morgan Kaufmann, 1999.
[84]
S. Song and L. Chen. Efficient discovery of similarity constraints for matching dependencies. TKDE, 87:146--166, 2013.
[85]
S. Staworko. Declarative inconsistency handling in relational and semi-structured databases. PhD thesis, the State University of New York at Buffalo, 2007.
[86]
S. Staworko, J. Chomicki, and J. Marcinkowski. Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell., 64(2-3):209--246, 2012.
[87]
B. ten Cate, G. Fontaine, and P. G. Kolaitis. On the data complexity of consistent query answering. In ICDT, pages 22--33, 2012.
[88]
J. Wang, T. Kraska, M. J. Franklin, and J. Feng. CrowdER: Crowdsourcing entity resolution. PVLDB, 2012.
[89]
S. Whang, O. Benjelloun, and H. Garcia-Molina. Generic entity resolution with negative rules. VLDB J., 18(6):1261--1277, 2009.
[90]
J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, 2005.
[91]
J. Wijsen. Database repairing using updates. TODS, 30(3), 2005.
[92]
Wikibon. A comprehensive list of big data statistics, 2012. http://wikibon.org/blog/big-data-statistics/.
[93]
C. M. Wyss, C. Giannella, and E. L. Robertson. Fastfds: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances. In DaWak, 2001.
[94]
M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, pages 279--289, 2011.
[95]
W. Yancey. BigMatch: A program for extracting probable matches from a large file. Technical Report Computing 2007/01, U.S. Census Bureau, 2007.
[96]
H. Zhang, Y. Diao, and N. Immerman. Recognizing patterns in streams with imprecise timestamps. PVLDB, pages 244--255, 2010.
[97]
B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 2012.

Cited By

View all
  • (2024)Efficient Discovery of Functional Dependencies on Massive DataIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.328820936:1(107-121)Online publication date: 1-Jan-2024
  • (2024)Mitigating Data Sparsity in Integrated Data through Text Conceptualization2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00269(3490-3504)Online publication date: 13-May-2024
  • (2024)The SPATIAL Architecture: Design and Development Experiences from Gauging and Monitoring the AI Inference Capabilities of Modern Applications2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS60910.2024.00092(947-959)Online publication date: 23-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record
ACM SIGMOD Record  Volume 44, Issue 3
September 2015
57 pages
ISSN:0163-5808
DOI:10.1145/2854006
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 December 2015
Published in SIGMOD Volume 44, Issue 3

Check for updates

Qualifiers

  • Column

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)84
  • Downloads (Last 6 weeks)9
Reflects downloads up to 01 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient Discovery of Functional Dependencies on Massive DataIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.328820936:1(107-121)Online publication date: 1-Jan-2024
  • (2024)Mitigating Data Sparsity in Integrated Data through Text Conceptualization2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00269(3490-3504)Online publication date: 13-May-2024
  • (2024)The SPATIAL Architecture: Design and Development Experiences from Gauging and Monitoring the AI Inference Capabilities of Modern Applications2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS60910.2024.00092(947-959)Online publication date: 23-Jul-2024
  • (2023)Toward Improved Machine Learning-Based Intrusion Detection for Internet of Things TrafficComputers10.3390/computers1208014812:8(148)Online publication date: 27-Jul-2023
  • (2023)An Automatic Threshold Selection Using ALO for Healthcare Duplicate Record Detection with Reciprocal Neuro-Fuzzy Inference SystemComputers, Materials & Continua10.32604/cmc.2023.03399574:3(5821-5836)Online publication date: 2023
  • (2023)A Data-centric AI Framework for Automating Exploratory Data Analysis and Data Quality TasksJournal of Data and Information Quality10.1145/360370915:4(1-26)Online publication date: 1-Nov-2023
  • (2023)DQSOps: Data Quality Scoring Operations Framework for Data-Driven ApplicationsProceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering10.1145/3593434.3593445(32-41)Online publication date: 14-Jun-2023
  • (2023)A Method to Classify Data Quality for Decision Making Under UncertaintyJournal of Data and Information Quality10.1145/359253415:2(1-27)Online publication date: 21-Apr-2023
  • (2023)Towards Reliable Collaborative Data Processing Ecosystems: Survey on Data Quality Criteria2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom60117.2023.00345(2456-2464)Online publication date: 1-Nov-2023
  • (2023)Spatio-Temporal Generative Adversarial Network Based Power Distribution Network State Estimation With Multiple Time-Scale MeasurementsIEEE Transactions on Industrial Informatics10.1109/TII.2023.323462419:9(9790-9797)Online publication date: Sep-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media