Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Anatomy of Metadata for Data Curation

Published: 13 June 2020 Publication History

Abstract

Real-world datasets often suffer from various data quality problems. Several data cleaning solutions have been proposed so far. However, data cleaning remains a manual and iterative task that requires domain and technical expertise. Exploiting metadata promises to improve the tedious process of data preparation, because data errors are detectable through metadata. This article investigates the intrinsic connection between metadata and data errors. In this work, we establish a mapping that reflects the connection between data quality issues and extractable metadata using qualitative and quantitative techniques. Additionally, we present a taxonomy based on a closed grammar that covers all existing metadata and allows the composition of novel types of metadata. We provide a case-study to show the practical application of the grammar for generating new metadata for data quality assessment.

References

[1]
Ziawasch Abedjan, Cuneyt G. Akcora, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker. 2015. Temporal rules discovery for web data cleaning. Proc. VLDB Endow., Vol. 9. 336--347.
[2]
Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting data errors: Where are we and what needs to be done? Proc. VLDB Endow., Vol. 9. 993--1004.
[3]
Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: A survey. VLDB J., Vol. 24. 557--581.
[4]
Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Papenbrock. 2018. Data Profiling. Morgan 8 Claypool Publishers.
[5]
Ziawasch Abedjan and Felix Naumann. 2011. Advancing the discovery of unique column combinations. In Proceedings of the International Conference on Information and Knowledge Management (CIKM’11). 1565--1570.
[6]
Ziawasch Abedjan and Felix Naumann. 2014. Amending RDF entities with new facts. In Proceedings of the Extended Semantic Web Conference (ESWC’14). 131--143.
[7]
Charu C. Aggarwal. 2015. Outlier Analysis. Springer.
[8]
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, et al. 2015. Spark SQL: Relational data processing in Spark. In Proceedings of the International Conference on Management of Data (SIGMOD’15). 1383--1394.
[9]
V. Barnett and T. Lewis. 1995. Outliers in Statistical Data. Wiley Online Library.
[10]
Carlo Batini, Monica Scannapieco, et al. 2016. Data and Information Quality. Springer.
[11]
Xiao-yun Chen and Yan-yan Zhan. 2008. Multi-scale anomaly detection algorithm based on infrequent pattern of time series. J. Comput. Appl. Math. 214, 1 (2008).
[12]
Martin Chodorow and Claudia Leacock. 2000. An unsupervised method for detecting grammatical errors. In Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference.
[13]
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Holistic data cleaning: Putting violations into context. In Proceedings of the International Conference on Data Engineering (ICDE’13). 458--469.
[14]
Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of the International Conference on Management of Data (SIGMOD’15). 1247--1261.
[15]
Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, and Nan Tang. 2013. NADEEF: A commodity data cleaning system. In Proceedings of the International Conference on Management of Data (SIGMOD’13). 541--552.
[16]
Nilesh Dalvi, Anirban Dasgupta, Ravi Kumar, and Vibhor Rastogi. 2013. Aggregating crowdsourced binary ratings. In Proceedings of the International World Wide Web Conference (WWW’13). 285--294.
[17]
Tamraparni Dasu, Theodore Johnson, Shanmugauelayut Muthukrishnan, and Vladislav Shkapenyuk. 2002. Mining database structure; or, how to build a data quality browser. In Proceedings of the International Conference on Management of Data (SIGMOD’02). 240--251.
[18]
Wenfei Fan and Floris Geerts. 2012. Foundations of Data Quality Management. Morgan 8 Claypool Publishers.
[19]
Ivan P. Fellegi and David Holt. 1976. A systematic approach to automatic edit and imputation. J. Amer. Statist. Assoc. 71, 353 (1976), 17--35.
[20]
Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. 2013. The LLUNATIC data-cleaning framework. Proc. VLDB Endow. 6, 9 (2013), 625--636.
[21]
Lukasz Golab, Howard Karloff, Flip Korn, and Divesh Srivastava. 2010. Data auditor: Exploring data quality and semantics using pattern tableaux. Proc. VLDB Endow. 3, 1--2 (2010), 1641--1644.
[22]
Manish Gupta, Jing Gao, Charu C. Aggarwal, and Jiawei Han. 2014. Outlier detection for temporal data: A survey. IEEE Trans. Knowl. Data Eng. 26, 9 (2014), 2250--2267.
[23]
Alon Halevy, Flip Korn, Natalya F. Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing Google’s datasets. In Proceedings of the International Conference on Data Engineering (ICDE’16). 795--806.
[24]
Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, et al. 2018. Applied machine learning at Facebook: A datacenter infrastructure perspective. In Proceedings of the IEEE Symposium on High Performance Computer Architecture (HPCA’18). 620--629.
[25]
Joseph M. Hellerstein. 2008. Quantitative Data Cleaning for Large Databases. United Nations Economic Commission for Europe (UNECE).
[26]
Joseph M. Hellerstein, Vikram Sreekanti, Joseph E. Gonzalez, James Dalton, Akon Dey, Sreyashi Nag, Krishna Ramachandran, Sudhanshu Arora, Arka Bhattacharyya, Shirshanka Das, et al. 2017. Ground: A data context service. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’17).
[27]
Victoria Hodge and Jim Austin. 2004. A survey of outlier detection methodologies. Artif. Intell. Rev. 22, 2 (2004), 85--126.
[28]
Jean-Nicholas Hould. 2017. Craft beers dataset. Retrieved from https://www.kaggle.com/nickhould/craft-cans. Version 1.
[29]
Ming Hua and Jian Pei. 2007. Cleaning disguised missing data: A heuristic approach. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’07). 950--958.
[30]
Yannis Ioannidis. 2003. The history of histograms (abridged). Proc. VLDB Endow. 19--30.
[31]
Yannis Ioannidis and Viswanath Poosala. 1995. Histogram-based solutions to diverse database estimation problems. IEEE Data Eng. Bull. 18, 3 (1995), 10--18.
[32]
Christian S. Jensen, Richard T. Snodgrass, and Michael D. Soo. 1996. Extending existing dependency theory to temporal databases. IEEE Trans. Knowl. Data Eng. 8, 4 (1996), 563--582.
[33]
Kaggle. [n.d.]. The State of Machine Learning and Data Science 2017. Retrieved from https://bit.ly/2KopcwB.
[34]
Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the ACM SIGCHI Conference. 3363--3372.
[35]
Sean Kandel, Ravi Parikh, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. Profiler: Integrated statistical analysis and visualization for data quality assessment. In Proceedings of the International Working Conference on Advanced Visual Interfaces. 547--554.
[36]
Eser Kandogan, Mary Roth, Peter Schwarz, Joshua Hui, Ignacio Terrizzano, Christina Christodoulakis, and Renée J. Miller. 2015. LabBook: Metadata-driven social collaborative data analysis. In Proceedings of the IEEE International Conference on Big Data. 431--440.
[37]
Won Kim, Byoung-Ju Choi, Eui-Kyeong Hong, Soo-Kyung Kim, and Doheon Lee. 2003. A taxonomy of dirty data. Data Mining Knowl. Discov. 7, 1 (2003), 81--99.
[38]
Donald E. Knuth. 1964. Backus normal form vs. Backus Naur form. Commun. ACM (Dec. 1964).
[39]
Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. Boostclean: Automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299 (2017).
[40]
Sanjay Krishnan, Daniel Haas, Michael J. Franklin, and Eugene Wu. 2016. Towards reliable interactive data cleaning: A user survey and recommendations. In Proceedings of the Workshop on Human-in-the-Loop Data Analytics (HILDA’16).
[41]
Sebastian Kruse, David Hahn, Marius Walter, and Felix Naumann. 2017. Metacrate: Organize and analyze millions of data profiles. In Proceedings of the ACM Conference on Information and Knowledge Management. 2483--2486.
[42]
Sebastian Kruse, Thorsten Papenbrock, Hazar Harmouch, and Felix Naumann. 2016. Data anamnesis: Admitting raw data into an organization. IEEE Data Eng. Bull. 39, 2 (2016), 8--20.
[43]
Nuno Laranjeiro, Seyma Nur Soydemir, and Jorge Bernardino. 2015. A survey on data quality: Classifying poor data. In Proceedings of the Pacific Rim International Symposium on Dependable Computing (PRDC’15). 179--188.
[44]
Lin Li, Taoxin Peng, and Jessie Kennedy. 2011. A rule based taxonomy of dirty data. GSTF J. Comput. 1, 2 (2011).
[45]
Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. 2012. Truth finding on the deep web: Is the problem solved? In Proceedings of the International Conference on Very Large Databases (VLDB’12), Vol. 6. 97--108.
[46]
Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A configuration-free error detection system. In Proceedings of the International Conference on Management of Data (SIGMOD’19). 865--882.
[47]
Jonathan I. Maletic and Andrian Marcus. 2000. Data cleansing: Beyond integrity analysis. In Proceedings of the 5th Conference on Information Quality. 200--209.
[48]
Michael V. Mannino, Paicheng Chu, and Thomas Sager. 1988. Statistical profile estimation in database systems. ACM Comput. Surv. 20, 3 (1988), 191--221.
[49]
Chris Mayfield, Jennifer Neville, and Sunil Prabhakar. 2010. ERACER: A database approach for statistical inference and data cleaning. In Proceedings of the International Conference on Management of Data (SIGMOD’10). 75--86.
[50]
John McCarthy and Michael I. Levin. 1965. LISP 1.5 Programmer’s Manual. The MIT Press.
[51]
Metmuseum. 2018. The Metropolitan Museum of Art Open Access. Retrieved from https://github.com/metmuseum/openaccess.
[52]
Michael Stonebraker, Nik Bates-Haus, Liam Cleary, Larry Simmons, and Andy Palmer. 2018. Getting Data Operations Right. O’Reilly Media.
[53]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv (2013).
[54]
H. Müller and J. C. Freytag. 2003. Problems, Methods and Challenges in Comprehensive Data Cleansing. Technical Report HUB-IB-164, Humboldt-Universität zu Berlin, Institut für Informatik.
[55]
Felix Naumann. 2014. Data profiling revisited. In Proceedings of the International Conference on Management of Data (SIGMOD’14), Vol. 42. 40--49.
[56]
Paulo Oliveira, Fátima Rodrigues, Pedro Henriques, and Helena Galhardas. 2005. A taxonomy of data quality problems. In Proceedings of the International Workshop on Data and Information Quality. 219--233.
[57]
Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and Felix Naumann. 2015. Data profiling with metanome. Proc. VLDB Endow. 8, 12 (2015), 1860--1863.
[58]
Fabio Parisi, Francesco Strino, Boaz Nadler, and Yuval Kluger. 2014. Ranking and combining multiple predictors without labeled data. Proc. Nat. Acad. Sci. 111, 4 (2014), 1253--1258.
[59]
Ronald K. Pearson. 2005. Mining Imperfect Data: Dealing with Contamination and Incomplete Records. Vol. 93. Siam.
[60]
Ronald K. Pearson. 2006. The problem of disguised missing data. ACM SIGKDD Explor. Newslett. 8, 1 (2006), 83--92.
[61]
Clement Pit Claudel, Zelda Mariet, Rachael Harding, and Sam Madden. 2016. Outlier Detection in Heterogeneous Datasets Using Automatic Tuple Expansion. Technical Report, MIT.
[62]
Emmanouil Platanios, Hoifung Poon, Tom M. Mitchell, and Eric J. Horvitz. 2017. Estimating accuracy from unlabeled data: A probabilistic logic approach. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 4361--4370.
[63]
Nataliya Prokoshyna, Jaroslaw Szlichta, Fei Chiang, Renée J Miller, and Divesh Srivastava. 2015. Combining quantitative and logical data cleaning. Proc. VLDB Endow. 9, 4 (2015), 300--311.
[64]
Abdulhakim A. Qahtan, Ahmed Elmagarmid, Raul Castro Fernandez, Mourad Ouzzani, and Nan Tang. 2018. FAHES: A robust disguised missing values detector. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’18). 2100--2109.
[65]
Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. In IEEE Data Eng. Bull., Vol. 23. 3--13.
[66]
Vijayshankar Raman and Joseph M. Hellerstein. 2001. Potter’s wheel: An interactive data cleaning system. In Proceedings of the International Conference on Very Large Databases (VLDB’01), Vol. 1. 381--390.
[67]
Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the International Conference on Very Large Databases (VLDB’17), Vol. 11. 269--282.
[68]
Thomas C. Redman. 2018. If Your Data Is Bad, Your Machine Learning Tools Are Useless. Retrieved from https://bit.ly/2InCpnA.
[69]
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic data repairs with probabilistic inference. Proc. VLDB Endow., Vol. 10. 1190--1201.
[70]
Jenn Riley. 2017. Understanding Metadata. What Is Metadata, and What Is It for? Retrieved from https://groups.niso.org/apps/group_public/download.php/17443/understanding-metadata.
[71]
Malcolm Slaney and Michael Casey. 2008. Locality-sensitive hashing for finding nearest neighbors. IEEE Sig. Proc. Mag. 25, 2 (2008), 128--131.
[72]
Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas, George Beskales, Mitch Cherniack, Stanley B. Zdonik, Alexander Pagan, and Shan Xu. 2013. Data curation at scale: The Data Tamer System. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’13).
[73]
Michael Stonebraker and Ihab F. Ilyas. 2018. Data integration: The current status and the way forward. IEEE Data Eng. Bull. 41, 2 (2018), 3--9.
[74]
Trifacta. 2019. Supported Data Types - Trifacta Wrangler. https://docs.trifacta.com/display/PE/SupportedDataTypes.
[75]
Larysa Visengeriyeva and Ziawasch Abedjan. 2018. Metadata-driven error detection. In Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM’18). 1:1--1:12.
[76]
Pei Wang and Yeye He. 2019. Uni-Detect: A unified approach to automated error detection in tables. In Proceedings of the International Conference on Management of Data (SIGMOD’19). 811--828.
[77]
Mohamed Yakout, Laure Berti-Équille, and Ahmed K. Elmagarmid. 2013. Don’t be SCAREd: Use SCalable Automatic REpairing with maximal likelihood and bounded changes. In Proceedings of the International Conference on Management of Data (SIGMOD’13). 553--564.
[78]
Zhi-Hua Zhou. 2012. Ensemble Methods: Foundations and Algorithms. CRC Press.

Cited By

View all
  • (2024)Graph-Based Digital File Curation for Engineering Reuse: Methodology and Case StudyAIAA SCITECH 2024 Forum10.2514/6.2024-1133Online publication date: 4-Jan-2024
  • (2024)Use of Context in Data Quality Management: A Systematic Literature ReviewJournal of Data and Information Quality10.1145/367208216:3(1-41)Online publication date: 17-Jun-2024
  • (2024)On the Impact of Showing Evidence from Peers in Crowdsourced Truthfulness AssessmentsACM Transactions on Information Systems10.1145/363787242:3(1-26)Online publication date: 22-Jan-2024
  • Show More Cited By

Index Terms

  1. Anatomy of Metadata for Data Curation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Journal of Data and Information Quality
    Journal of Data and Information Quality  Volume 12, Issue 3
    On the Horizon and Regular Articles
    September 2020
    104 pages
    ISSN:1936-1955
    EISSN:1936-1963
    DOI:10.1145/3404101
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 June 2020
    Online AM: 07 May 2020
    Accepted: 01 November 2019
    Revised: 01 August 2019
    Received: 01 March 2019
    Published in JDIQ Volume 12, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data curation
    2. data profiling
    3. data quality
    4. error detection
    5. metadata

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)89
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 24 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Graph-Based Digital File Curation for Engineering Reuse: Methodology and Case StudyAIAA SCITECH 2024 Forum10.2514/6.2024-1133Online publication date: 4-Jan-2024
    • (2024)Use of Context in Data Quality Management: A Systematic Literature ReviewJournal of Data and Information Quality10.1145/367208216:3(1-41)Online publication date: 17-Jun-2024
    • (2024)On the Impact of Showing Evidence from Peers in Crowdsourced Truthfulness AssessmentsACM Transactions on Information Systems10.1145/363787242:3(1-26)Online publication date: 22-Jan-2024
    • (2024)Data Quality Assessment through a Preference ModelJournal of Data and Information Quality10.1145/363240716:1(1-21)Online publication date: 6-Mar-2024
    • (2023)Multi-perspective approach for curating and exploring the history of climate change in Latin America within digital newspapersComputer Science and Information Systems10.2298/CSIS220110008V20:3(1179-1205)Online publication date: 2023
    • (2023)A concept for providing and utilizing metadata in data analytics applicationsat - Automatisierungstechnik10.1515/auto-2022-010771:1(44-55)Online publication date: 13-Jan-2023
    • (2023)On the role of human and machine metadata in relevance judgment tasksInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10317760:2Online publication date: 1-Mar-2023
    • (2022)Enabling data-centric AI through data quality management and data literacyit - Information Technology10.1515/itit-2021-004864:1-2(67-70)Online publication date: 18-Feb-2022
    • (2022)Does Evidence from Peers Help Crowd Workers in Assessing Truthfulness?Companion Proceedings of the Web Conference 202210.1145/3487553.3524236(302-306)Online publication date: 25-Apr-2022
    • (2022)Data ProfilingEncyclopedia of Big Data Technologies10.1007/978-3-319-63962-8_8-2(1-6)Online publication date: 17-Feb-2022
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media