Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

BUNNI: Learning Repair Actions in Rule-driven Data Cleaning

Published: 24 June 2024 Publication History

Abstract

In this work, we address the challenging and open problem of involving non-expert users in the data repairing problem as first-class citizens. Despite a large number of proposals that have been devoted to cleaning data from the point of view of expert users (IT staff and data scientists), there is a lack of studies from the perspective of non-expert ones. Given a set of available data quality rules, we exploit machine learning techniques to guide the user to identify the dirty values for each violation and repair them. We show that with a low user effort, it is possible to identify the values in tuples that can be trusted and the ones that are most likely errors. We show experimentally how this machine learning approach leads to a unique clean solution with high quality in scenarios where other approaches fail.

References

[1]
Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. 1999. Consistent query answers in inconsistent databases. In Proceedings of the 18th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, 68–79. DOI:
[2]
Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Renée J. Miller, Paolo Papotti, and Donatello Santoro. 2015. Messing up with BART: Error generation for evaluating data-cleaning algorithms. Proceedings of the VLDB Endowment 9, 2 (2015), 36–47. DOI:
[3]
Gilbert Badaro, Mohammed Saeed, and Paolo Papotti. 2023. Transformers for tabular data representation: A survey of models and applications. Transactions of the Association for Computational Linguistics 11 (2023), 227–249. DOI:
[4]
Christopher M. Bishop. 2007. Pattern Recognition and Machine Learning (5th ed.). Springer. https://www.worldcat.org/oclc/71008143
[5]
Philip Bohannon, Michael Flaster, Wenfei Fan, and Rajeev Rastogi. 2005. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 143–154. DOI:
[6]
Jean-Flavien Bussotti, Enzo Veltri, Donatello Santoro, and Paolo Papotti. 2023. Generation of training examples for tabular natural language inference. Proceedings of the ACM on Management of Data 1, 4 (Dec. 2023), Article 243, 27 pages. DOI:
[7]
Fei Chiang and Renée J. Miller. 2008. Discovering data quality rules. Proceedings of the VLDB Endowment 1, 1 (2008), 1166–1177. DOI:
[8]
Fei Chiang and Renée J. Miller. 2011. A unified model for data and constraint repair. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering (ICDE ’11). IEEE, 446–457. DOI:
[9]
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Discovering denial constraints. Proceedings of the VLDB Endowment 6, 13 (2013), 1498–1509. DOI:
[10]
Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. 2007. Improving data quality: Consistency and accuracy. In Proceedings of the 2007 33rd International Conference on Very Large Data Bases. ACM, 315–326. http://www.vldb.org/conf/2007/papers/research/p315-cong.pdf
[11]
Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed K. Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, and Nan Tang. 2013. NADEEF: A commodity data cleaning system. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD ’13). ACM, 541–552. DOI:
[12]
Sushovan De, Yuheng Hu, Yi Chen, and Subbarao Kambhampati. 2014. BayesWipe: A multimodal system for data cleaning and consistent query answering on structured bigdata. In Proceedings of the 2014 IEEE International Conference on Big Data Big Data ’14). IEEE, 15–24. DOI:
[13]
Sushovan De, Yuheng Hu, Venkata Vamsikrishna Meduri, Yi Chen, and Subbarao Kambhampati. 2016. BayesWipe: A scalable probabilistic framework for improving data quality. Journal of Data and Information Quality 8, 1 (2016), 5.
[14]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACCL-HLT ’19), Volume 1 (Long and Short Papers). 4171–4186. DOI:
[15]
Amr Ebaid, Ahmed K. Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin. 2013. NADEEF: A generalized data cleaning system. Proceedings of the VLDB Endowment 6, 12 (2013), 1218–1221.
[16]
Wenfei Fan and Floris Geerts. 2012. Foundations of Data Quality Management. Synthesis Lectures on Data Management. Morgan & Claypool Publishers. DOI:
[17]
Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2008. Conditional functional dependencies for capturing data inconsistencies. ACM Transactions on Database Systems 33, 2 (2008), Article 6, 48 pages. DOI:
[18]
Wenfei Fan, Floris Geerts, Jianzhong Li, and Ming Xiong. 2011. Discovering conditional functional dependencies. IEEE Transactions on Knowledge and Data Engineering 23, 5 (2011), 683–698. DOI:
[19]
Wenfei Fan, Floris Geerts, Nan Tang, and Wenyuan Yu. 2013. Inferring data currency and consistency for conflict resolution. In Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE ’13). IEEE, 470–481. DOI:
[20]
Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2010. Towards certain fixes with editing rules and master data. VLDB Journal 3, 2 (2010), 173–184.
[21]
Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2011. Interaction between record matching and data repairing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD ’11). ACM, 469–480. DOI:
[22]
Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2012. Towards certain fixes with editing rules and master data. VLDB Journal 21, 2 (2012), 213–238.
[23]
Helena Galhardas, Daniela Florescu, Dennis E. Shasha, Eric Simon, and Cristian-Augustin Saita. 2001. Declarative data cleaning: Language, model, and algorithms. In Proceedingsof the 27th International Conference on Very Large Databases (VLDB ’01). 371–380. http://www.vldb.org/conf/2001/P371.pdf
[24]
Susan Garavaglia and Asha Sharma. 1998. A smart guide to dummy variables: Four applications and a macro. In Proceedings of the Northeast SAS Users Group Conference. 43.
[25]
Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. 2020. Cleaning data with Llunatic. VLDB Journal 29, 4 (2020), 867–892. DOI:
[26]
Boris Glavic, Giansalvatore Mecca, Renée J. Miller, Paolo Papotti, Donatello Santoro, and Enzo Veltri. 2024. Similarity measures for incomplete database instances. In Proceedings 27th International Conference on Extending Database Technology (EDBT ’24).
[27]
Lukasz Golab, Howard J. Karloff, Flip Korn, Barna Saha, and Divesh Srivastava. 2012. Discovering conservation rules. In Proceedings of the IEEE 28th International Conference on Data Engineering (ICDE ’12). IEEE, 738–749. DOI:
[28]
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter 11, 1 (2009), 10–18.
[29]
Jian He, Enzo Veltri, Donatello Santoro, Guoliang Li, Giansalvatore Mecca, Paolo Papotti, and Nan Tang. 2016. Interactive and deterministic data cleaning. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD ’16). ACM, 893–907. DOI:
[30]
Jeffrey Heer, Joseph M. Hellerstein, and Sean Kandel. 2015. Predictive interaction for data transformation. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR ’15). http://cidrdb.org/cidr2015/Papers/CIDR15_Paper27.pdf
[31]
Matteo Interlandi and Nan Tang. 2015. Proof positive and negative in data cleaning. In Proceedings of the 31st IEEE International Conference on Data Engineering (ICDE ’15). IEEE, 18–29. DOI:
[32]
Sijia Jiang, Zijing Tan, Jiawei Wang, Zhikang Wang, and Shuai Ma. 2023. Guided conditional functional dependency discovery. Information Systems 114 (2023), 102158. DOI:
[33]
Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the International Conference on Human Factors in Computing Systems(CHI ’11). ACM, 3363–3372. DOI:
[34]
Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin. 2015. BigDansing: A system for big data cleansing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1215–1230. DOI:
[35]
Sotiris B. Kotsiantis. 2007. Supervised machine learning: A review of classification techniques. In Emerging Artificial Intelligence Applications in Computer Engineering—Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies. Frontiers in Artificial Intelligence and Applications, Vol. 160. IOS Press, 3–24. http://www.booksonline.iospress.nl/Content/View.aspx?piid=6950
[36]
Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. 2016. ActiveClean: Interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment 9, 12 (2016), 948–959. DOI:
[37]
Paola Lapadula, Giansalvatore Mecca, Donatello Santoro, Luisa Solimando, and Enzo Veltri. 2018. Humanity is overrated. Or not. Automatic diagnostic suggestions by Greg, ML (Extended abstract). Communications in Computer and Information Science 909 (2018), 305–313. DOI:
[38]
Mohammad Mahdavi and Ziawasch Abedjan. 2020. Baran: Effective error correction via a unified context representation and transfer learning. Proceedings of the VLDB Endowment 13, 11 (2020), 1948–1961. http://www.vldb.org/pvldb/vol13/p1948-mahdavi.pdf
[39]
Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A configuration-free error detection system. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19). ACM, 865–882. DOI:
[40]
Chris Mayfield, Jennifer Neville, and Sunil Prabhakar. 2010. ERACER: A database approach for statistical inference and data cleaning. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD ’10). ACM, 75–86. DOI:
[41]
Jennifer Neville and David Jensen. 2007. Relational dependency networks. Journal of Machine Learning Research 8 (March 2007), 653–692.
[42]
Andrew Y. Ng and Michael I. Jordan. 2002. On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems 2 (2002), 841–848.
[43]
Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and Felix Naumann. 2015. Data profiling with Metanome. Proceedings of the VLDB Endowment 8, 12 (Aug.2015), 1860–1863. DOI:
[44]
Judea Pearl. 1986. Fusion, propagation, and structuring in belief networks. Artificial Intelligence 29, 3 (1986), 241–288.
[45]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21 (2020), Article 140, 67 pages.
[46]
Vijayshankar Raman and Joseph M. Hellerstein. 2001. Potter’s Wheel: An interactive data cleaning system. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB ’01). 381–390. http://www.vldb.org/conf/2001/P381.pdf
[47]
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic data repairs with probabilistic inference. Proceedings of the VLDB Endowment 10, 11 (2017), 1190–1201. DOI:
[48]
Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B. Gupta, Xiaojiang Chen, and Xin Wang. 2021. A survey of deep active learning. ACM Computing Surveys 54, 9 (Oct. 2021), Article 180, 40 pages. DOI:
[49]
Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive deduplication using active learning. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 269–278. DOI:
[50]
Burr Settles. 2012. Active Learning. Morgan & Claypool Publishers. DOI:
[51]
Shaoxu Song and Lei Chen. 2013. Efficient discovery of similarity constraints for matching dependencies. Data & Knowledge Engineering 87 (2013), 146–166. DOI:
[52]
Enzo Veltri, Gilbert Badaro, Mohammed Saeed, and Paolo Papotti. 2023. Data ambiguity profiling for the generation of training examples. In Proceedings of the 39th IEEE International Conference on Data Engineering (ICDE ’23). IEEE, 450–463. DOI:
[53]
Maksims Volkovs, Fei Chiang, Jaroslaw Szlichta, and Renée J. Miller. 2014. Continuous data cleaning. In Proceedings of the IEEE 30th International Conference on Data Engineering (ICDE ’14). IEEE, 244–255. DOI:
[54]
Jiannan Wang and Nan Tang. 2014. Towards dependable data repairing with fixing rules. In Proceedings of the International Conference on Management of Data (SIGMOD ’14). ACM, 457–468. DOI:
[55]
Bo Wu and Craig A. Knoblock. 2015. An iterative approach to synthesize data transformation programs. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI ’15). 1726–1732. http://ijcai.org/Abstract/15/246
[56]
Richard Wu, Aoqian Zhang, Ihab F. Ilyas, and Theodoros Rekatsinas. 2020. Attention-based learning for missing data imputation in HoloClean. In Proceedings of Machine Learning and Systems 2020 (MLSys ’20). https://proceedings.mlsys.org/book/307.pdf
[57]
Mohamed Yakout, Laure Berti-Équille, and Ahmed K. Elmagarmid. 2013. Don’t be SCAREd: Use scalable automatic repairing with maximal likelihood and bounded changes. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD ’13). ACM, 553–564. DOI:
[58]
Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F. Ilyas. 2011. Guided data repair. Proceedings of the VLDB Endowment 4, 5 (2011), 279–289. DOI:
[59]
Jian Zhou, Zhixu Li, Binbin Gu, Qing Xie, Jia Zhu, Xiangliang Zhang, and Guoliang Li. 2016. CrowdAidRepair: A crowd-aided interactive data repairing method. In Database Systems for Advanced Applications. Lecture Notes in Computer Science, Vol. 9642. Springer, 51–66. DOI:

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 16, Issue 2
June 2024
135 pages
EISSN:1936-1963
DOI:10.1145/3613602
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2024
Online AM: 25 May 2024
Accepted: 15 May 2024
Revised: 06 February 2024
Received: 04 May 2023
Published in JDIQ Volume 16, Issue 2

Check for updates

Author Tags

  1. Data cleaning
  2. repair discovery
  3. human in the loop
  4. machine learning

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 205
    Total Downloads
  • Downloads (Last 12 months)205
  • Downloads (Last 6 weeks)20
Reflects downloads up to 29 Nov 2024

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media