Abstract
Existing supervised methods for error detection require access to clean labels in order to train the classification models. This is difficult to achieve in practical scenarios. While the majority of the error detection algorithms ignore the effect of noisy labels, in this paper, we design effective techniques for error detection when both data and labels contain noise. Nevertheless, we present TabMentor, a novel deep-learning model for error detection on tabular data with noisy training labels. TabMentor introduces a deep model for the prediction, i.e., Tabclassifier that suggests the most salient features for the decision step, enabling efficient learning. For feature extraction, it uses existing error detection algorithms, along with some raw features from the datasets. To reduce the negative effect of noisy training labels on the model, TabMentor uses another deep model, i.e., Teachernet, to supervise the training of Tabclassifier. During the training process, both Teachernet and Tabclassifier dynamically learn curriculum from data, allowing Tabclassifier to focus more on clean labeled samples. Performance evaluation using five different data sets shows that the TabMentor excels over the best baseline error detection system by 0.05 to 0.11 in terms of F1 scores.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arik, S.Ö., Pfister, T.: Tabnet: attentive interpretable tabular learning. In: AAAI (2021)
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52
Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning (2009)
Biessmann, F., et al.: Datawig: missing value imputation for tables. J. Mach. Learn. Res. 20, 1–6 (2019)
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. (CSUR) 41, 1–58 (2009)
Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: overview and emerging challenges. In: Proceedings of the 2016 International Conference on Management of Data (2016)
Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE. IEEE (2013)
Csiszár, I.: Information geometry and alternating minimization procedures. Stat. Decis. 1, 205–237 (1984)
Dallachiesa, M., et al.: Nadeef: a commodity data cleaning system. In: 2013 ACM SIGMOD (2013)
Das, S., Doan, A., Psgc, C.G., Konda, P., Govind, Y., Paulsen, D.: The magellan data repository (2015)
Han, B., et al.: Co-teaching: robust training of deep neural networks with extremely noisy labels. Adv. Neural Inf. Process. Syst. 31 (2018)
Heidari, A., McGrath, J., Ilyas, I.F., Rekatsinas, T.: Holodetect: few-shot learning for error detection. In: Proceedings of the 2019 International Conference on Management of Data (2019)
Hellerstein, J.M.: Quantitative data cleaning for large databases. UNECE (2008)
Huang, Z., He, Y.: Auto-detect: data-driven error detection in tables. In: Proceedings of the 2018 International Conference on Management of Data (2018)
Jiang, L., Meng, D., Zhao, Q., Shan, S., Hauptmann, A.G.: Self-paced curriculum learning. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)
Jiang, L., Zhou, Z., Leung, T., Li, L.J., Fei-Fei, L.: Mentornet: learning data-driven curriculum for very deep neural networks on corrupted labels. In: International Conference on Machine Learning. PMLR (2018)
Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the Sigchi Conference on Human Factors in Computing Systems (2011)
Katzir, L., Elidan, G., El-Yaniv, R.: Net-dnf: effective deep modeling of tabular data. In: International Conference on Learning Representations (2020)
Krishnan, S., Wang, J., Wu, E., Franklin, M.J., Goldberg, K.: Activeclean: interactive data cleaning for statistical modeling. In: PVLDB (2016)
Kumar, M., Packer, B., Koller, D.: Self-paced learning for latent variable models. Adv. Neural Inf. Process. Syst. (2010)
Li, J., Socher, R., Hoi, S.C.: Dividemix: learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394 (2020)
Liu, Z., Zhou, Z., Rekatsinas, T.: Picket: self-supervised data diagnostics for ml pipelines. arXiv (2020)
Mahdavi, M., et al.: Raha: a configuration-free error detection system. In: SIGMOD (2019)
Malach, E., Shalev-Shwartz, S.: Decoupling “when to update” from “how to update". Adv. Neural Inf. Process. Syst. (2017)
Neutatz, F., Chen, B., Abedjan, Z., Wu, E.: From cleaning before ml to cleaning for ml. IEEE (2021)
Neutatz, F., Mahdavi, M., Abedjan, Z.: Ed2: two-stage active learning for error detection-technical report. arXiv (2019)
Ouzzani, M., Hammady, H., Fedorowicz, Z., Elmagarmid, A.: Rayyan-a web and mobile app for systematic reviews. Syst. Rev. 5, 1–10 (2016)
Pit-Claudel, C., Mariet, Z., Harding, R., Madden, S.: Outlier detection in heterogeneous datasets using automatic tuple expansion (2016)
Popov, S., Morozov, S., Babenko, A.: Neural oblivious decision ensembles for deep learning on tabular data. arXiv preprint arXiv:1909.06312 (2019)
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE (2000)
Raman, V., Hellerstein, J.M.: Potter’s wheel: an interactive data cleaning system. In: VLDB (2001)
Rammelaere, J., Geerts, F.: Explaining repaired data with cfds. In: VLDB (2018)
Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: Holoclean: holistic data repairs with probabilistic inference. arXiv (2017)
Ridzuan, F., Zainon, W.M.N.W.: Diagnostic analysis for outlier detection in big data analytics. Procedia Comput. Sci. 197, 685–692 (2022)
Rosales, R., Fung, G., Tong, W.: Automatic discrimination of mislabeled training points for large margin classifiers. In: Proceedings of Snowbird Machine Learning Workshop. Citeseer (2009)
Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., Aroyo, L.M.: “everyone wants to do the model work, not the data work”: data cascades in high-stakes AI. In: proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (2021)
Sharma, K., Donmez, P., Luo, E., Liu, Y., Yalniz, I.Z.: NoiseRank: unsupervised label noise reduction with dependence models. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 737–753. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_44
Shwartz-Ziv, R., Armon, A.: Tabular data: deep learning is not all you need. Inf. Fusion 81, 84–90 (2022)
Visengeriyeva, L., Abedjan, Z.: Metadata-driven error detection. In: SSDBM (2018)
Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. arXiv (2011)
Yan, J.N., Schulte, O., Zhang, M., Wang, J., Cheng, R.: Scoded: statistical constraint oriented data error detection. In: 2020 ACM SIGMOD (2020)
Yuan, B., Chen, J., Zhang, W., Tai, H.S., McMains, S.: Iterative cross learning on noisy labels. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE (2018)
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64, 107–115 (2021)
Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22, 177–210 (2004)
Acknowledgements
National Key R &D program of China 2021YFB3301500, Guangdong Provincial Natural Science Foundation 2019A1515111047, Shenzhen Colleges and Universities Continuous Support Grant 20200811104054002, Guangdong “Pearl River Talent Recruitment Program” under Grant 2019ZT08X603, the 14th “115” Industrial Innovation Group (Project 4) of Anhui Province, NSFC 62072311, U2001212, Guangdong Project 2020B1515120028, and Shenzhen Project JCYJ20210324094402008.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, Y., Qin, J., Wang, Y., Ali, M.A., Ji, Y., Mao, R. (2023). TabMentor: Detect Errors on Tabular Data with Noisy Labels. In: Yang, X., et al. Advanced Data Mining and Applications. ADMA 2023. Lecture Notes in Computer Science(), vol 14178. Springer, Cham. https://doi.org/10.1007/978-3-031-46671-7_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-46671-7_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46670-0
Online ISBN: 978-3-031-46671-7
eBook Packages: Computer ScienceComputer Science (R0)