Nothing Special   »   [go: up one dir, main page]

Skip to main content

TabMentor: Detect Errors on Tabular Data with Noisy Labels

  • Conference paper
  • First Online:
Advanced Data Mining and Applications (ADMA 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14178))

Included in the following conference series:

  • 505 Accesses

Abstract

Existing supervised methods for error detection require access to clean labels in order to train the classification models. This is difficult to achieve in practical scenarios. While the majority of the error detection algorithms ignore the effect of noisy labels, in this paper, we design effective techniques for error detection when both data and labels contain noise. Nevertheless, we present TabMentor, a novel deep-learning model for error detection on tabular data with noisy training labels. TabMentor introduces a deep model for the prediction, i.e., Tabclassifier that suggests the most salient features for the decision step, enabling efficient learning. For feature extraction, it uses existing error detection algorithms, along with some raw features from the datasets. To reduce the negative effect of noisy training labels on the model, TabMentor uses another deep model, i.e., Teachernet, to supervise the training of Tabclassifier. During the training process, both Teachernet and Tabclassifier dynamically learn curriculum from data, allowing Tabclassifier to focus more on clean labeled samples. Performance evaluation using five different data sets shows that the TabMentor excels over the best baseline error detection system by 0.05 to 0.11 in terms of F1 scores.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Arik, S.Ö., Pfister, T.: Tabnet: attentive interpretable tabular learning. In: AAAI (2021)

    Google Scholar 

  2. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52

    Chapter  Google Scholar 

  3. Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning (2009)

    Google Scholar 

  4. Biessmann, F., et al.: Datawig: missing value imputation for tables. J. Mach. Learn. Res. 20, 1–6 (2019)

    MathSciNet  MATH  Google Scholar 

  5. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. (CSUR) 41, 1–58 (2009)

    Article  Google Scholar 

  6. Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: overview and emerging challenges. In: Proceedings of the 2016 International Conference on Management of Data (2016)

    Google Scholar 

  7. Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE. IEEE (2013)

    Google Scholar 

  8. Csiszár, I.: Information geometry and alternating minimization procedures. Stat. Decis. 1, 205–237 (1984)

    MathSciNet  MATH  Google Scholar 

  9. Dallachiesa, M., et al.: Nadeef: a commodity data cleaning system. In: 2013 ACM SIGMOD (2013)

    Google Scholar 

  10. Das, S., Doan, A., Psgc, C.G., Konda, P., Govind, Y., Paulsen, D.: The magellan data repository (2015)

    Google Scholar 

  11. Han, B., et al.: Co-teaching: robust training of deep neural networks with extremely noisy labels. Adv. Neural Inf. Process. Syst. 31 (2018)

    Google Scholar 

  12. Heidari, A., McGrath, J., Ilyas, I.F., Rekatsinas, T.: Holodetect: few-shot learning for error detection. In: Proceedings of the 2019 International Conference on Management of Data (2019)

    Google Scholar 

  13. Hellerstein, J.M.: Quantitative data cleaning for large databases. UNECE (2008)

    Google Scholar 

  14. Huang, Z., He, Y.: Auto-detect: data-driven error detection in tables. In: Proceedings of the 2018 International Conference on Management of Data (2018)

    Google Scholar 

  15. Jiang, L., Meng, D., Zhao, Q., Shan, S., Hauptmann, A.G.: Self-paced curriculum learning. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)

    Google Scholar 

  16. Jiang, L., Zhou, Z., Leung, T., Li, L.J., Fei-Fei, L.: Mentornet: learning data-driven curriculum for very deep neural networks on corrupted labels. In: International Conference on Machine Learning. PMLR (2018)

    Google Scholar 

  17. Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the Sigchi Conference on Human Factors in Computing Systems (2011)

    Google Scholar 

  18. Katzir, L., Elidan, G., El-Yaniv, R.: Net-dnf: effective deep modeling of tabular data. In: International Conference on Learning Representations (2020)

    Google Scholar 

  19. Krishnan, S., Wang, J., Wu, E., Franklin, M.J., Goldberg, K.: Activeclean: interactive data cleaning for statistical modeling. In: PVLDB (2016)

    Google Scholar 

  20. Kumar, M., Packer, B., Koller, D.: Self-paced learning for latent variable models. Adv. Neural Inf. Process. Syst. (2010)

    Google Scholar 

  21. Li, J., Socher, R., Hoi, S.C.: Dividemix: learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394 (2020)

  22. Liu, Z., Zhou, Z., Rekatsinas, T.: Picket: self-supervised data diagnostics for ml pipelines. arXiv (2020)

    Google Scholar 

  23. Mahdavi, M., et al.: Raha: a configuration-free error detection system. In: SIGMOD (2019)

    Google Scholar 

  24. Malach, E., Shalev-Shwartz, S.: Decoupling “when to update” from “how to update". Adv. Neural Inf. Process. Syst. (2017)

    Google Scholar 

  25. Neutatz, F., Chen, B., Abedjan, Z., Wu, E.: From cleaning before ml to cleaning for ml. IEEE (2021)

    Google Scholar 

  26. Neutatz, F., Mahdavi, M., Abedjan, Z.: Ed2: two-stage active learning for error detection-technical report. arXiv (2019)

    Google Scholar 

  27. Ouzzani, M., Hammady, H., Fedorowicz, Z., Elmagarmid, A.: Rayyan-a web and mobile app for systematic reviews. Syst. Rev. 5, 1–10 (2016)

    Article  Google Scholar 

  28. Pit-Claudel, C., Mariet, Z., Harding, R., Madden, S.: Outlier detection in heterogeneous datasets using automatic tuple expansion (2016)

    Google Scholar 

  29. Popov, S., Morozov, S., Babenko, A.: Neural oblivious decision ensembles for deep learning on tabular data. arXiv preprint arXiv:1909.06312 (2019)

  30. Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE (2000)

    Google Scholar 

  31. Raman, V., Hellerstein, J.M.: Potter’s wheel: an interactive data cleaning system. In: VLDB (2001)

    Google Scholar 

  32. Rammelaere, J., Geerts, F.: Explaining repaired data with cfds. In: VLDB (2018)

    Google Scholar 

  33. Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: Holoclean: holistic data repairs with probabilistic inference. arXiv (2017)

    Google Scholar 

  34. Ridzuan, F., Zainon, W.M.N.W.: Diagnostic analysis for outlier detection in big data analytics. Procedia Comput. Sci. 197, 685–692 (2022)

    Article  Google Scholar 

  35. Rosales, R., Fung, G., Tong, W.: Automatic discrimination of mislabeled training points for large margin classifiers. In: Proceedings of Snowbird Machine Learning Workshop. Citeseer (2009)

    Google Scholar 

  36. Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., Aroyo, L.M.: “everyone wants to do the model work, not the data work”: data cascades in high-stakes AI. In: proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (2021)

    Google Scholar 

  37. Sharma, K., Donmez, P., Luo, E., Liu, Y., Yalniz, I.Z.: NoiseRank: unsupervised label noise reduction with dependence models. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 737–753. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_44

    Chapter  Google Scholar 

  38. Shwartz-Ziv, R., Armon, A.: Tabular data: deep learning is not all you need. Inf. Fusion 81, 84–90 (2022)

    Article  Google Scholar 

  39. Visengeriyeva, L., Abedjan, Z.: Metadata-driven error detection. In: SSDBM (2018)

    Google Scholar 

  40. Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. arXiv (2011)

    Google Scholar 

  41. Yan, J.N., Schulte, O., Zhang, M., Wang, J., Cheng, R.: Scoded: statistical constraint oriented data error detection. In: 2020 ACM SIGMOD (2020)

    Google Scholar 

  42. Yuan, B., Chen, J., Zhang, W., Tai, H.S., McMains, S.: Iterative cross learning on noisy labels. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE (2018)

    Google Scholar 

  43. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64, 107–115 (2021)

    Article  Google Scholar 

  44. Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22, 177–210 (2004)

    Google Scholar 

Download references

Acknowledgements

National Key R &D program of China 2021YFB3301500, Guangdong Provincial Natural Science Foundation 2019A1515111047, Shenzhen Colleges and Universities Continuous Support Grant 20200811104054002, Guangdong “Pearl River Talent Recruitment Program” under Grant 2019ZT08X603, the 14th “115” Industrial Innovation Group (Project 4) of Anhui Province, NSFC 62072311, U2001212, Guangdong Project 2020B1515120028, and Shenzhen Project JCYJ20210324094402008.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianbin Qin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, Y., Qin, J., Wang, Y., Ali, M.A., Ji, Y., Mao, R. (2023). TabMentor: Detect Errors on Tabular Data with Noisy Labels. In: Yang, X., et al. Advanced Data Mining and Applications. ADMA 2023. Lecture Notes in Computer Science(), vol 14178. Springer, Cham. https://doi.org/10.1007/978-3-031-46671-7_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-46671-7_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-46670-0

  • Online ISBN: 978-3-031-46671-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics