Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

ADESIT: visualize the limits of your data in a machine learning process

Published: 01 July 2021 Publication History

Abstract

Thanks to the numerous machine learning tools available to us nowadays, it is easier than ever to derive a model from a dataset in the frame of a supervised learning problem. However, when this model behaves poorly compared with an expected performance, the underlying question of the existence of such a model is often underlooked and one might just be tempted to try different parameters or just choose another model architecture. This is why the quality of the learning examples should be considered as early as possible as it acts as a go/no go signal for the following potentially costly learning process. With ADESIT, we provide a way to evaluate the ability of a dataset to perform well for a given supervised learning problem through statistics and visual exploration. Notably, we base our work on recent studies proposing the use of functional dependencies and specifically counterexample analysis to provide dataset cleanliness statistics but also a theoretical upper bound on the prediction accuracy directly linked to the problem settings (measurement uncertainty, expected generalization...). In brief, ADESIT is intended to be part of an iterative data refinement process right after data selection and right before the machine learning process itself. With further analysis for a given problem, the user can characterize, clean and export dynamically selected subsets, allowing to better understand what regions of the data could be refined and where the data precision must be improved by using, for example, new or more precise sensors.

References

[1]
Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2015. Relaxed functional dependencies-a survey of approaches. IEEE Transactions on Knowledge and Data Engineering 28, 1 (2015), 147--165.
[2]
Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis, and Kostas Stefanidis. 2019. End-to-end entity resolution for big data: A survey. arXiv preprint arXiv:1905.06397 (2019).
[3]
Xu Chu, Ihab F Ilyas, and Paolo Papotti. 2013. Discovering denial constraints. Proceedings of the VLDB Endowment 6, 13 (2013), 1498--1509.
[4]
Vlastislav Dohnal, Claudio Gennaro, and Pavel Zezula. 2003. Similarity join in metric spaces using ed-index. In International Conference on Database and Expert Systems Applications. Springer, 484--493.
[5]
Wenfei Fan. 2008. Dependencies Revisited for Improving Data Quality (PODS '08). Association for Computing Machinery, New York, NY, USA, 159--170.
[6]
Keinosuke Fukunaga. 2013. Introduction to statistical pattern recognition. Elsevier.
[7]
Michael R Garey and David S Johnson. 1979. Computers and intractability. Vol. 174. freeman San Francisco.
[8]
Demian Hespe, Sebastian Lamm, Christian Schulz, and Darren Strash. 2020. WeGotYouCovered: The Winning Solver from the PACE 2019 Challenge. In 2020 Proceedings of the SIAM Workshop on Combinatorial Scientific Computing. SIAM.
[9]
Yka Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. The computer journal 42, 2 (1999), 100--111.
[10]
Jyrki Kivinen and Heikki Mannila. 1995. Approximate inference of functional dependencies from relations. Theoretical Computer Science 149, 1 (1995), 129 -- 149. Fourth International Conference on Database Theory (ICDT '92).
[11]
Lars Kolb, Andreas Thor, and Erhard Rahm. 2010. Parallel Sorted Neighborhood Blocking with MapReduce. arXiv:1010.3053 [cs.DC]
[12]
Marie Le Guilly, Jean-Marc Petit, and Vasile-Marian Scuturici. 2020. Evaluating Classification Feasibility Using Functional Dependencies. Trans. Large Scale Data Knowl. Centered Syst. 44(2020), 132--159.
[13]
Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3--13.
[14]
Yihan Wang, Shaoxu Song, Lei Chen, Jeffrey Xu Yu, and Hong Cheng. 2017. Discovering conditional matching rules. ACM Transactions on Knowledge Discovery from Data (TKDD) 11, 4 (2017), 1--38.
[15]
Ziheng Wei and Sebastian Link. 2019. Embedded functional dependencies and data-completeness tailored database design. Proceedings of the VLDB Endowment 12, 11 (2019), 1458--1470.

Index Terms

  1. ADESIT: visualize the limits of your data in a machine learning process
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 14, Issue 12
        July 2021
        587 pages
        ISSN:2150-8097
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        Published: 01 July 2021
        Published in PVLDB Volume 14, Issue 12

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 42
          Total Downloads
        • Downloads (Last 12 months)5
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 14 Jan 2025

        Other Metrics

        Citations

        View Options

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media