Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1557019.1557060acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Issues in evaluation of stream learning algorithms

Published: 28 June 2009 Publication History

Abstract

Learning from data streams is a research area of increasing importance. Nowadays, several stream learning algorithms have been developed. Most of them learn decision models that continuously evolve over time, run in resource-aware environments, detect and react to changes in the environment generating data. One important issue, not yet conveniently addressed, is the design of experimental work to evaluate and compare decision models that evolve over time. There are no golden standards for assessing performance in non-stationary environments. This paper proposes a general framework for assessing predictive stream learning algorithms. We defend the use of Predictive Sequential methods for error estimate - the prequential error. The prequential error allows us to monitor the evolution of the performance of models that evolve over time. Nevertheless, it is known to be a pessimistic estimator in comparison to holdout estimates. To obtain more reliable estimators we need some forgetting mechanism. Two viable alternatives are: sliding windows and fading factors. We observe that the prequential error converges to an holdout estimator when estimated over a sliding window or using fading factors. We present illustrative examples of the use of prequential error estimators, using fading factors, for the tasks of: i) assessing performance of a learning algorithm; ii) comparing learning algorithms; iii) hypothesis testing using McNemar test; and iv) change detection using Page-Hinkley test. In these tasks, the prequential error estimated using fading factors provide reliable estimators. In comparison to sliding windows, fading factors are faster and memory-less, a requirement for streaming applications. This paper is a contribution to a discussion in the good-practices on performance assessment when learning dynamic models that evolve over time.

Supplementary Material

JPG File (p329-gama.jpg)
MP4 File (p329-gama.mp4)

References

[1]
B. Babcock, M. Datar, R. Motwani, and L. O'Callaghan. Maintaining variance and k-medians over data stream windows. In Proc. of the 22nd Symposium on Principles of Database Systems, pages 234--243. ACM Press, 2003.
[2]
Michele Basseville and Igor Nikiforov. Detection of Abrupt Changes: Theory and Applications. Prentice-Hall Inc, 1987.
[3]
C. Blake, E. Keogh, and C.J. Merz. UCI repository of Machine Learning Databases, 1999.
[4]
Gladys Castillo and João Gama. Bias management of bayesian network classifiers. In A. Hoffmann, H. Motoda, and T. Scheffer, editors, Discovery Science, Proceedings of 8th International Conference, pages 70--83. LNAI 3735, Springer Verlag, 2005.
[5]
H. Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sums of observations. Annals of Mathematical Statistics, 23:493--507, 1952.
[6]
Graham Cormode, S. Muthukrishnan, and Wei Zhuang. Conquering the divide: Continuous clustering of distributed data streams. In ICDE, pages 1036--1045, 2007.
[7]
A. P. Dawid. Statistical theory: The prequential approach. Journal of the Royal Statistical Society-A, 147:278--292, 1984.
[8]
Janez Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1--30, 2006.
[9]
T. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Corvallis, technical report nr. 97.331, Oregon State University, 1996.
[10]
Pedro Domingos and Geoff Hulten. Mining High-Speed Data Streams. In Ismail Parsa, Raghu Ramakrishnan, and Sal Stolfo, editors, Proceedings of the ACM Sixth International Conference on Knowledge Discovery and Data Mining, pages 71--80. ACM Press, 2000.
[11]
F.J. Ferrer-Troyano, J.S. Aguilar-Ruiz, and J.C. Riquelme. Incremental rule learning and border examples selection from numerical data streams. Journal of Universal Computer Science, 11(8):1426--1439, 2005.
[12]
Francisco Ferrer-Troyano, Jesus S. Aguilar-Ruiz, and Jose C. Riquelme. Discovering decision rules from numerical data streams. In Proceedings of the 2004 ACM symposium on Applied computing, pages 649--653. ACM Press, 2004.
[13]
João Gama, Pedro Medas, Gladys Castillo, and Pedro Rodrigues. Learning with drift detection. In Ana L. C. Bazzan and Sofiane Labidi, editors, Advances in Artificial Intelligence - SBIA 2004, volume 3171 of Lecture Notes in Computer Science, pages 286--295. Springer Verlag, October 2004.
[14]
João Gama, Pedro Medas, and Ricardo Rocha. Forest trees for on-line data. In Proceedings of the 2004 ACM symposium on Applied computing, pages 632--636. ACM Press, 2004.
[15]
João Gama, Ricardo Rocha, and Pedro Medas. Accurate decision trees for mining high-speed data streams. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 523--528. ACM Press, 2003.
[16]
B. Ghosh and P. Sen. Handbook of Sequential Analysis. Narcel Dekker, 1991.
[17]
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13--30, 1963.
[18]
Geoff Hulten and Pedro Domingos. Catching up with the data: research issues in mining data streams. In Proc. of Workshop on Research issues in Data Mining and Knowledge Discovery, 2001.
[19]
Geoff Hulten and Pedro Domingos. VFML - a toolkit for mining high-speed time-changing data streams. http://www.cs.washington.edu/dm/vfml/. 2003.
[20]
Geoff Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing data streams. In Proceedings of the 7th ACM SIGKDD International conference on Knowledge discovery and data mining, pages 97--106. ACM Press, 2001.
[21]
Richard Kirkby. Improving Hoeffding Trees. PhD thesis, University of Waikato - New Zealand, 2008.
[22]
Ralf Klinkenberg. Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, 8(3):281--300, 2004.
[23]
I. Koychev. Gradual forgetting for adaptation to concept drift. In Proceedings of ECAI 2000 Workshop Current Issues in Spatio-Temporal Reasoning. Berlin, Germany, pages 101--106, 2000.
[24]
I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler. Yale: Rapid prototyping for complex data mining tasks. In ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 935--940. ACM Press, 2006.
[25]
H. Mouss, D. Mouss, N. Mouss, and L. Sefouhi. Test of Page-Hinkley, an approach for fault detection in an agro-alimentary production system. In Proceedings of the 5th Asian Control Conference, volume 2, pages 815--818, 2004.
[26]
E. S. Page. Continuous inspection schemes. Biometrika, 41(1/2):100--115, 1954.
[27]
W. Nick Street and YongSeog Kim. A streaming ensemble algorithm SEA for large-scale classification. In Proc. seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 377--382. ACM Press, 2001.
[28]
Gerhard Widmer and Miroslav Kubat. Learning in the presence of concept drift and hidden contexts. Machine Learning, 23:69--101, 1996.

Cited By

View all
  • (2025)Stream ETL framework for twitter-based sentiment analysis: Leveraging big data technologiesExpert Systems with Applications10.1016/j.eswa.2024.125523261(125523)Online publication date: Feb-2025
  • (2024)QARF: A Novel Malicious Traffic Detection Approach via Online Active Learning for Evolving Traffic StreamsChinese Journal of Electronics10.23919/cje.2022.00.36033:3(645-656)Online publication date: May-2024
  • (2024)OEBench: Investigating Open Environment Challenges in Real-World Relational Data StreamsProceedings of the VLDB Endowment10.14778/3648160.364817017:6(1283-1296)Online publication date: 1-Feb-2024
  • Show More Cited By

Index Terms

  1. Issues in evaluation of stream learning algorithms

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
    June 2009
    1426 pages
    ISBN:9781605584959
    DOI:10.1145/1557019
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 June 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data streams
    2. evaluation design

    Qualifiers

    • Research-article

    Conference

    KDD09

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)62
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 13 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Stream ETL framework for twitter-based sentiment analysis: Leveraging big data technologiesExpert Systems with Applications10.1016/j.eswa.2024.125523261(125523)Online publication date: Feb-2025
    • (2024)QARF: A Novel Malicious Traffic Detection Approach via Online Active Learning for Evolving Traffic StreamsChinese Journal of Electronics10.23919/cje.2022.00.36033:3(645-656)Online publication date: May-2024
    • (2024)OEBench: Investigating Open Environment Challenges in Real-World Relational Data StreamsProceedings of the VLDB Endowment10.14778/3648160.364817017:6(1283-1296)Online publication date: 1-Feb-2024
    • (2024)Prioritized Binary Transformation Method for Efficient Multi-label Classification of Data Streams with Many LabelsProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679980(4218-4222)Online publication date: 21-Oct-2024
    • (2024)Learning to Detect Soccer Pass at the Edge: A Comparison Between Edge and Cloud-Based Streaming Machine Learning Approaches2024 IEEE International Workshop on Sport, Technology and Research (STAR)10.1109/STAR62027.2024.10635986(187-192)Online publication date: 8-Jul-2024
    • (2024)Streaming Continual Learning for Unified Adaptive Intelligence in Dynamic EnvironmentsIEEE Intelligent Systems10.1109/MIS.2024.347946939:6(81-85)Online publication date: 1-Nov-2024
    • (2024)Condition monitoring of industrial exhaust motor-fan assembly using statistical feature optimization with nature-inspired optimizers2024 IEEE 3rd International Conference on Control, Instrumentation, Energy & Communication (CIEC)10.1109/CIEC59440.2024.10468197(97-102)Online publication date: 25-Jan-2024
    • (2024)Tenet: Benchmarking Data Stream Classifiers in Presence of Temporal Dependence2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825670(1187-1196)Online publication date: 15-Dec-2024
    • (2024)MAcPNN: Mutual Assisted Learning on Data Streams with Temporal Dependence2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825150(890-899)Online publication date: 15-Dec-2024
    • (2024)Roadmap of Concept Drift Adaptation in Data Stream Mining, Years LaterIEEE Access10.1109/ACCESS.2024.335881712(21129-21146)Online publication date: 2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media