Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Open challenges for data stream mining research

Published: 25 September 2014 Publication History

Abstract

Every day, huge volumes of sensory, transactional, and web data are continuously generated as streams, which need to be analyzed online as they arrive. Streaming data can be considered as one of the main sources of what is called big data. While predictive modeling for data streams and big data have received a lot of attention over the last decade, many research approaches are typically designed for well-behaved controlled problem settings, overlooking important challenges imposed by real-world applications. This article presents a discussion on eight open challenges for data stream mining. Our goal is to identify gaps between current research and meaningful applications, highlight open problems, and define new application-relevant research directions for data stream mining. The identified challenges cover the full cycle of knowledge discovery and involve such problems as: protecting data privacy, dealing with legacy systems, handling incomplete and delayed information, analysis of complex data, and evaluation of stream mining algorithms. The resulting analysis is illustrated by practical applications and provides general suggestions concerning lines of future research in data stream mining.

References

[1]
C. Aggarwal, editor. Data Streams: Models and Algorithms. Springer, 2007.
[2]
C. Aggarwal and D. Turaga. Mining data streams: Systems and algorithms. In Machine Learning and Knowledge Discovery for Engineering Systems Health Management, pages 4--32. Chapman and Hall, 2012.
[3]
R. Agrawal and R. Srikant. Privacy-preserving data mining. SIGMOD Rec., 29(2):439--450, 2000.
[4]
C. Anagnostopoulos, N. Adams, and D. Hand. Deciding what to observe next: Adaptive variable selection for regression in multivariate data streams. In Proc. of the 2008 ACM Symp. on Applied Computing, SAC, pages 961--965, 2008.
[5]
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS, pages 1--16, 2002.
[6]
C. Brodley, U. Rebbapragada, K. Small, and B. Wallace. Challenges and opportunities in applied machine learning. AI Magazine, 33(1):11--24, 2012.
[7]
D. Brzezinski and J. Stefanowski. Reacting to different types of concept drift: The accuracy updated ensemble algorithm. IEEE Trans. on Neural Networks and Learning Systems., 25:81--94, 2014.
[8]
D. Chakrabarti, R. Kumar, F. Radlinski, and E. Upfal. Mortal multi-armed bandits. In Proc. of the 22nd Conf. on Neural Information Processing Systems, NIPS, pages 273--280, 2008.
[9]
D. Cox and D. Oakes. Analysis of Survival Data. Chapman & Hall, London, 1984.
[10]
T. Dietterich. Machine-learning research. AI Magazine, 18(4):97--136, 1997.
[11]
G. Ditzler and R. Polikar. Semi-supervised learning in nonstationary environments. In Proc. of the 2011 Int. Joint Conf. on Neural Networks, IJCNN, pages 2741--2748, 2011.
[12]
W. Fan and A. Bifet. Mining big data: current status, and forecast to the future. SIGKDD Explorations, 14(2):1--5, 2012.
[13]
M. Gaber, J. Gama, S. Krishnaswamy, J. Gomes, and F. Stahl. Data stream mining in ubiquitous environments: state-of-theart and current directions. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(2):116--138, 2014.
[14]
M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data streams: A review. SIGMOD Rec., 34(2):18--26, 2005.
[15]
J. Gama. Knowledge Discovery from Data Streams. Chapman & Hall/CRC, 2010.
[16]
J. Gama, R. Sebastiao, and P. Rodrigues. On evaluating stream learning algorithms. Machine Learning, 90(3):317--346, 2013.
[17]
J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia. A survey on concept-drift adaptation. ACM Computing Surveys, 46(4), 2014.
[18]
J. Gantz and D. Reinsel. The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east, December 2012.
[19]
A. Goldberg, M. Li, and X. Zhu. Online manifold regularization: A new learning setting and empirical study. In Proc. of the European Conf. on Machine Learning and Principles of Knowledge Discovery in Databases, ECMLPKDD, pages 393--407, 2008.
[20]
I. Guyon, A. Saffari, G. Dror, and G. Cawley. Model selection: Beyond the bayesian/frequentist divide. Journal of Machine Learning Research, 11:61--87, 2010.
[21]
M. Hassani and T. Seidl. Towards a mobile health context prediction: Sequential pattern mining in multiple streams. In Proc. of, IEEE Int. Conf. on Mobile Data Management, MDM, pages 55--57, 2011.
[22]
H. He and Y. Ma, editors. Imbalanced Learning: Foundations, Algorithms, and Applications. IEEE, 2013.
[23]
T. Hoens, R. Polikar, and N. Chawla. Learning from streaming data with concept drift and imbalance: an overview. Progress in Artificial Intelligence, 1(1):89--101, 2012.
[24]
IBM. An architectural blueprint for autonomic computing. Technical report, IBM, 2003.
[25]
E. Ikonomovska, K. Driessens, S. Dzeroski, and J. Gama. Adaptive windowing for online learning from multiple interrelated data streams. In Proc. of the 11th IEEE Int. Conf. on Data Mining Workshops, ICDMW, pages 697--704, 2011.
[26]
A. Kotov, C. Zhai, and R. Sproat. Mining named entities with temporally correlated bursts from multilingual web news streams. In Proc. of the 4th ACMInt. Conf. onWeb Search and Data Mining, WSDM, pages 237--246, 2011.
[27]
G. Krempl. The algorithm APT to classify in concurrence of latency and drift. In Proc. of the 10th Int. Conf. on Advances in Intelligent Data Analysis, IDA, pages 222--233, 2011.
[28]
M. Last and H. Halpert. Survival analysis meets data stream mining. In Proc. of the 1st Worksh. on Real-World Challenges for Data Stream Mining, RealStream, pages 26--29, 2013.
[29]
F. Nelwamondo and T.Marwala. Key issues on computational intelligence techniques for missing data imputation - a review. In Proc. of World Multi Conf. on Systemics, Cybernetics and Informatics, volume 4, pages 35--40, 2008.
[30]
E. Noack,W. Belau, R.Wohlgemuth, R.Müller, S. Palumberi, P. Parodi, and F. Burzagli. Efficiency of the columbus failure management system. In Proc. of the AIAA 40th Int. Conf. on Environmental Systems, 2010.
[31]
E. Noack, A. Luedtke, I. Schmitt, T. Noack, E. Schaumlöffel, E. Hauke, J. Stamminger, and E. Frisk. The columbus module as a technology demonstrator for innovative failure management. In German Air and Space Travel Congress, 2012.
[32]
M. Oliveira and J. Gama. A framework to monitor clusters evolution applied to economy and finance problems. Intelligent Data Analysis, 16(1):93--111, 2012.
[33]
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann Publishers Inc., 1999.
[34]
T. Raeder and N. Chawla. Model monitor (m2): Evaluating, comparing, and monitoring models. Journal of Machine Learning Research, 10:1387--1390, 2009.
[35]
P. Rodrigues and J. Gama. Distributed clustering of ubiquitous data streams. WIREs Data Mining and Knowledge Discovery, pages 38--54, 2013.
[36]
T. Sakaki, M. Okazaki, and Y. Matsuo. Tweet analysis for real-time event detection and earthquake reporting system development. IEEE Trans. on Knowledge and Data Engineering, 25(4):919--931, 2013.
[37]
C. Salperwyck and V. Lemaire. Learning with few examples: An empirical study on leading classifiers. In Proc. of the 2011 Int. Joint Conf. on Neural Networks, IJCNN, pages 1010--1019, 2011.
[38]
B. Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan and Claypool Publishers, 2012.
[39]
A. Shaker and E. Hüllermeier. Survival analysis on data streams: Analyzing temporal events in dynamically changing environments. Int. Journal of Applied Mathematics and Computer Science, 24(1):199--212, 2014.
[40]
C. Shearer. The CRISP-DMmodel: the new blueprint for data mining. J Data Warehousing, 2000.
[41]
Z. Siddiqui, M. Oliveira, J. Gama, and M. Spiliopoulou. Where are we going? predicting the evolution of individuals. In Proc. of the 11th Int. Conf. on Advances in Intelligent Data Analysis, IDA, pages 357--368, 2012.
[42]
Z. Siddiqui and M. Spiliopoulou. Classification rule mining for a stream of perennial objects. In Proc. of the 5th Int. Conf. on Rule-based Reasoning, Programming, and Applications, RuleML, pages 281--296, 2011.
[43]
M. Spiliopoulou and G. Krempl. Tutorial "mining multiple threads of streaming data". In Proc. of the Pacific-Asia Conf. on Knowledge Discovery and Data Mining, PAKDD, 2013.
[44]
D. Waterman. A Guide to Expert Systems. Addison-Wesley, 1986.
[45]
W. Young, G. Weckman, and W. Holland. A survey of methodologies for the treatment of missing values within datasets: limitations and benefits. Theoretical Issues in Ergonomics Science, 12, January 2011.
[46]
B. Zhou, Y. Han, J. Pei, B. Jiang, Y. Tao, and Y. Jia. Continuous privacy preserving publishing of data streams. In Proc. of the 12th Int. Conf. on Extending Database Technology, EDBT, pages 648--659, 2009.
[47]
I. Zliobaite. Controlled permutations for testing adaptive learning models. Knowledge and Information Systems, In Press, 2014.
[48]
I. Zliobaite, A. Bifet, M. Gaber, B. Gabrys, J. Gama, L. Minku, and K. Musial. Next challenges for adaptive learning systems. SIGKDD Explorations, 14(1):48--55, 2012.
[49]
I. Zliobaite and B. Gabrys. Adaptive preprocessing for streaming data. IEEE Trans. on Knowledge and Data Engineering, 26(2):309--321, 2014.

Cited By

View all

Index Terms

  1. Open challenges for data stream mining research

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM SIGKDD Explorations Newsletter
    ACM SIGKDD Explorations Newsletter  Volume 16, Issue 1
    Special issue on big data
    June 2014
    63 pages
    ISSN:1931-0145
    EISSN:1931-0153
    DOI:10.1145/2674026
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 September 2014
    Published in SIGKDD Volume 16, Issue 1

    Check for updates

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)89
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 22 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Explainable data stream miningIntelligent Decision Technologies10.3233/IDT-23006518:1(371-385)Online publication date: 1-Jan-2024
    • (2024)Concept Evolution Detecting over Feature StreamsACM Transactions on Knowledge Discovery from Data10.1145/367801218:8(1-32)Online publication date: 13-Jul-2024
    • (2024)TWStream: Three-Way Stream ClusteringIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2024.336971632:9(4927-4939)Online publication date: 1-Sep-2024
    • (2024)Class-Incremental Recognition of Objects in Remote Sensing Images With Dynamic Hybrid Exemplar SelectionIEEE Transactions on Aerospace and Electronic Systems10.1109/TAES.2024.336311460:3(3468-3481)Online publication date: Jun-2024
    • (2024)Scalable Overspeed Item Detection in Streams2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00094(1157-1170)Online publication date: 13-May-2024
    • (2024)Anticipative Bayesian classification for data streams with verification latencyJournal of Applied Statistics10.1080/02664763.2024.2319222(1-20)Online publication date: 21-Feb-2024
    • (2024)Bin.INI: An ensemble approach for dynamic data streamsExpert Systems with Applications10.1016/j.eswa.2024.124853256(124853)Online publication date: Dec-2024
    • (2024)Integration of data science with the intelligent IoT (IIoT): current challenges and future perspectivesDigital Communications and Networks10.1016/j.dcan.2024.02.007Online publication date: Mar-2024
    • (2024)Discovering outlying attributes of outliers in data streamsData & Knowledge Engineering10.1016/j.datak.2024.102349(102349)Online publication date: Aug-2024
    • (2024)Integrated detection and localization of concept drifts in process mining with batch and stream trace clustering supportData & Knowledge Engineering10.1016/j.datak.2023.102253149(102253)Online publication date: Jan-2024
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media