research-article

Machine learning for streaming data: state of the art, challenges, and opportunities

Authors:

Heitor Murilo Gomes,

Jean Paul Barddal,

João GamaAuthors Info & Claims

ACM SIGKDD Explorations Newsletter, Volume 21, Issue 2

Pages 6 - 22

https://doi.org/10.1145/3373464.3373470

Published: 26 November 2019 Publication History

Abstract

Incremental learning, online learning, and data stream learning are terms commonly associated with learning algorithms that update their models given a continuous influx of data without performing multiple passes over data. Several works have been devoted to this area, either directly or indirectly as characteristics of big data processing, i.e., Velocity and Volume. Given the current industry needs, there are many challenges to be addressed before existing methods can be efficiently applied to real-world problems. In this work, we focus on elucidating the connections among the current stateof- the-art on related fields; and clarifying open challenges in both academia and industry. We treat with special care topics that were not thoroughly investigated in past position and survey papers. This work aims to evoke discussion and elucidate the current research opportunities, highlighting the relationship of different subareas and suggesting courses of action when possible.

References

[1]

Z. S. Abdallah, M. M. Gaber, B. Srinivasan, and S. Krishnaswamy. Activity recognition with evolving data streams: A review. ACM Computing Surveys (CSUR), 51(4):71, 2018.

[2]

A. Agarwal, O. Chapelle, M. Dud´ik, and J. Langford. A reliable effective terascale linear learning system. The Journal of Machine Learning Research, 15(1):1111--1133, 2014.

Digital Library

[3]

C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. In International Conference on Very Large Data Bases (VLDB), pages 81--92, 2003.

Digital Library

[4]

C. C. Aggarwal and P. S. Yu. On classification of highcardinality data streams. In SIAM International Conference on Data Mining, pages 802--813, 2010.

[5]

T. Al-Khateeb, M. M. Masud, L. Khan, C. C. Aggarwal, J. Han, and B. M. Thuraisingham. Stream classification with recurring and novel class detection using class-based ensemble. In ICDM, pages 31--40, 2012.

Digital Library

[6]

M. Armbrust, T. Das, J. Torres, B. Yavuz, S. Zhu, R. Xin, A. Ghodsi, I. Stoica, and M. Zaharia. Structured streaming: A declarative api for real-time applications in apache spark. In International Conference on Management of Data, pages 601--613, 2018.

Digital Library

[7]

M. Baena-Garc´a, J. del Campo- ´Avila, R. Fidalgo, A. Bifet, R. Gavald'a, and R. Morales-Bueno. Early drift detection method. 2006.

[8]

D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012.

Digital Library

[9]

J. P. Barddal, H. M. Gomes, and F. Enembreck. Analyzing the impact of feature drifts in streaming learning. In International Conference on Neural Information Processing, pages 21--28. Springer, 2015.

[10]

J. P. Barddal, H. M. Gomes, F. Enembreck, and B. Pfahringer. A survey on feature drift adaptation: Definition, benchmark, challenges and future directions. Journal of Systems and Software, 127:278 -- 294, 2017.

[11]

R. Bardenet, M. Brendel, B. K´egl, and M. Sebag. Collaborative hyperparameter tuning. In International Conference on Machine Learning, pages 199-- 207, 2013.

[12]

M. Barreno, B. Nelson, R. Sears, A. D. Joseph, and J. D. Tygar. Can machine learning be secure? In ACM Symposium on Information, computer and communications security, pages 16--25, 2006.

Digital Library

[13]

L. E. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The annals of mathematical statistics, 41(1):164--171, 1970.

[14]

Y. Ben-Haim and E. Tom-Tov. A streaming parallel decision tree algorithm. The Journal of Machine Learning Research, 11:849--872, 2010.

Digital Library

[15]

A. Bifet. Classifier concept drift detection and the illusion of progress. In International Conference on Artificial Intelligence and Soft Computing, pages 715--725. Springer, 2017.

[16]

A. Bifet, G. de Francisci Morales, J. Read, G. Holmes, and B. Pfahringer. Efficient online evaluation of big data stream classifiers. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 59--68, 2015.

Digital Library

[17]

A. Bifet and R. Gavalda. Learning from time-changing data with adaptive windowing. In SIAM international conference on data mining, pages 443--448, 2007.

[18]

A. Bifet and R. Gavald'a. Adaptive learning from evolving data streams. In International Symposium on Intelligent Data Analysis, pages 249--260. Springer, 2009.

Digital Library

[19]

A. Bifet, R. Gavalda, G. Holmes, and B. Pfahringer. Machine Learning for Data Streams: with Practical Examples in MOA. Adaptive Computation and Machine Learning series. MIT Press, 2018.

[20]

A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. Moa: Massive online analysis. The Journal of Machine Learning Research, 11:1601--1604, 2010. SIGKDD Explorations Volume 21, Issue 1 Page 19

Digital Library

[21]

A. Bifet, G. Holmes, and B. Pfahringer. Leveraging bagging for evolving data streams. In PKDD, pages 135--150, 2010.

Digital Library

[22]

B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426, 1970.

Digital Library

[23]

A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Conference on Computational learning theory, pages 92--100, 1998.

Digital Library

[24]

L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001.

Digital Library

[25]

P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.

[26]

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321--357, 2002.

[27]

S. Chen and H. He. Towards incremental learning of nonstationary imbalanced data stream: a multiple selectively recursive approach. Evolving Systems, 2(1):35--50, 2011.

[28]

M. Chenaghlou, M. Moshtaghi, C. Leckie, and M. Salehi. Online clustering for evolving data streams with online anomaly detection. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 508--521. Springer, 2018.

Digital Library

[29]

W. Chu, M. Zinkevich, L. Li, A. Thomas, and B. Tseng. Unbiased online active learning in data streams. In ACM SIGKDD international conference on Knowledge discovery and data mining, pages 195-- 203, 2011.

Digital Library

[30]

G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005.

Digital Library

[31]

M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. SIAM journal on computing, 31(6):1794--1813, 2002.

[32]

E. R. de Faria, A. C. P. de Leon Ferreira de Carvalho, and J. Gama. MINAS: multiclass learning algorithm for novelty detection in data streams. Data Mining Knowledge Discovery, 30(3):640--680, 2016.

Digital Library

[33]

G. De Francisci Morales and A. Bifet. Samoa: Scalable advanced massive online analysis. Journal of Machine Learning Research, 16:149--153, 2015.

Digital Library

[34]

K. Dembczy´nski, W. Waegeman, W. Cheng, and E. H¨ullermeier. On label dependence and loss minimization in multi-label classification. Mach. Learn., 88(1--2):5--45, July 2012.

[35]

G. Ditzler, M. D. Muhlbaier, and R. Polikar. Incremental learning of new classes in unbalanced datasets: Learn++.udnc. In International Workshop on Multiple Classifier Systems, pages 33--42, 2010.

Digital Library

[36]

G. Ditzler, R. Polikar, and N. Chawla. An incremental learning algorithm for non-stationary environments and class imbalance. In International Conference on Pattern Recognition, pages 2997--3000, 2010.

Digital Library

[37]

P. Domingos and G. Hulten. Mining high-speed data streams. In ACM SIGKDD international conference on Knowledge discovery and data mining, pages 71-- 80, 2000.

Digital Library

[38]

A. R. T. Donders, G. J. Van Der Heijden, T. Stijnen, and K. G. Moons. A gentle introduction to imputation of missing values. Journal of clinical epidemiology, 59(10):1087--1091, 2006.

[39]

Y. Dong and N. Japkowicz. Threaded ensembles of autoencoders for stream learning. Computational Intelligence, 34(1):261--281, 2018.

[40]

D. M. dos Reis, P. Flach, S. Matwin, and G. Batista. Fast unsupervised online drift detection using incremental kolmogorov-smirnov test. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1545--1554, 2016.

Digital Library

[41]

K.-L. Du and M. N. Swamy. Neural Networks and Statistical Learning. Springer Publishing Company, Incorporated, 2013.

[42]

R. Elwell and R. Polikar. Incremental learning of concept drift in nonstationary environments. IEEE Transactions on Neural Networks, 22(10):1517--1531, 2011.

Digital Library

[43]

M. A. Faisal, Z. Aung, J. R.Williams, and A. Sanchez. Data-stream-based intrusion detection system for advanced metering infrastructure in smart grid: A feasibility study. IEEE Systems journal, 9(1):31--44, 2015.

[44]

W. Fan and A. Bifet. Mining big data: current status, and forecast to the future. ACM SIGKDD Explorations Newsletter, 14(2):1--5, 2013.

Digital Library

[45]

M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data streams: a review. ACM Sigmod Record, 34(2):18--26, 2005.

Digital Library

[46]

M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybridbased approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4):463--484, 2012.

Digital Library

[47]

S. Galelli, G. B. Humphrey, H. R. Maier, A. Castelletti, G. C. Dandy, and M. S. Gibbs. An evaluation framework for input variable selection algorithms for environmental data-driven models. Environmental Modelling & Software, 62:33 -- 51, 2014.

[48]

J. Gama and P. Kosina. Learning about the learning process. In International Symposium on Intelligent Data Analysis, pages 162--172, 2011.

[49]

J. Gama and P. Kosina. Recurrent concepts in data streams classification. Knowledge and Information Systems, 40(3):489--507, 2014. SIGKDD Explorations Volume 21, Issue 1 Page 20

Digital Library

[50]

J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia. A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4):44, 2014.

[51]

S. Garc´a, S. Ram´rez-Gallego, J. Luengo, J. M. Ben´tez, and F. Herrera. Big data preprocessing: methods and prospects. Big Data Analytics, 1(1):9, 2016.

[52]

A. Ghazikhani, R. Monsefi, and H. S. Yazdi. Ensemble of online neural networks for non-stationary and imbalanced data streams. Neurocomputing, 122:535--544, 2013.

Digital Library

[53]

H. M. Gomes, J. P. Barddal, F. Enembreck, and A. Bifet. A survey on ensemble learning for data stream classification. ACM Computing Surveys, 50(2):23:1--23:36, 2017.

Digital Library

[54]

H. M. Gomes, A. Bifet, J. Read, J. P. Barddal, F. Enembreck, B. Pfharinger, G. Holmes, and T. Abdessalem. Adaptive random forests for evolving data stream classification. Machine Learning, 106(9- 10):1469--1495, 2017.

[55]

H. M. Gomes and F. Enembreck. Sae: Social adaptive ensemble classifier for data streams. In IEEE Symposium on Computational Intelligence and Data Mining, pages 199--206, April 2013.

[56]

H. M. Gomes and F. Enembreck. Sae2: Advances on the social adaptive ensemble classifier for data streams. In Proceedings of the 29th Annual ACM Symposium on Applied Computing (SAC), SAC 2014, pages 199--206, March 2014.

Digital Library

[57]

H. M. Gomes, J. Read, and A. Bifet. Streaming random patches for evolving data stream classification. In IEEE International Conference on Data Mining. IEEE, 2019.

[58]

M. Grzenda, H. M. Gomes, and A. Bifet. Delayed labelling evaluation for data streams. Data Mining and Knowledge Discovery, to appear.

[59]

S. Guha, N. Mishra, G. Roy, and O. Schrijvers. Robust random cut forest based anomaly detection on streams. In International Conference on Machine Learning, pages 2712--2721, 2016.

Digital Library

[60]

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update. ACM SIGKDD Explorations newsletter, 11(1):10--18, 2009.

Digital Library

[61]

A. Haque, B. Parker, L. Khan, and B. Thuraisingham. Evolving big data stream classification with mapreduce. In International Conference on Cloud Computing (CLOUD), pages 570--577, 2014.

Digital Library

[62]

M. Harries and N. S.Wales. Splice-2 comparative evaluation: Electricity pricing. 1999.

[63]

S. Hashemi, Y. Yang, Z. Mirzamomen, and M. Kangavari. Adapted one-versus-all decision trees for data stream classification. IEEE Transactions on Knowledge and Data Engineering, 21(5):624--637, 2009.

Digital Library

[64]

M. J. Hosseini, A. Gholipour, and H. Beigy. An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary data streams. Knowledge and Information Systems, 46(3):567--597, 2016.

Digital Library

[65]

L. Huang, A. D. Joseph, B. Nelson, B. I. Rubinstein, and J. Tygar. Adversarial machine learning. In ACM workshop on Security and artificial intelligence, pages 43--58, 2011.

Digital Library

[66]

N. Jiang and L. Gruenwald. Research issues in data stream association rule mining. ACM Sigmod Record, 35(1):14--19, 2006.

Digital Library

[67]

T. Joachims. Transductive inference for text classification using support vector machines. In ICML, volume 99, pages 200--209, 1999.

Digital Library

[68]

I. Katakis, G. Tsoumakas, E. Banos, N. Bassiliades, and I. Vlahavas. An adaptive personalized news dissemination system. Journal of Intelligent Information Systems, 32(2):191--212, 2009.

[69]

R. Klinkenberg. Using labeled and unlabeled data to learn drifting concepts. In IJCAI Workshop on Learning from Temporal and Spatial Data, pages 16--24, 2001.

[70]

J. Z. Kolter, M. Maloof, et al. Dynamic weighted majority: A new ensemble method for tracking concept drift. In ICDM, pages 123--130, 2003.

Digital Library

[71]

P. Kosina and J. a. Gama. Very fast decision rules for classification in data streams. Data Mining and Knowledge Discovery, 29(1):168--202, Jan. 2015.

Digital Library

[72]

N. Kourtellis, G. D. F. Morales, A. Bifet, and A. Murdopo. Vht: Vertical hoeffding tree. In IEEE International Conference on Big Data, pages 915--922, 2016.

[73]

G. Krempl, I. Zliobaite, D. Brzezi´nski, E. H¨ullermeier, M. Last, V. Lemaire, T. Noack, A. Shaker, S. Sievi, M. Spiliopoulou, et al. Open challenges for data stream mining research. ACM SIGKDD Explorations newsletter, 16(1):1--10, 2014.

Digital Library

[74]

L. I. Kuncheva. A stability index for feature selection. In International Multi-Conference: Artificial Intelligence and Applications, AIAP'07, pages 390--395, 2007.

[75]

B. Kveton, H. H. Bui, M. Ghavamzadeh, G. Theocharous, S. Muthukrishnan, and S. Sun. Graphical model sketch. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 81--97, 2016.

Digital Library

[76]

J. Langford, L. Li, and A. Strehl. Vowpal Wabbit, 2007.

[77]

P. Lehtinen, M. Saarela, and T. Elomaa. Online chimerge algorithm. In Data Mining: Foundations and Intelligent Paradigms, pages 199--216. 2012.

[78]

M. Li, M. Liu, L. Ding, E. A. Rundensteiner, and M. Mani. Event stream processing with out-of-order data arrival. In International Conference on Distributed Computing Systems Workshops, pages 67--67, 2007. SIGKDD Explorations Volume 21, Issue 1 Page 21

Digital Library

[79]

E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, J. Gonzalez, K. Goldberg, and I. Stoica. Ray rllib: A composable and scalable reinforcement learning library. arXiv preprint arXiv:1712.09381, 2017.

[80]

V. L´opez, A. Fern´andez, S. Garc´a, V. Palade, and F. Herrera. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250:113--141, 2013.

[81]

V. Losing, B. Hammer, and H. Wersing. Incremental on-line learning: A review and comparison of state of the art algorithms. Neurocomputing, 275:1261--1274, 2018.

Digital Library

[82]

G. Louppe and P. Geurts. Ensembles on random patches. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 346--361. Springer, 2012.

[83]

D. Marron, J. Read, A. Bifet, T. Abdessalem, E. Ayguade, and J. Herrero. Echo state hoeffding tree learning. In R. J. Durrant and K.-E. Kim, editors, Asian Conference on Machine Learning, volume 63, pages 382--397, 2016.

[84]

M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thuraisingham. Classification and novel class detection in concept-drifting data streams under time constraints. IEEE Transactions on Knowledge and Data Engineering, 23(6):859--874, 2011.

Digital Library

[85]

M. M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham. A practical approach to classify evolving data streams: Training with limited amount of labeled data. In ICDM, pages 929--934. IEEE, 2008.

Digital Library

[86]

I. Mitliagkas, C. Caramanis, and P. Jain. Memory limited, streaming pca. In Advances in Neural Information Processing Systems, pages 2886--2894, 2013.

Digital Library

[87]

J. Montiel, J. Read, A. Bifet, and T. Abdessalem. Scikit-multiflow: A multi-output streaming framework. Journal of Machine Learning Research, 19(72), 2018.

[88]

M. D. Muhlbaier, A. Topalis, and R. Polikar. Learn++.nc: Combining ensemble of classifiers with dynamically weighted consult-and-vote for efficient incremental learning of new classes. IEEE transactions on neural networks, 20(1):152--168, 2009.

[89]

J. Nelder and R. Mead. A simplex method for function minimization. The Computer Journal, 7:308--313, 1965.

[90]

H.-L. Nguyen, Y.-K. Woon, W.-K. Ng, and L. Wan. Heterogeneous ensemble for feature drifts in data streams. In P.-N. Tan, S. Chawla, C. K. Ho, and J. Bailey, editors, Advances in Knowledge Discovery and Data Mining, pages 1--12, 2012.

[91]

S. Nogueira and G. Brown. Measuring the stability of feature selection. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 442--457. Springer, 2016.

[92]

A. Oliver, A. Odena, C. A. Raffel, E. D. Cubuk, and I. Goodfellow. Realistic evaluation of deep semisupervised learning algorithms. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, pages 3238--3249. 2018.

[93]

N. Oza. Online bagging and boosting. In IEEE International Conference on Systems, Man and Cybernetics, volume 3, pages 2340--2345 Vol. 3, Oct 2005.

[94]

S. J. Pan, Q. Yang, et al. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345--1359, 2010.

Digital Library

[95]

B. Parker and L. Khan. Rapidly labeling and tracking dynamically evolving concepts in data streams. IEEE International Conference on Data Mining Workshops, 0:1161--1164, 2013.

Digital Library

[96]

B. Parker, A. M. Mustafa, and L. Khan. Novel class detection and feature via a tiered ensemble approach for stream mining. In IEEE International Conference on Tools with Artificial Intelligence, volume 1, pages 1171--1178, 2012.

Digital Library

[97]

B. S. Parker and L. Khan. Detecting and tracking concept class drift and emergence in non-stationary fast data streams. In AAAI Conference on Artificial Intelligence, 2015.

[98]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825--2830, 2011.

[99]

B. Pfahringer, G. Holmes, and R. Kirkby. Handling numeric attributes in hoeffding trees. In Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, pages 296--307, 2008.

[100]

X. C. Pham, M. T. Dang, S. V. Dinh, S. Hoang, T. T. Nguyen, and A. W. C. Liew. Learning from data stream based on random projection and hoeffding tree classifier. In International Conference on Digital Image Computing: Techniques and Applications, pages 1--8, 2017.

[101]

C. Pinto and J. Gama. Partition incremental discretization. In Portuguese conference on artificial intelligence, pages 168--174, 2005.

[102]

J. Plasse and N. Adams. Handling delayed labels in temporally evolving data streams. In IEEE ICBD, pages 2416--2424, 2016.

[103]

S. Ram´rez-Gallego, B. Krawczyk, S. Garc´a, M. Wo´zniak, and F. Herrera. A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing, 239:39--57, 2017.

Digital Library

[104]

J. Read, A. Bifet, G. Holmes, and B. Pfahringer. Scalable and efficient multi-label classification for evolving data streams. Machine Learning, 88(1--2):243--272, 2012. SIGKDD Explorations Volume 21, Issue 1 Page 22

[105]

J. Read, L. Martino, and J. Hollm´en. Multi-label methods for prediction with sequential data. Pattern Recognition, 63(March):45--55, 2017.

[106]

J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classifier chains for multi-label classification. Machine Learning, 85(3):333--359, 2011.

Digital Library

[107]

P. Reutemann and J. Vanschoren. Scientific workflow management with adams. In Machine Learning and Knowledge Discovery in Databases, pages 833--837. Springer, 2012.

[108]

P. Roy, A. Khan, and G. Alonso. Augmented sketch: Faster and more accurate stream processing. In International Conference on Management of Data, pages 1449--1463, 2016.

Digital Library

[109]

J. Rushing, S. Graves, E. Criswell, and A. Lin. A coverage based ensemble algorithm (cbea) for streaming data. In IEEE International Conference on Tools with Artificial Intelligence, pages 106--112, 2004.

Digital Library

[110]

M. Salehi and L. Rashidi. A survey on anomaly detection in evolving data:[with application to forest fire risk prediction]. ACM SIGKDD Explorations Newsletter, 20(1):13--23, 2018.

Digital Library

[111]

J. C. Schlimmer and R. H. Granger. Incremental learning from noisy data. Machine learning, 1(3):317--354, 1986.

[112]

A. Shrivastava, A. C. Konig, and M. Bilenko. Time adaptive sketches (ada-sketches) for summarizing data streams. In International Conference on Management of Data, pages 1417--1432, 2016.

Digital Library

[113]

V. Sindhwani, P. Niyogi, and M. Belkin. Beyond the point cloud: from transductive to semi-supervised learning. In ICML, pages 824--831, 2005.

Digital Library

[114]

K. O. Stanley and R. Miikkulainen. Efficient reinforcement learning through evolving neural network topologies. In Genetic and Evolutionary Computation Conference, page 9, San Francisco, 2002.

[115]

I. Stoica, D. Song, R. A. Popa, D. Patterson, M. W. Mahoney, R. Katz, A. D. Joseph, M. Jordan, J. M. Hellerstein, J. E. Gonzalez, et al. A berkeley view of systems challenges for ai. arXiv preprint arXiv:1712.05855, 2017.

[116]

W. N. Street and Y. Kim. A streaming ensemble algorithm (sea) for large-scale classification. In ACM SIGKDD international conference on Knowledge discovery and data mining, pages 377--382, 2001.

Digital Library

[117]

R. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning. MIT Press, 1st edition, 1998.

[118]

L. Torgo, R. P. Ribeiro, B. Pfahringer, and P. Branco. Smote for regression. In Portuguese conference on artificial intelligence, pages 378--389. Springer, 2013.

[119]

A. Tsymbal. The problem of concept drift: definitions and related work. Technical report, 2004.

[120]

B. Veloso, J. Gama, and B. Malheiro. Self hyperparameter tuning for data streams. In International Conference on Discovery Science, page to appear, 2018.

[121]

I. Zliobaite, A. Bifet, J. Read, B. Pfahringer, and G. Holmes. Evaluation methods and decision theory for classification of streaming data with temporal dependence. Machine Learning, 98(3):455--482, 2014.

Digital Library

[122]

G. I. Webb. Contrary to popular belief incremental discretization can be sound, computationally efficient and extremely useful for streaming data. In ICDM, pages 1031--1036. IEEE, 2014.

Digital Library

[123]

G. I.Webb, R. Hyde, H. Cao, H. L. Nguyen, and F. Petitjean. Characterizing concept drift. Data Mining and Knowledge Discovery, 30(4):964--994, 2016.

Digital Library

[124]

G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine Learning, 23(1):69--101, Apr. 1996.

[125]

K. Wu, K. Zhang, W. Fan, A. Edwards, and S. Y. Philip. Rs-forest: A rapid density estimator for streaming anomaly detection. In ICDM, pages 600-- 609. IEEE, 2014.

Digital Library

[126]

X. Wu, P. Li, and X. Hu. Learning from concept drifting data streams with unlabeled data. Neurocomputing, 92:145--155, 2012.

Digital Library

[127]

X. Wu, X. Zhu, G.-Q. Wu, and W. Ding. Data mining with big data. IEEE transactions on knowledge and data engineering, 26(1):97--107, 2014.

[128]

T. Yang, L. Liu, Y. Yan, M. Shahzad, Y. Shen, X. Li, B. Cui, and G. Xie. Sf-sketch: A fast, accurate, and memory efficient data structure to store frequencies of data items. In ICDE, pages 103--106, 2017.

[129]

W. Yu, Y. Gu, and J. Li. Single-pass pca of large highdimensional data. In International Joint Conference on Artificial Intelligence, pages 3350--3356, 2017.

[130]

M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In ACM Symposium on Operating Systems Principles, pages 423--438, 2013.

Digital Library

[131]

M.-L. Zhang and Z.-H. Zhou. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8):1819--1837, 2014.

[132]

Z. Zhao, F. Morstatter, S. Sharma, S. Alelyani, A. Anand, and H. Liu. Advancing feature selection research. ASU feature selection repository, pages 1--28, 2010.

[133]

G. Zhou, K. Sohn, and H. Lee. Online incremental feature learning with denoising autoencoders. In Artificial intelligence and statistics, pages 1453--1461, 2012.

[134]

I. Zliobaite. Change with delayed labeling: When is it detectable? In IEEE International Conference on Data Mining Workshops, pages 843--850, 2010.

[135]

I. Zliobaite, A. Bifet, J. Read, B. Pfahringer, and G. Holmes. Evaluation methods and decision theory for classification of streaming data with temporal dependence. Machine Learning, 98(3):455--482, 2015.

Digital Library

Cited By

Botacin MGomes H(2025)Towards more realistic evaluations: The impact of label delays in malware detection pipelinesComputers & Security10.1016/j.cose.2024.104122148(104122)Online publication date: Jan-2025
https://doi.org/10.1016/j.cose.2024.104122
Tian ZXie XShi J(2024)Bayesian quantile regression for streaming dataAIMS Mathematics10.3934/math.202412769:9(26114-26138)Online publication date: 2024
https://doi.org/10.3934/math.20241276
Melo ACâmara MPinto J(2024)Data-Driven Process Monitoring and Fault Diagnosis: A Comprehensive SurveyProcesses10.3390/pr1202025112:2(251)Online publication date: 24-Jan-2024
https://doi.org/10.3390/pr12020251
Show More Cited By

Machine learning for streaming data: state of the art, challenges, and opportunities
1. Computing methodologies

Recommendations

Big data, lifelong machine learning and transfer learning
WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining

A major challenge in today's world is the Big Data problem, which manifests itself in Web and Mobile domains as rapidly changing and heterogeneous data streams. A data-mining system must be able to cope with the influx of changing data in a continual ...
Machine learning on big data

Machine learning (ML) is continuously unleashing its power in a wide range of applications. It has been pushed to the forefront in recent years partly owing to the advent of big data. ML algorithms have never been better promised while challenged by big ...
Data Management in Machine Learning: Challenges, Techniques, and Systems
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Large-scale data analytics using statistical machine learning (ML), popularly called advanced analytics, underpins many modern data-driven applications. The data management community has been working for over a decade on tackling data management-related ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter

ACM SIGKDD Explorations Newsletter Volume 21, Issue 2

December 2019

100 pages

ISSN:1931-0145

EISSN:1931-0153

DOI:10.1145/3373464

Editors:
Hanghang Tong
Arizona State University
,
Xin Luna Dong
Google
,
Ankur Teredesai
University of Washington Tacoma
,
Reza Zafarani
Syracuse University

Issue’s Table of Contents

Copyright © 2019 Copyright is held by the owner/author(s).

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 November 2019

Published in SIGKDD Volume 21, Issue 2

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

167
Total Citations
View Citations
3,378
Total Downloads

Downloads (Last 12 months)580
Downloads (Last 6 weeks)70

Reflects downloads up to 14 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Botacin MGomes H(2025)Towards more realistic evaluations: The impact of label delays in malware detection pipelinesComputers & Security10.1016/j.cose.2024.104122148(104122)Online publication date: Jan-2025
https://doi.org/10.1016/j.cose.2024.104122
Tian ZXie XShi J(2024)Bayesian quantile regression for streaming dataAIMS Mathematics10.3934/math.202412769:9(26114-26138)Online publication date: 2024
https://doi.org/10.3934/math.20241276
Melo ACâmara MPinto J(2024)Data-Driven Process Monitoring and Fault Diagnosis: A Comprehensive SurveyProcesses10.3390/pr1202025112:2(251)Online publication date: 24-Jan-2024
https://doi.org/10.3390/pr12020251
Ahmad IWan ZAhmad AUllah S(2024)A Hybrid Optimization Model for Efficient Detection and Classification of Malware in the Internet of ThingsMathematics10.3390/math1210143712:10(1437)Online publication date: 7-May-2024
https://doi.org/10.3390/math12101437
Han MLi CMeng FHe FZhang R(2024)An Adaptive Active Learning Method for Multiclass Imbalanced Data Streams with Concept DriftApplied Sciences10.3390/app1416717614:16(7176)Online publication date: 15-Aug-2024
https://doi.org/10.3390/app14167176
Mahapatra SChandola V(2024)Learning manifolds from non-stationary streamsJournal of Big Data10.1186/s40537-023-00872-811:1Online publication date: 23-Mar-2024
https://doi.org/10.1186/s40537-023-00872-8
Gaudreault JBranco P(2024)A Systematic Literature Review of Novelty Detection in Data Streams: Challenges and OpportunitiesACM Computing Surveys10.1145/365728656:10(1-37)Online publication date: 12-Apr-2024
https://dl.acm.org/doi/10.1145/3657286
Tosi MVenugopal VTheobald M(2024)TensAIR: Real-Time Training of Neural Networks from Data-streamsProceedings of the 2024 8th International Conference on Machine Learning and Soft Computing10.1145/3647750.3647762(73-82)Online publication date: 26-Jan-2024
https://dl.acm.org/doi/10.1145/3647750.3647762
Amarasinghe PPham DTran BNguyen SSun YAlahakoon DLi XHandl J(2024)Evolutionary Multi-Objective Optimisation for Fairness-Aware Self Adjusting Memory Classifiers in Data StreamsProceedings of the Genetic and Evolutionary Computation Conference10.1145/3638529.3654038(258-266)Online publication date: 14-Jul-2024
https://dl.acm.org/doi/10.1145/3638529.3654038
Liu ZQiu RZeng ZZhu YHamann HTong HBaeza-Yates RBonchi F(2024)AIM: Attributing, Interpreting, Mitigating Data UnfairnessProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671797(2014-2025)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671797
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents