Abstract
Clustering is a powerful and versatile tool for knowledge discovery, able to provide a valuable information for data analysis in various domains. To perform this task based on streaming data is quite challenging: outdated knowledge needs to be disposed while the current knowledge is obtained from fresh data; since data are continuously flowing, strict efficiency constraints have to be met. This paper presents WCDS, an approach to this problem based on the WiSARD artificial neural network model. This model already had useful characteristics as inherent incremental learning capability and patent functioning speed. These were combined with novel features as an adaptive countermeasure to cluster imbalance, a mechanism to discard expired data, and offline clustering based on a pairwise similarity measure for WiSARD discriminators. In an insightful experimental evaluation, the proposed system had an excellent performance according to multiple quality standards. This supports its applicability for the analysis of data streams.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Iverson bracket: \([L] = 1\) if the logical expression L is true; otherwise, \([L] = 0.\)
\(\lfloor x \rceil \) represents the nearest integer of real number x.
http://cs.joensuu.fi/sipu/datasets/ (accessed 2017/1/11).
https://github.com/deric/clustering-benchmark (accessed 2017/1/11).
http://kdd.ics.uci.edu/databases/kddcup99/ (accessed 2017/02/07).
http://moa.cms.waikato.ac.nz/datasets/ (accessed 2017/02/07).
http://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+under+dynamic+gas+mixtures (accessed 2016/10/17).
References
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: VLDB, pp. 81–92 (2003)
Aleksander, I., Thomas, W., Bowden, P.: WiSARD, a radical step forward in image recognition. Sens. Rev. 4(3), 120–124 (1984)
Aleksander, I., Gregorio, M.D., França, F.M.G., Lima, P.M.V., Morton, H.: A brief introduction to weightless neural systems. In: ESANN 2009, 17th European Symposium on Artificial Neural Networks, Bruges, Belgium, April 22–24, 2009, Proceedings (2009)
Barddal, J.P., Gomes, H.M., Enembreck, F.: A complex network-based anytime data stream clustering algorithm. In: Arik, S., Huang, T., Lai, W.K., Liu, Q. (eds) Neural Information Processing—22nd International Conference, ICONIP 2015, Istanbul, Turkey, November 9–12, 2015, Proceedings, Part I, Springer. Lecture Notes in Computer Science, vol. 9489, pp. 615–622 (2015)
Barddal, J.P., Gomes, H.M., Enembreck, F., Barthès, J.A.: Sncstream\({}^{+}\): extending a high quality true anytime data stream clustering algorithm. Inf Syst 62, 60–73 (2016)
Bifet, A., Gavaldà, R.: Learning from time-changing data with adaptive windowing. In: Proceedings of the 7th SIAM International Conference on Data Mining, April 26–28, 2007, Minneapolis, Minnesota, USA, SIAM, pp. 443–448 (2007)
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: Moa: massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)
Blackard, J.A., Dean, D.J.: Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput. Electron. Agric. 24(3), 131–151 (1999)
Campello, R.J.G.B., Moulavi, D., Zimek, A., Sander, J.: Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans. Knowl. Discov. Data 10(1), 5:1–5:51 (2015)
Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: SDM, SIAM, pp. 328–339 (2006)
Cardoso, D.O., Lima, P.M.V., Gregorio, M.D., Gama, J., França, F.M.G.: Clustering data streams with weightless neural networks. In: ESANN 2011, 19th European Symposium on Artificial Neural Networks, Bruges, Belgium, April 27–29, 2011, Proceedings (2011)
Cardoso, D.O., Gregorio, M.D., Lima, P.M.V., Gama, J., França, F.M.G.: A weightless neural network-based approach for stream data clustering. In: Intelligent Data Engineering and Automated Learning—IDEAL 2012—13th International Conference, Natal, Brazil, August 29–31, 2012, pp. 328–335. Proceedings, Springer (2012)
Cardoso, D.O., França, F.M.G., Gama, J.: A bounded neural network for open set recognition. In: 2015 International Joint Conference on Neural Networks , IJCNN 2015, Killarney, Ireland, July 12–17, 2015, IEEE, pp. 1–7 (2015)
Cardoso, D.O., Carvalho, D.S., Alves, D.S.F., de Souza, D.F.P., Carneiro, H.C.C., Pedreira, C.E., Lima, P.M.V., França, F.M.G.: Financial credit analysis via a clustering weightless neural classifier. Neurocomputing 183, 70–78 (2016)
Cardoso, D.O., França, F.M.G., Gama, J.: Clustering data streams using a forgetful neural model. In: Ossowski, S. (ed.) Proceedings of the 31st Annual ACM Symposium on Applied Computing, Pisa, Italy, April 4–8, 2016, ACM, pp. 949–951 (2016)
Datar, M., Motwani, R.: Data Stream Management—Processing High-Speed Data Streams, Data-Centric Systems and Applications. In: Garofalakis, M.N., Gehrke, J., Rastogi, R. (eds.) The sliding-window computation model and results. Springer, Berlin, Heidelberg (2016)
Day, W.H., Edelsbrunner, H.: Efficient algorithms for agglomerative hierarchical clustering methods. J. Classif. 1(1), 7–24 (1984)
Fonollosa, J., Sheik, S., Huerta, R., Marco, S.: Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring. Sens. Actuators B Chem. 215, 618–629 (2015)
Gama, J.: Knowledge Discovery from Data Streams. CRC data mining and knowledge discovery series. CRC Press, Chapman and Hall, Boca Raton (2010)
Grieco, B.P.A., Lima, P.M.V., Gregorio, M.D., França, F.M.G.: Producing pattern examples from “mental” images. Neurocomputing 73(7–9), 1057–1064 (2010)
Hassani M, Seidl T (2016) Using internal evaluation measures to validate the quality of diverse stream clustering algorithms. Vietnam J. Comput. Sci. 1–13. doi:10.1007/s40595-016-0086-9
Jin, C., Yu, J.X., Zhou, A., Cao, F.: Efficient clustering of uncertain data streams. Knowl. Inf. Syst. 40(3), 509–539 (2014)
Kolcz, A., Allinson, N.: Application of the cmac input encoding scheme in the n-tuple approximation network. IEE Proc. Comput. Digit. Tech. 141(3), 177–183 (1994)
Kranen, P., Assent, I., Baldauf, C., Seidl, T.: The clustree: indexing micro-clusters for anytime stream mining. Knowl. Inf. Syst. 29(2), 249–272 (2011)
Levandowsky, M., Winter, D.: Distance between sets. Nature 234(5323), 34–35 (1971)
Linneberg, C., Jorgensen, T.: Discretization methods for encoding of continuous input variables for boolean neural networks. In: International Joint Conference on Neural Networks, 1999. IJCNN ’99, vol. 2, pp. 1219–1224 (1999)
Lloyd, S.: Least squares quantization in pcm. IEEE Trans. Inf. Theory 28(2), 129–137 (1982). doi:10.1109/TIT.1982.1056489
McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5(4), 115–133 (1943)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., VanderPlas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Rosenberg, A., Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: Eisner, J. (ed.) EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28–30, 2007, Prague, Czech Republic, ACL, pp. 410–420 (2007)
Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., Carvalho, A.C.P.L.F., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 13:1–13:31 (2013)
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010)
Wagner, S., Wagner, D.: Comparing clusterings—an overview (2007). http://staff.ustc.edu.cn/~zwp/teach/MVA/cluster_validation.pdf. Accessed 7 Feb 2017
Wan, L., Ng, W.K., Dang, X.H., Yu, P.S., Zhang, K.: Density-based clustering of data streams at multiple resolutions. ACM Trans. Knowl. Discov. Data 3(3), 14:1–14:28 (2009)
Zahn, C.T.: Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput. 20(1), 68–86 (1971)
Zhu, Y., Shasha, D.E.: Statstream: statistical monitoring of thousands of data streams in real time. In: VLDB 2002, Proceedings of 28th International Conference on Very Large Data Bases, August 20–23, 2002, Hong Kong, China, Morgan Kaufmann, pp. 358–369 (2002)
Zliobaite, I., Bifet, A., Pfahringer, B., Holmes, G.: Active learning with drifting streaming data. IEEE Trans. Neural Netw. Learn. Syst. 25(1), 27–39 (2014)
Acknowledgements
Douglas O. Cardoso thanks CAPES (process 99999.005992/2014-01) and CNPq for financial support. João Gama thanks the support of the European Commission through the project MAESTRA (Grant Number ICT-750 2013-612944). Felipe M. G. França thanks the support of FAPERJ, FINEP, and INOVAX.
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Cardoso, D.O., França, F.M.G. & Gama, J. WCDS: A Two-Phase Weightless Neural System for Data Stream Clustering. New Gener. Comput. 35, 391–416 (2017). https://doi.org/10.1007/s00354-017-0018-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00354-017-0018-y