Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

WCDS: A Two-Phase Weightless Neural System for Data Stream Clustering

  • Special Feature
  • Published:
New Generation Computing Aims and scope Submit manuscript

Abstract

Clustering is a powerful and versatile tool for knowledge discovery, able to provide a valuable information for data analysis in various domains. To perform this task based on streaming data is quite challenging: outdated knowledge needs to be disposed while the current knowledge is obtained from fresh data; since data are continuously flowing, strict efficiency constraints have to be met. This paper presents WCDS, an approach to this problem based on the WiSARD artificial neural network model. This model already had useful characteristics as inherent incremental learning capability and patent functioning speed. These were combined with novel features as an adaptive countermeasure to cluster imbalance, a mechanism to discard expired data, and offline clustering based on a pairwise similarity measure for WiSARD discriminators. In an insightful experimental evaluation, the proposed system had an excellent performance according to multiple quality standards. This supports its applicability for the analysis of data streams.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. Iverson bracket: \([L] = 1\) if the logical expression L is true; otherwise, \([L] = 0.\)

  2. \(\lfloor x \rceil \) represents the nearest integer of real number x.

  3. http://cs.joensuu.fi/sipu/datasets/ (accessed 2017/1/11).

  4. https://github.com/deric/clustering-benchmark (accessed 2017/1/11).

  5. http://kdd.ics.uci.edu/databases/kddcup99/ (accessed 2017/02/07).

  6. http://moa.cms.waikato.ac.nz/datasets/ (accessed 2017/02/07).

  7. http://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+under+dynamic+gas+mixtures (accessed 2016/10/17).

References

  1. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: VLDB, pp. 81–92 (2003)

  2. Aleksander, I., Thomas, W., Bowden, P.: WiSARD, a radical step forward in image recognition. Sens. Rev. 4(3), 120–124 (1984)

    Article  Google Scholar 

  3. Aleksander, I., Gregorio, M.D., França, F.M.G., Lima, P.M.V., Morton, H.: A brief introduction to weightless neural systems. In: ESANN 2009, 17th European Symposium on Artificial Neural Networks, Bruges, Belgium, April 22–24, 2009, Proceedings (2009)

  4. Barddal, J.P., Gomes, H.M., Enembreck, F.: A complex network-based anytime data stream clustering algorithm. In: Arik, S., Huang, T., Lai, W.K., Liu, Q. (eds) Neural Information Processing—22nd International Conference, ICONIP 2015, Istanbul, Turkey, November 9–12, 2015, Proceedings, Part I, Springer. Lecture Notes in Computer Science, vol. 9489, pp. 615–622 (2015)

  5. Barddal, J.P., Gomes, H.M., Enembreck, F., Barthès, J.A.: Sncstream\({}^{+}\): extending a high quality true anytime data stream clustering algorithm. Inf Syst 62, 60–73 (2016)

    Article  Google Scholar 

  6. Bifet, A., Gavaldà, R.: Learning from time-changing data with adaptive windowing. In: Proceedings of the 7th SIAM International Conference on Data Mining, April 26–28, 2007, Minneapolis, Minnesota, USA, SIAM, pp. 443–448 (2007)

  7. Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: Moa: massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)

    Google Scholar 

  8. Blackard, J.A., Dean, D.J.: Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput. Electron. Agric. 24(3), 131–151 (1999)

    Article  Google Scholar 

  9. Campello, R.J.G.B., Moulavi, D., Zimek, A., Sander, J.: Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans. Knowl. Discov. Data 10(1), 5:1–5:51 (2015)

    Article  Google Scholar 

  10. Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: SDM, SIAM, pp. 328–339 (2006)

  11. Cardoso, D.O., Lima, P.M.V., Gregorio, M.D., Gama, J., França, F.M.G.: Clustering data streams with weightless neural networks. In: ESANN 2011, 19th European Symposium on Artificial Neural Networks, Bruges, Belgium, April 27–29, 2011, Proceedings (2011)

  12. Cardoso, D.O., Gregorio, M.D., Lima, P.M.V., Gama, J., França, F.M.G.: A weightless neural network-based approach for stream data clustering. In: Intelligent Data Engineering and Automated Learning—IDEAL 2012—13th International Conference, Natal, Brazil, August 29–31, 2012, pp. 328–335. Proceedings, Springer (2012)

  13. Cardoso, D.O., França, F.M.G., Gama, J.: A bounded neural network for open set recognition. In: 2015 International Joint Conference on Neural Networks , IJCNN 2015, Killarney, Ireland, July 12–17, 2015, IEEE, pp. 1–7 (2015)

  14. Cardoso, D.O., Carvalho, D.S., Alves, D.S.F., de Souza, D.F.P., Carneiro, H.C.C., Pedreira, C.E., Lima, P.M.V., França, F.M.G.: Financial credit analysis via a clustering weightless neural classifier. Neurocomputing 183, 70–78 (2016)

    Article  Google Scholar 

  15. Cardoso, D.O., França, F.M.G., Gama, J.: Clustering data streams using a forgetful neural model. In: Ossowski, S. (ed.) Proceedings of the 31st Annual ACM Symposium on Applied Computing, Pisa, Italy, April 4–8, 2016, ACM, pp. 949–951 (2016)

  16. Datar, M., Motwani, R.: Data Stream Management—Processing High-Speed Data Streams, Data-Centric Systems and Applications. In: Garofalakis, M.N., Gehrke, J., Rastogi, R. (eds.) The sliding-window computation model and results. Springer, Berlin, Heidelberg (2016)

    Chapter  Google Scholar 

  17. Day, W.H., Edelsbrunner, H.: Efficient algorithms for agglomerative hierarchical clustering methods. J. Classif. 1(1), 7–24 (1984)

    Article  MATH  Google Scholar 

  18. Fonollosa, J., Sheik, S., Huerta, R., Marco, S.: Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring. Sens. Actuators B Chem. 215, 618–629 (2015)

    Article  Google Scholar 

  19. Gama, J.: Knowledge Discovery from Data Streams. CRC data mining and knowledge discovery series. CRC Press, Chapman and Hall, Boca Raton (2010)

    Book  MATH  Google Scholar 

  20. Grieco, B.P.A., Lima, P.M.V., Gregorio, M.D., França, F.M.G.: Producing pattern examples from “mental” images. Neurocomputing 73(7–9), 1057–1064 (2010)

    Article  Google Scholar 

  21. Hassani M, Seidl T (2016) Using internal evaluation measures to validate the quality of diverse stream clustering algorithms. Vietnam J. Comput. Sci. 1–13. doi:10.1007/s40595-016-0086-9

  22. Jin, C., Yu, J.X., Zhou, A., Cao, F.: Efficient clustering of uncertain data streams. Knowl. Inf. Syst. 40(3), 509–539 (2014)

    Article  Google Scholar 

  23. Kolcz, A., Allinson, N.: Application of the cmac input encoding scheme in the n-tuple approximation network. IEE Proc. Comput. Digit. Tech. 141(3), 177–183 (1994)

    Article  Google Scholar 

  24. Kranen, P., Assent, I., Baldauf, C., Seidl, T.: The clustree: indexing micro-clusters for anytime stream mining. Knowl. Inf. Syst. 29(2), 249–272 (2011)

    Article  Google Scholar 

  25. Levandowsky, M., Winter, D.: Distance between sets. Nature 234(5323), 34–35 (1971)

    Article  Google Scholar 

  26. Linneberg, C., Jorgensen, T.: Discretization methods for encoding of continuous input variables for boolean neural networks. In: International Joint Conference on Neural Networks, 1999. IJCNN ’99, vol. 2, pp. 1219–1224 (1999)

  27. Lloyd, S.: Least squares quantization in pcm. IEEE Trans. Inf. Theory 28(2), 129–137 (1982). doi:10.1109/TIT.1982.1056489

    Article  MathSciNet  MATH  Google Scholar 

  28. McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5(4), 115–133 (1943)

    Article  MathSciNet  MATH  Google Scholar 

  29. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., VanderPlas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  30. Rosenberg, A., Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: Eisner, J. (ed.) EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28–30, 2007, Prague, Czech Republic, ACL, pp. 410–420 (2007)

  31. Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., Carvalho, A.C.P.L.F., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 13:1–13:31 (2013)

    Article  MATH  Google Scholar 

  32. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010)

    MathSciNet  MATH  Google Scholar 

  33. Wagner, S., Wagner, D.: Comparing clusterings—an overview (2007). http://staff.ustc.edu.cn/~zwp/teach/MVA/cluster_validation.pdf. Accessed 7 Feb 2017

  34. Wan, L., Ng, W.K., Dang, X.H., Yu, P.S., Zhang, K.: Density-based clustering of data streams at multiple resolutions. ACM Trans. Knowl. Discov. Data 3(3), 14:1–14:28 (2009)

    Article  Google Scholar 

  35. Zahn, C.T.: Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput. 20(1), 68–86 (1971)

    Article  MATH  Google Scholar 

  36. Zhu, Y., Shasha, D.E.: Statstream: statistical monitoring of thousands of data streams in real time. In: VLDB 2002, Proceedings of 28th International Conference on Very Large Data Bases, August 20–23, 2002, Hong Kong, China, Morgan Kaufmann, pp. 358–369 (2002)

  37. Zliobaite, I., Bifet, A., Pfahringer, B., Holmes, G.: Active learning with drifting streaming data. IEEE Trans. Neural Netw. Learn. Syst. 25(1), 27–39 (2014)

    Article  Google Scholar 

Download references

Acknowledgements

Douglas O. Cardoso thanks CAPES (process 99999.005992/2014-01) and CNPq for financial support. João Gama thanks the support of the European Commission through the project MAESTRA (Grant Number ICT-750 2013-612944). Felipe M. G. França thanks the support of FAPERJ, FINEP, and INOVAX.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Douglas O. Cardoso.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cardoso, D.O., França, F.M.G. & Gama, J. WCDS: A Two-Phase Weightless Neural System for Data Stream Clustering. New Gener. Comput. 35, 391–416 (2017). https://doi.org/10.1007/s00354-017-0018-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00354-017-0018-y

Keywords

Navigation