Abstract
The Internet of Things is a novel concept in which numerous physical devices are linked to the internet to collect, generate, and distribute data for processing. Data storage and processing become more challenging as the number of devices increases. One solution to the problem is to reduce the amount of stored data in such a way that processing accuracy does not suffer significantly. The reduction can be lossy or lossless, depending on the type of data. The article presents a novel lossy algorithm for reducing the amount of data stored in the system. The reduction process aims to reduce the volume of data while maintaining classification accuracy and properly adjusting the reduction ratio. A nonlinear cluster distance measure is used to create subgroups so that samples can be assigned to the correct clusters even though the cluster shape is nonlinear. Each sample is assumed to arrive one at a time during the reduction. As a result of this approach, the algorithm is suitable for streaming data. The user can adjust the degree of reduction, and the reduction algorithm strives to minimize classification error. The algorithm is not dependent on any particular classification technique. Subclusters are formed and readjusted after each sample during the calculation. To summarize the data from the subclusters, representative points are calculated. The data summary that is created can be saved and used for future processing. The accuracy difference between regular and reduced datasets is used to measure the effectiveness of the proposed method. Different classifiers are used to measure the accuracy difference. The results show that the nonlinear information-theoretic cluster distance measure improves the reduction rates with higher accuracy values compared to existing studies. At the same time, the reduction rate can be adjusted as desired, which is a lacking feature in the current methods. The characteristics are discussed, and the results are compared to previously published algorithms.
Similar content being viewed by others
Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
Abbreviations
- KNN:
-
K-Nearest Neihgbourhood
- SVM:
-
Supoort Vector Machine
- CEF:
-
Clustering Evaluation Function
- EbSDR:
-
Entropy based Streaming Data Reduction
- DROP:
-
Decremental Reduction Optimization Procedure
- SNN:
-
Selective Nearest Neighbor Rule
- CNN:
-
Condensed Nearest Neighbor Rule
- RNN:
-
Reduced Nearest Neighbor Rule
- ENN:
-
Edited Nearest Neighbor
- DEL:
-
Decremental Encoding Length
- IBx:
-
Instance Based Learning
- HMNEI:
-
Hit Miss Network Edition Iterative
- ATISA:
-
Adaptive Threshold-based Instance Selection Algorithm
- CNNIR:
-
Constraint Nearest Neighbor-Based Instance Reduction
- BNNT:
-
Binary Nearest Neighbor Tree
- LDIS:
-
Local Density-based Instance Selection
- CDIS:
-
Central Density-based Instance Selection
- LSSm:
-
Local Set-Based Smoother
- LSBo:
-
Local Set Border Selector
- ICF:
-
Iterative Case Filtering
- SIR:
-
Spectral Instance Reduction
References
Amro A, Al-Akhras M, Hindi KE, Habib M, Shawar BA (2021) Instance reduction for avoiding overfitting in decision trees. J Intell Syst 30(1):438–459
Atalay V, Zubaroğlu A (2021) Data stream clustering: a review. Artif Intell Rev 54:1201–1236
Bader J, Nelson D, Chai-Zhang T, Gerych W, Rundensteiner E (2019) Neural network for nonlinear dimension reduction through manifold recovery, in 2019 IEEE MIT undergraduate research technology conference (URTC). MA, USA, Cambridge
Barshan E, Ghodsi A, Azimifar Z, Jahromi MZ (2011) Supervised principal component analysis: visualization, classification and regression on subspaces and submanifolds. Pattern Recogn 44(7):1357–1371
Bo L, Wang C, Huang D-S (2009) Supervised feature extraction based on orthogonal discriminant projection. Neurocomputing 73(1):191–196
Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Disc April:153–172
Carbonera LL, Abel M (2015) A density-based approach for instance selection. In: 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI), Vietri sul Mare, pp 768–774. https://doi.org/10.1109/ICTAI.2015.114
Carbonera LJ, Abel M (2016) A novel density-based approach for instance selection. In: 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI), San Jose, pp 549–556. https://doi.org/10.1109/ICTAI.2016.0090
Cavalcanti GD, Ren TI, Pereira CL (2013) ATISA: adaptive threshold-based instance selection algorithm. Expert Syst Appl 40:6894–6900
Chou C-H, Kuo B-H, Chang F (2006) The generalized condensed nearest neighbor rule as a data reduction method. In: 18th International Conference on Pattern Recognition (ICPR'06), Hong Kong, pp 556–559. https://doi.org/10.1109/ICPR.2006.1119
Connor M, Kumar P (2010) Fast construction of k-nearest neighbor graphs for point clouds. IEEE Trans Vis Comput Graph 16(4):599–608
Czarnowski I, Jędrzejowicz P (2018) An approach to data reduction for learning from big datasets: integrating stacking, rotation, and agent population learning techniques. Complexity 2018:13. https://doi.org/10.1155/2018/7404627
"Data Skeleton Source Code," (2023) [Online]. Available: https://github.com/erhangokcay/data_skeleton. [Accessed 7 03 2023]
Dua D, Graff C (2017) Machine learning repository. [Online]. Available: http://archive.ics.uci.edu/ml. Accessed 7 Mar 2023
Fong Y, Xu J (2021) Forward stepwise deep autoencoder-based monotone nonlinear dimensionality. J Comput Graph Stat 30(3):519–529
Gates G (1972) Reduced nearest neighbor rule. IEEE Trans Inf Theory 18(3):431–433
Ge W, Hongzhe X, Weibin Z, Weilu Z, Baiyang F (2011) Multi-kernel PCA based high-dimensional images feature reduction. In: 2011 International Conference on Electric Information and Control Engineering, pp 5966–5969
Gisbrecht A, Schulz A, Hammer B (2015) Parametric nonlinear dimensionality reduction using kernel t-SNE. Neurocomputing 147:71–82
Gokcay E (2020) An information-theoretic instance-based classifier. Inf Sci 536:263–276
Gokcay E, Principe J (2002) Information theoretic clustering. Pattern Anal Mach Intel IEEE Trans 24(2):158–171
Hart P (1968) The condensed nearest neighbor rule. IEEE Trans Inf Theory 14:515–516
He X, Niyogi P (2004) Locality preserving projections. In: Mozer MC, Jordan MI, Petsche T (eds) Advances in Neural Information Processing Systems. MIT Press, Cambridge, pp 153–160
Hervé A, Williams Lynne J (2010) Principal component analysis. WIREs Comput Stat 2(4):433–459
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Hout M, Papesh M, Goldinger S (2013) Multidimensional scaling. WIREs Cogn Sci 4:93–103
Karimi A-H, Wong A, Ghodsi A (2018) "SRP: Efficient class-aware embedding learning for large-scale data via supervised random projections," arXiv preprint arXiv:1811.03166
Kecman V (2005) Support vector machines – an introduction, In: Wang LP (ed.) Support Vector Machines: Theory and Applications, Springer, Berlin, pp 1–47. https://doi.org/10.1007/10984697_1
Krawczyk B, Triguero I, GarcÃa S, Woźniak M, Herrera F (2019) Instance reduction for one-class classification. Knowl Inf Syst 59:601–628
Lafon S, Lee A (2006) Diffusion maps and coarse-graining: a unified framework for dimensionality reduction, graph partitioning, and data set parameterization. IEEE Trans Pattern Anal Mach Intell 28(9):1393–1403
Leyva E, Gonzalez A, Perez R (2015) Three new instance selection methods based on local sets: a comparative study with several approaches from a bi-objective perspective. Pattern Recogn 48(4):1523–1537
Mandal A, Maji P (2022) "Multiview regularized discriminant canonical correlation analysis: sequential extraction of relevant features from multiblock data," IEEE Trans Cybern, pp. 1–13
Nikolaidis K, Goulermas J, Wu Q (2011) A class boundary preserving algorithm for data condensation. Pattern Recogn 44(3):704–715
Nikolaidis K, Rodriguez-Martinez E, Goulermas JY (2012) Spectral graph optimization for instance reduction. IEEE Trans Neural Netw Learn Syst 23(7):1169–1175
Ran R, Ren Y, Zhang S, Fang B (2021) A novel discriminant locality preserving projections method. J Math Imaging 63:541–554
Samko O, Marshall A, Rosin P (2006) Selection of the optimal parameter value for the Isomap algorithm. Pattern Recogn Lett 27(9):968–979
Sarveniazi A (2014) An actual survey of dimensionality reduction. Am Jo Comput Math 4:55–72
Siblini W, Kuntz P, Meyer F (2021) A review on dimensionality reduction for multi-label classification. IEEE Trans Knowl Data Eng 33(3):839–857
van der Maaten L, Postma E, Van Den Herik HJ (2009) Dimensionality reduction: a comparative review. J Mach Learn Res 10:66–71
Wang Q, Qin Z, Nie F, Li X (2021) C2DNDA: a deep framework for nonlinear dimensionality reduction. IEEE Trans Ind Electron 68(2):1684–1694
Wilson D (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man, Cybern 3:408–421
Wilson D, Martinez T (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38(3):257–286
Yang L, Zhu Q, Huang J, Wu Q, Cheng D, Hong X (2019) Constraint nearest neighbor for instance reduction. Soft Comput 23:13235–13245
Zeng S, Gaao C, Wang X, Jiang L, Feng D (2019) Multiple kernel-based discriminant analysis via support vectors for dimension reduction. IEEE Access 7:35418–35430
Acknowledgments
I would like to thank my loving wife, Ozlemgul Pekdemir Gokcay, for her unlimited support and patience throughout all aspects of the study.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
The authors did not receive support from any organization for the submitted work.
Conflict of Interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A
Appendix B
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gokcay, E. Entropy based streaming big-data reduction with adjustable compression ratio. Multimed Tools Appl 83, 2647–2681 (2024). https://doi.org/10.1007/s11042-023-15897-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15897-7