Entropy based streaming big-data reduction with adjustable compression ratio

Erhan Gokcay ORCID: orcid.org/0000-0002-4220-199X¹

187 Accesses
1 Citation
Explore all metrics

Abstract

The Internet of Things is a novel concept in which numerous physical devices are linked to the internet to collect, generate, and distribute data for processing. Data storage and processing become more challenging as the number of devices increases. One solution to the problem is to reduce the amount of stored data in such a way that processing accuracy does not suffer significantly. The reduction can be lossy or lossless, depending on the type of data. The article presents a novel lossy algorithm for reducing the amount of data stored in the system. The reduction process aims to reduce the volume of data while maintaining classification accuracy and properly adjusting the reduction ratio. A nonlinear cluster distance measure is used to create subgroups so that samples can be assigned to the correct clusters even though the cluster shape is nonlinear. Each sample is assumed to arrive one at a time during the reduction. As a result of this approach, the algorithm is suitable for streaming data. The user can adjust the degree of reduction, and the reduction algorithm strives to minimize classification error. The algorithm is not dependent on any particular classification technique. Subclusters are formed and readjusted after each sample during the calculation. To summarize the data from the subclusters, representative points are calculated. The data summary that is created can be saved and used for future processing. The accuracy difference between regular and reduced datasets is used to measure the effectiveness of the proposed method. Different classifiers are used to measure the accuracy difference. The results show that the nonlinear information-theoretic cluster distance measure improves the reduction rates with higher accuracy values compared to existing studies. At the same time, the reduction rate can be adjusted as desired, which is a lacking feature in the current methods. The characteristics are discussed, and the results are compared to previously published algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two-level energy-efficient data reduction strategies based on SAX-LZW and hierarchical clustering for minimizing the huge data conveyed on the internet of things networks

Article 25 May 2022

Big Sensor Data Acquisition and Archiving with Compression

Design and Implementation of Hybrid Compression Algorithm for Personal Health Care Big Data Applications

Article 28 March 2020

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Abbreviations

KNN:: K-Nearest Neihgbourhood
SVM:: Supoort Vector Machine
CEF:: Clustering Evaluation Function
EbSDR:: Entropy based Streaming Data Reduction
DROP:: Decremental Reduction Optimization Procedure
SNN:: Selective Nearest Neighbor Rule
CNN:: Condensed Nearest Neighbor Rule
RNN:: Reduced Nearest Neighbor Rule
ENN:: Edited Nearest Neighbor
DEL:: Decremental Encoding Length
IBx:: Instance Based Learning
HMNEI:: Hit Miss Network Edition Iterative
ATISA:: Adaptive Threshold-based Instance Selection Algorithm
CNNIR:: Constraint Nearest Neighbor-Based Instance Reduction
BNNT:: Binary Nearest Neighbor Tree
LDIS:: Local Density-based Instance Selection
CDIS:: Central Density-based Instance Selection
LSSm:: Local Set-Based Smoother
LSBo:: Local Set Border Selector
ICF:: Iterative Case Filtering
SIR:: Spectral Instance Reduction

References

Amro A, Al-Akhras M, Hindi KE, Habib M, Shawar BA (2021) Instance reduction for avoiding overfitting in decision trees. J Intell Syst 30(1):438–459
Google Scholar
Atalay V, Zubaroğlu A (2021) Data stream clustering: a review. Artif Intell Rev 54:1201–1236
Article Google Scholar
Bader J, Nelson D, Chai-Zhang T, Gerych W, Rundensteiner E (2019) Neural network for nonlinear dimension reduction through manifold recovery, in 2019 IEEE MIT undergraduate research technology conference (URTC). MA, USA, Cambridge
Google Scholar
Barshan E, Ghodsi A, Azimifar Z, Jahromi MZ (2011) Supervised principal component analysis: visualization, classification and regression on subspaces and submanifolds. Pattern Recogn 44(7):1357–1371
Article Google Scholar
Bo L, Wang C, Huang D-S (2009) Supervised feature extraction based on orthogonal discriminant projection. Neurocomputing 73(1):191–196
Google Scholar
Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Disc April:153–172
Article MathSciNet Google Scholar
Carbonera LL, Abel M (2015) A density-based approach for instance selection. In: 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI), Vietri sul Mare, pp 768–774. https://doi.org/10.1109/ICTAI.2015.114
Carbonera LJ, Abel M (2016) A novel density-based approach for instance selection. In: 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI), San Jose, pp 549–556. https://doi.org/10.1109/ICTAI.2016.0090
Cavalcanti GD, Ren TI, Pereira CL (2013) ATISA: adaptive threshold-based instance selection algorithm. Expert Syst Appl 40:6894–6900
Article Google Scholar
Chou C-H, Kuo B-H, Chang F (2006) The generalized condensed nearest neighbor rule as a data reduction method. In: 18th International Conference on Pattern Recognition (ICPR'06), Hong Kong, pp 556–559. https://doi.org/10.1109/ICPR.2006.1119
Connor M, Kumar P (2010) Fast construction of k-nearest neighbor graphs for point clouds. IEEE Trans Vis Comput Graph 16(4):599–608
Article Google Scholar
Czarnowski I, Jędrzejowicz P (2018) An approach to data reduction for learning from big datasets: integrating stacking, rotation, and agent population learning techniques. Complexity 2018:13. https://doi.org/10.1155/2018/7404627
"Data Skeleton Source Code," (2023) [Online]. Available: https://github.com/erhangokcay/data_skeleton. [Accessed 7 03 2023]
Dua D, Graff C (2017) Machine learning repository. [Online]. Available: http://archive.ics.uci.edu/ml. Accessed 7 Mar 2023
Fong Y, Xu J (2021) Forward stepwise deep autoencoder-based monotone nonlinear dimensionality. J Comput Graph Stat 30(3):519–529
Article MathSciNet Google Scholar
Gates G (1972) Reduced nearest neighbor rule. IEEE Trans Inf Theory 18(3):431–433
Article Google Scholar
Ge W, Hongzhe X, Weibin Z, Weilu Z, Baiyang F (2011) Multi-kernel PCA based high-dimensional images feature reduction. In: 2011 International Conference on Electric Information and Control Engineering, pp 5966–5969
Gisbrecht A, Schulz A, Hammer B (2015) Parametric nonlinear dimensionality reduction using kernel t-SNE. Neurocomputing 147:71–82
Article Google Scholar
Gokcay E (2020) An information-theoretic instance-based classifier. Inf Sci 536:263–276
Article MathSciNet Google Scholar
Gokcay E, Principe J (2002) Information theoretic clustering. Pattern Anal Mach Intel IEEE Trans 24(2):158–171
Article Google Scholar
Hart P (1968) The condensed nearest neighbor rule. IEEE Trans Inf Theory 14:515–516
Article Google Scholar
He X, Niyogi P (2004) Locality preserving projections. In: Mozer MC, Jordan MI, Petsche T (eds) Advances in Neural Information Processing Systems. MIT Press, Cambridge, pp 153–160
Hervé A, Williams Lynne J (2010) Principal component analysis. WIREs Comput Stat 2(4):433–459
Article Google Scholar
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Article MathSciNet Google Scholar
Hout M, Papesh M, Goldinger S (2013) Multidimensional scaling. WIREs Cogn Sci 4:93–103
Article Google Scholar
Karimi A-H, Wong A, Ghodsi A (2018) "SRP: Efficient class-aware embedding learning for large-scale data via supervised random projections," arXiv preprint arXiv:1811.03166
Kecman V (2005) Support vector machines – an introduction, In: Wang LP (ed.) Support Vector Machines: Theory and Applications, Springer, Berlin, pp 1–47. https://doi.org/10.1007/10984697_1
Krawczyk B, Triguero I, García S, Woźniak M, Herrera F (2019) Instance reduction for one-class classification. Knowl Inf Syst 59:601–628
Article Google Scholar
Lafon S, Lee A (2006) Diffusion maps and coarse-graining: a unified framework for dimensionality reduction, graph partitioning, and data set parameterization. IEEE Trans Pattern Anal Mach Intell 28(9):1393–1403
Article Google Scholar
Leyva E, Gonzalez A, Perez R (2015) Three new instance selection methods based on local sets: a comparative study with several approaches from a bi-objective perspective. Pattern Recogn 48(4):1523–1537
Article Google Scholar
Mandal A, Maji P (2022) "Multiview regularized discriminant canonical correlation analysis: sequential extraction of relevant features from multiblock data," IEEE Trans Cybern, pp. 1–13
Nikolaidis K, Goulermas J, Wu Q (2011) A class boundary preserving algorithm for data condensation. Pattern Recogn 44(3):704–715
Article Google Scholar
Nikolaidis K, Rodriguez-Martinez E, Goulermas JY (2012) Spectral graph optimization for instance reduction. IEEE Trans Neural Netw Learn Syst 23(7):1169–1175
Article Google Scholar
Ran R, Ren Y, Zhang S, Fang B (2021) A novel discriminant locality preserving projections method. J Math Imaging 63:541–554
Article MathSciNet Google Scholar
Samko O, Marshall A, Rosin P (2006) Selection of the optimal parameter value for the Isomap algorithm. Pattern Recogn Lett 27(9):968–979
Article Google Scholar
Sarveniazi A (2014) An actual survey of dimensionality reduction. Am Jo Comput Math 4:55–72
Article Google Scholar
Siblini W, Kuntz P, Meyer F (2021) A review on dimensionality reduction for multi-label classification. IEEE Trans Knowl Data Eng 33(3):839–857
Google Scholar
van der Maaten L, Postma E, Van Den Herik HJ (2009) Dimensionality reduction: a comparative review. J Mach Learn Res 10:66–71
Google Scholar
Wang Q, Qin Z, Nie F, Li X (2021) C2DNDA: a deep framework for nonlinear dimensionality reduction. IEEE Trans Ind Electron 68(2):1684–1694
Article Google Scholar
Wilson D (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man, Cybern 3:408–421
Article MathSciNet Google Scholar
Wilson D, Martinez T (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38(3):257–286
Article Google Scholar
Yang L, Zhu Q, Huang J, Wu Q, Cheng D, Hong X (2019) Constraint nearest neighbor for instance reduction. Soft Comput 23:13235–13245
Article Google Scholar
Zeng S, Gaao C, Wang X, Jiang L, Feng D (2019) Multiple kernel-based discriminant analysis via support vectors for dimension reduction. IEEE Access 7:35418–35430
Article Google Scholar

Download references

Acknowledgments

I would like to thank my loving wife, Ozlemgul Pekdemir Gokcay, for her unlimited support and patience throughout all aspects of the study.

Author information

Authors and Affiliations

Atilim University, Software Engineering, Kizilcasar Mah, Incek, Golbasi, 06830, Ankara, Turkey
Erhan Gokcay

Authors

Erhan Gokcay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Erhan Gokcay.

Ethics declarations

The authors did not receive support from any organization for the submitted work.

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

Appendix B

Table 9 UCI Dataset

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gokcay, E. Entropy based streaming big-data reduction with adjustable compression ratio. Multimed Tools Appl 83, 2647–2681 (2024). https://doi.org/10.1007/s11042-023-15897-7

Download citation

Received: 01 September 2022
Revised: 07 March 2023
Accepted: 22 May 2023
Published: 07 June 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11042-023-15897-7