Benchmarking framework for class imbalance problem using novel sampling approach for big data

244 Accesses
10 Citations
Explore all metrics

Abstract

The traditional techniques of machine learning always need to be strengthened for dealing with cosmic nature of big data for systematic and methodical learning. The unbalanced distribution of classes in big data, popularly known as imbalanced big data chases the problem of learning to a much higher level. The conventional methods are being progressively modified to handle and curtail the problem of learning from imbalanced datasets in the context of big data at the data level and algorithmic level. In the current study, a cluster heads based data level sampling solution which inherits edge of K-Means and Fuzzy C-Means clustering approaches is applied. The proposed approach is evaluated with three different classifiers namely Support Vector Machines, Decision Tree and k-Nearest Neighbor and compared with conventional SMOTE algorithm. The experiment has shown promising results with an increment of 8.09% and 35.71% in terms of accuracy and AUC respectively, for all imbalanced datasets. This work imparts a baseline comparison of solutions for imbalanced classification at data level in big data scenario and proposes an efficient clustering-based solution for same.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hybrid Data-Level Techniques for Class Imbalance Problem

Investigation of Imbalanced Big Data Set Classification: Clustering Minority Samples Over Sampling Technique

KSMOTEEN: A Cluster Based Hybrid Sampling Model for Imbalance Class Data

References

Al-Jarrah OY, Yoo PD, Muhaidat S, Karagiannidis GK, Taha K (2015) Efficient machine learning for big data: a review. Big Data Res 2(3):87–93. https://doi.org/10.1016/j.bdr.2015.04.001
Article Google Scholar
Bechini A, Marcelloni F, Segatori A (2016) A MapReduce solution for associative classification of big data. Inf Sci 332:33–55. https://doi.org/10.1016/j.ins.2015.10.041
Article Google Scholar
Chacko AM, Gupta A, Kumar SDM (2017) Improving execution speed of incremental runs of MapReduce using provenance. Int J Big Data Intell 4(3):186–194. https://doi.org/10.1504/IJBDI.2017.10006111
Article Google Scholar
Fernández A, Carmona CJ, Jesus MJ, Herrera F (2016) A view on fuzzy systems for big data: progress and opportunities. Int J Comput Intell Syst 9(Sup1):69–80. https://doi.org/10.1080/18756891.2016.1180820
Article Google Scholar
Ghazi MR, Gangodkar D (2015) Hadoop, MapReduce and HDFS: a developers perspective. Procedia Comput Sci 48:45–50. https://doi.org/10.1016/j.procs.2015.04.108
Article Google Scholar
Gong J, Kim H (2017) RHSBoost: improving classification in imbalance data. Comput Stat Data Anal 111:1–13. https://doi.org/10.1016/j.csda.2017.01.005
Article MATH MathSciNet Google Scholar
Han J, Kamber M, Pei J (2012) Classification: basic concepts. In: Elsevier (ed) Data mining concepts and techniques, 3rd ed. Morgan Kaufmann, Waltham, pp 327–383
Chapter Google Scholar
He Q, Wang H, ZhuangF Shang T, Shi Z (2015) Parallel sampling from big data with uncertainty distribution. Fuzzy Sets Syst 258:117–133. https://doi.org/10.1016/j.fss.2014.01.016
Article MATH MathSciNet Google Scholar
Hochbaum DS, Baumann P (2014) Sparse computation for large-scale data mining. In: 2014 IEEE international conference on big data. https://doi.org/10.1109/bigdata.2014.7004252
Hu H, Wen Y, Chua T, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. IEEE Access 2:652–687. https://doi.org/10.1109/access.2014.2332453
Article Google Scholar
Kamal S, Ripon SH, Dey N, Ashour AS, Santhi V (2016) A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset. Comput Methods Programs Biomed 131:191–206. https://doi.org/10.1016/j.cmpb.2016.04.005
Article Google Scholar
Kang Q, Chen X, Li S, Zhou M (2017) A noise-filtered under-sampling scheme for imbalanced classification. IEEE Trans Cybern 47:4263–4274. https://doi.org/10.1109/tcyb.2016.2606104
Article Google Scholar
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Progr Artif Intell 5(4):221–232. https://doi.org/10.1007/s13748-016-0094-0
Article Google Scholar
Landset S, Khoshgoftaar TM, Richter AN, Hasanin T (2015) A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J Big Data 2(1):24. https://doi.org/10.1186/s40537-015-0032-1
Article Google Scholar
López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141. https://doi.org/10.1016/j.ins.2013.07.007
Article Google Scholar
López V, Río SD, Benítez JM, Herrera F (2015) Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data. Fuzzy Sets Syst 258:5–38. https://doi.org/10.1016/j.fss.2014.01.015
Article MathSciNet Google Scholar
Maillo J, Ramírez S, Triguero I, Herrera F (2017) KNN-IS: AN Iterative Spark-based design of the k-Nearest Neighbors classifier for big data. Knowl-Based Syst 117:3–15. https://doi.org/10.1016/j.knosys.2016.06.012
Article Google Scholar
Meddah IHA, Belkadi K (2017) Parallel distributed patterns mining using Hadoop MapReduce framework. Int J Grid High Perform Comput. https://doi.org/10.4018/ijghpc.2017040105
Article Google Scholar
Pandey R, Dhoundiyal M (2015) Quantitative evaluation of big data categorical variables through R. Procedia Comput Sci 46:582–588. https://doi.org/10.1016/j.procs.2015.02.097
Article Google Scholar
Park S-H, Ha Y-G (2014) Large imbalance data classification based on MapReduce for traffic accident prediction. In: Eighth international conference on innovative mobile and internet services in ubiquitous computing, IEEE, pp 45–49 https://doi.org/10.1109/imis.2014.6
Patil SS, Sonavane SP (2017) Enriched Over_Sampling techniques for improving classification of imbalanced big data. In: Third international conference on big data computing service and applications, IEEE, https://doi.org/10.1109/bigdataservice.2017.19
Río SD, López V, Benítez JM, Herrera F (2014) On the use of MapReduce for imbalanced big data using Random Forest. Inf Sci 285:112–137. https://doi.org/10.1016/j.ins.2014.03.043
Article Google Scholar
Río SD, López V, Benítez JM, Herrera F (2015) A MapReduce approach to address big data classification problems based on the fusion of linguistic fuzzy rules. Int J Comput Intell Syst 8:422–437. https://doi.org/10.1080/18756891.2015.1017377
Article Google Scholar
Rodger JA (2015) Discovery of medical big data analytics: improving the prediction of traumatic brain injury survival rates by data mining Patient Informatics Processing Software Hybrid Hadoop Hive. Inf Med Unlocked 1:17–26. https://doi.org/10.1016/j.imu.2016.01.002
Article Google Scholar
Sanz JA, Bernardo D, Herrera F, Bustince H, Hagras H (2015) A compact evolutionary interval-valued fuzzy rule-based classification system for the modeling and prediction of real-world financial applications with imbalanced data. IEEE Trans Fuzzy Syst 23(4):973–990. https://doi.org/10.1109/tfuzz.2014.2336263
Article Google Scholar
Slagter K, Hsu C-H, Chung Y-C (2015) An adaptive and memory efficient sampling mechanism for partitioning in MapReduce. Int J Parallel Program 43:489–507. https://doi.org/10.1007/s10766-013-0288-z
Article Google Scholar
Triguero I, Peralta D, Bacardit J, García S, Herrera F (2015a) MRPR: a MapReduce solution for prototype reduction in big data classification. Neurocomputing 150:331–345. https://doi.org/10.1016/j.neucom.2014.04.078
Article Google Scholar
Triguero I, Río SD, López V, Bacardit J, Benítez JM, Herrera F (2015b) ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem. Knowl-Based Syst 87:69–79. https://doi.org/10.1016/j.knosys.2015.05.027
Article Google Scholar
Triguero I, Galar M, Merino D, Maillo J, Bustince H, Herrera F (2016) Evolutionary undersampling for extremely imbalanced big data classification under apache spark. In: IEEE congress on evolutionary computation (CEC). https://doi.org/10.1109/cec.2016.7743853
Troncoso A, Ribera P, Asencio-Cortes G, Vega I, Gallego D (2018) Imbalanced classification techniques for monsoon forecasting based on a new climatic time series. Environ Model Softw 106:48–56. https://doi.org/10.1016/j.envsoft.2017.11.024
Article Google Scholar
Tsai CW, Lai CF, Chao HC, Vasilakos AV (2015) Big data analytics: a survey. J Big Data. https://doi.org/10.1186/s40537-015-0030-3
Article Google Scholar
Uskenbayeva R, Kuandykov A, Cho YI, Temirbolatova T, Amanzholova S, Kozhamzharova D (2015) Integrating of data using the Hadoop and R. Procedia Comput Sci 56:145–149. https://doi.org/10.1016/j.procs.2015.07.187
Article Google Scholar
Vluymans S, Tarragó DS, Saeys Y, Cornelis C, Herrera F (2016) Fuzzy rough classifiers for class imbalanced multi-instance data. Pattern Recogn 53:36–45. https://doi.org/10.1016/j.patcog.2015.12.002
Article Google Scholar
Wu X, Zhu X, Wu GQ, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107. https://doi.org/10.1109/TKDE.2013.109
Article Google Scholar
Xing EP, Ho Q, Dai W, Kim JK, Wei J et al (2015) Petuum: a new platform for distributed machine learning on big data. IEEE Trans Big Data 1(2):49–67. https://doi.org/10.1109/tbdata.2015.2472014
Article Google Scholar
Zhang X, Cheng M, Liu Y, Li DH, Wu RM (2014) Short-term load forecasting based on big data technologies. Appl Mech Mater 687–691:1186–1192. https://doi.org/10.4028/www.scientific.net/amm
Article Google Scholar
Zou Q, Xie S, Lin Z, Wu M, Ju Y (2016) Finding the best classification threshold in imbalanced classification. Big Data Res 5:2–8. https://doi.org/10.1016/j.bdr.2015.12.001
Article Google Scholar

Download references

Author information

Authors and Affiliations

University School of Information, Communication and Technology, Guru Gobind Singh Indraprastha University, Sector 16C, Dwarka, Delhi, 110078, India
Khyati Ahlawat, Anuradha Chug & Amit Prakash Singh

Authors

Khyati Ahlawat
View author publications
You can also search for this author in PubMed Google Scholar
Anuradha Chug
View author publications
You can also search for this author in PubMed Google Scholar
Amit Prakash Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Khyati Ahlawat.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ahlawat, K., Chug, A. & Singh, A.P. Benchmarking framework for class imbalance problem using novel sampling approach for big data. Int J Syst Assur Eng Manag 10, 824–835 (2019). https://doi.org/10.1007/s13198-019-00817-6

Download citation

Received: 09 October 2018
Revised: 16 April 2019
Published: 13 June 2019
Issue Date: August 2019
DOI: https://doi.org/10.1007/s13198-019-00817-6

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hybrid Data-Level Techniques for Class Imbalance Problem

Investigation of Imbalanced Big Data Set Classification: Clustering Minority Samples Over Sampling Technique

KSMOTEEN: A Cluster Based Hybrid Sampling Model for Imbalance Class Data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Benchmarking framework for class imbalance problem using novel sampling approach for big data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hybrid Data-Level Techniques for Class Imbalance Problem

Investigation of Imbalanced Big Data Set Classification: Clustering Minority Samples Over Sampling Technique

KSMOTEEN: A Cluster Based Hybrid Sampling Model for Imbalance Class Data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation