Abstract
The traditional techniques of machine learning always need to be strengthened for dealing with cosmic nature of big data for systematic and methodical learning. The unbalanced distribution of classes in big data, popularly known as imbalanced big data chases the problem of learning to a much higher level. The conventional methods are being progressively modified to handle and curtail the problem of learning from imbalanced datasets in the context of big data at the data level and algorithmic level. In the current study, a cluster heads based data level sampling solution which inherits edge of K-Means and Fuzzy C-Means clustering approaches is applied. The proposed approach is evaluated with three different classifiers namely Support Vector Machines, Decision Tree and k-Nearest Neighbor and compared with conventional SMOTE algorithm. The experiment has shown promising results with an increment of 8.09% and 35.71% in terms of accuracy and AUC respectively, for all imbalanced datasets. This work imparts a baseline comparison of solutions for imbalanced classification at data level in big data scenario and proposes an efficient clustering-based solution for same.
Similar content being viewed by others
References
Al-Jarrah OY, Yoo PD, Muhaidat S, Karagiannidis GK, Taha K (2015) Efficient machine learning for big data: a review. Big Data Res 2(3):87–93. https://doi.org/10.1016/j.bdr.2015.04.001
Bechini A, Marcelloni F, Segatori A (2016) A MapReduce solution for associative classification of big data. Inf Sci 332:33–55. https://doi.org/10.1016/j.ins.2015.10.041
Chacko AM, Gupta A, Kumar SDM (2017) Improving execution speed of incremental runs of MapReduce using provenance. Int J Big Data Intell 4(3):186–194. https://doi.org/10.1504/IJBDI.2017.10006111
Fernández A, Carmona CJ, Jesus MJ, Herrera F (2016) A view on fuzzy systems for big data: progress and opportunities. Int J Comput Intell Syst 9(Sup1):69–80. https://doi.org/10.1080/18756891.2016.1180820
Ghazi MR, Gangodkar D (2015) Hadoop, MapReduce and HDFS: a developers perspective. Procedia Comput Sci 48:45–50. https://doi.org/10.1016/j.procs.2015.04.108
Gong J, Kim H (2017) RHSBoost: improving classification in imbalance data. Comput Stat Data Anal 111:1–13. https://doi.org/10.1016/j.csda.2017.01.005
Han J, Kamber M, Pei J (2012) Classification: basic concepts. In: Elsevier (ed) Data mining concepts and techniques, 3rd ed. Morgan Kaufmann, Waltham, pp 327–383
He Q, Wang H, ZhuangF Shang T, Shi Z (2015) Parallel sampling from big data with uncertainty distribution. Fuzzy Sets Syst 258:117–133. https://doi.org/10.1016/j.fss.2014.01.016
Hochbaum DS, Baumann P (2014) Sparse computation for large-scale data mining. In: 2014 IEEE international conference on big data. https://doi.org/10.1109/bigdata.2014.7004252
Hu H, Wen Y, Chua T, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. IEEE Access 2:652–687. https://doi.org/10.1109/access.2014.2332453
Kamal S, Ripon SH, Dey N, Ashour AS, Santhi V (2016) A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset. Comput Methods Programs Biomed 131:191–206. https://doi.org/10.1016/j.cmpb.2016.04.005
Kang Q, Chen X, Li S, Zhou M (2017) A noise-filtered under-sampling scheme for imbalanced classification. IEEE Trans Cybern 47:4263–4274. https://doi.org/10.1109/tcyb.2016.2606104
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Progr Artif Intell 5(4):221–232. https://doi.org/10.1007/s13748-016-0094-0
Landset S, Khoshgoftaar TM, Richter AN, Hasanin T (2015) A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J Big Data 2(1):24. https://doi.org/10.1186/s40537-015-0032-1
López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141. https://doi.org/10.1016/j.ins.2013.07.007
López V, Río SD, Benítez JM, Herrera F (2015) Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data. Fuzzy Sets Syst 258:5–38. https://doi.org/10.1016/j.fss.2014.01.015
Maillo J, Ramírez S, Triguero I, Herrera F (2017) KNN-IS: AN Iterative Spark-based design of the k-Nearest Neighbors classifier for big data. Knowl-Based Syst 117:3–15. https://doi.org/10.1016/j.knosys.2016.06.012
Meddah IHA, Belkadi K (2017) Parallel distributed patterns mining using Hadoop MapReduce framework. Int J Grid High Perform Comput. https://doi.org/10.4018/ijghpc.2017040105
Pandey R, Dhoundiyal M (2015) Quantitative evaluation of big data categorical variables through R. Procedia Comput Sci 46:582–588. https://doi.org/10.1016/j.procs.2015.02.097
Park S-H, Ha Y-G (2014) Large imbalance data classification based on MapReduce for traffic accident prediction. In: Eighth international conference on innovative mobile and internet services in ubiquitous computing, IEEE, pp 45–49 https://doi.org/10.1109/imis.2014.6
Patil SS, Sonavane SP (2017) Enriched Over_Sampling techniques for improving classification of imbalanced big data. In: Third international conference on big data computing service and applications, IEEE, https://doi.org/10.1109/bigdataservice.2017.19
Río SD, López V, Benítez JM, Herrera F (2014) On the use of MapReduce for imbalanced big data using Random Forest. Inf Sci 285:112–137. https://doi.org/10.1016/j.ins.2014.03.043
Río SD, López V, Benítez JM, Herrera F (2015) A MapReduce approach to address big data classification problems based on the fusion of linguistic fuzzy rules. Int J Comput Intell Syst 8:422–437. https://doi.org/10.1080/18756891.2015.1017377
Rodger JA (2015) Discovery of medical big data analytics: improving the prediction of traumatic brain injury survival rates by data mining Patient Informatics Processing Software Hybrid Hadoop Hive. Inf Med Unlocked 1:17–26. https://doi.org/10.1016/j.imu.2016.01.002
Sanz JA, Bernardo D, Herrera F, Bustince H, Hagras H (2015) A compact evolutionary interval-valued fuzzy rule-based classification system for the modeling and prediction of real-world financial applications with imbalanced data. IEEE Trans Fuzzy Syst 23(4):973–990. https://doi.org/10.1109/tfuzz.2014.2336263
Slagter K, Hsu C-H, Chung Y-C (2015) An adaptive and memory efficient sampling mechanism for partitioning in MapReduce. Int J Parallel Program 43:489–507. https://doi.org/10.1007/s10766-013-0288-z
Triguero I, Peralta D, Bacardit J, García S, Herrera F (2015a) MRPR: a MapReduce solution for prototype reduction in big data classification. Neurocomputing 150:331–345. https://doi.org/10.1016/j.neucom.2014.04.078
Triguero I, Río SD, López V, Bacardit J, Benítez JM, Herrera F (2015b) ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem. Knowl-Based Syst 87:69–79. https://doi.org/10.1016/j.knosys.2015.05.027
Triguero I, Galar M, Merino D, Maillo J, Bustince H, Herrera F (2016) Evolutionary undersampling for extremely imbalanced big data classification under apache spark. In: IEEE congress on evolutionary computation (CEC). https://doi.org/10.1109/cec.2016.7743853
Troncoso A, Ribera P, Asencio-Cortes G, Vega I, Gallego D (2018) Imbalanced classification techniques for monsoon forecasting based on a new climatic time series. Environ Model Softw 106:48–56. https://doi.org/10.1016/j.envsoft.2017.11.024
Tsai CW, Lai CF, Chao HC, Vasilakos AV (2015) Big data analytics: a survey. J Big Data. https://doi.org/10.1186/s40537-015-0030-3
Uskenbayeva R, Kuandykov A, Cho YI, Temirbolatova T, Amanzholova S, Kozhamzharova D (2015) Integrating of data using the Hadoop and R. Procedia Comput Sci 56:145–149. https://doi.org/10.1016/j.procs.2015.07.187
Vluymans S, Tarragó DS, Saeys Y, Cornelis C, Herrera F (2016) Fuzzy rough classifiers for class imbalanced multi-instance data. Pattern Recogn 53:36–45. https://doi.org/10.1016/j.patcog.2015.12.002
Wu X, Zhu X, Wu GQ, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107. https://doi.org/10.1109/TKDE.2013.109
Xing EP, Ho Q, Dai W, Kim JK, Wei J et al (2015) Petuum: a new platform for distributed machine learning on big data. IEEE Trans Big Data 1(2):49–67. https://doi.org/10.1109/tbdata.2015.2472014
Zhang X, Cheng M, Liu Y, Li DH, Wu RM (2014) Short-term load forecasting based on big data technologies. Appl Mech Mater 687–691:1186–1192. https://doi.org/10.4028/www.scientific.net/amm
Zou Q, Xie S, Lin Z, Wu M, Ju Y (2016) Finding the best classification threshold in imbalanced classification. Big Data Res 5:2–8. https://doi.org/10.1016/j.bdr.2015.12.001
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ahlawat, K., Chug, A. & Singh, A.P. Benchmarking framework for class imbalance problem using novel sampling approach for big data. Int J Syst Assur Eng Manag 10, 824–835 (2019). https://doi.org/10.1007/s13198-019-00817-6
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13198-019-00817-6