Nothing Special   »   [go: up one dir, main page]

计算机科学 ›› 2019, Vol. 46 ›› Issue (12): 8-12.doi: 10.11896/jsjkx.180901813

• 大数据与数据科学 • 上一篇    下一篇

AdaBoostRS:高维不平衡数据学习的集成整合

杨平安, 林亚平, 祝团飞   

  1. (湖南大学信息科学与工程学院 长沙410000)
  • 收稿日期:2018-09-27 出版日期:2019-12-15 发布日期:2019-12-17
  • 通讯作者: 林亚平(1956-),男,博士,教授,博士生导师,主要研究方向为计算机网络、云安全和机器学习等,E-mail:yplin@hun.edu.cn。
  • 作者简介:杨平安(1995-),女,硕士生,主要研究方向为机器学习、数据挖掘等,E-mail:ypingan@hnu.edu.cn;祝团飞(1987-),男,博士,CCF会员,主要研究方向为云安全和机器学习等。

AdaBoostRS:Integration of High-dimensional Unbalanced Data Learning

YANG Ping-an, LIN Ya-ping, ZHU Tuan-fei   

  1. (College of Information Science and Engineering,Hunan University,Changsha 410000,China)
  • Received:2018-09-27 Online:2019-12-15 Published:2019-12-17

摘要: 机器学习中类不平衡分布问题包含了不同类之间数据样本的偏差分布,导致学习过程更偏向于多数类。而高维数据的稀疏性使得分类的偏差更加明显,因此对于高维不平衡数据,维度灾难与类不平衡分布这两个挑战性问题相互叠加在一起,使得解决高维不平衡问题变得更为困难。针对这一问题,文中提出结合随机子空间和SMOTE过采样技术的AdaBoost集成方法(AdaBoost ensemble of Random subspace and SMOTE,AdaBoostRS)来处理高维不平衡数据的分类。具体地,AdaBoostRS通过随机子空间选取部分特征来训练每个分类器,以增加分类样本的多样性和降低高维数据的维度,然后通过SMOTE方法对降维数据的少数类进行线性插值,以解决类不平衡问题。基于8个高维不平衡的标准时间序列数据集进行实验,结果表明,以F-measure、G-mean与AUC 3个性能指标来进行评判,AdaBoostRS优于传统的集成学习方法。

关键词: AdaBoost, SMOTE, 高维不平衡, 随机子空间

Abstract: The class imbalance problem in machine learning contains a skewed distribution of data samples among different classes,resulting in a learning bias toward the majority class.In high-dimensional data,the sparseness of the data makes the classification bias more obvious.For high-dimensional unbalanced data,the two challenging problems of dimensional disaster and class imbalance distribution are superimposed,making it more difficult to solve high-dimensional imbalance problems.This paper proposed an AdaBoost integration method combining random subspace and SMOTE oversampling technology,named AdaBoostRS (AdaBoost ensemble of Random subspace and SMOTE),to deal with the classification of high-dimensional unbalanced data.AdaBoostRS trains each classifier by selecting partial features in a random subspace to increase the diversity of the classification samples and reduce the dimensions of the high-dimensional data.Thena few classes of dimensionality reduction data are linearly interpolated through the SMOTE method to solve the class imbalance problem.The experiment is based on 8 high-dimensional unbalanced standard time series dataset.The results show that AdaBoostRS is superior to the traditional integrated learning method in terms of three performance indicators of F-measure,G-mean and AUC.

Key words: AdaBoost, High-dimensional imbalance, Random subspace, SMOTE

中图分类号: 

  • TP301.6
[1]PARVIN H,BEHROUZ M B,HOSEIN A.Detection of cancer patients using an innovative method for learning at imbalanced datasets[C]//International Conference on Rough Sets and Knowledge Technology.Springer,Berlin,Heidelberg,2011.
[2]CIESLAK D A,CHAWLA N V,STRIEGEL A.Combating im- balance in network intrusion datasets [C]//GrC.2006:732-737.
[3]JING X Y,WU F,DONG X W,et al.An improved SDA based defect prediction framework for both within-project and cross-project class-imbalance problems[J].IEEE Transactions on Software Engineering,2017,43(4):321-339.
[4]ZHANG Y,ZHOU Z H.Cost-sensitive face recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2010,32(10):1758-1769.
[5]LIU C L,HSAIO W H,LEE C H,et al.Semi-supervised text classification with universumlearning[J].IEEE Transactions on Cybernetics,2016,46(2):462-473.
[6]LIU X Y,WU J X,ZHOU Z H.Exploratory undersampling for class-imbalance learning[J].IEEE Transactions on Systems,Man,and Cybernetics,Part B (Cybernetics),2009,39(2):539-550.
[7]SÁEZ J A S,LUENGO J,STEFANOWSKI J,et al.SMOTE-IPF:Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering[J].Information Sciences,2009,21(9):184-203.
[8]HE H B,GARCIA E A.Learning from imbalanced data[J].IEEE Transactions on Knowledge & Data Engineering,2009,21(9):1263-1284.
[9]ALBERTO C,ZAFRA A,VENTURA S.Weighted data gravitation classification for standard and imbalanced data[J].IEEE Transactions on Cybernetics,2013,43(6):1672-1687.
[10]DANIELE C R,PORTINALE L.Dynamic Bayesian networks for fault detection,identification,and recovery in autonomous spacecraft[J].IEEE Transactions on Systems,Man,and Cybernetics:Systems,2015,45(1):13-24.
[11]TANG Y,ZHANG Y Q,CHAWLA N V,et al.SVMs modeling for highly imbalanced classification[J].IEEE Transactions on Systems,Man,and Cybernetics,Part B (Cybernetics),2009,39(1):281-288.
[12]KANG Q,HUANG B Y,ZHOU M C.Dynamic behavior of artificial Hodgkin-Huxley neuron model subject to additive noise[J].IEEE Transactions on Cybernetics,2016,46(9):2083-2093.
[13]ZHANG X W,HU B G.A new strategy of cost-free learning in the class imbalance problem[J].IEEE Transactions on Know-ledge & Data Engineering,2014,26(12):2872-2885.
[14]LIU X Y,ZHOU Z H.The influence of class imbalance on cost-sensitive learning[C]//Sixth International Conference on Data Mining (ICDM’06).IEEE,2006:970-974.
[15]WEISS,GARY M.Mining with rarity:a unifying framework [J].ACM Sigkdd Explorations Newsletter,2004,6(1):7-19.
[16]PRATI,RONALDO C,BATISTA G E,et al.Class imbalances versus class overlapping:an analysis of a learning system beha-vior[C]//Mexican International Conference on Artificial Intelligence.Springer,Berlin,Heidelberg,2004.
[17]RAO,BHARAT R,KRISHNAN S,et al.Data mining for improved cardiac care[J].ACM SIGKDD Explorations Newsletter 2006,8(1):3-10.
[18]JAPKOWICZ,NATHALIE,MYERS C,et al.A novelty detection approach to classification[M].Morgan Kaufmann Publi-shers Inc,1995.
[19]DI MARTINO M,DECIA F,MOLINELLI J,et al.Improving Electric Fraud Detection using Class Imbalance Strategies [C]//ICPRAM.2012:135-141.
[20]VICTORIA L,SARA D R,MANUEL B J,et al.Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data [J].Fuzzy Sets and Systems,2015(258):5-38.
[21]BARTOSZ K,WOC'NIAK M,SCHAEFER G.Cost-sensitive decision tree ensembles for effective imbalanced classification[J].Applied Soft Computing,2014(14):554-562.
[22]MACIEJ Z,TOMCZAK J M.Boosted SVM with active learning strategy for imbalanced data[J].Soft Computing,2015,19(12):3357-3368.
[23]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[24]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning[C]//International Conference on Intelligent Computing.Springer,Berlin,Heidelberg,2005.
[25]YOUNGW A,NYKL S L,WECKMAN G R,et al.Using Voronoi diagrams to improve classification performances when modeling imbalanced datasets[J].Neural Computing and Applications,2015,26(5):1041-1054.
[26]LIU X Y,WU J,ZHOU Z H.Exploratory Under-sampling for class-imbalance learning,bioinformatics[J].Proceedings of the IEEE Transactions on Systems,Man,and Cybernetics,Part B:Cybernetics,2009,39(2):539-550.
[27]VORRABOOT P,RASMEQUAN S,CHINNASARN K.Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms[J].Neurocomputing,2015(152):429-443.
[28]YU H L,NI J,ZHAO J.ACOSampling:an ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data[J].Neurocomputing,2013(101):309-318.
[29]YIN Q Y,ZHANG J S,ZHANG C X,et al.A novel selective en- semble algorithm for imbalanced data classification based on exploratory undersampling[J].Mathematical Problems in Engineering,2014,71(3):741-764.
[30]YOAV F.Boosting a weak learning algorithm by majority[J].Information and Computation,1995,121(2):256-285.
[31]CHAWLA N V,LAZAREVIC A,HALL L O,et al.SMOTEBoost:Improving Prediction of the Minority Class in Boosting.[J].Lecture Notes in Computer Science,2003,2838:107-119.
[32]SEIFFERT C,KHOSHGOFTAAR T M,VAN HULSE J,et al.RUSBoost:a hybrid approach to alleviating class imbalance[J].IEEE Transactions on Systems,Man,and Cybernetics-Part A:Systems and Humans,2010,40(1):185-197.
[33]LIU X Y,WU J,ZHOU Z H.Exploratory Under-sampling for class-imbalance learning,bioinformatics [J].Proceedings of the IEEE Transactions on Systems,Man,and Cybernetics,Part B:Cybernetics,2009,39(2):539-550.
[34]NANNI L,FANTOZZI C,LAZZARINI N.Coupling different methods for overcoming the class imbalance problem[J].Neurocomputing,2015,158:48-61.
[35]SUN Z,SONG Q,ZHU X.A novel ensemble method forclassi- fying imbalanced data[J].Pattern Recognition,2015,48:1623-1637.
[36]DÍEZ-PASTOR J F,RODRÍGUEZ J J,GARCÍA-OSORIO C, et al.Random balance:ensembles of variable prors classifiers for imbalanced data[J].Knowledge-Based Systems,2015,85:96-111.
[37]KRAWCZYK B,SCHAEFER G.An improved ensemble ap- proach for imbalanced classification problems[C]//IEEE,International Symposium on Applied Computational Intelligence and Informatics.IEEE,2013:423-426.
[38]ZIEBA M,TOMCZAK J M.Boosted SVM with active learning strategy for imbalanced data[J].Soft Computing,2015,19(12):3357-3368.
[39]BELLINGER C,JAPKOWICZ N,DRUMMOND C.Christopher Drummond.Synthetic Oversampling for Advanced Radioactive Threat Detection[C]//2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).IEEE,2015:948-953.
[40]MATHIEU B,SEKI K,UEHARA K.Tackling class imbalance and data scarcity in literature-based gene function annotation[C]//Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2011.
[41]NGUWI Y Y,CHO S Y.Support vector self-organizing learning for imbalanced medical data[C]//International Joint Conference on Neural Networks(IJCNN 2009).IEEE,2009:2250-2255.
[42]NASRABADI,NASSER M.Pattern recognition and machine learning[J].Journal of electronic imaging,2007,16(4):049901.
[43]YANG Q,WU X D.10 challenging problems in data mining research.International[J].Journal of Information Technology & Decision Making,2006,5(4):597-604.
[44]BELLINGER C,DRUMMOND C,JAPKOWICZ N.Manifold- based synthetic oversampling with manifold conformance estimation[J].Machine Learning,2018,107(3):605-637.
[45]CUI Y,MA H,SAHA T.Improvement of power transformer insulation diagnosis using oil characteristics data preprocessed by SMOTEBoosttechnique[J].IEEE Transactions on Dielectrics and Electrical Insulation,2014,21(5):2363-2373.
[46]GU J,JIAO L,LIU F,et al.Random subspace based ensemble sparse representation[J].Pattern Recognition,2018(74):544-555.
[47]KEOGH E,XI X,WEI L C A.Ratanamahatana.UCRTime Series Classification/ClusteringPage[OL].http://www.cs.ucr.edu/~eamonn/time_series_data.
[48]WEI L,KEOGH E J.Semi-Supervised Time Series Classification[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2006:748-753.
[49]GAO J W,LIANG J Y.Research and advancement of classification method of imbalanced data sets[J].Computer Sciense,2008,35:10-13.
[50]LI K W,YANG L,LIU W Y,et al.Unbalanced Data Classification Method Based on RSBoost Algorithm[J].Computer Scien-ce,2015,42(9):249-252.
[51]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[1] 周志豪, 陈磊, 伍翔, 丘东亮, 梁广升, 曾凡巧.
基于SMOTE-SDSAE-SVM的车载CAN总线入侵检测算法
SMOTE-SDSAE-SVM Based Vehicle CAN Bus Intrusion Detection Algorithm
计算机科学, 2022, 49(6A): 562-570. https://doi.org/10.11896/jsjkx.210700106
[2] 陈静杰, 王琨.
不平衡油耗数据的区间预测方法
Interval Prediction Method for Imbalanced Fuel Consumption Data
计算机科学, 2021, 48(7): 178-183. https://doi.org/10.11896/jsjkx.200500145
[3] 龚追飞, 魏传佳.
基于改进AdaBoost算法的复杂网络链路预测
Link Prediction of Complex Network Based on Improved AdaBoost Algorithm
计算机科学, 2021, 48(3): 158-162. https://doi.org/10.11896/jsjkx.200600075
[4] 刘全明, 李尹楠, 郭婷, 李岩纬.
基于Borderline-SMOTE和双Attention的入侵检测方法
Intrusion Detection Method Based on Borderline-SMOTE and Double Attention
计算机科学, 2021, 48(3): 327-332. https://doi.org/10.11896/jsjkx.200600025
[5] 鲁淑霞, 张振莲.
基于最优间隔的AdaBoostv算法的非平衡数据分类
Imbalanced Data Classification of AdaBoostv Algorithm Based on Optimum Margin
计算机科学, 2021, 48(11): 184-191. https://doi.org/10.11896/jsjkx.200900107
[6] 董明刚,姜振龙,敬超.
基于海林格距离和SMOTE的多类不平衡学习算法
Multi-class Imbalanced Learning Algorithm Based on Hellinger Distance and SMOTE Algorithm
计算机科学, 2020, 47(1): 102-109. https://doi.org/10.11896/jsjkx.190600060
[7] 韩慧,王黎明,柴玉梅,刘箴.
基于强化表征学习深度森林的文本情感分类
Text Sentiment Classification Based on Deep Forests with Enhanced Features
计算机科学, 2019, 46(7): 172-179. https://doi.org/10.11896/j.issn.1002-137X.2019.07.027
[8] 金旭, 王磊, 孙国梓, 李华康.
一种基于质心空间的不均衡数据欠采样方法
Under-sampling Method for Unbalanced Data Based on Centroid Space
计算机科学, 2019, 46(2): 50-55. https://doi.org/10.11896/j.issn.1002-137X.2019.02.008
[9] 王莉, 陈红梅.
基于NKSMOTE算法的非平衡数据集分类方法
NKSMOTE Algorithm Based Classification Method for Imbalanced Dataset
计算机科学, 2018, 45(9): 260-265. https://doi.org/10.11896/j.issn.1002-137X.2018.09.043
[10] 陈圣灵,沈思淇,李东升.
基于样本权重更新的不平衡数据集成学习方法
Ensemble Learning Method for Imbalanced Data Based on Sample Weight Updating
计算机科学, 2018, 45(7): 31-37. https://doi.org/10.11896/j.issn.1002-137X.2018.07.005
[11] 李珊,饶文碧.
基于视频的矿井中人体运动区域检测
Video-based Detection of Human Motion Area in Mine
计算机科学, 2018, 45(4): 291-295. https://doi.org/10.11896/j.issn.1002-137X.2018.04.049
[12] 熊婧,高岩,王雅瑜.
基于Adaboost算法的软件缺陷预测模型
Software Defect Prediction Model Based on Adaboost Algorithm
计算机科学, 2016, 43(7): 186-190. https://doi.org/10.11896/j.issn.1002-137X.2016.07.034
[13] 皮嘉立,巫正中,陈卓.
基于Adaboost-CSHG的特定类目标跟踪识别
Specific Target Tracking and Recognition Based on Adaboost-CSHG
计算机科学, 2016, 43(4): 318-321. https://doi.org/10.11896/j.issn.1002-137X.2016.04.065
[14] 宋相法,曹志伟,郑逢斌,焦李成.
基于随机子空间核极端学习机集成的高光谱遥感图像分类
Classification of Hyperspectral Remote Sensing Image Based on Random Subspace and Kernel Extreme Learning Machine Ensemble
计算机科学, 2016, 43(3): 301-304. https://doi.org/10.11896/j.issn.1002-137X.2016.03.056
[15] 霍芋霖,符意德.
基于Zynq的人脸检测设计
Face Detection Design Based on Zynq
计算机科学, 2016, 43(10): 322-325. https://doi.org/10.11896/j.issn.1002-137X.2016.10.060
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!