CN105975992A - Unbalanced data classification method based on adaptive upsampling - Google Patents
Unbalanced data classification method based on adaptive upsampling Download PDFInfo
- Publication number
- CN105975992A CN105975992A CN201610331709.9A CN201610331709A CN105975992A CN 105975992 A CN105975992 A CN 105975992A CN 201610331709 A CN201610331709 A CN 201610331709A CN 105975992 A CN105975992 A CN 105975992A
- Authority
- CN
- China
- Prior art keywords
- samples
- positive
- sample
- positive sample
- new
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000003044 adaptive effect Effects 0.000 title claims abstract description 9
- 238000012549 training Methods 0.000 claims abstract description 31
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 18
- 238000013145 classification model Methods 0.000 claims abstract description 3
- 238000012545 processing Methods 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 2
- 239000000523 sample Substances 0.000 description 59
- 238000012360 testing method Methods 0.000 description 8
- 238000007635 classification algorithm Methods 0.000 description 6
- 238000003066 decision tree Methods 0.000 description 4
- 238000001514 detection method Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000035945 sensitivity Effects 0.000 description 4
- 230000007423 decrease Effects 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 230000002159 abnormal effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000002790 cross-validation Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000003909 pattern recognition Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000012952 Resampling Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000010779 crude oil Substances 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2148—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明涉及一种基于自适应升采样的不平衡数据集分类方法,包括以下步骤:根据和计算不平衡数据集的不平衡率,计算需要新生成的正样本总数;以欧氏距离为度量,对于每个正样本,计算概率密度分布;)确定该正样本所需生成的新样本个数;生成新的正样本,将新生成的正样本点加入到原有的不平衡训练集中,使正负样本数目相同,即得到包含正样本和负样本各nn个的新平衡训练集;对新生成的平衡训练集运用Adaboost算法进行训练,迭代T次后得到最终的分类模型。本发明可以提高不平衡数据集的分类性能。
The invention relates to a method for classifying unbalanced data sets based on adaptive upsampling, comprising the following steps: calculating the total number of positive samples that need to be newly generated according to and calculating the unbalanced rate of the unbalanced data set; taking Euclidean distance as a measure, For each positive sample, calculate the probability density distribution;) determine the number of new samples that need to be generated for the positive sample; generate a new positive sample, and add the newly generated positive sample points to the original unbalanced training set, so that the positive The number of negative samples is the same, that is, a new balanced training set containing n n positive samples and negative samples is obtained; the newly generated balanced training set is trained using the Adaboost algorithm, and the final classification model is obtained after T iterations. The invention can improve the classification performance of the unbalanced data set.
Description
所属技术领域Technical field
本发明涉及模式识别技术,具体涉及一种针对不平衡数据集的分类器。The invention relates to pattern recognition technology, in particular to a classifier for unbalanced data sets.
背景技术Background technique
随着数据挖掘、模式识别与机器学习技术的快速发展,数据分类已经在图像检索、医疗检测与诊断、测谎、文本分类及原油泄漏检测等多个领域中得以应用并发挥重要作用。然而,诸如支持向量机、人工神经网络及线性判别分析等经典分类算法在设计时均假定训练所用的数据集中各类所包含的样本数大致相同。但实际上,在上述几个领域中,异常样本(正样本)的数目往往远少于正常样本(负样本)。此时,为得到更高的整体准确率,经典分类器会更多关注负样本类,分类边界会向正样本方向移动,使得大量正样本被误分入负类,最终导致正类样本分类性能下降。考虑到多数情况下异常样本在决策中具有更高价值,为提高正样本分类准确率,针对不平衡数据集的分类算法成为了研究热点。With the rapid development of data mining, pattern recognition and machine learning technologies, data classification has been applied and played an important role in many fields such as image retrieval, medical detection and diagnosis, polygraph detection, text classification and crude oil spill detection. However, classical classification algorithms such as support vector machines, artificial neural networks, and linear discriminant analysis are designed with the assumption that the training data set contains approximately the same number of samples for each class. But in fact, in the above-mentioned fields, the number of abnormal samples (positive samples) is often much less than that of normal samples (negative samples). At this time, in order to obtain a higher overall accuracy, the classical classifier will pay more attention to the negative sample class, and the classification boundary will move to the direction of the positive sample, so that a large number of positive samples will be misclassified into the negative class, which will eventually lead to poor classification performance of positive samples. decline. Considering that abnormal samples have higher value in decision-making in most cases, in order to improve the classification accuracy of positive samples, classification algorithms for imbalanced data sets have become a research hotspot.
近年来,科研人员提出了多种针对不平衡数据集的分类方法。根据作用对象的不同,这些方法主要可以被分为数据级方法和算法级方法两大类。In recent years, researchers have proposed a variety of classification methods for imbalanced datasets. According to different objects, these methods can be mainly divided into two categories: data-level methods and algorithm-level methods.
数据级方法主要通过对数据进行重采样来改变数据分布,使正负样本的数目基本相同,以此实现数据平衡。对负样本进行降采样和对正样本进行升采样均可达到这一目的。专利“基于有监督上采样学习的蛋白质-核苷酸绑定位点预测方法”(CN104077499A)采用了升采样的方法,通过增加正样本数量以获得平衡的数据集并用于训练支持向量机。但由于该种方法只是将正样本进行复制后加入原有的数据集中,相当于每个正样本均被多次训练,容易出现过拟合现象,最终导致分类器性能下降。专利“基于欠抽样面向不平衡数据集的交通事件自动检测方法”(CN103927874A)采用降采样方法,从负样本集中随机抽取部分样本与全体正样本组成训练集对分类器进行训练。但由于丢弃了大量负样本,该方法无法保证抽取得到的负样本子集能够较好代表原有样本集,因此训练效果也不够理想。The data-level method mainly changes the data distribution by resampling the data, so that the number of positive and negative samples is basically the same, so as to achieve data balance. This can be achieved by downsampling negative samples and upsampling positive samples. The patent "Protein-Nucleotide Binding Site Prediction Method Based on Supervised Upsampling Learning" (CN104077499A) uses an upsampling method to obtain a balanced data set by increasing the number of positive samples and use it to train a support vector machine. However, since this method only copies the positive samples and adds them to the original data set, it is equivalent to each positive sample being trained multiple times, which is prone to overfitting and eventually leads to a decline in the performance of the classifier. The patent "Automatic detection method of traffic incidents based on under-sampling for unbalanced data sets" (CN103927874A) adopts the down-sampling method to randomly select some samples from the negative sample set and all positive samples to form a training set to train the classifier. However, due to discarding a large number of negative samples, this method cannot guarantee that the extracted negative sample subset can better represent the original sample set, so the training effect is not ideal.
算法级方法主要通过改进分类算法而非改变数据分布来解决不平衡分类问题。Adaboost是经典的算法级方法之一。这种方法通过将多个分类器级联,并不断增加错分样本的权重以提高将该类样本再次错分的代价,由此提高分类的准确率。然而,由于传统的Adaboost算法本身并未过多关注正样本,因此效果仍然不够理想。Algorithm-level methods mainly solve the imbalanced classification problem by improving the classification algorithm rather than changing the data distribution. Adaboost is one of the classic algorithm-level methods. This method cascades multiple classifiers and continuously increases the weight of misclassified samples to increase the cost of misclassifying such samples again, thereby improving the accuracy of classification. However, because the traditional Adaboost algorithm itself does not pay too much attention to positive samples, the effect is still not ideal.
从上述分析可以看出,数据级方法和算法级方法虽然都可以减轻数据不平衡对分类效果产生的影响,但两种方法均存在一定的局限性。From the above analysis, it can be seen that although both the data-level method and the algorithm-level method can alleviate the impact of data imbalance on the classification effect, both methods have certain limitations.
发明内容Contents of the invention
本发明的目的是克服现有方法的不足,提出一种基于自适应升采样的不平衡数据集分类算法,以提高不平衡数据集的分类性能。本发明的技术方案如下:The purpose of the present invention is to overcome the deficiencies of the existing methods and propose a classification algorithm for unbalanced data sets based on adaptive upsampling to improve the classification performance of unbalanced data sets. Technical scheme of the present invention is as follows:
一种基于自适应升采样的不平衡数据集分类方法,设原始不平衡数据集中正样本数目为np,负样本数目为nn,该方法包括以下步骤:A classification method for unbalanced datasets based on adaptive upsampling, assuming that the number of positive samples in the original unbalanced dataset is n p , and the number of negative samples is n n , the method includes the following steps:
(1)根据np和nn计算不平衡数据集的不平衡率IR,由IR计算需要新生成的正样本总数G;(1) Calculate the imbalance rate IR of the unbalanced data set according to n p and n n , and calculate the total number of positive samples G that need to be newly generated by IR;
(2)以欧氏距离为度量,对于每个正样本i,搜索不平衡数据集中同其距离最近的K个最近邻样本,统计上述K个最近邻样本中负样本所占的比例,记为pi,对各个正样本所得到的pi值相加并进行归一化处理,将处理完成后得到的值记为ri,此时各正样本的ri值之和为1,即ri形成概率密度分布,称ri为正样本i的概率;(2) Taking the Euclidean distance as the measure, for each positive sample i, search for the K nearest neighbor samples in the unbalanced data set, and count the proportion of negative samples in the above K nearest neighbor samples, which is denoted as p i , add the p i values obtained from each positive sample and perform normalization processing, and record the value obtained after the processing as r i , at this time, the sum of the r i values of each positive sample is 1, that is, r i forms a probability density distribution, and r i is called the probability of positive sample i;
(3)对于每个正样本i,根据正样本总数G值与步骤(2)中得到的概率ri确定该正样本所需生成的新样本个数gi;(3) For each positive sample i, according to the total number of positive samples G and the probability r i obtained in step (2), determine the number of new samples g i that need to be generated for the positive sample;
(4)对于每个正样本i,在步骤(2)中得到的K个最近邻样本中随机选取gi个,分别与其组成样本对,在样本对的连线上随机选取一点即得到新生成的正样本,新的正样本生成过程完成后生成G个新的正样本点,将新生成的G个正样本点加入到原有的不平衡训练集中,使正负样本数目相同,即得到包含正样本和负样本各nn个的新平衡训练集;(4) For each positive sample i, randomly select g i from the K nearest neighbor samples obtained in step (2), form a sample pair with it respectively, and randomly select a point on the connection line of the sample pair to obtain a new generation After the new positive sample generation process is completed, G new positive sample points are generated, and the newly generated G positive sample points are added to the original unbalanced training set, so that the number of positive and negative samples is the same, that is, the inclusion A new balanced training set of n n positive samples and negative samples each;
(5)记Adaboost算法的迭代次数为T,对新生成的平衡训练集运用Adaboost算法进行训练,迭代T次后得到最终的分类模型。(5) Record the number of iterations of the Adaboost algorithm as T, use the Adaboost algorithm to train the newly generated balanced training set, and obtain the final classification model after T iterations.
本发明针对不平衡数据集,将数据级方法和算法级方法结合的算法,并对升采样算法进行改进与优化,主要对在正负样本边界附近的正样本点进行升采样,对远离边界的正样本不做处理,以在不平衡数据集上获得更好的分类效果,结合了自适应升采样算法与Adaboost算法的优点,保证升采样中生成的新正样本主要集中在边界附近,同时通过组合分类器进行增强学习,提高分类器整体性能。经实验比较,本发明在多个分类器评价指标上具有明显优势。For unbalanced data sets, the present invention combines the data-level method and the algorithm-level method, and improves and optimizes the upsampling algorithm. Positive samples are not processed to obtain better classification results on unbalanced datasets. The advantages of the adaptive upsampling algorithm and the Adaboost algorithm are combined to ensure that the new positive samples generated in the upsampling are mainly concentrated near the boundary. At the same time, through Combine classifiers for reinforcement learning and improve the overall performance of classifiers. Through experimental comparison, the present invention has obvious advantages in multiple classifier evaluation indexes.
附图说明Description of drawings
图1是Adaboost增强学习算法流程图。Figure 1 is a flowchart of the Adaboost enhanced learning algorithm.
图2是本发明的流程图。Fig. 2 is a flow chart of the present invention.
具体实施方式detailed description
本发明受自适应升采样算法和图1所示Adaboost算法启发,将二者进行组合,形成一个集成分类器。下面结合附图对本发明作进一步详细的说明。Inspired by the adaptive upsampling algorithm and the Adaboost algorithm shown in Figure 1, the present invention combines the two to form an integrated classifier. The present invention will be described in further detail below in conjunction with the accompanying drawings.
(1)取得测试和训练数据:本发明选择KEEL数据库中的车辆种类识别数据库,共包含样本846个。数据库中的正样本为小货车数据,共199个,即np=199。负样本包含公交车、欧宝轿车、萨博汽车共三种车辆的数据,共647个,即nn=647。数据库中包含扭矩、转向半径、最大制动距离等共18维特征。按(1)式计算不平衡率,(1) Obtaining test and training data: the present invention selects the vehicle type identification database in the KEEL database, which contains 846 samples altogether. The positive samples in the database are small truck data, a total of 199, that is, n p =199. The negative samples include the data of three types of vehicles: bus, Opel, and Saab, totaling 647, that is, n n =647. The database contains a total of 18-dimensional features such as torque, steering radius, and maximum braking distance. Calculate the unbalance rate according to formula (1),
IR=nn/np(1)IR=n n /n p (1)
可得在本实验中不平衡率应为3.25。It can be obtained that the imbalance ratio in this experiment should be 3.25.
(2)按(2)式计算需要生成的正样本个数,(2) Calculate the number of positive samples that need to be generated according to formula (2),
G=(nn-np)×β(2)G=(n n -n p )×β(2)
其中,β是一个介于0到1之间的常数。当β=1时,经升采样后正负样本的数目将完全相同,数据集达到完全平衡,本发明取β=1。可知,需要生成的新正样本数目应为448。随后根据该值对正样本进行自适应升采样处理,使正负样本数目达到平衡。具体方法为:对于每个正样本,以欧氏距离作为度量,分别计算距其最近的K个样本点中负样本所占比例pi:Among them, β is a constant between 0 and 1. When β=1, the number of positive and negative samples will be exactly the same after up-sampling, and the data set will be completely balanced. In the present invention, β=1. It can be seen that the number of new positive samples that need to be generated should be 448. Then, according to this value, the positive samples are adaptively upsampled to balance the number of positive and negative samples. The specific method is: for each positive sample, the Euclidean distance is used as the measure to calculate the proportion p i of the negative samples in the nearest K sample points:
pi=ki/K,i=1,...,np (3)p i =k i /K,i=1,...,n p (3)
为保证准确判断每个正样本是否在正负样本边界附近,K应取较大值,但随着K值的增加,计算量也将明显增加。为保持较低计算复杂度,本发明对上述两需求进行折中处理,取K=5。随后,对所有pi进行归一化处理,使其表示为概率密度分布并计算每个正样本应生成的新正样本个数In order to accurately judge whether each positive sample is near the boundary of positive and negative samples, K should take a larger value, but as the value of K increases, the amount of calculation will also increase significantly. In order to keep the computational complexity low, the present invention makes a compromise between the above two requirements, taking K=5. Subsequently, normalize all p i to represent it as a probability density distribution and calculate the number of new positive samples that should be generated for each positive sample
由(4)式可知,越靠近边界、邻近样本中负样本较多的样本点将被用于生成更多正样本,而远离边界、邻近样本均为正样本的样本点将不会被用于生成正样本。随后,对每一个正样本,在其K个最近邻样本点中随机选取gi个,按(5)式方法生成新的正样本:From formula (4), it can be seen that the closer to the boundary, the sample points with more negative samples in the adjacent samples will be used to generate more positive samples, while the sample points far from the boundary and adjacent samples are all positive samples will not be used for Generate positive samples. Then, for each positive sample, randomly select g i among its K nearest neighbor sample points, and generate a new positive sample according to formula (5):
newi=xi+λ(xni-xi)(5)new i =x i +λ(x ni -x i )(5)
其中,newi是新生成的样本点,λ为一个取值在0到1之间的随机数,xni为被随机选中的邻近样本点。对于每个正样本,这一过程将进行gi次。样本生成过程完成后,将新生成的样本点加入到原有的不平衡训练集中,即可得到新的平衡训练集。这种自适应的增采样方法可以确保新生成的训练集不存在不平衡问题,且新生成的样本主要位于正负样本区分难度较大的边界区域。Among them, new i is a newly generated sample point, λ is a random number between 0 and 1, and x ni is a randomly selected adjacent sample point. This process will be done g i times for each positive sample. After the sample generation process is completed, the newly generated sample points are added to the original unbalanced training set to obtain a new balanced training set. This adaptive upsampling method can ensure that there is no imbalance problem in the newly generated training set, and the newly generated samples are mainly located in the boundary area where it is difficult to distinguish positive and negative samples.
由图1和图2可看出,若直接进行随机升采样,将所有正样本点进行复制,则新生成的样本点将与原来的正样本点完全重合且分布在整个正样本空间内。而自适应升采样可以生成与原样本点不同的正样本,且新生成的正样本均在边界附近。It can be seen from Figure 1 and Figure 2 that if random upsampling is performed directly and all positive sample points are copied, the newly generated sample points will completely coincide with the original positive sample points and be distributed in the entire positive sample space. Adaptive upsampling can generate positive samples different from the original sample points, and the newly generated positive samples are all near the boundary.
(3)本发明采取五折交叉验证对不平衡数据集进行训练与测试。训练与测试均选择C4.5决策树作为基分类器的Adaboost分类算法。其中,设C4.5决策树的最小叶节点数为2,置信度为0.25,树训练完成后需进行剪枝处理。所有数据在进入分类器前均完成归一化处理,即数据最小值为0,最大值为1。正样本数据标签为+1,负样本数据标签为-1。(3) The present invention adopts five-fold cross-validation to train and test the unbalanced data set. Both training and testing choose C4.5 decision tree as the Adaboost classification algorithm of the base classifier. Among them, the minimum number of leaf nodes of the C4.5 decision tree is set to 2, and the confidence level is 0.25. After the tree training is completed, it needs to be pruned. All data are normalized before entering the classifier, that is, the minimum value of the data is 0 and the maximum value is 1. The positive sample data label is +1, and the negative sample data label is -1.
将平衡的正负样本按五折交叉验证划分出训练集与测试集,此时训练集中应包含正负样本各518个。训练所用的样本个数为2nn,即1036。取Adaboost算法的迭代次数T=10,则按如下方法进行训练:Divide the balanced positive and negative samples into training set and test set according to 5-fold cross-validation. At this time, the training set should contain 518 positive and negative samples each. The number of samples used for training is 2n n , that is, 1036. The number of iterations T=10 of the Adaboost algorithm is taken, and the training is carried out as follows:
1.记各样本权值为Dt(i),其中,t可取1到(T-1)之间的整数值,表示当前迭代轮次,i表示样本编号。初始化每个样本的权值均为D1(i)=1/(2nn),i=1,...,2nn.1. Record the weight of each sample as D t (i), where t can take an integer value between 1 and (T-1), indicating the current iteration round, and i indicates the sample number. Initialize the weight of each sample as D 1 (i)=1/(2n n ), i=1,...,2n n .
2.将加权后的训练集用于训练分类器ht,训练完成后计算其训练错误率2. Use the weighted training set to train the classifier h t , and calculate its training error rate after the training is completed
其中,t=1,...T,为当前所处的迭代轮次数。εt为第t轮迭代的训练错误率,Dt(i)为该轮迭代中每个样本的权重,yi为样本xi所属的类别标签,取值为1或-1。h(xi)为样本xi经训练后的分类标签。Wherein, t=1,...T is the number of iteration rounds currently in. ε t is the training error rate of the t-th iteration, D t (i) is the weight of each sample in this iteration, y i is the category label to which the sample x i belongs, and the value is 1 or -1. h( xi ) is the classification label of sample xi after training.
3.设第t轮迭代完成后得到的分类器在最终投票中的权重为αt,根据每轮迭代中的训练错误率计算该轮迭代训练生成的分类器的权重为3. Let the weight of the classifier obtained after the tth round of iteration in the final vote be α t , and calculate the weight of the classifier generated by this round of iterative training according to the training error rate in each round of iteration as
同时,在下一轮迭代中,每个样本的权重更新为At the same time, in the next iteration, the weight of each sample is updated as
其中,Zt为当前迭代轮次中各样本的权值之和,用于对各样本权值进行归一化处理。Among them, Z t is the sum of the weights of each sample in the current iteration round, which is used to normalize the weights of each sample.
4.执行2,3步骤共T次,完成全部迭代与权值更新过程,从而完成分类器训练。对于待分类的测试样本,其分类结果应为4. Execute steps 2 and 3 a total of T times to complete all iterations and weight update processes to complete classifier training. For the test sample to be classified, the classification result should be
由(7)式可知,每个子分类器的权重由其分类错误率决定。错误率更低的分类器将在(9)式的投票过程中获得更高的权重。此外,对于单个样本,由式(8)可以看到,若样本的原始标签与分类结果不同,则指数幂的值将大于0,自然对数的结果将小于1,使得该样本在下轮迭代中的权重增加。反之,样本在下轮迭代中的权值将会减小。It can be seen from (7) that the weight of each sub-classifier is determined by its classification error rate. Classifiers with lower error rates will receive higher weights in the voting process in (9). In addition, for a single sample, it can be seen from formula (8) that if the original label of the sample is different from the classification result, the value of the exponent power will be greater than 0, and the result of the natural logarithm will be less than 1, making the sample in the next iteration weight increase. On the contrary, the weight of the sample in the next iteration will decrease.
将测试集样本输入完成训练的分类器中,测试样本的最终分类结果,如图2所示。Input the test set sample into the classifier that has completed the training, and the final classification result of the test sample is shown in Figure 2.
表1给出了直接运用C4.5决策树对不平衡数据集进行分类、对正样本进行随机升采样后运用C4.5进行分类及本发明所使用的方法进行分类分别得到的测试结果。我们采用以下几个指标对分类器性能进行评价:Table 1 shows the test results obtained by directly using the C4.5 decision tree to classify the unbalanced data set, using C4.5 to classify after random upsampling the positive samples, and the method used in the present invention. We use the following indicators to evaluate the classifier performance:
表1分类算法结果与比较(同一指标下最好的结果用黑体标出)Table 1 Classification algorithm results and comparison (the best results under the same index are marked in bold)
由表1数据可以看出,直接使用C4.5决策树进行分类虽然可以得到最高的特异性指标,但灵敏度最低,证明此时数据不平衡现象对分类性能产生了明显影响。正样本的边界区域被侵噬,大量正样本被误分为负样本。经过简单的随机升采样后,这一问题有所缓解,但灵敏度与特异性的差距仍然较大;而本发明同时得到了良好的灵敏度和特异性指标,二者的几何平均值在参与对比的几种方法中同样最高,证明本发明对灵敏度和特异性有最佳折中。It can be seen from the data in Table 1 that although the highest specificity index can be obtained by directly using the C4.5 decision tree for classification, the sensitivity is the lowest, which proves that the data imbalance has a significant impact on the classification performance at this time. The boundary area of positive samples is eroded, and a large number of positive samples are misclassified as negative samples. After simple random upsampling, this problem has been alleviated to some extent, but the gap between sensitivity and specificity is still relatively large; while the present invention has obtained good sensitivity and specificity indicators at the same time, and the geometric mean of the two is in the comparison It is also the highest among several methods, proving that the present invention has the best compromise between sensitivity and specificity.
综上所述,本发明可以在不平衡数据集上获得良好的分类效果,有效消除数据不平衡问题对分类带来的消极影响。To sum up, the present invention can obtain a good classification effect on an unbalanced data set, and effectively eliminate the negative impact of data imbalance on classification.
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610331709.9A CN105975992A (en) | 2016-05-18 | 2016-05-18 | Unbalanced data classification method based on adaptive upsampling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610331709.9A CN105975992A (en) | 2016-05-18 | 2016-05-18 | Unbalanced data classification method based on adaptive upsampling |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105975992A true CN105975992A (en) | 2016-09-28 |
Family
ID=56955297
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610331709.9A Pending CN105975992A (en) | 2016-05-18 | 2016-05-18 | Unbalanced data classification method based on adaptive upsampling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105975992A (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107273916A (en) * | 2017-05-22 | 2017-10-20 | 上海大学 | The unknown Information Hiding & Detecting method of steganographic algorithm |
CN108133223A (en) * | 2016-12-01 | 2018-06-08 | 富士通株式会社 | The device and method for determining convolutional neural networks CNN models |
CN108334455A (en) * | 2018-03-05 | 2018-07-27 | 清华大学 | The Software Defects Predict Methods and system of cost-sensitive hypergraph study based on search |
CN108629413A (en) * | 2017-03-15 | 2018-10-09 | 阿里巴巴集团控股有限公司 | Neural network model training, trading activity Risk Identification Method and device |
CN108733633A (en) * | 2018-05-18 | 2018-11-02 | 北京科技大学 | A kind of the unbalanced data homing method and device of sample distribution adjustment |
CN108776711A (en) * | 2018-03-07 | 2018-11-09 | 中国电力科学研究院有限公司 | A kind of electrical power system transient sample data extracting method and system |
CN109086412A (en) * | 2018-08-03 | 2018-12-25 | 北京邮电大学 | A kind of unbalanced data classification method based on adaptive weighted Bagging-GBDT |
CN109327464A (en) * | 2018-11-15 | 2019-02-12 | 中国人民解放军战略支援部队信息工程大学 | A method and device for processing class imbalance in network intrusion detection |
CN109614967A (en) * | 2018-10-10 | 2019-04-12 | 浙江大学 | A license plate detection method based on negative sample data value resampling |
CN109740750A (en) * | 2018-12-17 | 2019-05-10 | 北京深极智能科技有限公司 | Method of data capture and device |
CN109756494A (en) * | 2018-12-29 | 2019-05-14 | 中国银联股份有限公司 | A kind of negative sample transform method and device |
CN109862392A (en) * | 2019-03-20 | 2019-06-07 | 济南大学 | Identification method, system, device and medium of Internet game video traffic |
CN110163226A (en) * | 2018-02-12 | 2019-08-23 | 北京京东尚科信息技术有限公司 | Equilibrating data set generation method and apparatus and classification method and device |
CN110998648A (en) * | 2018-08-09 | 2020-04-10 | 北京嘀嘀无限科技发展有限公司 | System and method for distributing orders |
CN111062806A (en) * | 2019-12-13 | 2020-04-24 | 合肥工业大学 | Personal financial credit risk assessment method, system and storage medium |
WO2020082734A1 (en) * | 2018-10-24 | 2020-04-30 | 平安科技(深圳)有限公司 | Text emotion recognition method and apparatus, electronic device, and computer non-volatile readable storage medium |
CN111598189A (en) * | 2020-07-20 | 2020-08-28 | 北京瑞莱智慧科技有限公司 | Generative model training method, data generation method, device, medium, and apparatus |
CN111652268A (en) * | 2020-04-22 | 2020-09-11 | 浙江盈狐云数据科技有限公司 | A classification method for unbalanced flow data based on resampling mechanism |
CN113903030A (en) * | 2021-10-12 | 2022-01-07 | 杭州迪英加科技有限公司 | Liquid-based cell pathology image generation method based on weak supervised learning |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103927874A (en) * | 2014-04-29 | 2014-07-16 | 东南大学 | Automatic incident detection method based on under-sampling and used for unbalanced data set |
CN104573708A (en) * | 2014-12-19 | 2015-04-29 | 天津大学 | Ensemble-of-under-sampled extreme learning machine |
CN104951809A (en) * | 2015-07-14 | 2015-09-30 | 西安电子科技大学 | Unbalanced data classification method based on unbalanced classification indexes and integrated learning |
CN105373806A (en) * | 2015-10-19 | 2016-03-02 | 河海大学 | Outlier detection method based on uncertain data set |
-
2016
- 2016-05-18 CN CN201610331709.9A patent/CN105975992A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103927874A (en) * | 2014-04-29 | 2014-07-16 | 东南大学 | Automatic incident detection method based on under-sampling and used for unbalanced data set |
CN104573708A (en) * | 2014-12-19 | 2015-04-29 | 天津大学 | Ensemble-of-under-sampled extreme learning machine |
CN104951809A (en) * | 2015-07-14 | 2015-09-30 | 西安电子科技大学 | Unbalanced data classification method based on unbalanced classification indexes and integrated learning |
CN105373806A (en) * | 2015-10-19 | 2016-03-02 | 河海大学 | Outlier detection method based on uncertain data set |
Non-Patent Citations (3)
Title |
---|
HAIBO HE 等: "ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning", 《2008 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS》 * |
刘余霞 等: "一种新的过采样算法DB_SMOTE", 《计算机工程与应用》 * |
陶新民 等: "不均衡数据分类算法的综述", 《重庆邮电大学学报(自然科学版)》 * |
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108133223A (en) * | 2016-12-01 | 2018-06-08 | 富士通株式会社 | The device and method for determining convolutional neural networks CNN models |
CN108133223B (en) * | 2016-12-01 | 2020-06-26 | 富士通株式会社 | Apparatus and method for determining convolutional neural network CNN model |
CN108629413A (en) * | 2017-03-15 | 2018-10-09 | 阿里巴巴集团控股有限公司 | Neural network model training, trading activity Risk Identification Method and device |
CN108629413B (en) * | 2017-03-15 | 2020-06-16 | 创新先进技术有限公司 | Neural network model training and transaction behavior risk identification method and device |
CN107273916B (en) * | 2017-05-22 | 2020-10-16 | 上海大学 | Information Hiding Detection Method Unknown to Steganography Algorithm |
CN107273916A (en) * | 2017-05-22 | 2017-10-20 | 上海大学 | The unknown Information Hiding & Detecting method of steganographic algorithm |
CN110163226A (en) * | 2018-02-12 | 2019-08-23 | 北京京东尚科信息技术有限公司 | Equilibrating data set generation method and apparatus and classification method and device |
CN108334455B (en) * | 2018-03-05 | 2020-06-26 | 清华大学 | Software defect prediction method and system based on search cost-sensitive hypergraph learning |
CN108334455A (en) * | 2018-03-05 | 2018-07-27 | 清华大学 | The Software Defects Predict Methods and system of cost-sensitive hypergraph study based on search |
CN108776711A (en) * | 2018-03-07 | 2018-11-09 | 中国电力科学研究院有限公司 | A kind of electrical power system transient sample data extracting method and system |
CN108733633A (en) * | 2018-05-18 | 2018-11-02 | 北京科技大学 | A kind of the unbalanced data homing method and device of sample distribution adjustment |
CN109086412A (en) * | 2018-08-03 | 2018-12-25 | 北京邮电大学 | A kind of unbalanced data classification method based on adaptive weighted Bagging-GBDT |
CN110998648A (en) * | 2018-08-09 | 2020-04-10 | 北京嘀嘀无限科技发展有限公司 | System and method for distributing orders |
CN109614967A (en) * | 2018-10-10 | 2019-04-12 | 浙江大学 | A license plate detection method based on negative sample data value resampling |
CN109614967B (en) * | 2018-10-10 | 2020-07-17 | 浙江大学 | License plate detection method based on negative sample data value resampling |
WO2020082734A1 (en) * | 2018-10-24 | 2020-04-30 | 平安科技(深圳)有限公司 | Text emotion recognition method and apparatus, electronic device, and computer non-volatile readable storage medium |
CN109327464A (en) * | 2018-11-15 | 2019-02-12 | 中国人民解放军战略支援部队信息工程大学 | A method and device for processing class imbalance in network intrusion detection |
CN109740750A (en) * | 2018-12-17 | 2019-05-10 | 北京深极智能科技有限公司 | Method of data capture and device |
CN109756494A (en) * | 2018-12-29 | 2019-05-14 | 中国银联股份有限公司 | A kind of negative sample transform method and device |
CN109756494B (en) * | 2018-12-29 | 2021-04-16 | 中国银联股份有限公司 | Negative sample transformation method and device |
CN109862392A (en) * | 2019-03-20 | 2019-06-07 | 济南大学 | Identification method, system, device and medium of Internet game video traffic |
CN109862392B (en) * | 2019-03-20 | 2021-04-13 | 济南大学 | Identification method, system, device and medium of Internet game video traffic |
CN111062806A (en) * | 2019-12-13 | 2020-04-24 | 合肥工业大学 | Personal financial credit risk assessment method, system and storage medium |
CN111062806B (en) * | 2019-12-13 | 2022-05-10 | 合肥工业大学 | Personal financial credit risk assessment method, system and storage medium |
CN111652268A (en) * | 2020-04-22 | 2020-09-11 | 浙江盈狐云数据科技有限公司 | A classification method for unbalanced flow data based on resampling mechanism |
CN111598189B (en) * | 2020-07-20 | 2020-10-30 | 北京瑞莱智慧科技有限公司 | Generative model training method, data generation method, device, medium, and apparatus |
CN111598189A (en) * | 2020-07-20 | 2020-08-28 | 北京瑞莱智慧科技有限公司 | Generative model training method, data generation method, device, medium, and apparatus |
CN113903030A (en) * | 2021-10-12 | 2022-01-07 | 杭州迪英加科技有限公司 | Liquid-based cell pathology image generation method based on weak supervised learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105975992A (en) | Unbalanced data classification method based on adaptive upsampling | |
CN110443281B (en) | Text classification self-adaptive oversampling method based on HDBSCAN (high-density binary-coded decimal) clustering | |
CN102520341B (en) | Analog circuit fault diagnosis method based on Bayes-KFCM (Kernelized Fuzzy C-Means) algorithm | |
Nguyen et al. | Learning pattern classification tasks with imbalanced data sets | |
CN112147432A (en) | BiLSTM module based on attention mechanism, transformer state diagnosis method and system | |
CN106778853A (en) | Unbalanced data sorting technique based on weight cluster and sub- sampling | |
CN110853756B (en) | Esophageal cancer risk prediction method based on SOM neural network and SVM | |
CN110363230B (en) | Fault diagnosis method of stacking integrated sewage treatment based on weighted base classifier | |
CN107255785A (en) | Based on the analog-circuit fault diagnosis method for improving mRMR | |
CN109086412A (en) | A kind of unbalanced data classification method based on adaptive weighted Bagging-GBDT | |
CN107688831A (en) | A kind of unbalanced data sorting technique based on cluster down-sampling | |
CN111680726B (en) | Transformer fault diagnosis method and system based on neighbor component analysis and k neighbor learning fusion | |
CN106250442A (en) | The feature selection approach of a kind of network security data and system | |
CN109214460A (en) | Method for diagnosing fault of power transformer based on Relative Transformation Yu nuclear entropy constituent analysis | |
CN105975993A (en) | Unbalanced data classification method based on boundary upsampling | |
CN108877947B (en) | Deep sample learning method based on iterative mean clustering | |
CN112633337A (en) | Unbalanced data processing method based on clustering and boundary points | |
CN115048988B (en) | Classification fusion method for imbalanced data sets based on Gaussian mixture model | |
CN108764346A (en) | A kind of mixing sampling integrated classifier based on entropy | |
CN104809476A (en) | Multi-target evolutionary fuzzy rule classification method based on decomposition | |
CN104463251A (en) | Cancer gene expression profile data identification method based on integration of extreme learning machines | |
Wang et al. | Nearest Neighbor with Double Neighborhoods Algorithm for Imbalanced Classification. | |
CN102750545A (en) | Pattern recognition method capable of achieving cluster, classification and metric learning simultaneously | |
CN110322968A (en) | A kind of feature selection approach and device of disease category medical data | |
CN108920477A (en) | A kind of unbalanced data processing method based on binary tree structure |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160928 |
|
RJ01 | Rejection of invention patent application after publication |