CN105975992A

CN105975992A - Unbalanced data classification method based on adaptive upsampling

Info

Publication number: CN105975992A
Application number: CN201610331709.9A
Authority: CN
Inventors: 吕卫; 李喆; 褚晶辉
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2016-05-18
Filing date: 2016-05-18
Publication date: 2016-09-28

Abstract

The invention relates to a method for classifying unbalanced data sets based on adaptive upsampling, comprising the following steps: calculating the total number of positive samples that need to be newly generated according to and calculating the unbalanced rate of the unbalanced data set; taking Euclidean distance as a measure, For each positive sample, calculate the probability density distribution;) determine the number of new samples that need to be generated for the positive sample; generate a new positive sample, and add the newly generated positive sample points to the original unbalanced training set, so that the positive The number of negative samples is the same, that is, a new balanced training set containing n _n positive samples and negative samples is obtained; the newly generated balanced training set is trained using the Adaboost algorithm, and the final classification model is obtained after T iterations. The invention can improve the classification performance of the unbalanced data set.

Description

A Classification Method for Imbalanced Datasets Based on Adaptive Upsampling

所属技术领域Technical field

本发明涉及模式识别技术，具体涉及一种针对不平衡数据集的分类器。The invention relates to pattern recognition technology, in particular to a classifier for unbalanced data sets.

背景技术Background technique

随着数据挖掘、模式识别与机器学习技术的快速发展，数据分类已经在图像检索、医疗检测与诊断、测谎、文本分类及原油泄漏检测等多个领域中得以应用并发挥重要作用。然而，诸如支持向量机、人工神经网络及线性判别分析等经典分类算法在设计时均假定训练所用的数据集中各类所包含的样本数大致相同。但实际上，在上述几个领域中，异常样本(正样本)的数目往往远少于正常样本(负样本)。此时，为得到更高的整体准确率，经典分类器会更多关注负样本类，分类边界会向正样本方向移动，使得大量正样本被误分入负类，最终导致正类样本分类性能下降。考虑到多数情况下异常样本在决策中具有更高价值，为提高正样本分类准确率，针对不平衡数据集的分类算法成为了研究热点。With the rapid development of data mining, pattern recognition and machine learning technologies, data classification has been applied and played an important role in many fields such as image retrieval, medical detection and diagnosis, polygraph detection, text classification and crude oil spill detection. However, classical classification algorithms such as support vector machines, artificial neural networks, and linear discriminant analysis are designed with the assumption that the training data set contains approximately the same number of samples for each class. But in fact, in the above-mentioned fields, the number of abnormal samples (positive samples) is often much less than that of normal samples (negative samples). At this time, in order to obtain a higher overall accuracy, the classical classifier will pay more attention to the negative sample class, and the classification boundary will move to the direction of the positive sample, so that a large number of positive samples will be misclassified into the negative class, which will eventually lead to poor classification performance of positive samples. decline. Considering that abnormal samples have higher value in decision-making in most cases, in order to improve the classification accuracy of positive samples, classification algorithms for imbalanced data sets have become a research hotspot.

近年来，科研人员提出了多种针对不平衡数据集的分类方法。根据作用对象的不同，这些方法主要可以被分为数据级方法和算法级方法两大类。In recent years, researchers have proposed a variety of classification methods for imbalanced datasets. According to different objects, these methods can be mainly divided into two categories: data-level methods and algorithm-level methods.

数据级方法主要通过对数据进行重采样来改变数据分布，使正负样本的数目基本相同，以此实现数据平衡。对负样本进行降采样和对正样本进行升采样均可达到这一目的。专利“基于有监督上采样学习的蛋白质-核苷酸绑定位点预测方法”(CN104077499A)采用了升采样的方法，通过增加正样本数量以获得平衡的数据集并用于训练支持向量机。但由于该种方法只是将正样本进行复制后加入原有的数据集中，相当于每个正样本均被多次训练，容易出现过拟合现象，最终导致分类器性能下降。专利“基于欠抽样面向不平衡数据集的交通事件自动检测方法”(CN103927874A)采用降采样方法，从负样本集中随机抽取部分样本与全体正样本组成训练集对分类器进行训练。但由于丢弃了大量负样本，该方法无法保证抽取得到的负样本子集能够较好代表原有样本集，因此训练效果也不够理想。The data-level method mainly changes the data distribution by resampling the data, so that the number of positive and negative samples is basically the same, so as to achieve data balance. This can be achieved by downsampling negative samples and upsampling positive samples. The patent "Protein-Nucleotide Binding Site Prediction Method Based on Supervised Upsampling Learning" (CN104077499A) uses an upsampling method to obtain a balanced data set by increasing the number of positive samples and use it to train a support vector machine. However, since this method only copies the positive samples and adds them to the original data set, it is equivalent to each positive sample being trained multiple times, which is prone to overfitting and eventually leads to a decline in the performance of the classifier. The patent "Automatic detection method of traffic incidents based on under-sampling for unbalanced data sets" (CN103927874A) adopts the down-sampling method to randomly select some samples from the negative sample set and all positive samples to form a training set to train the classifier. However, due to discarding a large number of negative samples, this method cannot guarantee that the extracted negative sample subset can better represent the original sample set, so the training effect is not ideal.

算法级方法主要通过改进分类算法而非改变数据分布来解决不平衡分类问题。Adaboost是经典的算法级方法之一。这种方法通过将多个分类器级联，并不断增加错分样本的权重以提高将该类样本再次错分的代价，由此提高分类的准确率。然而，由于传统的Adaboost算法本身并未过多关注正样本，因此效果仍然不够理想。Algorithm-level methods mainly solve the imbalanced classification problem by improving the classification algorithm rather than changing the data distribution. Adaboost is one of the classic algorithm-level methods. This method cascades multiple classifiers and continuously increases the weight of misclassified samples to increase the cost of misclassifying such samples again, thereby improving the accuracy of classification. However, because the traditional Adaboost algorithm itself does not pay too much attention to positive samples, the effect is still not ideal.

从上述分析可以看出，数据级方法和算法级方法虽然都可以减轻数据不平衡对分类效果产生的影响，但两种方法均存在一定的局限性。From the above analysis, it can be seen that although both the data-level method and the algorithm-level method can alleviate the impact of data imbalance on the classification effect, both methods have certain limitations.

发明内容Contents of the invention

本发明的目的是克服现有方法的不足，提出一种基于自适应升采样的不平衡数据集分类算法，以提高不平衡数据集的分类性能。本发明的技术方案如下：The purpose of the present invention is to overcome the deficiencies of the existing methods and propose a classification algorithm for unbalanced data sets based on adaptive upsampling to improve the classification performance of unbalanced data sets. Technical scheme of the present invention is as follows:

一种基于自适应升采样的不平衡数据集分类方法，设原始不平衡数据集中正样本数目为n_p，负样本数目为n_n，该方法包括以下步骤：A classification method for unbalanced datasets based on adaptive upsampling, assuming that the number of positive samples in the original unbalanced dataset is n _p , and the number of negative samples is n _n , the method includes the following steps:

(1)根据n_p和n_n计算不平衡数据集的不平衡率IR，由IR计算需要新生成的正样本总数G；(1) Calculate the imbalance rate IR of the unbalanced data set according to n _p and n _n , and calculate the total number of positive samples G that need to be newly generated by IR;

(2)以欧氏距离为度量，对于每个正样本i，搜索不平衡数据集中同其距离最近的K个最近邻样本，统计上述K个最近邻样本中负样本所占的比例，记为p_i，对各个正样本所得到的p_i值相加并进行归一化处理，将处理完成后得到的值记为r_i，此时各正样本的r_i值之和为1，即r_i形成概率密度分布，称r_i为正样本i的概率；(2) Taking the Euclidean distance as the measure, for each positive sample i, search for the K nearest neighbor samples in the unbalanced data set, and count the proportion of negative samples in the above K nearest neighbor samples, which is denoted as p _i , add the p _i values obtained from each positive sample and perform normalization processing, and record the value obtained after the processing as r _i , at this time, the sum of the r _i values of each positive sample is 1, that is, r _i forms a probability density distribution, and r _i is called the probability of positive sample i;

(3)对于每个正样本i，根据正样本总数G值与步骤(2)中得到的概率r_i确定该正样本所需生成的新样本个数g_i；(3) For each positive sample i, according to the total number of positive samples G and the probability r _i obtained in step (2), determine the number of new samples g _i that need to be generated for the positive sample;

(4)对于每个正样本i，在步骤(2)中得到的K个最近邻样本中随机选取g_i个，分别与其组成样本对，在样本对的连线上随机选取一点即得到新生成的正样本，新的正样本生成过程完成后生成G个新的正样本点，将新生成的G个正样本点加入到原有的不平衡训练集中，使正负样本数目相同，即得到包含正样本和负样本各n_n个的新平衡训练集；(4) For each positive sample i, randomly select g _i from the K nearest neighbor samples obtained in step (2), form a sample pair with it respectively, and randomly select a point on the connection line of the sample pair to obtain a new generation After the new positive sample generation process is completed, G new positive sample points are generated, and the newly generated G positive sample points are added to the original unbalanced training set, so that the number of positive and negative samples is the same, that is, the inclusion A new balanced training set of n _n positive samples and negative samples each;

(5)记Adaboost算法的迭代次数为T，对新生成的平衡训练集运用Adaboost算法进行训练，迭代T次后得到最终的分类模型。(5) Record the number of iterations of the Adaboost algorithm as T, use the Adaboost algorithm to train the newly generated balanced training set, and obtain the final classification model after T iterations.

本发明针对不平衡数据集，将数据级方法和算法级方法结合的算法，并对升采样算法进行改进与优化，主要对在正负样本边界附近的正样本点进行升采样，对远离边界的正样本不做处理，以在不平衡数据集上获得更好的分类效果，结合了自适应升采样算法与Adaboost算法的优点，保证升采样中生成的新正样本主要集中在边界附近，同时通过组合分类器进行增强学习，提高分类器整体性能。经实验比较，本发明在多个分类器评价指标上具有明显优势。For unbalanced data sets, the present invention combines the data-level method and the algorithm-level method, and improves and optimizes the upsampling algorithm. Positive samples are not processed to obtain better classification results on unbalanced datasets. The advantages of the adaptive upsampling algorithm and the Adaboost algorithm are combined to ensure that the new positive samples generated in the upsampling are mainly concentrated near the boundary. At the same time, through Combine classifiers for reinforcement learning and improve the overall performance of classifiers. Through experimental comparison, the present invention has obvious advantages in multiple classifier evaluation indexes.

附图说明Description of drawings

图1是Adaboost增强学习算法流程图。Figure 1 is a flowchart of the Adaboost enhanced learning algorithm.

图2是本发明的流程图。Fig. 2 is a flow chart of the present invention.

具体实施方式detailed description

本发明受自适应升采样算法和图1所示Adaboost算法启发，将二者进行组合，形成一个集成分类器。下面结合附图对本发明作进一步详细的说明。Inspired by the adaptive upsampling algorithm and the Adaboost algorithm shown in Figure 1, the present invention combines the two to form an integrated classifier. The present invention will be described in further detail below in conjunction with the accompanying drawings.

(1)取得测试和训练数据：本发明选择KEEL数据库中的车辆种类识别数据库，共包含样本846个。数据库中的正样本为小货车数据，共199个,即n_p＝199。负样本包含公交车、欧宝轿车、萨博汽车共三种车辆的数据，共647个,即n_n＝647。数据库中包含扭矩、转向半径、最大制动距离等共18维特征。按(1)式计算不平衡率，(1) Obtaining test and training data: the present invention selects the vehicle type identification database in the KEEL database, which contains 846 samples altogether. The positive samples in the database are small truck data, a total of 199, that is, n _p =199. The negative samples include the data of three types of vehicles: bus, Opel, and Saab, totaling 647, that is, n _n =647. The database contains a total of 18-dimensional features such as torque, steering radius, and maximum braking distance. Calculate the unbalance rate according to formula (1),

IR＝n_n/n_p(1)IR=n _n /n _p (1)

可得在本实验中不平衡率应为3.25。It can be obtained that the imbalance ratio in this experiment should be 3.25.

(2)按(2)式计算需要生成的正样本个数,(2) Calculate the number of positive samples that need to be generated according to formula (2),

G＝(n_n-n_p)×β(2)G=(n _n -n _p )×β(2)

其中，β是一个介于0到1之间的常数。当β＝1时，经升采样后正负样本的数目将完全相同，数据集达到完全平衡，本发明取β＝1。可知，需要生成的新正样本数目应为448。随后根据该值对正样本进行自适应升采样处理，使正负样本数目达到平衡。具体方法为：对于每个正样本，以欧氏距离作为度量，分别计算距其最近的K个样本点中负样本所占比例p_i:Among them, β is a constant between 0 and 1. When β=1, the number of positive and negative samples will be exactly the same after up-sampling, and the data set will be completely balanced. In the present invention, β=1. It can be seen that the number of new positive samples that need to be generated should be 448. Then, according to this value, the positive samples are adaptively upsampled to balance the number of positive and negative samples. The specific method is: for each positive sample, the Euclidean distance is used as the measure to calculate the proportion p _i of the negative samples in the nearest K sample points:

p_i＝k_i/K,i＝1,...,n_p (3)p _i =k _i /K,i=1,...,n _p (3)

为保证准确判断每个正样本是否在正负样本边界附近，K应取较大值，但随着K值的增加，计算量也将明显增加。为保持较低计算复杂度，本发明对上述两需求进行折中处理，取K＝5。随后，对所有p_i进行归一化处理，使其表示为概率密度分布并计算每个正样本应生成的新正样本个数In order to accurately judge whether each positive sample is near the boundary of positive and negative samples, K should take a larger value, but as the value of K increases, the amount of calculation will also increase significantly. In order to keep the computational complexity low, the present invention makes a compromise between the above two requirements, taking K=5. Subsequently, normalize all p _i to represent it as a probability density distribution and calculate the number of new positive samples that should be generated for each positive sample

${g g}_{i i} = = \frac{{p p}_{i i}}{{Σ Σ}_{j j = = 11}^{{n no}_{p p}} {p p}_{j j}} \times \times G G - - - - - - ((44))$

由(4)式可知，越靠近边界、邻近样本中负样本较多的样本点将被用于生成更多正样本，而远离边界、邻近样本均为正样本的样本点将不会被用于生成正样本。随后，对每一个正样本，在其K个最近邻样本点中随机选取g_i个，按(5)式方法生成新的正样本：From formula (4), it can be seen that the closer to the boundary, the sample points with more negative samples in the adjacent samples will be used to generate more positive samples, while the sample points far from the boundary and adjacent samples are all positive samples will not be used for Generate positive samples. Then, for each positive sample, randomly select g _i among its K nearest neighbor sample points, and generate a new positive sample according to formula (5):

new_i＝x_i+λ(x_ni-x_i)(5)new _i ＝x _i +λ(x _ni -x _i )(5)

其中，new_i是新生成的样本点，λ为一个取值在0到1之间的随机数，x_ni为被随机选中的邻近样本点。对于每个正样本，这一过程将进行g_i次。样本生成过程完成后，将新生成的样本点加入到原有的不平衡训练集中，即可得到新的平衡训练集。这种自适应的增采样方法可以确保新生成的训练集不存在不平衡问题，且新生成的样本主要位于正负样本区分难度较大的边界区域。Among them, new _i is a newly generated sample point, λ is a random number between 0 and 1, and x _ni is a randomly selected adjacent sample point. This process will be done g _i times for each positive sample. After the sample generation process is completed, the newly generated sample points are added to the original unbalanced training set to obtain a new balanced training set. This adaptive upsampling method can ensure that there is no imbalance problem in the newly generated training set, and the newly generated samples are mainly located in the boundary area where it is difficult to distinguish positive and negative samples.

由图1和图2可看出，若直接进行随机升采样，将所有正样本点进行复制，则新生成的样本点将与原来的正样本点完全重合且分布在整个正样本空间内。而自适应升采样可以生成与原样本点不同的正样本，且新生成的正样本均在边界附近。It can be seen from Figure 1 and Figure 2 that if random upsampling is performed directly and all positive sample points are copied, the newly generated sample points will completely coincide with the original positive sample points and be distributed in the entire positive sample space. Adaptive upsampling can generate positive samples different from the original sample points, and the newly generated positive samples are all near the boundary.

(3)本发明采取五折交叉验证对不平衡数据集进行训练与测试。训练与测试均选择C4.5决策树作为基分类器的Adaboost分类算法。其中，设C4.5决策树的最小叶节点数为2，置信度为0.25，树训练完成后需进行剪枝处理。所有数据在进入分类器前均完成归一化处理，即数据最小值为0，最大值为1。正样本数据标签为+1，负样本数据标签为-1。(3) The present invention adopts five-fold cross-validation to train and test the unbalanced data set. Both training and testing choose C4.5 decision tree as the Adaboost classification algorithm of the base classifier. Among them, the minimum number of leaf nodes of the C4.5 decision tree is set to 2, and the confidence level is 0.25. After the tree training is completed, it needs to be pruned. All data are normalized before entering the classifier, that is, the minimum value of the data is 0 and the maximum value is 1. The positive sample data label is +1, and the negative sample data label is -1.

将平衡的正负样本按五折交叉验证划分出训练集与测试集，此时训练集中应包含正负样本各518个。训练所用的样本个数为2n_n，即1036。取Adaboost算法的迭代次数T＝10，则按如下方法进行训练：Divide the balanced positive and negative samples into training set and test set according to 5-fold cross-validation. At this time, the training set should contain 518 positive and negative samples each. The number of samples used for training is 2n _n , that is, 1036. The number of iterations T=10 of the Adaboost algorithm is taken, and the training is carried out as follows:

1.记各样本权值为D_t(i)，其中，t可取1到(T-1)之间的整数值，表示当前迭代轮次，i表示样本编号。初始化每个样本的权值均为D₁(i)＝1/(2n_n)，i＝1,...,2n_n.1. Record the weight of each sample as D _t (i), where t can take an integer value between 1 and (T-1), indicating the current iteration round, and i indicates the sample number. Initialize the weight of each sample as D ₁ (i)=1/(2n _n ), i=1,...,2n _n .

2.将加权后的训练集用于训练分类器h_t，训练完成后计算其训练错误率2. Use the weighted training set to train the classifier h _t , and calculate its training error rate after the training is completed

${ϵ ϵ}_{t t} = = {Σ Σ}_{i i = = 11}^{m m} {D D.}_{t t} [[{y the y}_{i i} &NotEqual; &NotEqual; {h h}_{t t} (({x x}_{i i}))]] - - - - - - ((66))$

其中，t＝1,...T，为当前所处的迭代轮次数。ε_t为第t轮迭代的训练错误率，D_t(i)为该轮迭代中每个样本的权重，y_i为样本x_i所属的类别标签，取值为1或-1。h(x_i)为样本x_i经训练后的分类标签。Wherein, t=1,...T is the number of iteration rounds currently in. ε _t is the training error rate of the t-th iteration, D _t (i) is the weight of each sample in this iteration, y _i is the category label to which the sample x _i belongs, and the value is 1 or -1. h( _xi ) is the classification label of sample _xi after training.

3.设第t轮迭代完成后得到的分类器在最终投票中的权重为α_t，根据每轮迭代中的训练错误率计算该轮迭代训练生成的分类器的权重为3. Let the weight of the classifier obtained after the tth round of iteration in the final vote be α _t , and calculate the weight of the classifier generated by this round of iterative training according to the training error rate in each round of iteration as

${α α}_{t t} = = \frac{11}{22} l l n no \frac{11 - - {ϵ ϵ}_{t t}}{{ϵ ϵ}_{t t}} - - - - - - ((77))$

同时，在下一轮迭代中，每个样本的权重更新为At the same time, in the next iteration, the weight of each sample is updated as

${D D.}_{t t + + 11} ((i i)) = = \frac{{D D.}_{t t} ((i i)) exp exp [[- - {α α}_{t t} {y the y}_{i i} {h h}_{t t} (({x x}_{i i}))]]}{{Z Z}_{t t}} - - - - - - ((88))$

其中，Z_t为当前迭代轮次中各样本的权值之和，用于对各样本权值进行归一化处理。Among them, Z _t is the sum of the weights of each sample in the current iteration round, which is used to normalize the weights of each sample.

4.执行2,3步骤共T次，完成全部迭代与权值更新过程，从而完成分类器训练。对于待分类的测试样本，其分类结果应为4. Execute steps 2 and 3 a total of T times to complete all iterations and weight update processes to complete classifier training. For the test sample to be classified, the classification result should be

$s the s i i g g n no ((H h ((x x)) = = {Σ Σ}_{t t = = 11}^{T T} {α α}_{t t} {h h}_{t t} ((x x)))) - - - - - - ((99))$

由(7)式可知，每个子分类器的权重由其分类错误率决定。错误率更低的分类器将在(9)式的投票过程中获得更高的权重。此外，对于单个样本，由式(8)可以看到，若样本的原始标签与分类结果不同，则指数幂的值将大于0，自然对数的结果将小于1，使得该样本在下轮迭代中的权重增加。反之，样本在下轮迭代中的权值将会减小。It can be seen from (7) that the weight of each sub-classifier is determined by its classification error rate. Classifiers with lower error rates will receive higher weights in the voting process in (9). In addition, for a single sample, it can be seen from formula (8) that if the original label of the sample is different from the classification result, the value of the exponent power will be greater than 0, and the result of the natural logarithm will be less than 1, making the sample in the next iteration weight increase. On the contrary, the weight of the sample in the next iteration will decrease.

将测试集样本输入完成训练的分类器中，测试样本的最终分类结果，如图2所示。Input the test set sample into the classifier that has completed the training, and the final classification result of the test sample is shown in Figure 2.

表1给出了直接运用C4.5决策树对不平衡数据集进行分类、对正样本进行随机升采样后运用C4.5进行分类及本发明所使用的方法进行分类分别得到的测试结果。我们采用以下几个指标对分类器性能进行评价：Table 1 shows the test results obtained by directly using the C4.5 decision tree to classify the unbalanced data set, using C4.5 to classify after random upsampling the positive samples, and the method used in the present invention. We use the following indicators to evaluate the classifier performance:

表1分类算法结果与比较(同一指标下最好的结果用黑体标出)Table 1 Classification algorithm results and comparison (the best results under the same index are marked in bold)

由表1数据可以看出，直接使用C4.5决策树进行分类虽然可以得到最高的特异性指标，但灵敏度最低，证明此时数据不平衡现象对分类性能产生了明显影响。正样本的边界区域被侵噬，大量正样本被误分为负样本。经过简单的随机升采样后，这一问题有所缓解，但灵敏度与特异性的差距仍然较大；而本发明同时得到了良好的灵敏度和特异性指标，二者的几何平均值在参与对比的几种方法中同样最高，证明本发明对灵敏度和特异性有最佳折中。It can be seen from the data in Table 1 that although the highest specificity index can be obtained by directly using the C4.5 decision tree for classification, the sensitivity is the lowest, which proves that the data imbalance has a significant impact on the classification performance at this time. The boundary area of positive samples is eroded, and a large number of positive samples are misclassified as negative samples. After simple random upsampling, this problem has been alleviated to some extent, but the gap between sensitivity and specificity is still relatively large; while the present invention has obtained good sensitivity and specificity indicators at the same time, and the geometric mean of the two is in the comparison It is also the highest among several methods, proving that the present invention has the best compromise between sensitivity and specificity.

综上所述，本发明可以在不平衡数据集上获得良好的分类效果，有效消除数据不平衡问题对分类带来的消极影响。To sum up, the present invention can obtain a good classification effect on an unbalanced data set, and effectively eliminate the negative impact of data imbalance on classification.

Claims

1. A method for classifying imbalanced data sets based on adaptive upsampling, assuming that the number of positive samples in the original unbalanced data set is n _p , and the number of negative samples is n _n , the method includes the following steps:

(1) Calculate the imbalance rate IR of the unbalanced data set according to n _p and n _n , and calculate the total number of positive samples G that need to be newly generated by IR;

(2) Taking the Euclidean distance as the measure, for each positive sample i, search for the K nearest neighbor samples in the unbalanced data set, and count the proportion of negative samples in the above K nearest neighbor samples, which is denoted as p _i , add the p _i values obtained from each positive sample and perform normalization processing, and record the value obtained after the processing as r _i , at this time, the sum of the r _i values of each positive sample is 1, that is, r _i forms a probability density distribution, and r _i is called the probability of positive sample i;

(3) For each positive sample i, according to the total number of positive samples G and the probability r _i obtained in step (2), determine the number of new samples g _i that need to be generated for the positive sample;

(4) For each positive sample i, randomly select g _i from the K nearest neighbor samples obtained in step (2), form a sample pair with it respectively, and randomly select a point on the connection line of the sample pair to obtain a new generation After the new positive sample generation process is completed, G new positive sample points are generated, and the newly generated G positive sample points are added to the original unbalanced training set, so that the number of positive and negative samples is the same, that is, the inclusion A new balanced training set of n _n positive samples and negative samples each;

(5) Record the number of iterations of the Adaboost algorithm as T, use the Adaboost algorithm to train the newly generated balanced training set, and obtain the final classification model after T iterations.