CN110097116A

CN110097116A - A kind of virtual sample generation method based on independent component analysis and Density Estimator

Info

Publication number: CN110097116A
Application number: CN201910357339.XA
Authority: CN
Inventors: 董小社; 袁坤; 王龙翔; 张兴军; 王强; 王宇菲
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2019-08-06

Abstract

The invention discloses a virtual sample generation method based on independent component analysis and kernel density estimation. In the initial stage of system operation, when the number of training samples is insufficient, the invention uses the kernel density estimation method to estimate the probability density function of a small number of samples The probability density function of the sample as a whole, when there is a correlation between the attributes of the original sample, the independent component analysis method is used to remove the correlation between the attributes of the original sample, and then the kernel density is estimated, which is generated according to the estimated probability density function dummy samples. The invention can alleviate the problem of insufficient training samples when training the machine learning model, and improve the accuracy of the machine learning model. Compared with other virtual sample generation methods, the present invention introduces an independent component analysis method to solve the problem of correlation among various attributes of the sample, thus broadening the application of the present invention.

Description

A Virtual Sample Generation Method Based on Independent Component Analysis and Kernel Density Estimation

技术领域technical field

本发明属于计算机领域，具体涉及一种基于独立成分分析和核密度估计的虚拟样本生成方法。The invention belongs to the field of computers, in particular to a virtual sample generation method based on independent component analysis and kernel density estimation.

背景技术Background technique

目前机器学习方法被越来越多地应用在各个领域之中。对于经典统计学所无法解决的问题，人们希望可以用机器学习的方法去解决。样本数量对机器学习方法的准确性影响很大。但是在很多情况下，由于受到采样时间和成本的限制，往往存在样本数量不足的问题。At present, machine learning methods are increasingly used in various fields. For problems that cannot be solved by classical statistics, people hope that machine learning can be used to solve them. The sample size has a great influence on the accuracy of machine learning methods. However, in many cases, due to the limitation of sampling time and cost, there is often the problem of insufficient number of samples.

虚拟样本生成技术最早由Niyogi等提出。王旭等将虚拟样本生成方法分为三类，基于先验知识、基于扰动以及基于研究领域的分布函数。虚拟样本生成技术被应用到能量预测模型的构建过程中，虚拟样本生成技术使能量预测模型的精度得到了显著的提升。Lee等人利用潜在信息函数生成虚拟样本，提升基于神经网络的需求预测模型性能。Arora等人通过经验公式生成虚拟样本，并利用带有虚拟样本的数据集成功构建了一个基于人工神经网络的计算模型，估算电池发热率。The virtual sample generation technology was first proposed by Niyogi et al. Wang Xu and others divided virtual sample generation methods into three categories, based on prior knowledge, based on disturbance and based on the distribution function of the research field. The virtual sample generation technology is applied to the construction process of the energy prediction model, and the virtual sample generation technology has significantly improved the accuracy of the energy prediction model. Lee et al. used latent information functions to generate virtual samples to improve the performance of demand forecasting models based on neural networks. Arora et al. generated virtual samples through empirical formulas, and successfully constructed an artificial neural network-based calculation model using the data set with virtual samples to estimate the heating rate of the battery.

现有的虚拟样本生成方法，主要针对属性间相互独立的样本，没有考虑样本属性间的相关性。The existing virtual sample generation methods are mainly aimed at samples whose attributes are independent of each other, without considering the correlation between sample attributes.

发明内容Contents of the invention

本发明的目的在于克服上述不足，提供一种应用面更广、操作更简单的基于独立成分分析和核密度估计的虚拟样本生成方法，提高机器学习模型的准确率。The purpose of the present invention is to overcome the above shortcomings, provide a virtual sample generation method based on independent component analysis and kernel density estimation with wider application and simpler operation, and improve the accuracy of machine learning models.

为了达到上述目的，本发明包括以下步骤：In order to achieve the above object, the present invention comprises the following steps:

步骤一，对原始样本数据进行独立成分分析，去除属性间的相关性，并判断分析结构是否收敛；Step 1: Carry out independent component analysis on the original sample data, remove the correlation between attributes, and judge whether the analysis structure is convergent;

步骤二，若收敛，则对独立样本采用多核密度估计法估计概率密度函数，并取样；若不收敛，则对原始样本采用多核密度估计法估计概率密度函数，并取样；Step 2, if convergent, estimate the probability density function by multi-kernel density estimation method for independent samples, and sample; if not convergent, estimate the probability density function by multi-kernel density estimation method for the original sample, and sample;

步骤三，采用步骤一中独立成分分析的结果对步骤二中收敛取样的数据进行恢复相关性，使收敛取样后的数据映射回原始样本空间，得到虚拟样本；Step 3, using the results of the independent component analysis in step 1 to restore the correlation of the convergently sampled data in step 2, so that the data after convergent sampling are mapped back to the original sample space to obtain virtual samples;

步骤四，将虚拟样本与原始样本进行混合，得到最终的扩充的样本集。Step 4, mixing the virtual samples with the original samples to obtain the final expanded sample set.

当样本数据各个属性间存在相关性时，独立成分分析所得到的结果如下：When there is a correlation between the various attributes of the sample data, the results obtained by the independent component analysis are as follows:

假设收集到少量样本为，Assume that a small number of samples are collected as,

x＝(x₁,x₂,…,x_n),x∈Rⁿ x=(x ₁ ,x ₂ ,…,x _n ), x∈R ⁿ

假设x是由n个相互独立的随机变量s经过线性变换后所得到的，则有，Assuming that x is obtained by linear transformation of n independent random variables s, then,

s＝(s₁,s₂,…,s_n),s∈Rⁿ s=(s ₁ ,s ₂ ,…,s _n ), s∈R ⁿ

假设A为混合矩阵则有，Assuming that A is a mixing matrix, there are,

x⁽ⁱ⁾＝As⁽ⁱ⁾,i＝(1,2,…,m)，A为常量；x ⁽ⁱ⁾ = As ⁽ⁱ⁾ , i = (1,2,...,m), A is a constant;

其中，x为已收集到的样本，s为经过独立成份分析后得到的独立随机变量。Among them, x is the sample that has been collected, and s is the independent random variable obtained after independent component analysis.

采用多核密度估计法估计概率密度函数的方法如下：The method of estimating the probability density function using the multi-kernel density estimation method is as follows:

核密度估计的数学表达式为，The mathematical expression for kernel density estimation is,

根据均平方积分误差函数求解光滑系数h，其中f(s)为s的真实概率密度函数，为对f(s)的估计；Solve the smooth coefficient h according to the mean square integral error function, where f(s) is the true probability density function of s, is an estimate of f(s);

求解出光滑系数h之后，就完成了对s概率密度函数的估计。After solving the smooth coefficient h, the estimation of the probability density function of s is completed.

采用高斯函数作为核函数，高斯函数表达式为，Using the Gaussian function as the kernel function, the expression of the Gaussian function is,

步骤二中，取样的方法如下：In step 2, the sampling method is as follows:

s_v＝s_i+hs_r，where 1≤i≤n,s_r～N(0,1)；s _v ＝s _i +hs _r ，where 1≤i≤n,s _r ～N(0,1);

其中，s_v为取样值。Among them, s _v is the sampling value.

原始样本数据为训练样本数量不足的数据。The original sample data is the data with insufficient training samples.

与现有技术相比，本发明在系统运行初期，训练样本数量不足的情况下，利用核密度估计的方法，通过少量样本的概率密度函数估计样本整体的概率密度函数，当原始样本各属性之间存在相关性时先采用独立成分分析的方法去除原始样本各属性之间的相关性，再进行核密度估计，根据估计得到的概率密度函数生成虚拟样本。本发明可以缓解训练机器学习模型时训练样本不足的问题，提高机器学习模型的准确度。相较于其他虚拟样本生成方法，本发明引入了独立成分分析方法解决了样本各个属性间具有相关性的问题，从而拓宽了本发明的应用面。Compared with the prior art, in the initial stage of system operation, when the number of training samples is insufficient, the present invention uses the method of kernel density estimation to estimate the probability density function of the whole sample through the probability density function of a small number of samples, when the attributes of the original sample When there is a correlation among them, the method of independent component analysis is used to remove the correlation between the attributes of the original sample, and then the kernel density is estimated, and the virtual sample is generated according to the estimated probability density function. The invention can alleviate the problem of insufficient training samples when training the machine learning model, and improve the accuracy of the machine learning model. Compared with other virtual sample generation methods, the present invention introduces an independent component analysis method to solve the problem of correlation among various attributes of the sample, thus broadening the application of the present invention.

附图说明Description of drawings

图1为本发明的流程图；Fig. 1 is a flowchart of the present invention;

图2为本发明实施例中的示意图。Fig. 2 is a schematic diagram of an embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明做进一步说明。The present invention will be further described below in conjunction with the accompanying drawings.

参见图1，本发明包括以下步骤：Referring to Fig. 1, the present invention comprises the following steps:

步骤一，对训练样本数量不足的原始样本数据进行独立成分分析，去除属性间的相关性，并判断分析结构是否收敛；Step 1: Perform independent component analysis on the original sample data with insufficient training samples, remove the correlation between attributes, and judge whether the analysis structure is convergent;

步骤四，将虚拟样本与不收敛取样后的数据进行混合，得到最终虚拟样本。Step 4: Mix the virtual sample with the data after non-convergent sampling to obtain the final virtual sample.

x＝(x₁,x₂,…,x_n),x∈Rⁿ x=(x ₁ ,x ₂ ,…,x _n ), x∈R ⁿ

s＝(s₁,s₂,…,s_n),s∈Rⁿ s=(s ₁ ,s ₂ ,…,s _n ), s∈R ⁿ

假设A为混合矩阵则有，Assuming that A is a mixing matrix, there are,

x⁽ⁱ⁾＝As⁽ⁱ⁾,i＝(1,2,…,m)；x ⁽ⁱ⁾ = As ⁽ⁱ⁾ , i = (1,2,...,m);

其中，x为已收集到的样本，s为经过独立成份分析后得到的独立随机变量，A为常量。Among them, x is the sample that has been collected, s is the independent random variable obtained after independent component analysis, and A is a constant.

取样的方法如下：The sampling method is as follows:

s_v＝s_i+hs_r，where 1≤i≤n,s_r～N(0,1)，s_v为取样值。。s _v =s _i +hs _r , where 1≤i≤n, s _r ~N(0,1), s _v is the sampling value. .

实施例：Example:

设s＝[-4,-3.5,-2,-1,-0.75,1,3,3.2,4,4.2,4,6]，经过核密度估计后，图2中的实线为虚线描绘了加在每个原始样本上的高斯核函数。生成虚拟样本时，首先选择一个原始样本，图2中选择地是s＝3时的原始样本，蓝色曲线描绘了s＝3处的高斯核函数。然后再生成一个符合正态分布的一维随机数s_r，此处取s_r＝0.29。最后根据由核密度估计求出的h求出虚拟独立样本s_v。本例中s＝3，s_r＝0.29，h＝1.4614，根据公式，得s_v＝3+0.29*1.461＝3.4237。Suppose s=[-4,-3.5,-2,-1,-0.75,1,3,3.2,4,4.2,4,6], after kernel density estimation, the solid line in Figure 2 is Dashed lines depict the Gaussian kernel applied to each original sample. When generating a virtual sample, first select an original sample. In Figure 2, the original sample at s=3 is selected, and the blue curve depicts the Gaussian kernel function at s=3. Then generate a one-dimensional random number s _r conforming to the normal distribution, where s _r =0.29. Finally, the virtual independent sample s _v is obtained according to h calculated by kernel density estimation. In this example, s=3, s _r =0.29, h=1.4614, according to the formula, s _v =3+0.29*1.461=3.4237.

根据以上步骤进行采样，直到获得满意数量的虚拟独立样本s_v，最后根据公式(3)，将独立虚拟样本映射回原始有样本空间，得到虚拟样本，Sampling is carried out according to the above steps until a satisfactory number of virtual independent samples s _v is obtained, and finally according to formula (3), the independent virtual samples are mapped back to the original sample space to obtain virtual samples,

x_v ⁽ⁱ⁾＝As_v ⁽ⁱ⁾,i＝1,2,…,m。x _v ⁽ⁱ⁾ = As _v ⁽ⁱ⁾ , i = 1, 2, . . . , m.

Claims

1. A virtual sample generation method based on independent component analysis and kernel density estimation, is characterized in that, comprises the following steps:

Step 1: Carry out independent component analysis on the original sample data, remove the correlation between attributes, and judge whether the analysis structure is convergent;

Step 2, if convergent, estimate the probability density function by multi-kernel density estimation method for independent samples, and sample; if not convergent, estimate the probability density function by multi-kernel density estimation method for the original sample, and sample;

Step 3, using the results of the independent component analysis in step 1 to restore the correlation of the convergently sampled data in step 2, so that the data after convergent sampling are mapped back to the original sample space to obtain virtual samples;

Step 4, mixing the virtual samples with the original samples to obtain the final expanded sample set.

2. a kind of virtual sample generation method based on independent component analysis and kernel density estimation according to claim 1, it is characterized in that, when there is correlation between each attribute of sample data, the obtained result of independent component analysis is as follows:

Assume that a small number of samples are collected as,

x=(x ₁ ,x ₂ ,…,x _n ), x∈R ⁿ

Assuming that x is obtained by linear transformation of n independent random variables s, then,

s=(s ₁ ,s ₂ ,…,s _n ), s∈R ⁿ

Assuming that A is a mixing matrix, there are,

x ⁽ⁱ⁾ = As ⁽ⁱ⁾ , i = (1,2,...,m);

Among them, x is the sample that has been collected, s is the independent random variable obtained after independent component analysis, and A is a constant.

3. a kind of virtual sample generation method based on independent component analysis and kernel density estimation according to claim 1, is characterized in that, adopts the method for estimation probability density function of multikernel density estimation method as follows:

The mathematical expression for kernel density estimation is,

Solve the smooth coefficient h according to the mean square integral error function, where f(s) is the true probability density function of s, is an estimate of f(s);

After solving the smooth coefficient h, the estimation of the probability density function of s is completed.

4. a kind of virtual sample generation method based on independent component analysis and kernel density estimation according to claim 3, it is characterized in that, adopt Gaussian function as kernel function, Gaussian function expression is,

5. a kind of virtual sample generation method based on independent component analysis and kernel density estimation according to claim 1, is characterized in that, in step 2, the method for sampling is as follows:

s _v ＝s _i +hs _r ，where1≤i≤n,s _r ～N(0,1)

Among them, s _v is the sampling value.

6. a kind of virtual sample generation method based on independent component analysis and kernel density estimation according to claim 1, is characterized in that, original sample data is the data that training sample quantity is insufficient.