CN110097116A - A kind of virtual sample generation method based on independent component analysis and Density Estimator - Google Patents
A kind of virtual sample generation method based on independent component analysis and Density Estimator Download PDFInfo
- Publication number
- CN110097116A CN110097116A CN201910357339.XA CN201910357339A CN110097116A CN 110097116 A CN110097116 A CN 110097116A CN 201910357339 A CN201910357339 A CN 201910357339A CN 110097116 A CN110097116 A CN 110097116A
- Authority
- CN
- China
- Prior art keywords
- sample
- component analysis
- independent component
- virtual
- samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000012880 independent component analysis Methods 0.000 title claims abstract description 26
- 238000005070 sampling Methods 0.000 claims description 12
- 238000004458 analytical method Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 abstract description 26
- 238000010801 machine learning Methods 0.000 abstract description 8
- 238000005516 engineering process Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 238000010438 heat treatment Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2134—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on separation criteria, e.g. independent component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
本发明公开了一种基于独立成分分析和核密度估计的虚拟样本生成方法,本发明在系统运行初期,训练样本数量不足的情况下,利用核密度估计的方法,通过少量样本的概率密度函数估计样本整体的概率密度函数,当原始样本各属性之间存在相关性时先采用独立成分分析的方法去除原始样本各属性之间的相关性,再进行核密度估计,根据估计得到的概率密度函数生成虚拟样本。本发明可以缓解训练机器学习模型时训练样本不足的问题,提高机器学习模型的准确度。相较于其他虚拟样本生成方法,本发明引入了独立成分分析方法解决了样本各个属性间具有相关性的问题,从而拓宽了本发明的应用面。
The invention discloses a virtual sample generation method based on independent component analysis and kernel density estimation. In the initial stage of system operation, when the number of training samples is insufficient, the invention uses the kernel density estimation method to estimate the probability density function of a small number of samples The probability density function of the sample as a whole, when there is a correlation between the attributes of the original sample, the independent component analysis method is used to remove the correlation between the attributes of the original sample, and then the kernel density is estimated, which is generated according to the estimated probability density function dummy samples. The invention can alleviate the problem of insufficient training samples when training the machine learning model, and improve the accuracy of the machine learning model. Compared with other virtual sample generation methods, the present invention introduces an independent component analysis method to solve the problem of correlation among various attributes of the sample, thus broadening the application of the present invention.
Description
技术领域technical field
本发明属于计算机领域,具体涉及一种基于独立成分分析和核密度估计的虚拟样本生成方法。The invention belongs to the field of computers, in particular to a virtual sample generation method based on independent component analysis and kernel density estimation.
背景技术Background technique
目前机器学习方法被越来越多地应用在各个领域之中。对于经典统计学所无法解决的问题,人们希望可以用机器学习的方法去解决。样本数量对机器学习方法的准确性影响很大。但是在很多情况下,由于受到采样时间和成本的限制,往往存在样本数量不足的问题。At present, machine learning methods are increasingly used in various fields. For problems that cannot be solved by classical statistics, people hope that machine learning can be used to solve them. The sample size has a great influence on the accuracy of machine learning methods. However, in many cases, due to the limitation of sampling time and cost, there is often the problem of insufficient number of samples.
虚拟样本生成技术最早由Niyogi等提出。王旭等将虚拟样本生成方法分为三类,基于先验知识、基于扰动以及基于研究领域的分布函数。虚拟样本生成技术被应用到能量预测模型的构建过程中,虚拟样本生成技术使能量预测模型的精度得到了显著的提升。Lee等人利用潜在信息函数生成虚拟样本,提升基于神经网络的需求预测模型性能。Arora等人通过经验公式生成虚拟样本,并利用带有虚拟样本的数据集成功构建了一个基于人工神经网络的计算模型,估算电池发热率。The virtual sample generation technology was first proposed by Niyogi et al. Wang Xu and others divided virtual sample generation methods into three categories, based on prior knowledge, based on disturbance and based on the distribution function of the research field. The virtual sample generation technology is applied to the construction process of the energy prediction model, and the virtual sample generation technology has significantly improved the accuracy of the energy prediction model. Lee et al. used latent information functions to generate virtual samples to improve the performance of demand forecasting models based on neural networks. Arora et al. generated virtual samples through empirical formulas, and successfully constructed an artificial neural network-based calculation model using the data set with virtual samples to estimate the heating rate of the battery.
现有的虚拟样本生成方法,主要针对属性间相互独立的样本,没有考虑样本属性间的相关性。The existing virtual sample generation methods are mainly aimed at samples whose attributes are independent of each other, without considering the correlation between sample attributes.
发明内容Contents of the invention
本发明的目的在于克服上述不足,提供一种应用面更广、操作更简单的基于独立成分分析和核密度估计的虚拟样本生成方法,提高机器学习模型的准确率。The purpose of the present invention is to overcome the above shortcomings, provide a virtual sample generation method based on independent component analysis and kernel density estimation with wider application and simpler operation, and improve the accuracy of machine learning models.
为了达到上述目的,本发明包括以下步骤:In order to achieve the above object, the present invention comprises the following steps:
步骤一,对原始样本数据进行独立成分分析,去除属性间的相关性,并判断分析结构是否收敛;Step 1: Carry out independent component analysis on the original sample data, remove the correlation between attributes, and judge whether the analysis structure is convergent;
步骤二,若收敛,则对独立样本采用多核密度估计法估计概率密度函数,并取样;若不收敛,则对原始样本采用多核密度估计法估计概率密度函数,并取样;Step 2, if convergent, estimate the probability density function by multi-kernel density estimation method for independent samples, and sample; if not convergent, estimate the probability density function by multi-kernel density estimation method for the original sample, and sample;
步骤三,采用步骤一中独立成分分析的结果对步骤二中收敛取样的数据进行恢复相关性,使收敛取样后的数据映射回原始样本空间,得到虚拟样本;Step 3, using the results of the independent component analysis in step 1 to restore the correlation of the convergently sampled data in step 2, so that the data after convergent sampling are mapped back to the original sample space to obtain virtual samples;
步骤四,将虚拟样本与原始样本进行混合,得到最终的扩充的样本集。Step 4, mixing the virtual samples with the original samples to obtain the final expanded sample set.
当样本数据各个属性间存在相关性时,独立成分分析所得到的结果如下:When there is a correlation between the various attributes of the sample data, the results obtained by the independent component analysis are as follows:
假设收集到少量样本为,Assume that a small number of samples are collected as,
x=(x1,x2,…,xn),x∈Rn x=(x 1 ,x 2 ,…,x n ), x∈R n
假设x是由n个相互独立的随机变量s经过线性变换后所得到的,则有,Assuming that x is obtained by linear transformation of n independent random variables s, then,
s=(s1,s2,…,sn),s∈Rn s=(s 1 ,s 2 ,…,s n ), s∈R n
假设A为混合矩阵则有,Assuming that A is a mixing matrix, there are,
x(i)=As(i),i=(1,2,…,m),A为常量;x (i) = As (i) , i = (1,2,...,m), A is a constant;
其中,x为已收集到的样本,s为经过独立成份分析后得到的独立随机变量。Among them, x is the sample that has been collected, and s is the independent random variable obtained after independent component analysis.
采用多核密度估计法估计概率密度函数的方法如下:The method of estimating the probability density function using the multi-kernel density estimation method is as follows:
核密度估计的数学表达式为,The mathematical expression for kernel density estimation is,
根据均平方积分误差函数求解光滑系数h,其中f(s)为s的真实概率密度函数,为对f(s)的估计;Solve the smooth coefficient h according to the mean square integral error function, where f(s) is the true probability density function of s, is an estimate of f(s);
求解出光滑系数h之后,就完成了对s概率密度函数的估计。After solving the smooth coefficient h, the estimation of the probability density function of s is completed.
采用高斯函数作为核函数,高斯函数表达式为,Using the Gaussian function as the kernel function, the expression of the Gaussian function is,
步骤二中,取样的方法如下:In step 2, the sampling method is as follows:
sv=si+hsr,where 1≤i≤n,sr~N(0,1);s v =s i +hs r ,where 1≤i≤n,s r ~N(0,1);
其中,sv为取样值。Among them, s v is the sampling value.
原始样本数据为训练样本数量不足的数据。The original sample data is the data with insufficient training samples.
与现有技术相比,本发明在系统运行初期,训练样本数量不足的情况下,利用核密度估计的方法,通过少量样本的概率密度函数估计样本整体的概率密度函数,当原始样本各属性之间存在相关性时先采用独立成分分析的方法去除原始样本各属性之间的相关性,再进行核密度估计,根据估计得到的概率密度函数生成虚拟样本。本发明可以缓解训练机器学习模型时训练样本不足的问题,提高机器学习模型的准确度。相较于其他虚拟样本生成方法,本发明引入了独立成分分析方法解决了样本各个属性间具有相关性的问题,从而拓宽了本发明的应用面。Compared with the prior art, in the initial stage of system operation, when the number of training samples is insufficient, the present invention uses the method of kernel density estimation to estimate the probability density function of the whole sample through the probability density function of a small number of samples, when the attributes of the original sample When there is a correlation among them, the method of independent component analysis is used to remove the correlation between the attributes of the original sample, and then the kernel density is estimated, and the virtual sample is generated according to the estimated probability density function. The invention can alleviate the problem of insufficient training samples when training the machine learning model, and improve the accuracy of the machine learning model. Compared with other virtual sample generation methods, the present invention introduces an independent component analysis method to solve the problem of correlation among various attributes of the sample, thus broadening the application of the present invention.
附图说明Description of drawings
图1为本发明的流程图;Fig. 1 is a flowchart of the present invention;
图2为本发明实施例中的示意图。Fig. 2 is a schematic diagram of an embodiment of the present invention.
具体实施方式Detailed ways
下面结合附图对本发明做进一步说明。The present invention will be further described below in conjunction with the accompanying drawings.
参见图1,本发明包括以下步骤:Referring to Fig. 1, the present invention comprises the following steps:
步骤一,对训练样本数量不足的原始样本数据进行独立成分分析,去除属性间的相关性,并判断分析结构是否收敛;Step 1: Perform independent component analysis on the original sample data with insufficient training samples, remove the correlation between attributes, and judge whether the analysis structure is convergent;
步骤二,若收敛,则对独立样本采用多核密度估计法估计概率密度函数,并取样;若不收敛,则对原始样本采用多核密度估计法估计概率密度函数,并取样;Step 2, if convergent, estimate the probability density function by multi-kernel density estimation method for independent samples, and sample; if not convergent, estimate the probability density function by multi-kernel density estimation method for the original sample, and sample;
步骤三,采用步骤一中独立成分分析的结果对步骤二中收敛取样的数据进行恢复相关性,使收敛取样后的数据映射回原始样本空间,得到虚拟样本;Step 3, using the results of the independent component analysis in step 1 to restore the correlation of the convergently sampled data in step 2, so that the data after convergent sampling are mapped back to the original sample space to obtain virtual samples;
步骤四,将虚拟样本与不收敛取样后的数据进行混合,得到最终虚拟样本。Step 4: Mix the virtual sample with the data after non-convergent sampling to obtain the final virtual sample.
当样本数据各个属性间存在相关性时,独立成分分析所得到的结果如下:When there is a correlation between the various attributes of the sample data, the results obtained by the independent component analysis are as follows:
假设收集到少量样本为,Assume that a small number of samples are collected as,
x=(x1,x2,…,xn),x∈Rn x=(x 1 ,x 2 ,…,x n ), x∈R n
假设x是由n个相互独立的随机变量s经过线性变换后所得到的,则有,Assuming that x is obtained by linear transformation of n independent random variables s, then,
s=(s1,s2,…,sn),s∈Rn s=(s 1 ,s 2 ,…,s n ), s∈R n
假设A为混合矩阵则有,Assuming that A is a mixing matrix, there are,
x(i)=As(i),i=(1,2,…,m);x (i) = As (i) , i = (1,2,...,m);
其中,x为已收集到的样本,s为经过独立成份分析后得到的独立随机变量,A为常量。Among them, x is the sample that has been collected, s is the independent random variable obtained after independent component analysis, and A is a constant.
采用多核密度估计法估计概率密度函数的方法如下:The method of estimating the probability density function using the multi-kernel density estimation method is as follows:
核密度估计的数学表达式为,The mathematical expression for kernel density estimation is,
根据均平方积分误差函数求解光滑系数h,其中f(s)为s的真实概率密度函数,为对f(s)的估计;Solve the smooth coefficient h according to the mean square integral error function, where f(s) is the true probability density function of s, is an estimate of f(s);
求解出光滑系数h之后,就完成了对s概率密度函数的估计。After solving the smooth coefficient h, the estimation of the probability density function of s is completed.
采用高斯函数作为核函数,高斯函数表达式为,Using the Gaussian function as the kernel function, the expression of the Gaussian function is,
取样的方法如下:The sampling method is as follows:
sv=si+hsr,where 1≤i≤n,sr~N(0,1),sv为取样值。。s v =s i +hs r , where 1≤i≤n, s r ~N(0,1), s v is the sampling value. .
实施例:Example:
设s=[-4,-3.5,-2,-1,-0.75,1,3,3.2,4,4.2,4,6],经过核密度估计后,图2中的实线为虚线描绘了加在每个原始样本上的高斯核函数。生成虚拟样本时,首先选择一个原始样本,图2中选择地是s=3时的原始样本,蓝色曲线描绘了s=3处的高斯核函数。然后再生成一个符合正态分布的一维随机数sr,此处取sr=0.29。最后根据由核密度估计求出的h求出虚拟独立样本sv。本例中s=3,sr=0.29,h=1.4614,根据公式,得sv=3+0.29*1.461=3.4237。Suppose s=[-4,-3.5,-2,-1,-0.75,1,3,3.2,4,4.2,4,6], after kernel density estimation, the solid line in Figure 2 is Dashed lines depict the Gaussian kernel applied to each original sample. When generating a virtual sample, first select an original sample. In Figure 2, the original sample at s=3 is selected, and the blue curve depicts the Gaussian kernel function at s=3. Then generate a one-dimensional random number s r conforming to the normal distribution, where s r =0.29. Finally, the virtual independent sample s v is obtained according to h calculated by kernel density estimation. In this example, s=3, s r =0.29, h=1.4614, according to the formula, s v =3+0.29*1.461=3.4237.
根据以上步骤进行采样,直到获得满意数量的虚拟独立样本sv,最后根据公式(3),将独立虚拟样本映射回原始有样本空间,得到虚拟样本,Sampling is carried out according to the above steps until a satisfactory number of virtual independent samples s v is obtained, and finally according to formula (3), the independent virtual samples are mapped back to the original sample space to obtain virtual samples,
xv (i)=Asv (i),i=1,2,…,m。x v (i) = As v (i) , i = 1, 2, . . . , m.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910357339.XA CN110097116A (en) | 2019-04-29 | 2019-04-29 | A kind of virtual sample generation method based on independent component analysis and Density Estimator |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910357339.XA CN110097116A (en) | 2019-04-29 | 2019-04-29 | A kind of virtual sample generation method based on independent component analysis and Density Estimator |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110097116A true CN110097116A (en) | 2019-08-06 |
Family
ID=67446517
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910357339.XA Pending CN110097116A (en) | 2019-04-29 | 2019-04-29 | A kind of virtual sample generation method based on independent component analysis and Density Estimator |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110097116A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111160619A (en) * | 2019-12-06 | 2020-05-15 | 北京国电通网络技术有限公司 | Power load prediction method based on data derivation |
CN112098915A (en) * | 2020-11-05 | 2020-12-18 | 武汉格蓝若智能技术有限公司 | Method for evaluating secondary errors of multiple voltage transformers under double-bus segmented wiring |
-
2019
- 2019-04-29 CN CN201910357339.XA patent/CN110097116A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111160619A (en) * | 2019-12-06 | 2020-05-15 | 北京国电通网络技术有限公司 | Power load prediction method based on data derivation |
CN112098915A (en) * | 2020-11-05 | 2020-12-18 | 武汉格蓝若智能技术有限公司 | Method for evaluating secondary errors of multiple voltage transformers under double-bus segmented wiring |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111199270B (en) | Regional wave height forecasting method and terminal based on deep learning | |
WO2017157183A1 (en) | Automatic multi-threshold characteristic filtering method and apparatus | |
CN108563837B (en) | Method and system for correcting model parameters of alluvial river water sand model in real time | |
CN105790258B (en) | Latin hypercube probability load flow calculation method based on normal state Copula functions | |
CN110417694A (en) | A communication signal modulation method identification method | |
CN108416382A (en) | One kind is based on iteration sampling and a pair of of modified Web graph of multi-tag as training convolutional neural networks method | |
CN108879732A (en) | Transient stability evaluation in power system method and device | |
CN107181474A (en) | A kind of kernel adaptive algorithm filter based on functional expansion | |
CN107292439A (en) | A kind of method and apparatus for the short-term wind speed forecasting that Copula functions are mixed based on time-varying | |
CN114611631A (en) | Method, system, device and medium for fast training a model from a partial training set | |
CN115905978A (en) | Fault diagnosis method and system based on hierarchical federated learning | |
CN110097116A (en) | A kind of virtual sample generation method based on independent component analysis and Density Estimator | |
CN104636486A (en) | Method and device for extracting features of users on basis of non-negative alternating direction change | |
CN107276561A (en) | Based on the Hammerstein system identifying methods for quantifying core least mean-square error | |
CN106227767A (en) | A kind of based on the adaptive collaborative filtering method of field dependency | |
WO2023050649A1 (en) | Esg index determination method based on data complementing, and related product | |
CN118193507A (en) | Traffic flow data interpolation method and system based on tensor completion and graph annotation meaning network | |
CN103268614B (en) | A kind of for many prospects be divided into cut prospect spectrum drawing generating method | |
CN114021445B (en) | A Non-local Prediction Method of Ocean Eddy Mixture Based on Random Forest Model | |
CN101477686A (en) | Nonsupervision image segmentation process based on clone selection | |
CN104933011A (en) | Relation model determination method and device | |
CN103870596A (en) | Enhanced constraint conditional random field model for Web object information extraction | |
CN110780604B (en) | Space-time signal recovery method based on space-time smoothness and time correlation | |
CN114118381A (en) | Learning method, device, equipment and medium based on adaptive aggregation sparse communication | |
CN110874453A (en) | A Self-service Expansion Method Based on Correlation Coefficient Criterion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190806 |