Nothing Special   »   [go: up one dir, main page]

CN117473717B - A data quality analysis method based on Bernoulli-Gaussian model and EM algorithm - Google Patents

A data quality analysis method based on Bernoulli-Gaussian model and EM algorithm Download PDF

Info

Publication number
CN117473717B
CN117473717B CN202311354617.9A CN202311354617A CN117473717B CN 117473717 B CN117473717 B CN 117473717B CN 202311354617 A CN202311354617 A CN 202311354617A CN 117473717 B CN117473717 B CN 117473717B
Authority
CN
China
Prior art keywords
bernoulli
data quality
gaussian model
algorithm
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311354617.9A
Other languages
Chinese (zh)
Other versions
CN117473717A (en
Inventor
杨玲
喻杨康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN202311354617.9A priority Critical patent/CN117473717B/en
Publication of CN117473717A publication Critical patent/CN117473717A/en
Application granted granted Critical
Publication of CN117473717B publication Critical patent/CN117473717B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
    • G06F17/12Simultaneous equations, e.g. systems of linear equations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2111/00Details relating to CAD techniques
    • G06F2111/08Probabilistic or stochastic CAD

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • General Engineering & Computer Science (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Operations Research (AREA)
  • Computing Systems (AREA)
  • Geometry (AREA)
  • Evolutionary Computation (AREA)
  • Computer Hardware Design (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a data quality analysis method based on Bernoulli-Gaussian model and EM algorithm, which relates to the technical field of space geography, and comprises the steps of preprocessing based on collected geodetic data and constructing Bernoulli-Gaussian model according to the statistics of rough differences; and calculating Bernoulli-Gaussian model parameters based on the linear observation equation, and analyzing the data quality of the geodetic measurement. According to the invention, by a calculation method of Bernoulli-Gaussian model parameters in a linear observation equation, the precision information of an observation value, the coarse difference rate and the coarse difference size in the observation value and other factors are obtained, any threshold value is not required to be introduced for distinguishing an abnormal value from a normal value, the intervention of human to the data quality is avoided, and the analysis result of the data quality is more scientific and reliable.

Description

Data quality analysis method based on Bernoulli-Gaussian model and EM algorithm
Technical Field
The invention relates to the technical field of space geography, in particular to a data quality analysis method based on Bernoulli-Gaussian model and an EM algorithm.
Background
In modern geodetic practice, a large number of observations have been recorded or sampled in recent years, such data sets are almost impossible to have outliers, thus outlier handling becomes part of the geodetic operator's daily work, from a statistical point of view, outliers are considered to occur due to observation models failing to provide adequate fit or statistical interpretation, beckman and Cook (1983) categorize the incapacity of observation models into two categories, local model defects and global model defects, weaknesses of local models are those reasons that only focus on peripheral observations and not on the entire model, these reasons may require separate handling of outliers, as the surrogate model containing outliers is generally unknown, global model defects are reasons that lead to replacement of existing models by new or revised models of the entire sample, these reasons treat outliers as known properties, and may lead to replacement of existing models with hybrid models.
The observation models can be divided into two categories according to the cause of outlier generation, in general, the cause of local model defect indicates that outlier should be handled independently from normal observation, in which case there is one anomaly model to explain the existence of outlier in addition to the existing normal model to describe normal observation data, so far there are many different normal-anomaly models, the most common of which is two models proposed by Dixon (1950), one is called mean shift model, which considers outlier as the result of mean shift, the other is called variance expansion model, which considers outlier as the result of variance expansion (LEHMANN ET al 2020), the opposite, the cause of overall model defect requires replacement of existing model with new model or correction model of whole sample, in which case hybrid model is usually used to combine normal and anomaly observation, hawkins (1980) can build two common hybrid models, respectively, a position pollution model and a scale pollution model (leann 2013), the former consisting of normal distribution and another position pollution distribution and the latter of normal distribution and another scale pollution distribution.
However, the solutions that are common today have drawbacks including that in geodetics outliers are usually caused by gross errors, which often cause outliers, and that it is natural to explain the cause of outliers generation in geodetic data processing by the nature of gross errors, in which case the gross error model plays an important role in establishing the distinction and connection between different observation models.
Disclosure of Invention
The present invention has been made in view of the above-mentioned problems of the prior art in which a threshold value is introduced for distinguishing between outliers and normal values and for human intervention in data quality when outlier processing is performed on a data set obtained in a geodetic practice.
Therefore, the problem to be solved by the invention is how to provide a method for distinguishing abnormal values from normal values without introducing any threshold value, avoiding human intervention on the data quality and enabling the analysis result of the data quality to be more scientific and reliable.
In order to solve the technical problems, the invention provides the following technical scheme:
In a first aspect, an embodiment of the present invention provides a data quality analysis method based on a Bernoulli-Gaussian model and an EM algorithm, which includes preprocessing based on collected geodetic data, constructing the Bernoulli-Gaussian model according to the statistics of the coarse differences, constructing a hybrid model of observations in a linear observation equation based on the Expectation Maximum algorithm, and calculating based on the linear observation equation
Bernoulli-Gaussian model parameters, and analyze the data quality of the geodetic measurements.
As a preferable scheme of the data quality analysis method based on Bernoulli-Gaussian model and EM algorithm, the method comprises the following steps of:
wherein e g is the gross error, Z is the pattern matrix; Is a size vector.
As a preferable scheme of the data quality analysis method based on the Bernoulli-Gaussian model and the EM algorithm, the invention comprises the following specific formulas, wherein the mode matrix Z is a diagonal matrix, the value is 2 m, and the specific formulas are as follows:
Z=Zi,i∈{0,…,2m-1}
Wherein the j-th diagonal element of the pattern matrix Z obeys the Bernoulli distribution of the parameter ε j, the probability distribution of the pattern matrix Z is as follows:
Wherein Z ij is the j-th diagonal element of Z i, epsilon j is the coarse difference of the j-th observed value, and the size vector Obeying multidimensional Gaussian distribution, and the specific formula is as follows:
wherein, Is the mean vector of the size of the coarse difference; is a variance-covariance matrix.
As a preferable scheme of the data quality analysis method based on Bernoulli-Gaussian model and EM algorithm, the observation value comprises true value, accidental error and gross error, and the specific formula of the observation value is as follows:
y=Ax+e+eg
Wherein y is an observation value vector, A is a non-random design matrix, x is a parameter vector to be estimated, e is a zero-mean value, and variance is a random error vector of sigma.
As a preferable scheme of the data quality analysis method based on the Bernoulli-Gaussian model and the EM algorithm, the method for constructing the mixed model of the observed values in the linear observation equation comprises the following steps:
According to the additivity of the Gaussian distribution, the probability distribution formula of the observed value y is as follows:
wherein, Is the probability distribution of the observed value y, x, Σ, epsilon, The specific calculation steps of the parameters to be estimated of the observed value y in the mixed model are as follows, if the coarse error rates of different observed values in one type of observation equation are the same, the specific formulas are as follows:
εj=ε,j∈{1,...,m}
Will be The method is divided into a known co-factor array and an unknown factor, and the specific formula is as follows:
Σ=σ2Q
Wherein, Q is the number of the components, Are known co-factor vectors or matrixes, and the formula of the converted parameters to be estimated is as follows:
wherein, Is x, sigma 2, epsilon, Probability distribution of y.
As a preferred embodiment of the data quality analysis method based on the Bernoulli-Gaussian model and the EM algorithm according to the present invention, the calculation of the Bernoulli-Gaussian model parameters based on the linear observation equation uses Expectation Maximum algorithm, given x, σ 2, epsilon, And divides each iteration of the Expectation Maximum algorithm into an E step and an M step.
As a preferable scheme of the data quality analysis method based on the Bernoulli-Gaussian model and the EM algorithm, the calculation formula of the step E is as follows:
Wherein, gamma (Z i) is the posterior probability of Z i, and the calculation formula of the M step is as follows:
The calculation formulas of M i and N i are as follows:
Substituting the parameter value on the right side in the M step into the estimated value of the previous iteration, and taking the parameter value on the left side in the M step as the new estimated value of the parameter of the previous iteration until the iteration converges.
In a second aspect, to further solve the security problem existing in geodetic measurement, embodiments provide a system for data quality analysis based on Bernoulli-Gaussian model and EM algorithm, which includes a Bernoulli-Gaussian model module for decomposing and calculating a coarse difference and obtaining a probability quality function and a probability density function of whether the coarse difference occurs or not, a hybrid model construction module for constructing a hybrid model of observations in a linear observation equation, and a parameter evaluation module for calculating values of Bernoulli-Gaussian model parameters in the linear observation equation and analyzing data quality of geodetic measurement.
In a third aspect, an embodiment of the present invention provides a computer device, comprising a memory and a processor, the memory storing a computer program, wherein the computer program when executed by the processor implements any step of the data quality analysis method based on the Bernoulli-Gaussian model and the EM algorithm according to the first aspect of the present invention.
In a fourth aspect, embodiments of the present invention provide a computer readable storage medium having a computer program stored thereon, wherein the computer program when executed by a processor implements any of the steps of the data quality analysis method according to the first aspect of the present invention based on Bernoulli-Gaussian model and EM algorithm.
The invention has the beneficial effects that the Bernoulli-Gaussian coarse difference statistical model is provided, the estimation method of the BG model parameters in the linear observation equation is provided based on EM (Expectation Maximum) algorithm, the BG model parameters can be estimated in a single observation equation, and can also be estimated in a plurality of observation equations, the invention not only can obtain the precision information of the observation value, but also can obtain the coarse difference rate, the coarse difference size and other factors in the observation value, and provides an omnibearing analysis means for the geodetic data quality, the invention does not need to introduce any threshold value for distinguishing the abnormal value from the normal value, the intervention of human data quality is avoided, and the analysis result of the data quality is more scientific and reliable.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:
fig. 1 is a PDF graph estimated during an EM algorithm iteration in example 1.
Fig. 2 is a diagram of histograms and estimated PDF case1 for the gaussian model and the mixed model in different cases in example 2.
Fig. 3 is a histogram and estimated PDF case2 plot of the gaussian model and the mixed model of example 2 under different conditions.
Fig. 4 is a histogram and estimated PDF case3 plot of the gaussian model and the mixed model of example 2 under different conditions.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.
Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
Example 1
Referring to fig. 1, a first embodiment of the present invention provides a data quality analysis method based on a Bernoulli-Gaussian model and an EM algorithm, comprising the steps of:
S1, preprocessing based on collected geodetic data, and constructing a Bernoulli-Gaussian model according to the statistics of the coarse differences.
Preferably, the data of the geodetic measurement are collected for preprocessing and the statistics of the gross errors are carried out.
Further, the coarse difference comprises a Bernoulli variable and a Gaussian variable, and the calculation of the decomposition of the coarse difference is as follows:
wherein e g is the gross error, Z is the pattern matrix; Is a size vector.
Further, the mode matrix Z is a diagonal matrix, the value is 2 m, and the specific formula is as follows:
Z=Zi,i∈{0,…,2m-1}
Wherein the j-th diagonal element of the pattern matrix Z obeys the Bernoulli distribution of the parameter ε j, the probability distribution of the pattern matrix Z is as follows:
Wherein Z ij is the j-th diagonal element of Z i, and ε j is the coarse difference of the j-th observed value.
Preferably, the size vectorObeying multidimensional Gaussian distribution, and the specific formula is as follows:
wherein, Is the mean vector of the size of the coarse difference; is a variance-covariance matrix.
S2, constructing a mixed model of observed values in a linear observation equation based on Expectation Maximum algorithm.
Preferably, the observations include true values, occasional errors and gross errors, and the specific formulas for the observations are as follows:
y=Ax+e+eg
Wherein y is an observation value vector, A is a non-random design matrix, x is a parameter vector to be estimated, e is a zero-mean value, and variance is a random error vector of sigma.
Further, constructing a hybrid model of the observed values in the linear observation equation includes the following steps of, according to the additivity of the Gaussian distribution, a probability distribution formula of the observed values y:
wherein, Is the probability distribution of the observed value y, x, Σ, epsilon, For each parameter to be estimated for which the observed value y is in the hybrid model.
The specific calculation steps of each parameter to be estimated of the observed value y in the mixed model are as follows, if the coarse error rates of different observed values in one type of observation equation are the same, the specific formula is as follows:
εj=ε,j∈{1,...,m}
The method will be described in terms of sigma, The method is divided into a known co-factor array and an unknown factor, and the specific formula is as follows:
Σ=σ2Q
Wherein, Q is the number of the components, Are known co-factor vectors or matrixes, and the formula of the converted parameters to be estimated is as follows:
wherein, Is x, sigma 2, epsilon, Probability distribution of y; specifically, for the linear observation equation y=ax+e+e g, where a= [1,., 1] T, x= [ μ ]; performing numerical calculation by Monte Carlo Simulation (MCS), firstly, according to the true value of Ax simulation observation value, adding random error and coarse error into the true value of observation value, and making the random error of e be independently and completely obeyed Gaussian distribution, namely, the coarse error in e-N (0, sigma 2);eg) is simulated according to BG modelAnd is also provided withThe observations in y are samples sampled independently from the mixed distribution,Parameter estimation is performed by EM algorithm as a PDF fitting process on these samples.
S3, calculating Bernoulli-Gaussian model parameters based on a linear observation equation, and analyzing the data quality of the geodetic measurement.
Preferably, the Bernoulli-Gaussian model parameters are calculated based on linear observation equations using the Expectation Maximum algorithm, given x, σ 2, ε, And divides each iteration of the Expectation Maximum algorithm into an E step and an M step.
Further, the calculation formula of the step E is as follows:
wherein, gamma (Z i) is the posterior probability of Z i, and the calculation formula of the M step is as follows:
wherein, the calculation formulas of M i and N i are as follows:
Substituting the parameter value on the right side in the M step into the estimated value of the previous iteration, and taking the parameter value on the left side in the M step as the new estimated value of the parameter of the previous iteration until the iteration converges.
Specifically, model parameters are calculated by using an EM algorithm through simulating n samples of the mixed model, so that true values and initial values of different parameters in the mixed model with parameters shown in table 1 are obtained:
Table 1 truth and initial tables for different parameters in a hybrid model
In the EM iteration process, the parameters calculated by the EM algorithm are as shown in table 2 and estimated in the EM algorithm iteration process:
table 2 parameter table estimated during iteration of EM algorithm
From the results of tables 1 and 2, it can be seen that the parameters calculated by the present invention are different in each iteration and eventually converge to a true value.
Further, as shown in fig. 1, in the iterative process of EM, the calculated PDF is gradually deformed, and finally approaches the actual PDF.
The embodiment also provides a system for data quality analysis based on the Bernoulli-Gaussian model and the EM algorithm, which comprises a Bernoulli-Gaussian model module, a mixed model construction module and a parameter evaluation module, wherein the Bernoulli-Gaussian model module is used for decomposing and calculating the coarse difference and obtaining a probability quality function and a probability density function of whether the coarse difference occurs or not, the mixed model construction module is used for constructing a mixed model of an observed value in a linear observation equation, and the parameter evaluation module is used for calculating the numerical value of a Bernoulli-Gaussian model parameter in the linear observation equation and analyzing the data quality measured in the ground.
The embodiment also provides a computer device, which is suitable for the situation of the data quality analysis method based on the Bernoulli-Gaussian model and the EM algorithm, and comprises a memory and a processor, wherein the memory is used for storing computer executable instructions, and the processor is used for executing the computer executable instructions to realize the data quality analysis method based on the Bernoulli-Gaussian model and the EM algorithm, which is proposed by the embodiment.
The computer device may be a terminal comprising a processor, a memory, a communication interface, a display screen and input means connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
The present embodiment also provides a storage medium having stored thereon a computer program which when executed by a processor implements a method for data quality analysis based on the Bernoulli-Gaussian model and the EM algorithm as proposed in the above embodiments, the storage medium may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as a static random access Memory (Static Random Access Memory, SRAM for short), an electrically erasable Programmable Read-Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM for short), an erasable Programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM for short), a Programmable Read-Only Memory (PROM for short), a Read-Only Memory (ROM for short), a magnetic Memory, a flash Memory, a magnetic disk or an optical disk.
In summary, the invention provides a Bernoulli-Gaussian gross error statistical model, and provides an estimation method of BG model parameters in a linear observation equation based on EM (Expectation Maximum) algorithm, which can estimate the BG model parameters in a single observation equation and in a plurality of observation equations, can obtain the accuracy information of an observed value, can obtain factors such as the gross error rate, the gross error size and the like in the observed value, provides an omnibearing analysis means for the data quality of geodetic measurement, does not need to introduce any threshold value for distinguishing abnormal values and normal values, avoids human intervention on the data quality, and ensures that the analysis result of the data quality is more scientific and reliable.
Example 2
Referring to fig. 2 to 4, a second embodiment of the present invention is different from the first embodiment in that experimental comparison data of the present invention and the prior art are provided for verifying the beneficial effects thereof.
The comparison of the present invention with a hybrid model using three different parameter values and calculating LS using a conventional Gaussian model and corresponding model parameters, such as the true and estimated values of LS and EM parameters for the different cases of Table 3, is as follows.
TABLE 3 truth and valuation of LS and EM parameters for different situations
As can be seen from the results of table 3, there is no significant difference between the actual values of LS and EM and the estimated parameters when the total error rate and the total error magnitude are relatively small (see Case 1), and as the total error rate increases and the total error magnitude increases (see Case 2 and Case 3), the accuracy of LS estimation decreases significantly due to lack of robustness. In contrast, the parameters of the EM estimation are less affected by outliers and remain stable all the time, and the precision of the coarse-difference parameters is even improved in the case of a large proportion of outliers.
As shown in figures 2-4, when the total error rate and the total error size are relatively small, the PDFs estimated by the Gaussian model and the mix model are close to the histogram and have satisfactory performance for fitting the samples, as the total error rate is increased and the total error size is increased (see Case 2 and Case 3), the PDFs estimated by the Gaussian model and the LS are obviously deformed compared with the histogram, the Gaussian model loses the capability of fitting the samples polluted by coarse errors with larger proportion and amplitude, and conversely, the PDFs of the mixed model are always close to the histogram and keep good fitting performance.
It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims (7)

1.一种基于Bernoulli-Gaussian模型和EM算法的数据质量分析方法,其特征在于:包括:1. A data quality analysis method based on Bernoulli-Gaussian model and EM algorithm, characterized by comprising: 基于采集的大地测量的数据进行预处理,并根据粗差的统计构建Bernoulli-Gaussian模型;Preprocessing is performed based on the collected geodetic data, and a Bernoulli-Gaussian model is constructed based on the statistics of gross errors; 基于Expectation Maximum算法,构建线性观测方程中观测值的混合模型;Based on the Expectation Maximum algorithm, a mixed model of observations in the linear observation equation is constructed; 基于线性观测方程计算Bernoulli-Gaussian模型参数,并分析大地测量的数据质量;Calculate Bernoulli-Gaussian model parameters based on linear observation equations and analyze geodetic data quality; 所述构建线性观测方程中观测值的混合模型包括以下步骤:The construction of a mixed model of observations in a linear observation equation comprises the following steps: 根据高斯分布的可加性,观测值y的概率分布公式如下:According to the additivity of Gaussian distribution, the probability distribution formula of the observed value y is as follows: 其中,为观测值y的概率分布;x,Σ,ε,为观测值y在混合模型中的各待估参数;Zij为Zi的第j个对角元素;εj为第j个观测值的粗差率;A为非随机设计矩阵;in, is the probability distribution of the observed value y; x,Σ,ε, are the parameters to be estimated for the observed value y in the mixed model; Zij is the j-th diagonal element of Zi ; εj is the gross error rate of the j-th observed value; A is the non-random design matrix; 观测值y在混合模型中的各待估参数的具体计算步骤如下:The specific calculation steps for the estimated parameters of the observation value y in the mixed model are as follows: 若一类观测方程中不同观测值的粗错误率是相同的,则具体公式如下:If the crude error rates of different observations in a type of observation equation are the same, the specific formula is as follows: εj=ε,j∈{1,…,m}ε j =ε,j∈{1,…,m} 将Σ,分为已知的协因数阵和未知的因子两部分,具体公式如下:Σ, It is divided into two parts: known cofactor matrix and unknown factors. The specific formula is as follows: Σ=σ2QΣ=σ 2 Q 其中,Q,均为已知的协因数向量或矩阵;Among them, Q, are all known cofactor vectors or matrices; 待估参数转化后公式如下:The formula after the estimated parameters are transformed is as follows: 其中,为x,σ2,ε,y的概率分布;in, is x,σ 2 ,ε, The probability distribution of y; 所述基于线性观测方程计算Bernoulli-Gaussian模型参数采用Expectation Maximum算法,给定x,σ2,ε,的初值,并将Expectation Maximum算法的每一轮迭代分为E步骤和M步骤;The Bernoulli-Gaussian model parameters are calculated based on the linear observation equation using the Expectation Maximum algorithm. Given x,σ 2 ,ε, The initial value of , and each iteration of the Expectation Maximum algorithm is divided into E step and M step; 所述E步骤的计算公式为:The calculation formula of the E step is: 其中,γ(Zi)为Zi的后验概率;Among them, γ(Z i ) is the posterior probability of Zi ; 所述M步骤的计算公式为:The calculation formula of the M step is: Mi和Ni的计算公式如下:The calculation formulas for Mi and Ni are as follows: 将M步骤中右侧的参数值代入上一轮迭代的估值,M步骤中左侧的参数值作为这一轮迭代参数的新的估值,直到迭代收敛。Substitute the parameter value on the right side of the M step into the estimation of the previous iteration, and use the parameter value on the left side of the M step as the new estimation of the parameter for this iteration until the iteration converges. 2.如权利要求1所述的基于Bernoulli-Gaussian模型和EM算法的数据质量分析方法,其特征在于:所述粗差包括伯努利变量和高斯变量;所述粗差的分解的计算如下:2. The data quality analysis method based on the Bernoulli-Gaussian model and the EM algorithm according to claim 1, characterized in that: the gross error includes Bernoulli variables and Gaussian variables; the decomposition calculation of the gross error is as follows: 其中,eg为粗差;Z为模式矩阵;为大小向量。Among them, e g is the gross error; Z is the pattern matrix; is a size vector. 3.如权利要求2所述的基于Bernoulli-Gaussian模型和EM算法的数据质量分析方法,其特征在于:所述模式矩阵Z为对角阵,取值为2m种,具体公式如下:3. The data quality analysis method based on the Bernoulli-Gaussian model and the EM algorithm according to claim 2, characterized in that: the pattern matrix Z is a diagonal matrix with a value of 2 m , and the specific formula is as follows: Z=Zi,i∈{0,…,2m-1}Z=Z i ,i∈{0,…,2 m -1} 其中,模式矩阵Z的第j个对角线元素服从参数εj的Bernoulli分布,模式矩阵Z的概率分布如下:Among them, the j-th diagonal element of the pattern matrix Z obeys the Bernoulli distribution with parameter ε j , and the probability distribution of the pattern matrix Z is as follows: 其中,Zij为Zi的第j个对角元素;εj为第j个观测值的粗差率;Where, Zij is the j-th diagonal element of Zi ; εj is the gross error rate of the j-th observation; 所述大小向量服从多维高斯分布,具体公式如下:The size vector It obeys multidimensional Gaussian distribution, and the specific formula is as follows: 其中,为粗差大小的均值向量;为方差-协方差矩阵。in, is the mean vector of gross error size; is the variance-covariance matrix. 4.如权利要求3所述的基于Bernoulli-Gaussian模型和EM算法的数据质量分析方法,其特征在于:所述观测值包括真值、偶然误差和粗差,所述观测值的具体公式如下:4. The data quality analysis method based on the Bernoulli-Gaussian model and the EM algorithm according to claim 3, characterized in that: the observed value includes a true value, an accidental error and a gross error, and the specific formula of the observed value is as follows: y=Ax+e+eg y=Ax+e+ eg 其中,y为观测值向量;A为非随机设计矩阵;x为待估参数向量;e为零均值,方差为Σ的随机误差向量。Where y is the observation vector; A is the non-random design matrix; x is the parameter vector to be estimated; and e is a random error vector with zero mean and variance Σ. 5.一种基于Bernoulli-Gaussian模型和EM算法的数据质量分析的系统,基于权利要求1~4任一所述的基于Bernoulli-Gaussian模型和EM算法的数据质量分析的方法,其特征在于:还包括,5. A system for data quality analysis based on Bernoulli-Gaussian model and EM algorithm, based on the method for data quality analysis based on Bernoulli-Gaussian model and EM algorithm according to any one of claims 1 to 4, characterized in that: it also includes: Bernoulli-Gaussian模型模块,用于将粗差进行分解和计算,并获得粗差发生与否的概率质量函数和概率密度函数;Bernoulli-Gaussian model module, used to decompose and calculate gross errors and obtain the probability mass function and probability density function of whether gross errors occur or not; 混合模型构建模块,用于构建线性观测方程中观测值的混合模型;Mixed model building module, used to build mixed models of observations in linear observation equations; 参数评估模块,用于计算Bernoulli-Gaussian模型参数在线性观测方程中的数值,并分析大地测量的数据质量。The parameter evaluation module is used to calculate the values of the Bernoulli-Gaussian model parameters in the linear observation equation and analyze the data quality of geodetic measurements. 6.一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于:所述处理器执行所述计算机程序时实现权利要求1~4任一所述的基于Bernoulli-Gaussian模型和EM算法的数据质量分析方法的步骤。6. A computer device comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the steps of the data quality analysis method based on the Bernoulli-Gaussian model and the EM algorithm as described in any one of claims 1 to 4 when executing the computer program. 7.一种计算机可读存储介质,其上存储有计算机程序,其特征在于:所述计算机程序被处理器执行时实现权利要求1~4任一所述的基于Bernoulli-Gaussian模型和EM算法的数据质量分析方法的步骤。7. A computer-readable storage medium having a computer program stored thereon, characterized in that: when the computer program is executed by a processor, the steps of the data quality analysis method based on the Bernoulli-Gaussian model and the EM algorithm as described in any one of claims 1 to 4 are implemented.
CN202311354617.9A 2023-10-19 2023-10-19 A data quality analysis method based on Bernoulli-Gaussian model and EM algorithm Active CN117473717B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311354617.9A CN117473717B (en) 2023-10-19 2023-10-19 A data quality analysis method based on Bernoulli-Gaussian model and EM algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311354617.9A CN117473717B (en) 2023-10-19 2023-10-19 A data quality analysis method based on Bernoulli-Gaussian model and EM algorithm

Publications (2)

Publication Number Publication Date
CN117473717A CN117473717A (en) 2024-01-30
CN117473717B true CN117473717B (en) 2024-12-06

Family

ID=89632272

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311354617.9A Active CN117473717B (en) 2023-10-19 2023-10-19 A data quality analysis method based on Bernoulli-Gaussian model and EM algorithm

Country Status (1)

Country Link
CN (1) CN117473717B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114117802A (en) * 2021-11-29 2022-03-01 同济大学 A method, device and medium for multiple gross error detection based on maximum a posteriori estimation

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
NO337304B1 (en) * 2014-06-03 2016-03-07 Q Free Asa Detection of a charge object in a GNSS system with particle filter
CN104776827B (en) * 2015-04-03 2017-04-05 东南大学 The Detection of Gross Errors method of GPS height anomaly data
GB2555375B (en) * 2016-09-30 2020-01-22 Equinor Energy As Improved methods relating to quality control
CN109270560B (en) * 2018-10-12 2022-04-26 东南大学 Multi-dimensional gross error positioning and value fixing method for area elevation abnormal data

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114117802A (en) * 2021-11-29 2022-03-01 同济大学 A method, device and medium for multiple gross error detection based on maximum a posteriori estimation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于贝叶斯学习的阵列天线故障诊断方法研究";许煜辉;《中国优秀硕士学位论文全文数据库(信息科技辑)》;20230215;正文24-28 *

Also Published As

Publication number Publication date
CN117473717A (en) 2024-01-30

Similar Documents

Publication Publication Date Title
Hasan et al. Priority ranking of critical uncertainties affecting small-disturbance stability using sensitivity analysis techniques
Bi A review of statistical methods for determination of relative importance of correlated predictors and identification of drivers of consumer liking
Andersen et al. Extensions to the Gaussian copula: Random recovery and random factor loadings
Wan Simulating survival data with predefined censoring rates for proportional hazards models
Polson et al. Bayesian l 0‐regularized least squares
TW202013104A (en) Data processing method, data processing device, and computer-readable recording medium
CN112016826A (en) Method and device for determining corrosion degree of transformer substation equipment and computer equipment
White et al. An evaluation of point and interval estimates in population pharmacokinetics using NONMEM analysis
Tekwa et al. Theory and application of an improved species richness estimator
Struben et al. Parameter estimation through maximum likelihood and bootstrapping methods
CN117473717B (en) A data quality analysis method based on Bernoulli-Gaussian model and EM algorithm
Skinner et al. Weibull regression for lifetimes measured with error
Li et al. Quantile association for bivariate survival data
Kelly A review of software packages for analyzing correlated survival data
Sasaki et al. Estimating sexual size dimorphism in fossil species from posterior probability densities
Duewer A comparison of location estimators for interlaboratory data contaminated with value and uncertainty outliers
Bai et al. Calibrating input parameters via eligibility sets
Elliott et al. Weighted Dirichlet process mixture models to accommodate complex sample designs for linear and quantile regression
CN113095963A (en) Real estate cost data processing method, real estate cost data processing device, computer equipment and storage medium
Atamanyuk et al. Management of an agricultural enterprise on the basis of its economic state forecasting
Stevens et al. Augmented measurement system assessment
CN114492003B (en) Gravity modeling method and device based on inverse distance weighting method and quadric surface method
Guolo Measurement errors in control risk regression: A comparison of correction techniques
Ruiz et al. Generalized Functional Mixed Models for Accelerated Degradation-Based Reliability Analysis
CN110097265A (en) Acquisition methods, device and the storage medium of the ready degree of Project Technical

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant