CN116186547A - Method for rapidly identifying abnormal data of environmental water affair monitoring and sampling - Google Patents
Method for rapidly identifying abnormal data of environmental water affair monitoring and sampling Download PDFInfo
- Publication number
- CN116186547A CN116186547A CN202310467167.8A CN202310467167A CN116186547A CN 116186547 A CN116186547 A CN 116186547A CN 202310467167 A CN202310467167 A CN 202310467167A CN 116186547 A CN116186547 A CN 116186547A
- Authority
- CN
- China
- Prior art keywords
- data
- feature
- constructing
- time sequence
- distribution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 title claims abstract description 54
- 238000000034 method Methods 0.000 title claims abstract description 43
- 230000002159 abnormal effect Effects 0.000 title claims abstract description 29
- 230000007613 environmental effect Effects 0.000 title claims abstract description 21
- 238000012544 monitoring process Methods 0.000 title claims abstract description 19
- 238000005070 sampling Methods 0.000 title claims abstract description 18
- 238000009826 distribution Methods 0.000 claims abstract description 56
- 239000011159 matrix material Substances 0.000 claims abstract description 28
- 238000012549 training Methods 0.000 claims abstract description 26
- 239000013598 vector Substances 0.000 claims abstract description 26
- 230000007246 mechanism Effects 0.000 claims abstract description 24
- 238000010586 diagram Methods 0.000 claims abstract description 16
- 230000002776 aggregation Effects 0.000 claims abstract description 15
- 238000004220 aggregation Methods 0.000 claims abstract description 15
- 230000005856 abnormality Effects 0.000 claims abstract description 14
- 238000012512 characterization method Methods 0.000 claims abstract description 9
- 238000000605 extraction Methods 0.000 claims description 14
- 238000010276 construction Methods 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 4
- 230000008030 elimination Effects 0.000 claims description 4
- 238000003379 elimination reaction Methods 0.000 claims description 4
- 230000004931 aggregating effect Effects 0.000 claims description 3
- 230000005791 algae growth Effects 0.000 claims description 3
- 230000000873 masking effect Effects 0.000 claims description 3
- 241000764238 Isis Species 0.000 claims description 2
- 238000012545 processing Methods 0.000 abstract description 2
- XKMRRTOUMJRJIA-UHFFFAOYSA-N ammonia nh3 Chemical compound N.N XKMRRTOUMJRJIA-UHFFFAOYSA-N 0.000 description 12
- QVGXLLKOCUKJST-UHFFFAOYSA-N atomic oxygen Chemical compound [O] QVGXLLKOCUKJST-UHFFFAOYSA-N 0.000 description 9
- 229910052760 oxygen Inorganic materials 0.000 description 9
- 239000001301 oxygen Substances 0.000 description 9
- 241000195493 Cryptophyta Species 0.000 description 8
- OAICVXFJPJFONN-UHFFFAOYSA-N Phosphorus Chemical compound [P] OAICVXFJPJFONN-UHFFFAOYSA-N 0.000 description 7
- 229910052698 phosphorus Inorganic materials 0.000 description 7
- 239000011574 phosphorus Substances 0.000 description 7
- 125000001477 organic nitrogen group Chemical group 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- IJGRMHOSHXDMSA-UHFFFAOYSA-N Atomic nitrogen Chemical compound N#N IJGRMHOSHXDMSA-UHFFFAOYSA-N 0.000 description 2
- KDKJYYNXYAZPIK-UHFFFAOYSA-J aluminum potassium disulfate hydrate Chemical compound O.[Al+3].[K+].[O-]S([O-])(=O)=O.[O-]S([O-])(=O)=O KDKJYYNXYAZPIK-UHFFFAOYSA-J 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000010865 sewage Substances 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 229910052799 carbon Inorganic materials 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012851 eutrophication Methods 0.000 description 1
- 229910052757 nitrogen Inorganic materials 0.000 description 1
- 238000004457 water analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A20/00—Water conservation; Efficient water supply; Efficient water use
- Y02A20/152—Water filtration
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Mathematical Optimization (AREA)
- Computational Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Pure & Applied Mathematics (AREA)
- Computing Systems (AREA)
- Mathematical Analysis (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a method for rapidly identifying abnormal data of environmental water affair monitoring and sampling, which relates to the technical field of water affair data processing, and comprises the following steps: s1, given collected water service data, extracting characteristics based on sequence time sequence distribution; s2, carrying out feature aggregation based on space diagram distribution and Kullback-Leibler divergence on the features to obtain feature vectors of space distribution information fused with different time sequence data; s3, constructing a data correlation matrix based on a mechanism equation for the feature vector; s4, introducing picture information in the data, extracting image features, and constructing a cross-modal vision-time sequence-spatial distribution-mechanism feature characterization model to obtain integral features; s5, constructing a feature decoding prediction and prediction network training module, decoding the integral features, outputting a predicted value and training a prediction network; s6, predicting new water service data by using the existing training set data, judging whether distribution deviation occurs or not, and identifying data abnormality.
Description
Technical Field
The invention relates to the technical field of water affair data processing, in particular to a method for rapidly identifying abnormal data of environmental water affair monitoring and sampling.
Background
In the field of environmental water engineering, a large amount of data, such as real-time flow, temperature, dissolved oxygen, algae, organic carbon, organic phosphorus, organic nitrogen, ammonia nitrogen, etc., are usually collected in a certain water area, and these data can be used for analyzing the hydrodynamic characteristics of the water area and the pollution condition, pollution degree and pollution mechanism of the water area. Therefore, the step of collecting data plays an important role in water area flood control and drainage and pollution control. However, data collected in the field often causes anomalies in the data for various reasons, such as sensor damage, mishandling of the collection personnel, accumulated errors in the equipment, etc. Therefore, from a large amount of data, the automatic identification of abnormal data is particularly critical, and the method has very important significance for ensuring the environmental water analysis work.
Through searching, publication number CN109160550a discloses an urban sewage treatment information management system, wherein an expert system based on a fault tree is adopted to identify abnormal parts in the sewage treatment process. However, such methods do not adequately consider the spatio-temporal continuity relationship between the same data types nor the mechanism relationship expressed in differential equations between different data types. In this method, only the abnormality (too high or too low) of the data size is simply considered, and the abnormality in the system is judged by referring to the fault tree. Such expert systems are characterized as simple, efficient, but fail to identify complex, potential system failures. In addition, the chinese patent of invention, publication No. CN111830871a, discloses abnormality identification of water affair monitoring equipment by using a frequency domain analysis method of equipment data, but the scheme has the following drawbacks: representing data features in the frequency domain typically represents inadequate capability, difficulty in characterizing insignificant anomalies, and difficulty in finding potential data anomalies.
The disclosed scheme has the problems of unreliability and inaccuracy in identifying the abnormality of water service data, and the invention provides a new scheme for identifying complex and potential abnormality so as to realize more reliable and more accurate data abnormality identification in environmental water service and meet the actual needs in order to improve the abnormality identification capability.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a method for rapidly identifying abnormal data of environmental water affair monitoring and sampling, so as to solve the technical problems.
The technical method adopted for solving the technical problems is as follows: a method for rapidly identifying abnormal data of environmental water affair monitoring and sampling is characterized in that: the method comprises the following steps of: s1, given collected water service data, extracting characteristics based on sequence time sequence distribution; s2, carrying out feature aggregation based on space diagram distribution and Kullback-Leibler divergence on the features to obtain feature vectors of space distribution information fused with different time sequence data; s3, constructing a data correlation matrix based on a mechanism equation for the feature vector; s4, introducing picture information in the data, extracting image features, and constructing a cross-modal vision-time sequence-spatial distribution-mechanism feature characterization model to obtain integral features; s5, constructing a feature decoding prediction and prediction network training module, decoding the integral features, outputting a predicted value and training a prediction network; s6, predicting new water service data by using the existing training set data, judging whether distribution deviation occurs or not, and identifying data abnormality.
In the above method, the step S1 includes the following steps:
s11, giving acquired dataWhere m, n denote that the data is acquired at a spatial location (m, n), t denote that the data is acquired at time t, and k denote the type of the data;
s12, constructing a time sequence feature extraction network based on a transducer for each data category k;
s13, aggregating the same kind of data at the same place to form a sequence dataAnd extracting the characteristics of all the sequence data by using the time sequence characteristic extraction network.
In the above method, in the step S13, features of all the sequence data are extracted, and a weighted time sequence attention mechanism is adopted, and the formula is as follows
In the above method, the attention weight matrix based on the time sequence is obtained as
In the above method, the step S2 includes the following steps:
s21, arranging the characteristics constructed by the obtained data of the same category according to the geographical position distribution of the data, constructing a characteristic diagram, wherein nodes in the characteristic diagram are characteristic vectors of a piece of time sequence data, and the weight calculation mode of edges between two nodes is as follows: the geographic position is defined asIs constructed such that a mean value is +.>Covariance isIs a gaussian distribution of (c); two adjacent nodes in the feature map>The weight of the edge of (2) is calculated according to the Kullback-Leibler divergence between two Gaussian distributions
S22, adopting a graph network to perform weighted feature aggregation on the data time sequence feature vectors based on the spatial arrangement in the feature graph.
In the above method, the step S3 includes the following steps:
s31, constructing importance scores among different data categories according to hydrodynamic force, water quality and algae growth and elimination change mechanisms;
s32, constructing an adjacent matrix among different data types according to the fractionWherein the elements of the adjacency matrix are->Representing the marked class of data->The scores between the data are used for constructing a correlation matrix between the data>The construction mode is that
In the above method, the step S4 includes the following steps:
s41, introducing picture information in the data, and extracting picture information features by adopting Vision Transformer;
s42, splicing the picture information features with the feature vectors;
s43, utilizing the correlation matrixWeighting the spliced features, and constructing a cross-modal vision-time sequence-space distribution-mechanism feature characterization model to obtain integral features.
In the above method, the step S5 includes the following steps:
s51, decoding the integral features, and outputting predicted values of (m, n, t, k) data at unknown points;
s52, masking random data in the original data, carrying out position coding, time coding and data type coding on the masked data, predicting the masked data by using a prediction network, constructing gradient information for the prediction target of the euclidean distance between the true value and the predicted value of the masked data, and training the prediction network.
In the above method, in the step S51, the overall feature is decoded, and a Multi-layer Multi-Head attention network is used as a decoder.
In the above method, the step S6 includes the following steps:
s61, using the existing training set data to generate new dataExtracting q batches of data in the neighborhood of (1)>Cyclically taking the q batches of data as input of a prediction network, predicting the data of (m, n, t, k) at the unknown points, and outputting a prediction mean +.>And prediction standard deviation->Q predicted values +.>And prediction standard deviation;
S62, constructing q Gaussian distributions according to the predicted mean and the predicted variance;
S63, using new dataAs Gaussian distribution->Calculating the probability density of the sample point in the q Gaussian distributions;
s63, when the sample point is in one of the Gaussian distributions and the probability density of the new data is lower than a set threshold, identifying that the sample point is abnormal.
The beneficial effects of the invention are as follows: the method has the advantages that the time sequence relation and the spatial distribution relation among the data and the correlation module based on the mechanism are fused, the water service data is extracted by adopting a cross-modal large model, and potential features in the data can be captured more effectively, so that the reliability and the accuracy of data anomaly identification in the water service data are improved; in the data characteristic representation, a time sequence relationship, a spatial distribution relationship based on graph representation and a mechanism relationship are fused, so that the characteristic representation of the data is more accurate, and the problem of inaccurate recognition caused by insufficient characteristic capacity representation in the normal data abnormal recognition is solved; the data is firstly predicted and then identified, namely, the data to be judged is firstly predicted in multiple rounds by utilizing the neighborhood data (different neighborhood samples), if the variance of the prediction result in the multiple rounds of prediction is larger, the data is identified to be abnormal, and the identification stability of the data abnormality is stronger.
Drawings
FIG. 1 is a flow chart of a method for rapidly identifying abnormal data of environmental water affair monitoring and sampling in the invention.
Fig. 2 is a diagram showing a comparison of a network module used in a transducer network according to an embodiment of the present invention and a weighted attention module according to the present invention.
Fig. 3 is a feature extraction module based on a sequence timing distribution in an embodiment of the invention.
Fig. 4 is a feature aggregation module based on feature space diagram distribution and Kullback-Leibler divergence in an embodiment of the present invention.
FIG. 5 is a cross-modal visual-time series-spatial distribution-mechanism characterization model in an embodiment of the invention.
Fig. 6 is a schematic diagram of position encoding in an embodiment of the invention.
FIG. 7 is a data prediction flow in an embodiment of the invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
The conception, specific structure, and technical effects produced by the present invention will be clearly and completely described below with reference to the embodiments and the drawings to fully understand the objects, features, and effects of the present invention. It is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments, and that other embodiments obtained by those skilled in the art without inventive effort are within the scope of the present invention based on the embodiments of the present invention. In addition, all the coupling/connection relationships referred to in the patent are not direct connection of the single-finger members, but rather, it means that a better coupling structure can be formed by adding or subtracting coupling aids depending on the specific implementation. The technical features in the invention can be interactively combined on the premise of no contradiction and conflict.
The invention discloses a method for rapidly identifying abnormal data of environmental water affair monitoring and sampling, which comprises the following steps:
s1, given collected water service data, extracting characteristics based on sequence time sequence distribution;
specifically, the step S1 includes the following steps:
s11, giving acquired dataWhere m, n denote that the data is acquired at a spatial location (m, n), t denote that the data is acquired at time t, and k denote the type of the data;
s12, constructing a time sequence feature extraction network based on a transducer which is a network for improving the model training speed by using an attention mechanism for each data class k;
s13, aggregating the same kind of data at the same place to form a sequence dataAnd extracting the characteristics of all the sequence data by using the time sequence characteristic extraction network in the step S12.
Further, the feature of all the sequence data is extracted, and a weighted time sequence attention mechanism can be adopted, and the formula is as follows
Obtaining the corresponding attention weight matrix based on the time sequence as
S2, carrying out feature aggregation based on space diagram distribution and Kullback-Leibler divergence on the features constructed by the data of the same category obtained in the step S1 to obtain feature vectors fused with space distribution information of different time sequence data;
specifically, the step S2 includes the following steps:
s21, arranging the characteristics constructed by the data of the same category obtained in the step S13 according to the geographical position distribution of the data to construct a characteristic diagram, wherein nodes in the characteristic diagram are characteristic vectors of a piece of time sequence data, and the weight calculation mode of edges between two nodes is as follows: the geographic position is defined asIs constructed such that a mean value is +.>Covariance is +.>Is a gaussian distribution of (c); two adjacent nodes in the feature map>The weight of the edge of (2) is calculated according to the Kullback-Leibler divergence between two Gaussian distributions
S22, adopting a graph network to perform weighted feature aggregation on the data time sequence feature vectors based on the spatial arrangement in the feature graph.
S3, in order to fully utilize the relation among different kinds of data, constructing a data correlation matrix based on a mechanism equation for the feature vector;
specifically, the step S3 includes the following steps:
s31, constructing importance scores among different data categories by an industry expert according to hydrodynamic force, water quality and algae generation and elimination change mechanisms;
s32, constructing an adjacent matrix among different data types according to the fractionWherein the elements of the adjacency matrix are->Representing class of data uttered by expert->The scores between the data are used for constructing a correlation matrix between the data>The construction mode is that
S4, based on the feature extraction and correlation matrix construction of the time sequence, the spatial distribution and the mechanism, additionally introducing picture information in data, extracting image features, and constructing a cross-modal vision-time sequence-spatial distribution-mechanism feature characterization model to obtain integral features;
specifically, the step S4 includes the following steps:
s41, based on the characteristic extraction and the correlation matrix construction about time sequence, spatial distribution and mechanism, additionally introducing picture information in the data, such as alum water purifying effect in the water body, wherein the alum water purifying effect in the water body is difficult to be represented by numerical values, and the picture information characteristic can be extracted only by judging according to the picture information and adopting Vision Transformer;
s42, splicing the picture information features with the feature vectors obtained in the step S2;
s43, utilizing the correlation matrix in the step S3And weighting the spliced features, so that a cross-modal vision-time sequence-spatial distribution-mechanism feature characterization model is integrally constructed, and integral features are obtained.
S5, constructing a feature decoding prediction and prediction network training module, decoding the integral features, outputting a predicted value and training a prediction network;
the step S5 comprises the following steps:
s51, according to the overall characteristics of the data acquired in the step S4, a Multi-layer Multi-Head attention network is adopted as a decoder, the overall characteristics are decoded, and a predicted value of the data at an unknown point (m, n, t, k) is output;
s52, shielding random data in the original data, carrying out position coding, time coding and data type coding on the shielded data, predicting the shielded data by using a prediction network, constructing gradient information for predicting the real value of the shielded data and Euclidean distance between the predicted values as prediction targets, and training the prediction network;
to train a predictive network, given input data, the input data is masked randomly and the network output needs to predict the masked data. And training the prediction network according to the L2 distance between the predicted value and the true value as a loss so as to construct gradient information.
S6, predicting new water service data by utilizing the existing training set data, judging whether distribution deviation occurs or not, and identifying data abnormality;
specifically, the step S6 includes the following steps:
s61, using the existing training set data to generate new dataExtracting q batches of data in the neighborhood of (1)>The q batches of data are cyclically taken as input of the prediction network to predict the data at the unknown points (m, n, t, k), and the prediction network can output a prediction mean value simultaneously>And prediction standard deviation->Thus q predicted values for the unknown point can be obtained +.>And prediction standard deviation->;
S62, constructing q Gaussian distributions according to the predicted mean and the predicted variance;
S63, using new data(i.e. the data of new unknown points (m, n, t, k)) as a gaussian distribution +.>Calculating the probability density of the sample point in the q Gaussian distributions;
s63, when the sample point is in one of the Gaussian distributions and the probability density of the new data is lower than a set threshold, identifying that the sample point is abnormal, namely the water service data of the new unknown point (m, n, t, k) is abnormal.
Taking Shenzhen Yan field reservoir basin as an example, consider that the data indexes in the basin include: flow, water temperature, dissolved oxygen, ammonia nitrogen and total nitrogenTotal phosphorus, algae density, etc. Firstly, various water affair related data of various positions and various time periods in the river basin are utilized, wherein the data comprise indexes such as flow, water temperature, dissolved oxygen, ammonia nitrogen, total phosphorus and algae density. Here we note thatFor being in place->At (I) a part of>The data type collected at the moment is +.>Wherein the data subscript indicates +.>Data.
And S1, extracting features based on sequence time sequence distribution. And constructing the data according to the time sequence to obtain a plurality of time sequence data sequences. For example, a certain sectionAll ammonia nitrogen data in the last half year can be constructed into a time sequence data sequence +.>. According to the scheme, ammonia nitrogen time sequence data sequences with a plurality of different sections in the river basin can be obtained. And then, performing feature extraction on the time sequence data of the ammonia nitrogen by using a transducer. For a piece of time series data, in order to enable the data with shorter time intervals to have larger correlation, the invention constructs a weighting matrix of the time series data, and when the data characteristics are extracted by using a transducer, a weighting attention layer is additionally added in an intermediate layer, so that the obtained characteristics have stronger representation capability. Specifically, referring to fig. 2, a stack of a plurality of network modules as shown in the left diagram of fig. 2 is included in an encoder of a conventional transducer. However, for the purpose ofThe invention replaces the Multi-Head Attention layer in one of the network modules with the weighted Attention layer proposed by the invention by fully utilizing the time sequence relation among the data.
Therefore, the time sequence characteristics of ammonia nitrogen data at different places can be obtained. These features are constructed as a map from the geographical location of the acquisition of the sequence data. The overall flow is shown in fig. 3. In addition, the same feature extraction scheme is also employed for other types of data (dissolved oxygen, total phosphorus, etc.).
And S2, feature aggregation based on space diagram distribution and Kullback-Leibler divergence. In the last step, the features of the organic nitrogen data are constructed as a graph using their spatial distribution. Each node in the graph is an organic nitrogen feature vector for that slice. If the Kullback-Leibler divergence between two nodes is greater than a certain threshold, an edge exists between the two nodes, and the weight of the edge is the Kullback-Leibler divergence between the two nodes. Under the definition of this graph, the aggregation process for one feature node is as follows. As shown in fig. 4, node 5 is connected to node 2, node 3 and node 4. Wherein the organic nitrogen feature vector of the site is represented. For characteristic vectorThe converged feature is expressed as +.>The feature aggregation mode is: />
Wherein the method comprises the steps ofRepresenting a weight to be learned, +.>Representing the Kullback-Leibler divergence between node 2 and node 5.
Performing one aggregation on the nodes represents one aggregation on all the characteristic nodes in the graph. The aggregation process is carried out for a plurality of times, and finally the aggregated data characteristics can be obtained. Here, not only the ammonia nitrogen features but also the time sequence features of other data types are required to be converged, so that feature vectors of spatial distribution information fused with different time sequence data are obtained.
And S3, constructing a data correlation matrix based on a mechanism equation. In the last step, the characteristic vectors of data such as flow, water temperature, dissolved oxygen, ammonia nitrogen, total phosphorus, algae density and the like of different places (different nodes in the figure) are obtained. Each vector already contains timing information as well as spatial distribution information. These feature vectors are further feature extracted in this step. Considering hydrodynamic force, water quality and algae growth and elimination change process mechanism in the river basin, the correlation among indexes such as water temperature, dissolved oxygen, ammonia nitrogen, total phosphorus, algae density and the like is scored by an industry expert, and each score is a triplet which represents the correlation between two different types of data, such as (ammonia nitrogen, dissolved oxygen, 0.1). The correlation score sum between the same type of data and other types of data is set to 1. Thus, a correlation matrix P between the several types of data can be obtained, and then the nth power of the matrix P is calculated to obtain a final correlation matrix.
Similar to the feature extraction step based on sequence time sequence distribution, the weighted attention layer is constructed by utilizing the correlation matrix P, and the feature vectors are further fused by adopting the weighted attention network module, so that the feature representation can contain the mechanism relation among different data.
And S4, cross-modal vision-time sequence-spatial distribution-mechanism characteristic characterization model. According to the feature extraction step based on sequence time sequence distribution, the feature aggregation step based on space diagram distribution and Kullback-Leibler divergence and the data correlation matrix construction step based on a mechanism equation, an image feature extraction step based on vision transformer is added on the basis of the steps for evaluating the eutrophication degree in the water body. The overall feature encoding model is shown in fig. 5.
And S5, feature decoding prediction and training network training. The acquired data are subjected to characteristic representation in the steps, and the time sequence relationship, the topological relationship of spatial distribution and the mechanism relationship among the data are considered in the characteristic representation process. These features are then used to decode the features so that the decoder can predict the organic nitrogen, dissolved oxygen, algae, etc. indicators for unknown time in the unknown area. In the present invention, the decoder section matches the decoder in the transformer, but additionally adds time code and index type code, and predicts the index size of a specific time, a specific position and a specific type as the condition information.
The position code is still represented in the form of a graph network, as shown in fig. 7, if a node is to be representedThe representation of a certain position is completed by only setting the value of the corresponding node to 1 and the values of other nodes to 0. In order to convert the representation of the graph form into a dense vector representation, the graphs shown in the right graph in fig. 6 are still assembled and spliced here in an assembled manner.
The time code is similar to the onehot code type, and for example, considering a period from 5 times in the past to the present, the code indicating the current time may be indicated as "000001", and the time code indicating the last time may be indicated as "000010".
The data type code is also based on onehot code, wherein the flow code is "00000001", the water temperature code is "00000010", the dissolved oxygen code is "00000100", the ammonia nitrogen code is "00001000", the total nitrogen code is "00010000", the total phosphorus code is "00100000", and the algae density code is "01000000".
With the above-described encoding scheme, a decoder can decode the feature and predict the unknown value at the encoding site. The overall flow is shown in fig. 7 below.
Thus, the training network can be trained with a large amount of data. The training samples were constructed as follows: masking a certain data in the original data, performing position coding, time coding and data type coding on the data, and predicting the masked data by using the training network. Its predicted target is the true value of the data and the Euclidean distance between the predicted values.
And S6, identifying data abnormality. As previously described, first, the existing training set data is utilized to generate new dataExtracting q batches of data in the neighborhood of (1)>. The q lot data is cyclically used as input to a prediction network to predict the data at (m, n, t, k). Since the prediction network can output the prediction mean +.>And prediction standard deviationThus q predicted values +.>And prediction standard deviation->. Q Gaussian distributions can be constructed from the predicted mean and predicted variance>Thus consider new data +>For one sample point in these gaussian distributions, the probability density of that sample point in these q distributions is calculated. If the sample point is in one of the Gaussian distributions and the probability density of the new data is lower than the manually set threshold, the sample point is considered to be abnormal.
According to the invention, the time sequence relation and the spatial distribution relation among the data and the correlation module based on the mechanism are fused, and the water service data is extracted by adopting a cross-modal large model, so that the potential characteristics in the data can be more effectively captured, and the reliability and the accuracy of data anomaly identification in the water service data are improved; in the data characteristic representation, a time sequence relationship, a spatial distribution relationship based on graph representation and a mechanism relationship are fused, so that the characteristic representation of the data is more accurate, and the problem of inaccurate recognition caused by insufficient characteristic capacity representation in the normal data abnormal recognition is solved; the data is firstly predicted and then identified, namely, the data to be judged is firstly predicted in multiple rounds by utilizing the neighborhood data (different neighborhood samples), if the variance of the prediction result in the multiple rounds of prediction is larger, the data is identified to be abnormal, and the identification stability of the data abnormality is stronger.
While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention, and these equivalent modifications or substitutions are included in the scope of the present invention as defined in the appended claims.
Claims (10)
1. A method for rapidly identifying abnormal data of environmental water affair monitoring and sampling is characterized by comprising the following steps: the method comprises the following steps of:
s1, given collected water service data, extracting characteristics based on sequence time sequence distribution;
s2, carrying out feature aggregation based on space diagram distribution and Kullback-Leibler divergence on the features to obtain feature vectors of space distribution information fused with different time sequence data;
s3, constructing a data correlation matrix based on a mechanism equation for the feature vector;
s4, introducing picture information in the data, extracting image features, and constructing a cross-modal vision-time sequence-spatial distribution-mechanism feature characterization model to obtain integral features;
s5, constructing a feature decoding prediction and prediction network training module, decoding the integral features, outputting a predicted value and training a prediction network;
s6, predicting new water service data by using the existing training set data, judging whether distribution deviation occurs or not, and identifying data abnormality.
2. The method for quickly identifying abnormal data of environmental water affair monitoring and sampling according to claim 1, which is characterized in that: the step S1 comprises the following steps:
s11, giving acquired dataWhere m, n denote that the data is acquired at a spatial location (m, n), t denote that the data is acquired at time t, and k denote the type of the data;
s12, constructing a time sequence feature extraction network based on a transducer for each data category k;
3. The method for quickly identifying abnormal data of environmental water affair monitoring and sampling according to claim 2, which is characterized in that: in the step S13, features of all the sequence data are extracted, and a weighted time sequence attention mechanism is adopted, and the formula is as follows
5. The method for quickly identifying abnormal data of environmental water affair monitoring and sampling according to claim 2, which is characterized in that: the step S2 comprises the following steps:
s21, arranging the characteristics constructed by the obtained data of the same category according to the geographical position distribution of the data, constructing a characteristic diagram, wherein nodes in the characteristic diagram are characteristic vectors of a piece of time sequence data, and the weight calculation mode of edges between two nodes is as follows: the geographic position is defined asIs constructed such that a mean value is +.>Covariance isIs a gaussian distribution of (c); two adjacent nodes in the feature map>The weight of the edge of (2) is calculated according to the Kullback-Leibler divergence between two Gaussian distributions
S22, adopting a graph network to perform weighted feature aggregation on the data time sequence feature vectors based on the spatial arrangement in the feature graph.
6. The method for quickly identifying abnormal data of environmental water affair monitoring and sampling according to claim 5, which is characterized in that: the step S3 comprises the following steps:
s31, constructing importance scores among different data categories according to hydrodynamic force, water quality and algae growth and elimination change mechanisms;
s32, constructing an adjacent matrix among different data types according to the fractionWherein the elements of the adjacency matrix are->Representing the marked class of data->The scores between the data are used for constructing a correlation matrix between the data>The construction mode is that
7. The method for quickly identifying abnormal data of environmental water affair monitoring and sampling according to claim 6, which is characterized in that: the step S4 comprises the following steps:
s41, introducing picture information in the data, and extracting picture information features by adopting Vision Transformer;
s42, splicing the picture information features with the feature vectors;
8. The method for quickly identifying abnormal data of environmental water affair monitoring and sampling according to claim 7, which is characterized in that: the step S5 comprises the following steps:
s51, decoding the integral features, and outputting predicted values of (m, n, t, k) data at unknown points;
s52, masking random data in the original data, carrying out position coding, time coding and data type coding on the masked data, predicting the masked data by using a prediction network, constructing gradient information for the prediction target of the euclidean distance between the true value and the predicted value of the masked data, and training the prediction network.
9. The method for quickly identifying abnormal data of environmental water affair monitoring and sampling according to claim 8, which is characterized in that: in the step S51, the overall feature is decoded, and a Multi-layer Multi-Head attention network is used as a decoder.
10. The method for quickly identifying abnormal data of environmental water affair monitoring and sampling according to claim 9, which is characterized in that: the step S6 comprises the following steps:
s61, using the existing training set data to generate new dataExtracting q batches of data in the neighborhood of (1)>Cyclically taking the q batches of data as input of a prediction network, predicting the data of (m, n, t, k) at the unknown points, and outputting a prediction mean +.>And prediction standard deviation->Q predicted values +.>And prediction standard deviation->;
S62, constructing q Gaussian distributions according to the predicted mean and the predicted variance ;
S63, using new dataAs Gaussian distribution->Calculating the probability density of the sample point in the q Gaussian distributions;
s63, when the sample point is in one of the Gaussian distributions and the probability density of the new data is lower than a set threshold, identifying that the sample point is abnormal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310467167.8A CN116186547B (en) | 2023-04-27 | 2023-04-27 | Method for rapidly identifying abnormal data of environmental water affair monitoring and sampling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310467167.8A CN116186547B (en) | 2023-04-27 | 2023-04-27 | Method for rapidly identifying abnormal data of environmental water affair monitoring and sampling |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116186547A true CN116186547A (en) | 2023-05-30 |
CN116186547B CN116186547B (en) | 2023-07-07 |
Family
ID=86450939
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310467167.8A Active CN116186547B (en) | 2023-04-27 | 2023-04-27 | Method for rapidly identifying abnormal data of environmental water affair monitoring and sampling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116186547B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117592571A (en) * | 2023-12-05 | 2024-02-23 | 武汉华康世纪医疗股份有限公司 | Air conditioning unit fault type diagnosis method and system based on big data |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080168339A1 (en) * | 2006-12-21 | 2008-07-10 | Aquatic Informatics (139811) | System and method for automatic environmental data validation |
CN109523510A (en) * | 2018-10-11 | 2019-03-26 | 浙江大学 | River water quality free air anomaly method for detecting area based on multi-spectrum remote sensing image |
US20200050182A1 (en) * | 2018-08-07 | 2020-02-13 | Nec Laboratories America, Inc. | Automated anomaly precursor detection |
CN113158391A (en) * | 2021-04-30 | 2021-07-23 | 中国人民解放军国防科技大学 | Method, system, device and storage medium for visualizing multi-dimensional network node classification |
CN113553577A (en) * | 2021-06-01 | 2021-10-26 | 中国人民解放军战略支援部队信息工程大学 | Unknown user malicious behavior detection method and system based on hypersphere variational automatic encoder |
CN113988268A (en) * | 2021-11-03 | 2022-01-28 | 西安交通大学 | Heterogeneous multi-source time sequence anomaly detection method based on unsupervised full-attribute graph |
CN114444914A (en) * | 2022-01-20 | 2022-05-06 | 中国电建集团华东勘测设计研究院有限公司 | Method for analyzing change trend of key water quality factor for watershed comprehensive treatment performance evaluation |
CN114510467A (en) * | 2022-01-07 | 2022-05-17 | 深圳市广汇源环境水务有限公司 | Intelligent water affair data abnormity identification method |
CN115018021A (en) * | 2022-08-08 | 2022-09-06 | 广东电网有限责任公司肇庆供电局 | Machine room abnormity detection method and device based on graph structure and abnormity attention mechanism |
US20220342988A1 (en) * | 2021-04-21 | 2022-10-27 | Sonalysts, Inc. | System and method of situation awareness in industrial control systems |
CN115470850A (en) * | 2022-09-09 | 2022-12-13 | 大连理工大学 | Water quality abnormal event recognition early warning method based on pipe network water quality time-space data |
-
2023
- 2023-04-27 CN CN202310467167.8A patent/CN116186547B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080168339A1 (en) * | 2006-12-21 | 2008-07-10 | Aquatic Informatics (139811) | System and method for automatic environmental data validation |
US20200050182A1 (en) * | 2018-08-07 | 2020-02-13 | Nec Laboratories America, Inc. | Automated anomaly precursor detection |
CN109523510A (en) * | 2018-10-11 | 2019-03-26 | 浙江大学 | River water quality free air anomaly method for detecting area based on multi-spectrum remote sensing image |
US20220342988A1 (en) * | 2021-04-21 | 2022-10-27 | Sonalysts, Inc. | System and method of situation awareness in industrial control systems |
CN113158391A (en) * | 2021-04-30 | 2021-07-23 | 中国人民解放军国防科技大学 | Method, system, device and storage medium for visualizing multi-dimensional network node classification |
CN113553577A (en) * | 2021-06-01 | 2021-10-26 | 中国人民解放军战略支援部队信息工程大学 | Unknown user malicious behavior detection method and system based on hypersphere variational automatic encoder |
CN113988268A (en) * | 2021-11-03 | 2022-01-28 | 西安交通大学 | Heterogeneous multi-source time sequence anomaly detection method based on unsupervised full-attribute graph |
CN114510467A (en) * | 2022-01-07 | 2022-05-17 | 深圳市广汇源环境水务有限公司 | Intelligent water affair data abnormity identification method |
CN114444914A (en) * | 2022-01-20 | 2022-05-06 | 中国电建集团华东勘测设计研究院有限公司 | Method for analyzing change trend of key water quality factor for watershed comprehensive treatment performance evaluation |
CN115018021A (en) * | 2022-08-08 | 2022-09-06 | 广东电网有限责任公司肇庆供电局 | Machine room abnormity detection method and device based on graph structure and abnormity attention mechanism |
CN115470850A (en) * | 2022-09-09 | 2022-12-13 | 大连理工大学 | Water quality abnormal event recognition early warning method based on pipe network water quality time-space data |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117592571A (en) * | 2023-12-05 | 2024-02-23 | 武汉华康世纪医疗股份有限公司 | Air conditioning unit fault type diagnosis method and system based on big data |
CN117592571B (en) * | 2023-12-05 | 2024-05-17 | 武汉华康世纪医疗股份有限公司 | Air conditioning unit fault type diagnosis method and system based on big data |
Also Published As
Publication number | Publication date |
---|---|
CN116186547B (en) | 2023-07-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110210644A (en) | The traffic flow forecasting method integrated based on deep neural network | |
CN116910633B (en) | Power grid fault prediction method based on multi-modal knowledge mixed reasoning | |
CN116186547B (en) | Method for rapidly identifying abnormal data of environmental water affair monitoring and sampling | |
CN112529234A (en) | Surface water quality prediction method based on deep learning | |
CN114221790A (en) | BGP (Border gateway protocol) anomaly detection method and system based on graph attention network | |
CN111784084B (en) | Travel generation prediction method, system and device based on gradient lifting decision tree | |
CN112217674A (en) | Alarm root cause identification method based on causal network mining and graph attention network | |
CN116187203A (en) | Watershed water quality prediction method, system, electronic equipment and storage medium | |
CN111242028A (en) | Remote sensing image ground object segmentation method based on U-Net | |
CN114330120A (en) | 24-hour PM prediction based on deep neural network2.5Method of concentration | |
CN114510467A (en) | Intelligent water affair data abnormity identification method | |
CN115356599B (en) | Multi-mode urban power grid fault diagnosis method and system | |
CN115391746B (en) | Interpolation method, interpolation device, electronic device and medium for meteorological element data | |
CN117033923A (en) | Method and system for predicting crime quantity based on interpretable machine learning | |
CN116244600A (en) | Method, system and equipment for constructing GIS intermittent discharge mode identification model | |
CN112016403B (en) | Video abnormal event detection method | |
CN117271959B (en) | Uncertainty evaluation method and equipment for PM2.5 concentration prediction result | |
CN118094314B (en) | Chemical process fault diagnosis method and system based on improved space-time model | |
CN117007912B (en) | Distribution network line power failure analysis method, device, equipment and storage medium | |
CN117372715A (en) | Tail gas blackness detection method based on visual characteristics | |
CN116597379A (en) | System and method for detecting abnormal drainage of water outlet on sunny day by self-starting based on deep learning and Internet of things | |
CN117809239A (en) | Incremental learning algorithm-based power transmission channel construction machinery identification method | |
Putpuek et al. | Urban Expansion Prediction using Supervised Classification and ConvLSTM | |
Xing et al. | Spatio-Temporal Failure Prediction Using LSTGM for Optical Networks | |
CN118311693A (en) | Data acquisition method and system combining multi-terminal data observation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |