Nothing Special   »   [go: up one dir, main page]

CN116186547A - Method for rapidly identifying abnormal data of environmental water affair monitoring and sampling - Google Patents

Method for rapidly identifying abnormal data of environmental water affair monitoring and sampling Download PDF

Info

Publication number
CN116186547A
CN116186547A CN202310467167.8A CN202310467167A CN116186547A CN 116186547 A CN116186547 A CN 116186547A CN 202310467167 A CN202310467167 A CN 202310467167A CN 116186547 A CN116186547 A CN 116186547A
Authority
CN
China
Prior art keywords
data
feature
constructing
time sequence
distribution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310467167.8A
Other languages
Chinese (zh)
Other versions
CN116186547B (en
Inventor
赵鑫
梁彬锐
阳秀春
黄文稻
邓超联
张毅
王晖文
彭玉萍
江锦燕
龚艳光
杨茂勇
谢艳玲
刘瑶瑶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Ghy Environment Water Conservancy Co ltd
Original Assignee
Shenzhen Ghy Environment Water Conservancy Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Ghy Environment Water Conservancy Co ltd filed Critical Shenzhen Ghy Environment Water Conservancy Co ltd
Priority to CN202310467167.8A priority Critical patent/CN116186547B/en
Publication of CN116186547A publication Critical patent/CN116186547A/en
Application granted granted Critical
Publication of CN116186547B publication Critical patent/CN116186547B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A20/00Water conservation; Efficient water supply; Efficient water use
    • Y02A20/152Water filtration

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a method for rapidly identifying abnormal data of environmental water affair monitoring and sampling, which relates to the technical field of water affair data processing, and comprises the following steps: s1, given collected water service data, extracting characteristics based on sequence time sequence distribution; s2, carrying out feature aggregation based on space diagram distribution and Kullback-Leibler divergence on the features to obtain feature vectors of space distribution information fused with different time sequence data; s3, constructing a data correlation matrix based on a mechanism equation for the feature vector; s4, introducing picture information in the data, extracting image features, and constructing a cross-modal vision-time sequence-spatial distribution-mechanism feature characterization model to obtain integral features; s5, constructing a feature decoding prediction and prediction network training module, decoding the integral features, outputting a predicted value and training a prediction network; s6, predicting new water service data by using the existing training set data, judging whether distribution deviation occurs or not, and identifying data abnormality.

Description

Method for rapidly identifying abnormal data of environmental water affair monitoring and sampling
Technical Field
The invention relates to the technical field of water affair data processing, in particular to a method for rapidly identifying abnormal data of environmental water affair monitoring and sampling.
Background
In the field of environmental water engineering, a large amount of data, such as real-time flow, temperature, dissolved oxygen, algae, organic carbon, organic phosphorus, organic nitrogen, ammonia nitrogen, etc., are usually collected in a certain water area, and these data can be used for analyzing the hydrodynamic characteristics of the water area and the pollution condition, pollution degree and pollution mechanism of the water area. Therefore, the step of collecting data plays an important role in water area flood control and drainage and pollution control. However, data collected in the field often causes anomalies in the data for various reasons, such as sensor damage, mishandling of the collection personnel, accumulated errors in the equipment, etc. Therefore, from a large amount of data, the automatic identification of abnormal data is particularly critical, and the method has very important significance for ensuring the environmental water analysis work.
Through searching, publication number CN109160550a discloses an urban sewage treatment information management system, wherein an expert system based on a fault tree is adopted to identify abnormal parts in the sewage treatment process. However, such methods do not adequately consider the spatio-temporal continuity relationship between the same data types nor the mechanism relationship expressed in differential equations between different data types. In this method, only the abnormality (too high or too low) of the data size is simply considered, and the abnormality in the system is judged by referring to the fault tree. Such expert systems are characterized as simple, efficient, but fail to identify complex, potential system failures. In addition, the chinese patent of invention, publication No. CN111830871a, discloses abnormality identification of water affair monitoring equipment by using a frequency domain analysis method of equipment data, but the scheme has the following drawbacks: representing data features in the frequency domain typically represents inadequate capability, difficulty in characterizing insignificant anomalies, and difficulty in finding potential data anomalies.
The disclosed scheme has the problems of unreliability and inaccuracy in identifying the abnormality of water service data, and the invention provides a new scheme for identifying complex and potential abnormality so as to realize more reliable and more accurate data abnormality identification in environmental water service and meet the actual needs in order to improve the abnormality identification capability.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a method for rapidly identifying abnormal data of environmental water affair monitoring and sampling, so as to solve the technical problems.
The technical method adopted for solving the technical problems is as follows: a method for rapidly identifying abnormal data of environmental water affair monitoring and sampling is characterized in that: the method comprises the following steps of: s1, given collected water service data, extracting characteristics based on sequence time sequence distribution; s2, carrying out feature aggregation based on space diagram distribution and Kullback-Leibler divergence on the features to obtain feature vectors of space distribution information fused with different time sequence data; s3, constructing a data correlation matrix based on a mechanism equation for the feature vector; s4, introducing picture information in the data, extracting image features, and constructing a cross-modal vision-time sequence-spatial distribution-mechanism feature characterization model to obtain integral features; s5, constructing a feature decoding prediction and prediction network training module, decoding the integral features, outputting a predicted value and training a prediction network; s6, predicting new water service data by using the existing training set data, judging whether distribution deviation occurs or not, and identifying data abnormality.
In the above method, the step S1 includes the following steps:
s11, giving acquired data
Figure SMS_1
Where m, n denote that the data is acquired at a spatial location (m, n), t denote that the data is acquired at time t, and k denote the type of the data;
s12, constructing a time sequence feature extraction network based on a transducer for each data category k;
s13, aggregating the same kind of data at the same place to form a sequence data
Figure SMS_2
And extracting the characteristics of all the sequence data by using the time sequence characteristic extraction network.
In the above method, in the step S13, features of all the sequence data are extracted, and a weighted time sequence attention mechanism is adopted, and the formula is as follows
Figure SMS_3
Wherein the method comprises the steps of
Figure SMS_4
In the above method, the attention weight matrix based on the time sequence is obtained as
Figure SMS_5
In the above method, the step S2 includes the following steps:
s21, arranging the characteristics constructed by the obtained data of the same category according to the geographical position distribution of the data, constructing a characteristic diagram, wherein nodes in the characteristic diagram are characteristic vectors of a piece of time sequence data, and the weight calculation mode of edges between two nodes is as follows: the geographic position is defined as
Figure SMS_6
Is constructed such that a mean value is +.>
Figure SMS_7
Covariance is
Figure SMS_8
Is a gaussian distribution of (c); two adjacent nodes in the feature map>
Figure SMS_9
The weight of the edge of (2) is calculated according to the Kullback-Leibler divergence between two Gaussian distributions
Figure SMS_10
S22, adopting a graph network to perform weighted feature aggregation on the data time sequence feature vectors based on the spatial arrangement in the feature graph.
In the above method, the step S3 includes the following steps:
s31, constructing importance scores among different data categories according to hydrodynamic force, water quality and algae growth and elimination change mechanisms;
s32, constructing an adjacent matrix among different data types according to the fraction
Figure SMS_11
Wherein the elements of the adjacency matrix are->
Figure SMS_12
Representing the marked class of data->
Figure SMS_13
The scores between the data are used for constructing a correlation matrix between the data>
Figure SMS_14
The construction mode is that
Figure SMS_15
In the above method, the step S4 includes the following steps:
s41, introducing picture information in the data, and extracting picture information features by adopting Vision Transformer;
s42, splicing the picture information features with the feature vectors;
s43, utilizing the correlation matrix
Figure SMS_16
Weighting the spliced features, and constructing a cross-modal vision-time sequence-space distribution-mechanism feature characterization model to obtain integral features.
In the above method, the step S5 includes the following steps:
s51, decoding the integral features, and outputting predicted values of (m, n, t, k) data at unknown points;
s52, masking random data in the original data, carrying out position coding, time coding and data type coding on the masked data, predicting the masked data by using a prediction network, constructing gradient information for the prediction target of the euclidean distance between the true value and the predicted value of the masked data, and training the prediction network.
In the above method, in the step S51, the overall feature is decoded, and a Multi-layer Multi-Head attention network is used as a decoder.
In the above method, the step S6 includes the following steps:
s61, using the existing training set data to generate new data
Figure SMS_17
Extracting q batches of data in the neighborhood of (1)>
Figure SMS_18
Cyclically taking the q batches of data as input of a prediction network, predicting the data of (m, n, t, k) at the unknown points, and outputting a prediction mean +.>
Figure SMS_19
And prediction standard deviation->
Figure SMS_20
Q predicted values +.>
Figure SMS_21
And prediction standard deviation
Figure SMS_22
S62, constructing q Gaussian distributions according to the predicted mean and the predicted variance
Figure SMS_23
S63, using new data
Figure SMS_24
As Gaussian distribution->
Figure SMS_25
Calculating the probability density of the sample point in the q Gaussian distributions;
s63, when the sample point is in one of the Gaussian distributions and the probability density of the new data is lower than a set threshold, identifying that the sample point is abnormal.
The beneficial effects of the invention are as follows: the method has the advantages that the time sequence relation and the spatial distribution relation among the data and the correlation module based on the mechanism are fused, the water service data is extracted by adopting a cross-modal large model, and potential features in the data can be captured more effectively, so that the reliability and the accuracy of data anomaly identification in the water service data are improved; in the data characteristic representation, a time sequence relationship, a spatial distribution relationship based on graph representation and a mechanism relationship are fused, so that the characteristic representation of the data is more accurate, and the problem of inaccurate recognition caused by insufficient characteristic capacity representation in the normal data abnormal recognition is solved; the data is firstly predicted and then identified, namely, the data to be judged is firstly predicted in multiple rounds by utilizing the neighborhood data (different neighborhood samples), if the variance of the prediction result in the multiple rounds of prediction is larger, the data is identified to be abnormal, and the identification stability of the data abnormality is stronger.
Drawings
FIG. 1 is a flow chart of a method for rapidly identifying abnormal data of environmental water affair monitoring and sampling in the invention.
Fig. 2 is a diagram showing a comparison of a network module used in a transducer network according to an embodiment of the present invention and a weighted attention module according to the present invention.
Fig. 3 is a feature extraction module based on a sequence timing distribution in an embodiment of the invention.
Fig. 4 is a feature aggregation module based on feature space diagram distribution and Kullback-Leibler divergence in an embodiment of the present invention.
FIG. 5 is a cross-modal visual-time series-spatial distribution-mechanism characterization model in an embodiment of the invention.
Fig. 6 is a schematic diagram of position encoding in an embodiment of the invention.
FIG. 7 is a data prediction flow in an embodiment of the invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
The conception, specific structure, and technical effects produced by the present invention will be clearly and completely described below with reference to the embodiments and the drawings to fully understand the objects, features, and effects of the present invention. It is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments, and that other embodiments obtained by those skilled in the art without inventive effort are within the scope of the present invention based on the embodiments of the present invention. In addition, all the coupling/connection relationships referred to in the patent are not direct connection of the single-finger members, but rather, it means that a better coupling structure can be formed by adding or subtracting coupling aids depending on the specific implementation. The technical features in the invention can be interactively combined on the premise of no contradiction and conflict.
The invention discloses a method for rapidly identifying abnormal data of environmental water affair monitoring and sampling, which comprises the following steps:
s1, given collected water service data, extracting characteristics based on sequence time sequence distribution;
specifically, the step S1 includes the following steps:
s11, giving acquired data
Figure SMS_26
Where m, n denote that the data is acquired at a spatial location (m, n), t denote that the data is acquired at time t, and k denote the type of the data;
s12, constructing a time sequence feature extraction network based on a transducer which is a network for improving the model training speed by using an attention mechanism for each data class k;
s13, aggregating the same kind of data at the same place to form a sequence data
Figure SMS_27
And extracting the characteristics of all the sequence data by using the time sequence characteristic extraction network in the step S12.
Further, the feature of all the sequence data is extracted, and a weighted time sequence attention mechanism can be adopted, and the formula is as follows
Figure SMS_28
Wherein the method comprises the steps of
Figure SMS_29
Obtaining the corresponding attention weight matrix based on the time sequence as
Figure SMS_30
S2, carrying out feature aggregation based on space diagram distribution and Kullback-Leibler divergence on the features constructed by the data of the same category obtained in the step S1 to obtain feature vectors fused with space distribution information of different time sequence data;
specifically, the step S2 includes the following steps:
s21, arranging the characteristics constructed by the data of the same category obtained in the step S13 according to the geographical position distribution of the data to construct a characteristic diagram, wherein nodes in the characteristic diagram are characteristic vectors of a piece of time sequence data, and the weight calculation mode of edges between two nodes is as follows: the geographic position is defined as
Figure SMS_31
Is constructed such that a mean value is +.>
Figure SMS_32
Covariance is +.>
Figure SMS_33
Is a gaussian distribution of (c); two adjacent nodes in the feature map>
Figure SMS_34
The weight of the edge of (2) is calculated according to the Kullback-Leibler divergence between two Gaussian distributions
Figure SMS_35
S22, adopting a graph network to perform weighted feature aggregation on the data time sequence feature vectors based on the spatial arrangement in the feature graph.
S3, in order to fully utilize the relation among different kinds of data, constructing a data correlation matrix based on a mechanism equation for the feature vector;
specifically, the step S3 includes the following steps:
s31, constructing importance scores among different data categories by an industry expert according to hydrodynamic force, water quality and algae generation and elimination change mechanisms;
s32, constructing an adjacent matrix among different data types according to the fraction
Figure SMS_36
Wherein the elements of the adjacency matrix are->
Figure SMS_37
Representing class of data uttered by expert->
Figure SMS_38
The scores between the data are used for constructing a correlation matrix between the data>
Figure SMS_39
The construction mode is that
Figure SMS_40
S4, based on the feature extraction and correlation matrix construction of the time sequence, the spatial distribution and the mechanism, additionally introducing picture information in data, extracting image features, and constructing a cross-modal vision-time sequence-spatial distribution-mechanism feature characterization model to obtain integral features;
specifically, the step S4 includes the following steps:
s41, based on the characteristic extraction and the correlation matrix construction about time sequence, spatial distribution and mechanism, additionally introducing picture information in the data, such as alum water purifying effect in the water body, wherein the alum water purifying effect in the water body is difficult to be represented by numerical values, and the picture information characteristic can be extracted only by judging according to the picture information and adopting Vision Transformer;
s42, splicing the picture information features with the feature vectors obtained in the step S2;
s43, utilizing the correlation matrix in the step S3
Figure SMS_41
And weighting the spliced features, so that a cross-modal vision-time sequence-spatial distribution-mechanism feature characterization model is integrally constructed, and integral features are obtained.
S5, constructing a feature decoding prediction and prediction network training module, decoding the integral features, outputting a predicted value and training a prediction network;
the step S5 comprises the following steps:
s51, according to the overall characteristics of the data acquired in the step S4, a Multi-layer Multi-Head attention network is adopted as a decoder, the overall characteristics are decoded, and a predicted value of the data at an unknown point (m, n, t, k) is output;
s52, shielding random data in the original data, carrying out position coding, time coding and data type coding on the shielded data, predicting the shielded data by using a prediction network, constructing gradient information for predicting the real value of the shielded data and Euclidean distance between the predicted values as prediction targets, and training the prediction network;
to train a predictive network, given input data, the input data is masked randomly and the network output needs to predict the masked data. And training the prediction network according to the L2 distance between the predicted value and the true value as a loss so as to construct gradient information.
S6, predicting new water service data by utilizing the existing training set data, judging whether distribution deviation occurs or not, and identifying data abnormality;
specifically, the step S6 includes the following steps:
s61, using the existing training set data to generate new data
Figure SMS_42
Extracting q batches of data in the neighborhood of (1)>
Figure SMS_43
The q batches of data are cyclically taken as input of the prediction network to predict the data at the unknown points (m, n, t, k), and the prediction network can output a prediction mean value simultaneously>
Figure SMS_44
And prediction standard deviation->
Figure SMS_45
Thus q predicted values for the unknown point can be obtained +.>
Figure SMS_46
And prediction standard deviation->
Figure SMS_47
S62, constructing q Gaussian distributions according to the predicted mean and the predicted variance
Figure SMS_48
S63, using new data
Figure SMS_49
(i.e. the data of new unknown points (m, n, t, k)) as a gaussian distribution +.>
Figure SMS_50
Calculating the probability density of the sample point in the q Gaussian distributions;
s63, when the sample point is in one of the Gaussian distributions and the probability density of the new data is lower than a set threshold, identifying that the sample point is abnormal, namely the water service data of the new unknown point (m, n, t, k) is abnormal.
Taking Shenzhen Yan field reservoir basin as an example, consider that the data indexes in the basin include: flow, water temperature, dissolved oxygen, ammonia nitrogen and total nitrogenTotal phosphorus, algae density, etc. Firstly, various water affair related data of various positions and various time periods in the river basin are utilized, wherein the data comprise indexes such as flow, water temperature, dissolved oxygen, ammonia nitrogen, total phosphorus and algae density. Here we note that
Figure SMS_51
For being in place->
Figure SMS_52
At (I) a part of>
Figure SMS_53
The data type collected at the moment is +.>
Figure SMS_54
Wherein the data subscript indicates +.>
Figure SMS_55
Data.
And S1, extracting features based on sequence time sequence distribution. And constructing the data according to the time sequence to obtain a plurality of time sequence data sequences. For example, a certain section
Figure SMS_56
All ammonia nitrogen data in the last half year can be constructed into a time sequence data sequence +.>
Figure SMS_57
. According to the scheme, ammonia nitrogen time sequence data sequences with a plurality of different sections in the river basin can be obtained. And then, performing feature extraction on the time sequence data of the ammonia nitrogen by using a transducer. For a piece of time series data, in order to enable the data with shorter time intervals to have larger correlation, the invention constructs a weighting matrix of the time series data, and when the data characteristics are extracted by using a transducer, a weighting attention layer is additionally added in an intermediate layer, so that the obtained characteristics have stronger representation capability. Specifically, referring to fig. 2, a stack of a plurality of network modules as shown in the left diagram of fig. 2 is included in an encoder of a conventional transducer. However, for the purpose ofThe invention replaces the Multi-Head Attention layer in one of the network modules with the weighted Attention layer proposed by the invention by fully utilizing the time sequence relation among the data.
Therefore, the time sequence characteristics of ammonia nitrogen data at different places can be obtained. These features are constructed as a map from the geographical location of the acquisition of the sequence data. The overall flow is shown in fig. 3. In addition, the same feature extraction scheme is also employed for other types of data (dissolved oxygen, total phosphorus, etc.).
And S2, feature aggregation based on space diagram distribution and Kullback-Leibler divergence. In the last step, the features of the organic nitrogen data are constructed as a graph using their spatial distribution. Each node in the graph is an organic nitrogen feature vector for that slice. If the Kullback-Leibler divergence between two nodes is greater than a certain threshold, an edge exists between the two nodes, and the weight of the edge is the Kullback-Leibler divergence between the two nodes. Under the definition of this graph, the aggregation process for one feature node is as follows. As shown in fig. 4, node 5 is connected to node 2, node 3 and node 4. Wherein the organic nitrogen feature vector of the site is represented. For characteristic vector
Figure SMS_58
The converged feature is expressed as +.>
Figure SMS_59
The feature aggregation mode is: />
Figure SMS_60
Wherein the method comprises the steps of
Figure SMS_61
Representing a weight to be learned, +.>
Figure SMS_62
Representing the Kullback-Leibler divergence between node 2 and node 5.
Performing one aggregation on the nodes represents one aggregation on all the characteristic nodes in the graph. The aggregation process is carried out for a plurality of times, and finally the aggregated data characteristics can be obtained. Here, not only the ammonia nitrogen features but also the time sequence features of other data types are required to be converged, so that feature vectors of spatial distribution information fused with different time sequence data are obtained.
And S3, constructing a data correlation matrix based on a mechanism equation. In the last step, the characteristic vectors of data such as flow, water temperature, dissolved oxygen, ammonia nitrogen, total phosphorus, algae density and the like of different places (different nodes in the figure) are obtained. Each vector already contains timing information as well as spatial distribution information. These feature vectors are further feature extracted in this step. Considering hydrodynamic force, water quality and algae growth and elimination change process mechanism in the river basin, the correlation among indexes such as water temperature, dissolved oxygen, ammonia nitrogen, total phosphorus, algae density and the like is scored by an industry expert, and each score is a triplet which represents the correlation between two different types of data, such as (ammonia nitrogen, dissolved oxygen, 0.1). The correlation score sum between the same type of data and other types of data is set to 1. Thus, a correlation matrix P between the several types of data can be obtained, and then the nth power of the matrix P is calculated to obtain a final correlation matrix.
Similar to the feature extraction step based on sequence time sequence distribution, the weighted attention layer is constructed by utilizing the correlation matrix P, and the feature vectors are further fused by adopting the weighted attention network module, so that the feature representation can contain the mechanism relation among different data.
And S4, cross-modal vision-time sequence-spatial distribution-mechanism characteristic characterization model. According to the feature extraction step based on sequence time sequence distribution, the feature aggregation step based on space diagram distribution and Kullback-Leibler divergence and the data correlation matrix construction step based on a mechanism equation, an image feature extraction step based on vision transformer is added on the basis of the steps for evaluating the eutrophication degree in the water body. The overall feature encoding model is shown in fig. 5.
And S5, feature decoding prediction and training network training. The acquired data are subjected to characteristic representation in the steps, and the time sequence relationship, the topological relationship of spatial distribution and the mechanism relationship among the data are considered in the characteristic representation process. These features are then used to decode the features so that the decoder can predict the organic nitrogen, dissolved oxygen, algae, etc. indicators for unknown time in the unknown area. In the present invention, the decoder section matches the decoder in the transformer, but additionally adds time code and index type code, and predicts the index size of a specific time, a specific position and a specific type as the condition information.
The position code is still represented in the form of a graph network, as shown in fig. 7, if a node is to be represented
Figure SMS_63
The representation of a certain position is completed by only setting the value of the corresponding node to 1 and the values of other nodes to 0. In order to convert the representation of the graph form into a dense vector representation, the graphs shown in the right graph in fig. 6 are still assembled and spliced here in an assembled manner.
The time code is similar to the onehot code type, and for example, considering a period from 5 times in the past to the present, the code indicating the current time may be indicated as "000001", and the time code indicating the last time may be indicated as "000010".
The data type code is also based on onehot code, wherein the flow code is "00000001", the water temperature code is "00000010", the dissolved oxygen code is "00000100", the ammonia nitrogen code is "00001000", the total nitrogen code is "00010000", the total phosphorus code is "00100000", and the algae density code is "01000000".
With the above-described encoding scheme, a decoder can decode the feature and predict the unknown value at the encoding site. The overall flow is shown in fig. 7 below.
Thus, the training network can be trained with a large amount of data. The training samples were constructed as follows: masking a certain data in the original data, performing position coding, time coding and data type coding on the data, and predicting the masked data by using the training network. Its predicted target is the true value of the data and the Euclidean distance between the predicted values.
And S6, identifying data abnormality. As previously described, first, the existing training set data is utilized to generate new data
Figure SMS_64
Extracting q batches of data in the neighborhood of (1)>
Figure SMS_67
. The q lot data is cyclically used as input to a prediction network to predict the data at (m, n, t, k). Since the prediction network can output the prediction mean +.>
Figure SMS_69
And prediction standard deviation
Figure SMS_66
Thus q predicted values +.>
Figure SMS_68
And prediction standard deviation->
Figure SMS_70
. Q Gaussian distributions can be constructed from the predicted mean and predicted variance>
Figure SMS_71
Thus consider new data +>
Figure SMS_65
For one sample point in these gaussian distributions, the probability density of that sample point in these q distributions is calculated. If the sample point is in one of the Gaussian distributions and the probability density of the new data is lower than the manually set threshold, the sample point is considered to be abnormal.
According to the invention, the time sequence relation and the spatial distribution relation among the data and the correlation module based on the mechanism are fused, and the water service data is extracted by adopting a cross-modal large model, so that the potential characteristics in the data can be more effectively captured, and the reliability and the accuracy of data anomaly identification in the water service data are improved; in the data characteristic representation, a time sequence relationship, a spatial distribution relationship based on graph representation and a mechanism relationship are fused, so that the characteristic representation of the data is more accurate, and the problem of inaccurate recognition caused by insufficient characteristic capacity representation in the normal data abnormal recognition is solved; the data is firstly predicted and then identified, namely, the data to be judged is firstly predicted in multiple rounds by utilizing the neighborhood data (different neighborhood samples), if the variance of the prediction result in the multiple rounds of prediction is larger, the data is identified to be abnormal, and the identification stability of the data abnormality is stronger.
While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention, and these equivalent modifications or substitutions are included in the scope of the present invention as defined in the appended claims.

Claims (10)

1. A method for rapidly identifying abnormal data of environmental water affair monitoring and sampling is characterized by comprising the following steps: the method comprises the following steps of:
s1, given collected water service data, extracting characteristics based on sequence time sequence distribution;
s2, carrying out feature aggregation based on space diagram distribution and Kullback-Leibler divergence on the features to obtain feature vectors of space distribution information fused with different time sequence data;
s3, constructing a data correlation matrix based on a mechanism equation for the feature vector;
s4, introducing picture information in the data, extracting image features, and constructing a cross-modal vision-time sequence-spatial distribution-mechanism feature characterization model to obtain integral features;
s5, constructing a feature decoding prediction and prediction network training module, decoding the integral features, outputting a predicted value and training a prediction network;
s6, predicting new water service data by using the existing training set data, judging whether distribution deviation occurs or not, and identifying data abnormality.
2. The method for quickly identifying abnormal data of environmental water affair monitoring and sampling according to claim 1, which is characterized in that: the step S1 comprises the following steps:
s11, giving acquired data
Figure QLYQS_1
Where m, n denote that the data is acquired at a spatial location (m, n), t denote that the data is acquired at time t, and k denote the type of the data;
s12, constructing a time sequence feature extraction network based on a transducer for each data category k;
s13, aggregating the same kind of data at the same place to form a sequence data
Figure QLYQS_2
And extracting the characteristics of all the sequence data by using the time sequence characteristic extraction network.
3. The method for quickly identifying abnormal data of environmental water affair monitoring and sampling according to claim 2, which is characterized in that: in the step S13, features of all the sequence data are extracted, and a weighted time sequence attention mechanism is adopted, and the formula is as follows
Figure QLYQS_3
Wherein the method comprises the steps of
Figure QLYQS_4
4. The method for quickly identifying abnormal data of environmental water affair monitoring and sampling according to claim 3, which is characterized in that: obtaining the corresponding attention weight matrix based on the time sequence as
Figure QLYQS_5
5. The method for quickly identifying abnormal data of environmental water affair monitoring and sampling according to claim 2, which is characterized in that: the step S2 comprises the following steps:
s21, arranging the characteristics constructed by the obtained data of the same category according to the geographical position distribution of the data, constructing a characteristic diagram, wherein nodes in the characteristic diagram are characteristic vectors of a piece of time sequence data, and the weight calculation mode of edges between two nodes is as follows: the geographic position is defined as
Figure QLYQS_6
Is constructed such that a mean value is +.>
Figure QLYQS_7
Covariance is
Figure QLYQS_8
Is a gaussian distribution of (c); two adjacent nodes in the feature map>
Figure QLYQS_9
The weight of the edge of (2) is calculated according to the Kullback-Leibler divergence between two Gaussian distributions
Figure QLYQS_10
S22, adopting a graph network to perform weighted feature aggregation on the data time sequence feature vectors based on the spatial arrangement in the feature graph.
6. The method for quickly identifying abnormal data of environmental water affair monitoring and sampling according to claim 5, which is characterized in that: the step S3 comprises the following steps:
s31, constructing importance scores among different data categories according to hydrodynamic force, water quality and algae growth and elimination change mechanisms;
s32, constructing an adjacent matrix among different data types according to the fraction
Figure QLYQS_11
Wherein the elements of the adjacency matrix are->
Figure QLYQS_12
Representing the marked class of data->
Figure QLYQS_13
The scores between the data are used for constructing a correlation matrix between the data>
Figure QLYQS_14
The construction mode is that
Figure QLYQS_15
7. The method for quickly identifying abnormal data of environmental water affair monitoring and sampling according to claim 6, which is characterized in that: the step S4 comprises the following steps:
s41, introducing picture information in the data, and extracting picture information features by adopting Vision Transformer;
s42, splicing the picture information features with the feature vectors;
s43, utilizing the correlation matrix
Figure QLYQS_16
Weighting the spliced features, and constructing a cross-modal vision-time sequence-space distribution-mechanism feature characterization model to obtain integral features.
8. The method for quickly identifying abnormal data of environmental water affair monitoring and sampling according to claim 7, which is characterized in that: the step S5 comprises the following steps:
s51, decoding the integral features, and outputting predicted values of (m, n, t, k) data at unknown points;
s52, masking random data in the original data, carrying out position coding, time coding and data type coding on the masked data, predicting the masked data by using a prediction network, constructing gradient information for the prediction target of the euclidean distance between the true value and the predicted value of the masked data, and training the prediction network.
9. The method for quickly identifying abnormal data of environmental water affair monitoring and sampling according to claim 8, which is characterized in that: in the step S51, the overall feature is decoded, and a Multi-layer Multi-Head attention network is used as a decoder.
10. The method for quickly identifying abnormal data of environmental water affair monitoring and sampling according to claim 9, which is characterized in that: the step S6 comprises the following steps:
s61, using the existing training set data to generate new data
Figure QLYQS_17
Extracting q batches of data in the neighborhood of (1)>
Figure QLYQS_18
Cyclically taking the q batches of data as input of a prediction network, predicting the data of (m, n, t, k) at the unknown points, and outputting a prediction mean +.>
Figure QLYQS_19
And prediction standard deviation->
Figure QLYQS_20
Q predicted values +.>
Figure QLYQS_21
And prediction standard deviation->
Figure QLYQS_22
S62, constructing q Gaussian distributions according to the predicted mean and the predicted variance
Figure QLYQS_23
S63, using new data
Figure QLYQS_24
As Gaussian distribution->
Figure QLYQS_25
Calculating the probability density of the sample point in the q Gaussian distributions;
s63, when the sample point is in one of the Gaussian distributions and the probability density of the new data is lower than a set threshold, identifying that the sample point is abnormal.
CN202310467167.8A 2023-04-27 2023-04-27 Method for rapidly identifying abnormal data of environmental water affair monitoring and sampling Active CN116186547B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310467167.8A CN116186547B (en) 2023-04-27 2023-04-27 Method for rapidly identifying abnormal data of environmental water affair monitoring and sampling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310467167.8A CN116186547B (en) 2023-04-27 2023-04-27 Method for rapidly identifying abnormal data of environmental water affair monitoring and sampling

Publications (2)

Publication Number Publication Date
CN116186547A true CN116186547A (en) 2023-05-30
CN116186547B CN116186547B (en) 2023-07-07

Family

ID=86450939

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310467167.8A Active CN116186547B (en) 2023-04-27 2023-04-27 Method for rapidly identifying abnormal data of environmental water affair monitoring and sampling

Country Status (1)

Country Link
CN (1) CN116186547B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117592571A (en) * 2023-12-05 2024-02-23 武汉华康世纪医疗股份有限公司 Air conditioning unit fault type diagnosis method and system based on big data

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080168339A1 (en) * 2006-12-21 2008-07-10 Aquatic Informatics (139811) System and method for automatic environmental data validation
CN109523510A (en) * 2018-10-11 2019-03-26 浙江大学 River water quality free air anomaly method for detecting area based on multi-spectrum remote sensing image
US20200050182A1 (en) * 2018-08-07 2020-02-13 Nec Laboratories America, Inc. Automated anomaly precursor detection
CN113158391A (en) * 2021-04-30 2021-07-23 中国人民解放军国防科技大学 Method, system, device and storage medium for visualizing multi-dimensional network node classification
CN113553577A (en) * 2021-06-01 2021-10-26 中国人民解放军战略支援部队信息工程大学 Unknown user malicious behavior detection method and system based on hypersphere variational automatic encoder
CN113988268A (en) * 2021-11-03 2022-01-28 西安交通大学 Heterogeneous multi-source time sequence anomaly detection method based on unsupervised full-attribute graph
CN114444914A (en) * 2022-01-20 2022-05-06 中国电建集团华东勘测设计研究院有限公司 Method for analyzing change trend of key water quality factor for watershed comprehensive treatment performance evaluation
CN114510467A (en) * 2022-01-07 2022-05-17 深圳市广汇源环境水务有限公司 Intelligent water affair data abnormity identification method
CN115018021A (en) * 2022-08-08 2022-09-06 广东电网有限责任公司肇庆供电局 Machine room abnormity detection method and device based on graph structure and abnormity attention mechanism
US20220342988A1 (en) * 2021-04-21 2022-10-27 Sonalysts, Inc. System and method of situation awareness in industrial control systems
CN115470850A (en) * 2022-09-09 2022-12-13 大连理工大学 Water quality abnormal event recognition early warning method based on pipe network water quality time-space data

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080168339A1 (en) * 2006-12-21 2008-07-10 Aquatic Informatics (139811) System and method for automatic environmental data validation
US20200050182A1 (en) * 2018-08-07 2020-02-13 Nec Laboratories America, Inc. Automated anomaly precursor detection
CN109523510A (en) * 2018-10-11 2019-03-26 浙江大学 River water quality free air anomaly method for detecting area based on multi-spectrum remote sensing image
US20220342988A1 (en) * 2021-04-21 2022-10-27 Sonalysts, Inc. System and method of situation awareness in industrial control systems
CN113158391A (en) * 2021-04-30 2021-07-23 中国人民解放军国防科技大学 Method, system, device and storage medium for visualizing multi-dimensional network node classification
CN113553577A (en) * 2021-06-01 2021-10-26 中国人民解放军战略支援部队信息工程大学 Unknown user malicious behavior detection method and system based on hypersphere variational automatic encoder
CN113988268A (en) * 2021-11-03 2022-01-28 西安交通大学 Heterogeneous multi-source time sequence anomaly detection method based on unsupervised full-attribute graph
CN114510467A (en) * 2022-01-07 2022-05-17 深圳市广汇源环境水务有限公司 Intelligent water affair data abnormity identification method
CN114444914A (en) * 2022-01-20 2022-05-06 中国电建集团华东勘测设计研究院有限公司 Method for analyzing change trend of key water quality factor for watershed comprehensive treatment performance evaluation
CN115018021A (en) * 2022-08-08 2022-09-06 广东电网有限责任公司肇庆供电局 Machine room abnormity detection method and device based on graph structure and abnormity attention mechanism
CN115470850A (en) * 2022-09-09 2022-12-13 大连理工大学 Water quality abnormal event recognition early warning method based on pipe network water quality time-space data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117592571A (en) * 2023-12-05 2024-02-23 武汉华康世纪医疗股份有限公司 Air conditioning unit fault type diagnosis method and system based on big data
CN117592571B (en) * 2023-12-05 2024-05-17 武汉华康世纪医疗股份有限公司 Air conditioning unit fault type diagnosis method and system based on big data

Also Published As

Publication number Publication date
CN116186547B (en) 2023-07-07

Similar Documents

Publication Publication Date Title
CN110210644A (en) The traffic flow forecasting method integrated based on deep neural network
CN116910633B (en) Power grid fault prediction method based on multi-modal knowledge mixed reasoning
CN116186547B (en) Method for rapidly identifying abnormal data of environmental water affair monitoring and sampling
CN112529234A (en) Surface water quality prediction method based on deep learning
CN114221790A (en) BGP (Border gateway protocol) anomaly detection method and system based on graph attention network
CN111784084B (en) Travel generation prediction method, system and device based on gradient lifting decision tree
CN112217674A (en) Alarm root cause identification method based on causal network mining and graph attention network
CN116187203A (en) Watershed water quality prediction method, system, electronic equipment and storage medium
CN111242028A (en) Remote sensing image ground object segmentation method based on U-Net
CN114330120A (en) 24-hour PM prediction based on deep neural network2.5Method of concentration
CN114510467A (en) Intelligent water affair data abnormity identification method
CN115356599B (en) Multi-mode urban power grid fault diagnosis method and system
CN115391746B (en) Interpolation method, interpolation device, electronic device and medium for meteorological element data
CN117033923A (en) Method and system for predicting crime quantity based on interpretable machine learning
CN116244600A (en) Method, system and equipment for constructing GIS intermittent discharge mode identification model
CN112016403B (en) Video abnormal event detection method
CN117271959B (en) Uncertainty evaluation method and equipment for PM2.5 concentration prediction result
CN118094314B (en) Chemical process fault diagnosis method and system based on improved space-time model
CN117007912B (en) Distribution network line power failure analysis method, device, equipment and storage medium
CN117372715A (en) Tail gas blackness detection method based on visual characteristics
CN116597379A (en) System and method for detecting abnormal drainage of water outlet on sunny day by self-starting based on deep learning and Internet of things
CN117809239A (en) Incremental learning algorithm-based power transmission channel construction machinery identification method
Putpuek et al. Urban Expansion Prediction using Supervised Classification and ConvLSTM
Xing et al. Spatio-Temporal Failure Prediction Using LSTGM for Optical Networks
CN118311693A (en) Data acquisition method and system combining multi-terminal data observation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant