Nothing Special   »   [go: up one dir, main page]

CN116821832A - Abnormal data identification and correction method for high-voltage industrial and commercial user power load - Google Patents

Abnormal data identification and correction method for high-voltage industrial and commercial user power load Download PDF

Info

Publication number
CN116821832A
CN116821832A CN202310910100.7A CN202310910100A CN116821832A CN 116821832 A CN116821832 A CN 116821832A CN 202310910100 A CN202310910100 A CN 202310910100A CN 116821832 A CN116821832 A CN 116821832A
Authority
CN
China
Prior art keywords
data
matrix
load
user
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202310910100.7A
Other languages
Chinese (zh)
Inventor
周州
李婧娇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Institute of Technology
Original Assignee
Nanjing Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Institute of Technology filed Critical Nanjing Institute of Technology
Priority to CN202310910100.7A priority Critical patent/CN116821832A/en
Publication of CN116821832A publication Critical patent/CN116821832A/en
Withdrawn legal-status Critical Current

Links

Landscapes

  • Supply And Distribution Of Alternating Current (AREA)

Abstract

The application discloses an abnormal data identification and correction method aiming at the power load of a high-voltage industrial and commercial user. Firstly, the nonlinear relation of the original data is processed by the kernel PCA method in a dimension-reducing way, clustering is realized by a K-means clustering algorithm, and the power consumption modes and behavior characteristics of various users are mined. Then, LOF is used to identify those outlier data points that differ significantly from their neighbors. And finally, complementing the abnormal data points by designing a TCN structure of the expansion non-causal convolution layer, and carrying out error analysis with the actual data to verify the correction effect. By the combined application of a plurality of technical means, the method can effectively identify and correct abnormal data points in the power load of the high-voltage industrial and commercial user, improves the accuracy and reliability of the data, and provides valuable references for the quality assurance of the power information of the high-voltage industrial and commercial user.

Description

Abnormal data identification and correction method for high-voltage industrial and commercial user power load
Technical Field
The application relates to the technical field of power consumption load abnormal data identification, in particular to an abnormal data identification and correction method aiming at power consumption load of high-voltage industrial and commercial users.
Background
The power system is an important infrastructure supporting national economic development and social life. With the continuous improvement of social productivity, modern power systems in China undergo rapid development, the power production capacity is greatly improved, and the power grid construction scale is enlarged so as to meet the rapidly-increased power demand. However, compared with the development of the electric power system, the development and application degree of the electric energy acquisition device is not synchronous with the development and application degree of the electric energy acquisition device, and especially, accurate, real-time and comprehensive acquisition of electric energy information cannot be achieved for high-voltage industrial and commercial users with huge electricity requirements. The power consumption of the high-voltage industrial and commercial user groups including large industrial enterprises, commercial complexes and office building groups is extremely large and stable, and the power consumption has an important influence on the stability and load balance of a power system. The method has the advantages that the work of identifying and correcting the abnormal data of the power load of the high-voltage industrial and commercial user is carried out, and the method has important significance for improving the data quality, improving the power load prediction and scheduling, guaranteeing the safe and stable operation of the power grid and improving the energy management efficiency. The method is beneficial to optimizing the operation of the power system, improving the energy utilization efficiency, reducing the energy waste and contributing to the sustainable development of the energy field.
The current method for identifying the abnormal data of the user mainly comprises the following three steps: (1) The statistical method is simple and easy to realize, and has small demand on calculation force; however, complicated abnormal pattern recognition capability such as office building groups is limited, and nonlinear relations are difficult to deal with. (2) The rule detection method has stronger interpretation and is applicable to specific scenes and occasions with stronger regularity; however, the method is weak in applicability and difficult to use on a large scale because rules and thresholds are manually formulated. (3) The machine learning method can automatically process the electricity consumption mode and the characteristics of the learning target user and can process complex abnormal modes and nonlinear relations; but the quality requirements for the historical data are high and more calculation power is consumed.
The traditional method for correcting the user abnormal data has certain advantages in some simple scenes, but has some limitations when processing complex and dynamic abnormal conditions, has strong subjectivity and dependence on definition of abnormal values, cannot process complex abnormal modes, and is easy to misjudge normal data or miss real abnormal data. With the increase in data volume and the complexity of the anomaly pattern, conventional methods may not provide efficient and accurate anomaly data correction. Based on the above, a method for identifying and correcting abnormal data of electrical load of high-voltage industrial and commercial users based on Kernel principal component analysis (Kernel Principal Component Analysis, kernel PCA), local outlier factor algorithm (Local Outlier Factor, LOF) and time domain convolutional neural network (Temporal Convolutional Network, TCN) is provided. The model not only can rapidly process large-scale data in parallel, but also can effectively learn complex abnormal modes.
Disclosure of Invention
1. The technical problems to be solved are as follows:
aiming at the technical problems, the application provides an abnormal data identification and correction method aiming at the high-voltage industrial and commercial user power consumption load, wherein the user power consumption data is mapped into a high-dimensional feature space through Kernel PCA to extract a nonlinear relation and a data mode in the data, a main component in the feature space is calculated, and the data is subjected to dimension reduction in the feature space to realize dimension reduction processing. And then clustering by using a standard K-means method to determine the users of the same type. Then, the LOF algorithm is utilized to identify the abnormal data, and the method can adapt to data of different types and distribution without prior assumption. Finally, a TCN data correction model is utilized, different cavity coefficient expansion receptive fields are arranged in a convolution layer, and simultaneously, future and past load information is utilized to predict and complement abnormal data.
2. The technical scheme is as follows:
the method for identifying and correcting the abnormal data of the power load of the high-voltage industrial and commercial user is characterized by comprising the following steps of: the method comprises the following steps:
step one: acquiring an electricity load sequence of a high-voltage industrial and commercial user, and preprocessing the electricity load sequence to generate a sample set; the preprocessing comprises the steps of obtaining user name, peak load, average load and load fluctuation data;
step two: performing dimension reduction and eigenvalue decomposition on a sample set based on kernel PCA analysis, calculating a kernel matrix of the sample set, performing centering operation on the kernel matrix, enabling each element of the matrix to subtract a column average value and a row average value to be zero, ensuring that the center of a feature is located at an origin, performing eigenvalue decomposition on the centering kernel matrix, and selecting the k most important eigenvectors as main components according to the magnitude of the eigenvalue; performing linear transformation on the sample set through the k selected feature vectors to obtain a dimensionality-reduced data set;
step three: clustering the dimensionality reduced data set by using a K-means algorithm, and classifying the high-voltage industrial and commercial users according to the clustering result to obtain a plurality of dimensionality reduced sub-data sets;
step four: calculating local reachable density LRD and local outlier factor LOF for each dimension-reduced sub-data set; sorting the data points in the sub-data set according to the calculated local outlier factor LOF, and recognizing the data points with the outlier factor LOF larger than a preset value as abnormal data;
step five: constructing a TCN data correction model; the TCN data correction model inputs normal high-voltage business user load data for training, and a loss function is used as a training termination condition; when the loss function converges to a preset threshold or reaches the maximum iteration number, stopping training, and applying a final model to correct data;
step six: and (3) inputting the abnormal data confirmed in the step four into a TCN data correction model, namely outputting the complement value corresponding to the abnormal data.
Further, the first step specifically includes:
s11, acquiring an electricity load sequence of a high-voltage industrial and commercial user, preprocessing the data, and calculating corresponding peak load L corresponding to a load curve according to electricity data of a statistical time period of the user at a preset frequency as shown in a formula (1) P Average load L A Load fluctuation L V
L P =max i∈T {L i }
(1) Wherein L is P A peak load representing a load curve over the statistical period; l (L) A An average load of peak loads of the load curve representing the statistical period; l (L) V Load fluctuations representing the load curve over a statistical period; t corresponds to the time point sampling number of the sample, namely, the sampling frequency is preset in one statistical period, and the sampling frequency in one statistical period is the time point sampling number of the sample; li represents the corresponding load value, i.e. peak load L P Average load L A Load fluctuation L V
S12, a sample set X of the power load of the industrial and commercial user is expressed as:
X=(X 1 ,X 2 ,…,X n ) (2)
(2) In the formula, sample X i A power consumption data matrix representing user i; the data set X is a three-dimensional data tensor of nxt X d, where N is the number of samples, i.e. the number of users; d is the feature dimension, here corresponding to 3; each sample X i Representing the characteristic value of the ith sample as a matrix of T multiplied by d; at each sample X i Each column represents a feature dimension, and each row represents the value of the feature at different time points, and the mathematical expression is as follows:
further, the second step includes:
s21: based on Gaussian kernels, each element of a kernel matrix K of a sample set X is obtained according to the following formula (4);
(4) Where K (i, j) represents an element of the kernel matrix K, ||X i -X j I represents sample X i And X j Euclidean distance in feature space, γ is the bandwidth parameter of the gaussian kernel function;
i.e. sample X i And X j The core calculation formula of (2) can be expressed as:
s22: the core matrix K is centered by the following formula (6) to obtain a centered core matrixThe column average value and the row average value of each element of the centering core matrix are subtracted to be zero, so that the center of the feature is ensured to be positioned at the origin; the specific calculation process is as follows:
(6) In the method, in the process of the application,represents the i-th row mean of the kernel matrix K, < >>Represents the j-th column mean of the kernel matrix K, < >>Represents the overall mean value of the core matrix K, and the core matrix is obtained after the centering operation> Elements of the ith row and the jth column;
s23, centering the core matrix by adopting the following formula (7)Performing eigenvalue decomposition to obtain eigenvalues and corresponding eigenvectors; taking the magnitude of the characteristic value as a basis for importance selection, selecting k most important characteristic vectors as main components, wherein k is the dimension of the target after dimension reduction;
(7) The formula is as follows:
(7) Wherein Q is an orthogonal matrix composed of eigenvectors v, and Λ is a diagonal matrix composed of eigenvalues λ; each eigenvalue λ corresponds to one eigenvector v, satisfying the following relationship:
i.e. the eigenvector v is the decentered matrixA non-zero vector in the null space of (a), corresponding to the eigenvalue λ;
s23, centering the core matrixPerforming linear transformation according to the k selected feature vectors, and obtaining a dimensionality reduced data set according to the following formula (9);
(9) Wherein Y is the reduced dimension data matrix of N x k scale, namely the reduced dimension data set,the core matrix is a matrix with N multiplied by N, V is a matrix composed of k selected eigenvectors, and the core matrix is a matrix with N multiplied by k.
Further, the third step specifically includes:
s31, randomly selecting k from the data set Y by taking the columns of the data matrix Y after dimension reduction as feature vectors 1 The samples are used as initial cluster centers, and the Euclidean distance d (Y) between the samples and the feature vector of each cluster center is calculated i ,Y j ) And assign it to the cluster center closest to it; calculating the average value of all samples in each cluster, and taking the average value as a new cluster center;
Y i =(y i1 ,y i2 ,…,y ik ) T
Y j =(y j1 ,y j2 ,…,y jk ) T
(10) AndIn d (Y) i ,Y j ) For the feature vector Euclidean distance of user i and user j, d j To select the cluster center Y i The minimum Euclidean distance of the rest feature vectors to the clustering center X is C, and the C is a target user set; y is i1 Is the first element of the user i matrix;
s32, determining the number k of clusters 2 Calculating the ratio of the minimum distance between samples in nearest neighbor clusters to the maximum distance between different clusters by using Dunn Index, namely Dunn Index, drawing a D-k graph, and determining the number k of clusters by using elbow rule 2
Wherein min is 1≤i≤k d min (C i ) Represents the minimum distance between clusters, max 1≤i≤k d max (C i ) Indicating the maximum distance between clusters, and the larger D indicates the better clustering result; d represents Dunn index; ci represents the set of users within class i of the current cluster;
s33: and taking one cluster in the cluster calculation result as a user with the same electricity utilization type.
Further, the fourth step specifically includes:
s41: acquiring a user load data set B in the third step, wherein the data set B is k i Matrix of x T, k i For the ith cluster class k in the clustering result i Number of individual users;
s42: initializing LOF parameters, and calculating the reachable distance of the power consumption data of the users in the same cluster by using a Manhattan distance measurement method;
Y i =(y i1 ,y i2 ,…,y iT ) T
Y j =(y j1 ,y j2 ,…,y jT ) T
dist(o)=|y i1 -y j1 |+|y i2 -y j2 |+…+|y iT -y jT |
RD(p,o)=max(dist(o),k-distance(p)) (11)
where p is the target data point, o is the nearest neighbor, dist (o) is the Manhattan distance between p and o, and k-distance (p) is the distance between the kth nearest neighbor of p and p;
s43, calculating local outlier factors LOF; for each data point p, calculating a local outlier factor of the data point p, and measuring the outlier degree of the data point p relative to the adjacent data point p;
where LOF (p) represents the local outlier factor of data point p, LRD (o) represents the local reachable density of neighbor o, LRD (p) represents the local reachable density of data point p;
s44, drawing the number k of clusters 2 The number k of neighbors is determined according to the elbow rule according to the relation diagram of the value of (1) and the corresponding LOF value 3
S45: determining abnormal data; sorting the data points according to the local outlier factor LOF obtained by calculation, and taking the data points with the outlier factor OF larger than a preset threshold value as abnormal data;
OF k (p)=1-LOF k-norm (p)
wherein LOF k-norm (p) represents the number OF local outliers OF the current data standard, max { LOF (p) } represents the maximum OF all local outliers, OF k (p) represents the current data outlier factor number, i.e. the OF value is the difference OF the normalized LOF value from 1; the closer the OF value is to 1, the more outliers the data points are represented; the closer to 0, the closer to the normal data point the data point is; the OF value ranges between 0 and 1;
s46: repeatedly iterating S42 to S45 until all classified user data sets are iterated;
s47: and deleting the abnormal data in the corresponding electricity utilization data matrix Xi, marking the abnormal data as empty NaN at the corresponding position of the matrix, and constructing a data set F by all NaNs.
Further, the fifth step specifically includes:
s51: defining an input layer, setting the first N data and the last N data of the abnormal data as inputs into a sliding window, and forming tensors with the shape of (N, 1, 2n+1); where N is the sample power data quantity; 1 is the number of channels, which is determined by one dimension of the power data; 2n+1 represents a time series length.
S52: defining a convolution layer, and constructing an expansion non-causal convolution layer of kernel size=3 and condition= [1,2,4,8 ]; at this time, the receptive field of each convolution layer is (3,7,15,31), the output feature pattern shape of the first convolution layer (condition=1) is (N, 1, 31), the output feature pattern shape of the second convolution layer (condition=2) is (N, 1, 30), the output feature pattern shape of the third convolution layer (condition=4) is (N, 1, 28), and the output feature pattern shape of the fourth convolution layer (condition=8) is (N, 1, 24); after each convolution layer, a batch normalization and activation function ReLU is added to introduce nonlinearity.
S53: adding residual error connection; adding residual connection among all convolution layers, adding input characteristics and intermediate layer characteristics, setting two paths, and introducing a 1X 1 causal convolution layer on a first path to achieve the effect of dimension reduction and characteristic transformation; introducing a 1 multiplied by 3 expansion non-causal convolution layer on a second path to utilize future and past negative information, setting a batch normalization layer, and carrying out batch normalization on the output of the convolution layer so as to accelerate the training process and improve the robustness of the model; finally, the results of the two paths are summed and output through a correction linear unit;
s54: defining an output; extracting a complement value corresponding to the abnormal data location from the output of the TCN data correction model;
s55: model training, training a TCN data correction model for the data portion of the same class of users where no NaN data null points exist, using a mean square error (Mean Squared Error, MSE) as a loss function.
Wherein y represents a sequence of true values,representing a sequence of predicted values, N representing the sequence length.
Further, in the step six, the trained TCN data correction model is traversed through the data set F until the data in the data set is completed.
3. The beneficial effects are that:
(1) The Kernel PCA mapping dimension reduction is used in the user clustering process, because the high-voltage industrial and commercial users mainly comprise large-scale industrial enterprises, commercial complexes, office building groups and the like, complex abnormal modes exist in the users of the same type, and the power consumption mode and regularity are strong. The Kernel PCA can better process nonlinear relations, can more accurately reserve local structures of original data, can better capture nonlinear features, and can highlight the features; and the k-means clustering method is used for clustering similar users together, so that the power utilization modes and behavior characteristics of various users can be conveniently mined. If cluster analysis is directly used, it is difficult to process the difference of electricity consumption modes between similar industrial enterprises and office building groups, so that the clustering effect is poor, and further, the subsequent abnormal data rejection is difficult.
(2) The local outlier factor method is used in the user abnormal data identification process, LOF (local outlier factor) is an anomaly detection method based on density, and complex data distribution and local density change can be adapted more. In the data of the high-voltage industrial and commercial users, different electricity utilization modes and behaviors exist, the LOF can identify abnormal data points with larger differences from the adjacent points, and the LOF has better adaptability in different scenes.
(3) The time domain convolutional neural network is used in the user abnormal data correction process, the TCN essence is a deep learning model suitable for time sequence data, multiple layers of convolutional layers and residual connection can be arranged, the receptive field can be expanded by setting different cavity coefficients, and the abnormal data points can be more accurately complemented by utilizing future and past power consumption information in sequence. Unlike the general timing problem and the predictive problem, in the abnormal data correction process, we often have more information about the abnormal values, and the conventional problem mainly holds the history data.
In summary, the method can effectively identify and correct abnormal data points in the power load of the high-voltage industrial and commercial user through the combined application of a plurality of technical means, improves the accuracy and the reliability of data, and provides valuable references for the quality assurance of the power information of the high-voltage industrial and commercial user.
Drawings
FIG. 1 is an overall flow chart of the present application;
FIG. 2 is a diagram of a TCN data correction model employed in a particular embodiment;
fig. 3 is a diagram of a TCN data correction model residual block structure in an embodiment.
Detailed Description
The present application will be described in detail with reference to the accompanying drawings.
As shown in fig. 1, the method for identifying and correcting abnormal data of the electrical load of the high-voltage industrial and commercial users is characterized in that: the method comprises the following steps:
step one: acquiring an electricity load sequence of a high-voltage industrial and commercial user, and preprocessing the electricity load sequence to generate a sample set; the preprocessing comprises the steps of obtaining user name, peak load, average load and load fluctuation data;
the first step specifically comprises:
s11, acquiring an electricity load sequence of a high-voltage industrial and commercial user, preprocessing the data, and calculating corresponding peak load L corresponding to a load curve according to electricity data of a statistical time period of the user at a preset frequency as shown in a formula (1) P Average load L A Load fluctuation L V
L P =max i∈T {L i }
(1) Wherein L is P A peak load representing a load curve over the statistical period; l (L) A An average load of peak loads of the load curve representing the statistical period; l (L) V Load fluctuations representing the load curve over a statistical period; t corresponds to the time point sampling number of the sample, namely, the sampling frequency is preset in one statistical period, and the sampling frequency in one statistical period is the time point sampling number of the sample; li represents the corresponding load value, i.e. peak load L P Average load L A Load fluctuation L V
S12, a sample set X of the power load of the industrial and commercial user is expressed as:
X=(X 1 ,X 2 ,…,X n ) (2)
(2) In the formula, sample X i A power consumption data matrix representing user i; the data set X is a three-dimensional data tensor of nxt X d, where N is the number of samples, i.e. the number of users; d is the feature dimension, here corresponding to 3; each sample X i Representing the characteristic value of the ith sample as a matrix of T multiplied by d; at each sample X i Each column represents a feature dimension, and each row represents the value of the feature at different time points, and the mathematical expression is as follows:
the method is used for constructing a data set, collecting electricity consumption data of high-voltage industrial and commercial users, and carrying out data cleaning and normalization processing to ensure the accuracy and consistency of the data. And by calculating the statistical characteristics of peak load, average load, load fluctuation and the like of the load curve on the normalized load data, in order to simplify calculation, each characteristic data can be expanded into column vectors with the same dimension as the original load data, and the column vectors are combined with the original power consumption data to generate a power consumption data matrix of the user. And carrying out standardization processing on the original user electricity data so that column vectors of the data matrix have zero mean and unit variance to eliminate dimension differences among various features.
Step two: performing dimension reduction and eigenvalue decomposition on a sample set based on kernel PCA analysis, calculating a kernel matrix of the sample set, performing centering operation on the kernel matrix, enabling each element of the matrix to subtract a column average value and a row average value to be zero, ensuring that the center of a feature is located at an origin, performing eigenvalue decomposition on the centering kernel matrix, and selecting the k most important eigenvectors as main components according to the magnitude of the eigenvalue; performing linear transformation on the sample set through the k selected feature vectors to obtain a dimensionality-reduced data set; specifically comprises
S21: based on Gaussian kernels, each element of a kernel matrix K of a sample set X is obtained according to the following formula (4);
(4) Where K (i, j) represents an element of the kernel matrix K, ||X i -X j I represents sample X i And X j Euclidean distance in feature space, γ is the bandwidth parameter of the gaussian kernel function;
i.e. sample X i And X j The core calculation formula of (2) can be expressed as:
s22: adopts the following formula (6) to pairThe core matrix K is subjected to centering operation to obtain a centering core matrixThe column average value and the row average value of each element of the centering core matrix are subtracted to be zero, so that the center of the feature is ensured to be positioned at the origin; the specific calculation process is as follows:
(6) In the method, in the process of the application,represents the i-th row mean of the kernel matrix K, < >>Represents the j-th column mean of the kernel matrix K, < >>Represents the overall mean value of the core matrix K, and the core matrix is obtained after the centering operation> Elements of the ith row and the jth column;
s23, adopting the following formula7) Centering a nuclear matrixPerforming eigenvalue decomposition to obtain eigenvalues and corresponding eigenvectors; taking the magnitude of the characteristic value as a basis for importance selection, selecting k most important characteristic vectors as main components, wherein k is the dimension of the target after dimension reduction;
(7) The formula is as follows:
(7) Wherein Q is an orthogonal matrix composed of eigenvectors v, and Λ is a diagonal matrix composed of eigenvalues λ; each eigenvalue lambda corresponds to one eigenvector v, satisfying the following relationship:
i.e. the eigenvector v is the decentered matrixA non-zero vector in the null space of (a), corresponding to the eigenvalue λ;
s23, centering the core matrixPerforming linear transformation according to the k selected feature vectors, and obtaining a dimensionality reduced data set according to the following formula (9);
(9) Wherein Y is the reduced dimension data matrix of N x k scale, namely the reduced dimension data set,is a centralized core matrix, is an N x N scale matrix,v is a matrix of k eigenvectors selected, which is an nxk matrix.
And (3) mapping the original data set from the high-dimensional characteristic space to the low-dimensional space by the projection transformation method in the second step, so as to realize the purpose of reducing the dimension.
Step three: clustering the dimensionality reduced data set by using a K-means algorithm, and classifying the high-voltage industrial and commercial users according to the clustering result to obtain a plurality of dimensionality reduced sub-data sets; the method specifically comprises the following steps:
s31, randomly selecting k from the data set Y by taking the columns of the data matrix Y after dimension reduction as feature vectors 1 The samples are used as initial cluster centers, and the Euclidean distance d (Y) between the samples and the feature vector of each cluster center is calculated i ,Y j ) And assign it to the cluster center closest to it; calculating the average value of all samples in each cluster, and taking the average value as a new cluster center;
Y i =(y i1 ,y i2 ,…,y ik ) T
Y j =(y j1 ,y j2 ,…,y jk ) T
(10) Wherein d (Y) i ,Y j ) For the feature vector Euclidean distance of user i and user j, d j To select the cluster center Y i The minimum Euclidean distance of the rest feature vectors to the clustering center X is C, and the C is a target user set; y is i1 Is the first element of the user i matrix;
s32, determining the number k of clusters 2 Calculating the ratio of the minimum distance between samples in nearest neighbor clusters to the maximum distance between different clusters by using Dunn Index, namely Dunn Index, drawing a D-k graph, and determining the number k of clusters by using elbow rule 2
Wherein min is 1≤i≤k d min (C i ) Represents the minimum distance between clusters, max 1≤i≤k d max (C i ) Indicating the maximum distance between clusters, and the larger D indicates the better clustering result; d represents Dunn index; ci represents the set of users within the current class i;
s33: the users are categorized based on the clustering results.
Step four: calculating local reachable density LRD and local outlier factor LOF for each dimension-reduced sub-data set; sorting the data points in the sub-data set according to the calculated local outlier factor LOF, and recognizing the data points with the outlier factor LOF larger than a preset value as abnormal data;
the fourth step specifically comprises:
s41: acquiring a user load data set B in the third step, wherein the data set B is k i Matrix of x T, k i For the ith cluster class k in the clustering result i Number of individual users;
s42: initializing LOF parameters, and calculating the reachable distance of the power consumption data of the users in the same cluster by using a Manhattan distance measurement method;
Y i =(y i1 ,y i2 ,…,y iT ) T
Y j =(y j1 ,y j2 ,…,y jT ) T
dist(o)=|y i1 -y j1 |+|y i2 -y j2 |+…+|y iT -y jT |
RD(p,o)=max(dist(o),k-distance(p)) (11)
where p is the target data point, o is the nearest neighbor, dist (o) is the Manhattan distance between p and o, and k-distance (p) is the distance between the kth nearest neighbor of p and p;
s43, calculating local outlier factors LOF; for each data point p, calculating a local outlier factor of the data point p, and measuring the outlier degree of the data point p relative to the adjacent data point p;
where LOF (p) represents the local outlier factor of data point p, LRD (o) represents the local reachable density of neighbor o, LRD (p) represents the local reachable density of data point p;
s44, drawing the number k of clusters 2 The number k of neighbors is determined according to the elbow rule according to the relation diagram of the value of (1) and the corresponding LOF value 3
S45: determining abnormal data; sorting the data points according to the local outlier factor LOF obtained by calculation, and taking the data points with the outlier factor OF larger than a preset threshold value as abnormal data;
OF k (p)=1-LOF k-norm (p)
wherein LOF k-norm (p) represents the number OF local outliers OF the current data standard, max { LOF (p) } represents the maximum OF all local outliers, OF k (p) represents the current data outlier factor number, i.e. the OF value is the difference OF the normalized LOF value from 1; the closer the OF value is to 1, the more outliers the data points are represented; the closer to 0, the closer to the normal data point the data point is; the OF value ranges between 0 and 1;
s46: repeatedly iterating S42 to S45 until all classified user data sets are iterated;
s47: and deleting the abnormal data in the corresponding electricity utilization data matrix Xi, marking the abnormal data as empty NaN at the corresponding position of the matrix, and constructing a data set F by all NaNs.
The method aims at identifying abnormal data through processing the sub-data set after dimension reduction, and the method can reduce the calculated amount by calculating the reachable distance of the power consumption data of the users in the same cluster by adopting a Manhattan distance measurement method, so that the LOF is utilized to identify the abnormal data points with larger difference from the adjacent points.
Step five: constructing a TCN data correction model; the TCN data correction model inputs normal high-voltage business user load data for training, and a loss function is used as a training termination condition; when the loss function converges to a preset threshold or reaches the maximum iteration number, stopping training, and applying a final model to correct data; the method specifically comprises the following steps:
s51: defining an input layer, setting the first N data and the last N data of the abnormal data as inputs into a sliding window, and forming tensors with the shape of (N, 1, 2n+1); where N is the sample power data quantity; 1 is the number of channels, which is determined by one dimension of the power data; 2n+1 represents a time series length.
S52: defining a convolution layer, and constructing an expansion non-causal convolution layer of kernel size=3 and condition= [1,2,4,8 ]; at this time, the receptive field of each convolution layer is (3,7,15,31), the output feature pattern shape of the first convolution layer (condition=1) is (N, 1, 31), the output feature pattern shape of the second convolution layer (condition=2) is (N, 1, 30), the output feature pattern shape of the third convolution layer (condition=4) is (N, 1, 28), and the output feature pattern shape of the fourth convolution layer (condition=8) is (N, 1, 24); after each convolution layer, a batch normalization and activation function ReLU is added to introduce nonlinearity.
S53: adding residual error connection; adding residual connection among all convolution layers, adding input characteristics and intermediate layer characteristics, setting two paths, and introducing a 1X 1 causal convolution layer on a first path to achieve the effect of dimension reduction and characteristic transformation; introducing a 1 multiplied by 3 expansion non-causal convolution layer on a second path to utilize future and past negative information, setting a batch normalization layer, and carrying out batch normalization on the output of the convolution layer so as to accelerate the training process and improve the robustness of the model; finally, the results of the two paths are summed and output through a correction linear unit;
s54: defining an output; extracting a complement value corresponding to the abnormal data location from the output of the TCN data correction model;
s55: model training, training a TCN data correction model for the data portion of the same class of users where no NaN data null points exist, using a mean square error (Mean Squared Error, MSE) as a loss function.
Wherein y represents a sequence of true values,representing a sequence of predicted values, N representing the sequence length.
Step six: inputting the abnormal data confirmed in the step four into a TCN data correction model, namely outputting a complement value corresponding to the abnormal data; and step six, traversing the trained TCN data correction model into a data set F until the data in the data set is completed.
Specific examples:
as shown in fig. 2 and 3, the structure of the T CN data correction model in this embodiment is shown in the schematic diagram: first, data division is performed, the first 15 pieces of data and the last 15 pieces of data are input as target values, and a sliding window is set to form tensors in the shape of (N, 1, 31). Where N is the sample power data quantity; 1 is the number of channels, since the power data is one-dimensional; 31 is the time series length;
the convolution layers are defined and an expanded non-causal convolution layer of kernel size=3 and condition= [1,2,4,8] is constructed. At this time, the receptive field of each convolution layer is (3,7,15,31), the output feature pattern shape of the first convolution layer (condition=1) is (N, 1, 31), the output feature pattern shape of the second convolution layer (condition=2) is (N, 1, 30), the output feature pattern shape of the third convolution layer (condition=4) is (N, 1, 28), and the output feature pattern shape of the fourth convolution layer (condition=8) is (N, 1, 24). After each convolution layer, a batch normalization and activation function ReLU is added to introduce nonlinearity.
To solve the problem of gradient vanishing and model degradation due to deeper network structures, we add residual connections between the various convolutional layers, adding the input characteristics to the intermediate layer characteristics. In order to avoid the problem of the difference between the input characteristics and the channel number of the intermediate layer characteristics, two paths are arranged, and a 1×1 causal convolution layer is introduced on the first path, so as to achieve the effect of the dimension reduction and the characteristic transformation. And introducing a 1 multiplied by 3 expansion non-causal convolution layer on a second path to utilize future and past negative information, and setting a batch normalization layer to normalize the output of the convolution layer in batches so as to accelerate the training process and improve the robustness of the model. And finally, adding the results of the two paths and outputting the result through a correction linear unit. The model is trained using the mean square error (Mean Squared Error, MSE) as the loss function.
And extracting the complement value corresponding to the position of the abnormal data from the output of the trained TCN data correction model, and replacing the abnormal data in the original input sequence with the complement value to finish the correction of the abnormal data.
While the application has been described with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the application, and it is intended that the scope of the application shall be defined by the appended claims.

Claims (7)

1. The method for identifying and correcting the abnormal data of the power load of the high-voltage industrial and commercial user is characterized by comprising the following steps of: the method comprises the following steps:
step one: acquiring an electricity load sequence of a high-voltage industrial and commercial user, and preprocessing the electricity load sequence to generate a sample set; the preprocessing comprises the steps of obtaining user name, peak load, average load and load fluctuation data;
step two: performing dimension reduction and eigenvalue decomposition on a sample set based on kernel PCA analysis, calculating a kernel matrix of the sample set, performing centering operation on the kernel matrix, enabling each element of the matrix to subtract a column average value and a row average value to be zero, ensuring that the center of a feature is located at an origin, performing eigenvalue decomposition on the centering kernel matrix, and selecting the k most important eigenvectors as main components according to the magnitude of the eigenvalue; performing linear transformation on the sample set through the k selected feature vectors to obtain a dimensionality-reduced data set;
step three: clustering the dimensionality reduced data set by using a K-means algorithm, and classifying the high-voltage industrial and commercial users according to the clustering result to obtain a plurality of dimensionality reduced sub-data sets;
step four: calculating local reachable density LRD and local outlier factor LOF for each dimension-reduced sub-data set; sorting the data points in the sub-data set according to the calculated local outlier factor LOF, and recognizing the data points with the outlier factor LOF larger than a preset value as abnormal data;
step five: constructing a TCN data correction model; the TCN data correction model inputs normal high-voltage business user load data for training, and a loss function is used as a training termination condition; when the loss function converges to a preset threshold or reaches the maximum iteration number, stopping training, and applying a final model to correct data;
step six: and (3) inputting the abnormal data confirmed in the step four into a TCN data correction model, namely outputting the complement value corresponding to the abnormal data.
2. The method for identifying and correcting abnormal data for electrical loads of high-voltage industrial and commercial users according to claim 1, wherein the method comprises the following steps: the first step specifically comprises:
s11, acquiring an electricity load sequence of a high-voltage industrial and commercial user, preprocessing the data, and calculating corresponding peak load L corresponding to a load curve according to electricity data of a statistical time period of the user at a preset frequency as shown in a formula (1) P Average load L A Load fluctuation L V
L P =max i∈T {L i }
(1) Wherein L is P A peak load representing a load curve over the statistical period; l (L) A An average load of peak loads of the load curve representing the statistical period; l (L) V Load fluctuations representing the load curve over a statistical period; t corresponds to the time point sampling number of the sample, namely, the sampling frequency is preset in one statistical period, and the sampling frequency in one statistical period is the time point sampling number of the sample; li represents the corresponding load value, i.e. peak load L P Average load L A Load fluctuation L V
S12, a sample set X of the power load of the industrial and commercial user is expressed as:
X=(X 1 ,X 2 ,…,X n ) (2)
(2) In the formula, sample X i A power consumption data matrix representing user i; the data set X is a three-dimensional data tensor of nxt X d, where N is the number of samples, i.e. the number of users; d is the feature dimension, here corresponding to 3; each sample X i Representing the characteristic value of the ith sample as a matrix of T multiplied by d; at each sample X i Each column represents a feature dimension, and each row represents the value of the feature at different time points, and the mathematical expression is as follows:
3. the method for identifying and correcting abnormal data for electrical loads of high-voltage industrial and commercial users according to claim 2, wherein the method comprises the following steps: the second step specifically comprises:
s21: based on Gaussian kernels, each element of a kernel matrix K of a sample set X is obtained according to the following formula (4);
(4) Where K (i, j) represents an element of the kernel matrix K, ||X i -X j I represents sample X i And X j Euclidean distance in feature space, γ is the bandwidth parameter of the gaussian kernel function;
i.e. sample X i And X j The core calculation formula of (2) can be expressed as:
s22: the core matrix K is centered by the following formula (6) to obtain a centered core matrixThe column average value and the row average value of each element of the centering core matrix are subtracted to be zero, so that the center of the feature is ensured to be positioned at the origin; the specific calculation process is as follows:
(6) In the method, in the process of the application,represents the i-th row mean of the kernel matrix K, < >>Represents the j-th column mean of the kernel matrix K, < >>Represents the overall mean value of the core matrix K, and the core matrix is obtained after the centering operation> Elements of the ith row and the jth column;
s23, centering the core matrix by adopting the following formula (7)Performing eigenvalue decomposition to obtain eigenvalues and corresponding eigenvectors; taking the magnitude of the characteristic value as a basis for importance selection, selecting k most important characteristic vectors as main components, wherein k is the dimension of the target after dimension reduction;
(7) The formula is as follows:
(7) Where Q is an orthogonal matrix composed of eigenvectors v, Λ is a diagonal matrix composed of eigenvalues λ; each eigenvalue λ corresponds to one eigenvector v, satisfying the following relationship:
i.e. the eigenvector v is the de-centralised kernel matrixA non-zero vector in the null space of (a), corresponding to the eigenvalue λ;
s23, centering the core matrixPerforming linear transformation according to the k selected feature vectors, and obtaining a dimensionality reduced data set according to the following formula (9);
(9) Wherein Y is the reduced dimension data matrix of N x k scale, namely the reduced dimension data set,the core matrix is a matrix with N multiplied by N, V is a matrix composed of k selected eigenvectors, and the core matrix is a matrix with N multiplied by k.
4. The method for identifying and correcting abnormal data for electrical loads of high-voltage industrial and commercial users according to claim 3, wherein the method comprises the following steps: the third step specifically comprises:
s31, randomly selecting k from the data set Y by taking the columns of the data matrix Y after dimension reduction as feature vectors 1 The samples are used as initial cluster centers, and the Euclidean distance d (Y) between the samples and the feature vector of each cluster center is calculated i ,Y j ) And assign it to the cluster center closest to it; calculating the average value of all samples in each cluster, and taking the average value as a new cluster center;
Y i =(y i1 ,y i2 ,…,y ik ) T
Y j =(y j1 ,y j2 ,…,y jk ) T
(10) Wherein d (Y) i ,Y j ) For the feature vector Euclidean distance of user i and user j, d j To select the cluster center Y i The minimum Euclidean distance of the rest feature vectors to the clustering center X is C, and the C is a target user set; y is i1 Is the first element of the user i matrix;
s32, determining the number k of clusters 2 Calculating the ratio of the minimum distance between samples in nearest neighbor clusters to the maximum distance between different clusters by using Dunn Index, namely Dunn Index, drawing a D-k graph, and determining the number k of clusters by using elbow rule 2
Wherein min is 1≤i≤k d min (C i ) Represents the minimum distance between clusters, max 1≤i≤k d max (C i ) Indicating the maximum distance between clusters, and the larger D indicates the better clustering result; d represents Dunn index; ci represents the set of users within class i of the current cluster;
s33: and taking one cluster in the cluster calculation result as a user with the same electricity utilization type.
5. The method for identifying and correcting abnormal data for electrical loads of high-voltage industrial and commercial users according to claim 4, wherein the method comprises the following steps: the fourth step specifically comprises:
s41: acquiring a user load data set B in the third step, wherein the data set B is k i Matrix of x T, k i For the ith cluster class k in the clustering result i Number of individual users;
s42: initializing LOF parameters, and calculating the reachable distance of the power consumption data of the users in the same cluster by using a Manhattan distance measurement method;
Y i =(y i1 ,y i2 ,…,y iT ) T
Y j =(y j1 ,y j2 ,…,y jT ) T
dist(o)=|y i1 -y j1 |+|y i2 -y j2 |+…+|y iT -y jT |
RD(p,o)=max(dist(o),k-distance(p)) (11)
where p is the target data point, o is the nearest neighbor, dist (o) is the Manhattan distance between p and o, and k-distance (p) is the distance between the kth nearest neighbor of p and p;
s43, calculating local outlier factors LOF; for each data point p, calculating a local outlier factor of the data point p, and measuring the outlier degree of the data point p relative to the adjacent data point p;
where LOF (p) represents the local outlier factor of data point p, LRD (o) represents the local reachable density of neighbor o, LRD (p) represents the local reachable density of data point p;
s44, drawing the number k of clusters 2 The number k of neighbors is determined according to the elbow rule according to the relation diagram of the value of (1) and the corresponding LOF value 3
S45: determining abnormal data; sorting the data points according to the local outlier factor LOF obtained by calculation, and taking the data points with the outlier factor OF larger than a preset threshold value as abnormal data;
OF k (p)=1-LOF k-morm (p)
wherein,,LOF k-morm (p) represents the number OF local outliers OF the current data standard, max { LOF (p) } represents the maximum OF all local outliers, OF k (p) represents the current data outlier factor number, i.e. the OF value is the difference OF the normalized LOF value from 1; the closer the OF value is to 1, the more outliers the data points are represented; the closer to 0, the closer to the normal data point the data point is; the OF value ranges between 0 and 1;
s46: repeatedly iterating S42 to S45 until all classified user data sets are iterated;
s47: and deleting the abnormal data in the corresponding electricity utilization data matrix Xi, marking the abnormal data as empty NaN at the corresponding position of the matrix, and constructing a data set F by all NaNs.
6. The method for identifying and correcting abnormal data for electrical loads of high-voltage industrial and commercial users according to claim 5, wherein the method comprises the following steps: the fifth step specifically comprises:
s51: defining an input layer, setting the first N data and the last N data of the abnormal data as inputs into a sliding window, and forming tensors with the shape of (N, 1, 2n+1); where N is the sample power data quantity; 1 is the number of channels, which is determined by one dimension of the power data; 2n+1 represents a time series length;
s52: defining a convolution layer, and constructing an expansion non-causal convolution layer of kernel size=3 and condition= [1,2,4,8 ]; at this time, the receptive field of each convolution layer is (3,7,15,31), the output feature pattern shape of the first convolution layer (condition=1) is (N, 1, 31), the output feature pattern shape of the second convolution layer (condition=2) is (N, 1, 30), the output feature pattern shape of the third convolution layer (condition=4) is (N, 1, 28), and the output feature pattern shape of the fourth convolution layer (condition=8) is (N, 1, 24); after each convolution layer, adding a batch normalization and activation function ReLU to introduce nonlinearity;
s53: adding residual error connection; adding residual connection among all convolution layers, adding input characteristics and intermediate layer characteristics, setting two paths, and introducing a 1X 1 causal convolution layer on a first path to achieve the effect of dimension reduction and characteristic transformation; introducing a 1 multiplied by 3 expansion non-causal convolution layer on a second path to utilize future and past negative information, setting a batch normalization layer, and carrying out batch normalization on the output of the convolution layer so as to accelerate the training process and improve the robustness of the model; finally, the results of the two paths are summed and output through a correction linear unit;
s54: defining an output; extracting a complement value corresponding to the abnormal data location from the output of the TCN data correction model;
s55: model training, namely training a TCN data correction model for a data part without NaN data null points in the similar users, and using a mean square error (Mean Squared Error, MSE) as a loss function;
wherein y represents a sequence of true values,representing a sequence of predicted values, N representing the sequence length.
7. The method for identifying and correcting abnormal data for electrical loads of high-voltage industrial and commercial users according to claim 6, wherein the method comprises the following steps: and step six, traversing the trained TCN data correction model into a data set F until the data in the data set is completed.
CN202310910100.7A 2023-07-24 2023-07-24 Abnormal data identification and correction method for high-voltage industrial and commercial user power load Withdrawn CN116821832A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310910100.7A CN116821832A (en) 2023-07-24 2023-07-24 Abnormal data identification and correction method for high-voltage industrial and commercial user power load

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310910100.7A CN116821832A (en) 2023-07-24 2023-07-24 Abnormal data identification and correction method for high-voltage industrial and commercial user power load

Publications (1)

Publication Number Publication Date
CN116821832A true CN116821832A (en) 2023-09-29

Family

ID=88139253

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310910100.7A Withdrawn CN116821832A (en) 2023-07-24 2023-07-24 Abnormal data identification and correction method for high-voltage industrial and commercial user power load

Country Status (1)

Country Link
CN (1) CN116821832A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117034177A (en) * 2023-10-08 2023-11-10 湖北华中电力科技开发有限责任公司 Intelligent monitoring method for abnormal data of power load
CN117093879A (en) * 2023-10-19 2023-11-21 无锡尚航数据有限公司 Intelligent operation management method and system for data center
CN117150233A (en) * 2023-10-30 2023-12-01 广东电网有限责任公司湛江供电局 Power grid abnormal data management method, system, equipment and medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117034177A (en) * 2023-10-08 2023-11-10 湖北华中电力科技开发有限责任公司 Intelligent monitoring method for abnormal data of power load
CN117034177B (en) * 2023-10-08 2023-12-19 湖北华中电力科技开发有限责任公司 Intelligent monitoring method for abnormal data of power load
CN117093879A (en) * 2023-10-19 2023-11-21 无锡尚航数据有限公司 Intelligent operation management method and system for data center
CN117093879B (en) * 2023-10-19 2024-01-30 无锡尚航数据有限公司 Intelligent operation management method and system for data center
CN117150233A (en) * 2023-10-30 2023-12-01 广东电网有限责任公司湛江供电局 Power grid abnormal data management method, system, equipment and medium
CN117150233B (en) * 2023-10-30 2024-02-13 广东电网有限责任公司湛江供电局 Power grid abnormal data management method, system, equipment and medium

Similar Documents

Publication Publication Date Title
CN111199016B (en) Daily load curve clustering method for improving K-means based on DTW
CN116821832A (en) Abnormal data identification and correction method for high-voltage industrial and commercial user power load
CN104809658B (en) A kind of rapid analysis method of low-voltage distribution network taiwan area line loss
CN110163429B (en) Short-term load prediction method based on similarity day optimization screening
CN106055918A (en) Power system load data identification and recovery method
CN111900731B (en) PMU-based power system state estimation performance evaluation method
CN108805213B (en) Power load curve double-layer spectral clustering method considering wavelet entropy dimensionality reduction
CN114330583B (en) Abnormal electricity utilization identification method and abnormal electricity utilization identification system
CN110020680B (en) PMU data classification method based on random matrix theory and fuzzy C-means clustering algorithm
CN110991737A (en) Ultra-short-term wind power prediction method based on deep belief network
CN111539657A (en) Typical electricity consumption industry load characteristic classification and synthesis method combined with user daily electricity consumption curve
CN112001441A (en) Power distribution network line loss anomaly detection method based on Kmeans-AHC hybrid clustering algorithm
CN111008726A (en) Class image conversion method in power load prediction
CN111026741A (en) Data cleaning method and device based on time series similarity
CN117131449A (en) Data management-oriented anomaly identification method and system with propagation learning capability
CN115358437A (en) Power supply load prediction method based on convolutional neural network
CN116148753A (en) Intelligent electric energy meter operation error monitoring system
CN112596016A (en) Transformer fault diagnosis method based on integration of multiple one-dimensional convolutional neural networks
CN117151770A (en) Attention mechanism-based LSTM carbon price prediction method and system
CN111861785A (en) Special transformer industry fault identification method based on power utilization characteristics and outlier detection
CN117131022B (en) Heterogeneous data migration method of electric power information system
CN107274025B (en) System and method for realizing intelligent identification and management of power consumption mode
CN116365519B (en) Power load prediction method, system, storage medium and equipment
CN117435937A (en) Smart electric meter abnormal data identification method, device, equipment and storage medium
CN113033898A (en) Electrical load prediction method and system based on K-means clustering and BI-LSTM neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20230929