Disclosure of Invention
The invention aims to provide a photovoltaic fault arc feature selection method combining a filtering type evaluation strategy with an encapsulation type evaluation strategy, which reduces the dimension of a fault arc feature set by solving the selection problem of distinguishing the fault arc from arc-like features in a grid-connected photovoltaic system, reduces the training time of a classifier and is beneficial to improving the rapidity and the reliability of a fault arc detection algorithm.
In order to achieve the purpose, the invention adopts the following technical scheme:
the feature selection method comprises the following steps:
1) Sampling current waveforms detected from before to after fault arc in the grid-connected photovoltaic system point by frequency f, and sampling by T s Obtaining a signal x for a time window length acquisition n Go to step 2);
2) Calculating a signal x n Removing outlier data in the calculation result and then carrying out normalization processing on the time domain characteristics in each statistical period to obtain a signal x n S1= { f1, f 2., fm1}, the element f in S1 i A feature quantity data set representing the ith time-domain feature, i =1, 2.., m1, go to step 3);
3) Performing time-frequency domain analysis on the time-domain features in the time-domain feature set S1, removing outlier data in the time-frequency domain analysis result, and performing normalization processing to obtain a signal x n S2= { tf1, tf2,. Tfm2}, the element tf in S2 i The characteristic quantity data set representing the ith time-frequency domain characteristic is transferred to the step 4);
4) Merging the time domain feature set S1 and the time-frequency domain feature set S2 to obtain a multi-feature set S3= { f1, f2,. Multidata, fm1, tf1, tf2,. Multidata, tfm2} to be selected, and turning to step 5);
5) Calculating the feature weight of each feature in the multi-feature set S3 by adopting a Relieff algorithm, screening out irrelevant features with the weight lower than a threshold value in the multi-feature set S3 according to the set threshold value, then sorting the irrelevant features from high to low according to the weight to obtain a relevant feature set S4= { F1, F2.,. Fm3}, and turning to step 6);
6) If the number of the features required in the fault arc detection algorithm is determined to be q, using a maximum correlation minimum redundancy algorithm to enable the number of the features selected by the algorithm to be m = q, and obtaining an optimal feature set under the requirement condition (namely determining the number of the features to be q) as the output of the feature selection method; otherwise, go to step 7);
7) When the feature number required by the fault arc detection algorithm is not determined, a maximum correlation minimum redundancy algorithm is adopted, so that the feature numbers m selected by the algorithm are respectively 1,2, 3, a series of non-redundant feature sets are obtained, and the step 8 is carried out;
8) And respectively taking a series of non-redundant feature sets as sample sets of the classifier to carry out fault arc detection (the fault arc detection result is classifier output) accuracy test, namely carrying out non-redundant feature set analysis, constructing a multi-objective optimization model based on the classifier output accuracy and the feature number, then adopting multi-objective optimization solution of the classifier output accuracy and the feature number to determine an optimal feature set (selecting the optimal feature set from the series of non-redundant feature sets), and outputting.
Preferably, the sampling frequency f is at least twice of the maximum frequency in the effective fault arc characteristic frequency band, and under the condition allowed by a sampling hardware device, the higher sampling frequency can enable the selected fault arc characteristic frequency band to better reflect the fundamental difference characteristic of the fault arc, so that f =200 kHz-2 MHz is selected; the relation between the time window length and the sampling frequency is T s N/f, where N is the number of sampling points of the detection signal in the time window, the number of sampling points is selected in such a manner that the detection signal in the time window with a certain length can reflect an effective fault arc time-frequency characteristic, the value range of N is 1000000 to 4000000, and f is an integer multiple of 2N in consideration of specific component extraction of frequency dimensions in the time-frequency plane; for the time-frequency domain characteristics, considering the reliability of the characteristics and the calculation amount compromise processing, the statistical period is 1000-10000 sampling points.
Preferably, for the feature set to be selected (including a time domain feature set S1= { f1, f2,. And fm1} obtained by analysis of a statistical method and a time-frequency domain feature set S2= { tf1, tf2,. And tfm2} obtained by analysis of a time-frequency domain), a significant time domain feature that a fault arc is distinguished from a similar arc is obtained to the greatest extent, and the time domain feature in the time domain feature set S1 is a signal x counted by periods n One or more of mean, variance, skewness, kurtosis of (c). The time-frequency domain analysis selects a method capable of improving the frequency resolution and the time resolution to the greatest extent, and the time-frequency domain analysis method can adopt wavelet transformation and/or short-time Fourier transformation. The window function length of short-time Fourier transform is 1000-10000, the wavelet transform wave function is Rbio3.1, and 3-9 (example)E.g., 6) layer wavelet packet decomposition, with nodes from (6, 0) to (6, 11).
Preferably, the outlier data is an element that differs by more than three times the standard deviation (it is ensured that the data showing the characteristics of the fault arc is not removed) from the mean of all elements in the corresponding characteristic quantity data set by replacing the outlier data with the last non-outlier data.
Preferably, each parameter in the ReliefF algorithm is determined based on the significant time-frequency characteristics for separating the fault arc from the arc-like to the maximum extent and reducing the accidental error of calculation (so that the significant time-frequency characteristics of the fault arc are given relatively the maximum characteristic weight), the sampling time is 5% -15% (for example, 10%) of the total samples, and the running time is 10-10000 (for example, 200); the K neighbor value is from 2 to 20 (e.g., 8).
Preferably, the ReliefF algorithm calculates the real number as the value of the feature weight assigned to each feature in the multiple feature set S3, the feature weights are divided into two types of features with high and low weights according to the value range, and the set threshold for determining the relevant feature set S4 by filtering is determined according to the middle value of the value range formed by the lower weight limit of the high-weight feature and the upper weight limit of the low-weight feature (for example, the threshold is 0.09).
Preferably, in the step 6), the maximum correlation minimum redundancy algorithm uses an incremental search method to find approximately optimal features, that is, one of the features is selected from the related feature set S4 to maximize the constraint target value Φ, then the next feature is selected from the remaining features of the related feature set S4 except the selected feature to also maximize the constraint target value Φ, and the optimal feature set is obtained by analogy in this order.
Preferably, when a series of non-redundant feature sets are used as the sample set of the classifier, considering the requirements of the subsequent arc fault detection algorithm on stability and accuracy, a strong classifier with higher stability and accuracy is selected, for example, a random forest classifier is used as the classifier for non-redundant feature set analysis, and the number of subtrees of the random forest classifier is 10 to 1000 (for example, 1000).
Preferably, theIn the step 8), when the classifier fault arc detection accuracy and the multi-objective optimization of the feature number are adopted for solving, considering the compromise between the feature calculation complexity (calculated amount) and the optimization effect, the optimization method adopts a linear weighting method; the optimization variable is selected as f a (classifier output accuracy) and f b (1 is the difference obtained by subtracting the ratio of the number of the candidate characteristic set characteristics to the total number of the candidate characteristics, namely 1-m/m 3), f a Weight w of a =0.01 to 0.99 (for example, 0.8), f b Weight w of a =0.01 to 0.99 (for example, 0.2), and the objective function of the constructed feature selection model (multi-objective optimization model) under the multi-factor influence condition is f c =w a f a +w b f b Then make the output variable f c The non-redundant feature set with the highest value (as the feature set effect evaluation index) is the best feature set.
The invention has the following beneficial effects:
the feature space adopted in the feature selection method of the invention not only contains time domain features which are simple in calculation and easy to realize, but also contains time-frequency domain features which are obtained after time-frequency domain analysis (such as short-time Fourier transform and wavelet transform) and can analyze signal local features in frequency bands, and the features of fault data are fully mined by constructing multiple feature sets to be selected which are combined in time-frequency domain, and the feature space has better robustness compared with a single-class feature criterion.
In the feature selection method, the multiple feature sets to be selected adopt a Relieff algorithm capable of rapidly filtering redundant features, different weights can be given to the features according to the relevance of each feature and category, the feature quantity with low weight is screened out, and a part of irrelevant features can be removed, so that the dimension reduction of the multiple feature sets to be selected is realized, the calculated quantity is reduced, and the rapidity of a fault arc detection algorithm is favorably improved.
In the feature selection method, when the number of features required by fault arc detection is given, the distribution and mutual information among the features and between the features and response variables are calculated by adopting a maximum correlation minimum redundancy algorithm, so that an optimal feature set is obtained. When the required feature number is uncertain, a series of non-redundant feature sets obtained by a maximum correlation minimum redundancy algorithm are used as sample sets of the trainer, the learning performance of the trainer in different sample sets is obtained, a multi-optimization target based on the classifier output accuracy and the feature number is adopted for solving, so that a simplified feature set capable of accurately identifying the arc-like and fault arc is obtained as an optimal feature set, the learner is not required to be used for multiple rounds of training, and the learning time is obviously reduced.
The feature set of the invention comprises time domain features obtained by a statistical method, and also comprises time-frequency domain features which can be obtained after time-frequency domain analysis (such as short-time Fourier transform and wavelet transform) and can analyze local features of signals in different frequency bands, and the time-frequency domain features have high precision and good noise immunity. The characteristics of fault data can be fully mined by constructing the multi-feature set to be selected combined by time domain and frequency domain, and compared with a single-class characteristic criterion, the method has better robustness, thereby improving the reliability of the detection algorithm.
The invention solves the problem that the optimal feature number of the feature set is difficult to determine by filtering through multi-objective optimization of the classifier output accuracy and the feature number. Compared with the traditional packaging type feature selection method, the method does not need to use a learner to carry out multi-round training, and a series of non-redundant feature sets obtained through the maximum correlation minimum redundancy algorithm are used as the sample set of the trainer, so that the dimensionality of the data set is reduced, the training time of the classifier is shortened, and the speed and the reliability of the fault arc detection algorithm are improved.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
As shown in FIG. 1, the invention firstly samples (for example, T is used for T) a photovoltaic system fault arc occurrence period signal (for example, a current waveform capable of displaying fault arc characteristics and a current waveform before and after the fault arc occurrence period signal in a normal state of the system) s Time window of length signal x n ) The method comprises the steps of extracting corresponding multi-characteristic values from sampling signals based on multi-time-frequency transformation (time-domain characteristics are obtained through analysis of a statistical method and then are further obtained through time-frequency domain analysis), adopting a Relieff algorithm to the multi-characteristic set to be selected, giving corresponding weights to different characteristics based on fault arc characteristic relevance, and further screening out the characteristics with lower weights to obtain a related characteristic set. And under the condition of giving the required number of features, obtaining the optimal feature set by adopting a maximum correlation minimum redundancy algorithm on the related feature sets. When the required feature number is not given, a maximum correlation minimum redundancy algorithm is adopted for the correlation feature set to obtain a series of non-redundancy feature sets, the non-redundancy feature sets are used as sample sets of the classifier, and multi-objective optimization of the output accuracy rate and the feature number of the classifier is adopted to carry out solving to obtain the optimal feature set. The invention constructs the optimal data set meeting different requirements by combining various feature selection algorithms, fully excavates the optimal features for fault arc data analysis, realizes the reduction of the dimension of the fault arc feature set, reduces the training time of the classifier, and is beneficial to improving the rapidity and the reliability of the fault arc detection algorithm.
With reference to fig. 2, the steps of the photovoltaic fault arc characteristic selection method combining the filtering type and the encapsulation type evaluation strategy are specifically described:
firstly, setting the sampling frequency f of a current signal by a detection signal device, and sampling the fault arc current signal of the grid-connected photovoltaic system point by using the frequency f. Consider that: on one hand, too few data points in the time window cannot accurately reflect the fundamental difference characteristics of the fault arc and the arc-like arc, and on the other hand, too many data points in the time window can increase the calculation amount. Therefore, the number of sample points in the time window is selected to be 2500000. In order to reduce the hardware implementation requirements of the detection signal sampling device and reflect the fault arc characteristic frequency band of the difference between the fault arc and the similar arc, the sampling frequency f of the output current signal of the grid-connected photovoltaic system is 1MHz.
Step two pairs of collected current signals x n Performing a multi-time-frequency analysis process to calculate a signal x n Time domain, time-frequency domain feature quantity of (1).
And respectively calculating the characteristic quantities corresponding to four time domain characteristics of an average value, a variance, a skewness and a kurtosis by taking 8000 points as a statistical period. And performing short-time Fourier transform on the four time domain features, wherein the fft point number and the window function length of each segment are 8000, the window function is a Hamming window, in order to reduce the calculated amount, the overlapping sample number novelap of each segment is 0, and 4001 signals with frequencies from 0Hz to 500kHz are obtained as the feature quantity of the time-frequency domain features of the first part. Meanwhile, 6 layers of wavelet packet decomposition are carried out on the four time domain characteristics according to a Rbio3.1 function, decomposition coefficients of wavelet packet nodes (6, 0) to (6, 11) are reconstructed to obtain a reconstructed current signal, and an average value, a variance, a skewness and a kurtosis in each period are calculated for the reconstructed current signal to serve as the time-frequency domain characteristics of the second part. The time domain features and the time-frequency domain features of the two parts are combined together to form a multi-feature set to be selected, and outlier data and normalization are removed (unit limitation of the features is removed, the unit limitation is converted into a dimensionless pure numerical value, and features of different units or orders of magnitude can be compared conveniently).
And step three, taking the multi-feature set to be selected obtained in the step two as the input of a Relieff algorithm, endowing different weights to the features by the Relieff algorithm according to the relevance of each feature and category (normal state and fault arc), endowing the features with high relevance to classification with high weight, endowing the features with low relevance with low weight, setting a weight threshold value and removing the features with the weight lower than the threshold value, so that the purpose of removing irrelevant features is achieved. The method for selecting the characteristics of the fault arc by adopting the Relieff algorithm comprises the following specific steps:
(1) Selecting an arbitrary sample X (feature quantity) and its corresponding classification;
(2) Respectively finding out K nearest neighbor samples H from sample sets of same type and different types as the sample X j And M j (j=1,2,…,K);
(3) Updating each feature F according to equation (6) p (P =1,2, \8230;, P) weight W (F) p ):
In the formula, m is the repeated times of the algorithm; p (c) is the probability of a faulty arc sample in the total sample;
d(F p ,X,H j ) Denotes sample X is at F p Upper and sample H j The distance of (d);
d(F p ,X,M j ) Denotes sample X at F p Upper and sample M j The distance of (d);
in the formula, V (Fp, A) represents a characteristic F of the sample A p A is X, H j Or M j ;max(F p ) Represents a feature F p Maximum value of (1), min (F) p ) Express feature F p Minimum value of (d);
(4) Repeating the above process m times to obtain the weight W (F) of all P characteristics p );
(5) Setting a weight threshold value alpha, and selecting the features with the weight larger than alpha to form a feature subset. On one hand, the number of the features higher than the threshold value is at least more than the required number of the features; on the other hand, the larger the threshold value is, the more the number of the retained features is, the greater the correlation between the features is, and the greater the subsequent calculation amount is, so that the set threshold value should select the features with the weight obviously higher than that of other features.
And fourthly, under the condition that the number of the features required in the fault arc detection algorithm is given, the maximum correlation minimum redundancy algorithm is adopted for the related feature set to obtain the optimal feature set. And when the number of the features required in the fault arc detection algorithm is not given, a maximum correlation minimum redundancy algorithm is adopted for the correlation feature set to obtain a series of non-redundancy feature sets.
The maximum correlation minimum redundancy algorithm is a feature selection method for maximizing the correlation between feature variables and targets and minimizing the correlation between features, and the correlation between features and the correlation between features and category variables are measured by using the size of mutual information quantity.
In the formula, x and y are given two random variables (specifically referred to as features in the invention), and p (x, y) is a joint probability distribution function of x and y; p (x) and p (y) are probability distribution functions for x and y, respectively. To find a feature subset S containing m features, the maximum correlation is given by I (x) i (ii) a c) Search the m features x associated with the object classification c in the proper order i The correlation magnitude can be calculated by the following formula:
a subset of m features may not be the optimal feature subset, and when two features are highly interdependent, one of them is deleted and the respective order discrimination does not change much. Therefore, a minimum redundancy R(s) is introduced to eliminate redundancy between features and to select mutually exclusive features (equation (11)); then, phi is obtained by adding the maximum correlation coefficient D and the minimum redundancy R (equation (12)).
maxφ(D,R),φ=D-R (12)
When the required number of features (for example, q) is given in advance, the number of the features m = q selected in the maximum correlation minimum redundancy algorithm, and redundant features are removed by calculating the distribution and mutual information among the features, the features and the label variables, so that an optimal feature set with minimum correlation among the features and maximum correlation among the features and the classification variables is obtained. When the required features are not determined in advance, a maximum correlation minimum redundancy algorithm is adopted, the number m of the features selected by the algorithm is respectively 1,2, 1 and m3 (m 3 is the number of the relevant features obtained according to the Relieff algorithm), a series of non-redundant feature sets are formed, the step five is carried out, and the series of non-redundant feature sets are used as a sample set of the classifier.
Step five, the prediction result of the strong classifier has strong stability and high accuracy, but the calculated amount is large and the training time is long; the weak classifier has weak stability, low accuracy and short training time. In order to ensure that the subsequent detection algorithm has strong stability and high accuracy, a strong classifier is preferentially selected, specifically, a random forest classifier can be selected, and non-redundant feature sets corresponding to different feature number m values are used as a sample set of the classifier. According to the relation between the output accuracy of the classifier and the number of the features, the number of the features can be visually seen to be increased after the output accuracy of the classifier exceeds 90%, the output accuracy of the classifier is not obviously increased, but the calculated amount and the training time are obviously increased, therefore, the number of the features are solved by adopting multiple optimization targets of the output accuracy of the classifier and the number of the features, the optimization method is selected as a linear weighting method, and the optimization variable is selected as f a (classifier output accuracy) and f b (1 is the difference obtained by subtracting the ratio of the number of the candidate characteristic set characteristics to the total number of the candidate characteristics, namely 1-m/m 3), f a Has a weight of 0.8,f b Is 0.2, the objective function is obtained:
f c =0.8f a +0.2f b
then, f c The highest valued non-redundant feature set is the best feature set.
As shown in FIG. 3a, the fault arc current is sampled at 1MHz, and the sampled signal x is n Taking 8000 points as a period, calculating the feature quantities of four time domain features of average value, variance, skewness and kurtosis, and enabling S1= { f1, f2, f3, f4}. For signal x n The time domain characteristic quantity of (2) is decomposed by 6 layers of wavelet packets according to a Rbio3.1 function, the decomposition coefficients of the nodes (6, 0) to (6, 11) of the wavelet packets are reconstructed to obtain a reconstructed current signal, and the average value, the variance, the skewness and the kurtosis in each period (8000 periods) are calculated for the reconstructed current signal to be used as the time-frequency domain characteristic quantity. At the same time, for the signal x n The time domain characteristic quantity is subjected to short-time Fourier transform to obtain signals under different frequency bands as frequency domain characteristic quantities (4001 in total), and a time-frequency domain characteristic set S2 to be selected is obtained by combining the time-frequency domain characteristic quantities obtained by wavelet transform and short-time Fourier transform, wherein the set S2 is a set { tf1, tf 2. A feature matrix composed of time-domain features and time-frequency-domain features, i.e., a multi-feature set S3= { F1, F2, ·, F4, tf1, tf2,. And tfm2} is used as an input of the ReliefF algorithm, and a threshold is set to 0.09, because the feature weight with higher relevance to classification is larger, a group of features with higher weight, i.e., S4= { F1, F2,. And F7} (as shown in fig. 3 b) is used as an input of the following maximum correlation minimum redundancy algorithm.
The result shows that the multi-feature set to be selected is used as the input of the Relieff algorithm, so that the dimension reduction of the high-dimensional feature set is realized.
As shown in fig. 4a, the grid-connected photovoltaic system fault arc current signal is acquired at a sampling frequency f = 1MHz. Before 1.05s, the current signal is in a normal state, and after 1.05s, the current signal is in a fault state. The feature F1 is the feature with the largest weight value selected by the Relieff algorithm, and is the feature corresponding to the signal x n The current variance after reconstruction of the decomposition coefficients of the wavelet packet nodes (6, 7) is carried out by 6-layer wavelet packet decomposition according to the Rbio3.1 function (FIG. 4 b).
The above results show that large-amplitude pulse indication occurs at the time of occurrence of the fault arc, the characteristic value after the occurrence of the fault arc is integrally larger than that in a normal state, effective characteristics can be selected by adopting a Relieff algorithm, and the identification of the fault arc of the photovoltaic system is realized.
As can be seen from fig. 3b, the number of features m3 in the relevant feature set is 7. And (3) enabling the number m of the features selected by the maximum correlation minimum redundancy algorithm to be 1,2, 7 respectively, selecting a series of non-redundant feature sets as sample sets of the random forest classifier, and testing the accuracy of the classifier after training. As can be seen from fig. 5a, when the number of feature quantities in the non-redundant feature set is less than 3, the increase of the accuracy rate is large as the number of feature quantities increases; when the number of the characteristic quantities in the non-redundant characteristic set is 3, the accuracy is higher than 90%; after the number of the characteristic quantities in the non-redundant characteristic set is more than 4, the increase range of the accuracy rate is smaller along with the increase of the number of the characteristic quantities.
As shown in FIG. 5b, when the number of features in the non-redundant feature set increases from 1 to 3, f c The value of (c) is increased; when the number of the features in the non-redundant feature set exceeds 3, f increases along with the number of the features c The value of (c) decreases.
When the number of features in the non-redundant feature set is 3, f c The number of the classifier is the largest, and at this time, the number of the features is small, and the output accuracy of the classifier is high. The feature set F1, F2, F3 is selected as the best feature set.
The results show that the classifier is adopted to output multiple optimization targets of the accuracy and the number of the features to solve, so that a simplified and efficient optimal feature set can be obtained, and the problems of overlarge calculation amount and overlong training time of a subsequent detection algorithm due to the fact that the number of the features is too large are avoided.
The invention combines the advantages of filtering and packaging, and uses a hybrid method to process large-scale data sets. The optimal situation is similar to the time complexity of the filtering type, and the performance of the packed algorithm is similar. The processing process of the hybrid method is that firstly, filtering is used for quickly selecting features based on inherent characteristics of a data set, a small number of features are reserved, the feature scale of further searching is reduced, and then, an encapsulation method is used for further optimizing to obtain a feature subset with optimal classification performance. The invention adopts multiple feature sets combined by time-frequency domains to realize full mining of the features of fault data, adopts the Relieff algorithm to quickly filter irrelevant features, and has the opportunity of entering an optimal feature subset after filtering, thereby improving the classification accuracy and having quick convergence speed. According to the method, the distribution and mutual information among the features and between the features and the response variables are calculated by adopting a maximum correlation minimum redundancy algorithm, so that a simplified feature set with the highest correlation with the fault arc is obtained, the redundant features are removed, and the method has an obvious dimension reduction effect. The feature selection method provided by the invention can reduce the training time of the classifier and improve the rapidity and the reliability of the fault arc detection algorithm.
The invention has the following characteristics:
1) The feature set of the invention comprises time domain features obtained by a statistical method, and also comprises time-frequency domain features obtained after short-time Fourier transform and wavelet transform, and the multi-feature set combined by the time domain and the frequency domain can realize the features of fully mining fault data and comprehensively detecting fault electric arcs, and can be suitable for various occasions. According to the method, a Relieff algorithm capable of rapidly filtering redundant features is adopted for a multi-feature set to be selected, the obtained feature weight enables researchers to have more visual understanding on fault arc features, and a part of irrelevant features can be removed by removing the feature quantity with the weight lower than a threshold value, so that the dimension reduction of a high-dimensional feature set is realized.
2) The invention classifies whether the number of features required by the fault arc detection algorithm is given. When the required number of the features is given, the distribution and mutual information among the features, the features and the response variables are calculated by adopting a maximum correlation minimum redundancy algorithm, so that the optimal feature set is obtained. When the required feature number is uncertain, a maximum correlation minimum redundancy algorithm is adopted to obtain a series of non-redundant feature sets, the non-redundant feature sets are used as sample sets of the classifier, and the classifier is adopted to output multiple optimization targets of the accuracy and the feature number to solve to obtain the optimal feature set.
3) Compared with the traditional encapsulated characteristic selection method, the characteristic selection method combining the filtering type and the encapsulated evaluation strategy does not need to use a learner to carry out multi-round training, a series of non-redundant characteristic sets obtained by a maximum correlation minimum redundancy algorithm are used as a sample set of the trainer, and multi-target optimization of the output accuracy and the number of the characteristics of the classifier is adopted to carry out solving, so that a simplified characteristic set with a good classification effect is obtained as an optimal characteristic set, the dimensionality of a data set is reduced, the training time of the classifier is shortened, and the speed and the reliability of a fault arc detection algorithm are improved.