CN113076986B

CN113076986B - Photovoltaic fault arc characteristic selection method combining filtering type and packaging type evaluation strategies

Info

Publication number: CN113076986B
Application number: CN202110333850.3A
Authority: CN
Inventors: 陈思磊; 李兴文; 翟心楠; 孟羽; 吴子豪; 王辰曦; 唐露甜; 王若谷
Original assignee: Electric Power Research Institute of State Grid Shanxi Electric Power Co Ltd; Xian Jiaotong University
Current assignee: Xian Jiaotong University; Electric Power Research Institute of State Grid Shaanxi Electric Power Co Ltd
Priority date: 2021-03-29
Filing date: 2021-03-29
Publication date: 2022-12-09
Anticipated expiration: 2041-03-29
Also published as: CN113076986A

Abstract

The invention discloses a photovoltaic fault arc feature selection method combining a filtering type evaluation strategy with a packaging type evaluation strategy, wherein multiple feature sets to be selected are processed by a Relieff algorithm, corresponding weights are given to different features based on the fault arc feature correlation degree, the features with lower weights are further screened out, if the number of the required features is given, the required optimal feature set can be obtained by adopting a maximum correlation minimum redundancy algorithm, otherwise, a series of non-redundant feature sets are obtained by adopting the maximum correlation minimum redundancy algorithm, and the optimal feature set is obtained by adopting multi-target optimization of the classifier accuracy and the feature number. The invention constructs the optimal data set meeting different requirements by combining various feature selection methods, fully excavates the optimal features for fault arc data analysis, realizes the reduction of the dimension of the fault arc feature set, reduces the training time of a classifier, and is beneficial to improving the rapidity and the reliability of a fault arc detection algorithm.

Description

Photovoltaic fault arc characteristic selection method combining filtering type and packaging type evaluation strategies

Technical Field

The invention belongs to the technical field of photovoltaic fault arc characteristic selection, and particularly relates to a method for selecting characteristics by constructing multiple time-frequency characteristic sets, fully mining the characteristics of fault data and combining a filtering type and packaging type evaluation strategy aiming at the characteristics of a direct-current fault arc time domain and a time-frequency domain of a photovoltaic system.

Background

In recent years, photovoltaic power generation has been rapidly developed due to its clean reproducibility. However, the problem of dc arc faults in photovoltaic systems is becoming more serious as the equipment ages and external factors. At present, a large photovoltaic power station is mainly analyzed by depending on time domain and frequency domain characteristics of direct current fault arcs. The voltage and current waveforms on the line can be detected in the time domain when an arc occurs. The change rate of the arc current under a specific time window obtained by analyzing the change of the arc current can be used as an effective characteristic quantity. However, since the photovoltaic system itself is affected by environmental factors such as illumination, the current changes are large, and the current changes caused by different arcs are different, so that misjudgment is easy to occur when time domain characteristic quantities are adopted, and the anti-interference performance of the fault arc detection algorithm is poor. Therefore, the time domain characteristic quantity is adopted, so that the arc condition can be simply judged, and the method is not suitable for accurate fault arc identification. The frequency domain can use the difference of harmonic content at the high frequency of the signal before and after the arc occurs as the basis for judging the fault arc. The frequency domain detection based on Fourier transform has high precision, is not influenced by the type of an electric arc, but lacks time domain information, is relatively troublesome to calculate, and has certain requirements on the stability of a waveform; and along with the rise of voltage and current, the low-frequency part of the current is reduced when the bus is connected with the arc in series, so that the low-frequency part of the current is greatly influenced by the position where the arc occurs, and the photovoltaic grid-connected inverter is not suitable for a large photovoltaic system. The wavelet transformation and the integrated empirical mode decomposition isochronous frequency domain method combines the advantages of a time domain and a frequency domain, can highlight arc fault characteristics in different frequency bands, can search the frequency band with the most obvious arc characteristics for identification, and enables detection to be more convenient and accurate, but wavelet basis selection and decomposition layer number determination are complex.

The characteristic selection step in the fault arc detection algorithm is mainly used for eliminating irrelevant and redundant characteristics and selecting the characteristics really relevant to fault arcs and arc-like classification, so that researchers can more intuitively know the characteristics of the fault arcs. While irrelevant or redundant features can reduce the similarity between homogeneous samples and thus the performance of the classifier. Therefore, the feature selection has important significance on the fault arc detection algorithm. Feature selection methods are generally divided into three categories: the first type is a filtering method, that is, each feature is scored according to the divergence or correlation index of the feature, and a proper feature is selected by setting a scoring threshold or the number of thresholds to be selected. The second category is the encapsulation method, i.e. predicting the effect score according to the objective function, and selecting part of the features each time or excluding part of the features. The third type is an embedding method, that is, after training is performed by adopting a machine learning algorithm and a machine learning model, the features are selected from large to small according to the obtained weight coefficient of each feature. Although similar to the filtering method, the embedding method determines the merits of the features through machine learning training, rather than directly according to the statistical indexes of the features.

In the fault arc feature selection algorithm, when filtering type feature selection is adopted, the calculation complexity is low, the calculation speed is high, however, the filtering method mainly considers the expression difference of features in different types of samples, and a classifier is not involved in the feature selection process, so when the filtering method is used for feature selection, the optimal feature number of the feature set is difficult to determine, the packaging type feature selection process is adopted to be related to a learner, the performance of the learner is used as an evaluation criterion of the feature selection, the universality is poor, and when the embedded type feature selection is adopted, the feature selection can be automatically carried out in the training process of the learner. In the application process, the mixed feature selection method has four problems of high classification accuracy to be improved, high data dimension, single candidate feature subset and same relevance and redundancy. At present, a feature selection algorithm more suitable for fault arc detection needs to be developed, and the method has important significance for realizing reduction of feature set dimension, removal of redundant features and irrelevant features, and more intuitive understanding of fault arc features.

Disclosure of Invention

The invention aims to provide a photovoltaic fault arc feature selection method combining a filtering type evaluation strategy with an encapsulation type evaluation strategy, which reduces the dimension of a fault arc feature set by solving the selection problem of distinguishing the fault arc from arc-like features in a grid-connected photovoltaic system, reduces the training time of a classifier and is beneficial to improving the rapidity and the reliability of a fault arc detection algorithm.

In order to achieve the purpose, the invention adopts the following technical scheme:

the feature selection method comprises the following steps:

1) Sampling current waveforms detected from before to after fault arc in the grid-connected photovoltaic system point by frequency f, and sampling by T _s Obtaining a signal x for a time window length acquisition _n Go to step 2);

2) Calculating a signal x _n Removing outlier data in the calculation result and then carrying out normalization processing on the time domain characteristics in each statistical period to obtain a signal x _n S1= { f1, f 2., fm1}, the element f in S1 _i A feature quantity data set representing the ith time-domain feature, i =1, 2.., m1, go to step 3);

3) Performing time-frequency domain analysis on the time-domain features in the time-domain feature set S1, removing outlier data in the time-frequency domain analysis result, and performing normalization processing to obtain a signal x _n S2= { tf1, tf2,. Tfm2}, the element tf in S2 _i The characteristic quantity data set representing the ith time-frequency domain characteristic is transferred to the step 4);

4) Merging the time domain feature set S1 and the time-frequency domain feature set S2 to obtain a multi-feature set S3= { f1, f2,. Multidata, fm1, tf1, tf2,. Multidata, tfm2} to be selected, and turning to step 5);

5) Calculating the feature weight of each feature in the multi-feature set S3 by adopting a Relieff algorithm, screening out irrelevant features with the weight lower than a threshold value in the multi-feature set S3 according to the set threshold value, then sorting the irrelevant features from high to low according to the weight to obtain a relevant feature set S4= { F1, F2.,. Fm3}, and turning to step 6);

6) If the number of the features required in the fault arc detection algorithm is determined to be q, using a maximum correlation minimum redundancy algorithm to enable the number of the features selected by the algorithm to be m = q, and obtaining an optimal feature set under the requirement condition (namely determining the number of the features to be q) as the output of the feature selection method; otherwise, go to step 7);

7) When the feature number required by the fault arc detection algorithm is not determined, a maximum correlation minimum redundancy algorithm is adopted, so that the feature numbers m selected by the algorithm are respectively 1,2, 3, a series of non-redundant feature sets are obtained, and the step 8 is carried out;

8) And respectively taking a series of non-redundant feature sets as sample sets of the classifier to carry out fault arc detection (the fault arc detection result is classifier output) accuracy test, namely carrying out non-redundant feature set analysis, constructing a multi-objective optimization model based on the classifier output accuracy and the feature number, then adopting multi-objective optimization solution of the classifier output accuracy and the feature number to determine an optimal feature set (selecting the optimal feature set from the series of non-redundant feature sets), and outputting.

Preferably, the sampling frequency f is at least twice of the maximum frequency in the effective fault arc characteristic frequency band, and under the condition allowed by a sampling hardware device, the higher sampling frequency can enable the selected fault arc characteristic frequency band to better reflect the fundamental difference characteristic of the fault arc, so that f =200 kHz-2 MHz is selected; the relation between the time window length and the sampling frequency is T _s N/f, where N is the number of sampling points of the detection signal in the time window, the number of sampling points is selected in such a manner that the detection signal in the time window with a certain length can reflect an effective fault arc time-frequency characteristic, the value range of N is 1000000 to 4000000, and f is an integer multiple of 2N in consideration of specific component extraction of frequency dimensions in the time-frequency plane; for the time-frequency domain characteristics, considering the reliability of the characteristics and the calculation amount compromise processing, the statistical period is 1000-10000 sampling points.

Preferably, for the feature set to be selected (including a time domain feature set S1= { f1, f2,. And fm1} obtained by analysis of a statistical method and a time-frequency domain feature set S2= { tf1, tf2,. And tfm2} obtained by analysis of a time-frequency domain), a significant time domain feature that a fault arc is distinguished from a similar arc is obtained to the greatest extent, and the time domain feature in the time domain feature set S1 is a signal x counted by periods _n One or more of mean, variance, skewness, kurtosis of (c). The time-frequency domain analysis selects a method capable of improving the frequency resolution and the time resolution to the greatest extent, and the time-frequency domain analysis method can adopt wavelet transformation and/or short-time Fourier transformation. The window function length of short-time Fourier transform is 1000-10000, the wavelet transform wave function is Rbio3.1, and 3-9 (example)E.g., 6) layer wavelet packet decomposition, with nodes from (6, 0) to (6, 11).

Preferably, the outlier data is an element that differs by more than three times the standard deviation (it is ensured that the data showing the characteristics of the fault arc is not removed) from the mean of all elements in the corresponding characteristic quantity data set by replacing the outlier data with the last non-outlier data.

Preferably, each parameter in the ReliefF algorithm is determined based on the significant time-frequency characteristics for separating the fault arc from the arc-like to the maximum extent and reducing the accidental error of calculation (so that the significant time-frequency characteristics of the fault arc are given relatively the maximum characteristic weight), the sampling time is 5% -15% (for example, 10%) of the total samples, and the running time is 10-10000 (for example, 200); the K neighbor value is from 2 to 20 (e.g., 8).

Preferably, the ReliefF algorithm calculates the real number as the value of the feature weight assigned to each feature in the multiple feature set S3, the feature weights are divided into two types of features with high and low weights according to the value range, and the set threshold for determining the relevant feature set S4 by filtering is determined according to the middle value of the value range formed by the lower weight limit of the high-weight feature and the upper weight limit of the low-weight feature (for example, the threshold is 0.09).

Preferably, in the step 6), the maximum correlation minimum redundancy algorithm uses an incremental search method to find approximately optimal features, that is, one of the features is selected from the related feature set S4 to maximize the constraint target value Φ, then the next feature is selected from the remaining features of the related feature set S4 except the selected feature to also maximize the constraint target value Φ, and the optimal feature set is obtained by analogy in this order.

Preferably, when a series of non-redundant feature sets are used as the sample set of the classifier, considering the requirements of the subsequent arc fault detection algorithm on stability and accuracy, a strong classifier with higher stability and accuracy is selected, for example, a random forest classifier is used as the classifier for non-redundant feature set analysis, and the number of subtrees of the random forest classifier is 10 to 1000 (for example, 1000).

Preferably, theIn the step 8), when the classifier fault arc detection accuracy and the multi-objective optimization of the feature number are adopted for solving, considering the compromise between the feature calculation complexity (calculated amount) and the optimization effect, the optimization method adopts a linear weighting method; the optimization variable is selected as f _a (classifier output accuracy) and f _b (1 is the difference obtained by subtracting the ratio of the number of the candidate characteristic set characteristics to the total number of the candidate characteristics, namely 1-m/m 3), f _a Weight w of _a =0.01 to 0.99 (for example, 0.8), f _b Weight w of _a =0.01 to 0.99 (for example, 0.2), and the objective function of the constructed feature selection model (multi-objective optimization model) under the multi-factor influence condition is f _c ＝w _a f _a +w _b f _b Then make the output variable f _c The non-redundant feature set with the highest value (as the feature set effect evaluation index) is the best feature set.

The invention has the following beneficial effects:

the feature space adopted in the feature selection method of the invention not only contains time domain features which are simple in calculation and easy to realize, but also contains time-frequency domain features which are obtained after time-frequency domain analysis (such as short-time Fourier transform and wavelet transform) and can analyze signal local features in frequency bands, and the features of fault data are fully mined by constructing multiple feature sets to be selected which are combined in time-frequency domain, and the feature space has better robustness compared with a single-class feature criterion.

In the feature selection method, the multiple feature sets to be selected adopt a Relieff algorithm capable of rapidly filtering redundant features, different weights can be given to the features according to the relevance of each feature and category, the feature quantity with low weight is screened out, and a part of irrelevant features can be removed, so that the dimension reduction of the multiple feature sets to be selected is realized, the calculated quantity is reduced, and the rapidity of a fault arc detection algorithm is favorably improved.

In the feature selection method, when the number of features required by fault arc detection is given, the distribution and mutual information among the features and between the features and response variables are calculated by adopting a maximum correlation minimum redundancy algorithm, so that an optimal feature set is obtained. When the required feature number is uncertain, a series of non-redundant feature sets obtained by a maximum correlation minimum redundancy algorithm are used as sample sets of the trainer, the learning performance of the trainer in different sample sets is obtained, a multi-optimization target based on the classifier output accuracy and the feature number is adopted for solving, so that a simplified feature set capable of accurately identifying the arc-like and fault arc is obtained as an optimal feature set, the learner is not required to be used for multiple rounds of training, and the learning time is obviously reduced.

The feature set of the invention comprises time domain features obtained by a statistical method, and also comprises time-frequency domain features which can be obtained after time-frequency domain analysis (such as short-time Fourier transform and wavelet transform) and can analyze local features of signals in different frequency bands, and the time-frequency domain features have high precision and good noise immunity. The characteristics of fault data can be fully mined by constructing the multi-feature set to be selected combined by time domain and frequency domain, and compared with a single-class characteristic criterion, the method has better robustness, thereby improving the reliability of the detection algorithm.

The invention solves the problem that the optimal feature number of the feature set is difficult to determine by filtering through multi-objective optimization of the classifier output accuracy and the feature number. Compared with the traditional packaging type feature selection method, the method does not need to use a learner to carry out multi-round training, and a series of non-redundant feature sets obtained through the maximum correlation minimum redundancy algorithm are used as the sample set of the trainer, so that the dimensionality of the data set is reduced, the training time of the classifier is shortened, and the speed and the reliability of the fault arc detection algorithm are improved.

Drawings

Fig. 1 is a schematic frame diagram of a photovoltaic fault arc characteristic selection method combining a filtering type and an encapsulation type evaluation strategy.

Fig. 2 is a flowchart of a photovoltaic fault arc signature selection method combining filtering and encapsulation evaluation strategies.

Fig. 3a is a feature weight distribution diagram obtained by applying a ReliefF algorithm to a multi-feature set to be selected.

Fig. 3b is a correlation feature weight distribution graph obtained after removing features below a threshold.

Fig. 4a is a waveform diagram of a sampled dc fault arc current signal of a photovoltaic system.

Fig. 4b is a waveform diagram of the feature with the highest weight obtained by the ReliefF algorithm.

FIG. 5a shows the test output accuracy of the random forest classifier after training with different sample sets.

FIG. 5b is a diagram showing the multi-objective optimization result of the random forest classifier output accuracy and the feature number.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

As shown in FIG. 1, the invention firstly samples (for example, T is used for T) a photovoltaic system fault arc occurrence period signal (for example, a current waveform capable of displaying fault arc characteristics and a current waveform before and after the fault arc occurrence period signal in a normal state of the system) _s Time window of length signal x _n ) The method comprises the steps of extracting corresponding multi-characteristic values from sampling signals based on multi-time-frequency transformation (time-domain characteristics are obtained through analysis of a statistical method and then are further obtained through time-frequency domain analysis), adopting a Relieff algorithm to the multi-characteristic set to be selected, giving corresponding weights to different characteristics based on fault arc characteristic relevance, and further screening out the characteristics with lower weights to obtain a related characteristic set. And under the condition of giving the required number of features, obtaining the optimal feature set by adopting a maximum correlation minimum redundancy algorithm on the related feature sets. When the required feature number is not given, a maximum correlation minimum redundancy algorithm is adopted for the correlation feature set to obtain a series of non-redundancy feature sets, the non-redundancy feature sets are used as sample sets of the classifier, and multi-objective optimization of the output accuracy rate and the feature number of the classifier is adopted to carry out solving to obtain the optimal feature set. The invention constructs the optimal data set meeting different requirements by combining various feature selection algorithms, fully excavates the optimal features for fault arc data analysis, realizes the reduction of the dimension of the fault arc feature set, reduces the training time of the classifier, and is beneficial to improving the rapidity and the reliability of the fault arc detection algorithm.

With reference to fig. 2, the steps of the photovoltaic fault arc characteristic selection method combining the filtering type and the encapsulation type evaluation strategy are specifically described:

firstly, setting the sampling frequency f of a current signal by a detection signal device, and sampling the fault arc current signal of the grid-connected photovoltaic system point by using the frequency f. Consider that: on one hand, too few data points in the time window cannot accurately reflect the fundamental difference characteristics of the fault arc and the arc-like arc, and on the other hand, too many data points in the time window can increase the calculation amount. Therefore, the number of sample points in the time window is selected to be 2500000. In order to reduce the hardware implementation requirements of the detection signal sampling device and reflect the fault arc characteristic frequency band of the difference between the fault arc and the similar arc, the sampling frequency f of the output current signal of the grid-connected photovoltaic system is 1MHz.

Step two pairs of collected current signals x _n Performing a multi-time-frequency analysis process to calculate a signal x _n Time domain, time-frequency domain feature quantity of (1).

And respectively calculating the characteristic quantities corresponding to four time domain characteristics of an average value, a variance, a skewness and a kurtosis by taking 8000 points as a statistical period. And performing short-time Fourier transform on the four time domain features, wherein the fft point number and the window function length of each segment are 8000, the window function is a Hamming window, in order to reduce the calculated amount, the overlapping sample number novelap of each segment is 0, and 4001 signals with frequencies from 0Hz to 500kHz are obtained as the feature quantity of the time-frequency domain features of the first part. Meanwhile, 6 layers of wavelet packet decomposition are carried out on the four time domain characteristics according to a Rbio3.1 function, decomposition coefficients of wavelet packet nodes (6, 0) to (6, 11) are reconstructed to obtain a reconstructed current signal, and an average value, a variance, a skewness and a kurtosis in each period are calculated for the reconstructed current signal to serve as the time-frequency domain characteristics of the second part. The time domain features and the time-frequency domain features of the two parts are combined together to form a multi-feature set to be selected, and outlier data and normalization are removed (unit limitation of the features is removed, the unit limitation is converted into a dimensionless pure numerical value, and features of different units or orders of magnitude can be compared conveniently).

And step three, taking the multi-feature set to be selected obtained in the step two as the input of a Relieff algorithm, endowing different weights to the features by the Relieff algorithm according to the relevance of each feature and category (normal state and fault arc), endowing the features with high relevance to classification with high weight, endowing the features with low relevance with low weight, setting a weight threshold value and removing the features with the weight lower than the threshold value, so that the purpose of removing irrelevant features is achieved. The method for selecting the characteristics of the fault arc by adopting the Relieff algorithm comprises the following specific steps:

(1) Selecting an arbitrary sample X (feature quantity) and its corresponding classification;

(2) Respectively finding out K nearest neighbor samples H from sample sets of same type and different types as the sample X _j And M _j (j＝1，2，…，K)；

(3) Updating each feature F according to equation (6) _p (P =1,2, \8230;, P) weight W (F) _p )：

In the formula, m is the repeated times of the algorithm; p (c) is the probability of a faulty arc sample in the total sample;

d(F _p ,X,H _j ) Denotes sample X is at F _p Upper and sample H _j The distance of (d);

d(F _p ,X,M _j ) Denotes sample X at F _p Upper and sample M _j The distance of (d);

in the formula, V (Fp, A) represents a characteristic F of the sample A _p A is X, H _j Or M _j ；max(F _p ) Represents a feature F _p Maximum value of (1), min (F) _p ) Express feature F _p Minimum value of (d);

(4) Repeating the above process m times to obtain the weight W (F) of all P characteristics _p )；

(5) Setting a weight threshold value alpha, and selecting the features with the weight larger than alpha to form a feature subset. On one hand, the number of the features higher than the threshold value is at least more than the required number of the features; on the other hand, the larger the threshold value is, the more the number of the retained features is, the greater the correlation between the features is, and the greater the subsequent calculation amount is, so that the set threshold value should select the features with the weight obviously higher than that of other features.

And fourthly, under the condition that the number of the features required in the fault arc detection algorithm is given, the maximum correlation minimum redundancy algorithm is adopted for the related feature set to obtain the optimal feature set. And when the number of the features required in the fault arc detection algorithm is not given, a maximum correlation minimum redundancy algorithm is adopted for the correlation feature set to obtain a series of non-redundancy feature sets.

The maximum correlation minimum redundancy algorithm is a feature selection method for maximizing the correlation between feature variables and targets and minimizing the correlation between features, and the correlation between features and the correlation between features and category variables are measured by using the size of mutual information quantity.

In the formula, x and y are given two random variables (specifically referred to as features in the invention), and p (x, y) is a joint probability distribution function of x and y; p (x) and p (y) are probability distribution functions for x and y, respectively. To find a feature subset S containing m features, the maximum correlation is given by I (x) _i (ii) a c) Search the m features x associated with the object classification c in the proper order _i The correlation magnitude can be calculated by the following formula:

a subset of m features may not be the optimal feature subset, and when two features are highly interdependent, one of them is deleted and the respective order discrimination does not change much. Therefore, a minimum redundancy R(s) is introduced to eliminate redundancy between features and to select mutually exclusive features (equation (11)); then, phi is obtained by adding the maximum correlation coefficient D and the minimum redundancy R (equation (12)).

maxφ(D,R),φ＝D-R (12)

When the required number of features (for example, q) is given in advance, the number of the features m = q selected in the maximum correlation minimum redundancy algorithm, and redundant features are removed by calculating the distribution and mutual information among the features, the features and the label variables, so that an optimal feature set with minimum correlation among the features and maximum correlation among the features and the classification variables is obtained. When the required features are not determined in advance, a maximum correlation minimum redundancy algorithm is adopted, the number m of the features selected by the algorithm is respectively 1,2, 1 and m3 (m 3 is the number of the relevant features obtained according to the Relieff algorithm), a series of non-redundant feature sets are formed, the step five is carried out, and the series of non-redundant feature sets are used as a sample set of the classifier.

Step five, the prediction result of the strong classifier has strong stability and high accuracy, but the calculated amount is large and the training time is long; the weak classifier has weak stability, low accuracy and short training time. In order to ensure that the subsequent detection algorithm has strong stability and high accuracy, a strong classifier is preferentially selected, specifically, a random forest classifier can be selected, and non-redundant feature sets corresponding to different feature number m values are used as a sample set of the classifier. According to the relation between the output accuracy of the classifier and the number of the features, the number of the features can be visually seen to be increased after the output accuracy of the classifier exceeds 90%, the output accuracy of the classifier is not obviously increased, but the calculated amount and the training time are obviously increased, therefore, the number of the features are solved by adopting multiple optimization targets of the output accuracy of the classifier and the number of the features, the optimization method is selected as a linear weighting method, and the optimization variable is selected as f _a (classifier output accuracy) and f _b (1 is the difference obtained by subtracting the ratio of the number of the candidate characteristic set characteristics to the total number of the candidate characteristics, namely 1-m/m 3), f _a Has a weight of 0.8,f _b Is 0.2, the objective function is obtained:

f _c ＝0.8f _a +0.2f _b

then, f _c The highest valued non-redundant feature set is the best feature set.

As shown in FIG. 3a, the fault arc current is sampled at 1MHz, and the sampled signal x is _n Taking 8000 points as a period, calculating the feature quantities of four time domain features of average value, variance, skewness and kurtosis, and enabling S1= { f1, f2, f3, f4}. For signal x _n The time domain characteristic quantity of (2) is decomposed by 6 layers of wavelet packets according to a Rbio3.1 function, the decomposition coefficients of the nodes (6, 0) to (6, 11) of the wavelet packets are reconstructed to obtain a reconstructed current signal, and the average value, the variance, the skewness and the kurtosis in each period (8000 periods) are calculated for the reconstructed current signal to be used as the time-frequency domain characteristic quantity. At the same time, for the signal x _n The time domain characteristic quantity is subjected to short-time Fourier transform to obtain signals under different frequency bands as frequency domain characteristic quantities (4001 in total), and a time-frequency domain characteristic set S2 to be selected is obtained by combining the time-frequency domain characteristic quantities obtained by wavelet transform and short-time Fourier transform, wherein the set S2 is a set { tf1, tf 2. A feature matrix composed of time-domain features and time-frequency-domain features, i.e., a multi-feature set S3= { F1, F2, ·, F4, tf1, tf2,. And tfm2} is used as an input of the ReliefF algorithm, and a threshold is set to 0.09, because the feature weight with higher relevance to classification is larger, a group of features with higher weight, i.e., S4= { F1, F2,. And F7} (as shown in fig. 3 b) is used as an input of the following maximum correlation minimum redundancy algorithm.

The result shows that the multi-feature set to be selected is used as the input of the Relieff algorithm, so that the dimension reduction of the high-dimensional feature set is realized.

As shown in fig. 4a, the grid-connected photovoltaic system fault arc current signal is acquired at a sampling frequency f = 1MHz. Before 1.05s, the current signal is in a normal state, and after 1.05s, the current signal is in a fault state. The feature F1 is the feature with the largest weight value selected by the Relieff algorithm, and is the feature corresponding to the signal x _n The current variance after reconstruction of the decomposition coefficients of the wavelet packet nodes (6, 7) is carried out by 6-layer wavelet packet decomposition according to the Rbio3.1 function (FIG. 4 b).

The above results show that large-amplitude pulse indication occurs at the time of occurrence of the fault arc, the characteristic value after the occurrence of the fault arc is integrally larger than that in a normal state, effective characteristics can be selected by adopting a Relieff algorithm, and the identification of the fault arc of the photovoltaic system is realized.

As can be seen from fig. 3b, the number of features m3 in the relevant feature set is 7. And (3) enabling the number m of the features selected by the maximum correlation minimum redundancy algorithm to be 1,2, 7 respectively, selecting a series of non-redundant feature sets as sample sets of the random forest classifier, and testing the accuracy of the classifier after training. As can be seen from fig. 5a, when the number of feature quantities in the non-redundant feature set is less than 3, the increase of the accuracy rate is large as the number of feature quantities increases; when the number of the characteristic quantities in the non-redundant characteristic set is 3, the accuracy is higher than 90%; after the number of the characteristic quantities in the non-redundant characteristic set is more than 4, the increase range of the accuracy rate is smaller along with the increase of the number of the characteristic quantities.

As shown in FIG. 5b, when the number of features in the non-redundant feature set increases from 1 to 3, f _c The value of (c) is increased; when the number of the features in the non-redundant feature set exceeds 3, f increases along with the number of the features _c The value of (c) decreases.

When the number of features in the non-redundant feature set is 3, f _c The number of the classifier is the largest, and at this time, the number of the features is small, and the output accuracy of the classifier is high. The feature set F1, F2, F3 is selected as the best feature set.

The results show that the classifier is adopted to output multiple optimization targets of the accuracy and the number of the features to solve, so that a simplified and efficient optimal feature set can be obtained, and the problems of overlarge calculation amount and overlong training time of a subsequent detection algorithm due to the fact that the number of the features is too large are avoided.

The invention combines the advantages of filtering and packaging, and uses a hybrid method to process large-scale data sets. The optimal situation is similar to the time complexity of the filtering type, and the performance of the packed algorithm is similar. The processing process of the hybrid method is that firstly, filtering is used for quickly selecting features based on inherent characteristics of a data set, a small number of features are reserved, the feature scale of further searching is reduced, and then, an encapsulation method is used for further optimizing to obtain a feature subset with optimal classification performance. The invention adopts multiple feature sets combined by time-frequency domains to realize full mining of the features of fault data, adopts the Relieff algorithm to quickly filter irrelevant features, and has the opportunity of entering an optimal feature subset after filtering, thereby improving the classification accuracy and having quick convergence speed. According to the method, the distribution and mutual information among the features and between the features and the response variables are calculated by adopting a maximum correlation minimum redundancy algorithm, so that a simplified feature set with the highest correlation with the fault arc is obtained, the redundant features are removed, and the method has an obvious dimension reduction effect. The feature selection method provided by the invention can reduce the training time of the classifier and improve the rapidity and the reliability of the fault arc detection algorithm.

The invention has the following characteristics:

1) The feature set of the invention comprises time domain features obtained by a statistical method, and also comprises time-frequency domain features obtained after short-time Fourier transform and wavelet transform, and the multi-feature set combined by the time domain and the frequency domain can realize the features of fully mining fault data and comprehensively detecting fault electric arcs, and can be suitable for various occasions. According to the method, a Relieff algorithm capable of rapidly filtering redundant features is adopted for a multi-feature set to be selected, the obtained feature weight enables researchers to have more visual understanding on fault arc features, and a part of irrelevant features can be removed by removing the feature quantity with the weight lower than a threshold value, so that the dimension reduction of a high-dimensional feature set is realized.

2) The invention classifies whether the number of features required by the fault arc detection algorithm is given. When the required number of the features is given, the distribution and mutual information among the features, the features and the response variables are calculated by adopting a maximum correlation minimum redundancy algorithm, so that the optimal feature set is obtained. When the required feature number is uncertain, a maximum correlation minimum redundancy algorithm is adopted to obtain a series of non-redundant feature sets, the non-redundant feature sets are used as sample sets of the classifier, and the classifier is adopted to output multiple optimization targets of the accuracy and the feature number to solve to obtain the optimal feature set.

3) Compared with the traditional encapsulated characteristic selection method, the characteristic selection method combining the filtering type and the encapsulated evaluation strategy does not need to use a learner to carry out multi-round training, a series of non-redundant characteristic sets obtained by a maximum correlation minimum redundancy algorithm are used as a sample set of the trainer, and multi-target optimization of the output accuracy and the number of the characteristics of the classifier is adopted to carry out solving, so that a simplified characteristic set with a good classification effect is obtained as an optimal characteristic set, the dimensionality of a data set is reduced, the training time of the classifier is shortened, and the speed and the reliability of a fault arc detection algorithm are improved.

Claims

1. A photovoltaic fault arc characteristic selection method combining a filtering type evaluation strategy with an encapsulation type evaluation strategy is characterized by comprising the following steps of: the method comprises the following steps:

1) Sampling current waveforms point by point from before to after fault arc in the grid-connected photovoltaic system by frequency f, and sampling by T _s Obtaining a signal x for a time window length acquisition _n ；

2) Calculating the signal x _n Removing outlier data in the calculation result and then carrying out normalization processing on the time domain characteristics in each statistical period to obtain a signal x _n S1= { f1, f 2., fm1}, the element f in S1 _i A feature data set representing an ith time-domain feature, i =1, 2., m1;

3) Performing time-frequency domain analysis on the time-domain features, removing outlier data in the time-frequency domain analysis result, and performing normalization processing to obtain a signal x _n S2= { tf1, tf2,. Tfm2}, the element tf in S2 _i A feature quantity data set representing an ith time-frequency domain feature, i =1, 2.

4) Merging the time domain feature set S1 and the time-frequency domain feature set S2 to obtain a multi-feature set S3 to be selected, wherein the multi-feature set S3= { f1, f2,. Multidata, fm1, tf1, tf2,. Multidata, tfm2};

5) Calculating by using a Relieff algorithm to obtain the feature weight of each feature in the multi-feature set S3, then screening out the features with the weight lower than a threshold value in the multi-feature set S3, namely irrelevant features, and sequencing the remaining features after screening out in the multi-feature set S3 from high to low according to the weight to obtain a relevant feature set S4= { F1, F2.

6) If the number of the features required by fault arc detection is determined to be q, selecting m features from the related feature set S4 by using a maximum correlation minimum redundancy algorithm, and enabling the value of m to be equal to q to obtain an optimal feature set, otherwise, turning to the step 7);

7) If the number of the features required by fault arc detection is uncertain, a maximum correlation minimum redundancy algorithm is adopted, and feature sets with the number m of the features being 1,2, 3 are respectively selected from the correlation feature set S4 to obtain a series of non-redundancy feature sets;

8) And respectively taking a series of non-redundant feature sets as sample sets of the classifier to carry out fault arc detection accuracy testing, and then solving by using a multi-objective optimization model based on the fault arc detection accuracy and the number of features of the classifier to obtain the non-redundant feature set serving as the optimal feature set.

2. The method for selecting the arc characteristics of the photovoltaic fault in combination with the filtering and packaging evaluation strategies according to claim 1, wherein: f is 200 kHz-2 MHz, sampling points in the time window length are 1000000-4000000, and the statistical period is 1000-10000 sampling points.

3. The photovoltaic fault arc characteristic selection method combining the filtering type evaluation strategy and the packaging type evaluation strategy according to claim 1, wherein the photovoltaic fault arc characteristic selection method comprises the following steps: the time domain features are one or more of average value, variance, skewness and kurtosis, wavelet transformation and/or short-time Fourier transformation are/is adopted in time-frequency domain analysis, the window function length of the short-time Fourier transformation is 1000-10000, the mother wavelet of the wavelet transformation is Rbio3.1, and nodes obtained by wavelet packet decomposition are (6, 0) to (6, 11).

4. The photovoltaic fault arc characteristic selection method combining the filtering type evaluation strategy and the packaging type evaluation strategy according to claim 1, wherein the photovoltaic fault arc characteristic selection method comprises the following steps: the outlier data is an element which has a standard deviation more than three times different from the mean value of all elements in the corresponding characteristic quantity data set, and the method for removing the outlier data is to replace the outlier data with the last non-outlier data.

5. The photovoltaic fault arc characteristic selection method combining the filtering type evaluation strategy and the packaging type evaluation strategy according to claim 1, wherein the photovoltaic fault arc characteristic selection method comprises the following steps: in the Relieff algorithm, the sampling times are 5% -15% of the total samples, the running times are 10-10000, and the K nearest neighbor value is 2-20.

6. The method for selecting the arc characteristics of the photovoltaic fault in combination with the filtering and packaging evaluation strategies according to claim 1, wherein: the numerical value of the feature weight of each feature in the multi-feature set S3 is a real number, each feature weight is divided into two types of features with high and low weights according to the range of the numerical value, and the threshold value is set according to the middle value of the numerical value range formed by the lower weight limit of the high-weight feature and the upper weight limit of the low-weight feature.

7. The photovoltaic fault arc characteristic selection method combining the filtering type evaluation strategy and the packaging type evaluation strategy according to claim 1, wherein the photovoltaic fault arc characteristic selection method comprises the following steps: in the step 6), the maximum correlation minimum redundancy algorithm selects m features which maximize the constraint target value from the correlation feature set S4 by using an incremental search method, so as to obtain an optimal feature set.

8. The photovoltaic fault arc characteristic selection method combining the filtering type evaluation strategy and the packaging type evaluation strategy according to claim 1, wherein the photovoltaic fault arc characteristic selection method comprises the following steps: in the step 8), a random forest classifier is adopted, and the number of subtrees of the random forest classifier is 10-1000.

9. The method for selecting the arc characteristics of the photovoltaic fault in combination with the filtering and packaging evaluation strategies according to claim 1, wherein: in the step 8), the objective function of the multi-objective optimization model is as follows:

f _c ＝w _a f _a +w _b f _b

wherein, f _a For classifier fault arc detection accuracy, f _b Is 1-m/m3, f _c For feature set effectFruit evaluation index, w _a 、w _b To optimize variable weights;

f _c the non-redundant feature set corresponding to the largest value is the best feature set.