Open AccessArticle

Pre-Filtering SCADA Data for Enhanced Machine Learning-Based Multivariate Power Estimation in Wind Turbines

Bubin Wang

Bin Zhou

^*,

Denghao Zhu

Mingheng Zou

and

Haoxuan Luo

School of Energy and Environment, Southeast University, Nanjing 210096, China

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(3), 410; https://doi.org/10.3390/jmse13030410

Submission received: 25 January 2025 / Revised: 20 February 2025 / Accepted: 21 February 2025 / Published: 22 February 2025

(This article belongs to the Topic Advances in Wind Energy Technology)

Download

Browse Figures

Figure 1
Flowchart of proposed methodology. "> Figure 2
The operating regions of a typical variable-speed, variable-pitch wind turbine and the evolution of the power coefficient, rotor speed, and pitch angle with wind speed. "> Figure 3
Dataset A: (a) wind speed–power distribution; (b) wind speed–pitch angle distribution; (c) wind speed–rotor speed distribution. "> Figure 4
Impact of filtering rules on relationship between wind speed, pitch angle, and active power in Dataset A: (a) after applying rules (1) and (2); and (b) after applying rules (1), (2), and (3). "> Figure 5
AD results in two scenarios: (a) binning only with a fixed interval of 1 m/s, without a moving window; and (b) binning with a moving window of 1 m/s in size and a step size of 0.5 m/s. "> Figure 6
Autoencoder-based AD results: (a) autoencoder with wind speed and active power; and (b) autoencoder with wind speed, active power, ambient temperature, rotor speed, and pitch angle. "> Figure 7
Modeling results. (a) Base-BPNN. (b) AD-BPNN. (c) PF-AD-BPNN. "> Figure 8
Comparison of actual and estimated values for wind turbine power output. (a) Base-BPNN. (b) AD-BPNN. (c) PF-AD-BPNN. "> Figure 9
Modeling results. (a) Base-SVM. (b) AD-SVM. (c) PF-AD-SVM. "> Figure 10
Comparison of actual and estimated values for wind turbine power output. (a) Base-SVM. (b) AD-SVM. (c) PF-AD-SVM. "> Figure 11
The PF performance diagram of Dataset B. ">

Versions Notes

Abstract

Data generated during the shutdown or start-up processes of wind turbines, particularly in complex wind conditions such as offshore environments, often accumulate in the low-wind-speed region, leading to reduced multivariate power estimation accuracy. Therefore, developing efficient filtering methods is crucial to improving data quality and model performance. This paper proposes a novel filtering method that integrates the control strategies of variable-speed, variable-pitch wind turbines, such as maximum-power point tracking (MPPT) and pitch angle control, with statistical distribution characteristics derived from supervisory control and data acquisition (SCADA). First, thresholds for pitch angle and rotor speed are determined based on SCADA data distribution, and the filtering effect is visualized. Subsequently, a sliding window technique is employed for the secondary confirmation of potential outliers, enabling further anomaly detection (AD). Finally, the performance of the power estimation model is validated using two wind turbine datasets and two machine learning algorithms, with results compared with and without filtering. The results demonstrate that the proposed filtering method significantly enhances the accuracy of multivariate power estimation, proving its effectiveness in improving data quality for wind turbines operating in diverse and complex environments.

Keywords:

pre-filtering; anomaly detection; power estimation; rotor speed; wind turbine

1. Introduction

Global warming refers to the continuous rise in the average temperature of the Earth’s climate system. Over the past 50 years, due to uncontrolled emissions of greenhouse gasses, the average temperature has increased at the fastest rate on record [1]. To address this severe challenge, the global energy structure is rapidly transitioning towards cleaner and low-carbon energy sources. Against this backdrop, wind power, as an important renewable energy source, has seen continuous rapid growth in installed capacity and power generation [2]. By the end of 2022, the total installed capacity of onshore and offshore wind power globally reached 841.9 GW and 64.3 GW, respectively, with 77.6 GW of new wind power capacity added during the year [3]. Wind turbines are typically installed in harsh environments such as oceans and are subjected to adverse factors such as alternating loads. As operational time increases, the health condition of turbines gradually deteriorates, significantly reducing their power generation performance and affecting the overall efficiency of wind farms [4]. Therefore, there is an urgent need to accurately assess the extent of performance anomalies in wind turbines and implement effective maintenance measures or technical upgrades to enhance their power generation capabilities. The performance of wind turbines is typically evaluated using methods such as the capacity factor, availability, power generation, wind turbine power curve (WTPC) analysis, and multivariate power estimation [5].

The capacity factor, availability, and power generation reflect the fault time and fault rate of wind turbines, covering multiple aspects such as power generation efficiency, operational status, and environmental impact. The WTPC and multivariate power estimation provide more detailed performance analysis, supporting fault diagnosis and optimized control [6]. The WTPC is a core tool for describing the relationship between wind speed and power output, used to evaluate the power generation efficiency of wind turbines. By combining measured wind speed data with the WTPC, the power generation of a wind turbine in a specific location and time period can be predicted. Additionally, analyzing the differences between actual power output and the expected values from the WTPC can effectively identify and diagnose potential faults in the wind turbine system. However, due to the variability of wind power and the unpredictability of wind speed, the standard power curves defined by the International Electrotechnical Commission (IEC) often struggle to accurately monitor the actual performance and operational status of wind turbines [7]. The IEC standard requires averaging data over wind speed intervals of 0.5 m/s or 1 m/s. While this method is simple, it may obscure subtle variations in turbine operation. In the field of WTPC modeling, extensive research has been conducted, covering various machine learning algorithms, including backpropagation neural networks (BPNNs) [8], support vector machines (SVMs) [9], extreme learning machines (ELMs) [10], and deep learning models such as convolutional neural networks (CNNs) [11]. Wang et al. [11] proposed an innovative data-driven deep learning method, which integrates an ELM, channel attention mechanisms, a CNN, and Huber loss functions, significantly improving the modeling accuracy of the WTPC. Furthermore, Mehrjoo et al. [12] developed a hybrid estimation method based on a weighted balanced loss function, which optimizes both estimation error and goodness-of-fit by shrinking the estimates toward a standardized target model. The WTPC can effectively capture the performance trends of wind turbines over long-term operation, making it suitable for assessing long-term power generation efficiency and health status [13]. However, since WTPC modeling typically requires a long period of data accumulation (e.g., 30 days or more) to build a stable power curve, its response to short-term anomalies is relatively slow, such as yaw angle faults or sensor drift. This is because the proportion of early-stage anomaly data in long-term windows is relatively low, making it difficult to significantly impact the overall power curve, thereby limiting its application in real-time fault detection and short-term performance evaluation.

To address these limitations, multivariate power estimation methods have been proposed, incorporating additional environmental and operational parameters such as air density and rotor speed to enhance sensitivity to short-term anomalies [14,15]. These methods leverage the complex relationships between multiple variables to characterize the nonlinear dynamics between environmental parameters and wind power, enabling the more accurate and timely detection of performance deviations. This makes them suitable for both real-time fault detection and short-term performance evaluation, as well as the precise assessment of wind turbine performance degradation under varying wind energy scenarios through controlled input environmental variables. Pandit et al. [16] incorporated air density into a Gaussian process model for wind turbine power evaluation to improve fitting accuracy. Astolfi [17] proposed multivariate approaches to the wind turbine power curve, incorporating additional environmental information and working parameters as input variables for data-driven models to improve the accuracy of theoretical power extraction under non-stationary conditions, leveraging SCADA data and advanced methods from artificial intelligence and applied statistics. Manobela et al. [18] proposed a wind turbine power evaluation method based on Gaussian processes, data filtering, and artificial neural network modeling, using wind speed and wind direction as input variables. Schlechtingen et al. [19] established four wind turbine power models based on cluster center fuzzy logic, neural networks, k-nearest neighbor models, and adaptive neuro-fuzzy inference systems, using wind speed, wind direction, and ambient temperature as input variables. Cascianelli et al. [7] proposed an ensemble of multivariate polynomial regression models to predict the active power of wind turbines and provide reliable prediction intervals, incorporating environmental conditions, operational and thermal variables, and interactions between turbines, achieving a mean absolute error of approximately 1.0% of the rated power on real SCADA data from an Italian wind farm. Lee et al. [20] conducted multivariate wind turbine power curve regression by combining SCADA data and met mast data. Input variables included wind speed, wind direction, humidity, turbulence intensity, and the wind shear coefficient, with the regression model being an additive multivariate conditional kernel density estimation model. These studies enhanced the models by incorporating additional environmental parameters as inputs, thereby reducing the variance in wind power prediction errors. In summary, in specific case studies, incorporating environmental variables and operational state variables from the SCADA system is beneficial for improving the accuracy of power evaluation models. By integrating additional data such as wind speed, wind direction, air density, temperature, turbulence intensity, and other relevant parameters, the models can better capture the complex relationships and dynamics affecting wind turbine performance. However, factors such as shutdowns, power limiting, and equipment failures often contaminate actual wind turbine operation data with complex anomalies, significantly compromising the accuracy of multivariable power estimation models [21].

Abnormal data adversely affect the monitoring of wind turbine operational status and can distort power estimation models developed based on such data. Therefore, refining wind data before model establishment is crucial. Data cleaning can be divided into two steps: preliminary filtering and anomaly detection (AD) [22]. The former is used to quickly remove obvious anomalies, while the latter is employed to deeply identify complex anomalies. Filtering aims to eliminate data points that violate physical laws or operational logic, such as data from the non-generation, start-up, or shutdown phases of wind turbines. It is generally a rule-based approach, relying on an understanding of system behavior, such as defining thresholds for wind speed and pitch angle. On the other hand, AD focuses on identifying abnormal data points, such as those caused by sensor failures or extreme weather conditions. AD typically employs statistical or machine learning techniques to detect data points that deviate significantly from normal patterns [23]. This step-by-step approach can more comprehensively improve data quality, laying a solid foundation for subsequent modeling and analysis. However, many studies only employ AD strategies or combine them with minimal preprocessing. This widespread neglect is particularly unusual, as the international standard IEC 61400-12 [24] explicitly mandates a data quality check for power curve measurement, which includes removing unavailable measurements, as well as filtering and excluding data based on power-limited conditions and fault records, with reference to operator logs. Wang et al. [11] proposed a method that combines the 3σ criterion and the quartile algorithm for data cleaning to address the limited performance of a single approach in certain cases, while employing the Mahalanobis distance to measure the distance between data points. The effectiveness of the proposed method was validated through comparisons with commonly used techniques such as isolation forest and the local outlier factor. However, this method only applies a single filtering rule, where samples satisfying the conditions of wind speed greater than the cut-in speed and power less than 1 kW are identified as irrational data. Morrison et al. [22] proposed a multi-rule filtering method and explored the impact of such filtering by comparing the performance of four different AD methods with and without filtering. Although this method incorporates information about the pitch angle as a filtering rule, it neglects the role of rotor speed.

To effectively identify and filter abnormal data, thereby improving data quality, a novel pre-filtering (PF) method is proposed in this paper for machine learning-based multivariate power estimation in wind turbines. First, PF is performed by setting filtering rules based on the operational state variables of wind turbines, specifically pitch angle and rotor speed. Subsequently, AD is conducted using a sliding window approach combined with the 3σ criterion and the quartile method. Following this, the performance of two widely used machine learning algorithms, the BPNN and SVM, is compared for multivariate power estimation with and without PF on two distinct datasets. The main contributions and novelties of the present study can be summarized as follows:

A novel PF method is proposed to enhance machine learning-based multivariate power estimation in wind turbines. Samples corresponding to start-up and shutdown phases are filtered by applying thresholds to pitch angle and rotor speed data. The effectiveness of the proposed filtering method is demonstrated through visualization by comparing the results of different filtering rules.
By introducing settings for sliding window size and step size, the AD method is optimized to avoid the incorrect cleaning of data in the four corners of the region. A dual-window validation mechanism is adopted, where a sample is confirmed as a final anomaly only when it is identified as a potential anomaly in two consecutive windows.
By comparing model performance with and without PF, the effectiveness of PF in improving data quality and model accuracy is validated. Experiments conducted on two distinct datasets enhance the reliability and generalizability of the results.

2. Methods

The method of this study is illustrated in Figure 1, primarily consisting of filtering, AD, and power estimation models. This section will introduce the data PF rules combined with wind turbine performance, the AD methods based on statistical principles, and the machine learning models used for power estimation. Additionally, the selection and description of input variables, as well as the performance evaluation metrics for the models, will be discussed.

2.1. Pre-Filtering

PF refers to the preliminary screening or cleaning of raw data before they enter the main processing pipeline. In this paper, the PF rules were determined based on the operational characteristics of wind turbines. Variable-speed, variable-pitch wind turbines are currently the dominant technology for grid-connected wind energy systems. Figure 2 provides a visual representation of the different operating regions of a wind turbine, highlighting the variation in essential operational parameters such as active power, the power coefficient, rotor speed, and pitch angle with respect to wind speed. This illustration is based on findings from previous works [25]. In region I, where the wind speed is below the cut-in speed, the wind turbine does not generate power. Typically, the rotor remains stationary with a rotational speed of 0, and the blades are adjusted to the feather position to reduce wind resistance and mechanical loads. However, some wind turbines may maintain a slow rotation of the rotor (idling) at low wind speeds to optimize start-up performance. In region II, where the wind speed is above the cut-in speed but below the rated speed, the wind turbine begins to generate power. Typically, the pitch angle is set to 0° or close to 0° to maximize wind energy capture. The rotor speed is adjusted through the maximum-power point tracking (MPPT) strategy to maintain the optimal tip–speed ratio, thereby achieving maximum wind energy utilization efficiency. In region III, where the wind speed is above the rated speed but below the cut-out speed, the wind turbine enters the rated power phase. Typically, the rotor speed remains constant, close to the rated speed, while the pitch angle is adjusted to control wind energy capture, thereby maintaining stable power output. In region IV, when the wind speed exceeds the cut-out speed, the wind turbine activates its protection mechanism and ceases power generation to ensure the safety of the equipment. At this point, the blade pitch is adjusted to the feathered position to reduce the impact of wind force on the blades, while the rotor speed gradually decreases until it comes to a complete stop (rotor speed reaches 0). This protective measure effectively prevents potential damage to the turbine caused by high wind speeds, ensuring the safe operation of the wind power system under extreme wind conditions.

The PF process incorporated the variation in essential operational parameters such as active power, rotor speed, and pitch angle with respect to wind speed, with clearly defined rules. For the 10 min interval samples in the SCADA system, the PF rules in this study primarily included the following conditions: when the wind speed exceeded the cut-in speed, the power generation was less than 1 kW; the rotor speed was lower than the minimum theoretical value in region II; and the pitch angle exceeded the maximum theoretical value in region III. Samples meeting any of the above conditions were filtered out. The specific processing procedures and analysis results will be elaborated on in the Section 3. Note that Figure 2 is a schematic diagram, and the parameters of wind turbines of the same type but different models may vary. Therefore, the filtering rules need to be determined based on actual operational SCADA data.

2.2. Anomaly Detection

To mitigate the limitations of a single AD method, a combination of the 3σ criterion and the quartile method was employed for data cleaning, which has been shown to be more effective than other common methods, such as isolation forest and the local outlier factor [11]. In this paper, the aforementioned methods were further optimized by incorporating settings for sliding window size and step size, ensuring that each data point underwent two rounds of detection. Additionally, a dual-window validation mechanism was adopted, where a sample was confirmed as a final anomaly only when it was identified as a potential anomaly in two consecutive windows. This approach effectively avoided the incorrect cleaning of data in the four corners of the region.

2.2.1. 3σ Criterion

The 3σ criterion is an outlier detection method based on statistical principles, primarily used to determine whether data points deviate from the normal range. In a normal distribution, data are mainly concentrated around the mean (μ), and the standard deviation (σ) measures the dispersion of the data.

Consider a dataset, X, consisting of n data points:

X = {x_{1}, x_{2}, \dots, x_{n}} .

(1)

First, calculate μ and σ. A data point, x_i, is considered normal if it satisfies the following:

x_{i} = [μ - 3 σ, μ + 3 σ] .

(2)

If x_i lies outside this range, it is classified as an outlier, as the probability of its occurrence under a normal distribution is extremely low, only about 0.27%.

2.2.2. Quartile Method

The quartile method is a statistical technique used to describe data distribution and detect outliers. It divides the data into four parts in ascending order, with each part containing approximately 25% of the data. The quartile method calculates quartiles to determine data distribution and uses the interquartile range (IQR) for outlier detection. The IQR is calculated as follows:

I Q R = Q 3 - Q 1,

(3)

where Q1 is the first quartile and Q3 is the third quartile. A data point, x_i, is considered normal if it satisfies the following:

x_{i} = [Q 1 - 1 . 5 \times I Q R, Q 3 + 1 . 5 \times I Q R] .

(4)

If it deviates from this range, it is categorized as an outlier.

2.3. Machine Learning

In this study, two machine learning models—the BPNN and SVM—were employed to validate the effectiveness of the proposed filtering strategy. The BPNN, a classic neural network, was chosen for its ability to model complex nonlinear relationships through iterative weight adjustments, making it particularly suitable for capturing the intricate dynamics of wind turbine data. The SVM, known for its robustness in high-dimensional spaces and ability to handle small datasets, was selected to ensure reliable classification performance. By employing these two models, which represent different learning paradigms (neural networks and kernel-based methods), the study ensures a robust and comprehensive evaluation of the filtering strategy’s effectiveness.

2.4. Power Output Calculation and Multivariate Input Selection

Based on the power generation principles of wind turbines, this study selected key parameters that influence wind power output as the input variables for the multivariate power estimation of wind turbines. According to aerodynamic principles, the mechanical power output of a wind turbine can be calculated using

P = \frac{1}{2} C_{p} ρ A V_{H}^{3} \cos^{3} θ .

(5)

Here, C_p represents the wind energy utilization coefficient at the hub-height wind speed, which depends on the tip–speed ratio and the pitch angle; ρ denotes the air density; A is the rotor-swept area; V_H is the hub-height wind speed; and

θ

is the yaw error angle, defined as the angle between the actual wind direction and the orientation of the rotor main shaft. In the subsequent case study, appropriate input variables were selected by integrating power generation principles and SCADA data.

2.5. Evaluation Metrics

Given the complexity and nonlinear characteristics of the wind turbine multivariate power estimation model, this study adopted a multi-indicator evaluation method to ensure a comprehensive and accurate performance assessment. Specifically, the mean absolute error (MAE), root mean square error (RMSE), coefficient of determination (R²), and normalized mean absolute percentage error (NMAPE) were employed as evaluation metrics to evaluate the model performance. Except for R², where higher values indicate better performance, lower values are preferred for the other metrics. The expressions for these assessment metrics are shown below:

MAE = \frac{1}{N} \sum_{i = 1}^{N} |{\overset{\land}{P}}_{i} - P_{i}|,

(6)

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {({\overset{\land}{P}}_{i} - P_{i})}^{2}},

(7)

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {({\overset{\land}{P}}_{i} - P_{i})}^{2}}{\sum_{i = 1}^{N} {(\bar{P} - P_{i})}^{2}},

(8)

NMAPE = \frac{1}{N} \sum_{i = 1}^{N} |\frac{{\overset{\land}{P}}_{i} - P_{i}}{P_{\max}}| \times 100 %,

(9)

where N is the number of samples,

{\overset{\land}{P}}_{i}

and

P_{i}

are the estimated and target values, respectively, while

\bar{P}

and

P_{\max}

are the mean and maximum target values, respectively.

3. Results and Discussion

To validate the effectiveness of the proposed method, two SCADA datasets, namely Dataset A and Dataset B, were utilized in the case studies. Dataset A was downloaded from the Kaggle website with rated powers of 1.7 MW (megawatts) [11]. Dataset B was obtained from an offshore wind farm in Guangdong, China, with rated powers of 5.5 MW. The 5.5 MW wind turbine is typically used in large-scale wind farms, especially offshore wind farms, due to its high single-unit capacity, which effectively reduces the unit power generation cost. The inclusion of both a smaller turbine and a larger offshore turbine allowed for the analysis of the method’s performance across different turbine sizes and operational environments, albeit within a limited scope. All datasets were sampled at 10 min intervals, with each variable value averaged over the sampling interval to smooth data fluctuations, reduce noise, and preserve the overall trend. Based on the generation principles of wind turbines and SCADA variables, wind speed, active power, ambient temperature, rotor speed, and pitch angle were selected as input variables. After removing missing values, Dataset A contained 40,640 samples, spanning from May 2019 to March 2020, while Dataset B contained 21,001 samples, spanning from November 2023 to March 2024. Using Dataset A as an example, the processing flow of PF and AD was analyzed. Furthermore, the performance of the multivariate power estimation model on Dataset A and Dataset B after data processing was compared based on the final estimation results.

3.1. Pre-Filtering Results

First, the distribution of SCADA data from the wind turbine in Dataset A was analyzed, as illustrated in Figure 3. Figure 3a shows the wind speed–power distribution, where scattered outliers are observed outside the main region, along with a cluster of outliers at the bottom. Figure 3b illustrates the wind speed and pitch angle distribution. The distribution analysis shows that most normal operating data points had pitch angles concentrated between 0° and 20°. As illustrated in Figure 2, pitch angles in region III exhibited a threshold behavior. Therefore, in Dataset A, data points with pitch angles exceeding 20° were typically associated with abnormal operating conditions, including non-generation states (such as shutdown or start-up phases) and power-limiting operation. Based on the wind speed–rotor speed distribution shown in Figure 3c, a rotor speed threshold of 9.1 rpm was selected as a filtering rule to improve data quality for power estimation. According to Figure 2, at low wind speeds, the rotor speed exhibited significant fluctuations, often corresponding to non-generation states such as shutdown, start-up, or idling. Filtering out data points with rotor speeds below 9.1 rpm effectively removed anomalies associated with these non-generation states.

Based on the above analysis, the PF rules applied in this study were defined as follows: (1) power generation below 1 kW when the wind speed exceeded the cut-in speed; (2) a pitch angle exceeding 20°; and (3) a rotor speed lower than 9.1 rpm. Any samples satisfying one or more of these conditions were filtered out. The thresholds for the pitch angle and rotor speed were determined empirically based on the distribution of operational data collected from the SCADA system, combined with control strategies, and were not optimized through a specific process. These thresholds were derived from the typical operational behavior of wind turbines across different wind speed ranges. Due to differences in turbine models and control strategies, the pitch angle and rotor speed thresholds of different wind turbines need to be determined based on actual operational data.

The visualization results under different filtering strategies are shown in Figure 4. It can be observed that adding the rotor speed threshold as a filtering condition improved the filtering effect. However, some scattered points still exist outside the main region, indicating that AD was needed in addition to filtering strategies to further enhance data quality.

3.2. Anomaly Detection Results

To effectively identify outliers in the wind speed–power data, the following steps were implemented. First, we set a sliding window for wind speed with a size of 1 m/s and a step size of 0.5 m/s, ensuring that each wind speed–power sample was processed twice. Next, we calculated the Mahalanobis distance within each window and marked potential outliers based on the 3σ criterion. For samples not marked as outliers, we further applied the quartile method for secondary detection to identify additional potential outliers. After processing all windows, we only removed samples that were simultaneously marked as potential outliers by two windows to ensure the reliability of the detection results.

Taking Dataset A as an example, Figure 5 presents the AD results in two scenarios: (a) binning only, without a moving window; and (b) binning with a moving window. Both methods exhibited significant cleaning effectiveness through the identification of a large number of outliers. The proposed method, incorporating a sliding window, achieved a smoother data distribution, particularly in the region close to the rated wind speed, thereby addressing the discontinuities caused by the binning method at the edges of the bins.

Autoencoders, as a type of deep learning model, are also applicable to AD [26]. The autoencoder was trained using the dataset, with a compressed representation dimension set to 2. To prevent overfitting, the Tikhonov regularization with a factor of 0.001 was applied, and sparsity regularization with a constant of 4 and a proportion of 0.1 was used. The model was trained for a maximum of 200 epochs. The results of the autoencoder-based AD are shown in Figure 6. Although incorporating more input features improved the AD performance, some discrete points remained in the main distribution region, leading to a less optimal result compared to the proposed method. The AD performance did not show significant improvement even when the compression dimension was set to 3 or 4. This could be due to the fact that autoencoders are deep learning models that are well suited for complex or high-dimensional data. For the low-dimensional data used in this study, the autoencoder might have been too complex, leading to overfitting during the training process or an inability to effectively learn the intrinsic features of the data.

3.3. Multivariate Power Estimation Modeling Results

To comprehensively validate the superiority of the proposed algorithm, comparative experiments were conducted with different data processing methods. In this study, the dataset was divided into a training set (80%) and a test set (20%) to evaluate the effectiveness of the proposed filtering strategy. The hyperparameter settings for the BPNN and SVM are shown in Table 1. For the BPNN, a hidden layer size of 10, a learning rate of 0.001, and 20 training epochs were selected after considering the balance between computational efficiency and model performance. These values were chosen to ensure that the model was capable of converging quickly while avoiding overfitting or underfitting. For the SVM, the parameters such as the radial basis function kernel, a box constraint of 1000, and an epsilon value of 50 were selected based on common recommendations for regression tasks and were found to work well for the datasets used in this study. While hyperparameter tuning could potentially improve performance, the selected values provided a solid starting point for demonstrating the effectiveness of the proposed filtering strategy. During the analysis of the impact of different data processing methods on model performance, the hyperparameters were kept consistent [22]. A brief description of the methods is as follows:

Base-BPNN: the BPNN was applied directly to the raw data without any preprocessing.

AD-BPNN: the BPNN was applied to the dataset after anomaly detection.

PF-AD-BPNN: the BPNN was applied to the dataset after pre-filtering and anomaly detection.

Similarly, Base-SVM, AD-SVM, and PF-AD-SVM represent the application of the SVM in different data processing stages.

The results of power estimation by the SVM, including a comparison of actual and estimated values, are presented in Figure 7 and Figure 8. As can be seen from Figure 7a and Figure 8a, sparse outliers far from the main region could not be effectively estimated. The reason may be that in areas far from the main region, data points are scarce, making it difficult for the model to learn effective patterns from limited samples, resulting in inaccurate estimations. Outliers may have exhibited characteristics significantly different from those of the main region’s data. In comparison to the direct use of raw data, the AD-BPNN evaluation model exhibited a notable decrease in large errors. However, a noticeable deviation existed in the low-wind-speed region, representing the sample points that required filtering. As can be seen from Figure 7c and Figure 8c, the PF-AD-BPNN model exhibited excellent performance in addressing large errors and local deviations.

The results of power estimation by the SVM, including a comparison of actual and estimated values, are presented in Figure 9 and Figure 10. Although the multivariate estimation performance differed slightly compared to that of the BPNN, the qualitative impact of PF and AD on the model remained consistent. Through comparative experiments using both the BPNN and SVM methods, the effectiveness of PF and AD in improving power estimation performance across different models was validated.

The evaluation metrics of different power estimation methods for Dataset A and Dataset B are detailed in Table 2. Figure 11 shows the PF performance diagram of Dataset B, where some anomalies are filtered out. However, some scattered points still remain outside the main region, suggesting that AD was required in conjunction with filtering strategies to further improve data quality. The remaining processing steps after filtering were the same as those for Dataset A.

For both the BPNN and SVM models, the models with PF and AD outperformed the base models and models with only AD across all evaluation metrics. This indicates that PF and AD significantly improved model performance. Models with only AD showed improvements over the base models, but the improvements were less significant compared to models combining PF and AD. For Dataset A, the PF-AD-BPNN demonstrated a reduction of 22.67% and 19.31% in the MAE and RMSE, respectively, when compared to the AD-BPNN. Similarly, the PF-AD-SVM showed a decrease of 6.41% and 6.13% in the MAE and RMSE, respectively, compared to the AD-SVM. In contrast, for Dataset B, the PF-AD-BPNN exhibited a decline of 16.24% and 23.59% in the MAE and RMSE, respectively, relative to the AD-BPNN. Likewise, the PF-AD-SVM reflected a reduction of 10.33% and 10.94% in the MAE and RMSE, respectively, compared to the AD-SVM. This suggests that AD alone was beneficial but more effective when combined with PF.

Since machine learning models require real-time applicability in wind farm operations, an assessment of the computational costs associated with PF and AD was conducted. The computer used has an Intel Core i5-13400F 2.50 GHz processor and 32 GB of random-access memory. The time required to filter the data was in the order of milliseconds, which was negligible compared to the overall training time. The training times for the different strategies are provided in Table 3. As shown, while the filtering step incurred a very minimal computational cost, the training time varied significantly depending on the strategy used. The AD process notably increased the computational cost. This supports the idea that the filtering step, while essential, had a minimal impact on overall processing time compared to the training phase. Overall, a training time in the order of seconds is acceptable for real-time applications in wind farm operations.

3.4. Discussion

This paper proposes a filtering method that combines pitch angle and rotor speed thresholds based on the operational state parameters of wind turbines. The visualization results demonstrate better cleaning effectiveness compared to those using only pitch angle. The filtering process primarily removed sample points where the wind turbine was not generating power, was shut down, or was starting up. Samples during the start-up and shutdown processes, where power generation was incomplete, could lead to power estimation errors, as reflected in the low-wind-speed region of the AD-BPNN and AD-SVM results shown in Figure 7b and Figure 9b. Additionally, the use of a sliding window during the AD process avoided discontinuities at the boundaries of bins in the binning method, while preserving more normal samples.

The WTPC offers advantages over multivariate methods in terms of simplicity, interpretability, and computational efficiency. By focusing on the relationship between wind speed and power output, the WTPC provides a clear representation of turbine performance. However, it may not fully capture the complexity of turbine behavior under varying environmental conditions, where multivariate methods could provide more comprehensive insights.

Based on the PF-AD-BPNN method, a univariate WTPC model was constructed with wind speed as the input variable and active power as the output target. The evaluation metrics are presented in Table 4. In comparison to the WTPC, the multivariate estimation method achieved significant improvements across all evaluation metrics, as presented in Table 2. Specifically, it achieved reductions in the MAE of 70.57% and 81.04% for the two datasets, respectively, and reductions in the RMSE of 70.77% and 80.65%, respectively. Additionally, R² improved by 1.24% and 2.62%, while the NMAPE decreased by 70.57% and 81.04% for the two datasets. These improvements can facilitate the timely detection of abnormal conditions in the real-time monitoring of wind turbine power generation performance, such as a reduction in the power coefficient caused by blade icing or damage, thus offering valuable guidance for turbine maintenance and operation.

In summary, multivariate power estimation serves as an important complement to the WTPC-based method for evaluating wind turbine power generation performance. It provides a more comprehensive reflection of turbine performance under different operating conditions and offers more accurate insights for turbine operation and maintenance. Enhancing model performance through PF and AD is highly significant, as it improves the reliability and accuracy of power estimation.

4. Conclusions

This study investigated PF methods for SCADA data in the machine learning-based multivariate power estimation of wind turbines to improve model accuracy. By utilizing the relationship between wind speed and wind turbine operational parameters, namely pitch angle and rotor speed, filtering rules were established. Compared to using only pitch angle, this approach achieved better filtering of sample points during the start-up and shutdown states, thereby improving the accuracy of power estimation. During the AD process, a sliding window was used for the secondary confirmation of potential outliers, avoiding the discontinuities between bins inherent in the binning method. This approach enhanced the robustness of the detection results. The proposed method’s improvement to the multivariate power estimation model was validated using two datasets and two machine learning methods, demonstrating its effectiveness in enhancing model performance. The results indicate that the estimation model, after filtering and AD, achieved optimal performance in terms of evaluation metrics, underscoring the value of the proposed preprocessing techniques. Future work will extend this approach to other turbine types and integrate it into real-time monitoring systems. Additionally, it will incorporate datasets from diverse wind turbine manufacturers and geographical areas to further enhance the generalizability and robustness of the findings. Given that ensemble models have demonstrated strong performance in wind energy applications [27], we will also explore their use to improve prediction accuracy and model robustness, comparing them with current methods to assess potential improvements in the proposed filtering strategy.

Author Contributions

Conceptualization, B.W. and B.Z.; methodology, B.W.; software, B.W.; validation, B.W., M.Z. and H.L.; formal analysis, M.Z.; investigation, H.L.; resources, B.Z.; data curation, D.Z.; writing—original draft preparation, B.W.; writing—review and editing, D.Z.; visualization, M.Z.; supervision, B.Z.; project administration, B.Z.; funding acquisition, B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded in part by the Innovative System Construction and Application Demonstration of Carbon Footprint and Labeling for Major Products of Jiangsu Province (BT2024013), the Scientific and Technological Innovation Project of Carbon Emission Peak and Carbon Neutrality of Jiangsu Province under Grant No. BE2023854, the National Natural Science Foundation of China under Grant Nos. 50976024 and 50906013.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AD	Anomaly detection
BPNN	Backpropagation neural network
CNN	Convolutional neural network
ELM	Extreme learning machine
IEC	International Electrotechnical Commission
IQR	Interquartile range
MAE	Mean absolute error
MPPT	Maximum-power point tracking
NMAPE	Normalized mean absolute percentage error
PF	Pre-filtering
RMSE	Root mean square error
R²	Coefficient of determination
SCADA	Supervisory control and data acquisition
SVM	Support vector machine
WTPC	Wind turbine power curve

References

Mathew, M.D. Nuclear energy: A pathway towards mitigation of global warming. Prog. Nucl. Energy 2022, 143, 104080. [Google Scholar] [CrossRef]
Hassan, Q.; Viktor, P.; Al-Musawi, T.J.; Ali, B.M.; Algburi, S.; Alzoubi, H.M.; Al-Jiboory, A.K.; Sameen, A.Z.; Salman, H.M.; Jaszczur, M. The renewable energy role in the global energy Transformations. Renew. Energy Focus 2024, 48, 100545. [Google Scholar] [CrossRef]
Kassa, B.Y.; Baheta, A.T.; Beyene, A. Current trends and innovations in enhancing the aerodynamic performance of small-scale, horizontal axis wind turbines: A review. ASME Open J. Eng. 2024, 3, 031001. [Google Scholar] [CrossRef]
Voigt, C.C.; Bernard, E.; Huang, J.C.-C.; Frick, W.F.; Kerbiriou, C.; MacEwan, K.; Mathews, F.; Rodríguez-Durán, A.; Scholz, C.; Webala, P.W.; et al. Toward solving the global green–green dilemma between wind energy production and bat conservation. BioScience 2024, 74, 240–252. [Google Scholar] [CrossRef] [PubMed]
Qiao, Y.; Han, S.; Zhang, Y.; Liu, Y.; Yan, J. A multivariable wind turbine power curve modeling method considering segment control differences and short-time self-dependence. Renew. Energy 2024, 222, 119894. [Google Scholar] [CrossRef]
Mushtaq, K.; Zou, R.; Waris, A.; Yang, K.; Wang, J.; Iqbal, J.; Jameel, M. Multivariate wind power curve modeling using multivariate adaptive regression splines and regression trees. PLoS ONE 2023, 18, e0290316. [Google Scholar] [CrossRef] [PubMed]
Cascianelli, S.; Astolfi, D.; Castellani, F.; Cucchiara, R.; Fravolini, M.L. Wind turbine power curve monitoring based on environmental and operational data. IEEE Trans. Ind. Inform. 2021, 18, 5209–5218. [Google Scholar] [CrossRef]
Yan, J.; Zhang, H.; Liu, Y.; Han, S.; Li, L. Uncertainty estimation for wind energy conversion by probabilistic wind turbine power curve modelling. Appl. Energy 2019, 239, 1356–1370. [Google Scholar] [CrossRef]
Ouyang, T.; Kusiak, A.; He, Y. Modeling wind-turbine power curve: A data partitioning and mining approach. Renew. Energy 2017, 102, 1–8. [Google Scholar] [CrossRef]
Pei, S.; Li, Y. Wind turbine power curve modeling with a hybrid machine learning technique. Appl. Sci. 2019, 9, 4930. [Google Scholar] [CrossRef]
Wang, Y.; Duan, X.; Zou, R.; Zhang, F.; Li, Y.; Hu, Q. A novel data-driven deep learning approach for wind turbine power curve modeling. Energy 2023, 270, 126908. [Google Scholar] [CrossRef]
Mehrjoo, M.; Jozani, M.J.; Pawlak, M. Toward hybrid approaches for wind turbine power curve modeling with balanced loss functions and local weighting schemes. Energy 2021, 218, 119478. [Google Scholar] [CrossRef]
Lydia, M.; Kumar GE, P. Machine learning applications in wind turbine generating systems. Mater. Today Proc. 2021, 45, 6411–6414. [Google Scholar] [CrossRef]
Astolfi, D.; Castellani, F.; Natili, F. Wind turbine multivariate power modeling techniques for control and monitoring purposes. J. Dyn. Syst. Meas. Control 2021, 143, 034501. [Google Scholar] [CrossRef]
Astolfi, D.; Castellani, F.; Lombardi, A.; Terzi, L. Multivariate SCADA data analysis methods for real-world wind turbine power curve monitoring. Energies 2021, 14, 1105. [Google Scholar] [CrossRef]
Pandit, R.K.; Infield, D.; Carroll, J. Incorporating air density into a Gaussian process wind turbine power curve model for improving fitting accuracy. Wind Energy 2019, 22, 302–315. [Google Scholar] [CrossRef]
Astolfi, D. Perspectives on SCADA data analysis methods for multivariate wind turbine power curve modeling. Machines 2021, 9, 100. [Google Scholar] [CrossRef]
Manobel, B.; Sehnke, F.; Lazzús, J.A.; Salfate, I.; Felder, M.; Montecinos, S. Wind turbine power curve modeling based on Gaussian processes and artificial neural networks. Renew. Energy 2018, 125, 1015–1020. [Google Scholar] [CrossRef]
Schlechtingen, M.; Santos, I.F.; Achiche, S. Using data-mining approaches for wind turbine power curve monitoring: A comparative study. IEEE Trans. Sustain. Energy 2013, 4, 671–679. [Google Scholar] [CrossRef]
Lee, G.; Ding, Y.; Genton, M.G.; Xie, L. Power curve estimation with multivariate environmental factors for inland and offshore wind farms. J. Am. Stat. Assoc. 2015, 110, 56–67. [Google Scholar] [CrossRef]
Wang, Y.; Hu, Q.; Li, L.; Foley, A.M.; Srinivasan, D. Approaches to wind power curve modeling: A review and discussion. Renew. Sustain. Energy Rev. 2019, 116, 109422. [Google Scholar] [CrossRef]
Morrison, R.; Liu, X.; Lin, Z. Anomaly detection in wind turbine SCADA data for power curve cleaning. Renew. Energy 2022, 184, 473–486. [Google Scholar] [CrossRef]
Zou, M.; Zhou, B.; Wang, B.; Liu, Q.; Zhao, R.; Dai, M.; Rao, Z.; Wang, Y. Offshore wind turbine wind speed power anomaly data cleaning method based on RANSAC regression and DBSCAN clustering. In Proceedings of the 4th International Conference on Smart Grid and Renewable Energy (SGRE), Doha, Qatar, 8–10 January 2024; pp. 1–8. [Google Scholar]
IEC TR 61400-12-4:2020; IEC, TC 88-Wind Energy Generation. IEC: Geneva, Switzerland, 2020.
Saint-Drenan, Y.-M.; Besseau, R.; Jansen, M.; Staffell, I.; Troccoli, A.; Dubus, L.; Schmidt, J.; Gruber, K.; Simões, S.G.; Heier, S. A parametric model for wind turbine power curves incorporating environmental conditions. Renew. Energy 2020, 157, 754–768. [Google Scholar] [CrossRef]
Renström, N.; Bangalore, P.; Highcock, E. System-wide anomaly detection in wind turbines using deep autoencoders. Renew. Energy 2020, 157, 647–659. [Google Scholar] [CrossRef]
Mansour, R.; Osama, S.; Ahmed, H.; Nasser, M.; Mahmoud, N.; Elkodama, A.; Ismaiel, A. Parametric Analysis Towards the Design of Micro-Scale Wind Turbines: A Machine Learning Approach. Appl. Syst. Innov. 2024, 7, 129. [Google Scholar] [CrossRef]

Figure 1. Flowchart of proposed methodology.

Figure 2. The operating regions of a typical variable-speed, variable-pitch wind turbine and the evolution of the power coefficient, rotor speed, and pitch angle with wind speed.

Figure 3. Dataset A: (a) wind speed–power distribution; (b) wind speed–pitch angle distribution; (c) wind speed–rotor speed distribution.

Figure 4. Impact of filtering rules on relationship between wind speed, pitch angle, and active power in Dataset A: (a) after applying rules (1) and (2); and (b) after applying rules (1), (2), and (3).

Figure 5. AD results in two scenarios: (a) binning only with a fixed interval of 1 m/s, without a moving window; and (b) binning with a moving window of 1 m/s in size and a step size of 0.5 m/s.

Figure 6. Autoencoder-based AD results: (a) autoencoder with wind speed and active power; and (b) autoencoder with wind speed, active power, ambient temperature, rotor speed, and pitch angle.

Figure 7. Modeling results. (a) Base-BPNN. (b) AD-BPNN. (c) PF-AD-BPNN.

Figure 8. Comparison of actual and estimated values for wind turbine power output. (a) Base-BPNN. (b) AD-BPNN. (c) PF-AD-BPNN.

Figure 9. Modeling results. (a) Base-SVM. (b) AD-SVM. (c) PF-AD-SVM.

Figure 10. Comparison of actual and estimated values for wind turbine power output. (a) Base-SVM. (b) AD-SVM. (c) PF-AD-SVM.

Figure 11. The PF performance diagram of Dataset B.

Table 1. Main parameter settings of BPNN and SVM.

BPNN		SVM
Parameter	Setting	Parameter	Setting
Number of neurons	10	Kernel function	Radial basis function
Learning rate	0.001	Standardize	True
Epochs	20	Box constraint	1000
Optimizer	Levenberg–Marquardt	Epsilon	50
Loss function	Mean square error	Loss function	Epsilon-insensitive loss

Table 2. Performance of multivariate power estimation models using different methods.

Model	Dataset A				Dataset B
Model	MAE	RMSE	R²	NMAPE	MAE	RMSE	R²	NMAPE
Base-BPNN	19.0547	32.1235	0.9974	1.0711	69.2410	138.2843	0.9956	0.9514
AD-BPNN	18.3868	25.1369	0.9984	1.0566	46.6067	77.1107	0.9983	0.8329
PF-AD-BPNN	14.2184	20.2822	0.9988	0.8146	39.0366	58.9191	0.9990	0.6974
Base-SVM	21.5519	33.5307	0.9972	1.2369	53.2242	125.4331	0.9956	0.9514
AD-SVM	19.9569	25.1017	0.9984	1.1434	42.6191	75.2729	0.9984	0.7617
PF-AD-SVM	18.6784	23.5619	0.9985	1.0735	38.2160	67.0365	0.9987	0.6830

Table 3. The computational time for training under different strategies.

Model	Time of Dataset A (s)	Time of Dataset B (s)
Base-BPNN	1.6686	1.2748
AD-BPNN	2.1623	1.7533
PF-AD-BPNN	2.1526	1.7947
Base-SVM	2.1460	1.5900
AD-SVM	2.5179	1.8513
PF-AD-SVM	2.3927	1.9288

Table 4. Performance of WTPC models using different datasets.

Dataset	MAE	RMSE	R²	NMAPE
Dataset A	48.3084	69.3790	0.9866	2.7678
Dataset B	205.8494	304.5174	0.9735	3.6782

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, B.; Zhou, B.; Zhu, D.; Zou, M.; Luo, H. Pre-Filtering SCADA Data for Enhanced Machine Learning-Based Multivariate Power Estimation in Wind Turbines. J. Mar. Sci. Eng. 2025, 13, 410. https://doi.org/10.3390/jmse13030410

AMA Style

Wang B, Zhou B, Zhu D, Zou M, Luo H. Pre-Filtering SCADA Data for Enhanced Machine Learning-Based Multivariate Power Estimation in Wind Turbines. Journal of Marine Science and Engineering. 2025; 13(3):410. https://doi.org/10.3390/jmse13030410

Chicago/Turabian Style

Wang, Bubin, Bin Zhou, Denghao Zhu, Mingheng Zou, and Haoxuan Luo. 2025. "Pre-Filtering SCADA Data for Enhanced Machine Learning-Based Multivariate Power Estimation in Wind Turbines" Journal of Marine Science and Engineering 13, no. 3: 410. https://doi.org/10.3390/jmse13030410

APA Style

Wang, B., Zhou, B., Zhu, D., Zou, M., & Luo, H. (2025). Pre-Filtering SCADA Data for Enhanced Machine Learning-Based Multivariate Power Estimation in Wind Turbines. Journal of Marine Science and Engineering, 13(3), 410. https://doi.org/10.3390/jmse13030410

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu