1. Introduction
Wind energy has gained global momentum to address pressing environmental concerns and fuel shortages [
1], with its share in the new energy sector growing significantly [
2]. Offshore wind power, a crucial subfield of wind energy development, has emerged as a new trend in the global wind power industry due to its vast resources, minimal environmental impact, high efficiency, large individual capacity, and proximity to load centers [
3]. However, the unpredictability of wind speed and other meteorological factors leads to erratic power yields from wind turbines [
4], causing considerable fluctuations in the power grid when wind energy is integrated in large volumes. Hence, precise and resilient short-term wind power forecasting is vital for large-scale grid integration [
5]. Despite factors like wind speed, direction, temperature, humidity, barometric pressure, and altitude causing significant variances in wind power, offshore wind energy remains more stable and less turbulent than onshore, as it is uninfluenced by topography, vegetation, and buildings [
6]. Nonetheless, the short distance between units, the lengthy, extensive range of wind energy through the impeller wake, and the existence of complex regional numerical models at sea make accurate wind-energy predictions challenging. Wind power forecasting methods can be categorized into ultra-short-term, short-term, medium-term, and long-term forecasts based on the forecasting horizon [
7]. While ultra-short-term forecasts predict wind power up to 4 h ahead with a resolution of 15 min or less, short-term forecasts extend up to 72 h, medium-term forecasts span three days to a week, and long-term forecasts exceed a week. The latter two are generally used in wind farm site selection and power plant development plans, though they are not the focus of this study [
7].
Forecasting methods can be divided into physical modeling, statistical learning, machine learning, and combined physical–statistical methods [
8]. Although physical-model-based methods are computationally demanding and challenging to implement, statistical learning methods, including the autoregressive integrated moving average (ARIMA) [
9], are easy to use and quick to compute. However, they tend to be less accurate when dealing with highly volatile wind power forecasts. With the evolution of artificial intelligence, machine learning has been applied to wind power forecasting, with methods including support vector machines (SVM) [
10], extreme gradient boosting tree models (XGBoost) [
11], traditional neural network models [
12,
13], recurrent neural network (RNN) models [
14,
15,
16], and transformer architecture models [
17,
18]. However, due to the complex volatility of wind power and the lack of clear time-series characteristics, single-model predictions often fall short in accuracy. Combined prediction methods, which leverage the strengths of various models, have proven to significantly improve forecast accuracy compared to single models [
19]. For instance, Wu et al. [
20] first used LSTM neural networks to forecast wind speed and other meteorological data, then applied similar time-series matching methods to filter out the main factors for modeling, training, and prediction in LightGBM. Similarly, Cao et al. [
21] used a convolutional neural network (CNN) to extract the spatial and temporal correlation vectors from different stations and used LSTM to extract the temporal relationship between historical time points for multi-step wind power forecasting.
However, conventional methods struggle to handle the high noise, high volatility, and non-stationarity of the original time-series data. To address this, some researchers have applied signal-processing algorithms like empirical wavelet transformation (EWT) [
22], empirical mode decomposition (EMD) [
23], and complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) [
24] to wind power forecasting. For example, Abedinia et al. [
25] proposed an improved empirical modal decomposition method (IEMD) to decompose wind speed and fed the decomposed signals into a hybrid prediction model based on BaNN and K-means clustering, using the intelligent optimization method ChB-SSO for the automatic tuning of BaNN parameters. Similarly, Li et al. [
26] used the historical wind speed and key meteorological factors decomposed by variational modal decomposition (VMD) and weighted permutation entropy (WPE) as inputs and output the forecasts with the CNN-LSTM model.
Recent research further explores advanced techniques for monitoring and diagnosing mechanical health across diverse applications [
27,
28,
29,
30]. While Sharma et al. [
31] use swarm decomposition and permutation entropy for bearing defect detection, Vashishtha et al. [
32] explore a Levy flight-based genetic algorithm for Pelton wheel health assessment. Chauhan et al. [
33] propose a corrected conditional entropy measure combined with a multi-parent evolutionary algorithm for bearing diagnostics. Meanwhile, Chauhan [
34] leverages an adaptive wavelet mutation strategy within an evolutionary algorithm for bearing defect identification. However, these methods have limitations, such as poor time resolution accuracy or short prediction time steps. To overcome these issues, this study proposes a composite deep-learning multistep forecasting method based on multi-timescale inputs (MSI) for short-term wind power forecasting. This includes VMD signal deconstruction, multi-timescale inputs, and composite prediction models based on gated recurrent units (GRU), CNN, and improved transformers. The key contributions of this study are as follows:
First, a trigonometric function is used to standardize the unique meteorological feature of wind direction, after which it is further standardized to the interval [0,1] using the maximum–minimum normalization method. This aligns with the actual position distribution of wind direction;
The Variational Mode Decomposition (VMD) method is used to decompose the primary time-series information in the original wind speed signal, yielding three components that reflect the overall trend, primary fluctuation trend, and sub-fluctuation trend of the original wind speed signal, respectively;
A multi-timescale input method is proposed to improve the power forecast effect of the model by capturing the long- and short-term time-series relationships of different input data scales;
The GRU neural network captures the time-series relationship of long-timescale input data, while the improved transformer time-series forecast model is used to process short-timescale input data. Finally, the CNN’s strong ability to process local information is utilized to extract the tip of each time point of each branch output individually, outputting the multistep forecast results through the fully connected neural network (FCN).
2. Data Sources and Processing
2.1. Data Sources
This study scrutinized acquired data from wind turbines located at an offshore wind farm in Guangdong. The wind farm, situated roughly 55 km off the coast in waters 41–46 m deep, has a total installed capacity of 500 MW. It consists of 37 wind turbines, each with a capacity of 6.8 MW, and 30 turbines of 8.3 MW. The facility also includes a 220 kV offshore booster station and 35 kV underwater transmission cables. We randomly selected historical data from four units within the wind farm, two of which were 6.8 MW and the other two 8.3 MW. This historical dataset comprises measured output power gleaned from the SCADA and NWP systems. All the data maintained a time resolution of 10 min, accruing to 144 data samples per day. Each unit’s data spanned one year and one month. We used the first year’s data for both training and validation purposes, while the subsequent month’s data served to test the model’s precision.
Table 1 and
Figure 1 provide a detailed overview of the raw power data for the four wind turbines under consideration.
As shown in
Table 1, there are 67,996 datasets for each unit, and the data have a 10-min event interval. As shown in
Figure 1, the start and end times of the datasets for units 3 and 12 are from October 2022 to November 2023, and the start and end times for units 14 and 61 are from September 2022 to October 2023. The first 52,560 data points in each dataset constitute the training set, and the remaining 5436 are the test set. The complete dataset, comprising 105,408 10-min samples, was partitioned into training, validation, and testing. The breakdown is as follows: Training set: 60,000 samples (56.9% of the total data). Validation set: 10,000 samples (9.5% of the total data). Test set: 35,408 samples (33.6% of the total data). The training set was used for model training and parameter estimation. The validation set, held out from the training process, was used for model selection and hyperparameter tuning. Monitoring the model’s performance on the validation set during training could identify the best-performing model configuration and prevent overfitting. Finally, the test set, a separate, unseen subset of the data, was used for the final model evaluation and reporting of performance metrics. This approach ensured an unbiased assessment of the model’s generalization capabilities.
2.2. Wind Turbine Operation and Control
Wind turbine generators (WTGs) are influenced by many factors, broadly classified into internal and external factors. Internal factors primarily include the blade’s shape, size, and material, along with the transmission efficiency of the drive train, which were established during the WTG design phase. Over time, as the wind turbine accumulates operational hours and undergoes multiple maintenance procedures and overhauls, these internal factors experience minor variations. However, these changes are not easily measurable with specific indicators. Nonetheless, the benefits of deep learning can be harnessed to adapt to the operational status of WTGs by employing numerous parameters in a deep neural network.
On the other hand, external factors impacting wind turbine power generation encompass wind speed, wind direction, air density, and other meteorological variables. The fundamental operation of WTGs involves the transformation of the kinetic energy of the airflow into the mechanical energy of the wind wheel’s rotation. This mechanical energy is then transmitted to the generator through the wind turbine drive system, which converts it into electrical energy. According to Betz’s theory, the wind energy absorbed by the wind turbine can be expressed as:
where
represents the air density,
R is the radius of the turbine impeller,
v is the ambient wind speed,
is the power coefficient factor of the wind turbine,
is the tip speed ratio, and
is the pitch angle of the turbine blades. The characteristic curve of
value of wind turbine is related to the design parameters of the wind turbine, which are directly given by the manufacturer, and the maximum wind energy coefficient is
. According to the Baez limit theory, the wind power calculation formula shows that wind speed is the main factor affecting the power of WTGs. As shown in
Figure 2, it is the standard power curve of 6.8 MW and 8.3 MW WTGs of this offshore wind farm, with a cut-in wind speed of 3 m/s, a cut-out wind speed of 25 m/s, rated powers of 6800 KW and 8300 KW, respectively, and a rated wind speed of 11.1 m/s. When the external wind speed was greater than 3 m/s, the WTGs started up, and the wind power was conducted through the impellers, spindle, transmission box, etc., to the generator, which drives the generator to rotate and generate electricity. When the wind speed exceeds 11.1 m/s, the wind turbine generating power reaches the rated value. At this time, it is important to initiate the wind turbine pitch system. This action helps control the amount of wind energy the turbine harnesses and prevents the turbine blades from spinning too fast, which could lead to accidents involving flying cars. When wind speeds surpass 25 m/s, both the wind turbine’s yaw and pitch systems are activated simultaneously, leading to the turbine being powered down.
2.3. Feature Selection
When the wind direction is stable, the impeller of the wind turbine can be maintained at the optimal angle, allowing the wind energy to be more fully utilized, at which point the wind turbine generates more power and is more efficient. In contrast, when the wind direction varies significantly, the impeller of the wind turbine needs to constantly adjust its angle to adapt to different wind directions, which will affect the efficiency and power generation of the generator. Although advanced offshore wind turbines are now equipped with an automatic yaw-to-wind system, which can automatically adjust the nacelle’s direction and track the incoming direction of the wind in real time, the system also requires a specific reaction time, and inevitably, there will be wind alignment errors. The wind direction time-series feature data are conducive to the deep learning model capturing the wind-pairing error distribution law of the wind turbine to eliminate this error. Meanwhile, the wind direction and wind speed have a close relationship, as shown in
Figure 3, and historical wind speed direction rose diagram of a wind farm, from which the distribution of wind speed values at different wind speed intervals in each direction shows a high degree of similarity. The scale of the wind speed distribution is given in counts within the 13 months, and the color scale of the wind speed contours is in m/s. For the same wind farm, the terrain and geomorphological features in each direction were specific and greatly influenced the change in wind speed. Therefore, the wind direction time-series feature data are also conducive to the model to better capture the change rule of wind speed and improve the accuracy of the wind power prediction results.
As can be seen from the wind-power calculation formula, the air density is also closely related to the size of the wind energy. According to the IEC61400-12-1 standard [
35], the actual air density calculation formula is:
where
is the density of air, kg/m
3;
B is the atmospheric pressure, Pa;
T is the absolute temperature, K;
is the relative humidity, taken as
;
is the gas constant of dry air, 287.05 J/kg-K;
is the gas constant of water vapor, 461.5 J/kg-K;
is the vapor pressure.
Therefore, to obtain the actual air density, it is only necessary to obtain the relative humidity, atmospheric pressure, and atmospheric temperature. In summary, the main external factors affecting the power generation of wind turbines are meteorological factors, such as wind speed, wind direction, relative humidity, atmospheric pressure, and atmospheric temperature, which were selected as the initial input parameters of the model.
Table 2 summarizes the features selected for the analysis, along with their descriptions and sources.
2.4. Data Preprocessing and Gap Handling
Upon closer inspection of the raw data, we identified several missing or interpolated data gaps. These gaps could arise for various reasons, such as scheduled maintenance, unscheduled downtime, or temporary disconnection of the turbines or the entire wind farm from the grid. Specifically, we observed a significant data gap for turbines #3 and #12 just before July 2023. After consulting with the wind farm operators, we learned that this gap was due to a scheduled maintenance period during which these turbines were taken offline for routine inspections and servicing. To handle such data gaps, we explored two approaches:
Gap Interpolation: One approach was to interpolate the missing data points based on the available data before and after the gap. While this approach can provide a continuous data stream, it may introduce biases or inaccuracies, especially for larger gaps or periods with rapidly changing wind conditions;
Gap Removal: Alternatively, we opted to remove the data gaps entirely from the dataset, treating them as missing values. This approach ensures that our analysis and modeling efforts are based solely on actual recorded data, avoiding any potential biases introduced by interpolation.
After careful consideration, we chose the gap removal approach for our analysis. We believe that this conservative approach preserves the integrity of the data and provides a more accurate representation of the turbines’ performance under the observed conditions. It is important to note that the presence of data gaps and the chosen gap-handling strategy may impact the overall dataset size and the distribution of samples across different operating conditions. We have considered these factors during our data partitioning and model training processes to ensure robust and reliable results.
2.5. Deconstruction of Wind Speed Signals
The volatility of offshore wind signals is complex because of several natural factors. First, the shape and size of land features and their relative position to the sea can change the direction and strength of the wind. The fluctuation of ocean waves and their interaction with the wind can also affect the wind speed signal. Furthermore, meteorological factors, such as temperature, humidity, and pressure in the atmosphere, and the influence of atmospheric currents can have a complex effect on wind speed signals. These signals are not simply superimposed but are intertwined and interfere with each other, making it challenging to extract adequate timing information from the original wind speed signal. To effectively extract the timing features in the wind speed signal, it is necessary to preprocess the wind speed signal with feature deconstruction to decompose the primary signals and remove the related noise.
2.5.1. Principles of VMD
The volatility of offshore wind signals is complex due to several natural factors, such as the shape and size of land features, ocean wave interactions, and meteorological factors, such as temperature, humidity, and atmospheric currents. These signals are not simply superimposed but are intertwined and interfere, making it challenging to extract adequate timing information from the original wind speed signal. To effectively extract the timing features, it is necessary to preprocess the wind speed signal by decomposing the primary signals and removing the related noise. This study employs the Variational Mode Decomposition (VMD) [
36] method, which effectively processes non-smooth and nonlinear mixed time-frequency signals. The VMD method decomposes the original one-dimensional wind speed signal
x(t) into
k finite-bandwidth intrinsic modal functions (IMFs), allowing us to extract the signal’s frequency domain features. The constrained variational expression for the VMD method is given by:
where
k is the number of modes to be decomposed,
denotes the
k intrinsic modal components,
is the center frequency of each component,
is the Dirichlet function, ∗ is the convolution operation,
t is the time series, and
denotes the partial derivatives of time
t. Equation (3) aims to decompose the input signal
x(t) into
k intrinsic modal functions, subject to the constraint that the sum of these functions equals the original signal. To solve Equation (3), the Lagrange multiplier operator
λ is introduced, converting the constrained variational problem into an unconstrained variational problem. This leads to the Lagrange augmented matrix expression:
where
is the quadratic term penalty factor, which is used to reduce the interference of Gaussian noise, the final solution reduces the noise and volatility of the original signal to obtain each IMF component with a higher signal-to-noise ratio in the filtered bandwidth set. The VMD defines each component as an amplitude-modulation-frequency modulation (AMFM) function, which can be expressed as:
where
is the instantaneous amplitude of the component and
is the instantaneous phase of the component.
2.5.2. Wind Speed Signal Decomposition
The penalty factor
and the number of decomposition layers
in the VMD algorithm need to be selected by humans. The penalty factor
is 1.5~2.0 times the sampling point length, and the number of decomposition layers
is determined according to the actual decomposition effect [
37]. Define
as the wind speed signal reconstructed from the decomposition components, where:
Define the root mean square error between the original wind speed and reconstructed signals as:
Define the Pearson correlation coefficient between the original wind speed and reconstructed signals as:
where
denotes the covariance,
and
are the standard deviations, and
. In the above evaluation of reconstruction performance indicators, the smaller the value of
RMSE and the closer the correlation coefficient value is to 1, the better the reconstructed signals obtained from the decomposition of each component and the original signals overlap.
Taking the data of Unit 12 as an example, the parameters of the decomposition process are shown in
Table 3, and the correlation coefficient nearly reaches the maximum value when the value of
k is 3. With further increase of
k, the RMSE decreases, but the decrease is slow, while the correlation coefficient remains relatively stable. The computation time will greatly increase as the number of decomposition layers increases. The larger the number of decomposition layers
k, the better the signal overlap before and after reconstruction. However, as the value of
k increases, the reconstructed signal is prone to introducing noise. As shown in
Figure 4 and
Figure 5, the VMD 4-layer decomposition introduces noise with a small amplitude on top of the VMD 3-layer decomposition, which instead tends to confuse the timing features. Therefore, the optimal number of decomposition layers chosen
, and each component is shown in
Figure 4. Among them, IMF1 reflects the overall trend of the original wind speed signal and is the trend component; IMF2 is the main fluctuation component of the wind speed signal; IMF3 is the secondary fluctuation component of the wind speed signal.
2.6. Feature Standardization
The VMD method deconstructs the wind speed signal to obtain three components representing different aspects of the wind speed, such as periodicity, trend volatility, etc. Each component has unique features and contributions, which can provide us with more detailed and comprehensive wind speed information. To further extract the time-period features, this paper extracts four key temporal features, namely month, day, hour, and minute, from the date. These features are crucial for understanding the temporal properties of wind speed. For example, a month may affect the seasonal wind speed variation, while day, hour, and minute provide information on a finer timescale. Together with the power itself, 12-dimensional features were obtained, and the data needed to be normalized in the next step. First, since the angle of wind direction ranges from 0 to 360°, and the due north direction is 0°, when the wind angle tends to be close to 0° and close to 360°, the numerical representation results of the wind position should be similar. However, if the standard normalization or the maximum and minimum normalization are used to deal with it, the difference in the results obtained is enormous. Therefore, the trigonometric normalization method is used first for this particular feature of the wind direction angle, as shown in the following equation.
where
is the standardized value for wind direction,
is the wind angle, ranging from 0 to 360. Afterward, for all 12-dimensional features, the data are normalized using the maximum–minimum normalization method, as shown below:
where
is the original data,
is the minimum value, and
is the maximum value.
3. Forecasting Model Structure
The forecasting model was developed using a data-driven machine learning approach. The overall methodology involved several key steps: (1) data preprocessing and feature engineering, (2) variational mode decomposition (VMD) for extracting intrinsic mode functions (IMFs) from the wind speed and power generation time series, (3) feature selection to identify the most relevant IMFs and meteorological variables, and (4) training and evaluation of various machine learning models (e.g., random forest, gradient boosting, neural networks) for wind power forecasting. The following subsections provide detailed descriptions of each step in the proposed approach.
3.1. GRU Network
RNN can learn the interrelated information between pre- and post-data when dealing with continuous time series data, so RNN has certain advantages in prediction tasks. However, due to the limitation of its structure, RNN has the problem of gradient vanishing during backpropagation, which is unsuitable for dealing with long time series data. GRU [
38] is an improved RNN structure, which effectively solves the problems of gradient vanishing and gradient explosion by introducing structures such as a memory unit and gating mechanism, enabling RNN to handle sequence data better.
As shown in
Figure 6, the GRU neural network effectively processes sequential information by introducing a gating mechanism. The mathematical principle is based on the neural network’s activation function and weight matrix, which control the information flow by calculating the gating unit’s value.
where
corresponds to the update gate, reset gate, candidate state at the current moment, and hidden state at the current moment, respectively.
is the input variable of the current moment, and
is the current state of the reset gate, which can control the output of the previous moment
.
are the training parameters inside the model, and
is the nonlinear activation function. Here, the ReLU function is used. The GRU neural network achieves layer-by-layer transmission and information extraction of sequence data by controlling the forgetting and retaining of information through two gating units, the reset gate and the update gate, respectively. This mechanism can effectively capture the sequence’s long-term dependencies and improve the model’s performance.
3.2. Improved Transformer Timing Forecast Model
The Transformer model was first proposed by the Google machine translation team and applied to natural language processing (NLP) tasks with good results. It uses a self-attention mechanism and position encoding (PE) to capture long-distance dependencies in the input sequence. Despite the similarities between time series prediction and NLP tasks, there are some key differences between them, and some modifications to the model structure are required to apply the Transformer model to the task of WTG power forecast. The input sequences in NLP tasks are mostly words or symbols, which must be mapped into a fixed-size numeric vector representation by word vector coding before the computer can process them. In the WTG power forecast task, each time point of the resulting time series data is a 12×1 numerical feature vector that can be used directly as input to the model. Because Transformer does not use the structure of the recurrent neural network but uses global information, it is not able to utilize the knowledge of temporal features before and after the data, and it needs to embed the positional relationship between the data in the input data, as shown in the following equation:
where
pos denotes the position of the data,
d represents the dimension of the PE,
2i indicates the even dimension, and 2
i + 1 denotes the odd dimension (i.e.,
2i ≤
d,
2i + 1 ≤
d). Computing PE in this way allows the model to calculate the relative position, and for a fixed length spacing
k, PE (
pos +
k) can be obtained using PE (
pos). The encoded data
X is then passed into the multi-head attention module (As shown in
Figure 7), which combines multiple self-attention mechanisms in parallel. Three linear variational matrices
are used in the attention mechanism to calculate
Q,
K, and
V, where
Q is the query value,
K is the key value, and
V is the original value, after which the attention mechanism is calculated as shown in the following equation:
The outputs of each attention mechanism module are spliced by columns and passed through a fully connected linear layer to obtain the output Z. The matrix Z needs to be consistent in dimension with its input matrix X. The outcomes of each attention mechanism module are spliced by columns.
Figure 7.
Multi-head Attention.
Figure 7.
Multi-head Attention.
The matrix Z is summed with its input matrix X through the residual linkage network and layer normalization. Layer normalization turns the inputs of each layer of neurons to have the same mean and variance, which speeds up convergence. The output matrix of the encoder module is obtained. The traditional Transformer decoder uses a large number of matrix operations and attention mechanisms, so it consumes a large amount of computational resources. It can only generate output sequences individually, with low output efficiency. When dealing with large-scale datasets or long sequences, it may be necessary to use high-performance GPUs or TPUs and other computational resources, which will increase the cost of model training and inference and is difficult to apply in practical industrial scenarios.
Here, the decoding layer uses CNN and FCN (As shown in
Figure 8); instead, CNN can perform local feature extraction on the matrix output from the encoder and capture the local attentional information through convolutional operations. Then, the global temporal information is captured by FCN to integrate the local features into a complete sequence representation. The decoder structure using CNN and FCN can output the prediction results of long sequences simultaneously, avoiding the limitation of traditional decoders that generate output sequences one by one. In addition, this structure can reduce the computational complexity in the decoding process and improve the decoding efficiency.
3.3. Multi-Timescale Input Forecast Models
Meteorological factors such as wind speed, wind direction, temperature, humidity, and barometric pressure all affect the power generated by wind turbines. However, these meteorological factors are highly uncertain and difficult to predict accurately. To capture the time series characteristics of wind energy more accurately, based on the GRU neural network, a method of wind power forecast using a multi-timescale input model with GRU and an improved transformer for time series forecast (MSI-GTTS) is proposed. The specific model structure is shown in
Figure 9. The core idea of this method is to divide the processed data into different timescales. Specifically, we process the data using three timescales: days, weeks, and months.
Specifically, the data of one day, one week, and one month before the moment t can be taken as inputs to the model for predicting the power generated by wind turbines after the moment t, respectively. This paper uses two different models to deal with these different timescales: the GRU neural network and the improved Transformer time series prediction model. For one-day and one-week data, we used a GRU neural network for processing. The GRU neural network is a very suitable model for processing time series data, which can effectively capture the time-series features in the data. On the other hand, the GRU neural network may be difficult to process effectively for one-month data due to its long timescale. Therefore, we adopt the Improved Transformer model to process one month’s input data to capture its long-time-dependent features. Finally, we spliced the data from these three different timescales. With one-dimensional convolution and FCN, we obtained the final forecasting results. The MSI-GTTS model consists of three branching channels and one aggregated output channel, where the first two branching channels consist of a GRU recurrent neural network and a fully connected layer, and the third branch is the improved Transformer temporal forecasting model. The coding layer in the third branch consists of three encoders within the encoder in the multi-head attention mechanism.
4. Experimental Analysis and Verification
4.1. Evaluation Indicators
In this paper, the mean absolute error (MAE), mean square error (MSE), root mean square error (RMSE), and coefficient of determination (R
2) were used as the evaluation indexes of the model performance. Among them:
Among them, denotes the measured power value, denotes the mean of the measured power values and denotes the predicted power value. The smaller the MAE, MSE, and RMSE indicators are, the better the prediction effect is. R2 is used to assess the degree of conformity between the forecast value and the actual value, and the value range is [0,1]. The closer R2 is to 1, the better the fitting effect of the forecast and actual values.
4.2. Analysis of the Model Training Process
The network parameters are updated by backpropagation gradient descent, the optimizer is Adam, the learning rate is set to 0.0001, the loss function is MSE, and the current optimal model is automatically saved when the loss function values of both the training set and the validation set reach the minimum. The number of iterations is set to 1000. Taking the data of Unit 12 as an example, the loss function curves of each model’s training set and validation set are shown in
Figure 10 and
Figure 11.
The magnitude of data embodied in the loss curves is small because the data are normalized. As can be seen from
Figure 10 and
Figure 11 the proposed VMD-MSI-GTTS method has the fastest convergence speed and the smallest training loss, and the difference between the loss values of the training and validation sets is very small, which indicates that there is no overfitting problem. Among them, the loss function value of the GRU model alone is too large compared to the other three methods using VMD signal deconstruction, which indicates that the VMD method can effectively deconstruct the primary temporal information in the medium-original wind speed signals, which in turn significantly improves the accuracy of the WTG power prediction results. Among the three models that used VMD signal deconvolution, the loss function value of the VMD-GRU model with single timescale input was the largest, and the loss function values of the two models (VMD-MSI-GRU and VMD-MSI-GTTS) that used multi-timescale input were smaller. Moreover, during the training process, the loss function curves of the two models with multi-timescale inputs change more stably and have a stronger convergence tendency than those with single-timescale inputs. This indicates that the multi-timescale input models can effectively capture the long- and short-term temporal relationships of different input data scales and improve the models’ prediction effect. Among the two models that used the multi-timescale input method, the loss values of the training and validation sets of the VMD-MSI-GTTS model are smaller than those of the VMD-MSI-GRU model. This indicates that the improved Transformer temporal forecasting model can capture the temporal relationships of the long-timescale input data more effectively than the GRU, ultimately allowing the model to achieve better forecast results.
4.3. Comparison of Module Analysis Experiment Results
The module analysis experiment results for the test set data of the four units are shown in
Table 4, and the data visualization results are shown in
Figure 12.
From
Figure 12, it can be intuitively seen that the VMD-MSI-GTTS model proposed in this paper achieves the best evaluation indexes in the power forecast experiments of all four units. Compared with the three methods of GRU, VMD-GRU, and VMD-MSI-GRU, the VMD-MSI-GTTS model has the smallest MAE, MSE, and RMSE for the power forecast results of the four units, and the value of the coefficient of determination R
2 is the closest to 1. It can be derived from
Table 4.
Compared to the GRU model, the VMD-GRU model reduces the MAE, MSE, and RMSE values of the prediction results for the four datasets by an average of 0.0253, 0.0081, and 0.0322, respectively, and improves the R2 value by an average of 0.112.
Compared to the VMD-GRU model, the MAE, MSE, and RMSE values of the prediction results of the VMD-MSI-GRU model for the four datasets were reduced by an average of 0.0069, 0.0024, and 0.0095, respectively, and the R2 value was improved by an average of 0.033.
Compared to the VMD-MSI-GRU model, the MAE, MSE, and RMSE values of the prediction results of the VMD-MSI-GTTS model for the four datasets were reduced by an average of 0.0042, 0.0009, and 0.0095, respectively, and the R2 values were improved by an average of 0.0063. From the experimental results, it can be concluded that the VMD signal deconstruction method, the multi-timescale input structure, and the improved Transformer timing prediction method in the proposed VMD-MSI-GTTS model can effectively improve the accuracy of the power forecast results for WTGs.
The VMD method can effectively deconstruct the leading time series information in the raw wind speed signal, which can significantly improve the accuracy of the power prediction results of WTGs; the multi-timescale input method can effectively capture the long-term and short-term time series relationships of different scales of input data, which can improve the power prediction effect of the model; compared with the GRU, the improved Transformer time series forecast model is more capable of capturing the time series relationships of long-time input data, which can lead to better prediction effects of the model in the end.
After back-normalizing the forecast results, the distribution of errors in the results of the module analysis experiments for the four datasets is shown in
Figure 13, from which the accuracy of the prediction results of the VMD-MSI-GTTS model is the highest in each of the four datasets, with the smallest range of distribution of errors between the predicted values and the true values.
4.4. Comparison of Different Decomposition Methods
To verify the effectiveness of the VMD signal decomposition method, the VMD method is compared with the EEMD, EWT, and TVF-EMD methods. The comparative experimental results of different decomposition methods are shown in
Table 5 and
Figure 14. From
Table 5, it can be seen that in the experimental results of the four units, compared with the EEMD-MSI-GTTS, EWT-MSI-GTTS, and TVFEMD-MSI-GTTS models, the prediction results of the VMD-MSI-GTTS model proposed in this paper have the smallest errors and the highest prediction accuracy.
As shown in
Figure 14, the absolute error distribution ranges of the prediction results of the proposed VMD-MSI-GTTS models are also all minimized. The prediction performance of the TVFEMD-MSI-GTTS in the datasets of Units 3 and 14 is similar to that of the VMD-MSI-GTTS model, but the prediction performance in the datasets of Units 12 and 61 is poor.
The prediction performance of the EWT-MSI-GTTS model is good in the Unit 61 dataset but poor in the other three datasets. The experimental results show that the VMD method can improve the short-term multistep power prediction accuracy of offshore wind turbines in a better and more stable way than the EEMD, EWT, and TVF-EMD decomposition methods and is more universal.
4.5. Model Comparison Experimental Analysis and Validation
To further validate the proposed VMD-MSI-GTTS model in the offshore wind turbine power forecasting problem with LSTM [
39], CNN-LSTM [
40], LSTM-Attention [
41], and Informer [
42,
43] models for comparison experiments. The results of the comparison experiments are shown in
Table 6, and the visualization results are shown in
Figure 15. From
Figure 15, it can be visualized that the VMD-MSI-GTTS model proposed in this paper exhibits excellent forecast performance on all four WTGs, significantly outperforming the other four compared models. Specifically, the VMD-MSI-GTTS model achieved the best evaluation metrics on all datasets. In contrast, the other models had their strengths and weaknesses in performance on the different unit datasets. For example, in the dataset of Unit 3, the Informer model achieved the smallest values for MAE, MSE, and RMSE metrics; however, in the dataset of Unit 61, the Informer model had the highest MAE, MSE, and RMSE values of the four compared models.
The VMD-MSI-GTTS model proposed in this paper demonstrates excellent power forecast performance in all four datasets compared to the LSTM, CNN-LSTM, LSTM-Attention, and Informer models. The mean values of MAE, MSE, and RMSE for the forecast results of the four datasets are 0.0522, 0.0084, and 0.0907, respectively, and the mean value of R
2 is 0.899. Compared with the other four methods, the VMD-MSI-GTTS model significantly improves the accuracy of the power forecast of offshore wind turbines and provides a specific reference value in the offshore wind turbine power forecast field. After back-normalizing the prediction results of the five models on the four datasets, the distribution of errors in the prediction results of each model on different datasets is shown in
Figure 16. Some of the forecasting result curves for the comparison experiments on the four datasets are shown in
Figure 17. Compared with the LSTM, CNN-LSTM, LSTM-Attention, and Informer models, the VMD-MSI-GTTS model has the smallest error distribution range between the forecast values and the true values in all four datasets, with the highest forecast accuracy and the strongest reliability, which indicates that it is more robust in forecasting the power of different WTGs. This robustness may be attributed to the sensitivity of the VMD-MSI-GTTS model to multi-timescale information and its effective capture of temporal structure. With an MAE of 0.05 on the normalized power values (scaled between 0 and 1), the model can predict the 10-min-ahead wind power with an average percentage error of approximately 5% of the nominal power capacity.
5. Conclusions
To improve the accuracy of offshore wind turbine ultra-short-term power multistep forecast results, an ultra-short-term wind power forecast method for offshore wind turbines based on VMD signal deconstruction, multi-timescale inputs, and an improved Transformer time-series forecast model is proposed. Experiments are also conducted with the historical data of four WTGs in an offshore wind farm in Guangdong, and the output power of WTGs in the next four hours is forecasted with a time resolution of 10 min. The experimental results show that the proposed VMD-MSI-GTTS model has the highest accuracy and the strongest stability for offshore wind turbines’ ultra-short-term power multistep forecast results. The breakdown conclusions are as follows:
The unique meteorological feature of wind direction is first standardized by a trigonometric function and then normalized by maximum–minimum normalization, which can normalize the data to the [0,1] interval and is in line with the actual positional distribution of the wind direction.
The VMD method can effectively deconstruct the primary timing information in the medium-original wind speed signals, which in turn significantly improves the accuracy of the wind turbine power prediction results.
The multi-timescale input method can effectively capture the long- and short-term temporal relationships of input data at different scales and improve the power forecast of the model.
Compared to the GRU, the improved Transformer temporal forecast model is more capable of capturing the temporal relationships of input data over long-timescales, ultimately allowing the model to achieve better forecast results.
The proposed VMD-MSI-GTTS model also exhibits superior prediction performance compared to LSTM, CNN-LSTM, LSTM-Attention, and Informer temporal prediction methods, significantly outperforming these comparison models.
In the future, the effect of the timescale size of each input on the model’s forecast accuracy can be explored in detail, and the impact of seasonal periodicity on the forecast model can be considered to build a longer-time wind turbine power forecast model. Furthermore, optimization algorithms can be incorporated to optimize each hybrid deep learning model parameter to improve the model’s forecast performance.