1. Introduction
Waves are a fundamental component of the ocean system and have significant impacts on human activities and the ecological environment. Even some extreme wave phenomena, such as storms or tsunamis, can cause serious natural disasters [
1,
2]. Waves can also pose significant threats to various maritime and coastal activities. For instance, rough sea conditions can endanger the safety of maritime navigation, increasing the risk of vessel capsizing, collisions, or delays in shipping routes [
3]. In ports, high waves can disrupt loading and unloading operations, compromise structural stability, and lead to economic losses due to downtime [
4]. Furthermore, the unpredictability of wave patterns presents a challenge for marine fisheries, affecting the safety of fishing vessels and the sustainability of fish stocks [
5]. Therefore, accurate prediction of wave height is crucial to ensure the safety of human activities [
6,
7,
8,
9]. In addition, in the field of new energy, wave energy converters can utilize the kinetic and potential energy of waves to generate electricity, and accurate wave forecasting is beneficial for improving energy production efficiency [
10,
11,
12,
13]. Of course, waves also bear the responsibility of nutrient exchange and sediment transport in marine ecology, playing a crucial role in the sustainable growth of marine organisms [
14].
Historically, such predictions have predominantly relied on numerical models such as the wave model (WAM) [
15], simulating waves nearshore (SWAN) [
16], and Wavewatch III (WW3) [
17]. These numerical models are grounded in energy balance equations, describing physical processes including wind stress, wave–wave nonlinear interactions, and bottom dissipation. However, their reliance on grid data, encompassing inputs like wind and wave fields, poses a significant impediment when applying them to local wave prediction at individual buoy stations [
18]. Furthermore, during the parameterization process, approximate functions are utilized, which inevitably introduce errors into the numerical models [
19]. Hence, it is imperative to identify a model tailored to enhance the accuracy of significant wave height (SWH) prediction at individual buoy stations.
The rapid development of artificial intelligence (AI) technology offers novel avenues for related research. Distinct from conventional numerical models, AI-based methodologies afford greater flexibility, facilitating predictions using merely historical series data sourced from buoy stations. AI is a generalized concept that includes machine learning and deep learning. Established machine learning algorithms such as support vector regression (SVR) [
20], extreme learning machine (ELM) [
21], nonlinear function-on-function model [
22], and random forest [
23] have demonstrated utility in prediction tasks. Simultaneously, the accelerated evolution of deep learning has prompted researchers to leverage these techniques for SWH prediction [
24]. Deo et al. [
25] employed a straightforward three-layered artificial neural network (ANN) to predict the wave height across three different locations along the Indian coast and achieved satisfactory results. Due to the parallel relationship between different inputs of ANN, the temporal dependencies within the input data are ignored. In contrast, the recurrent neural network (RNN) incorporates the recursive input structure, which can effectively capture the temporal features of input data [
26]. Notably, LSTM [
27,
28,
29,
30], an advanced variant of RNN, address challenges like long-term dependencies and gradient vanishing in RNN. Fan et al. [
31] employed an LSTM model to predict wave heights for both 1 h and 6 h intervals, using historical wind speed and wave height data as input parameters. When tested at 10 different buoy sites, it was found that the prediction accuracy and stability were superior to other deep learning models such as ANN. In addition, building upon LSTM, bidirectional LSTM (Bi-LSTM) was applied to the estimation of tidal level, with a stronger capability in handling long-term dependencies [
32]. The AI-based models mentioned above revolve around time series forecasting, which ingest historical data like SWH, wind speed, and direction to extrapolate future wave heights.
LSTM and other recurrent neural networks have temporal dependencies during computation, resulting in performance bottlenecks in parallel computing. Convolutional neural networks (CNNs), with their sliding filtering structure, greatly improve the efficiency of parallel computing, and many researchers have applied CNN networks to regional wave prediction [
33,
34,
35]. In addition, CNN networks and other neural networks or time–frequency decomposition techniques are integrated to construct prediction models [
36,
37]. On the basis of CNNs, the TCN uses causal dilation convolutional layers to perform convolution operations on input sequences, which not only facilitates parallel computing but also enhances the extraction of temporal features. Ji et al. [
38] proposed an effective wave height prediction model based on variational mode decomposition and the TCN and established a prediction model using the TCN and Bayesian hyperparameter optimization, which has been improved in multi-step prediction. Huang et al. [
39] tested the predictive performance of ANN, LSTM, and TCN models in China’s offshore waters using a multi-station data fusion training strategy. The results showed that multi-station data can improve the prediction results. Lou et al. [
40] combined the TCN with empirical mode decomposition (EMD) and applied this hybrid model to buoy observation data. The effectiveness of EMD-TCN in wave height prediction has been verified, and the lag problem in previous wave height prediction research has been eliminated, improving the accuracy of wave height prediction.
In recent years, the attention mechanism has received increasing attention. As the core module of Transformer, the attention mechanism has the ability to extract global features. The attention mechanism has been widely applied in prediction fields such as traffic flow and wind power [
41,
42,
43,
44]. Zhang et al. [
45] applied the TCN-Attention model to ship motion prediction. Different weights were assigned to the original features through the attention mechanism, and it effectively improves the prediction accuracy. Luo et al. [
46] combined Bi-LSTM and attention to predict wave heights in the Atlantic hurricane zone, selecting four data features collected at five buoy stations (wave height, wind speed, wind direction, and wave direction) as model inputs and future wave heights as model outputs. Compared with the benchmark model, the BLA model has a more stable predictive performance.
However, to the best of our knowledge, few researchers have applied the TCN-Attention model to the field of wave prediction. The volatility of waves necessitates strong global feature extraction capabilities in the model. The TCN-Attention model achieves efficient feature learning through global receptive fields and attention weight allocation. Thus, this paper employs the TCN-Attention model to predict wave heights. Additionally, to address the time-consuming challenge of determining hyperparameters in traditional deep learning models, we introduce WOA for global hyperparameter optimization, making the process more efficient. Furthermore, the fixed input and output modes of the plain ReLU activation function limit the model’s expressive power. To overcome this limitation, a dynamic ReLU activation function is introduced, which adjusts the parameters of the activation function based on the input sequence, thereby enhancing the model’s representation flexibility.
This paper is organized as follows.
Section 2 introduces the principles of each model.
Section 3 introduces the dataset information.
Section 4 presents the predicted results of each model and discusses them. Finally,
Section 5 provides a conclusion.
2. Method
2.1. Temporal Convolutional Network
TCN is deep learning architecture designed for processing sequences of data, and it is particularly well suited for tasks involving temporal dependencies [
47]. The TCN model has shown considerable success in a variety of sequential data tasks, including natural language processing, audio processing, and time series analysis.
TCN is composed of a dilated causal convolutional network and a residual network, with the same length of input and output. The convolutional network used by TCN is a causal convolutional network, which means that the current output is only related to historical inputs and not to future inputs, avoiding the problem of information leakage. In addition, this architecture can map sequences of any length to output sequences of the same length using a process similar to RNN and LSTM.
Figure 1 shows the causal convolutional network adopted by TCN, which achieves a larger receptive field through dilated convolution.
Dilated convolution adds an dilation factor
d on the basis of ordinary convolutional networks, which increases the size of the time series observed by the network without increasing computational complexity. The calculation formula for dilated causal convolution is as follows:
In the formula: x represents the input sequence, f represents the convolution kernel, s represents the sequence index, k is the size of the convolution kernel, and d represents the dilation coefficient. It increases with the increase in the network layer, and the receptive field is expanded by controlling the size of d and k.
Residual networks are used to solve the problem of performance degradation in traditional deep learning networks when the network layers are very deep. In the TCN model, the superposition of causal convolution and dilated convolution leads to a deeper number of layers. In order to avoid gradient explosion and vanishing during training, residual connections are introduced to fuse the input into the output of the convolutional network. The formula is as follows:
In the formula, x represents the input, represents the output of the convolutional layer, and represents the activation function.
TCN consists of two dilated causal convolutions and nonlinear mapping layers, as shown in
Figure 2. The convolutional layer adopts one-dimensional dilated causal convolution and adjusts the receptive field by adjusting the convolution dilation size through the dilation coefficient. Next, we use dynamic ReLu as the activation function, which dynamically adjusts the parameters of ReLu based on the convolutional output. Finally, we add a dropout layer after each dilated convolution to avoid overfitting during training.
TCN can be efficiently parallelized across time steps, making them computationally efficient, especially when compared with RNN that process sequences sequentially. Due to dilated convolutions, TCNs can capture long-range dependencies in sequences without relying on recurrent connections, but this also brings about the problem of vanishing gradients. The use of residual connections effectively avoids this problem.
Unlike traditional TCN networks, the improved TCN model uses dynamic ReLu [
48] as the activation function. The current mainstream ReLu activation function
is static and performs the same operation on all input sequences. Its parameters are generated by a hyperfunction, which is related to the input sequence. The core of dynamic ReLu is to encode the global sequence as a hyperfunction and adjust the piecewise linear activation function accordingly. Compared with static activation functions, the additional computational cost of dynamic ReLu can be ignored, but it has greater expressive power. Dynamic ReLu can be represented by parameterized piecewise functions:
and are dynamically adjusted based on the input sequence , is a hyperfunction, K is the number of piecewise functions, C is the number of channels, and the activation function parameters are related to the input .
2.2. Attention Mechanism
The attention mechanism is inspired by human visual attention and was first applied in the field of computer vision [
49]. The basic idea of the attention mechanism is to select information that is more critical to the current task goal from the input information. It assigns corresponding weights based on the importance of the input information, thereby capturing more valuable information. With the continuous development and improvement of the attention mechanism, they have been applied to time series prediction and have achieved good results.
The structure of the attention mechanism is shown in
Figure 3. Firstly, a multi-layer perceptron is used to calculate the similarity weight of the data at each time step. The calculation formula is as follows:
Among them,
represents the connection weight between the input layer and hidden layer of MLP,
represents the connection weight between the hidden layer and output layer of MLP, and
represents the hidden state matrix. Next, use the softmax function to normalize these similarity weights, and the calculation formula is as follows:
Among them, exp() represents an exponential function based on the natural constant
e. Finally, the normalized similarity weights and corresponding data are weighted and summed to obtain the self-attention output matrix
R. The calculation formula is as follows:
Among them, represents the hidden state vector at time j, and represents the normalized similarity weight of . In wave prediction, the self-attention mechanism can use MLP to calculate similarity weights for hourly wave height data. The self attention mechanism can assign corresponding weights to the prediction results based on the importance of historical wave data at different times, thereby capturing more valuable information.
2.3. Whale Optimization Algorithm
The WOA is a biological heuristic algorithm proposed by Mirjalili and Lewis [
50] that aims to simulate the behavior of group hunting. Compared with other optimization algorithms such as particle swarm optimization, the whale optimization algorithm has faster optimization speed and supports distributed computing, stronger global optimization ability, and fewer parameter adjustments. The whale optimization algorithm aims to simulate three behaviors of humpback whales: encircling prey, spiral bubble hunting, and searching for prey.
- (1)
Surrounding prey
A group of humpback whales will find the direction of their prey and surround them. During this process, whales will determine the position of their prey based on the optimal direction in the current optimization space and continuously approach the prey. In this iterative cycle, the position of whales will also be constantly updated. The formula for surrounding behavior is described as follows:
Among them,
t represents the current number of iterations,
is the coefficient vector,
is the optimal solution in the current optimization space, updated if a better solution appears after each iteration,
is the orientation vector of the whale group, and · is the elementwise multiplication of the vector. The coefficient vector
is defined as follows:
In the formula, linearly decreases from 2 to 0 during continuous iteration, while is a random vector that fluctuates in the interval [0, 1].
- (2)
Spiral bubble hunting (mining stage)
Spiral bubble hunting involves two behavioral mechanisms: contraction encirclement and spiral orientation update, with effective probabilities of 0.5 each. The random number
p determines which mechanism to adopt to attack the prey. The mathematical description of the contraction enclosure mechanism is shown in Formula (
8), where the whale’s possible new position will be at any position between its original position and the optimal position. Unlike the square contraction circle formed by contraction and encirclement behavior, the spiral position update mechanism allows whales to perform spiral motion between their current position and prey. The mathematical model of spiral bubble hunting is shown as follows:
Among them, represents the distance between the whale and its prey (i.e., the distance between the whale and the current optimal optimization agent), the constant b is used to define the logarithmic helix, l is a random number within the interval [−1, 1], and p is a random number within the interval [0, 1].
- (3)
Search for predation (exploration stage)
Unlike the development phase, the search for prey (exploration phase) is based on randomly selected search agent locations rather than the location of the optimal agent. This model design helps the optimization agent to jump out of local optima and converge towards a more optimal direction. The search and predation process is described using a mathematical model as follows:
Among them, is the position of a randomly selected search agent (i.e., a humpback whale) in the current iteration calculation. The value range of is different from the previous stage. In order to expand the search range and find the global optimum, the random number of vectors is greater than 1 or less than −1, that is, , and the update amplitude of the search agent coordinates will be significantly increased.
2.4. Model Structure
The ITCN-A prediction model proposed in this article is shown in
Figure 4. The model can be divided into three stages: data preprocessing, WOA hyperparameter optimization, and TCN-Attention model prediction.
Stage 1: Firstly, cubic spline interpolation is used to fill in the missing values of the original time series. Non-nodal boundaries are selected as boundary conditions, and the interpolation is solved by combining equations based on the condition that the original function, first-order derivative, and second-order derivative are all continuous.
Stage 2: Construct the optimization space with the hyperparameters of the TCN, including the number of convolutional kernels, batch size, and learning rate. The WOA first randomly generates the initial position of search agents, and then these search agents will adaptively choose different hunting modes to update their coordinate positions. When the termination condition is met, the WOA will output the globally optimal hyperparameter combination.
Stage 3: Build a prediction model through the TCN-Attention network. The TCN uses dilated causal convolution to extract information from the input sequence, with consistent input and output sizes. Then, the TCN output sequence is input into a multi-head attention mechanism to further obtain global information. Finally, a fully connected network is constructed behind the attention mechanism to obtain the predicted wave height value. Unlike traditional TCN networks, the TCN network used in this paper adapts the dynamic ReLu activation function, which dynamically adjusts the activation function parameters based on the input sequence, making it very flexible.
3. Study Area
The dataset is from the National Data Buoy Center (NDBC,
https://www.ndbc.noaa.gov/, accessed on 1 November 2024). The NDBC is a branch of the National Oceanic and Atmospheric Administration (NOAA) that plays a crucial role in monitoring and collecting data from a network of buoys and coastal stations across the United States. These buoys are strategically placed in oceans, coastal areas, and Great Lakes to gather real-time environmental data, including information on waves, wind, currents, and other marine conditions. The NDBC provides valuable datasets related to waves. These datasets can include: wave height (the average height of the highest one-third of waves in a given period), wave period (the time it takes for successive wave crests to pass a fixed point), and wave direction (the compass direction from which the waves are coming).
The experiment selected three buoy stations, 41008, 42055, and 46083, for research.
Table 1 lists the geographic information of these stations.
Figure 5 shows the locations of these buoy stations. The time range for selecting the dataset is from 2018 to 2022, with a collection frequency of once per hour. The data from 2018 to 2021 are selected as the training set, and the data from 2022 are used as the testing set.
In addition, many values in the dataset are 99.0, which represents missing values that need to be filled or removed during the data preprocessing stage. This process requires the use of Python’s Pandas library, which provides rich functions for data reading and preprocessing.
The model used in this paper are all built using the Pytorch framework. Pytorch is an open-source machine learning library for Python, developed by Facebook’s AI Research lab (FAIR). It provides a flexible and dynamic computational graph computation paradigm, which makes it particularly well suited for research and experimentation in deep learning.
In order to evaluate the performance of the model prediction, we also selected scientific evaluation metrics, including root mean square error (
), mean absolute error (
), and
. Their expressions are as follows:
In Equation (
14),
is the observed value (i.e., the true value),
is the predicted value of the model,
is the average of the observed values, and
n represents the total number of samples.
4. Results and Analysis
4.1. Hyperparameter Optimization
The hyperparameters of the TCN are determined using the WOA. Firstly, it is necessary to determine the range of these hyperparameters. The range of batch size is [32, 96], the range of the ADAM optimizer learning rate is [0.001, 0.03], and the range for the numbers of convolution kernels in the TCN is [2, 16]. When initializing the position coordinates of the humpback whale, we randomly generate the positions of 10 humpback whales within the set range as the starting state of the iteration. Subsequently, the whale will calculate the updated direction based on its current position and optimal position, gradually approaching the position of the best fitness.
The iterative changes in the value of the best fitness function are shown in
Figure 6. The fitness function iteration curve in
Figure 6 was obtained by running on data from the 41008 station, and the fitness function values recorded the changes in
and
corresponding to the 1 h prediction. In the initial situation, the
value of the fitness function was 0.109, as indicated by the starting point of the blue line on the graph. This high initial value reflects the model’s error before optimization. Over the first few iterations, there is a significant decrease in
, which quickly drops to around 0.097 within the first five iterations. This rapid decline demonstrates the effectiveness of the initial optimization steps. After the initial sharp drop, the
curve begins to plateau, showing a more gradual decrease over the subsequent iterations. By the 10th iteration, the
value has decreased to 0.091, marking a 16.5% reduction from the initial value. This point indicates the convergence of the
, as further iterations beyond the 10th do not show a significant reduction, suggesting that the model has reached its optimal configuration for the hyperparameters.
Similarly, the value, represented by the red line, starts at approximately 0.072. There is a noticeable reduction in the first few iterations, where the drops to around 0.063. This initial drop highlights the model’s improvement in prediction accuracy as the optimization process begins. By the 10th iteration, the has decreased to around 0.059, reflecting an 18% reduction from its initial value. After this point, the also shows convergence, with minimal changes being observed in subsequent iterations.
The optimized parameters are as follows: batch size of 32, learning rate of 0.003, and a number of TCN convolution kernels of 6. The optimization process was supported by a Tesla T4 graphics card. It took approximately 26 h to process 20 epochs of whale optimization. These optimized hyperparameters result in a more accurate and efficient model for prediction. The convergence of both and after 10 iterations suggests that WOA effectively finds the optimal hyperparameters within a relatively short number of iterations. This efficiency is crucial for practical applications where computational resources and time are limited.
4.2. Model Comparison
After obtaining the optimal hyperparameters through whale optimization algorithm, they were applied to the prediction model ITCN-A and tested at the 41008 site for 1 h, 3 h, and 6 h of prediction results. At the same time, they were compared with the LSTM, TCN, and TCN-Attention models, as shown in
Figure 7. In the 1 h prediction, the
,
, and
values predicted by LSTM for wave height are 0.091, 0.07, and 0.954, respectively. However, the LSTM model adopts a temporal structure, and the continuous updating of cell and hidden states can lead to a certain degree of gradient vanishing and gradual disappearance of historical information, which affects the prediction accuracy. The
and
of the TCN model decreased to 0.084 and 0.064, and the
value increased to 0.958. The improvement in accuracy is attributed to the dilated causal convolution structure of the TCN. The dilated causal convolution expands the temporal receptive field to all input sequences, fully extracting the temporal features of the input sequences, avoiding the situation of vanishing gradients. In addition, the residual connection of the TCN can greatly improve the lower limit of prediction accuracy. In order to further enhance the extraction of input features, the TCN and attention mechanisms are coupled. The attention mechanism can extract global temporal features and also avoid the phenomenon of gradient vanishing. In 1 h wave prediction, the
and
of TCN-Attention are 0.079 and 0.059, which are 13.2% and 15.7% lower than LSTM, respectively, and the
value is 0.964, an increase of one percentage point. The traditional TCN model uses the ReLu activation function, which is fixed regardless of the numerical value of the input data and is not flexible enough. Therefore, we switched to dynamic ReLu, where the parameters of the piecewise function are determined by the input data’s hyperfunction, making it more flexible and improving accuracy to a certain extent. The
,
, and
values are 0.075, 0.051, and 0.972, respectively, with
and
being reduced by 17.6% and 27.1% compared with the LSTM model. Overall, in terms of the 1 h prediction, there is not much difference in prediction accuracy among the four models, but TCN-DyReLu-Attention has the highest prediction accuracy.
By increasing the prediction time, in a 3 h prediction, the TCN model compared with LSTM reduced by 5.1%, reaching 0.15, and increased by 0.5 percentage, reaching 0.871. The error metrics of the TCN and TCN-Attention models are not significantly different, indicating that the TCN model has effectively extracted temporal features, and the improvement of the coupled ensemble model is not significant. Compared with the LSTM model, ITCN-A reduces and by 10.1% and 10%, respectively. In the 3 h prediction, there is still not much difference among the four models.
Extending the prediction time to 6 h and conducting tests at the 41008 buoy station, the results showed that the proposed integrated model reduced and by 19.7% and 14.1%, respectively, compared with LSTM and increased by 11.8 percentage points. It can be seen that although the gap between models is small at a low prediction lead time, as the prediction time increases, the leading advantage of the integrated model gradually expands, and the lag of traditional LSTM gradually becomes apparent.
In order to present the model prediction results more vividly, the 1 h predicted values of the ITCN-A, TCN-Attention, TCN, and LSTM models on the 41008 buoy station’s test dataset were plotted as continuous curves, as shown in the
Figure 8. It can be seen that the LSTM model has a significant deviation between the predicted values and the true values, and overall, it is lower than the true values, indicating a serious underestimation phenomenon, which is difficult to accept in actual ocean prediction. After using the TCN model, the deviation from the true value decreases, and many time points are larger than the true value, resulting in a slight overestimation phenomenon. After incorporating the attention mechanism and dynamic ReLu activation function, the predicted curve of the integrated model is closest to the true value curve and has the lowest error.
At station 41008, the ITCN-A, TCN-Attention, TCN, LSTM models were applied for 3 h prediction, and their predicted values were compared, as shown in the
Figure 9. It can be seen that the predicted values of the LSTM model and TCN model differ significantly from the actual values. The attention mechanism to some extent enhances the extraction of input features. Dynamic ReLu further improves the representation ability of the combination model, and its predicted curve is closer to the true value curve. The
Figure 10 shows the 6 h prediction curves of different models, and it can be seen that all models have a certain degree of underestimation, with LSTM being the most obvious. Although the integrated model proposed in this article also underestimates the actual value, it is closest to the actual value.
4.3. Different Lead Time
Long-term forecasting of significant wave height is crucial for preventing marine disasters; therefore, it is necessary to measure the model’s ability to make medium-term and long-term predictions. At station 41008, the LSTM, TCN, TCN-Attention and ITCN-A models were constructed to predict the wave heights for the next 12 h, 18 h, and 24 h. The variation of various error metrics with the prediction time is shown in
Figure 11. From
Figure 11, it can be seen that in terms of 12 h, 18 h, and 24 h prediction, the
and
metrics of the LSTM model are slightly higher than those of the TCN model. Specifically, for a 12 h lead time, the
of the LSTM model is around 0.32, while the TCN model achieves a slightly lower
of approximately 0.31. For
, the LSTM model starts at around 0.07 and the TCN model at 0.063. This trend continues as the lead time increases, highlighting the consistent performance of the TCN over LSTM for medium-term forecasts.
As the prediction horizon extends to 18 h and 24 h, the benefits of integrating the attention mechanism and using the dynamic ReLU activation function become more evident. The and for the TCN-Attention and ITCN-A models show a noticeable reduction compared with the plain TCN model. For instance, at the 24 h prediction mark, the ITCN-A model achieves an of 0.336 and an of 0.233, significantly lower than the corresponding values for the LSTM model.
Moreover, we examined the R² metric, which measures the proportion of variance explained by the model. For short-term predictions (1 h to 6 h), all models, including LSTM, TCN, and its variants, exhibit high R² values close to 1.0, indicating strong predictive power. However, as the lead time increases to 12 h, 18 h, and 24 h, the R² values decrease, reflecting the increasing difficulty of long-term forecasting. Notably, the ITCN-A model shows a higher R² compared with the LSTM model, especially for the 24 h lead time, where it achieves an R² improvement of 5.3 percentage points over the LSTM model.
This enhancement indicates that the attention mechanism can effectively extract long-term sequence features, and the dilated convolution of TCN, which has a global receptive field, enhances the model’s long-term prediction ability. The integration of dynamic ReLU further refines the activation function, providing better adaptability and improving the model’s performance on longer lead times.
4.4. Multi-Station Analysis
In order to verify the universality of the model, it is necessary to apply each model to multiple different buoy stations. The significant wave height data from stations 42055 and 46083 were selected for model testing, covering 1 h, 3 h, 6 h, 9 h, 12 h, 18 h, and 24 h predictions. Based on the predicted and true values, scatter charts can be drawn, as shown in
Figure 12.
The chart shows the scatter plots of predicted values for different models at different lead times at station 42055. From the chart, it can be seen that as the lead time increases, the prediction accuracy continuously decreases. The LSTM model has the largest error, and the scatter points are more dispersed compared with other models. This is evident in the 24 h lead time, where the LSTM’s scatter points deviate significantly from the line, indicating a higher prediction error ( = 0.566, = 0.344, R2 = 0.499). The TCN model effectively improves prediction accuracy, thanks to the global receptive field of dilated causal convolution. Even at longer lead times, scatter points are concentrated around the line, though there is a noticeable underestimation trend. For instance, at the 24 h lead time, TCN’s scatter points are closer to the line compared with LSTM, with improved and values ( = 0.559, = 0.325, R2 = 0.195). This demonstrates the model’s ability to maintain better accuracy over extended prediction horizons.
On the basis of the TCN model, coupling it with the attention mechanism and using dynamic ReLU activation further refines the prediction accuracy. The TCN-Attention and ITCN-A models exhibit scatter distributions that are closest to the line across all lead times. For example, in the 24 h prediction, the ITCN-A model achieves and values of 0.551 and 0.329, respectively, indicating the lowest error among the models tested. This improvement is attributed to the attention mechanism’s ability to effectively extract long-term dependencies in the data and the adaptability of the dynamic ReLU at handling different input sequences.
Next, we analyze station 46083, and its scatter plot is shown in
Figure 13. In the 1 h prediction, all four models performed well, with R² values around 0.97, indicating high accuracy for short-term forecasts. However, as the prediction horizon extends, the LSTM model exhibits a degree of overestimation, likely due to issues related to gradient vanishing and exploding. These issues cause instability in input features during the chain calculation process, leading to inaccurate predictions. In contrast, the TCN adopts residual connections, effectively mitigating the problems of vanishing or exploding gradients. From
Figure 13, it can be observed that the scatter points predicted by the TCN model are more symmetrically distributed around the
line. The TCN-Attention and ITCN-A models further enhance prediction accuracy. For example, in the 18 h prediction, the ITCN-A model achieves
and
values of 0.379 and 0.281, respectively, which are 16.2% and 6.1% lower than the corresponding metrics of the LSTM model.