1. Introduction
Nearly 100,000 dams are in service in China, and some were built in the 1950s–1970s, and are now approaching or exceeding their service life. Dam safety monitoring systems have been widely applied in newly constructed dams, or old dams after reinforcement [
1]. Monitoring systems can sense environmental quantity and corresponding physical information related to structural response, like dam deformation [
2]. Monitoring variables, like deformation, often show abnormal signals when dam structures suffer from damage or unconventional loads, providing an important role in identifying potential failure risks.
In recent decades, data-driven dam behavior monitoring, forecasting, and interpretation methods have aroused wide research interest in the dam safety monitoring community [
3,
4,
5] Dam deformation is often utilized as the main research object because it intuitively reflects the response of dam structures under the coupling effect of environmental factors and external loads [
6,
7]. Specifically, statistical predictive algorithms have been widely adopted for dam deformation prediction in practice, benefitting from the mature theory and simple modeling process [
8,
9,
10]. The selection of input variables is an important modeling basis for dam deformation statistical methods. Factor selection is a key issue affecting the performance of data-driven dam behavior prediction models [
11]. The hydraulic–seasonal–time (HST) model is one of the most commonly used factor models, in which dam deformation is attributed to three parts: water level, thermal variation, and time-varying effects. However, it is difficult for HST models to simulate the effect of actual temperature fields through the combination of simple harmonic functions. The assumption of independent input variables in HST models is hard to satisfy in practice because there is a relationship between air temperature and water level changes. Moreover, dam deformation behavior is significantly influenced by highly complex realistic temperature fields, which cannot be accurately simulated by simple harmonic functions.
To address this limitation, improved HST models, like the hydraulic–temperature–time (HTT) model, are proposed to represent thermal effects for dam deformation prediction using prototypical thermometer data [
12]. In the HTT model, the prototypical measured temperature data sensed by thermometers placed in the dam body and its foundation are used as the input variables [
13]. However, challenges remain in optimizing the scale of thermometer placement. A high density of thermometers can lead to high-dimensional nonlinearity, complicating model performance, whereas too few thermometers may result in insufficient data, limiting the model’s predictive accuracy. In addition, the modeling method based on measured thermometer data shows significant differences for different types of dams, making it difficult to obtain a universal method. To overcome the above limitations, this paper proposes a dam deformation monitoring model using measured temperature data and its hysteresis factor, called the hydraulic–air temperature–time (HT
airT) model. In the HT
airT model, long-term prototypical air temperature data is utilized to simulate the thermal effect, which has the advantages of strong universality and wide applicability [
14].
Apart from the appropriate input variables in the causal model, the fitting capability and generalization ability in the regression model also determine dam deformation prediction performance [
15,
16]. Multiple linear analysis (MLR) and its improved variant methods are widely used for regression modeling in dam safety monitoring. However, statistical methods show poor performance in dealing with the nonlinear relationship between input variables. In addition, it is also a challenging task to accurately simulate and evaluate the nonlinear relationship between numerous input variables and environmental effect sizes.
In recent years, with the rapid development of artificial intelligence (AI) technology, the use of powerful nonlinear fitting capabilities of machine learning (ML) to construct dam safety monitoring models has received extensive research and attention [
17,
18,
19]. Recently, a series of ML-based algorithms, such as support vehicle machine (SVM), Random Forest (RF), and Gaussian process regression (GPR) have been introduced to simulate the nonlinear mapping between environmental and dam effect variables [
20,
21]. For instance, Kang et al. [
22] developed a GPR-based dam deformation prediction model for concrete gravity dams. Liu et al. [
23] proposed a combined prediction model for long-term deformation using the long short-term memory network (LSTM). Dai et al. [
23] developed an RF-based deformation prediction model for concrete dams. It can be seen from the above references that the ML-based algorithm has significant advantages in the concrete dam deformation monitoring model. However, the aforementioned studies primarily rely on single-factor models or individual machine-learning regression strategies. Few studies have explored the coupling of dam deformation monitoring causal models with intelligent computing approaches, which could potentially offer a more comprehensive understanding of dam deformation behavior prediction.
To address the aforementioned challenges, this study proposes a method for monitoring dam deformation and predicting behavior based on intelligent optimization, ML, and measured air temperature data. Initially, long-term multi-year temperature data, along with their lagged terms, are utilized as temperature factors to construct the HTairT model for predicting dam deformation. Subsequently, CatBoost, a high-performance, open-source gradient boosting algorithm, is employed to model the nonlinear relationships between environmental factors and dam deformation behavior. The optimal parameters for CatBoost are determined using an enhanced particle swarm optimization (PSO) algorithm. A high dam, in operation for several years, serves as the engineering case study, with multiple horizontal deformation monitoring points used to verify and assess the prediction accuracy and generalization performance of the proposed method.
This study makes several important contributions: it presents an improved HTairT deformation monitoring model that incorporates long-term air temperature data and lagged terms to more effectively simulate thermal effects on dam deformation; it introduces a novel approach that combines the PSO algorithm with the CatBoost regressor to optimize model parameters, achieving high predictive accuracy in dam deformation forecasting; and it identifies critical factors influencing dam deformation, specifically emphasizing the impact of water level and average air temperature over various time windows (1–2 days, 3–7 days, and 30–60 days).
The remainder of this paper is organized as follows. The basic methodology of dam deformation monitoring, CatBoost, and improved PSO algorithm are introduced in
Section 2. In
Section 3, engineering background introduction and monitoring data statistical analysis are introduced. In
Section 4, a series of comparative evaluation experiments are used to verify the generalization performance and accuracy of the proposed method. In
Section 5, the key contribution of this study of the proposed method is described, and our further research plan is included.
2. Methodology
In this section, the overall architecture of the developed deformation modeling and prediction framework is first introduced to provide an overview of the workflow. Subsequently, the theoretical foundations of the model’s components are further detailed. The specific content is outlined as follows.
2.1. Dam Deformation Monitoring Model Using Observed Air Temperature Data
Figure 1 demonstrates the loads and environmental impacts of dams in long-term service. It can be inferred that dam deformation is a typical dam structural response, which can be further divided into recoverable components affected by water pressure and temperature and irrecoverable components affected by creep, alkali-aggregate, and material time aging. Dam deformation is primarily induced by the interaction of complex loads, particularly hydrodynamic pressures and temperature variations. Hydrodynamic pressures exert lateral forces on the upstream face of the dam, which fluctuate with changes in water levels and flow rates, leading to lateral deformation. Additionally, thermal stresses caused by variations in water and air temperatures, as well as solar radiation, can result in differential expansion and contraction within the dam structure. This combination of mechanical and thermal loads can cause the dam to deform more or to become abnormal.
In the developed HT
airT model using the observed air temperature data, dam deformation
can be described using the following three components: the hydraulic component
, the thermal component
, and the time-varying component
. Specifically,
denotes the elastic deformation under hydraulic load,
denotes the recoverable deformation affected by temperatures, and
denotes the irreversible dam deformation caused by dam material aging. The details are as follows.
The deformation of HT
airT-based high arch dams affected by temperature effects can usually be characterized by the following formulas:
where
represents the number of lag components,
represents the daily average temperature of the current date, and
represents the segment average temperature values of
monitoring days.
,
,
,
,
,
, and
denote the regression coefficients.
Specifically, the average value of the measured temperature data and its one-year lag term is used as the temperature influencing factor, making a total of 12 temperature variable factors, as follows:
where these temperature variables contain the original temperature monitoring data and the average of its past data. For example,
represents the average air temperature data from the third to the seventh day in the past.
The time component of the deformation of concrete dams can be expressed by the following formula:
where
denotes the cumulative number of days between the current and the initial monitoring dates; and
represents the total number of thermometers embedded in the dam body and its foundation. The time-varying effect represents changes in dam structural performance and deformation due to variations in material properties over time.
Based on the above formula, dam deformation can be expressed by the following formula:
where
,
,
,
,
,
, and
denote the regression coefficients, and
denotes the average value of historical temperature data.
2.2. The Improved CatBoost-Based Regressor
CatBoost is a gradient-boosting algorithm that builds decision trees sequentially, with each tree correcting the errors (residuals) from the previous one, thereby reducing the overall loss.
Figure 2 illustrates the process of gradient boosting, which is the foundation of the CatBoost model. In gradient boosting, multiple decision trees are built sequentially, with each new tree aiming to minimize the loss (or error) from the previous trees. The model uses ordered boosting to prevent overfitting and handles categorical features directly without extensive preprocessing, making it efficient and accurate, especially for datasets with mixed feature types. The iterative process of training trees in sequence allows CatBoost to effectively capture complex relationships in the data, leading to high prediction accuracy, especially for datasets with high-dimensional feature data.
Assume that there is a dam deformation monitoring dataset
, which
, represents the feature vectors affecting the dam deformation and
denotes the corresponding dam deformation label value. The relevant model input variables can be expressed as follows:
where
denotes the encoded value of the
k-th categorical feature for the
i-th sample,
represents all samples,
, where the categorical feature,
, matches the value of sample iii,
denotes the target value associated with sample
j, and
denotes the total number of samples in the dataset.
To prevent overfitting, CatBoost first randomly arranges the dataset to generate a random sequence. Next, the categorical feature value of each sample is converted into a numerical value, the mean of the previous label value of this sample is taken, and then the prior value
p and the weight of the prior value (a > 0) are added, which can be denoted as follows:
where
denotes the encoded value of the categorical feature after permutation,
denotes the prior value, often the global mean of the target variable, and
denotes the smoothing parameter (where α > 0), controlling the weight of the prior value in the encoding.
denotes the summation over all previous samples up to P − 1, indicating that only past samples are used to encode the current value (thus preventing data leakage).
Then, the CatBoost algorithm generates a tree, and does not process any features in the first split. In the second split, it combines all the categorical features and combinations in the tree with all the categorical features in the data set, thus adding new features.
2.3. Improved Particle Swarm Optimization Algorithm
PSO is a population-based technique, which uses multiple particles that form a swarm [
24]. Each particle represents a candidate solution within the swarm, with all candidates co-existing and cooperating simultaneously. Each particle moves through the search space, aiming to find the optimal solution. The search space thus represents the set of all possible solutions, while the swarm of particles symbolizes the evolving candidate solutions. During iterations, each particle tracks both its personal best solution (optimum) and the swarm’s best overall solution. Based on this information, each particle adjusts its velocity and position. Specifically, each particle dynamically updates its velocity, influenced by its own experience and that of neighboring particles. Similarly, it adjusts its position using information about its current location, velocity, and the distances to both its personal best and the swarm’s best solutions.
Figure 3 illustrates the position update mechanism in the PSO algorithm. In PSO, each particle represents a potential solution, and its position in the search space is influenced by both its own historical best position and the global best position found by the entire swarm. The velocity directs the particle’s movement, determining its trajectory towards these two influential points. At each iteration, the particle updates its position by balancing the exploration of the search space (based on its current velocity) and the exploitation of known good solutions. This process enables the particle to progressively approach the optimal solution.
In an N-dimensional space, the positions of the particles are
, and the flight speed is
. Particles update their speed and position by updating the formula. The update of particle speed and position is as follows:
where
is inertia, which reflects the impact of individual historical records on the present,
and
are the learning factor,
is the individual optimal value at iteration times, and
is the global optimal value at its operation times. The process of CatBoost parameter optimization using the improved particle swarm optimization algorithm can be seen in
Figure 4. The specific calculation process is as follows:
Step 1 Initialization. Each particle, representing a potential set of CatBoost hyperparameters, is initialized with a random position and velocity within the defined search space. The position of each particle corresponds to a specific combination of hyperparameters, while the velocity controls the direction and speed with which the particle explores the search space.
Step 2 Fitness Calculation. The fitness of each particle is evaluated by training a CatBoost model using the hyperparameters represented by the particle’s position. The model’s performance is assessed based on a predefined metric, such as mean squared error (MSE) or accuracy, on a validation set. This fitness value determines how well the current set of hyperparameters performs.
Step 3: Update Personal Best and Global Best For each particle, its best-known position is updated if the current fitness is better than the fitness at its previous best position. Additionally, the best global position is updated based on the particle with the best fitness across the entire swarm.
Step 4: Update Velocity and Position. This velocity update enables the particles to balance exploration and exploitation in the search space. The new position of each particle is then calculated by adding the updated velocity to the particle’s current position.
Step 5: Check Optimization Criteria. After each iteration, a check is performed to see if the optimization criteria are met. Common stopping conditions include reaching a maximum number of iterations or achieving a fitness value within a predefined threshold.
Step 6: Termination or Repeat. If the optimization criteria are satisfied, the algorithm terminates, and the optimal hyperparameters found are returned. Otherwise, the process loops back to Step 2 for another iteration, with updated positions and velocities, refining the search for the best hyperparameters.
4. Experimental Results’ Analysis
Both the training and validation evaluation of the proposed method were performed on the same workstation. The high-performance computer for data experiments and testing evaluations includes the following configuration: an Intel Core i5-12500 CPU for parallel processing, NVIDIA RTX 4060Ti for ML acceleration, 32 GB DDR5 RAM for handling large datasets, and 1TB NVMe SSD for fast data access. The system runs on Windows 11 Pro, with software environments like Anaconda for Python 3.9 management, Jupyter Lab for coding, and Docker for containerization. Pre-installed libraries include Scikit-learn and CatBoost for efficient ML workflows.
4.1. The Impact of Different Input Factors on Model Prediction Accuracy
Firstly, to evaluate the influence of different input factor variables on the construction of the dam deformation monitoring model, three-factor models were introduced, including the HST, HT
airT, and HTT models using the multiple linear regression (MLR) model.
Figure 8 demonstrates the predictive results of MLR on the test sets using different input variables.
Table 1 demonstrates the evaluation of model prediction performance with different input factors. It can be seen that the HT
airT-based dam deformation monitoring model shows the highest accuracy with R
2 values of 0.929 for PL13-1 and 0.925 for PL13-2. This follows the true values closely, indicating that the use of past temperature monitoring data and its hysteresis factor can effectively reflect the thermal effect of temperature on the deformation of concrete dams. The hysteresis factor as the influence of air temperature and its lagged terms on dam deformation. The HST-based dam deformation monitoring model uses simple harmonic factors to simulate the partial load of the dam deformation affected by the temperature field, so its simulation effect is limited, which leads to the limited prediction accuracy and generalization ability of the model. Although the HTT-based dam deformation monitoring model has a large amount of measured thermometer data, the high-dimensional nonlinear input variables brought by excessive thermometer data can easily lead to a decrease in the generalization ability and robustness of the statistical regression model. Based on the above analysis, it can be seen that using measured temperature monitoring data for modeling can significantly improve the predictive capability of statistical models by accurately simulating the influence of the thermal effect of the temperature field on the dam deformation without introducing excessive variables, thereby improving the generalization performance and robustness of the predictive model.
4.2. Model Hyperparameter Selection and Optimization Process
Based on the constructed HTT dam deformation monitoring model, it is necessary, further research is needed on parameter optimization methods for ML algorithms. The selection and tuning of the hyperparameters of the CatBoost regressor are crucial for achieving optimal performance for dam deformation prediction. In this study, four crucial hyperparameters i.e., learning rate, depth, L2 Leaf Regularization, and Bagging Temperature, are selected as hyperparameters. Below is an analysis of each chosen parameter, its rationale for inclusion in the optimization process, and its impact on the model. The specific contents are as follows.
(1) Learning Rate: The learning rate governs the magnitude of updates to the model parameters during each iteration. It directly influences how quickly the CatBoost model converges to an optimal solution. A lower learning rate (e.g., 0.01) ensures more gradual learning, reducing the likelihood of overfitting, but increasing the number of iterations required for convergence. A higher learning rate (e.g., 0.3) speeds up the convergence process but risks overshooting the optimal solution, potentially leading to suboptimal performance. By setting a range between 0.01 and 0.3, the optimization process can balance convergence speed with model stability.
(2) Depth: The depth parameter defines the maximum number of splits in a tree, controlling the model’s complexity. Deeper trees can capture more intricate patterns in the data but are more prone to overfitting. The range of 4 to 10 is selected to provide a balance between model complexity and generalization. Shallow trees (depth < 4) may be too simplistic to capture the underlying relationships in the data, while deeper trees (depth > 10) could lead to overfitting by fitting noise in the training data. This range allows the model to adequately capture non-linear patterns while mitigating the risk of overfitting.
(3) L2 Leaf Regularization: L2 regularization applies a penalty to the model’s leaf values, helping to prevent overfitting by discouraging overly large weights in the trees. Regularization is essential in controlling the complexity of the model. A higher L2 regularization value (closer to 10) imposes a stronger penalty on the leaf values, which can reduce the model’s tendency to overfit the training data. A lower value (closer to 1) allows the model more flexibility to fit the data but may lead to overfitting. The chosen range provides the necessary flexibility to explore various regularization strengths.
(4) Depth: Bagging temperature introduces randomness into the sampling process during each iteration, affecting the diversity of the trees. A higher bagging temperature increases the randomness, while a lower value results in more deterministic sampling. By varying the bagging temperature, the model can explore different levels of randomness in the training data, which affects model robustness. A value of 0 ensures deterministic bagging, leading to more stable results, while a higher temperature adds more randomness, which can enhance generalization and reduce overfitting. The range of 0 to 1 allows for experimentation with different levels of randomness, potentially improving model performance.
Based on the above analysis, in this study, the model hyperparameters and parameter value ranges to be optimized are shown in
Table 2. The selection of parameters for PSO is crucial for balancing exploration and exploitation during the search process. With reference to previous research results, a well-configured set of PSO parameters is essential for achieving optimal performance in optimization tasks. A swarm size of 50 is a moderate choice, balancing computational cost and solution diversity. An inertia weight of 0.9 helps maintain a balance between exploration and exploitation. The cognitive coefficient and social coefficient, both set at 1.5, ensure that particles learn from their own experience and collaborate with others in the swarm. A maximum velocity of 10 prevents excessive movement, and a stopping criterion of 100 iterations or convergence ensures the algorithm terminates appropriately. These parameters offer stability and efficiency in finding optimal solutions for ML algorithms.
Figure 9 illustrates the parameter optimization process of the CatBoost regressor using improved PSO algorithms. It can be seen that the loss (negative MSE) initially shows fluctuations due to the exploration phase, with a notable spike around epoch 20. As optimization progresses, the loss gradually stabilizes, particularly after epoch 40, indicating that the improved PSO algorithm has transitioned into the exploitation phase. This consistent reduction in loss values highlights the efficiency of the improved PSO in fine-tuning hyperparameters of the CatBoost algorithm, leading to better convergence and an optimal model performance.
The optimization effect of the particle swarm optimization algorithm can be obtained through multiple accuracy evaluation indicators on the validation set.
Figure 10 demonstrates the calculation results of multiple evaluation indicators of the validation set using the improved PSO algorithm for optimizing CatBoost hyperparameters. It can be inferred that, as the epochs progress, metrics like MSE and MAE show a clear decreasing trend, indicating that PSO effectively explores the hyperparameter space and converges toward optimal solutions. The initial fluctuations highlight the algorithm’s exploration phase, which gradually stabilizes, leading to consistent improvements in predictive performance. The relatively low and stable MAE values also underscore the robustness of the model. Overall, the improved PSO algorithm facilitates faster convergence and enhanced model accuracy of the constructed CatBoost dam deformation monitoring model.
The optimized hyperparameters for CatBoost at monitoring points PL13-1 and PL13-2 are shown in
Table 3. At monitoring point PL13-1, the model achieves a loss of −0.026 with a learning rate of 0.061, a depth of 4, L
2 leaf regularization of 2.386, and a bagging temperature of 0.368. At monitoring point PL13-2, the model is slightly more aggressive with a learning rate of 0.145 and a higher L
2 leaf regularization of 4.145, indicating a stronger regularization effect to control overfitting. Despite the higher regularization, the depth remains 4, and the bagging temperature is reduced to 0.196, suggesting a more conservative approach to data sampling. Overall, these parameter settings demonstrate the developed dam deformation prediction model’s adaptability to different conditions at each monitoring point, optimizing predictive accuracy while minimizing errors.
4.3. Comparison of Prediction Accuracy of Different ML Methods
Secondly, the performance comparison of different prediction methods is carried out. The proposed method is compared against several widely used ML algorithms. Metrics including R2, MAE, and MSE are employed to assess the predictive accuracy and generalization capability of each method. A comparative study is conducted to evaluate the effectiveness and generalization capability of the proposed method. The evaluation is performed from two perspectives: comparing factor models and assessing the accuracy of different prediction methods.
To verify the effectiveness of the proposed PSO-based CatBoost method in predicting dam deformation concerning both accuracy and generalization performance, a series of comparative experiments were conducted using well-established ML models, including RF [
25], artificial neural networks (ANN) [
26], gradient boosting machine (GBT) [
27], Support Vector Regression (SVR) [
28] and AdaBoost [
29]. The following evaluation experiments are mainly used to verify the improvements in both predictive accuracy and the generalization capacity in dam deformation prediction. The principle of the relevant comparison method is introduced as follows.
RF. This is an ensemble learning technique that builds multiple decision trees during training and merges them to improve the accuracy and robustness of predictions. The method effectively handles both classification and regression problems, particularly excelling in scenarios with complex interactions between variables and noisy data. It reduces overfitting by averaging the results of individual trees, thus increasing generalizability.
ANN. This is a computational model inspired by the human brain, consisting of interconnected neurons organized into layers. ANN is highly adaptable and capable of capturing complex, non-linear relationships between input and output variables. However, it can be prone to overfitting and requires significant computational resources for training.
GBT. Gradient Boosting is an iterative algorithm that builds models sequentially by correcting the errors of previous models. Each new model is optimized to minimize the residuals of the combined model, leading to a highly accurate prediction model. Gradient Boosting is effective in reducing bias and variance, making it suitable for a wide range of complex prediction tasks.
SVR. SVR is a machine learning technique based on Support Vector Machines (SVM) for regression tasks. It maps input data to a high-dimensional space and fits a hyperplane to capture relationships between variables. SVR is effective in handling non-linear patterns, especially with high-dimensional data. By focusing on key training points (support vectors), SVR enhances model robustness and reduces overfitting, making it suitable for noisy data and complex predictions.
AdaBoost. is an ensemble method that adjusts the weight of incorrectly classified instances in each iteration, focusing more on difficult cases. By iteratively combining weak learners, AdaBoost constructs a strong learner with improved accuracy. It is simple to implement and effective in handling both classification and regression problems, particularly in the presence of noisy data.
Figure 11 demonstrates the quantitative comparison of the results of different prediction methods.
Figure 12 shows prediction results and residual distribution process lines of different methods. It is evident that the developed PSO-optimized CatBoost model consistently exhibits superior performance. For instance, at monitoring point PL13-1, the PSO-CatBoost model achieves the highest R
2 value of 0.978, coupled with the lowest MSE (1.144) and MAE (1.933), indicating its ability to accurately predict dam deformation patterns. In contrast, the SVR method shows the poorest performance at PL13-1, with a significantly lower R
2 of 0.701 and much higher error values, including an MAE of 26.455 and an MSE of 4.719, demonstrating its inability to capture the deformation trends accurately. Similarly, at monitoring point PL13-2, the PSO-CatBoost model continues to outperform the other methods, achieving an R
2 of 0.973, MSE of 1.168, and MAE of 1.376. RF and GBT also show strong results, with R
2 values of 0.956 and 0.951.
The comprehensive analysis of R2, MSE, and MAE values across different models highlights the superior accuracy and robustness of the PSO-optimized CatBoost model. It can accurately follow both short-term fluctuations and long-term trends in dam deformation, particularly during periods of rapid change. This makes it a more reliable and effective method for dam deformation prediction compared to traditional methods, which exhibit higher error rates and lower R2 values. Thus, the enhanced PSO-optimized CatBoost model demonstrates a clear advantage in handling complex, dynamic environmental data related to dam deformation, confirming its robustness and reliability in predicting structural behavior.
4.4. Assessment of the Importance of Influencing Factors
Figure 13 shows the important analysis of factors affecting dam deformation for monitoring point PL13-1 and monitoring point PL13-2 using the developed CatBoost method. It can be inferred that, from the feature importance plot, it is evident that the water level factor H
1 plays a dominant role in dam deformation, indicating that the hydrostatic pressure in this specific direction significantly affects the dam structure’s stability. Next, air temperature-related factors, including T
31_60, T
1_2, and T
3_7 temperature variables, also show substantial importance, suggesting that temperature variations, along with the lagged effects, contribute to the dam’s deformation through material expansion and contraction over time. Additionally, the time-dependent factors, including t
1 and t
2 variables, exhibit moderate importance, implying that the deformation process evolves with time, potentially capturing long-term degradation. In contrast, other water level directions, including H
2, H
3, and H
4, have a minimal impact, indicating that the deformation is more sensitive to specific water level components. This analysis highlights the critical need to monitor water level and temperature variations as key contributors to dam deformation.
5. Conclusions
In this study, we developed an advanced approach for monitoring and predicting dam deformation by leveraging intelligent optimization, machine learning techniques, and long-term air temperature data. By integrating CatBoost 1.0 method, a high-performance gradient boosting algorithm, with an enhanced PSO algorithm, we effectively modeled the nonlinear relationship between environmental factors and dam deformation behavior. A case study of a large concrete arch dam demonstrated that our model achieves robust prediction accuracy and strong generalization, validated across multiple deformation monitoring points. Experimental results reveal that the integration of temperature data with intelligent optimization significantly enhances prediction precision compared to conventional methods. Furthermore, comparative evaluations showed that our model outperforms traditional machine learning algorithms, with higher R2 values and reduced MAE and MSE metrics, underscoring its accuracy and reliability for dam deformation prediction.
The proposed predictive model holds significant promise for practical application in dam safety monitoring. By integrating this model into existing monitoring systems, real-time data analysis can be enhanced, enabling improved early warning capabilities and supporting informed decision-making for dam safety management. This integration would allow for a more proactive approach to monitoring, where potential deformation risks could be identified promptly based on environmental and operational data. Additionally, the model’s adaptability to diverse monitoring environments suggests its applicability across various dam types and regions. However, successful implementation will require addressing certain technical requirements, such as data integration protocols and computational resources, as well as overcoming challenges related to scaling the model for different dam structures. Overall, this model presents a valuable tool for advancing the reliability and responsiveness of dam safety monitoring systems.
However, several limitations of the current model should be addressed in future research to enhance its robustness and adaptability. First, while the model has proven effective for predicting horizontal deformation, expanding the range of input variables to include additional environmental factors, such as humidity, wind load, and fluctuating water levels, could significantly improve its predictive accuracy and applicability in more complex conditions. Incorporating these variables would allow the model to account for a broader spectrum of environmental influences, which could be especially valuable in regions with varying climates and weather patterns. Second, the model’s performance may vary across different dam types and structural designs, highlighting the need for further investigation into its adaptability and effectiveness in diverse engineering scenarios. Future research should focus on testing and fine-tuning the model for various dam structures, including arch dams, gravity dams, rockfill dams, and levees, to establish a more universal predictive framework. In addition, the proposed method will be extended beyond horizontal deformation prediction to include seepage prediction, as well as vertical and complex deformation predictions in hydraulic structures, such as rockfill dams and levees. These enhancements aim to provide a more comprehensive predictive tool that could contribute to safer and more efficient management of hydraulic structures under different operational conditions. Ultimately, this research will guide the development of a predictive system capable of supporting preventive maintenance, risk assessment, and decision-making processes in real-time safety monitoring of critical hydraulic infrastructures.