Open AccessArticle

Deformation Modeling and Prediction of Concrete Dam Using Observed Air Temperature and Enhanced CatBoost Algorithm

Fang Xing

¹,

Hui Li

^1,2,* and

Tianyu Li

^2,3

Department of Hydraulic Engineering, Henan Vocational College of Water Conservancy and Environment, Zhengzhou 450008, China

The National Key Laboratory of Water Disaster Prevention, Hohai University, Nanjing 210024, China

School of Civil Engineering, Sanjiang University, Nanjing 210012, China

Author to whom correspondence should be addressed.

Water 2024, 16(23), 3341; https://doi.org/10.3390/w16233341

Submission received: 24 October 2024 / Revised: 8 November 2024 / Accepted: 18 November 2024 / Published: 21 November 2024

(This article belongs to the Section Hydraulics and Hydrodynamics)

Download

Browse Figures

Figure 1
Schematic diagram of the environmental loads during the service period. "> Figure 2
Intuitive diagram of CatBoost regression model. "> Figure 3
Intuitive diagram of position update mechanism in PSO algorithms. "> Figure 4
Calculation process of improved PSO optimization algorithm. "> Figure 5
Dam deformation monitoring system. "> Figure 6
Concrete dam deformation change process line. "> Figure 7
Concrete dam temperature and hysteresis process line. "> Figure 8
Comparison of prediction performance on the test set using different input factors. "> Figure 9
Changes in negative MSE during model parameter optimization. "> Figure 10
Multi-index changes during model parameter optimization. "> Figure 11
Prediction results and residual analysis of different methods. "> Figure 12
Radar chart analysis of prediction results of different regression algorithms Comparison of residual box plots of different prediction methods. "> Figure 13
Importance analysis of factors affecting dam deformation. ">

Versions Notes

Abstract

Accurate prediction of concrete dam deformation is essential for ensuring structural safety and operational efficiency. This study presents a novel approach for monitoring and predicting concrete dam deformation using observed air temperature data, intelligent optimization, and machine learning techniques. To address the limitations of traditional statistical models in simulating the thermal effects on dam body deformation, this study proposes an improved hydraulic–air temperature–time (HT_airT) deformation monitoring model. This model leverages long-term air temperature data and its lagged terms as critical input variables, enabling a more comprehensive understanding of thermal impacts on dam deformation. To capture the complex, nonlinear relationships between environmental factors and dam deformation behavior, we introduce the high-performance CatBoost gradient-boosting algorithm as a regressor. An enhanced Particle Swarm Optimization (PSO) algorithm is utilized for optimizing CatBoost’s parameters, enhancing the model’s predictive accuracy. A high concrete dam, currently in service, is selected as the case study, where two representative deformation monitoring points are used for validation. This research fills a gap by combining CatBoost with an optimized PSO in a deformation monitoring model, providing a novel approach that improves predictive reliability in long-term dam safety monitoring. Experimental results show that the enhanced PSO-optimized CatBoost algorithm achieves higher R² and lower MSE and MAE values in multiple monitoring points. compared with other benchmark methods Moreover, the importance of factors affecting deformation can be identified using the proposed method, and experimental results indicate that water level and average air temperature of 1–2 days, 3–7 days, and 30–60 days are key factors affecting the deformation of high concrete arch dams.

Keywords:

dam safety; thermal effect; deformation prediction; machine learning; data mining

1. Introduction

Nearly 100,000 dams are in service in China, and some were built in the 1950s–1970s, and are now approaching or exceeding their service life. Dam safety monitoring systems have been widely applied in newly constructed dams, or old dams after reinforcement [1]. Monitoring systems can sense environmental quantity and corresponding physical information related to structural response, like dam deformation [2]. Monitoring variables, like deformation, often show abnormal signals when dam structures suffer from damage or unconventional loads, providing an important role in identifying potential failure risks.

In recent decades, data-driven dam behavior monitoring, forecasting, and interpretation methods have aroused wide research interest in the dam safety monitoring community [3,4,5] Dam deformation is often utilized as the main research object because it intuitively reflects the response of dam structures under the coupling effect of environmental factors and external loads [6,7]. Specifically, statistical predictive algorithms have been widely adopted for dam deformation prediction in practice, benefitting from the mature theory and simple modeling process [8,9,10]. The selection of input variables is an important modeling basis for dam deformation statistical methods. Factor selection is a key issue affecting the performance of data-driven dam behavior prediction models [11]. The hydraulic–seasonal–time (HST) model is one of the most commonly used factor models, in which dam deformation is attributed to three parts: water level, thermal variation, and time-varying effects. However, it is difficult for HST models to simulate the effect of actual temperature fields through the combination of simple harmonic functions. The assumption of independent input variables in HST models is hard to satisfy in practice because there is a relationship between air temperature and water level changes. Moreover, dam deformation behavior is significantly influenced by highly complex realistic temperature fields, which cannot be accurately simulated by simple harmonic functions.

To address this limitation, improved HST models, like the hydraulic–temperature–time (HTT) model, are proposed to represent thermal effects for dam deformation prediction using prototypical thermometer data [12]. In the HTT model, the prototypical measured temperature data sensed by thermometers placed in the dam body and its foundation are used as the input variables [13]. However, challenges remain in optimizing the scale of thermometer placement. A high density of thermometers can lead to high-dimensional nonlinearity, complicating model performance, whereas too few thermometers may result in insufficient data, limiting the model’s predictive accuracy. In addition, the modeling method based on measured thermometer data shows significant differences for different types of dams, making it difficult to obtain a universal method. To overcome the above limitations, this paper proposes a dam deformation monitoring model using measured temperature data and its hysteresis factor, called the hydraulic–air temperature–time (HT_airT) model. In the HT_airT model, long-term prototypical air temperature data is utilized to simulate the thermal effect, which has the advantages of strong universality and wide applicability [14].

Apart from the appropriate input variables in the causal model, the fitting capability and generalization ability in the regression model also determine dam deformation prediction performance [15,16]. Multiple linear analysis (MLR) and its improved variant methods are widely used for regression modeling in dam safety monitoring. However, statistical methods show poor performance in dealing with the nonlinear relationship between input variables. In addition, it is also a challenging task to accurately simulate and evaluate the nonlinear relationship between numerous input variables and environmental effect sizes.

In recent years, with the rapid development of artificial intelligence (AI) technology, the use of powerful nonlinear fitting capabilities of machine learning (ML) to construct dam safety monitoring models has received extensive research and attention [17,18,19]. Recently, a series of ML-based algorithms, such as support vehicle machine (SVM), Random Forest (RF), and Gaussian process regression (GPR) have been introduced to simulate the nonlinear mapping between environmental and dam effect variables [20,21]. For instance, Kang et al. [22] developed a GPR-based dam deformation prediction model for concrete gravity dams. Liu et al. [23] proposed a combined prediction model for long-term deformation using the long short-term memory network (LSTM). Dai et al. [23] developed an RF-based deformation prediction model for concrete dams. It can be seen from the above references that the ML-based algorithm has significant advantages in the concrete dam deformation monitoring model. However, the aforementioned studies primarily rely on single-factor models or individual machine-learning regression strategies. Few studies have explored the coupling of dam deformation monitoring causal models with intelligent computing approaches, which could potentially offer a more comprehensive understanding of dam deformation behavior prediction.

To address the aforementioned challenges, this study proposes a method for monitoring dam deformation and predicting behavior based on intelligent optimization, ML, and measured air temperature data. Initially, long-term multi-year temperature data, along with their lagged terms, are utilized as temperature factors to construct the HT_airT model for predicting dam deformation. Subsequently, CatBoost, a high-performance, open-source gradient boosting algorithm, is employed to model the nonlinear relationships between environmental factors and dam deformation behavior. The optimal parameters for CatBoost are determined using an enhanced particle swarm optimization (PSO) algorithm. A high dam, in operation for several years, serves as the engineering case study, with multiple horizontal deformation monitoring points used to verify and assess the prediction accuracy and generalization performance of the proposed method.

This study makes several important contributions: it presents an improved HT_airT deformation monitoring model that incorporates long-term air temperature data and lagged terms to more effectively simulate thermal effects on dam deformation; it introduces a novel approach that combines the PSO algorithm with the CatBoost regressor to optimize model parameters, achieving high predictive accuracy in dam deformation forecasting; and it identifies critical factors influencing dam deformation, specifically emphasizing the impact of water level and average air temperature over various time windows (1–2 days, 3–7 days, and 30–60 days).

The remainder of this paper is organized as follows. The basic methodology of dam deformation monitoring, CatBoost, and improved PSO algorithm are introduced in Section 2. In Section 3, engineering background introduction and monitoring data statistical analysis are introduced. In Section 4, a series of comparative evaluation experiments are used to verify the generalization performance and accuracy of the proposed method. In Section 5, the key contribution of this study of the proposed method is described, and our further research plan is included.

2. Methodology

In this section, the overall architecture of the developed deformation modeling and prediction framework is first introduced to provide an overview of the workflow. Subsequently, the theoretical foundations of the model’s components are further detailed. The specific content is outlined as follows.

2.1. Dam Deformation Monitoring Model Using Observed Air Temperature Data

Figure 1 demonstrates the loads and environmental impacts of dams in long-term service. It can be inferred that dam deformation is a typical dam structural response, which can be further divided into recoverable components affected by water pressure and temperature and irrecoverable components affected by creep, alkali-aggregate, and material time aging. Dam deformation is primarily induced by the interaction of complex loads, particularly hydrodynamic pressures and temperature variations. Hydrodynamic pressures exert lateral forces on the upstream face of the dam, which fluctuate with changes in water levels and flow rates, leading to lateral deformation. Additionally, thermal stresses caused by variations in water and air temperatures, as well as solar radiation, can result in differential expansion and contraction within the dam structure. This combination of mechanical and thermal loads can cause the dam to deform more or to become abnormal.

In the developed HT_airT model using the observed air temperature data, dam deformation

δ

can be described using the following three components: the hydraulic component

δ_{H}

, the thermal component

δ_{T}

, and the time-varying component

δ_{θ}

. Specifically,

δ_{H}

denotes the elastic deformation under hydraulic load,

δ_{T}

denotes the recoverable deformation affected by temperatures, and

δ_{θ}

denotes the irreversible dam deformation caused by dam material aging. The details are as follows.

δ = δ_{H} + δ_{T_{a i r}} + δ_{θ}

(1)

δ_{H} = \sum_{i = 1}^{n_{1}} a_{i} H^{i}

(2)

The deformation of HT_airT-based high arch dams affected by temperature effects can usually be characterized by the following formulas:

δ_{T_{a i r}} = b_{1} A_{0}^{T} + \sum_{i = 1}^{m_{1}} b_{i} A_{p - q}^{T}

(3)

where

m_{1}

represents the number of lag components,

A_{0}^{T}

represents the daily average temperature of the current date, and

A_{p - q}^{T}

represents the segment average temperature values of

p - q

monitoring days.

a_{i}

b_{1 i}

b_{2 i}

b_{i}

b_{1}

c_{1}

, and

c_{2}

denote the regression coefficients.

Specifically, the average value of the measured temperature data and its one-year lag term is used as the temperature influencing factor, making a total of 12 temperature variable factors, as follows:

δ_{T_{a i r}} = \{\begin{array}{l} T_{0}, T_{1 – 2}, T_{3 – 7}, T_{8 – 15}, \\ T_{16 – 30}, T_{31 – 60}, T_{61 – 90}, T_{91 – 120}, \\ T_{121 – 180}, T_{181 – 240}, T_{241 – 300}, T_{301 – 365} \end{array}\}

(4)

where these temperature variables contain the original temperature monitoring data and the average of its past data. For example,

T_{3 – 7}

represents the average air temperature data from the third to the seventh day in the past.

The time component of the deformation of concrete dams can be expressed by the following formula:

δ_{θ} = c_{1} θ + c_{2} \ln θ

(5)

where

t

denotes the cumulative number of days between the current and the initial monitoring dates; and

m_{1}

represents the total number of thermometers embedded in the dam body and its foundation. The time-varying effect represents changes in dam structural performance and deformation due to variations in material properties over time.

Based on the above formula, dam deformation can be expressed by the following formula:

δ_{{HT}_{air} T} = \sum_{i = 1}^{4} a_{i} H^{i} + c_{1} θ + c_{2} \ln θ + b_{1} A_{0}^{T} + \sum_{i = 2}^{m_{2}} b_{i} A_{p - q}^{T}

(6)

where

a_{i}

b_{1 i}

b_{2 i}

b_{i}

b_{1}

c_{1}

, and

c_{2}

denote the regression coefficients, and

A_{p - q}^{T}

denotes the average value of historical temperature data.

2.2. The Improved CatBoost-Based Regressor

CatBoost is a gradient-boosting algorithm that builds decision trees sequentially, with each tree correcting the errors (residuals) from the previous one, thereby reducing the overall loss. Figure 2 illustrates the process of gradient boosting, which is the foundation of the CatBoost model. In gradient boosting, multiple decision trees are built sequentially, with each new tree aiming to minimize the loss (or error) from the previous trees. The model uses ordered boosting to prevent overfitting and handles categorical features directly without extensive preprocessing, making it efficient and accurate, especially for datasets with mixed feature types. The iterative process of training trees in sequence allows CatBoost to effectively capture complex relationships in the data, leading to high prediction accuracy, especially for datasets with high-dimensional feature data.

Assume that there is a dam deformation monitoring dataset

D = {(X_{1}, Y_{1}), (X_{2}, Y_{2}), \dots, (X_{n}, Y_{n})}

, which

X_{i} = (X_{i 1}, X_{i 2}, \dots, X_{i m})

, represents the feature vectors affecting the dam deformation and

Y_{i} ϵ R

denotes the corresponding dam deformation label value. The relevant model input variables can be expressed as follows:

x_{i k} = \frac{\sum_{j = 1}^{n} [x_{j k} = x_{i k}] \cdot Y_{j}}{\sum_{j = 1}^{n} [x_{j k} = x_{i k}]}

(7)

where

x_{i k}

denotes the encoded value of the k-th categorical feature for the i-th sample,

x_{j k} = x_{i k}

represents all samples,

j

, where the categorical feature,

k

, matches the value of sample iii,

Y_{j}

denotes the target value associated with sample j, and

n

denotes the total number of samples in the dataset.

To prevent overfitting, CatBoost first randomly arranges the dataset to generate a random sequence. Next, the categorical feature value of each sample is converted into a numerical value, the mean of the previous label value of this sample is taken, and then the prior value p and the weight of the prior value (a > 0) are added, which can be denoted as follows:

x_{σ_{p}, k} = \frac{\sum_{j = 1}^{P - 1} [x_{σ_{j} k} = x_{σ_{P} k}] \cdot Y_{σ_{j}} + α \cdot P}{\sum_{j = 1}^{P - 1} [x_{σ_{j} k} = x_{σ_{P} k}] + α}

(8)

where

x_{σ, \hat{k}}

denotes the encoded value of the categorical feature after permutation,

P

denotes the prior value, often the global mean of the target variable, and

α

denotes the smoothing parameter (where α > 0), controlling the weight of the prior value in the encoding.

\sum_{j = 1}^{P - 1}

denotes the summation over all previous samples up to P − 1, indicating that only past samples are used to encode the current value (thus preventing data leakage).

Then, the CatBoost algorithm generates a tree, and does not process any features in the first split. In the second split, it combines all the categorical features and combinations in the tree with all the categorical features in the data set, thus adding new features.

2.3. Improved Particle Swarm Optimization Algorithm

PSO is a population-based technique, which uses multiple particles that form a swarm [24]. Each particle represents a candidate solution within the swarm, with all candidates co-existing and cooperating simultaneously. Each particle moves through the search space, aiming to find the optimal solution. The search space thus represents the set of all possible solutions, while the swarm of particles symbolizes the evolving candidate solutions. During iterations, each particle tracks both its personal best solution (optimum) and the swarm’s best overall solution. Based on this information, each particle adjusts its velocity and position. Specifically, each particle dynamically updates its velocity, influenced by its own experience and that of neighboring particles. Similarly, it adjusts its position using information about its current location, velocity, and the distances to both its personal best and the swarm’s best solutions.

Figure 3 illustrates the position update mechanism in the PSO algorithm. In PSO, each particle represents a potential solution, and its position in the search space is influenced by both its own historical best position and the global best position found by the entire swarm. The velocity directs the particle’s movement, determining its trajectory towards these two influential points. At each iteration, the particle updates its position by balancing the exploration of the search space (based on its current velocity) and the exploitation of known good solutions. This process enables the particle to progressively approach the optimal solution.

In an N-dimensional space, the positions of the particles are

X_{i} = (x_{i 1}, x_{i 2}, \dots, x_{i N})

, and the flight speed is

V_{i} = (v_{i 1}, v_{i 2}, \dots, v_{i N})

. Particles update their speed and position by updating the formula. The update of particle speed and position is as follows:

V_{i d}^{(n + 1)} = w V_{i d}^{(n)} + c_{1} r_{1} (p_{i d}^{(n)} - X_{i d}^{(n)}) + c_{2} r_{2} (p_{g d}^{(n)} - X_{i d}^{(n)})

(9)

X_{i d}^{(n + 1)} = X_{i d}^{(n)} + V_{i d}^{(n + 1)}

(10)

where

w

is inertia, which reflects the impact of individual historical records on the present,

c_{1}

and

c_{2}

are the learning factor,

p_{i d}^{(n)}

is the individual optimal value at iteration times, and

p_{i d}^{(n)}

is the global optimal value at its operation times. The process of CatBoost parameter optimization using the improved particle swarm optimization algorithm can be seen in Figure 4. The specific calculation process is as follows:

Step 1 Initialization. Each particle, representing a potential set of CatBoost hyperparameters, is initialized with a random position and velocity within the defined search space. The position of each particle corresponds to a specific combination of hyperparameters, while the velocity controls the direction and speed with which the particle explores the search space.

Step 2 Fitness Calculation. The fitness of each particle is evaluated by training a CatBoost model using the hyperparameters represented by the particle’s position. The model’s performance is assessed based on a predefined metric, such as mean squared error (MSE) or accuracy, on a validation set. This fitness value determines how well the current set of hyperparameters performs.

Step 3: Update Personal Best and Global Best For each particle, its best-known position is updated if the current fitness is better than the fitness at its previous best position. Additionally, the best global position is updated based on the particle with the best fitness across the entire swarm.

Step 4: Update Velocity and Position. This velocity update enables the particles to balance exploration and exploitation in the search space. The new position of each particle is then calculated by adding the updated velocity to the particle’s current position.

Step 5: Check Optimization Criteria. After each iteration, a check is performed to see if the optimization criteria are met. Common stopping conditions include reaching a maximum number of iterations or achieving a fitness value within a predefined threshold.

Step 6: Termination or Repeat. If the optimization criteria are satisfied, the algorithm terminates, and the optimal hyperparameters found are returned. Otherwise, the process loops back to Step 2 for another iteration, with updated positions and velocities, refining the search for the best hyperparameters.

3. Case Study

3.1. Project Description and Dam SHM System Dataset Description

The water conservancy project utilized in this study is located in Sichuan Province, China, and includes barrages, flood discharge systems, energy dissipation structures, water conveyance systems, and underground powerhouses. Figure 5 provides an overview of the project and its structural health monitoring (SHM) system. As shown in Figure 5, the main water-retaining structure is a 300-m-level concrete parabolic double-curvature arch dam. To monitor the dam’s operational behavior, the project is equipped with automated SHM systems. Horizontal displacements of the dam are primarily monitored using plumb line (PL) and inverted plumb line (IP) systems, which consist of 10 vertical lines and 8 inverted vertical lines.

3.2. Prototypical Monitoring Data Analysis

This study selected the data of two horizontal deformation measurement points at the crown beam of the arch dam, including monitoring points PL13-1 and PL13-2, as the research object. Figure 6 reveals clear patterns of deformation variation over the observed period, showcasing cyclical trends that could be associated with seasonal factors, environmental changes, or operational influences on the dam. For example, the most pronounced variations are observed in PL13-1 and PL13-2, indicating substantial deformation at these points. These deformations tended to decrease sharply in mid-2017 and again in mid-2018, followed by a gradual recovery.

Figure 7 illustrates the temporal variation of air temperature measurements across different time intervals, as represented by the variables T₀ through T_{301_365}. The x-axis denotes the dates, spanning from early 2016 to late 2018, while the y-axis represents the corresponding temperature values in degrees. Each line represents a distinct temperature measurement recorded at varying time intervals, indicated by different colors and dashed line styles. The data showcases significant temperature fluctuations over time, with seasonal trends evident across multiple variables. Some measurements exhibit more rapid oscillations (such as T_{1_2}, T_{3_7}), likely capturing short-term temperature variations, while others (e.g., T_{301_365}) show smoother, long-term trends. The graph provides valuable insight into both the immediate and cumulative effects of temperature changes over time, which can be instrumental for dam deformation prediction.

4. Experimental Results’ Analysis

Both the training and validation evaluation of the proposed method were performed on the same workstation. The high-performance computer for data experiments and testing evaluations includes the following configuration: an Intel Core i5-12500 CPU for parallel processing, NVIDIA RTX 4060Ti for ML acceleration, 32 GB DDR5 RAM for handling large datasets, and 1TB NVMe SSD for fast data access. The system runs on Windows 11 Pro, with software environments like Anaconda for Python 3.9 management, Jupyter Lab for coding, and Docker for containerization. Pre-installed libraries include Scikit-learn and CatBoost for efficient ML workflows.

4.1. The Impact of Different Input Factors on Model Prediction Accuracy

Firstly, to evaluate the influence of different input factor variables on the construction of the dam deformation monitoring model, three-factor models were introduced, including the HST, HT_airT, and HTT models using the multiple linear regression (MLR) model. Figure 8 demonstrates the predictive results of MLR on the test sets using different input variables. Table 1 demonstrates the evaluation of model prediction performance with different input factors. It can be seen that the HT_airT-based dam deformation monitoring model shows the highest accuracy with R² values of 0.929 for PL13-1 and 0.925 for PL13-2. This follows the true values closely, indicating that the use of past temperature monitoring data and its hysteresis factor can effectively reflect the thermal effect of temperature on the deformation of concrete dams. The hysteresis factor as the influence of air temperature and its lagged terms on dam deformation. The HST-based dam deformation monitoring model uses simple harmonic factors to simulate the partial load of the dam deformation affected by the temperature field, so its simulation effect is limited, which leads to the limited prediction accuracy and generalization ability of the model. Although the HTT-based dam deformation monitoring model has a large amount of measured thermometer data, the high-dimensional nonlinear input variables brought by excessive thermometer data can easily lead to a decrease in the generalization ability and robustness of the statistical regression model. Based on the above analysis, it can be seen that using measured temperature monitoring data for modeling can significantly improve the predictive capability of statistical models by accurately simulating the influence of the thermal effect of the temperature field on the dam deformation without introducing excessive variables, thereby improving the generalization performance and robustness of the predictive model.

4.2. Model Hyperparameter Selection and Optimization Process

Based on the constructed HTT dam deformation monitoring model, it is necessary, further research is needed on parameter optimization methods for ML algorithms. The selection and tuning of the hyperparameters of the CatBoost regressor are crucial for achieving optimal performance for dam deformation prediction. In this study, four crucial hyperparameters i.e., learning rate, depth, L₂ Leaf Regularization, and Bagging Temperature, are selected as hyperparameters. Below is an analysis of each chosen parameter, its rationale for inclusion in the optimization process, and its impact on the model. The specific contents are as follows.

(1) Learning Rate: The learning rate governs the magnitude of updates to the model parameters during each iteration. It directly influences how quickly the CatBoost model converges to an optimal solution. A lower learning rate (e.g., 0.01) ensures more gradual learning, reducing the likelihood of overfitting, but increasing the number of iterations required for convergence. A higher learning rate (e.g., 0.3) speeds up the convergence process but risks overshooting the optimal solution, potentially leading to suboptimal performance. By setting a range between 0.01 and 0.3, the optimization process can balance convergence speed with model stability.

(2) Depth: The depth parameter defines the maximum number of splits in a tree, controlling the model’s complexity. Deeper trees can capture more intricate patterns in the data but are more prone to overfitting. The range of 4 to 10 is selected to provide a balance between model complexity and generalization. Shallow trees (depth < 4) may be too simplistic to capture the underlying relationships in the data, while deeper trees (depth > 10) could lead to overfitting by fitting noise in the training data. This range allows the model to adequately capture non-linear patterns while mitigating the risk of overfitting.

(3) L₂ Leaf Regularization: L₂ regularization applies a penalty to the model’s leaf values, helping to prevent overfitting by discouraging overly large weights in the trees. Regularization is essential in controlling the complexity of the model. A higher L2 regularization value (closer to 10) imposes a stronger penalty on the leaf values, which can reduce the model’s tendency to overfit the training data. A lower value (closer to 1) allows the model more flexibility to fit the data but may lead to overfitting. The chosen range provides the necessary flexibility to explore various regularization strengths.

(4) Depth: Bagging temperature introduces randomness into the sampling process during each iteration, affecting the diversity of the trees. A higher bagging temperature increases the randomness, while a lower value results in more deterministic sampling. By varying the bagging temperature, the model can explore different levels of randomness in the training data, which affects model robustness. A value of 0 ensures deterministic bagging, leading to more stable results, while a higher temperature adds more randomness, which can enhance generalization and reduce overfitting. The range of 0 to 1 allows for experimentation with different levels of randomness, potentially improving model performance.

Based on the above analysis, in this study, the model hyperparameters and parameter value ranges to be optimized are shown in Table 2. The selection of parameters for PSO is crucial for balancing exploration and exploitation during the search process. With reference to previous research results, a well-configured set of PSO parameters is essential for achieving optimal performance in optimization tasks. A swarm size of 50 is a moderate choice, balancing computational cost and solution diversity. An inertia weight of 0.9 helps maintain a balance between exploration and exploitation. The cognitive coefficient and social coefficient, both set at 1.5, ensure that particles learn from their own experience and collaborate with others in the swarm. A maximum velocity of 10 prevents excessive movement, and a stopping criterion of 100 iterations or convergence ensures the algorithm terminates appropriately. These parameters offer stability and efficiency in finding optimal solutions for ML algorithms.

Figure 9 illustrates the parameter optimization process of the CatBoost regressor using improved PSO algorithms. It can be seen that the loss (negative MSE) initially shows fluctuations due to the exploration phase, with a notable spike around epoch 20. As optimization progresses, the loss gradually stabilizes, particularly after epoch 40, indicating that the improved PSO algorithm has transitioned into the exploitation phase. This consistent reduction in loss values highlights the efficiency of the improved PSO in fine-tuning hyperparameters of the CatBoost algorithm, leading to better convergence and an optimal model performance.

The optimization effect of the particle swarm optimization algorithm can be obtained through multiple accuracy evaluation indicators on the validation set. Figure 10 demonstrates the calculation results of multiple evaluation indicators of the validation set using the improved PSO algorithm for optimizing CatBoost hyperparameters. It can be inferred that, as the epochs progress, metrics like MSE and MAE show a clear decreasing trend, indicating that PSO effectively explores the hyperparameter space and converges toward optimal solutions. The initial fluctuations highlight the algorithm’s exploration phase, which gradually stabilizes, leading to consistent improvements in predictive performance. The relatively low and stable MAE values also underscore the robustness of the model. Overall, the improved PSO algorithm facilitates faster convergence and enhanced model accuracy of the constructed CatBoost dam deformation monitoring model.

The optimized hyperparameters for CatBoost at monitoring points PL13-1 and PL13-2 are shown in Table 3. At monitoring point PL13-1, the model achieves a loss of −0.026 with a learning rate of 0.061, a depth of 4, L₂ leaf regularization of 2.386, and a bagging temperature of 0.368. At monitoring point PL13-2, the model is slightly more aggressive with a learning rate of 0.145 and a higher L₂ leaf regularization of 4.145, indicating a stronger regularization effect to control overfitting. Despite the higher regularization, the depth remains 4, and the bagging temperature is reduced to 0.196, suggesting a more conservative approach to data sampling. Overall, these parameter settings demonstrate the developed dam deformation prediction model’s adaptability to different conditions at each monitoring point, optimizing predictive accuracy while minimizing errors.

4.3. Comparison of Prediction Accuracy of Different ML Methods

Secondly, the performance comparison of different prediction methods is carried out. The proposed method is compared against several widely used ML algorithms. Metrics including R², MAE, and MSE are employed to assess the predictive accuracy and generalization capability of each method. A comparative study is conducted to evaluate the effectiveness and generalization capability of the proposed method. The evaluation is performed from two perspectives: comparing factor models and assessing the accuracy of different prediction methods.

To verify the effectiveness of the proposed PSO-based CatBoost method in predicting dam deformation concerning both accuracy and generalization performance, a series of comparative experiments were conducted using well-established ML models, including RF [25], artificial neural networks (ANN) [26], gradient boosting machine (GBT) [27], Support Vector Regression (SVR) [28] and AdaBoost [29]. The following evaluation experiments are mainly used to verify the improvements in both predictive accuracy and the generalization capacity in dam deformation prediction. The principle of the relevant comparison method is introduced as follows.

RF. This is an ensemble learning technique that builds multiple decision trees during training and merges them to improve the accuracy and robustness of predictions. The method effectively handles both classification and regression problems, particularly excelling in scenarios with complex interactions between variables and noisy data. It reduces overfitting by averaging the results of individual trees, thus increasing generalizability.

ANN. This is a computational model inspired by the human brain, consisting of interconnected neurons organized into layers. ANN is highly adaptable and capable of capturing complex, non-linear relationships between input and output variables. However, it can be prone to overfitting and requires significant computational resources for training.

GBT. Gradient Boosting is an iterative algorithm that builds models sequentially by correcting the errors of previous models. Each new model is optimized to minimize the residuals of the combined model, leading to a highly accurate prediction model. Gradient Boosting is effective in reducing bias and variance, making it suitable for a wide range of complex prediction tasks.

SVR. SVR is a machine learning technique based on Support Vector Machines (SVM) for regression tasks. It maps input data to a high-dimensional space and fits a hyperplane to capture relationships between variables. SVR is effective in handling non-linear patterns, especially with high-dimensional data. By focusing on key training points (support vectors), SVR enhances model robustness and reduces overfitting, making it suitable for noisy data and complex predictions.

AdaBoost. is an ensemble method that adjusts the weight of incorrectly classified instances in each iteration, focusing more on difficult cases. By iteratively combining weak learners, AdaBoost constructs a strong learner with improved accuracy. It is simple to implement and effective in handling both classification and regression problems, particularly in the presence of noisy data.

Figure 11 demonstrates the quantitative comparison of the results of different prediction methods. Figure 12 shows prediction results and residual distribution process lines of different methods. It is evident that the developed PSO-optimized CatBoost model consistently exhibits superior performance. For instance, at monitoring point PL13-1, the PSO-CatBoost model achieves the highest R² value of 0.978, coupled with the lowest MSE (1.144) and MAE (1.933), indicating its ability to accurately predict dam deformation patterns. In contrast, the SVR method shows the poorest performance at PL13-1, with a significantly lower R² of 0.701 and much higher error values, including an MAE of 26.455 and an MSE of 4.719, demonstrating its inability to capture the deformation trends accurately. Similarly, at monitoring point PL13-2, the PSO-CatBoost model continues to outperform the other methods, achieving an R² of 0.973, MSE of 1.168, and MAE of 1.376. RF and GBT also show strong results, with R² values of 0.956 and 0.951.

The comprehensive analysis of R², MSE, and MAE values across different models highlights the superior accuracy and robustness of the PSO-optimized CatBoost model. It can accurately follow both short-term fluctuations and long-term trends in dam deformation, particularly during periods of rapid change. This makes it a more reliable and effective method for dam deformation prediction compared to traditional methods, which exhibit higher error rates and lower R² values. Thus, the enhanced PSO-optimized CatBoost model demonstrates a clear advantage in handling complex, dynamic environmental data related to dam deformation, confirming its robustness and reliability in predicting structural behavior.

4.4. Assessment of the Importance of Influencing Factors

Figure 13 shows the important analysis of factors affecting dam deformation for monitoring point PL13-1 and monitoring point PL13-2 using the developed CatBoost method. It can be inferred that, from the feature importance plot, it is evident that the water level factor H₁ plays a dominant role in dam deformation, indicating that the hydrostatic pressure in this specific direction significantly affects the dam structure’s stability. Next, air temperature-related factors, including T_{31_60}, T_{1_2}, and T_{3_7} temperature variables, also show substantial importance, suggesting that temperature variations, along with the lagged effects, contribute to the dam’s deformation through material expansion and contraction over time. Additionally, the time-dependent factors, including t₁ and t₂ variables, exhibit moderate importance, implying that the deformation process evolves with time, potentially capturing long-term degradation. In contrast, other water level directions, including H₂, H₃, and H₄, have a minimal impact, indicating that the deformation is more sensitive to specific water level components. This analysis highlights the critical need to monitor water level and temperature variations as key contributors to dam deformation.

5. Conclusions

In this study, we developed an advanced approach for monitoring and predicting dam deformation by leveraging intelligent optimization, machine learning techniques, and long-term air temperature data. By integrating CatBoost 1.0 method, a high-performance gradient boosting algorithm, with an enhanced PSO algorithm, we effectively modeled the nonlinear relationship between environmental factors and dam deformation behavior. A case study of a large concrete arch dam demonstrated that our model achieves robust prediction accuracy and strong generalization, validated across multiple deformation monitoring points. Experimental results reveal that the integration of temperature data with intelligent optimization significantly enhances prediction precision compared to conventional methods. Furthermore, comparative evaluations showed that our model outperforms traditional machine learning algorithms, with higher R² values and reduced MAE and MSE metrics, underscoring its accuracy and reliability for dam deformation prediction.

The proposed predictive model holds significant promise for practical application in dam safety monitoring. By integrating this model into existing monitoring systems, real-time data analysis can be enhanced, enabling improved early warning capabilities and supporting informed decision-making for dam safety management. This integration would allow for a more proactive approach to monitoring, where potential deformation risks could be identified promptly based on environmental and operational data. Additionally, the model’s adaptability to diverse monitoring environments suggests its applicability across various dam types and regions. However, successful implementation will require addressing certain technical requirements, such as data integration protocols and computational resources, as well as overcoming challenges related to scaling the model for different dam structures. Overall, this model presents a valuable tool for advancing the reliability and responsiveness of dam safety monitoring systems.

However, several limitations of the current model should be addressed in future research to enhance its robustness and adaptability. First, while the model has proven effective for predicting horizontal deformation, expanding the range of input variables to include additional environmental factors, such as humidity, wind load, and fluctuating water levels, could significantly improve its predictive accuracy and applicability in more complex conditions. Incorporating these variables would allow the model to account for a broader spectrum of environmental influences, which could be especially valuable in regions with varying climates and weather patterns. Second, the model’s performance may vary across different dam types and structural designs, highlighting the need for further investigation into its adaptability and effectiveness in diverse engineering scenarios. Future research should focus on testing and fine-tuning the model for various dam structures, including arch dams, gravity dams, rockfill dams, and levees, to establish a more universal predictive framework. In addition, the proposed method will be extended beyond horizontal deformation prediction to include seepage prediction, as well as vertical and complex deformation predictions in hydraulic structures, such as rockfill dams and levees. These enhancements aim to provide a more comprehensive predictive tool that could contribute to safer and more efficient management of hydraulic structures under different operational conditions. Ultimately, this research will guide the development of a predictive system capable of supporting preventive maintenance, risk assessment, and decision-making processes in real-time safety monitoring of critical hydraulic infrastructures.

Author Contributions

Conceptualization, methodology, experiment, software, F.X., validation, funding; formal analysis, H.L. investigation, experiment, T.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been supported by the 2024 Henan Province Science and Technology Research Project (242102211010), Special Basic Cooperative Research Programs of Yunnan Provincial Undergraduate Universities Association (No. 202301BA070001-13), Yunnan Fundamental Research Projects (No. 202401CF070006), and Talent Introduction Program of Kunming University (No. XJ20230089), the Jiangsu Provincial Science and Technology Department Science and Technology Project Basic Research Program Natural Science Foundation-Youth Fund Project (No. BK20230955) and Jiangsu Province Key R&D Plan (No. BE2022605).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to data confidentiality laws in China.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Malekloo, A.; Ozer, E.; Alhamaydeh, M.; Girolami, M. Machine Learning and Structural Health Monitoring Overview with Emerging Technology and High-Dimensional Data Source Highlights. Struct. Health Monit. 2021, 21, 1906–1955. [Google Scholar] [CrossRef]
Jeon, J.; Lee, J.; Shin, D.; Park, H. Development of Dam Safety Management System. Adv. Eng. Softw. 2009, 40, 554–563. [Google Scholar] [CrossRef]
Zhang, D.; Ma, G.; Deng, Z.; Wang, Q.; Zhang, G.; Zhou, W. A Self-Adaptive Gradient-Based Particle Swarm Optimization Algorithm with Dynamic Population Topology [Formula Presented]. Appl. Soft Comput. 2022, 130, 109660. [Google Scholar] [CrossRef]
Lin, C.; Li, T.; Chen, S.; Liu, X.; Lin, C.; Liang, S. Gaussian Process Regression-Based Forecasting Model of Dam Deformation. Neural Comput. Appl. 2019, 31, 8503–8518. [Google Scholar] [CrossRef]
Li, Y.; Yin, Q.; Zhang, Y.; Qiu, W. Prediction of Long-Term Maximum Settlement Deformation of Concrete Face Rockfill Dams Using Hybrid Support Vector Regression Optimized with HHO Algorithm. J. Civ. Struct. Health Monit. 2023, 13, 371–386. [Google Scholar] [CrossRef]
Ma, C.; Xu, X.; Yang, J.; Cheng, L. Safety Monitoring and Management of Reservoir and Dams. Water 2023, 15, 1078. [Google Scholar] [CrossRef]
Ren, Q.; Li, M.; Kong, T.; Ma, J. Multi-Sensor Real-Time Monitoring of Dam Behavior Using Self-Adaptive Online Sequential Learning. Autom. Constr. 2022, 140, 104365. [Google Scholar] [CrossRef]
Mata, J.; Tavares de Castro, A.; Sá da Costa, J. Constructing Statistical Models for Arch Dam Deformation. Struct. Control Health Monit. 2014, 21, 423–437. [Google Scholar] [CrossRef]
Hu, J.; Wu, S. Statistical Modeling for Deformation Analysis of Concrete Arch Dams with Influential Horizontal Cracks. Struct. Health Monit. 2019, 18, 546–562. [Google Scholar] [CrossRef]
Shi, Y.; Yang, J.; Wu, J.; He, J. A Statistical Model of Deformation during the Construction of a Concrete Face Rockfill Dam. Struct. Control Health Monit. 2018, 25, e2074. [Google Scholar] [CrossRef]
Jiedeerbieke, M.; Li, T.; Qi, H.; Lin, C. Gravity Dam Deformation Prediction Model Based on I-KShape and ZOA-BiLSTM. IEEE Access 2024, 12, 50710–50722. [Google Scholar] [CrossRef]
Yuan, D.; Gu, C.; Qin, X.; Shao, C.; He, J. Performance-Improved TSVR-Based DHM Model of Super High Arch Dams Using Measured Air Temperature. Eng. Struct. 2022, 250, 113400. [Google Scholar] [CrossRef]
Xu, Y.; Bao, Y.; Zhang, Y.; Li, H. Attribute-Based Structural Damage Identification by Few-Shot Meta Learning with Inter-Class Knowledge Transfer. Struct. Health Monit. 2020, 20, 1494–1517. [Google Scholar] [CrossRef]
Jiang, Y.; Yu, Y.; Kong, M.; Mei, Y.; Yuan, L.; Huang, Z.; Kuang, K.; Wang, Z.; Yao, H.; Zou, J.; et al. Artificial Intelligence for Retrosynthesis Prediction. Engineering 2022, 25, 32–50. [Google Scholar] [CrossRef]
Liu, X.; Li, Z.; Sun, L.; Yahya, E.; Wang, J.; Lu, W. A Critical Review of Statistical Model of Dam Monitoring Data. J. Build. Eng. 2023, 80, 108106. [Google Scholar] [CrossRef]
Wei, B.; Liu, B.; Yuan, D.; Mao, Y.; Yao, S. Spatiotemporal Hybrid Model for Concrete Arch Dam Deformation Monitoring Considering Chaotic Effect of Residual Series. Eng. Struct. 2021, 228, 111488. [Google Scholar] [CrossRef]
Plehiers, P.P.; Symoens, S.H.; Amghizar, I.; Marin, G.B.; Stevens, C.V.; Van Geem, K.M. Artificial Intelligence in Steam Cracking Modeling: A Deep Learning Algorithm for Detailed Effluent Prediction. Engineering 2019, 5, 1027–1040. [Google Scholar] [CrossRef]
Alhebrawi, M.N.; Huang, H.; Wu, Z. Artificial Intelligence Enhanced Automatic Identification for Concrete Cracks Using Acoustic Impact Hammer Testing. J. Civ. Struct. Health Monit. 2022, 13, 469–484. [Google Scholar] [CrossRef]
Yilun, W.; Qingbin, L.; Yu, H.; Yajun, W.; Xuezhou, Z.; Yaosheng, T.; Chunfeng, L.; Lei, P. Deformation Prediction Model Based on an Improved CNN + LSTM Model for the First Impoundment of Super—High Arch Dams. J. Civ. Struct. Health Monit. 2023, 13, 431–442. [Google Scholar] [CrossRef]
Li, M.; Shen, Y.; Ren, Q.; Li, H. A New Distributed Time Series Evolution Prediction Model for Dam Deformation Based on Constituent Elements. Adv. Eng. Inform. 2019, 39, 41–52. [Google Scholar] [CrossRef]
Li, Y.; Zhang, H.; Wen, L.; Shi, N. A Prediction Model for Deformation Behavior of Concrete Face Rockfill Dams Based on the Threshold Regression Method. Arab. J. Sci. Eng. 2021, 46, 5801–5816. [Google Scholar] [CrossRef]
Kang, F.; Li, J. Displacement Model for Concrete Dam Safety Monitoring via Gaussian Process Regression Considering Extreme Air Temperature. J. Struct. Eng. 2020, 146, 05019001. [Google Scholar] [CrossRef]
Liu, W.; Pan, J.; Ren, Y.; Wu, Z.; Wang, J. Coupling Prediction Model for Long-term Displacements of Arch Dams Based on Long Short-term Memory Network. Struct. Control Health Monit. 2020, 27, e2548. [Google Scholar] [CrossRef]
Yang, L.; Su, H.; Wen, Z. Improved PLS and PSO Methods-Based Back Analysis for Elastic Modulus of Dam. Adv. Eng. Softw. 2019, 131, 205–216. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Hariri-Ardebili, M.A.; Pourkamali-Anaraki, F. Support Vector Machine Based Reliability Analysis of Concrete Dams. Soil Dyn. Earthq. Eng. 2018, 104, 276–295. [Google Scholar] [CrossRef]
Chen, L.; Liu, Z.; Tong, L.; Jiang, Z.; Wang, S.; Dong, J.; Zhou, H. Underwater Object Detection Using Invert Multi-Class Adaboost with Deep Learning. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the environmental loads during the service period.

Figure 2. Intuitive diagram of CatBoost regression model.

Figure 3. Intuitive diagram of position update mechanism in PSO algorithms.

Figure 4. Calculation process of improved PSO optimization algorithm.

Figure 5. Dam deformation monitoring system.

Figure 6. Concrete dam deformation change process line.

Figure 7. Concrete dam temperature and hysteresis process line.

Figure 8. Comparison of prediction performance on the test set using different input factors.

Figure 9. Changes in negative MSE during model parameter optimization.

Figure 10. Multi-index changes during model parameter optimization.

Figure 11. Prediction results and residual analysis of different methods.

Figure 12. Radar chart analysis of prediction results of different regression algorithms Comparison of residual box plots of different prediction methods.

Figure 13. Importance analysis of factors affecting dam deformation.

Table 1. Calculation of evaluation indicators for prediction results of different methods.

Monitoring Point	Dataset	R²	MSE	MAE
	HT_airT	0.929	6.266	2.281
PL13-1	HST	0.912	7.780	2.552
	HTT	0.895	9.284	2.894
	HT_airT	0.925	5.360	2.133
	HST	0.918	5.864	2.209
PL13-2	HTT	0.898	8.745	2.748

Table 2. Model hyperparameters to be optimized and their value ranges.

Hyperparameter	Range
Learning Rate	0.01–0.3
Depth	4–10
L2 Leaf Regularization	1–10
Bagging Temperature	0–1

Table 3. Model hyperparameter optimization results.

Monitoring Points	Hyperparameter	Values
	Learning Rate	0.061
	Depth	4
PL13-1	L2 Leaf Regularization	2.386
	Bagging Temperature	0.368
	Best Loss (Negative MSE)	−0.026
	Learning Rate	0.145
	Depth	4
PL13-2	L2 Leaf Regularization	4.145
	Bagging Temperature	0.196
	Best Loss (Negative MSE)	−0.024

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xing, F.; Li, H.; Li, T. Deformation Modeling and Prediction of Concrete Dam Using Observed Air Temperature and Enhanced CatBoost Algorithm. Water 2024, 16, 3341. https://doi.org/10.3390/w16233341

AMA Style

Xing F, Li H, Li T. Deformation Modeling and Prediction of Concrete Dam Using Observed Air Temperature and Enhanced CatBoost Algorithm. Water. 2024; 16(23):3341. https://doi.org/10.3390/w16233341

Chicago/Turabian Style

Xing, Fang, Hui Li, and Tianyu Li. 2024. "Deformation Modeling and Prediction of Concrete Dam Using Observed Air Temperature and Enhanced CatBoost Algorithm" Water 16, no. 23: 3341. https://doi.org/10.3390/w16233341

APA Style

Xing, F., Li, H., & Li, T. (2024). Deformation Modeling and Prediction of Concrete Dam Using Observed Air Temperature and Enhanced CatBoost Algorithm. Water, 16(23), 3341. https://doi.org/10.3390/w16233341

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deformation Modeling and Prediction of Concrete Dam Using Observed Air Temperature and Enhanced CatBoost Algorithm

Abstract

1. Introduction

2. Methodology

2.1. Dam Deformation Monitoring Model Using Observed Air Temperature Data

2.2. The Improved CatBoost-Based Regressor

2.3. Improved Particle Swarm Optimization Algorithm

3. Case Study

3.1. Project Description and Dam SHM System Dataset Description

3.2. Prototypical Monitoring Data Analysis

4. Experimental Results’ Analysis

4.1. The Impact of Different Input Factors on Model Prediction Accuracy

4.2. Model Hyperparameter Selection and Optimization Process

4.3. Comparison of Prediction Accuracy of Different ML Methods

4.4. Assessment of the Importance of Influencing Factors

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI