Introduction

The evaluation of slope stability is a critical area of research in geotechnics and slope engineering1,2. However, the processes that govern slope deformation and failure are inherently complex geological phenomena3. The multitude of factors affecting slope stability are characterized by significant uncertainties, which complicate accurate assessments4. These factors include variations in soil and rock properties, groundwater conditions, seismic activity, weathering processes, and human activities5,6. Each of these variables can influence the stability of a slope in different ways, and their interactions can be highly complex and unpredictable7,8. Traditional theoretical analyses relying on simplified assumptions and idealized conditions have long been used to evaluate slope stability. However, these approaches often fail to capture the intricate and dynamic nature of real-world slopes. As a result, there has been a growing reliance on numerical methods to provide more detailed and nuanced analyses9. Numerical methods, such as the finite element method, offer a more sophisticated approach by allowing for the modeling of complex geometries, material behaviors, and boundary conditions. Despite their advantages, these methods still face significant challenges due to the inherent uncertainties in the input parameters and the computational intensity required for large-scale simulations10.

Discontinuous deformation analysis (DDA) is another numerical approach that accounts for the discontinuous nature of geological materials, such as the presence of faults, fractures, and joints. This method can provide insights into the failure mechanisms of slopes that are not easily captured by continuous methods such as the finite element method. However, DDA also requires detailed information about discontinuities, which can be difficult to obtain accurately11,12. The virtual element method (VEM) is a relatively recent development that extends the capabilities of traditional finite element methods. The VEM allows for more flexible mesh generation and can handle complex topologies and heterogeneous materials more efficiently.

Nonetheless, like other numerical methods, VEM's effectiveness is limited by the precision of the input data and the quality of the mesh13,14. Phase field models represent another advanced numerical technique used to simulate the evolution of fractures and other discontinuities within a slope. These models are particularly useful for studying the progressive failure of slopes and the development of failure surfaces over time. However, phase field models are computationally demanding and require detailed calibration against experimental or field data to produce reliable results15,16.

The evaluation of slope stability can be understood as a pattern recognition task, where the goal is to identify and interpret the complex relationships and patterns among various factors that influence slope behaviour17. This approach leverages data-driven techniques and advanced computational models to analyze the multitude of variables involved in slope stability18. Machine learning algorithms, statistical methods, and artificial intelligence (AI) techniques are often employed to process and interpret these data. These methods can identify correlations, trends, and anomalies that traditional analytical approaches might overlook19. Recent studies have expanded the scope of slope stability analysis by integrating machine learning (ML) and artificial intelligence (AI) techniques. For instance, slope stability prediction based on long short-term memory (LSTM) neural networks has shown significant improvements compared to traditional models such as convolutional neural networks, support vector machines, and random forests20. A comprehensive dataset consisting of 1400 instances for slope stability analysis was constructed using ensemble machine learning techniques such as decision tree regression (DTR), multiple linear regression (MLR), K nearest neighbor regression (KNN), random forest regression (RF), extreme gradient boosting regression (XGB), support vector regression (SVR) and subset simulation. The voting ensemble model stands out as a strong algorithm, indicating significant proficiency for slope stability prediction21. The study of uncertainties in landslide susceptibility prediction emphasizes the importance of different study area scales and mapping unit scales, highlighting the complexity of slope stability assessments22. In addition, research on coal-rock identification methods and data augmentation algorithms, such as FL-Segformer, has provided new insights into the geological characterization essential for slope stability analysis23. The spatial prediction of soil organic carbon in coal mining subsidence areas using RBF neural networks has demonstrated the potential of AI in environmental and geotechnical studies24. Furthermore, fuzzy inference systems using genetic algorithms and pattern searches have been developed for predicting roof fall rates in underground coal mines, indicating the growing reliance on hybrid AI techniques for geotechnical applications25.

Machine learning models, for example, can be trained on large datasets to recognize the conditions that precede slope failures. These models can then be used to predict future instability based on new data, providing early warnings and allowing proactive measures to be taken26,27. Techniques such as neural networks, support vector machines, and decision trees have shown promise in accurately identifying patterns related to slope stability. Using databases of slope information28, researchers have applied machine learning (ML) techniques to evaluate slope stability19,29,30. Sabri et al.31 employed artificial neural networks (ANNs), Bayesian neural networks (BNNs), convolutional neural networks (CNNs), and deep neural networks (DNNs) to evaluate slope stability. Among these models, the ANN achieved the best performance during testing, with an accuracy of TIC = 0.0076 and LMI = 0.9872. Keawsawasvong et al.32 examined rock slope stability using the Hoek–Brown failure criterion and the impact of pseudo-static seismic forces. It employs finite element limit analysis (FELA) and introduces a multivariate adaptive regression splines (MARS) model, achieving an R2 value of 99.82%. This study also conducts sensitivity analysis on rock mass strength and slope geometry, offering valuable insights for engineering practitioners. The MARS model and analysis provide both theoretical guidance and practical tools for rock slope stability assessments. Youssef et al.33 focused on landslide hazards and mapped landslide susceptibility using logistic regression (LR), artificial neural networks (ANNs), and an ensemble of LRs and ANNs. Using 800 landslide locations and 14 causative parameters, the ROC-AUC, RMSE, and kappa indices of the models were evaluated. The AUC results were 82% for LR, 89% for ANN, and 91% for the ensemble. Zhou et al.34 implemented six supervised learning methods to predict the stability of hard rock pillars and comparatively analyzed their performance. Hoang and Pham35 utilized ML techniques for least-squares classification (LS-SVC) and the firefly algorithm (FA) to investigate 168 case studies on slopes.

Although these AI models have improved our understanding of slope conditions, each approach has advantages and limitations. The complexity of the problem means that existing methods remain imperfect. Different algorithms exhibit varying sensitivities to the same types of data, leading to different classification accuracies across various categories and algorithms36. Each classifier is subject to intrinsic limitations, such as overfitting, computational cost, or sensitivity to specific features37. As a result, the focus of slope stability assessment research is to continuously search for novel and robust ML algorithms that can develop assessment models with better prediction performance38.

One promising approach for achieving better slope stability prediction results is ensemble learning. Ensemble learning involves training multiple algorithms to capitalize on their complementary advantages39. By combining the strengths of various models, ensemble methods can enhance prediction accuracy and robustness. Ensemble learning creates a more comprehensive 'strong learner' by combining multiple models, leading to more accurate predictions, better generalization, and broader applications in slope stability assessment40. This approach reduces errors and balances the bias-variance trade-off, enhancing robustness and handling complex, nonlinear interactions among diverse data types41. Techniques such as bagging, boosting, and stacking integrate different algorithms to capture a wide range of patterns, preventing overfitting and improving performance on unseen data. As a result, ensemble learning is a powerful tool for developing reliable and effective predictive models in complex slope stability analyses42,43.

Recent applications of ensemble learning in slope stability and landslide assessment have shown promising results. Studies have demonstrated the effectiveness of various ensemble techniques, such as random forests, gradient boosting machines, and adaptive boosted decision trees, in predicting slope stability with higher accuracy than individual models44,45,46. These advancements underscore the potential of ensemble learning to transform slope stability analysis, providing engineers with robust tools for risk mitigation and design optimization. For example, Qi and Tang47 developed a slope stability evaluation algorithm that integrates adaptive boosted decision trees (ABDTs), quadratic discriminant analysis (QDA), SVMs, ANNs, Gaussian process classification (GPC), and K-nearest neighbors (KNNs) as individual classifiers and weighted majority voting as an ensemble classifier. The results showed that the hybrid ensemble approach significantly improved the prediction of slope stability.

Against this backdrop, our study introduces a groundbreaking metaheuristic approach for optimal feature selection integrated with a deep learning framework aimed at enhancing slope stability classification. This innovative combination of metaheuristic feature selection and deep learning holds immense potential for transforming the landscape of slope stability analysis. By harnessing their ability to capture intricate patterns, leverage ensemble techniques, and identify optimal feature subsets, this methodology provides engineers with a robust toolset for mitigating risks and optimizing designs for critical infrastructure projects globally.

The remainder of this paper is organized as follows: Section "Methodology" discusses the methodology, including subsections on recurrent neural networks (RNNs), long short-term memory (LSTM) models, generative adversarial networks (GANs), and the proposed greedy goose optimization algorithm. Section "Dataset description" describes the dataset used. Section "Results" presents and discusses the obtained results. Finally, Section "Discussion" concludes the paper.

Methodology

Recurrent neural network (RNN)

Recurrent neural networks (RNNs) are a type of neural architecture known for their ability to model sequential data by capturing dynamic temporal behavior. Unlike feedforward networks, which process inputs independently, RNNs maintain an internal state that allows the current network output to be influenced by past inputs48. This property gives RNNs the ability to capture long-range dependencies, making them particularly suitable for tasks that require sequential data processing, such as speech recognition49, language modeling50, and time series prediction51. The strength of RNNs in preserving an internal state and modeling temporal dynamics is facilitated by their recurrent connections, which allow activation patterns to propagate across successive time steps. The hidden state of an RNN at time t, denoted as \({h}_{t}\), is calculated as a function of the current input, \({x}_{t}\). and the previous hidden state, \({h}_{t-1}\). as defined in Eq. 1 (1):

$${h}_{t}=f\left({W}_{hh}\cdot {h}_{t-1}+{W}_{xh}\cdot {x}_{t}+{b}_{h}\right)$$
(1)

In this equation, \({W}_{hh}\) and \({W}_{xh}\) are weight matrices governing the contributions of the previous hidden state and current input, respectively, while \({b}_{h}\) is the bias vector. The nonlinear activation function \(f\), often ReLU, introduces nonlinearity, which is crucial for learning complex patterns.

This recursive computation enables RNNs to encode information from previous time steps and to capture dependencies across arbitrary intervals. This makes them suitable for tasks such as natural language processing and the prediction of time series, where order and context are crucial.

For classification tasks, the fundamental RNN architecture remains consistent, with modifications in the output layer. The output \({y}_{t}\) at each time step \(t\) is computed as (2):

$${y}_{t}=\text{softmax}\left({W}_{hy}*{h}_{t}+{b}_{y}\right)$$
(2)

Here, \({W}_{hy}\) and \({b}_{y}\) represent the weight matrix and bias vector for the output layer, respectively, and softmax ensures that the outputs sum to 1, representing class probabilities.

Long short-term memory (LSTM)

Long short-term memory (LSTM) networks are a special type of recurrent neural network (RNN) developed to overcome the limitations of traditional RNNs in capturing long-range dependencies and mitigating the problem of vanishing or exploding gradients during training52,53,54. LSTMs are particularly well suited for tasks involving sequential data processing, such as natural language processing, speech recognition, and time series prediction53,55,56.

The key innovation of LSTM networks lies in their ability to selectively store information and update it over time using gated mechanisms. These mechanisms, consisting of input, forgetting, and output gates, regulate the flow of information within the network and allow it to capture long-term dependencies while effectively avoiding the vanishing gradient problem57,58.

While LSTMs were originally developed for sequence modeling problems such as language modeling and machine translation, they have also proven effective in classification tasks for sequential data such as text or time series data. In text classification, an LSTM can process sequential words or characters as inputs, with the last hidden state serving as a summary representation for classification59. Similarly, in time series classification, the hidden states of the LSTM capture relevant patterns in the sequence, which can then be used for prediction60. When classifying sequences of different lengths, it is common to use an attention mechanism together with the LSTM to better focus on the most relevant parts of the input61. Bidirectional LSTMs, which process sequences in both forward and backward directions, have also shown advantages in classification62.

In the context of classification, LSTMs offer a unique advantage by capturing temporal dependencies within sequential data, enabling more accurate and contextual predictions. This is achieved using specialized gating mechanisms that control the flow of information within the network and allow LSTMs to selectively update, retain, or discard information depending on the input sequence. The functioning of an LSTM unit in a classification task can be illustrated as follows58:

$${i}_{t}=\sigma \left({W}_{ix}{x}_{t}+{W}_{ih}{h}_{t-1}+{W}_{ic}{c}_{t-1}+{b}_{i}\right)$$
(3)
$${f}_{t}=\sigma \left({W}_{fx}{x}_{t}+{W}_{fh}{h}_{t-1}+{W}_{fc}{c}_{t-1}+{b}_{f}\right)$$
(4)
$${o}_{t}=\sigma \left({W}_{ox}{x}_{t}+{W}_{oh}{h}_{t-1}+{W}_{oc}{c}_{t}+{b}_{o}\right)$$
(5)
$${g}_{t}=\text{tanh}\left({W}_{gx}{x}_{t}+{W}_{gh}{h}_{t-1}+{b}_{g}\right)$$
(6)
$${c}_{t}={f}_{t}\odot {c}_{t-1}+{i}_{t}\odot {g}_{t}$$
(7)
$${h}_{t}={o}_{t}\odot \text{tanh}\left({c}_{t}\right)$$
(8)

Here, \({i}_{t},{f}_{t}\) and \({o}_{t}\) represent the input, forget, and output gates, respectively, which control the flow of information within the LSTM cell. \({g}_{t}\) is the candidate cell state, \({c}_{t}\) is the cell state and \({h}_{t}\) is the hidden state/output. The weight matrices \(W\) and bias vectors \(b\) are learnable parameters, \(\sigma\) denotes the sigmoid activation function, and \(\odot\) represents elementwise multiplication.

Generative adversarial network (GAN)

Generative adversarial networks (GANs) have proven to be powerful frameworks for generative modeling and enable the generation of realistic synthetic data in various domains63,64,65. GANs were first introduced by Goodfellow et al.63 and have since been extensively studied and applied in various fields, including computer vision, natural language processing, and data augmentation66.

Creswell et al.67 highlighted the ability of GANs to learn deep representations without extensively annotated training data by using backpropagation signals derived during training. Furthermore, the stability and convergence of GAN training have been improved by algorithmic innovations such as unrolled GANs66, which change the target of the generator based on the unrolled optimization of the discriminator. Gui et al.68 provided a comprehensive overview of different algorithms and their applications and demonstrated the great potential of GANs in various fields. The adaptability of GANs was also demonstrated by Wang et al.69, who applied GANs in autonomous driving technology to improve the realism of simulated environments. The basic mechanism of GANs can be mathematically described by the following min–max optimization problem70:

$$\underset{G}{\text{min}} \underset{D}{\text{max}} V(D,G)={\mathbb{E}}_{x\sim {p}_{\text{data }}(x)}[\text{log}D(x)]+{\mathbb{E}}_{z\sim {p}_{z}(z)}[\text{log}(1-D(G(z)))]$$
(9)

where \(G\) represents the generator network, \(D\) represents the discriminator network, \(x\) represents instances of real data, and \(z\) represents the input noise variables used by \(G\) to generate the data. The generator \(G\) learns to map the noise variable \(z\) to the data space \(G(z)\), while \(D\) outputs a scalar representing the probability that \(x\) came from the training data rather than \(G\)64,71.

Generative adversarial networks (GANs) are traditionally associated with generative modeling tasks, such as image synthesis and data generation. However, recent developments have shown the potential of GANs in classification tasks, which represents a significant extension of their capabilities. This integration of GANs into classification tasks has led to novel approaches that combine the strengths of generative and discriminative models and offer promising ways to improve classification accuracy and robustness.

The integration of GANs in classification tasks requires a modification of the traditional GAN objective to accommodate the discriminative task. The formulation proposed by Odena et al.72 involves combining the GAN objective with a classification loss:

$${\text{min}}_{{\theta }_{g}} {\text{max}}_{{\theta }_{d}} {\mathbb{E}}_{x\sim {p}_{\text{data }}(x)}[\text{log}D(x)]+{\mathbb{E}}_{z\sim {p}_{z}(z)}[\text{log}(1-D(G(z)))]+\lambda {\mathcal{L}}_{\text{class}}$$
(10)

Here, \({\mathcal{L}}_{\text{class}}\) represents the classification loss, which can be derived from a supervised or semisupervised learning setting. The parameter \(\lambda\) controls the trade-off between the adversarial and classification objectives, allowing for flexible integration of GANs into the classification pipeline.

Binary graylag goose optimization algorithm for feature selection

The Binary Graylag Goose Optimization (BGGO) algorithm extends the original GGO framework to handle discrete variables, making it particularly useful for feature selection. Feature selection is a crucial process in machine learning and data analysis that aims to identify the most relevant features while reducing dimensionality and improving model performance. The binary version of the GGO algorithm adapts the standard model to work with binary search spaces in which each item component represents the presence (1) or absence (0) of a feature in a dataset73,74,75.

In BGGO, the position of each goose is represented as a binary vector rather than a continuous vector (see Algorithm 1). The position update rule is modified to suit binary operations, utilizing a sigmoid function to map continuous values to binary decisions. The updated position of the \(i\)-th goose at time \(t+1\) is given by76:

$${\overrightarrow{x}}_{i}(t+1)=\text{round}\left(\sigma \left({\overrightarrow{x}}_{i}(t)+f(t)\cdot \left({\overrightarrow{x}}_{g}(t)-{\overrightarrow{x}}_{i}(t)\right)\right)\right)$$
(11)

where \(\sigma\) is the sigmoid function defined as77:

$$\sigma (v)=\frac{1}{1+{e}^{-v}}$$
(12)

This transformation ensures the algorithm maintains a binary search space, with the rounding function converting the sigmoid output to 0 or 1. The lead goose \({\overrightarrow{x}}_{g}(t)\) is chosen based on a fitness function specifically designed to evaluate the efficacy of the selected features, often involving measures of classification accuracy, compactness of the feature set, and robustness to overfitting76. The binary GGO algorithm proceeds as follows78:

  • Initialization: Initialize a population of binary solutions within the specified domain \([\text{0,1}]\).

  • Fitness Evaluation: Evaluate the quality of solutions using the objective function \({F}_{n}\), which incorporates the classification error rate Err and feature selection significance.

  • Binary Transformation: Continuous solutions are converted to binary format using the sigmoid function, ensuring that the values are either 0 or 1 based on a threshold (0.5).

    $$\begin{array}{c}{x}_{d}^{(t+1)}=\left\{\begin{array}{ll}1& \text{ i}\text{f }\sigma (m)\ge 0.5\\ 0& \text{ otherwise }\end{array}\right.\\ \text{.}\end{array}$$
    (13)
    $$\sigma (m)=\frac{1}{1+{e}^{-10(m-0.5)}}\text{, where }m\text{ represents the chosen features}$$
  • Optimization: Iteratively optimize the binary solutions to minimize \({F}_{n}\) and select a subset of features that yield low classification error rates.

Algorithm 1
figure a

bGGO Algorithm

Table 1 details the configuration parameters for the binary GGO algorithm, including the number of agents, number of iterations, dimensionality, domain, inertia factor, and other key settings. The algorithm uses ten agents, balancing computational efficiency with exploration capability. With 100 iterations, the algorithm has ample opportunities to refine its solutions. The dimension is set to match the number of features in the dataset, ensuring comprehensive feature selection. The domain [0,1] indicates binary decision-making for feature inclusion. Conducting 20 runs provides statistical robustness to the results. The inertia factor of SC (0.1) implies a moderate influence of previous positions on current movements. The ranges [0,1] for \({r}_{1}\) to \({r}_{5}\) and [0,2] for \({w}_{1}\) to \({w}_{4}\) offer flexibility in agent behavior and decision-making. The values of α (0.99) and β (0.01) indicate a strong emphasis on exploitation, with a small allowance for exploration, promoting convergence while maintaining solution diversity. Figure 1 presents a comprehensive flowchart of the study, outlining the entire process from data collection and processing to feature selection, model training, and evaluation, with the Binary Greylag Goose Optimization (BGGO) algorithm integrated into feature selection.

Table 1 Configuration settings for the binary GGO algorithm.
Fig. 1
figure 1

Overview of the study workflow incorporating the BGGO Algorithm.

Evaluation metrics

Several widely used evaluation metrics quantify the classification capabilities of models from different perspectives. Researchers such as55,79 have discussed and used these metrics extensively in their studies. Four commonly used evaluation metrics are presented in this section: precision, sensitivity (true positive rate), specificity (true negative rate), and the F score, which combines precision and recall into a single measure. In addition, the statistical significance of the positive and negative predictions of the models was evaluated using the p-value in conjunction with the positive predictive value (PPV) and the negative predictive value (NPV), respectively.

Accuracy is a fundamental metric that measures the overall correctness of a model's predictions. It is calculated as the ratio of true positive (TP) and true negative (TN) instances to the total number of instances:

$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$
(14)

Sensitivity, also known as the true positive rate (TPR) or recall, quantifies the ability of a model to identify positive instances (e.g., unstable slopes) correctly. It is defined as the ratio of true positive instances to the sum of true positive and false negative instances:

$$\text{Sensitivity }(\text{TPR})=\frac{TP}{TP+FN}$$
(15)

The specificity or true negative rate (TNR) measures the ability of the model to correctly identify negative instances (e.g., stable slopes). It is calculated as the ratio of true negative instances to the sum of true negative and false positive instances:

$$\text{Specificity }(\text{TNR})=\frac{TN}{TN+FP}$$
(16)

The p-value, in conjunction with the positive predictive value (PPV) and the negative predictive value (NPV), indicates the statistical significance of the positive and negative predictions of the model. These values are usually derived from contingency tables or confusion matrices.

The F score, also known as the harmonic mean of precision and recall, combines both precision and recall into a single metric. It is calculated as:

$$\text{F}\_\text{score }=\frac{2\times (\text{ Precision }\times \text{ Recall })}{\text{ Precision }+\text{ Recall}}$$
(17)

where Precision = TP\(/(TP+FP)\), and Recall = Sensitivity (TPR).

Dataset description

The dataset used in this study comprises a comprehensive collection of geological and geotechnical parameters of numerous slope sites collected from various sources35,38,80,81,82,83,84,85,86. The dataset underwent preprocessing before model training to ensure data quality and integrity. Missing values were addressed using mean imputation, and outliers were identified and treated to minimize their impact on model performance53,57,87,88. The dataset was then divided into training (80%) and testing (20%) subsets, ensuring that the models were trained on a substantial portion of the data while retaining enough unseen data for rigorous evaluation.

The dataset contains critical factors that influence slope stability, such as geometric attributes–(slope height, angle, and curvature), material properties (soil type, cohesion, friction angle, and unit weight), and environmental conditions (pore water pressure ratio and seismic loading). Each slope is labeled either stable (1) or unstable (2), which facilitates supervised learning. The relevant parameters and their descriptions can be found in Table 2. To provide insight into the data distribution and variability, Table 3 contains descriptive statistics for the main geotechnical parameters affecting slope stability. The mean unit weight (UNW) of the soil/rock is 20.185 kN/m3, with a relatively moderate standard deviation of 7.044, indicating reasonable variability within the dataset. The cohesion values (COHs) show considerable scatter with a mean value of 25.600 kPa and a substantial standard deviation of 31.036, suggesting a wide range of soil/rock shear strength characteristics. The FRI has a mean value of 25.308° and a standard deviation of 12.331°, reflecting the diversity of frictional resistance characteristics among the slope cases. In addition, the slope height parameter (HEI) has a mean value of 90.289 m and a considerable standard deviation of 120.140 m, indicating a wide range of slope geometries within the dataset. The pore pressure ratio (PPR), an important indicator of drainage conditions, has a relatively low mean of 0.254 and a standard deviation of 0.260, indicating varying degrees of saturation on the slopes. The 95% confidence intervals for the mean values provide a range of plausible values for each parameter, allowing for further statistical analysis and modeling efforts.

Table 2 Geotechnical parameters for slope stability analysis.
Table 3 Descriptive statistics summary.

In addition to the descriptive statistics, Table 4 contains a quantile analysis that provides a more comprehensive view of the variable distributions within the dataset. The unit weight (UNW) variable has a relatively narrow range, from 0.492 to 33.16 kN/m3, with an interquartile range (IQR) of 6.23, indicating moderate variability. The median value of 20.959 kN/m3 indicates a typical value for the unit weight of the slope materials under consideration. In contrast, the cohesion values (COHs) show considerable variability, with a wide range from 0 to 300 kPa and a significant IQR of 26.77 kPa. The median value of 19.69 kPa and the relatively high robust coefficient of variation (0.982) indicate a skewed distribution, possibly influenced by outliers or different soil/rock types within the dataset.

Table 4 Quantile analysis of variable distributions.

The friction angle (FRI) shows a more concentrated distribution, with an IQR of 15.11° and a median of 28.8°, which is consistent with typical ground friction angles. However, the range of 49.5° indicates the presence of both granular and cohesive material in the dataset. The slope angles (SLAs) vary considerably, with a range of 64.698° and an IQR of 19.54°, reflecting the different topographic conditions in the dataset. The median slope angle value of 34.98° shows that both gentle and steep slopes are adequately represented.

The slope height parameter (HEI) has a considerable range of 564.982 m, with a median of 45.8 m and a significant IQR of 88 m, indicating a diverse mix of slope geometries, from small slopes to large slopes. In addition, the pore pressure ratio (PPR) has a maximum value of 1, indicating fully saturated conditions for some slopes. The median value of 0.25 and the IQR of 0.35 indicate a wide range of drainage conditions, from dry to nearly saturated slopes.

To further characterize the diversity within the dataset, Fig. 2 illustrates the relative frequency distributions of these influential parameters. The distribution of unit weight (UNW) peaks at 12.85 kN/m3, with 43.93% of cases falling within this range. Cohesion (COH) shows a bimodal pattern with modes at 15 kPa (16.19%) and 135 kPa (1.9%), indicating a wide range of soil/rock types. The distribution of the friction angle (FRI) is approximately normal, with a center at 30.55° (18.69%), which corresponds to typical soil behavior. The slope angles (SLAs) are skewed toward steeper values, with 29.91% in the 32.5° range and a considerable increase to 71.43°, reflecting different topographies. The slope height (HEI) varied considerably, peaking at 115.65 m (28.04%), but covered a wide range from small slopes to large slopes. The pore pressure ratio (PPR) is bimodal, with modes at 0.075 (41.84%) and 0.325 (13.27%), indicating a mixture of well-drained and poorly drained conditions, respectively.

Fig. 2
figure 2

Frequency distributions of slope stability input parameters.

Results

This section presents the main results of the analyses performed to identify the most influential factors affecting slope stability and to evaluate the performance of different deep-learning models in classifying slope stability.

Correlation and feature importance analysis

As depicted in Fig. 3, the correlation analysis provided valuable insights into the relationships between various parameters and the slope stability status (SST). The friction angle (FRI) emerged as the most influential factor among these parameters, exhibiting a robust negative correlation of − 0.29 with the SST. This finding underscores the critical role of frictional resistance in determining slope stability, where higher friction angles correspond to enhanced stability due to increased resistance against gravitational forces. Additionally, cohesion (COH) demonstrated a moderate negative correlation of − 0.18 with SST. This implies that materials with greater cohesion tend to exhibit better slope stability characteristics. Cohesion contributes to the internal strength of the material, effectively resisting shear forces and reducing the likelihood of slope failure.

Fig. 3
figure 3

Correlation matrix of study variables.

Moreover, the analysis revealed a negative correlation of − 0.16 between unit weight (UNW) and SST. This suggests that higher unit weights, indicative of denser materials, are associated with increased slope stability. The weight of the overlying materials plays a crucial role in providing stability by exerting pressure and confining the underlying layers. The slope height (HEI) exhibited a negative correlation of − 0.15 with the SST, indicating that taller slopes tend to be less stable. This relationship reflects the increased gravitational forces acting on taller slopes, which may exceed the cohesive and frictional resistance of the materials, leading to instability.

In contrast, the SLA showed a relatively weak negative correlation (− 0.07) with the SST. While slope angle influences stability by affecting the distribution of stresses and the potential for material movement, its impact appears to be less pronounced than that of other factors considered in this analysis. Interestingly, the pore pressure ratio (PPR) demonstrated a positive correlation of 0.11 with the SST. Higher pore water pressures within the soil or rock mass are associated with reduced effective stresses, compromising the shear strength of the material and, consequently, the slope stability. This positive correlation underscores the importance of considering pore water pressures in slope stability assessments, as they can significantly influence the stability of slopes, particularly under saturated or water-rich conditions.

Figure 3 provides valuable insights into the factors influencing slope stability, highlighting the interplay between material properties, slope geometry, and pore water pressures. Understanding these relationships is essential for accurate slope stability assessments and effective risk mitigation strategies in engineering and infrastructure projects.

Figure 4 illustrates the relative importance of the input variables in determining slope stability, which was evaluated using a random forest regressor model from the sci-kit-learn library. Cohesion (COH) emerges as the most critical parameter, boasting a feature importance score of 0.244. This finding aligns with the correlation analysis, which showed a moderate negative correlation between cohesion and slope stability status. This highlights the pivotal role of cohesion in regulating slope stability by governing the shear strength of the soil or rock mass. Similarly, unit weight (UNW) has a significance of 0.236, underscoring the significance of material density and its impact on the equilibrium between driving and resisting forces within the slope.

Fig. 4
figure 4

Relative importance of input variables.

Moreover, the slope height (HEI) exhibits considerable significance, with a feature importance of 0.210. This is consistent with the correlation analysis, which revealed a negative correlation between slope height and slope stability status. This highlights the intrinsic influence of geometric factors on stability conditions, with taller slopes posing greater challenges to stability due to increased gravitational forces acting upon them. The friction angle (FRI) also emerged as a key determinant, with a significance of 0.196. This aligns with the strong negative correlation between friction angle and slope stability status found in the correlation analysis. This underscores its crucial contribution to frictional resistance and slope stability. Excluding the pore pressure ratio (PPR) due to its relatively low significance, the analysis underscores that cohesion, unit weight, slope height, and friction angle are the most influential factors for slope stability in this dataset. These influential features are then utilized to evaluate the performance of the deep learning models in the subsequent analysis.

Performance evaluation of deep-learning models

Following the classic feature importance analysis, a comprehensive evaluation of various deep learning models for the classification of slope stability was conducted. The metrics for precision, recall, and F score (Fig. 5) shed light on the performance of each model. The recurrent neural network (RNN) exhibited a notably high positive predictive value (PPV) of 0.933, indicating its robust ability to identify unstable slopes correctly. However, its negative predictive value (NPV) of 0.877 suggests room for improvement in accurately classifying stable slopes. In contrast, the long short-term memory (LSTM) network demonstrated a contrasting performance profile with a lower PPV of 0.882 but a higher NPV of 0.907. This indicates a better identification of stable slopes while slightly underperforming in unstable cases. The overall F scores of 0.866 and 0.874 for the RNN and LSTM models, respectively, reflect a reasonable balance between precision and recall, providing a comprehensive overview of their performance.

Fig. 5
figure 5

Precision, Recall, and F Score analysis for deep learning models in slope stability classification.

While the F scores offer valuable insights, a closer examination of the classification performance indicators in Table 5 provides a more nuanced understanding of the relative strengths and limitations of each model. The generative adversarial network (GAN) model exhibited the highest overall accuracy of 0.913, showing its superior ability to classify stable and unstable slopes accurately. Following the classic feature importance analysis, an evaluation of various deep learning models for the classification of slope stability was conducted. The metrics for precision, recall, and F score (Fig. 5) shed light on the performance of each model. The recurrent neural network (RNN) exhibited a notably high positive predictive value (PPV) of 0.933, indicating its robust ability to identify unstable slopes correctly. However, its negative predictive value (NPV) of 0.877 suggests there is room for improvement in accurately classifying stable slopes.

Table 5 Classification performance indicators across deep learning approaches.

In contrast, the long short-term memory (LSTM) network demonstrated a different performance profile, with a lower PPV of 0.882 but a higher NPV of 0.907. This indicates that the LSTM network is better at identifying stable slopes, though it slightly underperforms in classifying unstable slopes. The overall F scores of 0.866 and 0.874 for the RNN and LSTM models, respectively, reflect a reasonable balance between precision and recall, providing a thorough overview of their performance. While the F scores offer valuable insights, a closer examination of the classification performance indicators in Table 5 offers a more nuanced understanding of the relative strengths and limitations of each model. The sensitivity (true positive rate) of the generative adversarial network (GAN) model, at 0.865, matches that of the LSTM network, indicating its effectiveness in detecting unstable slopes without overlooking a significant number of positive instances. However, the RNN demonstrated the highest specificity of 0.959, suggesting its proficiency in avoiding false-positive errors and accurately detecting truly stable slopes.

Conversely, the lower specificity of the LSTM model, at 0.919, suggests a tendency to misclassify some stable cases as unstable. In addition, the GAN model exhibited the highest overall accuracy of 0.913, demonstrating its superior ability to classify both stable and unstable slopes accurately. This superior performance of the GAN model is further validated by the ROC analysis presented in Fig. 6.

Fig. 6
figure 6

ROC analysis of the RNN, LSTM, and GAN models for slope stability classification.

The receiver operating characteristic (ROC) analysis, as depicted in Fig. 6, validates the GAN's robust classification capabilities across the entire spectrum of decision thresholds. With the highest area under the curve (AUC) value of 0.9285, the GAN consistently maintains a superior true-positive rate across most false-positive rate values compared to other models. This indicates its ability to strike a better balance between correctly identifying unstable slopes and minimizing the misclassification of stable slopes, as evidenced by the superior performance metrics in Table 5. Although the RNN initially displays a slightly steeper curve at low false positive rates, suggesting a potential advantage for applications with an extremely low tolerance for false positives, its performance plateaus at higher rates. In contrast, the ROC curve of the GAN exhibits a favorable profile across a wider range of thresholds. Considering both the AUC values and the shape of the ROC curve as a whole, the GAN emerged as the optimal model, making it the most suitable choice for accurately classifying both stable and unstable slopes while minimizing overall misclassification errors.

The integration of the binary bGGO technique for feature selection with the previously evaluated GAN model resulted in notable advancements in the classification of slope stability. As highlighted in Table 6, the bGGO-GAN model exhibited a remarkable increase in sensitivity (true positive rate) of 9.29%, increasing from 0.865 to 0.946 compared to that of the classical GAN model. This enhancement signifies a significant improvement in correctly identifying instances of unstable slopes while maintaining a balanced approach without excessively overlooking positive instances. Furthermore, the positive predictive value (PPV) slightly increased from 0.918 to 0.928, indicating a modest improvement of 1.05% in the bGGO-GAN model. In contrast, the negative predictive value (NPV) showed a more substantial increase from 0.909 to 0.962, representing a notable improvement of 5.86%. This indicates an enhanced ability to identify stable slopes while accurately minimizing misclassifications.

Table 6 Enhanced performance metrics for the bGGO-GAN model vs. the classical GAN model.

Moreover, the F1 score, which reflects the harmonic mean of precision and recall, demonstrated a significant improvement from 0.891 to 0.937 in the bGGO-GAN model. This corresponds to a notable 5.13% increase in the overall performance of the model, highlighting its effectiveness in achieving a balanced combination of precision and recall. Overall, the integration of the bGGO technique with the GAN model has yielded substantial improvements in slope stability classification, with enhancements across key performance metrics. These advancements underscore the efficacy of leveraging advanced feature selection techniques to augment the capabilities of machine learning models, ultimately leading to more accurate and reliable predictions in critical engineering applications.

Figure 7 compares the proposed bGGO algorithm against six other optimization algorithms. bGGO achieves the lowest average error of 0.438, surpassing the next-best algorithm (bGWO) by 15.1% and the worst-performing algorithm (bSBO) by 21.5%. This substantial improvement in accuracy underscores bGGO's robust performance in minimizing errors across various optimization scenarios. Furthermore, bGGO exhibits the best fitness score of 0.463, which is 5.9% better than that of the next closest competitor (bWAO at 0.492) and 20.7% superior to the worst performer in this category (bSBO at 0.559).

Fig. 7
figure 7

Binary optimization techniques for feature selection in slope stability modeling.

bGGO also shows remarkable efficiency, with an average selection size of 0.451, which is significantly smaller than that of all other algorithms. It is 24% more efficient than the next best performer (bWAO at 0.594) and 45.1% more efficient than the least efficient algorithm (bSBO at 0.821). This indicates that bGGO can achieve superior results while utilizing fewer resources or iterations, potentially leading to significant computational savings. Moreover, bGGO demonstrates excellent consistency in its performance, with its worst fitness (0.562) being only 21.4% higher than its best fitness (0.463), which is the smallest difference among all algorithms.

In summary, bGGO has emerged as the most suitable binary optimization technique, combining low classification error, efficient feature selection, strong best fitness values, and robust worst fitness values. These attributes collectively contribute to its effectiveness in achieving accurate and reliable slope stability classification.

Comparative analysis of optimization techniques

According to Table 7, the accuracy of the bGGO-GAN model of 95% is competitive with that of models such as support vector regression (SVR) by Bui et al.89 at 95.09% and slightly lower than that of ANN-PSO by Gordan et al.90 at 98.6%. The R2 values indicate that many models achieve high predictive power, with some models, such as Moayedi et al.82, reaching an R2 close to 1. Moreover, when evaluating the effectiveness of various models for slope stability classification, it is crucial to consider both the performance metrics and the size of the datasets used. Models trained on larger datasets tend to offer more reliable and generalizable predictions. For instance91, achieved an R2 of 0.9939 with 630 samples using an MLP, while82 reported an R2 of 0.9998 using RT with the same sample size, highlighting their strong predictive power.

Table 7 Some of the classification studies on slope stability.

Similarly, Bui et al.89 and Gordan et al.90 demonstrated high accuracy and R2 values with large datasets, emphasizing their robust performance. In contrast, models with smaller datasets, such as Samui93 (SVM, 46 samples, R2 = 0.875) and Hidayat et al.94 (ANFIS, 53 samples, R2 = 0.96), while showing good performance, may have limitations in their generalizability due to the smaller sample sizes. The present study's bGGO-GAN model, which leverages a substantial dataset of 627 samples, achieved a high accuracy of 95%, demonstrating its robust and reliable performance in slope stability classification. This comparative analysis underscores the importance of dataset size in model training, with larger datasets generally supporting more accurate and reliable predictive models.

Discussion

The results from this study provide valuable insights into the factors influencing slope stability and the efficacy of various deep-learning models in predicting it. The findings underscore the significance of parameters such as friction angle, cohesion, unit weight, and slope height in determining slope stability. Additionally, the performance of different deep learning models, particularly the Generative Adversarial Network (GAN) and the Binary Grey Wolf Optimization (bGGO) integrated with GAN, reveals substantial improvements in classification accuracy.

The correlation and feature importance analyses highlight that friction angle (FRI) and cohesion (COH) are critical determinants of slope stability. The negative correlations observed with slope stability status (SST) are consistent with the fundamental principles of geotechnical engineering. For instance, higher friction angles and greater cohesion generally enhance the resistance against failure, as supported by prior research101,102. Similarly, unit weight (UNW) and slope height (HEI) demonstrate significant roles in slope stability, reflecting the impact of gravitational forces and material density on stability.

The deep learning models' performance evaluation indicates that the GAN model excels in accuracy and robustness in slope stability classification. The GAN's superior accuracy (0.913) and the highest area under the curve (AUC) (0.9285) corroborate its effectiveness in balancing true positive and false positive rates, as supported by previous studies on GANs in complex prediction tasks63,103. The Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) networks, while effective, exhibit trade-offs between positive predictive value (PPV) and negative predictive value (NPV). This trade-off is consistent with findings from other applications where RNNs and LSTMs show variable performance depending on the context52,104.

The integration of the bGGO technique with GAN models results in notable performance enhancements, with significant improvements in sensitivity, positive predictive value, and overall F1 score. This advancement is aligned with recent advancements in optimization techniques, which emphasize the benefits of combining feature selection methods with machine learning models for enhanced prediction accuracy105,106. The bGGO algorithm, in particular, demonstrates exceptional efficiency and robustness, surpassing other optimization algorithms in error minimization and fitness scores. This finding supports the growing recognition of advanced optimization techniques in improving model performance107.

The analysis further illustrates the importance of dataset size in model performance. Models trained on larger datasets, such as the bGGO-GAN model with 627 samples, generally exhibit better predictive power and reliability, corroborating the observations from other studies on dataset size and model accuracy108,109. This highlights the need for ample data to ensure the generalizability and robustness of predictive models.

Overall, the study effectively combines traditional geotechnical parameters with advanced machine learning techniques, resulting in enhanced accuracy and reliability in slope stability classification. Future work may explore further enhancements in model performance and validation with diverse datasets to solidify the findings and extend their applicability in practical engineering scenarios.

Conclusions

This study aimed to identify the most influential factors affecting slope stability and evaluate the performance of various machine learning models for classifying slope stability. This research assessed the effectiveness of machine learning techniques combined with modern feature selection algorithms and conventional feature analysis methods.

Correlation analysis revealed that friction angle was the most influential factor affecting slope stability, with a robust negative correlation of − 0.29 with the slope stability status (SST). Cohesion, unit weight, and slope height also showed significant negative correlations, highlighting their importance in determining slope stability. The random forest regressor analysis further confirmed these findings, with cohesion emerging as the most critical parameter, followed closely by unit weight, slope height, and friction angle.

The evaluation of various deep learning models revealed that the generative adversarial network (GAN) outperformed the others, achieving the highest overall accuracy of 0.913. The robustness of the GAN model was further validated through receiver operating characteristic (ROC) analysis, which showed the highest area under the curve (AUC) value of 0.9285, indicating superior classification capabilities.

The integration of the binary bGGO technique for feature selection with the GAN model led to notable advancements in classification performance. The resulting bGGO-GAN model exhibited significant improvements across key performance metrics. The sensitivity increased by 9.29%, increasing from 0.865 to 0.946, demonstrating an enhanced ability to identify unstable slopes correctly. The positive predictive value slightly improved by 1.05%, while the negative predictive value showed a substantial increase of 5.86%. The F1 score, reflecting the balance between precision and recall, improved by 5.13%, increasing from 0.891 to 0.937.

A comparison of the bGGO-GAN model's performance (95% accuracy) with that of other studies revealed its competitiveness in slope stability classification. While some models achieved slightly higher accuracy, it is important to note that the bGGO-GAN model was trained on a substantial dataset, enhancing its generalizability and reliability.

In summary, the developed bGGO-GAN model shows great potential for improving the accuracy and reliability of slope stability assessments, ultimately contributing to safer and more efficient engineering practices in slope stability management. However, this study has several limitations that warrant further investigation. Future research could benefit from expanding the dataset to include a wider range of geological and environmental conditions. Additionally, exploring the integration of other advanced models and hybrid approaches could further enhance the predictive capabilities of slope stability assessments.