1. Introduction
Due to the relatively fragile overburden and shallow depth of underground mining, the ground surface cracks can be easily generated along with the mining activities [
1]. These cracks can immediately result in ventilation trouble under the mine shaft, runoff disturbance, and vegetation destruction; therefore, they can threaten mining safety, groundwater, and vegetation health [
2]. In the Loess Plateau, the loose structure and knotted growth characteristics of loess make the susceptible to destruction of the overburden. If the overburden destruction is too aggravated by underground mining, it may cause overburden collapse and lots of cracks appearing on the ground [
3]. These coal mining ground surface cracks can play the role of ventilation channels to deliver oxygen from the ground surface air into the belowground coal mining area. The oxygen from the ground surface air continuously reacts with coal and coalbed methane, and this chemical interaction increases the temperature within the underground mining spaces. The coal naturally burns as the ambient temperature rises, resulting in a gas explosion, and more importantly, poses a significant threat to the safety of mining operations [
4]. Additionally, these ground surface cracks directly damage buildings, transportation lines, and other geotechnical facilities and lead to mechanical destruction of plant roots, vegetation, and soil degradation, as well as ecological environment problems, such as soil and water loss [
5]. As a consequence, the quick, accurate, and effective delineation of ground surface cracks becomes the crucial issue for underground mining disaster prevention and control. Furthermore, the quick delineation of the ground surface can promote ecological restoration for the green mine.
Currently, monitoring ground surface cracks in mining areas primarily relies on in situ observations. Information about these cracks can be obtained using total station theodolite or Global Positioning Systems (GPS). These methods are time consuming and laborious, and the results are often discontinuous and in the form of limited datasets. In addition, some surveys have shown that ground surface crack widths are often at the centimeter level [
6,
7], with a small number exceeding the sub-meter scale. Despite the sub-meter resolution capability of current optical satellite images, detecting all cracks remains a challenge, particularly for those with narrower widths. UAV photogrammetry technology, known for its exceptional adaptability to various terrains, swift response capabilities, and cost effectiveness, efficiently captures high-resolution images with centimeter-level precision on a regular and ongoing basis. This capability facilitates the dynamic monitoring of ground cracks within mining areas [
8].
Ground surface crack delineation from UAV images involves imagery processing techniques [
9]. Currently, a variety of technical methods are employed for crack processing in image analysis [
10]. These approaches encompass traditional manual interpretation, where experts visually identify and label cracks. However, this method is susceptible to misclassification and omission due to the labor-intensive nature and inefficiency of operators manually delineating cracks based on their own expertise and experience [
11]. The threshold segmentation method divides an image into different regions or objects by setting one or more threshold values. It is sensitive to noise, image quality, and variations in lighting, potentially affecting the choice of thresholds and the accuracy of segmentation. Additionally, for objects with irregular shapes or complex boundaries, threshold segmentation may not yield precise results [
10]. The Canny algorithm is a widely used edge detection technique in the field of computer vision and image processing. Dual-threshold detection provides superior localization capabilities for weak edges, and hysteresis thresholding effectively detects and connects edges, maintaining high accuracy even in the presence of noise, yet it has poor generalization performance [
12]. Machine learning techniques relying on feature engineering have been widely used in crack delineation. These techniques are more accurate than the traditional imagery processing methods but rely heavily on the characterization of manually extracted features [
13]. Therefore, to accommodate a variety of unique conditions, innovative technologies and methods are necessary for the delineation of crack features in coal mining areas.
Deep learning is widely applied in the field of image processing. As a specific architecture in deep learning, the convolutional neural networks (CNNs) have outstanding advantages in automatically and effectively extracting image features, especially in the recognition and image crack delineation, where this efficient semantic segmentation method of CNNs shows significant advantages. For example, the CrackPix model [
14] based on the Full Convolutional Network (FCN) can detect cracks automatically in concrete infrastructure without an artificial feature design, which is significantly superior to the traditional methods. Fan et al. [
15] proposed a neural network U-HDN based on a codec structure, which integrates the crack context information into a multi-extension module to extract more crack features and improve crack detection performance. However, traditional convolutional neural network technologies usually rely on the feature input at a single scale, while ground cracks are long and narrow small-scale objects. Therefore, these networks need multi-scale feature information to achieve more accurate crack delineation [
16]. To solve this problem, feature pyramids and hierarchical promotion networks (FPHBNs) [
17] accurately detect cracks by integrating contextual information into low-level features. Chen et al. [
18] enhanced the multi-scale features and performance of deep learning networks by using convolution and attention mechanisms with different expansion rates and proposed a method combining U-Net and the Multi-scale Global Inference module (MGRB) to significantly improve the efficiency and accuracy of ground crack detection.
The studies mentioned above are highly inspiring for our research. The presence of various types of grass, fallen leaves, and gravel, and the influence of varying lighting conditions and shadows on the coal mine site, make it challenging to accurately detect and extract cracks from the complex background of the surface crack images captured by the UAV. In addition, the lack of datasets for coal mine surface cracks is another issue that needs to be emphasized. Existing public datasets are mainly used for concrete surfaces, including pavements, rock cracks, architecture cracks, and brick wall cracks, such as Crack 500 [
19] and DeepCrack [
20], which are very different from the coal mining ground surface cracks because of cracks in the Loess Plateau are easily affected by topography and vegetation coverage.
To address these issues, an improved U-Net model is proposed in this paper. The RN module, DAM module, and ASPP module are introduced to construct the DRA-UNet model for more precise automatic delineation of surface cracks from UAV high-resolution photogrammetric images. The analysis demonstrates that integrating the RN module, DAM module, and ASPP module significantly enhances the ability to capture multi-scale features. This DRA-UNet model also improves the attention of cracks in loess areas, reduces interference from irrelevant information, and enhances the generalization capability of multi-level spatial and spectral features.
3. Results
3.1. Experimental Setup
This experiment is implemented based on the PyTorch framework. The CPU used for training is an Intel(R) Xeon(R) W-2265 CPU @ 3.50 GHz and the GPU is a NVIDIA RTX 4000 with a running memory of 128 G on a 64-bit Windows operating system. PyTorch version 2.0.0 with an initial learning rate set to 5 × 10−4, a training period set to 100, a batch size set to 16, and an optimizer set to Adam was used.
3.2. Data Enhancement
In the experiments, a data augmentation technique is employed to enhance the robustness of the model, mitigate overfitting, and at the same time improve the generalization ability of the model, and this data enhancement operation is only used for training sets. Three different types of data enhancement were implemented as follows:
- (1)
Random adjustment of brightness and contrast;
- (2)
Flip horizontally and vertically randomly;
- (3)
Randomly rotate 90 degrees, 180 degrees, and 270 degrees.
3.3. Loss Function
When segmenting ground surface cracks in mining areas, it is essential to consider both segmentation precision and recall rates. The Dice Loss presented in Equation (6) offers a more precise evaluation of the correspondence between the predicted results and actual values [
26]. This prioritizes coverage of cracked regions during training sessions, leading to improved segmentation recall rates. Nevertheless, this type of loss solely emphasizes similarity without penalizing misclassifications, thereby limiting its effectiveness at enhancing model accuracy.
Equation (7)’s binary cross-entropy (BCE) loss primarily serves binary classification tasks by computing differences between the predicted results and actual values while effectively penalizing false classifications [
27]. However, within ground surface crack segmentation tasks where cracked regions constitute a smaller proportion within overall imagery contextually leads models towards predicting these regions as background elements instead.
To mitigate class imbalance effects within such scenarios, class weights are introduced into adjusting our loss functions; controlling contributions from positive samples through weight adjustments allows models greater focus on capturing crucial cracked information.
DRA-UNet strategically combines both Dice Loss with BCE losses, harnessing their respective strengths and resulting in enhanced overall model performance as depicted by their combined representation shown in Equation (8).
where
N is the number of pixels,
qi is the actual category,
pi is the predicted category, and
ε is the smoothness coefficient.
3.4. Comparative Methods
To prove the accuracy and feasibility of the DRA-UNet model, we compared the evaluation indicators of DeepLabV3+, SegNet, PSPNet, Segformer, and FastSCNN with the DRA-UNet model as follows.
- (1)
DeepLabV3+
DeepLabV3+ is the depth of the image semantic segmentation for a neural network [
28]; it combines Atrous convolution and spatial pyramid pool (spatial pyramid pooling, ASPP) while maintaining a high calculation efficiency, and it enhances the ability of the object model to capture different dimensions. DeepLabV3+ is one of the important models in semantic segmentation and can capture features at multiple scales over a wide perception field.
- (2)
SegNet
SegNet is an end-to-end deep learning network [
29]. Through the input imagery, features are extracted through the encoder network, and the spatial resolution of the image is restored by the decoder network. Finally, the category of each pixel is exported through the Softmax classifier to realize the semantic segmentation of the imagery.
- (3)
PSPNet
PSPNet (Pyramid Scene Parsing Network) introduces the Pyramid Pooling Module, which captures global and local contextual information through pooling operations at different scales to improve the performance of semantic segmentation [
30]. PSPNet has excellent performance in dealing with complex scenes and multi-scale objects. PSPNet performs well in handling complex scenes and multi-scale objects and is widely used in tasks such as cityscape segmentation.
- (4)
Segformer
Segformer is a semantic segmentation model that combines the advantages of the transformer architecture and convolutional neural networks (CNNs) [
31]. By using an efficient transformer encoder to extract global features, combined with a lightweight decoder for fine-grained segmentation, Segformer achieves a good balance between accuracy and computational efficiency and is suitable for a wide range of semantic segmentation application scenarios.
- (5)
FastSCNN
FastSCNN is an efficient semantic segmentation model designed for mobile devices and embedded systems [
32]. It adopts a lightweight architecture, including a learning imagery downsampling module, a global feature extraction module, and a feature fusion module, which can significantly reduce the computational overhead and memory usage while ensuring segmentation accuracy, making it well suited for real-time applications.
3.5. Evaluation Indicators
In our experiments assessing the model’s efficacy and accuracy, we utilized key performance metrics, such as the Pr (precision rate), Re (recall rate), F1 (F1 score), and MIoU (mean intersection over union), which are defined below.
In imagery segmentation tasks, TPs, FPs, FNs, and TNs, respectively, correspond to the true positives, false positives, false negatives, and true negatives in the confusion matrix, as shown in
Table 2. When dealing with highly unbalanced samples, the key is to reduce false negatives (FNs) to ensure that as many cracks as possible are correctly identified. Therefore, for crack segmentation tasks, recall is more significant than precision. Moreover, it is important to note that precision or recall needs to be traded off in the experiment. High-sensitivity models usually lead to higher recall rates but may have less precision. A lower sensitivity model may improve precision, but the recall rate will decrease. To balance the two, F1 scores were introduced. The F1 score is the harmonic average of precision and recall. When both precision and recall are higher, the F1 score will also be higher, which is suitable for cases where both accuracy and recall need to be optimized. In addition, MIoU (mean intersection over union) is a measure of the similarity between actual cracked pixels and predicted cracked pixels. Therefore, the higher the value of MIoU, the higher the similarity between the model-predicted cracks and the real cracks, reflecting the better segmentation effect.
3.6. Experiments
On the GCCMA-UAV dataset, we conducted comparison experiments using DRA-UNet and five other comparison methods: DeepLabV3+, SegNet, PSPNet, Segformer, and FastSCNN’s semantic segmentation method.
The crack delineation results for all six methods are shown in
Figure 7. Every DL method yields superior recognition outcomes for cracks that are visible to the naked eye. However, upon closer examination, it is evident that the delineation results of DRA-UNet in the first row of red frames are more precise and closely resemble the ground truth of the original imagery, similar to the delineation results shown in the first three rows of red rectangles. In addition, DRA-UNet can accurately recognize cracks that are difficult to distinguish with the naked eye or that are not clearly defined enough, as seen in the final three rows in the red frame. In contrast, other DL methods cannot depict a clear and complete boundary of the real shape of ground cracks, which highlights the advantage of DRA-UNet in solving complex cases.
In the coal mining area, our network exhibits an obvious performance advantage on the ground crack dataset in
Table 3. Because of the highest recall rate of 77.29% in DRA-UNet, our network shows phenomenal performance in capturing most ground cracks in effect, even though its precision is slightly lower than DeepLabV3+. MIoU demonstrates a significant level of 69.88%, whereas F1 also achieves 78.87%, demonstrating the superior performance of the model in segmentation quality and overall performance.
3.7. Ablation Experiments
Table 4 presents the results of the ablation experiments, comparing the performance of different models on four metrics. U-Net + RN. In contrast to the basic U-Net, upon the addition of the RN module, the Pr slightly declined (from 81.35% to 80.65%), while the Re increased (from 73.46% to 76.86%). Nevertheless, F1 remained unchanged, and MIoU also slightly decreased (from 67.90% to 66.90%). This might suggest that the incorporation of the RN enhanced the ability to capture details, thereby elevating the recall rate. However, due to the potential introduction of some redundant information, it resulted in a marginal decrease in precision. U-Net + DAM. After introducing the DAM module, the Pr significantly rose (from 81.35% to 84.76%), but the Re significantly dropped (from 73.46% to 71.41%), ultimately leading to a slight reduction in F1 (from 76.81% to 76.43%), and MIoU also decreased (from 67.90% to 65.57%). The attention mechanism of the DAM was conducive to improving the precise positioning of the target area, thereby enhancing the precision. Nevertheless, it might cause missed detections as it focused more on specific areas, thereby lowering the recall rate. U-Net + ASPP. After adding the ASPP module, the Pr slightly decreased (from 81.35% to 80.01%), but the Re significantly increased (from 73.46% to 76.63%), resulting in an increase in F1 to 78.21%. However, MIoU decreased to 65.25%. ASPP enhanced the model’s perception ability for targets of different sizes through multi-scale feature extraction, improving the recall rate. However, it might lead to a decrease in precision due to an increase in misclassification of the background area. U-Net + RN + DAM. After combining the RN module and the DAM module, the Pr decreased to 77.90%, the Re increased to 74.08%, F1 slightly improved to 75.84%, and MIoU remained at 66.96%. This combination balanced the precision and the recall rate. However, compared with the individual use of the RN or DAM, the effect was not as anticipated. This might be because the functions of these two modules had some redundancies or conflicts, leading to no significant improvement in performance. U-Net + RN + ASPP. After combining the RN and ASPP, the Pr further decreased to 78.50%, the Re slightly increased to 76.08%, F1 increased to 73.84%, and MIoU increased to 68.66%. This combination was beneficial for improving the overall target detection ability. However, the decrease in Pr might be attributed to the increase in model complexity, making it more challenging to accurately distinguish boundaries. U-Net + DAM + ASPP. After combining the DAM module and the ASPP module, the Pr increased to 82.54%, the recall slightly decreased to 73.29%, F1 was 75.87%, and MIoU was 65.68%. This indicated that the combination of the two could be complementary, but it might also give rise to some contradictions. For instance, the DAM tended to focus on important areas, while ASPP focused on multi-scale, which might lead to information conflicts. DRA-UNet. The final DRA-UNet incorporated multiple modules and performed optimally. The Pr reached 84.92%, the Re reached 77.29%, F1 was 78.87%, and MIoU also reached 69.88%. This outcome suggested that under a reasonable architectural design, the combination of different modules could be complementary and maximize performance. Among them, the DAM strengthened the attention mechanism, ASPP provided multi-scale information, and the RN assisted in stabilizing the gradient and optimizing the model training process, thereby enabling DRA-UNet to achieve superior results in all metrics.
3.8. Model Generalization Study
In this section, we employ the publicly accessible Crack500 dataset to examine the generalization capabilities of the DRA-UNet model. The Crack500 dataset comprises 500 images of road cracks, each with a resolution of 2000 × 1500 pixels, captured using mobile phones. These 500 images have been cropped to generate a total of 3368 smaller images, each measuring 256 × 256 pixels. Among these, 2696 images are designated for the training set, while 336 images are allocated for both the test and verification sets.
Figure 8 presents a comparative analysis of the performance of the DR-UNet model against other models for crack identification within the Crack500 dataset. The findings indicate that, in contrast to alternative models, the DR-UNet model effectively and accurately detects subtle and ambiguous cracks while demonstrating superior noise suppression capabilities; consequently, its recognition results are more closely aligned with actual images.
Table 5 presents a comparative analysis of the performance of the DRA-UNet model against other models on the Crack500 dataset. Within this dataset, the DRA-UNet model demonstrated exceptional performance, achieving the highest MIoU at 80.32%. While its precision was lower than that of Segformer and FastSCNN, it outperformed other models in terms of recall and F1 score. These results indicate that the DR-UNet model exhibits strong generalization capabilities and robustness when applied to the Crack500 dataset.
4. Discussion
Here, the comparison between DRA-UNet and other methods regarding crack delineation is shown in
Table 6. A deep regression model named Faster R-CNN_YOLO was applied to delineate cracks in high-resolution images with dirty walls, pavements, marbles, etc., of different material types [
33]. DMA-Net, including DeepLabv3+, ASPP, and the multi-attention module, was used for pavement crack detection. DMA-Net was applied to the Crack500 dataset, the DeepCrack dataset, and the Fma (Fitchburg Municipal Airport) dataset, and it showed excellent performance in road pavement crack segmentation, even in pavement crack images not captured with professional mapping equipment [
34]. DDR-UNet (Deformable Dense Residual UNet) included the deformable convolution and deformable dense residual UNet. It was evaluated on three datasets (including an ore dataset) for ore imagery segmentation and measuring ore particle size distribution [
35]. The GFSegNet (Ground Fissure Segmentation Network) included a DSDE (deep–shallow decoupled encoder), an MFFD (multi-scale feature fusion decoder), and loss function. The Mine Ground Fissure Unmanned Aerial Vehicle dataset (MGF-UAV) was used for crack delineation, and the spatial resolution of the dataset is 2.6 cm. The model was applied to the Crack500, DeepCrack, CrackForest, and ISPRS-Postdam datasets [
36]. MFPA-Net mainly included the MFPN (Multi-scale Feature Pyramid Network) module, ASPP module, DRN (an improved Dilated Residual Network) module, and the DAM module, the the GFCMA (Ground Fissures of the Coal Mining Area) dataset was built to train the MFPA-Net model [
37]. For MFPA-Net, the ground resolution of the crack imagery was 33 cm. MFPA-Net focused on the multi-scale spatial resolution features of the ground cracks. MFPA-Net paid more attention to the transmission of other resolution features of cracks to the delineation of coal mine ground surface cracks in mining areas. However, this DRA-UNet is inclined to fuse spatial and spectral features of the ground cracks from UAV images. When using MFPA-Net to process real large-scale scene images, the pixels on each image correspond to a real distance of 33 cm on the ground. Our GCCMA-UAV dataset has a ground resolution of 3 cm, and it can be used to delineate cracks in centimeter widths. Compared to the other five models, DRA-UNet shows a relatively same level of crack delineation (
Table 6). The DRA-UNet model was applied well in the processing of crack delineation in coal mine ground surface area (
Figure 7). This indicates that DRA-UNet pays more attention to the fusion of the spatial and spectral features of cracks to enhance its anti-noise capability.
This study mainly focuses on semantic-level segmentation of ground cracks. However, the size and shape of these ground cracks are also important because this vector information can be used to track changes in cracks, monitor ground movement and deformation, and provide data support for the prevention and response to potential disasters. Future research could explore ways to comprehensively capture and utilize vector information to improve monitoring accuracy. The DRA-UNet model was used to delineate cracks in the coal mining ground surface of the Huojitu Coal Mine Shaft (
Figure 9a). There were 4903 cracks that appeared in the loose areas of the coal mining ground surface; the average length of these cracks was 5.82 m and the average width was 5.9 cm. Especially, the cracks easily appeared in the mining area (
Figure 9b), concrete pavement (
Figure 9c), and the woods (
Figure 9d). In the mining area, mining-induced subsidence has led to the formation of annular crack groups. Due to the hardness of the concrete pavement, the cracks were difficult to heal. In the woods, the cracks were very short and small.
Ground surface cracking represents the primary manifestation of land degradation resulting from high-intensity coal mining. UAV photogrammetry provides a good solution for the rapid acquisition of ground cracks. This study presents a DL network framework, DRA-UNet, which is designed to address the challenges of ground surface crack detection in coal mine areas. The framework integrates the RN, DAM, and ASPP modules, which can automatically and accurately extract ground surface cracks in UAV images under complex backgrounds. In order to obtain ground surface crack information automatically, quickly, and accurately from UAV images and realize the development of DL technology in ground surface monitoring, in this study, the GCCMA-UAV dataset was constructed based on the UAV images collected in the coal mine area. In this study, experiments are performed on a GCCMA-UAV using various methods, and the experimental results show that fine crack segmentation can be achieved by DRA-UNet, which is significantly better than similar DL networks, particularly for fine and fuzzy cracks in complex backgrounds.
The ablation research demonstrates that the combination of the RN, DAM, and ASPP modules enhances the performance of DRA-UNet greatly. DRA-UNet has shown remarkable reliability in comprehending global attention at a large scale and capturing intricate details at a micro-level. We have identified substantial issues with existing models that have difficulties in effectively representing the spatial structural links of cracks. As a result, these models often fail to maintain the continuity of cracks and tend to overlook smaller cracks. Additionally, other models often yield incorrect classifications due to noise factors, such as vegetation and shadows. The results of the experiments indicate that our network exhibits excellent reliability in ground crack segmentation.
The method proposed in this paper is suitable for the rapid delineation of sparsely vegetated surface cracks in the Loess Plateau. The approach presented in this paper requires surface crack monitoring data, which are performed based on optical UAV images and which are sensitive to weather conditions and affected by factors such as vegetation cover. For regions with denser vegetation, combining multiple data sources should be considered to improve the reliability of monitoring, such as adding thermal infrared images.
In this study, single-phase UAV images are used to identify surface cracks. However, the surface cracks will continue to appear over time around the actual coal mining area. Therefore, the change detection of surface cracks has become particularly important. Detecting changes in surface cracks using deep learning techniques is a very worthwhile research direction.
5. Conclusions
Building upon the generalization of the validation model utilizing a common dataset, the DR-UNet model introduced in this paper not only demonstrates exceptional crack recognition performance on specific datasets (such as the GCCMA-UAV dataset) but also exhibits commendable robustness and adaptability across various types of crack image data. The model’s generalization capability has been validated, confirming that DR-UNet can be effectively employed with other similar high-resolution UAV imagery datasets. This further underscores the model’s extensive applicability in practical scenarios.
(1) Efficient deep learning model. DRA-UNet significantly enhances the feature extraction capability and crack identification accuracy by incorporating a residual network, dual attention mechanism, and void space pyramid pool module.
(2) Data augmentation strategy. Various data augmentation techniques are employed to enhance sample diversity, mitigate model overfitting, and improve the model’s generalization ability under complex backgrounds.
(3) Multi-scale feature extraction. The ASPP module effectively enhances fine crack extraction by capturing multi-scale features.
(4) Experimental validation. The results from experiments on the GCCMA-UAV dataset demonstrate that DRA-UNet outperforms existing similar models in metrics of Pr, Re, F1, and MIoU, showcasing its superiority and broad applicability for surface crack identification in coal mines.
Future research can further optimize the model structure and explore additional DL techniques to enhance the accuracy and efficiency of surface crack identification. Furthermore, constructing larger and more diverse datasets of surface fractures is also essential for enhancing the model’s performance. In summary, DRA-UNet offers an efficient and reliable solution for automatically identifying surface cracks in coal mines using high-resolution images.