CN116645592B

CN116645592B - A crack detection method and storage medium based on image processing

Info

Publication number: CN116645592B
Application number: CN202310914403.6A
Authority: CN
Inventors: 牛伟龙; 吴澄; 盛洁; 叶陆琴; 吕景珑; 钱曙杰; 夏从东; 吕志荣
Original assignee: Suzhou Rail Transit Group Co ltd; Suzhou University
Current assignee: Suzhou Rail Transit Group Co ltd; Suzhou University
Priority date: 2023-07-25
Filing date: 2023-07-25
Publication date: 2023-09-29
Anticipated expiration: 2043-07-25
Also published as: CN116645592A

Abstract

The present invention relates to the field of image processing. A crack detection method based on deep learning proposed by this invention adopts the Swin Mask RCNN algorithm in crack detection, and it uses Swin Transformer as its backbone network. Compared with previous target detection models, Swin Transformer has better feature extraction capabilities and higher expression capabilities, making features richer and helping to improve the accuracy of crack detection. The network constructed by the present invention is in the Swin Transformer network. After the sliding window operation, a multi-layer window fusion module is constructed to enable windows with crack information to be fused to prevent the loss of crack information, allowing it to better locate crack information in the image, improving position detection accuracy while reducing calculations. quantity.

Description

A crack detection method and storage medium based on image processing

技术领域Technical field

本发明涉及图像处理技术领域，尤其是指一种基于图像处理的裂缝检测方法和存储介质。The present invention relates to the technical field of image processing, and in particular, to a crack detection method and storage medium based on image processing.

背景技术Background technique

Mask RCNN于2017 年 10 月提出，是一种用于目标检测和实例分割的深度学习模型。它继承了Faster RCNN的两阶段检测框架，同时引入了一个额外的“蒙版分支”（MaskBranch）来解决实例分割问题。该蒙版分支通过在Faster RCNN的RoI池化层上增加一个卷积和反卷积网络来提取目标图像中每个实例的二进制掩码，将卷积神经网络、区域卷积神经网络（RCNN）和分割网络的思想融合在了一起。网络结构由四个部分组成：Mask RCNN was proposed in October 2017 and is a deep learning model for target detection and instance segmentation. It inherits the two-stage detection framework of Faster RCNN and introduces an additional "Mask Branch" to solve the instance segmentation problem. This mask branch extracts the binary mask of each instance in the target image by adding a convolution and deconvolution network to the RoI pooling layer of Faster RCNN, combining the convolutional neural network and the region convolutional neural network (RCNN). Integrated with the idea of segmentation network. The network structure consists of four parts:

Backbone：一个用于特征提取的卷积神经网络，通常是 ResNet 或 ResNeXt；Backbone: A convolutional neural network for feature extraction, usually ResNet or ResNeXt;

Region Proposal Network (RPN)：一个用于生成候选区域的网络，与 FasterRCNN 相同；Region Proposal Network (RPN): A network used to generate candidate regions, the same as FasterRCNN;

Bounding Box Head：一个用于分类和回归候选区域的网络，与 Faster RCNN 相同；Bounding Box Head: A network for classifying and regressing candidate regions, the same as Faster RCNN;

Mask Head：一个用于生成候选区域的掩码的全卷积网络。Mask Head: A fully convolutional network for generating masks for candidate regions.

然而，Mask RCNN在小目标检测上却表现一般。首先是Mask RCNN使用较大的网格特征，因此对于细小目标的细节信息难以捕捉，检测性能自然而然有所降低；此外，在实际应用中，几乎没有单调背景、唯一存在的小目标，采集到的图片一般拥有多个目标。MaskRCNN的分类器对大目标的适应较为良好，同样的参数阈值下，难以检测到多个同种但大小不一的目标；而且在一张图片存在多个小目标集合的大目标的时候，Mask RCNN是无法准确分割的，这样的图片恰恰是裂缝检测中最需要进行检测的部分。因此，Mask RCNN虽然在大目标上有较好的检测效果，但是裂缝检测领域难以展现优势。However, Mask RCNN performs mediocrely in small target detection. The first is that Mask RCNN uses larger grid features, so it is difficult to capture the detailed information of small targets, and the detection performance is naturally reduced; in addition, in practical applications, there is almost no monotonous background and the only small target that exists, and the collected Images generally have multiple objects. MaskRCNN's classifier adapts well to large targets. Under the same parameter threshold, it is difficult to detect multiple targets of the same type but different sizes; and when there are multiple large targets in a picture with multiple small target sets, Mask RCNN cannot accurately segment, and such pictures are precisely the parts that need to be detected most in crack detection. Therefore, although Mask RCNN has better detection results on large targets, it is difficult to show its advantages in the field of crack detection.

Swin Transformer是在2021年3月由微软亚洲研究院的刘泽等人提出的，分别用层级结构和滑窗、线性计算，解决Transformer在图像领域应用的两大挑战：视觉实体的尺度变化和图像分辨率的高计算量，并在多个视觉任务上都取得了优异的性能。（删除“包括图像分类（在ImageNet-1K上达到87.3%的top-1准确率），目标检测和实例分割（在COCOtest-dev上达到58.7%的box AP和51.1%的mask AP），语义分割（在ADE20K val上达到53.5%的mIoU）等。”）Swin Transformer was proposed by Liu Ze and others from Microsoft Research Asia in March 2021. It uses hierarchical structure, sliding window, and linear calculation respectively to solve the two major challenges of Transformer application in the image field: scale changes of visual entities and images. High computational complexity at high resolution and achieves excellent performance on multiple vision tasks. (Remove "includes image classification (reaching 87.3% top-1 accuracy on ImageNet-1K), object detection and instance segmentation (reaching 58.7% box AP and 51.1% mask AP on COCOtest-dev), semantic segmentation (Achieving 53.5% mIoU on ADE20K val) etc.")

Swin mask RCNN是在 Swin Transformer 的基础上，使用 Mask RCNN 的框架，实现了目标检测和实例分割的方法。通过使用 Swin Transformer 作为骨干网络，替换了Mask RCNN 中的卷积层，使用不同大小的移动窗口和分层结构，解决了从语言到视觉的转换中遇到的挑战，如视觉实体的尺度变化和图像像素的高分辨率。Swin mask RCNN is based on Swin Transformer and uses the Mask RCNN framework to implement target detection and instance segmentation. By using Swin Transformer as the backbone network, the convolutional layer in Mask RCNN is replaced, and moving windows of different sizes and hierarchical structures are used to solve the challenges encountered in the conversion from language to vision, such as scale changes of visual entities and The high resolution of the image pixels.

但Swin Transformer滑窗操作可能导致细小目标被分割成多个窗口，影响特征提取和定位精度，而且层级结构可能导致细小目标在高层特征图中丢失或模糊，影响分类和分割质量。此外，位置编码是默认设置，针对裂缝这类小目标检测并不敏感。这也使得SwinTransformer在裂缝检测这种小目标检测上表现效果一般。However, the Swin Transformer sliding window operation may cause small targets to be divided into multiple windows, affecting feature extraction and positioning accuracy, and the hierarchical structure may cause small targets to be lost or blurred in high-level feature maps, affecting classification and segmentation quality. In addition, position encoding is the default setting and is not sensitive to small target detection such as cracks. This also makes SwinTransformer perform poorly in small target detection such as crack detection.

发明内容Contents of the invention

为此，本发明所要解决的技术问题在于克服现有技术中Mask RCNN在小目标检测上却表现一般以及Swin Transformer滑窗操作可能导致细小目标被分割成多个窗口，影响特征提取和定位精度，而且层级结构可能导致细小目标在高层特征图中丢失或模糊，影响分类和分割质量。此外，位置编码是默认设置，针对裂缝这类小目标检测并不敏感。这也使得Swin Transformer在裂缝检测这种小目标检测上表现效果一般的问题。To this end, the technical problem to be solved by the present invention is to overcome the existing technology in which Mask RCNN performs poorly in small target detection and the Swin Transformer sliding window operation may cause small targets to be divided into multiple windows, affecting feature extraction and positioning accuracy. Moreover, the hierarchical structure may cause small targets to be lost or blurred in high-level feature maps, affecting the quality of classification and segmentation. In addition, position encoding is the default setting and is not sensitive to small target detection such as cracks. This also makes Swin Transformer perform poorly in small target detection such as crack detection.

为解决上述技术问题，本发明提供了一种基于图像处理的裂缝检测方法，包括：In order to solve the above technical problems, the present invention provides a crack detection method based on image processing, including:

S101：获取待检测的裂缝图像；S101: Obtain the crack image to be detected;

S102：利用构建的经过改进的Swin Transformer网络作为骨干网络提取所述待检测的裂缝图像的特征，生成一系列的候选区，包括：S102: Use the constructed improved Swin Transformer network as the backbone network to extract the features of the crack image to be detected, and generate a series of candidate areas, including:

S201：利用构建的多头自注意力机制模块对所述待检测的裂缝图像进行计算并输出特征图，包括：S201: Use the built multi-head self-attention mechanism module to calculate the crack image to be detected and output a feature map, including:

将所述待检测的裂缝图像按同样的窗口大小划分为n*n的多个小窗格得到第一次划分后的特征图，并利用构建的Shift Window Attention机制模块对第一次划分后的特征图中的小窗格进行窗口滑动得到第二次划分后的特征图，将第一次划分后的特征图以及第二次划分后的特征图输入构建的多层窗口融合模块，所述多层窗口融合模块将第一次划分后的特征图以及第二次划分后的特征图进行归一化处理，并且将归一化处理后的第一次划分后的特征图每个小窗格以及第二次划分后的特征图中的每个小窗格进行特征映射并计算相似矩阵，所述第一次划分后的特征图中的小窗格与第二次划分后的特征图中的小窗格具有重合部分，根据得到的相似矩阵判断是否需要将所述具有重合部分的小窗格进行融合，若相似矩阵表明存在连接的裂缝信息，则将两个小窗格融合生成一个融合窗口输出，若相似矩阵表明不存在连接的裂缝信息，则不对两个小窗格进行融合，作为两个独立窗口分别输出，得到所述多层窗口融合模块输出的多个融合窗口和多个独立窗口；Divide the crack image to be detected into multiple n*n small panes according to the same window size to obtain the feature map after the first division, and use the built Shift Window Attention mechanism module to The small pane in the feature map performs window sliding to obtain the feature map after the second division. The feature map after the first division and the feature map after the second division are input into the multi-layer window fusion module constructed. The layer window fusion module normalizes the feature map after the first division and the feature map after the second division, and converts each small window of the normalized feature map after the first division into Each small pane in the feature map after the second division performs feature mapping and calculates a similarity matrix. The small panes in the feature map after the first division are the same as the small panes in the feature map after the second division. The panes have overlapping parts. According to the obtained similarity matrix, it is judged whether the small panes with overlapping parts need to be fused. If the similarity matrix indicates that there is connected crack information, the two small panes are fused to generate a fusion window output. , if the similarity matrix indicates that there is no connected crack information, the two small panes will not be fused and will be output as two independent windows respectively, obtaining multiple fusion windows and multiple independent windows output by the multi-layer window fusion module;

利用构建的变换矩阵对所述多层窗口融合模块输出的每个窗口进行计算得到每个窗口的自注意力机制中的查询值Q,键值K,值V，并对所述每个窗口内首先通过一个卷积层对每个窗口内的像素进行线性变换提取像素数据，并根据得到每个窗口的查询值Q,键值K,值V提取特征，具体为：Use the constructed transformation matrix to calculate each window output by the multi-layer window fusion module to obtain the query value Q, key value K, and value V in the self-attention mechanism of each window, and calculate the query value Q, key value K, and value V in the self-attention mechanism of each window. First, a convolution layer is used to linearly transform the pixels in each window to extract pixel data, and the features are extracted based on the query value Q, key value K, and value V of each window, specifically:

其中，是Q,K矩阵的列数，即向量维度，B表示每个窗口的位置偏量，B的值由设定的相对位置偏置参数表给出；in, is the number of columns of the Q, K matrix, that is, the vector dimension, B represents the position offset of each window, and the value of B is given by the set relative position offset parameter table;

利用多头自注意力机制模块中构建的相对位置索引来调取相对位置编码表中的参数并将每个窗口的特征计算融合输出多头自注意力计算输出的特征图；Use the relative position index constructed in the multi-head self-attention mechanism module to retrieve the parameters in the relative position encoding table and calculate and fuse the features of each window to output the feature map output by the multi-head self-attention calculation;

S202：利用构建的特征金字塔网络对经过多头自注意力计算输出的特征图进行处理，提取所述多头自注意力计算输出的特征图中的不同尺度的特征并进行融合，输出融合后的特征图；S202: Use the constructed feature pyramid network to process the feature map output by multi-head self-attention calculation, extract features of different scales in the feature map output by multi-head self-attention calculation and fuse them, and output the fused feature map. ;

S203：利用构建的Region Proposal Network网络对融合后的特征图进行预测，得到一系列的候选区域，每一个候选区域包含了一个得分和一个位置偏移量，根据得分对每一个候选区域进行排序和筛选，保留部分高得分候选区域；S203: Use the constructed Region Proposal Network to predict the fused feature map and obtain a series of candidate regions. Each candidate region contains a score and a position offset. Each candidate region is sorted and summed according to the score. Screen and retain some high-scoring candidate areas;

S103: 对得到的每个高得分候选区域进行分类和回归，并根据每个高得分候选区域的位置偏移量得到最终的检测框，并预测每个检测框内的像素级掩码,输出带有目标检测框和图像分割的检测结果。S103: Classify and regress each obtained high-scoring candidate area, obtain the final detection frame based on the position offset of each high-scoring candidate area, and predict the pixel-level mask in each detection frame, and output the band There are detection results of target detection frames and image segmentation.

进一步地，所述利用构建的Shift Window Attention机制模块对第一次划分后的特征图中的小窗格进行窗口滑动得到第二次划分后的特征图中小窗格进行窗口滑动时，当所述第一次划分后的特征图中小窗格沿水平或者垂直方向进行窗口滑动时，所述窗口滑动的距离小于小窗格的边长，当所述第一次划分后的特征图中小窗格沿小窗格对角线进行窗口滑动时，所述窗口滑动的距离小于小窗格的对角线距离。Further, when the built Shift Window Attention mechanism module performs window sliding on the small panes in the feature map after the first division to obtain the window sliding on the small panes in the feature map after the second division, when the When the small pane in the feature map after the first division slides along the horizontal or vertical direction, the sliding distance of the window is less than the side length of the small pane. When the small pane in the feature map after the first division slides along the When the small pane slides diagonally, the sliding distance of the window is smaller than the diagonal distance of the small pane.

进一步地，所述利用构建的特征金字塔网络对经过多头自注意力计算输出的特征图进行处理包括：Further, using the constructed feature pyramid network to process the feature map output by multi-head self-attention calculation includes:

对输入的特征图进行卷积操作，得到多个不同层次的特征图；Perform a convolution operation on the input feature map to obtain multiple feature maps of different levels;

构建一个自顶向下的路径和横向连接来构建一个特征金字塔，每个金字塔层都包含一个上采样操作和一个1x1卷积操作，用于将上一层的特征图与当前层的特征图相加并进行特征融合；Construct a top-down path and lateral connections to build a feature pyramid. Each pyramid layer contains an upsampling operation and a 1x1 convolution operation to compare the feature map of the previous layer with the feature map of the current layer. Add and perform feature fusion;

使用一个3x3卷积操作对每个金字塔层进行平滑处理，将每个金字塔层平滑处理后的特征图输出。Use a 3x3 convolution operation to smooth each pyramid layer, and output the smoothed feature map of each pyramid layer.

进一步地，所述多个不同层次的特征图中每个特征图的尺寸和通道数都不同。Further, the size and number of channels of each feature map in the multiple different levels of feature maps are different.

进一步地，所述对每个候选区域进行分类和回归，并根据每个候选区域位置偏移量得到最终的检测框，并预测每个检测框内的像素级掩码,输出带有目标检测框和图像分割的检测结果包括：Further, each candidate area is classified and regressed, and the final detection frame is obtained according to the position offset of each candidate area, and the pixel-level mask in each detection frame is predicted, and the target detection frame is output. And the detection results of image segmentation include:

利用构建的RoIAlign层将候选区域池化到相同大小；Use the built RoIAlign layer to pool the candidate regions to the same size;

输出候选区域的一个固定大小的特征向量，表示每个候选区域的特征；Output a fixed-size feature vector of the candidate region, representing the characteristics of each candidate region;

对每个候选区域的特征向量进行分类和回归，输出一个类别标签和一个位置偏移量，表示每个 RoI的最终检测框；Classify and regress the feature vector of each candidate area, and output a category label and a position offset, representing the final detection frame of each RoI;

对每个 RoI的最终检测框内的特征向量进行掩码操作来预测每个检测框内的像素级掩码，输出一个二值矩阵表示每个检测框内的前景和背景像素，并与原图像结合输出一个带有目标检测框和图像分割的检测结果。Perform a masking operation on the feature vector in the final detection frame of each RoI to predict the pixel-level mask in each detection frame, output a binary matrix representing the foreground and background pixels in each detection frame, and compare it with the original image Combined output is a detection result with target detection frame and image segmentation.

进一步地，所述RoIAlign层利用双线性内插的方法对候选区域进行池化。Further, the RoIAlign layer uses a bilinear interpolation method to pool the candidate regions.

进一步地，所述对每个候选区域的特征向量进行分类和回归具体为：利用一个全连接层将每个候选区域的特征进行整合回归，利用一个softmax层将每个候选区域的特征进行分类。Further, the classification and regression of the feature vector of each candidate region is specifically: using a fully connected layer to integrate and regress the features of each candidate region, and using a softmax layer to classify the features of each candidate region.

进一步地，所述对每个候选区域的最终检测框内的特征向量进行掩码操作来预测每个检测框内的像素级掩码具体为，使用一个卷积层对检测框内的特征图进行特征提取，并使用一个sigmoid激活函数将提取到的特征二值化，输出一个二值矩阵。Further, the mask operation is performed on the feature vector in the final detection frame of each candidate area to predict the pixel-level mask in each detection frame. Specifically, a convolution layer is used to perform a mask operation on the feature map in the detection frame. Feature extraction, and a sigmoid activation function is used to binarize the extracted features and output a binary matrix.

进一步地，所述使用一个卷积层对检测框内的特征图进行特征提取中的卷积层包括构建的横向边缘卷积核和纵向边缘卷积核。Further, the convolution layer in using a convolution layer to extract features from the feature map within the detection frame includes a constructed horizontal edge convolution kernel and a longitudinal edge convolution kernel.

本发明还提供了一种存储介质，所述存储介质上存储有计算机程序，所述计算机程序被处理器执行时实现上述所述的一种基于图像处理的裂缝检测方法的步骤。The present invention also provides a storage medium. A computer program is stored on the storage medium. When the computer program is executed by a processor, the steps of the above-mentioned image processing-based crack detection method are implemented.

本发明的上述技术方案相比现有技术具有以下优点：The above technical solution of the present invention has the following advantages compared with the existing technology:

本发明所述的一种基于图像处理的裂缝检测方法和存储介质创新性地将Swinmask RCNN模型应用到裂缝检测领域中，并对Swin mask RCNN模型参数以及模型结构进行了优化。为了更好地提取出裂缝，我们在Swin Transformer网络滑窗操作之后构建了多层窗口融合模块，使具有裂缝特征的小窗格相互进行融合，保留了完整的裂缝信息。在模型特征提取环节，我们构建了横向边缘卷积核和纵向边缘卷积核，能够更好的提取细长形状的裂缝信息；在提升位置检测精度的同时降低计算量。The image processing-based crack detection method and storage medium of the present invention innovatively apply the Swinmask RCNN model to the field of crack detection, and optimize the Swinmask RCNN model parameters and model structure. In order to better extract cracks, we built a multi-layer window fusion module after the Swin Transformer network sliding window operation, so that small windows with crack characteristics can be fused with each other and retain complete crack information. In the model feature extraction process, we constructed a transverse edge convolution kernel and a longitudinal edge convolution kernel, which can better extract information about elongated cracks; while improving the position detection accuracy, the amount of calculation is reduced.

附图说明Description of the drawings

为了使本发明的内容更容易被清楚的理解，下面根据本发明的具体实施例并结合附图，对本发明作进一步详细的说明，其中In order to make the content of the present invention easier to understand clearly, the present invention will be further described in detail below based on specific embodiments of the present invention and in conjunction with the accompanying drawings, wherein

图1是本发明实施例一一种基于图像处理的裂缝检测方法的流程图；Figure 1 is a flow chart of a crack detection method based on image processing according to an embodiment of the present invention;

图2是本发明构建的一种基于图像处理的裂缝检测方法的算法工作流程图；Figure 2 is an algorithm workflow diagram of a crack detection method based on image processing constructed by the present invention;

图3为Swin Transformer网络中滑窗操作的示意图；Figure 3 is a schematic diagram of the sliding window operation in the Swin Transformer network;

图4为多层窗口融合模块的工作示意图；Figure 4 is a schematic diagram of the working of the multi-layer window fusion module;

图5为多层窗口融合模块的工作原理图；Figure 5 is a working principle diagram of the multi-layer window fusion module;

图6是自注意力机制Self-Attention的结构图；Figure 6 is the structure diagram of the self-attention mechanism Self-Attention;

图7为本发明构建的相对位置编码表的示意图；Figure 7 is a schematic diagram of the relative position coding table constructed by the present invention;

图8为两头自注意力机制的工作示意图；Figure 8 is a schematic diagram of the working of the self-attention mechanism at both ends;

图9是特征金字塔的网络示意图；Figure 9 is a schematic network diagram of the feature pyramid;

图10是特征金字塔的网络结构图；Figure 10 is the network structure diagram of the feature pyramid;

图11是Swin Transformer网络的结构图；Figure 11 is the structure diagram of the Swin Transformer network;

图12为第一种裂缝原图以及分别经过MaskRCNN网络、本发明所提供的一种基于图像处理的裂缝检测方法处理后的裂缝检测结果的对比图；Figure 12 is a comparison chart of the first crack original image and the crack detection results processed by the MaskRCNN network and a crack detection method based on image processing provided by the present invention;

图13为第二种裂缝原图以及分别经过MaskRCNN网络、本发明所提供的一种基于图像处理的裂缝检测方法处理后的裂缝检测结果的对比图；Figure 13 is a comparison chart of the second original crack image and the crack detection results processed by the MaskRCNN network and a crack detection method based on image processing provided by the present invention;

图14为第三种裂缝原图以及分别经过MaskRCNN网络、本发明所提供的一种基于图像处理的裂缝检测方法处理后的裂缝检测结果的对比图；Figure 14 is a comparison diagram of the third original crack image and the crack detection results processed by the MaskRCNN network and a crack detection method based on image processing provided by the present invention;

图15为第四种裂缝原图以及分别经过MaskRCNN网络、本发明所提供的一种基于图像处理的裂缝检测方法处理后的裂缝检测结果的对比图。Figure 15 is a comparison chart of the fourth crack original image and the crack detection results processed by the MaskRCNN network and a crack detection method based on image processing provided by the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施例对本发明作进一步说明，以使本领域的技术人员可以更好地理解本发明并能予以实施，但所举实施例不作为对本发明的限定。The present invention will be further described below in conjunction with the accompanying drawings and specific examples, so that those skilled in the art can better understand and implement the present invention, but the examples are not intended to limit the present invention.

实施例一Embodiment 1

参照图1所示，本发明的实施例一的步骤包括：Referring to Figure 1, the steps of Embodiment 1 of the present invention include:

， ,

本发明提出的一种基于图像处理的裂缝检测方法在裂缝检测中采用Swin MaskRCNN算法，使用了Swin Transformer作为其骨干网络。相比于以往的目标检测模型，SwinTransformer具有更好的特征提取能力和更高的表达能力，使得特征更加丰富，有助于提高裂缝检测的准确度。为了更好地提取出裂缝，我们在Swin Transformer网络滑窗操作之后构建了多层窗口融合模块，使具有裂缝特征的小窗格相互进行融合，进一步使小窗格之间的信息进行传递，保留了完整的裂缝信息。A crack detection method based on image processing proposed by the present invention adopts the Swin MaskRCNN algorithm in crack detection and uses Swin Transformer as its backbone network. Compared with previous target detection models, SwinTransformer has better feature extraction capabilities and higher expression capabilities, making features richer and helping to improve the accuracy of crack detection. In order to better extract the cracks, we built a multi-layer window fusion module after the sliding window operation of the Swin Transformer network, so that the small panes with crack characteristics can be fused with each other, and further the information between the small panes can be transferred and retained. complete crack information.

实施例二Embodiment 2

本发明实施例二的算法工作流程图如图2所示，针对采集的道路裂缝、隧道裂缝和桥梁裂缝数据，基于Swin mask RCNN深度学习网络结构进行训练。输入图片为缩放后进行人工标注后的裂缝，预测主要分为两阶段，包括：The algorithm workflow diagram of Embodiment 2 of the present invention is shown in Figure 2. The collected road crack, tunnel crack and bridge crack data are trained based on the Swin mask RCNN deep learning network structure. The input image is a scaled and manually annotated crack. The prediction is mainly divided into two stages, including:

第一阶段：使用 Swin Transformer 作为骨干网络，提取图像的特征，并生成一系列的候选区。首先，将输入图像分成多个小块，称为 patches，并将每个 patch 转换为一个特征向量；第二步，对每个 patch 进行自注意力计算，这一步中使用了滑动窗口和分层结构，并且，在注意力计算过程中，本实施例改进了位置编码矩阵，来提高模型对裂缝的检测效率和准确性；之后，输出多个尺度的特征图，每个特征图包含了不同大小和分辨率的patches。The first stage: Use Swin Transformer as the backbone network to extract the features of the image and generate a series of candidate areas. First, the input image is divided into multiple small patches, called patches, and each patch is converted into a feature vector; in the second step, self-attention is calculated for each patch. In this step, sliding windows and layering are used structure, and during the attention calculation process, this embodiment improves the position encoding matrix to improve the model's detection efficiency and accuracy of cracks; then, multiple scale feature maps are output, each feature map contains different sizes and resolution patches.

候选区域生成：在最后一个尺度的特征图上，使用了一个 Region ProposalNetwork (RPN)，来预测一系列的候选区域，称为 Regions of Interest (RoI)；每个 RoI包含了一个得分和一个位置偏移量，表示该区域的置信度和位置。根据得分对 RoI进行排序和筛选，保留一定数量的高得分 RoI，作为第二阶段的输入。Candidate region generation: On the feature map of the last scale, a Region ProposalNetwork (RPN) is used to predict a series of candidate regions, called Regions of Interest (RoI); each RoI contains a score and a position bias. Shift, indicating the confidence and location of the region. RoIs are sorted and filtered according to their scores, and a certain number of high-scoring RoIs are retained as input to the second stage.

第二阶段：对每个候选区域进行分类和回归，得到最终的检测框，并预测每个检测框内的像素级掩码。首先，进行候选区域对齐，对每个 RoI 在不同尺度的特征图上进行对齐操作，使用了一个 RoI Align 层，来保持特征图的分辨率和位置精度；然后输出一个固定大小的特征向量，表示每个 RoI 的特征。下一步进行检测框预测。对每个 RoI 的特征向量进行分类和回归操作，使用了一个全连接层和一个 softmax 层，来预测每个 RoI 的类别和位置偏移量。然后输出一个类别标签和一个位置偏移量，表示每个 RoI 的最终检测框。最后进行掩码预测。对每个 RoI 的特征向量进行掩码操作，使用了针对裂缝改进后的卷积层和一个 sigmoid 层，来预测每个检测框内的像素级掩码，之后输出一个二值矩阵，表示每个检测框内的前景和背景像素，最后输出。The second stage: Classify and regress each candidate area to obtain the final detection frame, and predict the pixel-level mask within each detection frame. First, candidate area alignment is performed, and each RoI is aligned on feature maps of different scales. An RoI Align layer is used to maintain the resolution and position accuracy of the feature map; then a fixed-size feature vector is output, representing Characteristics of each RoI. The next step is to predict the detection frame. Classification and regression operations are performed on the feature vector of each RoI, using a fully connected layer and a softmax layer to predict the category and position offset of each RoI. It then outputs a class label and a position offset representing the final detection box for each RoI. Finally, mask prediction is performed. The feature vector of each RoI is masked, using a convolutional layer improved for cracks and a sigmoid layer to predict the pixel-level mask in each detection frame, and then output a binary matrix representing each The foreground and background pixels within the frame are detected and finally output.

本发明实施例二所提供的基于图像处理的裂缝检测方法的具体步骤包括：The specific steps of the crack detection method based on image processing provided in Embodiment 2 of the present invention include:

S31：获取待检测的裂缝图像以及经过标注的裂缝图像；S31: Obtain the crack image to be detected and the annotated crack image;

S32：使用 Swin Transformer 作为骨干网络，提取图像的特征，并生成一系列的候选区，包括：S32: Use Swin Transformer as the backbone network to extract image features and generate a series of candidate areas, including:

S321：将整张图片进行裁剪，嵌入向量，设置裁剪大小为4*4像素，裁剪后设定输出通道来确定嵌入向量的大小，最后将 H,W 维度展开，并移动到第一维度；输入图片是（960，960，3）（RGB三通道），经过裁剪后为（240，240，96），96为模型规定通道输出，240来自于960/4=240，窗口为设定为7*7,得到分割后的特征图，即第一次划分后的特征图；利用构建的Shift Window Attention机制模块对第一次划分后的特征图中的小窗格进行窗口滑动，窗口位置向下、向右移动各两个单位，得到第二次划分后的特征图；S321: Crop the entire picture, embed the vector, set the crop size to 4*4 pixels, set the output channel after cropping to determine the size of the embedded vector, and finally expand the H and W dimensions and move them to the first dimension; input The picture is (960, 960, 3) (RGB three channels). After cropping, it is (240, 240, 96). 96 is the channel output specified by the model. 240 comes from 960/4=240. The window is set to 7* 7. Obtain the segmented feature map, that is, the feature map after the first division; use the built Shift Window Attention mechanism module to slide the small pane in the feature map after the first division, with the window position downward, Move two units to the right to obtain the feature map after the second division;

在步骤S321中滑窗操作（Shifted Window Attention）：如图3所示，标准的Transformer 架构及其对图像分类的适应版本都执行全局自注意力，其计算了每个token 与其他所有 tokens 之间的关系 (Attention Map)。但Swin Transformer不同，它采用了滑窗（Shift Window Attention）设计。In step S321, sliding window operation (Shifted Window Attention): As shown in Figure 3, the standard Transformer architecture and its adapted version for image classification both perform global self-attention, which calculates the relationship between each token and all other tokens. relationship (Attention Map). But Swin Transformer is different, it adopts a sliding window (Shift Window Attention) design.

S322：如图4所示，将第一次划分后的特征图以及第二次划分后的特征图输入多层窗口融合模块（Multi-window confluence，简称MWC）。多层窗口融合模块是在原本滑窗模块之后添加的一个纯数学计算模块，没有加入网络参数进行训练，并不会增加网络计算量，其基本原理如下：S322: As shown in Figure 4, input the feature map after the first division and the feature map after the second division into the multi-window fusion module (Multi-window confluence, referred to as MWC). The multi-layer window fusion module is a purely mathematical calculation module added after the original sliding window module. It does not add network parameters for training and does not increase the amount of network calculation. Its basic principle is as follows:

如图5所示：MWC将W-MSA和SW-MSA的输出fw（第一次划分后的特征图）和fsw（第二次划分后的特征图）作为输入。首先，通过softmax将特征映射沿通道方向归一化，确保向量模长度等于1，具体操作：先进行softmax操作，公式如下，e表示自然对数的底数，即常数2.71828, />为第i个节点的输出值，C为输出节点的个数。得到归一化的特征向量，然后再将其除以其模长，确保向量模长度等于1。As shown in Figure 5: MWC takes the output fw (feature map after the first division) and fsw (feature map after the second division) of W-MSA and SW-MSA as input. First, the feature map is normalized along the channel direction through softmax to ensure that the vector modulus length is equal to 1. Specific operation: first perform the softmax operation, the formula is as follows , e represents the base of the natural logarithm, that is, the constant 2.71828, /> is the output value of the i-th node, and C is the number of output nodes. Get the normalized eigenvector, then divide it by its modulus length to ensure that the vector modulus length is equal to 1.

然后，将归一化处理后的第一次划分后的特征图每个小窗格以及第二次划分后的特征图中的每个小窗格进行处理，以patch为向量，对两个特征映射的相同位置进行点乘，两者输入都为（240，240，96），对应位置进行点乘，最终得到一个与原始特征映射尺寸相同的特征映射，相加得到fw和fsw的相似矩阵。所述相似矩阵为小窗格的特征矩阵， W-MSA和SW-MSA中的小窗格存在重合的部分，若矩阵中两个小窗格的重合部分中的元素存在一些裂缝特征参数则这两个具有重合部分的小窗格中间存在可能的裂缝信息，则将这两个小窗格进行融合，若矩阵中两个小窗格的重合部分中的元素没有裂缝的一些特征参数则不进行融合。Then, each small pane of the first divided feature map after normalization and each small pane of the second divided feature map are processed, using patch as a vector, the two features are Dot multiplication is performed on the same position of the map, both inputs are (240, 240, 96), dot multiplication is performed on the corresponding position, and finally a feature map with the same size as the original feature map is obtained, and the similarity matrix of fw and fsw is obtained by addition. The similarity matrix is the characteristic matrix of small panes. The small panes in W-MSA and SW-MSA have overlapping parts. If the elements in the overlapping parts of the two small panes in the matrix have some crack characteristic parameters, then this If there is possible crack information between two small panes with overlapping parts, the two small panes will be fused. If the elements in the overlapping part of the two small panes in the matrix do not have some characteristic parameters of cracks, the process will not be performed. Fusion.

最后，所述第一次划分后的特征图中的小窗格与第二次划分后的特征图中的小窗格进行融合时，将fw、fsw经过软池化以及数学相加变换得到的特征图与相似矩阵相乘，得到具有上下文信息的特征窗口fc。这样就可以将多个窗口结合起来，防止丢失裂缝这种细长目标的信息。小窗格经过软池化以及数学相加变换得到的特征图与相似矩阵相乘，软池化操作是使用了softmax函数来计算每个池化区域内的权重，从而对输入特征进行加权平均，由于实际图片较大，具体操作如下举例说明：假设有一个输入特征图大小为4x4，通道数为1的特征映射。使用2x2的池化窗口进行软池化操作。输入为矩阵（1 2 3 4，5 6 7 8，910 11 12，13 14 15 16），对于每个2x2的池化区域，计算softmax操作得到的权重，然后对池化区域内的特征进行加权平均。对于第一个池化区域：（ 1 2，5 6），通过softmax操作，计算得到的权重为：（0.0474 0.0474，0.9526 0.9526 ），对池化区域内的特征进行加权平均：(1*0.0474 + 2*0.0474 + 5*0.9526 + 6*0.9526) = 5.9992，对于其他池化区域，同样进行softmax操作和加权平均。最终得到的池化特征图为：（5.9992 7.9992，13.999215.9992），在这个例子中，原始特征图的大小为4x4，经过2x2的软池化操作后，得到了一个2x2的池化特征图。数学相加变换是将fw和fsw对应位置相加。Finally, when the small panes in the feature map after the first division are merged with the small panes in the feature map after the second division, fw and fsw are obtained by soft pooling and mathematical addition transformation. The feature map is multiplied with the similarity matrix to obtain the feature window fc with contextual information. This allows multiple windows to be combined and prevents information from being lost on slender targets like cracks. The feature map obtained by soft pooling and mathematical addition transformation of the small window is multiplied by the similarity matrix. The soft pooling operation uses the softmax function to calculate the weight in each pooling area, thereby performing a weighted average of the input features. Since the actual image is larger, the specific operation is explained as follows: Assume there is a feature map with an input feature map size of 4x4 and a channel number of 1. Use a 2x2 pooling window for soft pooling operations. The input is a matrix (1 2 3 4, 5 6 7 8, 910 11 12, 13 14 15 16). For each 2x2 pooling area, calculate the weight obtained by the softmax operation, and then weight the features in the pooling area. average. For the first pooling area: (1 2, 5 6), through the softmax operation, the calculated weight is: (0.0474 0.0474, 0.9526 0.9526), and the weighted average of the features in the pooling area is: (1*0.0474 + 2*0.0474 + 5*0.9526 + 6*0.9526) = 5.9992. For other pooling areas, softmax operation and weighted average are also performed. The final pooled feature map is: (5.9992 7.9992, 13.999215.9992). In this example, the size of the original feature map is 4x4. After a 2x2 soft pooling operation, a 2x2 pooled feature map is obtained. The mathematical addition transformation is to add the corresponding positions of fw and fsw.

S323：利用构建的变换矩阵对所述多层窗口融合模块输出的每个窗口进行计算得到每个窗口的自注意力机制中的查询值Q,键值K,值V，并根据得到每个窗口的查询值Q,键值K,值V提取特征，利用多头自注意力机制模块中构建的相对位置索引来调取相对位置编码表中的参数并将每个窗口的特征计算融合输出多头自注意力计算输出的特征图；S323: Use the constructed transformation matrix to calculate each window output by the multi-layer window fusion module to obtain the query value Q, key value K, and value V in the self-attention mechanism of each window, and obtain each window based on The query value Q, key value K, and value V are used to extract features, and the relative position index constructed in the multi-head self-attention mechanism module is used to retrieve the parameters in the relative position encoding table and the features of each window are calculated and fused to output the multi-head self-attention Feature map of force calculation output;

本发明构建的多头注意力机制由多个自注意力机制（Self-Attention）构成，图6为单个自注意力机制的结构图：The multi-head attention mechanism constructed by this invention is composed of multiple self-attention mechanisms (Self-Attention). Figure 6 is a structural diagram of a single self-attention mechanism:

在计算的时候需要用到矩阵Q(查询),K(键值),V(值)。在实际中，Self-Attention接收的是输入(单词的表示向量x组成的矩阵X) 或者上一个编码模块的输出。而Q,K,V正是通过 Self-Attention 的输入进行线性变换得到的。Self-Attention 的输入用矩阵X进行表示，则可以使用线性变阵矩阵WQ,WK,WV计算得到Q,K,V。其中，是Q,K矩阵的列数，即向量维度，具体公式为：/>；The matrices Q (query), K (key value), and V (value) need to be used during calculation. In practice, Self-Attention receives the input (matrix X composed of word representation vectors x) or the output of the previous encoding module. Q, K, and V are obtained by linear transformation of the input of Self-Attention. The input of Self-Attention is represented by the matrix X, and Q, K, V can be calculated using the linear transformation matrices WQ, WK, and WV. in, is the number of columns of the Q and K matrices, that is, the vector dimension. The specific formula is:/> ;

得到矩阵 Q, K, V之后就可以根据上述公式计算出 Self-Attention 的输出，具体计算流程如下：After obtaining the matrices Q, K, V, the output of Self-Attention can be calculated according to the above formula. The specific calculation process is as follows:

得到之后，使用 Softmax 计算每一个向量对于其他向量的 attention系数，公式中的 Softmax 是对矩阵的每一行进行 Softmax，即每一行的和都变为1. 得到Softmax 矩阵之后可以和V相乘，得到最终的输出。get After that, use Softmax to calculate the attention coefficient of each vector for other vectors. Softmax in the formula is to perform Softmax on each row of the matrix, that is, the sum of each row becomes 1. After obtaining the Softmax matrix, it can be multiplied by V to obtain the final Output.

如图7所示，本发明的相对位置编码表示例包含两部分：相对位置索引，以及相对位置编码表，相对位置编码表的构建方式为：建立相对位置编码表，并将表中的数据初始化，以经过标注的裂缝图像为样本数据，将样本数据的检测准确率为标准，利用机器学习的方法训练相对位置编码表中的参数，得到相对位置编码表，计算公式如下：；As shown in Figure 7, the relative position coding table example of the present invention includes two parts: a relative position index, and a relative position coding table. The relative position coding table is constructed by: establishing a relative position coding table, and initializing the data in the table. , taking the annotated crack images as sample data, using the detection accuracy of the sample data as a standard, using machine learning methods to train the parameters in the relative position coding table, and obtaining the relative position coding table. The calculation formula is as follows: ;

其中sign（x）函数用于返回x的符号。如果x为正数，则sign(x)返回1；如果x为负数，则sign(x)返回-1；如果x等于0，则sign(x)返回0，y同理。、/>表示x、y偏离值。The sign(x) function is used to return the sign of x. If x is a positive number, sign(x) returns 1; if x is a negative number, sign(x) returns -1; if x is equal to 0, sign(x) returns 0, and the same goes for y. ,/> Represents the x and y deviation values.

如图8所示：以两头注意力为例，第一个自注意力部分计算出b（i,1）,同理可得b(i,2)，计算出输出bi，/>为变换矩阵。图中ei为位置编码。As shown in Figure 8: Taking the two-headed attention as an example, the first self-attention part calculates b(i,1), and similarly, b(i,2) can be obtained. Calculate the output bi,/> is the transformation matrix. In the figure, ei is the position encoding.

S324：利用构建的特征金字塔网络对经过自注意力计算输出的特征图进行处理，提取特征图中的不同尺度的特征并进行融合，输出融合后的特征图；S324: Use the constructed feature pyramid network to process the feature map output by self-attention calculation, extract features of different scales in the feature map and fuse them, and output the fused feature map;

本发明实施例构建的特征金字塔的网络示意图如图9所示：The network schematic diagram of the feature pyramid constructed by the embodiment of the present invention is shown in Figure 9:

特征金字塔网络（Feature Pyramid Network，简称FPN）是一种用于多尺度目标检测的通用网络结构。FPN最初是在2017年由Facebook AI Research提出的，它的目标是利用不同尺度下的特征图来检测不同大小的目标。Feature Pyramid Network (FPN for short) is a general network structure used for multi-scale target detection. FPN was originally proposed by Facebook AI Research in 2017. Its goal is to use feature maps at different scales to detect targets of different sizes.

FPN利用了两种不同层次的特征：一种是底层浅层次的，这类特征分辨率高但语义少；另一种是高层深层次的，这类特征分辨率低但语义多。它们结合起来形成了多尺度的特征图，可以检测不同尺度的目标。FPN有两个核心步骤：第一步是自下而上地构建一个类似金字塔的结构，生成不同分辨率的初始特征图；第二步是自上而下地对初始特征图进行上采样、融合和调整，得到更好的特征表示。具体来说，FPN通过上采样将底部特征图变换到与上层特征图相同的分辨率，然后对它们进行加权求和，以得到更丰富的特征表达。FPN uses two different levels of features: one is the bottom shallow level, which has high resolution but little semantics; the other is the high-level deep level, which has low resolution but has more semantics. They are combined to form multi-scale feature maps that can detect targets at different scales. FPN has two core steps: the first step is to build a pyramid-like structure from the bottom up to generate initial feature maps of different resolutions; the second step is to upsample, fuse and combine the initial feature maps from the top down. Adjust to get better feature representation. Specifically, FPN transforms the bottom feature map to the same resolution as the upper feature map through upsampling, and then performs a weighted summation of them to obtain a richer feature expression.

特征金字塔的具体网络结构图如图10所示：The specific network structure diagram of the feature pyramid is shown in Figure 10:

使用一个骨干网络对输入图像进行卷积操作，输入有3个通道，同时有多个卷积核。对于每个卷积核，先在输入3个通道分别作卷积，再将3个通道结果加起来得到卷积输出，得到多个不同层次的特征图，每个特征图的尺寸和通道数都不同。A backbone network is used to perform a convolution operation on the input image. The input has 3 channels and multiple convolution kernels. For each convolution kernel, the three input channels are first convolved separately, and then the results of the three channels are added to obtain the convolution output. Multiple feature maps of different levels are obtained. The size and number of channels of each feature map are the same. different.

然后，使用一个自顶向下的路径和横向连接来构建一个特征金字塔，每个金字塔层都包含一个上采样操作和一个1x1卷积操作，用于将上一层的特征图与当前层的特征图相加，从而融合高层次和低层次的信息。这里上采样操作使用双线性插值，它是一种基于两个方向上的线性插值来计算高分辨率图像中每个像素值的方法。假设有一个低分辨率图像，大小为m x n，需要将其上采样为高分辨率图像，大小为M x N。对于高分辨率图像中的每个像素位置，需要计算其对应的值。首先，计算出在低分辨率图像中的对应位置。具体计算方法如下：/>。Then, a top-down path and lateral connections are used to build a feature pyramid. Each pyramid layer contains an upsampling operation and a 1x1 convolution operation to combine the feature map of the previous layer with the features of the current layer. Graph summation, thus merging high-level and low-level information. Here the upsampling operation uses bilinear interpolation, which is a method of calculating the value of each pixel in a high-resolution image based on linear interpolation in two directions. Suppose there is a low-resolution image of size mxn that needs to be upsampled into a high-resolution image of size M x N. For each pixel position in the high-resolution image , its corresponding value needs to be calculated. First, the corresponding position in the low-resolution image is calculated . The specific calculation method is as follows:/> .

然后，找到低分辨率图像中与(x, y)最近的四个像素点，分别是)。其中，/>和/>是不大于x和y的最大整数，/>和是不小于x和y的最小整数。接下来，通过以下公式计算高分辨率图像中位置/>的像素值：Then, find the four closest pixels to (x, y) in the low-resolution image, which are ). Among them,/> and/> is the largest integer not larger than x and y,/> and is the smallest integer not less than x and y. Next, calculate the position in the high-resolution image by the following formula /> pixel value:

其中，in,

，/>表示低分辨率图像中位置/>的像素值。通过上述计算，可以使用线性插值方法将低分辨率图像上采样为高分辨率图像。 ,/> Indicates the position in the low-resolution image/> pixel value. With the above calculations, a low-resolution image can be upsampled to a high-resolution image using linear interpolation.

最后，使用一个3x3卷积操作对每个金字塔层进行平滑处理，使得每个金字塔层都具有相同的通道数，并且消除了上采样带来的混叠效应。使用FPN的目标检测模型可以获得更好的多尺度特征图来检测不同尺寸的目标，能够有效提高目标检测准确率。Finally, a 3x3 convolution operation is used to smooth each pyramid layer so that each pyramid layer has the same number of channels and eliminates the aliasing effect caused by upsampling. Using the FPN target detection model can obtain better multi-scale feature maps to detect targets of different sizes, which can effectively improve the target detection accuracy.

S325：利用构建的Region Proposal Network (RPN)层来对融合后的特征图进行预测，得到一系列的候选区域。S325: Use the constructed Region Proposal Network (RPN) layer to predict the fused feature map and obtain a series of candidate regions.

S33：对每个候选区域进行分类和回归，得到最终的检测框，并预测每个检测框内的像素级掩码，包括：S33: Classify and regress each candidate area to obtain the final detection frame, and predict the pixel-level mask within each detection frame, including:

S331：利用构建的RoIAlign层对候选区域进行池化，RoIAlign层采用双线性内插的方法，将候选区域池化到相同大小，这里使用平均池化，具体操作如下：在输入特征图上滑动一个固定大小3*3的池化窗口；对于每个池化窗口，计算窗口内所有值的平均值；将平均值作为输出特征图中对应位置的值；并保持特征图的分辨率和位置精度，输出的Rol即为补齐后的候选区域，输出一个固定大小的特征向量，表示Rol的特征。S331: Use the built RoIAlign layer to pool the candidate areas. The RoIAlign layer uses bilinear interpolation to pool the candidate areas to the same size. Average pooling is used here. The specific operation is as follows: slide on the input feature map A fixed size 3*3 pooling window; for each pooling window, calculate the average of all values in the window; use the average as the value of the corresponding position in the output feature map; and maintain the resolution and position accuracy of the feature map , the output Rol is the completed candidate area, and a fixed-size feature vector is output to represent the characteristics of Rol.

所述步骤S331中在特征图上利用RoIAlign准确地对齐RoI的特征；RoIAlign层将RoI区域划分为固定大小的小格子，并使用双线性插值的方法，在特征图上计算每个小格子的像素值；得到了与RoI对应的特征图。In the step S331, RoIAlign is used to accurately align the characteristics of the RoI on the feature map; the RoIAlign layer divides the RoI area into small grids of fixed size, and uses the bilinear interpolation method to calculate the characteristics of each small grid on the feature map. Pixel value; the feature map corresponding to RoI is obtained.

S332：对每个 RoI的特征向量进行分类和回归操作，使用了一个全连接层和一个softmax层，全连接层将每个候选区域的特征进行整合回归，softmax层将每个候选区域的特征进行分类，来预测每个Rol的类别和位置偏移量。然后输出一个类别标签和一个位置偏移量，表示每个Rol的最终检测框；S332: Classify and regress the feature vector of each RoI, using a fully connected layer and a softmax layer. The fully connected layer integrates and regresses the features of each candidate region, and the softmax layer integrates the features of each candidate region. Classification, to predict the category and position offset of each Rol. Then output a category label and a position offset, representing the final detection box of each Rol;

S333：进行掩码预测。对每个Rol的特征向量进行掩码操作，使用了针对裂缝改进后的卷积层和一个 sigmoid 层，卷积层对检测框内的特征图进行特征提取，卷积层中设有构建的横向边缘卷积核和纵向边缘卷积核来针对裂缝图像的细节部分进行更进一步的的提取，sigmoid激活函数将提取到的特征二值化，输出一个0和1组成的二值矩阵，其中0和1分别表示矩阵中的两种状态或类别，来预测每个检测框内的像素级掩码，二值矩阵表示每个检测框内的前景和背景像素。最后与原图像结合输出一个带有目标检测框和的图像分割的检测结果。S333: Perform mask prediction. The feature vector of each Rol is masked, and a convolution layer improved for cracks and a sigmoid layer are used. The convolution layer extracts features from the feature map in the detection frame. The convolution layer is equipped with a constructed lateral The edge convolution kernel and the longitudinal edge convolution kernel are used to further extract the details of the crack image. The sigmoid activation function binarizes the extracted features and outputs a binary matrix composed of 0 and 1, where 0 and 1 1 respectively represents the two states or categories in the matrix to predict the pixel-level mask in each detection frame, and the binary matrix represents the foreground and background pixels in each detection frame. Finally, it is combined with the original image to output an image segmentation detection result with target detection frame and sum.

本发明构建的Swin Transformer网络结构如图11所示：The Swin Transformer network structure constructed by this invention is shown in Figure 11:

Swin Transformer网络采用了基于窗口平移的方法，对视觉数据进行建模和处理。整体由多头自注意力（W-MSA）、移位窗口多头自注意力（SW-MSA）和多层感知器（MLP）组成。在中间插入layernorm（LN）层可以使训练更加稳定，并在每个模块之后使用残差连接。具体公式为：The Swin Transformer network uses a window translation-based method to model and process visual data. The whole system consists of multi-head self-attention (W-MSA), shifted window multi-head self-attention (SW-MSA) and multi-layer perceptron (MLP). Inserting layernorm (LN) layers in the middle can make the training more stable, and use residual connections after each module. The specific formula is:

Swin Transformer的核心思想是使用分层的窗口机制对图像建模，每个窗口处理一部分图像，然后通过基于Transformer的方法来整合窗口处理的结果。Swin Transformer将图像划分成多个大的窗口，每个窗口都被视为一个特征图，不同大小的图像在分配给不同数量的窗口后产生一组多分辨率特征图，从而减少了时间复杂度，并增强了模型的特征提取能力。同时，Swin Transformer使用跨窗口局部卷积，从非局部视野提取特征，并在窗口内引入了绝对位置编码和相对位置编码，使得模型在处理具有空间信息的图像时能够在坐标系统中建立位置相关的关系。The core idea of Swin Transformer is to use a hierarchical window mechanism to model images. Each window processes a part of the image, and then integrates the results of window processing through a Transformer-based method. Swin Transformer divides the image into multiple large windows, and each window is regarded as a feature map. Images of different sizes produce a set of multi-resolution feature maps after being assigned to different numbers of windows, thereby reducing time complexity. , and enhance the feature extraction capability of the model. At the same time, Swin Transformer uses cross-window local convolution to extract features from non-local fields of view, and introduces absolute position encoding and relative position encoding within the window, allowing the model to establish position correlation in the coordinate system when processing images with spatial information. Relationship.

本发明使用基于Python的“Pytorch”框架，搭建基于Swin Mask RCNN的网络算法模型，选择CUDA 以及和MMCV兼容版本的mmdetection、COCO相关工具包等搭建合适该模型训练的虚拟环境，根据图像数量和大小选取合适的权重文件，将2000张图片以6:2:2的比例划分为训练集、测试集、验证集。经过人工标注出裂缝位置、具体轮廓、标签信息等。根据图像大小、检测类别数目、图片训练及数量修改模型参数后，将训练集投入预训练模型中进行训练，训练700轮之后得到训练后的权重文件，利用Python编写程序调用训练后的权重文件使用测试集进行测试，即输入一张图片，给出一个测试结果，模型mAP在41%左右，本发明提出的一种基于图像处理的裂缝检测方法与Mask RCNN网络提取的对比结果如图12、图13、图14、图15所示，图12至图15分别为四种不同裂缝原图以及分别经过MaskRCNN网络、本发明所提供的一种基于图像处理的裂缝检测方法处理后的裂缝检测结果的对比图；左边图像为每种裂缝原图，中间的图像为经过Mask RCNN网络提取的裂缝检测结果图像，右边图像为本发明一种基于图像处理的裂缝检测方法处理的裂缝检测结果图像，经过对比可以看出，本发明提出的方法在检测的准确度以及在对图像的细节检测方面具有更明显的优势。The present invention uses the "Pytorch" framework based on Python to build a network algorithm model based on Swin Mask RCNN, and selects CUDA and MMCV-compatible versions of mmdetection and COCO related toolkits to build a virtual environment suitable for training the model. According to the number and size of images, Select an appropriate weight file and divide the 2000 images into a training set, a test set, and a verification set in a ratio of 6:2:2. The crack location, specific outline, label information, etc. are manually marked. After modifying the model parameters according to the image size, number of detection categories, image training and quantity, the training set is put into the pre-training model for training. After 700 rounds of training, the trained weight file is obtained. Use Python to write a program to call the trained weight file. Test the test set, that is, input a picture and give a test result. The model mAP is about 41%. The comparison results between the crack detection method based on image processing proposed by the present invention and the Mask RCNN network extraction are as shown in Figure 12, Figure 13. As shown in Figures 14 and 15, Figures 12 to 15 respectively show the original images of four different cracks and the crack detection results processed by the MaskRCNN network and a crack detection method based on image processing provided by the present invention. Comparison chart; the left image is the original image of each type of crack, the middle image is the crack detection result image extracted through the Mask RCNN network, and the right image is the crack detection result image processed by a crack detection method based on image processing of the present invention. After comparison It can be seen that the method proposed by the present invention has more obvious advantages in terms of detection accuracy and detail detection of images.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质（包括但不限于磁盘存储器、CD-ROM、光学存储器等）上实施的计算机程序产品的形式。Those skilled in the art will understand that embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请是参照根据本申请实施例的方法和计算机程序产品的流程图和来描述的。应理解可由计算机程序指令实现流程图中的每一流程。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程中指定的功能的装置。The present application is described with reference to flowcharts and flowcharts of methods and computer program products according to embodiments of the present application. It will be understood that each process in the flowchart illustrations can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device used to implement the functions specified in one process or multiple processes of the flow chart.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程中指定的功能。These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes of the flowchart.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing a function specified in a flowchart process or processes.

Claims

1. A crack detection method based on image processing, characterized by: including:

S101: Obtain the crack image to be detected;

S102: Use the constructed improved Swin Transformer network as the backbone network to extract the features of the crack image to be detected, and generate a series of candidate areas, including:

S201: Use the built multi-head self-attention mechanism module to calculate the crack image to be detected and output a feature map, including:

Divide the crack image to be detected into multiple n*n small panes according to the same window size to obtain the feature map after the first division, and use the built Shift Window Attention mechanism module to The small pane in the feature map performs window sliding to obtain the feature map after the second division. The feature map after the first division and the feature map after the second division are input into the multi-layer window fusion module constructed. The layer window fusion module normalizes the feature map after the first division and the feature map after the second division, and converts each small window of the normalized feature map after the first division into Each small pane in the feature map after the second division performs feature mapping and calculates a similarity matrix. The small panes in the feature map after the first division are the same as the small panes in the feature map after the second division. The panes have overlapping parts. According to the obtained similarity matrix, it is judged whether the small panes with overlapping parts need to be fused. If the similarity matrix indicates that there is connected crack information, the two small panes are fused to generate a fusion window output. , if the similarity matrix indicates that there is no connected crack information, the two small panes will not be fused and will be output as two independent windows respectively, obtaining multiple fusion windows and multiple independent windows output by the multi-layer window fusion module;

Use the constructed transformation matrix to calculate each window output by the multi-layer window fusion module to obtain the query value Q, key value K, and value V in the self-attention mechanism of each window, and calculate the query value Q, key value K, and value V in the self-attention mechanism of each window. First, a convolution layer is used to linearly transform the pixels in each window to extract pixel data, and the features are extracted based on the query value Q, key value K, and value V of each window, specifically:

;

in, is the number of columns of the Q, K matrix, that is, the vector dimension, B represents the position offset of each window, and the value of B is given by the set relative position offset parameter table;

Use the relative position index constructed in the multi-head self-attention mechanism module to retrieve the parameters in the relative position encoding table and calculate and fuse the features of each window to output the feature map output by the multi-head self-attention calculation;

S202: Use the constructed feature pyramid network to process the feature map output by multi-head self-attention calculation, extract features of different scales in the feature map output by multi-head self-attention calculation and fuse them, and output the fused feature map. ;

S203: Use the constructed Region Proposal Network to predict the fused feature map and obtain a series of candidate regions. Each candidate region contains a score and a position offset. Each candidate region is sorted and summed according to the score. Screen and retain some high-scoring candidate areas;

S103: Classify and regress each obtained high-scoring candidate area, obtain the final detection frame based on the position offset of each high-scoring candidate area, and predict the pixel-level mask in each detection frame, and output the band There are detection results of target detection frames and image segmentation.

2. A crack detection method based on image processing according to claim 1, characterized in that: the constructed Shift Window Attention mechanism module performs window sliding on the small pane in the feature map after the first division. When the small pane in the feature map obtained for the second division performs window sliding, when the small pane in the feature map after the first division performs window sliding along the horizontal or vertical direction, the sliding distance of the window is smaller than that of the small window. The side length of the grid, when the small pane in the feature map after the first division slides along the diagonal line of the small pane, the sliding distance of the window is smaller than the diagonal distance of the small pane.

3. A crack detection method based on image processing according to claim 1, characterized in that: using the constructed feature pyramid network to process the feature map output by multi-head self-attention calculation includes:

Perform a convolution operation on the input feature map to obtain multiple feature maps of different levels;

Construct a top-down path and lateral connections to build a feature pyramid. Each pyramid layer contains an upsampling operation and a 1x1 convolution operation to compare the feature map of the previous layer with the feature map of the current layer. Add and perform feature fusion;

Use a 3x3 convolution operation to smooth each pyramid layer, and output the smoothed feature map of each pyramid layer.

4. A crack detection method based on image processing according to claim 3, characterized in that: the size and number of channels of each feature map in the multiple different levels of feature maps are different.

5. A crack detection method based on image processing according to claim 1, characterized in that: classifying and regressing each candidate area, and obtaining the final detection frame according to the position offset of each candidate area, and Predict the pixel-level mask within each detection frame, and output the detection results with target detection frame and image segmentation including:

Use the built RoIAlign layer to pool the candidate regions to the same size;

Output a fixed-size feature vector of the candidate region, representing the characteristics of each candidate region;

Classify and regress the feature vector of each candidate area, and output a category label and a position offset, representing the final detection frame of each RoI;

Perform a masking operation on the feature vector in the final detection frame of each RoI to predict the pixel-level mask in each detection frame, output a binary matrix representing the foreground and background pixels in each detection frame, and compare it with the original image Combined output is a detection result with target detection frame and image segmentation.

6. A crack detection method based on image processing according to claim 5, characterized in that: the RoIAlign layer uses a bilinear interpolation method to pool the candidate areas.

7. A crack detection method based on image processing according to claim 5, characterized in that: the classification and regression of the feature vector of each candidate area is specifically: using a fully connected layer to classify each candidate area The features are integrated and regressed, and a softmax layer is used to classify the features of each candidate area.

8. A crack detection method based on image processing according to claim 5, characterized in that: the feature vector in the final detection frame of each candidate area is masked to predict the feature vector in each detection frame. Specifically, the pixel-level mask uses a convolutional layer to extract features from the feature map within the detection frame, and uses a sigmoid activation function to binarize the extracted features and output a binary matrix.

9. A crack detection method based on image processing according to claim 8, characterized in that: using a convolution layer to extract features from the feature map in the detection frame, the convolution layer includes constructed lateral edges. Convolution kernel and longitudinal edge convolution kernel.

10. A storage medium, characterized in that: a computer program is stored on the storage medium, and when the computer program is executed by a processor, the crack based on image processing according to any one of claims 1 to 9 is implemented. Steps of the detection method.