CN111242038B

CN111242038B - Dynamic tongue fibrillation detection method based on frame prediction network

Info

Publication number: CN111242038B
Application number: CN202010040375.6A
Authority: CN
Inventors: 蔡轶珩; 刘嘉琦; 郭雅君; 胡绍斌; 张新峰
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2024-06-07
Anticipated expiration: 2040-01-15
Also published as: CN111242038A

Abstract

A dynamic tongue fibrillation detection method based on a frame prediction network relates to the fields of computer vision, pattern recognition and medical engineering. The invention provides a new P-net network aiming at the processing problem of data with strong time-space correlation such as dynamic video, which not only considers the characteristics of multiple scales, but also adds ConvGRU modules for time modeling, thereby realizing the fusion processing of time-space information. In order to accurately improve judgment performance of tongue tremor and generalization capability of a network, the invention executes optical flow extraction operation before network input so as to further acquire tremor information among frames. The invention adopts a prediction thought, uses the generated countermeasure model to carry out joint discrimination training on the network, simultaneously uses a space-time sliding window operation when calculating the tremor fraction, and finally realizes a high-precision dynamic tongue tremor detection algorithm based on the P-net network.

Description

Dynamic tongue fibrillation detection method based on frame prediction network

Technical Field

The invention belongs to the field of computer vision, pattern recognition and medical engineering, and relates to a dynamic tongue fibrillation detection method based on a frame prediction network.

Background

Along with the improvement of the living standard of people and the development of technology, big data analysis and artificial intelligence are widely focused, and the combination of medicine and artificial intelligence is an important direction. Medical science can provide powerful assistance to people, and artificial intelligence can also provide analysis tasks for medical data. The two are mutually blended, so that the method has a very good prospect, and a plurality of enterprises and institutions perform multiple investment in the direction. Among these, machine learning is an important method for realizing artificial intelligence, and an automatic analysis process is realized by analyzing and organizing collected data and creating a corresponding model by using an algorithm of machine learning. If we can construct some systems which can assist doctors in diagnosing the illness state, the doctor can be greatly facilitated to diagnose, and meanwhile, the condition of missed detection of the illness can be avoided to a certain extent.

The tongue diagnosis is one of the main contents of four diagnostic methods in TCM, wherein tongue tremors refer to unstable and uncontrollable tongue tremors. By observing the degree of abnormality in tongue tremor, we can get some useful information from the "argue heatedly animals, the spleen of which is affected", and provide more data reference for doctors. However, this field also faces many difficulties, and tongue fibrillation belongs to a dynamic information, and how to model the dynamic characteristics better is a problem to be solved urgently. Therefore, the intelligent tongue fibrillation detection method is designed to help patients and doctors, and powerful objective data is provided for quick and better diagnosis of the patients.

Disclosure of Invention

The related work in the field is quite blank, and therefore the invention provides a high-precision dynamic tongue-shake detection algorithm based on a frame prediction network (P-net). The structure of Context and ConvGRU is added into the U-net network, so that the characteristic of multi-scale modeling of the network on input data can be enhanced, and meanwhile, the ConvGRU module can be used for modeling the input data in time. The P-net network provided by the invention has a good modeling effect on data with strong time-space characteristics such as video, and can better distinguish whether the tongue shakes or not.

The invention provides a high-precision dynamic tongue fibrillation detection algorithm based on a P-net network, wherein the general idea is that after normal dynamic tongue pretreatment, the dynamic tongue fibrillation detection algorithm is input into a designed network for training, and a model capable of predicting the next frame state of the normal tongue better is expected to be established; during testing, the trained model cannot correctly predict the state of the tongue with tremor due to the difference of the movement of the tongue with tremor and the normal tongue, so that the tongue with tremor can be automatically detected when the tongue with tremor occurs through the ST-Pscore scoring mechanism designed by us. The method mainly comprises the following characteristics:

(1) Optical flow extraction preprocessing based on tongue tremor information

After the original acquired tongue dynamic data are extracted one by one into separate pictures, every two adjacent pictures are input into an optical flow network which is pre-trained on a large-scale data set to extract optical flow information, so that interference of static information is reduced, and tongue vibration information is better obtained.

(2) New design of P-net network

The network construction structure specifically proposed is as follows:

p-net consists of three parts: the method comprises an encoding stage, a multi-scale feature and time information fusion stage and a decoding stage. The coding stage is characterized in that the initial coding of the input tongue dynamic information is finished through the structural design of three '2-layer convolution layers and 1-layer downsampling layer'; the multi-scale feature and time information fusion stage is characterized in that the features extracted in the previous stage can be subjected to space-time fusion through a 'Context module and ConvGRU module' which are designed in a combined mode, so that better feature expression is provided for the next stage; the decoding stage is characterized in that the recovery of the bottom layer characteristics is realized through the structural design of three '2-layer convolution layers and 1-layer up-sampling layers', so that preparation is made for detecting the tongue tremor state. In summary, we have designed a P-net that is a network structure that encodes and fuses multi-scale features and temporal features.

The Context module consists of four branches, wherein the three branches consist of 3 convolution layers, and one branch consists of 2 convolution layers; the front and back convolution layers have the functions of dimension increasing and dimension decreasing, and the middle convolution layer achieves the purpose of multi-scale feature extraction by adopting different atrous rate designs; the ConvGRU module consists of two convgru-cells, and the time characteristics of the input tongue dynamic information are extracted by inputting continuous T characteristic diagrams and updating two state information of h1 and h2, wherein the extracted characteristics are the last output.

(3) Process for predicting tongue movement state and detecting tongue tremors by using P-net network

Characteristics during training:

The mode of inputting the optical flow diagrams to the P-net is continuous T optical flow diagrams, each optical flow diagram is independently encoded, and finally, the T+1st optical flow diagram is obtained through prediction; the actual T+1th optical flow diagram is taken as ground truth, ground truth and the predicted pictures are input into various loss functions to perform network optimization, a generated countermeasure network model is used, the predicted images and the corresponding ground truth are simultaneously input into a discriminator to perform joint discrimination, and the discrimination model cannot distinguish whether the predicted images or the original ground truth are input, so that the prediction process is further optimized, and the detection result is improved.

Characteristics at the time of prediction:

The test set also extracts the optical flow, and takes continuous T frames to be respectively input into the trained network, and finally the predicted picture of each frame is obtained.

Characteristics of tongue tremor detection:

The predicted picture and the original picture are input into an ST-Pscore scoring frame designed by us, the evaluation method combines sliding window operation on time and space, space-time content is further fused, a final tremor score is obtained, and whether the tongue in the input dynamic data belongs to a tremor state is judged by comparing the relation between the tremor score and a set threshold value (the threshold value is taken as 0.432).

Advantageous effects

1. The invention realizes the dynamic tongue vibration detection based on prediction by using the P-net network

To our knowledge, this is the first invention in the past that combines deep learning techniques with dynamic tongue fibrillation detection. The invention provides the method for realizing the prediction of dynamic information by using the P-net network, and further realizing the judgment of whether the tongue shakes or not.

2. The invention extracts the optical flow of the input dynamic data through the pre-training network

Because of the dynamic nature of tongue trembling, time information among different frames needs to be analyzed, so that the trembling information among the tongues is obtained by a method for extracting optical flow, key important time characteristics can be extracted for processing, interference of static information on results can be reduced, and the accuracy of an algorithm is improved by combining the two aspects.

3. The invention provides a P-net network

The common U-net network is characterized in that in the process of up-sampling, the output of the same scale corresponding to a feature extraction part (namely, a down-sampling process) is fused every time of up-sampling, so the U-net network has a good modeling effect on an input image in space, the considered feature size is single, and the consideration of time information is not added, so the input data such as video has a problem of strong time correlation, and a good effect can be obtained only by setting a time module.

4. The invention provides a method for calculating a tremor score (ST-Pscore) based on space-time sliding window operation

In order to further improve the accuracy, the invention performs space-time sliding window operation after obtaining the corresponding dithering score of each frame, not only performs weighting processing on the previous and subsequent frames in time, but also performs average value processing on the dithering score of each region on the small region image on each frame. The combination of the two can further establish the space-time connection, and a better detection effect is obtained.

Description of the drawings:

FIG. 1 is an extracted light flow chart of the present invention.

Fig. 2 is a P-net network diagram of the present invention.

Fig. 3 is a Context structure diagram in the present invention.

Fig. 4 is a diagram showing the structure of ConvGRU in the present invention.

Fig. 5 is a training flow for implementing tongue fibrillation detection by using P-net according to the present invention.

FIG. 6 is a test flow chart of the present invention.

Detailed Description

The following describes the embodiments of the present invention in detail with reference to the accompanying drawings.

1. Pretreatment of

After the original collected tongue dynamic data are extracted one by one into separate pictures, every two adjacent pictures are input into an optical flow network which is pre-trained on a large-scale data set to extract optical flow information, and the specific network structure is shown in figure 1.

2. Construction of P-net network

As shown in fig. 2. The specific parameters of each layer of the P-net network constructed by the invention are as follows:

① Two convolution layers C1 and C2: the input size is 256×256, the number of input channels is 3, the convolution kernel is 3×3, the step size is 1, the edge filling mode is 'valid', the activation function LeaklyRelu, the output size is 256×256, and the number of output channels is 64.

② P1 downsampling layer: the input size is 256×256, the number of input channels is 64, the pooling kernel is 2×2, the step size is 1, the edge filling mode is 'same', and the output size is 128×128.

③ C3, C4 two convolution layers: the input size is 128×128, the number of input channels is 64, the convolution kernel is 3×3, the step size is 1, the edge filling mode is 'valid', the activation function is LeaklyRelu, the output size is 128×128, and the number of output channels is 128.

④ P2 downsampling layer: the input size is 128×128, the number of input channels is 128, the pooling kernel is 2×2, the step size is 1, the edge filling mode is 'same', the output size is 64×64,

⑤ Two convolution layers of C5 and C6: the input size is 64×64, the number of input channels is 128, the convolution kernel is 3×3, the step size is 1, the edge filling mode is 'valid', the activation function is LeaklyRelu, the output size is 64×64, and the number of output channels is 256. ⑥ P3 downsampling layer: the input size is 64×64, the number of input channels is 256, the pooling kernel is 2×2, the step size is 1, the edge filling mode is 'same', and the output size is 32×32.

⑦ CT1 Context layer: the input size is 32×32, the number of input channels is 256, the convolution kernel is 3×3, the output size is 32×32, and the number of output channels is 512 by four channel addition.

⑧ Layer L1, L2 two ConvGRU: the input size is 32×32, the number of input channels is 512, the convolution kernel is 3×3, the output size is 32×32, and the number of output channels is 512.

⑨ U1 deconvolution lamination: the input size is 32×32, the number of input channels is 512, the convolution kernel is 2×2, the step size is 2, the edge filling mode is 'same', and the output size is 64×64.

⑩ And splicing the result after deconvolution of U1 with the result of the C6 convolution with the corresponding size, wherein the splicing dimension is 3.

C7, C8 two convolution layers: the input size is 64×64, the number of input channels is 512, the convolution kernel is 3×3, the step size is 1, the edge filling mode is 'valid', the activation function is LeaklyRelu, the output size is 64×64, and the number of output channels is 256. /(I)U2 deconvolution lamination: the input size is 64×64, the number of input channels is 256, the convolution kernel is 2×2, the step size is 2, the edge filling mode is 'same', and the output size is 128×128.

And splicing the result after the deconvolution of U2 with the result of the convolution of C4 with the corresponding size, wherein the splicing dimension is 3.

C9, C10 two convolution layers: the input size is 128×128, the number of input channels is 256, the convolution kernel is 3×3, the step size is 1, the edge filling mode is 'valid', the activation function is LeaklyRelu, the output size is 128×128, and the number of output channels is 128.

U3 deconvolution lamination: the input size is 128×128, the number of input channels is 128, the convolution kernel is 2×2, the step size is 2, the edge filling mode is 'same', and the output size is 256×256.

And splicing the result after the deconvolution of U3 with the result of the convolution of C2 with the corresponding size, wherein the splicing dimension is 3.

C11, C12 two convolution layers: the input size is 256×256, the number of input channels is 128, the convolution kernel is 3×3, the step size is 1, the edge filling mode is 'valid', the activation function is LeaklyRelu, the output size is 256×256, and the number of output channels is 64.

Output layer: the input size is 256×256, the number of input channels is 64, the convolution kernel is 3×3, the step size is 1, the edge filling mode is 'same', the activation mode is tanh, the output size is 256×256, and the number of output channels is 3.

3. Context build

As shown in fig. 3. The specific parameters of each layer of the Context structure constructed by the invention are as follows:

① L1 branch: the three convolution kernels are respectively 1×1,3×3 and 1×1, the step size is1, the edge filling mode is 'valid', the cavity convolution rate is respectively 1 and 5, the activation function is LeaklyRelu, the output size is 32×32, and the output channel number is 512.

② L2 branch: the three convolution kernels are respectively 1×1,3×3 and 1×1, the step size is1, the edge filling mode is 'valid', the cavity convolution rate is respectively 1 and 3, the activation function is LeaklyRelu, the output size is 32×32, and the output channel number is 512.

③ L3 branch: the three convolution kernels are respectively 1×1,3×3 and 1×1, the step size is1, the edge filling mode is 'valid', the cavity convolution rate is respectively 1 and 1, the activation function is LeaklyRelu, the output size is 32×32, and the output channel number is 512.

④ L4 branch: the two convolution structures are composed of two convolution structures, wherein the input size is 32 multiplied by 32, the number of input channels is 256, the two convolution kernels are respectively 1 multiplied by 1, the step length is 1, the edge filling mode is 'valid', the cavity convolution rate is respectively 1, the activation function is LeaklyRelu, the output size is 32 multiplied by 32, and the number of output channels is 512.

4. ConvGRU construction

As shown in fig. 4. The concrete arrangement of ConvGRU structure constructed by the invention is as follows:

The input of the module is T feature graphs output by a Context module, the module consists of two ConvGRU-cells, the convolution input size of each layer is 32×32, the number of input channels is 512, the convolution kernel size is 3×3, the step size is 1, the edge filling mode is 'valid', the activation function is LeaklyRelu, the output size is 32×32, and the number of output channels is 512. The last output is selected as the input feature for the next stage.

5. Training process for tongue tremor detection

As shown in fig. 5, the specific training process of the dynamic tongue fibrillation detection algorithm based on prediction implemented by using the P-net network is as follows: ① And cutting the long video in the training set into single frame images, and inputting two adjacent frames into an optical flow network to extract an optical flow.

② Four frames of optical flow images are sequentially input into a P-net network, the four frames of images are firstly encoded, four feature images are obtained after passing through a Context module in the network, then the four feature images are used as input of ConvGRU modules in the network, one feature image comprising the previous four frames of information is obtained, and all decoding operations in the network are carried out on the feature images, so that a final output predicted optical flow image is obtained.

③ And comparing the difference between the predicted optical flow and the true optical flow to perform network optimization. The invention adopts the intensity loss function and the gradient loss function, msssim structure loss function to compare the difference between the predicted optical flow and the true optical flow, and adjusts the network parameters by minimizing the global loss function through the Adam algorithm in the training process. The specific calculation formula is as follows: (where I is true optical flow and I ^* is predicted optical flow)

L_grad(I,I^*)＝|||g_d(I)|-|g_d(I^*)|||₁

L_msssim(I,I^*)＝|||msssim(I)|-|msssim(I^*)|||₁

④ We have introduced a generative model (GAN) consisting of a discriminant and a generator. The predicted optical flow and the original optical flow are input into the same judging model (namely a judging device D), and the characteristics of the judging model are extracted and whether the input is true or the predicted optical flow is judged. The discrimination model and the P-net (serving as a generator G) are trained together, so that the discrimination model cannot distinguish whether an image predicted by the P-net is consistent with an original truth image or not, and the prediction capability of the network is further improved. The discriminant model consists of three convolution layers and one output layer. The training process loss function corresponding to GAN is calculated as follows: (where D is a discriminator and G is a P-net generator)

⑤ In summary, we have the loss function when training P-net: (in the experiment, lambda _a was set to 2e-4, lambda _s was set to 0.5, lambda _g was set to 1, and lambda _i was set to 1)

The loss function in training the arbiter is: (lambda _d was set in the experiment to 2 e-5)

6. Test process based on P-net tongue fibrillation detection algorithm

As shown in fig. 6, the specific test process of the dynamic tongue fibrillation anomaly detection algorithm based on prediction implemented by using the P-net network is as follows:

① And cutting the long video in the test set into single frame images, and inputting two adjacent frames into an optical flow network to extract an optical flow.

② For the test optical flow frame, taking the I _t frame as an example, respectively inputting four continuous frames consisting of an I _t frame and an I _t-1、I_t-2、I_t-3 frame into a trained P-net prediction network to finally obtain a predicted future frameWill/>Comparing with the input original true value I _t+1, obtaining a predicted peak signal-to-noise ratio (PSNR) between two images by a space-time sliding window evaluation method (ST-Pscore) designed by us, and calculating the predicted peak signal-to-noise ratio to obtain a predicted tongue tremor score value of the dynamic frame, wherein a specific calculation formula of ST-Pscore is as follows: ( Where k is the first k minimum PSNR indexes in all selected small area images, M, N is the size of the small area image, λ is the weight of the current frame, and p is the number of selected preceding and following frames. At the time of detection, k is set to 5, M, N is set to 32, λ is set to 5, and p is set to 3 )

③ After the spatiotemporal jitter score (ST-Pscore) of test frame I _t is determined, the value is compared with a set threshold (threshold). If ST-Pscore is greater than or equal to threshold, the test frame I _t is a tongue fibrillation frame. If ST-Pscore < threshold, test frame I _t is a non-lingual fibrillation frame. The present invention takes threshold=0.432.

④ Repeating the step ②③ for all the extracted optical flow frames, detecting all the tongue tremor frames in the test data, and outputting tremor fraction of each frame to realize real-time dynamic tongue tremor detection.

Claims

1. A dynamic tongue fibrillation detection method based on P-net is characterized in that:

(1) Optical flow extraction preprocessing based on tongue tremor information

After the original collected tongue dynamic data are extracted one by one into separate pictures, inputting every two adjacent pictures into a pre-trained optical flow network on a large-scale data set to extract optical flow information;

(2) Designing P-net network

The network construction structure specifically proposed is as follows:

p-net consists of three parts: the device comprises an encoding module, a multi-scale feature and time information fusion module and a decoding module; the coding module completes the primary coding of the input tongue dynamic information through the structural design of three '2-layer convolution layers and 1-layer downsampling layer'; the multi-scale feature and time information fusion module performs space-time fusion on the features extracted in the previous stage through a 'Context module and ConvGRU module' which are designed in a combined mode, and provides better feature expression for the next stage; the decoding module realizes the recovery of the bottom layer characteristics through the structural design of three '2-layer convolution layers and 1-layer up-sampling layer', thereby preparing for detecting the tongue trembling state; the Context module consists of four branches, wherein the three branches consist of 3 convolution layers, and one branch consists of 2 convolution layers; the front and back convolution layers have the functions of dimension increasing and dimension decreasing, and the middle convolution layer achieves the purpose of multi-scale feature extraction by adopting different atrous rate designs; the ConvGRU module consists of two convgru-cells, and extracts time characteristics of input tongue dynamic information by inputting continuous T feature graphs and updating two state information of h1 and h 2;

(3) Training a P-net network

The form input to the P-net is continuous T light flow graphs, each light flow graph is independently encoded, and finally, the T+1st light flow graph is obtained through prediction; the actual T+1th optical flow diagram is taken as ground truth, ground truth and the predicted pictures are input into various loss functions to perform network optimization, a generated countermeasure network model is used, the predicted images and the corresponding ground truth are simultaneously input into a discriminator to perform joint discrimination, and the discrimination model cannot distinguish whether the predicted images or the original ground truth are input;

(4) Prediction of tongue movement state using P-net network

For the test optical flow frame I _t, respectively inputting four continuous frames consisting of an I _t frame and an I _t-1、I_t-2、I_t-3 frame into a trained P-net prediction network to finally obtain a predicted future frameWill/>Comparing with the input original true value I _t+1, obtaining a predicted peak signal-to-noise ratio PSNR between two images through a space-time sliding window evaluation method ST-Pscore, and calculating the predicted tongue tremor score value of the dynamic frame through the predicted peak signal-to-noise ratio, wherein the specific calculation formula ST-Pscore is as follows:

Wherein k is the number of PSNR indexes arranged in front after PSNR indexes in all selected small-area images are arranged from small to large; m, N is the size of the small area image, lambda is the weight of the current frame, and p is the number of the selected front and rear multi-frames; k is set to 5, M, N is set to 32, λ is set to 5, and p is set to 3;

a relationship between the tremor score and a set threshold is compared to determine whether the tongue in the input dynamic data is in a tremor state, and if the tongue is greater than or equal to the threshold, the tongue is considered to be in the tremor state, and the threshold is taken to be 0.432.

2. The method of claim 1, wherein the specific parameters of each layer of the P-net network are as follows:

① Two convolution layers C1 and C2: the input size is 256×256, the number of input channels is 3, the convolution kernel is 3×3, the step length is 1, the edge filling mode is 'valid', the activation function is LeaklyRelu, the output size is 256×256, and the number of output channels is 64;

② P1 downsampling layer: the input size is 256×256, the number of input channels is 64, the pooling core is 2×2, the step size is 1, the edge filling mode is 'same', and the output size is 128×128;

③ C3, C4 two convolution layers: the input size is 128×128, the number of input channels is 64, the convolution kernel is 3×3, the step size is 1, the edge filling mode is 'valid', the activation function is LeaklyRelu, the output size is 128×128, and the number of output channels is 128;

⑤ Two convolution layers of C5 and C6: the input size is 64×64, the number of input channels is 128, the convolution kernel is 3×3, the step size is 1, the edge filling mode is 'valid', the activation function is LeaklyRelu, the output size is 64×64, and the number of output channels is 256;

⑥ P3 downsampling layer: the input size is 64×64, the number of input channels is 256, the pooling kernel is 2×2, the step size is 1, the edge filling mode is 'same', and the output size is 32×32;

⑦ CT1 Context layer: the input size is 32×32, the number of input channels is 256, the convolution kernel is 3×3, the output size is 32×32, and the number of output channels is 512 by adding four channels;

⑧ Layer L1, L2 two ConvGRU: the input size is 32×32, the number of input channels is 512, the convolution kernel is 3×3, the output size is 32×32, and the number of output channels is 512;

⑨ U1 deconvolution lamination: the input size is 32×32, the number of input channels is 512, the convolution kernel is 2×2, the step size is 2, the edge filling mode is 'same', and the output size is 64×64;

⑩ Splicing the result after deconvolution of U1 with the result of C6 convolution with the corresponding size, wherein the splicing dimension is 3;

C7, C8 two convolution layers: the input size is 64×64, the number of input channels is 512, the convolution kernel is 3×3, the step size is 1, the edge filling mode is 'valid', the activation function is LeaklyRelu, the output size is 64×64, and the number of output channels is 256; /(I) U2 deconvolution lamination: the input size is 64×64, the number of input channels is 256, the convolution kernel is 2×2, the step size is 2, the edge filling mode is 'same', and the output size is 128×128;

Splicing the result after the deconvolution of U2 with the result of the C4 convolution with the corresponding size, wherein the splicing dimension is 3;

C9, C10 two convolution layers: the input size is 128×128, the number of input channels is 256, the convolution kernel is 3×3, the step size is 1, the edge filling mode is 'valid', the activation function is LeaklyRelu, the output size is 128×128, and the number of output channels is 128;

U3 deconvolution lamination: the input size is 128×128, the number of input channels is 128, the convolution kernel is 2×2, the step size is 2, the edge filling mode is 'same', and the output size is 256×256;

splicing the result after the deconvolution of U3 with the result of the C2 convolution with the corresponding size, wherein the splicing dimension is 3;

C11, C12 two convolution layers: the input size is 256×256, the number of input channels is 128, the convolution kernel is 3×3, the step size is 1, the edge filling mode is 'valid', the activation function is LeaklyRelu, the output size is 256×256, and the number of output channels is 64;

3. The method according to claim 1, wherein the specific parameters of each layer of the Context structure are as follows:

① L1 branch: the three convolution kernels are respectively 1×1,3×3 and 1×1, the step length is1, the edge filling mode is 'valid', the cavity convolution rate is respectively 1 and 5, the activation function is LeaklyRelu, the output size is 32×32, and the output channel number is 512;

② L2 branch: the three convolution kernels are respectively 1×1,3×3 and 1×1, the step length is1, the edge filling mode is 'valid', the cavity convolution rate is respectively 1 and 3, the activation function is LeaklyRelu, the output size is 32×32, and the output channel number is 512;

③ L3 branch: the three convolution kernels are respectively 1×1,3×3 and 1×1, the step length is1, the edge filling mode is 'valid', the cavity convolution rate is respectively 1 and 1, the activation function is LeaklyRelu, the output size is 32×32, and the output channel number is 512;

④ L4 branch: the two convolution structures are composed of an input size of 32 multiplied by 32, an input channel number of 256, two convolution kernels of 1 multiplied by 1, a step length of 1, an edge filling mode of 'valid', a cavity convolution rate of 1, an activation function of LeaklyRelu, an output size of 32 multiplied by 32 and an output channel number of 512.

4. The method of claim 1, wherein the concrete arrangement of the ConvGRU structures constructed is as follows:

The input of the module is T feature graphs output by a Context module, the module consists of two ConvGRU-cells, the convolution input size of each layer is 32×32, the number of input channels is 512, the convolution kernel size is 3×3, the step length is 1, the edge filling mode is 'valid', the activation function is LeaklyRelu, the output size is 32×32, and the number of output channels is 512; the last output is selected as the input feature for the next stage.