CN114495170A

CN114495170A - Pedestrian re-identification method and system based on local self-attention inhibition

Info

Publication number: CN114495170A
Application number: CN202210102559.XA
Authority: CN
Inventors: 张森; 尚赵伟; 赵羚志; 周明亮
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2022-05-13

Abstract

The invention provides a pedestrian re-identification method and system based on local self-attention inhibition. The method comprises the following steps: collecting an original pedestrian picture sample, and preprocessing the sample; constructing a network model, wherein the network model comprises a convolution backbone network optimized by a residual error network, the output of the convolution backbone network is connected with N self-attention branches for local feature extraction, and the output levels of the N local feature extraction branches are in residual error connection with a feature map output by the convolution backbone network; carrying out back propagation training on the network model by utilizing the preprocessed samples; and re-identifying the pedestrian by using the target picture in the trained network model. The method improves the accuracy of extracting the local features of the pedestrian, so that the pedestrian re-identification method based on local self-attention inhibition has higher identification capability.

Description

Pedestrian re-identification method and system based on local self-attention inhibition

Technical Field

The invention relates to the field of computers, in particular to a pedestrian re-identification method and system based on local self-attention inhibition.

Background

The pedestrian re-identification is also called a cross-mirror tracking technology, and the aim of the technology is to cross-scene and cross-camera identification and retrieval of the identity of a pedestrian. Along with the construction and the landing of wisdom city, wisdom community, more and more cameras are settled in each corner of district, market, street, can both acquire magnanimity pedestrian video or image data every day. However, the utilization of such data is far from sufficient at present. For example, in an intelligent security scene, the action track of a suspect is determined mainly by manual screening and integration of the police officer from a large amount of monitoring videos, so that the method is time-consuming, labor-consuming and easy to misjudge; in a scene such as intelligent person finding, when a child walks away, the child can be reminded and notified only in a broadcasting mode of a worker, and the effect is very limited due to the fact that the psychology of the child is not mature enough in a noisy environment.

Therefore, intelligent integration of analysis of pedestrian data from multiple cameras using computers is the current and future preferred way of cross-mirror tracking. A typical complete cross-mirror tracking system is divided into three phases: the method comprises the steps of pedestrian data acquisition and uploading of multiple cameras in different time periods, pedestrian position detection based on video frames, and pedestrian identity recognition based on manual feature extraction or a deep neural network. The cameras of different manufacturers and different models often bring data heterogeneity, the accuracy of pedestrian position detection also depends on the performance of the detection model, and besides, due to factors such as weather change, illumination condition, attitude difference, barrier shielding, complex and variable backgrounds, pedestrians with the same identity need to be accurately searched from a large number of pedestrian identity libraries, and the method is extremely challenging.

The method for carrying out identity recognition by utilizing the pedestrian re-recognition model comprises the following two steps: and (4) extracting the features of the pedestrian images and calculating the feature similarity. The feature similarity calculation part generally calculates the cosine distance or Euclidean distance between the features of the image to be detected and the image features in the image library, the smaller the distance is, the greater the similarity is, and the detected sequencing result is obtained according to the similarity score. The accuracy of this step depends on the performance of the feature extraction phase model. In a large-scale pedestrian image library, a lot of pedestrian samples with different identities are similar in posture, clothing, visual angle and the like, and the discovery of some local fine-grained features is the key for distinguishing the pedestrian samples. Therefore, designing a local feature extraction model capable of resisting background interference and shielding is the core for improving the robustness of the pedestrian re-identification model.

In early work, the pedestrian local features are extracted from the original image by using a manually designed feature extraction operator, for example, the original image is divided into six horizontal parts by Karanam and the like, and the gray level histograms of different color spaces are calculated for each part. Matsukawa et al propose a two-level Gaussian modeling model, which is to divide an image into a plurality of local blocks, establish Gaussian distribution among the local blocks at the same horizontal position, and then establish second-level Gaussian distribution among different horizontal positions, thereby improving the capability of depicting the image texture. The method is greatly influenced by textural features, lacks of extraction of integral semantic information of pedestrians, is easy to overfit a training set and has very limited precision.

With the excellent performance of the deep neural network in the field of large-scale image classification, many researchers also apply the deep neural network to the problem of pedestrian re-identification. Sun, etc. performs horizontal segmentation on the characteristic diagram output by the convolution network to respectively represent the head, the upper half body, the thigh, the shank, etc., and each segment is classified independently, thereby effectively improving the accuracy of pedestrian re-identification. Rahu l and the like directly perform horizontal slicing on an original image, and then send each slice into a long-time and short-time memory network to obtain the fused features. Zhang et al add a shortest path-based slice matching algorithm on the basis of horizontal slicing, so that the misalignment problem of the hard slice mode is alleviated to a certain extent. The method has a good effect under the condition that the postures of pedestrians are standard and uniform, but the postures of pedestrians in a real scene are greatly changed, and the conditions of non-upright states such as riding and half-body shielding states such as umbrellas and other pedestrian shielding exist, so that the retrieval accuracy is reduced due to a hard-slicing mode.

Zhao et al extracts a plurality of pedestrian skeleton key points by using a human posture estimation model, acquires corresponding pixel regions according to the key points, and trains the pixel regions together with the original image to realize alignment of the regions. Zheng et al uses affine transformation to achieve pixel-level pose alignment on the basis of skeleton key points, and then performs feature extraction. The local alignment effect of the methods depends on an additional attitude estimation model, and inevitably increases the model parameters, which is not beneficial to engineering deployment.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention aims to provide a pedestrian re-identification method and system based on local self-attention inhibition.

In order to achieve the above object, the present invention provides a pedestrian re-identification method based on local self-attention suppression, comprising the steps of:

collecting an original pedestrian picture sample, and preprocessing the sample;

constructing a network model, wherein the network model comprises a convolution backbone network optimized by a residual error network, the output of the convolution backbone network is connected with N self-attention branches for local feature extraction, the output levels of the N local feature extraction branches are in residual error connection with a feature map output by the convolution backbone network, and N is a positive integer;

carrying out back propagation training on the network model by utilizing the preprocessed samples;

and re-identifying the pedestrian by using the target picture in the trained network model.

The pedestrian re-identification method based on local self-attention inhibition

According to the pedestrian re-identification method based on local self-attention suppression, on the basis of a convolution backbone network after optimization of a residual network, N self-attention branches and a global branch are introduced to carry out residual connection so as to extract local semantic features of different limb parts of a pedestrian, wherein the residual network can enable a model to have enough depth and avoid the phenomenon of gradient disappearance or gradient explosion due to the jump connection operation of the residual network, the extraction accuracy of the local features of the pedestrian is improved due to the introduction of the N self-attention branches, and the identification capability of the pedestrian re-identification method based on local self-attention suppression is higher.

The pedestrian re-identification method based on local self-attention inhibition preferably comprises the following steps: the method for training the network model comprises the following steps:

sending the preprocessed pictures into the convolution backbone network for global feature extraction to obtain a multi-channel feature map;

respectively sending the obtained multi-channel feature maps into the N self-attention branches for local feature extraction;

and performing residual error connection on the output of the self-attention branch and the feature graph output by the convolution backbone network, calculating a loss function through pooling of the convolution backbone network and a classifier, performing back propagation and updating network parameters until iteration is completed, and storing model data.

The pedestrian re-identification method based on local self-attention inhibition preferably comprises the following steps: each self-attention branch comprises a layer normalization layer and a self-attention block, wherein the layer normalization layer normalizes different channel data of the same sample, and the self-attention block obtains the global receptive field by establishing the relation between each feature and all other features by using the structure of each head of multi-head attention in the visual transformer.

The layer normalization layer avoids the distribution drift phenomenon by normalizing different channel data of the same sample, overcomes the defect that the batch normalization layer is influenced by the batch size of the sample, and can accelerate model convergence. The self-attention block effectively avoids the problem that the local receptive field of the convolutional network is not enough to pay attention to the global information. Because the background occupation ratio in the pedestrian image is not low, the backgrounds shot by different cameras are different, the judgment of the identity of the pedestrian is undoubtedly a great interference, and the difference of different backgrounds can be effectively shielded by establishing long-range dependence through the global receptive field, so that the model focuses on the limb area of the pedestrian. In addition, different pedestrians with similar appearances at certain parts often appear when the sample size is large, such as wearing the same shoes or wearing the same sunglasses, and the global receptive field can relieve the identity misjudgment condition caused by the similar appearances to a certain extent.

The pedestrian re-identification method based on local self-attention inhibition preferably comprises the following steps: after back propagation, category activation thermodynamic diagrams corresponding to the output characteristics of the N self-attention branches are respectively calculated, and input suppression is carried out on the rest branches according to the thermodynamic diagrams of each self-attention branch.

Therefore, the network can be forced to dig out different local semantic features, multiple sub-significant local features of the pedestrian can be extracted more conveniently, redundancy of N self-attention branches is avoided, and the problem that information loss is caused by overlooking the most significant region and neglecting other sub-significant regions with the same importance is avoided.

The preferred scheme is as follows: the method for performing input suppression on the remaining branches according to the thermodynamic diagram of each self-attention branch comprises the following steps: the saliency region mask is obtained by activating a thermodynamic diagram with the class of each self-attention branch and then superimposed on the inputs of the remaining branches. This enables significant area suppression, where a category-activated thermodynamic diagram reflects a region in the feature space that is strongly correlated with the identity of the pedestrian, with a larger value in the thermodynamic diagram indicating that the region is more critical in determining the identity of the pedestrian. The key area concerned by the current branch can be shielded from other areas by using a mask fusion mode, and the N branches have different initialization parameters although the N branches have the same structure, so that parameter values after multiple iterations are different, and different branches can concern different limb areas.

The pedestrian re-identification method based on local self-attention inhibition preferably comprises the following steps: the convolutional backbone network is additionally provided with a non-local attention block respectively behind the first two residual blocks of the ResNet50 network, the pooling layer is a generalized average pooling layer with parameters, and the loss function is a weighted regularization triple loss function. The obtained convolution backbone network can improve the accuracy of the pedestrian re-identification network model.

The pedestrian re-identification method based on local self-attention inhibition preferably comprises the following steps: the N self-attention branches have the same structure and different initialization parameters. Therefore, the parameter values after a plurality of iterations are different, so that different self-attention branches can focus on different limb areas.

The pedestrian re-identification method based on local self-attention inhibition preferably comprises the following steps: and fusing the global features and the local features of the N self-attention branches obtained by the target picture through a network model to obtain a feature vector, and calculating the similarity between the feature vector and the feature vectors of all the images in the image library to be checked to obtain a recognition result.

The application also provides a pedestrian re-identification system, which comprises a processor and a memory connected with the processor in a communication manner, wherein the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the pedestrian re-identification method based on local self-attention suppression. The pedestrian re-identification system has the advantages of the pedestrian re-identification method based on local self-attention suppression.

The invention has the beneficial effects that: according to the invention, multiple self-attention branches are introduced into the convolution backbone network optimized by the residual error network, so that the local regions of the limbs of pedestrians can be focused in a self-adaptive manner, the local semantic features of different parts of the limbs of the pedestrians are extracted, and the interference of the background is reduced; the gradient information and the category activation thermodynamic diagrams in the back propagation are utilized to realize mutual inhibition of key areas of different branches, so that different branches of the model can pay attention to different local body parts of pedestrians, and therefore rich fine-grained detail information is obtained. Under the condition that the number of parameters and the calculated amount are not increased remarkably by the network model, all body areas of pedestrians in the image can be excavated in a self-adaptive mode, and robustness to visual angle difference, background change and the like is improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic diagram of a network model architecture;

fig. 2 is a category activation thermodynamic diagram visualization.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

In the description of the present invention, unless otherwise specified and limited, it is to be noted that the terms "mounted," "connected," and "connected" are to be interpreted broadly, and may be, for example, a mechanical connection or an electrical connection, a communication between two elements, a direct connection, or an indirect connection via an intermediate medium, and specific meanings of the terms may be understood by those skilled in the art according to specific situations.

The invention provides a pedestrian re-identification method based on local self-attention inhibition, which comprises the following steps of:

collecting an original pedestrian picture sample, and preprocessing the sample.

The data sampling strategy is set before the original pedestrian picture samples are collected, in the embodiment, the data sampling strategy is that pedestrians of a fixed number are randomly selected, the pictures of the same number of each person are combined into a batch, the specific number is determined by the size of the video memory of the used video card, and the larger the size is, the better the size is.

The preprocessing of the original pedestrian picture sample comprises common data enhancement operations of scaling the size to be uniform, standardizing three channels of RGB, randomly turning left and right, filling pixel values, cutting random size, randomly erasing and the like. The diversity of the samples can be increased to a certain extent through pretreatment, overfitting of the model to a specific data set is avoided, and the generalization performance is improved.

And constructing a network model, wherein in the embodiment, the network model is established by using a mainstream deep learning framework such as Pythrch or TensorF l ow. The specific structure of the network model is shown in fig. 1, and comprises a convolution backbone network and N local feature extraction branches.

In this embodiment, the convolutional backbone network part loads pre-trained parameters on the imageNet dataset, and the remaining parameters are initialized using default. The convolution backbone network adopts an AGW network as a reference model, and is improved on the basis of a ResNet50 residual network as follows: respectively adding a non-local attention block behind the first two residual blocks of the ResNet50 residual network, wherein the non-local attention belongs to one of space attention structures and is generally used for establishing long context dependence on the shallow layer of a neural network; changing an original global average pooling layer into generalized average pooling with parameters; and changing the original triplet loss function with the hyperparametric threshold into a weighted regularized triplet loss function. The three optimizations described above all provide different degrees of accuracy improvement to the pedestrian re-identification model, and the performance on the current public data set is superior to ResNet 50.

In this embodiment, the local feature extraction branches are self-attention branches, generally N is not greater than 3, and in this embodiment, considering the detection accuracy and the computation amount, it is preferable that there are three self-attention branches, such as branch-1, branch-2, and branch-3 in fig. 1, each self-attention branch includes a layer normalization layer and a self-attention block, where the layer normalization layer normalizes different channel data of the same sample, and the self-attention block obtains the global receptive field by establishing a relationship between each feature and all other features by using a multi-head attention structure in the visual transformer. The three self-attention branches have the same structure and different initialization parameters.

Specifically, the output of the convolution backbone network is connected with three self-attention branches, and the output levels of the three local feature extraction branches are in residual connection with the feature map output by the convolution backbone network.

And training the network model by utilizing the preprocessed samples. In the training process, one iteration comprises two parts: calculating gradient values of the parameters by back propagation, and updating the network parameters by using the gradient values.

The present embodiment preferably but not limited to training 120 rounds by using Adam optimizer, the learning rate is initially 0.00035, the learning rate of the first 10 rounds is doubled, the learning rate of the 40 th and 70 th rounds is decreased by 10 times, and the model parameters are saved after the training is finished.

Specifically, during training, the preprocessed picture is sent to a convolution backbone network (in this embodiment, the convolution backbone network optimized by the residual error network) to perform global feature extraction, so as to obtain a multi-channel feature map. The residual network can enable the model to have enough depth and avoid the phenomenon of gradient disappearance or gradient explosion due to the operation of hop connection of the residual network.

And respectively sending the obtained multi-channel feature map into three self-attention branches for local feature extraction. In this embodiment, it is preferable to send the feature maps output by the fourth residual block of the convolutional layer of the ResNet50 residual network to the three self-attention branches.

And performing residual error connection on the outputs of the three self-attention branches and the feature graph output by the convolution backbone network, performing pooling and loss function calculation on the outputs consistent with the backbone network, wherein the loss function comprises cross entropy loss and weighted regularization triple loss, performing back propagation and updating network parameters until iteration is completed, and storing model data.

In the training process, after back propagation, category activation thermodynamic diagrams corresponding to three self-attention branch output features are respectively calculated, input suppression is carried out on the remaining two branches according to the thermodynamic diagrams of each self-attention branch, namely a significance region mask is obtained by the category activation thermodynamic diagrams of each self-attention branch, and the significance region mask is superposed on the input positions of the remaining branches.

The following is an example of Branch-1: the heatmap-1 in fig. 1 is a thermodynamic diagram corresponding to the first self-attention branch, the category activation thermodynamic diagram reflects the degree of influence of different areas of the pedestrian image on the identification of the pedestrian image, the key area often has a larger characteristic value, and the key area is reflected on the thermodynamic diagram and is a darker part, for example, the thermodynamic diagram of the branch-1 in fig. 1 indicates that the branch-1 ignores the characteristics of the parts of the person, such as shoes, shanks, and the like. Therefore, the embodiment uses three parameters not to share the self-attention branch to extract a plurality of secondary significant local features of the pedestrian, and simultaneously, in order to avoid redundancy of the three self-attention branches, the remaining two self-attention branches are subjected to input suppression according to the thermodynamic diagram of each self-attention branch, so that the network is forced to dig out different local semantic features. Specifically, taking branch-1 as an example, the mask-1 is calculated according to its thermodynamic diagram heatmap-1 as follows:

wherein m is_i,jThe value of the ith row and the jth column in a mask diagram is shown, h represents a thermodynamic diagram, alpha represents an inhibition coefficient, the value of the invention is 0.1, beta represents an inhibition factor, and the value of the invention is 0.75. The mask-1 mask graph and the characteristic graph calculate Hadamard products at the input of the branch-2 and the branch-3, and the step is similar to the branch-2 and the branch-3, so that the aim that different branches focus on different human body areas is fulfilled.

And updating the network parameters after each iteration, and storing the network model data after the iteration training is finished. And after the training is finished, the thermal power and mask diagram are not needed any more, the training picture obtains global features and three branch local features through a network model, the feature vectors are obtained through a generalized mean pooling layer after fusion, the feature vectors participate in similarity calculation among samples, when the similarity reaches a preset requirement, the network model obtained through the training is considered to meet the requirement, and the training of the network model is terminated.

And when the pedestrian re-identification is required to be carried out on the target picture, carrying out the pedestrian re-identification on the target picture in the trained network model. Specifically, the target picture is subjected to fusion of global features obtained through a network model and local features of three self-attention branches to obtain feature vectors, similarity (such as Euclidean distance calculation) is calculated between the feature vectors and feature vectors of all images in an image library to be checked, and the feature vectors are sorted from small to large to obtain specific recognition results.

The following detailed description will be given by taking specific examples as examples.

Experiments were performed on 3 public data sets including Market-1501, DukeMTMC-reiD and MSMT 17. The Market-1501 is manufactured in 2015 in Qinghua university, the training set and the test set comprise 1501 different pedestrians, about 20 pictures of each person on average, and the test mode comprises an indoor search mode and a full scene search mode, wherein the indoor search mode only uses an indoor scene for testing, and background change is relatively small. The DukeMTMC-reiD data set is obtained by manually labeling the multi-target pedestrian tracking data set, and 3 thousands of pictures are obtained. The MSMT17 data set is obtained by monitoring weather change and different periods, and is characterized by comprising a large number of pedestrian identities (more than 4000) and pedestrian images (more than 12 ten thousand).

The test indexes use two kinds of accumulative matching characteristic (CMC) [12] and average precision average value (mAP) [13] commonly used in the problem of pedestrian re-identification. The cumulative matching characteristic is in the form of rank-k, for example, rank-5 represents the ratio of pedestrians with correct identity in the first 5 items of the image sequence after being sorted by similarity, and rank-1 generally needs to be concerned. And the average precision mean value represents the average precision of all the pedestrian images to be checked, wherein the average precision of a single image is obtained by calculating the mean value of each correctly matched precision in the picture sequence.

Table 1 experimental results for different data sets

Training is performed under the Ubuntu system of NVI D IA-1080GPU, 16G memory, the pre-processing stage picture size is scaled to 256 × 128, the weight attenuation coefficient is set to 0.0005, and other settings remain the same as described in the above embodiments. The results on the three data sets are shown in table 1. Where "-" indicates that the index is not listed in the original paper, and in addition, the data set not used in the original paper is not listed, and the artificial feature based method [1,2] is also not listed in the table because the accuracy is significantly lower than the deep learning based method.

From table 1, it can be seen that the accuracy of the network model designed by the present invention is improved to different degrees on 3 data sets compared with the reference model, especially for the MSMT17 data set, the pedestrian identity and the sample number are greatly increased compared with other data sets, the involved weather is complicated and various, the complexity of the data set is significantly improved under different illumination conditions and shadow conditions at different time periods, it can be seen from table 1 that the rank-1 index of the model of the present invention is improved by 2.7%, the mAP index is improved by 4%, the index is significantly higher than the index improvement range on other data sets, which indicates that the robustness and generalization performance of the model of the present invention are better than those of the existing methods.

To further illustrate the performance of the invention, the extracted features of the model and the reference model of the invention are visualized by using a generic class activation thermodynamic tool grad-CAM commonly used in computer vision, and the comparison result is shown in FIG. 2.

As can be seen in fig. 2: the reference model usually only focuses on part of the body of the pedestrian, such as the visualization result of the pedestrian b, and is susceptible to the influence of the background and shielding, for example, the interference of the pedestrian a by the background is large, and the wheel shielding of the pedestrian c causes misjudgment; the model of the invention can be better gathered at each part of the body of the pedestrian, and simultaneously, the interference of the background and the shielding area is shielded.

In order to measure the parameter quantity and the calculation quantity change brought by the model of the invention, an open-source PyTorch-OpCounter tool is used for statistics, and the result is shown in Table 2, so that compared with a reference model, the parameter quantity of the model of the invention is only improved by 0.18%, the calculation quantity is only improved by 0.54%, and the improvement amplitude compared with the identification precision is negligible.

TABLE 2 comparison of model parameters and calculated quantities

Model (model)	Parameters	FLOPs
			AGW	23.541M	4.076G
ours	23.584M	4.098G

Therefore, the method effectively improves the identification precision and the model robustness of the pedestrian re-identification model under the complex background, the shielding condition and the scene with variable visual angles.

The invention also provides an embodiment of a pedestrian re-identification system, which comprises a processor and a memory connected with the processor in communication, wherein the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the corresponding operation of the pedestrian re-identification method based on local self-attention suppression.

In the description of the specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A pedestrian re-identification method based on local self-attention inhibition is characterized by comprising the following steps:

collecting an original pedestrian picture sample, and preprocessing the sample;

2. The pedestrian re-identification method based on local self-attention suppression according to claim 1, wherein the method for training in the network model comprises the following steps:

3. The method of claim 1, wherein each self-attention branch comprises a layer normalization layer and a self-attention block, wherein the layer normalization layer normalizes different channel data of the same sample, and the self-attention block uses a multi-head attention structure in a visual transformer to obtain a global receptive field by establishing a relationship between each feature and all other features.

4. The pedestrian re-identification method based on local self-attention suppression according to claim 1 or 2, characterized in that after back propagation, category activation thermodynamic diagrams corresponding to output features of the N self-attention branches are respectively calculated, and input suppression is performed on the remaining branches according to the thermodynamic diagram of each self-attention branch.

5. The pedestrian re-identification method based on local self-attention suppression according to claim 4, wherein the method for performing input suppression on the remaining branches according to the thermodynamic diagram of each self-attention branch comprises the following steps: the saliency region mask is obtained by activating a thermodynamic diagram with the class of each self-attention branch and then superimposed on the inputs of the remaining branches.

6. The pedestrian re-identification method based on local self-attention suppression according to claim 1, wherein the convolutional backbone network respectively adds a non-local attention block after the first two residual blocks of the ResNet50 network, the pooling layer is a generalized average pooling layer with parameters, and the loss function is a weighted regularization triplet loss function.

7. The pedestrian re-identification method based on local self-attention suppression according to claim 1, wherein the N self-attention branches have the same structure and different initialization parameters.

8. The pedestrian re-identification method based on local self-attention suppression according to claim 1, wherein the target image is subjected to pooling after global features and N self-attention branch local features are obtained through a network model, and then similarity is calculated between the feature vector and feature vectors of all images in an image library to be checked, so that an identification result is obtained.

9. The method for pedestrian re-identification based on local self-attention suppression according to claim 1, wherein N is not greater than 3.

10. A pedestrian re-identification system comprising a processor and a memory communicatively coupled to the processor, the memory storing at least one executable instruction that causes the processor to perform operations corresponding to the method of locally suppressing self-attention based pedestrian re-identification as claimed in any one of claims 1 to 9.