CN113221787A - Pedestrian multi-target tracking method based on multivariate difference fusion - Google Patents
Pedestrian multi-target tracking method based on multivariate difference fusion Download PDFInfo
- Publication number
- CN113221787A CN113221787A CN202110556574.7A CN202110556574A CN113221787A CN 113221787 A CN113221787 A CN 113221787A CN 202110556574 A CN202110556574 A CN 202110556574A CN 113221787 A CN113221787 A CN 113221787A
- Authority
- CN
- China
- Prior art keywords
- net
- pedestrian
- detection
- fusion
- key point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
- G06V20/53—Recognition of crowd images, e.g. recognition of crowd congestion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2155—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a pedestrian multi-target tracking method based on multivariate difference fusion, which comprises the following steps of: (1) acquiring a training sample set and a test sample set; (2) constructing a detection and re-identification integrated network model based on multi-element difference fusion; (3) performing iterative training on a detection and re-recognition integrated network model based on multivariate difference fusion; (4) and acquiring a multi-target tracking result of the pedestrian. According to the method, when the integrated network model for detecting and re-identifying based on the multivariate difference fusion is constructed, the difference of training data, a training mode and a network structure is added, so that the two key point heat map prediction sub-networks form prediction preference on targets with different sizes, the prediction results of the two sub-networks are added and fused to obtain the multivariate difference fusion key point heat map, the problem of low detection recall rate caused by prediction of the sub-networks by only using a single key point heat map in the prior art is solved, and the tracking accuracy of the algorithm is improved.
Description
Technical Field
The invention belongs to the technical field of computer vision, and relates to a pedestrian multi-target tracking method based on multivariate difference fusion, which can be used for monitoring pedestrian multi-target tracking tasks in the fields of security protection, video content understanding, human-computer interaction and the like.
Background
The pedestrian multi-target tracking algorithm is widely applied to the fields of security monitoring, video content understanding, man-machine interaction, intelligent nursing and the like. In recent years, with the rise and popularization of deep learning, the pedestrian multi-target tracking algorithm gradually forms an algorithm paradigm of combining three basic modules of target detection, re-recognition feature extraction and data association. The object detection module is used for detecting all pedestrian objects in a positioning scene, the re-identification feature extraction module is used for extracting and coding pedestrian appearance information, and the data association module estimates the similarity between a historical track and a detected pedestrian in a current frame according to the information provided by the detection and re-identification feature extraction module and performs optimal association matching according to the similarity so as to form the track.
Yufu Zhang et al, in 2020, "FairMOT" published by IEEE Conference On Computer Vision and Pattern Recognition, discloses a multi-target Tracking algorithm integrating Detection and Re-Identification tasks into a network, which adds a Re-Identification feature extraction sub-network On the CenterNet Detection network to make the Detection and Re-Identification tasks share a large number of convolutional layer parameters and features, thereby reducing the number of network parameters and calculation, improving the execution efficiency of the system, and achieving good results in the balance of speed and precision.
However, in the FairMOT algorithm, only the weight extraction tasks which can be detected and identified again are simply integrated, and the four prediction task branch sub-networks only share one fusion feature map, so that the intense competition of features among tasks is caused, and further learning of each task is inhibited; in addition, the FairMOT algorithm ignores the characteristic difference between targets with larger scale difference under the scene with larger scale difference of the targets, only one target central point heat map prediction sub-network is adopted to detect the targets with all scales to be recalled, although the convolutional neural network has the capacity of learning and adapting to the changes of scale, texture and the like, for the learning between the targets with larger difference, the network often seeks a balance between the targets, so that the detection recall effect of the model on the pedestrian targets is inhibited, and the accuracy of multi-target tracking is reduced.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides a pedestrian multi-target tracking method based on multi-element difference fusion, which is used for solving the technical problem of low detection recall rate in the scene with large target scale difference in the prior art.
In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:
(1) obtaining a training sample set DtrainAnd test sample set Dtest:
(1a) Preprocessing the selected V RGB image sequences with the pedestrian detection frame labels and the identity labels to obtain a preprocessed RGB image frame sequence setAnd mixing SvTaking RGB image frames contained in the I preprocessed RGB image frame sequences as a training sample set DtrainTaking the rest K preprocessed RGB image frame sequences as a test sample set DtestWhereinDenotes the mth one containing LmFrame pre-processed RGB imageA sequence of frames comprising a sequence of frames,f(n)representing the n-th preprocessed RGB image frame, I > K, I + K ═ V, V > 20, Lm>200;
(2) Constructing a detection and re-identification integrated network model O based on multivariate difference fusion:
(2a) constructing a structure of a detection and re-identification integrated network model O based on multi-element difference fusion:
structure for constructing detection and re-identification integrated network model O, including backbone network NetbackboneAnd NetbackboneCascaded parallel-arranged and same-structure first feature fusion sub-network AsAnd a second feature fusion sub-network AlAnd a convergence module, wherein the first feature converges the subnetwork AsThe output end of the network is connected with a parallelly arranged key point deviation prediction sub-network NetoffsetSmall target preference keypoint heatmap prediction subnetwork Nethm_sAnd bounding Box prediction subnetwork NetbboxSecond feature fusion subnetwork AlThe output end of the network is connected with a large target preference key point heat map prediction sub-network Net which is arranged in parallelhm_lAnd re-identifying feature extraction sub-network NetreidWherein:
backbone network NetbackboneAdopting a tree-shaped polymerization iterative network consisting of a plurality of two-dimensional convolution layers, a plurality of batch normalization layers, a plurality of two-dimensional pooling layers, a plurality of deformable convolution layers and a plurality of transposition convolution layers;
first feature fusion subnet AsAnd a second feature fusion sub-network AlEach comprising a plurality of spatial attention sub-networks NetsamAnd a channel attention subnetwork Netcam(ii) a Spatial attention subnetwork NetsamComprises a global average pooling layer and a global maximum pooling layer which are arranged in parallel, and a two-dimensional convolution layer connected with the two pooling layers, a channel attention sub-network NetcamComprises a global average pooling layer and a global maximum pooling layer which are arranged in parallel, wherein the global average pooling layer and the global maximum pooling layer are respectively connected with two in a cascade wayA two-dimensional convolution layer; netoffset、Nethm_s、Netbbox、Nethm_lAnd NetreidThe sub-networks all adopt a structure comprising a first convolutional layer, a rule active layer and a second convolutional layer which are sequentially cascaded, and NetreidThe output end is cascaded with a full connection layer, Nethm_sAnd Nethm_lThe output end is cascaded with the fusion module;
(2b) defining a loss function L for a keypoint heat map prediction taskheatmap:
Where N represents the number of keypoints in the predicted keypoint heat map, alpha and beta represent hyper-parameters,and YxyRespectively representing the labels and response values of key points at coordinates (x, y) in the predicted key point heat map, sigma representing summation operation, and log representing logarithm operation;
(3) carrying out iterative training on a detection and re-recognition integrated network model O based on multivariate difference fusion:
(3a) initializing detection and re-identifying integrated network model O with weight parameter thetaJThe iteration frequency is T, the maximum iteration frequency is T, T is more than or equal to 50000, and T is made to be 0;
(3b) for slave training sample set DtrainIn the random selection of bs ∈ [16,64 ]]Random data enhancement is carried out on each training sample, detection frame information of each training sample is updated according to an enhancement mode, bs data enhancement training samples with updated detection frame information are obtained, and the ratio of the height in the detection frame with updated information to the height of the image frame is larger than a threshold thratioIs taken as a large target, and the ratio is less than a threshold thratioThe pedestrian target is used as a small target, and finally, the small target preference key point heat map label is determined according to the updated detection frame information, the updated identity information and the division result of the large target and the small targetLarge target preference key point heat map labelDifference fusion key point heat map labelBounding box labelbboxKey point offset labeloffsetRe-identification identity labelid;
(3c) Using the bs training samples after data enhancement as the input of the detection and re-recognition integrated network model O, namely the backbone network NetbackboneExtracting features of each training sample to obtain three feature maps Feat with different scales of the training sample1、Feat2、Feat3;
(3d) First feature fusion subnet AsFor Feat1、Feat2、Feat3Performing self-adaptive fusion to obtain a feature map FeatsThe keypoint shift prediction subnetwork NetoffsetSmall target preference keypoint heatmap prediction subnetwork Nethm_sAnd bounding Box prediction subnetwork NetbboxRespectively with FeatsForward reasoning is carried out for input to obtain NetoffsetCorresponding keypoint offset predictor vector Vecoffset、Nethm_sCorresponding keypoint heatmap predictions Hm _ S and Net of small target preferencesbboxCorresponding distance value vectors Vec from the key points to the upper, lower, left and right sides of the target framedis_bbox(ii) a Simultaneous second feature fusion subnetwork AlFor Feat1、Feat2、Feat3Performing self-adaptive fusion to obtain a feature map FeatlLarge target preference keypoint heat map prediction subnetwork Nethm_lAnd re-identifying feature extraction sub-network NetreidRespectively with FeatlFor input, forward reasoning is carried out to obtain Nethm_lCorresponding big target preferred key point heatmap prediction results Hm _ L and NetreidCorresponding re-identified feature vector VecreidIs totally connected toJoining layer pair VecreidClassifying to obtain a pedestrian identity classification result; the fusion module fuses the Hm _ S key point heat map and the Hm _ L key point heat map to obtain a fused key point heat map Hm;
(3e) using L1 loss function, through inputting the predicted value of the shift of the key point and the label thereofoffsetCalculating a loss value L of the keypoint shift prediction resultoffSimultaneously inputting the predicted value of the bounding box and label thereofbboxCalculating a loss value L for a bounding box predictorbboxAnd adopting cross entropy loss function, through inputting pedestrian identity classification result and label thereofidCalculating loss value L of re-recognition feature extraction resultreidThen, a loss function L of the task is predicted by adopting the key point heat mapheatmapRespectively inputting Hm _ S, Hm _ L and Hm and corresponding label Calculating respective loss values Lhm_s、Lhm_l、LhmFinally to Loff、Lbbox、Lreid、Lhm_s、Lhm_lAnd LhmCarrying out self-adaptive weighted summation to obtain a loss value L of the detection and re-identification integrated network model Ototal;
(3f) Using a back propagation method and passing through the loss value LtotalCalculating the gradient of the weight parameter of the detection and re-identification integrated network model O, and then adopting a gradient descent algorithm to carry out the pair of the weight parameter theta through the gradient of the weight parameter of OJUpdating is carried out;
(3g) judging whether T is greater than T, if so, obtaining a trained detection and re-recognition integrated network model O', otherwise, making T equal to T +1, and executing the step (3 b);
(4) acquiring a multi-target tracking result of the pedestrian:
(4a) initializing test sample set DtestThe kth test specimen isComprises P RGB image frames, the P-th RGB image frame is f(p)P is more than 200, k is equal to 1, and the historical track set Tra is initialized(k)={};
(4b) Let p be 1;
(4c) set of test samplesP-th RGB image frame f in (1)(p)Forward propagation as input to a trained detection and re-recognition integrated network model O' to yield f(p)Is predicted value Vec of the key point offsetoffsetDistance values Vec from the periphery of the target frame to the key points up, down, left and rightdis_bboxKey point heat map prediction result Hm and re-recognition feature vector VecreidAnd to Vecoffset、Vecdis_bboxAnd Hm to obtain f(p)Set of pedestrian detection frames Det ═ { Det ═ DetiI is more than or equal to |0 and less than or equal to DN-1}, wherein detiA detection box for the ith pedestrian, DN denotes f(p)Detecting the number of pedestrians;
(4d) screening out f(p)The detected key point response value confiGreater than a response threshold thconfObject of the pedestrian Object { Object ═ Object }i|confiIs greater than th, i is greater than or equal to 0 and is less than or equal to DN-1, and is collected in the Det and VecreidObtaining a detection frame and re-identification feature vector information corresponding to the pedestrian target from the vector;
(4e) according to the detection frame and the re-recognition characteristic vector information of the screened pedestrian target, adopting an online correlation method to carry out on-line correlation on the Object and the Tra in the screened pedestrian target set(k)Carrying out data association to obtain f(p)The multi-target tracking result of the pedestrians is obtained;
(4f) judging whether P is more than or equal to P, if so, obtaining a test sampleOtherwise, making p equal to p +1, and updating the historical track set Tra(k)And performStep (4 c);
(4g) judging whether K is more than or equal to K, if so, obtaining a test sample set DtestOtherwise, k is made to be k +1, and step (4b) is executed.
Compared with the prior art, the invention has the following advantages:
1. when a detection and re-recognition integrated network model based on multi-element difference fusion is constructed, two feature fusion sub-networks which are arranged in parallel are respectively cascaded with a key point heat map prediction sub-network, a training mode difference and a training data difference are added through designing a loss function and a training mode, so that the two key point heat map prediction sub-networks form prediction preference on targets with different sizes, the difference results of the two sub-networks are added and fused to obtain the multi-element difference fusion key point heat map.
2. In addition, after a plurality of prediction sub-networks are separated and respectively cascaded to two feature fusion sub-networks which are arranged in parallel, the competition degree of the features among the plurality of prediction tasks is reduced, the network structure difference is added to a key point heat map of multi-element difference fusion, and the tracking accuracy of the algorithm is further improved.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention.
Fig. 2 is a schematic structural diagram of the integrated network for detecting and re-identifying based on multivariate difference fusion according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the specific embodiments.
Referring to fig. 1, the present invention includes the steps of:
step 1) obtaining a training sample set DtrainAnd test sample set Dtest:
Step 1a) preprocessing the selected V RGB image sequences with pedestrian detection frame labels and identity labels to obtain a preprocessed RGB image frame sequence setAnd mixing SvTaking RGB image frames contained in the I preprocessed RGB image frame sequences as a training sample set DtrainTaking the rest K preprocessed RGB image frame sequences as a test sample set DtestWhereinDenotes the mth one containing LmA sequence of frame pre-processed RGB image frames,f(n)representing the n-th preprocessed RGB image frame, I > K, I + K ═ V, V > 20, LmIs more than 200; in the example, crowdHuman, ETH, CityPerson, CalTech, CUHK-SYSU, PRW and MOT17train data sets with rich scenes are used as training data sets, the generalization capability of the model is improved, and the MOT17test data set with 7 image sequences in different scenes and the average frame number of the sequences of 845 is used for testing, so that the tracking accuracy is reasonably tested.
The method comprises the following steps of selecting V RGB image sequences with pedestrian detection frame labels and identity labels, preprocessing the selected V RGB image sequences, and realizing the following steps:
(1a1) adjusting the size of each RGB image frame in each RGB image sequence by a bilinear interpolation method to obtain an RGB image frame sequence set S with all RGB image frames of 608 × 1088v' so as to be consistent with the network input size.
(1a2) Assembling a sequence of RGB image frames SvIn the method, the pedestrian detection frame label and the image after the scale change are updated synchronously, and meanwhile, the pedestrian identity label is uniformly coded, namely, the identity label of the data sample with the missing identity information is setSetting the number as-1, sequentially increasing the coding mark from 1 for each pedestrian with single identity to obtain an RGB image frame sequence set which adjusts the size of an RGB image frame and updates a detection frame label and an identity labelThe method has the advantages that the V RGB image sequences with the pedestrian detection frame labels and the identity labels are preprocessed, and the consistency of the pedestrian identities and the consistency of the labels and pictures in the training or testing process are guaranteed.
Step 2), constructing a detection and re-identification integrated network model O based on multivariate difference fusion:
(2a) constructing a structure of a detection and re-identification integrated network model O based on multi-element difference fusion:
structure for constructing detection and re-identification integrated network model O, including backbone network NetbackboneAnd NetbackboneCascaded parallel-arranged and same-structure first feature fusion sub-network AsAnd a second feature fusion sub-network AlAnd a convergence module, wherein the first feature converges the subnetwork AsThe output end of the network is connected with a parallelly arranged key point deviation prediction sub-network NetoffsetSmall target preference keypoint heatmap prediction subnetwork Nethm_sAnd bounding Box prediction subnetwork NetbboxSecond feature fusion subnetwork AlThe output end of the network is connected with a large target preference key point heat map prediction sub-network Net which is arranged in parallelhm_lAnd re-identifying feature extraction sub-network NetreidWherein:
backbone network NetbackboneAdopting a tree-shaped polymerization iterative network consisting of a plurality of two-dimensional convolution layers, a plurality of batch normalization layers, a plurality of two-dimensional pooling layers, a plurality of deformable convolution layers and a plurality of transposition convolution layers;
first feature fusion subnet AsAnd a second feature fusion sub-network AlEach comprising a plurality of spatial attention sub-networks NetsamAnd a channel attention subnetwork Netcam(ii) a Spatial attention subnetwork NetsamIncluding global planes arranged in parallelAn average pooling layer and a global maximum pooling layer, and a two-dimensional convolutional layer connected to the two pooling layers, a channel attention subnetwork NetcamThe system comprises a global average pooling layer and a global maximum pooling layer which are arranged in parallel, wherein the global average pooling layer and the global maximum pooling layer are respectively connected with two-dimensional convolutional layers in a cascade mode; netoffset、Nethm_s、Netbbox、Nethm_lAnd NetreidThe sub-networks all adopt a structure comprising a first convolutional layer, a rule active layer and a second convolutional layer which are sequentially cascaded, and NetreidThe output end is cascaded with a full connection layer, Nethm_sAnd Nethm_lThe output end is cascaded with the fusion module;
wherein the structure of the integrated network model O is detected and re-identified based on the multivariate difference fusion, wherein:
backbone network NetbackboneThe number of the contained two-dimensional convolution layers, the batch normalization layer, the two-dimensional pooling layer, the deformable convolution layer and the transposition convolution layer is respectively 27, 37, 6, 4 and 2; the network is used for extracting basic features and used as input of a feature fusion sub-network, a DLA34 backbone network used in a FairMOT algorithm is adopted, and other backbone networks such as ResNet can be adopted for replacement.
First feature fusion subnet AsAnd a second feature fusion sub-network AlEach containing three structurally identical spatial attention sub-networks NetsamThe NetsamThe convolution kernel size of the included two-dimensional convolution layer is 3x3, the step length is 1, and the output dimension is 1; channel attention subnetwork NetcamThe number of the two-dimensional convolution layers is 4, respectively, NetcamThe size of the convolution kernel of the included two-dimensional convolution layer is 1x1, and the step size is 1; the fusion sub-network replaces an equal-ratio fusion feature sub-network in FairMOT, is used for helping a follow-up task to provide a more appropriate feature map, and provides difference on a network structure for training a follow-up two key point heat map prediction sub-network due to the fact that structural parameters of the fusion sub-network are influenced by the follow-up multi-task.
Netoffset、Nethm_s、Netbbox、Nethm_lAnd NetreidThe convolution kernel size of the first convolutional layer and the second convolutional layer contained in the subnetwork is 3x3, with a step size of 1. In addition, the output channels of the convolution kernels of the first convolution layer of the above subnetwork are all set to 256, and the output channels of the second convolution layer are 2, 1, 4, 1, 128 respectively; the offset prediction result, the boundary frame prediction result and the heat map prediction result can be decoded to obtain a detection result, and the appearance similarity of the pedestrian can be measured by obtaining a re-identification feature vector according to a network and calculating the cosine similarity.
NetreidThe output end cascade full-connection layer is used for assisting classification only during training, the output dimension is the number of pedestrians in the training data set, and the output dimension is discarded during testing.
(2b) Defining a loss function L for a keypoint heat map prediction taskheatmap:
Where N represents the number of keypoints in the predicted keypoint heat map, and alpha and beta represent the hyperparameters, taken here as 2 and 4 respectively,and YxyRespectively representing the labels and response values of key points at coordinates (x, y) in the predicted key point heat map, sigma representing summation operation, and log representing logarithm operation;
step 3) iterative training is carried out on the detection and re-recognition integrated network model O based on the multivariate difference fusion:
(3a) initializing detection and re-identifying integrated network model O with weight parameter thetaJThe iteration frequency is T, the maximum iteration frequency is T, T is more than or equal to 50000, and T is made to be 0;
(3b) for slave training sample set DtrainIn the random selection of bs ∈ [16,64 ]]Random data enhancement is carried out on each training sample, the detection frame information of each training sample is updated according to an enhancement mode, bs data enhancement training samples with updated detection frame information are obtained, and the updated information is obtainedThe ratio of high in the detection frame to high in the image frame is greater than the threshold thratioIs taken as a large target, and the ratio is less than a threshold thratioThe pedestrian target is used as a small target, and finally, the small target preference key point heat map label is determined according to the updated detection frame information, the updated identity information and the division result of the large target and the small targetLarge target preference key point heat map labelDifference fusion key point heat map labelBounding box labelbboxKey point offset labeloffsetRe-identification identity labelid;
The concrete implementation steps are as follows:
(3b1) carrying out random theta angle rotation on each training sample, wherein theta belongs to < -5 > and 5 >, carrying out random scale change on each training sample after the random angle rotation by taking s as a coefficient, wherein s belongs to 0.9 and 1.1, then carrying out random image brightness change operation on each training sample after the random scale change by taking r as a coefficient, and wherein r belongs to < -0.2 and 0.2, and obtaining bs training samples after random data enhancement;
(3b2) and synchronously updating the detection frame labels according to the values of theta and s to obtain bs training samples with enhanced data after the detection frame information is updated.
(3b3) Determining a heat map of key points of large and small targets to predict sub-network training labels;
dividing the large and small targets by the height h of the pedestrian target frameiRatio to high H of input image ratio as a division criterion:
wherein, divide represents the division result, if the division into HmL represents that the target is used as a supervised training sample of the large target preference key point heat map prediction sub-network prediction result, otherwise, the sample is ignored; if the classification HmS shows that the target is used as a supervised training sample of the small target preference key point heat map prediction sub-network prediction result, otherwise, the sample is ignored; for the predicted heat map Hm obtained by the fusion module, all target samples are supervised training samples; through the processes, the target sample division result of each key point heat map is obtained;
generating a key point heat map training label, and detecting a frame label for each pedestrian in the RGB imageCalculating the center point of the detection frameAnd treating it as a target key point, wherein Determining a key point training label value ofWhereinFor rounding-down, R is the down-sampling rate; finally obtaining a key point heat map label
Where x and y are the abscissa index values on the keypoint heat map,for the tag value, σ, of the keypoint heat map at coordinate (x, y) locationcIs a target size adaptive standard deviation value; the division result through training samples and the heat map label function of the key pointsCalculating key point heat map labels corresponding to the Hm _ S, Hm _ L and the Hm key point heat map to respectively obtain
(3b4) Determining re-recognition feature extraction sub-network NetreidTraining labels, assuming a target identity is labeled with IDiThe minimum identity label of the numerical value in the training set is IDxBy calculating IDi-IDxThe result of (2) is sub-network NetreidThe label value corresponding to the targetThe set of identity tags of all targets in the predicted image is
(3b5) Determining a keypoint shift prediction subnetwork NetoffsetTraining labels, assuming that the coordinates of the center point of a certain target are p ═ cx, cy, and the quantized coordinatesWhereinRepresents a rounding-down operation, R represents a down-sampling step size, calculatedAs a result, the sub-network Net is obtainedoffsetTraining label value corresponding to the targetThe keypoint offset prediction labels of all targets in the predicted image are
(3b6) Determining bounding box prediction sub-network NetbboxTraining labels, assuming that the coordinates of the upper left corner and the lower right corner of a certain target frame are (x1, y1), (x2, y2), calculatingObtaining sub-network NetbboxTraining label corresponding to the targetThe set of bounding box prediction labels of all targets in the predicted image is
(3c) Using the bs training samples after data enhancement as the input of the detection and re-recognition integrated network model O, namely the backbone network NetbackboneExtracting features of each training sample to obtain three feature maps Feat with different scales of the training sample1、Feat2、Feat3;
(3d) First feature fusion subnet AsFor Feat1、Feat2、Feat3Performing self-adaptive fusion to obtain a feature map FeatsThe keypoint shift prediction subnetwork NetoffsetSmall target preference keypoint heatmap prediction subnetwork Nethm_sAnd bounding Box prediction subnetwork NetbboxRespectively with FeatsForward reasoning is carried out for input to obtain NetoffsetCorresponding keypoint offset predictor vector Vecoffset、Nethm_sCorresponding keypoint heatmap predictions Hm _ S and Net of small target preferencesbboxCorresponding distance value vectors Vec from the key points to the upper, lower, left and right sides of the target framedis_bbox(ii) a Simultaneous second feature fusion subnetwork AlFor Feat1、Feat2、Feat3Performing self-adaptive fusion to obtain a feature map FeatlLarge target preference keypoint heat map prediction subnetwork Nethm_lAnd re-identifying feature extraction sub-network NetreidRespectively with FeatlFor input, forward reasoning is carried out to obtain Nethm_lCorresponding big target preferred key point heatmap prediction results Hm _ L and NetreidCorresponding re-identified feature vector VecreidFull connected layer pair VecreidClassifying to obtain a pedestrian identity classification result; the fusion module fuses the Hm _ S key point heat map and the Hm _ L key point heat map to obtain a fused key point heat map Hm;
wherein the first feature merges with the subnetwork AsFor Feat1、Feat2、Feat3Carrying out self-adaptive fusion, and comprising the following steps:
(3c1) feature fusion subnetwork AsThere are three spatial attention sub-networks, one for eachAndwherein the spatial attention subnetworkBackbone network NetbackboneOutput characteristic diagram Feat1As its inputs, the network processing sequence is in turn: feat1Input spatial attention subnetworkAndmultiplying the outputs of (1) to obtain Feat'1→Feat1And Feat'1Addition → Feat1"feature map, other two spatial attention subnetworksAndrespectively with Feat2、Feat3For input, Feat is obtained by the same process2"feature map, Feat3"feature maps, these three feature maps will be referred to as channel attention sub-network NetcamThe input of (1);
(3c2) for Feat2"and Feat3"both characteristic graphs are up-sampled by 2 times and 4 times respectively by using transposition convolution to obtain the result of the convolution with Feat1Feature map Feat of uniform size2″′、Feat3"→ mixing Feat1″、Feat2″′、Feat3' three characteristic diagrams are spliced to obtain Featsam→FeatsamInput channel attention subnetwork Netcam→FeatsamAnd NetcamMultiplying the outputs of (A) to obtain Featcam→FeatsamWith FeatcamAddition → FeatsObtaining a characteristic diagram Feats。
(3e) Using L1 loss function, through inputting the predicted value of the shift of the key point and the label thereofoffsetCalculating a loss value L of the keypoint shift prediction resultoffSimultaneously inputting the predicted value of the bounding box and label thereofbboxCalculating a loss value L for a bounding box predictorbboxAnd adopting cross entropy loss function, through inputting pedestrian identity classification result and label thereofidCalculating loss value L of re-recognition feature extraction resultreidThen, a loss function L of the task is predicted by adopting the key point heat mapheatmapRespectively inputting Hm _ S, Hm _ L and Hm and corresponding label ComputingRespective loss value Lhm_s、Lhm_l、LhmFinally to Loff、Lbbox、Lreid、Lhm_s、Lhm_lAnd LhmCarrying out self-adaptive weighted summation to obtain a loss value L of the detection and re-identification integrated network model Ototal;
Wherein the detecting and re-identifying loss value L of the integrated network model OtotalThe calculation formula is as follows:
Ldet=a×(0.6×Lhm+0.15×Lhm_l+0.25×Lhm_s)+b×Loff+c×Lbbox
where parameters a, b, and c are constant term coefficients, where a is 1, b is 1, and c is 0.1, and w1 and w2 are learnable parameters.
(3f) Using a back propagation method and passing through the loss value LtotalCalculating the gradient of the weight parameter of the detection and re-identification integrated network model O, and then adopting a gradient descent algorithm to carry out the pair of the weight parameter theta through the gradient of the weight parameter of OJUpdating is carried out;
wherein the gradient of the weight parameter passing through O is opposite to the weight parameter thetaJUpdating, wherein the updating formula is as follows:
wherein:indicating the updated network parameters and the updated network parameters,representing the network parameter before update, alphaJThe step size is represented as a function of time,representing the network parameter gradient of O.
(3g) Judging whether T is greater than T, if so, obtaining a trained detection and re-recognition integrated network model O', otherwise, making T equal to T +1, and executing the step (3 b);
step 4), acquiring a multi-target tracking result of the pedestrian:
(4a) initializing test sample set DtestThe kth test specimen is Comprises P RGB image frames, the P-th RGB image frame is f(p)P is more than 200, k is equal to 1, and the historical track set Tra is initialized(k)={};
(4b) Let p be 1;
(4c) set of test samplesP-th RGB image frame f in (1)(p)Forward propagation as input to a trained detection and re-recognition integrated network model O' to yield f(p)Is predicted value Vec of the key point offsetoffsetDistance values Vec from the periphery of the target frame to the key points up, down, left and rightdis_bboxKey point heat map prediction result Hm and re-recognition feature vector VecreidAnd to Vecoffset、Vecdis_bboxAnd Hm to obtain f(p)Set of pedestrian detection frames Det ═ { Det ═ DetiI is more than or equal to |0 and less than or equal to DN-1}, wherein detiA detection box for the ith pedestrian, DN denotes f(p)Detecting the number of pedestrians;
(4d) screening out f(p)The detected key point response value confiGreater than a response threshold thconfObject of the pedestrian Object { Object ═ Object }i|confiIs greater than th, i is greater than or equal to 0 and is less than or equal to DN-1, and is collected in the Det and VecreidObtaining a detection frame and re-identification feature vector information corresponding to the pedestrian target from the vector;
wherein, in the Det set and VecreidThe implementation steps of obtaining the detection frame and the re-identification feature vector information corresponding to the pedestrian target in the vector are as follows:
(4d1) in the target objectiIndex the subscript in the Det set to obtain the detection frame DetiAnd with the detection frame detiIs indexed by the center point coordinates of (a) in VecreidInquiring in the vector to obtain the re-identification feature vector embedi;
(4e) According to the detection frame and the re-recognition characteristic vector information of the screened pedestrian target, adopting an online correlation method to carry out on-line correlation on the Object and the Tra in the screened pedestrian target set(k)Carrying out data association to obtain f(p)The multi-target tracking result of the pedestrians is obtained;
the correlation method adopts the same online correlation method as in the FairMOT algorithm, and specifically comprises the following steps:
(4e1) attributes defining the trajectory: the ordered set of detection frames of each pedestrian under the tracking scene is called a track TraiAnd each track has the following attributes: information of current trajectory target boxState of trackRe-identified feature vectors for trajectoriesLife span length of trajectoryNumber of consecutive unmatched framesMotion informationInformation of track object boxI.e. the coordinates of the upper left corner and the lower right corner of the containing frame; state of trackDefinition, track StateThe method comprises three states of an active state, a lost state and an inactive state, wherein the track of the active state is the track matched with a detection frame in the previous frame; the missing state track is the number of the frames which are not matched with the detection frame but are continuously not matched in the previous frameDoes not exceed the life spanNumber of consecutive unmatched framesExceeds the life cycle lengthThe trajectory of (a) is an inactive state trajectory; re-identified feature vectors for trajectoriesRe-identifying characteristic vectors representing the appearance of the track target, calculating cosine similarity of the vectors between the track and the detection during correlation matching to judge the possibility that the track and the detection belong to the same track; life span length of trajectoryNamely, the maximum frame number limit threshold value is continuously unmatched, and the track is set to be in an inactive state when the maximum frame number limit threshold value is exceeded; acquisition and processing of motion informationBy usingAnd the Kalman filtering algorithm is used for estimating the horizontal and vertical coordinates of the target center position of all the positions of the tracks, the aspect ratio and height of the current target frame and the speed variables of the four states, and updating the parameters of the Kalman filtering algorithm according to the final matching result.
(4e2) For all active state tracks and all lost state tracks, firstly estimating the coordinate positions of the track frames through a Kalman filtering algorithm, and then calculating the Markov distance Matrix between the track frames and all detection target frames of the current frameDisMotionFor matrix median greater than threshold thmdIs modified into an infinite value, and the rest position values are not changed to obtain the final motion prediction distance Matrix'DisMotionSimultaneously calculating cosine similarity distance Matrix between the track and the detected re-identification feature vectorDisEmbedFinally, the two are fused according to the following formula:
MatrixDis=λMatrixDisEmbed+(1-λ)Matrix′DisMotion
obtaining the final distance MatrixDisOptimizing and matching by adopting a Hungarian algorithm according to the distance matrix, and updating the track state;
(4e3) calculating the overlap proportion matrix of the unmatched active track and the detection frame in the last step, and finding out the matrix which has the maximum overlap proportion with the track and has the value larger than the threshold thiouThe detection of (3) matches it and updates the track state;
(4e4) for the track which is still not matched, if the track is in the active state, the state of the track is changed into the state of the track in the lost state, the number of the lost frames of the track is counted, if the track is judged to be in the lost state, the lost count is increased by one, and the number of the lost frames is judged to be or larger than the life cycle of the trackIf the state is larger than the required state, setting the state as an inactive state; initializing the detection which is not matched into a new track;
(4e5) outputting current frame matching correlation information;
(4f) judging whether P is more than or equal to P, if so, obtaining a test sampleBook (I)Otherwise, making p equal to p +1, and updating the historical track set Tra(k)And performing step (4 c);
(4g) judging whether K is more than or equal to K, if so, obtaining a test sample set DtestOtherwise, k is made to be k +1, and step (4b) is executed.
The effect of the present invention is further explained by combining the simulation experiment as follows:
1. simulation conditions and contents:
the hardware platform of the simulation experiment is as follows: the graphics card is configured as Nvidia RTX2080Ti × 2, the processor is configured as xeon (r) E5-2620 v4@2.10Ghz × 32, and the memory is configured as 64 GB.
The software platform of the simulation experiment is as follows: the operating system is Ubuntu16.04LTS, the Python version is 3.7, the Pythroch version is 1.2.0, and the OpenCV version is 3.4.0.
The integrated network provided by the invention is characterized in that a training image sequence data set used in a simulation experiment is a crowdHuman, ETH, CityPerson, CalTech, CUHK-SYSU, PRW and MOT17train data set, the integrated network is pre-trained for 60 generations on the crowdHuman data set, and then is trained for 30 generations on the other data sets to obtain test model parameters; and the test image sequence data set is an MOT17test data set, wherein the test image sequence data set comprises image sequences under 7 different tracking scenes, and comprises 5919 frame images and 785 pedestrian tracks.
The Tracking accuracy of the multi-target Tracking method disclosed in the present invention and the paper "FairMOT of On the Fairness of Detection and Re-Identification in Multiple Object Tracking" published in 2020 by Yifu Zhang et al was compared and simulated, and the results are shown in Table 1
2. And (3) simulation result analysis:
in order to evaluate the tracking accuracy, the following evaluation index (tracking accuracy MOTA) formula is used to calculate the accuracy of the tracking results of the present invention and the prior art respectively, and the calculation results are plotted as table 1:
table 1.
Wherein, FN is the number of false negative targets, FP is the number of false positive targets, IDSW is the number of identity switching times, and GT is the number of targets in the truth label.
As can be seen from Table 1, compared with the prior art, the tracking accuracy of the invention is improved by 0.8, which is obviously higher than that of the prior art.
The above simulation experiments show that: according to the method, when the integrated network model for detecting and re-identifying based on the multivariate difference fusion is constructed, the difference of training data, a training mode and a network structure is added, so that the two key point heat map prediction sub-networks form prediction preference on targets with different sizes, the prediction results of the two sub-networks are added and fused to obtain the multivariate difference fusion key point heat map, the problem of low detection recall rate caused by prediction of the sub-networks by only using a single key point heat map in the prior art is solved, and the tracking accuracy of the algorithm is improved.
Claims (6)
1. A pedestrian multi-target tracking method based on multivariate difference fusion is characterized by comprising the following steps:
(1) obtaining a training sample set DtrainAnd test sample set Dtest:
(1a) Preprocessing the selected V RGB image sequences with the pedestrian detection frame labels and the identity labels to obtain a preprocessed RGB image frame sequence setAnd mixing SvTaking RGB image frames contained in the I preprocessed RGB image frame sequences as a training sample set DtrainTaking the rest K preprocessed RGB image frame sequences as a test sample set DtestWhereinDenotes the mth one containing LmA sequence of frame pre-processed RGB image frames,f(n)representing the n-th preprocessed RGB image frame, I > K, I + K ═ V, V > 20, Lm>200;
(2) Constructing a detection and re-identification integrated network model O based on multivariate difference fusion:
(2a) constructing a structure of a detection and re-identification integrated network model O based on multi-element difference fusion:
structure for constructing detection and re-identification integrated network model O, including backbone network NetbackboneAnd NetbackboneCascaded parallel-arranged and same-structure first feature fusion sub-network AsAnd a second feature fusion sub-network AlAnd a convergence module, wherein the first feature converges the subnetwork AsThe output end of the network is connected with a parallelly arranged key point deviation prediction sub-network NetoffsetSmall target preference keypoint heatmap prediction subnetwork Nethm_sAnd bounding Box prediction subnetwork NetbboxSecond feature fusion subnetwork AlThe output end of the network is connected with a large target preference key point heat map prediction sub-network Net which is arranged in parallelhm_lAnd re-identifying feature extraction sub-network NetreidWherein:
backbone network NetbackboneAdopting a tree-shaped polymerization iterative network consisting of a plurality of two-dimensional convolution layers, a plurality of batch normalization layers, a plurality of two-dimensional pooling layers, a plurality of deformable convolution layers and a plurality of transposition convolution layers;
first feature fusion subnet AsAnd a second feature fusion sub-network AlAll comprise a plurality ofIndividual spatial attention subnetwork NetsamAnd a channel attention subnetwork Netcam(ii) a Spatial attention subnetwork NetsamComprises a global average pooling layer and a global maximum pooling layer which are arranged in parallel, and a two-dimensional convolution layer connected with the two pooling layers, a channel attention sub-network NetcamThe system comprises a global average pooling layer and a global maximum pooling layer which are arranged in parallel, wherein the global average pooling layer and the global maximum pooling layer are respectively connected with two-dimensional convolutional layers in a cascade mode; netoffset、Nethm_s、Netbbox、Nethm_lAnd NetreidThe sub-networks all adopt a structure comprising a first convolutional layer, a rule active layer and a second convolutional layer which are sequentially cascaded, and NetreidThe output end is cascaded with a full connection layer, Nethm_sAnd Nethm_lThe output end is cascaded with the fusion module;
(2b) defining a loss function L for a keypoint heat map prediction taskheatmap:
Where N represents the number of keypoints in the predicted keypoint heat map, alpha and beta represent hyper-parameters,and YxyRespectively representing the labels and response values of key points at coordinates (x, y) in the predicted key point heat map, sigma representing summation operation, and log representing logarithm operation;
(3) carrying out iterative training on a detection and re-recognition integrated network model O based on multivariate difference fusion:
(3a) initializing detection and re-identifying integrated network model O with weight parameter thetaJThe iteration frequency is T, the maximum iteration frequency is T, T is more than or equal to 50000, and T is made to be 0;
(3b) for slave training sample set DtrainIn the random selection of bs ∈ [16,64 ]]Random data enhancement is carried out on each training sample, and the detection frame letter of each training sample is subjected to the enhancement modeUpdating information to obtain bs data enhanced training samples after the information of the detection frame is updated, and enabling the ratio of the height in the detection frame to the height of the image frame after the information is updated to be larger than a threshold thratioIs taken as a large target, and the ratio is less than a threshold thratioThe pedestrian target is used as a small target, and finally, the small target preference key point heat map label is determined according to the updated detection frame information, the updated identity information and the division result of the large target and the small targetLarge target preference key point heat map labelDifference fusion key point heat map labelBounding box labelbboxKey point offset labeloffsetRe-identification identity labelid;
(3c) Using the bs training samples after data enhancement as the input of the detection and re-recognition integrated network model O, namely the backbone network NetbackboneExtracting features of each training sample to obtain three feature maps Feat with different scales of the training sample1、Feat2、Feat3;
(3d) First feature fusion subnet AsFor Feat1、Feat2、Feat3Performing self-adaptive fusion to obtain a feature map FeatsThe keypoint shift prediction subnetwork NetoffsetSmall target preference keypoint heatmap prediction subnetwork Nethm_sAnd bounding Box prediction subnetwork NetbboxRespectively with FeatsForward reasoning is carried out for input to obtain NetoffsetCorresponding keypoint offset predictor vector Vecoffset、Nethm_sCorresponding keypoint heatmap predictions Hm _ S and Net of small target preferencesbboxCorresponding distance value vectors Vec from the key points to the upper, lower, left and right sides of the target framedis_bbox(ii) a All in oneTemporal second feature fusion subnetwork AlFor Feat1、Feat2、Feat3Performing self-adaptive fusion to obtain a feature map FeatlLarge target preference keypoint heat map prediction subnetwork Nethm_lAnd re-identifying feature extraction sub-network NetreidRespectively with FeatlFor input, forward reasoning is carried out to obtain Nethm_lCorresponding big target preferred key point heatmap prediction results Hm _ L and NetreidCorresponding re-identified feature vector VecreidFull connected layer pair VecreidClassifying to obtain a pedestrian identity classification result; the fusion module fuses the Hm _ S key point heat map and the Hm _ L key point heat map to obtain a fused key point heat map Hm;
(3e) using L1 loss function, through inputting the predicted value of the shift of the key point and the label thereofoffsetCalculating a loss value L of the keypoint shift prediction resultoffSimultaneously inputting the predicted value of the bounding box and label thereofbboxCalculating a loss value L for a bounding box predictorbboxAnd adopting cross entropy loss function, through inputting pedestrian identity classification result and label thereofidCalculating loss value L of re-recognition feature extraction resultreidThen, a loss function L of the task is predicted by adopting the key point heat mapheatmapRespectively inputting Hm _ S, Hm _ L and Hm and corresponding label Calculating respective loss values Lhm_s、Lhm_l、LhmFinally to Loff、Lbbox、Lreid、Lhm_s、Lhm_lAnd LhmCarrying out self-adaptive weighted summation to obtain a loss value L of the detection and re-identification integrated network model Ototal;
(3f) Using a back propagation method and passing through the loss value LtotalCalculating the gradient of the weight parameter of the integrated network model O for detection and re-identification, and thenUsing gradient descent algorithm, the weight parameter gradient of O is used to match the weight parameter thetaJUpdating is carried out;
(3g) judging whether T is greater than T, if so, obtaining a trained detection and re-recognition integrated network model O', otherwise, making T equal to T +1, and executing the step (3 b);
(4) acquiring a multi-target tracking result of the pedestrian:
(4a) initializing test sample set DtestThe kth test specimen isComprises P RGB image frames, the P-th RGB image frame is f(p)P is more than 200, k is equal to 1, and the historical track set Tra is initialized(k)={};
(4b) Let p be 1;
(4c) set of test samplesP-th RGB image frame f in (1)(p)Forward propagation as input to a trained detection and re-recognition integrated network model O' to yield f(p)Is predicted value Vec of the key point offsetoffsetDistance values Vec from the periphery of the target frame to the key points up, down, left and rightdis_bboxKey point heat map prediction result Hm and re-recognition feature vector VecreidAnd to Vecoffset、Vecdis_bboxAnd Hm to obtain f(p)Set of pedestrian detection frames Det ═ { Det ═ DetiI is more than or equal to |0 and less than or equal to DN-1}, wherein detiA detection box for the ith pedestrian, DN denotes f(p)Detecting the number of pedestrians;
(4d) screening out f(p)The detected key point response value confiGreater than a response threshold thconfObject of the pedestrian Object { Object ═ Object }i|confiIs greater than th, i is greater than or equal to 0 and is less than or equal to DN-1, and is collected in the Det and VecreidObtaining a detection frame and re-identification feature vector information corresponding to the pedestrian target from the vector;
(4e) according to the detection frame and the re-recognition characteristics of the screened pedestrian targetVector information, adopting an online correlation method to carry out on the screened pedestrian target set Object and the history track set Tra(k)Carrying out data association to obtain f(p)The multi-target tracking result of the pedestrians is obtained;
(4f) judging whether P is more than or equal to P, if so, obtaining a test sampleOtherwise, making p equal to p +1, and updating the historical track set Tra(k)And performing step (4 c);
(4g) judging whether K is more than or equal to K, if so, obtaining a test sample set DtestOtherwise, k is made to be k +1, and step (4b) is executed.
2. The pedestrian multi-target tracking method based on multivariate difference fusion as claimed in claim 1, wherein the preprocessing is performed on the selected V RGB image sequences with the pedestrian detection frame tags and the identity tags in step (1a), and the implementation steps are as follows:
(1a1) adjusting the size of each RGB image frame in each RGB image sequence by adopting a bilinear interpolation method to obtain an RGB image frame sequence set S with all the same RGB image frame sizesv′;
(1a2) Assembling a sequence of RGB image frames SvIn the method, the pedestrian detection frame label and the image after the scale change are updated synchronously, and meanwhile, the pedestrian identity label is uniformly coded, namely, for the data sample lacking identity information, the identity label is set to be-1, the coding mark is sequentially increased from 1 for the pedestrian with each individual identity, and the RGB image frame sequence set frame sequence after the RGB image frame is subjected to size adjustment and the detection frame label and the identity label are updated is obtainedThe method and the device realize preprocessing of the V RGB image sequence with the pedestrian detection frame tag and the identity tag.
3. The multi-difference fusion-based pedestrian multi-target tracking method according to claim 1, wherein the multi-difference fusion-based detection and re-identification of the structure of the integrated network model O in step (2a) is performed, wherein:
backbone network NetbackboneThe number of the contained two-dimensional convolution layers, the batch normalization layer, the two-dimensional pooling layer, the deformable convolution layer and the transposition convolution layer is respectively 27, 37, 6, 4 and 2;
first feature fusion subnet AsAnd a second feature fusion sub-network AlEach containing three structurally identical spatial attention sub-networks NetsamThe NetsamThe convolution kernel size of the included two-dimensional convolution layer is 3x3, the step length is 1, and the output dimension is 1; channel attention subnetwork NetcamThe number of the two-dimensional convolution layers is 4, respectively, NetcamThe size of the convolution kernel of the included two-dimensional convolution layer is 1x1, and the step size is 1;
Netoffset、Nethm_s、Netbbox、Nethm_land NetreidThe convolution kernel size of the first convolutional layer and the second convolutional layer contained in the subnetwork is 3x3, with a step size of 1.
4. The multi-difference fusion-based pedestrian multi-target tracking method according to claim 1, wherein the pairs in the step (3b) are selected from a training sample set DtrainThe method comprises the following steps of carrying out random data enhancement on bs training samples randomly selected from the training samples, and updating the detection frame information of each training sample according to an enhancement mode, wherein the method comprises the following steps:
(3b1) carrying out random theta angle rotation on each training sample, wherein theta belongs to < -5 > and 5 >, carrying out random scale change on each training sample after the random angle rotation by taking s as a coefficient, wherein s belongs to 0.9 and 1.1, then carrying out random image brightness change operation on each training sample after the random scale change by taking r as a coefficient, and wherein r belongs to < -0.2 and 0.2, and obtaining bs training samples after random data enhancement;
(3b2) and synchronously updating the detection frame labels according to the values of theta and s to obtain bs training samples with enhanced data after the detection frame information is updated.
5. The multi-difference fusion-based pedestrian multi-target tracking method according to claim 1, wherein the step (3e) of detecting and re-identifying the loss value L of the integrated network model OtotalThe calculation formula is as follows:
Ldet=a×(0.6×Lhm+0.15×Lhm_l+0.25×Lhm_s)+b×Loff+c×Lbbox
where the parameters a, b, and c are constant term coefficients, and w1 and w2 are learnable parameters.
6. The multi-difference fusion-based pedestrian multi-target tracking method according to claim 1, wherein the gradient of the weight parameter passing through O in the step (3f) is relative to the weight parameter θJUpdating, wherein the updating formula is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110556574.7A CN113221787B (en) | 2021-05-18 | 2021-05-18 | Pedestrian multi-target tracking method based on multi-element difference fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110556574.7A CN113221787B (en) | 2021-05-18 | 2021-05-18 | Pedestrian multi-target tracking method based on multi-element difference fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113221787A true CN113221787A (en) | 2021-08-06 |
CN113221787B CN113221787B (en) | 2023-09-29 |
Family
ID=77093689
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110556574.7A Active CN113221787B (en) | 2021-05-18 | 2021-05-18 | Pedestrian multi-target tracking method based on multi-element difference fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113221787B (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113688761A (en) * | 2021-08-31 | 2021-11-23 | 安徽大学 | Pedestrian behavior category detection method based on image sequence |
CN113723322A (en) * | 2021-09-02 | 2021-11-30 | 南京理工大学 | Pedestrian detection method and system based on single-stage anchor-free frame |
CN113807187A (en) * | 2021-08-20 | 2021-12-17 | 北京工业大学 | Unmanned aerial vehicle video multi-target tracking method based on attention feature fusion |
CN114120188A (en) * | 2021-11-19 | 2022-03-01 | 武汉大学 | Multi-pedestrian tracking method based on joint global and local features |
CN114241053A (en) * | 2021-12-31 | 2022-03-25 | 北京工业大学 | FairMOT multi-class tracking method based on improved attention mechanism |
CN114529581A (en) * | 2022-01-28 | 2022-05-24 | 西安电子科技大学 | Multi-target tracking method based on deep learning and multi-task joint training |
CN114663917A (en) * | 2022-03-14 | 2022-06-24 | 清华大学 | Multi-view-angle-based multi-person three-dimensional human body pose estimation method and device |
CN114708653A (en) * | 2022-03-23 | 2022-07-05 | 南京邮电大学 | Specified pedestrian action retrieval method based on pedestrian re-identification algorithm |
CN114937239A (en) * | 2022-05-25 | 2022-08-23 | 青岛科技大学 | Pedestrian multi-target tracking identification method and tracking identification device |
CN115082748A (en) * | 2022-08-23 | 2022-09-20 | 浙江大华技术股份有限公司 | Classification network training and target re-identification method, device, terminal and storage medium |
CN116912633A (en) * | 2023-09-12 | 2023-10-20 | 深圳须弥云图空间科技有限公司 | Training method and device for target tracking model |
CN117831094A (en) * | 2023-11-15 | 2024-04-05 | 北京京能热力股份有限公司 | Distributed boiler room video intelligent analysis method and device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019144575A1 (en) * | 2018-01-24 | 2019-08-01 | 中山大学 | Fast pedestrian detection method and device |
US20190278378A1 (en) * | 2018-03-09 | 2019-09-12 | Adobe Inc. | Utilizing a touchpoint attribution attention neural network to identify significant touchpoints and measure touchpoint contribution in multichannel, multi-touch digital content campaigns |
CN111079658A (en) * | 2019-12-19 | 2020-04-28 | 夸氪思维(南京)智能技术有限公司 | Video-based multi-target continuous behavior analysis method, system and device |
CN111767847A (en) * | 2020-06-29 | 2020-10-13 | 佛山市南海区广工大数控装备协同创新研究院 | Pedestrian multi-target tracking method integrating target detection and association |
CN112131959A (en) * | 2020-08-28 | 2020-12-25 | 浙江工业大学 | 2D human body posture estimation method based on multi-scale feature reinforcement |
-
2021
- 2021-05-18 CN CN202110556574.7A patent/CN113221787B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019144575A1 (en) * | 2018-01-24 | 2019-08-01 | 中山大学 | Fast pedestrian detection method and device |
US20190278378A1 (en) * | 2018-03-09 | 2019-09-12 | Adobe Inc. | Utilizing a touchpoint attribution attention neural network to identify significant touchpoints and measure touchpoint contribution in multichannel, multi-touch digital content campaigns |
CN111079658A (en) * | 2019-12-19 | 2020-04-28 | 夸氪思维(南京)智能技术有限公司 | Video-based multi-target continuous behavior analysis method, system and device |
CN111767847A (en) * | 2020-06-29 | 2020-10-13 | 佛山市南海区广工大数控装备协同创新研究院 | Pedestrian multi-target tracking method integrating target detection and association |
CN112131959A (en) * | 2020-08-28 | 2020-12-25 | 浙江工业大学 | 2D human body posture estimation method based on multi-scale feature reinforcement |
Non-Patent Citations (2)
Title |
---|
侯建华;麻建;王超;项俊;: "基于空间注意力机制的视觉多目标跟踪", 中南民族大学学报(自然科学版), no. 04 * |
张静;王文杰;: "基于多信息融合的多目标跟踪方法研究", 计算机测量与控制, no. 09 * |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113807187A (en) * | 2021-08-20 | 2021-12-17 | 北京工业大学 | Unmanned aerial vehicle video multi-target tracking method based on attention feature fusion |
CN113807187B (en) * | 2021-08-20 | 2024-04-02 | 北京工业大学 | Unmanned aerial vehicle video multi-target tracking method based on attention feature fusion |
CN113688761B (en) * | 2021-08-31 | 2024-02-20 | 安徽大学 | Pedestrian behavior category detection method based on image sequence |
CN113688761A (en) * | 2021-08-31 | 2021-11-23 | 安徽大学 | Pedestrian behavior category detection method based on image sequence |
CN113723322A (en) * | 2021-09-02 | 2021-11-30 | 南京理工大学 | Pedestrian detection method and system based on single-stage anchor-free frame |
CN114120188A (en) * | 2021-11-19 | 2022-03-01 | 武汉大学 | Multi-pedestrian tracking method based on joint global and local features |
CN114120188B (en) * | 2021-11-19 | 2024-04-05 | 武汉大学 | Multi-row person tracking method based on joint global and local features |
CN114241053A (en) * | 2021-12-31 | 2022-03-25 | 北京工业大学 | FairMOT multi-class tracking method based on improved attention mechanism |
CN114241053B (en) * | 2021-12-31 | 2024-05-28 | 北京工业大学 | Multi-category tracking method based on improved attention mechanism FairMOT |
CN114529581A (en) * | 2022-01-28 | 2022-05-24 | 西安电子科技大学 | Multi-target tracking method based on deep learning and multi-task joint training |
CN114663917A (en) * | 2022-03-14 | 2022-06-24 | 清华大学 | Multi-view-angle-based multi-person three-dimensional human body pose estimation method and device |
CN114708653A (en) * | 2022-03-23 | 2022-07-05 | 南京邮电大学 | Specified pedestrian action retrieval method based on pedestrian re-identification algorithm |
CN114937239A (en) * | 2022-05-25 | 2022-08-23 | 青岛科技大学 | Pedestrian multi-target tracking identification method and tracking identification device |
CN115082748B (en) * | 2022-08-23 | 2022-11-22 | 浙江大华技术股份有限公司 | Classification network training and target re-identification method, device, terminal and storage medium |
CN115082748A (en) * | 2022-08-23 | 2022-09-20 | 浙江大华技术股份有限公司 | Classification network training and target re-identification method, device, terminal and storage medium |
CN116912633B (en) * | 2023-09-12 | 2024-01-05 | 深圳须弥云图空间科技有限公司 | Training method and device for target tracking model |
CN116912633A (en) * | 2023-09-12 | 2023-10-20 | 深圳须弥云图空间科技有限公司 | Training method and device for target tracking model |
CN117831094A (en) * | 2023-11-15 | 2024-04-05 | 北京京能热力股份有限公司 | Distributed boiler room video intelligent analysis method and device |
Also Published As
Publication number | Publication date |
---|---|
CN113221787B (en) | 2023-09-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113221787B (en) | Pedestrian multi-target tracking method based on multi-element difference fusion | |
Adarsh et al. | YOLO v3-Tiny: Object Detection and Recognition using one stage improved model | |
CN107229904B (en) | Target detection and identification method based on deep learning | |
CN111027493B (en) | Pedestrian detection method based on deep learning multi-network soft fusion | |
CN110443805B (en) | Semantic segmentation method based on pixel density | |
CN112396002A (en) | Lightweight remote sensing target detection method based on SE-YOLOv3 | |
CN109977997B (en) | Image target detection and segmentation method based on convolutional neural network rapid robustness | |
CN109543606A (en) | A kind of face identification method that attention mechanism is added | |
WO2020046213A1 (en) | A method and apparatus for training a neural network to identify cracks | |
CN110826379B (en) | Target detection method based on feature multiplexing and YOLOv3 | |
CN111968150B (en) | Weak surveillance video target segmentation method based on full convolution neural network | |
CN111626184B (en) | Crowd density estimation method and system | |
CN114648665B (en) | Weak supervision target detection method and system | |
Kim et al. | Fast pedestrian detection in surveillance video based on soft target training of shallow random forest | |
CN114821014B (en) | Multi-mode and countermeasure learning-based multi-task target detection and identification method and device | |
CN110647802A (en) | Remote sensing image ship target detection method based on deep learning | |
CN113223614A (en) | Chromosome karyotype analysis method, system, terminal device and storage medium | |
KR20180123810A (en) | Data enrichment processing technology and method for decoding x-ray medical image | |
CN114241250A (en) | Cascade regression target detection method and device and computer readable storage medium | |
CN115018039A (en) | Neural network distillation method, target detection method and device | |
CN118212572A (en) | Road damage detection method based on improvement YOLOv7 | |
CN116977859A (en) | Weak supervision target detection method based on multi-scale image cutting and instance difficulty | |
CN117437691A (en) | Real-time multi-person abnormal behavior identification method and system based on lightweight network | |
CN109583584B (en) | Method and system for enabling CNN with full connection layer to accept indefinite shape input | |
CN114972434B (en) | Cascade detection and matching end-to-end multi-target tracking system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |