CN118071826B

CN118071826B - Target positioning method and system based on video monitoring

Info

Publication number: CN118071826B
Application number: CN202410123391.XA
Authority: CN
Inventors: 杨永清; 樊黎; 全合卫
Original assignee: Guangdong Jinhui Communication Technology Co ltd
Current assignee: Guangdong Jinhui Communication Technology Co ltd
Priority date: 2024-01-29
Filing date: 2024-01-29
Publication date: 2024-09-24
Anticipated expiration: 2044-01-29
Also published as: CN118071826A

Abstract

The invention discloses a target positioning method and a system based on video monitoring, wherein the method comprises the steps of obtaining video stream data, establishing a plurality of target detection frames, and utilizing each target detection frame to perform multi-channel gradient feature and multi-channel color feature extraction on different target areas of a video image so as to generate a plurality of feature sub-graphs; and (3) splicing the feature sub-graphs according to the intersection ratio between any two target detection frames, calculating a motion blur kernel of the spliced target feature graphs to calculate the ambiguity, performing deblurring operation on the target feature graphs when the motion blur kernel is compensated and optimized until the ambiguity reaches a preset value, and calculating position coordinates of the target in continuous frame images by combining a correlation filter to finish target tracking and positioning. The method adopts the partition characteristic extraction and the splicing, combines the ambiguity calculation tuning and deblurring operation, and greatly improves the image quality; by calculating the target positions of the continuous frame images, the accurate positioning and tracking of the dynamic target are realized.

Description

Target positioning method and system based on video monitoring

Technical Field

The invention relates to the technical field of video monitoring and target positioning, in particular to a target positioning method and system based on video monitoring.

Background

The video monitoring is one of the common positioning modes at present, for example, in markets, underground garages, hospitals, houses and the like, monitoring is usually required to be installed, and once an abnormal target appears, the monitoring can be called for the first time to track the target so as to assist in monitoring management work. However, there are many difficulties in dynamic positioning of the target based on video monitoring, such as target occlusion, target interference, target disappearance, etc., and the accuracy of the target positioning result is often not guaranteed due to these problems. In recent years, an artificial intelligence algorithm is combined with target positioning, and certain results are obtained, but the mode has higher requirements on equipment configuration, calculation power and the like, and is not suitable for video monitoring equipment which is popularized in a large area.

Disclosure of Invention

In order to solve at least one technical problem set forth above, the present invention provides a target positioning method and system based on video monitoring.

In a first aspect, the present invention provides a target positioning method based on video monitoring, the method comprising:

Acquiring video stream data, wherein the video stream data comprises a plurality of continuous frame video images;

establishing a plurality of target detection frames, and extracting the characteristics of the multi-channel gradient characteristics and the multi-channel color characteristics of different target areas of the video image by utilizing each target detection frame to generate a plurality of characteristic sub-graphs;

Calculating the intersection ratio between any two target detection frames to be used as the correlation between any two target detection frames; splicing the feature sub-graphs according to the correlation relationship to obtain a target feature graph;

calculating a motion blur kernel of the target feature map, and obtaining the ambiguity of the target feature map according to the motion blur kernel;

Judging whether the ambiguity reaches a preset value, if the ambiguity does not reach the preset value, performing compensation tuning on a motion blur kernel, outputting a current motion blur kernel when the ambiguity reaches the preset value, determining parameters of a deconvolution filter according to the current motion blur kernel, and performing deblurring operation on the target feature map through the deconvolution filter;

performing cyclic convolution calculation on the deblurred target feature map by using a correlation filter, and determining the position coordinates of the target in the current frame image; and obtaining the motion trail of the target based on all the position coordinates of the target in the continuous frame video images, and completing the tracking and positioning of the target.

Preferably, the establishing a plurality of target detection frames includes:

performing example segmentation on the target image by using a full convolution neural network FCN to obtain an initial detection frame of a rectangle and a corresponding polygonal contour;

And calculating a first IOU value of the initial detection frame and the polygonal outline, calculating a second IOU value of the polygonal outline and the target actual outline when the first IOU value reaches a first preset threshold value, and screening out the corresponding initial detection frame as the target detection frame when the second IOU value reaches a second preset threshold value.

Preferably, the calculating the motion blur kernel of the target feature map includes:

initializing a blur radius R, and constructing a motion vector diagram by using the blur radius R, wherein the motion vector diagram is a straight line, the length is R, the angle is 0, and the width is 1 pixel;

Based on the blur radius R, an affine transformation matrix K is established:

Wherein θ is a movement angle of the video monitoring device;

and mapping the motion vector image by using the affine transformation matrix K to obtain a motion blur kernel.

Preferably, the method further comprises calculating a movement angle of the video monitoring device, comprising:

based on n continuous frame video images, randomly selecting m feature points from a first frame video image, and sequentially searching matching points of the m feature points from the continuous frames:

X₁₁,X₁₂,...,X_1m

X₂₁,X₂₂,...,X_2m

......

X_n1,X_n2,...,X_nm

Wherein X _1m is the m-th feature point in the 1 st frame of video image, X _2m is the matching point of X _1m in the 2 nd frame of video image, and X _nm is the matching point of X _(n-1)m in the n-th frame of video image;

calculating the movement angle of the video monitoring equipment:

Wherein (X _ij,y_ij) is the coordinate value of the j-th feature point X _ij in the i-th frame video image; θ is the movement angle of the video monitoring device.

Preferably, the obtaining the ambiguity of the target feature map according to the motion blur kernel includes:

determining pixel values of points in the video image by using the motion blur kernel;

calculating the blurring degree of the video image according to the pixel value:

Where L is the width of the image, H is the height of the image, and P _lh is the pixel value of the image at pixel (L, H).

Preferably, the compensation tuning of the motion blur kernel includes:

if the ambiguity does not reach the preset value, multiplying the ambiguity radius by a compensation coefficient, and then carrying back to the affine transformation matrix to recalculate the motion blur kernel and the ambiguity until the ambiguity reaches the preset value; wherein the value range of the compensation coefficient is (1.1,1.5).

Preferably, the performing a cyclic convolution calculation on the deblurred target feature map by using a correlation filter, to determine a position coordinate of the target in the current frame image, includes:

the relevant filter F acts on the target feature map D to obtain a response map C:

C＝D*F

E(F)＝||D*F-Y||²+||F||²

Wherein, is a cyclic convolution; e (F) is an objective function of the correlation filter F, and Y represents Gaussian distribution taking a spatial center point as a highest value; the term represents a norm;

and solving a correlation filter F by using a fast Fourier transform, and calculating the coordinate of the position of the maximum value of the response graph C, wherein the coordinate of the position of the target in the current frame image.

In a second aspect, the present invention provides a video monitoring-based object positioning system, the system comprising:

a video stream acquisition unit configured to acquire video stream data including a plurality of continuous frame video images;

the detection frame construction unit is used for establishing a plurality of target detection frames, and extracting the characteristics of the multi-channel gradient characteristics and the multi-channel color characteristics of different target areas of the video image by utilizing each target detection frame to generate a plurality of characteristic sub-graphs;

the characteristic diagram splicing unit is used for calculating the intersection ratio between any two target detection frames and taking the intersection ratio as the correlation between any two target detection frames; splicing the feature sub-graphs according to the correlation relationship to obtain a target feature graph;

The ambiguity calculation unit is used for calculating a motion blur kernel of the target feature map and obtaining the ambiguity of the target feature map according to the motion blur kernel;

the deblurring unit is used for judging whether the ambiguity reaches a preset value, if the ambiguity does not reach the preset value, compensating and optimizing the motion blur kernel, outputting a current motion blur kernel when the ambiguity reaches the preset value, determining parameters of a deconvolution filter according to the current motion blur kernel, and performing deblurring operation on the target feature map through the deconvolution filter;

The tracking and positioning unit is used for performing circular convolution calculation on the deblurred target feature map by using the correlation filter and determining the position coordinates of the target in the current frame image; and obtaining the motion trail of the target based on all the position coordinates of the target in the continuous frame video images, and completing the tracking and positioning of the target.

In a third aspect, the present invention also provides an electronic device, including: a processor and a memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform a method as described in the first aspect and any one of its possible implementation manners.

In a fourth aspect, the present invention also provides a computer readable storage medium having stored therein a computer program comprising program instructions which, when executed by a processor of an electronic device, cause the processor to perform a method as in the first aspect and any one of the possible implementations thereof.

Compared with the prior art, the invention has the beneficial effects that:

1) When the method is used for extracting the characteristics, a plurality of target detection frames are established, and each target detection frame is used for extracting the characteristics of the multi-channel gradient characteristics and the multi-channel color characteristics of different target areas of the video image to generate a plurality of characteristic sub-graphs; calculating the intersection ratio between any two target detection frames to be used as the correlation between any two target detection frames; and splicing the feature sub-graphs according to the correlation relationship to obtain a target feature graph. According to the invention, through partitioned feature extraction, finer image features can be extracted; by considering the correlation of different target detection frames, invalid features can be effectively filtered during feature splicing, noise interference is reduced, and feature quality of feature extraction is improved.

2) In video monitoring, motion blur can occur to video images due to unstable factors, and quality of the images can be seriously affected by improper processing or over-processing. Therefore, on the premise of accurately and effectively extracting the image features, the invention further calculates the motion blur kernel of the feature map to evaluate the blur degree of the feature map, and when the blur degree is determined to not reach the preset value, the motion blur kernel is supplemented and optimized until the blur degree can meet the requirement, and then deblurring operation of deconvolution filtering is performed, so that the image quality is greatly improved.

3) For dynamic positioning of the target in the continuous frame images, the invention adopts the correlation filter to carry out circular convolution calculation on the deblurred target feature image, determines the position coordinate of the target in each frame image, can finally determine the motion track of the target in the continuous frame according to the arrangement sequence of the continuous frame, and finally accurately realizes tracking and positioning of the target; meanwhile, compared with a common neural network, the algorithm provided by the invention has the advantages that the operation amount is greatly reduced, the operation is quickened, and the positioning cost is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

In order to more clearly describe the embodiments of the present invention or the technical solutions in the background art, the following description will describe the drawings that are required to be used in the embodiments of the present invention or the background art.

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.

Fig. 1 is a schematic flow chart of a target positioning method based on video monitoring according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating the sub-steps of step S20 in FIG. 1 according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a target positioning system based on video monitoring according to an embodiment of the present invention;

fig. 4 is a schematic hardware structure of an electronic device according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, may mean including any one or more elements selected from the group consisting of A, B and C.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better illustration of the invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, well known methods, procedures, components, and circuits have not been described in detail so as not to obscure the present invention.

The current common video monitoring target positioning methods include the following methods:

Background differencing method: the background image needs to be constructed, then the current image and the background image are subjected to difference, and then the difference image is processed, so that a final foreground image is obtained. The key point of the method is to construct a proper background model, which brings great difficulty to the construction process of the background model due to weather, light, camera shake and the like.

Time difference method: the difference method between adjacent frames is mainly to make difference by using two or more adjacent frames, and the obtained difference image is the detection result. The method is only suitable for the situation that the adjacent frames have little change, and can not accurately position for video monitoring in a period of relatively long time.

Deep learning-based method: such as Faster R-CNN, YOLO and other target detection methods, the methods have high requirements on equipment and have high cost and poor universality due to high calculation power. And because training data is not comprehensive enough, the problems of missing, shielding, change, background interference and the like occur during target identification, and the accuracy of target positioning cannot be ensured.

Based on the problems of high positioning difficulty, low positioning result precision and high cost in the existing method, the invention provides a target positioning method based on video monitoring, which combines the fuzzy degree tuning and the deblurring operation through the refined partition feature extraction and the splicing, thereby greatly improving the image quality, and finally being capable of positioning the action coordinates of the target in continuous frames to track the movement track of the target, realizing the accurate positioning of the dynamic target and having the advantages of high efficiency, low cost, high precision, strong universality and the like.

Referring to fig. 1, fig. 1 is a flowchart of a target positioning method based on video monitoring according to an embodiment (a) of the present invention.

A target positioning method based on video monitoring, the method comprising:

S10, acquiring video stream data, wherein the video stream data comprises a plurality of continuous frame video images.

In this embodiment, a network camera may be used to obtain video stream data through network connection. In general, a network camera encapsulates video stream data into data packets in a specific format and sends the data packets to a receiving end, and the receiving end needs to parse and process the data packets, for example, perform a decoding operation and a format processing operation on the data to obtain a required video frame image. In addition, video streaming data may also be obtained by accessing a network video server, which may capture and transmit video signals from a video source to a receiving end over a network, and which typically supports multiple video formats and resolutions.

S20, establishing a plurality of target detection frames, and extracting the characteristics of the multi-channel gradient characteristics and the multi-channel color characteristics of different target areas of the video image by utilizing each target detection frame to generate a plurality of characteristic images.

The target detection frame is a rectangular frame for marking the position of the detection object in one picture or video frame. The target detection boxes are typically automatically generated by a detection algorithm that searches for and locates the position and size of the target in the image, and then generates one or more detection boxes to mark the detected target position. These test frames typically contain information about the location, size, class, etc. of the test frame and can be used for subsequent tasks such as target tracking, target identification and classification. The target detection frame is generally composed of rectangular frames, round frames, polygonal frames and the like, and the specific shape depends on the implementation of the detection algorithm and task requirements. For example, in some tasks requiring detection of multiple targets, multiple detection frames may be used to mark different target locations.

The multi-channel gradient feature HOG and the multi-channel color feature CN are feature information extracted from the image, and are usually represented by vectors or matrices. It is therefore necessary to map these features into the corresponding coordinate system when associating them with the position coordinates of the locating features. A pixel coordinate representation is obtained by mapping the gradient or color feature vector into a pixel coordinate system. Thus, the gradient or color features can be associated with the position coordinates of the positioning features, so that the description and analysis of the information such as the position, shape, texture and the like of the object in the image are realized.

It should be noted that, since the single extracted gradient feature or color feature affects the subsequent recognition process, the multi-channel gradient feature HOG and the multi-channel color feature CN are extracted simultaneously in the embodiment, and the multi-channel gradient feature HOG and the multi-channel color feature CN can achieve the complementary effect in a feature combination manner, so that the effectiveness of the feature extraction link is ensured.

It can be understood that if the extraction of a target overall feature is adopted in the feature extraction, the overall extraction range is large, and granularity is not fine enough, so that the accuracy of finally identifying the feature is easy to be disturbed. Therefore, in this embodiment, a plurality of target detection frames are generated according to a detection algorithm, then, feature extraction is performed on different target areas of a video image by using each target detection frame, a plurality of feature images are obtained, and then, stitching is performed to obtain a feature extraction image of a target, so that compared with the overall extraction of the target, the feature granularity is finer, and the feature recognition result is more accurate.

For example, in a section of monitoring video, if 4 people, such as a person with a small size, are monitored, the appearance characteristics of the four people are relatively close, including the behavior states of height, weight, standing posture, etc. The target positioning object is the nail, if only 1 detection frame is used for extracting the nail at this time, the contour features of the person are very close, the nail is very easy to be interfered by other 3 persons, if 4 detection frames are used for respectively extracting the features of 4 persons, the target information is obtained through comparison, the process of the comparison is very complicated, and the feature extraction is very long in time consumption. For the above-mentioned situation, in this embodiment, a plurality of target detection frames may be set, and feature extraction is performed on the nail region, for example, facial features, hand features, and leg features of the nail are extracted respectively, and finally, the final feature map of the nail is locked by feature stitching, so that the accuracy of feature extraction is greatly improved.

S30, calculating the intersection ratio between any two target detection frames to be used as the correlation between any two target detection frames; and splicing the feature sub-graphs according to the correlation relationship to obtain a target feature graph.

In the last step, the main work in the step is to splice by extracting the characteristic diagram of the target subarea so as to obtain a complete target characteristic diagram. Considering that in the partition detection process, the correlation relationship exists between different detection frames, for example, a plurality of detection frames only extract features of an A person, the correlation relationship between the detection frames is strong, and if the extraction of the features of other persons occurs, the correlation degree of the detection frames with partition features of the A person is low, so that when feature mapping is spliced, the correlation relationship between every two target detection frames needs to be determined first, when the feature with high correlation degree is spliced, then the feature with low correlation degree is spliced, finally the feature without correlation degree can be discarded, feature interference is prevented, and finally the integral target feature mapping is obtained through splicing.

Therefore, in the embodiment, when the feature extraction is performed, a plurality of target detection frames are established, and each target detection frame is utilized to perform the feature extraction of the multi-channel gradient feature and the multi-channel color feature on different target areas of the video image, so as to generate a plurality of feature sub-graphs; calculating the intersection ratio between any two target detection frames to be used as the correlation between any two target detection frames; and splicing the feature sub-graphs according to the correlation relationship to obtain a target feature graph. Through the partitioned feature extraction, finer image features can be extracted; by considering the correlation of different target detection frames, invalid features can be effectively filtered during feature splicing, noise interference is reduced, and feature quality of feature extraction is improved.

And S40, calculating a motion blur kernel of the target feature map, and obtaining the ambiguity of the target feature map according to the motion blur kernel.

The motion blur kernel, also called a motion blur parameter, refers to a characteristic of image blur in an image due to object motion, camera shake, or object defocus, or the like. In image processing, a motion blur kernel is regarded as a process in which a clear image is convolved with a blur kernel to obtain a blurred image. In particular, when an object moves during exposure, or the camera itself shakes, or the object is out of focus, these factors may cause image blurring. This blurring is due to the fact that during exposure, images of different displacements are recorded on the sensor, which is equivalent to a rectangular wave convolved with an image of a stationary object, which rectangular wave can be regarded as a one-dimensional convolution kernel.

In this embodiment, the motion blur kernel is calculated to further determine the degree of blur of the target feature map, where the degree of blur is typically indicative of the degree of error or difference between the pixel values and the true values in the image. The smaller the blur, the higher the sharpness of the image, and conversely, the more blurred the image. Because the existence of the motion blur kernel can change the pixel value of each pixel point in the image, after the motion blur kernel is obtained, the pixel sizes of different pixel points in the image after the motion blur can be determined based on the pixels and the motion blur kernel when the motion blur does not occur, and finally the blur degree is calculated according to the pixel point sizes.

S50, judging whether the ambiguity reaches a preset value.

And S60, if the ambiguity does not reach the preset value, performing compensation and optimization on the motion blur kernel, and outputting the current motion blur kernel until the ambiguity reaches the preset value.

Before deblurring operations are performed, it is often necessary to ensure that the value of the ambiguity reaches a criterion that can be manipulated to prevent under-or over-processing. After the ambiguity is obtained, the ambiguity is compared with a preset value, if the ambiguity does not reach the preset value, the motion blur kernel is compensated and optimized, namely, the finally mapped blur kernel matrix is changed by changing parameters such as the blur radius and the like until the ambiguity reaches the preset value, and the current latest motion blur kernel is output.

Preferably, the compensation tuning of the motion blur kernel includes:

S70, determining parameters of a deconvolution filter according to the current motion blur kernel, and performing deblurring operation on the target feature map through the deconvolution filter.

Deconvolution filters are special filters that are aimed at restoring the original image by deconvolution of the blurred image. The parameters of the deconvolution filter are usually determined by experiments and adjustments, which depend on many factors, such as the size and shape of the blur kernel, so that after the latest motion blur kernel is obtained, the parameters of the deconvolution filter can be matched, so that the deblurring operation can be performed, and a feature map with higher definition can be obtained.

According to the embodiment, on the premise that the image features are accurately and effectively extracted, the motion blur kernel of the feature map is further calculated to evaluate the blur degree of the feature map, when the blur degree is determined to not reach a preset value, the motion blur kernel is supplemented and optimized until the blur degree can meet the requirements, and then deblurring operation of deconvolution filtering is performed, so that the image quality is greatly improved.

S80, performing circular convolution calculation on the deblurred target feature map by using a correlation filter, and determining the position coordinates of the target in the current frame image; and obtaining the motion trail of the target based on all the position coordinates of the target in the continuous frame video images, and completing the tracking and positioning of the target.

The corresponding objective function is determined by utilizing a minimization formula of the relevant filter, the relevant filter is solved, the relevant filter acts on the objective feature map to obtain a response map, the position coordinates of the objective in the current frame image can be obtained according to the response map, finally, the position coordinates of the objective in each frame image are solved through iteration, the motion track of the dynamic objective is obtained according to the sequence of continuous frame images, and tracking and positioning of the objective can be achieved.

According to the embodiment, for dynamic positioning of the target in the continuous frame images, a relevant filter is adopted to carry out circular convolution calculation on the deblurred target feature images, the position coordinates of the target in each frame image are determined, the motion track of the target in the continuous frame can be finally determined according to the arrangement sequence of the continuous frames, and finally, tracking and positioning of the target are accurately realized; meanwhile, compared with a common neural network, the algorithm provided by the embodiment has the advantages that the operation amount is greatly reduced, the operation is quickened, and meanwhile, the cost of target positioning and tracking is reduced.

Referring to fig. 2, in one embodiment, the establishing a plurality of target detection boxes includes:

s201, performing instance segmentation on a target image by using a full convolution neural network FCN to obtain an initial detection frame of a rectangle and a corresponding polygonal contour.

In performing object image instance segmentation, the FCN will first pre-process the image and then extract the high-level features of the image using the convolution and pooling layers. The FCN then upsamples the feature map of the last convolution layer using the deconvolution layer to the same size as the input image, thereby producing a prediction for each pixel while preserving spatial information in the original input image.

Further, the FCN finds the largest rectangular area meeting the predefined threshold in the image according to the preset detection scale, and the rectangular area is the initial detection frame. When a polygon contour is acquired, the FCN generates a series of segmentation masks (Segmentation Mask) within the boundary, each representing an instance, and the final polygon contour is obtained by merging and fusing the segmentation masks.

S202, calculating a first IOU value of the initial detection frame and the polygonal outline, when the first IOU value reaches a first preset threshold value, calculating a second IOU value of the polygonal outline and the target actual outline, and when the second IOU value reaches a second preset threshold value, screening out the corresponding initial detection frame as the target detection frame.

The first preset threshold value IOU1 and the second preset threshold value IOU2 can be set according to different scenes; moreover, in order to improve the detection accuracy of the rectangular detection frame, the second preset threshold value IOU2 is generally greater than the first preset threshold value IOU1. Wherein, the value ranges of the first preset threshold value IOU1 and the second preset threshold value IOU2 are all between 0.5 and 0.7.

Therefore, in this embodiment, the initial detection frame is first matched with the prediction target, and the first matching result is screened, that is, the screening that the IOU value of the initial detection frame is greater than the IOU1 is performed. And then carrying out second matching of the polygonal contour and the predicted target, and screening the second matching result, namely screening that the IOU value of the polygonal contour is larger than IOU 2. And finally, taking the initial detection frame after the twice screening as a final target detection frame.

In summary, in this embodiment, the FCN algorithm is used to perform instance segmentation to obtain two branch results of the initial detection frame and the polygon contour, and then two parallel branch results without intersection are used to establish a new judgment relationship; the initial detection frame is utilized to carry out IOU primary screening, and the polygonal outline is utilized to carry out IOU secondary screening, so that the target detection frame with higher detection precision is obtained.

In one embodiment, the calculating the motion blur kernel of the target feature map includes:

1) Initializing a blur radius R, and constructing a motion vector diagram by using the blur radius R, wherein the motion vector diagram is a straight line, the length is R, the angle is 0, and the width is 1 pixel.

The blur radius (Blurring Radius) is a radius at which the blur operation smoothes the image in the image processing. The size of the blur radius determines the degree of influence of the blur operation on the image, and is generally used for reducing noise and details in the image, so that the image is smoother and softer.

Motion vector diagrams are used to describe the displacement relationship between video frames. In general, a motion vector image is an image, each point on the motion vector image represents the position change of a certain frame image relative to another frame image, and a straight line is obtained by constructing the motion vector image through a blur radius R, and the straight line has a length R, an angle of 0 and a width of 1 pixel.

Based on the blur radius R, an affine transformation matrix K is established:

Wherein θ is a movement angle of the video monitoring device;

Preferably, the motion blur kernel is obtained on a map of 80 pixels of the motion vector map 80.

Further, calculating a movement angle of the video monitoring device includes:

X₁₁,X₁₂,...,X_1m

X₂₁,X₂₂,...,X_2m

......

X_n1,X_n2,...,X_nm

calculating the movement angle of the video monitoring equipment:

In this embodiment, by selecting a plurality of feature points on a frame of image, calculating the offset distance of each feature point in a continuous frame of image, including the offset in the x-coordinate direction and the y-coordinate direction, and finally averaging the offsets of all points, and calculating the final movement angle of the device, the accuracy of the calculation result is improved.

In one embodiment, the obtaining the ambiguity of the target feature map according to the motion blur kernel includes:

In one embodiment, the method uses a correlation filter to perform cyclic convolution calculation on the deblurred target feature map, and determines the position coordinates of the target in the current frame image;

C＝D*F

E(F)＝||D*F-Y||²+||F||²

When the related filtering is used for target tracking, the thought of sample cyclic translation is utilized, so that the problem of lack of training samples is solved greatly, and meanwhile, a certain boundary effect is generated, so that the target positioning and tracking effect is influenced. The boundary effect means that the training samples generated by the cyclic translation are synthesized samples, and the target boundary can translate along with the cyclic translation in the sample generation process, so that a large number of unreal training samples are generated, and the discrimination capability of the correlation filter is reduced.

In order to solve the problem of the boundary effect, it may be considered to add a spatial regularization term to the relevant filtering target tracking to perform constraint, and for the objective function in the above embodiment, a spatial weight map W may be introduced as a constraint term in a process of updating the relevant filter F in a certain frame to obtain a new optimized objective function, which specifically includes the following steps:

constraining an objective function of the correlation filter F by using a spatial regularization term, and optimizing the objective function:

In the method, in the process of the invention, The regularization term is represented as a function of the regularization term,Representing matrix element multiplication operation, W represents a space weight graph obeying quadratic function distribution; wherein:

Wherein, beta, lambda represent the setting parameters and are all more than 0; (a, B) represents coordinates of discrete points around the center of the target, and a=1, 2..a, b=1, 2..b; g ₁、G₂ denotes the width and height of the target.

Because the target center is positioned at the center of the sampling image, the regularization weight W is introduced to play a role in inhibiting the numerical value of the non-target area in the filter F, and the problem of boundary effect is relieved.

Preferably, in another embodiment, another boundary constraint correlation filtering method can be used to optimize the model, namely, an adaptive spatial regularization algorithm is used for correlation filtering target tracking. The adopted spatial regularization weight map can be subjected to self-adaptive dynamic adjustment according to the characteristics of the target in the tracking process, so that the optimized correlation filter has stronger generalization capability.

Referring to fig. 3, in an embodiment of the present invention, there is further provided a target positioning system based on video monitoring, the system including:

A video stream acquisition unit 100 for acquiring video stream data including a plurality of continuous frame video images;

The detection frame construction unit 200 is configured to establish a plurality of target detection frames, and perform feature extraction of multi-channel gradient features and multi-channel color features on different target areas of the video image by using each target detection frame to generate a plurality of feature maps;

The feature map stitching unit 300 is configured to calculate an intersection ratio between any two target detection frames, and use the intersection ratio as a correlation between any two target detection frames; splicing the feature sub-graphs according to the correlation relationship to obtain a target feature graph;

the ambiguity calculating unit 400 is configured to calculate a motion blur kernel of the target feature map, and obtain an ambiguity of the target feature map according to the motion blur kernel;

the deblurring unit 500 is configured to determine whether the ambiguity reaches a preset value, compensate and tune the motion blur kernel if the ambiguity does not reach the preset value, output a current motion blur kernel until the ambiguity reaches the preset value, determine parameters of a deconvolution filter according to the current motion blur kernel, and deblur the target feature map through the deconvolution filter;

The tracking and positioning unit 600 is configured to perform a cyclic convolution calculation on the deblurred target feature map by using a correlation filter, and determine a position coordinate of the target in the current frame image; and obtaining the motion trail of the target based on all the position coordinates of the target in the continuous frame video images, and completing the tracking and positioning of the target.

In a preferred embodiment, the detection frame construction unit 200 is further configured to perform instance segmentation on the target image by using the full convolutional neural network FCN, to obtain an initial detection frame of a rectangle and a corresponding polygonal contour;

In a preferred embodiment, the ambiguity calculation unit 400 is configured to calculate a motion blur kernel of a target feature map, and includes:

Based on the blur radius R, an affine transformation matrix K is established:

Wherein θ is a movement angle of the video monitoring device;

In a preferred embodiment, the ambiguity calculation unit 400 is further configured to calculate a movement angle of the video monitoring device, including:

X₁₁,X₁₂,...,X_1m

X₂₁,X₂₂,...,X_2m

......

X_n1,X_n2,...,X_nm

calculating the movement angle of the video monitoring equipment:

In a preferred embodiment, the ambiguity calculating unit 400 is configured to obtain the ambiguity of the target feature map according to the motion blur kernel, and includes:

In a preferred embodiment, the deblurring unit 500 is configured to perform compensation tuning on the motion blur kernel, and includes:

In a preferred embodiment, the tracking and positioning unit 600 is configured to perform a cyclic convolution calculation on the deblurred target feature map by using a correlation filter, and determine a position coordinate of the target in the current frame image, where the method includes:

C＝D*F

E(F)＝||D*F-Y||²+||F||²

In some embodiments, the functions or the modules included in the video monitoring-based target positioning system provided in this embodiment may be used to perform the method described in the foregoing method embodiments, and specific implementation of the method may refer to the description of the foregoing method embodiments, which is not further described herein for brevity.

An embodiment of the present invention further provides an electronic device, including: a processor, a transmitting means, an input means, an output means and a memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform a method as any one of the possible implementations described above.

An embodiment of the invention also provides a computer-readable storage medium in which a computer program is stored, the computer program comprising program instructions which, when executed by a processor of an electronic device, cause the processor to perform a method as any one of the possible implementations described above.

Referring to fig. 4, fig. 4 is a schematic hardware structure of an electronic device according to an embodiment of the invention.

The electronic device 2 comprises a processor 21, a memory 24, input means 22, output means 23. The processor 21, memory 24, input device 22, and output device 23 are coupled by connectors, including various interfaces, transmission lines, buses, etc., as are not limited by the present embodiments. It should be appreciated that in various embodiments of the invention, coupled is intended to mean interconnected by a particular means, including directly or indirectly through other devices, e.g., through various interfaces, transmission lines, buses, etc.

The processor 21 may be one or more graphics processors (graphics processing unit, GPUs), which in the case of a GPU as the processor 21 may be a single core GPU or a multi-core GPU. Alternatively, the processor 21 may be a processor group formed by a plurality of GPUs, and the plurality of processors are coupled to each other through one or more buses. In the alternative, the processor may be another type of processor, and the embodiment of the invention is not limited.

Memory 24 may be used to store computer program instructions as well as various types of computer program code for performing aspects of the present invention. Optionally, the memory includes, but is not limited to, random access memory (random access memory, RAM), read-only memory (ROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM), or portable read-only memory (compact disc read-only memory, CD-ROM) for associated instructions and data.

The input means 22 are for inputting data and/or signals and the output means 23 are for outputting data and/or signals. The output device 23 and the input device 22 may be separate devices or may be an integral device.

It will be appreciated that in embodiments of the present invention, the memory 24 may be used to store not only relevant instructions, but that embodiments of the present invention are not limited to the specific data stored in the memory.

It will be appreciated that fig. 4 shows only a simplified design of an electronic device. In practical applications, the electronic device may further include other necessary elements, including but not limited to any number of input/output devices, processors, memories, etc., and all video parsing devices capable of implementing the embodiments of the present invention are within the scope of the present invention.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein. It will be further apparent to those skilled in the art that the descriptions of the various embodiments of the present invention are provided with emphasis, and that the same or similar parts may not be described in detail in different embodiments for convenience and brevity of description, and thus, parts not described in one embodiment or in detail may be referred to in description of other embodiments.

In the several embodiments provided by the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital versatile disk (DIGITAL VERSATILEDISC, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. And the aforementioned storage medium includes: a read-only memory (ROM) or a random-access memory (random access memory, RAM), a magnetic disk or an optical disk, or the like.

Claims

1. A target positioning method based on video monitoring, the method comprising:

2. The video surveillance-based object localization method of claim 1, wherein the establishing a plurality of object detection boxes comprises:

3. The video surveillance-based object localization method of claim 1, wherein the calculating a motion blur kernel of an object feature map comprises:

Based on the blur radius R, an affine transformation matrix K is established:

Wherein θ is a movement angle of the video monitoring device;

4. The video monitoring-based object localization method of claim 3, further comprising calculating a movement angle of the video monitoring device, comprising:

X₁₁,X₁₂,...,X_1m

X₂₁,X₂₂,...,X_2m

......

X_n1,X_n2,...,X_nm

calculating the movement angle of the video monitoring equipment:

5. The method for positioning a target based on video surveillance according to claim 3, wherein the obtaining the ambiguity of the target feature map according to the motion blur kernel comprises:

6. The video surveillance-based object localization method of claim 3, wherein the performing compensation tuning on the motion blur kernel comprises:

7. The method for positioning a target based on video surveillance according to claim 1, wherein the performing a cyclic convolution calculation on the deblurred target feature map by using a correlation filter to determine a position coordinate of the target in the current frame image includes:

C＝D*F

E(F)＝||D*F-Y||²+||F||²

8. A video surveillance-based object positioning system, the system comprising:

9. An electronic device, comprising: a processor and a memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the video surveillance based object localization method of any one of claims 1 to 7.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program comprising program instructions which, when executed by a processor of an electronic device, cause the processor to perform the video surveillance based object localization method of any of claims 1 to 7.