CN114882403A

CN114882403A - Video space-time action positioning method based on progressive attention hypergraph

Info

Publication number: CN114882403A
Application number: CN202210481572.0A
Authority: CN
Inventors: 叶兴超; 李平; 曹佳晨
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2022-08-09
Anticipated expiration: 2042-05-05
Also published as: CN114882403B

Abstract

The invention discloses a video space-time action positioning method based on a progressive attention hypergraph. Firstly, sampling a given original video to obtain a frame sequence, and obtaining a target context characteristic and video space-time characteristic graph by using a convolutional neural network; obtaining a target context characteristic and a space-time relation matrix through a space-time relation encoder; generating a long-term target first-order characteristic by utilizing a progressive variable length window method module; meanwhile, obtaining target short-term high-order characteristics through a hypergraph module with shared attribute constraint and diffusion mechanism; and finally, outputting the spatial positions and the motion types of all the targets at different times by using a target motion regression module. The method not only can adaptively adjust the window size according to the original action duration to obtain the first-order characteristic of the target consistent with the original action duration, but also can capture the potential relation of the target through the hypergraph module, thereby realizing more effective utilization of the target interaction relation and improving the accuracy of video space-time action positioning.

Description

Video space-time action positioning method based on progressive attention hypergraph

Technical Field

The invention belongs to the technical field of computer vision, particularly relates to the field of motion positioning in video processing, and relates to a video space-time motion positioning method based on a progressive attention hypergraph.

Background

Since the rapid rise of the media industry, massive multimedia data mainly comprising videos are generated. Compared with the traditional image-text data, the video gradually becomes a mainstream media form due to the characteristics of rich visual content, intuitive expression form and the like. However, a large amount of complex scene information is contained in a large amount of video, such as: the number of targets is large and the action is complex. Therefore, how to quickly and accurately identify and locate the Action categories of all targets from a complex scene becomes an important research direction of researchers, namely a Spatio-temporal Action Localization (Spatio-temporal Action Localization) task. The task takes a long video which is not edited and possibly comprises a plurality of targets, each target comprises a plurality of actions as input, and outputs the spatial position, the starting and stopping time and the corresponding action type of an action segment in the video, so that the task can be widely applied to monitoring actual scenes such as security, video content examination, traffic safety detection and the like. For example, the space-time action positioning technology is applied to a monitoring security system, can monitor and judge dangerous actions of all targets in a range in real time and send out an alarm so as to assist in strengthening social security; in addition, the space-time action positioning technology is applied to a video content auditing system, and can effectively mark and screen out illegal video segments, so that manual review is facilitated, and the labor cost is reduced.

At present, a mainstream space-time action positioning method mainly adopts a two-stage paradigm, wherein a Faster regional convolutional Neural Network fast R-CNN (fast Region-convolutional Neural Network) and a slow-fast Network slowFast are used in a first stage to obtain a target Region characteristic and a space-time characteristic diagram, the target Region characteristic is copied along a space dimension to enable the dimension to be equal to the space-time characteristic diagram, and then the target Region characteristic and the space-time characteristic diagram are spliced on a channel dimension to generate an initial target first-order relationship (the space relationship between different targets and the time sequence relationship of the same target); the second phase uses a Long-Term Feature library LFB (Long-Term Feature Banks) as a memory module to store historical target first-order features and combines a Local Attention mechanism (Local Attention) to acquire the Long-Term target first-order features. However, in the process of extracting the target interaction relationship, the method ignores the influence of the potential relationship (high-order relationship: the space-time relationship established between two targets through the third target) between the targets on the judgment result, and causes the space-time positioning deviation of the action. Therefore, a graph convolution neural network (GCN) (graph relational network) is adopted in subsequent work, and the first-order relation between two targets is described through the common spatial position, so that the capture of the context of the global scene is realized, and the high-order relation between the targets is described as comprehensively as possible.

The shortcomings of the space-time motion positioning method are mainly expressed in the following three aspects: (1) although the long-term feature library with a fixed window size can well capture the first-order relation of the long-term target, for the action with short duration, the too large duration range in the long-term feature library can cause the model to be extracted into context-free features, so that the accuracy of short-duration action feature representation is reduced; (2) the influence of the high-order relation on the judgment of the action type at the current moment is reduced along with the increase of the time interval, but the calculation cost is increased along with the increase of the time interval, so that the high-order relation of the long-term target is constructed and the high real-time requirement of the model is difficult to meet; (3) the traditional graph structure can only represent the characteristic of a pair-wise relation, and is difficult to depict complicated and various target high-order relations. Therefore, aiming at the problems of short-term action confidence reduction caused by inaccurate capture range of the first-order relation of the target and high calculation overhead caused by unreasonable description mode of the high-order relation of the target, a space-time action positioning method which can adaptively adjust the window size according to the original action duration and can correctly reflect the high-order relation of the target is urgently needed to be designed.

Disclosure of Invention

Aiming at the defects of the existing method, the invention provides a video space-time action positioning method based on a progressive attention hypergraph. Aiming at the problem of time difference existing in actions, the method adaptively adjusts the window size by constructing a progressive variable-length window method module so as to extract more effective action characteristic representation; meanwhile, a hypergraph module with shared attribute constraint and diffusion mechanisms is designed to improve the description capacity of the target high-order relation, so that the accuracy of target action identification is improved.

The method of the invention sequentially performs the following operations on a video data set with given action types and action space-time marks:

preprocessing a video to obtain a video frame sequence, and extracting target area characteristics and a video space-time characteristic diagram by utilizing two-dimensional and three-dimensional convolutional neural networks;

step (2) a space-time relation encoder is constructed, target region characteristics and a video space-time characteristic graph are input, and initial target context characteristics and a space-time relation matrix are output;

step (3) constructing a progressive variable-length sliding window module, inputting a video frame sequence, an initial target context characteristic and a space-time relation matrix, and outputting a long-term target first-order characteristic;

constructing a hypergraph module with shared attribute constraint and diffusion mechanism, inputting an initial target context characteristic and a space-time relation matrix, and outputting a short-term target high-order characteristic;

step 5, constructing a target action regression module, inputting long-term target first-order characteristics and short-term target high-order characteristics, and outputting spatial positions and action types of all targets at the current moment;

and (6) optimizing a space-time action positioning model consisting of a space-time relation encoder, a progressive variable-length sliding window module, a hypergraph module and a target action regression module by using a random gradient descent algorithm, and sequentially executing the steps (1) to (5) on a new video sequence to obtain the space positions and action types of all targets at different moments.

Further, the step (1) is specifically:

(1-1) sampling the original video at a sampling rate of N frames per second to obtain a video frame sequence set with a frame number T

Wherein

Representing the real number field, U _s Representing the s-th frame, H 'and W' represent the height and width of the video frame, 3 represents RGB three channels, and N is 5-10;

(1-2) dividing the original video frame sequence into T video segments with 2 XN frames as a single video segment

V _t Representing the t-th video segment; then inputting the t-th video segment into a three-dimensional convolution neural network to generate a t-th video segment space-time feature map

H. W, C are the height, width, number of channels of the feature map,

further obtaining space-time characteristic graphs of all video clips;

(1-3) utilizing a target detection model based on a two-dimensional convolutional neural network to perform detection on the t video segment V _t The intermediate frame of the target detection system carries out target detection to obtain a sequence set of a target boundary frame

N _t The number of targets existing in the middle frame of the t-th video clip is represented, wherein beta is 0,1, beta is 0, the target is a boundary frame of a person, and beta is 1, the target is a boundary frame of an object;

a bounding box representing the ith object of the t-th video segment intermediate frame,

and

the abscissa and ordinate of the upper left corner of the ith target bounding box in the t-th video segment are represented,

and

the abscissa and the ordinate of the lower right corner of the ith target bounding box of the middle frame of the tth video segment are represented;

(1-4) bounding boxes according to the target

Obtaining the corresponding target boundary box in the video segment space-time characteristic diagram by a scaling mode

Obtaining the t-th video segment target characteristic diagram by a bilinear interpolation mode

And executing Global Average Pooling (GAP) operation to obtain target characteristics

H ', W' and C are the height, width and number of channels of the target feature map.

Still further, the step (2) is specifically:

(2-1) constructing a space-time relation encoder consisting of three fully-connected layers, and enabling the ith target feature of the tth video segment

Inputting the data into three full connection layers to obtain the query features

Key feature

And value characteristics

Wherein

D represents the number of channels of the query feature and the key feature, and d is less than C; the same method obtains the jth target feature of the tth video segment

Corresponding key feature

Sum value feature

(2-2) calculating the weight of the space-time relation between the target i and the target j

Generating a space-time relation matrix of all targets of the t-th video segment

Wherein

Softmax (·) represents a Softmax function, < ·, > represents the inner product; computing enhanced target region features

Will be provided with

Copying along the spatial dimension to make the space-time characteristic graph of the video segment t

Are consistent to obtain the global space characteristics of the target

(2-3) global spatial feature of the target

Splicing with a video space-time feature diagram X channel to obtain target initial context features through a two-dimensional convolution layer

Conv2D ₁ (·) denotes a two-dimensional convolution layer with an input channel of C '═ 2 · C, an output channel of C, and a convolution kernel size of 1 × 1 × C', and | | denotes channel splicing.

Further, the step (3) is specifically:

(3-1) the progressive variable-length sliding window module consists of an action auxiliary judgment submodule and a target first-order feature library, wherein the action auxiliary judgment submodule performs action coarse-grained judgment by using histogram similarity, and the target first-order feature library is used for storing target initial context features of all historical video segments

Obtaining a historical target context feature set

Wherein

Representing an ith target initial context feature of the phi-th video segment;

(3-2) dividing the current t-th video clip V _t Is converted into an RGB histogram matrix

Video clip V at time t-1 _t-1 Is converted into an RGB histogram matrix

Wherein 3 represents an RGB channel and is used;

(3-3) Using RGB histogram matrix Z _t And Z _t-1 Calculating histogram similarity between intermediate frames of adjacent video segments

And

the number of pixel points with the middle frame channel of R and the brightness of lambda in the t-th video segment and the t-1 th video segment is represented,

and

the number of pixel points with the middle frame channel of G and the brightness of lambda in the t-th video clip and the t-1 th video clip is represented,

and

the number of pixel points with B and lambda brightness in the middle frame channel of the tth video segment and the t-1 video segment is represented, and lambda is more than or equal to 0 and less than or equal to 255; according to the histogram similarity rho _t,t-1 Calculating the number of video segments similar to the t-th video segment

Is a threshold constant;

(3-4) performing the steps (3-2) and (3-3) on all the video segments to obtain a vector of the number of similar segments

The window size is set to ω ═ min (τ) _t ,L ₁ ) And reading a historical target initial context feature set within a time window [ t-omega, t) from the feature library

L ₁ Min (·, ·) represents taking the minimum value for a preset maximum window size;

(3-5) utilizing a target spatio-temporal relationship matrix M _t And M _t-1 Calculating the similarity between the t-th video segment and the t-1 st video segment

Calculated by the same method to obtain E _t,t-2 ,...,E _t,t-ω The similarity value sequence is arranged in a descending order to obtain the first alpha historical video segments which are most similar to the t-th video segment, and the channel splicing operation is carried out on the target initial context characteristics corresponding to the video segments to obtain the target associated space-time characteristics

Wherein the channel number C ″ ═ α · C; will be provided with

Inputting the data into a two-dimensional convolution layer to obtain a long-term target first-order characteristic consistent with the original action duration

Conv2D ₂ (. cndot.) represents a two-dimensional convolution layer having an input channel of C ″ ═ α · C, an output channel of C, and a convolution kernel size of 1 × 1 × C ″.

Still further, the step (4) is specifically:

(4-1) constructing a hypergraph module with shared attribute constraint and diffusion mechanism by using relative spatial positions of targets and target attributes (people and objects): firstly, the Euclidean distance between the target i and other targets j of the intermediate frame of the t-th video segment is calculated

And

the coordinates of the center positions of the target i bounding box and the target j bounding box of the t-th video segment intermediate frame,

dist (·, ·) denotes the euclidean distance; calculating the distance between the target i and all other targets of the current frame to obtain a constraint set of the target i

Representing a set of objects at a distance from object i less than delta',

is a threshold constant;

(4-2) in a constraint set

Constructing a high-order relation between a target i and other targets in the set, wherein a space-time relation established between the target i and the target j through a target R is represented as R (i, R, j), the target i and the target j are people, the target R is a person or an object, and i is not equal to j; the method comprises the following steps:

obtaining a space-time relation matrix M according to the step (2-2) _t Associating using the same object

Indicates the spatial position, symbol, of the common target r

The first-order relation between the two is shown and is expressed as a first-order characteristic between the targets

It is generated by using the initial context characteristics of the ith target

Object bounding box according to object r

Cutting is carried out, and first-order characteristics of the target i and the target r are obtained through bilinear interpolation operation

The first-order characteristics of the target j and the target r are obtained by the same method

By passing

Generating high-order characteristics of the target i, j relative to the target r, and writing a high-order characteristic set related to the target i

Wherein Conv2D ₃ (·) is a two-dimensional convolutional layer with input channel C '"2C, output channel C, and convolution kernel size 1 × 1 × C'";

(4-3) Using the high-order feature set ψ associated with target i ⁱ Calculating short-term target high-order characteristics

Still further, the step (5) is specifically:

(5-1) first-order feature of long-term object

And short term target high order features

Inputting a target action regression module to obtain target positioning and action judgment, specifically comprising the following steps: firstly, the method is to

And

performing element-by-element summation operation to obtain element-by-element sum characteristics

Represents a per-element sum; then will be

Inputting into two-dimensional convolutional layer and performing global average pooling along spatial dimension to obtain target classification score (logits)

K represents the number of action categories, Conv2D ₄ (. cndot.) represents a two-dimensional convolution layer with an input channel of C, an output channel of K and a convolution kernel size of 1 × 1 × C, and GAP (. cndot.) represents spatial dimension global average pooling;

(5-2) scoring the target classification using Softmax function

Processing is carried out to obtain an output probability of u as the action type at the t moment

e represents a natural base number;

(5-3) element by element and feature by feature

Obtaining target spatial position features by two-layer two-dimensional convolution

Wherein Conv2D ₅ (. The) is expressed as a two-dimensional convolution layer with an input channel of C, an output channel of 256, and a convolution kernel size of 3 × 3 × C, Conv2D ₆ (. is expressed as outputA two-dimensional convolution layer with input channel of 256, output channel of 4 and convolution kernel size of 1 × 1 × 256;

(5-4) spatial location characterization of the object

Obtaining a predicted target bounding box through a full connection layer

And

the abscissa and ordinate of the upper-left corner coordinate point of the prediction target bounding box are represented,

and

and the abscissa and the ordinate of the coordinate point at the lower right corner of the prediction target bounding box are represented.

Continuing further, the step (6) is specifically:

(6-1) constructing a space-time action positioning model consisting of a space-time relation encoder, a progressive variable length window method module, a hypergraph module with shared attribute constraint and diffusion mechanism and a target action regression module;

(6-2) sampling the training video into a frame sequence, inputting the frame sequence into a space motion positioning model, obtaining the spatial positions and the corresponding motion types of all targets at each moment, and calculating the cross entropy loss function of the model

Wherein

In order to be a true mark, the mark is,

indicating that the ith target of the t frame contains the action with the action type of u; computing a distance cross-correlation loss function of the model

Wherein,

representing the intersection ratio of the predicted target bounding box and the real target bounding box,

in order to predict the target bounding box,

the coordinates of the upper left corner of the target bounding box,

the coordinates of the lower right corner of the target bounding box,

in order to be a real target bounding box,

representing the coordinates of the center position of the target bounding box,

coordinates of the center position of the bounding box representing the real object,

coordinates representing the upper left corner of the minimum bounding box that can enclose both the true bounding box and the predicted bounding box,

coordinates representing the lower right corner of the smallest bounding box that can enclose both the true bounding box and the predicted bounding box, max (·,) representing taking the maximum value;

(6-3) optimizing the space-time motion positioning model by using a random gradient descent algorithm, and iteratively training the model until convergence to obtain an optimized space-time motion positioning model;

and (6-4) obtaining a video frame sequence through sampling a new video, inputting the optimized space-time motion positioning model, executing the steps (1) to (5) in sequence, and outputting the space positions and motion types of all targets of the video segments at the current moment.

The invention provides a video space-time action positioning method based on a progressive attention hypergraph, which has the following characteristics: (1) by utilizing the progressive cognitive theory idea, the progressive variable length window method module is used for carrying out difference processing on the actions with different durations, and compared with the past mode of fixing a long-term feature library, the method avoids the short-term action learning to the context-free features to a certain extent; (2) the hypergraph module with the constraint of the shared attribute and the diffusion mechanism reduces model calculation overhead through the shared attribute and spatial position constraint among the targets, and meanwhile, a more accurate action recognition rate can be obtained by constructing a potential relation among the targets; (3) by means of the parallel hypergraph and progressive variable-length sliding window strategy, the real-time performance of the model can be effectively guaranteed.

The method is suitable for tasks with real-time requirements on the positioning of the time-space actions, and has the advantages that: (1) the size of the window is adaptively adjusted by constructing a progressive variable-length window method module, so that the condition that short-term features are learned to redundant features is reduced, and the running speed of the model is increased; (2) by applying shared attribute constraint to the target high-order relation, the calculation overhead of the model can be reduced; (3) the parallel mode of the hypergraph module and the progressive variable length window method module can improve the operation efficiency of the model. The progressive attention mechanism variable-length sliding window module and the hypergraph module with the shared attribute constraint and diffusion mechanism can better ensure the calculation efficiency of the model, and can be applied to the fields of traffic safety detection, illegal content identification and the like.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, in the video spatiotemporal motion localization method based on the progressive hypergraph, an original video is uniformly sampled, and a target region feature and a video spatiotemporal feature map are extracted by using a convolutional neural network; obtaining a target context characteristic and a space-time relation matrix by using a space-time relation encoder; inputting the target context characteristics and the space-time relation matrix into a progressive lengthening sliding window module to obtain target first-order characteristics consistent with the original action; meanwhile, a hypergraph module is constructed, and short-term target high-order features are generated in a mode of increasing the constraint of shared attributes; and finally, obtaining the spatial positions and the motion types of all the targets at different moments by using a target motion regression module. The method utilizes a progressive attention mechanism, can adaptively adjust the size of a sliding window according to the original time length of the action so as to reduce the possibility that the short-term action learns the context-free characteristics, and describes the potential high-order relation between the targets through a shared attribute constraint and a diffusion hypergraph module, thereby generating an accurate space-time action positioning result.

The method of the invention sets the video data containing text description and then sequentially carries out the following operations:

preprocessing a video to obtain a video frame sequence, and extracting target area characteristics and a video space-time characteristic diagram by utilizing two-dimensional and three-dimensional convolutional neural networks; the method comprises the following steps:

Wherein

H. W, C are the height, width, number of channels of the feature map,

further obtaining a space-time characteristic diagram of all video clips;

and

and

(1-4) bounding boxes according to the target

Step (2) a space-time relation encoder is constructed, target region characteristics and a video space-time characteristic graph are input, and initial target context characteristics and a space-time relation matrix are output; the method comprises the following steps:

Key feature

And value characteristics

Wherein

Corresponding key feature

Sum value feature

Wherein

Will be provided with

Are consistent to obtain the global space characteristics of the target

(2-3) global spatial feature of the target

Step (3) constructing a progressive variable-length sliding window module, inputting a video frame sequence, an initial target context characteristic and a space-time relation matrix, and outputting a long-term target first-order characteristic; the method comprises the following steps:

Obtaining a historical target context feature set

Wherein

Representing an ith target initial context feature of the phi-th video segment;

Video clip V at time t-1 _t-1 Is converted into an RGB histogram matrix

Wherein 3 represents an RGB channel and is used;

And

and

and

Is a threshold constant;

(3-5) when the object is utilizedEmpty relation matrix M _t And M _t-1 Calculating the similarity between the t-th video segment and the t-1 st video segment

Wherein the channel number C ″ ═ α · C; will be provided with

Constructing a hypergraph module with shared attribute constraint and diffusion mechanism, inputting initial target context characteristics and a space-time relation matrix, and outputting short-term target high-order characteristics; the method comprises the following steps:

And

for the t videoThe coordinates of the center positions of the object i bounding box and the object j bounding box of the segment intermediate frame,

Representing a set of objects at a distance from object i less than delta',

is a threshold constant;

(4-2) in a constraint set

Indicates the spatial position, symbol, of the common target r

It is generated by using the initial context characteristics of the ith target

Object bounding box according to object r

By passing

Step 5, constructing a target action regression module, inputting long-term target first-order characteristics and short-term target high-order characteristics, and outputting spatial positions and action types of all targets at the current moment; the method comprises the following steps:

(5-1) first-order feature of long-term object

And short term target high order features

Inputting a target action regression module to obtain target positioning and action judgment, specifically comprising the following steps: firstly, the first step is to

And

Represents a per-element sum; then will be

(5-2) scoring the target classification using Softmax function

e represents a natural base number;

(5-3) element by element and feature by feature

Wherein Conv2D ₅ (. The) is expressed as a two-dimensional convolution layer with an input channel of C, an output channel of 256, and a convolution kernel size of 3 × 3 × C, Conv2D ₆ (. 2) represents a two-dimensional convolution layer with input channel 256, output channel 4, and convolution kernel size 1 × 1 × 256;

(5-4) spatial location characterization of the object

Obtaining a predicted target bounding box through a full connection layer

And

and

Step (6) optimizing a space-time action positioning model consisting of a space-time relation encoder, a progressive variable-length sliding window module, a hypergraph module and a target action regression module by using a random gradient descent algorithm, and sequentially executing the steps (1) to (5) on a new video sequence to obtain the space positions and action types of all targets at different moments; the method comprises the following steps:

(6-2) sampling the training video into a frame sequence, and inputting the frame sequence into a null action timerA bit model for obtaining the space positions and corresponding action categories of all targets at each time and calculating the cross entropy loss function of the model

Wherein

In order to be a true mark, the mark is,

Wherein,

in order to predict the target bounding box,

the coordinates of the upper left corner of the target bounding box,

the coordinates of the lower right corner of the target bounding box,

in order to be a real target bounding box,

representing the coordinates of the center position of the target bounding box,

The embodiment described in this embodiment is only an example of the implementation form of the inventive concept, and the protection scope of the present invention should not be considered as being limited to the specific form set forth in the embodiment, and the protection scope of the present invention is also equivalent to the technical means that can be conceived by those skilled in the art according to the inventive concept.

Claims

1. The video space-time action positioning method based on the progressive attention hypergraph is characterized in that the method sequentially performs the following operations on a given action type and action space-time marked video data set:

2. The method for video spatiotemporal motion localization based on progressive hyperopia as claimed in claim 1, wherein the step (1) is specifically:

Wherein

H. W, C are the height, width, number of channels of the feature map,

further obtaining space-time characteristic graphs of all video clips;

i＝1,2,...,N _t ，N _t The number of targets existing in the middle frame of the t-th video clip is represented, wherein beta is 0,1, beta is 0, the target is a boundary frame of a person, and beta is 1, the target is a boundary frame of an object;

and

and

the abscissa and the ordinate of the lower right corner of the ith target bounding box of the intermediate frame of the tth video clip are represented;

(1-4) bounding boxes according to the target

Obtaining video by scalingCorresponding target boundary box in segment space-time feature diagram

And performing a global average pooling operation to obtain target features

3. The method for video spatiotemporal motion localization based on progressive hyperopia as claimed in claim 2, wherein the step (2) is specifically:

Key feature

And value characteristics

Wherein

Corresponding key feature

Sum value feature

Wherein

Will be provided with

Are consistent to obtain the global space characteristics of the target

(2-3) global spatial feature of the target

Splicing with the X channel of the video space-time feature map by twoObtaining target initial context features for a dimension convolution layer

4. The method for video spatiotemporal motion localization based on progressive hyperopia as claimed in claim 3, wherein the step (3) is specifically:

Obtaining a historical target context feature set

Wherein

Representing an ith target initial context feature of the phi-th video segment;

Video clip V at time t-1 _t-1 Is converted into an RGB histogram matrix

Wherein 3 represents an RGB channel and is used;

(3-3) Using RGB histogram matrix Z _t And Z _t-1 Calculating between intermediate frames of adjacent video segmentsHistogram similarity

And

and

and

Delta is more than 0 and less than 1, which is a threshold constant;

The window size is set to ω ═ min (τ) _t ,L ₁ ) And reading the historical target initial context within the time window [ t-omega, t) from the feature libraryFeature set

Wherein the channel number C ″ ═ α · C; will be provided with

5. The method for video spatiotemporal motion localization based on progressive hyperopia as claimed in claim 4, wherein the step (4) is specifically:

(4-1) constructing a hypergraph module with shared attribute constraints and a diffusion mechanism by using the relative spatial position of the target and the target attribute: firstly, the Euclidean distance between the target i and other targets j of the intermediate frame of the t-th video segment is calculated

And

Representing a set of objects at a distance from object i less than delta',

is a threshold constant;

(4-2) in a constraint set

Show the common purposeThe spatial position, symbol, of the symbol r

It is generated by using the initial context characteristics of the ith target

Object bounding box according to object r

By passing

6. The progressive hyperopic image-based video spatiotemporal motion localization method according to claim 5, wherein the step (5) is specifically:

(5-1) first-order feature of long-term object

And short term target high order features

And

Represents a per-element sum; then will be

Inputting into two-dimensional convolution layer and performing global average pooling operation along spatial dimension to obtain target classification score

(5-2) scoring the target classification using Softmax function

e represents a natural base number;

(5-3) element by element and feature by feature

(5-4) spatial location characterization of the object

Obtaining a predicted target bounding box through a full connection layer

And

and

7. The video spatiotemporal motion localization method based on progressive hyperopia as claimed in claim 6, characterized in that the step (6) is specifically: