Nothing Special   »   [go: up one dir, main page]

CN114882403A - Video space-time action positioning method based on progressive attention hypergraph - Google Patents

Video space-time action positioning method based on progressive attention hypergraph Download PDF

Info

Publication number
CN114882403A
CN114882403A CN202210481572.0A CN202210481572A CN114882403A CN 114882403 A CN114882403 A CN 114882403A CN 202210481572 A CN202210481572 A CN 202210481572A CN 114882403 A CN114882403 A CN 114882403A
Authority
CN
China
Prior art keywords
target
video
space
time
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210481572.0A
Other languages
Chinese (zh)
Other versions
CN114882403B (en
Inventor
叶兴超
李平
曹佳晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202210481572.0A priority Critical patent/CN114882403B/en
Publication of CN114882403A publication Critical patent/CN114882403A/en
Application granted granted Critical
Publication of CN114882403B publication Critical patent/CN114882403B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Psychiatry (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video space-time action positioning method based on a progressive attention hypergraph. Firstly, sampling a given original video to obtain a frame sequence, and obtaining a target context characteristic and video space-time characteristic graph by using a convolutional neural network; obtaining a target context characteristic and a space-time relation matrix through a space-time relation encoder; generating a long-term target first-order characteristic by utilizing a progressive variable length window method module; meanwhile, obtaining target short-term high-order characteristics through a hypergraph module with shared attribute constraint and diffusion mechanism; and finally, outputting the spatial positions and the motion types of all the targets at different times by using a target motion regression module. The method not only can adaptively adjust the window size according to the original action duration to obtain the first-order characteristic of the target consistent with the original action duration, but also can capture the potential relation of the target through the hypergraph module, thereby realizing more effective utilization of the target interaction relation and improving the accuracy of video space-time action positioning.

Description

Video space-time action positioning method based on progressive attention hypergraph
Technical Field
The invention belongs to the technical field of computer vision, particularly relates to the field of motion positioning in video processing, and relates to a video space-time motion positioning method based on a progressive attention hypergraph.
Background
Since the rapid rise of the media industry, massive multimedia data mainly comprising videos are generated. Compared with the traditional image-text data, the video gradually becomes a mainstream media form due to the characteristics of rich visual content, intuitive expression form and the like. However, a large amount of complex scene information is contained in a large amount of video, such as: the number of targets is large and the action is complex. Therefore, how to quickly and accurately identify and locate the Action categories of all targets from a complex scene becomes an important research direction of researchers, namely a Spatio-temporal Action Localization (Spatio-temporal Action Localization) task. The task takes a long video which is not edited and possibly comprises a plurality of targets, each target comprises a plurality of actions as input, and outputs the spatial position, the starting and stopping time and the corresponding action type of an action segment in the video, so that the task can be widely applied to monitoring actual scenes such as security, video content examination, traffic safety detection and the like. For example, the space-time action positioning technology is applied to a monitoring security system, can monitor and judge dangerous actions of all targets in a range in real time and send out an alarm so as to assist in strengthening social security; in addition, the space-time action positioning technology is applied to a video content auditing system, and can effectively mark and screen out illegal video segments, so that manual review is facilitated, and the labor cost is reduced.
At present, a mainstream space-time action positioning method mainly adopts a two-stage paradigm, wherein a Faster regional convolutional Neural Network fast R-CNN (fast Region-convolutional Neural Network) and a slow-fast Network slowFast are used in a first stage to obtain a target Region characteristic and a space-time characteristic diagram, the target Region characteristic is copied along a space dimension to enable the dimension to be equal to the space-time characteristic diagram, and then the target Region characteristic and the space-time characteristic diagram are spliced on a channel dimension to generate an initial target first-order relationship (the space relationship between different targets and the time sequence relationship of the same target); the second phase uses a Long-Term Feature library LFB (Long-Term Feature Banks) as a memory module to store historical target first-order features and combines a Local Attention mechanism (Local Attention) to acquire the Long-Term target first-order features. However, in the process of extracting the target interaction relationship, the method ignores the influence of the potential relationship (high-order relationship: the space-time relationship established between two targets through the third target) between the targets on the judgment result, and causes the space-time positioning deviation of the action. Therefore, a graph convolution neural network (GCN) (graph relational network) is adopted in subsequent work, and the first-order relation between two targets is described through the common spatial position, so that the capture of the context of the global scene is realized, and the high-order relation between the targets is described as comprehensively as possible.
The shortcomings of the space-time motion positioning method are mainly expressed in the following three aspects: (1) although the long-term feature library with a fixed window size can well capture the first-order relation of the long-term target, for the action with short duration, the too large duration range in the long-term feature library can cause the model to be extracted into context-free features, so that the accuracy of short-duration action feature representation is reduced; (2) the influence of the high-order relation on the judgment of the action type at the current moment is reduced along with the increase of the time interval, but the calculation cost is increased along with the increase of the time interval, so that the high-order relation of the long-term target is constructed and the high real-time requirement of the model is difficult to meet; (3) the traditional graph structure can only represent the characteristic of a pair-wise relation, and is difficult to depict complicated and various target high-order relations. Therefore, aiming at the problems of short-term action confidence reduction caused by inaccurate capture range of the first-order relation of the target and high calculation overhead caused by unreasonable description mode of the high-order relation of the target, a space-time action positioning method which can adaptively adjust the window size according to the original action duration and can correctly reflect the high-order relation of the target is urgently needed to be designed.
Disclosure of Invention
Aiming at the defects of the existing method, the invention provides a video space-time action positioning method based on a progressive attention hypergraph. Aiming at the problem of time difference existing in actions, the method adaptively adjusts the window size by constructing a progressive variable-length window method module so as to extract more effective action characteristic representation; meanwhile, a hypergraph module with shared attribute constraint and diffusion mechanisms is designed to improve the description capacity of the target high-order relation, so that the accuracy of target action identification is improved.
The method of the invention sequentially performs the following operations on a video data set with given action types and action space-time marks:
preprocessing a video to obtain a video frame sequence, and extracting target area characteristics and a video space-time characteristic diagram by utilizing two-dimensional and three-dimensional convolutional neural networks;
step (2) a space-time relation encoder is constructed, target region characteristics and a video space-time characteristic graph are input, and initial target context characteristics and a space-time relation matrix are output;
step (3) constructing a progressive variable-length sliding window module, inputting a video frame sequence, an initial target context characteristic and a space-time relation matrix, and outputting a long-term target first-order characteristic;
constructing a hypergraph module with shared attribute constraint and diffusion mechanism, inputting an initial target context characteristic and a space-time relation matrix, and outputting a short-term target high-order characteristic;
step 5, constructing a target action regression module, inputting long-term target first-order characteristics and short-term target high-order characteristics, and outputting spatial positions and action types of all targets at the current moment;
and (6) optimizing a space-time action positioning model consisting of a space-time relation encoder, a progressive variable-length sliding window module, a hypergraph module and a target action regression module by using a random gradient descent algorithm, and sequentially executing the steps (1) to (5) on a new video sequence to obtain the space positions and action types of all targets at different moments.
Further, the step (1) is specifically:
(1-1) sampling the original video at a sampling rate of N frames per second to obtain a video frame sequence set with a frame number T
Figure BDA0003627686260000021
Wherein
Figure BDA0003627686260000022
Representing the real number field, U s Representing the s-th frame, H 'and W' represent the height and width of the video frame, 3 represents RGB three channels, and N is 5-10;
(1-2) dividing the original video frame sequence into T video segments with 2 XN frames as a single video segment
Figure BDA0003627686260000031
V t Representing the t-th video segment; then inputting the t-th video segment into a three-dimensional convolution neural network to generate a t-th video segment space-time feature map
Figure BDA0003627686260000032
H. W, C are the height, width, number of channels of the feature map,
Figure BDA0003627686260000033
further obtaining space-time characteristic graphs of all video clips;
(1-3) utilizing a target detection model based on a two-dimensional convolutional neural network to perform detection on the t video segment V t The intermediate frame of the target detection system carries out target detection to obtain a sequence set of a target boundary frame
Figure BDA0003627686260000034
N t The number of targets existing in the middle frame of the t-th video clip is represented, wherein beta is 0,1, beta is 0, the target is a boundary frame of a person, and beta is 1, the target is a boundary frame of an object;
Figure BDA0003627686260000035
a bounding box representing the ith object of the t-th video segment intermediate frame,
Figure BDA0003627686260000036
and
Figure BDA0003627686260000037
the abscissa and ordinate of the upper left corner of the ith target bounding box in the t-th video segment are represented,
Figure BDA0003627686260000038
and
Figure BDA0003627686260000039
the abscissa and the ordinate of the lower right corner of the ith target bounding box of the middle frame of the tth video segment are represented;
(1-4) bounding boxes according to the target
Figure BDA00036276862600000310
Obtaining the corresponding target boundary box in the video segment space-time characteristic diagram by a scaling mode
Figure BDA00036276862600000311
Obtaining the t-th video segment target characteristic diagram by a bilinear interpolation mode
Figure BDA00036276862600000312
And executing Global Average Pooling (GAP) operation to obtain target characteristics
Figure BDA00036276862600000313
H ', W' and C are the height, width and number of channels of the target feature map.
Still further, the step (2) is specifically:
(2-1) constructing a space-time relation encoder consisting of three fully-connected layers, and enabling the ith target feature of the tth video segment
Figure BDA00036276862600000314
Inputting the data into three full connection layers to obtain the query features
Figure BDA00036276862600000315
Key feature
Figure BDA00036276862600000316
And value characteristics
Figure BDA00036276862600000317
Wherein
Figure BDA00036276862600000318
D represents the number of channels of the query feature and the key feature, and d is less than C; the same method obtains the jth target feature of the tth video segment
Figure BDA00036276862600000319
Corresponding key feature
Figure BDA00036276862600000320
Sum value feature
Figure BDA00036276862600000321
(2-2) calculating the weight of the space-time relation between the target i and the target j
Figure BDA00036276862600000322
Generating a space-time relation matrix of all targets of the t-th video segment
Figure BDA00036276862600000323
Wherein
Figure BDA00036276862600000324
Softmax (·) represents a Softmax function, < ·, > represents the inner product; computing enhanced target region features
Figure BDA0003627686260000041
Will be provided with
Figure BDA0003627686260000042
Copying along the spatial dimension to make the space-time characteristic graph of the video segment t
Figure BDA0003627686260000043
Are consistent to obtain the global space characteristics of the target
Figure BDA0003627686260000044
(2-3) global spatial feature of the target
Figure BDA0003627686260000045
Splicing with a video space-time feature diagram X channel to obtain target initial context features through a two-dimensional convolution layer
Figure BDA0003627686260000046
Conv2D 1 (·) denotes a two-dimensional convolution layer with an input channel of C '═ 2 · C, an output channel of C, and a convolution kernel size of 1 × 1 × C', and | | denotes channel splicing.
Further, the step (3) is specifically:
(3-1) the progressive variable-length sliding window module consists of an action auxiliary judgment submodule and a target first-order feature library, wherein the action auxiliary judgment submodule performs action coarse-grained judgment by using histogram similarity, and the target first-order feature library is used for storing target initial context features of all historical video segments
Figure BDA0003627686260000047
Obtaining a historical target context feature set
Figure BDA0003627686260000048
Wherein
Figure BDA0003627686260000049
Representing an ith target initial context feature of the phi-th video segment;
(3-2) dividing the current t-th video clip V t Is converted into an RGB histogram matrix
Figure BDA00036276862600000410
Video clip V at time t-1 t-1 Is converted into an RGB histogram matrix
Figure BDA00036276862600000411
Wherein 3 represents an RGB channel and is used;
(3-3) Using RGB histogram matrix Z t And Z t-1 Calculating histogram similarity between intermediate frames of adjacent video segments
Figure BDA00036276862600000412
Figure BDA00036276862600000413
And
Figure BDA00036276862600000414
the number of pixel points with the middle frame channel of R and the brightness of lambda in the t-th video segment and the t-1 th video segment is represented,
Figure BDA00036276862600000415
and
Figure BDA00036276862600000416
the number of pixel points with the middle frame channel of G and the brightness of lambda in the t-th video clip and the t-1 th video clip is represented,
Figure BDA00036276862600000417
and
Figure BDA00036276862600000418
the number of pixel points with B and lambda brightness in the middle frame channel of the tth video segment and the t-1 video segment is represented, and lambda is more than or equal to 0 and less than or equal to 255; according to the histogram similarity rho t,t-1 Calculating the number of video segments similar to the t-th video segment
Figure BDA00036276862600000419
Is a threshold constant;
(3-4) performing the steps (3-2) and (3-3) on all the video segments to obtain a vector of the number of similar segments
Figure BDA00036276862600000420
The window size is set to ω ═ min (τ) t ,L 1 ) And reading a historical target initial context feature set within a time window [ t-omega, t) from the feature library
Figure BDA0003627686260000051
L 1 Min (·, ·) represents taking the minimum value for a preset maximum window size;
(3-5) utilizing a target spatio-temporal relationship matrix M t And M t-1 Calculating the similarity between the t-th video segment and the t-1 st video segment
Figure BDA0003627686260000052
Calculated by the same method to obtain E t,t-2 ,...,E t,t-ω The similarity value sequence is arranged in a descending order to obtain the first alpha historical video segments which are most similar to the t-th video segment, and the channel splicing operation is carried out on the target initial context characteristics corresponding to the video segments to obtain the target associated space-time characteristics
Figure BDA0003627686260000053
Wherein the channel number C ″ ═ α · C; will be provided with
Figure BDA0003627686260000054
Inputting the data into a two-dimensional convolution layer to obtain a long-term target first-order characteristic consistent with the original action duration
Figure BDA0003627686260000055
Conv2D 2 (. cndot.) represents a two-dimensional convolution layer having an input channel of C ″ ═ α · C, an output channel of C, and a convolution kernel size of 1 × 1 × C ″.
Still further, the step (4) is specifically:
(4-1) constructing a hypergraph module with shared attribute constraint and diffusion mechanism by using relative spatial positions of targets and target attributes (people and objects): firstly, the Euclidean distance between the target i and other targets j of the intermediate frame of the t-th video segment is calculated
Figure BDA0003627686260000056
Figure BDA0003627686260000057
And
Figure BDA0003627686260000058
the coordinates of the center positions of the target i bounding box and the target j bounding box of the t-th video segment intermediate frame,
Figure BDA0003627686260000059
dist (·, ·) denotes the euclidean distance; calculating the distance between the target i and all other targets of the current frame to obtain a constraint set of the target i
Figure BDA00036276862600000510
Figure BDA00036276862600000511
Representing a set of objects at a distance from object i less than delta',
Figure BDA00036276862600000512
is a threshold constant;
(4-2) in a constraint set
Figure BDA00036276862600000513
Constructing a high-order relation between a target i and other targets in the set, wherein a space-time relation established between the target i and the target j through a target R is represented as R (i, R, j), the target i and the target j are people, the target R is a person or an object, and i is not equal to j; the method comprises the following steps:
obtaining a space-time relation matrix M according to the step (2-2) t Associating using the same object
Figure BDA00036276862600000514
Figure BDA00036276862600000515
Indicates the spatial position, symbol, of the common target r
Figure BDA00036276862600000516
The first-order relation between the two is shown and is expressed as a first-order characteristic between the targets
Figure BDA00036276862600000517
It is generated by using the initial context characteristics of the ith target
Figure BDA00036276862600000518
Object bounding box according to object r
Figure BDA0003627686260000061
Cutting is carried out, and first-order characteristics of the target i and the target r are obtained through bilinear interpolation operation
Figure BDA0003627686260000062
The first-order characteristics of the target j and the target r are obtained by the same method
Figure BDA0003627686260000063
By passing
Figure BDA0003627686260000064
Generating high-order characteristics of the target i, j relative to the target r, and writing a high-order characteristic set related to the target i
Figure BDA0003627686260000065
Wherein Conv2D 3 (·) is a two-dimensional convolutional layer with input channel C '"2C, output channel C, and convolution kernel size 1 × 1 × C'";
(4-3) Using the high-order feature set ψ associated with target i i Calculating short-term target high-order characteristics
Figure BDA0003627686260000066
Still further, the step (5) is specifically:
(5-1) first-order feature of long-term object
Figure BDA0003627686260000067
And short term target high order features
Figure BDA0003627686260000068
Inputting a target action regression module to obtain target positioning and action judgment, specifically comprising the following steps: firstly, the method is to
Figure BDA0003627686260000069
And
Figure BDA00036276862600000610
performing element-by-element summation operation to obtain element-by-element sum characteristics
Figure BDA00036276862600000611
Figure BDA00036276862600000612
Represents a per-element sum; then will be
Figure BDA00036276862600000613
Inputting into two-dimensional convolutional layer and performing global average pooling along spatial dimension to obtain target classification score (logits)
Figure BDA00036276862600000614
K represents the number of action categories, Conv2D 4 (. cndot.) represents a two-dimensional convolution layer with an input channel of C, an output channel of K and a convolution kernel size of 1 × 1 × C, and GAP (. cndot.) represents spatial dimension global average pooling;
(5-2) scoring the target classification using Softmax function
Figure BDA00036276862600000615
Processing is carried out to obtain an output probability of u as the action type at the t moment
Figure BDA00036276862600000616
e represents a natural base number;
(5-3) element by element and feature by feature
Figure BDA00036276862600000617
Obtaining target spatial position features by two-layer two-dimensional convolution
Figure BDA00036276862600000618
Wherein Conv2D 5 (. The) is expressed as a two-dimensional convolution layer with an input channel of C, an output channel of 256, and a convolution kernel size of 3 × 3 × C, Conv2D 6 (. is expressed as outputA two-dimensional convolution layer with input channel of 256, output channel of 4 and convolution kernel size of 1 × 1 × 256;
(5-4) spatial location characterization of the object
Figure BDA00036276862600000619
Obtaining a predicted target bounding box through a full connection layer
Figure BDA00036276862600000620
Figure BDA00036276862600000621
And
Figure BDA00036276862600000622
the abscissa and ordinate of the upper-left corner coordinate point of the prediction target bounding box are represented,
Figure BDA0003627686260000071
and
Figure BDA0003627686260000072
and the abscissa and the ordinate of the coordinate point at the lower right corner of the prediction target bounding box are represented.
Continuing further, the step (6) is specifically:
(6-1) constructing a space-time action positioning model consisting of a space-time relation encoder, a progressive variable length window method module, a hypergraph module with shared attribute constraint and diffusion mechanism and a target action regression module;
(6-2) sampling the training video into a frame sequence, inputting the frame sequence into a space motion positioning model, obtaining the spatial positions and the corresponding motion types of all targets at each moment, and calculating the cross entropy loss function of the model
Figure BDA0003627686260000073
Wherein
Figure BDA0003627686260000074
In order to be a true mark, the mark is,
Figure BDA0003627686260000075
indicating that the ith target of the t frame contains the action with the action type of u; computing a distance cross-correlation loss function of the model
Figure BDA0003627686260000076
Wherein,
Figure BDA0003627686260000077
representing the intersection ratio of the predicted target bounding box and the real target bounding box,
Figure BDA0003627686260000078
in order to predict the target bounding box,
Figure BDA0003627686260000079
the coordinates of the upper left corner of the target bounding box,
Figure BDA00036276862600000710
the coordinates of the lower right corner of the target bounding box,
Figure BDA00036276862600000711
in order to be a real target bounding box,
Figure BDA00036276862600000712
representing the coordinates of the center position of the target bounding box,
Figure BDA00036276862600000713
coordinates of the center position of the bounding box representing the real object,
Figure BDA00036276862600000714
coordinates representing the upper left corner of the minimum bounding box that can enclose both the true bounding box and the predicted bounding box,
Figure BDA00036276862600000715
coordinates representing the lower right corner of the smallest bounding box that can enclose both the true bounding box and the predicted bounding box, max (·,) representing taking the maximum value;
(6-3) optimizing the space-time motion positioning model by using a random gradient descent algorithm, and iteratively training the model until convergence to obtain an optimized space-time motion positioning model;
and (6-4) obtaining a video frame sequence through sampling a new video, inputting the optimized space-time motion positioning model, executing the steps (1) to (5) in sequence, and outputting the space positions and motion types of all targets of the video segments at the current moment.
The invention provides a video space-time action positioning method based on a progressive attention hypergraph, which has the following characteristics: (1) by utilizing the progressive cognitive theory idea, the progressive variable length window method module is used for carrying out difference processing on the actions with different durations, and compared with the past mode of fixing a long-term feature library, the method avoids the short-term action learning to the context-free features to a certain extent; (2) the hypergraph module with the constraint of the shared attribute and the diffusion mechanism reduces model calculation overhead through the shared attribute and spatial position constraint among the targets, and meanwhile, a more accurate action recognition rate can be obtained by constructing a potential relation among the targets; (3) by means of the parallel hypergraph and progressive variable-length sliding window strategy, the real-time performance of the model can be effectively guaranteed.
The method is suitable for tasks with real-time requirements on the positioning of the time-space actions, and has the advantages that: (1) the size of the window is adaptively adjusted by constructing a progressive variable-length window method module, so that the condition that short-term features are learned to redundant features is reduced, and the running speed of the model is increased; (2) by applying shared attribute constraint to the target high-order relation, the calculation overhead of the model can be reduced; (3) the parallel mode of the hypergraph module and the progressive variable length window method module can improve the operation efficiency of the model. The progressive attention mechanism variable-length sliding window module and the hypergraph module with the shared attribute constraint and diffusion mechanism can better ensure the calculation efficiency of the model, and can be applied to the fields of traffic safety detection, illegal content identification and the like.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, in the video spatiotemporal motion localization method based on the progressive hypergraph, an original video is uniformly sampled, and a target region feature and a video spatiotemporal feature map are extracted by using a convolutional neural network; obtaining a target context characteristic and a space-time relation matrix by using a space-time relation encoder; inputting the target context characteristics and the space-time relation matrix into a progressive lengthening sliding window module to obtain target first-order characteristics consistent with the original action; meanwhile, a hypergraph module is constructed, and short-term target high-order features are generated in a mode of increasing the constraint of shared attributes; and finally, obtaining the spatial positions and the motion types of all the targets at different moments by using a target motion regression module. The method utilizes a progressive attention mechanism, can adaptively adjust the size of a sliding window according to the original time length of the action so as to reduce the possibility that the short-term action learns the context-free characteristics, and describes the potential high-order relation between the targets through a shared attribute constraint and a diffusion hypergraph module, thereby generating an accurate space-time action positioning result.
The method of the invention sets the video data containing text description and then sequentially carries out the following operations:
preprocessing a video to obtain a video frame sequence, and extracting target area characteristics and a video space-time characteristic diagram by utilizing two-dimensional and three-dimensional convolutional neural networks; the method comprises the following steps:
(1-1) sampling the original video at a sampling rate of N frames per second to obtain a video frame sequence set with a frame number T
Figure BDA0003627686260000081
Wherein
Figure BDA0003627686260000082
Representing the real number field, U s Representing the s-th frame, H 'and W' represent the height and width of the video frame, 3 represents RGB three channels, and N is 5-10;
(1-2) dividing the original video frame sequence into T video segments with 2 XN frames as a single video segment
Figure BDA0003627686260000083
V t Representing the t-th video segment; then inputting the t-th video segment into a three-dimensional convolution neural network to generate a t-th video segment space-time feature map
Figure BDA0003627686260000084
H. W, C are the height, width, number of channels of the feature map,
Figure BDA0003627686260000085
further obtaining a space-time characteristic diagram of all video clips;
(1-3) utilizing a target detection model based on a two-dimensional convolutional neural network to perform detection on the t video segment V t The intermediate frame of the target detection system carries out target detection to obtain a sequence set of a target boundary frame
Figure BDA0003627686260000091
N t The number of targets existing in the middle frame of the t-th video clip is represented, wherein beta is 0,1, beta is 0, the target is a boundary frame of a person, and beta is 1, the target is a boundary frame of an object;
Figure BDA0003627686260000092
a bounding box representing the ith object of the t-th video segment intermediate frame,
Figure BDA0003627686260000093
and
Figure BDA0003627686260000094
the abscissa and ordinate of the upper left corner of the ith target bounding box in the t-th video segment are represented,
Figure BDA0003627686260000095
and
Figure BDA0003627686260000096
the abscissa and the ordinate of the lower right corner of the ith target bounding box of the middle frame of the tth video segment are represented;
(1-4) bounding boxes according to the target
Figure BDA0003627686260000097
Obtaining the corresponding target boundary box in the video segment space-time characteristic diagram by a scaling mode
Figure BDA0003627686260000098
Obtaining the t-th video segment target characteristic diagram by a bilinear interpolation mode
Figure BDA0003627686260000099
And executing Global Average Pooling (GAP) operation to obtain target characteristics
Figure BDA00036276862600000910
H ', W' and C are the height, width and number of channels of the target feature map.
Step (2) a space-time relation encoder is constructed, target region characteristics and a video space-time characteristic graph are input, and initial target context characteristics and a space-time relation matrix are output; the method comprises the following steps:
(2-1) constructing a space-time relation encoder consisting of three fully-connected layers, and enabling the ith target feature of the tth video segment
Figure BDA00036276862600000911
Inputting the data into three full connection layers to obtain the query features
Figure BDA00036276862600000912
Key feature
Figure BDA00036276862600000913
And value characteristics
Figure BDA00036276862600000914
Wherein
Figure BDA00036276862600000915
D represents the number of channels of the query feature and the key feature, and d is less than C; the same method obtains the jth target feature of the tth video segment
Figure BDA00036276862600000916
Corresponding key feature
Figure BDA00036276862600000917
Sum value feature
Figure BDA00036276862600000918
(2-2) calculating the weight of the space-time relation between the target i and the target j
Figure BDA00036276862600000919
Generating a space-time relation matrix of all targets of the t-th video segment
Figure BDA00036276862600000920
Wherein
Figure BDA00036276862600000921
Softmax (·) represents a Softmax function, < ·, > represents the inner product; computing enhanced target region features
Figure BDA00036276862600000922
Will be provided with
Figure BDA00036276862600000923
Copying along the spatial dimension to make the space-time characteristic graph of the video segment t
Figure BDA00036276862600000924
Are consistent to obtain the global space characteristics of the target
Figure BDA00036276862600000925
(2-3) global spatial feature of the target
Figure BDA00036276862600000926
Splicing with a video space-time feature diagram X channel to obtain target initial context features through a two-dimensional convolution layer
Figure BDA0003627686260000101
Conv2D 1 (·) denotes a two-dimensional convolution layer with an input channel of C '═ 2 · C, an output channel of C, and a convolution kernel size of 1 × 1 × C', and | | denotes channel splicing.
Step (3) constructing a progressive variable-length sliding window module, inputting a video frame sequence, an initial target context characteristic and a space-time relation matrix, and outputting a long-term target first-order characteristic; the method comprises the following steps:
(3-1) the progressive variable-length sliding window module consists of an action auxiliary judgment submodule and a target first-order feature library, wherein the action auxiliary judgment submodule performs action coarse-grained judgment by using histogram similarity, and the target first-order feature library is used for storing target initial context features of all historical video segments
Figure BDA0003627686260000102
Obtaining a historical target context feature set
Figure BDA0003627686260000103
Wherein
Figure BDA0003627686260000104
Representing an ith target initial context feature of the phi-th video segment;
(3-2) dividing the current t-th video clip V t Is converted into an RGB histogram matrix
Figure BDA0003627686260000105
Video clip V at time t-1 t-1 Is converted into an RGB histogram matrix
Figure BDA0003627686260000106
Wherein 3 represents an RGB channel and is used;
(3-3) Using RGB histogram matrix Z t And Z t-1 Calculating histogram similarity between intermediate frames of adjacent video segments
Figure BDA0003627686260000107
Figure BDA0003627686260000108
And
Figure BDA0003627686260000109
the number of pixel points with the middle frame channel of R and the brightness of lambda in the t-th video segment and the t-1 th video segment is represented,
Figure BDA00036276862600001010
and
Figure BDA00036276862600001011
the number of pixel points with the middle frame channel of G and the brightness of lambda in the t-th video clip and the t-1 th video clip is represented,
Figure BDA00036276862600001012
and
Figure BDA00036276862600001013
the number of pixel points with B and lambda brightness in the middle frame channel of the tth video segment and the t-1 video segment is represented, and lambda is more than or equal to 0 and less than or equal to 255; according to the histogram similarity rho t,t-1 Calculating the number of video segments similar to the t-th video segment
Figure BDA00036276862600001014
Is a threshold constant;
(3-4) performing the steps (3-2) and (3-3) on all the video segments to obtain a vector of the number of similar segments
Figure BDA00036276862600001015
The window size is set to ω ═ min (τ) t ,L 1 ) And reading a historical target initial context feature set within a time window [ t-omega, t) from the feature library
Figure BDA00036276862600001016
L 1 Min (·, ·) represents taking the minimum value for a preset maximum window size;
(3-5) when the object is utilizedEmpty relation matrix M t And M t-1 Calculating the similarity between the t-th video segment and the t-1 st video segment
Figure BDA0003627686260000111
Calculated by the same method to obtain E t,t-2 ,...,E t,t-ω The similarity value sequence is arranged in a descending order to obtain the first alpha historical video segments which are most similar to the t-th video segment, and the channel splicing operation is carried out on the target initial context characteristics corresponding to the video segments to obtain the target associated space-time characteristics
Figure BDA0003627686260000112
Wherein the channel number C ″ ═ α · C; will be provided with
Figure BDA0003627686260000113
Inputting the data into a two-dimensional convolution layer to obtain a long-term target first-order characteristic consistent with the original action duration
Figure BDA0003627686260000114
Conv2D 2 (. cndot.) represents a two-dimensional convolution layer having an input channel of C ″ ═ α · C, an output channel of C, and a convolution kernel size of 1 × 1 × C ″.
Constructing a hypergraph module with shared attribute constraint and diffusion mechanism, inputting initial target context characteristics and a space-time relation matrix, and outputting short-term target high-order characteristics; the method comprises the following steps:
(4-1) constructing a hypergraph module with shared attribute constraint and diffusion mechanism by using relative spatial positions of targets and target attributes (people and objects): firstly, the Euclidean distance between the target i and other targets j of the intermediate frame of the t-th video segment is calculated
Figure BDA0003627686260000115
Figure BDA0003627686260000116
And
Figure BDA0003627686260000117
for the t videoThe coordinates of the center positions of the object i bounding box and the object j bounding box of the segment intermediate frame,
Figure BDA0003627686260000118
dist (·, ·) denotes the euclidean distance; calculating the distance between the target i and all other targets of the current frame to obtain a constraint set of the target i
Figure BDA0003627686260000119
Figure BDA00036276862600001110
Representing a set of objects at a distance from object i less than delta',
Figure BDA00036276862600001111
is a threshold constant;
(4-2) in a constraint set
Figure BDA00036276862600001112
Constructing a high-order relation between a target i and other targets in the set, wherein a space-time relation established between the target i and the target j through a target R is represented as R (i, R, j), the target i and the target j are people, the target R is a person or an object, and i is not equal to j; the method comprises the following steps:
obtaining a space-time relation matrix M according to the step (2-2) t Associating using the same object
Figure BDA00036276862600001113
Figure BDA00036276862600001114
Indicates the spatial position, symbol, of the common target r
Figure BDA00036276862600001115
The first-order relation between the two is shown and is expressed as a first-order characteristic between the targets
Figure BDA00036276862600001116
It is generated by using the initial context characteristics of the ith target
Figure BDA00036276862600001117
Object bounding box according to object r
Figure BDA00036276862600001118
Cutting is carried out, and first-order characteristics of the target i and the target r are obtained through bilinear interpolation operation
Figure BDA00036276862600001119
The first-order characteristics of the target j and the target r are obtained by the same method
Figure BDA00036276862600001120
By passing
Figure BDA0003627686260000121
Generating high-order characteristics of the target i, j relative to the target r, and writing a high-order characteristic set related to the target i
Figure BDA0003627686260000122
Wherein Conv2D 3 (·) is a two-dimensional convolutional layer with input channel C '"2C, output channel C, and convolution kernel size 1 × 1 × C'";
(4-3) Using the high-order feature set ψ associated with target i i Calculating short-term target high-order characteristics
Figure BDA0003627686260000123
Step 5, constructing a target action regression module, inputting long-term target first-order characteristics and short-term target high-order characteristics, and outputting spatial positions and action types of all targets at the current moment; the method comprises the following steps:
(5-1) first-order feature of long-term object
Figure BDA0003627686260000124
And short term target high order features
Figure BDA0003627686260000125
Inputting a target action regression module to obtain target positioning and action judgment, specifically comprising the following steps: firstly, the first step is to
Figure BDA0003627686260000126
And
Figure BDA0003627686260000127
performing element-by-element summation operation to obtain element-by-element sum characteristics
Figure BDA0003627686260000128
Figure BDA0003627686260000129
Represents a per-element sum; then will be
Figure BDA00036276862600001210
Inputting into two-dimensional convolutional layer and performing global average pooling along spatial dimension to obtain target classification score (logits)
Figure BDA00036276862600001211
K represents the number of action categories, Conv2D 4 (. cndot.) represents a two-dimensional convolution layer with an input channel of C, an output channel of K and a convolution kernel size of 1 × 1 × C, and GAP (. cndot.) represents spatial dimension global average pooling;
(5-2) scoring the target classification using Softmax function
Figure BDA00036276862600001212
Processing is carried out to obtain an output probability of u as the action type at the t moment
Figure BDA00036276862600001213
e represents a natural base number;
(5-3) element by element and feature by feature
Figure BDA00036276862600001214
Obtaining target spatial position features by two-layer two-dimensional convolution
Figure BDA00036276862600001215
Wherein Conv2D 5 (. The) is expressed as a two-dimensional convolution layer with an input channel of C, an output channel of 256, and a convolution kernel size of 3 × 3 × C, Conv2D 6 (. 2) represents a two-dimensional convolution layer with input channel 256, output channel 4, and convolution kernel size 1 × 1 × 256;
(5-4) spatial location characterization of the object
Figure BDA00036276862600001216
Obtaining a predicted target bounding box through a full connection layer
Figure BDA00036276862600001217
Figure BDA00036276862600001218
And
Figure BDA00036276862600001219
the abscissa and ordinate of the upper-left corner coordinate point of the prediction target bounding box are represented,
Figure BDA00036276862600001220
and
Figure BDA00036276862600001221
and the abscissa and the ordinate of the coordinate point at the lower right corner of the prediction target bounding box are represented.
Step (6) optimizing a space-time action positioning model consisting of a space-time relation encoder, a progressive variable-length sliding window module, a hypergraph module and a target action regression module by using a random gradient descent algorithm, and sequentially executing the steps (1) to (5) on a new video sequence to obtain the space positions and action types of all targets at different moments; the method comprises the following steps:
(6-1) constructing a space-time action positioning model consisting of a space-time relation encoder, a progressive variable length window method module, a hypergraph module with shared attribute constraint and diffusion mechanism and a target action regression module;
(6-2) sampling the training video into a frame sequence, and inputting the frame sequence into a null action timerA bit model for obtaining the space positions and corresponding action categories of all targets at each time and calculating the cross entropy loss function of the model
Figure BDA0003627686260000131
Wherein
Figure BDA0003627686260000132
In order to be a true mark, the mark is,
Figure BDA0003627686260000133
indicating that the ith target of the t frame contains the action with the action type of u; computing a distance cross-correlation loss function of the model
Figure BDA0003627686260000134
Wherein,
Figure BDA0003627686260000135
representing the intersection ratio of the predicted target bounding box and the real target bounding box,
Figure BDA0003627686260000136
in order to predict the target bounding box,
Figure BDA0003627686260000137
the coordinates of the upper left corner of the target bounding box,
Figure BDA0003627686260000138
the coordinates of the lower right corner of the target bounding box,
Figure BDA0003627686260000139
in order to be a real target bounding box,
Figure BDA00036276862600001310
representing the coordinates of the center position of the target bounding box,
Figure BDA00036276862600001311
coordinates of the center position of the bounding box representing the real object,
Figure BDA00036276862600001312
coordinates representing the upper left corner of the minimum bounding box that can enclose both the true bounding box and the predicted bounding box,
Figure BDA00036276862600001313
coordinates representing the lower right corner of the smallest bounding box that can enclose both the true bounding box and the predicted bounding box, max (·,) representing taking the maximum value;
(6-3) optimizing the space-time motion positioning model by using a random gradient descent algorithm, and iteratively training the model until convergence to obtain an optimized space-time motion positioning model;
and (6-4) obtaining a video frame sequence through sampling a new video, inputting the optimized space-time motion positioning model, executing the steps (1) to (5) in sequence, and outputting the space positions and motion types of all targets of the video segments at the current moment.
The embodiment described in this embodiment is only an example of the implementation form of the inventive concept, and the protection scope of the present invention should not be considered as being limited to the specific form set forth in the embodiment, and the protection scope of the present invention is also equivalent to the technical means that can be conceived by those skilled in the art according to the inventive concept.

Claims (7)

1. The video space-time action positioning method based on the progressive attention hypergraph is characterized in that the method sequentially performs the following operations on a given action type and action space-time marked video data set:
preprocessing a video to obtain a video frame sequence, and extracting target area characteristics and a video space-time characteristic diagram by utilizing two-dimensional and three-dimensional convolutional neural networks;
step (2) a space-time relation encoder is constructed, target region characteristics and a video space-time characteristic graph are input, and initial target context characteristics and a space-time relation matrix are output;
step (3) constructing a progressive variable-length sliding window module, inputting a video frame sequence, an initial target context characteristic and a space-time relation matrix, and outputting a long-term target first-order characteristic;
constructing a hypergraph module with shared attribute constraint and diffusion mechanism, inputting an initial target context characteristic and a space-time relation matrix, and outputting a short-term target high-order characteristic;
step 5, constructing a target action regression module, inputting long-term target first-order characteristics and short-term target high-order characteristics, and outputting spatial positions and action types of all targets at the current moment;
and (6) optimizing a space-time action positioning model consisting of a space-time relation encoder, a progressive variable-length sliding window module, a hypergraph module and a target action regression module by using a random gradient descent algorithm, and sequentially executing the steps (1) to (5) on a new video sequence to obtain the space positions and action types of all targets at different moments.
2. The method for video spatiotemporal motion localization based on progressive hyperopia as claimed in claim 1, wherein the step (1) is specifically:
(1-1) sampling the original video at a sampling rate of N frames per second to obtain a video frame sequence set with a frame number T
Figure FDA0003627686250000011
Wherein
Figure FDA0003627686250000012
Representing the real number field, U s Representing the s-th frame, H 'and W' represent the height and width of the video frame, 3 represents RGB three channels, and N is 5-10;
(1-2) dividing the original video frame sequence into T video segments with 2 XN frames as a single video segment
Figure FDA0003627686250000013
V t Representing the t-th video segment; then inputting the t-th video segment into a three-dimensional convolution neural network to generate a t-th video segment space-time feature map
Figure FDA0003627686250000014
H. W, C are the height, width, number of channels of the feature map,
Figure FDA0003627686250000015
further obtaining space-time characteristic graphs of all video clips;
(1-3) utilizing a target detection model based on a two-dimensional convolutional neural network to perform detection on the t video segment V t The intermediate frame of the target detection system carries out target detection to obtain a sequence set of a target boundary frame
Figure FDA0003627686250000021
i=1,2,...,N t ,N t The number of targets existing in the middle frame of the t-th video clip is represented, wherein beta is 0,1, beta is 0, the target is a boundary frame of a person, and beta is 1, the target is a boundary frame of an object;
Figure FDA0003627686250000022
a bounding box representing the ith object of the t-th video segment intermediate frame,
Figure FDA0003627686250000023
and
Figure FDA0003627686250000024
the abscissa and ordinate of the upper left corner of the ith target bounding box in the t-th video segment are represented,
Figure FDA0003627686250000025
and
Figure FDA0003627686250000026
the abscissa and the ordinate of the lower right corner of the ith target bounding box of the intermediate frame of the tth video clip are represented;
(1-4) bounding boxes according to the target
Figure FDA0003627686250000027
Obtaining video by scalingCorresponding target boundary box in segment space-time feature diagram
Figure FDA0003627686250000028
Obtaining the t-th video segment target characteristic diagram by a bilinear interpolation mode
Figure FDA0003627686250000029
And performing a global average pooling operation to obtain target features
Figure FDA00036276862500000210
H ', W' and C are the height, width and number of channels of the target feature map.
3. The method for video spatiotemporal motion localization based on progressive hyperopia as claimed in claim 2, wherein the step (2) is specifically:
(2-1) constructing a space-time relation encoder consisting of three fully-connected layers, and enabling the ith target feature of the tth video segment
Figure FDA00036276862500000211
Inputting the data into three full connection layers to obtain the query features
Figure FDA00036276862500000212
Key feature
Figure FDA00036276862500000213
And value characteristics
Figure FDA00036276862500000214
Wherein
Figure FDA00036276862500000215
D represents the number of channels of the query feature and the key feature, and d is less than C; the same method obtains the jth target feature of the tth video segment
Figure FDA00036276862500000216
Corresponding key feature
Figure FDA00036276862500000217
Sum value feature
Figure FDA00036276862500000218
(2-2) calculating the weight of the space-time relation between the target i and the target j
Figure FDA00036276862500000219
Generating a space-time relation matrix of all targets of the t-th video segment
Figure FDA00036276862500000220
Wherein
Figure FDA00036276862500000221
Softmax (·) represents a Softmax function, < ·, > represents the inner product; computing enhanced target region features
Figure FDA00036276862500000222
Will be provided with
Figure FDA00036276862500000223
Copying along the spatial dimension to make the space-time characteristic graph of the video segment t
Figure FDA00036276862500000224
Are consistent to obtain the global space characteristics of the target
Figure FDA00036276862500000225
(2-3) global spatial feature of the target
Figure FDA00036276862500000226
Splicing with the X channel of the video space-time feature map by twoObtaining target initial context features for a dimension convolution layer
Figure FDA00036276862500000227
Conv2D 1 (·) denotes a two-dimensional convolution layer with an input channel of C '═ 2 · C, an output channel of C, and a convolution kernel size of 1 × 1 × C', and | | denotes channel splicing.
4. The method for video spatiotemporal motion localization based on progressive hyperopia as claimed in claim 3, wherein the step (3) is specifically:
(3-1) the progressive variable-length sliding window module consists of an action auxiliary judgment submodule and a target first-order feature library, wherein the action auxiliary judgment submodule performs action coarse-grained judgment by using histogram similarity, and the target first-order feature library is used for storing target initial context features of all historical video segments
Figure FDA0003627686250000031
Obtaining a historical target context feature set
Figure FDA0003627686250000032
Wherein
Figure FDA0003627686250000033
Representing an ith target initial context feature of the phi-th video segment;
(3-2) dividing the current t-th video clip V t Is converted into an RGB histogram matrix
Figure FDA0003627686250000034
Video clip V at time t-1 t-1 Is converted into an RGB histogram matrix
Figure FDA0003627686250000035
Wherein 3 represents an RGB channel and is used;
(3-3) Using RGB histogram matrix Z t And Z t-1 Calculating between intermediate frames of adjacent video segmentsHistogram similarity
Figure FDA0003627686250000036
Figure FDA0003627686250000037
And
Figure FDA0003627686250000038
the number of pixel points with the middle frame channel of R and the brightness of lambda in the t-th video segment and the t-1 th video segment is represented,
Figure FDA0003627686250000039
and
Figure FDA00036276862500000310
the number of pixel points with the middle frame channel of G and the brightness of lambda in the t-th video clip and the t-1 th video clip is represented,
Figure FDA00036276862500000311
and
Figure FDA00036276862500000312
the number of pixel points with B and lambda brightness in the middle frame channel of the tth video segment and the t-1 video segment is represented, and lambda is more than or equal to 0 and less than or equal to 255; according to the histogram similarity rho t,t-1 Calculating the number of video segments similar to the t-th video segment
Figure FDA00036276862500000313
Delta is more than 0 and less than 1, which is a threshold constant;
(3-4) performing the steps (3-2) and (3-3) on all the video segments to obtain a vector of the number of similar segments
Figure FDA00036276862500000314
The window size is set to ω ═ min (τ) t ,L 1 ) And reading the historical target initial context within the time window [ t-omega, t) from the feature libraryFeature set
Figure FDA00036276862500000315
L 1 Min (·, ·) represents taking the minimum value for a preset maximum window size;
(3-5) utilizing a target spatio-temporal relationship matrix M t And M t-1 Calculating the similarity between the t-th video segment and the t-1 st video segment
Figure FDA00036276862500000316
Calculated by the same method to obtain E t,t-2 ,...,E t,t-ω The similarity value sequence is arranged in a descending order to obtain the first alpha historical video segments which are most similar to the t-th video segment, and the channel splicing operation is carried out on the target initial context characteristics corresponding to the video segments to obtain the target associated space-time characteristics
Figure FDA0003627686250000041
Wherein the channel number C ″ ═ α · C; will be provided with
Figure FDA0003627686250000042
Inputting the data into a two-dimensional convolution layer to obtain a long-term target first-order characteristic consistent with the original action duration
Figure FDA0003627686250000043
Conv2D 2 (. cndot.) represents a two-dimensional convolution layer having an input channel of C ″ ═ α · C, an output channel of C, and a convolution kernel size of 1 × 1 × C ″.
5. The method for video spatiotemporal motion localization based on progressive hyperopia as claimed in claim 4, wherein the step (4) is specifically:
(4-1) constructing a hypergraph module with shared attribute constraints and a diffusion mechanism by using the relative spatial position of the target and the target attribute: firstly, the Euclidean distance between the target i and other targets j of the intermediate frame of the t-th video segment is calculated
Figure FDA0003627686250000044
Figure FDA0003627686250000045
And
Figure FDA0003627686250000046
the coordinates of the center positions of the target i bounding box and the target j bounding box of the t-th video segment intermediate frame,
Figure FDA0003627686250000047
dist (·, ·) denotes the euclidean distance; calculating the distance between the target i and all other targets of the current frame to obtain a constraint set of the target i
Figure FDA0003627686250000048
Figure FDA0003627686250000049
Representing a set of objects at a distance from object i less than delta',
Figure FDA00036276862500000410
is a threshold constant;
(4-2) in a constraint set
Figure FDA00036276862500000411
Constructing a high-order relation between a target i and other targets in the set, wherein a space-time relation established between the target i and the target j through a target R is represented as R (i, R, j), the target i and the target j are people, the target R is a person or an object, and i is not equal to j; the method comprises the following steps:
obtaining a space-time relation matrix M according to the step (2-2) t Associating using the same object
Figure FDA00036276862500000412
Figure FDA00036276862500000413
Show the common purposeThe spatial position, symbol, of the symbol r
Figure FDA00036276862500000414
The first-order relation between the two is shown and is expressed as a first-order characteristic between the targets
Figure FDA00036276862500000415
It is generated by using the initial context characteristics of the ith target
Figure FDA00036276862500000416
Object bounding box according to object r
Figure FDA00036276862500000417
Cutting is carried out, and first-order characteristics of the target i and the target r are obtained through bilinear interpolation operation
Figure FDA00036276862500000418
The first-order characteristics of the target j and the target r are obtained by the same method
Figure FDA00036276862500000419
By passing
Figure FDA00036276862500000420
Generating high-order characteristics of the target i, j relative to the target r, and writing a high-order characteristic set related to the target i
Figure FDA00036276862500000421
Wherein Conv2D 3 (·) is a two-dimensional convolutional layer with input channel C '"2C, output channel C, and convolution kernel size 1 × 1 × C'";
(4-3) Using the high-order feature set ψ associated with target i i Calculating short-term target high-order characteristics
Figure FDA0003627686250000051
6. The progressive hyperopic image-based video spatiotemporal motion localization method according to claim 5, wherein the step (5) is specifically:
(5-1) first-order feature of long-term object
Figure FDA0003627686250000052
And short term target high order features
Figure FDA0003627686250000053
Inputting a target action regression module to obtain target positioning and action judgment, specifically comprising the following steps: firstly, the first step is to
Figure FDA0003627686250000054
And
Figure FDA0003627686250000055
performing element-by-element summation operation to obtain element-by-element sum characteristics
Figure FDA0003627686250000056
Figure FDA0003627686250000057
Represents a per-element sum; then will be
Figure FDA0003627686250000058
Inputting into two-dimensional convolution layer and performing global average pooling operation along spatial dimension to obtain target classification score
Figure FDA0003627686250000059
K represents the number of action categories, Conv2D 4 (. cndot.) represents a two-dimensional convolution layer with an input channel of C, an output channel of K and a convolution kernel size of 1 × 1 × C, and GAP (. cndot.) represents spatial dimension global average pooling;
(5-2) scoring the target classification using Softmax function
Figure FDA00036276862500000519
Processing is carried out to obtain an output probability of u as the action type at the t moment
Figure FDA00036276862500000510
e represents a natural base number;
(5-3) element by element and feature by feature
Figure FDA00036276862500000511
Obtaining target spatial position features by two-layer two-dimensional convolution
Figure FDA00036276862500000512
Wherein Conv2D 5 (. The) is expressed as a two-dimensional convolution layer with an input channel of C, an output channel of 256, and a convolution kernel size of 3 × 3 × C, Conv2D 6 (. 2) represents a two-dimensional convolution layer with input channel 256, output channel 4, and convolution kernel size 1 × 1 × 256;
(5-4) spatial location characterization of the object
Figure FDA00036276862500000513
Obtaining a predicted target bounding box through a full connection layer
Figure FDA00036276862500000514
Figure FDA00036276862500000515
And
Figure FDA00036276862500000516
the abscissa and ordinate of the upper-left corner coordinate point of the prediction target bounding box are represented,
Figure FDA00036276862500000517
and
Figure FDA00036276862500000518
and the abscissa and the ordinate of the coordinate point at the lower right corner of the prediction target bounding box are represented.
7. The video spatiotemporal motion localization method based on progressive hyperopia as claimed in claim 6, characterized in that the step (6) is specifically:
(6-1) constructing a space-time action positioning model consisting of a space-time relation encoder, a progressive variable length window method module, a hypergraph module with shared attribute constraint and diffusion mechanism and a target action regression module;
(6-2) sampling the training video into a frame sequence, inputting the frame sequence into a space motion positioning model, obtaining the spatial positions and the corresponding motion types of all targets at each moment, and calculating the cross entropy loss function of the model
Figure FDA0003627686250000061
Wherein
Figure FDA0003627686250000062
In order to be a true mark, the mark is,
Figure FDA0003627686250000063
indicating that the ith target of the t frame contains the action with the action type of u; computing a distance cross-correlation loss function of the model
Figure FDA0003627686250000064
Wherein,
Figure FDA0003627686250000065
representing the intersection ratio of the predicted target bounding box and the real target bounding box,
Figure FDA0003627686250000066
in order to predict the target bounding box,
Figure FDA0003627686250000067
sit in the upper left corner of the target bounding boxThe mark is that,
Figure FDA0003627686250000068
the coordinates of the lower right corner of the target bounding box,
Figure FDA0003627686250000069
in order to be a real target bounding box,
Figure FDA00036276862500000610
representing the coordinates of the center position of the target bounding box,
Figure FDA00036276862500000611
coordinates of the center position of the bounding box representing the real object,
Figure FDA00036276862500000612
coordinates representing the upper left corner of the minimum bounding box that can enclose both the true bounding box and the predicted bounding box,
Figure FDA00036276862500000613
coordinates representing the lower right corner of the smallest bounding box that can enclose both the true bounding box and the predicted bounding box, max (·,) representing taking the maximum value;
(6-3) optimizing the space-time motion positioning model by using a random gradient descent algorithm, and iteratively training the model until convergence to obtain an optimized space-time motion positioning model;
and (6-4) obtaining a video frame sequence through sampling a new video, inputting the optimized space-time motion positioning model, executing the steps (1) to (5) in sequence, and outputting the space positions and motion types of all targets of the video segments at the current moment.
CN202210481572.0A 2022-05-05 2022-05-05 Video space-time action positioning method based on progressive attention hypergraph Active CN114882403B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210481572.0A CN114882403B (en) 2022-05-05 2022-05-05 Video space-time action positioning method based on progressive attention hypergraph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210481572.0A CN114882403B (en) 2022-05-05 2022-05-05 Video space-time action positioning method based on progressive attention hypergraph

Publications (2)

Publication Number Publication Date
CN114882403A true CN114882403A (en) 2022-08-09
CN114882403B CN114882403B (en) 2022-12-02

Family

ID=82674257

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210481572.0A Active CN114882403B (en) 2022-05-05 2022-05-05 Video space-time action positioning method based on progressive attention hypergraph

Country Status (1)

Country Link
CN (1) CN114882403B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118279786A (en) * 2024-02-29 2024-07-02 北京科技大学 Time sequence action positioning method based on diffusion model

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106611157A (en) * 2016-11-17 2017-05-03 中国石油大学(华东) Multi-people posture recognition method based on optical flow positioning and sliding window detection
CN108399380A (en) * 2018-02-12 2018-08-14 北京工业大学 A kind of video actions detection method based on Three dimensional convolution and Faster RCNN
CN108805083A (en) * 2018-06-13 2018-11-13 中国科学技术大学 The video behavior detection method of single phase
US20190050996A1 (en) * 2017-08-04 2019-02-14 Intel Corporation Methods and apparatus to generate temporal representations for action recognition systems
CN110765854A (en) * 2019-09-12 2020-02-07 昆明理工大学 Video motion recognition method
CN111291647A (en) * 2020-01-21 2020-06-16 陕西师范大学 Single-stage action positioning method based on multi-scale convolution kernel and superevent module
WO2020196985A1 (en) * 2019-03-27 2020-10-01 연세대학교 산학협력단 Apparatus and method for video action recognition and action section detection
CN112926396A (en) * 2021-01-28 2021-06-08 杭州电子科技大学 Action identification method based on double-current convolution attention
CN113239869A (en) * 2021-05-31 2021-08-10 西安电子科技大学 Two-stage behavior identification method and system based on key frame sequence and behavior information
CN113255443A (en) * 2021-04-16 2021-08-13 杭州电子科技大学 Pyramid structure-based method for positioning time sequence actions of graph attention network
CN113822172A (en) * 2021-08-30 2021-12-21 中国科学院上海微系统与信息技术研究所 Video spatiotemporal behavior detection method

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106611157A (en) * 2016-11-17 2017-05-03 中国石油大学(华东) Multi-people posture recognition method based on optical flow positioning and sliding window detection
US20190050996A1 (en) * 2017-08-04 2019-02-14 Intel Corporation Methods and apparatus to generate temporal representations for action recognition systems
CN108399380A (en) * 2018-02-12 2018-08-14 北京工业大学 A kind of video actions detection method based on Three dimensional convolution and Faster RCNN
CN108805083A (en) * 2018-06-13 2018-11-13 中国科学技术大学 The video behavior detection method of single phase
WO2020196985A1 (en) * 2019-03-27 2020-10-01 연세대학교 산학협력단 Apparatus and method for video action recognition and action section detection
CN110765854A (en) * 2019-09-12 2020-02-07 昆明理工大学 Video motion recognition method
CN111291647A (en) * 2020-01-21 2020-06-16 陕西师范大学 Single-stage action positioning method based on multi-scale convolution kernel and superevent module
CN112926396A (en) * 2021-01-28 2021-06-08 杭州电子科技大学 Action identification method based on double-current convolution attention
CN113255443A (en) * 2021-04-16 2021-08-13 杭州电子科技大学 Pyramid structure-based method for positioning time sequence actions of graph attention network
CN113239869A (en) * 2021-05-31 2021-08-10 西安电子科技大学 Two-stage behavior identification method and system based on key frame sequence and behavior information
CN113822172A (en) * 2021-08-30 2021-12-21 中国科学院上海微系统与信息技术研究所 Video spatiotemporal behavior detection method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JUNTING PAN等: ""Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization"", 《ARXIV》 *
熊成鑫等: ""时域候选优化的时序动作检测"", 《中国图象图形学报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118279786A (en) * 2024-02-29 2024-07-02 北京科技大学 Time sequence action positioning method based on diffusion model

Also Published As

Publication number Publication date
CN114882403B (en) 2022-12-02

Similar Documents

Publication Publication Date Title
CN112926396B (en) Action identification method based on double-current convolution attention
CN111210446B (en) Video target segmentation method, device and equipment
CN111738055B (en) Multi-category text detection system and bill form detection method based on same
CN112508014A (en) Improved YOLOv3 target detection method based on attention mechanism
CN111968150A (en) Weak surveillance video target segmentation method based on full convolution neural network
CN110942471A (en) Long-term target tracking method based on space-time constraint
Xie et al. GhostFormer: Efficiently amalgamated CNN-transformer architecture for object detection
CN111476133A (en) Unmanned driving-oriented foreground and background codec network target extraction method
Mo et al. PVDet: Towards pedestrian and vehicle detection on gigapixel-level images
CN112101344B (en) Video text tracking method and device
CN114882403B (en) Video space-time action positioning method based on progressive attention hypergraph
WO2023036157A1 (en) Self-supervised spatiotemporal representation learning by exploring video continuity
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
Li et al. PFYOLOv4: An improved small object pedestrian detection algorithm
CN111639563B (en) Basketball video event and target online detection method based on multitasking
CN109308458B (en) Method for improving small target detection precision based on characteristic spectrum scale transformation
CN110942463B (en) Video target segmentation method based on generation countermeasure network
Wang et al. Scene uyghur recognition with embedded coordinate attention
CN116797799A (en) Single-target tracking method and tracking system based on channel attention and space-time perception
Mars et al. Combination of DE-GAN with CNN-LSTM for Arabic OCR on Images with Colorful Backgrounds
Varlik et al. Filtering airborne LIDAR data by using fully convolutional networks
CN116843719A (en) Target tracking method based on independent search twin neural network
CN113963021A (en) Single-target tracking method and system based on space-time characteristics and position changes
CN115082778A (en) Multi-branch learning-based homestead identification method and system
CN115018878A (en) Attention mechanism-based target tracking method in complex scene, storage medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant