CN114882403A - Video space-time action positioning method based on progressive attention hypergraph - Google Patents
Video space-time action positioning method based on progressive attention hypergraph Download PDFInfo
- Publication number
- CN114882403A CN114882403A CN202210481572.0A CN202210481572A CN114882403A CN 114882403 A CN114882403 A CN 114882403A CN 202210481572 A CN202210481572 A CN 202210481572A CN 114882403 A CN114882403 A CN 114882403A
- Authority
- CN
- China
- Prior art keywords
- target
- video
- space
- time
- action
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000009471 action Effects 0.000 title claims abstract description 90
- 238000000034 method Methods 0.000 title claims abstract description 56
- 230000000750 progressive effect Effects 0.000 title claims abstract description 36
- 239000011159 matrix material Substances 0.000 claims abstract description 30
- 230000007774 longterm Effects 0.000 claims abstract description 21
- 230000007246 mechanism Effects 0.000 claims abstract description 16
- 238000009792 diffusion process Methods 0.000 claims abstract description 14
- 238000005070 sampling Methods 0.000 claims abstract description 13
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 10
- 238000010586 diagram Methods 0.000 claims description 15
- 238000001514 detection method Methods 0.000 claims description 14
- 230000004807 localization Effects 0.000 claims description 9
- 238000011176 pooling Methods 0.000 claims description 9
- 238000004422 calculation algorithm Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 5
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 238000012512 characterization method Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 206010020675 Hypermetropia Diseases 0.000 claims 5
- 201000006318 hyperopia Diseases 0.000 claims 5
- 230000004305 hyperopia Effects 0.000 claims 5
- 230000003993 interaction Effects 0.000 abstract description 2
- 230000000875 corresponding effect Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 4
- 230000009476 short term action Effects 0.000 description 3
- 238000012544 monitoring process Methods 0.000 description 2
- 230000001149 cognitive effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Psychiatry (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a video space-time action positioning method based on a progressive attention hypergraph. Firstly, sampling a given original video to obtain a frame sequence, and obtaining a target context characteristic and video space-time characteristic graph by using a convolutional neural network; obtaining a target context characteristic and a space-time relation matrix through a space-time relation encoder; generating a long-term target first-order characteristic by utilizing a progressive variable length window method module; meanwhile, obtaining target short-term high-order characteristics through a hypergraph module with shared attribute constraint and diffusion mechanism; and finally, outputting the spatial positions and the motion types of all the targets at different times by using a target motion regression module. The method not only can adaptively adjust the window size according to the original action duration to obtain the first-order characteristic of the target consistent with the original action duration, but also can capture the potential relation of the target through the hypergraph module, thereby realizing more effective utilization of the target interaction relation and improving the accuracy of video space-time action positioning.
Description
Technical Field
The invention belongs to the technical field of computer vision, particularly relates to the field of motion positioning in video processing, and relates to a video space-time motion positioning method based on a progressive attention hypergraph.
Background
Since the rapid rise of the media industry, massive multimedia data mainly comprising videos are generated. Compared with the traditional image-text data, the video gradually becomes a mainstream media form due to the characteristics of rich visual content, intuitive expression form and the like. However, a large amount of complex scene information is contained in a large amount of video, such as: the number of targets is large and the action is complex. Therefore, how to quickly and accurately identify and locate the Action categories of all targets from a complex scene becomes an important research direction of researchers, namely a Spatio-temporal Action Localization (Spatio-temporal Action Localization) task. The task takes a long video which is not edited and possibly comprises a plurality of targets, each target comprises a plurality of actions as input, and outputs the spatial position, the starting and stopping time and the corresponding action type of an action segment in the video, so that the task can be widely applied to monitoring actual scenes such as security, video content examination, traffic safety detection and the like. For example, the space-time action positioning technology is applied to a monitoring security system, can monitor and judge dangerous actions of all targets in a range in real time and send out an alarm so as to assist in strengthening social security; in addition, the space-time action positioning technology is applied to a video content auditing system, and can effectively mark and screen out illegal video segments, so that manual review is facilitated, and the labor cost is reduced.
At present, a mainstream space-time action positioning method mainly adopts a two-stage paradigm, wherein a Faster regional convolutional Neural Network fast R-CNN (fast Region-convolutional Neural Network) and a slow-fast Network slowFast are used in a first stage to obtain a target Region characteristic and a space-time characteristic diagram, the target Region characteristic is copied along a space dimension to enable the dimension to be equal to the space-time characteristic diagram, and then the target Region characteristic and the space-time characteristic diagram are spliced on a channel dimension to generate an initial target first-order relationship (the space relationship between different targets and the time sequence relationship of the same target); the second phase uses a Long-Term Feature library LFB (Long-Term Feature Banks) as a memory module to store historical target first-order features and combines a Local Attention mechanism (Local Attention) to acquire the Long-Term target first-order features. However, in the process of extracting the target interaction relationship, the method ignores the influence of the potential relationship (high-order relationship: the space-time relationship established between two targets through the third target) between the targets on the judgment result, and causes the space-time positioning deviation of the action. Therefore, a graph convolution neural network (GCN) (graph relational network) is adopted in subsequent work, and the first-order relation between two targets is described through the common spatial position, so that the capture of the context of the global scene is realized, and the high-order relation between the targets is described as comprehensively as possible.
The shortcomings of the space-time motion positioning method are mainly expressed in the following three aspects: (1) although the long-term feature library with a fixed window size can well capture the first-order relation of the long-term target, for the action with short duration, the too large duration range in the long-term feature library can cause the model to be extracted into context-free features, so that the accuracy of short-duration action feature representation is reduced; (2) the influence of the high-order relation on the judgment of the action type at the current moment is reduced along with the increase of the time interval, but the calculation cost is increased along with the increase of the time interval, so that the high-order relation of the long-term target is constructed and the high real-time requirement of the model is difficult to meet; (3) the traditional graph structure can only represent the characteristic of a pair-wise relation, and is difficult to depict complicated and various target high-order relations. Therefore, aiming at the problems of short-term action confidence reduction caused by inaccurate capture range of the first-order relation of the target and high calculation overhead caused by unreasonable description mode of the high-order relation of the target, a space-time action positioning method which can adaptively adjust the window size according to the original action duration and can correctly reflect the high-order relation of the target is urgently needed to be designed.
Disclosure of Invention
Aiming at the defects of the existing method, the invention provides a video space-time action positioning method based on a progressive attention hypergraph. Aiming at the problem of time difference existing in actions, the method adaptively adjusts the window size by constructing a progressive variable-length window method module so as to extract more effective action characteristic representation; meanwhile, a hypergraph module with shared attribute constraint and diffusion mechanisms is designed to improve the description capacity of the target high-order relation, so that the accuracy of target action identification is improved.
The method of the invention sequentially performs the following operations on a video data set with given action types and action space-time marks:
preprocessing a video to obtain a video frame sequence, and extracting target area characteristics and a video space-time characteristic diagram by utilizing two-dimensional and three-dimensional convolutional neural networks;
step (2) a space-time relation encoder is constructed, target region characteristics and a video space-time characteristic graph are input, and initial target context characteristics and a space-time relation matrix are output;
step (3) constructing a progressive variable-length sliding window module, inputting a video frame sequence, an initial target context characteristic and a space-time relation matrix, and outputting a long-term target first-order characteristic;
constructing a hypergraph module with shared attribute constraint and diffusion mechanism, inputting an initial target context characteristic and a space-time relation matrix, and outputting a short-term target high-order characteristic;
step 5, constructing a target action regression module, inputting long-term target first-order characteristics and short-term target high-order characteristics, and outputting spatial positions and action types of all targets at the current moment;
and (6) optimizing a space-time action positioning model consisting of a space-time relation encoder, a progressive variable-length sliding window module, a hypergraph module and a target action regression module by using a random gradient descent algorithm, and sequentially executing the steps (1) to (5) on a new video sequence to obtain the space positions and action types of all targets at different moments.
Further, the step (1) is specifically:
(1-1) sampling the original video at a sampling rate of N frames per second to obtain a video frame sequence set with a frame number T
WhereinRepresenting the real number field, U s Representing the s-th frame, H 'and W' represent the height and width of the video frame, 3 represents RGB three channels, and N is 5-10;
(1-2) dividing the original video frame sequence into T video segments with 2 XN frames as a single video segmentV t Representing the t-th video segment; then inputting the t-th video segment into a three-dimensional convolution neural network to generate a t-th video segment space-time feature mapH. W, C are the height, width, number of channels of the feature map,further obtaining space-time characteristic graphs of all video clips;
(1-3) utilizing a target detection model based on a two-dimensional convolutional neural network to perform detection on the t video segment V t The intermediate frame of the target detection system carries out target detection to obtain a sequence set of a target boundary frameN t The number of targets existing in the middle frame of the t-th video clip is represented, wherein beta is 0,1, beta is 0, the target is a boundary frame of a person, and beta is 1, the target is a boundary frame of an object;a bounding box representing the ith object of the t-th video segment intermediate frame,andthe abscissa and ordinate of the upper left corner of the ith target bounding box in the t-th video segment are represented,andthe abscissa and the ordinate of the lower right corner of the ith target bounding box of the middle frame of the tth video segment are represented;
(1-4) bounding boxes according to the targetObtaining the corresponding target boundary box in the video segment space-time characteristic diagram by a scaling modeObtaining the t-th video segment target characteristic diagram by a bilinear interpolation modeAnd executing Global Average Pooling (GAP) operation to obtain target characteristicsH ', W' and C are the height, width and number of channels of the target feature map.
Still further, the step (2) is specifically:
(2-1) constructing a space-time relation encoder consisting of three fully-connected layers, and enabling the ith target feature of the tth video segmentInputting the data into three full connection layers to obtain the query featuresKey featureAnd value characteristicsWhereinD represents the number of channels of the query feature and the key feature, and d is less than C; the same method obtains the jth target feature of the tth video segmentCorresponding key featureSum value feature
(2-2) calculating the weight of the space-time relation between the target i and the target jGenerating a space-time relation matrix of all targets of the t-th video segmentWhereinSoftmax (·) represents a Softmax function, < ·, > represents the inner product; computing enhanced target region featuresWill be provided withCopying along the spatial dimension to make the space-time characteristic graph of the video segment tAre consistent to obtain the global space characteristics of the target
(2-3) global spatial feature of the targetSplicing with a video space-time feature diagram X channel to obtain target initial context features through a two-dimensional convolution layerConv2D 1 (·) denotes a two-dimensional convolution layer with an input channel of C '═ 2 · C, an output channel of C, and a convolution kernel size of 1 × 1 × C', and | | denotes channel splicing.
Further, the step (3) is specifically:
(3-1) the progressive variable-length sliding window module consists of an action auxiliary judgment submodule and a target first-order feature library, wherein the action auxiliary judgment submodule performs action coarse-grained judgment by using histogram similarity, and the target first-order feature library is used for storing target initial context features of all historical video segmentsObtaining a historical target context feature setWhereinRepresenting an ith target initial context feature of the phi-th video segment;
(3-2) dividing the current t-th video clip V t Is converted into an RGB histogram matrixVideo clip V at time t-1 t-1 Is converted into an RGB histogram matrixWherein 3 represents an RGB channel and is used;
(3-3) Using RGB histogram matrix Z t And Z t-1 Calculating histogram similarity between intermediate frames of adjacent video segments Andthe number of pixel points with the middle frame channel of R and the brightness of lambda in the t-th video segment and the t-1 th video segment is represented,andthe number of pixel points with the middle frame channel of G and the brightness of lambda in the t-th video clip and the t-1 th video clip is represented,andthe number of pixel points with B and lambda brightness in the middle frame channel of the tth video segment and the t-1 video segment is represented, and lambda is more than or equal to 0 and less than or equal to 255; according to the histogram similarity rho t,t-1 Calculating the number of video segments similar to the t-th video segmentIs a threshold constant;
(3-4) performing the steps (3-2) and (3-3) on all the video segments to obtain a vector of the number of similar segmentsThe window size is set to ω ═ min (τ) t ,L 1 ) And reading a historical target initial context feature set within a time window [ t-omega, t) from the feature libraryL 1 Min (·, ·) represents taking the minimum value for a preset maximum window size;
(3-5) utilizing a target spatio-temporal relationship matrix M t And M t-1 Calculating the similarity between the t-th video segment and the t-1 st video segmentCalculated by the same method to obtain E t,t-2 ,...,E t,t-ω The similarity value sequence is arranged in a descending order to obtain the first alpha historical video segments which are most similar to the t-th video segment, and the channel splicing operation is carried out on the target initial context characteristics corresponding to the video segments to obtain the target associated space-time characteristicsWherein the channel number C ″ ═ α · C; will be provided withInputting the data into a two-dimensional convolution layer to obtain a long-term target first-order characteristic consistent with the original action durationConv2D 2 (. cndot.) represents a two-dimensional convolution layer having an input channel of C ″ ═ α · C, an output channel of C, and a convolution kernel size of 1 × 1 × C ″.
Still further, the step (4) is specifically:
(4-1) constructing a hypergraph module with shared attribute constraint and diffusion mechanism by using relative spatial positions of targets and target attributes (people and objects): firstly, the Euclidean distance between the target i and other targets j of the intermediate frame of the t-th video segment is calculated Andthe coordinates of the center positions of the target i bounding box and the target j bounding box of the t-th video segment intermediate frame,dist (·, ·) denotes the euclidean distance; calculating the distance between the target i and all other targets of the current frame to obtain a constraint set of the target i Representing a set of objects at a distance from object i less than delta',is a threshold constant;
(4-2) in a constraint setConstructing a high-order relation between a target i and other targets in the set, wherein a space-time relation established between the target i and the target j through a target R is represented as R (i, R, j), the target i and the target j are people, the target R is a person or an object, and i is not equal to j; the method comprises the following steps:
obtaining a space-time relation matrix M according to the step (2-2) t Associating using the same object Indicates the spatial position, symbol, of the common target rThe first-order relation between the two is shown and is expressed as a first-order characteristic between the targetsIt is generated by using the initial context characteristics of the ith targetObject bounding box according to object rCutting is carried out, and first-order characteristics of the target i and the target r are obtained through bilinear interpolation operation
The first-order characteristics of the target j and the target r are obtained by the same methodBy passingGenerating high-order characteristics of the target i, j relative to the target r, and writing a high-order characteristic set related to the target iWherein Conv2D 3 (·) is a two-dimensional convolutional layer with input channel C '"2C, output channel C, and convolution kernel size 1 × 1 × C'";
(4-3) Using the high-order feature set ψ associated with target i i Calculating short-term target high-order characteristics
Still further, the step (5) is specifically:
(5-1) first-order feature of long-term objectAnd short term target high order featuresInputting a target action regression module to obtain target positioning and action judgment, specifically comprising the following steps: firstly, the method is toAndperforming element-by-element summation operation to obtain element-by-element sum characteristics Represents a per-element sum; then will beInputting into two-dimensional convolutional layer and performing global average pooling along spatial dimension to obtain target classification score (logits)K represents the number of action categories, Conv2D 4 (. cndot.) represents a two-dimensional convolution layer with an input channel of C, an output channel of K and a convolution kernel size of 1 × 1 × C, and GAP (. cndot.) represents spatial dimension global average pooling;
(5-2) scoring the target classification using Softmax functionProcessing is carried out to obtain an output probability of u as the action type at the t momente represents a natural base number;
(5-3) element by element and feature by featureObtaining target spatial position features by two-layer two-dimensional convolutionWherein Conv2D 5 (. The) is expressed as a two-dimensional convolution layer with an input channel of C, an output channel of 256, and a convolution kernel size of 3 × 3 × C, Conv2D 6 (. is expressed as outputA two-dimensional convolution layer with input channel of 256, output channel of 4 and convolution kernel size of 1 × 1 × 256;
(5-4) spatial location characterization of the objectObtaining a predicted target bounding box through a full connection layer Andthe abscissa and ordinate of the upper-left corner coordinate point of the prediction target bounding box are represented,andand the abscissa and the ordinate of the coordinate point at the lower right corner of the prediction target bounding box are represented.
Continuing further, the step (6) is specifically:
(6-1) constructing a space-time action positioning model consisting of a space-time relation encoder, a progressive variable length window method module, a hypergraph module with shared attribute constraint and diffusion mechanism and a target action regression module;
(6-2) sampling the training video into a frame sequence, inputting the frame sequence into a space motion positioning model, obtaining the spatial positions and the corresponding motion types of all targets at each moment, and calculating the cross entropy loss function of the modelWhereinIn order to be a true mark, the mark is,indicating that the ith target of the t frame contains the action with the action type of u; computing a distance cross-correlation loss function of the modelWherein,representing the intersection ratio of the predicted target bounding box and the real target bounding box,in order to predict the target bounding box,the coordinates of the upper left corner of the target bounding box,the coordinates of the lower right corner of the target bounding box,in order to be a real target bounding box,representing the coordinates of the center position of the target bounding box,coordinates of the center position of the bounding box representing the real object,coordinates representing the upper left corner of the minimum bounding box that can enclose both the true bounding box and the predicted bounding box,coordinates representing the lower right corner of the smallest bounding box that can enclose both the true bounding box and the predicted bounding box, max (·,) representing taking the maximum value;
(6-3) optimizing the space-time motion positioning model by using a random gradient descent algorithm, and iteratively training the model until convergence to obtain an optimized space-time motion positioning model;
and (6-4) obtaining a video frame sequence through sampling a new video, inputting the optimized space-time motion positioning model, executing the steps (1) to (5) in sequence, and outputting the space positions and motion types of all targets of the video segments at the current moment.
The invention provides a video space-time action positioning method based on a progressive attention hypergraph, which has the following characteristics: (1) by utilizing the progressive cognitive theory idea, the progressive variable length window method module is used for carrying out difference processing on the actions with different durations, and compared with the past mode of fixing a long-term feature library, the method avoids the short-term action learning to the context-free features to a certain extent; (2) the hypergraph module with the constraint of the shared attribute and the diffusion mechanism reduces model calculation overhead through the shared attribute and spatial position constraint among the targets, and meanwhile, a more accurate action recognition rate can be obtained by constructing a potential relation among the targets; (3) by means of the parallel hypergraph and progressive variable-length sliding window strategy, the real-time performance of the model can be effectively guaranteed.
The method is suitable for tasks with real-time requirements on the positioning of the time-space actions, and has the advantages that: (1) the size of the window is adaptively adjusted by constructing a progressive variable-length window method module, so that the condition that short-term features are learned to redundant features is reduced, and the running speed of the model is increased; (2) by applying shared attribute constraint to the target high-order relation, the calculation overhead of the model can be reduced; (3) the parallel mode of the hypergraph module and the progressive variable length window method module can improve the operation efficiency of the model. The progressive attention mechanism variable-length sliding window module and the hypergraph module with the shared attribute constraint and diffusion mechanism can better ensure the calculation efficiency of the model, and can be applied to the fields of traffic safety detection, illegal content identification and the like.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, in the video spatiotemporal motion localization method based on the progressive hypergraph, an original video is uniformly sampled, and a target region feature and a video spatiotemporal feature map are extracted by using a convolutional neural network; obtaining a target context characteristic and a space-time relation matrix by using a space-time relation encoder; inputting the target context characteristics and the space-time relation matrix into a progressive lengthening sliding window module to obtain target first-order characteristics consistent with the original action; meanwhile, a hypergraph module is constructed, and short-term target high-order features are generated in a mode of increasing the constraint of shared attributes; and finally, obtaining the spatial positions and the motion types of all the targets at different moments by using a target motion regression module. The method utilizes a progressive attention mechanism, can adaptively adjust the size of a sliding window according to the original time length of the action so as to reduce the possibility that the short-term action learns the context-free characteristics, and describes the potential high-order relation between the targets through a shared attribute constraint and a diffusion hypergraph module, thereby generating an accurate space-time action positioning result.
The method of the invention sets the video data containing text description and then sequentially carries out the following operations:
preprocessing a video to obtain a video frame sequence, and extracting target area characteristics and a video space-time characteristic diagram by utilizing two-dimensional and three-dimensional convolutional neural networks; the method comprises the following steps:
(1-1) sampling the original video at a sampling rate of N frames per second to obtain a video frame sequence set with a frame number TWhereinRepresenting the real number field, U s Representing the s-th frame, H 'and W' represent the height and width of the video frame, 3 represents RGB three channels, and N is 5-10;
(1-2) dividing the original video frame sequence into T video segments with 2 XN frames as a single video segmentV t Representing the t-th video segment; then inputting the t-th video segment into a three-dimensional convolution neural network to generate a t-th video segment space-time feature mapH. W, C are the height, width, number of channels of the feature map,further obtaining a space-time characteristic diagram of all video clips;
(1-3) utilizing a target detection model based on a two-dimensional convolutional neural network to perform detection on the t video segment V t The intermediate frame of the target detection system carries out target detection to obtain a sequence set of a target boundary frameN t The number of targets existing in the middle frame of the t-th video clip is represented, wherein beta is 0,1, beta is 0, the target is a boundary frame of a person, and beta is 1, the target is a boundary frame of an object;a bounding box representing the ith object of the t-th video segment intermediate frame,andthe abscissa and ordinate of the upper left corner of the ith target bounding box in the t-th video segment are represented,andthe abscissa and the ordinate of the lower right corner of the ith target bounding box of the middle frame of the tth video segment are represented;
(1-4) bounding boxes according to the targetObtaining the corresponding target boundary box in the video segment space-time characteristic diagram by a scaling modeObtaining the t-th video segment target characteristic diagram by a bilinear interpolation modeAnd executing Global Average Pooling (GAP) operation to obtain target characteristicsH ', W' and C are the height, width and number of channels of the target feature map.
Step (2) a space-time relation encoder is constructed, target region characteristics and a video space-time characteristic graph are input, and initial target context characteristics and a space-time relation matrix are output; the method comprises the following steps:
(2-1) constructing a space-time relation encoder consisting of three fully-connected layers, and enabling the ith target feature of the tth video segmentInputting the data into three full connection layers to obtain the query featuresKey featureAnd value characteristicsWhereinD represents the number of channels of the query feature and the key feature, and d is less than C; the same method obtains the jth target feature of the tth video segmentCorresponding key featureSum value feature
(2-2) calculating the weight of the space-time relation between the target i and the target jGenerating a space-time relation matrix of all targets of the t-th video segmentWhereinSoftmax (·) represents a Softmax function, < ·, > represents the inner product; computing enhanced target region featuresWill be provided withCopying along the spatial dimension to make the space-time characteristic graph of the video segment tAre consistent to obtain the global space characteristics of the target
(2-3) global spatial feature of the targetSplicing with a video space-time feature diagram X channel to obtain target initial context features through a two-dimensional convolution layerConv2D 1 (·) denotes a two-dimensional convolution layer with an input channel of C '═ 2 · C, an output channel of C, and a convolution kernel size of 1 × 1 × C', and | | denotes channel splicing.
Step (3) constructing a progressive variable-length sliding window module, inputting a video frame sequence, an initial target context characteristic and a space-time relation matrix, and outputting a long-term target first-order characteristic; the method comprises the following steps:
(3-1) the progressive variable-length sliding window module consists of an action auxiliary judgment submodule and a target first-order feature library, wherein the action auxiliary judgment submodule performs action coarse-grained judgment by using histogram similarity, and the target first-order feature library is used for storing target initial context features of all historical video segmentsObtaining a historical target context feature setWhereinRepresenting an ith target initial context feature of the phi-th video segment;
(3-2) dividing the current t-th video clip V t Is converted into an RGB histogram matrixVideo clip V at time t-1 t-1 Is converted into an RGB histogram matrixWherein 3 represents an RGB channel and is used;
(3-3) Using RGB histogram matrix Z t And Z t-1 Calculating histogram similarity between intermediate frames of adjacent video segments Andthe number of pixel points with the middle frame channel of R and the brightness of lambda in the t-th video segment and the t-1 th video segment is represented,andthe number of pixel points with the middle frame channel of G and the brightness of lambda in the t-th video clip and the t-1 th video clip is represented,andthe number of pixel points with B and lambda brightness in the middle frame channel of the tth video segment and the t-1 video segment is represented, and lambda is more than or equal to 0 and less than or equal to 255; according to the histogram similarity rho t,t-1 Calculating the number of video segments similar to the t-th video segmentIs a threshold constant;
(3-4) performing the steps (3-2) and (3-3) on all the video segments to obtain a vector of the number of similar segmentsThe window size is set to ω ═ min (τ) t ,L 1 ) And reading a historical target initial context feature set within a time window [ t-omega, t) from the feature libraryL 1 Min (·, ·) represents taking the minimum value for a preset maximum window size;
(3-5) when the object is utilizedEmpty relation matrix M t And M t-1 Calculating the similarity between the t-th video segment and the t-1 st video segmentCalculated by the same method to obtain E t,t-2 ,...,E t,t-ω The similarity value sequence is arranged in a descending order to obtain the first alpha historical video segments which are most similar to the t-th video segment, and the channel splicing operation is carried out on the target initial context characteristics corresponding to the video segments to obtain the target associated space-time characteristicsWherein the channel number C ″ ═ α · C; will be provided withInputting the data into a two-dimensional convolution layer to obtain a long-term target first-order characteristic consistent with the original action durationConv2D 2 (. cndot.) represents a two-dimensional convolution layer having an input channel of C ″ ═ α · C, an output channel of C, and a convolution kernel size of 1 × 1 × C ″.
Constructing a hypergraph module with shared attribute constraint and diffusion mechanism, inputting initial target context characteristics and a space-time relation matrix, and outputting short-term target high-order characteristics; the method comprises the following steps:
(4-1) constructing a hypergraph module with shared attribute constraint and diffusion mechanism by using relative spatial positions of targets and target attributes (people and objects): firstly, the Euclidean distance between the target i and other targets j of the intermediate frame of the t-th video segment is calculated Andfor the t videoThe coordinates of the center positions of the object i bounding box and the object j bounding box of the segment intermediate frame,dist (·, ·) denotes the euclidean distance; calculating the distance between the target i and all other targets of the current frame to obtain a constraint set of the target i Representing a set of objects at a distance from object i less than delta',is a threshold constant;
(4-2) in a constraint setConstructing a high-order relation between a target i and other targets in the set, wherein a space-time relation established between the target i and the target j through a target R is represented as R (i, R, j), the target i and the target j are people, the target R is a person or an object, and i is not equal to j; the method comprises the following steps:
obtaining a space-time relation matrix M according to the step (2-2) t Associating using the same object Indicates the spatial position, symbol, of the common target rThe first-order relation between the two is shown and is expressed as a first-order characteristic between the targetsIt is generated by using the initial context characteristics of the ith targetObject bounding box according to object rCutting is carried out, and first-order characteristics of the target i and the target r are obtained through bilinear interpolation operation
The first-order characteristics of the target j and the target r are obtained by the same methodBy passingGenerating high-order characteristics of the target i, j relative to the target r, and writing a high-order characteristic set related to the target iWherein Conv2D 3 (·) is a two-dimensional convolutional layer with input channel C '"2C, output channel C, and convolution kernel size 1 × 1 × C'";
(4-3) Using the high-order feature set ψ associated with target i i Calculating short-term target high-order characteristics
Step 5, constructing a target action regression module, inputting long-term target first-order characteristics and short-term target high-order characteristics, and outputting spatial positions and action types of all targets at the current moment; the method comprises the following steps:
(5-1) first-order feature of long-term objectAnd short term target high order featuresInputting a target action regression module to obtain target positioning and action judgment, specifically comprising the following steps: firstly, the first step is toAndperforming element-by-element summation operation to obtain element-by-element sum characteristics Represents a per-element sum; then will beInputting into two-dimensional convolutional layer and performing global average pooling along spatial dimension to obtain target classification score (logits)K represents the number of action categories, Conv2D 4 (. cndot.) represents a two-dimensional convolution layer with an input channel of C, an output channel of K and a convolution kernel size of 1 × 1 × C, and GAP (. cndot.) represents spatial dimension global average pooling;
(5-2) scoring the target classification using Softmax functionProcessing is carried out to obtain an output probability of u as the action type at the t momente represents a natural base number;
(5-3) element by element and feature by featureObtaining target spatial position features by two-layer two-dimensional convolutionWherein Conv2D 5 (. The) is expressed as a two-dimensional convolution layer with an input channel of C, an output channel of 256, and a convolution kernel size of 3 × 3 × C, Conv2D 6 (. 2) represents a two-dimensional convolution layer with input channel 256, output channel 4, and convolution kernel size 1 × 1 × 256;
(5-4) spatial location characterization of the objectObtaining a predicted target bounding box through a full connection layer Andthe abscissa and ordinate of the upper-left corner coordinate point of the prediction target bounding box are represented,andand the abscissa and the ordinate of the coordinate point at the lower right corner of the prediction target bounding box are represented.
Step (6) optimizing a space-time action positioning model consisting of a space-time relation encoder, a progressive variable-length sliding window module, a hypergraph module and a target action regression module by using a random gradient descent algorithm, and sequentially executing the steps (1) to (5) on a new video sequence to obtain the space positions and action types of all targets at different moments; the method comprises the following steps:
(6-1) constructing a space-time action positioning model consisting of a space-time relation encoder, a progressive variable length window method module, a hypergraph module with shared attribute constraint and diffusion mechanism and a target action regression module;
(6-2) sampling the training video into a frame sequence, and inputting the frame sequence into a null action timerA bit model for obtaining the space positions and corresponding action categories of all targets at each time and calculating the cross entropy loss function of the modelWhereinIn order to be a true mark, the mark is,indicating that the ith target of the t frame contains the action with the action type of u; computing a distance cross-correlation loss function of the modelWherein,representing the intersection ratio of the predicted target bounding box and the real target bounding box,in order to predict the target bounding box,the coordinates of the upper left corner of the target bounding box,the coordinates of the lower right corner of the target bounding box,in order to be a real target bounding box,representing the coordinates of the center position of the target bounding box,coordinates of the center position of the bounding box representing the real object,coordinates representing the upper left corner of the minimum bounding box that can enclose both the true bounding box and the predicted bounding box,coordinates representing the lower right corner of the smallest bounding box that can enclose both the true bounding box and the predicted bounding box, max (·,) representing taking the maximum value;
(6-3) optimizing the space-time motion positioning model by using a random gradient descent algorithm, and iteratively training the model until convergence to obtain an optimized space-time motion positioning model;
and (6-4) obtaining a video frame sequence through sampling a new video, inputting the optimized space-time motion positioning model, executing the steps (1) to (5) in sequence, and outputting the space positions and motion types of all targets of the video segments at the current moment.
The embodiment described in this embodiment is only an example of the implementation form of the inventive concept, and the protection scope of the present invention should not be considered as being limited to the specific form set forth in the embodiment, and the protection scope of the present invention is also equivalent to the technical means that can be conceived by those skilled in the art according to the inventive concept.
Claims (7)
1. The video space-time action positioning method based on the progressive attention hypergraph is characterized in that the method sequentially performs the following operations on a given action type and action space-time marked video data set:
preprocessing a video to obtain a video frame sequence, and extracting target area characteristics and a video space-time characteristic diagram by utilizing two-dimensional and three-dimensional convolutional neural networks;
step (2) a space-time relation encoder is constructed, target region characteristics and a video space-time characteristic graph are input, and initial target context characteristics and a space-time relation matrix are output;
step (3) constructing a progressive variable-length sliding window module, inputting a video frame sequence, an initial target context characteristic and a space-time relation matrix, and outputting a long-term target first-order characteristic;
constructing a hypergraph module with shared attribute constraint and diffusion mechanism, inputting an initial target context characteristic and a space-time relation matrix, and outputting a short-term target high-order characteristic;
step 5, constructing a target action regression module, inputting long-term target first-order characteristics and short-term target high-order characteristics, and outputting spatial positions and action types of all targets at the current moment;
and (6) optimizing a space-time action positioning model consisting of a space-time relation encoder, a progressive variable-length sliding window module, a hypergraph module and a target action regression module by using a random gradient descent algorithm, and sequentially executing the steps (1) to (5) on a new video sequence to obtain the space positions and action types of all targets at different moments.
2. The method for video spatiotemporal motion localization based on progressive hyperopia as claimed in claim 1, wherein the step (1) is specifically:
(1-1) sampling the original video at a sampling rate of N frames per second to obtain a video frame sequence set with a frame number TWhereinRepresenting the real number field, U s Representing the s-th frame, H 'and W' represent the height and width of the video frame, 3 represents RGB three channels, and N is 5-10;
(1-2) dividing the original video frame sequence into T video segments with 2 XN frames as a single video segmentV t Representing the t-th video segment; then inputting the t-th video segment into a three-dimensional convolution neural network to generate a t-th video segment space-time feature mapH. W, C are the height, width, number of channels of the feature map,further obtaining space-time characteristic graphs of all video clips;
(1-3) utilizing a target detection model based on a two-dimensional convolutional neural network to perform detection on the t video segment V t The intermediate frame of the target detection system carries out target detection to obtain a sequence set of a target boundary framei=1,2,...,N t ,N t The number of targets existing in the middle frame of the t-th video clip is represented, wherein beta is 0,1, beta is 0, the target is a boundary frame of a person, and beta is 1, the target is a boundary frame of an object;a bounding box representing the ith object of the t-th video segment intermediate frame,andthe abscissa and ordinate of the upper left corner of the ith target bounding box in the t-th video segment are represented,andthe abscissa and the ordinate of the lower right corner of the ith target bounding box of the intermediate frame of the tth video clip are represented;
(1-4) bounding boxes according to the targetObtaining video by scalingCorresponding target boundary box in segment space-time feature diagramObtaining the t-th video segment target characteristic diagram by a bilinear interpolation modeAnd performing a global average pooling operation to obtain target featuresH ', W' and C are the height, width and number of channels of the target feature map.
3. The method for video spatiotemporal motion localization based on progressive hyperopia as claimed in claim 2, wherein the step (2) is specifically:
(2-1) constructing a space-time relation encoder consisting of three fully-connected layers, and enabling the ith target feature of the tth video segmentInputting the data into three full connection layers to obtain the query featuresKey featureAnd value characteristicsWhereinD represents the number of channels of the query feature and the key feature, and d is less than C; the same method obtains the jth target feature of the tth video segmentCorresponding key featureSum value feature
(2-2) calculating the weight of the space-time relation between the target i and the target jGenerating a space-time relation matrix of all targets of the t-th video segmentWhereinSoftmax (·) represents a Softmax function, < ·, > represents the inner product; computing enhanced target region featuresWill be provided withCopying along the spatial dimension to make the space-time characteristic graph of the video segment tAre consistent to obtain the global space characteristics of the target
(2-3) global spatial feature of the targetSplicing with the X channel of the video space-time feature map by twoObtaining target initial context features for a dimension convolution layerConv2D 1 (·) denotes a two-dimensional convolution layer with an input channel of C '═ 2 · C, an output channel of C, and a convolution kernel size of 1 × 1 × C', and | | denotes channel splicing.
4. The method for video spatiotemporal motion localization based on progressive hyperopia as claimed in claim 3, wherein the step (3) is specifically:
(3-1) the progressive variable-length sliding window module consists of an action auxiliary judgment submodule and a target first-order feature library, wherein the action auxiliary judgment submodule performs action coarse-grained judgment by using histogram similarity, and the target first-order feature library is used for storing target initial context features of all historical video segmentsObtaining a historical target context feature setWhereinRepresenting an ith target initial context feature of the phi-th video segment;
(3-2) dividing the current t-th video clip V t Is converted into an RGB histogram matrixVideo clip V at time t-1 t-1 Is converted into an RGB histogram matrixWherein 3 represents an RGB channel and is used;
(3-3) Using RGB histogram matrix Z t And Z t-1 Calculating between intermediate frames of adjacent video segmentsHistogram similarity Andthe number of pixel points with the middle frame channel of R and the brightness of lambda in the t-th video segment and the t-1 th video segment is represented,andthe number of pixel points with the middle frame channel of G and the brightness of lambda in the t-th video clip and the t-1 th video clip is represented,andthe number of pixel points with B and lambda brightness in the middle frame channel of the tth video segment and the t-1 video segment is represented, and lambda is more than or equal to 0 and less than or equal to 255; according to the histogram similarity rho t,t-1 Calculating the number of video segments similar to the t-th video segmentDelta is more than 0 and less than 1, which is a threshold constant;
(3-4) performing the steps (3-2) and (3-3) on all the video segments to obtain a vector of the number of similar segmentsThe window size is set to ω ═ min (τ) t ,L 1 ) And reading the historical target initial context within the time window [ t-omega, t) from the feature libraryFeature setL 1 Min (·, ·) represents taking the minimum value for a preset maximum window size;
(3-5) utilizing a target spatio-temporal relationship matrix M t And M t-1 Calculating the similarity between the t-th video segment and the t-1 st video segmentCalculated by the same method to obtain E t,t-2 ,...,E t,t-ω The similarity value sequence is arranged in a descending order to obtain the first alpha historical video segments which are most similar to the t-th video segment, and the channel splicing operation is carried out on the target initial context characteristics corresponding to the video segments to obtain the target associated space-time characteristicsWherein the channel number C ″ ═ α · C; will be provided withInputting the data into a two-dimensional convolution layer to obtain a long-term target first-order characteristic consistent with the original action durationConv2D 2 (. cndot.) represents a two-dimensional convolution layer having an input channel of C ″ ═ α · C, an output channel of C, and a convolution kernel size of 1 × 1 × C ″.
5. The method for video spatiotemporal motion localization based on progressive hyperopia as claimed in claim 4, wherein the step (4) is specifically:
(4-1) constructing a hypergraph module with shared attribute constraints and a diffusion mechanism by using the relative spatial position of the target and the target attribute: firstly, the Euclidean distance between the target i and other targets j of the intermediate frame of the t-th video segment is calculated Andthe coordinates of the center positions of the target i bounding box and the target j bounding box of the t-th video segment intermediate frame,dist (·, ·) denotes the euclidean distance; calculating the distance between the target i and all other targets of the current frame to obtain a constraint set of the target i Representing a set of objects at a distance from object i less than delta',is a threshold constant;
(4-2) in a constraint setConstructing a high-order relation between a target i and other targets in the set, wherein a space-time relation established between the target i and the target j through a target R is represented as R (i, R, j), the target i and the target j are people, the target R is a person or an object, and i is not equal to j; the method comprises the following steps:
obtaining a space-time relation matrix M according to the step (2-2) t Associating using the same object Show the common purposeThe spatial position, symbol, of the symbol rThe first-order relation between the two is shown and is expressed as a first-order characteristic between the targetsIt is generated by using the initial context characteristics of the ith targetObject bounding box according to object rCutting is carried out, and first-order characteristics of the target i and the target r are obtained through bilinear interpolation operation
The first-order characteristics of the target j and the target r are obtained by the same methodBy passingGenerating high-order characteristics of the target i, j relative to the target r, and writing a high-order characteristic set related to the target iWherein Conv2D 3 (·) is a two-dimensional convolutional layer with input channel C '"2C, output channel C, and convolution kernel size 1 × 1 × C'";
(4-3) Using the high-order feature set ψ associated with target i i Calculating short-term target high-order characteristics
6. The progressive hyperopic image-based video spatiotemporal motion localization method according to claim 5, wherein the step (5) is specifically:
(5-1) first-order feature of long-term objectAnd short term target high order featuresInputting a target action regression module to obtain target positioning and action judgment, specifically comprising the following steps: firstly, the first step is toAndperforming element-by-element summation operation to obtain element-by-element sum characteristics Represents a per-element sum; then will beInputting into two-dimensional convolution layer and performing global average pooling operation along spatial dimension to obtain target classification scoreK represents the number of action categories, Conv2D 4 (. cndot.) represents a two-dimensional convolution layer with an input channel of C, an output channel of K and a convolution kernel size of 1 × 1 × C, and GAP (. cndot.) represents spatial dimension global average pooling;
(5-2) scoring the target classification using Softmax functionProcessing is carried out to obtain an output probability of u as the action type at the t momente represents a natural base number;
(5-3) element by element and feature by featureObtaining target spatial position features by two-layer two-dimensional convolutionWherein Conv2D 5 (. The) is expressed as a two-dimensional convolution layer with an input channel of C, an output channel of 256, and a convolution kernel size of 3 × 3 × C, Conv2D 6 (. 2) represents a two-dimensional convolution layer with input channel 256, output channel 4, and convolution kernel size 1 × 1 × 256;
(5-4) spatial location characterization of the objectObtaining a predicted target bounding box through a full connection layer Andthe abscissa and ordinate of the upper-left corner coordinate point of the prediction target bounding box are represented,andand the abscissa and the ordinate of the coordinate point at the lower right corner of the prediction target bounding box are represented.
7. The video spatiotemporal motion localization method based on progressive hyperopia as claimed in claim 6, characterized in that the step (6) is specifically:
(6-1) constructing a space-time action positioning model consisting of a space-time relation encoder, a progressive variable length window method module, a hypergraph module with shared attribute constraint and diffusion mechanism and a target action regression module;
(6-2) sampling the training video into a frame sequence, inputting the frame sequence into a space motion positioning model, obtaining the spatial positions and the corresponding motion types of all targets at each moment, and calculating the cross entropy loss function of the modelWhereinIn order to be a true mark, the mark is,indicating that the ith target of the t frame contains the action with the action type of u; computing a distance cross-correlation loss function of the modelWherein,representing the intersection ratio of the predicted target bounding box and the real target bounding box,in order to predict the target bounding box,sit in the upper left corner of the target bounding boxThe mark is that,the coordinates of the lower right corner of the target bounding box,in order to be a real target bounding box,representing the coordinates of the center position of the target bounding box,coordinates of the center position of the bounding box representing the real object,coordinates representing the upper left corner of the minimum bounding box that can enclose both the true bounding box and the predicted bounding box,coordinates representing the lower right corner of the smallest bounding box that can enclose both the true bounding box and the predicted bounding box, max (·,) representing taking the maximum value;
(6-3) optimizing the space-time motion positioning model by using a random gradient descent algorithm, and iteratively training the model until convergence to obtain an optimized space-time motion positioning model;
and (6-4) obtaining a video frame sequence through sampling a new video, inputting the optimized space-time motion positioning model, executing the steps (1) to (5) in sequence, and outputting the space positions and motion types of all targets of the video segments at the current moment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210481572.0A CN114882403B (en) | 2022-05-05 | 2022-05-05 | Video space-time action positioning method based on progressive attention hypergraph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210481572.0A CN114882403B (en) | 2022-05-05 | 2022-05-05 | Video space-time action positioning method based on progressive attention hypergraph |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114882403A true CN114882403A (en) | 2022-08-09 |
CN114882403B CN114882403B (en) | 2022-12-02 |
Family
ID=82674257
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210481572.0A Active CN114882403B (en) | 2022-05-05 | 2022-05-05 | Video space-time action positioning method based on progressive attention hypergraph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114882403B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118279786A (en) * | 2024-02-29 | 2024-07-02 | 北京科技大学 | Time sequence action positioning method based on diffusion model |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106611157A (en) * | 2016-11-17 | 2017-05-03 | 中国石油大学(华东) | Multi-people posture recognition method based on optical flow positioning and sliding window detection |
CN108399380A (en) * | 2018-02-12 | 2018-08-14 | 北京工业大学 | A kind of video actions detection method based on Three dimensional convolution and Faster RCNN |
CN108805083A (en) * | 2018-06-13 | 2018-11-13 | 中国科学技术大学 | The video behavior detection method of single phase |
US20190050996A1 (en) * | 2017-08-04 | 2019-02-14 | Intel Corporation | Methods and apparatus to generate temporal representations for action recognition systems |
CN110765854A (en) * | 2019-09-12 | 2020-02-07 | 昆明理工大学 | Video motion recognition method |
CN111291647A (en) * | 2020-01-21 | 2020-06-16 | 陕西师范大学 | Single-stage action positioning method based on multi-scale convolution kernel and superevent module |
WO2020196985A1 (en) * | 2019-03-27 | 2020-10-01 | 연세대학교 산학협력단 | Apparatus and method for video action recognition and action section detection |
CN112926396A (en) * | 2021-01-28 | 2021-06-08 | 杭州电子科技大学 | Action identification method based on double-current convolution attention |
CN113239869A (en) * | 2021-05-31 | 2021-08-10 | 西安电子科技大学 | Two-stage behavior identification method and system based on key frame sequence and behavior information |
CN113255443A (en) * | 2021-04-16 | 2021-08-13 | 杭州电子科技大学 | Pyramid structure-based method for positioning time sequence actions of graph attention network |
CN113822172A (en) * | 2021-08-30 | 2021-12-21 | 中国科学院上海微系统与信息技术研究所 | Video spatiotemporal behavior detection method |
-
2022
- 2022-05-05 CN CN202210481572.0A patent/CN114882403B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106611157A (en) * | 2016-11-17 | 2017-05-03 | 中国石油大学(华东) | Multi-people posture recognition method based on optical flow positioning and sliding window detection |
US20190050996A1 (en) * | 2017-08-04 | 2019-02-14 | Intel Corporation | Methods and apparatus to generate temporal representations for action recognition systems |
CN108399380A (en) * | 2018-02-12 | 2018-08-14 | 北京工业大学 | A kind of video actions detection method based on Three dimensional convolution and Faster RCNN |
CN108805083A (en) * | 2018-06-13 | 2018-11-13 | 中国科学技术大学 | The video behavior detection method of single phase |
WO2020196985A1 (en) * | 2019-03-27 | 2020-10-01 | 연세대학교 산학협력단 | Apparatus and method for video action recognition and action section detection |
CN110765854A (en) * | 2019-09-12 | 2020-02-07 | 昆明理工大学 | Video motion recognition method |
CN111291647A (en) * | 2020-01-21 | 2020-06-16 | 陕西师范大学 | Single-stage action positioning method based on multi-scale convolution kernel and superevent module |
CN112926396A (en) * | 2021-01-28 | 2021-06-08 | 杭州电子科技大学 | Action identification method based on double-current convolution attention |
CN113255443A (en) * | 2021-04-16 | 2021-08-13 | 杭州电子科技大学 | Pyramid structure-based method for positioning time sequence actions of graph attention network |
CN113239869A (en) * | 2021-05-31 | 2021-08-10 | 西安电子科技大学 | Two-stage behavior identification method and system based on key frame sequence and behavior information |
CN113822172A (en) * | 2021-08-30 | 2021-12-21 | 中国科学院上海微系统与信息技术研究所 | Video spatiotemporal behavior detection method |
Non-Patent Citations (2)
Title |
---|
JUNTING PAN等: ""Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization"", 《ARXIV》 * |
熊成鑫等: ""时域候选优化的时序动作检测"", 《中国图象图形学报》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118279786A (en) * | 2024-02-29 | 2024-07-02 | 北京科技大学 | Time sequence action positioning method based on diffusion model |
Also Published As
Publication number | Publication date |
---|---|
CN114882403B (en) | 2022-12-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112926396B (en) | Action identification method based on double-current convolution attention | |
CN111210446B (en) | Video target segmentation method, device and equipment | |
CN111738055B (en) | Multi-category text detection system and bill form detection method based on same | |
CN112508014A (en) | Improved YOLOv3 target detection method based on attention mechanism | |
CN111968150A (en) | Weak surveillance video target segmentation method based on full convolution neural network | |
CN110942471A (en) | Long-term target tracking method based on space-time constraint | |
Xie et al. | GhostFormer: Efficiently amalgamated CNN-transformer architecture for object detection | |
CN111476133A (en) | Unmanned driving-oriented foreground and background codec network target extraction method | |
Mo et al. | PVDet: Towards pedestrian and vehicle detection on gigapixel-level images | |
CN112101344B (en) | Video text tracking method and device | |
CN114882403B (en) | Video space-time action positioning method based on progressive attention hypergraph | |
WO2023036157A1 (en) | Self-supervised spatiotemporal representation learning by exploring video continuity | |
CN116596966A (en) | Segmentation and tracking method based on attention and feature fusion | |
Li et al. | PFYOLOv4: An improved small object pedestrian detection algorithm | |
CN111639563B (en) | Basketball video event and target online detection method based on multitasking | |
CN109308458B (en) | Method for improving small target detection precision based on characteristic spectrum scale transformation | |
CN110942463B (en) | Video target segmentation method based on generation countermeasure network | |
Wang et al. | Scene uyghur recognition with embedded coordinate attention | |
CN116797799A (en) | Single-target tracking method and tracking system based on channel attention and space-time perception | |
Mars et al. | Combination of DE-GAN with CNN-LSTM for Arabic OCR on Images with Colorful Backgrounds | |
Varlik et al. | Filtering airborne LIDAR data by using fully convolutional networks | |
CN116843719A (en) | Target tracking method based on independent search twin neural network | |
CN113963021A (en) | Single-target tracking method and system based on space-time characteristics and position changes | |
CN115082778A (en) | Multi-branch learning-based homestead identification method and system | |
CN115018878A (en) | Attention mechanism-based target tracking method in complex scene, storage medium and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |