CN112203115A - Video identification method and related device - Google Patents
Video identification method and related device Download PDFInfo
- Publication number
- CN112203115A CN112203115A CN202011078362.4A CN202011078362A CN112203115A CN 112203115 A CN112203115 A CN 112203115A CN 202011078362 A CN202011078362 A CN 202011078362A CN 112203115 A CN112203115 A CN 112203115A
- Authority
- CN
- China
- Prior art keywords
- video
- video frame
- identified
- frame segments
- segments
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 70
- 238000000605 extraction Methods 0.000 claims abstract description 42
- 230000002123 temporal effect Effects 0.000 claims description 24
- 230000004927 fusion Effects 0.000 claims description 23
- 238000012545 processing Methods 0.000 claims description 19
- 239000012634 fragment Substances 0.000 claims description 17
- 230000003068 static effect Effects 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 8
- 238000012549 training Methods 0.000 claims description 7
- 238000012163 sequencing technique Methods 0.000 claims description 4
- 238000005516 engineering process Methods 0.000 abstract description 24
- 238000010801 machine learning Methods 0.000 abstract description 7
- 238000004458 analytical method Methods 0.000 abstract description 4
- 230000006870 function Effects 0.000 description 12
- 239000013598 vector Substances 0.000 description 10
- 238000013473 artificial intelligence Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 9
- 230000003287 optical effect Effects 0.000 description 7
- 241000282414 Homo sapiens Species 0.000 description 6
- 238000004891 communication Methods 0.000 description 6
- 239000000284 extract Substances 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000001413 cellular effect Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 230000000903 blocking effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 230000002035 prolonged effect Effects 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000007599 discharging Methods 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000003446 memory effect Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000010079 rubber tapping Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000010897 surface acoustic wave method Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/23418—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/83—Generation or processing of protective or descriptive data associated with content; Content structuring
- H04N21/845—Structuring of content, e.g. decomposing content into time segments
- H04N21/8456—Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the application discloses a video identification method and a related device, which are used for obtaining a video frame segment of a video to be identified, extracting the spatiotemporal characteristics of the video frame segment to be identified, matching the spatiotemporal characteristics of the video frame segment to be identified with the spatiotemporal characteristics in a search library to obtain the spatiotemporal characteristics in the search library which is successfully matched, determining a target video where the video frame segment corresponding to the spatiotemporal characteristics in the search library which is successfully matched is located, and determining the name of the target video as the name of the video to be identified. When the name of the video to be identified is identified through technologies such as a computer vision technology, a machine learning technology and the like, feature extraction is not performed on the basis of a video frame of the video to be identified any more, but is performed on the basis of a video frame segment of the video to be identified, so that the spatiotemporal features of the video frame segment are obtained, and the spatiotemporal features of the video frame segment are extracted, so that the analysis capability of the video to be identified can be improved, the probability of the name of the video to be identified being named correctly is improved, and the experience of a user is improved.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to a video identification method and a related apparatus.
Background
With the development of internet technology, users can watch video resources through an internet platform. In order to be able to attract users, there are a large number of videos in the interconnected platform that clip out highlights from the complete video, or portions that may be of interest to the user, so that the user can watch the video with fragmented time,
when a user is interested in a certain video and wants to continuously watch the complete video corresponding to the video, if the video is published on the internet platform and the name of the corresponding complete video is not published at the same time, the user is difficult to know the name of the corresponding episode of the video, and the user experience is poor.
Disclosure of Invention
In order to solve the technical problem, the application provides a video identification method and a related device, which are used for identifying the name of a video and improving the experience of a user.
The embodiment of the application discloses the following technical scheme:
in one aspect, the present application provides a video identification method, including:
acquiring a video frame segment of a video to be identified, wherein the video frame segment comprises a plurality of continuous video frames;
extracting the space-time characteristics of the video frame segments; the spatio-temporal characteristics are fusion characteristics of spatial characteristics of the video frame segments and temporal characteristics of the video frame segments, and represent motion information of objects involved in the video frame segments, the spatial characteristics are used for identifying appearance information of the objects involved in each video frame of the video frame segments, and the temporal characteristics are used for identifying motion information of the objects involved in the video frame segments;
matching the spatiotemporal characteristics of the video frame segments with spatiotemporal characteristics in a search library, if the matching is successful, acquiring the spatiotemporal characteristics in the search library which is successfully matched, determining a target video where the video frame segment corresponding to the spatiotemporal characteristics in the search library which is successfully matched is located, and determining the name of the target video as the name of the video to be identified; the search library comprises space-time characteristics of a plurality of video frame fragments of the target video, and the target video is a complete video corresponding to the video to be identified.
In another aspect, the present application provides a video recognition apparatus, comprising: the device comprises an acquisition unit, an extraction unit and a processing unit;
the acquisition unit is used for acquiring a video frame segment in a video to be identified, wherein the video frame segment comprises a plurality of continuous video frames;
the extraction unit is used for extracting the space-time characteristics of the video frame segment; the spatio-temporal characteristics are fusion characteristics of spatial characteristics of the video frame segments and temporal characteristics of the video frame segments, and represent motion information of objects involved in the video frame segments, the spatial characteristics are used for identifying appearance information of the objects involved in each video frame of the video frame segments, and the temporal characteristics are used for identifying motion information of the objects involved in the video frame segments;
the processing unit is used for matching the spatio-temporal characteristics of the video frame segments with the spatio-temporal characteristics in a search library, if the matching is successful, obtaining the spatio-temporal characteristics in the search library which is successfully matched, determining a target video represented by the spatio-temporal characteristics in the search library which is successfully matched, and determining the name of the target video as the name of the video to be identified; the search library comprises space-time characteristics of a plurality of video frame fragments of the target video, and the target video is a complete video corresponding to the video to be identified.
In another aspect, an embodiment of the present application provides an apparatus for video identification, where the apparatus includes a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of the above aspect according to instructions in the program code.
In another aspect, the present application provides a computer-readable storage medium for storing a computer program for executing the method of the above aspect.
In another aspect, embodiments of the present application provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the methods provided in the various alternative implementations of the aspects described above.
According to the technical scheme, the video frame segment of the video to be identified is obtained, the spatio-temporal characteristics of the video frame segment to be identified are extracted, the spatio-temporal characteristics of the video frame segment to be identified are matched with the spatio-temporal characteristics in the search library, the spatio-temporal characteristics in the search library which are successfully matched are obtained, the target video where the video frame segment corresponding to the spatio-temporal characteristics in the search library which are successfully matched is located is determined, and the name of the target video is determined as the name of the video to be identified. When the name of the video to be identified is identified, feature extraction is not performed based on the video frame of the video to be identified, but is performed based on the video frame segment of the video to be identified, so that the space-time feature of the video frame segment is obtained, that is, not only the space feature of the video frame segment can be extracted, but also the time feature of the video frame segment can be extracted according to a plurality of continuous video frames.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario of a video identification method according to an embodiment of the present application;
fig. 2 is a flowchart of a video recognition method according to an embodiment of the present application;
fig. 3 is a schematic diagram of feature extraction provided in an embodiment of the present application;
fig. 4 is a schematic diagram of a video identification method according to an embodiment of the present application;
fig. 5 is a schematic view of an application scenario of a video identification method according to an embodiment of the present application;
fig. 6 is a schematic view of an application scenario of a video identification method according to an embodiment of the present application;
fig. 7 is a schematic view of an application scenario of a video identification method according to an embodiment of the present application;
fig. 8 is a schematic view of an application scenario of a video identification method according to an embodiment of the present application
Fig. 9 is a schematic structural diagram of a video recognition apparatus according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the accompanying drawings.
When the video does not identify the corresponding affiliated drama name, the user hardly knows the name of the video, so that the user cannot view the complete video corresponding to the video when interested in the video, and the user experience is poor. In order to improve the experience of the user, in the related art, in order to identify the affiliated drama name of the video, generally, each frame of video frame picture of the video to be identified is extracted, the spatial feature of each frame of picture video frame picture is extracted through a pre-trained model, and is matched with a pre-established search library, so that the name of the video to be identified is obtained. However, since a video is data that changes according to time, the similarity between adjacent video frames is large, and there are strong temporal correlation and spatial correlation, and the spatial features of video frame pictures are extracted based on each frame of video picture only, and the correlation characteristics between consecutive frames cannot be effectively extracted, there are many matching results for one video, and thus the name result corresponding to the video cannot be obtained.
Based on the above, the application provides a video identification method and a related device, which are used for identifying the name of a video and improving the experience of a user.
The video identification method provided by the embodiment of the application can be applied to equipment with video identification capability, such as terminal equipment or a server with a video identification function. The method can be independently executed through the terminal equipment or the server, can also be applied to a network scene of communication between the terminal equipment and the server, and is executed through the cooperation of the terminal equipment and the server. The terminal device may be a smart phone, a notebook computer, a desktop computer, a Personal Digital Assistant (PDA for short), a tablet computer, or the like. The server may be understood as an application server or a Web server, and in actual deployment, the server may be an independent server or a cluster server.
The video recognition method provided by the embodiment of the application is realized based on Artificial Intelligence (AI), which is a theory, method, technology and application system for simulating, extending and expanding human Intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
In the embodiment of the present application, the artificial intelligence software technology mainly involved includes the above-mentioned computer vision technology and machine learning technology.
Computer Vision technology (CV) is a science for researching how to make a machine see, and further means that a camera and a Computer are used for replacing human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
For convenience of describing the scheme, in the embodiments of the present application, a server is mainly used as a video identification device, and the video identification method provided by the embodiments of the present application is independently performed for explanation.
Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a video identification method according to an embodiment of the present application. In this scenario, the aforementioned processing device is a server 100, and further includes a search library 200. As shown in fig. 1, the embodiment of the present application is described with the search library 200 located in the server, and the search library 200 may be located in the server 100 or may be independent of the server 100.
The video 101 to be identified is composed of a plurality of video frames, after the server 200 acquires the video 101 to be identified, a video frame segment is extracted from the plurality of video frames of the video 101 to be identified, the video frame segment may be a single segment or multiple segments, each segment of the video frame segment includes a plurality of consecutive video frames, as shown in fig. 1, a video frame segment including 5 consecutive video frames is acquired from the video 101 to be identified.
Since the video is data which changes according to time and has strong time correlation and spatial correlation, in order to improve the accuracy of the identification result of the video 101 to be identified, the server 100 extracts the spatio-temporal characteristics of the video frame segment. The spatio-temporal characteristics of the video frame segments are fusion characteristics of the spatial characteristics of the video frame segments and the temporal characteristics of the video frame segments, that is, the server 100 can extract not only the spatial characteristics of the video frame segments but also the temporal characteristics of the video frame segments, and by extracting the characteristics of multiple dimensions, the analysis capability of the video 101 to be recognized can be improved, so that the accuracy of name recognition of the video 101 to be recognized is improved.
After extracting the spatiotemporal features of the video frame segments, the server 100 searches in the search library 200 to obtain spatiotemporal features matching the spatiotemporal features of the video frame segments. The search library 200 includes a plurality of spatio-temporal features, each spatio-temporal feature being a spatio-temporal feature extracted from a video frame segment included in the complete video. As shown in fig. 1, the search library 200 includes N spatio-temporal features, where the N spatio-temporal features correspond to different video frame segments, and each video frame segment may be from the same complete video or from multiple complete videos. Wherein the spatio-temporal features II are extracted from the video frame segments comprised by the complete video 102. If the spatio-temporal characteristics of the video frame segments of the video 101 to be identified are successfully matched with the spatio-temporal characteristics II in the search library 200, the spatio-temporal characteristics II are obtained, the video frame segments corresponding to the spatio-temporal characteristics II are determined to be from the target video 102, and the name of the target video 102 is determined as the name of the video 101 to be identified.
According to the technical scheme, the video frame segment is extracted from the video to be recognized, the feature extraction is carried out on the video frame segment, the feature extraction is not carried out on each frame of video frame in the video to be recognized, the spatial feature of the video frame segment can be extracted, the time feature of the video frame segment can be extracted, the spatiotemporal feature of the video frame segment is obtained, the spatiotemporal feature is matched with the spatiotemporal feature in the search library, the name of the video to be recognized is obtained, the probability of obtaining a correct result of the video to be recognized is improved, and the experience of a user is improved.
The video identification method provided by the embodiment of the present application is described below with reference to the application scenario shown in fig. 1. Referring to fig. 2, the figure is a flowchart of a video identification method according to an embodiment of the present application. In the method shown in fig. 2, the following steps are included:
s201: and acquiring a video frame segment of the video to be identified.
The video to be recognized is composed of a plurality of video frames, one video frame is a frame, and human eyes have a short memory effect on the image, so that when the eyes see the quick switching of the plurality of video frames, the video to be recognized is regarded as a segment of played video. In order to be able to identify the name of the video to be identified, a video frame segment of the video to be identified may be obtained, the video frame segment comprising a plurality of consecutive video frames, such that the temporal correlation between the video frames may be subsequently obtained from the consecutive video frames.
The manner in which the video frame segments are acquired is not particularly limited. For example, each frame of video frame of the video to be identified may be extracted, and then the consecutive video frames may be treated as a set of video frame segments. For another example, the video to be recognized may be divided into a plurality of groups of video frame segments according to a preset number of video frame frames. A group of video frames may be, for example, a video frame segment composed of 5 video frames, and those skilled in the art can set the length of the group of video frame segments according to actual needs.
S202: and extracting the space-time characteristics of the video frame segments.
Since the video frame segment is data that changes according to time, the similarity between any pixel and its neighboring pixels is high, and there are strong temporal correlation and spatial correlation. In the related art, since only the spatial features of the video frames are extracted, but the temporal features between the video frames are ignored, the accuracy of obtaining a correct result of the video name to be recognized is not high.
Based on this, the present application extracts not only spatial features but also temporal features. The object for extracting the characteristics is not an isolated frame of video frame any more, but a video frame segment comprising a plurality of continuous video frames, so that the time characteristics of the video frame segment can be extracted, and compared with the method for extracting the space characteristics of the video frame, the method for extracting the space-time characteristics of the video frame segment can better represent the action information of the object in the video frame segment, thereby improving the accuracy of obtaining the correct result of the name of the video to be identified. The space-time characteristics of the video frame segments are fusion characteristics of the space characteristics of the video frame segments and the time characteristics of the video frame segments, the space characteristics of the video frame segments are used for identifying the appearance information of the objects involved in each frame of the video frame segments, and the time characteristics of the video frame segments are used for identifying the operation information of the objects involved in the video frame segments.
Meanwhile, because of the high similarity between adjacent frames, if the feature vector (such as spatial feature) of each frame of video frame is extracted, a large number of redundant feature vectors will be obtained, and the feature vectors (such as spatio-temporal feature) of video frame segments are extracted, the redundant feature vectors will be reduced. For example, if a group of video frame segments has 3 video frames, 1 feature vector is extracted by using the technical solution of the embodiment of the present application, and 3 feature vectors are obtained by extracting the feature vector of each video frame, and 1 feature vector can reduce redundant feature vectors compared to 3 feature vectors.
S203: and matching the spatiotemporal characteristics of the video frame segments with spatiotemporal characteristics in a search library, if the matching is successful, acquiring the spatiotemporal characteristics in the search library which is successfully matched, determining a target video where the video frame segment corresponding to the spatiotemporal characteristics in the search library which is successfully matched is located, and determining the name of the target video as the name of the video to be identified.
And after the spatiotemporal characteristics of the video frame segments are obtained, matching the spatiotemporal characteristics of the video frame segments with the spatiotemporal characteristics in the search library so as to obtain the spatiotemporal characteristics matched with the spatiotemporal characteristics of the video frame segments in the search library.
The retrieval library is pre-established, when the retrieval library is pre-established, a complete video is divided into a plurality of video frame segments, the spatio-temporal characteristics of each video frame segment are extracted, and the extracted spatio-temporal characteristics are put into the retrieval library. For example, the search library may be a similarity search class library (Faiss), which is formed by densely clustering spatio-temporal features to construct an inverted index. The retrieval library has the space-time characteristics of a plurality of video frame fragments in a plurality of complete videos, and can continuously update the space-time characteristics in the retrieval library along with the update of complete video resources, so that the space-time characteristics matched with the space-time characteristics of the video frame fragments in the videos to be identified can be obtained in the retrieval library, the corresponding video frame fragments are determined according to the space-time characteristics in the retrieval library which are successfully matched, the name of a target video where the video frame fragments are located is determined as the name of the videos to be identified, and the target video is the complete video corresponding to the videos to be identified.
According to the technical scheme, when the name of the video to be recognized is recognized, feature extraction is not performed on the basis of the video frame of the video to be recognized, but the feature extraction is performed on the basis of the video frame segment of the video to be recognized, so that the space-time feature of the video frame segment is obtained, namely, the space feature of the video frame segment can be extracted, the time feature of the video frame segment can be extracted according to a plurality of continuous video frames, and compared with the method of extracting only the space feature of the video frame, the analysis capability of the video to be recognized can be improved by extracting the space-time feature of the video frame segment, so that the probability of the name of the video to be recognized is improved, and the experience of a user.
When extracting the space-time characteristics of the video frame segments, a characteristic extraction model can be adopted for extraction. Referring to fig. 3, the figure is a schematic diagram of feature extraction provided in an embodiment of the present application. The feature extraction model includes at least a spatial convolution layer, a first fusion layer, and a second fusion layer. Each layer in the feature extraction model will be described below.
After the video frame segment of the video to be identified is obtained, each frame of video frame is respectively input into the space convolution layer of the feature extraction model, and therefore the space feature of each frame of video frame is obtained. As shown in fig. 3, the video frame segment includes 3 consecutive video frames, the feature extraction model has 3 spatial convolution layers, and 3 consecutive video frames are respectively input into each spatial convolution layer to obtain 3 spatial features.
After the spatial features of each frame of video frame are obtained through the spatial convolution layer, the spatial features of each frame of video frame are input into a first fusion layer of the feature extraction model, and the spatial features of each frame of video frame are fused through the first fusion layer, so that the spatial features of the video frame fragments are obtained. As shown in fig. 3, 3 spatial features are fused into one spatial feature, i.e. the spatial feature of the video frame segment. Meanwhile, the video frame segments are input into the first fusion layer, and the time characteristics of the video frame segments can be obtained through the first fusion layer due to the time correlation among the video frame segments. As shown in fig. 3, temporal characteristics of a video frame segment including 3 video frames can be obtained through the first fusion layer.
And finally, inputting the spatial characteristics and the time characteristics of the video frame segments into a second fusion layer of the characteristic extraction model to obtain the space-time characteristics of the video frame segments.
The feature extraction model is not specifically limited in the present application, and for example, a model based on a long Time Series Networks (TSN) may be used. In the related art, if the time feature of a video frame is to be obtained, an optical flow picture of the video frame needs to be obtained first, so that the time feature of the video frame is obtained according to the optical flow picture, but since the video frame is converted into the optical flow picture, a large amount of calculation is required, compared with the method of converting each frame of the video to be recognized into the optical flow picture and then extracting the time feature, the amount of calculation can be reduced by inputting a video frame segment and extracting the time feature through the video frame segment in the embodiments of the present application, so that the resource consumption is reduced. Meanwhile, because the optical flow picture needs to extract the time features through the time convolution layer, the optical flow picture is not used as the extraction object of the time features any more in the embodiment of the application, the time features can be extracted through the original first fusion layer, the time convolution layer is reduced, the feature extraction model is simplified, and meanwhile, the calculation amount caused by the time convolution layer is reduced.
Due to different clipping settings of video producers, even video frames of the same complete video have differences in resolution, video aspect ratio, and the like. In order to improve the editing resistance of the feature extraction model, multiple segments of the sample video frame segment may be constructed during training of the feature extraction model, so as to construct sample video frame segments with different respective rates and/or different aspect ratios, so as to increase the diversity of the sample video frame segments. A plurality of fragments aiming at the sample video frame fragments are constructed and used as sample data to be input into the feature extraction model for model training, and the anti-editing capability of the feature extraction model is improved, so that the probability of the correct naming of the video name to be identified is improved, and the experience of a user is improved.
Due to different clipping settings of video producers, for a video frame of the same complete video, differences of resolution, video aspect ratio and the like exist, and differences of whether shades such as laces exist in a video frame picture also exist. Due to the fact that the shelters such as laces are static in the video, a dynamic area and a static area in the video to be recognized can be recognized, the static area is the area where the shelters are located in the video to be recognized, the static area can be removed from the video to be recognized, the editing resistance of the feature extraction model is improved, the probability of the name of the video to be recognized being named correctly is improved, and the experience of a user is improved.
The name of the video to be identified can be obtained through the spatio-temporal characteristics in the search library, and the positioning interval of the video to be identified in the target video can be obtained through the spatio-temporal characteristics in the search library. It should be noted that, if a positioning interval of a video to be identified in a target video needs to be obtained, when a search library is established, not only a corresponding relationship needs to be established between the complete video and a corresponding spatio-temporal feature, but also a positioning interval of a video frame segment corresponding to the spatio-temporal feature in the complete video needs to be marked, so that after the spatio-temporal feature in the search library which is successfully matched is obtained, a video frame segment corresponding to the spatio-temporal feature in the search library which is successfully matched is determined, and according to the positioning interval of the corresponding video frame segment in the target video, the positioning interval of the video to be identified in the target video can be obtained.
If a plurality of videos to be identified all correspond to one target video, the plurality of videos to be identified can be aggregated. The method comprises the steps of obtaining a plurality of positioning intervals of videos to be identified, sequencing the videos to be identified according to the positioning intervals of the videos to be identified, and displaying the videos to be identified according to a time sequence. Therefore, the continuous watching experience of the user is provided, the experience of the user is improved, and the service life of the user on the Internet platform (such as APP, webpage and the like) can be prolonged.
In order to further improve the accuracy of the video positioning interval to be identified, the matching of the positioning interval can be performed for multiple times. Next, the matching of the positioning sections is described as an example. After the first positioning area of the video to be identified in the target video is obtained, the first positioning area can be expanded. For example, the second positioning section after the extension is obtained by extending a preset time period after the first positioning section, or extending a preset time period before and after the first positioning section. For example, the first positioning interval is [50 seconds-100 seconds ], 30 seconds are added before and after the first positioning interval, and the second positioning interval is obtained [20 seconds-130 seconds ]. And then matching the spatiotemporal characteristics of the video frame segments with the spatiotemporal characteristics of the target video in the second positioning interval in the search library again based on the second positioning interval, if the matching is successful, obtaining the video frame segments corresponding to the spatiotemporal characteristics of the successfully matched target video, and obtaining the positioning interval of the video to be recognized in the target video according to the positioning interval of the video frame segments in the target video, thereby further improving the accuracy of the positioning interval of the video to be recognized in the target recognition.
In order to better understand the video identification method provided by the above embodiment, an application scenario of the video identification method provided by the embodiment of the present application is described below with reference to fig. 4 to 8.
Referring to fig. 4, the figure is a schematic view of a video identification method according to an embodiment of the present application. The building of the search library and the video identification method obtain the space-time characteristics in the same way, the search library can be built off-line, the video can be identified on line, and the method is not limited in detail. The following describes the process of creating a search library, and then describes the process of video recognition.
The process of establishing a search base comprises the following steps: a video frame segment is extracted from the complete video, for example 5 video frames from video 402. If the picture of the video frame segment has the shelter, the shelter can be firstly identified for the video frame segment, the dynamic area and the static area in the video frame segment are identified, and the static area of the video frame segment is removed. Then, the spatio-temporal features of the video frame segments are extracted and stored in the Faiss library 404 for subsequent matching of the spatio-temporal features.
The video identification process comprises the following steps: a video frame segment is extracted from the video to be identified, for example, a video frame segment including 5 video frames is extracted from the video 401 to be identified. If the picture of the video frame segment has the shelter, the shelter can be firstly identified for the video frame segment, the dynamic area and the static area in the video frame segment are identified, and the static area of the video frame segment is removed. Then, extracting the spatio-temporal features of the video frame segments, comparing the extracted spatio-temporal features of the video frame segments with the spatio-temporal features in the Faiss library 404, obtaining the first N features with higher similarity to the spatio-temporal features of the video frame segments, such as the first three features, and respectively obtaining the video frame segments corresponding to the three similar spatio-temporal features, and obtaining the names of the target videos where the corresponding video frame segments are located, thereby obtaining the names of the videos to be identified. And constructing a score histogram for each similar spatio-temporal feature, wherein the abscissa of the score histogram is the positioning interval of the video frame segment corresponding to the similar spatio-temporal feature in the target video, and the ordinate of the score histogram is the score of the similar feature. As can be seen from fig. 4, if the score of the first similar spatio-temporal feature is higher, the matching degree of the video 401 to be recognized and the first similar spatio-temporal feature is higher, the name of the target video 403 where the corresponding video frame segment is located can be determined by the first similar spatio-temporal feature, the name of the target video 403 is determined as the name of the video 401 to be recognized, and the location interval of the video 401 to be recognized in the target video 403 is [ time a-time B ].
As shown in fig. 5, if the user watches the video 501 through the mobile phone client, and the name of the video 501 is not included in the related information 1 of the video 501, the name of the video 501 may be identified by the video identification method in the embodiment of the present application, and is displayed on the page. Not only can the name of the video 501 be displayed on the page, but also the corpus of the video 501 can be displayed on the page in a connected manner, as shown in fig. 6, when the user clicks to view the corpus, the user can directly view the complete video corresponding to the video 501. Further, if a plurality of videos to be identified all correspond to the same target video, the plurality of videos to be identified may be sorted based on the positioning interval of each video to be identified. As shown in fig. 7, if the target video corresponding to the video 502 is the same target video as the target video corresponding to the video 501, the video 502 may be displayed below the video 501, and the introduction about the video 502 is displayed through the related information 2, and if the user wants to watch more other segments about the video 501 or the complete video corresponding to the video 502, the user may click on the selection set to watch other related videos. As shown in fig. 8, after the user clicks the selection, a related video corresponding to the same target video may be displayed below the video 501, and both the video 502 and the video 503 correspond to the same target video as the video 501, where the positioning interval of the video 502 is arranged before the positioning interval of the video 503, for example, the positioning interval of the video 502 is from 03: 44, the localization interval of video 503 begins at 06:23, and video 502 precedes video 503 in the ordering of the localization intervals. Therefore, continuous watching experience is displayed for the user through sequencing of the related videos, the experience of the user can be improved, and the service life of the user on the internet platform can be prolonged.
Aiming at the video identification method provided by the embodiment, the embodiment of the application also provides a video identification device.
Referring to fig. 9, the figure is a schematic view of a video recognition apparatus according to an embodiment of the present application. The device comprises: an acquisition unit 901, an extraction unit 902, and a processing unit 903;
the acquiring unit 901 is configured to acquire a video frame segment in a video to be identified, where the video frame segment includes a plurality of consecutive video frames;
the extracting unit 902 is configured to extract spatio-temporal features of the video frame segment; the spatio-temporal characteristics are fusion characteristics of spatial characteristics of the video frame segments and temporal characteristics of the video frame segments, and represent motion information of objects involved in the video frame segments, the spatial characteristics are used for identifying appearance information of the objects involved in each video frame of the video frame segments, and the temporal characteristics are used for identifying motion information of the objects involved in the video frame segments;
the processing unit 903 is configured to match the spatio-temporal features of the video frame segment with spatio-temporal features in a search library, if the matching is successful, obtain the spatio-temporal features in the search library which is successfully matched, determine a target video represented by the spatio-temporal features in the search library which is successfully matched, and determine the name of the target video as the name of the video to be identified; the search library comprises space-time characteristics of a plurality of video frame fragments of the target video, and the target video is a complete video corresponding to the video to be identified.
As a possible implementation manner, the extracting unit 902 is further configured to input the video frame segment into a spatial convolution layer of a feature extraction model, and obtain a spatial feature of each frame of video frame in the video frame segment;
inputting the spatial features of each video frame in the video frame segments into a first fusion layer of the feature extraction model to obtain the spatial features of the video frame segments, and inputting the video frame segments into the first fusion layer of the feature extraction model to obtain the temporal features of the video frame segments;
and inputting the spatial characteristics of the video frame segments and the temporal characteristics of the video frame segments into a second fusion layer of the characteristic extraction model to obtain the spatio-temporal characteristics of the video frame segments.
As a possible implementation manner, the apparatus further includes a training unit, configured to, when training the feature extraction model, construct a plurality of segments of sample video frame segments, where the plurality of segments include video frame segments with different resolutions and/or different aspect ratios for the sample video frame segments;
and inputting a plurality of fragments of the sample video frame fragments as sample data into a feature extraction model for training.
As a possible implementation manner, the apparatus further includes a removing unit, configured to identify a dynamic area and a static area in the video to be identified when the video to be identified has a blocking object, where the static area is an area where the blocking object is located in the video to be identified;
and removing the static area in the video to be identified.
As a possible implementation manner, the processing unit 903 is further configured to obtain a spatio-temporal feature in the search library which is successfully matched, and determine a video frame segment corresponding to the spatio-temporal feature in the search library which is successfully matched;
and obtaining the positioning interval of the video to be identified in the target video according to the positioning interval of the corresponding video frame segment in the target video.
As a possible implementation manner, the apparatus further includes a sorting unit, configured to obtain, if multiple videos to be identified correspond to a same target video, a positioning interval of the multiple videos to be identified;
and sequencing the videos to be identified according to the positioning intervals of the videos to be identified.
As a possible implementation manner, the processing unit 903 is further configured to obtain a first positioning interval of the video to be identified in the target video;
adding a preset time period for the first positioning interval to obtain a second positioning interval;
matching the spatiotemporal features of the video frame segments with the spatiotemporal features of the target video based on the second positioning interval;
if the matching is successful, acquiring a video frame segment corresponding to the space-time characteristics of the successfully matched target video;
and obtaining the positioning interval of the video to be identified in the target video according to the positioning interval of the corresponding video frame segment in the target video.
The video identification device provided in the above embodiment obtains a video frame segment of a video to be identified, extracts a spatiotemporal feature of the video frame segment to be identified, matches the spatiotemporal feature of the video frame segment to be identified with a spatiotemporal feature in a search library, obtains a spatiotemporal feature in the search library which is successfully matched, determines a target video where the video frame segment corresponding to the spatiotemporal feature in the search library which is successfully matched is located, and determines the name of the target video as the name of the video to be identified. When the name of the video to be identified is identified, feature extraction is not performed based on the video frame of the video to be identified, but is performed based on the video frame segment of the video to be identified, so that the space-time feature of the video frame segment is obtained, that is, not only the space feature of the video frame segment can be extracted, but also the time feature of the video frame segment can be extracted according to a plurality of continuous video frames.
The embodiment of the present application further provides an apparatus for video identification, and the apparatus for video identification provided in the embodiment of the present application will be described below from the perspective of hardware implementation.
Referring to fig. 10, fig. 10 is a schematic diagram of a server 1400 provided by an embodiment of the present application, where the server 1400 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1422 (e.g., one or more processors) and a memory 1432, one or more storage media 1430 (e.g., one or more mass storage devices) for storing applications 1442 or data 1444. Memory 1432 and storage media 1430, among other things, may be transient or persistent storage. The program stored on storage medium 1430 may include one or more modules (not shown), each of which may include a sequence of instructions operating on a server. Still further, a central processor 1422 may be disposed in communication with storage medium 1430 for executing a series of instruction operations on storage medium 1430 on server 1400.
The server 1400 may also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input-output interfaces 1458, and/or one or more operating systems 1441, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 10.
The CPU 1422 is configured to perform the following steps:
acquiring a video frame segment of a video to be identified, wherein the video frame segment comprises a plurality of continuous video frames;
extracting the space-time characteristics of the video frame segments; the spatio-temporal characteristics are fusion characteristics of spatial characteristics of the video frame segments and temporal characteristics of the video frame segments, and represent motion information of objects involved in the video frame segments, the spatial characteristics are used for identifying appearance information of the objects involved in each video frame of the video frame segments, and the temporal characteristics are used for identifying motion information of the objects involved in the video frame segments;
matching the spatiotemporal characteristics of the video frame segments with spatiotemporal characteristics in a search library, if the matching is successful, acquiring the spatiotemporal characteristics in the search library which is successfully matched, determining a target video where the video frame segment corresponding to the spatiotemporal characteristics in the search library which is successfully matched is located, and determining the name of the target video as the name of the video to be identified; the search library comprises space-time characteristics of a plurality of video frame fragments of the target video, and the target video is a complete video corresponding to the video to be identified.
Optionally, the CPU 1422 may further perform method steps of any specific implementation manner of the video identification method in the embodiment of the present application.
For the video identification method described above, the embodiment of the present application further provides a terminal device for video identification, so that the video identification method described above is implemented and applied in practice.
Referring to fig. 11, fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application. For convenience of explanation, only the parts related to the embodiments of the present application are shown, and details of the specific technology are not disclosed. The terminal device may be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA for short), and the like, taking the terminal device as the mobile phone as an example:
fig. 11 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 11, the mobile phone includes: a Radio Frequency (RF) circuit 1510, a memory 1520, an input unit 1530, a display unit 1540, a sensor 1550, an audio circuit 1560, a wireless fidelity (WiFi) module 1570, a processor 1580, and a power supply 1590. Those skilled in the art will appreciate that the handset configuration shown in fig. 11 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The following describes each component of the mobile phone in detail with reference to fig. 11:
the RF circuit 1510 may be configured to receive and transmit signals during information transmission and reception or during a call, and in particular, receive downlink information of a base station and then process the received downlink information to the processor 1580; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuit 1510 includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, RF circuit 1510 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.
The memory 1520 may be used to store software programs and modules, and the processor 1580 implements various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 1520. The memory 1520 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1520 may include high-speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
The input unit 1530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 1530 may include a touch panel 1531 and other input devices 1532. The touch panel 1531, also referred to as a touch screen, can collect touch operations of a user (e.g., operations of the user on or near the touch panel 1531 using any suitable object or accessory such as a finger or a stylus) and drive corresponding connection devices according to a preset program. Alternatively, the touch panel 1531 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 1580, and can receive and execute commands sent by the processor 1580. In addition, the touch panel 1531 may be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 1530 may include other input devices 1532 in addition to the touch panel 1531. In particular, other input devices 1532 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
The display unit 1540 may be used to display information input by the user or information provided to the user and various menus of the mobile phone. The Display unit 1540 may include a Display panel 1541, and optionally, the Display panel 1541 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 1531 may cover the display panel 1541, and when the touch panel 1531 detects a touch operation on or near the touch panel 1531, the touch operation is transmitted to the processor 1580 to determine the type of the touch event, and then the processor 1580 provides a corresponding visual output on the display panel 1541 according to the type of the touch event. Although in fig. 11, the touch panel 1531 and the display panel 1541 are two separate components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1531 and the display panel 1541 may be integrated to implement the input and output functions of the mobile phone.
The handset can also include at least one sensor 1550, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 1541 according to the brightness of ambient light and a proximity sensor that turns off the display panel 1541 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.
WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through a WiFi module 1570, and provides wireless broadband internet access for the user. Although fig. 11 shows WiFi module 1570, it is understood that it does not belong to the essential components of the handset and may be omitted entirely as needed within the scope not changing the essence of the invention.
The processor 1580 is a control center of the mobile phone, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1520 and calling data stored in the memory 1520, thereby integrally monitoring the mobile phone. Optionally, the processor 1580 may include one or more processing units; preferably, the processor 1580 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, and the like, and a modem processor, which mainly handles wireless communications. It is to be appreciated that the modem processor may not be integrated into the processor 1580.
The handset also includes a power supply 1590 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 1580 via a power management system to manage charging, discharging, and power consumption management functions via the power management system.
Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.
In an embodiment of the present application, the handset includes a memory 1520 that can store program code and transmit the program code to the processor.
The processor 1580 included in the mobile phone can execute the video identification method provided by the above embodiments according to the instructions in the program code.
The embodiment of the present application further provides a computer-readable storage medium for storing a computer program, where the computer program is used to execute the video identification method provided by the foregoing embodiment.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the video recognition method provided in the various alternative implementations of the above aspects.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium may be at least one of the following media: various media that can store program codes, such as read-only memory (ROM), RAM, magnetic disk, or optical disk.
It should be noted that, in the present specification, all the embodiments are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (10)
1. A method for video recognition, the method comprising:
acquiring a video frame segment of a video to be identified, wherein the video frame segment comprises a plurality of continuous video frames;
extracting the space-time characteristics of the video frame segments; the spatio-temporal characteristics are fusion characteristics of spatial characteristics of the video frame segments and temporal characteristics of the video frame segments, and represent motion information of objects involved in the video frame segments, the spatial characteristics are used for identifying appearance information of the objects involved in each video frame of the video frame segments, and the temporal characteristics are used for identifying motion information of the objects involved in the video frame segments;
matching the spatiotemporal characteristics of the video frame segments with spatiotemporal characteristics in a search library, if the matching is successful, acquiring the spatiotemporal characteristics in the search library which is successfully matched, determining a target video where the video frame segment corresponding to the spatiotemporal characteristics in the search library which is successfully matched is located, and determining the name of the target video as the name of the video to be identified; the search library comprises space-time characteristics of a plurality of video frame fragments of the target video, and the target video is a complete video corresponding to the video to be identified.
2. The method of claim 1, wherein the extracting spatiotemporal features of the video frame segments comprises:
inputting the video frame segment into a spatial convolution layer of a feature extraction model, and acquiring the spatial feature of each video frame in the video frame segment;
inputting the spatial features of each video frame in the video frame segments into a first fusion layer of the feature extraction model to obtain the spatial features of the video frame segments, and inputting the video frame segments into the first fusion layer of the feature extraction model to obtain the temporal features of the video frame segments;
and inputting the spatial characteristics of the video frame segments and the temporal characteristics of the video frame segments into a second fusion layer of the characteristic extraction model to obtain the spatio-temporal characteristics of the video frame segments.
3. The method of claim 2, further comprising:
while training the feature extraction model, constructing a plurality of segments of sample video frame segments, the plurality of segments including video frame segments of different resolutions and/or different aspect ratios for the sample video frame segments;
and inputting a plurality of fragments of the sample video frame fragments as sample data into a feature extraction model for training.
4. The method according to any one of claims 1-3, wherein when the video to be presented has an obstruction, the method further comprises:
identifying a dynamic area and a static area in the video to be identified, wherein the static area is an area where the sheltering object in the video to be identified is located;
and removing the static area in the video to be identified.
5. The method according to any one of claims 1-3, wherein the obtaining of spatiotemporal features in the successfully matched search library comprises:
acquiring the spatio-temporal characteristics in the successfully matched search library, and determining the video frame segment corresponding to the spatio-temporal characteristics in the successfully matched search library;
and obtaining the positioning interval of the video to be identified in the target video according to the positioning interval of the corresponding video frame segment in the target video.
6. The method of claim 5, further comprising:
if a plurality of videos to be identified correspond to the same target video, acquiring positioning intervals of the plurality of videos to be identified;
and sequencing the videos to be identified according to the positioning intervals of the videos to be identified.
7. The method according to claim 5, wherein the obtaining a location interval of the video to be identified in the target video comprises:
obtaining a first positioning interval of the video to be identified in the target video;
adding a preset time period for the first positioning interval to obtain a second positioning interval;
matching the spatiotemporal features of the video frame segments with the spatiotemporal features of the target video based on the second positioning interval;
if the matching is successful, acquiring a video frame segment corresponding to the space-time characteristics of the successfully matched target video;
and obtaining the positioning interval of the video to be identified in the target video according to the positioning interval of the corresponding video frame segment in the target video.
8. A video recognition apparatus, the apparatus comprising: the device comprises an acquisition unit, an extraction unit and a processing unit;
the acquisition unit is used for acquiring a video frame segment in a video to be identified, wherein the video frame segment comprises a plurality of continuous video frames;
the extraction unit is used for extracting the space-time characteristics of the video frame segment; the spatio-temporal characteristics are fusion characteristics of spatial characteristics of the video frame segments and temporal characteristics of the video frame segments, and represent motion information of objects involved in the video frame segments, the spatial characteristics are used for identifying appearance information of the objects involved in each video frame of the video frame segments, and the temporal characteristics are used for identifying motion information of the objects involved in the video frame segments;
the processing unit is used for matching the spatio-temporal characteristics of the video frame segments with the spatio-temporal characteristics in a search library, if the matching is successful, obtaining the spatio-temporal characteristics in the search library which is successfully matched, determining a target video represented by the spatio-temporal characteristics in the search library which is successfully matched, and determining the name of the target video as the name of the video to be identified; the search library comprises space-time characteristics of a plurality of video frame fragments of the target video, and the target video is a complete video corresponding to the video to be identified.
9. An apparatus for video recognition, the apparatus comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of any of claims 1-7 according to instructions in the program code.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program for performing the method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011078362.4A CN112203115B (en) | 2020-10-10 | 2020-10-10 | Video identification method and related device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011078362.4A CN112203115B (en) | 2020-10-10 | 2020-10-10 | Video identification method and related device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112203115A true CN112203115A (en) | 2021-01-08 |
CN112203115B CN112203115B (en) | 2023-03-10 |
Family
ID=74013941
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011078362.4A Active CN112203115B (en) | 2020-10-10 | 2020-10-10 | Video identification method and related device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112203115B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113038271A (en) * | 2021-03-25 | 2021-06-25 | 深圳市人工智能与机器人研究院 | Video automatic editing method, device and computer storage medium |
CN113033458A (en) * | 2021-04-09 | 2021-06-25 | 京东数字科技控股股份有限公司 | Action recognition method and device |
CN113627365A (en) * | 2021-08-16 | 2021-11-09 | 南通大学 | Group movement identification and time sequence analysis method |
CN113742519A (en) * | 2021-08-31 | 2021-12-03 | 杭州登虹科技有限公司 | Multi-object storage cloud video Timeline storage method and system |
CN114357248A (en) * | 2021-12-16 | 2022-04-15 | 北京旷视科技有限公司 | Video retrieval method, computer storage medium, electronic device, and computer program product |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002041569A (en) * | 2000-05-19 | 2002-02-08 | Nippon Telegr & Teleph Corp <Ntt> | Method and system for distributing retrieval service, method and device for retrieving information, information retrieving server, retrieval service providing method, program therefor, and recording medium the program recorded thereon |
JP2005025770A (en) * | 2000-05-19 | 2005-01-27 | Nippon Telegr & Teleph Corp <Ntt> | Method and system for distributing search service, method and apparatus for searching information, information search server, method for providing search service, its program, and recording medium with program recorded thereon |
CN101442641A (en) * | 2008-11-21 | 2009-05-27 | 清华大学 | Method and system for monitoring video copy based on content |
CN102222103A (en) * | 2011-06-22 | 2011-10-19 | 央视国际网络有限公司 | Method and device for processing matching relationship of video content |
CN103336957A (en) * | 2013-07-18 | 2013-10-02 | 中国科学院自动化研究所 | Network coderivative video detection method based on spatial-temporal characteristics |
CN106231356A (en) * | 2016-08-17 | 2016-12-14 | 腾讯科技(深圳)有限公司 | The treating method and apparatus of video |
CN108985165A (en) * | 2018-06-12 | 2018-12-11 | 东南大学 | A kind of video copy detection system and method based on convolution and Recognition with Recurrent Neural Network |
KR20190088688A (en) * | 2018-01-19 | 2019-07-29 | 한국기술교육대학교 산학협력단 | Method of searching crime using video synopsis |
CN110740389A (en) * | 2019-10-30 | 2020-01-31 | 腾讯科技(深圳)有限公司 | Video positioning method and device, computer readable medium and electronic equipment |
-
2020
- 2020-10-10 CN CN202011078362.4A patent/CN112203115B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002041569A (en) * | 2000-05-19 | 2002-02-08 | Nippon Telegr & Teleph Corp <Ntt> | Method and system for distributing retrieval service, method and device for retrieving information, information retrieving server, retrieval service providing method, program therefor, and recording medium the program recorded thereon |
JP2005025770A (en) * | 2000-05-19 | 2005-01-27 | Nippon Telegr & Teleph Corp <Ntt> | Method and system for distributing search service, method and apparatus for searching information, information search server, method for providing search service, its program, and recording medium with program recorded thereon |
CN101442641A (en) * | 2008-11-21 | 2009-05-27 | 清华大学 | Method and system for monitoring video copy based on content |
CN102222103A (en) * | 2011-06-22 | 2011-10-19 | 央视国际网络有限公司 | Method and device for processing matching relationship of video content |
CN103336957A (en) * | 2013-07-18 | 2013-10-02 | 中国科学院自动化研究所 | Network coderivative video detection method based on spatial-temporal characteristics |
CN106231356A (en) * | 2016-08-17 | 2016-12-14 | 腾讯科技(深圳)有限公司 | The treating method and apparatus of video |
KR20190088688A (en) * | 2018-01-19 | 2019-07-29 | 한국기술교육대학교 산학협력단 | Method of searching crime using video synopsis |
CN108985165A (en) * | 2018-06-12 | 2018-12-11 | 东南大学 | A kind of video copy detection system and method based on convolution and Recognition with Recurrent Neural Network |
CN110740389A (en) * | 2019-10-30 | 2020-01-31 | 腾讯科技(深圳)有限公司 | Video positioning method and device, computer readable medium and electronic equipment |
Non-Patent Citations (2)
Title |
---|
余佩诗: "网络环境中的保护作品完整权界限", 《《齐齐哈尔大学学报》(哲学社会科学版)》 * |
聂静 , 程海燕: "短视频内容传播的版权保护研究", 《中国出版》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113038271A (en) * | 2021-03-25 | 2021-06-25 | 深圳市人工智能与机器人研究院 | Video automatic editing method, device and computer storage medium |
CN113038271B (en) * | 2021-03-25 | 2023-09-08 | 深圳市人工智能与机器人研究院 | Video automatic editing method, device and computer storage medium |
CN113033458A (en) * | 2021-04-09 | 2021-06-25 | 京东数字科技控股股份有限公司 | Action recognition method and device |
CN113033458B (en) * | 2021-04-09 | 2023-11-07 | 京东科技控股股份有限公司 | Action recognition method and device |
CN113627365A (en) * | 2021-08-16 | 2021-11-09 | 南通大学 | Group movement identification and time sequence analysis method |
CN113742519A (en) * | 2021-08-31 | 2021-12-03 | 杭州登虹科技有限公司 | Multi-object storage cloud video Timeline storage method and system |
CN114357248A (en) * | 2021-12-16 | 2022-04-15 | 北京旷视科技有限公司 | Video retrieval method, computer storage medium, electronic device, and computer program product |
Also Published As
Publication number | Publication date |
---|---|
CN112203115B (en) | 2023-03-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112203115B (en) | Video identification method and related device | |
CN111556278B (en) | Video processing method, video display device and storage medium | |
EP3944147A1 (en) | Target detection method, model training method, device, apparatus and storage medium | |
CN111209423B (en) | Image management method and device based on electronic album and storage medium | |
CN112990390B (en) | Training method of image recognition model, and image recognition method and device | |
CN114722937B (en) | Abnormal data detection method and device, electronic equipment and storage medium | |
CN110347858B (en) | Picture generation method and related device | |
CN113723159A (en) | Scene recognition model training method, scene recognition method and model training device | |
CN112995757B (en) | Video clipping method and device | |
CN113723378B (en) | Model training method and device, computer equipment and storage medium | |
CN113821720A (en) | Behavior prediction method and device and related product | |
CN113709385B (en) | Video processing method and device, computer equipment and storage medium | |
CN111491123A (en) | Video background processing method and device and electronic equipment | |
CN110516113A (en) | A kind of method of visual classification, the method and device of video classification model training | |
CN113822427A (en) | Model training method, image matching device and storage medium | |
CN113269279B (en) | Multimedia content classification method and related device | |
CN112270238A (en) | Video content identification method and related device | |
CN116955677A (en) | Method, device, equipment and storage medium for generating pictures based on characters | |
CN112256976B (en) | Matching method and related device | |
CN113780291B (en) | Image processing method and device, electronic equipment and storage medium | |
CN116071614A (en) | Sample data processing method, related device and storage medium | |
CN116453005A (en) | Video cover extraction method and related device | |
CN116152289A (en) | Target object tracking method, related device, equipment and storage medium | |
CN110750193B (en) | Scene topology determination method and device based on artificial intelligence | |
CN113536876A (en) | Image recognition method and related device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40037430 Country of ref document: HK |
|
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |