CN108229527A - Training and video analysis method and apparatus, electronic equipment, storage medium, program - Google Patents
Training and video analysis method and apparatus, electronic equipment, storage medium, program Download PDFInfo
- Publication number
- CN108229527A CN108229527A CN201710530371.4A CN201710530371A CN108229527A CN 108229527 A CN108229527 A CN 108229527A CN 201710530371 A CN201710530371 A CN 201710530371A CN 108229527 A CN108229527 A CN 108229527A
- Authority
- CN
- China
- Prior art keywords
- video
- camera lens
- segment
- visual signature
- lens segment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
The embodiment of the invention discloses a kind of training and video analysis method and apparatus, electronic equipment, storage medium, program, wherein, the video analysis network includes visual signature network, and training method includes:For any sample segment in the corresponding at least one sample segment of at least one Sample video, the visual signature of any sample segment is obtained using visual signature network, the Sample video is labeled with video type label;The video estimation label of the sample segment is obtained according to the visual signature;According to the video estimation label of the sample segment and the video type label, the visual signature network is trained.The embodiment of the present invention can realize the analysis to long video.
Description
Technical field
The present invention relates to computer vision technique, especially a kind of training and video analysis method and apparatus, electronic equipment,
Storage medium, program.
Background technology
Annual in the world to have a large amount of film to generate, film is not only a kind of entertainment way, is substantially to the mankind
The dramatization displaying of world's actual life reflects culture, society, the history of the mankind by abundant media.Artificial intelligence
If it will be appreciated that film, is also just better understood on real world.Therefore, for this duration of film it is long, contain much information
The analysis of video is a significantly thing in computer vision field.
Invention content
The embodiment of the present invention provides a kind of technical solution for being used to carry out video analysis.
One side according to embodiments of the present invention, the training method of a kind of video analysis network provided, the video
It analyzes network and includes visual signature network, the method includes:
For any sample segment in the corresponding at least one sample segment of at least one Sample video, using regarding
Feel that character network obtains the visual signature of any sample segment, the Sample video is labeled with video type label;
The video estimation label of the sample segment is obtained according to the visual signature;
According to the video estimation label of the sample segment and the video type label, to the visual signature network into
Row training.
Optionally, it is described to obtain described appoint using visual signature network in the above-mentioned each training method embodiment of the present invention
Before the visual signature of one sample chips section, further include:
M camera lens segment is chosen from any sample segment and respectively from any in the M camera lens segment
N frame images are chosen in camera lens segment;Wherein, M, N are respectively the integer more than 0;The sample segment includes the Sample video
At least one of camera lens segment, each camera lens segment include an at least frame image;
The visual signature that any sample segment is obtained using visual signature network, including:Utilize visual signature
Network respectively for any camera lens segment in the M camera lens segment, extracts N frames described in any camera lens segment respectively
The visual signature of image.
Optionally, in the above-mentioned each training method embodiment of the present invention, the sample chips are obtained according to the visual signature
The video estimation label of section, including:
According to the visual signature of N frame images chosen respectively in the M camera lens segment, regarding for the sample segment is obtained
Frequency prediction label.
Optionally, in the above-mentioned each training method embodiment of the present invention, the sample segment includes:The Sample video pair
The video trailer section answered or the video clip that editing obtains from the Sample video.
Optionally, in the above-mentioned each training method embodiment of the present invention, the value of M, N are respectively more than or equal to 2
Integer.
Optionally, in the above-mentioned each training method embodiment of the present invention, the value that the value of M is 8, N is 3.
Optionally, in the above-mentioned each training method embodiment of the present invention, according to what is chosen respectively in the M camera lens segment
The visual signature of N frame images obtains the video estimation label of the sample segment, including:
Respectively for any camera lens segment, the visual signature for being based respectively on the N frames image determines the N frames image
Type label;
The type label for being based respectively on N frame images described in any camera lens segment determines any camera lens segment
Type label;
Determine the tag types of the sample segment as video preprocessor mark based on the type label of the M camera lens segment
Label.
Optionally, in the above-mentioned each training method embodiment of the present invention, the video analysis network further includes sequential organization
Network;
The method further includes:
After meeting preset condition to the training of the visual signature network, using the visual signature network to giving video
Feature extraction is carried out, obtains the visual signature of a plurality of lenses segment;The given video includes continuous a plurality of lenses segment;
Using sequential organization network, based on the visual signature and sequential relationship of the continuous a plurality of lenses segment, study
The sequential organization feature of the given video.
Optionally, in the above-mentioned each training method embodiment of the present invention, the given video specifically includes P lens
Section;Wherein, P is the integer more than 0;
Feature extraction is carried out to given video using the visual signature network, the vision for obtaining a plurality of lenses segment is special
Sign, including:
Q frame images are chosen from each camera lens segment in the given video respectively;Wherein, Q is the integer more than 0;
Using the visual signature network, respectively for each camera lens segment, Q described in each camera lens segment is extracted
Visual signature of the visual signature of frame image as each camera lens segment;
According to sequence of the P camera lens segment in the given video, successively by the vision of the P camera lens segment
Feature input timing structural network.
Optionally, in the above-mentioned each training method embodiment of the present invention, based on regarding for the continuous a plurality of lenses segment
Feel feature and sequential relationship, learn the sequential organization feature of the given video, including:
Based on the visual signature and sequential relationship of the continuous a plurality of lenses segment, predict that the given video is adjacent
Next camera lens segment;
Based on the forecasting accuracy of next camera lens segment, the sequential organization network is trained.
Optionally, it in the above-mentioned each training method embodiment of the present invention, further includes:
Q frame images are chosen from each option camera lens segment in L option camera lens segment respectively;Wherein, L is more than 1
Integer;The L option camera lens segment includes at least one correctly next camera lens segment;
Using the visual signature network, respectively for each option camera lens segment in the L option camera lens segment, carry
Take visual signature of the visual signature of the Q frames image as each option camera lens segment;
Based on the visual signature and sequential relationship of the continuous a plurality of lenses segment, predict that the given video is adjacent
Next camera lens segment, including:
Sequential organization network generates the P lens according to the visual signature and sequential relationship of the P camera lens segment
The sequential organization feature of section;
The sequential organization network obtains the P camera lens segment according to the sequential organization feature of the P camera lens segment
The visual signature of adjacent next camera lens segment;
Using convolutional neural networks, according to the visual signature of next camera lens segment and the L option lens
The visual signature of section, chooses the adjacent next camera lens segment of the P camera lens segment from the L option camera lens segment.
Optionally, it is special according to the vision of next camera lens segment in the above-mentioned each training method embodiment of the present invention
The visual signature for the L option camera lens segment of seeking peace, the P camera lens segment is chosen from the L option camera lens segment
Adjacent next camera lens segment, including:
According to each option camera lens segment in the visual signature of next camera lens segment and the L option camera lens segment
Visual signature, obtain probability score of each option camera lens segment as next camera lens segment respectively;
The highest at least one option camera lens segment of probability score is chosen in the L option camera lens segment as the P
The adjacent next camera lens segment of a camera lens segment.
Optionally, it is special according to the vision of next camera lens segment in the above-mentioned each training method embodiment of the present invention
The visual signature of each option camera lens segment, obtains each option camera lens segment conduct respectively in the L option camera lens segment of seeking peace
The probability score of next camera lens segment, including:
The visual signature of next camera lens segment is replicated, the vision for obtaining L next camera lens segments is special
Sign, and according to preset format to the visual signature of the L next camera lens segments and the vision of the L option camera lens segment
Feature is spliced, and obtains eigenmatrix;
Using convolutional neural networks, based on the eigenmatrix, each option in the L option camera lens segment is obtained respectively
Probability score of the camera lens segment as adjacent next camera lens segment.
Optionally, it is described to obtain each option camera lens segment conduct respectively in the above-mentioned each training method embodiment of the present invention
After the probability of next camera lens segment obtains, further include:
To each option camera lens segment in the L option camera lens segment respectively as the general of adjacent next camera lens segment
Rate score is normalized, and obtains L normalization probability score;
The highest at least one option camera lens segment of probability score is chosen in the L option camera lens segment as the P
The adjacent next camera lens segment of a camera lens segment, including:
From described L normalization probability score, it is corresponding to choose the highest at least one normalization probability score of numerical value
The option camera lens segment next camera lens segment adjacent as the P camera lens segment.
Optionally, in the above-mentioned each training method embodiment of the present invention, the sequential organization network includes recurrent neural net
Network.
Other side according to embodiments of the present invention, a kind of video analysis method provided, including:
At least one video clip is chosen from video;
Using visual signature network, the visual signature of at least one video clip is obtained respectively;
According to the visual signature from least one video clip, the video is analyzed.
Optionally, it is described that at least one regard is chosen from video in the above-mentioned each video analysis embodiment of the method for the present invention
Frequency segment, including:
X camera lens segment is chosen from the video, the video clip includes the X camera lens segment;
Respectively for any camera lens segment in the X camera lens segment, Y frame figures are chosen from any camera lens segment
Picture;Wherein, X, Y are respectively the integer more than 0.
Optionally, it is described to utilize visual signature network in the above-mentioned each video analysis embodiment of the method for the present invention, it obtains respectively
The visual signature of at least one video clip is taken, including:
Using visual signature network, respectively for any camera lens segment in the X camera lens segment, respectively from described any
The visual signature of the Y frames image is extracted in camera lens segment;
The basis is analyzed the long video from the visual signature of at least one video clip, including:
According to the visual signature of Y frame images chosen respectively from the X camera lens segment, the video is divided
Analysis.
Optionally, it is described that X lens is chosen from video in the above-mentioned each video analysis embodiment of the method for the present invention
Section, including:
Based on the characteristic similarity and preset condition between adjacent two field pictures in the video, chosen from the video
X camera lens segment.
Optionally, in the above-mentioned each video analysis embodiment of the method for the present invention, the value of M, N are respectively to be more than or equal to
2 integer.
Optionally, in the above-mentioned each video analysis embodiment of the method for the present invention, the value that the value of X is 8, Y is 3.
Optionally, in the above-mentioned each video analysis embodiment of the method for the present invention, distinguish according to from the X camera lens segment
The visual signature of the Y frame images of selection, analyzes the video, including:
Respectively for any camera lens segment in the X camera lens segment:It is based respectively on the visual signature of the Y frames image
Determine the type label of the Y frames image;And
The type of any camera lens segment is determined based on the type label of Y frame images described in any camera lens segment
Label;
Video of the tag types of the video as the video is determined based on the type label of the X camera lens segment
Prediction label.
Optionally, it in the above-mentioned each video analysis embodiment of the method for the present invention, further includes:
To video estimation label described in the video labeling.
Optionally, it is described that X lens is chosen from video in the above-mentioned each video analysis embodiment of the method for the present invention
Section, including:
In response to receiving searching request, continuous X camera lens segment is chosen from video successively;It is described to search in element request
Including video presentation field;
According to the visual signature of Y frame images chosen respectively from the X camera lens segment, the video is divided
Analysis, including:
According to the visual signature of Y frame images chosen respectively from the X camera lens segment, the X mirror is obtained respectively
The type label of each camera lens segment in head segment;
Export the camera lens segment of type label and the video presentation fields match in the X camera lens segment.
Optionally, in the above-mentioned each video analysis embodiment of the method for the present invention, type mark in the X camera lens segment is exported
Label and the camera lens segment of the video presentation fields match, including:
In response to getting the type label of any camera lens segment in the X camera lens segment, it is respectively compared described any
Whether the type label of camera lens segment matches with the video presentation field;
Output type label and any camera lens segment of the video presentation fields match;
Or
In response to getting the type label of all camera lens segments in the video, it is respectively compared all camera lens segments
In the type label of each camera lens segment whether matched with the video presentation field;
Export the camera lens segment of type label and the video presentation fields match in all camera lens segments.
Optionally, in the above-mentioned each video analysis embodiment of the method for the present invention, distinguish according to from the X camera lens segment
The visual signature of the Y frame images of selection, analyzes the video, including:
According to the visual signature of Y frame images chosen respectively from the X camera lens segment, the X mirror is obtained respectively
The type label of each camera lens segment in head segment;
The method further includes:
Using sequential organization network, the sequential organization feature of the X camera lens segment is obtained;
According to the type label of the X camera lens segment and sequential organization feature, the video presentation of the video is generated.
Optionally, it is described that X lens is chosen from video in the above-mentioned each video analysis embodiment of the method for the present invention
Section, including:Continuous X camera lens segment is chosen from video successively;
According to the visual signature of Y frame images chosen respectively from the X camera lens segment, the video is divided
Analysis, including:According to the visual signature of Y frame images chosen respectively from the X camera lens segment, the X mirror is obtained respectively
The type label of each camera lens segment in head segment;
The method further includes:
Using sequential organization network, the sequential organization feature of all camera lens segments in the video is obtained;
According to the type label and sequential organization feature of camera lens segments all in the video, the video of the video is generated
Description.
Optionally, it is described that X lens is chosen from video in the above-mentioned each video analysis embodiment of the method for the present invention
Section, including:
In response to receiving searching request, a video is chosen, and choose continuous X camera lens from the video successively
Segment;The element of searching asks to include video presentation field;
According to the visual signature of Y frame images chosen respectively from the X camera lens segment, the video is divided
Analysis, including:
According to the visual signature of Y frame images chosen respectively from the X camera lens segment, the X mirror is obtained respectively
The type label of each camera lens segment in head segment, and return and choose continuous X camera lens from the video successively described in execution
The operation of segment, until getting the type label of all camera lens segments in the video;Using recurrent neural network, institute is obtained
State the sequential organization feature of all camera lens segments in video;
It is special according to the type label of camera lens segments all in the video and sequential organization using the sequential organization network
Sign generates the video presentation of the video;
Whether the video presentation for comparing the video matches with the video presentation field;
Video presentation and the video presentation fields match in response to the video, export the video.
Optionally, in the above-mentioned each video analysis embodiment of the method for the present invention, the sequential organization network includes recurrence god
Through network.
Another aspect according to embodiments of the present invention, the training device of a kind of video analysis network provided are described to regard
Frequency analysis network includes visual signature network;Described device includes sort module and the first training module;Wherein:
The visual signature network, for being directed in the corresponding at least one sample segment of at least one Sample video
Any sample segment, obtain the visual signature of any sample segment;The Sample video is labeled with video type label;
The sort module, for obtaining the video estimation label of the sample segment according to the visual signature;
First training module, for the video estimation label according to the sample segment and the video type mark
Label, are trained the visual signature network.
Optionally, it in the above-mentioned each training device embodiment of the present invention, further includes:First chooses module, for distinguishing needle
Any sample segment at least one sample segment corresponding at least one Sample video, from any short-movie section
M camera lens segment of middle selection;The sample segment includes at least one of Sample video camera lens segment, each lens
Section includes at least one frame image;
Second chooses module, for choosing N frame images from any camera lens segment in the M camera lens segment respectively;
Wherein, M, N are respectively the integer more than 0;
The visual signature network specifically for being directed to any camera lens segment in the M camera lens segment respectively, carries respectively
Take the visual signature of N frame images described in any camera lens segment;
The sort module, it is special specifically for the vision according to the N frame images chosen respectively in the M camera lens segment
Sign obtains the video estimation label of the sample segment.
Optionally, in the above-mentioned each training device embodiment of the present invention, the sample segment includes:The Sample video pair
The video trailer section answered or the video clip that editing obtains from the Sample video.
Optionally, in the above-mentioned each training device embodiment of the present invention, the sort module is specifically used for:
Respectively for any camera lens segment, the visual signature for being based respectively on the N frames image determines the N frames image
Type label;
The type label for being based respectively on N frame images described in any camera lens segment determines any camera lens segment
Type label;
Determine the tag types of the sample segment as video preprocessor mark based on the type label of the M camera lens segment
Label.
Optionally, in the above-mentioned each training device embodiment of the present invention, the visual signature network is additionally operable to full in training
After sufficient preset condition, feature extraction is carried out to given video, obtains the visual signature of a plurality of lenses segment;The given video bag
Include continuous a plurality of lenses segment;
The video analysis network further includes:
Sequential organization network, for visual signature and sequential relationship based on the continuous a plurality of lenses segment, study
The sequential organization feature of the given video.
Optionally, in the above-mentioned each training device embodiment of the present invention, the given video specifically includes P lens
Section;Wherein, P is the integer more than 0;
Described second chooses module, is additionally operable to choose Q frame images from each camera lens segment in the given video respectively;
Wherein, Q is the integer more than 0;
The visual signature network carries out feature extraction to given video, when obtaining the visual signature of a plurality of lenses segment,
It is specifically used for:Respectively for each camera lens segment, the visual signature conduct of Q frame images described in each camera lens segment is extracted
The visual signature of each camera lens segment;And the sequence according to the P camera lens segment in the given video, successively will
The visual signature input recurrent neural network of the P camera lens segment.
Optionally, in the above-mentioned each training device embodiment of the present invention, the sequential organization network, specifically for being based on
The visual signature and sequential relationship of continuous a plurality of lenses segment are stated, predicts the adjacent next lens of the given video
Section;
Described device further includes:
Second training module, for the forecasting accuracy based on next camera lens segment, to the sequential organization net
Network is trained.
Optionally, in the above-mentioned each training device embodiment of the present invention, described second chooses module, is additionally operable to respectively from L
Q frame images are chosen in each option camera lens segment in a option camera lens segment;The L option camera lens segment includes at least one
Correct next camera lens segment;Wherein, L is the integer more than 1;
The visual signature network is additionally operable to each option camera lens segment being directed to respectively in the L option camera lens segment,
Extract visual signature of the visual signature of the Q frames image as each option camera lens segment;
The sequential organization network, is specifically used for:It is raw according to the visual signature and sequential relationship of the P camera lens segment
Into the sequential organization feature of the P camera lens segment;And according to obtaining the sequential organization feature of the P camera lens segment
The visual signature of the adjacent next camera lens segment of P camera lens segment;
Described device further includes:
Convolutional neural networks, for the visual signature according to next camera lens segment and the L option lens
The visual signature of section, chooses the adjacent next camera lens segment of the P camera lens segment from the L option camera lens segment.
Optionally, in the above-mentioned each training device embodiment of the present invention, the convolutional neural networks are specifically used for:
According to each option camera lens segment in the visual signature of next camera lens segment and the L option camera lens segment
Visual signature, obtain probability score of each option camera lens segment as next camera lens segment respectively;
The highest at least one option camera lens segment of probability score is chosen in the L option camera lens segment as the P
The adjacent next camera lens segment of a camera lens segment.
Optionally, it in the above-mentioned each training device embodiment of the present invention, further includes:
Normalize module, for each option camera lens segment in the L option camera lens segment respectively as under adjacent
The probability score of one camera lens segment is normalized, and obtains L normalization probability score;
The convolutional neural networks choose the highest at least one option mirror of probability score in the L option camera lens segment
During the head segment next camera lens segment adjacent as the P camera lens segment, obtained specifically for normalizing probability from described L
In point, the highest at least one corresponding option camera lens segment of probability score that normalizes of numerical value is chosen as the P lens
Next camera lens segment of Duan Xianglin.
Optionally, in the above-mentioned each training device embodiment of the present invention, the sequential organization network includes recurrent neural net
Network.
Another aspect according to embodiments of the present invention, a kind of video analysis device provided, including:
First chooses module, for choosing at least one video clip from video;
Using visual signature network, for obtaining the visual signature of at least one video clip respectively;
Sort module, for according to the visual signature from least one video clip, analyzing the video.
Another aspect according to embodiments of the present invention, the electronic equipment provided, including any of the above-described embodiment of the present invention
The training device or video analysis device of the video analysis network.
Optionally, in the above-mentioned each video analysis device embodiment of the present invention, described first chooses module, specifically for from
X camera lens segment is chosen in the video, the video clip includes the X camera lens segment;
Described device further includes:
Second chooses module, respectively for any camera lens segment in the X camera lens segment, from any lens
Y frame images are chosen in section;Wherein, X, Y are respectively the integer more than 0;
The visual signature network, specifically for respectively be directed to the X camera lens segment in any camera lens segment, respectively from
The visual signature of the Y frames image is extracted in any camera lens segment;
The sort module, it is special specifically for the vision according to the Y frame images chosen respectively from the X camera lens segment
Sign, analyzes the video.
Optionally, in the above-mentioned each video analysis device embodiment of the present invention, described first chooses module, specifically for base
Characteristic similarity and preset condition in the video between adjacent two field pictures choose X lens from the video
Section.
Optionally, in the above-mentioned each video analysis device embodiment of the present invention, the sort module is specifically used for:
Respectively for any camera lens segment in the X camera lens segment:It is based respectively on the visual signature of the Y frames image
Determine the type label of the Y frames image;And it is determined based on the type label of Y frame images described in any camera lens segment
The type label of any camera lens segment;
Video of the tag types of the video as the video is determined based on the type label of the X camera lens segment
Prediction label.
Optionally, it in the above-mentioned each video analysis device embodiment of the present invention, further includes:
Labeling module, for video estimation label described in the video labeling.
Optionally, it in the above-mentioned each video analysis device embodiment of the present invention, further includes:
Receiving module, for receiving searching request, the element of searching asks to include video presentation field;
Described first chooses module, specifically for receiving searching request in response to the receiving module, successively from video
It is middle to choose continuous X camera lens segment;
The sort module, it is special specifically for the vision according to the Y frame images chosen respectively from the X camera lens segment
Sign obtains the type label of each camera lens segment in the X camera lens segment respectively;
Output module, for exporting the mirror of type label and the video presentation fields match in the X camera lens segment
Head segment.
Optionally, in the above-mentioned each video analysis device embodiment of the present invention, the sort module, specifically for according to from
The visual signature of Y frame images chosen respectively in the X camera lens segment, obtains each camera lens in the X camera lens segment respectively
The type label of segment;
Described device further includes:
Sequential organization network, for obtaining the sequential organization feature of the X camera lens segment;
Generation module for the type label and sequential organization feature according to the X camera lens segment, generates the video
Video presentation.
Optionally, in the above-mentioned each video analysis device embodiment of the present invention, described first chooses module, specifically for according to
It is secondary that continuous X camera lens segment is chosen from video;
The sort module, it is special specifically for the vision according to the Y frame images chosen respectively from the X camera lens segment
Sign obtains the type label of each camera lens segment in the X camera lens segment respectively;
Described device further includes:
Sequential organization network, for obtaining the sequential organization feature of all camera lens segments in the video;
Generation module, for the type label and sequential organization feature according to camera lens segments all in the video, generation
The video presentation of the video.
Another aspect according to embodiments of the present invention, a kind of electronic equipment provided, including:
Memory, for storing executable instruction;And
Processor, it is any of the above-described thereby completing the present invention to perform the executable instruction for communicating with the memory
The operation of the training method of video analysis network or video analysis method described in embodiment.
Another aspect according to embodiments of the present invention, a kind of computer storage media provided, for storing computer
The instruction that can be read, described instruction are performed the training side for realizing video analysis network described in any of the above-described embodiment of the present invention
The operation of method or video analysis method.
Another aspect according to embodiments of the present invention, a kind of computer program provided, including computer-readable
Instruction, when the computer-readable instruction is run in a device, the processor execution in the equipment is used to implement this
Invent the executable of the training method of video analysis network described in any of the above-described embodiment or the step in video analysis method
Instruction.
The training method of video analysis network provided based on the above embodiment of the present invention and device, electronic equipment, calculating
Machine storage medium, computer program, it is proposed that utilize the corresponding sample segment of Sample video (such as the trailer of Sample video
Section) technical solution that is trained to visual signature network, a sample snippet extraction visual signature is chosen from Sample video,
The video estimation label of sample segment is obtained according to the visual signature, then according to the video estimation label and sample of sample segment
The video type label of video is trained visual signature network.Since sample segment is chosen from Sample video,
With the picture in the Sample video, Sample video is not based on based on a segment in Sample video in itself to vision spy
Sign network is trained, and is greatly reduced calculation amount, is saved computing resource and storage resource, improves training effectiveness, and instruct
The visual signature network got can be effectively used for the analysis to Sample video.
Video analysis method and apparatus, electronic equipment, computer storage media based on the above embodiment of the present invention offer,
Computer program, the selecting video segment from video, using visual signature network, the vision for extracting the video clip respectively is special
Sign, and long video is analyzed according to the visual signature.Based on the embodiment of the present invention, realize based on the segment pair in video
The analysis of video, without analyzing entire video, reduce the calculation amount of video analysis, save computing resource and storage provides
Source improves analysis efficiency so that the quick analysis isometric video of film is possibly realized.
Below by drawings and examples, technical scheme of the present invention is described in further detail.
Description of the drawings
The attached drawing of a part for constitution instruction describes the embodiment of the present invention, and is used to explain together with description
The principle of the present invention.
With reference to attached drawing, according to following detailed description, the present invention can be more clearly understood, wherein:
Fig. 1 is the flow chart of training method one embodiment of video analysis network of the present invention.
Fig. 2 is the flow chart of another embodiment of the training method of video analysis network of the present invention.
Fig. 3 is to a schematic diagram of visual signature network training in the embodiment of the present invention.
Fig. 4 is the flow chart of another embodiment of the training method of video analysis network of the present invention.
Fig. 5 is the flow chart of the training method further embodiment of video analysis network of the present invention.
Fig. 6 is to a schematic diagram of recurrent neural network training in the embodiment of the present invention.
Fig. 7 is the flow chart of video analysis method one embodiment of the present invention.
Fig. 8 is the flow chart of another embodiment of video analysis method of the present invention.
Fig. 9 is the flow chart of one Application Example of video analysis method of the present invention.
Figure 10 is the flow chart of another Application Example of video analysis method of the present invention.
Figure 11 is the flow chart of another Application Example of video analysis method of the present invention.
Figure 12 is the structure diagram of training device one embodiment of video analysis network of the present invention.
Figure 13 is the structure diagram of another embodiment of the training device of video analysis network of the present invention.
Figure 14 is the structure diagram of video analysis device one embodiment of the present invention.
Figure 15 is the structure diagram of another embodiment of video analysis device of the present invention.
Figure 16 is the structure diagram of another embodiment of video analysis device of the present invention.
Figure 17 is the structure diagram of the long video analysis device further embodiment of the present invention.
Figure 18 is the structure diagram of video analysis device a still further embodiment of the present invention.
Figure 19 is the structure diagram of one Application Example of electronic equipment of the present invention.
Specific embodiment
Carry out the various exemplary embodiments of detailed description of the present invention now with reference to attached drawing.It should be noted that:Unless in addition have
Body illustrates that the unlimited system of component and the positioned opposite of step, numerical expression and the numerical value otherwise illustrated in these embodiments is originally
The range of invention.
Simultaneously, it should be appreciated that for ease of description, the size of the various pieces shown in attached drawing is not according to reality
Proportionate relationship draw.
It is illustrative to the description only actually of at least one exemplary embodiment below, is never used as to the present invention
And its application or any restrictions that use.
Technology, method and apparatus known to person of ordinary skill in the relevant may be not discussed in detail, but suitable
In the case of, the technology, method and apparatus should be considered as part of specification.
It should be noted that:Similar label and letter represents similar terms in following attached drawing, therefore, once a certain Xiang Yi
It is defined in a attached drawing, then in subsequent attached drawing does not need to that it is further discussed.
The embodiment of the present invention can be applied to the electronic equipments such as terminal device, computer system/server, can with it is numerous
Other general or specialized computing system environments or configuration operate together.Suitable for electric with terminal device, computer system/server etc.
The example of well-known terminal device, computing system, environment and/or configuration that sub- equipment is used together includes but not limited to:
Personal computer system, server computer system, thin client, thick client computer, hand-held or laptop devices, based on microprocessor
System, set-top box, programmable consumer electronics, NetPC Network PC, little types Ji calculate machine Xi Tong ﹑ large computer systems and
Distributed cloud computing technology environment including any of the above described system, etc..
The electronic equipments such as terminal device, computer system/server can be in the department of computer science performed by computer system
It is described under the general linguistic context of system executable instruction (such as program module).In general, program module can include routine, program, mesh
Beacon course sequence, component, logic, data structure etc., they perform specific task or realize specific abstract data type.Meter
Calculation machine systems/servers can be implemented in distributed cloud computing environment, and in distributed cloud computing environment, task is by by logical
What the remote processing devices of communication network link performed.In distributed cloud computing environment, program module can be located at and include storage
On the Local or Remote computing system storage medium of equipment.
In the implementation of the present invention, the present inventor is had found by studying:Due to the video point of computer vision field
The calculation amount of analysis technology is all very big, and the Video Analysis Technology of computer vision field is only capable of being directed to tens seconds short-sighted at present
Frequency is analyzed, and can not directly be extended to the film for analyzing one, two hour and longer duration.
Fig. 1 is the flow chart of training method one embodiment of video analysis network of the present invention.The embodiment of the present invention regards
Frequency analysis network includes visual signature network.As shown in Figure 1, the training method of the embodiment, respectively at least one sample
Any sample segment in the corresponding at least one sample sample segment of Sample video performs following operation:
102, for any sample segment in the corresponding at least one sample segment of at least one Sample video, profit
The visual signature of any sample segment is obtained with visual signature network.
In the embodiment of the present invention, video of the Sample video to participate in video analysis network training as sample, sample segment
For the segment chosen from Sample video, the length of sample segment is less than Sample video.
Wherein, Sample video or its sample chips segment mark are marked with the type label of the Sample video, in various embodiments of the present invention
The type label of Sample video is known as video type label.The video type mark of Sample video in various embodiments of the present invention
What label, the i.e. producer or uploader of Sample video or its sample segment in advance noted the Sample video or its sample chips segment mark
The label that type or user note after viewing to the Sample video or its sample chips segment mark, for example, horror film, comedy, music,
Homicide etc., the video type label of Sample video can be from each web film (such as obtaining) on IMDB.
In various embodiments of the present invention, visual signature refers to video or its segment, the feature of frame image visually.
104, the video estimation label of sample segment is obtained according to visual signature.
106, according to the video estimation label of sample segment and the video type label of corresponding Sample video, to vision spy
Sign network is trained.
Illustratively, aforesaid operations 102~106 can be the process that iteration performs, the mistake performed by the iteration
Journey is trained visual signature network, until meet preset condition, such as the video estimation label of sample segment and corresponding+
Difference between video type label is less than default value or reaches preset times to the frequency of training of visual signature network,
Training is completed.
Training method based on the video analysis network that the above embodiment of the present invention provides, it is proposed that utilize Sample video pair
The technical solution that the sample segment (such as advance notice segment of Sample video) answered is trained visual signature network, regards from sample
A sample snippet extraction visual signature is chosen in frequency, the video estimation label of sample segment is obtained according to the visual signature, so
Visual signature network is trained according to the video type label of the video estimation label of sample segment and Sample video afterwards.By
It is chosen from Sample video in sample segment, there is the picture in the Sample video, based on one in Sample video
Segment and be not based on Sample video and visual signature network be trained in itself, greatly reduce calculation amount, save calculating money
Source and storage resource improve training effectiveness, and the visual signature network that training obtains can be effectively used for dividing Sample video
Analysis.
Fig. 2 is the flow chart of another embodiment of the training method of video analysis network of the present invention.As shown in Fig. 2, the reality
The training method of example is applied, respectively for any sample in the corresponding at least one sample segment of at least one Sample video
Segment performs following operation:
202, M camera lens segment is chosen from any sample segment and respectively from the either mirror in the M camera lens segment
N frame images are chosen in head segment.
Each sample segment includes at least one of Sample video camera lens segment, and each camera lens segment includes an at least frame
Image.In one of specific example of the embodiment of the present invention, sample segment is specially the corresponding video preprocessor of the Sample video
Announcement segment or the video clip that editing obtains from Sample video.In a video, video trailer section or video clip,
If a screen switching has occurred, another camera lens segment is switched to by a camera lens segment.
Illustratively, can based on the characteristic similarity and preset condition between adjacent two field pictures in sample segment, from
X camera lens segment is chosen in sample segment.For example, in a particular application, the feature between adjacent two field pictures can be passed through
Be compared, calculate consecutive frame image between characteristic similarity score, while based on from entire sample segment consider setting
Preset condition, such as the frame picture number of a camera lens segment are no less than the first default value, will not be more than the second default value,
Camera lens segment in identification, differentiation sample segment, then chooses M camera lens segment from sample segment.
In specific example, the mode randomly selected specifically may be used, M camera lens segment is chosen from sample segment,
Predetermined manner may be used, such as the mode of a camera lens segment is chosen every one or more camera lens segments, from sample segment
M camera lens segment of middle selection.Similarly, the mode randomly selected specifically may be used, N frame images is chosen from any camera lens segment,
Predetermined manner can also be used, such as the mode of a frame image is chosen every a frame or multiple image, from any camera lens segment
Choose N frame images.
Wherein, M, N are respectively the integer more than 0.The value of M and N is bigger, and the visual signature of acquisition is more rich, can make
The performance for obtaining more accurate, training acquisition the visual signature network of video estimation label of sample segment is more preferable, but required
Calculation amount is also more, more to the consumption of computing resource.In one of specific example of the embodiment of the present invention, M, N's takes
Value is respectively the integer more than or equal to 2, for example, the value that the value of M is 8, N is 3.The present inventor passes through the study found that M
Value be 8, when the value of N is 3, it is ensured that less computing resource is consumed in the case of the performance of visual signature network.
However, in the embodiment of the present invention, the value of M and N are not limited to this, and M and N take and can equally be realized during other numerical value.
It is chosen in a plurality of lenses segment, each camera lens segment from sample segment and chooses multiple image, it can be to entire sample
This segment has than more comprehensive viewing visual angle, so as to contribute to more comprehensively, effectively, accurately understand entire sample segment
Visual information.
204, using visual signature network, respectively for any camera lens segment in M camera lens segment, either mirror is extracted respectively
The visual signature of N frame images chosen in head segment.
206, according to the visual signature of N frame images chosen respectively in M camera lens segment, obtain the type mark of sample segment
Label are known as video estimation label in various embodiments of the present invention according to the type label that visual signature is got, with video type
Label is mutually distinguished.
The prediction label type of sample segment in various embodiments of the present invention, that is, the type of the sample segment predicted,
Such as horror film, comedy, music, homicide etc..
In a wherein specific example, which can specifically realize in the following way:
Respectively for any camera lens segment, the visual signature for being based respectively on the N frame images of above-mentioned selection determines the N frame images
Type label;
The type label for being based respectively on the N frame images of above-mentioned selection in any camera lens segment determines any camera lens segment
Type label;
The video estimation label of the sample segment is determined based on the type label of M camera lens segment of above-mentioned selection.
208, according to the video type mark of the video estimation label of sample segment Sample video corresponding with the sample segment
Label, are trained visual signature network.
Illustratively, aforesaid operations 202~208 can be the process that iteration performs, the mistake performed by the iteration
Journey is trained visual signature network, until meet preset condition, such as the video estimation label of sample segment and corresponding
Difference between the video type label of Sample video is less than default value or the frequency of training of visual signature network is reached
Preset times, training are completed, and obtain final visual signature network.
Training method based on the video analysis network that the above embodiment of the present invention provides, it is proposed that utilize Sample video pair
The technical solution that the sample segment (such as predicting segment) answered is trained initial visual character network model, from Sample video
M camera lens segment is chosen in corresponding sample segment, chooses N frame figures from any camera lens segment in M camera lens segment respectively
Picture, using visual signature network respectively for any camera lens segment in M camera lens segment, the vision for extracting N frame images respectively is special
Sign according to the visual signature of N frame images chosen respectively in M camera lens segment, obtains the video estimation label of sample segment, so
Afterwards according to the video estimation label of sample segment and the video type label of Sample video, visual signature network is trained.
Due to sample segment, editing obtains from Sample video, has the representative picture in the Sample video, based on sample
Segment is trained visual signature network, greatly reduces calculation amount, saves computing resource and storage resource, improves instruction
Practice efficiency, and obtained visual signature network is trained to can be effectively used for the analysis to Sample video;From sample segment choose to
An at least frame image is chosen in a few camera lens segment, each camera lens segment, by sparse sampling and as unit of camera lens segment
Network training is carried out, calculation amount is further reduced, saves computing resource and storage resource, improve training effectiveness.
Based on above-mentioned training method embodiment, realize and visual signature network be trained by the way of Weakly supervised,
From the visual signature of sample segment (for example, advance notice segment) learning film equal samples video, so that visual signature network
It can be effectively used for the analysis to Sample video.It is therein it is Weakly supervised refer to, only pass through the video type label of Sample video
The training to visual signature network is realized, without to each frame image/video type label in Sample video, sample segment, reducing
Mark and workload needed for calculating improve the training effectiveness of visual signature network.
Fig. 3 is to a schematic diagram of visual signature network training in the embodiment of the present invention.As shown in figure 3, the embodiment
In, Sample video is specially film, and the sample segment of Sample video is specially the advance notice segment of the film.To visual signature network
When being trained, specifically randomly selected in 8 camera lens segments, each camera lens segment from advance notice segment and randomly select 3 frame images,
These are selected to the frame image input visual signature network (visual model) come, these frames are extracted by visual signature network
The visual signature of image;Classified respectively according to the visual signature of each frame image to each frame image by grader, obtained each
The type label of frame image;Such as by averagely down-sampled (pooling) or maximum down-sampled mode, respectively by each lens
The type label that the N frame images come are selected in section determines the type label of the camera lens segment;Finally, such as by average drop it adopts
Sample (pooling) or maximum down-sampled mode, merge the type label of different camera lens segments, obtain the video of the advance notice segment
Prediction label, such as the type (such as horror film, comedy etc.) of film.The video type label of film can be from each web film
(such as being obtained on IMDB), so as to video estimation label and the video type label of corresponding film based on advance notice segment
To be trained to visual signature network, the network parameter of visual signature network is adjusted, by video type label to vision spy
Sign network exercises supervision study.
Due to the particularity of the advance notice segment of film, each frame image therein includes what producer's editing from film came out
Representative picture is trained visual signature network using the short-movie section for predicting segment as film, you can to realize
To the training effect of visual signature network, it is trained relative to by the use of entire film as training sample, more saves computing resource
And storage resource, improve training effectiveness.
In addition, the video analysis network of the embodiment of the present invention can further include sequential organization network, then it is of the invention
In embodiment can be included to the training process of video analysis network two stages:First stage, to visual signature network into
Row training, until meeting preset condition, first stage is completed;Meet default item to visual signature network training in the first stage
After part, into second stage, recurrent neural network is trained.Above-mentioned Fig. 1 is training process with embodiment illustrated in fig. 2
First stage after the completion of first stage, carries out second stage.
In a specific example of the embodiment of the present invention, sequential organization network is specifically realized using recurrent neural network,
Recurrent neural network can capture sequential organization feature and remember, and be suitable for the data for having sequential (such as video, voice etc.)
Processing.Wherein, recurrent neural network for example can be one long memory models in short-term (LSTM).Sequential is stored by LSTM
The short-term memory of structure feature enhances for long-range past memory capability, so as to be realized more by limited neuron
The acquisition of the sequential organization feature of large-capacity video.
Fig. 4 is the flow chart of another embodiment of the training method of video analysis network of the present invention.The embodiment is above-mentioned
After embodiment shown in fig. 1 or fig. 2, the training process of second stage is further comprised.It is as shown in figure 4, above-mentioned with the present invention
Training method embodiment is compared, and in the embodiment of the present invention, is further included:
302, feature extraction is carried out to given video using visual signature network, the vision for obtaining a plurality of lenses segment is special
Sign.
Wherein, video given herein above includes continuous a plurality of lenses segment.
304, using sequential organization network, based on the visual signature and sequential relationship of above-mentioned continuous a plurality of lenses segment,
Learn the sequential organization feature of video given herein above.
Sequential organization feature therein is used to represent the feature in video, video between segment, frame image in terms of sequential
Information, according to the sequential organization feature, it may be determined that the sequence in video, video between segment, frame image.
In one of specific example of training method embodiment shown in Fig. 4, it is assumed that given video specifically includes P mirror
Head segment, wherein, P is the integer more than 0.Operation 302 can specifically include:
Q frame images are chosen from each camera lens segment in given video respectively;Wherein, Q is the integer more than 0;
Using visual signature network, respectively for each camera lens segment, the visual signature of Q frame images in each camera lens segment is extracted
Visual signature as each camera lens segment;
It is successively that the visual signature of the P camera lens segment is defeated according to sequence of the above-mentioned P camera lens segment in given video
Enter sequential organization network.
In addition, in another specific example of training method embodiment shown in Fig. 4, operation 304 can specifically include:
Based on the visual signature and sequential relationship of a plurality of lenses segment continuous in given video, the given video phase is predicted
Adjacent next camera lens segment;
Based on the forecasting accuracy of next camera lens segment, sequential organization network is trained, during obtaining final
Sequence structure network.
Illustratively, the operation of each embodiment shown in above-mentioned Fig. 4 can be the process that an iteration performs, and pass through the iteration
The process of execution is trained sequential organization network, until meet preset condition, such as prediction next camera lens segment just
True probability reaches predetermined probabilities value or reaches preset times to the frequency of training of sequential organization network, to sequential organization net
The training of network is completed.
Fig. 5 is the flow chart of the training method further embodiment of sample of the present invention video analysis network.As shown in figure 5,
Compared with embodiment illustrated in fig. 4, in the embodiment, the training process of second stage includes:
402, Q frame images are chosen from each camera lens segment in given video respectively.
Wherein, given video specifically includes P camera lens segment, wherein, the value of P, Q are the integer more than 0.
404, using visual signature network, respectively for each camera lens segment in given video, extract Q in each camera lens segment
Visual signature of the visual signature of frame image as each camera lens segment.
406, according to the sequence of camera lens segment each in given video, successively by the vision of camera lens segment each in given video spy
Levy input timing structural network.
408, sequential organization generates given herein above according to the visual signature and sequential relationship of each camera lens segment in given video
The sequential organization feature of P camera lens segment in video.
410, sequential organization network obtains the P camera lens segment phase according to the sequential organization feature of above-mentioned P camera lens segment
The visual signature of adjacent next camera lens segment.
Later, operation 412 is performed.
402 ', Q frame images are chosen from each option camera lens segment in L option camera lens segment respectively.
Wherein, L is the integer more than 1.
The embodiment of the present invention is used to predicting adjacent next camera lens segment of given video, in option camera lens segment, at least
Including a correct next camera lens segment, remaining is wrong next camera lens segment.Wherein, correct next camera lens
The quantity of segment is preset, consistent with the quantity of correct next camera lens segment finally determined.Wherein one it is specific
In example, option camera lens segment includes a correct next camera lens segment, remaining L-1 next camera lenses for mistake
Segment.
404 ', using visual signature network, respectively for each option camera lens segment in L option camera lens segment, extract Q
Visual signature of the visual signature of frame image as each option camera lens segment.
Wherein, operation 402 '~404 ' and operation 402~410 are two flows performed parallel, are not present therebetween
Execution sequence limits.
412, using convolutional neural networks, according to the visual signature of the next camera lens segment got and above-mentioned L choosing
The visual signature of item camera lens segment, chooses the adjacent next mirror of above-mentioned P camera lens segment from above-mentioned L option camera lens segment
Head segment.
414, based on the forecasting accuracy of next camera lens segment, sequential organization network is trained.
Illustratively, aforesaid operations 408~414 (not including 402 '~404 ') or 402~414 (including 402 '~
404 ') it can be process that iteration performs, sequential organization network is trained by the process that the iteration performs, until
Meet preset condition, the forecasting accuracy of next camera lens segment reaches predetermined probabilities value or the instruction to sequential organization network
Practice number and reach preset times, training is completed.
Based on the embodiment of the present invention, specifically by the way of unsupervised from film equal samples video learning sequential organization,
Wherein unsupervised mode refers to, the mark of any information is not carried out to Sample video, and sequential organization network can be marked lacking
Learn the sequential organization of film equal samples video under conditions of note, i.e.,:Sequential in film equal samples video between different fragments
Structure feature so as to reduce mark and workload needed for calculating, improves the training effectiveness of sequential organization network.Sequential organization
After network training meets preset condition, the sequential organization feature of long video can be accurately got, based on sequential organization spy
Application is further analyzed to long video in sign, for example, based on the next camera lens segment of several lens sections predictions before film,
Retrieval meets the film of video presentation field, the video presentation of the film is generated according to film, etc..
In one of specific example of training method embodiment shown in Fig. 5, operation 412 can specifically include:
It is regarded according to each option camera lens segment in the visual signature of above-mentioned next camera lens segment and L option camera lens segment
Feel feature, obtain probability score of each option camera lens segment as next camera lens segment in L option camera lens segment respectively;
The highest one or more option camera lens segments of probability score in L option camera lens segment are chosen as above-mentioned P
The adjacent next camera lens segment of camera lens segment.
In addition, probability of each option camera lens segment as next camera lens segment in L option camera lens segment is got
After score, it is also an option that property to the L option camera lens segment respectively as the probability of adjacent next camera lens segment
Score is normalized, and the normalization for obtaining L option camera lens segment respectively as adjacent next camera lens segment is general
Rate score.Correspondingly, when choosing the adjacent next camera lens segment of above-mentioned P camera lens segment, specifically from above-mentioned L option camera lens
In the normalization probability score of segment, the highest one or more option camera lens segments of normalization probability score are chosen as above-mentioned
The adjacent next camera lens segment of P camera lens segment.The quantity of option camera lens segment specifically chosen can be according to preset condition
It determines, for example, the default selection normalization highest option camera lens segment of probability score is adjacent as above-mentioned P camera lens segment
Next camera lens segment or choose normalization probability score be higher than preset fraction any number of option camera lens segment conducts
The adjacent next camera lens segment of above-mentioned P camera lens segment.
Specifically, in the above embodiments, according to the visual signature and L option camera lens of above-mentioned next camera lens segment
The visual signature of each option camera lens segment in segment is obtained in L option camera lens segment respectively under each option camera lens segment conduct
The probability score of one camera lens segment can specifically include:
The visual signature of next camera lens segment to being obtained by operation 410 replicates, and obtains L next camera lenses
The visual signature of segment, and according to visual signature of the preset format to the L next camera lens segments and above-mentioned L option camera lens
The visual signature of segment is spliced, and obtains eigenmatrix;
Using convolutional neural networks, L option camera lens segment is obtained respectively as adjacent next based on this feature matrix
The probability score of a camera lens segment.
For example, using convolutional neural networks, the vision of option camera lens segment in each row of eigenmatrix can be obtained respectively
Similarity between the visual signature of feature and next camera lens segment obtains L option camera lens segment point according to similarity
Probability score not as adjacent next camera lens segment.Between visual signature and the visual signature of next camera lens segment
The higher option camera lens segment of similarity, probability score are higher.
Fig. 6 is to a schematic diagram of recurrent neural network training in the embodiment of the present invention.In the present embodiment, sequential organization
Network is specially recurrent neural network, and recurrent neural network can capture temporal aspect and remember, than being more suitable for sequence sometimes
Data (such as the data such as video, voice) processing.As shown in fig. 6, for one section of given continuous videos (i.e.:The present invention
Given video in embodiment), feature extraction is carried out to given video and option camera lens segment by visual signature network respectively,
Obtain visual signature;Then the visual signature of the adjacent next camera lens segment of video is given using recurrent neural network output,
Merge with option camera lens segment after (for example, with vectorial connecting method) by a convolutional neural networks, by convolutional neural networks
Probability score of the respective option camera lens segment as the correct adjacent next camera lens segment of given video is exported, probability score is most
High option camera lens segment as predicts the adjacent next camera lens segment of obtained given video.
Below by taking a specific example as an example, the embodiment of the present invention is illustrated.Assuming that given preceding n camera lenses segment
(i.e.:Given video includes continuous n camera lenses segment), to predict (n+1)th camera lens segment, idiographic flow is as follows:
First to above-mentioned preceding n camera lens segment, 3 frame images are chosen from each camera lens segment, input trained vision
Character network (visual model), the visual signature of 3 frame images selected in n camera lens segments before obtaining, per frame image
Visual signature assumes to be respectively 1 1024 feature vector tieed up;
The visual signature of each frame image in video given herein above is input in recurrent neural network (such as LSTM), it is defeated
Go out the adjacent next camera lens segment of above-mentioned preceding n camera lens segment (i.e.:(n+1)th camera lens segment) visual signature, it is assumed that be 1
The feature vector of a 256 dimension;
Assuming that a total of 32 of option camera lens segment, one of them is correct next camera lens segment, remaining 31 are
32 option camera lens segments are inputted trained visual signature network (visual by next camera lens segment of mistake
Model), a visual signature is obtained respectively for each option camera lens segment, it is assumed that for the feature vector of 1024 dimensions, obtain one
A 32 × 1024 the first matrix;
The feature vector that above-mentioned 256 are tieed up replicates 32 times, second matrix of one 32 × 256 is obtained, with 32 × 1024
First matrix is spliced according to preset format, such as is spliced according to the second matrix in preceding, the posterior mode of the first matrix,
Obtain the eigenmatrix of one 32 × 1280;
32 × 1280 eigenmatrix drops in features described above Input matrix to one three layers of convolutional neural networks successively
Dimension, for example, by the feature vector that this feature matrix successively dimensionality reduction is 32 × 256,32 × 64,32 × 1,32 × 1 feature vector
In each digitized representation its probability score of corresponding option camera lens segment as next option camera lens segment, probability score
Highest one next camera lens segment as prediction result.In addition, the convolutional neural networks may be one layer or
Other quantity layer, can be directly by the spy of 32 × 1280 eigenmatrix dimensionality reduction to 32 × 1 by one layer of convolutional neural networks
Sign vector;By the convolutional neural networks of other quantity layer by the feature of 32 × 1280 eigenmatrix successively dimensionality reduction to 32 × 1
Vector;
It is whether correct according to the prediction result, recurrent neural network can be trained, adjust its network parameter, so
The above-mentioned flow in the present embodiment is re-executed afterwards, until meeting preset condition, the training of recurrent neural network is completed.Fig. 7
Flow chart for video analysis method one embodiment of the present invention.As shown in fig. 7, the video analysis method of the embodiment includes:
502, at least one video clip is chosen from video.
Wherein, which can specifically include one or more camera lens segment.
504, using visual signature network, the visual signature of above-mentioned at least one video clip is obtained respectively.
506, according to the visual signature from above-mentioned at least one video clip, above-mentioned video is analyzed.
Based on the video analysis method that the above embodiment of the present invention provides, the selecting video segment from video utilizes vision
Character network extracts the visual signature of the video clip, and long video is analyzed according to the visual signature respectively.Based on this
Inventive embodiments realize based on the segment in video to the analysis of video, without analyzing entire video, reduce video point
The calculation amount of analysis, saves computing resource and storage resource, improves analysis efficiency so that the quick analysis isometric video of film into
It is possible.
Fig. 8 is the flow chart of another embodiment of video analysis method of the present invention.As shown in figure 8, the video of the embodiment
Analysis method includes:
602, X camera lens segment is chosen from video.
It illustratively, can be based on the characteristic similarity and preset condition between two field pictures adjacent in video, from video
X camera lens segment of middle selection.For example, in a particular application, can be compared by the feature between adjacent two field pictures,
The characteristic similarity score between consecutive frame image is calculated, while based on the preset condition that setting is considered from whole section of video, such as
The frame picture number of each camera lens segment is no less than the first default value, will not be more than the second default value, and identification is distinguished in video
Camera lens segment, then from video choose X camera lens segment.
In specific example, the mode randomly selected specifically may be used, X camera lens segment is chosen from video, it can also
The mode of a camera lens segment is chosen using predetermined manner, such as every one or more camera lens segments, X is chosen from from video
A camera lens segment.
604, respectively for any camera lens segment in above-mentioned X camera lens segment, Y frames are chosen from any camera lens segment
Image.
In specific example, the mode randomly selected specifically may be used, Y frame images is chosen from any camera lens segment,
Predetermined manner may be used, such as the mode of a frame image is chosen every a frame or multiple image, selected from any camera lens segment
Take Y frame images.
Wherein, X, Y are respectively the integer more than 0.In one of specific example of the embodiment of the present invention, M, N's takes
Value is respectively the integer more than or equal to 2, for example, the value that the value of X is 8, Y is 3.The value of X and Y is bigger, acquisition
Visual signature is more rich, and the video estimation label that can cause video is more accurate, but required calculation amount is also more, to meter
The consumption for calculating resource is more.In one of specific example of the embodiment of the present invention, the value that the value of X is 8, Y is 3.This
Inventors discovered through research that when the value of X is 8, the value of Y is 3, can effectively obtaining the visual signature of video
In the case of consume less computing resource.However, in the embodiment of the present invention, the value of X and Y are not limited to this, and X and Y take other
It can equally be realized during numerical value.
It, can be to entire video when selection multiple image in a plurality of lenses segment, each camera lens segment is chosen from video
Have than more comprehensive viewing visual angle, so as to help more comprehensively, effectively, accurately to understand the visual information of entire video and
Timing structure information.
606, it is any from this respectively respectively for any camera lens segment in above-mentioned X camera lenses segment using visual signature network
The visual signature of Y frame images is extracted in camera lens segment.
608, according to the visual signature of Y frame images chosen respectively from above-mentioned X camera lens segment, video is divided
Analysis.
Illustratively, which can specifically be realized by a convolutional neural networks.
In a wherein specific example, which can specifically realize in the following way:
Respectively for any camera lens segment in above-mentioned X camera lens segment, the vision for being based respectively on the Y frame images of extraction is special
Sign determines the type label of the Y frame images;And any where the Y frame images is determined based on the type label of the Y frame images
The type label of camera lens segment.For example, by averagely down-sampled (pooling) or maximum down-sampled mode, respectively by the Y frames
The type label of image determines the type label of any camera lens segment where the Y frame images;
Video preprocessor mark of the tag types as the video of video is determined based on the type label of above-mentioned X camera lens segment
Label.For example, by averagely down-sampled (pooling) or maximum down-sampled mode, the type mark of above-mentioned X camera lens segment is merged
Label, obtain the video estimation label of the video.
Tag types in the embodiment of the present invention, i.e. frame image, camera lens segment, the type of video, such as horror film, happiness
Play, music, homicide etc..
Based on the above embodiment of the present invention provide video analysis method, chosen from video at least one camera lens segment,
An at least frame image is chosen in each camera lens segment, using visual signature network, extracts the visual signature of each frame image respectively, by
The visual signature of each frame image obtains the video estimation label of the video, it is achieved thereby that the analysis to video.Based on the present invention
Embodiment is chosen from video and an at least frame image is chosen at least one camera lens segment, each camera lens segment, with sparse sampling
With video is analyzed as unit of camera lens segment, without analyzing entire video, reduce the calculation amount of video analysis, section
Computing resource and storage resource have been saved, has improved analysis efficiency so that the quick analysis isometric video of film is possibly realized, so as to
Applied to the realizing more high-level semantemes of the task, for example, search of vidclip etc..Based on the above-mentioned each video point of the present invention
It analyses in further embodiment of a method, after the video estimation label for obtaining above-mentioned video, can also include:To the video labeling
Video estimation label.
Based on the embodiment, analysis and classification to video can be realized.
Fig. 9 is the flow chart of one Application Example of video analysis method of the present invention.As shown in figure 9, the embodiment includes:
702, in response to receiving searching request, continuous X camera lens segment is chosen from video successively.
Wherein, it searches element request and includes video presentation field.
It illustratively, can be based on the characteristic similarity and preset condition between two field pictures adjacent in video, from video
X camera lens segment of middle selection.
704, Y frame images are chosen from any camera lens segment in above-mentioned X camera lens segment respectively.
Wherein, X, Y are respectively the integer more than 2.In one of specific example of the embodiment of the present invention, the value of X
Value for 8, Y is 3.
706, using visual signature network, respectively for any camera lens segment in above-mentioned X camera lenses segment, respectively from either mirror
The visual signature of Y frame images is extracted in head segment.
708, according to the visual signature of Y frame images chosen respectively from above-mentioned X camera lens segment, above-mentioned X is obtained respectively
The type label of each camera lens segment in a camera lens segment.
Illustratively, which can specifically be realized by a convolutional neural networks.
710, export the camera lens segment of type label and video presentation fields match in X camera lens segment.
In a wherein specific example, which includes:
In response to getting the type label of any camera lens segment in X camera lens segment, it is respectively compared any lens
Whether the type label of section matches with video presentation field;
Export the camera lens segment of type label and video presentation fields match in X camera lens segment.
Based on above-mentioned example, can realize to the real-time defeated of type label and the camera lens segment of video presentation fields match
Go out, i.e.,:It obtains one and meets the camera lens segment for searching video presentation field in element request, just export a camera lens segment, and do not have to
It all analyzes to complete to export again to meet and searches video presentation field in element request Deng all camera lens segments to being chosen in entire video
Camera lens segment.
In another specific example, which includes:
In response to getting the type label of all camera lens segments in video, it is respectively compared each camera lens in all camera lens segments
Whether the type label of segment matches with video presentation field;
Export the camera lens segment of type label and video presentation fields match in all camera lens segments.
It is unified defeated after the completion of all camera lens segments that can be chosen in entire video are all analyzed based on above-mentioned example
Go out the camera lens segment for meeting and searching video presentation field in element request.
Based on above-described embodiment, the search to associated clip in video is realized, can be retrieved from video and video
The consistent camera lens segment of description field.
Figure 10 is the flow chart of another Application Example of video analysis method of the present invention.As shown in Figure 10, the embodiment
Video analysis method include:
802, X camera lens segment is chosen from video.
It illustratively, can be based on the characteristic similarity and preset condition between two field pictures adjacent in video, from video
X camera lens segment of middle selection.
804, respectively for any camera lens segment in above-mentioned X camera lens segment, Y frames are chosen from any camera lens segment
Image.
In specific example, the mode randomly selected specifically may be used, Y frame images is chosen from any camera lens segment,
Predetermined manner may be used, such as the mode of a frame image is chosen every a frame or multiple image, selected from any camera lens segment
Take Y frame images.
Wherein, X, Y are respectively the integer more than 2.In one of specific example of the embodiment of the present invention, the value of X
Value for 8, Y is 3.It, can be to whole when selection multiple image in a plurality of lenses segment, each camera lens segment is chosen from video
A video has than more comprehensive viewing visual angle, so as to help more comprehensively, effectively, accurately to understand the vision of entire video
Information and timing structure information.
806, it is any from this respectively respectively for any camera lens segment in above-mentioned X camera lenses segment using visual signature network
The visual signature of Y frame images is extracted in camera lens segment.
808, according to the visual signature of Y frame images chosen respectively from above-mentioned X camera lens segment, obtain the X camera lens
The type label of segment.
Illustratively, which can specifically be realized by a convolutional neural networks.
810, using recurrent neural network, learn the sequential organization feature of above-mentioned X camera lens segment.
812, using recurrent neural network, according to the visual signature of above-mentioned X camera lens segment and sequential organization feature, generation
The video presentation of above-mentioned video.
Based on the embodiment, its video presentation can be generated directly against video, so as to user it will be seen that the video
Relevant information.
Figure 11 is the flow chart of another Application Example of video analysis method of the present invention.As shown in figure 11, the embodiment
Video analysis method include:
902, in response to receiving searching request, choose a video.
For example, choose a video successively from Video Reservoir.Wherein, it searches element request and includes video presentation field.
904, continuous X camera lens segment is chosen from the video successively.Illustratively, it can be based on adjacent two in video
Characteristic similarity and preset condition between frame image choose X camera lens segment from the video.
906, respectively for any camera lens segment in above-mentioned X camera lens segment, Y frames are chosen from any camera lens segment
Image.
In specific example, the mode randomly selected specifically may be used, Y frame images is chosen from any camera lens segment,
Predetermined manner may be used, such as the mode of a frame image is chosen every a frame or multiple image, selected from any camera lens segment
Take Y frame images.
Wherein, X, Y are respectively the integer more than 2.In one of specific example of the embodiment of the present invention, the value of X
Value for 9, Y is 3.It, can be to whole when selection multiple image in a plurality of lenses segment, each camera lens segment is chosen from video
A video has than more comprehensive viewing visual angle, so as to help more comprehensively, effectively, accurately to understand the vision of entire video
Information and timing structure information.
908, it is any from this respectively respectively for any camera lens segment in above-mentioned X camera lenses segment using visual signature network
The visual signature of Y frame images is extracted in camera lens segment.
910, according to the visual signature of Y frame images chosen respectively from above-mentioned X camera lens segment, above-mentioned X is obtained respectively
The type label of each camera lens segment in a camera lens segment.
Illustratively, which can specifically be realized by a convolutional neural networks.
Later, it returns and performs operation 904, until getting the type label of camera lens segment in the video, perform operation
912。
912, using recurrent neural network, obtain the sequential organization feature of all camera lens segments in the video.
914, it is special according to the type label of camera lens segments all in the video and sequential organization using recurrent neural network
Sign generates the video presentation of the video.
916, compare the video presentation of the video with searching whether the video presentation field in element request matches.
918, in response to video video presentation and search element request in video presentation fields match, export the video.
Otherwise, if the video presentation of the video and the video presentation field searched in element request mismatch, output is not performed and is regarded
The operation of frequency.
Based on the embodiment, realize based on word description (i.e.:Video presentation field) retrieval to video so that user
The video of corresponding contents can be retrieved from Video Reservoir (such as each web film).
One of ordinary skill in the art will appreciate that:Realizing all or part of step of above method embodiment can pass through
The relevant hardware of program instruction is completed, and aforementioned program can be stored in a computer read/write memory medium, the program
When being executed, step including the steps of the foregoing method embodiments is performed;And aforementioned storage medium includes:ROM, RAM, magnetic disc or light
The various media that can store program code such as disk.
Figure 12 is the structure diagram of training device one embodiment of video analysis network of the present invention.Each implementation of the invention
The training device of example can be used for realizing the training method of the various embodiments described above of the present invention.As shown in figure 12, in the embodiment, video
It analyzes network and includes visual signature network 10, the training device of video analysis network includes sort module 1002 and first and trains mould
Block 1004.Wherein:
Visual signature network 10, for being directed in the corresponding at least one sample segment of at least one Sample video
Any sample segment obtains the visual signature of any sample segment.Wherein, Sample video is labeled with video type label.
In one of specific example of the embodiment of the present invention, sample segment is specially the corresponding video of the Sample video
Advance notice segment or the video clip that editing obtains from Sample video.In a video, video trailer section or video clip
In, if a screen switching has occurred, another camera lens segment is switched to by a camera lens segment.
Sort module 1002, the visual signature for being got according to visual signature network 10 obtain regarding for the sample segment
Frequency prediction label.
First training module 1004, for the sample corresponding with the sample segment of the video estimation label according to sample segment
The video type label of video, is trained visual signature network 10.
Training device based on the video analysis network that the above embodiment of the present invention provides, it is proposed that utilize Sample video pair
The technical solution that the sample segment (such as advance notice segment of Sample video) answered is trained visual signature network, regards from sample
A sample snippet extraction visual signature is chosen in frequency, the video estimation label of sample segment is obtained according to the visual signature, so
Visual signature network is trained according to the video type label of the video estimation label of sample segment and Sample video afterwards.By
It is chosen from Sample video in sample segment, there is the picture in the Sample video, based on one in Sample video
Segment and be not based on Sample video and visual signature network be trained in itself, greatly reduce calculation amount, save calculating money
Source and storage resource improve training effectiveness, and the visual signature network that training obtains can be effectively used for dividing Sample video
Analysis.
Figure 13 is the structure diagram of another embodiment of the training device of video analysis network of the present invention.Such as Figure 13 institutes
Show, compared with the embodiment shown in Figure 12, the training device of the embodiment further includes:
First chooses module 1006, for being directed to the corresponding at least one sample chips of at least one Sample video respectively
Any sample segment in section chooses M camera lens segment from any short-movie section.Wherein, sample segment is included in Sample video
At least one camera lens segment, each camera lens segment include at least one frame image.
Second chooses module 1008, for choosing N frame figures from any camera lens segment in above-mentioned M camera lens segment respectively
Picture.Wherein, M, N are respectively the integer more than 0.
Correspondingly, in the embodiment, visual signature network 10 is appointed specifically for being directed in above-mentioned M camera lens segment respectively
One camera lens segment extracts the visual signature of N frame images chosen in any camera lens segment respectively.Sort module 1002 is specific to use
According to the visual signature of N frame images chosen respectively in above-mentioned M camera lens segment, the video preprocessor mark of sample segment is obtained
Label.
In a wherein specific example, sort module 1002 is specifically used for:
Respectively for any camera lens segment, the visual signature for being based respectively on the N frame images of selection determines the class of the N frame images
Type label;
It is based respectively on the type that the type label of N frame images chosen in any camera lens segment determines any camera lens segment
Label;
Determine the tag types of sample segment as video estimation label based on the type label of above-mentioned M camera lens segment.
Training device based on the video analysis network that the above embodiment of the present invention provides, it is proposed that utilize Sample video pair
The technical solution that the sample segment (such as predicting segment) answered is trained initial visual character network model, from Sample video
M camera lens segment is chosen in corresponding sample segment, chooses N frame figures from any camera lens segment in M camera lens segment respectively
Picture, using visual signature network respectively for any camera lens segment in M camera lens segment, the vision for extracting N frame images respectively is special
Sign according to the visual signature of N frame images chosen respectively in M camera lens segment, obtains the video estimation label of sample segment, so
Afterwards according to the video estimation label of sample segment and the video type label of Sample video, visual signature network is trained.
Due to sample segment, editing obtains from Sample video, has the representative picture in the Sample video, based on sample
Segment is trained visual signature network, greatly reduces calculation amount, saves computing resource and storage resource, improves instruction
Practice efficiency, and obtained visual signature network is trained to can be effectively used for the analysis to Sample video;From sample segment choose to
An at least frame image is chosen in a few camera lens segment, each camera lens segment, by sparse sampling and as unit of camera lens segment
Network training is carried out, calculation amount is further reduced, saves computing resource and storage resource, improve training effectiveness.
Based on above-mentioned training method embodiment, realize and visual signature network be trained by the way of Weakly supervised,
From the visual signature of sample segment (for example, advance notice segment) learning film equal samples video, so that visual signature network
It can be effectively used for the analysis to Sample video.It is therein it is Weakly supervised refer to, only pass through the video type label of Sample video
The training to visual signature network is realized, without to each frame image/video type label in Sample video, sample segment, reducing
Mark and workload needed for calculating improve the training effectiveness of visual signature network.
In addition, in another embodiment of the training device of video analysis network of the present invention, visual signature network 10, also
Available for after training meets preset condition, carrying out feature extraction to given video, obtaining the visual signature of a plurality of lenses segment,
Wherein, it gives video and includes continuous a plurality of lenses segment.Referring back to Figure 13, in this embodiment, video analysis network also wraps
It includes:Sequential organization network 20 for visual signature and sequential relationship based on continuous a plurality of lenses segment, learns this and given regards
The sequential organization feature of frequency.In a specific example of each training device embodiment of the present invention, sequential organization network 20 can be with
It is realized by recurrent neural network.
Assuming that video given herein above specifically includes P camera lens segment, wherein, P is the integer more than 0.Then wherein one
In specific example, second chooses module 1008, it may also be used for chooses Q frame figures from each camera lens segment in given video respectively
Picture, wherein, Q is the integer more than 0.Visual signature network 10 carries out feature extraction to given video, obtains a plurality of lenses segment
Visual signature when, be specifically used for:Respectively for each camera lens segment, the visual signature for extracting Q frame images in each camera lens segment is made
Visual signature for each camera lens segment;And the sequence according to above-mentioned P camera lens segment in given video, successively by the P
The visual signature input recurrent neural networks model of camera lens segment.
Further, in the further embodiment of the training device of video analysis network of the present invention, sequential organization network
20, specifically for visual signature and sequential relationship based on continuous a plurality of lenses segment, predict adjacent next of given video
A camera lens segment.Referring back to Figure 13, the training device of the embodiment further includes:Second training module 1010, it is next for being based on
The forecasting accuracy of a camera lens segment, is trained sequential organization network 20.
In the further embodiment of training device, second chooses module 1008, it may also be used for respectively from above-mentioned L option
Q frame images are chosen in each option camera lens segment in camera lens segment, wherein, L option camera lens segment includes at least one correct
Next camera lens segment, L is integer more than 1.Correspondingly, visual signature network 10, it may also be used for respectively for L option
Each option camera lens segment in camera lens segment extracts visual signature of the visual signature of Q frame images as each option camera lens segment.
Sequential organization network 20, is specifically used for:According to the visual signature and sequential relationship of P camera lens segment, P camera lens segment is generated
Sequential organization feature;And the adjacent next camera lens of P camera lens segment is obtained according to the sequential organization feature of P camera lens segment
The visual signature of segment.Referring back to Figure 13, the training device of the embodiment can also include:Convolutional neural networks 1012, are used for
According to the visual signature of above-mentioned next camera lens segment and the visual signature of L option camera lens segment, from the L option lens
The adjacent next camera lens segment of P camera lens segment is chosen in section.
In a wherein optional example, convolutional neural networks 1012 are specifically used for:According to regarding for next camera lens segment
Feel the visual signature of each option camera lens segment in feature and L option camera lens segment, obtain each option camera lens segment conduct respectively
The probability score of next camera lens segment;And choose the highest at least one option of probability score in L option camera lens segment
The camera lens segment next camera lens segment adjacent as P camera lens segment.
Further, referring back to Figure 13, the training device of the embodiment of the present invention can also include:Module 1014 is normalized,
For being obtained to each option camera lens segment in above-mentioned L option camera lens segment respectively as the probability of adjacent next camera lens segment
Divide and be normalized, obtain L normalization probability score.Correspondingly, convolutional neural networks 1012 choose above-mentioned L option
The highest at least one option camera lens segment of the probability score next lens adjacent as P camera lens segment in camera lens segment
Duan Shi, specifically for from above-mentioned L normalization probability score, choosing the highest at least one normalization probability score pair of numerical value
The option camera lens segment the answered next camera lens segment adjacent as P camera lens segment.Second training module 1010, is specifically based on
The forecasting accuracy for next camera lens segment that convolutional neural networks 1012 are chosen, is trained sequential organization network 20.
Figure 14 is the structure diagram of video analysis device one embodiment of the present invention.The video of various embodiments of the present invention point
Analysis apparatus can be used for realizing the video analysis method of the various embodiments described above of the present invention.As shown in figure 14, in the embodiment, video point
Analysis apparatus includes:
First chooses module 1006, for choosing at least one video clip from video.
Using visual signature network 10, for obtaining the visual signature of above-mentioned at least one video clip respectively.
Sort module 1002, for according to the visual signature from above-mentioned at least one video clip, being carried out to above-mentioned video
Analysis.
Based on the video analysis device that the above embodiment of the present invention provides, the selecting video segment from video utilizes vision
Character network extracts the visual signature of the video clip, and long video is analyzed according to the visual signature respectively.Based on this
Inventive embodiments realize based on the segment in video to the analysis of video, without analyzing entire video, reduce video point
The calculation amount of analysis, saves computing resource and storage resource, improves analysis efficiency so that the quick analysis isometric video of film into
It is possible.
Figure 15 is the structure diagram of long another embodiment of video analysis device of the present invention.With the embodiment shown in Figure 14
It compares, in the embodiment, first chooses module 1006, specifically for choosing X camera lens segment, above-mentioned video from above-mentioned video
Segment includes the X camera lens segment.In a wherein optional example, first, which chooses module 1006, is specifically used for based in video
Characteristic similarity and preset condition between adjacent two field pictures choose X camera lens segment from video.As shown in figure 15, should
The video analysis device of embodiment further includes:Second chooses module 1008, respectively for any lens in X camera lens segment
Section chooses Y frame images from any camera lens segment;Wherein, X, Y are respectively the integer more than 0.Correspondingly, in the embodiment, depending on
Feel that character network 10 is specifically used for carrying from any camera lens segment respectively for any camera lens segment in X camera lens segment respectively
Take the visual signature of Y frame images.Sort module 1002 is specifically used for according to the Y frame images chosen respectively from X camera lens segment
Visual signature, video is analyzed.
In a wherein optional example, sort module 1002 is specifically used for:
Respectively for select come X camera lens segment in any camera lens segment:It is based respectively on the Y frame figures for selecting and
The visual signature of picture determines to change the type label of Y frame images;And based on select in any camera lens segment come Y frame images
Type label determines the type label of any camera lens segment;
Video preprocessor mark of the tag types as video of video is determined based on the type label of above-mentioned X camera lens segment
Label.
Based on the above embodiment of the present invention provide video analysis method, chosen from video at least one camera lens segment,
An at least frame image is chosen in each camera lens segment, using visual signature network, extracts the visual signature of each frame image respectively, by
The visual signature of each frame image obtains the video estimation label of the video, it is achieved thereby that the analysis to video.Based on the present invention
Embodiment is chosen from video and an at least frame image is chosen at least one camera lens segment, each camera lens segment, with sparse sampling
With video is analyzed as unit of camera lens segment, without analyzing entire video, reduce the calculation amount of video analysis, section
Computing resource and storage resource have been saved, has improved analysis efficiency so that the quick analysis isometric video of film is possibly realized, so as to
Applied to the realizing more high-level semantemes of the task, for example, search of vidclip etc..
Figure 16 is the structure diagram of another embodiment of video analysis device of the present invention.As shown in figure 16, with Figure 14 or
Embodiment shown in figure 15 is compared, and the video analysis device of the embodiment further includes:Labeling module 1102, for above-mentioned video
Mark the video estimation label that sort module 1002 exports.
Figure 17 is the structure diagram of the long video analysis device further embodiment of the present invention.As shown in figure 17, with Figure 14
Or embodiment shown in figure 15 is compared, the video analysis device of the embodiment further includes:Receiving module 1104 and output module
1106.Wherein:
Receiving module 1104, for receiving searching request, this is searched element request and includes video presentation field.
First selection module 1006 is specifically used for receiving searching request in response to receiving module 1104, successively from video
Choose continuous X camera lens segment.Sort module 1002, specifically for according to the Y frame figures chosen respectively from X camera lens segment
The visual signature of picture obtains the type label of each camera lens segment in above-mentioned X camera lens segment respectively.
Output module 1106, for exporting the mirror of type label and video presentation fields match in above-mentioned X camera lens segment
Head segment.
Figure 18 is the structure diagram of the long video analysis device a still further embodiment of the present invention.As shown in figure 18, with Figure 14
Or embodiment shown in figure 15 is compared, the video analysis device of the embodiment further includes:Sequential organization network 20 and generation module
1108.In a wherein optional example, sequential organization network 20 can be recurrent neural network.
In one embodiment, sort module 1002 is specifically used for according to from above-mentioned X lens shown in Figure 18
The visual signature of Y frame images chosen respectively in section obtains the type label of each camera lens segment in X camera lens segment respectively.When
Sequence structure network 20, for obtaining the sequential organization feature of above-mentioned X camera lens segment.Generation module 1108, for according to above-mentioned X
The type label of a camera lens segment and sequential organization feature generate the video presentation of video.
In another embodiment shown in Figure 18, first, which chooses module 1006, is specifically used for the company of selection from video successively
X continuous camera lens segment.Correspondingly, sort module 1002 is specifically used for according to the Y frames chosen respectively from the X camera lens segment
The visual signature of image obtains the type label of each camera lens segment in the X camera lens segment respectively.Sequential organization network 20 is used for
Obtain the sequential organization feature of all camera lens segments in video.Generation module 1108, for according to camera lens segments all in video
Type label and sequential organization feature, generate the video presentation of video.
The embodiment of the present invention additionally provides a kind of electronic equipment, can include the video point of any of the above-described embodiment of the present invention
Analyse the training device or video analysis device of network.
In addition, the embodiment of the present invention additionally provides another electronic equipment, including:
Memory, for storing executable instruction;And
Processor, for communicating with memory to perform executable instruction any of the above-described embodiment thereby completing the present invention
The operation of the training method or video analysis method of video analysis network.
The electronic equipment of the various embodiments described above of the present invention, such as can be mobile terminal, personal computer (PC), tablet electricity
Brain, server etc..
The embodiment of the present invention additionally provides a kind of computer storage media, should for storing computer-readable instruction
Instruction is performed the training method of the video analysis network of realizing any of the above-described embodiment of the present invention or video analysis method
Operation.
The embodiment of the present invention additionally provides a kind of computer program, including computer-readable instruction, when the computer
When the instruction that can be read is run in a device, the processor execution in equipment is used to implement regarding for any of the above-described embodiment of the invention
The training method of frequency analysis network or the executable instruction of the step in video analysis method.
Figure 19 is the structure diagram of one Application Example of electronic equipment of the present invention.Below with reference to Figure 19, it illustrates
Suitable for being used for realizing the structure diagram of the electronic equipment of the terminal device of the embodiment of the present application or server.As shown in figure 19,
The electronic equipment includes one or more processors, communication unit etc., and one or more of processors are for example:In one or more
Central Processing Unit (CPU) 1201 and/or one or more image processor (GPU) 1213 etc., processor can be according to being stored in
Executable instruction in read-only memory (ROM) 1202 is loaded into random access storage device (RAM) from storage section 1208
Executable instruction in 1203 and perform various appropriate actions and processing.Communication unit 1212 may include but be not limited to network interface card, institute
It states network interface card and may include but be not limited to IB (Infiniband) network interface card, processor can be with read-only memory 1202 and/or random access
Communication is connected and through communication unit 1212 with performing executable instruction by bus 1204 with communication unit 1212 in memory 1203
It communicates with other target devices, so as to complete the training method or video of any video analysis network provided by the embodiments of the present application
The corresponding operation of analysis method, for example, for appointing in the corresponding at least one sample segment of at least one Sample video
One sample chips section, the visual signature of any sample segment is obtained using visual signature network, and the Sample video is labeled with
Video type label;The video estimation label of the sample segment is obtained according to the visual signature;According to the sample segment
Video estimation label and the video type label, the visual signature network is trained.Alternatively, for another example, from video
It is middle to choose at least one video clip;Using visual signature network, the vision for obtaining at least one video clip respectively is special
Sign;According to the visual signature from least one video clip, the video is analyzed.
In addition, in RAM 1203, it can also be stored with various programs and data needed for device operation.CPU1201、
ROM1202 and RAM1203 is connected with each other by bus 1204.In the case where there is RAM1203, ROM1202 is optional module.
RAM1203 stores executable instruction or executable instruction is written into ROM1202 at runtime, and executable instruction makes processor
1201 perform the training method of above-mentioned video analysis network or the corresponding operation of video analysis method.Input/output (I/O) interface
1205 are also connected to bus 1204.Communication unit 1212 can be integrally disposed, may be set to be (such as more with multiple submodule
A IB network interface cards), and in bus link.
I/O interfaces 1205 are connected to lower component:Importation 1206 including keyboard, mouse etc.;Including such as cathode
The output par, c 1207 of ray tube (CRT), liquid crystal display (LCD) etc. and loud speaker etc.;Storage section including hard disk etc.
1208;And the communications portion 1209 of the network interface card including LAN card, modem etc..Communications portion 1209 passes through
Communication process is performed by the network of such as internet.Driver 1211 is also according to needing to be connected to I/O interfaces 1205.It is detachable to be situated between
Matter 1211, such as disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on driver 1211 as needed, so as to
In being mounted into storage section 1208 as needed from the computer program read thereon.
Need what is illustrated, framework as shown in figure 19 is only a kind of optional realization method, can root during concrete practice
The component count amount and type of above-mentioned Figure 19 are selected, are deleted, increased or replaced according to actual needs;It is set in different function component
Put, can also be used it is separately positioned or integrally disposed and other implementations, such as GPU and CPU separate setting or can be by GPU collection
Into on CPU, communication unit separates setting, can also be integrally disposed on CPU or GPU, etc..These interchangeable embodiments
Each fall within protection domain disclosed by the invention.
Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description
Software program.For example, embodiment of the disclosure includes a kind of computer program product, it is machine readable including being tangibly embodied in
Computer program on medium, computer program are included for the program code of the method shown in execution flow chart, program code
It may include step pair in the training method of corresponding execution video analysis network provided by the embodiments of the present application or video analysis method
The instruction answered, for example, for any sample segment in the corresponding at least one sample segment of at least one Sample video,
The instruction of the visual signature of any sample segment is obtained using visual signature network, the Sample video is labeled with video class
Type label;The instruction of the video estimation label of the sample segment is obtained according to the visual signature;According to the sample segment
Video estimation label and the video type label, the instruction being trained to the visual signature network.Alternatively, for another example,
The instruction of at least one video clip is chosen from video;Using visual signature network, at least one video is obtained respectively
The instruction of the visual signature of segment;According to the visual signature from least one video clip, the video is analyzed
Instruction.In such embodiments, which can be downloaded and installed by communications portion 1209 from network,
And/or it is mounted from detachable media 1211.When the computer program is performed by central processing unit (CPU) 1201, perform
The above-mentioned function of being limited in the present processes.
Each embodiment is described by the way of progressive in this specification, the highlights of each of the examples are with its
The difference of its embodiment, the same or similar part cross-reference between each embodiment.For system embodiment
For, since it is substantially corresponding with embodiment of the method, so description is fairly simple, referring to the portion of embodiment of the method in place of correlation
It defends oneself bright.
Methods and apparatus of the present invention may be achieved in many ways.For example, can by software, hardware, firmware or
Software, hardware, firmware any combinations realize methods and apparatus of the present invention.The said sequence of the step of for the method
Merely to illustrate, the step of method of the invention, is not limited to sequence described in detail above, special unless otherwise
It does not mentionlet alone bright.In addition, in some embodiments, the present invention can be also embodied as recording program in the recording medium, these programs
Including being used to implement machine readable instructions according to the method for the present invention.Thus, the present invention also covering stores to perform basis
The recording medium of the program of the method for the present invention.
Description of the invention provides for the sake of example and description, and is not exhaustively or will be of the invention
It is limited to disclosed form.Many modifications and variations are obvious for the ordinary skill in the art.It selects and retouches
It states embodiment and is to more preferably illustrate the principle of the present invention and practical application, and those of ordinary skill in the art is enable to manage
The solution present invention is so as to design the various embodiments with various modifications suitable for special-purpose.
Claims (10)
1. a kind of training method of video analysis network, which is characterized in that the video analysis network includes visual signature network,
The method includes:
For any sample segment in the corresponding at least one sample segment of at least one Sample video, vision spy is utilized
The visual signature that network obtains any sample segment is levied, the Sample video mark contains video type label;
The video estimation label of the sample segment is obtained according to the visual signature;
According to the video estimation label of the sample segment and the video type label, the visual signature network is instructed
Practice.
2. according to the method described in claim 1, it is characterized in that, described obtain any sample using visual signature network
Before the visual signature of segment, further include:
M camera lens segment is chosen from any sample segment and respectively from any camera lens in the M camera lens segment
N frame images are chosen in segment;Wherein, M, N are respectively the integer more than 0;The sample segment is included in the Sample video
At least one camera lens segment, each camera lens segment include an at least frame image;
The visual signature that any sample segment is obtained using visual signature network, including:Using visual signature network,
Respectively for any camera lens segment in the M camera lens segment, N frame images described in any camera lens segment are extracted respectively
Visual signature.
3. according to the method described in claim 2, it is characterized in that, regarding for the sample segment is obtained according to the visual signature
Frequency prediction label, including:
According to the visual signature of N frame images chosen respectively in the M camera lens segment, the video preprocessor of the sample segment is obtained
Mark label.
A kind of 4. video analysis method, which is characterized in that including:
At least one video clip is chosen from video;
Using visual signature network, the visual signature of at least one video clip is obtained respectively;
According to the visual signature from least one video clip, the video is analyzed.
5. a kind of training device of video analysis network, which is characterized in that the video analysis network includes visual signature network;
Described device includes sort module and the first training module;Wherein:
The visual signature network, for being directed to appointing in the corresponding at least one sample segment of at least one Sample video
One sample chips section obtains the visual signature of any sample segment;The Sample video is labeled with video type label;
The sort module, for obtaining the video estimation label of the sample segment according to the visual signature;
First training module, it is right for the video estimation label according to the sample segment and the video type label
The visual signature network is trained.
6. a kind of video analysis device, which is characterized in that including:
First chooses module, for choosing at least one video clip from video;
Using visual signature network, for obtaining the visual signature of at least one video clip respectively;
Sort module, for according to the visual signature from least one video clip, analyzing the video.
7. a kind of electronic equipment, which is characterized in that including:The training device of video analysis network described in claim 5;Or
Person, the video analysis device described in claim 6.
8. a kind of electronic equipment, which is characterized in that including:
Memory, for storing executable instruction;And
Processor completes any institute of claims 1 to 3 for communicating with the memory to perform the executable instruction
State the operation of method or claim 4 the method.
9. a kind of computer storage media, for storing computer-readable instruction, which is characterized in that described instruction is performed
The operation of any the method for Shi Shixian claims 1 to 3 or any the method for claim 4.
10. a kind of computer program, including computer-readable instruction, which is characterized in that when described computer-readable
Instruction is when running in a device, the processor execution in the equipment be used to implement any the methods of claim 1-3 or
The executable instruction of step in claim 4 the method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710530371.4A CN108229527A (en) | 2017-06-29 | 2017-06-29 | Training and video analysis method and apparatus, electronic equipment, storage medium, program |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710530371.4A CN108229527A (en) | 2017-06-29 | 2017-06-29 | Training and video analysis method and apparatus, electronic equipment, storage medium, program |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108229527A true CN108229527A (en) | 2018-06-29 |
Family
ID=62658105
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710530371.4A Pending CN108229527A (en) | 2017-06-29 | 2017-06-29 | Training and video analysis method and apparatus, electronic equipment, storage medium, program |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108229527A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108961317A (en) * | 2018-07-27 | 2018-12-07 | 阿依瓦(北京)技术有限公司 | A kind of method and system of video depth analysis |
CN109543528A (en) * | 2018-10-19 | 2019-03-29 | 北京陌上花科技有限公司 | Data processing method and device for video features |
CN109635790A (en) * | 2019-01-28 | 2019-04-16 | 杭州电子科技大学 | A kind of pedestrian's abnormal behaviour recognition methods based on 3D convolution |
CN110119757A (en) * | 2019-03-28 | 2019-08-13 | 北京奇艺世纪科技有限公司 | Model training method, video category detection method, device, electronic equipment and computer-readable medium |
CN110781960A (en) * | 2019-10-25 | 2020-02-11 | Oppo广东移动通信有限公司 | Training method, classification method, device and equipment of video classification model |
CN110958489A (en) * | 2019-12-11 | 2020-04-03 | 腾讯科技(深圳)有限公司 | Video processing method, video processing device, electronic equipment and computer-readable storage medium |
CN111382616A (en) * | 2018-12-28 | 2020-07-07 | 广州市百果园信息技术有限公司 | Video classification method and device, storage medium and computer equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160379055A1 (en) * | 2015-06-25 | 2016-12-29 | Kodak Alaris Inc. | Graph-based framework for video object segmentation and extraction in feature space |
CN106682108A (en) * | 2016-12-06 | 2017-05-17 | 浙江大学 | Video retrieval method based on multi-modal convolutional neural network |
-
2017
- 2017-06-29 CN CN201710530371.4A patent/CN108229527A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160379055A1 (en) * | 2015-06-25 | 2016-12-29 | Kodak Alaris Inc. | Graph-based framework for video object segmentation and extraction in feature space |
CN106682108A (en) * | 2016-12-06 | 2017-05-17 | 浙江大学 | Video retrieval method based on multi-modal convolutional neural network |
Non-Patent Citations (3)
Title |
---|
GABRIEL S. SIMOES, JONATAS WEHRMANN, RODRIGO C. BARROS, DUNCAN D: "Movie Genre Classification with Convolutional Neural Networks", 《IN INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS》 * |
MAKARAND TAPASWI, YUKUN ZHU, RAINER STIEFELHAGEN, ANTONIO TORRAL: "MovieQA: Understanding Stories in Movies through Question-Answering", 《CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 * |
NITISH SRIVASTAVA, ELMAN MANSIMOV, RUSLAN SALAKHUTDINOV: "Unsupervised Learning of Video Representations using LSTMs", 《INTERNATIONAL CONFERENCE ON MACHINE LEARNING》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108961317A (en) * | 2018-07-27 | 2018-12-07 | 阿依瓦(北京)技术有限公司 | A kind of method and system of video depth analysis |
WO2020019397A1 (en) * | 2018-07-27 | 2020-01-30 | 阿依瓦(北京)技术有限公司 | Video depth analysis method and system |
CN109543528A (en) * | 2018-10-19 | 2019-03-29 | 北京陌上花科技有限公司 | Data processing method and device for video features |
CN111382616A (en) * | 2018-12-28 | 2020-07-07 | 广州市百果园信息技术有限公司 | Video classification method and device, storage medium and computer equipment |
CN111382616B (en) * | 2018-12-28 | 2023-08-18 | 广州市百果园信息技术有限公司 | Video classification method and device, storage medium and computer equipment |
CN109635790A (en) * | 2019-01-28 | 2019-04-16 | 杭州电子科技大学 | A kind of pedestrian's abnormal behaviour recognition methods based on 3D convolution |
CN110119757A (en) * | 2019-03-28 | 2019-08-13 | 北京奇艺世纪科技有限公司 | Model training method, video category detection method, device, electronic equipment and computer-readable medium |
CN110119757B (en) * | 2019-03-28 | 2021-05-25 | 北京奇艺世纪科技有限公司 | Model training method, video category detection method, device, electronic equipment and computer readable medium |
CN110781960A (en) * | 2019-10-25 | 2020-02-11 | Oppo广东移动通信有限公司 | Training method, classification method, device and equipment of video classification model |
CN110781960B (en) * | 2019-10-25 | 2022-06-28 | Oppo广东移动通信有限公司 | Training method, classification method, device and equipment of video classification model |
CN110958489A (en) * | 2019-12-11 | 2020-04-03 | 腾讯科技(深圳)有限公司 | Video processing method, video processing device, electronic equipment and computer-readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108229527A (en) | Training and video analysis method and apparatus, electronic equipment, storage medium, program | |
CN108229478A (en) | Image, semantic segmentation and training method and device, electronic equipment, storage medium and program | |
CN109325148A (en) | The method and apparatus for generating information | |
CN112015859A (en) | Text knowledge hierarchy extraction method and device, computer equipment and readable medium | |
Shuai et al. | Toward achieving robust low-level and high-level scene parsing | |
Malgireddy et al. | Language-motivated approaches to action recognition | |
Sims et al. | A neural architecture for detecting user confusion in eye-tracking data | |
US10963700B2 (en) | Character recognition | |
Jha et al. | A brief comparison on machine learning algorithms based on various applications: a comprehensive survey | |
Wang et al. | Spatial–temporal pooling for action recognition in videos | |
Lebron Casas et al. | Video summarization with LSTM and deep attention models | |
CN109446328A (en) | A kind of text recognition method, device and its storage medium | |
CN111078881B (en) | Fine-grained sentiment analysis method and system, electronic equipment and storage medium | |
Song et al. | Temporal action localization in untrimmed videos using action pattern trees | |
Gui et al. | Depression detection on social media with reinforcement learning | |
CN109271915A (en) | False-proof detection method and device, electronic equipment, storage medium | |
CN108268629A (en) | Image Description Methods and device, equipment, medium, program based on keyword | |
Wang et al. | Long video question answering: A matching-guided attention model | |
Li et al. | Event extraction for criminal legal text | |
Cheng et al. | Multi-label few-shot learning for sound event recognition | |
Chen et al. | Integrating Multisubspace Joint Learning With Multilevel Guidance for Cross-Modal Retrieval of Remote Sensing Images | |
Cerekovic | A deep look into group happiness prediction from images | |
Liu et al. | Key time steps selection for CFD data based on deep metric learning | |
Matzen et al. | Bubblenet: Foveated imaging for visual discovery | |
Si | Analysis of calligraphy Chinese character recognition technology based on deep learning and computer-aided technology |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180629 |