Nothing Special   »   [go: up one dir, main page]

CN108229527A - Training and video analysis method and apparatus, electronic equipment, storage medium, program - Google Patents

Training and video analysis method and apparatus, electronic equipment, storage medium, program Download PDF

Info

Publication number
CN108229527A
CN108229527A CN201710530371.4A CN201710530371A CN108229527A CN 108229527 A CN108229527 A CN 108229527A CN 201710530371 A CN201710530371 A CN 201710530371A CN 108229527 A CN108229527 A CN 108229527A
Authority
CN
China
Prior art keywords
video
camera lens
segment
visual signature
lens segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710530371.4A
Other languages
Chinese (zh)
Inventor
汤晓鸥
黄青虬
熊宇
熊元骏
林达华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sensetime Technology Development Co Ltd
Original Assignee
Beijing Sensetime Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sensetime Technology Development Co Ltd filed Critical Beijing Sensetime Technology Development Co Ltd
Priority to CN201710530371.4A priority Critical patent/CN108229527A/en
Publication of CN108229527A publication Critical patent/CN108229527A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention discloses a kind of training and video analysis method and apparatus, electronic equipment, storage medium, program, wherein, the video analysis network includes visual signature network, and training method includes:For any sample segment in the corresponding at least one sample segment of at least one Sample video, the visual signature of any sample segment is obtained using visual signature network, the Sample video is labeled with video type label;The video estimation label of the sample segment is obtained according to the visual signature;According to the video estimation label of the sample segment and the video type label, the visual signature network is trained.The embodiment of the present invention can realize the analysis to long video.

Description

Training and video analysis method and apparatus, electronic equipment, storage medium, program
Technical field
The present invention relates to computer vision technique, especially a kind of training and video analysis method and apparatus, electronic equipment, Storage medium, program.
Background technology
Annual in the world to have a large amount of film to generate, film is not only a kind of entertainment way, is substantially to the mankind The dramatization displaying of world's actual life reflects culture, society, the history of the mankind by abundant media.Artificial intelligence If it will be appreciated that film, is also just better understood on real world.Therefore, for this duration of film it is long, contain much information The analysis of video is a significantly thing in computer vision field.
Invention content
The embodiment of the present invention provides a kind of technical solution for being used to carry out video analysis.
One side according to embodiments of the present invention, the training method of a kind of video analysis network provided, the video It analyzes network and includes visual signature network, the method includes:
For any sample segment in the corresponding at least one sample segment of at least one Sample video, using regarding Feel that character network obtains the visual signature of any sample segment, the Sample video is labeled with video type label;
The video estimation label of the sample segment is obtained according to the visual signature;
According to the video estimation label of the sample segment and the video type label, to the visual signature network into Row training.
Optionally, it is described to obtain described appoint using visual signature network in the above-mentioned each training method embodiment of the present invention Before the visual signature of one sample chips section, further include:
M camera lens segment is chosen from any sample segment and respectively from any in the M camera lens segment N frame images are chosen in camera lens segment;Wherein, M, N are respectively the integer more than 0;The sample segment includes the Sample video At least one of camera lens segment, each camera lens segment include an at least frame image;
The visual signature that any sample segment is obtained using visual signature network, including:Utilize visual signature Network respectively for any camera lens segment in the M camera lens segment, extracts N frames described in any camera lens segment respectively The visual signature of image.
Optionally, in the above-mentioned each training method embodiment of the present invention, the sample chips are obtained according to the visual signature The video estimation label of section, including:
According to the visual signature of N frame images chosen respectively in the M camera lens segment, regarding for the sample segment is obtained Frequency prediction label.
Optionally, in the above-mentioned each training method embodiment of the present invention, the sample segment includes:The Sample video pair The video trailer section answered or the video clip that editing obtains from the Sample video.
Optionally, in the above-mentioned each training method embodiment of the present invention, the value of M, N are respectively more than or equal to 2 Integer.
Optionally, in the above-mentioned each training method embodiment of the present invention, the value that the value of M is 8, N is 3.
Optionally, in the above-mentioned each training method embodiment of the present invention, according to what is chosen respectively in the M camera lens segment The visual signature of N frame images obtains the video estimation label of the sample segment, including:
Respectively for any camera lens segment, the visual signature for being based respectively on the N frames image determines the N frames image Type label;
The type label for being based respectively on N frame images described in any camera lens segment determines any camera lens segment Type label;
Determine the tag types of the sample segment as video preprocessor mark based on the type label of the M camera lens segment Label.
Optionally, in the above-mentioned each training method embodiment of the present invention, the video analysis network further includes sequential organization Network;
The method further includes:
After meeting preset condition to the training of the visual signature network, using the visual signature network to giving video Feature extraction is carried out, obtains the visual signature of a plurality of lenses segment;The given video includes continuous a plurality of lenses segment;
Using sequential organization network, based on the visual signature and sequential relationship of the continuous a plurality of lenses segment, study The sequential organization feature of the given video.
Optionally, in the above-mentioned each training method embodiment of the present invention, the given video specifically includes P lens Section;Wherein, P is the integer more than 0;
Feature extraction is carried out to given video using the visual signature network, the vision for obtaining a plurality of lenses segment is special Sign, including:
Q frame images are chosen from each camera lens segment in the given video respectively;Wherein, Q is the integer more than 0;
Using the visual signature network, respectively for each camera lens segment, Q described in each camera lens segment is extracted Visual signature of the visual signature of frame image as each camera lens segment;
According to sequence of the P camera lens segment in the given video, successively by the vision of the P camera lens segment Feature input timing structural network.
Optionally, in the above-mentioned each training method embodiment of the present invention, based on regarding for the continuous a plurality of lenses segment Feel feature and sequential relationship, learn the sequential organization feature of the given video, including:
Based on the visual signature and sequential relationship of the continuous a plurality of lenses segment, predict that the given video is adjacent Next camera lens segment;
Based on the forecasting accuracy of next camera lens segment, the sequential organization network is trained.
Optionally, it in the above-mentioned each training method embodiment of the present invention, further includes:
Q frame images are chosen from each option camera lens segment in L option camera lens segment respectively;Wherein, L is more than 1 Integer;The L option camera lens segment includes at least one correctly next camera lens segment;
Using the visual signature network, respectively for each option camera lens segment in the L option camera lens segment, carry Take visual signature of the visual signature of the Q frames image as each option camera lens segment;
Based on the visual signature and sequential relationship of the continuous a plurality of lenses segment, predict that the given video is adjacent Next camera lens segment, including:
Sequential organization network generates the P lens according to the visual signature and sequential relationship of the P camera lens segment The sequential organization feature of section;
The sequential organization network obtains the P camera lens segment according to the sequential organization feature of the P camera lens segment The visual signature of adjacent next camera lens segment;
Using convolutional neural networks, according to the visual signature of next camera lens segment and the L option lens The visual signature of section, chooses the adjacent next camera lens segment of the P camera lens segment from the L option camera lens segment.
Optionally, it is special according to the vision of next camera lens segment in the above-mentioned each training method embodiment of the present invention The visual signature for the L option camera lens segment of seeking peace, the P camera lens segment is chosen from the L option camera lens segment Adjacent next camera lens segment, including:
According to each option camera lens segment in the visual signature of next camera lens segment and the L option camera lens segment Visual signature, obtain probability score of each option camera lens segment as next camera lens segment respectively;
The highest at least one option camera lens segment of probability score is chosen in the L option camera lens segment as the P The adjacent next camera lens segment of a camera lens segment.
Optionally, it is special according to the vision of next camera lens segment in the above-mentioned each training method embodiment of the present invention The visual signature of each option camera lens segment, obtains each option camera lens segment conduct respectively in the L option camera lens segment of seeking peace The probability score of next camera lens segment, including:
The visual signature of next camera lens segment is replicated, the vision for obtaining L next camera lens segments is special Sign, and according to preset format to the visual signature of the L next camera lens segments and the vision of the L option camera lens segment Feature is spliced, and obtains eigenmatrix;
Using convolutional neural networks, based on the eigenmatrix, each option in the L option camera lens segment is obtained respectively Probability score of the camera lens segment as adjacent next camera lens segment.
Optionally, it is described to obtain each option camera lens segment conduct respectively in the above-mentioned each training method embodiment of the present invention After the probability of next camera lens segment obtains, further include:
To each option camera lens segment in the L option camera lens segment respectively as the general of adjacent next camera lens segment Rate score is normalized, and obtains L normalization probability score;
The highest at least one option camera lens segment of probability score is chosen in the L option camera lens segment as the P The adjacent next camera lens segment of a camera lens segment, including:
From described L normalization probability score, it is corresponding to choose the highest at least one normalization probability score of numerical value The option camera lens segment next camera lens segment adjacent as the P camera lens segment.
Optionally, in the above-mentioned each training method embodiment of the present invention, the sequential organization network includes recurrent neural net Network.
Other side according to embodiments of the present invention, a kind of video analysis method provided, including:
At least one video clip is chosen from video;
Using visual signature network, the visual signature of at least one video clip is obtained respectively;
According to the visual signature from least one video clip, the video is analyzed.
Optionally, it is described that at least one regard is chosen from video in the above-mentioned each video analysis embodiment of the method for the present invention Frequency segment, including:
X camera lens segment is chosen from the video, the video clip includes the X camera lens segment;
Respectively for any camera lens segment in the X camera lens segment, Y frame figures are chosen from any camera lens segment Picture;Wherein, X, Y are respectively the integer more than 0.
Optionally, it is described to utilize visual signature network in the above-mentioned each video analysis embodiment of the method for the present invention, it obtains respectively The visual signature of at least one video clip is taken, including:
Using visual signature network, respectively for any camera lens segment in the X camera lens segment, respectively from described any The visual signature of the Y frames image is extracted in camera lens segment;
The basis is analyzed the long video from the visual signature of at least one video clip, including:
According to the visual signature of Y frame images chosen respectively from the X camera lens segment, the video is divided Analysis.
Optionally, it is described that X lens is chosen from video in the above-mentioned each video analysis embodiment of the method for the present invention Section, including:
Based on the characteristic similarity and preset condition between adjacent two field pictures in the video, chosen from the video X camera lens segment.
Optionally, in the above-mentioned each video analysis embodiment of the method for the present invention, the value of M, N are respectively to be more than or equal to 2 integer.
Optionally, in the above-mentioned each video analysis embodiment of the method for the present invention, the value that the value of X is 8, Y is 3.
Optionally, in the above-mentioned each video analysis embodiment of the method for the present invention, distinguish according to from the X camera lens segment The visual signature of the Y frame images of selection, analyzes the video, including:
Respectively for any camera lens segment in the X camera lens segment:It is based respectively on the visual signature of the Y frames image Determine the type label of the Y frames image;And
The type of any camera lens segment is determined based on the type label of Y frame images described in any camera lens segment Label;
Video of the tag types of the video as the video is determined based on the type label of the X camera lens segment Prediction label.
Optionally, it in the above-mentioned each video analysis embodiment of the method for the present invention, further includes:
To video estimation label described in the video labeling.
Optionally, it is described that X lens is chosen from video in the above-mentioned each video analysis embodiment of the method for the present invention Section, including:
In response to receiving searching request, continuous X camera lens segment is chosen from video successively;It is described to search in element request Including video presentation field;
According to the visual signature of Y frame images chosen respectively from the X camera lens segment, the video is divided Analysis, including:
According to the visual signature of Y frame images chosen respectively from the X camera lens segment, the X mirror is obtained respectively The type label of each camera lens segment in head segment;
Export the camera lens segment of type label and the video presentation fields match in the X camera lens segment.
Optionally, in the above-mentioned each video analysis embodiment of the method for the present invention, type mark in the X camera lens segment is exported Label and the camera lens segment of the video presentation fields match, including:
In response to getting the type label of any camera lens segment in the X camera lens segment, it is respectively compared described any Whether the type label of camera lens segment matches with the video presentation field;
Output type label and any camera lens segment of the video presentation fields match;
Or
In response to getting the type label of all camera lens segments in the video, it is respectively compared all camera lens segments In the type label of each camera lens segment whether matched with the video presentation field;
Export the camera lens segment of type label and the video presentation fields match in all camera lens segments.
Optionally, in the above-mentioned each video analysis embodiment of the method for the present invention, distinguish according to from the X camera lens segment The visual signature of the Y frame images of selection, analyzes the video, including:
According to the visual signature of Y frame images chosen respectively from the X camera lens segment, the X mirror is obtained respectively The type label of each camera lens segment in head segment;
The method further includes:
Using sequential organization network, the sequential organization feature of the X camera lens segment is obtained;
According to the type label of the X camera lens segment and sequential organization feature, the video presentation of the video is generated.
Optionally, it is described that X lens is chosen from video in the above-mentioned each video analysis embodiment of the method for the present invention Section, including:Continuous X camera lens segment is chosen from video successively;
According to the visual signature of Y frame images chosen respectively from the X camera lens segment, the video is divided Analysis, including:According to the visual signature of Y frame images chosen respectively from the X camera lens segment, the X mirror is obtained respectively The type label of each camera lens segment in head segment;
The method further includes:
Using sequential organization network, the sequential organization feature of all camera lens segments in the video is obtained;
According to the type label and sequential organization feature of camera lens segments all in the video, the video of the video is generated Description.
Optionally, it is described that X lens is chosen from video in the above-mentioned each video analysis embodiment of the method for the present invention Section, including:
In response to receiving searching request, a video is chosen, and choose continuous X camera lens from the video successively Segment;The element of searching asks to include video presentation field;
According to the visual signature of Y frame images chosen respectively from the X camera lens segment, the video is divided Analysis, including:
According to the visual signature of Y frame images chosen respectively from the X camera lens segment, the X mirror is obtained respectively The type label of each camera lens segment in head segment, and return and choose continuous X camera lens from the video successively described in execution The operation of segment, until getting the type label of all camera lens segments in the video;Using recurrent neural network, institute is obtained State the sequential organization feature of all camera lens segments in video;
It is special according to the type label of camera lens segments all in the video and sequential organization using the sequential organization network Sign generates the video presentation of the video;
Whether the video presentation for comparing the video matches with the video presentation field;
Video presentation and the video presentation fields match in response to the video, export the video.
Optionally, in the above-mentioned each video analysis embodiment of the method for the present invention, the sequential organization network includes recurrence god Through network.
Another aspect according to embodiments of the present invention, the training device of a kind of video analysis network provided are described to regard Frequency analysis network includes visual signature network;Described device includes sort module and the first training module;Wherein:
The visual signature network, for being directed in the corresponding at least one sample segment of at least one Sample video Any sample segment, obtain the visual signature of any sample segment;The Sample video is labeled with video type label;
The sort module, for obtaining the video estimation label of the sample segment according to the visual signature;
First training module, for the video estimation label according to the sample segment and the video type mark Label, are trained the visual signature network.
Optionally, it in the above-mentioned each training device embodiment of the present invention, further includes:First chooses module, for distinguishing needle Any sample segment at least one sample segment corresponding at least one Sample video, from any short-movie section M camera lens segment of middle selection;The sample segment includes at least one of Sample video camera lens segment, each lens Section includes at least one frame image;
Second chooses module, for choosing N frame images from any camera lens segment in the M camera lens segment respectively; Wherein, M, N are respectively the integer more than 0;
The visual signature network specifically for being directed to any camera lens segment in the M camera lens segment respectively, carries respectively Take the visual signature of N frame images described in any camera lens segment;
The sort module, it is special specifically for the vision according to the N frame images chosen respectively in the M camera lens segment Sign obtains the video estimation label of the sample segment.
Optionally, in the above-mentioned each training device embodiment of the present invention, the sample segment includes:The Sample video pair The video trailer section answered or the video clip that editing obtains from the Sample video.
Optionally, in the above-mentioned each training device embodiment of the present invention, the sort module is specifically used for:
Respectively for any camera lens segment, the visual signature for being based respectively on the N frames image determines the N frames image Type label;
The type label for being based respectively on N frame images described in any camera lens segment determines any camera lens segment Type label;
Determine the tag types of the sample segment as video preprocessor mark based on the type label of the M camera lens segment Label.
Optionally, in the above-mentioned each training device embodiment of the present invention, the visual signature network is additionally operable to full in training After sufficient preset condition, feature extraction is carried out to given video, obtains the visual signature of a plurality of lenses segment;The given video bag Include continuous a plurality of lenses segment;
The video analysis network further includes:
Sequential organization network, for visual signature and sequential relationship based on the continuous a plurality of lenses segment, study The sequential organization feature of the given video.
Optionally, in the above-mentioned each training device embodiment of the present invention, the given video specifically includes P lens Section;Wherein, P is the integer more than 0;
Described second chooses module, is additionally operable to choose Q frame images from each camera lens segment in the given video respectively; Wherein, Q is the integer more than 0;
The visual signature network carries out feature extraction to given video, when obtaining the visual signature of a plurality of lenses segment, It is specifically used for:Respectively for each camera lens segment, the visual signature conduct of Q frame images described in each camera lens segment is extracted The visual signature of each camera lens segment;And the sequence according to the P camera lens segment in the given video, successively will The visual signature input recurrent neural network of the P camera lens segment.
Optionally, in the above-mentioned each training device embodiment of the present invention, the sequential organization network, specifically for being based on The visual signature and sequential relationship of continuous a plurality of lenses segment are stated, predicts the adjacent next lens of the given video Section;
Described device further includes:
Second training module, for the forecasting accuracy based on next camera lens segment, to the sequential organization net Network is trained.
Optionally, in the above-mentioned each training device embodiment of the present invention, described second chooses module, is additionally operable to respectively from L Q frame images are chosen in each option camera lens segment in a option camera lens segment;The L option camera lens segment includes at least one Correct next camera lens segment;Wherein, L is the integer more than 1;
The visual signature network is additionally operable to each option camera lens segment being directed to respectively in the L option camera lens segment, Extract visual signature of the visual signature of the Q frames image as each option camera lens segment;
The sequential organization network, is specifically used for:It is raw according to the visual signature and sequential relationship of the P camera lens segment Into the sequential organization feature of the P camera lens segment;And according to obtaining the sequential organization feature of the P camera lens segment The visual signature of the adjacent next camera lens segment of P camera lens segment;
Described device further includes:
Convolutional neural networks, for the visual signature according to next camera lens segment and the L option lens The visual signature of section, chooses the adjacent next camera lens segment of the P camera lens segment from the L option camera lens segment.
Optionally, in the above-mentioned each training device embodiment of the present invention, the convolutional neural networks are specifically used for:
According to each option camera lens segment in the visual signature of next camera lens segment and the L option camera lens segment Visual signature, obtain probability score of each option camera lens segment as next camera lens segment respectively;
The highest at least one option camera lens segment of probability score is chosen in the L option camera lens segment as the P The adjacent next camera lens segment of a camera lens segment.
Optionally, it in the above-mentioned each training device embodiment of the present invention, further includes:
Normalize module, for each option camera lens segment in the L option camera lens segment respectively as under adjacent The probability score of one camera lens segment is normalized, and obtains L normalization probability score;
The convolutional neural networks choose the highest at least one option mirror of probability score in the L option camera lens segment During the head segment next camera lens segment adjacent as the P camera lens segment, obtained specifically for normalizing probability from described L In point, the highest at least one corresponding option camera lens segment of probability score that normalizes of numerical value is chosen as the P lens Next camera lens segment of Duan Xianglin.
Optionally, in the above-mentioned each training device embodiment of the present invention, the sequential organization network includes recurrent neural net Network.
Another aspect according to embodiments of the present invention, a kind of video analysis device provided, including:
First chooses module, for choosing at least one video clip from video;
Using visual signature network, for obtaining the visual signature of at least one video clip respectively;
Sort module, for according to the visual signature from least one video clip, analyzing the video.
Another aspect according to embodiments of the present invention, the electronic equipment provided, including any of the above-described embodiment of the present invention The training device or video analysis device of the video analysis network.
Optionally, in the above-mentioned each video analysis device embodiment of the present invention, described first chooses module, specifically for from X camera lens segment is chosen in the video, the video clip includes the X camera lens segment;
Described device further includes:
Second chooses module, respectively for any camera lens segment in the X camera lens segment, from any lens Y frame images are chosen in section;Wherein, X, Y are respectively the integer more than 0;
The visual signature network, specifically for respectively be directed to the X camera lens segment in any camera lens segment, respectively from The visual signature of the Y frames image is extracted in any camera lens segment;
The sort module, it is special specifically for the vision according to the Y frame images chosen respectively from the X camera lens segment Sign, analyzes the video.
Optionally, in the above-mentioned each video analysis device embodiment of the present invention, described first chooses module, specifically for base Characteristic similarity and preset condition in the video between adjacent two field pictures choose X lens from the video Section.
Optionally, in the above-mentioned each video analysis device embodiment of the present invention, the sort module is specifically used for:
Respectively for any camera lens segment in the X camera lens segment:It is based respectively on the visual signature of the Y frames image Determine the type label of the Y frames image;And it is determined based on the type label of Y frame images described in any camera lens segment The type label of any camera lens segment;
Video of the tag types of the video as the video is determined based on the type label of the X camera lens segment Prediction label.
Optionally, it in the above-mentioned each video analysis device embodiment of the present invention, further includes:
Labeling module, for video estimation label described in the video labeling.
Optionally, it in the above-mentioned each video analysis device embodiment of the present invention, further includes:
Receiving module, for receiving searching request, the element of searching asks to include video presentation field;
Described first chooses module, specifically for receiving searching request in response to the receiving module, successively from video It is middle to choose continuous X camera lens segment;
The sort module, it is special specifically for the vision according to the Y frame images chosen respectively from the X camera lens segment Sign obtains the type label of each camera lens segment in the X camera lens segment respectively;
Output module, for exporting the mirror of type label and the video presentation fields match in the X camera lens segment Head segment.
Optionally, in the above-mentioned each video analysis device embodiment of the present invention, the sort module, specifically for according to from The visual signature of Y frame images chosen respectively in the X camera lens segment, obtains each camera lens in the X camera lens segment respectively The type label of segment;
Described device further includes:
Sequential organization network, for obtaining the sequential organization feature of the X camera lens segment;
Generation module for the type label and sequential organization feature according to the X camera lens segment, generates the video Video presentation.
Optionally, in the above-mentioned each video analysis device embodiment of the present invention, described first chooses module, specifically for according to It is secondary that continuous X camera lens segment is chosen from video;
The sort module, it is special specifically for the vision according to the Y frame images chosen respectively from the X camera lens segment Sign obtains the type label of each camera lens segment in the X camera lens segment respectively;
Described device further includes:
Sequential organization network, for obtaining the sequential organization feature of all camera lens segments in the video;
Generation module, for the type label and sequential organization feature according to camera lens segments all in the video, generation The video presentation of the video.
Another aspect according to embodiments of the present invention, a kind of electronic equipment provided, including:
Memory, for storing executable instruction;And
Processor, it is any of the above-described thereby completing the present invention to perform the executable instruction for communicating with the memory The operation of the training method of video analysis network or video analysis method described in embodiment.
Another aspect according to embodiments of the present invention, a kind of computer storage media provided, for storing computer The instruction that can be read, described instruction are performed the training side for realizing video analysis network described in any of the above-described embodiment of the present invention The operation of method or video analysis method.
Another aspect according to embodiments of the present invention, a kind of computer program provided, including computer-readable Instruction, when the computer-readable instruction is run in a device, the processor execution in the equipment is used to implement this Invent the executable of the training method of video analysis network described in any of the above-described embodiment or the step in video analysis method Instruction.
The training method of video analysis network provided based on the above embodiment of the present invention and device, electronic equipment, calculating Machine storage medium, computer program, it is proposed that utilize the corresponding sample segment of Sample video (such as the trailer of Sample video Section) technical solution that is trained to visual signature network, a sample snippet extraction visual signature is chosen from Sample video, The video estimation label of sample segment is obtained according to the visual signature, then according to the video estimation label and sample of sample segment The video type label of video is trained visual signature network.Since sample segment is chosen from Sample video, With the picture in the Sample video, Sample video is not based on based on a segment in Sample video in itself to vision spy Sign network is trained, and is greatly reduced calculation amount, is saved computing resource and storage resource, improves training effectiveness, and instruct The visual signature network got can be effectively used for the analysis to Sample video.
Video analysis method and apparatus, electronic equipment, computer storage media based on the above embodiment of the present invention offer, Computer program, the selecting video segment from video, using visual signature network, the vision for extracting the video clip respectively is special Sign, and long video is analyzed according to the visual signature.Based on the embodiment of the present invention, realize based on the segment pair in video The analysis of video, without analyzing entire video, reduce the calculation amount of video analysis, save computing resource and storage provides Source improves analysis efficiency so that the quick analysis isometric video of film is possibly realized.
Below by drawings and examples, technical scheme of the present invention is described in further detail.
Description of the drawings
The attached drawing of a part for constitution instruction describes the embodiment of the present invention, and is used to explain together with description The principle of the present invention.
With reference to attached drawing, according to following detailed description, the present invention can be more clearly understood, wherein:
Fig. 1 is the flow chart of training method one embodiment of video analysis network of the present invention.
Fig. 2 is the flow chart of another embodiment of the training method of video analysis network of the present invention.
Fig. 3 is to a schematic diagram of visual signature network training in the embodiment of the present invention.
Fig. 4 is the flow chart of another embodiment of the training method of video analysis network of the present invention.
Fig. 5 is the flow chart of the training method further embodiment of video analysis network of the present invention.
Fig. 6 is to a schematic diagram of recurrent neural network training in the embodiment of the present invention.
Fig. 7 is the flow chart of video analysis method one embodiment of the present invention.
Fig. 8 is the flow chart of another embodiment of video analysis method of the present invention.
Fig. 9 is the flow chart of one Application Example of video analysis method of the present invention.
Figure 10 is the flow chart of another Application Example of video analysis method of the present invention.
Figure 11 is the flow chart of another Application Example of video analysis method of the present invention.
Figure 12 is the structure diagram of training device one embodiment of video analysis network of the present invention.
Figure 13 is the structure diagram of another embodiment of the training device of video analysis network of the present invention.
Figure 14 is the structure diagram of video analysis device one embodiment of the present invention.
Figure 15 is the structure diagram of another embodiment of video analysis device of the present invention.
Figure 16 is the structure diagram of another embodiment of video analysis device of the present invention.
Figure 17 is the structure diagram of the long video analysis device further embodiment of the present invention.
Figure 18 is the structure diagram of video analysis device a still further embodiment of the present invention.
Figure 19 is the structure diagram of one Application Example of electronic equipment of the present invention.
Specific embodiment
Carry out the various exemplary embodiments of detailed description of the present invention now with reference to attached drawing.It should be noted that:Unless in addition have Body illustrates that the unlimited system of component and the positioned opposite of step, numerical expression and the numerical value otherwise illustrated in these embodiments is originally The range of invention.
Simultaneously, it should be appreciated that for ease of description, the size of the various pieces shown in attached drawing is not according to reality Proportionate relationship draw.
It is illustrative to the description only actually of at least one exemplary embodiment below, is never used as to the present invention And its application or any restrictions that use.
Technology, method and apparatus known to person of ordinary skill in the relevant may be not discussed in detail, but suitable In the case of, the technology, method and apparatus should be considered as part of specification.
It should be noted that:Similar label and letter represents similar terms in following attached drawing, therefore, once a certain Xiang Yi It is defined in a attached drawing, then in subsequent attached drawing does not need to that it is further discussed.
The embodiment of the present invention can be applied to the electronic equipments such as terminal device, computer system/server, can with it is numerous Other general or specialized computing system environments or configuration operate together.Suitable for electric with terminal device, computer system/server etc. The example of well-known terminal device, computing system, environment and/or configuration that sub- equipment is used together includes but not limited to: Personal computer system, server computer system, thin client, thick client computer, hand-held or laptop devices, based on microprocessor System, set-top box, programmable consumer electronics, NetPC Network PC, little types Ji calculate machine Xi Tong ﹑ large computer systems and Distributed cloud computing technology environment including any of the above described system, etc..
The electronic equipments such as terminal device, computer system/server can be in the department of computer science performed by computer system It is described under the general linguistic context of system executable instruction (such as program module).In general, program module can include routine, program, mesh Beacon course sequence, component, logic, data structure etc., they perform specific task or realize specific abstract data type.Meter Calculation machine systems/servers can be implemented in distributed cloud computing environment, and in distributed cloud computing environment, task is by by logical What the remote processing devices of communication network link performed.In distributed cloud computing environment, program module can be located at and include storage On the Local or Remote computing system storage medium of equipment.
In the implementation of the present invention, the present inventor is had found by studying:Due to the video point of computer vision field The calculation amount of analysis technology is all very big, and the Video Analysis Technology of computer vision field is only capable of being directed to tens seconds short-sighted at present Frequency is analyzed, and can not directly be extended to the film for analyzing one, two hour and longer duration.
Fig. 1 is the flow chart of training method one embodiment of video analysis network of the present invention.The embodiment of the present invention regards Frequency analysis network includes visual signature network.As shown in Figure 1, the training method of the embodiment, respectively at least one sample Any sample segment in the corresponding at least one sample sample segment of Sample video performs following operation:
102, for any sample segment in the corresponding at least one sample segment of at least one Sample video, profit The visual signature of any sample segment is obtained with visual signature network.
In the embodiment of the present invention, video of the Sample video to participate in video analysis network training as sample, sample segment For the segment chosen from Sample video, the length of sample segment is less than Sample video.
Wherein, Sample video or its sample chips segment mark are marked with the type label of the Sample video, in various embodiments of the present invention The type label of Sample video is known as video type label.The video type mark of Sample video in various embodiments of the present invention What label, the i.e. producer or uploader of Sample video or its sample segment in advance noted the Sample video or its sample chips segment mark The label that type or user note after viewing to the Sample video or its sample chips segment mark, for example, horror film, comedy, music, Homicide etc., the video type label of Sample video can be from each web film (such as obtaining) on IMDB.
In various embodiments of the present invention, visual signature refers to video or its segment, the feature of frame image visually.
104, the video estimation label of sample segment is obtained according to visual signature.
106, according to the video estimation label of sample segment and the video type label of corresponding Sample video, to vision spy Sign network is trained.
Illustratively, aforesaid operations 102~106 can be the process that iteration performs, the mistake performed by the iteration Journey is trained visual signature network, until meet preset condition, such as the video estimation label of sample segment and corresponding+ Difference between video type label is less than default value or reaches preset times to the frequency of training of visual signature network, Training is completed.
Training method based on the video analysis network that the above embodiment of the present invention provides, it is proposed that utilize Sample video pair The technical solution that the sample segment (such as advance notice segment of Sample video) answered is trained visual signature network, regards from sample A sample snippet extraction visual signature is chosen in frequency, the video estimation label of sample segment is obtained according to the visual signature, so Visual signature network is trained according to the video type label of the video estimation label of sample segment and Sample video afterwards.By It is chosen from Sample video in sample segment, there is the picture in the Sample video, based on one in Sample video Segment and be not based on Sample video and visual signature network be trained in itself, greatly reduce calculation amount, save calculating money Source and storage resource improve training effectiveness, and the visual signature network that training obtains can be effectively used for dividing Sample video Analysis.
Fig. 2 is the flow chart of another embodiment of the training method of video analysis network of the present invention.As shown in Fig. 2, the reality The training method of example is applied, respectively for any sample in the corresponding at least one sample segment of at least one Sample video Segment performs following operation:
202, M camera lens segment is chosen from any sample segment and respectively from the either mirror in the M camera lens segment N frame images are chosen in head segment.
Each sample segment includes at least one of Sample video camera lens segment, and each camera lens segment includes an at least frame Image.In one of specific example of the embodiment of the present invention, sample segment is specially the corresponding video preprocessor of the Sample video Announcement segment or the video clip that editing obtains from Sample video.In a video, video trailer section or video clip, If a screen switching has occurred, another camera lens segment is switched to by a camera lens segment.
Illustratively, can based on the characteristic similarity and preset condition between adjacent two field pictures in sample segment, from X camera lens segment is chosen in sample segment.For example, in a particular application, the feature between adjacent two field pictures can be passed through Be compared, calculate consecutive frame image between characteristic similarity score, while based on from entire sample segment consider setting Preset condition, such as the frame picture number of a camera lens segment are no less than the first default value, will not be more than the second default value, Camera lens segment in identification, differentiation sample segment, then chooses M camera lens segment from sample segment.
In specific example, the mode randomly selected specifically may be used, M camera lens segment is chosen from sample segment, Predetermined manner may be used, such as the mode of a camera lens segment is chosen every one or more camera lens segments, from sample segment M camera lens segment of middle selection.Similarly, the mode randomly selected specifically may be used, N frame images is chosen from any camera lens segment, Predetermined manner can also be used, such as the mode of a frame image is chosen every a frame or multiple image, from any camera lens segment Choose N frame images.
Wherein, M, N are respectively the integer more than 0.The value of M and N is bigger, and the visual signature of acquisition is more rich, can make The performance for obtaining more accurate, training acquisition the visual signature network of video estimation label of sample segment is more preferable, but required Calculation amount is also more, more to the consumption of computing resource.In one of specific example of the embodiment of the present invention, M, N's takes Value is respectively the integer more than or equal to 2, for example, the value that the value of M is 8, N is 3.The present inventor passes through the study found that M Value be 8, when the value of N is 3, it is ensured that less computing resource is consumed in the case of the performance of visual signature network. However, in the embodiment of the present invention, the value of M and N are not limited to this, and M and N take and can equally be realized during other numerical value.
It is chosen in a plurality of lenses segment, each camera lens segment from sample segment and chooses multiple image, it can be to entire sample This segment has than more comprehensive viewing visual angle, so as to contribute to more comprehensively, effectively, accurately understand entire sample segment Visual information.
204, using visual signature network, respectively for any camera lens segment in M camera lens segment, either mirror is extracted respectively The visual signature of N frame images chosen in head segment.
206, according to the visual signature of N frame images chosen respectively in M camera lens segment, obtain the type mark of sample segment Label are known as video estimation label in various embodiments of the present invention according to the type label that visual signature is got, with video type Label is mutually distinguished.
The prediction label type of sample segment in various embodiments of the present invention, that is, the type of the sample segment predicted, Such as horror film, comedy, music, homicide etc..
In a wherein specific example, which can specifically realize in the following way:
Respectively for any camera lens segment, the visual signature for being based respectively on the N frame images of above-mentioned selection determines the N frame images Type label;
The type label for being based respectively on the N frame images of above-mentioned selection in any camera lens segment determines any camera lens segment Type label;
The video estimation label of the sample segment is determined based on the type label of M camera lens segment of above-mentioned selection.
208, according to the video type mark of the video estimation label of sample segment Sample video corresponding with the sample segment Label, are trained visual signature network.
Illustratively, aforesaid operations 202~208 can be the process that iteration performs, the mistake performed by the iteration Journey is trained visual signature network, until meet preset condition, such as the video estimation label of sample segment and corresponding Difference between the video type label of Sample video is less than default value or the frequency of training of visual signature network is reached Preset times, training are completed, and obtain final visual signature network.
Training method based on the video analysis network that the above embodiment of the present invention provides, it is proposed that utilize Sample video pair The technical solution that the sample segment (such as predicting segment) answered is trained initial visual character network model, from Sample video M camera lens segment is chosen in corresponding sample segment, chooses N frame figures from any camera lens segment in M camera lens segment respectively Picture, using visual signature network respectively for any camera lens segment in M camera lens segment, the vision for extracting N frame images respectively is special Sign according to the visual signature of N frame images chosen respectively in M camera lens segment, obtains the video estimation label of sample segment, so Afterwards according to the video estimation label of sample segment and the video type label of Sample video, visual signature network is trained. Due to sample segment, editing obtains from Sample video, has the representative picture in the Sample video, based on sample Segment is trained visual signature network, greatly reduces calculation amount, saves computing resource and storage resource, improves instruction Practice efficiency, and obtained visual signature network is trained to can be effectively used for the analysis to Sample video;From sample segment choose to An at least frame image is chosen in a few camera lens segment, each camera lens segment, by sparse sampling and as unit of camera lens segment Network training is carried out, calculation amount is further reduced, saves computing resource and storage resource, improve training effectiveness.
Based on above-mentioned training method embodiment, realize and visual signature network be trained by the way of Weakly supervised, From the visual signature of sample segment (for example, advance notice segment) learning film equal samples video, so that visual signature network It can be effectively used for the analysis to Sample video.It is therein it is Weakly supervised refer to, only pass through the video type label of Sample video The training to visual signature network is realized, without to each frame image/video type label in Sample video, sample segment, reducing Mark and workload needed for calculating improve the training effectiveness of visual signature network.
Fig. 3 is to a schematic diagram of visual signature network training in the embodiment of the present invention.As shown in figure 3, the embodiment In, Sample video is specially film, and the sample segment of Sample video is specially the advance notice segment of the film.To visual signature network When being trained, specifically randomly selected in 8 camera lens segments, each camera lens segment from advance notice segment and randomly select 3 frame images, These are selected to the frame image input visual signature network (visual model) come, these frames are extracted by visual signature network The visual signature of image;Classified respectively according to the visual signature of each frame image to each frame image by grader, obtained each The type label of frame image;Such as by averagely down-sampled (pooling) or maximum down-sampled mode, respectively by each lens The type label that the N frame images come are selected in section determines the type label of the camera lens segment;Finally, such as by average drop it adopts Sample (pooling) or maximum down-sampled mode, merge the type label of different camera lens segments, obtain the video of the advance notice segment Prediction label, such as the type (such as horror film, comedy etc.) of film.The video type label of film can be from each web film (such as being obtained on IMDB), so as to video estimation label and the video type label of corresponding film based on advance notice segment To be trained to visual signature network, the network parameter of visual signature network is adjusted, by video type label to vision spy Sign network exercises supervision study.
Due to the particularity of the advance notice segment of film, each frame image therein includes what producer's editing from film came out Representative picture is trained visual signature network using the short-movie section for predicting segment as film, you can to realize To the training effect of visual signature network, it is trained relative to by the use of entire film as training sample, more saves computing resource And storage resource, improve training effectiveness.
In addition, the video analysis network of the embodiment of the present invention can further include sequential organization network, then it is of the invention In embodiment can be included to the training process of video analysis network two stages:First stage, to visual signature network into Row training, until meeting preset condition, first stage is completed;Meet default item to visual signature network training in the first stage After part, into second stage, recurrent neural network is trained.Above-mentioned Fig. 1 is training process with embodiment illustrated in fig. 2 First stage after the completion of first stage, carries out second stage.
In a specific example of the embodiment of the present invention, sequential organization network is specifically realized using recurrent neural network, Recurrent neural network can capture sequential organization feature and remember, and be suitable for the data for having sequential (such as video, voice etc.) Processing.Wherein, recurrent neural network for example can be one long memory models in short-term (LSTM).Sequential is stored by LSTM The short-term memory of structure feature enhances for long-range past memory capability, so as to be realized more by limited neuron The acquisition of the sequential organization feature of large-capacity video.
Fig. 4 is the flow chart of another embodiment of the training method of video analysis network of the present invention.The embodiment is above-mentioned After embodiment shown in fig. 1 or fig. 2, the training process of second stage is further comprised.It is as shown in figure 4, above-mentioned with the present invention Training method embodiment is compared, and in the embodiment of the present invention, is further included:
302, feature extraction is carried out to given video using visual signature network, the vision for obtaining a plurality of lenses segment is special Sign.
Wherein, video given herein above includes continuous a plurality of lenses segment.
304, using sequential organization network, based on the visual signature and sequential relationship of above-mentioned continuous a plurality of lenses segment, Learn the sequential organization feature of video given herein above.
Sequential organization feature therein is used to represent the feature in video, video between segment, frame image in terms of sequential Information, according to the sequential organization feature, it may be determined that the sequence in video, video between segment, frame image.
In one of specific example of training method embodiment shown in Fig. 4, it is assumed that given video specifically includes P mirror Head segment, wherein, P is the integer more than 0.Operation 302 can specifically include:
Q frame images are chosen from each camera lens segment in given video respectively;Wherein, Q is the integer more than 0;
Using visual signature network, respectively for each camera lens segment, the visual signature of Q frame images in each camera lens segment is extracted Visual signature as each camera lens segment;
It is successively that the visual signature of the P camera lens segment is defeated according to sequence of the above-mentioned P camera lens segment in given video Enter sequential organization network.
In addition, in another specific example of training method embodiment shown in Fig. 4, operation 304 can specifically include:
Based on the visual signature and sequential relationship of a plurality of lenses segment continuous in given video, the given video phase is predicted Adjacent next camera lens segment;
Based on the forecasting accuracy of next camera lens segment, sequential organization network is trained, during obtaining final Sequence structure network.
Illustratively, the operation of each embodiment shown in above-mentioned Fig. 4 can be the process that an iteration performs, and pass through the iteration The process of execution is trained sequential organization network, until meet preset condition, such as prediction next camera lens segment just True probability reaches predetermined probabilities value or reaches preset times to the frequency of training of sequential organization network, to sequential organization net The training of network is completed.
Fig. 5 is the flow chart of the training method further embodiment of sample of the present invention video analysis network.As shown in figure 5, Compared with embodiment illustrated in fig. 4, in the embodiment, the training process of second stage includes:
402, Q frame images are chosen from each camera lens segment in given video respectively.
Wherein, given video specifically includes P camera lens segment, wherein, the value of P, Q are the integer more than 0.
404, using visual signature network, respectively for each camera lens segment in given video, extract Q in each camera lens segment Visual signature of the visual signature of frame image as each camera lens segment.
406, according to the sequence of camera lens segment each in given video, successively by the vision of camera lens segment each in given video spy Levy input timing structural network.
408, sequential organization generates given herein above according to the visual signature and sequential relationship of each camera lens segment in given video The sequential organization feature of P camera lens segment in video.
410, sequential organization network obtains the P camera lens segment phase according to the sequential organization feature of above-mentioned P camera lens segment The visual signature of adjacent next camera lens segment.
Later, operation 412 is performed.
402 ', Q frame images are chosen from each option camera lens segment in L option camera lens segment respectively.
Wherein, L is the integer more than 1.
The embodiment of the present invention is used to predicting adjacent next camera lens segment of given video, in option camera lens segment, at least Including a correct next camera lens segment, remaining is wrong next camera lens segment.Wherein, correct next camera lens The quantity of segment is preset, consistent with the quantity of correct next camera lens segment finally determined.Wherein one it is specific In example, option camera lens segment includes a correct next camera lens segment, remaining L-1 next camera lenses for mistake Segment.
404 ', using visual signature network, respectively for each option camera lens segment in L option camera lens segment, extract Q Visual signature of the visual signature of frame image as each option camera lens segment.
Wherein, operation 402 '~404 ' and operation 402~410 are two flows performed parallel, are not present therebetween Execution sequence limits.
412, using convolutional neural networks, according to the visual signature of the next camera lens segment got and above-mentioned L choosing The visual signature of item camera lens segment, chooses the adjacent next mirror of above-mentioned P camera lens segment from above-mentioned L option camera lens segment Head segment.
414, based on the forecasting accuracy of next camera lens segment, sequential organization network is trained.
Illustratively, aforesaid operations 408~414 (not including 402 '~404 ') or 402~414 (including 402 '~ 404 ') it can be process that iteration performs, sequential organization network is trained by the process that the iteration performs, until Meet preset condition, the forecasting accuracy of next camera lens segment reaches predetermined probabilities value or the instruction to sequential organization network Practice number and reach preset times, training is completed.
Based on the embodiment of the present invention, specifically by the way of unsupervised from film equal samples video learning sequential organization, Wherein unsupervised mode refers to, the mark of any information is not carried out to Sample video, and sequential organization network can be marked lacking Learn the sequential organization of film equal samples video under conditions of note, i.e.,:Sequential in film equal samples video between different fragments Structure feature so as to reduce mark and workload needed for calculating, improves the training effectiveness of sequential organization network.Sequential organization After network training meets preset condition, the sequential organization feature of long video can be accurately got, based on sequential organization spy Application is further analyzed to long video in sign, for example, based on the next camera lens segment of several lens sections predictions before film, Retrieval meets the film of video presentation field, the video presentation of the film is generated according to film, etc..
In one of specific example of training method embodiment shown in Fig. 5, operation 412 can specifically include:
It is regarded according to each option camera lens segment in the visual signature of above-mentioned next camera lens segment and L option camera lens segment Feel feature, obtain probability score of each option camera lens segment as next camera lens segment in L option camera lens segment respectively;
The highest one or more option camera lens segments of probability score in L option camera lens segment are chosen as above-mentioned P The adjacent next camera lens segment of camera lens segment.
In addition, probability of each option camera lens segment as next camera lens segment in L option camera lens segment is got After score, it is also an option that property to the L option camera lens segment respectively as the probability of adjacent next camera lens segment Score is normalized, and the normalization for obtaining L option camera lens segment respectively as adjacent next camera lens segment is general Rate score.Correspondingly, when choosing the adjacent next camera lens segment of above-mentioned P camera lens segment, specifically from above-mentioned L option camera lens In the normalization probability score of segment, the highest one or more option camera lens segments of normalization probability score are chosen as above-mentioned The adjacent next camera lens segment of P camera lens segment.The quantity of option camera lens segment specifically chosen can be according to preset condition It determines, for example, the default selection normalization highest option camera lens segment of probability score is adjacent as above-mentioned P camera lens segment Next camera lens segment or choose normalization probability score be higher than preset fraction any number of option camera lens segment conducts The adjacent next camera lens segment of above-mentioned P camera lens segment.
Specifically, in the above embodiments, according to the visual signature and L option camera lens of above-mentioned next camera lens segment The visual signature of each option camera lens segment in segment is obtained in L option camera lens segment respectively under each option camera lens segment conduct The probability score of one camera lens segment can specifically include:
The visual signature of next camera lens segment to being obtained by operation 410 replicates, and obtains L next camera lenses The visual signature of segment, and according to visual signature of the preset format to the L next camera lens segments and above-mentioned L option camera lens The visual signature of segment is spliced, and obtains eigenmatrix;
Using convolutional neural networks, L option camera lens segment is obtained respectively as adjacent next based on this feature matrix The probability score of a camera lens segment.
For example, using convolutional neural networks, the vision of option camera lens segment in each row of eigenmatrix can be obtained respectively Similarity between the visual signature of feature and next camera lens segment obtains L option camera lens segment point according to similarity Probability score not as adjacent next camera lens segment.Between visual signature and the visual signature of next camera lens segment The higher option camera lens segment of similarity, probability score are higher.
Fig. 6 is to a schematic diagram of recurrent neural network training in the embodiment of the present invention.In the present embodiment, sequential organization Network is specially recurrent neural network, and recurrent neural network can capture temporal aspect and remember, than being more suitable for sequence sometimes Data (such as the data such as video, voice) processing.As shown in fig. 6, for one section of given continuous videos (i.e.:The present invention Given video in embodiment), feature extraction is carried out to given video and option camera lens segment by visual signature network respectively, Obtain visual signature;Then the visual signature of the adjacent next camera lens segment of video is given using recurrent neural network output, Merge with option camera lens segment after (for example, with vectorial connecting method) by a convolutional neural networks, by convolutional neural networks Probability score of the respective option camera lens segment as the correct adjacent next camera lens segment of given video is exported, probability score is most High option camera lens segment as predicts the adjacent next camera lens segment of obtained given video.
Below by taking a specific example as an example, the embodiment of the present invention is illustrated.Assuming that given preceding n camera lenses segment (i.e.:Given video includes continuous n camera lenses segment), to predict (n+1)th camera lens segment, idiographic flow is as follows:
First to above-mentioned preceding n camera lens segment, 3 frame images are chosen from each camera lens segment, input trained vision Character network (visual model), the visual signature of 3 frame images selected in n camera lens segments before obtaining, per frame image Visual signature assumes to be respectively 1 1024 feature vector tieed up;
The visual signature of each frame image in video given herein above is input in recurrent neural network (such as LSTM), it is defeated Go out the adjacent next camera lens segment of above-mentioned preceding n camera lens segment (i.e.:(n+1)th camera lens segment) visual signature, it is assumed that be 1 The feature vector of a 256 dimension;
Assuming that a total of 32 of option camera lens segment, one of them is correct next camera lens segment, remaining 31 are 32 option camera lens segments are inputted trained visual signature network (visual by next camera lens segment of mistake Model), a visual signature is obtained respectively for each option camera lens segment, it is assumed that for the feature vector of 1024 dimensions, obtain one A 32 × 1024 the first matrix;
The feature vector that above-mentioned 256 are tieed up replicates 32 times, second matrix of one 32 × 256 is obtained, with 32 × 1024 First matrix is spliced according to preset format, such as is spliced according to the second matrix in preceding, the posterior mode of the first matrix, Obtain the eigenmatrix of one 32 × 1280;
32 × 1280 eigenmatrix drops in features described above Input matrix to one three layers of convolutional neural networks successively Dimension, for example, by the feature vector that this feature matrix successively dimensionality reduction is 32 × 256,32 × 64,32 × 1,32 × 1 feature vector In each digitized representation its probability score of corresponding option camera lens segment as next option camera lens segment, probability score Highest one next camera lens segment as prediction result.In addition, the convolutional neural networks may be one layer or Other quantity layer, can be directly by the spy of 32 × 1280 eigenmatrix dimensionality reduction to 32 × 1 by one layer of convolutional neural networks Sign vector;By the convolutional neural networks of other quantity layer by the feature of 32 × 1280 eigenmatrix successively dimensionality reduction to 32 × 1 Vector;
It is whether correct according to the prediction result, recurrent neural network can be trained, adjust its network parameter, so The above-mentioned flow in the present embodiment is re-executed afterwards, until meeting preset condition, the training of recurrent neural network is completed.Fig. 7 Flow chart for video analysis method one embodiment of the present invention.As shown in fig. 7, the video analysis method of the embodiment includes:
502, at least one video clip is chosen from video.
Wherein, which can specifically include one or more camera lens segment.
504, using visual signature network, the visual signature of above-mentioned at least one video clip is obtained respectively.
506, according to the visual signature from above-mentioned at least one video clip, above-mentioned video is analyzed.
Based on the video analysis method that the above embodiment of the present invention provides, the selecting video segment from video utilizes vision Character network extracts the visual signature of the video clip, and long video is analyzed according to the visual signature respectively.Based on this Inventive embodiments realize based on the segment in video to the analysis of video, without analyzing entire video, reduce video point The calculation amount of analysis, saves computing resource and storage resource, improves analysis efficiency so that the quick analysis isometric video of film into It is possible.
Fig. 8 is the flow chart of another embodiment of video analysis method of the present invention.As shown in figure 8, the video of the embodiment Analysis method includes:
602, X camera lens segment is chosen from video.
It illustratively, can be based on the characteristic similarity and preset condition between two field pictures adjacent in video, from video X camera lens segment of middle selection.For example, in a particular application, can be compared by the feature between adjacent two field pictures, The characteristic similarity score between consecutive frame image is calculated, while based on the preset condition that setting is considered from whole section of video, such as The frame picture number of each camera lens segment is no less than the first default value, will not be more than the second default value, and identification is distinguished in video Camera lens segment, then from video choose X camera lens segment.
In specific example, the mode randomly selected specifically may be used, X camera lens segment is chosen from video, it can also The mode of a camera lens segment is chosen using predetermined manner, such as every one or more camera lens segments, X is chosen from from video A camera lens segment.
604, respectively for any camera lens segment in above-mentioned X camera lens segment, Y frames are chosen from any camera lens segment Image.
In specific example, the mode randomly selected specifically may be used, Y frame images is chosen from any camera lens segment, Predetermined manner may be used, such as the mode of a frame image is chosen every a frame or multiple image, selected from any camera lens segment Take Y frame images.
Wherein, X, Y are respectively the integer more than 0.In one of specific example of the embodiment of the present invention, M, N's takes Value is respectively the integer more than or equal to 2, for example, the value that the value of X is 8, Y is 3.The value of X and Y is bigger, acquisition Visual signature is more rich, and the video estimation label that can cause video is more accurate, but required calculation amount is also more, to meter The consumption for calculating resource is more.In one of specific example of the embodiment of the present invention, the value that the value of X is 8, Y is 3.This Inventors discovered through research that when the value of X is 8, the value of Y is 3, can effectively obtaining the visual signature of video In the case of consume less computing resource.However, in the embodiment of the present invention, the value of X and Y are not limited to this, and X and Y take other It can equally be realized during numerical value.
It, can be to entire video when selection multiple image in a plurality of lenses segment, each camera lens segment is chosen from video Have than more comprehensive viewing visual angle, so as to help more comprehensively, effectively, accurately to understand the visual information of entire video and Timing structure information.
606, it is any from this respectively respectively for any camera lens segment in above-mentioned X camera lenses segment using visual signature network The visual signature of Y frame images is extracted in camera lens segment.
608, according to the visual signature of Y frame images chosen respectively from above-mentioned X camera lens segment, video is divided Analysis.
Illustratively, which can specifically be realized by a convolutional neural networks.
In a wherein specific example, which can specifically realize in the following way:
Respectively for any camera lens segment in above-mentioned X camera lens segment, the vision for being based respectively on the Y frame images of extraction is special Sign determines the type label of the Y frame images;And any where the Y frame images is determined based on the type label of the Y frame images The type label of camera lens segment.For example, by averagely down-sampled (pooling) or maximum down-sampled mode, respectively by the Y frames The type label of image determines the type label of any camera lens segment where the Y frame images;
Video preprocessor mark of the tag types as the video of video is determined based on the type label of above-mentioned X camera lens segment Label.For example, by averagely down-sampled (pooling) or maximum down-sampled mode, the type mark of above-mentioned X camera lens segment is merged Label, obtain the video estimation label of the video.
Tag types in the embodiment of the present invention, i.e. frame image, camera lens segment, the type of video, such as horror film, happiness Play, music, homicide etc..
Based on the above embodiment of the present invention provide video analysis method, chosen from video at least one camera lens segment, An at least frame image is chosen in each camera lens segment, using visual signature network, extracts the visual signature of each frame image respectively, by The visual signature of each frame image obtains the video estimation label of the video, it is achieved thereby that the analysis to video.Based on the present invention Embodiment is chosen from video and an at least frame image is chosen at least one camera lens segment, each camera lens segment, with sparse sampling With video is analyzed as unit of camera lens segment, without analyzing entire video, reduce the calculation amount of video analysis, section Computing resource and storage resource have been saved, has improved analysis efficiency so that the quick analysis isometric video of film is possibly realized, so as to Applied to the realizing more high-level semantemes of the task, for example, search of vidclip etc..Based on the above-mentioned each video point of the present invention It analyses in further embodiment of a method, after the video estimation label for obtaining above-mentioned video, can also include:To the video labeling Video estimation label.
Based on the embodiment, analysis and classification to video can be realized.
Fig. 9 is the flow chart of one Application Example of video analysis method of the present invention.As shown in figure 9, the embodiment includes:
702, in response to receiving searching request, continuous X camera lens segment is chosen from video successively.
Wherein, it searches element request and includes video presentation field.
It illustratively, can be based on the characteristic similarity and preset condition between two field pictures adjacent in video, from video X camera lens segment of middle selection.
704, Y frame images are chosen from any camera lens segment in above-mentioned X camera lens segment respectively.
Wherein, X, Y are respectively the integer more than 2.In one of specific example of the embodiment of the present invention, the value of X Value for 8, Y is 3.
706, using visual signature network, respectively for any camera lens segment in above-mentioned X camera lenses segment, respectively from either mirror The visual signature of Y frame images is extracted in head segment.
708, according to the visual signature of Y frame images chosen respectively from above-mentioned X camera lens segment, above-mentioned X is obtained respectively The type label of each camera lens segment in a camera lens segment.
Illustratively, which can specifically be realized by a convolutional neural networks.
710, export the camera lens segment of type label and video presentation fields match in X camera lens segment.
In a wherein specific example, which includes:
In response to getting the type label of any camera lens segment in X camera lens segment, it is respectively compared any lens Whether the type label of section matches with video presentation field;
Export the camera lens segment of type label and video presentation fields match in X camera lens segment.
Based on above-mentioned example, can realize to the real-time defeated of type label and the camera lens segment of video presentation fields match Go out, i.e.,:It obtains one and meets the camera lens segment for searching video presentation field in element request, just export a camera lens segment, and do not have to It all analyzes to complete to export again to meet and searches video presentation field in element request Deng all camera lens segments to being chosen in entire video Camera lens segment.
In another specific example, which includes:
In response to getting the type label of all camera lens segments in video, it is respectively compared each camera lens in all camera lens segments Whether the type label of segment matches with video presentation field;
Export the camera lens segment of type label and video presentation fields match in all camera lens segments.
It is unified defeated after the completion of all camera lens segments that can be chosen in entire video are all analyzed based on above-mentioned example Go out the camera lens segment for meeting and searching video presentation field in element request.
Based on above-described embodiment, the search to associated clip in video is realized, can be retrieved from video and video The consistent camera lens segment of description field.
Figure 10 is the flow chart of another Application Example of video analysis method of the present invention.As shown in Figure 10, the embodiment Video analysis method include:
802, X camera lens segment is chosen from video.
It illustratively, can be based on the characteristic similarity and preset condition between two field pictures adjacent in video, from video X camera lens segment of middle selection.
804, respectively for any camera lens segment in above-mentioned X camera lens segment, Y frames are chosen from any camera lens segment Image.
In specific example, the mode randomly selected specifically may be used, Y frame images is chosen from any camera lens segment, Predetermined manner may be used, such as the mode of a frame image is chosen every a frame or multiple image, selected from any camera lens segment Take Y frame images.
Wherein, X, Y are respectively the integer more than 2.In one of specific example of the embodiment of the present invention, the value of X Value for 8, Y is 3.It, can be to whole when selection multiple image in a plurality of lenses segment, each camera lens segment is chosen from video A video has than more comprehensive viewing visual angle, so as to help more comprehensively, effectively, accurately to understand the vision of entire video Information and timing structure information.
806, it is any from this respectively respectively for any camera lens segment in above-mentioned X camera lenses segment using visual signature network The visual signature of Y frame images is extracted in camera lens segment.
808, according to the visual signature of Y frame images chosen respectively from above-mentioned X camera lens segment, obtain the X camera lens The type label of segment.
Illustratively, which can specifically be realized by a convolutional neural networks.
810, using recurrent neural network, learn the sequential organization feature of above-mentioned X camera lens segment.
812, using recurrent neural network, according to the visual signature of above-mentioned X camera lens segment and sequential organization feature, generation The video presentation of above-mentioned video.
Based on the embodiment, its video presentation can be generated directly against video, so as to user it will be seen that the video Relevant information.
Figure 11 is the flow chart of another Application Example of video analysis method of the present invention.As shown in figure 11, the embodiment Video analysis method include:
902, in response to receiving searching request, choose a video.
For example, choose a video successively from Video Reservoir.Wherein, it searches element request and includes video presentation field.
904, continuous X camera lens segment is chosen from the video successively.Illustratively, it can be based on adjacent two in video Characteristic similarity and preset condition between frame image choose X camera lens segment from the video.
906, respectively for any camera lens segment in above-mentioned X camera lens segment, Y frames are chosen from any camera lens segment Image.
In specific example, the mode randomly selected specifically may be used, Y frame images is chosen from any camera lens segment, Predetermined manner may be used, such as the mode of a frame image is chosen every a frame or multiple image, selected from any camera lens segment Take Y frame images.
Wherein, X, Y are respectively the integer more than 2.In one of specific example of the embodiment of the present invention, the value of X Value for 9, Y is 3.It, can be to whole when selection multiple image in a plurality of lenses segment, each camera lens segment is chosen from video A video has than more comprehensive viewing visual angle, so as to help more comprehensively, effectively, accurately to understand the vision of entire video Information and timing structure information.
908, it is any from this respectively respectively for any camera lens segment in above-mentioned X camera lenses segment using visual signature network The visual signature of Y frame images is extracted in camera lens segment.
910, according to the visual signature of Y frame images chosen respectively from above-mentioned X camera lens segment, above-mentioned X is obtained respectively The type label of each camera lens segment in a camera lens segment.
Illustratively, which can specifically be realized by a convolutional neural networks.
Later, it returns and performs operation 904, until getting the type label of camera lens segment in the video, perform operation 912。
912, using recurrent neural network, obtain the sequential organization feature of all camera lens segments in the video.
914, it is special according to the type label of camera lens segments all in the video and sequential organization using recurrent neural network Sign generates the video presentation of the video.
916, compare the video presentation of the video with searching whether the video presentation field in element request matches.
918, in response to video video presentation and search element request in video presentation fields match, export the video.
Otherwise, if the video presentation of the video and the video presentation field searched in element request mismatch, output is not performed and is regarded The operation of frequency.
Based on the embodiment, realize based on word description (i.e.:Video presentation field) retrieval to video so that user The video of corresponding contents can be retrieved from Video Reservoir (such as each web film).
One of ordinary skill in the art will appreciate that:Realizing all or part of step of above method embodiment can pass through The relevant hardware of program instruction is completed, and aforementioned program can be stored in a computer read/write memory medium, the program When being executed, step including the steps of the foregoing method embodiments is performed;And aforementioned storage medium includes:ROM, RAM, magnetic disc or light The various media that can store program code such as disk.
Figure 12 is the structure diagram of training device one embodiment of video analysis network of the present invention.Each implementation of the invention The training device of example can be used for realizing the training method of the various embodiments described above of the present invention.As shown in figure 12, in the embodiment, video It analyzes network and includes visual signature network 10, the training device of video analysis network includes sort module 1002 and first and trains mould Block 1004.Wherein:
Visual signature network 10, for being directed in the corresponding at least one sample segment of at least one Sample video Any sample segment obtains the visual signature of any sample segment.Wherein, Sample video is labeled with video type label.
In one of specific example of the embodiment of the present invention, sample segment is specially the corresponding video of the Sample video Advance notice segment or the video clip that editing obtains from Sample video.In a video, video trailer section or video clip In, if a screen switching has occurred, another camera lens segment is switched to by a camera lens segment.
Sort module 1002, the visual signature for being got according to visual signature network 10 obtain regarding for the sample segment Frequency prediction label.
First training module 1004, for the sample corresponding with the sample segment of the video estimation label according to sample segment The video type label of video, is trained visual signature network 10.
Training device based on the video analysis network that the above embodiment of the present invention provides, it is proposed that utilize Sample video pair The technical solution that the sample segment (such as advance notice segment of Sample video) answered is trained visual signature network, regards from sample A sample snippet extraction visual signature is chosen in frequency, the video estimation label of sample segment is obtained according to the visual signature, so Visual signature network is trained according to the video type label of the video estimation label of sample segment and Sample video afterwards.By It is chosen from Sample video in sample segment, there is the picture in the Sample video, based on one in Sample video Segment and be not based on Sample video and visual signature network be trained in itself, greatly reduce calculation amount, save calculating money Source and storage resource improve training effectiveness, and the visual signature network that training obtains can be effectively used for dividing Sample video Analysis.
Figure 13 is the structure diagram of another embodiment of the training device of video analysis network of the present invention.Such as Figure 13 institutes Show, compared with the embodiment shown in Figure 12, the training device of the embodiment further includes:
First chooses module 1006, for being directed to the corresponding at least one sample chips of at least one Sample video respectively Any sample segment in section chooses M camera lens segment from any short-movie section.Wherein, sample segment is included in Sample video At least one camera lens segment, each camera lens segment include at least one frame image.
Second chooses module 1008, for choosing N frame figures from any camera lens segment in above-mentioned M camera lens segment respectively Picture.Wherein, M, N are respectively the integer more than 0.
Correspondingly, in the embodiment, visual signature network 10 is appointed specifically for being directed in above-mentioned M camera lens segment respectively One camera lens segment extracts the visual signature of N frame images chosen in any camera lens segment respectively.Sort module 1002 is specific to use According to the visual signature of N frame images chosen respectively in above-mentioned M camera lens segment, the video preprocessor mark of sample segment is obtained Label.
In a wherein specific example, sort module 1002 is specifically used for:
Respectively for any camera lens segment, the visual signature for being based respectively on the N frame images of selection determines the class of the N frame images Type label;
It is based respectively on the type that the type label of N frame images chosen in any camera lens segment determines any camera lens segment Label;
Determine the tag types of sample segment as video estimation label based on the type label of above-mentioned M camera lens segment.
Training device based on the video analysis network that the above embodiment of the present invention provides, it is proposed that utilize Sample video pair The technical solution that the sample segment (such as predicting segment) answered is trained initial visual character network model, from Sample video M camera lens segment is chosen in corresponding sample segment, chooses N frame figures from any camera lens segment in M camera lens segment respectively Picture, using visual signature network respectively for any camera lens segment in M camera lens segment, the vision for extracting N frame images respectively is special Sign according to the visual signature of N frame images chosen respectively in M camera lens segment, obtains the video estimation label of sample segment, so Afterwards according to the video estimation label of sample segment and the video type label of Sample video, visual signature network is trained. Due to sample segment, editing obtains from Sample video, has the representative picture in the Sample video, based on sample Segment is trained visual signature network, greatly reduces calculation amount, saves computing resource and storage resource, improves instruction Practice efficiency, and obtained visual signature network is trained to can be effectively used for the analysis to Sample video;From sample segment choose to An at least frame image is chosen in a few camera lens segment, each camera lens segment, by sparse sampling and as unit of camera lens segment Network training is carried out, calculation amount is further reduced, saves computing resource and storage resource, improve training effectiveness.
Based on above-mentioned training method embodiment, realize and visual signature network be trained by the way of Weakly supervised, From the visual signature of sample segment (for example, advance notice segment) learning film equal samples video, so that visual signature network It can be effectively used for the analysis to Sample video.It is therein it is Weakly supervised refer to, only pass through the video type label of Sample video The training to visual signature network is realized, without to each frame image/video type label in Sample video, sample segment, reducing Mark and workload needed for calculating improve the training effectiveness of visual signature network.
In addition, in another embodiment of the training device of video analysis network of the present invention, visual signature network 10, also Available for after training meets preset condition, carrying out feature extraction to given video, obtaining the visual signature of a plurality of lenses segment, Wherein, it gives video and includes continuous a plurality of lenses segment.Referring back to Figure 13, in this embodiment, video analysis network also wraps It includes:Sequential organization network 20 for visual signature and sequential relationship based on continuous a plurality of lenses segment, learns this and given regards The sequential organization feature of frequency.In a specific example of each training device embodiment of the present invention, sequential organization network 20 can be with It is realized by recurrent neural network.
Assuming that video given herein above specifically includes P camera lens segment, wherein, P is the integer more than 0.Then wherein one In specific example, second chooses module 1008, it may also be used for chooses Q frame figures from each camera lens segment in given video respectively Picture, wherein, Q is the integer more than 0.Visual signature network 10 carries out feature extraction to given video, obtains a plurality of lenses segment Visual signature when, be specifically used for:Respectively for each camera lens segment, the visual signature for extracting Q frame images in each camera lens segment is made Visual signature for each camera lens segment;And the sequence according to above-mentioned P camera lens segment in given video, successively by the P The visual signature input recurrent neural networks model of camera lens segment.
Further, in the further embodiment of the training device of video analysis network of the present invention, sequential organization network 20, specifically for visual signature and sequential relationship based on continuous a plurality of lenses segment, predict adjacent next of given video A camera lens segment.Referring back to Figure 13, the training device of the embodiment further includes:Second training module 1010, it is next for being based on The forecasting accuracy of a camera lens segment, is trained sequential organization network 20.
In the further embodiment of training device, second chooses module 1008, it may also be used for respectively from above-mentioned L option Q frame images are chosen in each option camera lens segment in camera lens segment, wherein, L option camera lens segment includes at least one correct Next camera lens segment, L is integer more than 1.Correspondingly, visual signature network 10, it may also be used for respectively for L option Each option camera lens segment in camera lens segment extracts visual signature of the visual signature of Q frame images as each option camera lens segment. Sequential organization network 20, is specifically used for:According to the visual signature and sequential relationship of P camera lens segment, P camera lens segment is generated Sequential organization feature;And the adjacent next camera lens of P camera lens segment is obtained according to the sequential organization feature of P camera lens segment The visual signature of segment.Referring back to Figure 13, the training device of the embodiment can also include:Convolutional neural networks 1012, are used for According to the visual signature of above-mentioned next camera lens segment and the visual signature of L option camera lens segment, from the L option lens The adjacent next camera lens segment of P camera lens segment is chosen in section.
In a wherein optional example, convolutional neural networks 1012 are specifically used for:According to regarding for next camera lens segment Feel the visual signature of each option camera lens segment in feature and L option camera lens segment, obtain each option camera lens segment conduct respectively The probability score of next camera lens segment;And choose the highest at least one option of probability score in L option camera lens segment The camera lens segment next camera lens segment adjacent as P camera lens segment.
Further, referring back to Figure 13, the training device of the embodiment of the present invention can also include:Module 1014 is normalized, For being obtained to each option camera lens segment in above-mentioned L option camera lens segment respectively as the probability of adjacent next camera lens segment Divide and be normalized, obtain L normalization probability score.Correspondingly, convolutional neural networks 1012 choose above-mentioned L option The highest at least one option camera lens segment of the probability score next lens adjacent as P camera lens segment in camera lens segment Duan Shi, specifically for from above-mentioned L normalization probability score, choosing the highest at least one normalization probability score pair of numerical value The option camera lens segment the answered next camera lens segment adjacent as P camera lens segment.Second training module 1010, is specifically based on The forecasting accuracy for next camera lens segment that convolutional neural networks 1012 are chosen, is trained sequential organization network 20.
Figure 14 is the structure diagram of video analysis device one embodiment of the present invention.The video of various embodiments of the present invention point Analysis apparatus can be used for realizing the video analysis method of the various embodiments described above of the present invention.As shown in figure 14, in the embodiment, video point Analysis apparatus includes:
First chooses module 1006, for choosing at least one video clip from video.
Using visual signature network 10, for obtaining the visual signature of above-mentioned at least one video clip respectively.
Sort module 1002, for according to the visual signature from above-mentioned at least one video clip, being carried out to above-mentioned video Analysis.
Based on the video analysis device that the above embodiment of the present invention provides, the selecting video segment from video utilizes vision Character network extracts the visual signature of the video clip, and long video is analyzed according to the visual signature respectively.Based on this Inventive embodiments realize based on the segment in video to the analysis of video, without analyzing entire video, reduce video point The calculation amount of analysis, saves computing resource and storage resource, improves analysis efficiency so that the quick analysis isometric video of film into It is possible.
Figure 15 is the structure diagram of long another embodiment of video analysis device of the present invention.With the embodiment shown in Figure 14 It compares, in the embodiment, first chooses module 1006, specifically for choosing X camera lens segment, above-mentioned video from above-mentioned video Segment includes the X camera lens segment.In a wherein optional example, first, which chooses module 1006, is specifically used for based in video Characteristic similarity and preset condition between adjacent two field pictures choose X camera lens segment from video.As shown in figure 15, should The video analysis device of embodiment further includes:Second chooses module 1008, respectively for any lens in X camera lens segment Section chooses Y frame images from any camera lens segment;Wherein, X, Y are respectively the integer more than 0.Correspondingly, in the embodiment, depending on Feel that character network 10 is specifically used for carrying from any camera lens segment respectively for any camera lens segment in X camera lens segment respectively Take the visual signature of Y frame images.Sort module 1002 is specifically used for according to the Y frame images chosen respectively from X camera lens segment Visual signature, video is analyzed.
In a wherein optional example, sort module 1002 is specifically used for:
Respectively for select come X camera lens segment in any camera lens segment:It is based respectively on the Y frame figures for selecting and The visual signature of picture determines to change the type label of Y frame images;And based on select in any camera lens segment come Y frame images Type label determines the type label of any camera lens segment;
Video preprocessor mark of the tag types as video of video is determined based on the type label of above-mentioned X camera lens segment Label.
Based on the above embodiment of the present invention provide video analysis method, chosen from video at least one camera lens segment, An at least frame image is chosen in each camera lens segment, using visual signature network, extracts the visual signature of each frame image respectively, by The visual signature of each frame image obtains the video estimation label of the video, it is achieved thereby that the analysis to video.Based on the present invention Embodiment is chosen from video and an at least frame image is chosen at least one camera lens segment, each camera lens segment, with sparse sampling With video is analyzed as unit of camera lens segment, without analyzing entire video, reduce the calculation amount of video analysis, section Computing resource and storage resource have been saved, has improved analysis efficiency so that the quick analysis isometric video of film is possibly realized, so as to Applied to the realizing more high-level semantemes of the task, for example, search of vidclip etc..
Figure 16 is the structure diagram of another embodiment of video analysis device of the present invention.As shown in figure 16, with Figure 14 or Embodiment shown in figure 15 is compared, and the video analysis device of the embodiment further includes:Labeling module 1102, for above-mentioned video Mark the video estimation label that sort module 1002 exports.
Figure 17 is the structure diagram of the long video analysis device further embodiment of the present invention.As shown in figure 17, with Figure 14 Or embodiment shown in figure 15 is compared, the video analysis device of the embodiment further includes:Receiving module 1104 and output module 1106.Wherein:
Receiving module 1104, for receiving searching request, this is searched element request and includes video presentation field.
First selection module 1006 is specifically used for receiving searching request in response to receiving module 1104, successively from video Choose continuous X camera lens segment.Sort module 1002, specifically for according to the Y frame figures chosen respectively from X camera lens segment The visual signature of picture obtains the type label of each camera lens segment in above-mentioned X camera lens segment respectively.
Output module 1106, for exporting the mirror of type label and video presentation fields match in above-mentioned X camera lens segment Head segment.
Figure 18 is the structure diagram of the long video analysis device a still further embodiment of the present invention.As shown in figure 18, with Figure 14 Or embodiment shown in figure 15 is compared, the video analysis device of the embodiment further includes:Sequential organization network 20 and generation module 1108.In a wherein optional example, sequential organization network 20 can be recurrent neural network.
In one embodiment, sort module 1002 is specifically used for according to from above-mentioned X lens shown in Figure 18 The visual signature of Y frame images chosen respectively in section obtains the type label of each camera lens segment in X camera lens segment respectively.When Sequence structure network 20, for obtaining the sequential organization feature of above-mentioned X camera lens segment.Generation module 1108, for according to above-mentioned X The type label of a camera lens segment and sequential organization feature generate the video presentation of video.
In another embodiment shown in Figure 18, first, which chooses module 1006, is specifically used for the company of selection from video successively X continuous camera lens segment.Correspondingly, sort module 1002 is specifically used for according to the Y frames chosen respectively from the X camera lens segment The visual signature of image obtains the type label of each camera lens segment in the X camera lens segment respectively.Sequential organization network 20 is used for Obtain the sequential organization feature of all camera lens segments in video.Generation module 1108, for according to camera lens segments all in video Type label and sequential organization feature, generate the video presentation of video.
The embodiment of the present invention additionally provides a kind of electronic equipment, can include the video point of any of the above-described embodiment of the present invention Analyse the training device or video analysis device of network.
In addition, the embodiment of the present invention additionally provides another electronic equipment, including:
Memory, for storing executable instruction;And
Processor, for communicating with memory to perform executable instruction any of the above-described embodiment thereby completing the present invention The operation of the training method or video analysis method of video analysis network.
The electronic equipment of the various embodiments described above of the present invention, such as can be mobile terminal, personal computer (PC), tablet electricity Brain, server etc..
The embodiment of the present invention additionally provides a kind of computer storage media, should for storing computer-readable instruction Instruction is performed the training method of the video analysis network of realizing any of the above-described embodiment of the present invention or video analysis method Operation.
The embodiment of the present invention additionally provides a kind of computer program, including computer-readable instruction, when the computer When the instruction that can be read is run in a device, the processor execution in equipment is used to implement regarding for any of the above-described embodiment of the invention The training method of frequency analysis network or the executable instruction of the step in video analysis method.
Figure 19 is the structure diagram of one Application Example of electronic equipment of the present invention.Below with reference to Figure 19, it illustrates Suitable for being used for realizing the structure diagram of the electronic equipment of the terminal device of the embodiment of the present application or server.As shown in figure 19, The electronic equipment includes one or more processors, communication unit etc., and one or more of processors are for example:In one or more Central Processing Unit (CPU) 1201 and/or one or more image processor (GPU) 1213 etc., processor can be according to being stored in Executable instruction in read-only memory (ROM) 1202 is loaded into random access storage device (RAM) from storage section 1208 Executable instruction in 1203 and perform various appropriate actions and processing.Communication unit 1212 may include but be not limited to network interface card, institute It states network interface card and may include but be not limited to IB (Infiniband) network interface card, processor can be with read-only memory 1202 and/or random access Communication is connected and through communication unit 1212 with performing executable instruction by bus 1204 with communication unit 1212 in memory 1203 It communicates with other target devices, so as to complete the training method or video of any video analysis network provided by the embodiments of the present application The corresponding operation of analysis method, for example, for appointing in the corresponding at least one sample segment of at least one Sample video One sample chips section, the visual signature of any sample segment is obtained using visual signature network, and the Sample video is labeled with Video type label;The video estimation label of the sample segment is obtained according to the visual signature;According to the sample segment Video estimation label and the video type label, the visual signature network is trained.Alternatively, for another example, from video It is middle to choose at least one video clip;Using visual signature network, the vision for obtaining at least one video clip respectively is special Sign;According to the visual signature from least one video clip, the video is analyzed.
In addition, in RAM 1203, it can also be stored with various programs and data needed for device operation.CPU1201、 ROM1202 and RAM1203 is connected with each other by bus 1204.In the case where there is RAM1203, ROM1202 is optional module. RAM1203 stores executable instruction or executable instruction is written into ROM1202 at runtime, and executable instruction makes processor 1201 perform the training method of above-mentioned video analysis network or the corresponding operation of video analysis method.Input/output (I/O) interface 1205 are also connected to bus 1204.Communication unit 1212 can be integrally disposed, may be set to be (such as more with multiple submodule A IB network interface cards), and in bus link.
I/O interfaces 1205 are connected to lower component:Importation 1206 including keyboard, mouse etc.;Including such as cathode The output par, c 1207 of ray tube (CRT), liquid crystal display (LCD) etc. and loud speaker etc.;Storage section including hard disk etc. 1208;And the communications portion 1209 of the network interface card including LAN card, modem etc..Communications portion 1209 passes through Communication process is performed by the network of such as internet.Driver 1211 is also according to needing to be connected to I/O interfaces 1205.It is detachable to be situated between Matter 1211, such as disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on driver 1211 as needed, so as to In being mounted into storage section 1208 as needed from the computer program read thereon.
Need what is illustrated, framework as shown in figure 19 is only a kind of optional realization method, can root during concrete practice The component count amount and type of above-mentioned Figure 19 are selected, are deleted, increased or replaced according to actual needs;It is set in different function component Put, can also be used it is separately positioned or integrally disposed and other implementations, such as GPU and CPU separate setting or can be by GPU collection Into on CPU, communication unit separates setting, can also be integrally disposed on CPU or GPU, etc..These interchangeable embodiments Each fall within protection domain disclosed by the invention.
Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product, it is machine readable including being tangibly embodied in Computer program on medium, computer program are included for the program code of the method shown in execution flow chart, program code It may include step pair in the training method of corresponding execution video analysis network provided by the embodiments of the present application or video analysis method The instruction answered, for example, for any sample segment in the corresponding at least one sample segment of at least one Sample video, The instruction of the visual signature of any sample segment is obtained using visual signature network, the Sample video is labeled with video class Type label;The instruction of the video estimation label of the sample segment is obtained according to the visual signature;According to the sample segment Video estimation label and the video type label, the instruction being trained to the visual signature network.Alternatively, for another example, The instruction of at least one video clip is chosen from video;Using visual signature network, at least one video is obtained respectively The instruction of the visual signature of segment;According to the visual signature from least one video clip, the video is analyzed Instruction.In such embodiments, which can be downloaded and installed by communications portion 1209 from network, And/or it is mounted from detachable media 1211.When the computer program is performed by central processing unit (CPU) 1201, perform The above-mentioned function of being limited in the present processes.
Each embodiment is described by the way of progressive in this specification, the highlights of each of the examples are with its The difference of its embodiment, the same or similar part cross-reference between each embodiment.For system embodiment For, since it is substantially corresponding with embodiment of the method, so description is fairly simple, referring to the portion of embodiment of the method in place of correlation It defends oneself bright.
Methods and apparatus of the present invention may be achieved in many ways.For example, can by software, hardware, firmware or Software, hardware, firmware any combinations realize methods and apparatus of the present invention.The said sequence of the step of for the method Merely to illustrate, the step of method of the invention, is not limited to sequence described in detail above, special unless otherwise It does not mentionlet alone bright.In addition, in some embodiments, the present invention can be also embodied as recording program in the recording medium, these programs Including being used to implement machine readable instructions according to the method for the present invention.Thus, the present invention also covering stores to perform basis The recording medium of the program of the method for the present invention.
Description of the invention provides for the sake of example and description, and is not exhaustively or will be of the invention It is limited to disclosed form.Many modifications and variations are obvious for the ordinary skill in the art.It selects and retouches It states embodiment and is to more preferably illustrate the principle of the present invention and practical application, and those of ordinary skill in the art is enable to manage The solution present invention is so as to design the various embodiments with various modifications suitable for special-purpose.

Claims (10)

1. a kind of training method of video analysis network, which is characterized in that the video analysis network includes visual signature network, The method includes:
For any sample segment in the corresponding at least one sample segment of at least one Sample video, vision spy is utilized The visual signature that network obtains any sample segment is levied, the Sample video mark contains video type label;
The video estimation label of the sample segment is obtained according to the visual signature;
According to the video estimation label of the sample segment and the video type label, the visual signature network is instructed Practice.
2. according to the method described in claim 1, it is characterized in that, described obtain any sample using visual signature network Before the visual signature of segment, further include:
M camera lens segment is chosen from any sample segment and respectively from any camera lens in the M camera lens segment N frame images are chosen in segment;Wherein, M, N are respectively the integer more than 0;The sample segment is included in the Sample video At least one camera lens segment, each camera lens segment include an at least frame image;
The visual signature that any sample segment is obtained using visual signature network, including:Using visual signature network, Respectively for any camera lens segment in the M camera lens segment, N frame images described in any camera lens segment are extracted respectively Visual signature.
3. according to the method described in claim 2, it is characterized in that, regarding for the sample segment is obtained according to the visual signature Frequency prediction label, including:
According to the visual signature of N frame images chosen respectively in the M camera lens segment, the video preprocessor of the sample segment is obtained Mark label.
A kind of 4. video analysis method, which is characterized in that including:
At least one video clip is chosen from video;
Using visual signature network, the visual signature of at least one video clip is obtained respectively;
According to the visual signature from least one video clip, the video is analyzed.
5. a kind of training device of video analysis network, which is characterized in that the video analysis network includes visual signature network; Described device includes sort module and the first training module;Wherein:
The visual signature network, for being directed to appointing in the corresponding at least one sample segment of at least one Sample video One sample chips section obtains the visual signature of any sample segment;The Sample video is labeled with video type label;
The sort module, for obtaining the video estimation label of the sample segment according to the visual signature;
First training module, it is right for the video estimation label according to the sample segment and the video type label The visual signature network is trained.
6. a kind of video analysis device, which is characterized in that including:
First chooses module, for choosing at least one video clip from video;
Using visual signature network, for obtaining the visual signature of at least one video clip respectively;
Sort module, for according to the visual signature from least one video clip, analyzing the video.
7. a kind of electronic equipment, which is characterized in that including:The training device of video analysis network described in claim 5;Or Person, the video analysis device described in claim 6.
8. a kind of electronic equipment, which is characterized in that including:
Memory, for storing executable instruction;And
Processor completes any institute of claims 1 to 3 for communicating with the memory to perform the executable instruction State the operation of method or claim 4 the method.
9. a kind of computer storage media, for storing computer-readable instruction, which is characterized in that described instruction is performed The operation of any the method for Shi Shixian claims 1 to 3 or any the method for claim 4.
10. a kind of computer program, including computer-readable instruction, which is characterized in that when described computer-readable Instruction is when running in a device, the processor execution in the equipment be used to implement any the methods of claim 1-3 or The executable instruction of step in claim 4 the method.
CN201710530371.4A 2017-06-29 2017-06-29 Training and video analysis method and apparatus, electronic equipment, storage medium, program Pending CN108229527A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710530371.4A CN108229527A (en) 2017-06-29 2017-06-29 Training and video analysis method and apparatus, electronic equipment, storage medium, program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710530371.4A CN108229527A (en) 2017-06-29 2017-06-29 Training and video analysis method and apparatus, electronic equipment, storage medium, program

Publications (1)

Publication Number Publication Date
CN108229527A true CN108229527A (en) 2018-06-29

Family

ID=62658105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710530371.4A Pending CN108229527A (en) 2017-06-29 2017-06-29 Training and video analysis method and apparatus, electronic equipment, storage medium, program

Country Status (1)

Country Link
CN (1) CN108229527A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108961317A (en) * 2018-07-27 2018-12-07 阿依瓦(北京)技术有限公司 A kind of method and system of video depth analysis
CN109543528A (en) * 2018-10-19 2019-03-29 北京陌上花科技有限公司 Data processing method and device for video features
CN109635790A (en) * 2019-01-28 2019-04-16 杭州电子科技大学 A kind of pedestrian's abnormal behaviour recognition methods based on 3D convolution
CN110119757A (en) * 2019-03-28 2019-08-13 北京奇艺世纪科技有限公司 Model training method, video category detection method, device, electronic equipment and computer-readable medium
CN110781960A (en) * 2019-10-25 2020-02-11 Oppo广东移动通信有限公司 Training method, classification method, device and equipment of video classification model
CN110958489A (en) * 2019-12-11 2020-04-03 腾讯科技(深圳)有限公司 Video processing method, video processing device, electronic equipment and computer-readable storage medium
CN111382616A (en) * 2018-12-28 2020-07-07 广州市百果园信息技术有限公司 Video classification method and device, storage medium and computer equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160379055A1 (en) * 2015-06-25 2016-12-29 Kodak Alaris Inc. Graph-based framework for video object segmentation and extraction in feature space
CN106682108A (en) * 2016-12-06 2017-05-17 浙江大学 Video retrieval method based on multi-modal convolutional neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160379055A1 (en) * 2015-06-25 2016-12-29 Kodak Alaris Inc. Graph-based framework for video object segmentation and extraction in feature space
CN106682108A (en) * 2016-12-06 2017-05-17 浙江大学 Video retrieval method based on multi-modal convolutional neural network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GABRIEL S. SIMOES, JONATAS WEHRMANN, RODRIGO C. BARROS, DUNCAN D: "Movie Genre Classification with Convolutional Neural Networks", 《IN INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS》 *
MAKARAND TAPASWI, YUKUN ZHU, RAINER STIEFELHAGEN, ANTONIO TORRAL: "MovieQA: Understanding Stories in Movies through Question-Answering", 《CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 *
NITISH SRIVASTAVA, ELMAN MANSIMOV, RUSLAN SALAKHUTDINOV: "Unsupervised Learning of Video Representations using LSTMs", 《INTERNATIONAL CONFERENCE ON MACHINE LEARNING》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108961317A (en) * 2018-07-27 2018-12-07 阿依瓦(北京)技术有限公司 A kind of method and system of video depth analysis
WO2020019397A1 (en) * 2018-07-27 2020-01-30 阿依瓦(北京)技术有限公司 Video depth analysis method and system
CN109543528A (en) * 2018-10-19 2019-03-29 北京陌上花科技有限公司 Data processing method and device for video features
CN111382616A (en) * 2018-12-28 2020-07-07 广州市百果园信息技术有限公司 Video classification method and device, storage medium and computer equipment
CN111382616B (en) * 2018-12-28 2023-08-18 广州市百果园信息技术有限公司 Video classification method and device, storage medium and computer equipment
CN109635790A (en) * 2019-01-28 2019-04-16 杭州电子科技大学 A kind of pedestrian's abnormal behaviour recognition methods based on 3D convolution
CN110119757A (en) * 2019-03-28 2019-08-13 北京奇艺世纪科技有限公司 Model training method, video category detection method, device, electronic equipment and computer-readable medium
CN110119757B (en) * 2019-03-28 2021-05-25 北京奇艺世纪科技有限公司 Model training method, video category detection method, device, electronic equipment and computer readable medium
CN110781960A (en) * 2019-10-25 2020-02-11 Oppo广东移动通信有限公司 Training method, classification method, device and equipment of video classification model
CN110781960B (en) * 2019-10-25 2022-06-28 Oppo广东移动通信有限公司 Training method, classification method, device and equipment of video classification model
CN110958489A (en) * 2019-12-11 2020-04-03 腾讯科技(深圳)有限公司 Video processing method, video processing device, electronic equipment and computer-readable storage medium

Similar Documents

Publication Publication Date Title
CN108229527A (en) Training and video analysis method and apparatus, electronic equipment, storage medium, program
CN108229478A (en) Image, semantic segmentation and training method and device, electronic equipment, storage medium and program
CN109325148A (en) The method and apparatus for generating information
CN112015859A (en) Text knowledge hierarchy extraction method and device, computer equipment and readable medium
Shuai et al. Toward achieving robust low-level and high-level scene parsing
Malgireddy et al. Language-motivated approaches to action recognition
Sims et al. A neural architecture for detecting user confusion in eye-tracking data
US10963700B2 (en) Character recognition
Jha et al. A brief comparison on machine learning algorithms based on various applications: a comprehensive survey
Wang et al. Spatial–temporal pooling for action recognition in videos
Lebron Casas et al. Video summarization with LSTM and deep attention models
CN109446328A (en) A kind of text recognition method, device and its storage medium
CN111078881B (en) Fine-grained sentiment analysis method and system, electronic equipment and storage medium
Song et al. Temporal action localization in untrimmed videos using action pattern trees
Gui et al. Depression detection on social media with reinforcement learning
CN109271915A (en) False-proof detection method and device, electronic equipment, storage medium
CN108268629A (en) Image Description Methods and device, equipment, medium, program based on keyword
Wang et al. Long video question answering: A matching-guided attention model
Li et al. Event extraction for criminal legal text
Cheng et al. Multi-label few-shot learning for sound event recognition
Chen et al. Integrating Multisubspace Joint Learning With Multilevel Guidance for Cross-Modal Retrieval of Remote Sensing Images
Cerekovic A deep look into group happiness prediction from images
Liu et al. Key time steps selection for CFD data based on deep metric learning
Matzen et al. Bubblenet: Foveated imaging for visual discovery
Si Analysis of calligraphy Chinese character recognition technology based on deep learning and computer-aided technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180629