CN106503723A - A kind of video classification methods and device - Google Patents
A kind of video classification methods and device Download PDFInfo
- Publication number
- CN106503723A CN106503723A CN201510559904.2A CN201510559904A CN106503723A CN 106503723 A CN106503723 A CN 106503723A CN 201510559904 A CN201510559904 A CN 201510559904A CN 106503723 A CN106503723 A CN 106503723A
- Authority
- CN
- China
- Prior art keywords
- information
- video
- matrix
- confidence level
- object function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Catching Or Destruction (AREA)
- Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)
Abstract
The embodiment of the invention discloses a kind of video classification methods, for solving the defect that accurately video can not be classified in prior art, improve the accuracy of visual classification.Present invention method includes:The information in video is obtained, the information in the video includes image information, Optic flow information and acoustic information;Corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information are generated using deep neural network;The video is processed according to first reference information, the second reference information and the 3rd reference information to obtain the class relations matrix of the confidence level matrix and the video of the video;The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, to obtain the subject fusion parameter of the video, the subject fusion parameter is used for classifying video.
Description
Technical field
The present invention relates to the communications field, and in particular to a kind of video classification methods and device.
Background technology
Visual classification refer to using video in visual information, auditory information and action message processed to video and analyzed, and judged and identified the action and event that occur in video.Visual classification can apply to a lot of practical problems, such as intelligent monitoring, video data management etc..
In visual classification, it is a kind of means of visual classification using feature early stage fusion, early stage fusion refers to the fusion of feature hierarchy.As depicted in figs. 1 and 2, the different characteristic in extraction video, such as characteristics of image, audio frequency characteristics, couple together the different characteristic that extracts, form assemblage characteristic.During training, by support vector machine (English full name:Support Vector Machine, abbreviation:SVM) or neutral net is trained to assemblage characteristic, generate the visual classification device for training.During visual classification, assemblage characteristic is extracted from video, the visual classification device for training is input into, is obtained the result of visual classification.
This video classification methods have problems in that, it is assumed that being simple complementary between the different characteristic of video, video can be indicated with these features.But as video is not each image, the simple combination of the mode such as sound may still be present between the mode such as image and sound and contacts.Therefore the feature that extracts not can completely express video content, video can not accurately be classified by the method.
Content of the invention
A kind of video classification methods and device is embodiments provided, for solving the defect that accurately video can not be classified in prior art, the accuracy of visual classification is improved.
First aspect present invention provides a kind of video classification methods, including:
The information in video is obtained, the information in the video includes image information, Optic flow information and acoustic information;
Corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information are generated using deep neural network;
The video is processed according to first reference information, the second reference information and the 3rd reference information to obtain the class relations matrix of the confidence level matrix and the video of the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, to obtain the subject fusion parameter of the video, the subject fusion parameter is used for classifying video, wherein, the constraint factor of the object function includes confidence level matrix, class relations matrix and fusion parameters.
In conjunction with first aspect, in the first possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included:
The first information is generated to described image information processing using the first convolutional neural networks, and the first information is first reference information;
The second information is generated to Optic flow information process using the second convolutional neural networks, and second information is second reference information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, and the 3rd information is the 3rd reference information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information and the 3rd information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information and the 3rd information, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W it is) empirical loss of the video in the processing procedure of the first information, second information and the 3rd information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with first aspect, in second possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included:
The first information is generated to described image information processing using the first convolutional neural networks, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using very first time recurrent neural network LSTM, the first information collectively constitutes first reference information with the 4th information;
The second information is generated to Optic flow information process using the second convolutional neural networks, and second information is second reference information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, and the 3rd information is the 3rd reference information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information and the 4th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information and the 4th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 4th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with first aspect, in the third possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included:
The first information is generated to described image information processing using the first convolutional neural networks, and the first information is first reference information;
The second information is generated to Optic flow information process using the second convolutional neural networks, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, and the 3rd information is the 3rd reference information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information and the 5th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information and the 5th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
By object function described in the class relations matrix substitution object function of the confidence level matrix and the video of the video it isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with first aspect, in the 4th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included:
The first information is generated to described image information processing using the first convolutional neural networks, and the first information is first reference information;
The second information is generated to Optic flow information process using the second convolutional neural networks, and second information is second reference information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information and the 6th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information and the 6th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the subject fusion parameter isWherein, the object function is represented and is worked asValue minimum when, obtain subject fusion parameter, wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with first aspect, in the 5th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included:
The first information is generated to described image information processing using the first convolutional neural networks, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information;
The second information is generated to Optic flow information process using the second convolutional neural networks, the second sub-information is extracted according to rule from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, and the 3rd information is the 3rd reference information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information, the 4th information and the 5th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information, the 4th information and the 5th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with first aspect, in the 6th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included:
The first information is generated to described image information processing using the first convolutional neural networks, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information;
The second information is generated to Optic flow information process using the second convolutional neural networks, and second information is second reference information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information, the 4th information and the 6th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information, the 4th information and the 6th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with first aspect, in the 7th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included:
The first information is generated to described image information processing using the first convolutional neural networks, and the first information is first reference information;
The second information is generated to Optic flow information process using the second convolutional neural networks, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information, the 5th information and the 6th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information, the 5th information and the 6th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, obtain subject fusion parameter, wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with first aspect, in the 8th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included:
The first information is generated to described image information processing using the first convolutional neural networks, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information;
The second information is generated to Optic flow information process using the second convolutional neural networks, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Second aspect present invention provides a kind of visual classification device, it is characterised in that include:
Acquisition module, for obtaining the information in video, the information in the video includes image information, Optic flow information and acoustic information;
Generation module, corresponding 3rd reference information of the acoustic information that corresponding second reference information of Optic flow information and the acquisition module that corresponding first reference information of the image information for generating the acquisition module acquisition using deep neural network, the acquisition module are obtained is obtained;
Processing module, the first reference information, the second reference information and the 3rd reference information for being generated according to the generation module are processed to obtain the class relations matrix of the confidence level matrix and the video of the video to the video;
Computing module, the confidence level matrix of video and the class relations matrix of the video for obtaining the processing module substitutes into object function, to obtain the subject fusion parameter of the video, the subject fusion parameter is used for classifying video, wherein, the constraint factor of the object function includes confidence level matrix, class relations matrix and fusion parameters.
In conjunction with second aspect, in the first possible implementation,
The generation module, the Image Information Processing specifically for being obtained to the acquisition module using the first convolutional neural networks generate the first information, and the first information is first reference information;Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module, second information is second reference information;The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module, the 3rd information is the 3rd reference information;
The processing module, the first information, second information and the 3rd information specifically for being generated the video by the generation module are processed, to obtain the confidence level matrix of the video;The first information, second information and the 3rd information that at least one video is generated by the generation module is processed, and to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The computing module, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W it is) empirical loss of the video in the processing procedure of the first information, second information and the 3rd information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with second aspect, in second possible implementation,
The generation module, Image Information Processing specifically for being obtained to the acquisition module using the first convolutional neural networks generates the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using very first time recurrent neural network LSTM, the first information collectively constitutes first reference information with the 4th information;Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module, second information is second reference information;The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module, the 3rd information is the 3rd reference information;
The processing module, the first information, second information, the 3rd information and the 4th information specifically for being generated the video by the generation module are processed, to obtain the confidence level matrix of the video;The first information, second information, the 3rd information and the 4th information that at least one video is generated by the generation module is processed, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The computing module, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 4th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with second aspect, in the third possible implementation,
The generation module, the Image Information Processing specifically for being obtained to the acquisition module using the first convolutional neural networks generate the first information, and the first information is first reference information;Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module, the 3rd information is the 3rd reference information;
The processing module, the first information, second information, the 3rd information and the 5th information specifically for being generated the video by the generation module are processed, to obtain the confidence level matrix of the video;The first information, second information, the 3rd information and the 5th information that at least one video is generated by the generation module is processed, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The computing module, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with second aspect, in the 4th kind of possible implementation,
The generation module, the Image Information Processing for being obtained to the acquisition module using the first convolutional neural networks generate the first information, and the first information is first reference information;Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module, second information is second reference information;The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The processing module, the first information, second information, the 3rd information and the 6th information specifically for being generated the video by the generation module are processed, to obtain the confidence level matrix of the video;The first information, second information, the 3rd information and the 6th information that at least one video is generated by the generation module is processed, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The computing module, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with second aspect, in the 5th kind of possible implementation,
The generation module, Image Information Processing specifically for being obtained to the acquisition module using the first convolutional neural networks generates the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information;Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module, the second sub-information is extracted according to rule from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module, the 3rd information is the 3rd reference information;
The processing module, the first information, second information, the 3rd information, the 4th information and the 5th information specifically for being generated the video by the generation module are processed, to obtain the confidence level matrix of the video;The first information, second information, the 3rd information, the 4th information and the 5th information that at least one video is generated by the generation module, processed to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The computing module, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with second aspect, in the 6th kind of possible implementation,
The generation module, Image Information Processing specifically for being obtained to the acquisition module using the first convolutional neural networks generates the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information;Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module, second information is second reference information;The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The processing module, the first information, second information, the 3rd information, the 4th information and the 6th information specifically for being generated the video by the generation module are processed, to obtain the confidence level matrix of the video;The first information, second information, the 3rd information, the 4th information and the 6th information that at least one video is generated by the generation module is processed, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The computing module, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with second aspect, in the 7th kind of possible implementation,
The generation module, the Image Information Processing specifically for being obtained to the acquisition module using the first convolutional neural networks generate the first information, and the first information is first reference information;Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The processing module, the first information, second information, the 3rd information, the 5th information and the 6th information specifically for being generated the video by the generation module are processed, to obtain the confidence level matrix of the video;The first information, second information, the 3rd information, the 5th information and the 6th information that at least one video is generated by the generation module is processed, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The computing module, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with second aspect, in the 8th kind of possible implementation,
The generation module, Image Information Processing specifically for being obtained to the acquisition module using the first convolutional neural networks generates the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information;Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The processing module, the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information specifically for being generated the video by the generation module is processed, to obtain the confidence level matrix of the video;The first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information that at least one video is generated by the generation module is processed, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The computing module, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Third aspect present invention provides a kind of visual classification device, including:Memorizer, processor, emitter and receptor;Wherein, the memorizer is connected with the processor, and the processor is connected with the emitter, and the processor is connected with the receptor;
It is used for executing following steps by calling the operational order stored in the memorizer, the processor:
The information in video is obtained, the information in the video includes image information, Optic flow information and acoustic information;
Corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information are generated using deep neural network;
The video is processed according to first reference information, the second reference information and the 3rd reference information to obtain the class relations matrix of the confidence level matrix and the video of the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, to obtain the subject fusion parameter of the video, the subject fusion parameter is used for classifying video, wherein, the constraint factor of the object function includes confidence level matrix, class relations matrix and fusion parameters.
In conjunction with the third aspect, in the first possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included:
The first information is generated to described image information processing using the first convolutional neural networks, and the first information is first reference information;
The second information is generated to Optic flow information process using the second convolutional neural networks, and second information is second reference information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, and the 3rd information is the 3rd reference information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information and the 3rd information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information and the 3rd information, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W it is) empirical loss of the video in the processing procedure of the first information, second information and the 3rd information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with the third aspect, in second possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included:
The first information is generated to described image information processing using the first convolutional neural networks, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using very first time recurrent neural network LSTM, the first information collectively constitutes first reference information with the 4th information;
The second information is generated to Optic flow information process using the second convolutional neural networks, and second information is second reference information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, and the 3rd information is the 3rd reference information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information and the 4th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information and the 4th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 4th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with the third aspect, in the third possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included:
The first information is generated to described image information processing using the first convolutional neural networks, and the first information is first reference information;
The second information is generated to Optic flow information process using the second convolutional neural networks, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, and the 3rd information is the 3rd reference information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information and the 5th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information and the 5th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with the third aspect, in the 4th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included:
The first information is generated to described image information processing using the first convolutional neural networks, and the first information is first reference information;
The second information is generated to Optic flow information process using the second convolutional neural networks, and second information is second reference information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information and the 6th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information and the 6th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with the third aspect, in the 5th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included:
The first information is generated to described image information processing using the first convolutional neural networks, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information;
The second information is generated to Optic flow information process using the second convolutional neural networks, the second sub-information is extracted according to rule from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, and the 3rd information is the 3rd reference information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information, the 4th information and the 5th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information, the 4th information and the 5th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with the third aspect, in the 6th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included:
The first information is generated to described image information processing using the first convolutional neural networks, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information;
The second information is generated to Optic flow information process using the second convolutional neural networks, and second information is second reference information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information, the 4th information and the 6th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information, the 4th information and the 6th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object functionWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with the third aspect, in the 7th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included:
The first information is generated to described image information processing using the first convolutional neural networks, and the first information is first reference information;
The second information is generated to Optic flow information process using the second convolutional neural networks, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information, the 5th information and the 6th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information, the 5th information and the 6th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In conjunction with the third aspect, in the 8th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included:
The first information is generated to described image information processing using the first convolutional neural networks, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information;
The second information is generated to Optic flow information process using the second convolutional neural networks, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Above technical scheme is applied, the information in video is obtained, the information in the video includes image information, Optic flow information and acoustic information;Corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information are generated using deep neural network;The video is processed according to first reference information, the second reference information and the 3rd reference information to obtain the class relations matrix of the confidence level matrix and the video of the video;The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, to obtain the subject fusion parameter of the video, the subject fusion parameter is used for classifying video.It can be seen that, unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes corresponding reference information, further video is processed using reference information, introducing class relations using reference information during processing to video carries out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
Description of the drawings
In order to be illustrated more clearly that the technical scheme of the embodiment of the present invention, accompanying drawing to be used needed for the embodiment of the present invention will be briefly described below, apparently, drawings in the following description are only some embodiments of the present invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with according to these other accompanying drawings of accompanying drawings acquisition.
Fig. 1 is one embodiment schematic diagram for extracting video information in prior art;
Fig. 2 is one embodiment schematic diagram of visual classification in prior art;
Fig. 3 is one embodiment schematic diagram of video classification methods in the embodiment of the present invention;
Fig. 4 is another embodiment schematic diagram of video classification methods in the embodiment of the present invention;
Fig. 5 is another embodiment schematic diagram of video classification methods in the embodiment of the present invention;
Fig. 6 is another embodiment schematic diagram of video classification methods in the embodiment of the present invention;
Fig. 7 is another embodiment schematic diagram of video classification methods in the embodiment of the present invention;
Fig. 8 is another embodiment schematic diagram of video classification methods in the embodiment of the present invention;
Fig. 9 is another embodiment schematic diagram of video classification methods in the embodiment of the present invention;
Figure 10 is another embodiment schematic diagram of video classification methods in the embodiment of the present invention;
Figure 11 is another embodiment schematic diagram of video classification methods in the embodiment of the present invention;
Figure 12-a are one embodiment schematic diagram of visual classification application scenarios in the embodiment of the present invention;
Figure 12-b are another embodiment schematic diagram of visual classification application scenarios in the embodiment of the present invention;
Figure 13 is a structural representation of visual classification device in the embodiment of the present invention;
Figure 14 is another structural representation of visual classification device in the embodiment of the present invention.
Specific embodiment
A kind of video classification methods and device is embodiments provided, for solving the defect that accurately video can not be classified in prior art, the accuracy of visual classification is improved.
Accompanying drawing in below in conjunction with the embodiment of the present invention, to the embodiment of the present invention in technical scheme be clearly and completely described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiment.Embodiment in based on the present invention, the acquired every other embodiment under the premise of creative work is not made of those skilled in the art, belongs to the scope of protection of the invention.
Term " first ", " second ", " the 3rd " " the 4th " in description and claims of this specification and above-mentioned accompanying drawing etc. are for distinguishing different objects, rather than for describing particular order.Additionally, term " comprising " and " having " and their any deformations, it is intended that cover non-exclusive including.Process, method, system, product or the equipment for for example containing series of steps or unit is not limited to the step of listing or unit, but alternatively also include the step of not listing or unit, or alternatively also include other steps intrinsic for these processes, method, product or equipment or unit.
Technical scheme, is applied to any terminal, and such as smart mobile phone, ipad and PC etc. are mainly used in computer, are not specifically limited herein.
Fig. 1 is referred to, one embodiment of video classification methods, mainly includes in the embodiment of the present invention:The information in video is obtained, the information in the video includes image information, Optic flow information and acoustic information;Corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information are generated using deep neural network;The video is processed according to first reference information, the second reference information and the 3rd reference information to obtain the class relations matrix of the confidence level matrix and the video of the video;The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, to obtain the subject fusion parameter of the video, the subject fusion parameter is used for classifying video, wherein, the constraint factor of the object function includes confidence level matrix, class relations matrix and fusion parameters.Detailed process is as follows:
301st, information in video is obtained;
Wherein, the information in the video includes image information, Optic flow information and acoustic information.Image information is specially the image information corresponding to each frame for constituting video;Optic flow information specifically by video in multiframe between the Optic flow information that extracted, wherein light stream be with regard to the ken in object of which movement detection in concept, for describing the motion of observed object, surface or edge caused by the motion relative to observer.In actual applications, the Optic flow information in video, the method that optical flow method is really further inferred to object translational speed and direction over time by the intensity of detection image pixel are detected by optical flow method;Acoustic information is to carry out sound spectrum information obtained from Fourier transformation to acoustical signal.
302nd, corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information are generated using deep neural network;
Wherein, further image information, Optic flow information and acoustic information are processed by deep neural network.Depth neural network is to simulate human brain to be analyzed the neutral net of study, and it imitates the mechanism of human brain has explaining data, such as image, sound and text etc., the deep neural network being related in the present invention:Convolutional neural networks (English full name Convolutional Neural Network, abbreviation:) or time recurrent neural network (English full name CNN:Long-Short Term Memory, abbreviation:LSTM) etc., more excellent result can be gone out to the information processing of video by convolutional neural networks or LSTM.Such as:Convolutional neural networks are a kind of feedforward neural networks, and its artificial neuron can respond the surrounding cells in a part of coverage, for large-scale image procossing has outstanding performance.Convolutional neural networks are made up of the full-mesh layer (corresponding classical neutral net) on one or more convolutional layers and top, while also including associated weights and pond layer (pooling layer).This structure enables convolutional neural networks using the two-dimensional structure of input data.Compared with other deep learning structures, convolutional neural networks can provide more excellent result in terms of image and speech recognition.This model can also be trained using back-propagation algorithm.Compare other depth, feedforward neural network, and convolutional neural networks need the parameter that estimates less, make a kind of deep learning structure for having much captivation.Due to unique design structure, LSTM is suitable for processing and interval in predicted time sequence and the critical event for postponing to grow very much LSTM.Generally than HMM (HMM) more preferably, used as nonlinear model, LSTM can be used for constructing larger deep neural network as complicated non-linear unit for the performance of LSTM.
303rd, the video is processed according to first reference information, the second reference information and the 3rd reference information to obtain the class relations matrix of the confidence level matrix and the video of the video;
Wherein, according to the first reference information that depth neural network is generated, second reference information and the 3rd reference information, further video is processed, so as to the class relations matrix for obtaining the confidence level matrix and video of video, so-called confidence level, also confidence level is, it refers to that particular individual treats the degree that particular proposition verity is believed, refers to the probability that population parameter value falls in a certain area of sample statistics value.Such as:Video is processed according to the first reference information, the probability in the sea of acquisition is 80%, the probability in blue sky is 20%.Wherein, classification is used for the variety classes for representing video, the video of such as cat, the video of Canis familiaris L..Class relations matrix is assigned to the percentage ratio of other classifications in order to represent each class training sample by mistake.Such as the video of multiple cats is processed, the probability of the cat of acquisition is 90%, and the probability of the Canis familiaris L. for obtaining is 10%, then it is 10% that the video of Canis familiaris L. is assigned to shared probability in the video of cat by mistake.
304th, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function, to obtain the subject fusion parameter of the video.
The subject fusion parameter is used for classifying video, and wherein, the constraint factor of the object function includes confidence level matrix, class relations matrix and fusion parameters.
Wherein, the class relations matrix of the confidence level matrix and video of the video for obtaining is substituted into object function, when the value of object function is minimum, the subject fusion parameter of the video can be obtained, so as to accurately be classified to video.
In specific categorizing process, using subject fusion parameter to the first reference information, the second reference information, the classification results of the 3rd reference information are merged, so as to obtain the final classification result of video.
On the basis of embodiment illustrated in fig. 3, Fig. 4 is referred to, in detail visual classification is described, specific as follows:
Step 401, the information obtained in video;
Wherein, the information in the video includes image information, Optic flow information and acoustic information;
Wherein, step 401 is same or like with the step 301 in embodiment illustrated in fig. 3, specifically see step 301, and here is omitted.
Step 402, using the first convolutional neural networks to described image information processing generate the first information, the first information be first reference information;
Step 403, using the second convolutional neural networks to the Optic flow information process generate the second information, second information be second reference information;
Step 404, using the 3rd convolutional neural networks to the acoustic information system generate the 3rd information, the 3rd information be the 3rd reference information;
Wherein, the first information is the information after first convolutional neural networks are processed to graphical information, such as model 1, second information is the information after second convolutional neural networks are processed to Optic flow information, such as model 2,3rd information is the information after the 3rd convolutional neural networks are processed to acoustic information, such as model 3.Wherein, concrete processing procedure is to extract image information, and Optic flow information and acoustic information distinguish corresponding one-dimension information.
It should be noted that step 402 can be executed to step 404 simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.In addition, the first convolutional neural networks, the second convolutional neural networks and the 3rd convolutional neural networks can be identical, it is also possible to different, can specifically be decided according to the actual requirements, be not specifically limited herein.
Step 405, the video is processed by the first information, second information and the 3rd information, to obtain the confidence level matrix of the video;
Wherein, video is processed by model 1, model 2 and model 3 respectively, accordingly, obtains the corresponding confidence level of the video respectively, and each confidence level is constituted the confidence level matrix of the video.For example, the probability that the video is judged as sea by the model 1 after being processed is 80%, and the probability in blue sky is 20%;The video is judged as the probability in sea by the model 2 after being processed be 70%, and the probability in blue sky is 30%;The video is judged as the probability in sea by the model 3 after being processed be 80%, and the probability in blue sky is 20%.
Step 406, at least one video is processed by the first information, second information and the 3rd information, to obtain the class relations matrix of the video;
Wherein, at least one video is used for classifying the video, such as at least one video is processed by model 1, model 2 and model 3, the model 1, the model 2 and the model 3 determine the class relations of the video after processing, class relations matrix is further combined, for example:Model 1 is processed to the video of multiple cats, is judged to that the probability of cat is 90%, and is judged to that the probability of Canis familiaris L. is 10%, then it is 10% that the video of Canis familiaris L. is assigned to shared probability in the video of cat by mistake;Model 2 is processed to the video of multiple cats, is judged to that the probability of cat is 80%, and is judged to that the probability of Canis familiaris L. is 20%, then it is 20% that the video of Canis familiaris L. is assigned to shared probability in the video of cat by mistake;Model 3 is processed to the video of multiple cats, is judged to that the probability of cat is 75%, and is judged to that the probability of Canis familiaris L. is 25%, then it is 25% that the video of Canis familiaris L. is assigned to shared probability in the video of cat by mistake.
It should be noted that step 405 and step 406 can be executed simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.
Step 407, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function.
Wherein, object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and for being classified to video wherein, W is the fusion parameters in the object function to the subject fusion parameter, L (S, the Y;W it is) empirical loss of the video in the processing procedure of the first information, second information and the 3rd information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Wherein, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function, when the value of the object function is minimum, the W of acquisition is the subject fusion parameter of the video.The specific process for obtaining W can be solved by near-end gradient algorithm (Proximal Gradient Method), near-end gradient algorithm is to solve for the most commonly used optimized algorithm during large-scale data, generally can be compared with rapid convergence, efficiently solving-optimizing problem.
In the embodiment of the present invention, directly the information in video is not combined, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information and the 3rd information, further using the first information, second information and the 3rd information are processed to video, using the first information, second information and the 3rd information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
On the basis of Fig. 3 to embodiment illustrated in fig. 4, Fig. 5 is referred to, in detail visual classification is described, specific as follows:
Step 501, the information obtained in video;
Wherein, the information in the video includes image information, Optic flow information and acoustic information;
Wherein, step 501 is same or like with the step 301 in embodiment illustrated in fig. 3, specifically see step 301, and here is omitted.
Step 502, using the first convolutional neural networks to described image information processing generate the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using very first time recurrent neural network LSTM, the first information collectively constitutes first reference information with the 4th information;
Wherein, generated after the first information (such as model 1) using the first convolutional neural networks, such as, extract the partial information (intermediate results in such as the first convolutional neural networks) in the first information again, the partial information is carried out processing further with very first time recurrent neural network LSTM and generate the 4th information (such as model 4).
It should be noted that the preset rules are the rule for setting in advance, in the such as output first information, default partial information, is not specifically limited depending on specific preset rules can be according to practical application herein as the input information in a LSTM.
Step 503, using the second convolutional neural networks to the Optic flow information process generate the second information, second information be second reference information;
Step 504, using the 3rd convolutional neural networks to the acoustic information system generate the 3rd information, the 3rd information be the 3rd reference information;
Wherein, the first information is the information after first convolutional neural networks are processed to graphical information, such as model 1, second information is the information after second convolutional neural networks are processed to Optic flow information, such as model 2,3rd information is the information after the 3rd convolutional neural networks are processed to acoustic information, such as model 3.Wherein, concrete processing procedure is to extract image information, and Optic flow information and acoustic information distinguish corresponding one-dimension information.4th information is the corresponding two-dimensional signal of the image information processed by a LSTM, such as model 4.
It should be noted that step 502 can be executed to step 504 simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.In addition, the first convolutional neural networks, the second convolutional neural networks and the 3rd convolutional neural networks can be identical, it is also possible to different, can specifically be decided according to the actual requirements, be not specifically limited herein.
Step 505, the video is processed by the first information, second information, the 3rd information and the 4th information, to obtain the confidence level matrix of the video;
Wherein, by video respectively by model 1, model 2, model 3 and model 4 are processed, accordingly, by model 1, model 2, model 3 and model 4 determine the corresponding confidence level of the video to the video after being processed, and each confidence level is constituted the confidence level matrix of the video.For example, the probability that model 1 determines sea after processing is 80%, and the probability in blue sky is 20%;It is 70% that model 2 determines the probability in sea, and the probability in blue sky is 30%;It is 80% that model 3 determines the probability in sea, and the probability in blue sky is 20%;It is 75% that model 4 determines the probability in sea, and the probability in blue sky is 25%.
Step 506, at least one video is processed by the first information, second information, the 3rd information and the 4th information, to obtain the class relations matrix of the video;
Wherein, at least one video is used for classifying the video, the video is processed by model 1, model 2, model 3 and model 4, the video passes through the model 1, model 2, model 3 and model 4 determine the class relations of the video after processing, be further combined class relations matrix, for example:The video of multiple cats is processed, the probability for determining cat is 90%, and the probability for determining Canis familiaris L. is 10%, then it is 10% that the video of Canis familiaris L. is assigned to shared probability in the video of cat by mistake.
It should be noted that step 505 and step 506 can be executed simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.
Step 507, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function.
Wherein, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and for classifying to video, wherein, W is the fusion parameters in the object function to the subject fusion parameter, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 4th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Wherein, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function, when the value of the object function is minimum, the W of acquisition is the subject fusion parameter of the video.The specific process for obtaining W can be solved by near-end gradient algorithm (Proximal Gradient Method), near-end gradient algorithm is to solve for the most commonly used optimized algorithm during large-scale data, generally can be compared with rapid convergence, efficiently solving-optimizing problem.
Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information and the 4th information, further using the first information, second information, 3rd information and the 4th information are processed to video, using the first information, second information, 3rd information and the 4th information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
On the basis of Fig. 3 to embodiment illustrated in fig. 5, Fig. 6 is referred to, in detail visual classification is described, specific as follows:
Step 601, the information obtained in video;
Wherein, the information in the video includes image information, Optic flow information and acoustic information;
Wherein, step 601 is same or like with the step 301 in embodiment illustrated in fig. 3, specifically see step 301, and here is omitted.
Step 602, using the first convolutional neural networks to described image information processing generate the first information, the first information be first reference information;
Step 603, using the second convolutional neural networks to the Optic flow information process generate the second information, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;
Wherein, generated after the second information using the second convolutional neural networks, such as, partial information (intermediate results in such as second convolutional neural networks) in second information is extracted again, the partial information is carried out processing the 5th information that generates, such as model 5 further with the second time recurrent neural network LSTM.
Step 604, using the 3rd convolutional neural networks to the acoustic information system generate the 3rd information, the 3rd information be the 3rd reference information;
It should be noted that step 602 can be executed to step 604 simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.In addition, the first convolutional neural networks, the second convolutional neural networks and the 3rd convolutional neural networks can be identical, it is also possible to different, can specifically be decided according to the actual requirements, be not specifically limited herein.
Step 605, the video is processed by the first information, second information, the 3rd information and the 5th information, to obtain the confidence level matrix of the video;
Step 606, at least one video is processed by the first information, second information, the 3rd information and the 5th information, to obtain the class relations matrix of the video;
Wherein, at least one video is used for classifying the video, specifically see step 406, it should be noted that step 605 and step 606 can be executed simultaneously, it is also possible to executes according to default sequencing, is not specifically limited herein.
Step 607, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function.
The object function isWherein, wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and for classifying to video, W is the fusion parameters in the object function to the subject fusion parameter, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Wherein, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function, when the value of the object function is minimum, the W of acquisition is the subject fusion parameter of the video.The specific process for obtaining W can be solved by near-end gradient algorithm (Proximal Gradient Method), near-end gradient algorithm is to solve for the most commonly used optimized algorithm during large-scale data, generally can be compared with rapid convergence, efficiently solving-optimizing problem.
Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information and the 5th information, further using the first information, second information, 3rd information and the 5th information are processed to video, using the first information, second information, 3rd information and the 5th information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
On the basis of Fig. 3 to embodiment illustrated in fig. 6, Fig. 7 is referred to, in detail visual classification is described, specific as follows:
Step 701, the information obtained in video;
Wherein, the information in the video includes image information, Optic flow information and acoustic information;
Wherein, step 701 is same or like with the step 301 in embodiment illustrated in fig. 3, specifically see step 301, and here is omitted.
Step 702, using the first convolutional neural networks to described image information processing generate the first information, the first information be first reference information;
Step 703, using the second convolutional neural networks to the Optic flow information process generate the second information, second information be second reference information;
Step 704, using the 3rd convolutional neural networks to the acoustic information system generate the 3rd information, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
Wherein, generated after the 3rd information using the 3rd convolutional neural networks, such as, partial information (intermediate results in such as threeth convolutional neural networks) in threeth information is extracted again, the partial information is carried out processing the 6th information that generates, such as model 6 further with the 3rd time recurrent neural network LSTM.
It should be noted that step 702 can be executed to step 704 simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.In addition, the first convolutional neural networks, the second convolutional neural networks and the 3rd convolutional neural networks can be identical, it is also possible to different, can specifically be decided according to the actual requirements, be not specifically limited herein.
Step 705, the video is processed by the first information, second information, the 3rd information and the 6th information, to obtain the confidence level matrix of the video;
Step 706, at least one video is processed by the first information, second information, the 3rd information and the 6th information, to obtain the class relations matrix of the video;
Wherein, at least one video is used for classifying the video, specifically see step 406, it should be noted that step 705 and step 706 can be executed simultaneously, it is also possible to executes according to default sequencing, is not specifically limited herein.
Step 707, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function.
Wherein, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Wherein, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function, when the value of the object function is minimum, the W of acquisition is the subject fusion parameter of the video.The specific process for obtaining W can be solved by near-end gradient algorithm (Proximal Gradient Method), near-end gradient algorithm is to solve for the most commonly used optimized algorithm during large-scale data, generally can be compared with rapid convergence, efficiently solving-optimizing problem.
Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information and the 6th information, further using the first information, second information, 3rd information and the 6th information are processed to video, using the first information, second information, 3rd information and the 6th information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
On the basis of Fig. 3 to embodiment illustrated in fig. 7, Fig. 8 is referred to, in detail visual classification is described, specific as follows:
Step 801, the information obtained in video;
Wherein, the information in the video includes image information, Optic flow information and acoustic information;
Wherein, step 801 is same or like with the step 301 in embodiment illustrated in fig. 3, specifically see step 301, and here is omitted.
Step 802, using the first convolutional neural networks to described image information processing generate the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information;
Wherein, generated after the first information using the first convolutional neural networks, such as, then partial information (intermediate results in such as the first convolutional neural networks) in the first information is extracted, the partial information is processed further with very first time recurrent neural network LSTM.
Step 803, using the second convolutional neural networks to the Optic flow information process generate the second information, the second sub-information is extracted according to rule from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;
Wherein, generated after the second information using the second convolutional neural networks, such as, then partial information (intermediate results in such as the second convolutional neural networks) in the second information is extracted, the partial information is processed further with the second time recurrent neural network LSTM.
Step 804, using the 3rd convolutional neural networks to the acoustic information system generate the 3rd information, the 3rd information be the 3rd reference information;
It should be noted that step 802 can be executed to step 804 simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.In addition, the first convolutional neural networks, the second convolutional neural networks and the 3rd convolutional neural networks can be identical, it is also possible to different, can specifically be decided according to the actual requirements, be not specifically limited herein.In addition, a LSTM can be identical with the 2nd LSTM, it is also possible to different, depending on being actually needed, it is not specifically limited herein.
Step 805, the video is processed by the first information, second information, the 3rd information, the 4th information and the 5th information, to obtain the confidence level matrix of the video;
Step 806, at least one video is processed by the first information, second information, the 3rd information, the 4th information and the 5th information, to obtain the class relations matrix of the video;
Wherein, at least one video is used for classifying the video, specifically see step 406, it should be noted that step 805 and step 806 can be executed simultaneously, it is also possible to executes according to default sequencing, is not specifically limited herein.
Step 807, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function.
Wherein, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Wherein, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function, when the value of the object function is minimum, the W of acquisition is the subject fusion parameter of the video.The specific process for obtaining W can be solved by near-end gradient algorithm (Proximal Gradient Method), near-end gradient algorithm is to solve for the most commonly used optimized algorithm during large-scale data, generally can be compared with rapid convergence, efficiently solving-optimizing problem.
Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information, 4th information and the 5th information, further using the first information, second information, 3rd information, 4th information and the 5th information are processed to video, using the first information, second information, 3rd information, 4th information and the 5th information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
On the basis of Fig. 3 to embodiment illustrated in fig. 8, Fig. 9 is referred to, in detail visual classification is described, specific as follows:
Step 901, the information obtained in video;
Wherein, the information in the video includes image information, Optic flow information and acoustic information;
Wherein, step 901 is same or like with the step 301 in embodiment illustrated in fig. 3, specifically see step 301, and here is omitted.
Step 902, using the first convolutional neural networks to described image information processing generate the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information;
Wherein, generated after the first information using the first convolutional neural networks, such as, then partial information (intermediate results in such as the first convolutional neural networks) in the first information is extracted, the partial information is processed further with very first time recurrent neural network LSTM.
Step 903, using the second convolutional neural networks to the Optic flow information process generate the second information, second information be second reference information;
Step 904, using the 3rd convolutional neural networks to the acoustic information system generate the 3rd information, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
Wherein, generated after the 3rd information using the 3rd convolutional neural networks, such as, then partial information (intermediate results in such as the 3rd convolutional neural networks) in the 3rd information is extracted, the partial information is processed further with the 3rd time recurrent neural network LSTM.
It should be noted that step 902 can be executed to step 904 simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.In addition, the first convolutional neural networks, the second convolutional neural networks and the 3rd convolutional neural networks can be identical, it is also possible to different, can specifically be decided according to the actual requirements, be not specifically limited herein.In addition, a LSTM can be identical with the 3rd LSTM, it is also possible to different, depending on being actually needed, it is not specifically limited herein.
Step 905, the video is processed by the first information, second information, the 3rd information, the 4th information and the 6th information, to obtain the confidence level matrix of the video;
Step 906, at least one video is processed by the first information, second information, the 3rd information, the 4th information and the 6th information, to obtain the class relations matrix of the video;
Wherein, at least one video is used for classifying the video, specifically see step 406, it should be noted that step 905 and step 906 can be executed simultaneously, it is also possible to executes according to default sequencing, is not specifically limited herein.
Step 907, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function.
The object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Wherein, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function, when the value of the object function is minimum, the W of acquisition is the subject fusion parameter of the video.The specific process for obtaining W can be solved by near-end gradient algorithm (Proximal Gradient Method), near-end gradient algorithm is to solve for the most commonly used optimized algorithm during large-scale data, generally can be compared with rapid convergence, efficiently solving-optimizing problem.
Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information, 4th information and the 6th information, further using the first information, second information, 3rd information, 4th information and the 6th information are processed to video, using the first information, second information, 3rd information, during 4th information and the 6th information are processed to video, introducing class relations carries out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
On the basis of Fig. 3 to embodiment illustrated in fig. 9, Figure 10 is referred to, in detail visual classification is described, specific as follows:
Step 1001, the information obtained in video;
Wherein, the information in the video includes image information, Optic flow information and acoustic information;
Wherein, step 1001 is same or like with the step 301 in embodiment illustrated in fig. 3, specifically see step 301, and here is omitted.
Step 1002, using the first convolutional neural networks to described image information processing generate the first information, the first information be first reference information;
Step 1003, using the second convolutional neural networks to the Optic flow information process generate the second information, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;
Wherein, generated after the second information using the second convolutional neural networks, such as, then partial information (intermediate results in such as the second convolutional neural networks) in the second information is extracted, the partial information is processed further with the second time recurrent neural network LSTM.
Step 1004, using the 3rd convolutional neural networks to the acoustic information system generate the 3rd information, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
Wherein, generated after the 3rd information using the 3rd convolutional neural networks, such as, then partial information (intermediate results in such as the 3rd convolutional neural networks) in the 3rd information is extracted, the partial information is processed further with the 3rd time recurrent neural network LSTM.
It should be noted that step 1002 can be executed to step 1004 simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.In addition, the first convolutional neural networks, the second convolutional neural networks and the 3rd convolutional neural networks can be identical, it is also possible to different, can specifically be decided according to the actual requirements, be not specifically limited herein.In addition, the 2nd LSTM can be identical with the 3rd LSTM, it is also possible to different, depending on being actually needed, it is not specifically limited herein.
Step 1005, the video is processed by the first information, second information, the 3rd information, the 5th information and the 6th information, to obtain the confidence level matrix of the video;
Step 1006, at least one video is processed by the first information, second information, the 3rd information, the 5th information and the 6th information, to obtain the class relations matrix of the video;
Wherein, at least one video is used for classifying the video, specifically see step 406, it should be noted that step 1005 and step 1006 can be executed simultaneously, it is also possible to executes according to default sequencing, is not specifically limited herein.
Step 1007, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function.
The object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Wherein, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function, when the value of the object function is minimum, the W of acquisition is the subject fusion parameter of the video.The specific process for obtaining W can be solved by near-end gradient algorithm (Proximal Gradient Method), near-end gradient algorithm is to solve for the most commonly used optimized algorithm during large-scale data, generally can be compared with rapid convergence, efficiently solving-optimizing problem.
Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information, 5th information and the 6th information, further using the first information, second information, 3rd information, 5th information and the 6th information are processed to video, using the first information, second information, 3rd information, 5th information and the 6th information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
On the basis of Fig. 3 to embodiment illustrated in fig. 10, Figure 11 is referred to, in detail visual classification is described, specific as follows:
Step 1101, the information obtained in video;
Wherein, the information in the video includes image information, Optic flow information and acoustic information;
Wherein, step 1101 is same or like with the step 301 in embodiment illustrated in fig. 3, specifically see step 301, and here is omitted.
Step 1102, using the first convolutional neural networks to described image information processing generate the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information;
Wherein, generated after the first information using the first convolutional neural networks, such as, then partial information (intermediate results in such as the first convolutional neural networks) in the first information is extracted, the partial information is processed further with very first time recurrent neural network LSTM.
Step 1103, using the second convolutional neural networks to the Optic flow information process generate the second information, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;
Wherein, generated after the second information using the second convolutional neural networks, such as, then partial information (intermediate results in such as the second convolutional neural networks) in the second information is extracted, the partial information is processed further with the second time recurrent neural network LSTM.
Step 1104, using the 3rd convolutional neural networks to the acoustic information system generate the 3rd information, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
Wherein, generated after the 3rd information using the 3rd convolutional neural networks, such as, then partial information (intermediate results in such as the 3rd convolutional neural networks) in the 3rd information is extracted, the partial information is processed further with the 3rd time recurrent neural network LSTM.
It should be noted that step 1102 can be executed to step 1104 simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.In addition, the first convolutional neural networks, the second convolutional neural networks and the 3rd convolutional neural networks can be identical, it is also possible to different, can specifically be decided according to the actual requirements, be not specifically limited herein.In addition, a LSTM, the 2nd LSTM and the 3rd LSTM can be identical, it is also possible to different, depending on being actually needed, it is not specifically limited herein.
Step 1105, the video is processed by the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, to obtain the confidence level matrix of the video;
Step 1106, at least one video is processed by the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, to obtain the class relations matrix of the video;
Wherein, at least one video is used for classifying the video, specifically see step 406, it should be noted that step 1105 and step 1106 can be executed simultaneously, it is also possible to executes according to default sequencing, is not specifically limited herein.
Step 1107, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function.
Wherein, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Wherein, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function, when the value of the object function is minimum, the W of acquisition is the subject fusion parameter of the video.The specific process for obtaining W can be solved by near-end gradient algorithm (Proximal Gradient Method), near-end gradient algorithm is to solve for the most commonly used optimized algorithm during large-scale data, generally can be compared with rapid convergence, efficiently solving-optimizing problem.
Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information, 4th information, 5th information and the 6th information, further using the first information, second information, 3rd information, 4th information, 5th information and the 6th information are processed to video, using the first information, second information, 3rd information, 4th information, 5th information and the 6th information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
In actual applications, by taking a preferred embodiment in the present invention as an example, as shown in Figure 12-a, for example:Extract the image information (such as frame) in the video of training, Optic flow information and acoustic information, by convolutional neural networks respectively to image information, Optic flow information and acoustic information are modeled, difference generation model 1, model 3 and model 5, in order to utilize the time sequence information in image information and Optic flow information, the present invention to be modeled image information and Optic flow information using LSTM networks.In the training process, such as:" frame:The input of LSTM networks " is " frame:The convolutional neural networks feature (i.e. the intermediate results of convolutional neural networks) that convolutional neural networks " are extracted;" light stream:The input of convolutional neural networks " is " light stream:The convolutional neural networks feature (i.e. the intermediate results of convolutional neural networks) that convolutional neural networks " are extracted.By training, model 2 and model 4 can be obtained.
After the model for having multichannel, a crucial problem is how that comprehensively utilizing these multi-modal information improves classification results.Often in training set, over-fitting is so as to cause the precision of fusion very poor, of the invention for each classification for conventional late period amalgamation mode, and adaptive fusion obtains subject fusion parameter (i.e. so as to fully be directed to per One class learning:Optimum fusion parameters).Additionally, during fusion, by the use of the relation between classification as the study of the constraint guiding grader of regularization in the embodiment of the present invention, making video classification model that there is higher discriminating power.
In order to reduce over-fitting, by the use of class relations as the process constrained to guide fusion of weight in the embodiment of the present invention, the object function of optimization is:Wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video.
Wherein, Section 1 L (S, Y;W it is) empirical loss of predictive value deviation for weighing the actual value of all samples on whole training dataset and obtain by logistic regression, S is the confidence level matrix of the Sample video in training set, each sample constitutes S by multiple still untreated fractionsm∈RC, wherein, (m=1 ... M), wherein C represent that the species of video, M represent the way for needing fusion.Y is the category of multiple Sample videos in training set.W be from S to Y map weight, and this formula optimize target.In the present embodiment, W is 5 matrixes, represents the proportion coefficient in final video classification results per classification results all the way respectively.
Section 2Be in order to utilize relation between classification and well-designed bound term, V is the class relations matrix of video, can pass through to calculate in per training process all the way (in the present embodiment, a total of 5 tunnel, each model is correspondingly all the way), for each class how many sample is correctly classified and assigns to other each classes how many samples by mistake, finally by different roods to V be stitched together, its dimension is identical with W, for constraining W, guides the study of W;It should be noted thatNot this norm of Luo Beini crows is represented, all items square is represented and is opened radical sign, therefore this is led square smooth afterwards;
Section 3 λ2||W||1It is the bound term for making weight coefficient sparse;Upper formula reaches a balance between loss item and the bound term of regularization, so as to learn to obtain subject fusion parameter.
Above-mentioned object function, is solved based on near-end gradient algorithm (Proximal Gradient Method), and near-end gradient algorithm is to solve for the most commonly used optimized algorithm during large-scale data, generally can be compared with rapid convergence, efficiently solving-optimizing problem.
Subject fusion parameter can be obtained by adaptive learning, subject fusion parameter W is 5 matrixes, represents the proportion coefficient in final video classification results per classification results all the way respectively.5 road disaggregated models constitute final video classification model by self adaptation fusion, and training process terminates.
Video classification model and subject fusion parameter are obtained by using said process, video can be classified, as shown in Figure 12-b, when classifying to input video, image information, Optic flow information and the acoustic information of video are first extracted, through the prediction of video classification model, obtain per classification results all the way.The video is per classification results S all the way in Figure 12-b and belongs to the other probability of each video class, represented with an one-dimensional vector.The subject fusion parameter obtained in training is recycled to merge multi-channel video classification results.Subject fusion parameter W obtained in training is 5 matrixes, represents the proportion coefficient in final video classification results per classification results all the way respectively.To be multiplied by per classification results all the way after corresponding weight W and be added, as final visual classification result, so as to accurately classify to video.So the classification results of final video areM represents video modeling way, M=5 in the present embodiment.
Specifically, the present invention is better than existing video classification methods in the following aspects:
Video multi-modal information utilizes aspect:The present invention fully excavated appearance present in video, transitory motions, audio frequency and long when time sequence information, compared to existing technology, to the description of video more fully.
Using semantic relation aspect:Relative to existing fusion method, this video introduces class relations so as to the probability of over-fitting greatly reduces during later stage fusion.
Classification speed aspect:Once off-line training of the invention is completed, test almost can be done in real time, and this is significantly better than (the English full name of the support vector machine based on kernel function:Support Vector Machine, abbreviation:SVM).
In the embodiment of the present invention, directly the information in video is not combined, but the information generated according to deep neural network in the video distinguishes corresponding reference information, further video is processed using reference information, introducing class relations using reference information during processing to video carries out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
For ease of preferably implementing the above-mentioned correlation technique of the embodiment of the present invention, it is also provided below for coordinating the relevant apparatus of said method.
Figure 13 is referred to, one embodiment of visual classification device 1300 in the embodiment of the present invention, including:Acquisition module 1301, generation module 1302, processing module 1303 and computing module 1304.
Acquisition module 1301, for obtaining the information in video, the information in the video includes image information, Optic flow information and acoustic information;
Generation module 1302, corresponding 3rd reference information of the acoustic information that corresponding second reference information of the Optic flow information that corresponding first reference information of the image information for generating the acquisition of acquisition module 1301 using deep neural network, the acquisition module 1301 are obtained and the acquisition module 1301 are obtained;
Processing module 1303, the first reference information, the second reference information and the 3rd reference information for being generated according to the generation module 1302 are processed to obtain the class relations matrix of the confidence level matrix and the video of the video to the video;
Computing module 1304, the confidence level matrix of video and the class relations matrix of the video for obtaining the processing module 1303 substitutes into object function, to obtain the subject fusion parameter of the video, the subject fusion parameter is used for classifying video, wherein, the constraint factor of the object function includes confidence level matrix, class relations matrix and fusion parameters.
In the embodiment of the present invention, directly the information in video is not combined, but the information generated according to deep neural network in the video distinguishes corresponding reference information, further video is processed using reference information, introducing class relations using reference information during processing to video carries out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
Further, the generation module 1302, the Image Information Processing specifically for being obtained to the acquisition module 1301 using the first convolutional neural networks generate the first information, and the first information is first reference information;Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module 1301, second information is second reference information;The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module 1301, the 3rd information is the 3rd reference information;
The processing module 1303, the first information, second information and the 3rd information specifically for being generated the video by the generation module 1302 are processed, to obtain the confidence level matrix of the video;The first information, second information and the 3rd information that at least one video is generated by the generation module 1302 is processed, and to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video;
The computing module 1304, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module 1303 substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W it is) empirical loss of the video in the processing procedure of the first information, second information and the 3rd information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
In the embodiment of the present invention, directly the information in video is not combined, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information and the 3rd information, further using the first information, second information and the 3rd information are processed to video, using the first information, second information and the 3rd information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
Further, the generation module 1302, Image Information Processing specifically for being obtained to the acquisition module 1301 using the first convolutional neural networks generates the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using very first time recurrent neural network LSTM, the first information collectively constitutes first reference information with the 4th information;Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module 1301, second information is second reference information;The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module 1301, the 3rd information is the 3rd reference information;
The processing module 1303, the first information, second information, the 3rd information and the 4th information specifically for being generated the video by the generation module 1302 are processed, to obtain the confidence level matrix of the video;The first information, second information, the 3rd information and the 4th information that at least one video is generated by the generation module 1302 is processed, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video;
The computing module 1304, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module 1303 substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 4th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information and the 4th information, further using the first information, second information, 3rd information and the 4th information are processed to video, using the first information, second information, 3rd information and the 4th information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
Further, the generation module 1302, the Image Information Processing specifically for being obtained to the acquisition module 1301 using the first convolutional neural networks generate the first information, and the first information is first reference information;Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module 1301, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module 1301, the 3rd information is the 3rd reference information;
The processing module 1303, the first information, second information, the 3rd information and the 5th information specifically for being generated the video by the generation module 1302 are processed, to obtain the confidence level matrix of the video;The first information, second information, the 3rd information and the 5th information that at least one video is generated by the generation module 1302 is processed, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video;
The computing module 1304, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module 1303 substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information and the 5th information, further using the first information, second information, 3rd information and the 5th information are processed to video, using the first information, second information, 3rd information and the 5th information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
Further, the generation module 1302, the Image Information Processing for being obtained to the acquisition module 1301 using the first convolutional neural networks generate the first information, and the first information is first reference information;Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module 1301, second information is second reference information;The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module 1301, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The processing module 1303, the first information, second information, the 3rd information and the 6th information specifically for being generated the video by the generation module 1302 are processed, to obtain the confidence level matrix of the video;The first information, second information, the 3rd information and the 6th information that at least one video is generated by the generation module 1302 is processed, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video;
The computing module 1304, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module 1303 substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator
Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information and the 6th information, further using the first information, second information, 3rd information and the 6th information are processed to video, using the first information, second information, 3rd information and the 6th information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
Further, the generation module 1302, Image Information Processing specifically for being obtained to the acquisition module 1301 using the first convolutional neural networks generates the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information;Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module 1301, the second sub-information is extracted according to rule from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module 1301, the 3rd information is the 3rd reference information;
The processing module 1303, the first information, second information, the 3rd information, the 4th information and the 5th information specifically for being generated the video by the generation module 1302 is processed, to obtain the confidence level matrix of the video;The first information, second information, the 3rd information, the 4th information and the 5th information that at least one video is generated by the generation module 1302 is processed, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video;
The computing module 1304, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module 1303 substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information, 4th information and the 5th information, further using the first information, second information, 3rd information, 4th information and the 5th information are processed to video, using the first information, second information, 3rd information, 4th information and the 5th information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
Further, the generation module 1302, Image Information Processing specifically for being obtained to the acquisition module 1301 using the first convolutional neural networks generates the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information;Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module 1301, second information is second reference information;The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module 1301, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The processing module 1303, the first information, second information, the 3rd information, the 4th information and the 6th information specifically for being generated the video by the generation module 1302 is processed, to obtain the confidence level matrix of the video;The first information, second information, the 3rd information, the 4th information and the 6th information that at least one video is generated by the generation module 1302 is processed, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video;
The computing module 1304, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module 1303 substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information, 4th information and the 6th information, further using the first information, second information, 3rd information, 4th information and the 6th information are processed to video, using the first information, second information, 3rd information, during 4th information and the 6th information are processed to video, introducing class relations carries out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
Further, the generation module 1302, the Image Information Processing specifically for being obtained to the acquisition module 1301 using the first convolutional neural networks generate the first information, and the first information is first reference information;Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module 1301, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module 1301, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The processing module 1303, the first information, second information, the 3rd information, the 5th information and the 6th information specifically for being generated the video by the generation module 1302 is processed, to obtain the confidence level matrix of the video;The first information, second information, the 3rd information, the 5th information and the 6th information that at least one video is generated by the generation module 1302 is processed, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video;
The computing module 1304, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module 1303 substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information, 5th information and the 6th information, further using the first information, second information, 3rd information, 5th information and the 6th information are processed to video, using the first information, second information, 3rd information, 5th information and the 6th information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
Further, the generation module 1302, Image Information Processing specifically for being obtained to the acquisition module 1301 using the first convolutional neural networks generates the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information;Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module 1301, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module 1301, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The processing module 1303, the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information specifically for being generated the video by the generation module 1302 is processed, to obtain the confidence level matrix of the video;The first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information that at least one video is generated by the generation module 1302 is processed, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video;
The computing module 1304, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module 1303 substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information, 4th information, 5th information and the 6th information, further using the first information, second information, 3rd information, 4th information, 5th information and the 6th information are processed to video, using the first information, second information, 3rd information, 4th information, 5th information and the 6th information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
Embodiment shown in Figure 13 is illustrated from the angle of functional module to the concrete structure of visual classification device, and the embodiment below in conjunction with Figure 14 is illustrated from hardware point of view to the concrete structure of visual classification device:A kind of visual classification device 1400, including:Memorizer 1401, processor 1402, emitter 1403 and receptor 1404;Wherein, the memorizer 1401 is connected with the processor 1402, and the processor 1402 is connected with the emitter 1403, and the processor 1402 is connected with the receptor 1404;
It is used for executing following steps by calling the operational order stored in the memorizer 1401, the processor 1402:
The information in video is obtained, the information in the video includes image information, Optic flow information and acoustic information;
Corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information are generated using deep neural network;
The video is processed according to first reference information, the second reference information and the 3rd reference information to obtain the class relations matrix of the confidence level matrix and the video of the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, to obtain the subject fusion parameter of the video, the subject fusion parameter is used for classifying video, wherein, the constraint factor of the object function includes confidence level matrix, class relations matrix and fusion parameters.
Wherein, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information and includes:
The first information is generated to described image information processing using the first convolutional neural networks, and the first information is first reference information;
The second information is generated to Optic flow information process using the second convolutional neural networks, and second information is second reference information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, and the 3rd information is the 3rd reference information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information and the 3rd information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information and the 3rd information, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W it is) empirical loss of the video in the processing procedure of the first information, second information and the 3rd information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Wherein, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information and includes:
The first information is generated to described image information processing using the first convolutional neural networks, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using very first time recurrent neural network LSTM, the first information collectively constitutes first reference information with the 4th information;
The second information is generated to Optic flow information process using the second convolutional neural networks, and second information is second reference information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, and the 3rd information is the 3rd reference information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information and the 4th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information and the 4th information, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 4th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Wherein, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information and includes:
The first information is generated to described image information processing using the first convolutional neural networks, and the first information is first reference information;
The second information is generated to Optic flow information process using the second convolutional neural networks, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, and the 3rd information is the 3rd reference information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information and the 5th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information and the 5th information, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Wherein, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information and includes:
The first information is generated to described image information processing using the first convolutional neural networks, and the first information is first reference information;
The second information is generated to Optic flow information process using the second convolutional neural networks, and second information is second reference information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information and the 6th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information and the 6th information, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Wherein, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information and includes:
The first information is generated to described image information processing using the first convolutional neural networks, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information;
The second information is generated to Optic flow information process using the second convolutional neural networks, the second sub-information is extracted according to rule from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, and the 3rd information is the 3rd reference information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information, the 4th information and the 5th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information, the 4th information and the 5th information, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Wherein, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information and includes:
The first information is generated to described image information processing using the first convolutional neural networks, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information;
The second information is generated to Optic flow information process using the second convolutional neural networks, and second information is second reference information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information, the 4th information and the 6th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information, the 4th information and the 6th information, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Wherein, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information and includes:
The first information is generated to described image information processing using the first convolutional neural networks, and the first information is first reference information;
The second information is generated to Optic flow information process using the second convolutional neural networks, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information, the 5th information and the 6th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information, the 5th information and the 6th information, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Wherein, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information and includes:
The first information is generated to described image information processing using the first convolutional neural networks, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information;
The second information is generated to Optic flow information process using the second convolutional neural networks, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information;
The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video:
The video is processed by the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, to obtain the confidence level matrix of the video;
At least one video is processed by the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y;W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
Therefore, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes corresponding reference information, further video is processed using reference information, introducing class relations using reference information during processing to video carries out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.
In the above-described embodiments, the description of each embodiment is all emphasized particularly on different fields, there is no the part that describes in detail in certain embodiment, may refer to the associated description of other embodiment.
Those skilled in the art can be understood that for convenience and simplicity of description, the specific work process of the system, apparatus, and unit of foregoing description may be referred to the corresponding process in preceding method embodiment, will not be described here.
In several embodiments provided herein, it should be understood that disclosed system, apparatus and method can be realized by another way.For example, device embodiment described above is only schematically, for example, the division of the unit, be only a kind of division of logic function, can have when actually realizing other dividing mode, for example multiple units or component can in conjunction with or be desirably integrated into another system, or some features can be ignored, or do not execute.Another, shown or discussed coupling each other or direct-coupling or communication connection can be INDIRECT COUPLING or the communication connections by some interfaces, device or unit, can be electrical, mechanical or other forms.
The unit that illustrates as separating component can be or may not be physically separate, as the part that unit shows can be or may not be physical location, you can be located at a place, or can also be distributed on multiple NEs.Some or all of unit therein can be selected according to the actual needs to realize the purpose of this embodiment scheme.
In addition, each functional unit in each embodiment of the invention can be integrated in a processing unit, or unit is individually physically present, it is also possible to which two or more units are integrated in a unit.Above-mentioned integrated unit both can be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.
If the integrated unit realized using in the form of SFU software functional unit and as independent production marketing or use when, can be stored in a computer read/write memory medium.It is based on such understanding, the part or all or part of the technical scheme that technical scheme is substantially contributed to prior art in other words can be embodied in the form of software product, the computer software product is stored in a storage medium, use so that a computer equipment (can be personal computer including some instructions, server, or the network equipment etc.) execute all or part of step of each embodiment methods described of the invention.And aforesaid storage medium includes:USB flash disk, portable hard drive, read only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.
The above, above example only in order to technical scheme to be described, rather than a limitation;Although being described in detail to the present invention with reference to the foregoing embodiments, it will be understood by those within the art that:Which still can be modified to the technical scheme described in foregoing embodiments, or carry out equivalent to which part technical characteristic;And these modifications or replacement, do not make the essence of appropriate technical solution depart from the spirit and scope of various embodiments of the present invention technical scheme.
Claims (27)
1. a kind of video classification methods, it is characterised in that include:
The information in video is obtained, the information in the video includes image information, Optic flow information and sound
Message ceases;
Corresponding first reference information of described image information, light stream letter are generated using deep neural network
Cease corresponding second reference information and corresponding 3rd reference information of the acoustic information;
Regarded to described according to first reference information, the second reference information and the 3rd reference information
Frequency is processed to obtain the class relations matrix of the confidence level matrix and the video of the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with
The subject fusion parameter of the video is obtained, the subject fusion parameter is used for classifying video, its
In, the constraint factor of the object function includes confidence level matrix, class relations matrix and fusion parameters.
2. video classification methods according to claim 1, it is characterised in that the utilization depth
Neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding second
Reference information and corresponding 3rd reference information of the acoustic information include:
The first information, first letter is generated to described image information processing using the first convolutional neural networks
Cease for first reference information;
The second information, second letter is generated to Optic flow information process using the second convolutional neural networks
Cease for second reference information;
The 3rd information, the 3rd letter is generated to the acoustic information system using the 3rd convolutional neural networks
Cease for the 3rd reference information;
Described according to first reference information, the second reference information and the 3rd reference information to institute
State video to be processed to obtain the class relations matrix bag of the confidence level matrix and the video of the video
Include:
By the video by the first information, second information and the 3rd information at
Reason, to obtain the confidence level matrix of the video;
At least one video is entered by the first information, second information and the 3rd information
Row is processed, and to obtain the class relations matrix of the video, at least one video is used for regarding to described
Frequency is classified;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with
The subject fusion parameter for obtaining the video includes:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, institute
Stating object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W) it is process of the video in the first information, second information and the 3rd information
During empirical loss, the S is the confidence level matrix of the video, and Y is the category of the video,
V is the class relations matrix of the video, λ1And λ2For weight coefficient,Represent not Luo Beini Wu Sifan
Number, | | | |1Represent Sparse rules operator.
3. video classification methods according to claim 1, it is characterised in that the utilization depth
Neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding second
Reference information and corresponding 3rd reference information of the acoustic information include:
The first information is generated to described image information processing using the first convolutional neural networks, according to default rule
First sub-information is then extracted from the first information, using very first time recurrent neural network LSTM pair
First sub-information carries out processing the 4th information that generates, and the first information is common with the 4th information
Constitute first reference information;
The second information, second letter is generated to Optic flow information process using the second convolutional neural networks
Cease for second reference information;
The 3rd information, the 3rd letter is generated to the acoustic information system using the 3rd convolutional neural networks
Cease for the 3rd reference information;
Described according to first reference information, the second reference information and the 3rd reference information to institute
State video to be processed to obtain the class relations matrix bag of the confidence level matrix and the video of the video
Include:
By the video by the first information, second information, the 3rd information and described
4th information is processed, to obtain the confidence level matrix of the video;
By at least one video by the first information, second information, the 3rd information and
4th information is processed, and to obtain the class relations matrix of the video, described at least one regards
Frequency is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with
The subject fusion parameter for obtaining the video includes:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, institute
Stating object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W it is) video in the first information, second information, the 3rd information and described
Empirical loss in the processing procedure of the 4th information, the S is the confidence level matrix of the video, and Y is
The category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Table
Show not this norm of Luo Beini crows, | | | |1Represent Sparse rules operator.
4. video classification methods according to claim 1, it is characterised in that the utilization depth
Neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding second
Reference information and corresponding 3rd reference information of the acoustic information include:
The first information, first letter is generated to described image information processing using the first convolutional neural networks
Cease for first reference information;
The second information is generated to Optic flow information process using the second convolutional neural networks, according to default rule
Then from second information, the second sub-information is extracted, second sub-information is entered using the 2nd LSTM
Row processes the 5th information that generates, and second information collectively constitutes second reference with the 5th information
Information;
The 3rd information, the 3rd letter is generated to the acoustic information system using the 3rd convolutional neural networks
Cease for the 3rd reference information;
Described according to first reference information, the second reference information and the 3rd reference information to institute
State video to be processed to obtain the class relations matrix bag of the confidence level matrix and the video of the video
Include:
By the video by the first information, second information, the 3rd information and described
5th information is processed, to obtain the confidence level matrix of the video;
By at least one video by the first information, second information, the 3rd information and
5th information is processed, and to obtain the class relations matrix of the video, described at least one regards
Frequency is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with
The subject fusion parameter for obtaining the video includes:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function institute
Stating object function isSolution draws W values, the W values
For the subject fusion parameter of the video, wherein, W is the fusion parameters in the object function, described
L(S,Y;W it is) video in the first information, second information, the 3rd information and described
Empirical loss in the processing procedure of the 5th information, the S is the confidence level matrix of the video, and Y is
The category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Table
Show not this norm of Luo Beini crows, | | | |1Represent Sparse rules operator.
5. video classification methods according to claim 1, it is characterised in that the utilization depth
Neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding second
Reference information and corresponding 3rd reference information of the acoustic information include:
The first information, first letter is generated to described image information processing using the first convolutional neural networks
Cease for first reference information;
The second information, second letter is generated to Optic flow information process using the second convolutional neural networks
Cease for second reference information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, according to default rule
Then from the 3rd information, the 3rd sub-information is extracted, the 3rd sub-information is entered using the 3rd LSTM
Row processes the 6th information that generates, and the 3rd information collectively constitutes the 3rd reference with the 6th information
Information;
Described according to first reference information, the second reference information and the 3rd reference information to institute
State video to be processed to obtain the class relations matrix bag of the confidence level matrix and the video of the video
Include:
By the video by the first information, second information, the 3rd information and described
6th information is processed, to obtain the confidence level matrix of the video;
By at least one video by the first information, second information, the 3rd information and
6th information is processed, and to obtain the class relations matrix of the video, described at least one regards
Frequency is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with
The subject fusion parameter for obtaining the video includes:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, institute
Stating subject fusion parameter isSolution draws W values, the W
It is worth the subject fusion parameter for the video, wherein, W is the fusion parameters in the object function, institute
State L (S, Y;W) be the video the first information, second information, the 3rd information and
Empirical loss in the processing procedure of the 6th information, the S is the confidence level matrix of the video, Y
It is the category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,
Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
6. video classification methods according to claim 1, it is characterised in that the utilization depth
Neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding second
Reference information and corresponding 3rd reference information of the acoustic information include:
The first information is generated to described image information processing using the first convolutional neural networks, according to default rule
From the first information, the first sub-information is then extracted, first sub-information is entered using a LSTM
Row processes the 4th information that generates, and the first information collectively constitutes first reference with the 4th information
Information;
Using the second convolutional neural networks to the Optic flow information process generate the second information, according to rule from
The second sub-information is extracted in second information, using the 2nd LSTM to second sub-information at
Reason generates the 5th information, and second information collectively constitutes second reference information with the 5th information;
The 3rd information, the 3rd letter is generated to the acoustic information system using the 3rd convolutional neural networks
Cease for the 3rd reference information;
Described according to first reference information, the second reference information and the 3rd reference information to institute
State video to be processed to obtain the class relations matrix bag of the confidence level matrix and the video of the video
Include:
By the video by the first information, second information, the 3rd information, described
Four information and the 5th information are processed, to obtain the confidence level matrix of the video;
By at least one video by the first information, second information, the 3rd information, institute
State the 4th information and the 5th information is processed, to obtain the class relations matrix of the video,
At least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with
The subject fusion parameter for obtaining the video includes:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh
Scalar functions areSolution draws W values, and the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W) be the video the first information, second information, the 3rd information, described
Empirical loss in the processing procedure of four information and the 5th information, the S are putting for the video
Reliability matrix, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For power
Weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
7. video classification methods according to claim 1, it is characterised in that the utilization depth
Neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding second
Reference information and corresponding 3rd reference information of the acoustic information include:
The first information is generated to described image information processing using the first convolutional neural networks, according to default rule
From the first information, the first sub-information is then extracted, first sub-information is entered using a LSTM
Row processes the 4th information that generates, and the first information collectively constitutes first reference with the 4th information
Information;
The second information, second letter is generated to Optic flow information process using the second convolutional neural networks
Cease for second reference information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, according to default rule
Then from the 3rd information, the 3rd sub-information is extracted, the 3rd sub-information is entered using the 3rd LSTM
Row processes the 6th information that generates, and the 3rd information collectively constitutes the 3rd reference with the 6th information
Information;
Described according to first reference information, the second reference information and the 3rd reference information to institute
State video to be processed to obtain the class relations matrix bag of the confidence level matrix and the video of the video
Include:
By the video by the first information, second information, the 3rd information, described
Four information and the 6th information are processed, to obtain the confidence level matrix of the video;
By at least one video by the first information, second information, the 3rd information, institute
State the 4th information and the 6th information is processed, to obtain the class relations matrix of the video,
At least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with
The subject fusion parameter for obtaining the video includes:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh
Scalar functions areSolution draws W values, and the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W) be the video the first information, second information, the 3rd information, described
Empirical loss in the processing procedure of four information and the 6th information, the S are putting for the video
Reliability matrix, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For power
Weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
8. video classification methods according to claim 1, it is characterised in that the utilization depth
Neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding second
Reference information and corresponding 3rd reference information of the acoustic information include:
The first information, first letter is generated to described image information processing using the first convolutional neural networks
Cease for first reference information;
The second information is generated to Optic flow information process using the second convolutional neural networks, according to default rule
Then from second information, the second sub-information is extracted, second sub-information is entered using the 2nd LSTM
Row processes the 5th information that generates, and second information collectively constitutes second reference with the 5th information
Information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, according to default rule
Then from the 3rd information, the 3rd sub-information is extracted, the 3rd sub-information is entered using the 3rd LSTM
Row processes the 6th information that generates, and the 3rd information collectively constitutes the 3rd reference with the 6th information
Information;
Described according to first reference information, the second reference information and the 3rd reference information to institute
State video to be processed to obtain the class relations matrix bag of the confidence level matrix and the video of the video
Include:
By the video by the first information, second information, the 3rd information, described
Five information and the 6th information are processed, to obtain the confidence level matrix of the video;
By at least one video by the first information, second information, the 3rd information, institute
State the 5th information and the 6th information is processed, to obtain the class relations matrix of the video,
At least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with
The subject fusion parameter for obtaining the video includes:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, institute
Stating object function isSolution draws W values, the W values
For the subject fusion parameter of the video, wherein, W is the fusion parameters in the object function, described
L(S,Y;W) be the video the first information, second information, the 3rd information, described
Empirical loss in the processing procedure of five information and the 6th information, the S are putting for the video
Reliability matrix, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For power
Weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
9. video classification methods according to claim 1, it is characterised in that the utilization depth
Neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding second
Reference information and corresponding 3rd reference information of the acoustic information include:
The first information is generated to described image information processing using the first convolutional neural networks, according to default rule
From the first information, the first sub-information is then extracted, first sub-information is entered using a LSTM
Row processes the 4th information that generates, and the first information collectively constitutes first reference with the 4th information
Information;
The second information is generated to Optic flow information process using the second convolutional neural networks, according to default rule
Then from second information, the second sub-information is extracted, second sub-information is entered using the 2nd LSTM
Row processes the 5th information that generates, and second information collectively constitutes second reference with the 5th information
Information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, according to default rule
Then from the 3rd information, the 3rd sub-information is extracted, the 3rd sub-information is entered using the 3rd LSTM
Row processes the 6th information that generates, and the 3rd information collectively constitutes the 3rd reference with the 6th information
Information;
Described according to first reference information, the second reference information and the 3rd reference information to institute
State video to be processed to obtain the class relations matrix bag of the confidence level matrix and the video of the video
Include:
By the video by the first information, second information, the 3rd information, described
Four information, the 5th information and the 6th information are processed, to obtain the confidence of the video
Degree matrix;
By at least one video by the first information, second information, the 3rd information, institute
State the 4th information, the 5th information and the 6th information to be processed, to obtain the video
Class relations matrix, at least one video are used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with
The subject fusion parameter for obtaining the video includes:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh
Scalar functions areSolution draws W values, and the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W) be the video the first information, second information, the 3rd information, described
Empirical loss in the processing procedure of four information, the 5th information and the 6th information, the S
It is the confidence level matrix of the video, Y is the category of the video, and V is the class relations of the video
Matrix, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules calculation
Son.
10. a kind of visual classification device, it is characterised in that include:
Acquisition module, for obtaining the information in video, the information in the video includes image information,
Optic flow information and acoustic information;
Generation module, the image information for generating the acquisition module acquisition using deep neural network are corresponding
Corresponding second reference information of Optic flow information that first reference information, the acquisition module are obtained and described
Corresponding 3rd reference information of acoustic information that acquisition module is obtained;
Processing module, for generated according to the generation module the first reference information, the second reference information
And the 3rd reference information video is processed with obtain the video confidence level matrix and
The class relations matrix of the video;
Computing module, for the confidence level matrix of video that obtains the processing module and the video
Class relations matrix substitutes into object function, and to obtain the subject fusion parameter of the video, the target is melted
Closing parameter is used for classifying video, and wherein, the constraint factor of the object function includes confidence level square
Battle array, class relations matrix and fusion parameters.
11. visual classification devices according to claim 10, it is characterised in that
The generation module, specifically for obtained to the acquisition module using the first convolutional neural networks
Image Information Processing generates the first information, and the first information is first reference information;Using second
Optic flow information process the second information of generation that convolutional neural networks are obtained to the acquisition module, described second
Information is second reference information;The sound acquisition module obtained using the 3rd convolutional neural networks
Sound information processing generates the 3rd information, and the 3rd information is the 3rd reference information;
The processing module, specifically for believing the video by the generation module is generated first
Breath, second information and the 3rd information are processed, to obtain the confidence level square of the video
Battle array;The first information that at least one video is generated by the generation module, second information and
3rd information is processed, and to obtain the class relations matrix of the video, described at least one regards
Frequency is used for classifying the video;
The computing module, specifically for the confidence level matrix of video that obtains the processing module and institute
The class relations matrix for stating video substitutes into object function, and the object function is Wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W) it is process of the video in the first information, second information and the 3rd information
During empirical loss, the S is the confidence level matrix of the video, and Y is the category of the video,
V is the class relations matrix of the video, λ1And λ2For weight coefficient,Represent not Luo Beini Wu Sifan
Number, | | | |1Represent Sparse rules operator.
12. visual classification devices according to claim 10, it is characterised in that
The generation module, specifically for obtained to the acquisition module using the first convolutional neural networks
Image Information Processing generates the first information, extracts the first son letter according to preset rules from the first information
Breath, is carried out processing to first sub-information using very first time recurrent neural network LSTM and generates the 4th
Information, the first information collectively constitute first reference information with the 4th information;Using second
Optic flow information process the second information of generation that convolutional neural networks are obtained to the acquisition module, described second
Information is second reference information;The sound acquisition module obtained using the 3rd convolutional neural networks
Sound information processing generates the 3rd information, and the 3rd information is the 3rd reference information;
The processing module, specifically for believing the video by the generation module is generated first
Breath, second information, the 3rd information and the 4th information are processed, described to obtain
The confidence level matrix of video;The first information that at least one video is generated by the generation module, institute
State the second information, the 3rd information and the 4th information to be processed, to obtain the video
Class relations matrix, at least one video are used for classifying the video;
The computing module, specifically for the confidence level matrix of video that obtains the processing module and institute
The class relations matrix for stating video substitutes into object function, and the object function is Wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W it is) video in the first information, second information, the 3rd information and described
Empirical loss in the processing procedure of the 4th information, the S is the confidence level matrix of the video, and Y is
The category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Table
Show not this norm of Luo Beini crows, | | | |1Represent Sparse rules operator.
13. visual classification devices according to claim 10, it is characterised in that
The generation module, specifically for obtained to the acquisition module using the first convolutional neural networks
Image Information Processing generates the first information, and the first information is first reference information;Using second
The Optic flow information that convolutional neural networks are obtained to the acquisition module processes second information that generates, according to default
Rule extracts the second sub-information from second information, using the 2nd LSTM to second sub-information
Carry out processing the 5th information that generates, second information collectively constitutes second ginseng with the 5th information
Examine information;The acoustic information system acquisition module obtained using the 3rd convolutional neural networks generates the
Three information, the 3rd information are the 3rd reference information;
The processing module, specifically for believing the video by the generation module is generated first
Breath, second information, the 3rd information and the 5th information are processed, described to obtain
The confidence level matrix of video;The first information that at least one video is generated by the generation module, institute
State the second information, the 3rd information and the 5th information to be processed, to obtain the video
Class relations matrix, at least one video are used for classifying the video;
The computing module, specifically for the confidence level matrix of video that obtains the processing module and institute
The class relations matrix for stating video substitutes into object function, and the object function is Wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W it is) video in the first information, second information, the 3rd information and described
Empirical loss in the processing procedure of the 5th information, the S is the confidence level matrix of the video, and Y is
The category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Table
Show not this norm of Luo Beini crows, | | | |1Represent Sparse rules operator.
14. visual classification devices according to claim 10, it is characterised in that
The generation module, for the image acquisition module obtained using the first convolutional neural networks
Information processing generates the first information, and the first information is first reference information;Using the second convolution
The Optic flow information that neutral net is obtained to the acquisition module processes second information that generates, second information
For second reference information;Believed using the sound that the 3rd convolutional neural networks are obtained to the acquisition module
Breath processes the 3rd information that generates, and extracts the 3rd sub-information according to preset rules from the 3rd information, profit
The 3rd sub-information is carried out processing the 6th information that generates, the 3rd information and institute with the 3rd LSTM
State the 6th information and collectively constitute the 3rd reference information;
The processing module, specifically for believing the video by the generation module is generated first
Breath, second information, the 3rd information and the 6th information are processed, described to obtain
The confidence level matrix of video;The first information that at least one video is generated by the generation module, institute
State the second information, the 3rd information and the 6th information to be processed, to obtain the video
Class relations matrix, at least one video are used for classifying the video;
The computing module, specifically for the confidence level matrix of video that obtains the processing module and institute
The class relations matrix for stating video substitutes into object function, and the object function is Wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W it is) video in the first information, second information, the 3rd information and described
Empirical loss in the processing procedure of the 6th information, the S is the confidence level matrix of the video, and Y is
The category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Table
Show not this norm of Luo Beini crows, | | | |1Represent Sparse rules operator.
15. visual classification devices according to claim 10, it is characterised in that
The generation module, specifically for obtained to the acquisition module using the first convolutional neural networks
Image Information Processing generates the first information, extracts the first son letter according to preset rules from the first information
Breath, is carried out processing the 4th information that generates, first letter to first sub-information using a LSTM
Breath collectively constitutes first reference information with the 4th information;Using the second convolutional neural networks to institute
The Optic flow information for stating acquisition module acquisition processes second information that generates, according to rule from second information
The second sub-information is extracted, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM,
Second information collectively constitutes second reference information with the 5th information;Using the 3rd convolution god
The 3rd information is generated through the acoustic information system that network is obtained to the acquisition module, the 3rd information is
3rd reference information;
The processing module, specifically for believing the video by the generation module is generated first
At breath, second information, the 3rd information, the 4th information and the 5th information
Reason, to obtain the confidence level matrix of the video;At least one video is generated by the generation module
The first information, second information, the 3rd information, the 4th information and described 5th letter
Breath, is processed to obtain the class relations matrix of the video, and at least one video is used for institute
State video to be classified;
The computing module, specifically for the confidence level matrix of video that obtains the processing module and institute
The class relations matrix for stating video substitutes into object function, and the object function is Wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W) be the video the first information, second information, the 3rd information, described
Empirical loss in the processing procedure of four information and the 5th information, the S are putting for the video
Reliability matrix, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For power
Weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
16. visual classification devices according to claim 10, it is characterised in that
The generation module, specifically for obtained to the acquisition module using the first convolutional neural networks
Image Information Processing generates the first information, extracts the first son letter according to preset rules from the first information
Breath, is carried out processing the 4th information that generates, first letter to first sub-information using a LSTM
Breath collectively constitutes first reference information with the 4th information;Using the second convolutional neural networks to institute
The Optic flow information for stating acquisition module acquisition processes second information that generates, and second information is the described second ginseng
Examine information;The acoustic information system acquisition module obtained using the 3rd convolutional neural networks generates the
Three information, extract the 3rd sub-information from the 3rd information according to preset rules, using the 3rd LSTM pair
3rd sub-information carries out processing the 6th information that generates, and the 3rd information is common with the 6th information
Constitute the 3rd reference information;
The processing module, specifically for believing the video by the generation module is generated first
At breath, second information, the 3rd information, the 4th information and the 6th information
Reason, to obtain the confidence level matrix of the video;At least one video is generated by the generation module
The first information, second information, the 3rd information, the 4th information and described 6th letter
Breath is processed, and to obtain the class relations matrix of the video, at least one video is used for institute
State video to be classified;
The computing module, specifically for the confidence level matrix of video that obtains the processing module and institute
The class relations matrix for stating video substitutes into object function, and the object function is Wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W) be the video the first information, second information, the 3rd information, described
Empirical loss in the processing procedure of four information and the 6th information, the S are putting for the video
Reliability matrix, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For power
Weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
17. visual classification devices according to claim 10, it is characterised in that
The generation module, specifically for obtained to the acquisition module using the first convolutional neural networks
Image Information Processing generates the first information, and the first information is first reference information;Using second
The Optic flow information that convolutional neural networks are obtained to the acquisition module processes second information that generates, according to default
Rule extracts the second sub-information from second information, using the 2nd LSTM to second sub-information
Carry out processing the 5th information that generates, second information collectively constitutes second ginseng with the 5th information
Examine information;The acoustic information system acquisition module obtained using the 3rd convolutional neural networks generates the
Three information, extract the 3rd sub-information from the 3rd information according to preset rules, using the 3rd LSTM pair
3rd sub-information carries out processing the 6th information that generates, and the 3rd information is common with the 6th information
Constitute the 3rd reference information;
The processing module, specifically for believing the video by the generation module is generated first
At breath, second information, the 3rd information, the 5th information and the 6th information
Reason, to obtain the confidence level matrix of the video;At least one video is generated by the generation module
The first information, second information, the 3rd information, the 5th information and described 6th letter
Breath is processed, and to obtain the class relations matrix of the video, at least one video is used for institute
State video to be classified;
The computing module, specifically for the confidence level matrix of video that obtains the processing module and institute
The class relations matrix for stating video substitutes into object function, and the object function is Wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W) be the video the first information, second information, the 3rd information, described
Empirical loss in the processing procedure of five information and the 6th information, the S are putting for the video
Reliability matrix, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For power
Weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
18. visual classification devices according to claim 10, it is characterised in that
The generation module, specifically for obtained to the acquisition module using the first convolutional neural networks
Image Information Processing generates the first information, extracts the first son letter according to preset rules from the first information
Breath, is carried out processing the 4th information that generates, first letter to first sub-information using a LSTM
Breath collectively constitutes first reference information with the 4th information;Using the second convolutional neural networks to institute
The Optic flow information for stating acquisition module acquisition processes second information that generates, according to preset rules from the described second letter
The second sub-information is extracted in breath, second sub-information is carried out processing using the 2nd LSTM and is generated the 5th
Information, second information collectively constitute second reference information with the 5th information;Using the 3rd
The acoustic information system that convolutional neural networks are obtained to the acquisition module generates the 3rd information, according to default rule
Then from the 3rd information, the 3rd sub-information is extracted, the 3rd sub-information is entered using the 3rd LSTM
Row processes the 6th information that generates, and the 3rd information collectively constitutes the 3rd reference with the 6th information
Information;
The processing module, specifically for believing the video by the generation module is generated first
Breath, second information, the 3rd information, the 4th information, the 5th information and described
6th information is processed, to obtain the confidence level matrix of the video;At least one video is passed through institute
State generation module generation the first information, second information, the 3rd information, the 4th information,
5th information and the 6th information are processed, to obtain the class relations matrix of the video,
At least one video is used for classifying the video;
The computing module, specifically for the confidence level matrix of video that obtains the processing module and institute
The class relations matrix for stating video substitutes into object function, and the object function is Wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W) be the video the first information, second information, the 3rd information, described
Empirical loss in the processing procedure of four information, the 5th information and the 6th information, the S
It is the confidence level matrix of the video, Y is the category of the video, and V is the class relations of the video
Matrix, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules calculation
Son.
19. a kind of visual classification devices, it is characterised in that include:Memorizer, processor, emitter
With receptor;Wherein, the memorizer is connected with the processor, the processor and the emitter
Connection, the processor are connected with the receptor;
It is used for executing following steps by calling the operational order stored in the memorizer, the processor:
The information in video is obtained, the information in the video includes image information, Optic flow information and sound
Message ceases;
Corresponding first reference information of described image information, light stream letter are generated using deep neural network
Cease corresponding second reference information and corresponding 3rd reference information of the acoustic information;
Regarded to described according to first reference information, the second reference information and the 3rd reference information
Frequency is processed to obtain the class relations matrix of the confidence level matrix and the video of the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with
The subject fusion parameter of the video is obtained, the subject fusion parameter is used for classifying video, its
In, the constraint factor of the object function includes confidence level matrix, class relations matrix and fusion parameters.
20. visual classification devices according to claim 19, it is characterised in that described using deep
Degree neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding the
Two reference informations and corresponding 3rd reference information of the acoustic information include:
The first information, first letter is generated to described image information processing using the first convolutional neural networks
Cease for first reference information;
The second information, second letter is generated to Optic flow information process using the second convolutional neural networks
Cease for second reference information;
The 3rd information, the 3rd letter is generated to the acoustic information system using the 3rd convolutional neural networks
Cease for the 3rd reference information;
Described according to first reference information, the second reference information and the 3rd reference information to institute
State video to be processed to obtain the class relations matrix bag of the confidence level matrix and the video of the video
Include:
By the video by the first information, second information and the 3rd information at
Reason, to obtain the confidence level matrix of the video;
At least one video is entered by the first information, second information and the 3rd information
Row is processed, and to obtain the class relations matrix of the video, at least one video is used for regarding to described
Frequency is classified;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with
The subject fusion parameter for obtaining the video includes:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh
Scalar functions areWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W) it is process of the video in the first information, second information and the 3rd information
During empirical loss, the S is the confidence level matrix of the video, and Y is the category of the video,
V is the class relations matrix of the video, λ1And λ2For weight coefficient,Represent not Luo Beini Wu Sifan
Number, | | | |1Represent Sparse rules operator.
21. visual classification devices according to claim 19, it is characterised in that described using deep
Degree neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding the
Two reference informations and corresponding 3rd reference information of the acoustic information include:
The first information is generated to described image information processing using the first convolutional neural networks, according to default rule
First sub-information is then extracted from the first information, using very first time recurrent neural network LSTM pair
First sub-information carries out processing the 4th information that generates, and the first information is common with the 4th information
Constitute first reference information;
The second information, second letter is generated to Optic flow information process using the second convolutional neural networks
Cease for second reference information;
The 3rd information, the 3rd letter is generated to the acoustic information system using the 3rd convolutional neural networks
Cease for the 3rd reference information;
Described according to first reference information, the second reference information and the 3rd reference information to institute
State video to be processed to obtain the class relations matrix bag of the confidence level matrix and the video of the video
Include:
By the video by the first information, second information, the 3rd information and described
4th information is processed, to obtain the confidence level matrix of the video;
By at least one video by the first information, second information, the 3rd information and
4th information is processed, and to obtain the class relations matrix of the video, described at least one regards
Frequency is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with
The subject fusion parameter for obtaining the video includes:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh
Scalar functions areWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W it is) video in the first information, second information, the 3rd information and described
Empirical loss in the processing procedure of the 4th information, the S is the confidence level matrix of the video, and Y is
The category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Table
Show not this norm of Luo Beini crows, | | | |1Represent Sparse rules operator.
22. visual classification devices according to claim 19, it is characterised in that described using deep
Degree neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding the
Two reference informations and corresponding 3rd reference information of the acoustic information include:
The first information, first letter is generated to described image information processing using the first convolutional neural networks
Cease for first reference information;
The second information is generated to Optic flow information process using the second convolutional neural networks, according to default rule
Then from second information, the second sub-information is extracted, second sub-information is entered using the 2nd LSTM
Row processes the 5th information that generates, and second information collectively constitutes second reference with the 5th information
Information;
The 3rd information, the 3rd letter is generated to the acoustic information system using the 3rd convolutional neural networks
Cease for the 3rd reference information;
Described according to first reference information, the second reference information and the 3rd reference information to institute
State video to be processed to obtain the class relations matrix bag of the confidence level matrix and the video of the video
Include:
By the video by the first information, second information, the 3rd information and described
5th information is processed, to obtain the confidence level matrix of the video;
By at least one video by the first information, second information, the 3rd information and
5th information is processed, and to obtain the class relations matrix of the video, described at least one regards
Frequency is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with
The subject fusion parameter for obtaining the video includes:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh
Scalar functions areWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W it is) video in the first information, second information, the 3rd information and described
Empirical loss in the processing procedure of the 5th information, the S is the confidence level matrix of the video, and Y is
The category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Table
Show not this norm of Luo Beini crows, | | | |1Represent Sparse rules operator.
23. visual classification devices according to claim 19, it is characterised in that described using deep
Degree neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding the
Two reference informations and corresponding 3rd reference information of the acoustic information include:
The first information, first letter is generated to described image information processing using the first convolutional neural networks
Cease for first reference information;
The second information, second letter is generated to Optic flow information process using the second convolutional neural networks
Cease for second reference information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, according to default rule
Then from the 3rd information, the 3rd sub-information is extracted, the 3rd sub-information is entered using the 3rd LSTM
Row processes the 6th information that generates, and the 3rd information collectively constitutes the 3rd reference with the 6th information
Information;
Described according to first reference information, the second reference information and the 3rd reference information to institute
State video to be processed to obtain the class relations matrix bag of the confidence level matrix and the video of the video
Include:
By the video by the first information, second information, the 3rd information and described
6th information is processed, to obtain the confidence level matrix of the video;
By at least one video by the first information, second information, the 3rd information and
6th information is processed, and to obtain the class relations matrix of the video, described at least one regards
Frequency is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with
The subject fusion parameter for obtaining the video includes:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh
Scalar functions areWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W it is) video in the first information, second information, the 3rd information and described
Empirical loss in the processing procedure of the 6th information, the S is the confidence level matrix of the video, and Y is
The category of the video, V is the class relations matrix of the video, λ1And λ2For weight coefficient,Table
Show not this norm of Luo Beini crows, | | | |1Represent Sparse rules operator.
24. visual classification devices according to claim 19, it is characterised in that described using deep
Degree neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding the
Two reference informations and corresponding 3rd reference information of the acoustic information include:
The first information is generated to described image information processing using the first convolutional neural networks, according to default rule
From the first information, the first sub-information is then extracted, first sub-information is entered using a LSTM
Row processes the 4th information that generates, and the first information collectively constitutes first reference with the 4th information
Information;
Using the second convolutional neural networks to the Optic flow information process generate the second information, according to rule from
The second sub-information is extracted in second information, using the 2nd LSTM to second sub-information at
Reason generates the 5th information, and second information collectively constitutes second reference information with the 5th information;
The 3rd information, the 3rd letter is generated to the acoustic information system using the 3rd convolutional neural networks
Cease for the 3rd reference information;
Described according to first reference information, the second reference information and the 3rd reference information to institute
State video to be processed to obtain the class relations matrix bag of the confidence level matrix and the video of the video
Include:
By the video by the first information, second information, the 3rd information, described
Four information and the 5th information are processed, to obtain the confidence level matrix of the video;
By at least one video by the first information, second information, the 3rd information, institute
State the 4th information and the 5th information is processed, to obtain the class relations matrix of the video,
At least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with
The subject fusion parameter for obtaining the video includes:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh
Scalar functions areWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W) be the video the first information, second information, the 3rd information, described
Empirical loss in the processing procedure of four information and the 5th information, the S are putting for the video
Reliability matrix, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For power
Weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
25. visual classification devices according to claim 19, it is characterised in that described using deep
Degree neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding the
Two reference informations and corresponding 3rd reference information of the acoustic information include:
The first information is generated to described image information processing using the first convolutional neural networks, according to default rule
From the first information, the first sub-information is then extracted, first sub-information is entered using a LSTM
Row processes the 4th information that generates, and the first information collectively constitutes first reference with the 4th information
Information;
The second information, second letter is generated to Optic flow information process using the second convolutional neural networks
Cease for second reference information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, according to default rule
Then from the 3rd information, the 3rd sub-information is extracted, the 3rd sub-information is entered using the 3rd LSTM
Row processes the 6th information that generates, and the 3rd information collectively constitutes the 3rd reference with the 6th information
Information;
Described according to first reference information, the second reference information and the 3rd reference information to institute
State video to be processed to obtain the class relations matrix bag of the confidence level matrix and the video of the video
Include:
By the video by the first information, second information, the 3rd information, described
Four information and the 6th information are processed, to obtain the confidence level matrix of the video;
By at least one video by the first information, second information, the 3rd information, institute
State the 4th information and the 6th information is processed, to obtain the class relations matrix of the video,
At least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with
The subject fusion parameter for obtaining the video includes:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh
Scalar functions Wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W) be the video the first information, second information, the 3rd information, described
Empirical loss in the processing procedure of four information and the 6th information, the S are putting for the video
Reliability matrix, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For power
Weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
26. visual classification devices according to claim 19, it is characterised in that described using deep
Degree neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding the
Two reference informations and corresponding 3rd reference information of the acoustic information include:
The first information, first letter is generated to described image information processing using the first convolutional neural networks
Cease for first reference information;
The second information is generated to Optic flow information process using the second convolutional neural networks, according to default rule
Then from second information, the second sub-information is extracted, second sub-information is entered using the 2nd LSTM
Row processes the 5th information that generates, and second information collectively constitutes second reference with the 5th information
Information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, according to default rule
Then from the 3rd information, the 3rd sub-information is extracted, the 3rd sub-information is entered using the 3rd LSTM
Row processes the 6th information that generates, and the 3rd information collectively constitutes the 3rd reference with the 6th information
Information;
Described according to first reference information, the second reference information and the 3rd reference information to institute
State video to be processed to obtain the class relations matrix bag of the confidence level matrix and the video of the video
Include:
By the video by the first information, second information, the 3rd information, described
Five information and the 6th information are processed, to obtain the confidence level matrix of the video;
By at least one video by the first information, second information, the 3rd information, institute
State the 5th information and the 6th information is processed, to obtain the class relations matrix of the video,
At least one video is used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with
The subject fusion parameter for obtaining the video includes:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh
Scalar functions areWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W) be the video the first information, second information, the 3rd information, described
Empirical loss in the processing procedure of five information and the 6th information, the S are putting for the video
Reliability matrix, Y is the category of the video, and V is the class relations matrix of the video, λ1And λ2For power
Weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules operator.
27. visual classification devices according to claim 19, it is characterised in that described using deep
Degree neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding the
Two reference informations and corresponding 3rd reference information of the acoustic information include:
The first information is generated to described image information processing using the first convolutional neural networks, according to default rule
From the first information, the first sub-information is then extracted, first sub-information is entered using a LSTM
Row processes the 4th information that generates, and the first information collectively constitutes first reference with the 4th information
Information;
The second information is generated to Optic flow information process using the second convolutional neural networks, according to default rule
Then from second information, the second sub-information is extracted, second sub-information is entered using the 2nd LSTM
Row processes the 5th information that generates, and second information collectively constitutes second reference with the 5th information
Information;
The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, according to default rule
Then from the 3rd information, the 3rd sub-information is extracted, the 3rd sub-information is entered using the 3rd LSTM
Row processes the 6th information that generates, and the 3rd information collectively constitutes the 3rd reference with the 6th information
Information;
Described according to first reference information, the second reference information and the 3rd reference information to institute
State video to be processed to obtain the class relations matrix bag of the confidence level matrix and the video of the video
Include:
By the video by the first information, second information, the 3rd information, described
Four information, the 5th information and the 6th information are processed, to obtain the confidence of the video
Degree matrix;
By at least one video by the first information, second information, the 3rd information, institute
State the 4th information, the 5th information and the 6th information to be processed, to obtain the video
Class relations matrix, at least one video are used for classifying the video;
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with
The subject fusion parameter for obtaining the video includes:
The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh
Scalar functions areWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute
The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described
L(S,Y;W) be the video the first information, second information, the 3rd information, described
Empirical loss in the processing procedure of four information, the 5th information and the 6th information, the S
It is the confidence level matrix of the video, Y is the category of the video, and V is the class relations of the video
Matrix, λ1And λ2For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |1Represent Sparse rules calculation
Son.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510559904.2A CN106503723A (en) | 2015-09-06 | 2015-09-06 | A kind of video classification methods and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510559904.2A CN106503723A (en) | 2015-09-06 | 2015-09-06 | A kind of video classification methods and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106503723A true CN106503723A (en) | 2017-03-15 |
Family
ID=58286604
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510559904.2A Pending CN106503723A (en) | 2015-09-06 | 2015-09-06 | A kind of video classification methods and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106503723A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107341462A (en) * | 2017-06-28 | 2017-11-10 | 电子科技大学 | A kind of video classification methods based on notice mechanism |
CN107480213A (en) * | 2017-07-27 | 2017-12-15 | 上海交通大学 | Community's detection and customer relationship Forecasting Methodology based on sequential text network |
CN107808122A (en) * | 2017-09-30 | 2018-03-16 | 中国科学院长春光学精密机械与物理研究所 | Method for tracking target and device |
CN108023876A (en) * | 2017-11-20 | 2018-05-11 | 西安电子科技大学 | Intrusion detection method and intruding detection system based on sustainability integrated study |
CN108446631A (en) * | 2018-03-20 | 2018-08-24 | 北京邮电大学 | The smart frequency spectrum figure analysis method of deep learning based on convolutional neural networks |
CN108762245A (en) * | 2018-03-20 | 2018-11-06 | 华为技术有限公司 | Data fusion method and relevant device |
WO2019052301A1 (en) * | 2017-09-15 | 2019-03-21 | 腾讯科技(深圳)有限公司 | Video classification method, information processing method and server |
CN109522450A (en) * | 2018-11-29 | 2019-03-26 | 腾讯科技(深圳)有限公司 | A kind of method and server of visual classification |
CN109726765A (en) * | 2019-01-02 | 2019-05-07 | 京东方科技集团股份有限公司 | A kind of sample extraction method and device of visual classification problem |
CN109740621A (en) * | 2018-11-20 | 2019-05-10 | 北京奇艺世纪科技有限公司 | A kind of video classification methods, device and equipment |
CN110060264A (en) * | 2019-04-30 | 2019-07-26 | 北京市商汤科技开发有限公司 | Neural network training method, video frame processing method, apparatus and system |
CN110942011A (en) * | 2019-11-18 | 2020-03-31 | 上海极链网络科技有限公司 | Video event identification method, system, electronic equipment and medium |
CN111209970A (en) * | 2020-01-08 | 2020-05-29 | Oppo(重庆)智能科技有限公司 | Video classification method and device, storage medium and server |
CN112287893A (en) * | 2020-11-25 | 2021-01-29 | 广东技术师范大学 | Sow lactation behavior identification method based on audio and video information fusion |
WO2022033231A1 (en) * | 2020-08-10 | 2022-02-17 | International Business Machines Corporation | Dual-modality relation networks for audio-visual event localization |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101894125A (en) * | 2010-05-13 | 2010-11-24 | 复旦大学 | Content-based video classification method |
CN103218608A (en) * | 2013-04-19 | 2013-07-24 | 中国科学院自动化研究所 | Network violent video identification method |
US8533134B1 (en) * | 2009-11-17 | 2013-09-10 | Google Inc. | Graph-based fusion for video classification |
CN103299324A (en) * | 2010-11-11 | 2013-09-11 | 谷歌公司 | Learning tags for video annotation using latent subtags |
CN104331442A (en) * | 2014-10-24 | 2015-02-04 | 华为技术有限公司 | Video classification method and device |
CN104881685A (en) * | 2015-05-27 | 2015-09-02 | 清华大学 | Video classification method based on shortcut depth nerve network |
-
2015
- 2015-09-06 CN CN201510559904.2A patent/CN106503723A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8533134B1 (en) * | 2009-11-17 | 2013-09-10 | Google Inc. | Graph-based fusion for video classification |
CN101894125A (en) * | 2010-05-13 | 2010-11-24 | 复旦大学 | Content-based video classification method |
CN103299324A (en) * | 2010-11-11 | 2013-09-11 | 谷歌公司 | Learning tags for video annotation using latent subtags |
CN103218608A (en) * | 2013-04-19 | 2013-07-24 | 中国科学院自动化研究所 | Network violent video identification method |
CN104331442A (en) * | 2014-10-24 | 2015-02-04 | 华为技术有限公司 | Video classification method and device |
CN104881685A (en) * | 2015-05-27 | 2015-09-02 | 清华大学 | Video classification method based on shortcut depth nerve network |
Non-Patent Citations (2)
Title |
---|
ZUXUAN WU ET AL.: "Exploring Inter-feature and Inter-class Relationships with Deep Neural Network for Video Classification", 《COPYRIGHT 2014 ACM》 * |
ZUXUAN WU ET AL.: "Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification", 《ARXIV:1504.0156V1》 * |
Cited By (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107341462A (en) * | 2017-06-28 | 2017-11-10 | 电子科技大学 | A kind of video classification methods based on notice mechanism |
CN107480213A (en) * | 2017-07-27 | 2017-12-15 | 上海交通大学 | Community's detection and customer relationship Forecasting Methodology based on sequential text network |
CN107480213B (en) * | 2017-07-27 | 2021-12-24 | 上海交通大学 | Community detection and user relation prediction method based on time sequence text network |
CN109508584B (en) * | 2017-09-15 | 2022-12-02 | 腾讯科技(深圳)有限公司 | Video classification method, information processing method and server |
US10956748B2 (en) | 2017-09-15 | 2021-03-23 | Tencent Technology (Shenzhen) Company Limited | Video classification method, information processing method, and server |
WO2019052301A1 (en) * | 2017-09-15 | 2019-03-21 | 腾讯科技(深圳)有限公司 | Video classification method, information processing method and server |
CN109508584A (en) * | 2017-09-15 | 2019-03-22 | 腾讯科技(深圳)有限公司 | The method of visual classification, the method for information processing and server |
CN107808122A (en) * | 2017-09-30 | 2018-03-16 | 中国科学院长春光学精密机械与物理研究所 | Method for tracking target and device |
CN107808122B (en) * | 2017-09-30 | 2020-08-11 | 中国科学院长春光学精密机械与物理研究所 | Target tracking method and device |
CN108023876A (en) * | 2017-11-20 | 2018-05-11 | 西安电子科技大学 | Intrusion detection method and intruding detection system based on sustainability integrated study |
US11987250B2 (en) | 2018-03-20 | 2024-05-21 | Huawei Technologies Co., Ltd. | Data fusion method and related device |
CN108762245A (en) * | 2018-03-20 | 2018-11-06 | 华为技术有限公司 | Data fusion method and relevant device |
CN108446631A (en) * | 2018-03-20 | 2018-08-24 | 北京邮电大学 | The smart frequency spectrum figure analysis method of deep learning based on convolutional neural networks |
CN109740621A (en) * | 2018-11-20 | 2019-05-10 | 北京奇艺世纪科技有限公司 | A kind of video classification methods, device and equipment |
CN109740621B (en) * | 2018-11-20 | 2021-02-05 | 北京奇艺世纪科技有限公司 | Video classification method, device and equipment |
CN109522450A (en) * | 2018-11-29 | 2019-03-26 | 腾讯科技(深圳)有限公司 | A kind of method and server of visual classification |
US12106563B2 (en) | 2018-11-29 | 2024-10-01 | Tencent Technology (Shenzhen) Company Limited | Video classification method and server |
WO2020108396A1 (en) * | 2018-11-29 | 2020-06-04 | 腾讯科技(深圳)有限公司 | Video classification method, and server |
US11741711B2 (en) | 2018-11-29 | 2023-08-29 | Tencent Technology (Shenzhen) Company Limited | Video classification method and server |
US11210522B2 (en) | 2019-01-02 | 2021-12-28 | Boe Technology Group Co., Ltd. | Sample extraction method and device targeting video classification problem |
CN109726765A (en) * | 2019-01-02 | 2019-05-07 | 京东方科技集团股份有限公司 | A kind of sample extraction method and device of visual classification problem |
CN110060264A (en) * | 2019-04-30 | 2019-07-26 | 北京市商汤科技开发有限公司 | Neural network training method, video frame processing method, apparatus and system |
CN110060264B (en) * | 2019-04-30 | 2021-03-23 | 北京市商汤科技开发有限公司 | Neural network training method, video frame processing method, device and system |
CN110942011A (en) * | 2019-11-18 | 2020-03-31 | 上海极链网络科技有限公司 | Video event identification method, system, electronic equipment and medium |
CN110942011B (en) * | 2019-11-18 | 2021-02-02 | 上海极链网络科技有限公司 | Video event identification method, system, electronic equipment and medium |
CN111209970B (en) * | 2020-01-08 | 2023-04-25 | Oppo(重庆)智能科技有限公司 | Video classification method, device, storage medium and server |
CN111209970A (en) * | 2020-01-08 | 2020-05-29 | Oppo(重庆)智能科技有限公司 | Video classification method and device, storage medium and server |
WO2022033231A1 (en) * | 2020-08-10 | 2022-02-17 | International Business Machines Corporation | Dual-modality relation networks for audio-visual event localization |
US11663823B2 (en) | 2020-08-10 | 2023-05-30 | International Business Machines Corporation | Dual-modality relation networks for audio-visual event localization |
GB2613507A (en) * | 2020-08-10 | 2023-06-07 | Ibm | Dual-modality relation networks for audio-visual event localization |
CN112287893A (en) * | 2020-11-25 | 2021-01-29 | 广东技术师范大学 | Sow lactation behavior identification method based on audio and video information fusion |
CN112287893B (en) * | 2020-11-25 | 2023-07-18 | 广东技术师范大学 | Sow lactation behavior identification method based on audio and video information fusion |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106503723A (en) | A kind of video classification methods and device | |
CN110188635B (en) | Plant disease and insect pest identification method based on attention mechanism and multi-level convolution characteristics | |
CN109583322B (en) | Face recognition deep network training method and system | |
US20190228268A1 (en) | Method and system for cell image segmentation using multi-stage convolutional neural networks | |
Qi et al. | Tea chrysanthemum detection under unstructured environments using the TC-YOLO model | |
CN104331442A (en) | Video classification method and device | |
CN107251059A (en) | Sparse reasoning module for deep learning | |
CN106897738A (en) | A kind of pedestrian detection method based on semi-supervised learning | |
US11983917B2 (en) | Boosting AI identification learning | |
CN114912612A (en) | Bird identification method and device, computer equipment and storage medium | |
CN114998220B (en) | Tongue image detection and positioning method based on improved Tiny-YOLO v4 natural environment | |
CN112686056B (en) | Emotion classification method | |
CN107247952B (en) | Deep supervision-based visual saliency detection method for cyclic convolution neural network | |
CN106682702A (en) | Deep learning method and system | |
CN108073851A (en) | A kind of method, apparatus and electronic equipment for capturing gesture identification | |
CN109978074A (en) | Image aesthetic feeling and emotion joint classification method and system based on depth multi-task learning | |
Sharma et al. | Automatic identification of bird species using audio/video processing | |
Sharan et al. | Automated cnn based coral reef classification using image augmentation and deep learning | |
Terziyan et al. | Causality-aware convolutional neural networks for advanced image classification and generation | |
CN110059765A (en) | A kind of mineral intelligent recognition categorizing system and method | |
CN113870863A (en) | Voiceprint recognition method and device, storage medium and electronic equipment | |
Liu et al. | Research on multi-cluster green persimmon detection method based on improved Faster RCNN | |
Chauhan et al. | Plant diseases concept in smart agriculture using deep learning | |
CN111062484A (en) | Data set selection method and device based on multi-task learning | |
CN117371511A (en) | Training method, device, equipment and storage medium for image classification model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170315 |
|
WD01 | Invention patent application deemed withdrawn after publication |