CN106503723A

CN106503723A - A kind of video classification methods and device

Info

Publication number: CN106503723A
Application number: CN201510559904.2A
Authority: CN
Inventors: 姜育刚; 吴祖煊; 顾子晨
Original assignee: Fudan University; Huawei Technologies Co Ltd
Current assignee: Fudan University; Huawei Technologies Co Ltd
Priority date: 2015-09-06
Filing date: 2015-09-06
Publication date: 2017-03-15

Abstract

The embodiment of the invention discloses a kind of video classification methods, for solving the defect that accurately video can not be classified in prior art, improve the accuracy of visual classification.Present invention method includes：The information in video is obtained, the information in the video includes image information, Optic flow information and acoustic information；Corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information are generated using deep neural network；The video is processed according to first reference information, the second reference information and the 3rd reference information to obtain the class relations matrix of the confidence level matrix and the video of the video；The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, to obtain the subject fusion parameter of the video, the subject fusion parameter is used for classifying video.

Description

A kind of video classification methods and device

Technical field

The present invention relates to the communications field, and in particular to a kind of video classification methods and device.

Background technology

Visual classification refer to using video in visual information, auditory information and action message processed to video and analyzed, and judged and identified the action and event that occur in video.Visual classification can apply to a lot of practical problems, such as intelligent monitoring, video data management etc..

In visual classification, it is a kind of means of visual classification using feature early stage fusion, early stage fusion refers to the fusion of feature hierarchy.As depicted in figs. 1 and 2, the different characteristic in extraction video, such as characteristics of image, audio frequency characteristics, couple together the different characteristic that extracts, form assemblage characteristic.During training, by support vector machine (English full name：Support Vector Machine, abbreviation：SVM) or neutral net is trained to assemblage characteristic, generate the visual classification device for training.During visual classification, assemblage characteristic is extracted from video, the visual classification device for training is input into, is obtained the result of visual classification.

This video classification methods have problems in that, it is assumed that being simple complementary between the different characteristic of video, video can be indicated with these features.But as video is not each image, the simple combination of the mode such as sound may still be present between the mode such as image and sound and contacts.Therefore the feature that extracts not can completely express video content, video can not accurately be classified by the method.

Content of the invention

A kind of video classification methods and device is embodiments provided, for solving the defect that accurately video can not be classified in prior art, the accuracy of visual classification is improved.

First aspect present invention provides a kind of video classification methods, including：

The information in video is obtained, the information in the video includes image information, Optic flow information and acoustic information；

Corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information are generated using deep neural network；

The video is processed according to first reference information, the second reference information and the 3rd reference information to obtain the class relations matrix of the confidence level matrix and the video of the video；

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, to obtain the subject fusion parameter of the video, the subject fusion parameter is used for classifying video, wherein, the constraint factor of the object function includes confidence level matrix, class relations matrix and fusion parameters.

In conjunction with first aspect, in the first possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included：

The first information is generated to described image information processing using the first convolutional neural networks, and the first information is first reference information；

The second information is generated to Optic flow information process using the second convolutional neural networks, and second information is second reference information；

The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, and the 3rd information is the 3rd reference information；

The described video process according to first reference information, the second reference information and the 3rd reference information is included with obtaining the class relations matrix of the confidence level matrix and the video of the video：

The video is processed by the first information, second information and the 3rd information, to obtain the confidence level matrix of the video；

At least one video is processed by the first information, second information and the 3rd information, to obtain the class relations matrix of the video, at least one video is used for classifying the video；

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, is included with obtaining the subject fusion parameter of the video：

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W it is) empirical loss of the video in the processing procedure of the first information, second information and the 3rd information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

In conjunction with first aspect, in second possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included：

The first information is generated to described image information processing using the first convolutional neural networks, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using very first time recurrent neural network LSTM, the first information collectively constitutes first reference information with the 4th information；

The video is processed by the first information, second information, the 3rd information and the 4th information, to obtain the confidence level matrix of the video；

At least one video is processed by the first information, second information, the 3rd information and the 4th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video；

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 4th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

In conjunction with first aspect, in the third possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included：

The second information is generated to Optic flow information process using the second convolutional neural networks, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information；

The video is processed by the first information, second information, the 3rd information and the 5th information, to obtain the confidence level matrix of the video；

At least one video is processed by the first information, second information, the 3rd information and the 5th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video；

By object function described in the class relations matrix substitution object function of the confidence level matrix and the video of the video it isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

In conjunction with first aspect, in the 4th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included：

The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information；

The video is processed by the first information, second information, the 3rd information and the 6th information, to obtain the confidence level matrix of the video；

At least one video is processed by the first information, second information, the 3rd information and the 6th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video；

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the subject fusion parameter isWherein, the object function is represented and is worked asValue minimum when, obtain subject fusion parameter, wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

In conjunction with first aspect, in the 5th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included：

The first information is generated to described image information processing using the first convolutional neural networks, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information；

The second information is generated to Optic flow information process using the second convolutional neural networks, the second sub-information is extracted according to rule from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information；

The video is processed by the first information, second information, the 3rd information, the 4th information and the 5th information, to obtain the confidence level matrix of the video；

At least one video is processed by the first information, second information, the 3rd information, the 4th information and the 5th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video；

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

In conjunction with first aspect, in the 6th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included：

The video is processed by the first information, second information, the 3rd information, the 4th information and the 6th information, to obtain the confidence level matrix of the video；

At least one video is processed by the first information, second information, the 3rd information, the 4th information and the 6th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video；

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

In conjunction with first aspect, in the 7th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included：

The video is processed by the first information, second information, the 3rd information, the 5th information and the 6th information, to obtain the confidence level matrix of the video；

At least one video is processed by the first information, second information, the 3rd information, the 5th information and the 6th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video；

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, obtain subject fusion parameter, wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

In conjunction with first aspect, in the 8th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included：

The video is processed by the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, to obtain the confidence level matrix of the video；

At least one video is processed by the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, to obtain the class relations matrix of the video, at least one video is used for classifying the video；

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

Second aspect present invention provides a kind of visual classification device, it is characterised in that include：

Acquisition module, for obtaining the information in video, the information in the video includes image information, Optic flow information and acoustic information；

Generation module, corresponding 3rd reference information of the acoustic information that corresponding second reference information of Optic flow information and the acquisition module that corresponding first reference information of the image information for generating the acquisition module acquisition using deep neural network, the acquisition module are obtained is obtained；

Processing module, the first reference information, the second reference information and the 3rd reference information for being generated according to the generation module are processed to obtain the class relations matrix of the confidence level matrix and the video of the video to the video；

Computing module, the confidence level matrix of video and the class relations matrix of the video for obtaining the processing module substitutes into object function, to obtain the subject fusion parameter of the video, the subject fusion parameter is used for classifying video, wherein, the constraint factor of the object function includes confidence level matrix, class relations matrix and fusion parameters.

In conjunction with second aspect, in the first possible implementation,

The generation module, the Image Information Processing specifically for being obtained to the acquisition module using the first convolutional neural networks generate the first information, and the first information is first reference information；Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module, second information is second reference information；The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module, the 3rd information is the 3rd reference information；

The processing module, the first information, second information and the 3rd information specifically for being generated the video by the generation module are processed, to obtain the confidence level matrix of the video；The first information, second information and the 3rd information that at least one video is generated by the generation module is processed, and to obtain the class relations matrix of the video, at least one video is used for classifying the video；

The computing module, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W it is) empirical loss of the video in the processing procedure of the first information, second information and the 3rd information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

In conjunction with second aspect, in second possible implementation,

The generation module, Image Information Processing specifically for being obtained to the acquisition module using the first convolutional neural networks generates the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using very first time recurrent neural network LSTM, the first information collectively constitutes first reference information with the 4th information；Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module, second information is second reference information；The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module, the 3rd information is the 3rd reference information；

The processing module, the first information, second information, the 3rd information and the 4th information specifically for being generated the video by the generation module are processed, to obtain the confidence level matrix of the video；The first information, second information, the 3rd information and the 4th information that at least one video is generated by the generation module is processed, to obtain the class relations matrix of the video, at least one video is used for classifying the video；

The computing module, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 4th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

In conjunction with second aspect, in the third possible implementation,

The generation module, the Image Information Processing specifically for being obtained to the acquisition module using the first convolutional neural networks generate the first information, and the first information is first reference information；Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information；The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module, the 3rd information is the 3rd reference information；

The processing module, the first information, second information, the 3rd information and the 5th information specifically for being generated the video by the generation module are processed, to obtain the confidence level matrix of the video；The first information, second information, the 3rd information and the 5th information that at least one video is generated by the generation module is processed, to obtain the class relations matrix of the video, at least one video is used for classifying the video；

The computing module, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

In conjunction with second aspect, in the 4th kind of possible implementation,

The generation module, the Image Information Processing for being obtained to the acquisition module using the first convolutional neural networks generate the first information, and the first information is first reference information；Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module, second information is second reference information；The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information；

The processing module, the first information, second information, the 3rd information and the 6th information specifically for being generated the video by the generation module are processed, to obtain the confidence level matrix of the video；The first information, second information, the 3rd information and the 6th information that at least one video is generated by the generation module is processed, to obtain the class relations matrix of the video, at least one video is used for classifying the video；

The computing module, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

In conjunction with second aspect, in the 5th kind of possible implementation,

The generation module, Image Information Processing specifically for being obtained to the acquisition module using the first convolutional neural networks generates the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information；Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module, the second sub-information is extracted according to rule from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information；The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module, the 3rd information is the 3rd reference information；

The processing module, the first information, second information, the 3rd information, the 4th information and the 5th information specifically for being generated the video by the generation module are processed, to obtain the confidence level matrix of the video；The first information, second information, the 3rd information, the 4th information and the 5th information that at least one video is generated by the generation module, processed to obtain the class relations matrix of the video, at least one video is used for classifying the video；

The computing module, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

In conjunction with second aspect, in the 6th kind of possible implementation,

The generation module, Image Information Processing specifically for being obtained to the acquisition module using the first convolutional neural networks generates the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information；Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module, second information is second reference information；The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information；

The processing module, the first information, second information, the 3rd information, the 4th information and the 6th information specifically for being generated the video by the generation module are processed, to obtain the confidence level matrix of the video；The first information, second information, the 3rd information, the 4th information and the 6th information that at least one video is generated by the generation module is processed, to obtain the class relations matrix of the video, at least one video is used for classifying the video；

The computing module, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

In conjunction with second aspect, in the 7th kind of possible implementation,

The generation module, the Image Information Processing specifically for being obtained to the acquisition module using the first convolutional neural networks generate the first information, and the first information is first reference information；Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information；The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information；

The processing module, the first information, second information, the 3rd information, the 5th information and the 6th information specifically for being generated the video by the generation module are processed, to obtain the confidence level matrix of the video；The first information, second information, the 3rd information, the 5th information and the 6th information that at least one video is generated by the generation module is processed, to obtain the class relations matrix of the video, at least one video is used for classifying the video；

The computing module, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

In conjunction with second aspect, in the 8th kind of possible implementation,

The generation module, Image Information Processing specifically for being obtained to the acquisition module using the first convolutional neural networks generates the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information；Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information；The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information；

The processing module, the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information specifically for being generated the video by the generation module is processed, to obtain the confidence level matrix of the video；The first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information that at least one video is generated by the generation module is processed, to obtain the class relations matrix of the video, at least one video is used for classifying the video；

The computing module, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

Third aspect present invention provides a kind of visual classification device, including：Memorizer, processor, emitter and receptor；Wherein, the memorizer is connected with the processor, and the processor is connected with the emitter, and the processor is connected with the receptor；

It is used for executing following steps by calling the operational order stored in the memorizer, the processor：

In conjunction with the third aspect, in the first possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included：

In conjunction with the third aspect, in second possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included：

In conjunction with the third aspect, in the third possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included：

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

In conjunction with the third aspect, in the 4th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included：

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

In conjunction with the third aspect, in the 5th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included：

In conjunction with the third aspect, in the 6th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included：

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object functionWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

In conjunction with the third aspect, in the 7th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included：

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

In conjunction with the third aspect, in the 8th kind of possible implementation, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information to be included：

Above technical scheme is applied, the information in video is obtained, the information in the video includes image information, Optic flow information and acoustic information；Corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information are generated using deep neural network；The video is processed according to first reference information, the second reference information and the 3rd reference information to obtain the class relations matrix of the confidence level matrix and the video of the video；The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, to obtain the subject fusion parameter of the video, the subject fusion parameter is used for classifying video.It can be seen that, unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes corresponding reference information, further video is processed using reference information, introducing class relations using reference information during processing to video carries out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.

Description of the drawings

In order to be illustrated more clearly that the technical scheme of the embodiment of the present invention, accompanying drawing to be used needed for the embodiment of the present invention will be briefly described below, apparently, drawings in the following description are only some embodiments of the present invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with according to these other accompanying drawings of accompanying drawings acquisition.

Fig. 1 is one embodiment schematic diagram for extracting video information in prior art；

Fig. 2 is one embodiment schematic diagram of visual classification in prior art；

Fig. 3 is one embodiment schematic diagram of video classification methods in the embodiment of the present invention；

Fig. 4 is another embodiment schematic diagram of video classification methods in the embodiment of the present invention；

Fig. 5 is another embodiment schematic diagram of video classification methods in the embodiment of the present invention；

Fig. 6 is another embodiment schematic diagram of video classification methods in the embodiment of the present invention；

Fig. 7 is another embodiment schematic diagram of video classification methods in the embodiment of the present invention；

Fig. 8 is another embodiment schematic diagram of video classification methods in the embodiment of the present invention；

Fig. 9 is another embodiment schematic diagram of video classification methods in the embodiment of the present invention；

Figure 10 is another embodiment schematic diagram of video classification methods in the embodiment of the present invention；

Figure 11 is another embodiment schematic diagram of video classification methods in the embodiment of the present invention；

Figure 12-a are one embodiment schematic diagram of visual classification application scenarios in the embodiment of the present invention；

Figure 12-b are another embodiment schematic diagram of visual classification application scenarios in the embodiment of the present invention；

Figure 13 is a structural representation of visual classification device in the embodiment of the present invention；

Figure 14 is another structural representation of visual classification device in the embodiment of the present invention.

Specific embodiment

Accompanying drawing in below in conjunction with the embodiment of the present invention, to the embodiment of the present invention in technical scheme be clearly and completely described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiment.Embodiment in based on the present invention, the acquired every other embodiment under the premise of creative work is not made of those skilled in the art, belongs to the scope of protection of the invention.

Term " first ", " second ", " the 3rd " " the 4th " in description and claims of this specification and above-mentioned accompanying drawing etc. are for distinguishing different objects, rather than for describing particular order.Additionally, term " comprising " and " having " and their any deformations, it is intended that cover non-exclusive including.Process, method, system, product or the equipment for for example containing series of steps or unit is not limited to the step of listing or unit, but alternatively also include the step of not listing or unit, or alternatively also include other steps intrinsic for these processes, method, product or equipment or unit.

Technical scheme, is applied to any terminal, and such as smart mobile phone, ipad and PC etc. are mainly used in computer, are not specifically limited herein.

Fig. 1 is referred to, one embodiment of video classification methods, mainly includes in the embodiment of the present invention：The information in video is obtained, the information in the video includes image information, Optic flow information and acoustic information；Corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information are generated using deep neural network；The video is processed according to first reference information, the second reference information and the 3rd reference information to obtain the class relations matrix of the confidence level matrix and the video of the video；The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, to obtain the subject fusion parameter of the video, the subject fusion parameter is used for classifying video, wherein, the constraint factor of the object function includes confidence level matrix, class relations matrix and fusion parameters.Detailed process is as follows：

301st, information in video is obtained；

Wherein, the information in the video includes image information, Optic flow information and acoustic information.Image information is specially the image information corresponding to each frame for constituting video；Optic flow information specifically by video in multiframe between the Optic flow information that extracted, wherein light stream be with regard to the ken in object of which movement detection in concept, for describing the motion of observed object, surface or edge caused by the motion relative to observer.In actual applications, the Optic flow information in video, the method that optical flow method is really further inferred to object translational speed and direction over time by the intensity of detection image pixel are detected by optical flow method；Acoustic information is to carry out sound spectrum information obtained from Fourier transformation to acoustical signal.

302nd, corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information are generated using deep neural network；

Wherein, further image information, Optic flow information and acoustic information are processed by deep neural network.Depth neural network is to simulate human brain to be analyzed the neutral net of study, and it imitates the mechanism of human brain has explaining data, such as image, sound and text etc., the deep neural network being related in the present invention：Convolutional neural networks (English full name Convolutional Neural Network, abbreviation：) or time recurrent neural network (English full name CNN：Long-Short Term Memory, abbreviation：LSTM) etc., more excellent result can be gone out to the information processing of video by convolutional neural networks or LSTM.Such as：Convolutional neural networks are a kind of feedforward neural networks, and its artificial neuron can respond the surrounding cells in a part of coverage, for large-scale image procossing has outstanding performance.Convolutional neural networks are made up of the full-mesh layer (corresponding classical neutral net) on one or more convolutional layers and top, while also including associated weights and pond layer (pooling layer).This structure enables convolutional neural networks using the two-dimensional structure of input data.Compared with other deep learning structures, convolutional neural networks can provide more excellent result in terms of image and speech recognition.This model can also be trained using back-propagation algorithm.Compare other depth, feedforward neural network, and convolutional neural networks need the parameter that estimates less, make a kind of deep learning structure for having much captivation.Due to unique design structure, LSTM is suitable for processing and interval in predicted time sequence and the critical event for postponing to grow very much LSTM.Generally than HMM (HMM) more preferably, used as nonlinear model, LSTM can be used for constructing larger deep neural network as complicated non-linear unit for the performance of LSTM.

303rd, the video is processed according to first reference information, the second reference information and the 3rd reference information to obtain the class relations matrix of the confidence level matrix and the video of the video；

Wherein, according to the first reference information that depth neural network is generated, second reference information and the 3rd reference information, further video is processed, so as to the class relations matrix for obtaining the confidence level matrix and video of video, so-called confidence level, also confidence level is, it refers to that particular individual treats the degree that particular proposition verity is believed, refers to the probability that population parameter value falls in a certain area of sample statistics value.Such as:Video is processed according to the first reference information, the probability in the sea of acquisition is 80%, the probability in blue sky is 20%.Wherein, classification is used for the variety classes for representing video, the video of such as cat, the video of Canis familiaris L..Class relations matrix is assigned to the percentage ratio of other classifications in order to represent each class training sample by mistake.Such as the video of multiple cats is processed, the probability of the cat of acquisition is 90%, and the probability of the Canis familiaris L. for obtaining is 10%, then it is 10% that the video of Canis familiaris L. is assigned to shared probability in the video of cat by mistake.

304th, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function, to obtain the subject fusion parameter of the video.

The subject fusion parameter is used for classifying video, and wherein, the constraint factor of the object function includes confidence level matrix, class relations matrix and fusion parameters.

Wherein, the class relations matrix of the confidence level matrix and video of the video for obtaining is substituted into object function, when the value of object function is minimum, the subject fusion parameter of the video can be obtained, so as to accurately be classified to video.

In specific categorizing process, using subject fusion parameter to the first reference information, the second reference information, the classification results of the 3rd reference information are merged, so as to obtain the final classification result of video.

On the basis of embodiment illustrated in fig. 3, Fig. 4 is referred to, in detail visual classification is described, specific as follows：

Step 401, the information obtained in video；

Wherein, the information in the video includes image information, Optic flow information and acoustic information；

Wherein, step 401 is same or like with the step 301 in embodiment illustrated in fig. 3, specifically see step 301, and here is omitted.

Step 402, using the first convolutional neural networks to described image information processing generate the first information, the first information be first reference information；

Step 403, using the second convolutional neural networks to the Optic flow information process generate the second information, second information be second reference information；

Step 404, using the 3rd convolutional neural networks to the acoustic information system generate the 3rd information, the 3rd information be the 3rd reference information；

Wherein, the first information is the information after first convolutional neural networks are processed to graphical information, such as model 1, second information is the information after second convolutional neural networks are processed to Optic flow information, such as model 2,3rd information is the information after the 3rd convolutional neural networks are processed to acoustic information, such as model 3.Wherein, concrete processing procedure is to extract image information, and Optic flow information and acoustic information distinguish corresponding one-dimension information.

It should be noted that step 402 can be executed to step 404 simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.In addition, the first convolutional neural networks, the second convolutional neural networks and the 3rd convolutional neural networks can be identical, it is also possible to different, can specifically be decided according to the actual requirements, be not specifically limited herein.

Step 405, the video is processed by the first information, second information and the 3rd information, to obtain the confidence level matrix of the video；

Wherein, video is processed by model 1, model 2 and model 3 respectively, accordingly, obtains the corresponding confidence level of the video respectively, and each confidence level is constituted the confidence level matrix of the video.For example, the probability that the video is judged as sea by the model 1 after being processed is 80%, and the probability in blue sky is 20%；The video is judged as the probability in sea by the model 2 after being processed be 70%, and the probability in blue sky is 30%；The video is judged as the probability in sea by the model 3 after being processed be 80%, and the probability in blue sky is 20%.

Step 406, at least one video is processed by the first information, second information and the 3rd information, to obtain the class relations matrix of the video；

Wherein, at least one video is used for classifying the video, such as at least one video is processed by model 1, model 2 and model 3, the model 1, the model 2 and the model 3 determine the class relations of the video after processing, class relations matrix is further combined, for example：Model 1 is processed to the video of multiple cats, is judged to that the probability of cat is 90%, and is judged to that the probability of Canis familiaris L. is 10%, then it is 10% that the video of Canis familiaris L. is assigned to shared probability in the video of cat by mistake；Model 2 is processed to the video of multiple cats, is judged to that the probability of cat is 80%, and is judged to that the probability of Canis familiaris L. is 20%, then it is 20% that the video of Canis familiaris L. is assigned to shared probability in the video of cat by mistake；Model 3 is processed to the video of multiple cats, is judged to that the probability of cat is 75%, and is judged to that the probability of Canis familiaris L. is 25%, then it is 25% that the video of Canis familiaris L. is assigned to shared probability in the video of cat by mistake.

It should be noted that step 405 and step 406 can be executed simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.

Step 407, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function.

Wherein, object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and for being classified to video wherein, W is the fusion parameters in the object function to the subject fusion parameter, L (S, the Y；W it is) empirical loss of the video in the processing procedure of the first information, second information and the 3rd information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

Wherein, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function, when the value of the object function is minimum, the W of acquisition is the subject fusion parameter of the video.The specific process for obtaining W can be solved by near-end gradient algorithm (Proximal Gradient Method), near-end gradient algorithm is to solve for the most commonly used optimized algorithm during large-scale data, generally can be compared with rapid convergence, efficiently solving-optimizing problem.

In the embodiment of the present invention, directly the information in video is not combined, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information and the 3rd information, further using the first information, second information and the 3rd information are processed to video, using the first information, second information and the 3rd information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.

On the basis of Fig. 3 to embodiment illustrated in fig. 4, Fig. 5 is referred to, in detail visual classification is described, specific as follows：

Step 501, the information obtained in video；

Wherein, step 501 is same or like with the step 301 in embodiment illustrated in fig. 3, specifically see step 301, and here is omitted.

Step 502, using the first convolutional neural networks to described image information processing generate the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using very first time recurrent neural network LSTM, the first information collectively constitutes first reference information with the 4th information；

Wherein, generated after the first information (such as model 1) using the first convolutional neural networks, such as, extract the partial information (intermediate results in such as the first convolutional neural networks) in the first information again, the partial information is carried out processing further with very first time recurrent neural network LSTM and generate the 4th information (such as model 4).

It should be noted that the preset rules are the rule for setting in advance, in the such as output first information, default partial information, is not specifically limited depending on specific preset rules can be according to practical application herein as the input information in a LSTM.

Step 503, using the second convolutional neural networks to the Optic flow information process generate the second information, second information be second reference information；

Step 504, using the 3rd convolutional neural networks to the acoustic information system generate the 3rd information, the 3rd information be the 3rd reference information；

Wherein, the first information is the information after first convolutional neural networks are processed to graphical information, such as model 1, second information is the information after second convolutional neural networks are processed to Optic flow information, such as model 2,3rd information is the information after the 3rd convolutional neural networks are processed to acoustic information, such as model 3.Wherein, concrete processing procedure is to extract image information, and Optic flow information and acoustic information distinguish corresponding one-dimension information.4th information is the corresponding two-dimensional signal of the image information processed by a LSTM, such as model 4.

It should be noted that step 502 can be executed to step 504 simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.In addition, the first convolutional neural networks, the second convolutional neural networks and the 3rd convolutional neural networks can be identical, it is also possible to different, can specifically be decided according to the actual requirements, be not specifically limited herein.

Step 505, the video is processed by the first information, second information, the 3rd information and the 4th information, to obtain the confidence level matrix of the video；

Wherein, by video respectively by model 1, model 2, model 3 and model 4 are processed, accordingly, by model 1, model 2, model 3 and model 4 determine the corresponding confidence level of the video to the video after being processed, and each confidence level is constituted the confidence level matrix of the video.For example, the probability that model 1 determines sea after processing is 80%, and the probability in blue sky is 20%；It is 70% that model 2 determines the probability in sea, and the probability in blue sky is 30%；It is 80% that model 3 determines the probability in sea, and the probability in blue sky is 20%；It is 75% that model 4 determines the probability in sea, and the probability in blue sky is 25%.

Step 506, at least one video is processed by the first information, second information, the 3rd information and the 4th information, to obtain the class relations matrix of the video；

Wherein, at least one video is used for classifying the video, the video is processed by model 1, model 2, model 3 and model 4, the video passes through the model 1, model 2, model 3 and model 4 determine the class relations of the video after processing, be further combined class relations matrix, for example：The video of multiple cats is processed, the probability for determining cat is 90%, and the probability for determining Canis familiaris L. is 10%, then it is 10% that the video of Canis familiaris L. is assigned to shared probability in the video of cat by mistake.

It should be noted that step 505 and step 506 can be executed simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.

Step 507, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function.

Wherein, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and for classifying to video, wherein, W is the fusion parameters in the object function to the subject fusion parameter, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 4th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information and the 4th information, further using the first information, second information, 3rd information and the 4th information are processed to video, using the first information, second information, 3rd information and the 4th information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.

On the basis of Fig. 3 to embodiment illustrated in fig. 5, Fig. 6 is referred to, in detail visual classification is described, specific as follows：

Step 601, the information obtained in video；

Wherein, step 601 is same or like with the step 301 in embodiment illustrated in fig. 3, specifically see step 301, and here is omitted.

Step 602, using the first convolutional neural networks to described image information processing generate the first information, the first information be first reference information；

Step 603, using the second convolutional neural networks to the Optic flow information process generate the second information, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information；

Wherein, generated after the second information using the second convolutional neural networks, such as, partial information (intermediate results in such as second convolutional neural networks) in second information is extracted again, the partial information is carried out processing the 5th information that generates, such as model 5 further with the second time recurrent neural network LSTM.

Step 604, using the 3rd convolutional neural networks to the acoustic information system generate the 3rd information, the 3rd information be the 3rd reference information；

It should be noted that step 602 can be executed to step 604 simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.In addition, the first convolutional neural networks, the second convolutional neural networks and the 3rd convolutional neural networks can be identical, it is also possible to different, can specifically be decided according to the actual requirements, be not specifically limited herein.

Step 605, the video is processed by the first information, second information, the 3rd information and the 5th information, to obtain the confidence level matrix of the video；

Step 606, at least one video is processed by the first information, second information, the 3rd information and the 5th information, to obtain the class relations matrix of the video；

Wherein, at least one video is used for classifying the video, specifically see step 406, it should be noted that step 605 and step 606 can be executed simultaneously, it is also possible to executes according to default sequencing, is not specifically limited herein.

Step 607, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function.

The object function isWherein, wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and for classifying to video, W is the fusion parameters in the object function to the subject fusion parameter, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information and the 5th information, further using the first information, second information, 3rd information and the 5th information are processed to video, using the first information, second information, 3rd information and the 5th information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.

On the basis of Fig. 3 to embodiment illustrated in fig. 6, Fig. 7 is referred to, in detail visual classification is described, specific as follows：

Step 701, the information obtained in video；

Wherein, step 701 is same or like with the step 301 in embodiment illustrated in fig. 3, specifically see step 301, and here is omitted.

Step 702, using the first convolutional neural networks to described image information processing generate the first information, the first information be first reference information；

Step 703, using the second convolutional neural networks to the Optic flow information process generate the second information, second information be second reference information；

Step 704, using the 3rd convolutional neural networks to the acoustic information system generate the 3rd information, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information；

Wherein, generated after the 3rd information using the 3rd convolutional neural networks, such as, partial information (intermediate results in such as threeth convolutional neural networks) in threeth information is extracted again, the partial information is carried out processing the 6th information that generates, such as model 6 further with the 3rd time recurrent neural network LSTM.

It should be noted that step 702 can be executed to step 704 simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.In addition, the first convolutional neural networks, the second convolutional neural networks and the 3rd convolutional neural networks can be identical, it is also possible to different, can specifically be decided according to the actual requirements, be not specifically limited herein.

Step 705, the video is processed by the first information, second information, the 3rd information and the 6th information, to obtain the confidence level matrix of the video；

Step 706, at least one video is processed by the first information, second information, the 3rd information and the 6th information, to obtain the class relations matrix of the video；

Wherein, at least one video is used for classifying the video, specifically see step 406, it should be noted that step 705 and step 706 can be executed simultaneously, it is also possible to executes according to default sequencing, is not specifically limited herein.

Step 707, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function.

Wherein, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information and the 6th information, further using the first information, second information, 3rd information and the 6th information are processed to video, using the first information, second information, 3rd information and the 6th information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.

On the basis of Fig. 3 to embodiment illustrated in fig. 7, Fig. 8 is referred to, in detail visual classification is described, specific as follows：

Step 801, the information obtained in video；

Wherein, step 801 is same or like with the step 301 in embodiment illustrated in fig. 3, specifically see step 301, and here is omitted.

Step 802, using the first convolutional neural networks to described image information processing generate the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information；

Wherein, generated after the first information using the first convolutional neural networks, such as, then partial information (intermediate results in such as the first convolutional neural networks) in the first information is extracted, the partial information is processed further with very first time recurrent neural network LSTM.

Step 803, using the second convolutional neural networks to the Optic flow information process generate the second information, the second sub-information is extracted according to rule from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information；

Wherein, generated after the second information using the second convolutional neural networks, such as, then partial information (intermediate results in such as the second convolutional neural networks) in the second information is extracted, the partial information is processed further with the second time recurrent neural network LSTM.

Step 804, using the 3rd convolutional neural networks to the acoustic information system generate the 3rd information, the 3rd information be the 3rd reference information；

It should be noted that step 802 can be executed to step 804 simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.In addition, the first convolutional neural networks, the second convolutional neural networks and the 3rd convolutional neural networks can be identical, it is also possible to different, can specifically be decided according to the actual requirements, be not specifically limited herein.In addition, a LSTM can be identical with the 2nd LSTM, it is also possible to different, depending on being actually needed, it is not specifically limited herein.

Step 805, the video is processed by the first information, second information, the 3rd information, the 4th information and the 5th information, to obtain the confidence level matrix of the video；

Step 806, at least one video is processed by the first information, second information, the 3rd information, the 4th information and the 5th information, to obtain the class relations matrix of the video；

Wherein, at least one video is used for classifying the video, specifically see step 406, it should be noted that step 805 and step 806 can be executed simultaneously, it is also possible to executes according to default sequencing, is not specifically limited herein.

Step 807, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function.

Wherein, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information, 4th information and the 5th information, further using the first information, second information, 3rd information, 4th information and the 5th information are processed to video, using the first information, second information, 3rd information, 4th information and the 5th information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.

On the basis of Fig. 3 to embodiment illustrated in fig. 8, Fig. 9 is referred to, in detail visual classification is described, specific as follows：

Step 901, the information obtained in video；

Wherein, step 901 is same or like with the step 301 in embodiment illustrated in fig. 3, specifically see step 301, and here is omitted.

Step 902, using the first convolutional neural networks to described image information processing generate the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information；

Step 903, using the second convolutional neural networks to the Optic flow information process generate the second information, second information be second reference information；

Step 904, using the 3rd convolutional neural networks to the acoustic information system generate the 3rd information, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information；

Wherein, generated after the 3rd information using the 3rd convolutional neural networks, such as, then partial information (intermediate results in such as the 3rd convolutional neural networks) in the 3rd information is extracted, the partial information is processed further with the 3rd time recurrent neural network LSTM.

It should be noted that step 902 can be executed to step 904 simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.In addition, the first convolutional neural networks, the second convolutional neural networks and the 3rd convolutional neural networks can be identical, it is also possible to different, can specifically be decided according to the actual requirements, be not specifically limited herein.In addition, a LSTM can be identical with the 3rd LSTM, it is also possible to different, depending on being actually needed, it is not specifically limited herein.

Step 905, the video is processed by the first information, second information, the 3rd information, the 4th information and the 6th information, to obtain the confidence level matrix of the video；

Step 906, at least one video is processed by the first information, second information, the 3rd information, the 4th information and the 6th information, to obtain the class relations matrix of the video；

Wherein, at least one video is used for classifying the video, specifically see step 406, it should be noted that step 905 and step 906 can be executed simultaneously, it is also possible to executes according to default sequencing, is not specifically limited herein.

Step 907, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function.

The object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information, 4th information and the 6th information, further using the first information, second information, 3rd information, 4th information and the 6th information are processed to video, using the first information, second information, 3rd information, during 4th information and the 6th information are processed to video, introducing class relations carries out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.

On the basis of Fig. 3 to embodiment illustrated in fig. 9, Figure 10 is referred to, in detail visual classification is described, specific as follows：

Step 1001, the information obtained in video；

Wherein, step 1001 is same or like with the step 301 in embodiment illustrated in fig. 3, specifically see step 301, and here is omitted.

Step 1002, using the first convolutional neural networks to described image information processing generate the first information, the first information be first reference information；

Step 1003, using the second convolutional neural networks to the Optic flow information process generate the second information, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information；

Step 1004, using the 3rd convolutional neural networks to the acoustic information system generate the 3rd information, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information；

It should be noted that step 1002 can be executed to step 1004 simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.In addition, the first convolutional neural networks, the second convolutional neural networks and the 3rd convolutional neural networks can be identical, it is also possible to different, can specifically be decided according to the actual requirements, be not specifically limited herein.In addition, the 2nd LSTM can be identical with the 3rd LSTM, it is also possible to different, depending on being actually needed, it is not specifically limited herein.

Step 1005, the video is processed by the first information, second information, the 3rd information, the 5th information and the 6th information, to obtain the confidence level matrix of the video；

Step 1006, at least one video is processed by the first information, second information, the 3rd information, the 5th information and the 6th information, to obtain the class relations matrix of the video；

Wherein, at least one video is used for classifying the video, specifically see step 406, it should be noted that step 1005 and step 1006 can be executed simultaneously, it is also possible to executes according to default sequencing, is not specifically limited herein.

Step 1007, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function.

The object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information, 5th information and the 6th information, further using the first information, second information, 3rd information, 5th information and the 6th information are processed to video, using the first information, second information, 3rd information, 5th information and the 6th information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.

On the basis of Fig. 3 to embodiment illustrated in fig. 10, Figure 11 is referred to, in detail visual classification is described, specific as follows：

Step 1101, the information obtained in video；

Wherein, step 1101 is same or like with the step 301 in embodiment illustrated in fig. 3, specifically see step 301, and here is omitted.

Step 1102, using the first convolutional neural networks to described image information processing generate the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information；

Step 1103, using the second convolutional neural networks to the Optic flow information process generate the second information, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information；

Step 1104, using the 3rd convolutional neural networks to the acoustic information system generate the 3rd information, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information；

It should be noted that step 1102 can be executed to step 1104 simultaneously, it is also possible to execute according to default sequencing, be not specifically limited herein.In addition, the first convolutional neural networks, the second convolutional neural networks and the 3rd convolutional neural networks can be identical, it is also possible to different, can specifically be decided according to the actual requirements, be not specifically limited herein.In addition, a LSTM, the 2nd LSTM and the 3rd LSTM can be identical, it is also possible to different, depending on being actually needed, it is not specifically limited herein.

Step 1105, the video is processed by the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, to obtain the confidence level matrix of the video；

Step 1106, at least one video is processed by the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, to obtain the class relations matrix of the video；

Wherein, at least one video is used for classifying the video, specifically see step 406, it should be noted that step 1105 and step 1106 can be executed simultaneously, it is also possible to executes according to default sequencing, is not specifically limited herein.

Step 1107, the class relations matrix of the confidence level matrix and the video of the video is substituted into object function.

Wherein, the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

Unlike the prior art, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes the corresponding first information, second information, 3rd information, 4th information, 5th information and the 6th information, further using the first information, second information, 3rd information, 4th information, 5th information and the 6th information are processed to video, using the first information, second information, 3rd information, 4th information, 5th information and the 6th information introduce class relations during processing to video and carry out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.

In actual applications, by taking a preferred embodiment in the present invention as an example, as shown in Figure 12-a, for example：Extract the image information (such as frame) in the video of training, Optic flow information and acoustic information, by convolutional neural networks respectively to image information, Optic flow information and acoustic information are modeled, difference generation model 1, model 3 and model 5, in order to utilize the time sequence information in image information and Optic flow information, the present invention to be modeled image information and Optic flow information using LSTM networks.In the training process, such as：" frame：The input of LSTM networks " is " frame：The convolutional neural networks feature (i.e. the intermediate results of convolutional neural networks) that convolutional neural networks " are extracted；" light stream：The input of convolutional neural networks " is " light stream：The convolutional neural networks feature (i.e. the intermediate results of convolutional neural networks) that convolutional neural networks " are extracted.By training, model 2 and model 4 can be obtained.

After the model for having multichannel, a crucial problem is how that comprehensively utilizing these multi-modal information improves classification results.Often in training set, over-fitting is so as to cause the precision of fusion very poor, of the invention for each classification for conventional late period amalgamation mode, and adaptive fusion obtains subject fusion parameter (i.e. so as to fully be directed to per One class learning：Optimum fusion parameters).Additionally, during fusion, by the use of the relation between classification as the study of the constraint guiding grader of regularization in the embodiment of the present invention, making video classification model that there is higher discriminating power.

In order to reduce over-fitting, by the use of class relations as the process constrained to guide fusion of weight in the embodiment of the present invention, the object function of optimization is：Wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video.

Wherein, Section 1 L (S, Y；W it is) empirical loss of predictive value deviation for weighing the actual value of all samples on whole training dataset and obtain by logistic regression, S is the confidence level matrix of the Sample video in training set, each sample constitutes S by multiple still untreated fractions^m∈R^C, wherein, (m=1 ... M), wherein C represent that the species of video, M represent the way for needing fusion.Y is the category of multiple Sample videos in training set.W be from S to Y map weight, and this formula optimize target.In the present embodiment, W is 5 matrixes, represents the proportion coefficient in final video classification results per classification results all the way respectively.

Section 2Be in order to utilize relation between classification and well-designed bound term, V is the class relations matrix of video, can pass through to calculate in per training process all the way (in the present embodiment, a total of 5 tunnel, each model is correspondingly all the way), for each class how many sample is correctly classified and assigns to other each classes how many samples by mistake, finally by different roods to V be stitched together, its dimension is identical with W, for constraining W, guides the study of W；It should be noted thatNot this norm of Luo Beini crows is represented, all items square is represented and is opened radical sign, therefore this is led square smooth afterwards；

Section 3 λ₂||W||₁It is the bound term for making weight coefficient sparse；Upper formula reaches a balance between loss item and the bound term of regularization, so as to learn to obtain subject fusion parameter.

Above-mentioned object function, is solved based on near-end gradient algorithm (Proximal Gradient Method), and near-end gradient algorithm is to solve for the most commonly used optimized algorithm during large-scale data, generally can be compared with rapid convergence, efficiently solving-optimizing problem.

Subject fusion parameter can be obtained by adaptive learning, subject fusion parameter W is 5 matrixes, represents the proportion coefficient in final video classification results per classification results all the way respectively.5 road disaggregated models constitute final video classification model by self adaptation fusion, and training process terminates.

Video classification model and subject fusion parameter are obtained by using said process, video can be classified, as shown in Figure 12-b, when classifying to input video, image information, Optic flow information and the acoustic information of video are first extracted, through the prediction of video classification model, obtain per classification results all the way.The video is per classification results S all the way in Figure 12-b and belongs to the other probability of each video class, represented with an one-dimensional vector.The subject fusion parameter obtained in training is recycled to merge multi-channel video classification results.Subject fusion parameter W obtained in training is 5 matrixes, represents the proportion coefficient in final video classification results per classification results all the way respectively.To be multiplied by per classification results all the way after corresponding weight W and be added, as final visual classification result, so as to accurately classify to video.So the classification results of final video areM represents video modeling way, M=5 in the present embodiment.

Specifically, the present invention is better than existing video classification methods in the following aspects：

Video multi-modal information utilizes aspect：The present invention fully excavated appearance present in video, transitory motions, audio frequency and long when time sequence information, compared to existing technology, to the description of video more fully.

Using semantic relation aspect：Relative to existing fusion method, this video introduces class relations so as to the probability of over-fitting greatly reduces during later stage fusion.

Classification speed aspect：Once off-line training of the invention is completed, test almost can be done in real time, and this is significantly better than (the English full name of the support vector machine based on kernel function：Support Vector Machine, abbreviation：SVM).

In the embodiment of the present invention, directly the information in video is not combined, but the information generated according to deep neural network in the video distinguishes corresponding reference information, further video is processed using reference information, introducing class relations using reference information during processing to video carries out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.

For ease of preferably implementing the above-mentioned correlation technique of the embodiment of the present invention, it is also provided below for coordinating the relevant apparatus of said method.

Figure 13 is referred to, one embodiment of visual classification device 1300 in the embodiment of the present invention, including：Acquisition module 1301, generation module 1302, processing module 1303 and computing module 1304.

Acquisition module 1301, for obtaining the information in video, the information in the video includes image information, Optic flow information and acoustic information；

Generation module 1302, corresponding 3rd reference information of the acoustic information that corresponding second reference information of the Optic flow information that corresponding first reference information of the image information for generating the acquisition of acquisition module 1301 using deep neural network, the acquisition module 1301 are obtained and the acquisition module 1301 are obtained；

Processing module 1303, the first reference information, the second reference information and the 3rd reference information for being generated according to the generation module 1302 are processed to obtain the class relations matrix of the confidence level matrix and the video of the video to the video；

Computing module 1304, the confidence level matrix of video and the class relations matrix of the video for obtaining the processing module 1303 substitutes into object function, to obtain the subject fusion parameter of the video, the subject fusion parameter is used for classifying video, wherein, the constraint factor of the object function includes confidence level matrix, class relations matrix and fusion parameters.

Further, the generation module 1302, the Image Information Processing specifically for being obtained to the acquisition module 1301 using the first convolutional neural networks generate the first information, and the first information is first reference information；Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module 1301, second information is second reference information；The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module 1301, the 3rd information is the 3rd reference information；

The processing module 1303, the first information, second information and the 3rd information specifically for being generated the video by the generation module 1302 are processed, to obtain the confidence level matrix of the video；The first information, second information and the 3rd information that at least one video is generated by the generation module 1302 is processed, and to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video；

The computing module 1304, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module 1303 substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W it is) empirical loss of the video in the processing procedure of the first information, second information and the 3rd information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

Further, the generation module 1302, Image Information Processing specifically for being obtained to the acquisition module 1301 using the first convolutional neural networks generates the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using very first time recurrent neural network LSTM, the first information collectively constitutes first reference information with the 4th information；Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module 1301, second information is second reference information；The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module 1301, the 3rd information is the 3rd reference information；

The processing module 1303, the first information, second information, the 3rd information and the 4th information specifically for being generated the video by the generation module 1302 are processed, to obtain the confidence level matrix of the video；The first information, second information, the 3rd information and the 4th information that at least one video is generated by the generation module 1302 is processed, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video；

The computing module 1304, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module 1303 substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 4th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

Further, the generation module 1302, the Image Information Processing specifically for being obtained to the acquisition module 1301 using the first convolutional neural networks generate the first information, and the first information is first reference information；Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module 1301, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information；The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module 1301, the 3rd information is the 3rd reference information；

The processing module 1303, the first information, second information, the 3rd information and the 5th information specifically for being generated the video by the generation module 1302 are processed, to obtain the confidence level matrix of the video；The first information, second information, the 3rd information and the 5th information that at least one video is generated by the generation module 1302 is processed, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video；

The computing module 1304, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module 1303 substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

Further, the generation module 1302, the Image Information Processing for being obtained to the acquisition module 1301 using the first convolutional neural networks generate the first information, and the first information is first reference information；Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module 1301, second information is second reference information；The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module 1301, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information；

The processing module 1303, the first information, second information, the 3rd information and the 6th information specifically for being generated the video by the generation module 1302 are processed, to obtain the confidence level matrix of the video；The first information, second information, the 3rd information and the 6th information that at least one video is generated by the generation module 1302 is processed, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video；

The computing module 1304, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module 1303 substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator

Further, the generation module 1302, Image Information Processing specifically for being obtained to the acquisition module 1301 using the first convolutional neural networks generates the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information；Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module 1301, the second sub-information is extracted according to rule from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information；The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module 1301, the 3rd information is the 3rd reference information；

The processing module 1303, the first information, second information, the 3rd information, the 4th information and the 5th information specifically for being generated the video by the generation module 1302 is processed, to obtain the confidence level matrix of the video；The first information, second information, the 3rd information, the 4th information and the 5th information that at least one video is generated by the generation module 1302 is processed, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video；

The computing module 1304, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module 1303 substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 5th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

Further, the generation module 1302, Image Information Processing specifically for being obtained to the acquisition module 1301 using the first convolutional neural networks generates the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information；Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module 1301, second information is second reference information；The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module 1301, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information；

The processing module 1303, the first information, second information, the 3rd information, the 4th information and the 6th information specifically for being generated the video by the generation module 1302 is processed, to obtain the confidence level matrix of the video；The first information, second information, the 3rd information, the 4th information and the 6th information that at least one video is generated by the generation module 1302 is processed, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video；

The computing module 1304, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module 1303 substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

Further, the generation module 1302, the Image Information Processing specifically for being obtained to the acquisition module 1301 using the first convolutional neural networks generate the first information, and the first information is first reference information；Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module 1301, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information；The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module 1301, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information；

The processing module 1303, the first information, second information, the 3rd information, the 5th information and the 6th information specifically for being generated the video by the generation module 1302 is processed, to obtain the confidence level matrix of the video；The first information, second information, the 3rd information, the 5th information and the 6th information that at least one video is generated by the generation module 1302 is processed, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video；

The computing module 1304, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module 1303 substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

Further, the generation module 1302, Image Information Processing specifically for being obtained to the acquisition module 1301 using the first convolutional neural networks generates the first information, the first sub-information is extracted from the first information according to preset rules, first sub-information is carried out processing the 4th information that generates using a LSTM, the first information collectively constitutes first reference information with the 4th information；Second information that generates is processed using the Optic flow information that the second convolutional neural networks are obtained to the acquisition module 1301, the second sub-information is extracted according to preset rules from second information, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, second information collectively constitutes second reference information with the 5th information；The 3rd information is generated using the acoustic information system that the 3rd convolutional neural networks are obtained to the acquisition module 1301, the 3rd sub-information is extracted from the 3rd information according to preset rules, the 3rd sub-information is carried out processing the 6th information that generates using the 3rd LSTM, the 3rd information collectively constitutes the 3rd reference information with the 6th information；

The processing module 1303, the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information specifically for being generated the video by the generation module 1302 is processed, to obtain the confidence level matrix of the video；The first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information that at least one video is generated by the generation module 1302 is processed, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video；

The computing module 1304, the confidence level matrix of video and the class relations matrix of the video specifically for obtaining the processing module 1303 substitute into object function, and the object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are the subject fusion parameter of the video, and wherein, W is the fusion parameters in the object function, L (S, the Y；W) it is empirical loss of the video in the processing procedure of the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, the S is the confidence level matrix of the video, Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

Embodiment shown in Figure 13 is illustrated from the angle of functional module to the concrete structure of visual classification device, and the embodiment below in conjunction with Figure 14 is illustrated from hardware point of view to the concrete structure of visual classification device：A kind of visual classification device 1400, including：Memorizer 1401, processor 1402, emitter 1403 and receptor 1404；Wherein, the memorizer 1401 is connected with the processor 1402, and the processor 1402 is connected with the emitter 1403, and the processor 1402 is connected with the receptor 1404；

It is used for executing following steps by calling the operational order stored in the memorizer 1401, the processor 1402：

Wherein, the utilization deep neural network generates corresponding first reference information of described image information, corresponding second reference information of the Optic flow information and corresponding 3rd reference information of the acoustic information and includes：

At least one video is processed by the first information, second information and the 3rd information, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video；

At least one video is processed by the first information, second information, the 3rd information and the 4th information, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video；

At least one video is processed by the first information, second information, the 3rd information and the 5th information, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video；

At least one video is processed by the first information, second information, the 3rd information and the 6th information, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video；

At least one video is processed by the first information, second information, the 3rd information, the 4th information and the 5th information, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video；

At least one video is processed by the first information, second information, the 3rd information, the 4th information and the 6th information, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video；

At least one video is processed by the first information, second information, the 3rd information, the 5th information and the 6th information, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video；

At least one video is processed by the first information, second information, the 3rd information, the 4th information, the 5th information and the 6th information, to obtain the class relations matrix of the video, wherein, at least one video is used for classifying the video；

Therefore, directly the information in video is not combined in the present invention, but the information generated according to deep neural network in the video distinguishes corresponding reference information, further video is processed using reference information, introducing class relations using reference information during processing to video carries out regularization constraint so as to reducing the risk of over-fitting, further the video is classified according to the result for processing, so as to effectively increase the accuracy of visual classification.

In the above-described embodiments, the description of each embodiment is all emphasized particularly on different fields, there is no the part that describes in detail in certain embodiment, may refer to the associated description of other embodiment.

Those skilled in the art can be understood that for convenience and simplicity of description, the specific work process of the system, apparatus, and unit of foregoing description may be referred to the corresponding process in preceding method embodiment, will not be described here.

In several embodiments provided herein, it should be understood that disclosed system, apparatus and method can be realized by another way.For example, device embodiment described above is only schematically, for example, the division of the unit, be only a kind of division of logic function, can have when actually realizing other dividing mode, for example multiple units or component can in conjunction with or be desirably integrated into another system, or some features can be ignored, or do not execute.Another, shown or discussed coupling each other or direct-coupling or communication connection can be INDIRECT COUPLING or the communication connections by some interfaces, device or unit, can be electrical, mechanical or other forms.

The unit that illustrates as separating component can be or may not be physically separate, as the part that unit shows can be or may not be physical location, you can be located at a place, or can also be distributed on multiple NEs.Some or all of unit therein can be selected according to the actual needs to realize the purpose of this embodiment scheme.

In addition, each functional unit in each embodiment of the invention can be integrated in a processing unit, or unit is individually physically present, it is also possible to which two or more units are integrated in a unit.Above-mentioned integrated unit both can be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.

If the integrated unit realized using in the form of SFU software functional unit and as independent production marketing or use when, can be stored in a computer read/write memory medium.It is based on such understanding, the part or all or part of the technical scheme that technical scheme is substantially contributed to prior art in other words can be embodied in the form of software product, the computer software product is stored in a storage medium, use so that a computer equipment (can be personal computer including some instructions, server, or the network equipment etc.) execute all or part of step of each embodiment methods described of the invention.And aforesaid storage medium includes：USB flash disk, portable hard drive, read only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.

The above, above example only in order to technical scheme to be described, rather than a limitation；Although being described in detail to the present invention with reference to the foregoing embodiments, it will be understood by those within the art that：Which still can be modified to the technical scheme described in foregoing embodiments, or carry out equivalent to which part technical characteristic；And these modifications or replacement, do not make the essence of appropriate technical solution depart from the spirit and scope of various embodiments of the present invention technical scheme.

Claims

1. a kind of video classification methods, it is characterised in that include：

The information in video is obtained, the information in the video includes image information, Optic flow information and sound Message ceases；

Corresponding first reference information of described image information, light stream letter are generated using deep neural network Cease corresponding second reference information and corresponding 3rd reference information of the acoustic information；

Regarded to described according to first reference information, the second reference information and the 3rd reference information Frequency is processed to obtain the class relations matrix of the confidence level matrix and the video of the video；

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with The subject fusion parameter of the video is obtained, the subject fusion parameter is used for classifying video, its In, the constraint factor of the object function includes confidence level matrix, class relations matrix and fusion parameters.

2. video classification methods according to claim 1, it is characterised in that the utilization depth Neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding second Reference information and corresponding 3rd reference information of the acoustic information include：

The first information, first letter is generated to described image information processing using the first convolutional neural networks Cease for first reference information；

The second information, second letter is generated to Optic flow information process using the second convolutional neural networks Cease for second reference information；

The 3rd information, the 3rd letter is generated to the acoustic information system using the 3rd convolutional neural networks Cease for the 3rd reference information；

Described according to first reference information, the second reference information and the 3rd reference information to institute State video to be processed to obtain the class relations matrix bag of the confidence level matrix and the video of the video Include：

By the video by the first information, second information and the 3rd information at Reason, to obtain the confidence level matrix of the video；

At least one video is entered by the first information, second information and the 3rd information Row is processed, and to obtain the class relations matrix of the video, at least one video is used for regarding to described Frequency is classified；

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, with The subject fusion parameter for obtaining the video includes：

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, institute Stating object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W) it is process of the video in the first information, second information and the 3rd information During empirical loss, the S is the confidence level matrix of the video, and Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Represent not Luo Beini Wu Sifan Number, | | | |₁Represent Sparse rules operator.

3. video classification methods according to claim 1, it is characterised in that the utilization depth Neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding second Reference information and corresponding 3rd reference information of the acoustic information include：

The first information is generated to described image information processing using the first convolutional neural networks, according to default rule First sub-information is then extracted from the first information, using very first time recurrent neural network LSTM pair First sub-information carries out processing the 4th information that generates, and the first information is common with the 4th information Constitute first reference information；

By the video by the first information, second information, the 3rd information and described 4th information is processed, to obtain the confidence level matrix of the video；

By at least one video by the first information, second information, the 3rd information and 4th information is processed, and to obtain the class relations matrix of the video, described at least one regards Frequency is used for classifying the video；

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, institute Stating object function isWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W it is) video in the first information, second information, the 3rd information and described Empirical loss in the processing procedure of the 4th information, the S is the confidence level matrix of the video, and Y is The category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Table Show not this norm of Luo Beini crows, | | | |₁Represent Sparse rules operator.

4. video classification methods according to claim 1, it is characterised in that the utilization depth Neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding second Reference information and corresponding 3rd reference information of the acoustic information include：

The second information is generated to Optic flow information process using the second convolutional neural networks, according to default rule Then from second information, the second sub-information is extracted, second sub-information is entered using the 2nd LSTM Row processes the 5th information that generates, and second information collectively constitutes second reference with the 5th information Information；

By the video by the first information, second information, the 3rd information and described 5th information is processed, to obtain the confidence level matrix of the video；

By at least one video by the first information, second information, the 3rd information and 5th information is processed, and to obtain the class relations matrix of the video, described at least one regards Frequency is used for classifying the video；

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function institute Stating object function isSolution draws W values, the W values For the subject fusion parameter of the video, wherein, W is the fusion parameters in the object function, described L(S,Y；W it is) video in the first information, second information, the 3rd information and described Empirical loss in the processing procedure of the 5th information, the S is the confidence level matrix of the video, and Y is The category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Table Show not this norm of Luo Beini crows, | | | |₁Represent Sparse rules operator.

5. video classification methods according to claim 1, it is characterised in that the utilization depth Neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding second Reference information and corresponding 3rd reference information of the acoustic information include：

The 3rd information is generated to the acoustic information system using the 3rd convolutional neural networks, according to default rule Then from the 3rd information, the 3rd sub-information is extracted, the 3rd sub-information is entered using the 3rd LSTM Row processes the 6th information that generates, and the 3rd information collectively constitutes the 3rd reference with the 6th information Information；

By the video by the first information, second information, the 3rd information and described 6th information is processed, to obtain the confidence level matrix of the video；

By at least one video by the first information, second information, the 3rd information and 6th information is processed, and to obtain the class relations matrix of the video, described at least one regards Frequency is used for classifying the video；

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, institute Stating subject fusion parameter isSolution draws W values, the W It is worth the subject fusion parameter for the video, wherein, W is the fusion parameters in the object function, institute State L (S, Y；W) be the video the first information, second information, the 3rd information and Empirical loss in the processing procedure of the 6th information, the S is the confidence level matrix of the video, Y It is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient, Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

6. video classification methods according to claim 1, it is characterised in that the utilization depth Neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding second Reference information and corresponding 3rd reference information of the acoustic information include：

The first information is generated to described image information processing using the first convolutional neural networks, according to default rule From the first information, the first sub-information is then extracted, first sub-information is entered using a LSTM Row processes the 4th information that generates, and the first information collectively constitutes first reference with the 4th information Information；

Using the second convolutional neural networks to the Optic flow information process generate the second information, according to rule from The second sub-information is extracted in second information, using the 2nd LSTM to second sub-information at Reason generates the 5th information, and second information collectively constitutes second reference information with the 5th information；

By the video by the first information, second information, the 3rd information, described Four information and the 5th information are processed, to obtain the confidence level matrix of the video；

By at least one video by the first information, second information, the 3rd information, institute State the 4th information and the 5th information is processed, to obtain the class relations matrix of the video, At least one video is used for classifying the video；

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh Scalar functions areSolution draws W values, and the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W) be the video the first information, second information, the 3rd information, described Empirical loss in the processing procedure of four information and the 5th information, the S are putting for the video Reliability matrix, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For power Weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

7. video classification methods according to claim 1, it is characterised in that the utilization depth Neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding second Reference information and corresponding 3rd reference information of the acoustic information include：

By the video by the first information, second information, the 3rd information, described Four information and the 6th information are processed, to obtain the confidence level matrix of the video；

By at least one video by the first information, second information, the 3rd information, institute State the 4th information and the 6th information is processed, to obtain the class relations matrix of the video, At least one video is used for classifying the video；

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh Scalar functions areSolution draws W values, and the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W) be the video the first information, second information, the 3rd information, described Empirical loss in the processing procedure of four information and the 6th information, the S are putting for the video Reliability matrix, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For power Weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

8. video classification methods according to claim 1, it is characterised in that the utilization depth Neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding second Reference information and corresponding 3rd reference information of the acoustic information include：

By the video by the first information, second information, the 3rd information, described Five information and the 6th information are processed, to obtain the confidence level matrix of the video；

By at least one video by the first information, second information, the 3rd information, institute State the 5th information and the 6th information is processed, to obtain the class relations matrix of the video, At least one video is used for classifying the video；

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, institute Stating object function isSolution draws W values, the W values For the subject fusion parameter of the video, wherein, W is the fusion parameters in the object function, described L(S,Y；W) be the video the first information, second information, the 3rd information, described Empirical loss in the processing procedure of five information and the 6th information, the S are putting for the video Reliability matrix, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For power Weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

9. video classification methods according to claim 1, it is characterised in that the utilization depth Neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding second Reference information and corresponding 3rd reference information of the acoustic information include：

By the video by the first information, second information, the 3rd information, described Four information, the 5th information and the 6th information are processed, to obtain the confidence of the video Degree matrix；

By at least one video by the first information, second information, the 3rd information, institute State the 4th information, the 5th information and the 6th information to be processed, to obtain the video Class relations matrix, at least one video are used for classifying the video；

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh Scalar functions areSolution draws W values, and the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W) be the video the first information, second information, the 3rd information, described Empirical loss in the processing procedure of four information, the 5th information and the 6th information, the S It is the confidence level matrix of the video, Y is the category of the video, and V is the class relations of the video Matrix, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules calculation Son.

10. a kind of visual classification device, it is characterised in that include：

Generation module, the image information for generating the acquisition module acquisition using deep neural network are corresponding Corresponding second reference information of Optic flow information that first reference information, the acquisition module are obtained and described Corresponding 3rd reference information of acoustic information that acquisition module is obtained；

Processing module, for generated according to the generation module the first reference information, the second reference information And the 3rd reference information video is processed with obtain the video confidence level matrix and The class relations matrix of the video；

Computing module, for the confidence level matrix of video that obtains the processing module and the video Class relations matrix substitutes into object function, and to obtain the subject fusion parameter of the video, the target is melted Closing parameter is used for classifying video, and wherein, the constraint factor of the object function includes confidence level square Battle array, class relations matrix and fusion parameters.

11. visual classification devices according to claim 10, it is characterised in that

The generation module, specifically for obtained to the acquisition module using the first convolutional neural networks Image Information Processing generates the first information, and the first information is first reference information；Using second Optic flow information process the second information of generation that convolutional neural networks are obtained to the acquisition module, described second Information is second reference information；The sound acquisition module obtained using the 3rd convolutional neural networks Sound information processing generates the 3rd information, and the 3rd information is the 3rd reference information；

The processing module, specifically for believing the video by the generation module is generated first Breath, second information and the 3rd information are processed, to obtain the confidence level square of the video Battle array；The first information that at least one video is generated by the generation module, second information and 3rd information is processed, and to obtain the class relations matrix of the video, described at least one regards Frequency is used for classifying the video；

The computing module, specifically for the confidence level matrix of video that obtains the processing module and institute The class relations matrix for stating video substitutes into object function, and the object function is

\underset{W}{m i n} L (S, Y; W) + λ_{1} | | W - V | |_{F}^{2} + λ_{2} | | W | |_{1},

Wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W) it is process of the video in the first information, second information and the 3rd information During empirical loss, the S is the confidence level matrix of the video, and Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Represent not Luo Beini Wu Sifan Number, | | | |₁Represent Sparse rules operator.

12. visual classification devices according to claim 10, it is characterised in that

The generation module, specifically for obtained to the acquisition module using the first convolutional neural networks Image Information Processing generates the first information, extracts the first son letter according to preset rules from the first information Breath, is carried out processing to first sub-information using very first time recurrent neural network LSTM and generates the 4th Information, the first information collectively constitute first reference information with the 4th information；Using second Optic flow information process the second information of generation that convolutional neural networks are obtained to the acquisition module, described second Information is second reference information；The sound acquisition module obtained using the 3rd convolutional neural networks Sound information processing generates the 3rd information, and the 3rd information is the 3rd reference information；

The processing module, specifically for believing the video by the generation module is generated first Breath, second information, the 3rd information and the 4th information are processed, described to obtain The confidence level matrix of video；The first information that at least one video is generated by the generation module, institute State the second information, the 3rd information and the 4th information to be processed, to obtain the video Class relations matrix, at least one video are used for classifying the video；

\underset{W}{m i n} L (S, Y; W) + λ_{1} | | W - V | |_{F}^{2} + λ_{2} | | W | |_{1},

Wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W it is) video in the first information, second information, the 3rd information and described Empirical loss in the processing procedure of the 4th information, the S is the confidence level matrix of the video, and Y is The category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Table Show not this norm of Luo Beini crows, | | | |₁Represent Sparse rules operator.

13. visual classification devices according to claim 10, it is characterised in that

The generation module, specifically for obtained to the acquisition module using the first convolutional neural networks Image Information Processing generates the first information, and the first information is first reference information；Using second The Optic flow information that convolutional neural networks are obtained to the acquisition module processes second information that generates, according to default Rule extracts the second sub-information from second information, using the 2nd LSTM to second sub-information Carry out processing the 5th information that generates, second information collectively constitutes second ginseng with the 5th information Examine information；The acoustic information system acquisition module obtained using the 3rd convolutional neural networks generates the Three information, the 3rd information are the 3rd reference information；

The processing module, specifically for believing the video by the generation module is generated first Breath, second information, the 3rd information and the 5th information are processed, described to obtain The confidence level matrix of video；The first information that at least one video is generated by the generation module, institute State the second information, the 3rd information and the 5th information to be processed, to obtain the video Class relations matrix, at least one video are used for classifying the video；

\underset{W}{m i n} L (S, Y; W) + λ_{1} | | W - V | |_{F}^{2} + λ_{2} | | W | |_{1},

Wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W it is) video in the first information, second information, the 3rd information and described Empirical loss in the processing procedure of the 5th information, the S is the confidence level matrix of the video, and Y is The category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Table Show not this norm of Luo Beini crows, | | | |₁Represent Sparse rules operator.

14. visual classification devices according to claim 10, it is characterised in that

The generation module, for the image acquisition module obtained using the first convolutional neural networks Information processing generates the first information, and the first information is first reference information；Using the second convolution The Optic flow information that neutral net is obtained to the acquisition module processes second information that generates, second information For second reference information；Believed using the sound that the 3rd convolutional neural networks are obtained to the acquisition module Breath processes the 3rd information that generates, and extracts the 3rd sub-information according to preset rules from the 3rd information, profit The 3rd sub-information is carried out processing the 6th information that generates, the 3rd information and institute with the 3rd LSTM State the 6th information and collectively constitute the 3rd reference information；

The processing module, specifically for believing the video by the generation module is generated first Breath, second information, the 3rd information and the 6th information are processed, described to obtain The confidence level matrix of video；The first information that at least one video is generated by the generation module, institute State the second information, the 3rd information and the 6th information to be processed, to obtain the video Class relations matrix, at least one video are used for classifying the video；

\underset{W}{m i n} L (S, Y; W) + λ_{1} | | W - V | |_{F}^{2} + λ_{2} | | W | |_{1},

Wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W it is) video in the first information, second information, the 3rd information and described Empirical loss in the processing procedure of the 6th information, the S is the confidence level matrix of the video, and Y is The category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Table Show not this norm of Luo Beini crows, | | | |₁Represent Sparse rules operator.

15. visual classification devices according to claim 10, it is characterised in that

The generation module, specifically for obtained to the acquisition module using the first convolutional neural networks Image Information Processing generates the first information, extracts the first son letter according to preset rules from the first information Breath, is carried out processing the 4th information that generates, first letter to first sub-information using a LSTM Breath collectively constitutes first reference information with the 4th information；Using the second convolutional neural networks to institute The Optic flow information for stating acquisition module acquisition processes second information that generates, according to rule from second information The second sub-information is extracted, second sub-information is carried out processing the 5th information that generates using the 2nd LSTM, Second information collectively constitutes second reference information with the 5th information；Using the 3rd convolution god The 3rd information is generated through the acoustic information system that network is obtained to the acquisition module, the 3rd information is 3rd reference information；

The processing module, specifically for believing the video by the generation module is generated first At breath, second information, the 3rd information, the 4th information and the 5th information Reason, to obtain the confidence level matrix of the video；At least one video is generated by the generation module The first information, second information, the 3rd information, the 4th information and described 5th letter Breath, is processed to obtain the class relations matrix of the video, and at least one video is used for institute State video to be classified；

\underset{W}{m i n} L (S, Y; W) + λ_{1} | | W - V | |_{F}^{2} + λ_{2} | | W | |_{1},

Wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W) be the video the first information, second information, the 3rd information, described Empirical loss in the processing procedure of four information and the 5th information, the S are putting for the video Reliability matrix, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For power Weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

16. visual classification devices according to claim 10, it is characterised in that

The generation module, specifically for obtained to the acquisition module using the first convolutional neural networks Image Information Processing generates the first information, extracts the first son letter according to preset rules from the first information Breath, is carried out processing the 4th information that generates, first letter to first sub-information using a LSTM Breath collectively constitutes first reference information with the 4th information；Using the second convolutional neural networks to institute The Optic flow information for stating acquisition module acquisition processes second information that generates, and second information is the described second ginseng Examine information；The acoustic information system acquisition module obtained using the 3rd convolutional neural networks generates the Three information, extract the 3rd sub-information from the 3rd information according to preset rules, using the 3rd LSTM pair 3rd sub-information carries out processing the 6th information that generates, and the 3rd information is common with the 6th information Constitute the 3rd reference information；

The processing module, specifically for believing the video by the generation module is generated first At breath, second information, the 3rd information, the 4th information and the 6th information Reason, to obtain the confidence level matrix of the video；At least one video is generated by the generation module The first information, second information, the 3rd information, the 4th information and described 6th letter Breath is processed, and to obtain the class relations matrix of the video, at least one video is used for institute State video to be classified；

\underset{W}{m i n} L (S, Y; W) + λ_{1} | | W - V | |_{F}^{2} + λ_{2} | | W | |_{1},

Wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W) be the video the first information, second information, the 3rd information, described Empirical loss in the processing procedure of four information and the 6th information, the S are putting for the video Reliability matrix, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For power Weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

17. visual classification devices according to claim 10, it is characterised in that

The generation module, specifically for obtained to the acquisition module using the first convolutional neural networks Image Information Processing generates the first information, and the first information is first reference information；Using second The Optic flow information that convolutional neural networks are obtained to the acquisition module processes second information that generates, according to default Rule extracts the second sub-information from second information, using the 2nd LSTM to second sub-information Carry out processing the 5th information that generates, second information collectively constitutes second ginseng with the 5th information Examine information；The acoustic information system acquisition module obtained using the 3rd convolutional neural networks generates the Three information, extract the 3rd sub-information from the 3rd information according to preset rules, using the 3rd LSTM pair 3rd sub-information carries out processing the 6th information that generates, and the 3rd information is common with the 6th information Constitute the 3rd reference information；

The processing module, specifically for believing the video by the generation module is generated first At breath, second information, the 3rd information, the 5th information and the 6th information Reason, to obtain the confidence level matrix of the video；At least one video is generated by the generation module The first information, second information, the 3rd information, the 5th information and described 6th letter Breath is processed, and to obtain the class relations matrix of the video, at least one video is used for institute State video to be classified；

\underset{W}{m i n} L (S, Y; W) + λ_{1} | | W - V | |_{F}^{2} + λ_{2} | | W | |_{1},

Wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W) be the video the first information, second information, the 3rd information, described Empirical loss in the processing procedure of five information and the 6th information, the S are putting for the video Reliability matrix, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For power Weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

18. visual classification devices according to claim 10, it is characterised in that

The generation module, specifically for obtained to the acquisition module using the first convolutional neural networks Image Information Processing generates the first information, extracts the first son letter according to preset rules from the first information Breath, is carried out processing the 4th information that generates, first letter to first sub-information using a LSTM Breath collectively constitutes first reference information with the 4th information；Using the second convolutional neural networks to institute The Optic flow information for stating acquisition module acquisition processes second information that generates, according to preset rules from the described second letter The second sub-information is extracted in breath, second sub-information is carried out processing using the 2nd LSTM and is generated the 5th Information, second information collectively constitute second reference information with the 5th information；Using the 3rd The acoustic information system that convolutional neural networks are obtained to the acquisition module generates the 3rd information, according to default rule Then from the 3rd information, the 3rd sub-information is extracted, the 3rd sub-information is entered using the 3rd LSTM Row processes the 6th information that generates, and the 3rd information collectively constitutes the 3rd reference with the 6th information Information；

The processing module, specifically for believing the video by the generation module is generated first Breath, second information, the 3rd information, the 4th information, the 5th information and described 6th information is processed, to obtain the confidence level matrix of the video；At least one video is passed through institute State generation module generation the first information, second information, the 3rd information, the 4th information, 5th information and the 6th information are processed, to obtain the class relations matrix of the video, At least one video is used for classifying the video；

\underset{W}{m i n} L (S, Y; W) + λ_{1} | | W - V | |_{F}^{2} + λ_{2} | | W | |_{1},

Wherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W) be the video the first information, second information, the 3rd information, described Empirical loss in the processing procedure of four information, the 5th information and the 6th information, the S It is the confidence level matrix of the video, Y is the category of the video, and V is the class relations of the video Matrix, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules calculation Son.

19. a kind of visual classification devices, it is characterised in that include：Memorizer, processor, emitter With receptor；Wherein, the memorizer is connected with the processor, the processor and the emitter Connection, the processor are connected with the receptor；

20. visual classification devices according to claim 19, it is characterised in that described using deep Degree neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding the Two reference informations and corresponding 3rd reference information of the acoustic information include：

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh Scalar functions areWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W) it is process of the video in the first information, second information and the 3rd information During empirical loss, the S is the confidence level matrix of the video, and Y is the category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Represent not Luo Beini Wu Sifan Number, | | | |₁Represent Sparse rules operator.

21. visual classification devices according to claim 19, it is characterised in that described using deep Degree neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding the Two reference informations and corresponding 3rd reference information of the acoustic information include：

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh Scalar functions areWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W it is) video in the first information, second information, the 3rd information and described Empirical loss in the processing procedure of the 4th information, the S is the confidence level matrix of the video, and Y is The category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Table Show not this norm of Luo Beini crows, | | | |₁Represent Sparse rules operator.

22. visual classification devices according to claim 19, it is characterised in that described using deep Degree neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding the Two reference informations and corresponding 3rd reference information of the acoustic information include：

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh Scalar functions areWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W it is) video in the first information, second information, the 3rd information and described Empirical loss in the processing procedure of the 5th information, the S is the confidence level matrix of the video, and Y is The category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Table Show not this norm of Luo Beini crows, | | | |₁Represent Sparse rules operator.

23. visual classification devices according to claim 19, it is characterised in that described using deep Degree neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding the Two reference informations and corresponding 3rd reference information of the acoustic information include：

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh Scalar functions areWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W it is) video in the first information, second information, the 3rd information and described Empirical loss in the processing procedure of the 6th information, the S is the confidence level matrix of the video, and Y is The category of the video, V is the class relations matrix of the video, λ₁And λ₂For weight coefficient,Table Show not this norm of Luo Beini crows, | | | |₁Represent Sparse rules operator.

24. visual classification devices according to claim 19, it is characterised in that described using deep Degree neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding the Two reference informations and corresponding 3rd reference information of the acoustic information include：

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh Scalar functions areWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W) be the video the first information, second information, the 3rd information, described Empirical loss in the processing procedure of four information and the 5th information, the S are putting for the video Reliability matrix, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For power Weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

25. visual classification devices according to claim 19, it is characterised in that described using deep Degree neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding the Two reference informations and corresponding 3rd reference information of the acoustic information include：

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh Scalar functions

\underset{W}{m i n} L (S, Y; W) + λ_{1} | | W - V | |_{F}^{2} + λ_{2} | | W | |_{1},

26. visual classification devices according to claim 19, it is characterised in that described using deep Degree neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding the Two reference informations and corresponding 3rd reference information of the acoustic information include：

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh Scalar functions areWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W) be the video the first information, second information, the 3rd information, described Empirical loss in the processing procedure of five information and the 6th information, the S are putting for the video Reliability matrix, Y is the category of the video, and V is the class relations matrix of the video, λ₁And λ₂For power Weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules operator.

27. visual classification devices according to claim 19, it is characterised in that described using deep Degree neutral net generates corresponding first reference information of described image information, the Optic flow information corresponding the Two reference informations and corresponding 3rd reference information of the acoustic information include：

The class relations matrix of the confidence level matrix and the video of the video is substituted into object function, the mesh Scalar functions areWherein, the object function is represented and is worked asValue minimum when, solve and draw W values, the W values are institute The subject fusion parameter of video is stated, wherein, W is the fusion parameters in the object function, described L(S,Y；W) be the video the first information, second information, the 3rd information, described Empirical loss in the processing procedure of four information, the 5th information and the 6th information, the S It is the confidence level matrix of the video, Y is the category of the video, and V is the class relations of the video Matrix, λ₁And λ₂For weight coefficient,Not this norm of Luo Beini crows is represented, | | | |₁Represent Sparse rules calculation Son.