CN105228033B

CN105228033B - A kind of method for processing video frequency and electronic equipment

Info

Publication number: CN105228033B
Application number: CN201510535580.9A
Authority: CN
Inventors: 董培; 靳玉茹
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2015-08-27
Filing date: 2015-08-27
Publication date: 2018-11-09
Anticipated expiration: 2035-08-27
Also published as: CN105228033A

Abstract

The invention discloses a kind of method for processing video frequency and electronic equipment, the method includes：Fisrt feature collection is extracted from video frame, the fisrt feature collection includes：Colour moment feature, Wavelet Texture, motion feature, the crucial point feature in part；Based on the fisrt feature collection, second feature collection is calculated, the second feature collection includes：Motion attention feature, the face attention feature based on depth information, the semantic indicative character of video-frequency band；The linear model weighted again using iteration carries out fusion treatment to each feature that second feature is concentrated, to obtain video frequency abstract.

Description

Video processing method and electronic equipment

Technical Field

The present invention relates to video processing technologies, and in particular, to a video processing method and an electronic device.

Background

An intelligent terminal such as a smart phone becomes a portable partner for work and life of people at present, and a user can easily accumulate a large amount of videos in a downloading and self-shooting mode. Especially for a mobile phone equipped with a binocular camera, the amount of data to be stored is larger. In the face of a mobile phone memory with relatively limited capacity, the management of video files becomes an urgent problem to be solved.

Disclosure of Invention

In order to solve the above technical problem, embodiments of the present invention provide a video processing method and an electronic device.

The video processing method provided by the embodiment of the invention comprises the following steps:

extracting a first feature set from a video frame, the first feature set comprising: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics;

calculating to obtain a second feature set based on the first feature set, wherein the second feature set comprises: the method comprises the following steps of (1) moving attention characteristics, human face attention characteristics based on depth information and semantic indication characteristics of a video segment;

and performing fusion processing on each feature in the second feature set by using the linear model of iterative reweighting so as to obtain the video abstract.

The electronic equipment provided by the embodiment of the invention comprises:

an extracting unit, configured to extract a first feature set from a video frame, where the first feature set includes: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics;

the first processing unit is configured to calculate a second feature set based on the first feature set, where the second feature set includes: the method comprises the following steps of (1) moving attention characteristics, human face attention characteristics based on depth information and semantic indication characteristics of a video segment;

and the second processing unit is used for carrying out fusion processing on each feature in the second feature set by using the linear model of iterative reweighting so as to obtain the video abstract.

In the technical scheme of the embodiment of the invention, color moment features, wavelet texture features, motion features and local key point features are extracted from video frames; then, calculating to obtain a motion attention feature, a human face attention feature based on depth information and a semantic indication feature of a video segment based on the extracted color moment feature, wavelet texture feature, motion feature and local key point feature; and performing fusion processing on the motion attention feature, the human face attention feature based on the depth information and the semantic indication feature of the video segment to obtain the video abstract. Therefore, the video segments with relatively refined semantics and important meanings are extracted from the original video, so that the data volume needing to be stored in the electronic equipment is effectively reduced, the utilization efficiency of the memory of the electronic equipment and the user experience are improved, and the method is also beneficial for a user to locate the video which the user most wants to find from a small amount of video files in the future. In addition, the technical scheme of the embodiment of the invention combines information from a visual modality (visual modality) and a textual modality (textual modality), and can more effectively capture the high-level semantics of the video content. The human face attention feature is combined with the depth information of objects in the scene, so that the high-level semantics can be mastered from a more comprehensive angle. The technical scheme of the embodiment of the invention does not depend on heuristic rules formulated for specific video types, and can be suitable for wider video types.

Drawings

Fig. 1 is a schematic flowchart of a video processing method according to a first embodiment of the invention;

fig. 2 is a flowchart illustrating a video processing method according to a second embodiment of the invention;

FIG. 3 is an overall flow chart of video summarization according to an embodiment of the present invention;

FIG. 4 is a flow chart of computing semantic indexing features of a video segment according to an embodiment of the invention;

fig. 5 is a schematic structural diagram of an electronic device according to a first embodiment of the invention;

fig. 6 is a schematic structural diagram of an electronic device according to a second embodiment of the present invention.

Detailed Description

So that the manner in which the features and aspects of the embodiments of the present invention can be understood in detail, a more particular description of the embodiments of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings.

In the era of information explosion, traditional video data browsing and management methods have faced unprecedented challenges. Therefore, the method has important practical significance for providing a short video abstract which is concentrated with key information in the original video for video users. Video summarization can be generally divided into two types, dynamic and static: the dynamic video abstract is a shortened version of an original video, wherein the dynamic video abstract can contain a series of video segments extracted from an original long version; and the still video summary may consist of a set of key frames extracted from the original video.

Conventional video summarization is generated by extracting visual or textual features from a video. However, most methods in this direction use heuristic rules or simple text analysis (e.g., based on word frequency statistics). In addition, the traditional attention model method adopting the human face features only considers information such as the plane position and the size of the detected human face in a scene, and lacks of using depth information.

The technical scheme of the embodiment of the invention estimates the relative importance of the video frequency band by an iterative reweighting mode based on the attention model of the user, the semantic information of the video and the depth information of the video frame, thereby generating the dynamic video abstract.

Fig. 1 is a schematic flowchart of a video processing method according to a first embodiment of the present invention, as shown in fig. 1, the video processing method includes the following steps:

step 101: extracting a first feature set from a video frame, the first feature set comprising: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics.

Referring to fig. 3, first, a first feature set is extracted from a video frame, the first feature set is a low-level feature set, and the first feature set includes four low-level features: color moment features, wavelet texture features, motion features, and local keypoint features.

Four low-level features in the first feature set are described in detail below.

(1) Colour moment characteristics

A video frame is spatially divided into 5 x 5 (25 in total) non-overlapping pixel blocks, and each pixel block is calculated for three channels of a Lab color spaceA first order moment and a second order third order central moment. The color moments of the 25 pixel blocks of the frame constitute the color moment feature vector f of the frame_cm(i)。

(2) Wavelet texture feature

Similarly, a video frame is divided into 3 × 3 (9 in total) non-overlapping pixel blocks, and the luminance components of each block are subjected to three-level Haar wavelet decomposition, thereby calculating the variance of wavelet coefficients for each level in the horizontal, vertical and diagonal directions. All the wavelet coefficient variances of the video frame constitute the wavelet texture characteristic vector f of the frame_wt(i)。

(3) Movement characteristics

The human eye has a sensitive recognition of changes in visual content. Based on the basic principle, a video frame is divided into M × N non-overlapping pixel blocks, each block contains 16 × 16 pixel points, and a motion vector v (i, M, N) is calculated through a motion estimation algorithm. M x N motion vectors constituting the motion characteristic f of the video frame_mv(i)。

(4) Local keypoint features

In semantic-level video analysis, bag of words (BoF) based on local keypoints can be used as a powerful supplement to features computed from global information. Therefore, salient regions are captured using soft-weighted local keypoint features, which are defined based on the importance of keypoints in a vocabulary of 500 visual words. Specifically, the keypoints in the ith video frame are obtained by a Difference of Gaussians (DoG) detector, represented by Scale-Invariant Feature Transform (SIFT) descriptors, and clustered into 500 visual words. Feature vector f of key point_kp(i) Is defined as: and the weighted similarity between the key points under the four neighbors and the visual words.

Step 102: calculating to obtain a second feature set based on the first feature set, wherein the second feature set comprises: the method comprises the following steps of motion attention feature, human face attention feature based on depth information and semantic indication feature of a video segment.

Next, based on these low-level features, further high-level visual and semantic features, referred to as a second feature set, are computed, including: the method comprises the following steps of (1) moving attention feature, human face attention feature based on depth information and semantic indication feature of a video segment.

Next, based on the above low-level features, further for each arbitrary given video segment χ_s(starting from the i-th₁(s) frame, terminating at ith₂(s) frames) to compute advanced visual and semantic features. Video segmentation is achieved by shot cut detection.

Each feature of the second feature set is described in detail below.

(1) Characteristics of attention in sports

The research of human attention in the psychological field lays an indispensable foundation for the attention modeling in the computer vision field. The cognitive mechanism of attention is very key in the analysis and understanding of human thinking and activities, so that the cognitive mechanism can play a guiding role in the process of selecting relatively important contents in the original video to form the video abstract. The scheme utilizes the motion attention model to calculate the high-level motion attention characteristics suitable for semantic analysis.

For the (m, n) -th pixel block in the i-th video frame, a spatial window containing 5 × 5 (25 total) pixel blocks and a temporal window containing 7 pixel blocks are designed, and both windows are centered on the (m, n) -th pixel block of the i-th frame. Averagely dividing the phase range of [0, 2 pi) into 8 intervals, and statistically calculating a spatial phase histogram in a spatial windowCounting a time phase histogram in a time windowThereby obtaining a space according to the following formulaConsistency indication C_s(i, m, n) and time consistency indication C_t(i，m，n)：

C_s(i，m，n)＝-∑_ζp_s(ζ)logp_s(ζ) (1a)

C_t(i，m，n)＝-∑_ζp_t(ζ)logp_t(ζ) (2a)

Wherein,andrespectively, the phase distribution in the spatial window and the temporal window. Next, the motion attention feature of the ith frame is defined as follows:

to suppress noise in the features of neighboring video frames, the above resulting sequence of motion attention features will be processed through a 9 th order median filter. For the s video segment χ_sThe motion attention feature is obtained by calculating the filtered single-frame feature value:

(2) human face attention feature based on depth information

In video, the presence of a human face may generally indicate more important content. According to the scheme, the area A of the face (indexed by the letter j) in each video frame is obtained through a face detection algorithm_F(j) And a location. For the j detected face, the corresponding relation with the video frame is based onDepth image d of_iAnd a set of pixel points { x | x ∈ Λ (j) } constituting the face, defining the depth saliency D (j) as follows:

wherein | Λ (j) | is the number of pixel points contained in the jth individual face. According to the position of the human face in the whole video frame, a position weight w is also defined_fp(j) To approximately reflect the relative attention that the face can get from the viewer (the closer the region to the center of the video frame is weighted the more) as shown in table 1:

TABLE 1

Table 1 different face weights assigned to different regions in a video frame. The central area has a high weight and the edge area has a low weight.

The face attention feature of the ith frame can be calculated as:

wherein A is_frmIs the area of the video frame, D_max(i)＝max_xd_i(x) In that respect In order to reduce the influence of human face detection inaccuracy on the whole situation of the scheme, the obtained human face attention feature sequence is also smoothed by a median filter (5 th order). Video segment x_sThe human face attention feature of (1) is represented by the following formula, namely the smoothed feature { FAC (i) | i ═ i₁(s)，...，i₂(s) } calculation yields:

(3) semantic indexing features for video segments

Referring to fig. 4, in order to mine semantic information, the scheme extracts semantic indication features of a video segment based on 374 concepts of VIREO-374 and three Support Vector Machines (SVMs) of each concept. The support vector machine is trained based on the color moment, the small ripple principle and the local key point characteristics, and the probability value of the degree of closeness of the relationship between a given video frame and the concept can be estimated in prediction. The flow of computing semantic indicative features of a video segment is shown in fig. 4:

for video segment χ_sFirst, its intermediate frame i is extracted_mColor moment characteristic f of(s)_cm(i_m(s)), wavelet texture feature f_wt(i_m(s)) and local keypoint features f_kp(i_m(s)), and a probability value { u } is obtained by prediction using a support vector machine_cm(s，j)，u_wt(s，j)，u_kp(s, j) | j ═ 1, 2.., 374}, and then the concept density is calculated:

then, the subtitle information corresponding to the video band is processed. Set gamma formed based on caption vocabulary_st(s) and a set of concept vocabulary Γ_cp(j) And calculating the semantic Similarity of the characters by using a Similarity measurement tool WordNet of an external dictionary WordNet, wherein the Similarity comprises the following steps:

wherein η (gamma, omega) represents the Similarity value of the caption word gamma and the concept word omega in WordNet:: Similarity.

To reduce the impact of irrelevant concepts, the following degree of literal relevance is defined:

wherein Q is guaranteedNormalized coefficients of hold. Since the support vector machine gives the probability of two classes of classification problems, a threshold of 0.5 is naturally used in the above formula.

Finally, the semantics of the video segment indicate the characteristic f_E(s) is defined as the weighted sum of ρ (s, j) weighted by u (s, j):

step 103: and performing fusion processing on each feature in the second feature set by using the linear model of iterative reweighting so as to obtain the video abstract.

And finally, fusing the three high-level characteristics by utilizing an iterative reweighted linear model to generate a video abstract with the length required by a user.

In the embodiment of the invention, the video abstract is finally determined by the significance score of each video segment, so that the following linear model is adopted to fuse three high-level characteristics, and the fusion result is the significance score of the video segment:

f_SAL(s)＝w_M(s)f_M(s)+w_F(s)f_F(s)+w_E(s)f_E(s) (12a)

wherein w_M(s)，w_F(s) and w_E(s) is the weight of the feature. Each feature is normalized to the interval [0, 1 ] before linear fusion]。

The feature weights are calculated by an iterative weight-weighting method as follows. In the k-th iteration, the weight w_#(s) (#. epsilon. { M, F, E }) is prepared fromthe following macroscopic factor α_#(s) and a microscopic factor β_#(s) product (i.e., w)_#(s)＝α_#(s)·β_#(s)) determining:

wherein r is_#(s) is characteristic f_#(s) at { f_#(s)|s＝1，2，...，N_SRank after descending order, N_SIs the total number of video segments in the video. Next, the saliency of the video segment f can be computed_SAL(s) and arranging the sequences in descending order. According to the length required by the user, according to f_SAL(s) entering the video segments one by one into the selected video summary from high to low.

Before the first iteration process is started, initializing the feature weight according to an equal weight principle. The iteration process ends over 15 times.

According to the technical scheme of the embodiment of the invention, low-level features such as color moments, wavelet textures, motion, local key points and the like are extracted from video frames. Next, based on these low-level features, further high-level visual and semantic features are computed, including a moving attention feature, a human face attention feature considering depth information, and a semantic indicative feature of the video segment. Then, an iterative reweighted linear model is used for fusing the three high-level features to generate a video abstract with the length required by a user.

Fig. 2 is a schematic flowchart of a video processing method according to a second embodiment of the present invention, and as shown in fig. 2, the video processing method includes the following steps:

step 201: extracting a first feature set from a video frame, the first feature set comprising: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics.

Four low-level features in the first feature set are described in detail below.

(1) Colour moment characteristics

A video frame is spatially divided into 5 x 5 (25 in total) non-overlapping pixel blocks, and a first-order moment and a second-order third-order central moment are respectively calculated for three channels of a Lab color space on each pixel block. The color moments of the 25 pixel blocks of the frame constitute the color moment feature vector f of the frame_cm(i)。

(2) Wavelet texture feature

(3) Movement characteristics

(4) Local keypoint features

In semantic-level video analysis, bag of words (BoF) based on local keypoints can be used as a powerful supplement to features computed from global information. Thus, local keypoint features with soft weightingFeatures are captured that are defined based on the importance of keypoints in a vocabulary of 500 visual words. Specifically, the keypoints in the ith video frame are obtained by a Difference of Gaussians (DoG) detector, represented by Scale-Invariant Feature Transform (SIFT) descriptors, and clustered into 500 visual words. Feature vector f of key point_kp(i) Is defined as: and the weighted similarity between the key points under the four neighbors and the visual words.

Step 202: and calculating to obtain the motion attention feature according to the motion feature in the first feature set.

For the (m, n) -th pixel block in the i-th video frame, a spatial window containing 5 × 5 (25 total) pixel blocks and a temporal window containing 7 pixel blocks are designed, and both windows are centered on the (m, n) -th pixel block of the i-th frame. Evenly dividing the phase range of [0, 2 pi) into 8 intervals, and unifying in a space windowCalculating a spatial phase histogramCounting a time phase histogram in a time windowThus, a spatial consistency indicator C can be obtained according to the following formula_s(i, m, n) and time consistency indication C_t(i，m，n)：

C_s(i，m，n)＝-∑_ζp_s(ζ)logp_s(ζ) (1b)

C_t(i，m，n)＝-∑_ζp_t(ζ)logp_t(ζ) (2b)

step 203: the area and the position of the face in each video frame are obtained through a face detection algorithm, and the face attention feature based on the depth information is obtained through calculation based on the depth image corresponding to the video frame and the pixel point set forming the face.

In video, the presence of a human face may generally indicate more important content. According to the scheme, the area A of the face (indexed by the letter j) in each video frame is obtained through a face detection algorithm_F(j) And a location. For the detected jth face, based on the depth image d corresponding to the video frame_iAnd a set of pixel points { x | x ∈ Λ (j) } constituting the face, defining the depth saliency D (j) as follows:

wherein | Λ (j) | is the number of pixel points contained in the jth individual face. Based on the location of the face in the entire video frame, a location weight wfp (j) is also defined to approximately reflect the relative attention that the face can get from the viewer (the area closer to the center of the video frame is weighted more), as shown in table 1:

TABLE 1

The face attention feature of the ith frame can be calculated as:

wherein A is_frmIs the area of the video frame, D_max(i)＝max_xd_i(x) In that respect To reduce the face detectionThe influence of inaccuracy on the whole situation of the scheme is measured, and the obtained human face attention feature sequence is smoothed by a median filter (5 orders). Video segment x_sThe human face attention feature of (1) is represented by the following formula, namely the smoothed feature { FAC (i) | i ═ i₁(s)，...，i₂(s) } calculation yields:

step 204: and the support vector machine detects semantic concepts of the color moment features, the wavelet texture features and the local key point features to obtain concept density.

In the embodiment of the invention, a support vector machine is trained based on the color moment features, the small moire features and the local key point features. The support vector machine selects a LibSVM packet, adopts a Radial Basis Function (RBF) for the color moment characteristics and the wavelet texture characteristics, and adopts a Chi-square kernel for the local key point characteristics.

Referring to fig. 4, in order to mine semantic information, the scheme extracts semantic indication features of a video segment based on 374 concepts (semantic concepts) of VIREO-37 and three Support Vector Machines (SVMs) of each concept. The support vector machine is trained based on the color moment, the small ripple principle and the local key point characteristics, and the probability value of the degree of closeness of the relationship between a given video frame and the concept can be estimated in prediction. The flow of computing semantic indicative features of a video segment is shown in fig. 4:

in the embodiment of the invention, the text information related to the video content is obtained from the audio signal of the video frame by utilizing a voice recognition technology; or,

and acquiring text information related to video content from the subtitles of the video frames.

Step 205: and calculating to obtain the semantic similarity of the characters based on the character information and the concept vocabulary information.

Next, subtitle information corresponding to the video band is processed. Set gamma formed based on caption vocabulary_st(s) and a set of concept vocabulary Γ_cp(j) The semantic Similarity of the characters (textual semantic Similarity) is calculated by the Similarity measurement tool WordNet of the external dictionary WordNet:

To reduce the impact of irrelevant concepts, the following textual relevance (textual relevance) is defined:

Step 206: and calculating to obtain the semantic indicating features based on the word semantic similarity and the concept density.

step 207: and linearly superposing the characteristics in the second characteristic set according to the characteristic weight value to obtain the significance score of the video segment.

f_SAL(s)＝w_M(s)f_M(s)+w_F(s)f_F(s)+w_E(s)f_E(s) (12b)

The feature weights are calculated by an iterative weight-weighting method as follows. In the k-th iteration, the weight w_#(s) (# ∈ { M, F, E }) by the following macroscopic factor α_#(s) and a microscopic factor β_#(s) product (i.e., w)_#(s)＝α_#(s)·β_#(s)) determining:

wherein r is_#(s) is characteristic f_#(s) at { f_#(s)|s＝1，2，...，N_SRank after descending order, N_SIs the total number of video segments in the video. Next, the saliency of the video segment f can be computed_SAL(s) and arranging the sequences in descending order. According to the length required by the user, can be according to f_SAL(s) entering the video segments one by one into the selected video summary from high to low.

Fig. 5 is a schematic structural composition diagram of an electronic device according to a first embodiment of the present invention, and as shown in fig. 5, the electronic device includes:

an extracting unit 51, configured to extract a first feature set from the video frame, where the first feature set includes: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics;

the first processing unit 52 is configured to calculate a second feature set based on the first feature set, where the second feature set includes: the method comprises the following steps of (1) moving attention characteristics, human face attention characteristics based on depth information and semantic indication characteristics of a video segment;

and the second processing unit 53 is configured to perform fusion processing on each feature in the second feature set by using the linear model with iterative reweighting, so as to obtain the video summary.

Those skilled in the art will appreciate that the functions implemented by the units in the electronic device shown in fig. 5 can be understood by referring to the related description of the video processing method described above. The functions of the units in the electronic device shown in fig. 5 may be implemented by a program running on a processor, or may be implemented by specific logic circuits.

Fig. 6 is a schematic structural composition diagram of an electronic device according to a second embodiment of the present invention, and as shown in fig. 6, the electronic device includes:

an extracting unit 61, configured to extract a first feature set from a video frame, where the first feature set includes: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics;

the first processing unit 62 is configured to calculate a second feature set based on the first feature set, where the second feature set includes: the method comprises the following steps of (1) moving attention characteristics, human face attention characteristics based on depth information and semantic indication characteristics of a video segment;

and the second processing unit 63 is configured to perform fusion processing on each feature in the second feature set by using the linear model with iterative reweighting, so as to obtain a video summary.

The first processing unit 62 includes:

a moving attention feature subunit 621, configured to calculate a moving attention feature according to the moving features in the first feature set;

and the face attention feature subunit 622 is configured to obtain the area and the position of the face in each video frame through a face detection algorithm, and calculate, based on the depth image corresponding to the video frame and the pixel point set forming the face, the face attention feature based on the depth information.

The electronic device further includes:

and the training unit 64 is used for training a support vector machine based on the color moment features, the small moire features and the local key point features.

The electronic device further includes:

a text extraction unit 65, configured to obtain text information related to video content from the audio signal of the video frame by using a voice recognition technology; or obtaining the text information related to the video content from the subtitles of the video frames.

The first processing unit 62 includes:

a semantic indicating feature subunit 623, configured to perform semantic concept detection on the color moment features, the wavelet texture features, and the local key point features by using the support vector machine, so as to obtain concept density; calculating to obtain the semantic similarity of characters based on the character information and the concept vocabulary information; and calculating to obtain the semantic indicating features based on the word semantic similarity and the concept density.

The second processing unit 63 includes:

the linear superposition subunit 631 is configured to linearly superpose each feature in the second feature set according to the feature weight value, so as to obtain a saliency score of the video segment;

the video summarization subunit 632 is configured to select video segments one by one as the video summary according to a preset summarization length, in order from high to low according to the saliency scores of the video segments.

Those skilled in the art will appreciate that the functions implemented by the units in the electronic device shown in fig. 6 can be understood by referring to the related description of the video processing method described above. The functions of the units in the electronic device shown in fig. 6 may be implemented by a program running on a processor, or may be implemented by specific logic circuits.

The technical schemes described in the embodiments of the present invention can be combined arbitrarily without conflict.

In the embodiments provided in the present invention, it should be understood that the disclosed method and intelligent device may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one second processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.

Claims

1. A method of video processing, the method comprising:

acquiring a depth image corresponding to the video frame, text information related to video content, and the area and position of a human face in the video frame;

calculating to obtain a motion attention feature in a second feature set based on the motion feature;

calculating to obtain the human face attention feature based on the depth information in the second feature set based on the area and the position of the human face and the depth image;

obtaining concept density based on the color moment features, the wavelet texture features and the local key point features;

obtaining the semantic similarity of characters based on the character information and the concept vocabulary information;

calculating to obtain semantic indicating characteristics of the video segments in the second characteristic set based on the concept density and the character semantic similarity;

2. The video processing method of claim 1, wherein the obtaining of the area and the position of the face in the video frame comprises:

obtaining the area and the position of a face in each video frame through a face detection algorithm;

correspondingly, the calculating the face attention feature based on the depth information in the second feature set based on the area and the position of the face and the depth image includes:

and calculating to obtain the human face attention feature based on the depth information based on the depth image corresponding to the video frame and the pixel point set forming the human face.

3. The video processing method of claim 1, the method further comprising:

acquiring text information related to video content from the audio signal of the video frame by utilizing a voice recognition technology; or,

4. The video processing method according to claim 3, wherein said deriving a concept affinity based on the color moment features, the wavelet texture features, and the local keypoint features comprises:

training a support vector machine based on the color moment features, the small moire features and the local key point features;

and the support vector machine detects semantic concepts of the color moment features, the wavelet texture features and the local key point features to obtain concept density.

5. The video processing method according to claim 1, wherein the fusion processing is performed on each feature in the second feature set by using an iterative reweighted linear model, so as to obtain a video summary; the method comprises the following steps:

linearly superposing each feature in the second feature set according to the feature weight value to obtain a significance score of the video segment;

according to the preset digest length, the video segments are selected as the video digests one by one according to the sequence of the significance scores of the video segments from high to low.

6. An electronic device, the electronic device comprising:

the extraction unit is further used for acquiring a depth image corresponding to the video frame and the area and position of the face in the video frame;

the character extraction unit is used for acquiring character information related to video content in the video frame;

the motion attention feature subunit is used for calculating motion attention features in the second feature set based on the motion features;

a face attention feature subunit, configured to calculate, based on the area and the position of the face and the depth image, a face attention feature based on depth information in the second feature set;

the semantic indicating feature subunit is used for obtaining concept density based on the color moment features, the wavelet texture features and the local key point features; calculating to obtain the semantic similarity of characters based on the character information and the concept vocabulary information; calculating to obtain semantic indicating characteristics of the video segments in the second characteristic set based on the concept density and the character semantic similarity;

7. The electronic device of claim 6, wherein:

the extraction unit is also used for obtaining the area and the position of the face in each video frame through a face detection algorithm;

and the face attention feature subunit is also used for calculating the face attention feature based on the depth image corresponding to the video frame and the pixel point set forming the face to obtain the face attention feature based on the depth information.

8. The electronic device of claim 6, the text extraction unit further comprising:

the system comprises a video frame, a voice recognition module, a display module and a display module, wherein the video frame is used for displaying video content; or obtaining the text information related to the video content from the subtitles of the video frames.

9. The electronic device of claim 8, further comprising:

the training unit is used for training a support vector machine based on the color moment features, the small moire features and the local key point features;

the semantic indication feature subunit is further configured to perform semantic concept detection on the color moment features, the wavelet texture features and the local key point features by using the support vector machine to obtain concept density.

10. The electronic device of claim 9, the second processing unit comprising:

the linear superposition subunit is used for carrying out linear superposition on each feature in the second feature set according to the feature weight value to obtain a significance score of the video segment;

and the video abstract subunit is used for selecting the video segments into the video abstract one by one according to the preset abstract length and the sequence from high to low of the significance scores of the video segments.