CN105228033B - A kind of method for processing video frequency and electronic equipment - Google Patents
A kind of method for processing video frequency and electronic equipment Download PDFInfo
- Publication number
- CN105228033B CN105228033B CN201510535580.9A CN201510535580A CN105228033B CN 105228033 B CN105228033 B CN 105228033B CN 201510535580 A CN201510535580 A CN 201510535580A CN 105228033 B CN105228033 B CN 105228033B
- Authority
- CN
- China
- Prior art keywords
- video
- feature
- features
- face
- video frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000012545 processing Methods 0.000 title claims abstract description 16
- 238000012706 support-vector machine Methods 0.000 claims description 24
- 238000003672 processing method Methods 0.000 claims description 16
- 238000001514 detection method Methods 0.000 claims description 11
- 238000007499 fusion processing Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 5
- 238000012549 training Methods 0.000 claims description 5
- 238000005516 engineering process Methods 0.000 claims description 4
- 230000005236 sound signal Effects 0.000 claims description 3
- 230000004927 fusion Effects 0.000 abstract description 5
- 230000000007 visual effect Effects 0.000 description 17
- 239000013598 vector Substances 0.000 description 10
- 238000004458 analytical method Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 5
- 230000001149 cognitive effect Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000002123 temporal effect Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000005259 measurement Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000000354 decomposition reaction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 239000000047 product Substances 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
- H04N21/8549—Creating video summaries, e.g. movie trailer
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/60—Analysis of geometric attributes
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Computer Security & Cryptography (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Geometry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of method for processing video frequency and electronic equipment, the method includes:Fisrt feature collection is extracted from video frame, the fisrt feature collection includes:Colour moment feature, Wavelet Texture, motion feature, the crucial point feature in part;Based on the fisrt feature collection, second feature collection is calculated, the second feature collection includes:Motion attention feature, the face attention feature based on depth information, the semantic indicative character of video-frequency band;The linear model weighted again using iteration carries out fusion treatment to each feature that second feature is concentrated, to obtain video frequency abstract.
Description
Technical Field
The present invention relates to video processing technologies, and in particular, to a video processing method and an electronic device.
Background
An intelligent terminal such as a smart phone becomes a portable partner for work and life of people at present, and a user can easily accumulate a large amount of videos in a downloading and self-shooting mode. Especially for a mobile phone equipped with a binocular camera, the amount of data to be stored is larger. In the face of a mobile phone memory with relatively limited capacity, the management of video files becomes an urgent problem to be solved.
Disclosure of Invention
In order to solve the above technical problem, embodiments of the present invention provide a video processing method and an electronic device.
The video processing method provided by the embodiment of the invention comprises the following steps:
extracting a first feature set from a video frame, the first feature set comprising: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics;
calculating to obtain a second feature set based on the first feature set, wherein the second feature set comprises: the method comprises the following steps of (1) moving attention characteristics, human face attention characteristics based on depth information and semantic indication characteristics of a video segment;
and performing fusion processing on each feature in the second feature set by using the linear model of iterative reweighting so as to obtain the video abstract.
The electronic equipment provided by the embodiment of the invention comprises:
an extracting unit, configured to extract a first feature set from a video frame, where the first feature set includes: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics;
the first processing unit is configured to calculate a second feature set based on the first feature set, where the second feature set includes: the method comprises the following steps of (1) moving attention characteristics, human face attention characteristics based on depth information and semantic indication characteristics of a video segment;
and the second processing unit is used for carrying out fusion processing on each feature in the second feature set by using the linear model of iterative reweighting so as to obtain the video abstract.
In the technical scheme of the embodiment of the invention, color moment features, wavelet texture features, motion features and local key point features are extracted from video frames; then, calculating to obtain a motion attention feature, a human face attention feature based on depth information and a semantic indication feature of a video segment based on the extracted color moment feature, wavelet texture feature, motion feature and local key point feature; and performing fusion processing on the motion attention feature, the human face attention feature based on the depth information and the semantic indication feature of the video segment to obtain the video abstract. Therefore, the video segments with relatively refined semantics and important meanings are extracted from the original video, so that the data volume needing to be stored in the electronic equipment is effectively reduced, the utilization efficiency of the memory of the electronic equipment and the user experience are improved, and the method is also beneficial for a user to locate the video which the user most wants to find from a small amount of video files in the future. In addition, the technical scheme of the embodiment of the invention combines information from a visual modality (visual modality) and a textual modality (textual modality), and can more effectively capture the high-level semantics of the video content. The human face attention feature is combined with the depth information of objects in the scene, so that the high-level semantics can be mastered from a more comprehensive angle. The technical scheme of the embodiment of the invention does not depend on heuristic rules formulated for specific video types, and can be suitable for wider video types.
Drawings
Fig. 1 is a schematic flowchart of a video processing method according to a first embodiment of the invention;
fig. 2 is a flowchart illustrating a video processing method according to a second embodiment of the invention;
FIG. 3 is an overall flow chart of video summarization according to an embodiment of the present invention;
FIG. 4 is a flow chart of computing semantic indexing features of a video segment according to an embodiment of the invention;
fig. 5 is a schematic structural diagram of an electronic device according to a first embodiment of the invention;
fig. 6 is a schematic structural diagram of an electronic device according to a second embodiment of the present invention.
Detailed Description
So that the manner in which the features and aspects of the embodiments of the present invention can be understood in detail, a more particular description of the embodiments of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings.
In the era of information explosion, traditional video data browsing and management methods have faced unprecedented challenges. Therefore, the method has important practical significance for providing a short video abstract which is concentrated with key information in the original video for video users. Video summarization can be generally divided into two types, dynamic and static: the dynamic video abstract is a shortened version of an original video, wherein the dynamic video abstract can contain a series of video segments extracted from an original long version; and the still video summary may consist of a set of key frames extracted from the original video.
Conventional video summarization is generated by extracting visual or textual features from a video. However, most methods in this direction use heuristic rules or simple text analysis (e.g., based on word frequency statistics). In addition, the traditional attention model method adopting the human face features only considers information such as the plane position and the size of the detected human face in a scene, and lacks of using depth information.
The technical scheme of the embodiment of the invention estimates the relative importance of the video frequency band by an iterative reweighting mode based on the attention model of the user, the semantic information of the video and the depth information of the video frame, thereby generating the dynamic video abstract.
Fig. 1 is a schematic flowchart of a video processing method according to a first embodiment of the present invention, as shown in fig. 1, the video processing method includes the following steps:
step 101: extracting a first feature set from a video frame, the first feature set comprising: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics.
Referring to fig. 3, first, a first feature set is extracted from a video frame, the first feature set is a low-level feature set, and the first feature set includes four low-level features: color moment features, wavelet texture features, motion features, and local keypoint features.
Four low-level features in the first feature set are described in detail below.
(1) Colour moment characteristics
A video frame is spatially divided into 5 x 5 (25 in total) non-overlapping pixel blocks, and each pixel block is calculated for three channels of a Lab color spaceA first order moment and a second order third order central moment. The color moments of the 25 pixel blocks of the frame constitute the color moment feature vector f of the framecm(i)。
(2) Wavelet texture feature
Similarly, a video frame is divided into 3 × 3 (9 in total) non-overlapping pixel blocks, and the luminance components of each block are subjected to three-level Haar wavelet decomposition, thereby calculating the variance of wavelet coefficients for each level in the horizontal, vertical and diagonal directions. All the wavelet coefficient variances of the video frame constitute the wavelet texture characteristic vector f of the framewt(i)。
(3) Movement characteristics
The human eye has a sensitive recognition of changes in visual content. Based on the basic principle, a video frame is divided into M × N non-overlapping pixel blocks, each block contains 16 × 16 pixel points, and a motion vector v (i, M, N) is calculated through a motion estimation algorithm. M x N motion vectors constituting the motion characteristic f of the video framemv(i)。
(4) Local keypoint features
In semantic-level video analysis, bag of words (BoF) based on local keypoints can be used as a powerful supplement to features computed from global information. Therefore, salient regions are captured using soft-weighted local keypoint features, which are defined based on the importance of keypoints in a vocabulary of 500 visual words. Specifically, the keypoints in the ith video frame are obtained by a Difference of Gaussians (DoG) detector, represented by Scale-Invariant Feature Transform (SIFT) descriptors, and clustered into 500 visual words. Feature vector f of key pointkp(i) Is defined as: and the weighted similarity between the key points under the four neighbors and the visual words.
Step 102: calculating to obtain a second feature set based on the first feature set, wherein the second feature set comprises: the method comprises the following steps of motion attention feature, human face attention feature based on depth information and semantic indication feature of a video segment.
Next, based on these low-level features, further high-level visual and semantic features, referred to as a second feature set, are computed, including: the method comprises the following steps of (1) moving attention feature, human face attention feature based on depth information and semantic indication feature of a video segment.
Next, based on the above low-level features, further for each arbitrary given video segment χs(starting from the i-th1(s) frame, terminating at ith2(s) frames) to compute advanced visual and semantic features. Video segmentation is achieved by shot cut detection.
Each feature of the second feature set is described in detail below.
(1) Characteristics of attention in sports
The research of human attention in the psychological field lays an indispensable foundation for the attention modeling in the computer vision field. The cognitive mechanism of attention is very key in the analysis and understanding of human thinking and activities, so that the cognitive mechanism can play a guiding role in the process of selecting relatively important contents in the original video to form the video abstract. The scheme utilizes the motion attention model to calculate the high-level motion attention characteristics suitable for semantic analysis.
For the (m, n) -th pixel block in the i-th video frame, a spatial window containing 5 × 5 (25 total) pixel blocks and a temporal window containing 7 pixel blocks are designed, and both windows are centered on the (m, n) -th pixel block of the i-th frame. Averagely dividing the phase range of [0, 2 pi) into 8 intervals, and statistically calculating a spatial phase histogram in a spatial windowCounting a time phase histogram in a time windowThereby obtaining a space according to the following formulaConsistency indication Cs(i, m, n) and time consistency indication Ct(i,m,n):
Cs(i,m,n)=-∑ζps(ζ)logps(ζ) (1a)
Ct(i,m,n)=-∑ζpt(ζ)logpt(ζ) (2a)
Wherein,andrespectively, the phase distribution in the spatial window and the temporal window. Next, the motion attention feature of the ith frame is defined as follows:
to suppress noise in the features of neighboring video frames, the above resulting sequence of motion attention features will be processed through a 9 th order median filter. For the s video segment χsThe motion attention feature is obtained by calculating the filtered single-frame feature value:
(2) human face attention feature based on depth information
In video, the presence of a human face may generally indicate more important content. According to the scheme, the area A of the face (indexed by the letter j) in each video frame is obtained through a face detection algorithmF(j) And a location. For the j detected face, the corresponding relation with the video frame is based onDepth image d ofiAnd a set of pixel points { x | x ∈ Λ (j) } constituting the face, defining the depth saliency D (j) as follows:
wherein | Λ (j) | is the number of pixel points contained in the jth individual face. According to the position of the human face in the whole video frame, a position weight w is also definedfp(j) To approximately reflect the relative attention that the face can get from the viewer (the closer the region to the center of the video frame is weighted the more) as shown in table 1:
TABLE 1
Table 1 different face weights assigned to different regions in a video frame. The central area has a high weight and the edge area has a low weight.
The face attention feature of the ith frame can be calculated as:
wherein A isfrmIs the area of the video frame, Dmax(i)=maxxdi(x) In that respect In order to reduce the influence of human face detection inaccuracy on the whole situation of the scheme, the obtained human face attention feature sequence is also smoothed by a median filter (5 th order). Video segment xsThe human face attention feature of (1) is represented by the following formula, namely the smoothed feature { FAC (i) | i ═ i1(s),...,i2(s) } calculation yields:
(3) semantic indexing features for video segments
Referring to fig. 4, in order to mine semantic information, the scheme extracts semantic indication features of a video segment based on 374 concepts of VIREO-374 and three Support Vector Machines (SVMs) of each concept. The support vector machine is trained based on the color moment, the small ripple principle and the local key point characteristics, and the probability value of the degree of closeness of the relationship between a given video frame and the concept can be estimated in prediction. The flow of computing semantic indicative features of a video segment is shown in fig. 4:
for video segment χsFirst, its intermediate frame i is extractedmColor moment characteristic f of(s)cm(im(s)), wavelet texture feature fwt(im(s)) and local keypoint features fkp(im(s)), and a probability value { u } is obtained by prediction using a support vector machinecm(s,j),uwt(s,j),ukp(s, j) | j ═ 1, 2.., 374}, and then the concept density is calculated:
then, the subtitle information corresponding to the video band is processed. Set gamma formed based on caption vocabularyst(s) and a set of concept vocabulary Γcp(j) And calculating the semantic Similarity of the characters by using a Similarity measurement tool WordNet of an external dictionary WordNet, wherein the Similarity comprises the following steps:
wherein η (gamma, omega) represents the Similarity value of the caption word gamma and the concept word omega in WordNet:: Similarity.
To reduce the impact of irrelevant concepts, the following degree of literal relevance is defined:
wherein Q is guaranteedNormalized coefficients of hold. Since the support vector machine gives the probability of two classes of classification problems, a threshold of 0.5 is naturally used in the above formula.
Finally, the semantics of the video segment indicate the characteristic fE(s) is defined as the weighted sum of ρ (s, j) weighted by u (s, j):
step 103: and performing fusion processing on each feature in the second feature set by using the linear model of iterative reweighting so as to obtain the video abstract.
And finally, fusing the three high-level characteristics by utilizing an iterative reweighted linear model to generate a video abstract with the length required by a user.
In the embodiment of the invention, the video abstract is finally determined by the significance score of each video segment, so that the following linear model is adopted to fuse three high-level characteristics, and the fusion result is the significance score of the video segment:
fSAL(s)=wM(s)fM(s)+wF(s)fF(s)+wE(s)fE(s) (12a)
wherein wM(s),wF(s) and wE(s) is the weight of the feature. Each feature is normalized to the interval [0, 1 ] before linear fusion]。
The feature weights are calculated by an iterative weight-weighting method as follows. In the k-th iteration, the weight w#(s) (#. epsilon. { M, F, E }) is prepared fromthe following macroscopic factor α#(s) and a microscopic factor β#(s) product (i.e., w)#(s)=α#(s)·β#(s)) determining:
wherein r is#(s) is characteristic f#(s) at { f#(s)|s=1,2,...,NSRank after descending order, NSIs the total number of video segments in the video. Next, the saliency of the video segment f can be computedSAL(s) and arranging the sequences in descending order. According to the length required by the user, according to fSAL(s) entering the video segments one by one into the selected video summary from high to low.
Before the first iteration process is started, initializing the feature weight according to an equal weight principle. The iteration process ends over 15 times.
According to the technical scheme of the embodiment of the invention, low-level features such as color moments, wavelet textures, motion, local key points and the like are extracted from video frames. Next, based on these low-level features, further high-level visual and semantic features are computed, including a moving attention feature, a human face attention feature considering depth information, and a semantic indicative feature of the video segment. Then, an iterative reweighted linear model is used for fusing the three high-level features to generate a video abstract with the length required by a user.
Fig. 2 is a schematic flowchart of a video processing method according to a second embodiment of the present invention, and as shown in fig. 2, the video processing method includes the following steps:
step 201: extracting a first feature set from a video frame, the first feature set comprising: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics.
Referring to fig. 3, first, a first feature set is extracted from a video frame, the first feature set is a low-level feature set, and the first feature set includes four low-level features: color moment features, wavelet texture features, motion features, and local keypoint features.
Four low-level features in the first feature set are described in detail below.
(1) Colour moment characteristics
A video frame is spatially divided into 5 x 5 (25 in total) non-overlapping pixel blocks, and a first-order moment and a second-order third-order central moment are respectively calculated for three channels of a Lab color space on each pixel block. The color moments of the 25 pixel blocks of the frame constitute the color moment feature vector f of the framecm(i)。
(2) Wavelet texture feature
Similarly, a video frame is divided into 3 × 3 (9 in total) non-overlapping pixel blocks, and the luminance components of each block are subjected to three-level Haar wavelet decomposition, thereby calculating the variance of wavelet coefficients for each level in the horizontal, vertical and diagonal directions. All the wavelet coefficient variances of the video frame constitute the wavelet texture characteristic vector f of the framewt(i)。
(3) Movement characteristics
The human eye has a sensitive recognition of changes in visual content. Based on the basic principle, a video frame is divided into M × N non-overlapping pixel blocks, each block contains 16 × 16 pixel points, and a motion vector v (i, M, N) is calculated through a motion estimation algorithm. M x N motion vectors constituting the motion characteristic f of the video framemv(i)。
(4) Local keypoint features
In semantic-level video analysis, bag of words (BoF) based on local keypoints can be used as a powerful supplement to features computed from global information. Thus, local keypoint features with soft weightingFeatures are captured that are defined based on the importance of keypoints in a vocabulary of 500 visual words. Specifically, the keypoints in the ith video frame are obtained by a Difference of Gaussians (DoG) detector, represented by Scale-Invariant Feature Transform (SIFT) descriptors, and clustered into 500 visual words. Feature vector f of key pointkp(i) Is defined as: and the weighted similarity between the key points under the four neighbors and the visual words.
Step 202: and calculating to obtain the motion attention feature according to the motion feature in the first feature set.
Next, based on these low-level features, further high-level visual and semantic features, referred to as a second feature set, are computed, including: the method comprises the following steps of (1) moving attention feature, human face attention feature based on depth information and semantic indication feature of a video segment.
Next, based on the above low-level features, further for each arbitrary given video segment χs(starting from the i-th1(s) frame, terminating at ith2(s) frames) to compute advanced visual and semantic features. Video segmentation is achieved by shot cut detection.
The research of human attention in the psychological field lays an indispensable foundation for the attention modeling in the computer vision field. The cognitive mechanism of attention is very key in the analysis and understanding of human thinking and activities, so that the cognitive mechanism can play a guiding role in the process of selecting relatively important contents in the original video to form the video abstract. The scheme utilizes the motion attention model to calculate the high-level motion attention characteristics suitable for semantic analysis.
For the (m, n) -th pixel block in the i-th video frame, a spatial window containing 5 × 5 (25 total) pixel blocks and a temporal window containing 7 pixel blocks are designed, and both windows are centered on the (m, n) -th pixel block of the i-th frame. Evenly dividing the phase range of [0, 2 pi) into 8 intervals, and unifying in a space windowCalculating a spatial phase histogramCounting a time phase histogram in a time windowThus, a spatial consistency indicator C can be obtained according to the following formulas(i, m, n) and time consistency indication Ct(i,m,n):
Cs(i,m,n)=-∑ζps(ζ)logps(ζ) (1b)
Ct(i,m,n)=-∑ζpt(ζ)logpt(ζ) (2b)
Wherein,andrespectively, the phase distribution in the spatial window and the temporal window. Next, the motion attention feature of the ith frame is defined as follows:
to suppress noise in the features of neighboring video frames, the above resulting sequence of motion attention features will be processed through a 9 th order median filter. For the s video segment χsThe motion attention feature is obtained by calculating the filtered single-frame feature value:
step 203: the area and the position of the face in each video frame are obtained through a face detection algorithm, and the face attention feature based on the depth information is obtained through calculation based on the depth image corresponding to the video frame and the pixel point set forming the face.
In video, the presence of a human face may generally indicate more important content. According to the scheme, the area A of the face (indexed by the letter j) in each video frame is obtained through a face detection algorithmF(j) And a location. For the detected jth face, based on the depth image d corresponding to the video frameiAnd a set of pixel points { x | x ∈ Λ (j) } constituting the face, defining the depth saliency D (j) as follows:
wherein | Λ (j) | is the number of pixel points contained in the jth individual face. Based on the location of the face in the entire video frame, a location weight wfp (j) is also defined to approximately reflect the relative attention that the face can get from the viewer (the area closer to the center of the video frame is weighted more), as shown in table 1:
TABLE 1
Table 1 different face weights assigned to different regions in a video frame. The central area has a high weight and the edge area has a low weight.
The face attention feature of the ith frame can be calculated as:
wherein A isfrmIs the area of the video frame, Dmax(i)=maxxdi(x) In that respect To reduce the face detectionThe influence of inaccuracy on the whole situation of the scheme is measured, and the obtained human face attention feature sequence is smoothed by a median filter (5 orders). Video segment xsThe human face attention feature of (1) is represented by the following formula, namely the smoothed feature { FAC (i) | i ═ i1(s),...,i2(s) } calculation yields:
step 204: and the support vector machine detects semantic concepts of the color moment features, the wavelet texture features and the local key point features to obtain concept density.
In the embodiment of the invention, a support vector machine is trained based on the color moment features, the small moire features and the local key point features. The support vector machine selects a LibSVM packet, adopts a Radial Basis Function (RBF) for the color moment characteristics and the wavelet texture characteristics, and adopts a Chi-square kernel for the local key point characteristics.
Referring to fig. 4, in order to mine semantic information, the scheme extracts semantic indication features of a video segment based on 374 concepts (semantic concepts) of VIREO-37 and three Support Vector Machines (SVMs) of each concept. The support vector machine is trained based on the color moment, the small ripple principle and the local key point characteristics, and the probability value of the degree of closeness of the relationship between a given video frame and the concept can be estimated in prediction. The flow of computing semantic indicative features of a video segment is shown in fig. 4:
for video segment χsFirst, its intermediate frame i is extractedmColor moment characteristic f of(s)cm(im(s)), wavelet texture feature fwt(im(s)) and local keypoint features fkp(im(s)), and a probability value { u } is obtained by prediction using a support vector machinecm(s,j),uwt(s,j),ukp(s, j) | j ═ 1, 2.., 374}, and then the concept density is calculated:
in the embodiment of the invention, the text information related to the video content is obtained from the audio signal of the video frame by utilizing a voice recognition technology; or,
and acquiring text information related to video content from the subtitles of the video frames.
Step 205: and calculating to obtain the semantic similarity of the characters based on the character information and the concept vocabulary information.
Next, subtitle information corresponding to the video band is processed. Set gamma formed based on caption vocabularyst(s) and a set of concept vocabulary Γcp(j) The semantic Similarity of the characters (textual semantic Similarity) is calculated by the Similarity measurement tool WordNet of the external dictionary WordNet:
wherein η (gamma, omega) represents the Similarity value of the caption word gamma and the concept word omega in WordNet:: Similarity.
To reduce the impact of irrelevant concepts, the following textual relevance (textual relevance) is defined:
wherein Q is guaranteedNormalized coefficients of hold. Since the support vector machine gives the probability of two classes of classification problems, a threshold of 0.5 is naturally used in the above formula.
Step 206: and calculating to obtain the semantic indicating features based on the word semantic similarity and the concept density.
Referring to fig. 4, in order to mine semantic information, the scheme extracts semantic indication features of a video segment based on 374 concepts of VIREO-374 and three Support Vector Machines (SVMs) of each concept. The support vector machine is trained based on the color moment, the small ripple principle and the local key point characteristics, and the probability value of the degree of closeness of the relationship between a given video frame and the concept can be estimated in prediction. The flow of computing semantic indicative features of a video segment is shown in fig. 4:
for video segment χsFirst, its intermediate frame i is extractedmColor moment characteristic f of(s)cm(im(s)), wavelet texture feature fwt(im(s)) and local keypoint features fkp(im(s)), and a probability value { u } is obtained by prediction using a support vector machinecm(s,j),uwt(s,j),ukp(s, j) | j ═ 1, 2.., 374}, and then the concept density is calculated:
then, the subtitle information corresponding to the video band is processed. Set gamma formed based on caption vocabularyst(s) and a set of concept vocabulary Γcp(j) And calculating the semantic Similarity of the characters by using a Similarity measurement tool WordNet of an external dictionary WordNet, wherein the Similarity comprises the following steps:
wherein η (gamma, omega) represents the Similarity value of the caption word gamma and the concept word omega in WordNet:: Similarity.
To reduce the impact of irrelevant concepts, the following degree of literal relevance is defined:
wherein Q is guaranteedNormalized coefficients of hold. Since the support vector machine gives the probability of two classes of classification problems, a threshold of 0.5 is naturally used in the above formula.
Finally, the semantics of the video segment indicate the characteristic fE(s) is defined as the weighted sum of ρ (s, j) weighted by u (s, j):
step 207: and linearly superposing the characteristics in the second characteristic set according to the characteristic weight value to obtain the significance score of the video segment.
And finally, fusing the three high-level characteristics by utilizing an iterative reweighted linear model to generate a video abstract with the length required by a user.
In the embodiment of the invention, the video abstract is finally determined by the significance score of each video segment, so that the following linear model is adopted to fuse three high-level characteristics, and the fusion result is the significance score of the video segment:
fSAL(s)=wM(s)fM(s)+wF(s)fF(s)+wE(s)fE(s) (12b)
wherein wM(s),wF(s) and wE(s) is the weight of the feature. Each feature is normalized to the interval [0, 1 ] before linear fusion]。
The feature weights are calculated by an iterative weight-weighting method as follows. In the k-th iteration, the weight w#(s) (# ∈ { M, F, E }) by the following macroscopic factor α#(s) and a microscopic factor β#(s) product (i.e., w)#(s)=α#(s)·β#(s)) determining:
wherein r is#(s) is characteristic f#(s) at { f#(s)|s=1,2,...,NSRank after descending order, NSIs the total number of video segments in the video. Next, the saliency of the video segment f can be computedSAL(s) and arranging the sequences in descending order. According to the length required by the user, can be according to fSAL(s) entering the video segments one by one into the selected video summary from high to low.
Before the first iteration process is started, initializing the feature weight according to an equal weight principle. The iteration process ends over 15 times.
According to the technical scheme of the embodiment of the invention, low-level features such as color moments, wavelet textures, motion, local key points and the like are extracted from video frames. Next, based on these low-level features, further high-level visual and semantic features are computed, including a moving attention feature, a human face attention feature considering depth information, and a semantic indicative feature of the video segment. Then, an iterative reweighted linear model is used for fusing the three high-level features to generate a video abstract with the length required by a user.
Fig. 5 is a schematic structural composition diagram of an electronic device according to a first embodiment of the present invention, and as shown in fig. 5, the electronic device includes:
an extracting unit 51, configured to extract a first feature set from the video frame, where the first feature set includes: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics;
the first processing unit 52 is configured to calculate a second feature set based on the first feature set, where the second feature set includes: the method comprises the following steps of (1) moving attention characteristics, human face attention characteristics based on depth information and semantic indication characteristics of a video segment;
and the second processing unit 53 is configured to perform fusion processing on each feature in the second feature set by using the linear model with iterative reweighting, so as to obtain the video summary.
Those skilled in the art will appreciate that the functions implemented by the units in the electronic device shown in fig. 5 can be understood by referring to the related description of the video processing method described above. The functions of the units in the electronic device shown in fig. 5 may be implemented by a program running on a processor, or may be implemented by specific logic circuits.
Fig. 6 is a schematic structural composition diagram of an electronic device according to a second embodiment of the present invention, and as shown in fig. 6, the electronic device includes:
an extracting unit 61, configured to extract a first feature set from a video frame, where the first feature set includes: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics;
the first processing unit 62 is configured to calculate a second feature set based on the first feature set, where the second feature set includes: the method comprises the following steps of (1) moving attention characteristics, human face attention characteristics based on depth information and semantic indication characteristics of a video segment;
and the second processing unit 63 is configured to perform fusion processing on each feature in the second feature set by using the linear model with iterative reweighting, so as to obtain a video summary.
The first processing unit 62 includes:
a moving attention feature subunit 621, configured to calculate a moving attention feature according to the moving features in the first feature set;
and the face attention feature subunit 622 is configured to obtain the area and the position of the face in each video frame through a face detection algorithm, and calculate, based on the depth image corresponding to the video frame and the pixel point set forming the face, the face attention feature based on the depth information.
The electronic device further includes:
and the training unit 64 is used for training a support vector machine based on the color moment features, the small moire features and the local key point features.
The electronic device further includes:
a text extraction unit 65, configured to obtain text information related to video content from the audio signal of the video frame by using a voice recognition technology; or obtaining the text information related to the video content from the subtitles of the video frames.
The first processing unit 62 includes:
a semantic indicating feature subunit 623, configured to perform semantic concept detection on the color moment features, the wavelet texture features, and the local key point features by using the support vector machine, so as to obtain concept density; calculating to obtain the semantic similarity of characters based on the character information and the concept vocabulary information; and calculating to obtain the semantic indicating features based on the word semantic similarity and the concept density.
The second processing unit 63 includes:
the linear superposition subunit 631 is configured to linearly superpose each feature in the second feature set according to the feature weight value, so as to obtain a saliency score of the video segment;
the video summarization subunit 632 is configured to select video segments one by one as the video summary according to a preset summarization length, in order from high to low according to the saliency scores of the video segments.
Those skilled in the art will appreciate that the functions implemented by the units in the electronic device shown in fig. 6 can be understood by referring to the related description of the video processing method described above. The functions of the units in the electronic device shown in fig. 6 may be implemented by a program running on a processor, or may be implemented by specific logic circuits.
The technical schemes described in the embodiments of the present invention can be combined arbitrarily without conflict.
In the embodiments provided in the present invention, it should be understood that the disclosed method and intelligent device may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one second processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.
Claims (10)
1. A method of video processing, the method comprising:
extracting a first feature set from a video frame, the first feature set comprising: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics;
acquiring a depth image corresponding to the video frame, text information related to video content, and the area and position of a human face in the video frame;
calculating to obtain a motion attention feature in a second feature set based on the motion feature;
calculating to obtain the human face attention feature based on the depth information in the second feature set based on the area and the position of the human face and the depth image;
obtaining concept density based on the color moment features, the wavelet texture features and the local key point features;
obtaining the semantic similarity of characters based on the character information and the concept vocabulary information;
calculating to obtain semantic indicating characteristics of the video segments in the second characteristic set based on the concept density and the character semantic similarity;
and performing fusion processing on each feature in the second feature set by using the linear model of iterative reweighting so as to obtain the video abstract.
2. The video processing method of claim 1, wherein the obtaining of the area and the position of the face in the video frame comprises:
obtaining the area and the position of a face in each video frame through a face detection algorithm;
correspondingly, the calculating the face attention feature based on the depth information in the second feature set based on the area and the position of the face and the depth image includes:
and calculating to obtain the human face attention feature based on the depth information based on the depth image corresponding to the video frame and the pixel point set forming the human face.
3. The video processing method of claim 1, the method further comprising:
acquiring text information related to video content from the audio signal of the video frame by utilizing a voice recognition technology; or,
and acquiring text information related to video content from the subtitles of the video frames.
4. The video processing method according to claim 3, wherein said deriving a concept affinity based on the color moment features, the wavelet texture features, and the local keypoint features comprises:
training a support vector machine based on the color moment features, the small moire features and the local key point features;
and the support vector machine detects semantic concepts of the color moment features, the wavelet texture features and the local key point features to obtain concept density.
5. The video processing method according to claim 1, wherein the fusion processing is performed on each feature in the second feature set by using an iterative reweighted linear model, so as to obtain a video summary; the method comprises the following steps:
linearly superposing each feature in the second feature set according to the feature weight value to obtain a significance score of the video segment;
according to the preset digest length, the video segments are selected as the video digests one by one according to the sequence of the significance scores of the video segments from high to low.
6. An electronic device, the electronic device comprising:
an extracting unit, configured to extract a first feature set from a video frame, where the first feature set includes: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics;
the extraction unit is further used for acquiring a depth image corresponding to the video frame and the area and position of the face in the video frame;
the character extraction unit is used for acquiring character information related to video content in the video frame;
the motion attention feature subunit is used for calculating motion attention features in the second feature set based on the motion features;
a face attention feature subunit, configured to calculate, based on the area and the position of the face and the depth image, a face attention feature based on depth information in the second feature set;
the semantic indicating feature subunit is used for obtaining concept density based on the color moment features, the wavelet texture features and the local key point features; calculating to obtain the semantic similarity of characters based on the character information and the concept vocabulary information; calculating to obtain semantic indicating characteristics of the video segments in the second characteristic set based on the concept density and the character semantic similarity;
and the second processing unit is used for carrying out fusion processing on each feature in the second feature set by using the linear model of iterative reweighting so as to obtain the video abstract.
7. The electronic device of claim 6, wherein:
the extraction unit is also used for obtaining the area and the position of the face in each video frame through a face detection algorithm;
and the face attention feature subunit is also used for calculating the face attention feature based on the depth image corresponding to the video frame and the pixel point set forming the face to obtain the face attention feature based on the depth information.
8. The electronic device of claim 6, the text extraction unit further comprising:
the system comprises a video frame, a voice recognition module, a display module and a display module, wherein the video frame is used for displaying video content; or obtaining the text information related to the video content from the subtitles of the video frames.
9. The electronic device of claim 8, further comprising:
the training unit is used for training a support vector machine based on the color moment features, the small moire features and the local key point features;
the semantic indication feature subunit is further configured to perform semantic concept detection on the color moment features, the wavelet texture features and the local key point features by using the support vector machine to obtain concept density.
10. The electronic device of claim 9, the second processing unit comprising:
the linear superposition subunit is used for carrying out linear superposition on each feature in the second feature set according to the feature weight value to obtain a significance score of the video segment;
and the video abstract subunit is used for selecting the video segments into the video abstract one by one according to the preset abstract length and the sequence from high to low of the significance scores of the video segments.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510535580.9A CN105228033B (en) | 2015-08-27 | 2015-08-27 | A kind of method for processing video frequency and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510535580.9A CN105228033B (en) | 2015-08-27 | 2015-08-27 | A kind of method for processing video frequency and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105228033A CN105228033A (en) | 2016-01-06 |
CN105228033B true CN105228033B (en) | 2018-11-09 |
Family
ID=54996666
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510535580.9A Active CN105228033B (en) | 2015-08-27 | 2015-08-27 | A kind of method for processing video frequency and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105228033B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9936239B2 (en) * | 2016-06-28 | 2018-04-03 | Intel Corporation | Multiple stream tuning |
CN106355171A (en) * | 2016-11-24 | 2017-01-25 | 深圳凯达通光电科技有限公司 | Video monitoring internetworking system |
CN106934397B (en) | 2017-03-13 | 2020-09-01 | 北京市商汤科技开发有限公司 | Image processing method and device and electronic equipment |
CN107222795B (en) * | 2017-06-23 | 2020-07-31 | 南京理工大学 | Multi-feature fusion video abstract generation method |
CN107979764B (en) * | 2017-12-06 | 2020-03-31 | 中国石油大学(华东) | Video subtitle generating method based on semantic segmentation and multi-layer attention framework |
CN109413510B (en) * | 2018-10-19 | 2021-05-18 | 深圳市商汤科技有限公司 | Video abstract generation method and device, electronic equipment and computer storage medium |
CN111327945B (en) | 2018-12-14 | 2021-03-30 | 北京沃东天骏信息技术有限公司 | Method and apparatus for segmenting video |
CN109932617B (en) * | 2019-04-11 | 2021-02-26 | 东南大学 | Self-adaptive power grid fault diagnosis method based on deep learning |
CN110347870A (en) * | 2019-06-19 | 2019-10-18 | 西安理工大学 | The video frequency abstract generation method of view-based access control model conspicuousness detection and hierarchical clustering method |
CN110225368B (en) * | 2019-06-27 | 2020-07-10 | 腾讯科技(深圳)有限公司 | Video positioning method and device and electronic equipment |
CN111984820B (en) * | 2019-12-19 | 2023-10-27 | 重庆大学 | A video summarization method based on dual self-attention capsule network |
CN113158720B (en) * | 2020-12-15 | 2024-06-18 | 嘉兴学院 | Video abstraction method and device based on dual-mode feature and attention mechanism |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1685344A (en) * | 2002-11-01 | 2005-10-19 | 三菱电机株式会社 | Method for summarizing unknown content of video |
WO2007099496A1 (en) * | 2006-03-03 | 2007-09-07 | Koninklijke Philips Electronics N.V. | Method and device for automatic generation of summary of a plurality of images |
CN101743596A (en) * | 2007-06-15 | 2010-06-16 | 皇家飞利浦电子股份有限公司 | Method and apparatus for automatically generating summaries of a multimedia file |
CN102880866A (en) * | 2012-09-29 | 2013-01-16 | 宁波大学 | Method for extracting face features |
KR20130061058A (en) * | 2011-11-30 | 2013-06-10 | 고려대학교 산학협력단 | Video summary method and system using visual features in the video |
CN103200463A (en) * | 2013-03-27 | 2013-07-10 | 天脉聚源(北京)传媒科技有限公司 | Method and device for generating video summary |
CN103210651A (en) * | 2010-11-15 | 2013-07-17 | 华为技术有限公司 | Method and system for video summarization |
CN104508682A (en) * | 2012-08-03 | 2015-04-08 | 柯达阿拉里斯股份有限公司 | Identifying key frames using group sparsity analysis |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7409407B2 (en) * | 2004-05-07 | 2008-08-05 | Mitsubishi Electric Research Laboratories, Inc. | Multimedia event detection and summarization |
US8467610B2 (en) * | 2010-10-20 | 2013-06-18 | Eastman Kodak Company | Video summarization using sparse basis function combination |
-
2015
- 2015-08-27 CN CN201510535580.9A patent/CN105228033B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1685344A (en) * | 2002-11-01 | 2005-10-19 | 三菱电机株式会社 | Method for summarizing unknown content of video |
WO2007099496A1 (en) * | 2006-03-03 | 2007-09-07 | Koninklijke Philips Electronics N.V. | Method and device for automatic generation of summary of a plurality of images |
CN101743596A (en) * | 2007-06-15 | 2010-06-16 | 皇家飞利浦电子股份有限公司 | Method and apparatus for automatically generating summaries of a multimedia file |
CN103210651A (en) * | 2010-11-15 | 2013-07-17 | 华为技术有限公司 | Method and system for video summarization |
KR20130061058A (en) * | 2011-11-30 | 2013-06-10 | 고려대학교 산학협력단 | Video summary method and system using visual features in the video |
CN104508682A (en) * | 2012-08-03 | 2015-04-08 | 柯达阿拉里斯股份有限公司 | Identifying key frames using group sparsity analysis |
CN102880866A (en) * | 2012-09-29 | 2013-01-16 | 宁波大学 | Method for extracting face features |
CN103200463A (en) * | 2013-03-27 | 2013-07-10 | 天脉聚源(北京)传媒科技有限公司 | Method and device for generating video summary |
Non-Patent Citations (2)
Title |
---|
Hierarchical 3D kernel descriptors for action recognition using depth sequences;Yu Kong et.al;《2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition》;20150508;全文 * |
Multi-scale information maximization based visual attention modeling for video summarization;Naveed Ejaz et.al;《2012 6th International Conference on Next Generation Mobile Appllications, Service and Technologies》;20120914;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN105228033A (en) | 2016-01-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105228033B (en) | A kind of method for processing video frequency and electronic equipment | |
WO2021088510A1 (en) | Video classification method and apparatus, computer, and readable storage medium | |
US9176987B1 (en) | Automatic face annotation method and system | |
Ejaz et al. | Efficient visual attention based framework for extracting key frames from videos | |
WO2020177673A1 (en) | Video sequence selection method, computer device and storage medium | |
CN103973969B (en) | Electronic device and image selection method thereof | |
US20110243452A1 (en) | Electronic apparatus, image processing method, and program | |
JP5947131B2 (en) | Search input method and system by region selection method | |
EP4083817A1 (en) | Video tag determination method, device, terminal, and storage medium | |
CN110889379B (en) | Expression package generation method and device and terminal equipment | |
CN112818995B (en) | Image classification method, device, electronic equipment and storage medium | |
Zhang et al. | Retargeting semantically-rich photos | |
CN114511810A (en) | Abnormal event detection method and device, computer equipment and storage medium | |
CN102375987A (en) | Image processing device and image feature vector extracting and image matching method | |
CN105956051A (en) | Information finding method, device and system | |
CN115909176A (en) | Video semantic segmentation method and device, electronic equipment and storage medium | |
CN106164977A (en) | Camera array analysis mechanisms | |
Miniakhmetova et al. | An approach to personalized video summarization based on user preferences analysis | |
CN109145140A (en) | One kind being based on the matched image search method of hand-drawn outline figure and system | |
CN112261321B (en) | Subtitle processing method and device and electronic equipment | |
EP3117627A1 (en) | Method and apparatus for video processing | |
KR20150101846A (en) | Image classification service system based on a sketch user equipment, service equipment, service method based on sketch and computer readable medium having computer program recorded therefor | |
CN112115740B (en) | Method and apparatus for processing image | |
Meng et al. | Human action classification using SVM_2K classifier on motion features | |
He et al. | A video summarization method based on key frames extracted by TMOF |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |