Nothing Special   »   [go: up one dir, main page]

CN105228033B - A kind of method for processing video frequency and electronic equipment - Google Patents

A kind of method for processing video frequency and electronic equipment Download PDF

Info

Publication number
CN105228033B
CN105228033B CN201510535580.9A CN201510535580A CN105228033B CN 105228033 B CN105228033 B CN 105228033B CN 201510535580 A CN201510535580 A CN 201510535580A CN 105228033 B CN105228033 B CN 105228033B
Authority
CN
China
Prior art keywords
video
feature
features
face
video frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510535580.9A
Other languages
Chinese (zh)
Other versions
CN105228033A (en
Inventor
董培
靳玉茹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN201510535580.9A priority Critical patent/CN105228033B/en
Publication of CN105228033A publication Critical patent/CN105228033A/en
Application granted granted Critical
Publication of CN105228033B publication Critical patent/CN105228033B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Geometry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of method for processing video frequency and electronic equipment, the method includes:Fisrt feature collection is extracted from video frame, the fisrt feature collection includes:Colour moment feature, Wavelet Texture, motion feature, the crucial point feature in part;Based on the fisrt feature collection, second feature collection is calculated, the second feature collection includes:Motion attention feature, the face attention feature based on depth information, the semantic indicative character of video-frequency band;The linear model weighted again using iteration carries out fusion treatment to each feature that second feature is concentrated, to obtain video frequency abstract.

Description

Video processing method and electronic equipment
Technical Field
The present invention relates to video processing technologies, and in particular, to a video processing method and an electronic device.
Background
An intelligent terminal such as a smart phone becomes a portable partner for work and life of people at present, and a user can easily accumulate a large amount of videos in a downloading and self-shooting mode. Especially for a mobile phone equipped with a binocular camera, the amount of data to be stored is larger. In the face of a mobile phone memory with relatively limited capacity, the management of video files becomes an urgent problem to be solved.
Disclosure of Invention
In order to solve the above technical problem, embodiments of the present invention provide a video processing method and an electronic device.
The video processing method provided by the embodiment of the invention comprises the following steps:
extracting a first feature set from a video frame, the first feature set comprising: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics;
calculating to obtain a second feature set based on the first feature set, wherein the second feature set comprises: the method comprises the following steps of (1) moving attention characteristics, human face attention characteristics based on depth information and semantic indication characteristics of a video segment;
and performing fusion processing on each feature in the second feature set by using the linear model of iterative reweighting so as to obtain the video abstract.
The electronic equipment provided by the embodiment of the invention comprises:
an extracting unit, configured to extract a first feature set from a video frame, where the first feature set includes: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics;
the first processing unit is configured to calculate a second feature set based on the first feature set, where the second feature set includes: the method comprises the following steps of (1) moving attention characteristics, human face attention characteristics based on depth information and semantic indication characteristics of a video segment;
and the second processing unit is used for carrying out fusion processing on each feature in the second feature set by using the linear model of iterative reweighting so as to obtain the video abstract.
In the technical scheme of the embodiment of the invention, color moment features, wavelet texture features, motion features and local key point features are extracted from video frames; then, calculating to obtain a motion attention feature, a human face attention feature based on depth information and a semantic indication feature of a video segment based on the extracted color moment feature, wavelet texture feature, motion feature and local key point feature; and performing fusion processing on the motion attention feature, the human face attention feature based on the depth information and the semantic indication feature of the video segment to obtain the video abstract. Therefore, the video segments with relatively refined semantics and important meanings are extracted from the original video, so that the data volume needing to be stored in the electronic equipment is effectively reduced, the utilization efficiency of the memory of the electronic equipment and the user experience are improved, and the method is also beneficial for a user to locate the video which the user most wants to find from a small amount of video files in the future. In addition, the technical scheme of the embodiment of the invention combines information from a visual modality (visual modality) and a textual modality (textual modality), and can more effectively capture the high-level semantics of the video content. The human face attention feature is combined with the depth information of objects in the scene, so that the high-level semantics can be mastered from a more comprehensive angle. The technical scheme of the embodiment of the invention does not depend on heuristic rules formulated for specific video types, and can be suitable for wider video types.
Drawings
Fig. 1 is a schematic flowchart of a video processing method according to a first embodiment of the invention;
fig. 2 is a flowchart illustrating a video processing method according to a second embodiment of the invention;
FIG. 3 is an overall flow chart of video summarization according to an embodiment of the present invention;
FIG. 4 is a flow chart of computing semantic indexing features of a video segment according to an embodiment of the invention;
fig. 5 is a schematic structural diagram of an electronic device according to a first embodiment of the invention;
fig. 6 is a schematic structural diagram of an electronic device according to a second embodiment of the present invention.
Detailed Description
So that the manner in which the features and aspects of the embodiments of the present invention can be understood in detail, a more particular description of the embodiments of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings.
In the era of information explosion, traditional video data browsing and management methods have faced unprecedented challenges. Therefore, the method has important practical significance for providing a short video abstract which is concentrated with key information in the original video for video users. Video summarization can be generally divided into two types, dynamic and static: the dynamic video abstract is a shortened version of an original video, wherein the dynamic video abstract can contain a series of video segments extracted from an original long version; and the still video summary may consist of a set of key frames extracted from the original video.
Conventional video summarization is generated by extracting visual or textual features from a video. However, most methods in this direction use heuristic rules or simple text analysis (e.g., based on word frequency statistics). In addition, the traditional attention model method adopting the human face features only considers information such as the plane position and the size of the detected human face in a scene, and lacks of using depth information.
The technical scheme of the embodiment of the invention estimates the relative importance of the video frequency band by an iterative reweighting mode based on the attention model of the user, the semantic information of the video and the depth information of the video frame, thereby generating the dynamic video abstract.
Fig. 1 is a schematic flowchart of a video processing method according to a first embodiment of the present invention, as shown in fig. 1, the video processing method includes the following steps:
step 101: extracting a first feature set from a video frame, the first feature set comprising: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics.
Referring to fig. 3, first, a first feature set is extracted from a video frame, the first feature set is a low-level feature set, and the first feature set includes four low-level features: color moment features, wavelet texture features, motion features, and local keypoint features.
Four low-level features in the first feature set are described in detail below.
(1) Colour moment characteristics
A video frame is spatially divided into 5 x 5 (25 in total) non-overlapping pixel blocks, and each pixel block is calculated for three channels of a Lab color spaceA first order moment and a second order third order central moment. The color moments of the 25 pixel blocks of the frame constitute the color moment feature vector f of the framecm(i)。
(2) Wavelet texture feature
Similarly, a video frame is divided into 3 × 3 (9 in total) non-overlapping pixel blocks, and the luminance components of each block are subjected to three-level Haar wavelet decomposition, thereby calculating the variance of wavelet coefficients for each level in the horizontal, vertical and diagonal directions. All the wavelet coefficient variances of the video frame constitute the wavelet texture characteristic vector f of the framewt(i)。
(3) Movement characteristics
The human eye has a sensitive recognition of changes in visual content. Based on the basic principle, a video frame is divided into M × N non-overlapping pixel blocks, each block contains 16 × 16 pixel points, and a motion vector v (i, M, N) is calculated through a motion estimation algorithm. M x N motion vectors constituting the motion characteristic f of the video framemv(i)。
(4) Local keypoint features
In semantic-level video analysis, bag of words (BoF) based on local keypoints can be used as a powerful supplement to features computed from global information. Therefore, salient regions are captured using soft-weighted local keypoint features, which are defined based on the importance of keypoints in a vocabulary of 500 visual words. Specifically, the keypoints in the ith video frame are obtained by a Difference of Gaussians (DoG) detector, represented by Scale-Invariant Feature Transform (SIFT) descriptors, and clustered into 500 visual words. Feature vector f of key pointkp(i) Is defined as: and the weighted similarity between the key points under the four neighbors and the visual words.
Step 102: calculating to obtain a second feature set based on the first feature set, wherein the second feature set comprises: the method comprises the following steps of motion attention feature, human face attention feature based on depth information and semantic indication feature of a video segment.
Next, based on these low-level features, further high-level visual and semantic features, referred to as a second feature set, are computed, including: the method comprises the following steps of (1) moving attention feature, human face attention feature based on depth information and semantic indication feature of a video segment.
Next, based on the above low-level features, further for each arbitrary given video segment χs(starting from the i-th1(s) frame, terminating at ith2(s) frames) to compute advanced visual and semantic features. Video segmentation is achieved by shot cut detection.
Each feature of the second feature set is described in detail below.
(1) Characteristics of attention in sports
The research of human attention in the psychological field lays an indispensable foundation for the attention modeling in the computer vision field. The cognitive mechanism of attention is very key in the analysis and understanding of human thinking and activities, so that the cognitive mechanism can play a guiding role in the process of selecting relatively important contents in the original video to form the video abstract. The scheme utilizes the motion attention model to calculate the high-level motion attention characteristics suitable for semantic analysis.
For the (m, n) -th pixel block in the i-th video frame, a spatial window containing 5 × 5 (25 total) pixel blocks and a temporal window containing 7 pixel blocks are designed, and both windows are centered on the (m, n) -th pixel block of the i-th frame. Averagely dividing the phase range of [0, 2 pi) into 8 intervals, and statistically calculating a spatial phase histogram in a spatial windowCounting a time phase histogram in a time windowThereby obtaining a space according to the following formulaConsistency indication Cs(i, m, n) and time consistency indication Ct(i,m,n):
Cs(i,m,n)=-∑ζps(ζ)logps(ζ) (1a)
Ct(i,m,n)=-∑ζpt(ζ)logpt(ζ) (2a)
Wherein,andrespectively, the phase distribution in the spatial window and the temporal window. Next, the motion attention feature of the ith frame is defined as follows:
to suppress noise in the features of neighboring video frames, the above resulting sequence of motion attention features will be processed through a 9 th order median filter. For the s video segment χsThe motion attention feature is obtained by calculating the filtered single-frame feature value:
(2) human face attention feature based on depth information
In video, the presence of a human face may generally indicate more important content. According to the scheme, the area A of the face (indexed by the letter j) in each video frame is obtained through a face detection algorithmF(j) And a location. For the j detected face, the corresponding relation with the video frame is based onDepth image d ofiAnd a set of pixel points { x | x ∈ Λ (j) } constituting the face, defining the depth saliency D (j) as follows:
wherein | Λ (j) | is the number of pixel points contained in the jth individual face. According to the position of the human face in the whole video frame, a position weight w is also definedfp(j) To approximately reflect the relative attention that the face can get from the viewer (the closer the region to the center of the video frame is weighted the more) as shown in table 1:
TABLE 1
Table 1 different face weights assigned to different regions in a video frame. The central area has a high weight and the edge area has a low weight.
The face attention feature of the ith frame can be calculated as:
wherein A isfrmIs the area of the video frame, Dmax(i)=maxxdi(x) In that respect In order to reduce the influence of human face detection inaccuracy on the whole situation of the scheme, the obtained human face attention feature sequence is also smoothed by a median filter (5 th order). Video segment xsThe human face attention feature of (1) is represented by the following formula, namely the smoothed feature { FAC (i) | i ═ i1(s),...,i2(s) } calculation yields:
(3) semantic indexing features for video segments
Referring to fig. 4, in order to mine semantic information, the scheme extracts semantic indication features of a video segment based on 374 concepts of VIREO-374 and three Support Vector Machines (SVMs) of each concept. The support vector machine is trained based on the color moment, the small ripple principle and the local key point characteristics, and the probability value of the degree of closeness of the relationship between a given video frame and the concept can be estimated in prediction. The flow of computing semantic indicative features of a video segment is shown in fig. 4:
for video segment χsFirst, its intermediate frame i is extractedmColor moment characteristic f of(s)cm(im(s)), wavelet texture feature fwt(im(s)) and local keypoint features fkp(im(s)), and a probability value { u } is obtained by prediction using a support vector machinecm(s,j),uwt(s,j),ukp(s, j) | j ═ 1, 2.., 374}, and then the concept density is calculated:
then, the subtitle information corresponding to the video band is processed. Set gamma formed based on caption vocabularyst(s) and a set of concept vocabulary Γcp(j) And calculating the semantic Similarity of the characters by using a Similarity measurement tool WordNet of an external dictionary WordNet, wherein the Similarity comprises the following steps:
wherein η (gamma, omega) represents the Similarity value of the caption word gamma and the concept word omega in WordNet:: Similarity.
To reduce the impact of irrelevant concepts, the following degree of literal relevance is defined:
wherein Q is guaranteedNormalized coefficients of hold. Since the support vector machine gives the probability of two classes of classification problems, a threshold of 0.5 is naturally used in the above formula.
Finally, the semantics of the video segment indicate the characteristic fE(s) is defined as the weighted sum of ρ (s, j) weighted by u (s, j):
step 103: and performing fusion processing on each feature in the second feature set by using the linear model of iterative reweighting so as to obtain the video abstract.
And finally, fusing the three high-level characteristics by utilizing an iterative reweighted linear model to generate a video abstract with the length required by a user.
In the embodiment of the invention, the video abstract is finally determined by the significance score of each video segment, so that the following linear model is adopted to fuse three high-level characteristics, and the fusion result is the significance score of the video segment:
fSAL(s)=wM(s)fM(s)+wF(s)fF(s)+wE(s)fE(s) (12a)
wherein wM(s),wF(s) and wE(s) is the weight of the feature. Each feature is normalized to the interval [0, 1 ] before linear fusion]。
The feature weights are calculated by an iterative weight-weighting method as follows. In the k-th iteration, the weight w#(s) (#. epsilon. { M, F, E }) is prepared fromthe following macroscopic factor α#(s) and a microscopic factor β#(s) product (i.e., w)#(s)=α#(s)·β#(s)) determining:
wherein r is#(s) is characteristic f#(s) at { f#(s)|s=1,2,...,NSRank after descending order, NSIs the total number of video segments in the video. Next, the saliency of the video segment f can be computedSAL(s) and arranging the sequences in descending order. According to the length required by the user, according to fSAL(s) entering the video segments one by one into the selected video summary from high to low.
Before the first iteration process is started, initializing the feature weight according to an equal weight principle. The iteration process ends over 15 times.
According to the technical scheme of the embodiment of the invention, low-level features such as color moments, wavelet textures, motion, local key points and the like are extracted from video frames. Next, based on these low-level features, further high-level visual and semantic features are computed, including a moving attention feature, a human face attention feature considering depth information, and a semantic indicative feature of the video segment. Then, an iterative reweighted linear model is used for fusing the three high-level features to generate a video abstract with the length required by a user.
Fig. 2 is a schematic flowchart of a video processing method according to a second embodiment of the present invention, and as shown in fig. 2, the video processing method includes the following steps:
step 201: extracting a first feature set from a video frame, the first feature set comprising: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics.
Referring to fig. 3, first, a first feature set is extracted from a video frame, the first feature set is a low-level feature set, and the first feature set includes four low-level features: color moment features, wavelet texture features, motion features, and local keypoint features.
Four low-level features in the first feature set are described in detail below.
(1) Colour moment characteristics
A video frame is spatially divided into 5 x 5 (25 in total) non-overlapping pixel blocks, and a first-order moment and a second-order third-order central moment are respectively calculated for three channels of a Lab color space on each pixel block. The color moments of the 25 pixel blocks of the frame constitute the color moment feature vector f of the framecm(i)。
(2) Wavelet texture feature
Similarly, a video frame is divided into 3 × 3 (9 in total) non-overlapping pixel blocks, and the luminance components of each block are subjected to three-level Haar wavelet decomposition, thereby calculating the variance of wavelet coefficients for each level in the horizontal, vertical and diagonal directions. All the wavelet coefficient variances of the video frame constitute the wavelet texture characteristic vector f of the framewt(i)。
(3) Movement characteristics
The human eye has a sensitive recognition of changes in visual content. Based on the basic principle, a video frame is divided into M × N non-overlapping pixel blocks, each block contains 16 × 16 pixel points, and a motion vector v (i, M, N) is calculated through a motion estimation algorithm. M x N motion vectors constituting the motion characteristic f of the video framemv(i)。
(4) Local keypoint features
In semantic-level video analysis, bag of words (BoF) based on local keypoints can be used as a powerful supplement to features computed from global information. Thus, local keypoint features with soft weightingFeatures are captured that are defined based on the importance of keypoints in a vocabulary of 500 visual words. Specifically, the keypoints in the ith video frame are obtained by a Difference of Gaussians (DoG) detector, represented by Scale-Invariant Feature Transform (SIFT) descriptors, and clustered into 500 visual words. Feature vector f of key pointkp(i) Is defined as: and the weighted similarity between the key points under the four neighbors and the visual words.
Step 202: and calculating to obtain the motion attention feature according to the motion feature in the first feature set.
Next, based on these low-level features, further high-level visual and semantic features, referred to as a second feature set, are computed, including: the method comprises the following steps of (1) moving attention feature, human face attention feature based on depth information and semantic indication feature of a video segment.
Next, based on the above low-level features, further for each arbitrary given video segment χs(starting from the i-th1(s) frame, terminating at ith2(s) frames) to compute advanced visual and semantic features. Video segmentation is achieved by shot cut detection.
The research of human attention in the psychological field lays an indispensable foundation for the attention modeling in the computer vision field. The cognitive mechanism of attention is very key in the analysis and understanding of human thinking and activities, so that the cognitive mechanism can play a guiding role in the process of selecting relatively important contents in the original video to form the video abstract. The scheme utilizes the motion attention model to calculate the high-level motion attention characteristics suitable for semantic analysis.
For the (m, n) -th pixel block in the i-th video frame, a spatial window containing 5 × 5 (25 total) pixel blocks and a temporal window containing 7 pixel blocks are designed, and both windows are centered on the (m, n) -th pixel block of the i-th frame. Evenly dividing the phase range of [0, 2 pi) into 8 intervals, and unifying in a space windowCalculating a spatial phase histogramCounting a time phase histogram in a time windowThus, a spatial consistency indicator C can be obtained according to the following formulas(i, m, n) and time consistency indication Ct(i,m,n):
Cs(i,m,n)=-∑ζps(ζ)logps(ζ) (1b)
Ct(i,m,n)=-∑ζpt(ζ)logpt(ζ) (2b)
Wherein,andrespectively, the phase distribution in the spatial window and the temporal window. Next, the motion attention feature of the ith frame is defined as follows:
to suppress noise in the features of neighboring video frames, the above resulting sequence of motion attention features will be processed through a 9 th order median filter. For the s video segment χsThe motion attention feature is obtained by calculating the filtered single-frame feature value:
step 203: the area and the position of the face in each video frame are obtained through a face detection algorithm, and the face attention feature based on the depth information is obtained through calculation based on the depth image corresponding to the video frame and the pixel point set forming the face.
In video, the presence of a human face may generally indicate more important content. According to the scheme, the area A of the face (indexed by the letter j) in each video frame is obtained through a face detection algorithmF(j) And a location. For the detected jth face, based on the depth image d corresponding to the video frameiAnd a set of pixel points { x | x ∈ Λ (j) } constituting the face, defining the depth saliency D (j) as follows:
wherein | Λ (j) | is the number of pixel points contained in the jth individual face. Based on the location of the face in the entire video frame, a location weight wfp (j) is also defined to approximately reflect the relative attention that the face can get from the viewer (the area closer to the center of the video frame is weighted more), as shown in table 1:
TABLE 1
Table 1 different face weights assigned to different regions in a video frame. The central area has a high weight and the edge area has a low weight.
The face attention feature of the ith frame can be calculated as:
wherein A isfrmIs the area of the video frame, Dmax(i)=maxxdi(x) In that respect To reduce the face detectionThe influence of inaccuracy on the whole situation of the scheme is measured, and the obtained human face attention feature sequence is smoothed by a median filter (5 orders). Video segment xsThe human face attention feature of (1) is represented by the following formula, namely the smoothed feature { FAC (i) | i ═ i1(s),...,i2(s) } calculation yields:
step 204: and the support vector machine detects semantic concepts of the color moment features, the wavelet texture features and the local key point features to obtain concept density.
In the embodiment of the invention, a support vector machine is trained based on the color moment features, the small moire features and the local key point features. The support vector machine selects a LibSVM packet, adopts a Radial Basis Function (RBF) for the color moment characteristics and the wavelet texture characteristics, and adopts a Chi-square kernel for the local key point characteristics.
Referring to fig. 4, in order to mine semantic information, the scheme extracts semantic indication features of a video segment based on 374 concepts (semantic concepts) of VIREO-37 and three Support Vector Machines (SVMs) of each concept. The support vector machine is trained based on the color moment, the small ripple principle and the local key point characteristics, and the probability value of the degree of closeness of the relationship between a given video frame and the concept can be estimated in prediction. The flow of computing semantic indicative features of a video segment is shown in fig. 4:
for video segment χsFirst, its intermediate frame i is extractedmColor moment characteristic f of(s)cm(im(s)), wavelet texture feature fwt(im(s)) and local keypoint features fkp(im(s)), and a probability value { u } is obtained by prediction using a support vector machinecm(s,j),uwt(s,j),ukp(s, j) | j ═ 1, 2.., 374}, and then the concept density is calculated:
in the embodiment of the invention, the text information related to the video content is obtained from the audio signal of the video frame by utilizing a voice recognition technology; or,
and acquiring text information related to video content from the subtitles of the video frames.
Step 205: and calculating to obtain the semantic similarity of the characters based on the character information and the concept vocabulary information.
Next, subtitle information corresponding to the video band is processed. Set gamma formed based on caption vocabularyst(s) and a set of concept vocabulary Γcp(j) The semantic Similarity of the characters (textual semantic Similarity) is calculated by the Similarity measurement tool WordNet of the external dictionary WordNet:
wherein η (gamma, omega) represents the Similarity value of the caption word gamma and the concept word omega in WordNet:: Similarity.
To reduce the impact of irrelevant concepts, the following textual relevance (textual relevance) is defined:
wherein Q is guaranteedNormalized coefficients of hold. Since the support vector machine gives the probability of two classes of classification problems, a threshold of 0.5 is naturally used in the above formula.
Step 206: and calculating to obtain the semantic indicating features based on the word semantic similarity and the concept density.
Referring to fig. 4, in order to mine semantic information, the scheme extracts semantic indication features of a video segment based on 374 concepts of VIREO-374 and three Support Vector Machines (SVMs) of each concept. The support vector machine is trained based on the color moment, the small ripple principle and the local key point characteristics, and the probability value of the degree of closeness of the relationship between a given video frame and the concept can be estimated in prediction. The flow of computing semantic indicative features of a video segment is shown in fig. 4:
for video segment χsFirst, its intermediate frame i is extractedmColor moment characteristic f of(s)cm(im(s)), wavelet texture feature fwt(im(s)) and local keypoint features fkp(im(s)), and a probability value { u } is obtained by prediction using a support vector machinecm(s,j),uwt(s,j),ukp(s, j) | j ═ 1, 2.., 374}, and then the concept density is calculated:
then, the subtitle information corresponding to the video band is processed. Set gamma formed based on caption vocabularyst(s) and a set of concept vocabulary Γcp(j) And calculating the semantic Similarity of the characters by using a Similarity measurement tool WordNet of an external dictionary WordNet, wherein the Similarity comprises the following steps:
wherein η (gamma, omega) represents the Similarity value of the caption word gamma and the concept word omega in WordNet:: Similarity.
To reduce the impact of irrelevant concepts, the following degree of literal relevance is defined:
wherein Q is guaranteedNormalized coefficients of hold. Since the support vector machine gives the probability of two classes of classification problems, a threshold of 0.5 is naturally used in the above formula.
Finally, the semantics of the video segment indicate the characteristic fE(s) is defined as the weighted sum of ρ (s, j) weighted by u (s, j):
step 207: and linearly superposing the characteristics in the second characteristic set according to the characteristic weight value to obtain the significance score of the video segment.
And finally, fusing the three high-level characteristics by utilizing an iterative reweighted linear model to generate a video abstract with the length required by a user.
In the embodiment of the invention, the video abstract is finally determined by the significance score of each video segment, so that the following linear model is adopted to fuse three high-level characteristics, and the fusion result is the significance score of the video segment:
fSAL(s)=wM(s)fM(s)+wF(s)fF(s)+wE(s)fE(s) (12b)
wherein wM(s),wF(s) and wE(s) is the weight of the feature. Each feature is normalized to the interval [0, 1 ] before linear fusion]。
The feature weights are calculated by an iterative weight-weighting method as follows. In the k-th iteration, the weight w#(s) (# ∈ { M, F, E }) by the following macroscopic factor α#(s) and a microscopic factor β#(s) product (i.e., w)#(s)=α#(s)·β#(s)) determining:
wherein r is#(s) is characteristic f#(s) at { f#(s)|s=1,2,...,NSRank after descending order, NSIs the total number of video segments in the video. Next, the saliency of the video segment f can be computedSAL(s) and arranging the sequences in descending order. According to the length required by the user, can be according to fSAL(s) entering the video segments one by one into the selected video summary from high to low.
Before the first iteration process is started, initializing the feature weight according to an equal weight principle. The iteration process ends over 15 times.
According to the technical scheme of the embodiment of the invention, low-level features such as color moments, wavelet textures, motion, local key points and the like are extracted from video frames. Next, based on these low-level features, further high-level visual and semantic features are computed, including a moving attention feature, a human face attention feature considering depth information, and a semantic indicative feature of the video segment. Then, an iterative reweighted linear model is used for fusing the three high-level features to generate a video abstract with the length required by a user.
Fig. 5 is a schematic structural composition diagram of an electronic device according to a first embodiment of the present invention, and as shown in fig. 5, the electronic device includes:
an extracting unit 51, configured to extract a first feature set from the video frame, where the first feature set includes: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics;
the first processing unit 52 is configured to calculate a second feature set based on the first feature set, where the second feature set includes: the method comprises the following steps of (1) moving attention characteristics, human face attention characteristics based on depth information and semantic indication characteristics of a video segment;
and the second processing unit 53 is configured to perform fusion processing on each feature in the second feature set by using the linear model with iterative reweighting, so as to obtain the video summary.
Those skilled in the art will appreciate that the functions implemented by the units in the electronic device shown in fig. 5 can be understood by referring to the related description of the video processing method described above. The functions of the units in the electronic device shown in fig. 5 may be implemented by a program running on a processor, or may be implemented by specific logic circuits.
Fig. 6 is a schematic structural composition diagram of an electronic device according to a second embodiment of the present invention, and as shown in fig. 6, the electronic device includes:
an extracting unit 61, configured to extract a first feature set from a video frame, where the first feature set includes: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics;
the first processing unit 62 is configured to calculate a second feature set based on the first feature set, where the second feature set includes: the method comprises the following steps of (1) moving attention characteristics, human face attention characteristics based on depth information and semantic indication characteristics of a video segment;
and the second processing unit 63 is configured to perform fusion processing on each feature in the second feature set by using the linear model with iterative reweighting, so as to obtain a video summary.
The first processing unit 62 includes:
a moving attention feature subunit 621, configured to calculate a moving attention feature according to the moving features in the first feature set;
and the face attention feature subunit 622 is configured to obtain the area and the position of the face in each video frame through a face detection algorithm, and calculate, based on the depth image corresponding to the video frame and the pixel point set forming the face, the face attention feature based on the depth information.
The electronic device further includes:
and the training unit 64 is used for training a support vector machine based on the color moment features, the small moire features and the local key point features.
The electronic device further includes:
a text extraction unit 65, configured to obtain text information related to video content from the audio signal of the video frame by using a voice recognition technology; or obtaining the text information related to the video content from the subtitles of the video frames.
The first processing unit 62 includes:
a semantic indicating feature subunit 623, configured to perform semantic concept detection on the color moment features, the wavelet texture features, and the local key point features by using the support vector machine, so as to obtain concept density; calculating to obtain the semantic similarity of characters based on the character information and the concept vocabulary information; and calculating to obtain the semantic indicating features based on the word semantic similarity and the concept density.
The second processing unit 63 includes:
the linear superposition subunit 631 is configured to linearly superpose each feature in the second feature set according to the feature weight value, so as to obtain a saliency score of the video segment;
the video summarization subunit 632 is configured to select video segments one by one as the video summary according to a preset summarization length, in order from high to low according to the saliency scores of the video segments.
Those skilled in the art will appreciate that the functions implemented by the units in the electronic device shown in fig. 6 can be understood by referring to the related description of the video processing method described above. The functions of the units in the electronic device shown in fig. 6 may be implemented by a program running on a processor, or may be implemented by specific logic circuits.
The technical schemes described in the embodiments of the present invention can be combined arbitrarily without conflict.
In the embodiments provided in the present invention, it should be understood that the disclosed method and intelligent device may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one second processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.

Claims (10)

1. A method of video processing, the method comprising:
extracting a first feature set from a video frame, the first feature set comprising: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics;
acquiring a depth image corresponding to the video frame, text information related to video content, and the area and position of a human face in the video frame;
calculating to obtain a motion attention feature in a second feature set based on the motion feature;
calculating to obtain the human face attention feature based on the depth information in the second feature set based on the area and the position of the human face and the depth image;
obtaining concept density based on the color moment features, the wavelet texture features and the local key point features;
obtaining the semantic similarity of characters based on the character information and the concept vocabulary information;
calculating to obtain semantic indicating characteristics of the video segments in the second characteristic set based on the concept density and the character semantic similarity;
and performing fusion processing on each feature in the second feature set by using the linear model of iterative reweighting so as to obtain the video abstract.
2. The video processing method of claim 1, wherein the obtaining of the area and the position of the face in the video frame comprises:
obtaining the area and the position of a face in each video frame through a face detection algorithm;
correspondingly, the calculating the face attention feature based on the depth information in the second feature set based on the area and the position of the face and the depth image includes:
and calculating to obtain the human face attention feature based on the depth information based on the depth image corresponding to the video frame and the pixel point set forming the human face.
3. The video processing method of claim 1, the method further comprising:
acquiring text information related to video content from the audio signal of the video frame by utilizing a voice recognition technology; or,
and acquiring text information related to video content from the subtitles of the video frames.
4. The video processing method according to claim 3, wherein said deriving a concept affinity based on the color moment features, the wavelet texture features, and the local keypoint features comprises:
training a support vector machine based on the color moment features, the small moire features and the local key point features;
and the support vector machine detects semantic concepts of the color moment features, the wavelet texture features and the local key point features to obtain concept density.
5. The video processing method according to claim 1, wherein the fusion processing is performed on each feature in the second feature set by using an iterative reweighted linear model, so as to obtain a video summary; the method comprises the following steps:
linearly superposing each feature in the second feature set according to the feature weight value to obtain a significance score of the video segment;
according to the preset digest length, the video segments are selected as the video digests one by one according to the sequence of the significance scores of the video segments from high to low.
6. An electronic device, the electronic device comprising:
an extracting unit, configured to extract a first feature set from a video frame, where the first feature set includes: color moment characteristics, wavelet texture characteristics, motion characteristics and local key point characteristics;
the extraction unit is further used for acquiring a depth image corresponding to the video frame and the area and position of the face in the video frame;
the character extraction unit is used for acquiring character information related to video content in the video frame;
the motion attention feature subunit is used for calculating motion attention features in the second feature set based on the motion features;
a face attention feature subunit, configured to calculate, based on the area and the position of the face and the depth image, a face attention feature based on depth information in the second feature set;
the semantic indicating feature subunit is used for obtaining concept density based on the color moment features, the wavelet texture features and the local key point features; calculating to obtain the semantic similarity of characters based on the character information and the concept vocabulary information; calculating to obtain semantic indicating characteristics of the video segments in the second characteristic set based on the concept density and the character semantic similarity;
and the second processing unit is used for carrying out fusion processing on each feature in the second feature set by using the linear model of iterative reweighting so as to obtain the video abstract.
7. The electronic device of claim 6, wherein:
the extraction unit is also used for obtaining the area and the position of the face in each video frame through a face detection algorithm;
and the face attention feature subunit is also used for calculating the face attention feature based on the depth image corresponding to the video frame and the pixel point set forming the face to obtain the face attention feature based on the depth information.
8. The electronic device of claim 6, the text extraction unit further comprising:
the system comprises a video frame, a voice recognition module, a display module and a display module, wherein the video frame is used for displaying video content; or obtaining the text information related to the video content from the subtitles of the video frames.
9. The electronic device of claim 8, further comprising:
the training unit is used for training a support vector machine based on the color moment features, the small moire features and the local key point features;
the semantic indication feature subunit is further configured to perform semantic concept detection on the color moment features, the wavelet texture features and the local key point features by using the support vector machine to obtain concept density.
10. The electronic device of claim 9, the second processing unit comprising:
the linear superposition subunit is used for carrying out linear superposition on each feature in the second feature set according to the feature weight value to obtain a significance score of the video segment;
and the video abstract subunit is used for selecting the video segments into the video abstract one by one according to the preset abstract length and the sequence from high to low of the significance scores of the video segments.
CN201510535580.9A 2015-08-27 2015-08-27 A kind of method for processing video frequency and electronic equipment Active CN105228033B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510535580.9A CN105228033B (en) 2015-08-27 2015-08-27 A kind of method for processing video frequency and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510535580.9A CN105228033B (en) 2015-08-27 2015-08-27 A kind of method for processing video frequency and electronic equipment

Publications (2)

Publication Number Publication Date
CN105228033A CN105228033A (en) 2016-01-06
CN105228033B true CN105228033B (en) 2018-11-09

Family

ID=54996666

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510535580.9A Active CN105228033B (en) 2015-08-27 2015-08-27 A kind of method for processing video frequency and electronic equipment

Country Status (1)

Country Link
CN (1) CN105228033B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9936239B2 (en) * 2016-06-28 2018-04-03 Intel Corporation Multiple stream tuning
CN106355171A (en) * 2016-11-24 2017-01-25 深圳凯达通光电科技有限公司 Video monitoring internetworking system
CN106934397B (en) 2017-03-13 2020-09-01 北京市商汤科技开发有限公司 Image processing method and device and electronic equipment
CN107222795B (en) * 2017-06-23 2020-07-31 南京理工大学 Multi-feature fusion video abstract generation method
CN107979764B (en) * 2017-12-06 2020-03-31 中国石油大学(华东) Video subtitle generating method based on semantic segmentation and multi-layer attention framework
CN109413510B (en) * 2018-10-19 2021-05-18 深圳市商汤科技有限公司 Video abstract generation method and device, electronic equipment and computer storage medium
CN111327945B (en) 2018-12-14 2021-03-30 北京沃东天骏信息技术有限公司 Method and apparatus for segmenting video
CN109932617B (en) * 2019-04-11 2021-02-26 东南大学 Self-adaptive power grid fault diagnosis method based on deep learning
CN110347870A (en) * 2019-06-19 2019-10-18 西安理工大学 The video frequency abstract generation method of view-based access control model conspicuousness detection and hierarchical clustering method
CN110225368B (en) * 2019-06-27 2020-07-10 腾讯科技(深圳)有限公司 Video positioning method and device and electronic equipment
CN111984820B (en) * 2019-12-19 2023-10-27 重庆大学 A video summarization method based on dual self-attention capsule network
CN113158720B (en) * 2020-12-15 2024-06-18 嘉兴学院 Video abstraction method and device based on dual-mode feature and attention mechanism

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1685344A (en) * 2002-11-01 2005-10-19 三菱电机株式会社 Method for summarizing unknown content of video
WO2007099496A1 (en) * 2006-03-03 2007-09-07 Koninklijke Philips Electronics N.V. Method and device for automatic generation of summary of a plurality of images
CN101743596A (en) * 2007-06-15 2010-06-16 皇家飞利浦电子股份有限公司 Method and apparatus for automatically generating summaries of a multimedia file
CN102880866A (en) * 2012-09-29 2013-01-16 宁波大学 Method for extracting face features
KR20130061058A (en) * 2011-11-30 2013-06-10 고려대학교 산학협력단 Video summary method and system using visual features in the video
CN103200463A (en) * 2013-03-27 2013-07-10 天脉聚源(北京)传媒科技有限公司 Method and device for generating video summary
CN103210651A (en) * 2010-11-15 2013-07-17 华为技术有限公司 Method and system for video summarization
CN104508682A (en) * 2012-08-03 2015-04-08 柯达阿拉里斯股份有限公司 Identifying key frames using group sparsity analysis

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7409407B2 (en) * 2004-05-07 2008-08-05 Mitsubishi Electric Research Laboratories, Inc. Multimedia event detection and summarization
US8467610B2 (en) * 2010-10-20 2013-06-18 Eastman Kodak Company Video summarization using sparse basis function combination

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1685344A (en) * 2002-11-01 2005-10-19 三菱电机株式会社 Method for summarizing unknown content of video
WO2007099496A1 (en) * 2006-03-03 2007-09-07 Koninklijke Philips Electronics N.V. Method and device for automatic generation of summary of a plurality of images
CN101743596A (en) * 2007-06-15 2010-06-16 皇家飞利浦电子股份有限公司 Method and apparatus for automatically generating summaries of a multimedia file
CN103210651A (en) * 2010-11-15 2013-07-17 华为技术有限公司 Method and system for video summarization
KR20130061058A (en) * 2011-11-30 2013-06-10 고려대학교 산학협력단 Video summary method and system using visual features in the video
CN104508682A (en) * 2012-08-03 2015-04-08 柯达阿拉里斯股份有限公司 Identifying key frames using group sparsity analysis
CN102880866A (en) * 2012-09-29 2013-01-16 宁波大学 Method for extracting face features
CN103200463A (en) * 2013-03-27 2013-07-10 天脉聚源(北京)传媒科技有限公司 Method and device for generating video summary

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Hierarchical 3D kernel descriptors for action recognition using depth sequences;Yu Kong et.al;《2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition》;20150508;全文 *
Multi-scale information maximization based visual attention modeling for video summarization;Naveed Ejaz et.al;《2012 6th International Conference on Next Generation Mobile Appllications, Service and Technologies》;20120914;全文 *

Also Published As

Publication number Publication date
CN105228033A (en) 2016-01-06

Similar Documents

Publication Publication Date Title
CN105228033B (en) A kind of method for processing video frequency and electronic equipment
WO2021088510A1 (en) Video classification method and apparatus, computer, and readable storage medium
US9176987B1 (en) Automatic face annotation method and system
Ejaz et al. Efficient visual attention based framework for extracting key frames from videos
WO2020177673A1 (en) Video sequence selection method, computer device and storage medium
CN103973969B (en) Electronic device and image selection method thereof
US20110243452A1 (en) Electronic apparatus, image processing method, and program
JP5947131B2 (en) Search input method and system by region selection method
EP4083817A1 (en) Video tag determination method, device, terminal, and storage medium
CN110889379B (en) Expression package generation method and device and terminal equipment
CN112818995B (en) Image classification method, device, electronic equipment and storage medium
Zhang et al. Retargeting semantically-rich photos
CN114511810A (en) Abnormal event detection method and device, computer equipment and storage medium
CN102375987A (en) Image processing device and image feature vector extracting and image matching method
CN105956051A (en) Information finding method, device and system
CN115909176A (en) Video semantic segmentation method and device, electronic equipment and storage medium
CN106164977A (en) Camera array analysis mechanisms
Miniakhmetova et al. An approach to personalized video summarization based on user preferences analysis
CN109145140A (en) One kind being based on the matched image search method of hand-drawn outline figure and system
CN112261321B (en) Subtitle processing method and device and electronic equipment
EP3117627A1 (en) Method and apparatus for video processing
KR20150101846A (en) Image classification service system based on a sketch user equipment, service equipment, service method based on sketch and computer readable medium having computer program recorded therefor
CN112115740B (en) Method and apparatus for processing image
Meng et al. Human action classification using SVM_2K classifier on motion features
He et al. A video summarization method based on key frames extracted by TMOF

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant