CN105335755B

CN105335755B - A kind of speak detection method and system being related to media fragment

Info

Publication number: CN105335755B
Application number: CN201510719532.5A
Authority: CN
Inventors: 胡瑞敏; 王瑾; 梁超; 王晓晨
Original assignee: Wuhan University WHU
Current assignee: BOOSLINK SUZHOU INFORMATION TECHNOLOGY Co.,Ltd.
Priority date: 2015-10-29
Filing date: 2015-10-29
Publication date: 2018-08-21
Anticipated expiration: 2035-10-29
Also published as: CN105335755A

Abstract

The present invention provides a kind of speak detection method and system being related to media fragment, including the media signal of input is divided into audio signal and vision signal, it is respectively processed, conditional probability per second is calculated using hidden Markov model according to harmonics likelihood ratio for audio signal, it is clustered, for vision signal to the vision signal of the media file of input, human face region in the every frame image of extraction, extract lip portion, the image energy of lip region, it is clustered according to image energy, conditional probability per second is calculated using hidden Markov model, it is clustered, obtain two classes；The cluster result respectively obtained to audio signal and vision signal is matched, the final result for the detection that obtains speaking.It is an advantage of the invention that detection of speaking can be carried out by two kinds of information of audio and video, verification and measurement ratio is improved.

Description

A kind of speak detection method and system being related to media fragment

Technical field

The present invention relates to detection technique fields of speaking, and specifically relate to and a kind of being related to speaking for media fragment and detection method and be System.

Background technology

With the development of information technology, the technologies such as human-computer interaction, teleconference, Application on Voiceprint Recognition become hot research object, Detection of speaking also has obtained more and more attention as wherein part and parcel.Detection technique of speaking just is to discriminate between media fragment In a kind of technology for whether speaking of personnel.Traditional activity detection approach of speaking mainly is based purely on audio-frequency information or regards Frequency information, poor robustness.In order to solve this problem, the multi-modal detection technique of speaking based on audio/video information is introduced into.But For the prior art usually by the training aids of a supervised learning, generalization ability is not strong, and verification and measurement ratio is caused to decline.

Invention content

The present invention has different characteristics for different media files in varying environment, it is proposed that a kind of audio/video information Speak detection method and the system matched are different from tradition based on the method for having supervision, are believed in audio and video using the activity of speaking Breath follows identical Annual distribution, and detection of speaking is carried out by the matching of audio/video information.

In order to achieve the above objectives, technical solution provided by the invention is a kind of detection method of speaking being related to media fragment, Include the following steps：

Step 1, the media signal S (t) of input is divided into audio signal S₁(t) and vision signal S₂(t), located respectively Reason,

For audio signal S₁(t), processing is as follows,

To the audio signal of the media file of input, the harmonic frequency vector in discrete fourier window is calculated；

Each frame harmonics likelihood ratio log Λ (t) are calculated, as the feature of audio, t is the frame label of audio；

According to harmonics likelihood ratio log Λ (t), conditional probability P (O per second are calculated using hidden Markov model_t| λ), into Row cluster, obtains two classes；In the hidden Markov model, state O is shown_tIt is tied after being normalized for harmonics likelihood ratio log Λ (t) Fruit, hidden state q_tExpression is spoken or silent；

For vision signal S₂(t), it handles as follows：

To the vision signal of the media file of input, extract per human face region in frame image；

Lip portion is extracted in the human face region of extraction；

The feature of lip region, described to be characterized as image energy E [n] in the every frame image of extraction；

According to image energy, conditional probability P (O per second are calculated using hidden Markov model_t| λ), it is clustered, is obtained To two classes；In the hidden Markov model, state O is shown_tAs a result, hidden state q after being normalized for image energy E [n]_tExpression is said Words are silent；

Step 2, the cluster result that audio signal and vision signal will be respectively obtained, using the editor of test dna sequence Distance algorithm is matched, the final result for the detection that obtains speaking, when matching editing distance calculate realize it is as follows,

Define L_X,Y(m, n) indicates first sequence X=x₁x₂…x_mLength be the substring of n to second sequence Y= y₁y₂…y_nLength be m substring editing distance, if 0≤i≤m, 0≤j≤n, Del, Ins, Sub respectively be delete, insert The cost enter, replaced, calculating is as follows,

If min (i, j)=0, L_X,Y(m, n)=max (i, j),

Otherwise

Wherein, x_iIndicate Audio clustering as a result, y_jIndicate Video clustering result.

Moreover, described image ENERGY E [n] calculating is as follows,

Wherein, v_y,t(i, j) indicates speed of the pixel (i, j) in the image of M × N sizes in Y-direction.

The present invention correspondingly provides a kind of detecting system of speaking being related to media fragment, comprises the following modules：

The media signal S (t) of input is divided into audio signal S by audio and video cluster module₁(t) and vision signal S₂(t), divide It is not handled,

For audio signal S₁(t), processing is as follows,

For vision signal S₂(t), it handles as follows：

Lip portion is extracted in the human face region of extraction；

Matching module, the cluster result for that will be respectively obtained to audio signal and vision signal, using test dna sequence Editing distance algorithm matched, the final result for the detection that obtains speaking, when matching editing distance calculate realize it is as follows,

If min (i, j)=0, L_X,Y(m, n)=max (i, j),

Otherwise

Moreover, described image ENERGY E [n] calculating is as follows,

The present invention carries out detection of speaking by carrying out matched angle to audio/video information, eliminates conventional method complexity Training process, while improving correct verification and measurement ratio.

Description of the drawings

Fig. 1 is the method flow diagram of the embodiment of the present invention.

Fig. 2 is the structure diagram of the embodiment of the present invention.

Specific implementation mode

With reference to embodiments with attached drawing the present invention will be described in detail technical solution.

If the processing work of Fig. 1, institute of embodiment of the present invention providing method include specific following step：

For audio signal S₁(t), it handles as follows：

(1) to the audio signal of the media file of input, the harmonic frequency vector in discrete fourier window is calculated, if implementing The harmonic frequency shared in multiple discrete fourier window DFT (Discrete Fourier transform) is obtained in example.

(2) the likelihood ratio log Λ (t) that each frame contains harmonics ingredient are calculated, as audio frequency characteristics, t is the frame label of audio.

When it is implemented, prior art realization, such as bibliography LN Tan, BJ can be used in (1) and (2) Borgstrom,A Alwan.Voice activity detection using harmonic frequency components in likelihood ratio test[C].Acoustics Speech and Signal Processing (ICASSP),2010:4466–4469.

(3) it is clustered according to harmonics likelihood ratio, obtains two classes.According to the harmonics likelihood of all frames in one second in embodiment Than log Λ (t), conditional probability P (O per second are calculated with HMM (Hidden Markov Models)_t| λ), aobvious state O therein_t The harmonics likelihood ratio log Λ (t) obtained for step (2) (are normalized [1,10], i.e. O_t∈ { 1,2 ..., 10 }), hidden state q_t Expression is spoken or silent, and hidden state number is N^q, i.e. N^q=2.It is trained study with Baum-Welch, obtains model ginseng Number λ=(A, B, π), wherein A indicate that the transfer matrix of hidden state, B indicate the state of sometime observable due to hidden state Probability, i.e. confusion matrix；π indicates initial state probabilities.It is T to design a length of window (T is the number of pictures per second of video) Sliding window calculates its corresponding P (O using forward-backward algorithm algorithm_t| λ) as the feature clustered.When it is implemented, can refer to document Rabiner,L.R.,et al.:A tutorial on hidden markov models and selected applications in speech recognition.Proc.IEEE 77(2),257–286(1989).AT&T Bell Lab,MurrayHill。

Embodiment cluster is K-means algorithms, obtains two classes, is indicated respectively with 0 and 1.

For vision signal S₂(t), it handles as follows：

(1) human face region in every frame image of the vision signal of the media file of extraction input, embodiment are special using Haar Face area of the cascade device extraction video of sign per the people in frame image.When it is implemented, the cascade device of Haar features is existing Technology, such as bibliography P.Viola, M.Jones.Robust real-time face detection [J] .International Journal of Computer Vision(IJCV),2004:137-154.

(2) lip portion is extracted in the human face region of extraction.When it is implemented, extraction, which is realized, can be used the prior art, example Such as bibliography Jie Zhang, Shiguang Shan, MeinaKan, et al.Coarse-to-Fine Auto-Encoder Networks(CFAN)for Real-Time Face Alignment[C].European Conference on Computer Vision(ECCV),2014:1-16.

Embodiment face area acquire its 68 facial feature points (68 characteristic points mark eyes, nose, face and Face mask), by marking the feature point coordinates of lip portion to extract a rectangle frame for including lip region.

(3) feature-image energy per lip region in frame image is extracted.Embodiment calculates two continuous frames mouth in Y-direction The light stream v in lip region_y,t(i, j), and its square is obtained into image energy E [n], i.e.,

V in formula_y,t(i, j) indicates speed of the pixel (i, j) in the image of M × N sizes in Y-direction (vertical direction), Document Tiawongsombat P, Jeong M H, Yun J S, et al.Robust are can refer to when it is implemented, calculating and realizing visual speakingness detection using bi-level HMM[J].Pattern Recognition,2012, 45(2):783-793.

(4) it is clustered according to image energy, obtains two classes.According to the image energy of all frames per second, can calculate every The conditional probability of second, the feature as cluster.In embodiment, according to the image energy E [n] of all frames per second, calculated with HMM every Conditional probability P (the O of second_t| λ), aobvious state O therein_tObtained for step (3) image energy E [n] (normalized [1, 10], i.e. O_t∈ { 1,2 ..., 10 }), hidden state q_tIt exactly speaks or silent, N^q=2.It is trained with Baum-Welch It practises, obtains model parameter λ=(A, B, π), wherein A indicates the transfer matrix of hidden state；B is indicated sometime due to hidden state The shape probability of state of observable, i.e. confusion matrix；π indicates initial state probabilities.(T is the sliding window that one length of window of design is T The number of pictures per second of audio), calculate its corresponding P (O using forward-backward algorithm algorithm_t| λ) as the feature clustered.It is poly- with K-means Class obtains two classes, is indicated respectively with 0 and 1.

Step 2, cluster result audio signal and vision signal respectively obtained according to step 1, it is poly- by obtain two Class result is matched, the final result for the detection that obtains speaking.In embodiment, because activity of speaking in audio is sent out on a timeline Movable probability of speaking in raw probability and video meets same distribution, utilizes test dna sequence method-in biology The distance that editing distance algorithm calculates audio/video information carries out activity matching of speaking.Define L_X,Y(m, n) indicates first sequence X =x₁x₂…x_mLength be the substring of n to second sequence Y=y₁y₂…y_nLength be m substring editing distance, here Wherein, x_iIndicate Audio clustering as a result, y_jIndicate Video clustering result.Because Audio clustering result and Video clustering result are all Two classes may be expressed as the substring of 0 and 1 composition.

Obtain the detection of speaking of matched result to the end.Assuming that 0≤i≤m, 0≤j≤n, Del, Ins, Sub are to delete, and are inserted Enter, the cost of replacement.Obtained last matching result is exactly testing result of speaking.Computational methods：If min (i, j)=0, L_X,Y (m, n)=max (i, j), otherwise

According to calculate gained editing distance, you can know audio signal cluster result and vision signal cluster result into Row is matched as a result, the testing result whether spoken.In general, editing distance is smaller, and the similarity of two strings is bigger. Using test dna sequence method in biology, the case where can supporting video length and audio length there are a small amount of errors, realize Optimal comparison maximizes matched quantity, space and unmatched quantity are minimized, and general two sequence lengths are identical Space situation is not considered then.For the element of Corresponding matching in two sub- string sequences, assert that the personnel in respective media clips are It is speaking, for unmatched element in two sub- string sequences, is assert that testing result is not for the personnel in respective media clips It is speaking.Using whether the testing result spoken, can further improve communication efficiency, be applied to human-computer interaction, teleconference, Application on Voiceprint Recognition etc., for example, when detect do not speak when, the quality of media signal can be reduced, such as reduce video and draw The clarity in face.

When it is implemented, can refer to document Vladimir Levenshtein, Binary codes capable of correcting deletions,insertions and reversals,”in Sovietphysics doklady,1966, vol.10,p.707.

Technical solution of the present invention can be used computer software mode and support automatic running flow, and modular mode can also be used Corresponding system is provided.Embodiment provides a kind of detecting system of speaking being related to media fragment, comprises the following modules：

For audio signal S₁(t), processing is as follows,

For vision signal S₂(t), it handles as follows：

Lip portion is extracted in the human face region of extraction；

It is clustered according to image energy, conditional probability P (O per second is calculated using hidden Markov model_t| λ), it carries out Cluster, obtains two classes；In the hidden Markov model, state O is shown_tAs a result, hidden state q after being normalized for image energy E [n]_t Expression is spoken or silent；

If min (i, j)=0, L_X,Y(m, n)=max (i, j),

Otherwise

Wherein, x_iIndicate Audio clustering as a result, y_jIndicate Video clustering result

Referring to Fig. 2, those skilled in the art can carry out system finer design, such as a kind of matching of audio/video information Detecting system of speaking, audio and video cluster module includes audio frequency process, and video handles two parts：Audio frequency process part is further It is made of audio preprocessing module, audio feature extraction module and the first cluster module；Video processing part is further by face Detection module, lip extraction module, video feature extraction module and the second cluster module composition.

The audio preprocessing module is denoted as module 1, for the audio signal of the media file to input, calculates discrete Fu In harmonic frequency vector in leaf window, acquired results input audio feature extraction module.

The audio feature extraction module is denoted as module 2, for calculating each frame harmonics likelihood ratio log Λ (t), as audio Feature, and input the first cluster module.

First cluster module is denoted as module 3, is used for according to harmonics likelihood ratio log Λ (t), using hidden Markov mould Type calculates conditional probability P (O per second_t| λ), it is clustered, obtains two classes, acquired results input matching module.

The face detection module is denoted as module 4, for the vision signal of the media file to input, extracts per frame image Middle human face region is simultaneously inputted lip extraction module.

The lip extraction module is denoted as module 5, extracts lip portion for the human face region in extraction, acquired results are defeated Enter video feature extraction module.

The video feature extraction module is denoted as module 6, the feature for extracting lip region in every frame image, gained knot Fruit inputs the second cluster module.

Second cluster module is denoted as module 7, for being clustered according to image energy, using hidden Markov model Calculate conditional probability P (O per second_t| λ), it is clustered, obtains two classes, acquired results input matching module.

The matching module is denoted as module 8, two clusters for obtaining the first cluster module and the second cluster module As a result it is matched, the final result for the detection that obtains speaking.

Each module specific implementation is corresponding to each step, and it will not go into details by the present invention.

Specific embodiment described herein is only an illustration of the spirit of the invention.The technical field of the invention Technical staff can make various modifications or additions to described specific embodiment or substitute by a similar method, However, it does not deviate from the spirit of the invention or beyond the scope of the appended claims.

Claims

1. a kind of detection method of speaking being related to media fragment, which is characterized in that include the following steps：

Step 1, the media signal S (t) of input is divided into audio signal S₁(t) and vision signal S₂(t), it is respectively processed,

For audio signal S₁(t), processing is as follows,

According to harmonics likelihood ratio log Λ (t), conditional probability P (O per second are calculated using hidden Markov model_t| λ), gathered Class obtains two classes；In the hidden Markov model, state O is shown_tAs a result, hidden after being normalized for harmonics likelihood ratio log Λ (t) State qt expression is spoken or silent, and model parameter λ=(A, B, π), wherein A indicate that the transfer matrix of hidden state, B indicate certain The shape probability of state of observable due to hidden state of a moment, i.e. confusion matrix；π indicates initial state probabilities；

For vision signal S₂(t), it handles as follows：

Lip portion is extracted in the human face region of extraction；

According to image energy, conditional probability P (O per second are calculated using hidden Markov model_t| λ), it is clustered, obtains two Class；In the hidden Markov model, state O is shown_tFor image energy E [n] normalize after as a result, hidden state qt expression speak or Person is silent；

Step 2, the cluster result that audio signal and vision signal will be respectively obtained, using the editing distance of test dna sequence Algorithm is matched, the final result for the detection that obtains speaking, when matching editing distance calculate realize it is as follows,

Define L_X,Y(m, n) indicates first sequence X=x₁x₂…x_mLength be the substring of n to second sequence Y=y₁y₂…y_n Length be m substring editing distance, if 0≤i≤m, 0≤j≤n, Del, Ins, Sub respectively be delete, be inserted into, replace Cost, calculating is as follows,

If min (i, j)=0, L_X,Y(m, n)=max (i, j),

Otherwise

2. being related to the detection method of speaking of media fragment according to claim 1, it is characterised in that：In step b3, the figure Picture ENERGY E [n] calculating is as follows,

3. a kind of detecting system of speaking being related to media fragment, which is characterized in that comprise the following modules：

The media signal S (t) of input is divided into audio signal S by audio and video cluster module₁(t) and vision signal S₂(t), respectively into Row processing,

For audio signal S₁(t), processing is as follows,

According to harmonics likelihood ratio log Λ (t), conditional probability P (O per second are calculated using hidden Markov model_t| λ), gathered Class obtains two classes；In the hidden Markov model, state O is shown_tAs a result, hidden after being normalized for harmonics likelihood ratio log Λ (t) State q_tExpression is spoken or silent, and model parameter λ=(A, B, π), wherein A indicate that the transfer matrix of hidden state, B indicate certain The shape probability of state of observable due to hidden state of a moment, i.e. confusion matrix；π indicates initial state probabilities；

For vision signal S₂(t), it handles as follows：

Lip portion is extracted in the human face region of extraction；

According to image energy, conditional probability P (O per second are calculated using hidden Markov model_t| λ), it is clustered, obtains two Class；In the hidden Markov model, state O is shown_tAs a result, hidden state q after being normalized for image energy E [n]_tExpression speak or Person is silent；

Matching module, the cluster result for that will be respectively obtained to audio signal and vision signal, using the volume of test dna sequence Volume distance algorithm is matched, the final result for the detection that obtains speaking, when matching editing distance calculate realize it is as follows,

If min (i, j)=0, L_X,Y(m, n)=max (i, j),

Otherwise

4. being related to the detecting system of speaking of media fragment according to claim 3, it is characterised in that：Described image ENERGY E [n] Calculating is as follows,