CN105335755B - A kind of speak detection method and system being related to media fragment - Google Patents
A kind of speak detection method and system being related to media fragment Download PDFInfo
- Publication number
- CN105335755B CN105335755B CN201510719532.5A CN201510719532A CN105335755B CN 105335755 B CN105335755 B CN 105335755B CN 201510719532 A CN201510719532 A CN 201510719532A CN 105335755 B CN105335755 B CN 105335755B
- Authority
- CN
- China
- Prior art keywords
- audio
- state
- result
- follows
- signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 30
- 239000012634 fragment Substances 0.000 title claims abstract description 12
- 230000005236 sound signal Effects 0.000 claims abstract description 29
- 238000000605 extraction Methods 0.000 claims abstract description 25
- 239000000284 extract Substances 0.000 claims abstract description 8
- 238000012360 testing method Methods 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 8
- 108091028043 Nucleic acid sequence Proteins 0.000 claims description 7
- 238000012546 transfer Methods 0.000 claims description 4
- 238000005259 measurement Methods 0.000 abstract description 3
- 238000012795 verification Methods 0.000 abstract description 3
- 238000000034 method Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 235000013399 edible fruits Nutrition 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 238000003657 Likelihood-ratio test Methods 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000000205 computational method Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000001815 facial effect Effects 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
- G06V10/758—Involving statistics of pixels or of feature values, e.g. histogram matching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Image Analysis (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Acoustics & Sound (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
Abstract
The present invention provides a kind of speak detection method and system being related to media fragment, including the media signal of input is divided into audio signal and vision signal, it is respectively processed, conditional probability per second is calculated using hidden Markov model according to harmonics likelihood ratio for audio signal, it is clustered, for vision signal to the vision signal of the media file of input, human face region in the every frame image of extraction, extract lip portion, the image energy of lip region, it is clustered according to image energy, conditional probability per second is calculated using hidden Markov model, it is clustered, obtain two classes;The cluster result respectively obtained to audio signal and vision signal is matched, the final result for the detection that obtains speaking.It is an advantage of the invention that detection of speaking can be carried out by two kinds of information of audio and video, verification and measurement ratio is improved.
Description
Technical field
The present invention relates to detection technique fields of speaking, and specifically relate to and a kind of being related to speaking for media fragment and detection method and be
System.
Background technology
With the development of information technology, the technologies such as human-computer interaction, teleconference, Application on Voiceprint Recognition become hot research object,
Detection of speaking also has obtained more and more attention as wherein part and parcel.Detection technique of speaking just is to discriminate between media fragment
In a kind of technology for whether speaking of personnel.Traditional activity detection approach of speaking mainly is based purely on audio-frequency information or regards
Frequency information, poor robustness.In order to solve this problem, the multi-modal detection technique of speaking based on audio/video information is introduced into.But
For the prior art usually by the training aids of a supervised learning, generalization ability is not strong, and verification and measurement ratio is caused to decline.
Invention content
The present invention has different characteristics for different media files in varying environment, it is proposed that a kind of audio/video information
Speak detection method and the system matched are different from tradition based on the method for having supervision, are believed in audio and video using the activity of speaking
Breath follows identical Annual distribution, and detection of speaking is carried out by the matching of audio/video information.
In order to achieve the above objectives, technical solution provided by the invention is a kind of detection method of speaking being related to media fragment,
Include the following steps:
Step 1, the media signal S (t) of input is divided into audio signal S1(t) and vision signal S2(t), located respectively
Reason,
For audio signal S1(t), processing is as follows,
To the audio signal of the media file of input, the harmonic frequency vector in discrete fourier window is calculated;
Each frame harmonics likelihood ratio log Λ (t) are calculated, as the feature of audio, t is the frame label of audio;
According to harmonics likelihood ratio log Λ (t), conditional probability P (O per second are calculated using hidden Markov modelt| λ), into
Row cluster, obtains two classes;In the hidden Markov model, state O is showntIt is tied after being normalized for harmonics likelihood ratio log Λ (t)
Fruit, hidden state qtExpression is spoken or silent;
For vision signal S2(t), it handles as follows:
To the vision signal of the media file of input, extract per human face region in frame image;
Lip portion is extracted in the human face region of extraction;
The feature of lip region, described to be characterized as image energy E [n] in the every frame image of extraction;
According to image energy, conditional probability P (O per second are calculated using hidden Markov modelt| λ), it is clustered, is obtained
To two classes;In the hidden Markov model, state O is showntAs a result, hidden state q after being normalized for image energy E [n]tExpression is said
Words are silent;
Step 2, the cluster result that audio signal and vision signal will be respectively obtained, using the editor of test dna sequence
Distance algorithm is matched, the final result for the detection that obtains speaking, when matching editing distance calculate realize it is as follows,
Define LX,Y(m, n) indicates first sequence X=x1x2…xmLength be the substring of n to second sequence Y=
y1y2…ynLength be m substring editing distance, if 0≤i≤m, 0≤j≤n, Del, Ins, Sub respectively be delete, insert
The cost enter, replaced, calculating is as follows,
If min (i, j)=0, LX,Y(m, n)=max (i, j),
Otherwise
Wherein, xiIndicate Audio clustering as a result, yjIndicate Video clustering result.
Moreover, described image ENERGY E [n] calculating is as follows,
Wherein, vy,t(i, j) indicates speed of the pixel (i, j) in the image of M × N sizes in Y-direction.
The present invention correspondingly provides a kind of detecting system of speaking being related to media fragment, comprises the following modules:
The media signal S (t) of input is divided into audio signal S by audio and video cluster module1(t) and vision signal S2(t), divide
It is not handled,
For audio signal S1(t), processing is as follows,
To the audio signal of the media file of input, the harmonic frequency vector in discrete fourier window is calculated;
Each frame harmonics likelihood ratio log Λ (t) are calculated, as the feature of audio, t is the frame label of audio;
According to harmonics likelihood ratio log Λ (t), conditional probability P (O per second are calculated using hidden Markov modelt| λ), into
Row cluster, obtains two classes;In the hidden Markov model, state O is showntIt is tied after being normalized for harmonics likelihood ratio log Λ (t)
Fruit, hidden state qtExpression is spoken or silent;
For vision signal S2(t), it handles as follows:
To the vision signal of the media file of input, extract per human face region in frame image;
Lip portion is extracted in the human face region of extraction;
The feature of lip region, described to be characterized as image energy E [n] in the every frame image of extraction;
According to image energy, conditional probability P (O per second are calculated using hidden Markov modelt| λ), it is clustered, is obtained
To two classes;In the hidden Markov model, state O is showntAs a result, hidden state q after being normalized for image energy E [n]tExpression is said
Words are silent;
Matching module, the cluster result for that will be respectively obtained to audio signal and vision signal, using test dna sequence
Editing distance algorithm matched, the final result for the detection that obtains speaking, when matching editing distance calculate realize it is as follows,
Define LX,Y(m, n) indicates first sequence X=x1x2…xmLength be the substring of n to second sequence Y=
y1y2…ynLength be m substring editing distance, if 0≤i≤m, 0≤j≤n, Del, Ins, Sub respectively be delete, insert
The cost enter, replaced, calculating is as follows,
If min (i, j)=0, LX,Y(m, n)=max (i, j),
Otherwise
Wherein, xiIndicate Audio clustering as a result, yjIndicate Video clustering result.
Moreover, described image ENERGY E [n] calculating is as follows,
Wherein, vy,t(i, j) indicates speed of the pixel (i, j) in the image of M × N sizes in Y-direction.
The present invention carries out detection of speaking by carrying out matched angle to audio/video information, eliminates conventional method complexity
Training process, while improving correct verification and measurement ratio.
Description of the drawings
Fig. 1 is the method flow diagram of the embodiment of the present invention.
Fig. 2 is the structure diagram of the embodiment of the present invention.
Specific implementation mode
With reference to embodiments with attached drawing the present invention will be described in detail technical solution.
If the processing work of Fig. 1, institute of embodiment of the present invention providing method include specific following step:
Step 1, the media signal S (t) of input is divided into audio signal S1(t) and vision signal S2(t), located respectively
Reason,
For audio signal S1(t), it handles as follows:
(1) to the audio signal of the media file of input, the harmonic frequency vector in discrete fourier window is calculated, if implementing
The harmonic frequency shared in multiple discrete fourier window DFT (Discrete Fourier transform) is obtained in example.
(2) the likelihood ratio log Λ (t) that each frame contains harmonics ingredient are calculated, as audio frequency characteristics, t is the frame label of audio.
When it is implemented, prior art realization, such as bibliography LN Tan, BJ can be used in (1) and (2)
Borgstrom,A Alwan.Voice activity detection using harmonic frequency
components in likelihood ratio test[C].Acoustics Speech and Signal Processing
(ICASSP),2010:4466–4469.
(3) it is clustered according to harmonics likelihood ratio, obtains two classes.According to the harmonics likelihood of all frames in one second in embodiment
Than log Λ (t), conditional probability P (O per second are calculated with HMM (Hidden Markov Models)t| λ), aobvious state O thereint
The harmonics likelihood ratio log Λ (t) obtained for step (2) (are normalized [1,10], i.e. Ot∈ { 1,2 ..., 10 }), hidden state qt
Expression is spoken or silent, and hidden state number is Nq, i.e. Nq=2.It is trained study with Baum-Welch, obtains model ginseng
Number λ=(A, B, π), wherein A indicate that the transfer matrix of hidden state, B indicate the state of sometime observable due to hidden state
Probability, i.e. confusion matrix;π indicates initial state probabilities.It is T to design a length of window (T is the number of pictures per second of video)
Sliding window calculates its corresponding P (O using forward-backward algorithm algorithmt| λ) as the feature clustered.When it is implemented, can refer to document
Rabiner,L.R.,et al.:A tutorial on hidden markov models and selected
applications in speech recognition.Proc.IEEE 77(2),257–286(1989).AT&T Bell
Lab,MurrayHill。
Embodiment cluster is K-means algorithms, obtains two classes, is indicated respectively with 0 and 1.
For vision signal S2(t), it handles as follows:
(1) human face region in every frame image of the vision signal of the media file of extraction input, embodiment are special using Haar
Face area of the cascade device extraction video of sign per the people in frame image.When it is implemented, the cascade device of Haar features is existing
Technology, such as bibliography P.Viola, M.Jones.Robust real-time face detection [J]
.International Journal of Computer Vision(IJCV),2004:137-154.
(2) lip portion is extracted in the human face region of extraction.When it is implemented, extraction, which is realized, can be used the prior art, example
Such as bibliography Jie Zhang, Shiguang Shan, MeinaKan, et al.Coarse-to-Fine Auto-Encoder
Networks(CFAN)for Real-Time Face Alignment[C].European Conference on Computer
Vision(ECCV),2014:1-16.
Embodiment face area acquire its 68 facial feature points (68 characteristic points mark eyes, nose, face and
Face mask), by marking the feature point coordinates of lip portion to extract a rectangle frame for including lip region.
(3) feature-image energy per lip region in frame image is extracted.Embodiment calculates two continuous frames mouth in Y-direction
The light stream v in lip regiony,t(i, j), and its square is obtained into image energy E [n], i.e.,
V in formulay,t(i, j) indicates speed of the pixel (i, j) in the image of M × N sizes in Y-direction (vertical direction),
Document Tiawongsombat P, Jeong M H, Yun J S, et al.Robust are can refer to when it is implemented, calculating and realizing
visual speakingness detection using bi-level HMM[J].Pattern Recognition,2012,
45(2):783-793.
(4) it is clustered according to image energy, obtains two classes.According to the image energy of all frames per second, can calculate every
The conditional probability of second, the feature as cluster.In embodiment, according to the image energy E [n] of all frames per second, calculated with HMM every
Conditional probability P (the O of secondt| λ), aobvious state O thereintObtained for step (3) image energy E [n] (normalized [1,
10], i.e. Ot∈ { 1,2 ..., 10 }), hidden state qtIt exactly speaks or silent, Nq=2.It is trained with Baum-Welch
It practises, obtains model parameter λ=(A, B, π), wherein A indicates the transfer matrix of hidden state;B is indicated sometime due to hidden state
The shape probability of state of observable, i.e. confusion matrix;π indicates initial state probabilities.(T is the sliding window that one length of window of design is T
The number of pictures per second of audio), calculate its corresponding P (O using forward-backward algorithm algorithmt| λ) as the feature clustered.It is poly- with K-means
Class obtains two classes, is indicated respectively with 0 and 1.
Step 2, cluster result audio signal and vision signal respectively obtained according to step 1, it is poly- by obtain two
Class result is matched, the final result for the detection that obtains speaking.In embodiment, because activity of speaking in audio is sent out on a timeline
Movable probability of speaking in raw probability and video meets same distribution, utilizes test dna sequence method-in biology
The distance that editing distance algorithm calculates audio/video information carries out activity matching of speaking.Define LX,Y(m, n) indicates first sequence X
=x1x2…xmLength be the substring of n to second sequence Y=y1y2…ynLength be m substring editing distance, here
Wherein, xiIndicate Audio clustering as a result, yjIndicate Video clustering result.Because Audio clustering result and Video clustering result are all
Two classes may be expressed as the substring of 0 and 1 composition.
Obtain the detection of speaking of matched result to the end.Assuming that 0≤i≤m, 0≤j≤n, Del, Ins, Sub are to delete, and are inserted
Enter, the cost of replacement.Obtained last matching result is exactly testing result of speaking.Computational methods:If min (i, j)=0, LX,Y
(m, n)=max (i, j), otherwise
Wherein, xiIndicate Audio clustering as a result, yjIndicate Video clustering result.
According to calculate gained editing distance, you can know audio signal cluster result and vision signal cluster result into
Row is matched as a result, the testing result whether spoken.In general, editing distance is smaller, and the similarity of two strings is bigger.
Using test dna sequence method in biology, the case where can supporting video length and audio length there are a small amount of errors, realize
Optimal comparison maximizes matched quantity, space and unmatched quantity are minimized, and general two sequence lengths are identical
Space situation is not considered then.For the element of Corresponding matching in two sub- string sequences, assert that the personnel in respective media clips are
It is speaking, for unmatched element in two sub- string sequences, is assert that testing result is not for the personnel in respective media clips
It is speaking.Using whether the testing result spoken, can further improve communication efficiency, be applied to human-computer interaction, teleconference,
Application on Voiceprint Recognition etc., for example, when detect do not speak when, the quality of media signal can be reduced, such as reduce video and draw
The clarity in face.
When it is implemented, can refer to document Vladimir Levenshtein, Binary codes capable of
correcting deletions,insertions and reversals,”in Sovietphysics doklady,1966,
vol.10,p.707.
Technical solution of the present invention can be used computer software mode and support automatic running flow, and modular mode can also be used
Corresponding system is provided.Embodiment provides a kind of detecting system of speaking being related to media fragment, comprises the following modules:
The media signal S (t) of input is divided into audio signal S by audio and video cluster module1(t) and vision signal S2(t), divide
It is not handled,
For audio signal S1(t), processing is as follows,
To the audio signal of the media file of input, the harmonic frequency vector in discrete fourier window is calculated;
Each frame harmonics likelihood ratio log Λ (t) are calculated, as the feature of audio, t is the frame label of audio;
According to harmonics likelihood ratio log Λ (t), conditional probability P (O per second are calculated using hidden Markov modelt| λ), into
Row cluster, obtains two classes;In the hidden Markov model, state O is showntIt is tied after being normalized for harmonics likelihood ratio log Λ (t)
Fruit, hidden state qtExpression is spoken or silent;
For vision signal S2(t), it handles as follows:
To the vision signal of the media file of input, extract per human face region in frame image;
Lip portion is extracted in the human face region of extraction;
The feature of lip region, described to be characterized as image energy E [n] in the every frame image of extraction;
It is clustered according to image energy, conditional probability P (O per second is calculated using hidden Markov modelt| λ), it carries out
Cluster, obtains two classes;In the hidden Markov model, state O is showntAs a result, hidden state q after being normalized for image energy E [n]t
Expression is spoken or silent;
Matching module, the cluster result for that will be respectively obtained to audio signal and vision signal, using test dna sequence
Editing distance algorithm matched, the final result for the detection that obtains speaking, when matching editing distance calculate realize it is as follows,
Define LX,Y(m, n) indicates first sequence X=x1x2…xmLength be the substring of n to second sequence Y=
y1y2…ynLength be m substring editing distance, if 0≤i≤m, 0≤j≤n, Del, Ins, Sub respectively be delete, insert
The cost enter, replaced, calculating is as follows,
If min (i, j)=0, LX,Y(m, n)=max (i, j),
Otherwise
Wherein, xiIndicate Audio clustering as a result, yjIndicate Video clustering result
Referring to Fig. 2, those skilled in the art can carry out system finer design, such as a kind of matching of audio/video information
Detecting system of speaking, audio and video cluster module includes audio frequency process, and video handles two parts:Audio frequency process part is further
It is made of audio preprocessing module, audio feature extraction module and the first cluster module;Video processing part is further by face
Detection module, lip extraction module, video feature extraction module and the second cluster module composition.
The audio preprocessing module is denoted as module 1, for the audio signal of the media file to input, calculates discrete Fu
In harmonic frequency vector in leaf window, acquired results input audio feature extraction module.
The audio feature extraction module is denoted as module 2, for calculating each frame harmonics likelihood ratio log Λ (t), as audio
Feature, and input the first cluster module.
First cluster module is denoted as module 3, is used for according to harmonics likelihood ratio log Λ (t), using hidden Markov mould
Type calculates conditional probability P (O per secondt| λ), it is clustered, obtains two classes, acquired results input matching module.
The face detection module is denoted as module 4, for the vision signal of the media file to input, extracts per frame image
Middle human face region is simultaneously inputted lip extraction module.
The lip extraction module is denoted as module 5, extracts lip portion for the human face region in extraction, acquired results are defeated
Enter video feature extraction module.
The video feature extraction module is denoted as module 6, the feature for extracting lip region in every frame image, gained knot
Fruit inputs the second cluster module.
Second cluster module is denoted as module 7, for being clustered according to image energy, using hidden Markov model
Calculate conditional probability P (O per secondt| λ), it is clustered, obtains two classes, acquired results input matching module.
The matching module is denoted as module 8, two clusters for obtaining the first cluster module and the second cluster module
As a result it is matched, the final result for the detection that obtains speaking.
Each module specific implementation is corresponding to each step, and it will not go into details by the present invention.
Specific embodiment described herein is only an illustration of the spirit of the invention.The technical field of the invention
Technical staff can make various modifications or additions to described specific embodiment or substitute by a similar method,
However, it does not deviate from the spirit of the invention or beyond the scope of the appended claims.
Claims (4)
1. a kind of detection method of speaking being related to media fragment, which is characterized in that include the following steps:
Step 1, the media signal S (t) of input is divided into audio signal S1(t) and vision signal S2(t), it is respectively processed,
For audio signal S1(t), processing is as follows,
To the audio signal of the media file of input, the harmonic frequency vector in discrete fourier window is calculated;
Each frame harmonics likelihood ratio log Λ (t) are calculated, as the feature of audio, t is the frame label of audio;
According to harmonics likelihood ratio log Λ (t), conditional probability P (O per second are calculated using hidden Markov modelt| λ), gathered
Class obtains two classes;In the hidden Markov model, state O is showntAs a result, hidden after being normalized for harmonics likelihood ratio log Λ (t)
State qt expression is spoken or silent, and model parameter λ=(A, B, π), wherein A indicate that the transfer matrix of hidden state, B indicate certain
The shape probability of state of observable due to hidden state of a moment, i.e. confusion matrix;π indicates initial state probabilities;
For vision signal S2(t), it handles as follows:
To the vision signal of the media file of input, extract per human face region in frame image;
Lip portion is extracted in the human face region of extraction;
The feature of lip region, described to be characterized as image energy E [n] in the every frame image of extraction;
According to image energy, conditional probability P (O per second are calculated using hidden Markov modelt| λ), it is clustered, obtains two
Class;In the hidden Markov model, state O is showntFor image energy E [n] normalize after as a result, hidden state qt expression speak or
Person is silent;
Step 2, the cluster result that audio signal and vision signal will be respectively obtained, using the editing distance of test dna sequence
Algorithm is matched, the final result for the detection that obtains speaking, when matching editing distance calculate realize it is as follows,
Define LX,Y(m, n) indicates first sequence X=x1x2…xmLength be the substring of n to second sequence Y=y1y2…yn
Length be m substring editing distance, if 0≤i≤m, 0≤j≤n, Del, Ins, Sub respectively be delete, be inserted into, replace
Cost, calculating is as follows,
If min (i, j)=0, LX,Y(m, n)=max (i, j),
Otherwise
Wherein, xiIndicate Audio clustering as a result, yjIndicate Video clustering result.
2. being related to the detection method of speaking of media fragment according to claim 1, it is characterised in that:In step b3, the figure
Picture ENERGY E [n] calculating is as follows,
Wherein, vy,t(i, j) indicates speed of the pixel (i, j) in the image of M × N sizes in Y-direction.
3. a kind of detecting system of speaking being related to media fragment, which is characterized in that comprise the following modules:
The media signal S (t) of input is divided into audio signal S by audio and video cluster module1(t) and vision signal S2(t), respectively into
Row processing,
For audio signal S1(t), processing is as follows,
To the audio signal of the media file of input, the harmonic frequency vector in discrete fourier window is calculated;
Each frame harmonics likelihood ratio log Λ (t) are calculated, as the feature of audio, t is the frame label of audio;
According to harmonics likelihood ratio log Λ (t), conditional probability P (O per second are calculated using hidden Markov modelt| λ), gathered
Class obtains two classes;In the hidden Markov model, state O is showntAs a result, hidden after being normalized for harmonics likelihood ratio log Λ (t)
State qtExpression is spoken or silent, and model parameter λ=(A, B, π), wherein A indicate that the transfer matrix of hidden state, B indicate certain
The shape probability of state of observable due to hidden state of a moment, i.e. confusion matrix;π indicates initial state probabilities;
For vision signal S2(t), it handles as follows:
To the vision signal of the media file of input, extract per human face region in frame image;
Lip portion is extracted in the human face region of extraction;
The feature of lip region, described to be characterized as image energy E [n] in the every frame image of extraction;
According to image energy, conditional probability P (O per second are calculated using hidden Markov modelt| λ), it is clustered, obtains two
Class;In the hidden Markov model, state O is showntAs a result, hidden state q after being normalized for image energy E [n]tExpression speak or
Person is silent;
Matching module, the cluster result for that will be respectively obtained to audio signal and vision signal, using the volume of test dna sequence
Volume distance algorithm is matched, the final result for the detection that obtains speaking, when matching editing distance calculate realize it is as follows,
Define LX,Y(m, n) indicates first sequence X=x1x2…xmLength be the substring of n to second sequence Y=y1y2…yn
Length be m substring editing distance, if 0≤i≤m, 0≤j≤n, Del, Ins, Sub respectively be delete, be inserted into, replace
Cost, calculating is as follows,
If min (i, j)=0, LX,Y(m, n)=max (i, j),
Otherwise
Wherein, xiIndicate Audio clustering as a result, yjIndicate Video clustering result.
4. being related to the detecting system of speaking of media fragment according to claim 3, it is characterised in that:Described image ENERGY E [n]
Calculating is as follows,
Wherein, vy,t(i, j) indicates speed of the pixel (i, j) in the image of M × N sizes in Y-direction.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510719532.5A CN105335755B (en) | 2015-10-29 | 2015-10-29 | A kind of speak detection method and system being related to media fragment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510719532.5A CN105335755B (en) | 2015-10-29 | 2015-10-29 | A kind of speak detection method and system being related to media fragment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105335755A CN105335755A (en) | 2016-02-17 |
CN105335755B true CN105335755B (en) | 2018-08-21 |
Family
ID=55286270
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510719532.5A Active CN105335755B (en) | 2015-10-29 | 2015-10-29 | A kind of speak detection method and system being related to media fragment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105335755B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108831462A (en) * | 2018-06-26 | 2018-11-16 | 北京奇虎科技有限公司 | Vehicle-mounted voice recognition methods and device |
CN109558788B (en) * | 2018-10-08 | 2023-10-27 | 清华大学 | Silence voice input identification method, computing device and computer readable medium |
CN110309799B (en) * | 2019-07-05 | 2022-02-08 | 四川长虹电器股份有限公司 | Camera-based speaking judgment method |
CN110706709B (en) * | 2019-08-30 | 2021-11-19 | 广东工业大学 | Multi-channel convolution aliasing voice channel estimation method combined with video signal |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7684982B2 (en) * | 2003-01-24 | 2010-03-23 | Sony Ericsson Communications Ab | Noise reduction and audio-visual speech activity detection |
CN103198833A (en) * | 2013-03-08 | 2013-07-10 | 北京理工大学 | High-precision method of confirming speaker |
CN103856689A (en) * | 2013-10-31 | 2014-06-11 | 北京中科模识科技有限公司 | Character dialogue subtitle extraction method oriented to news video |
-
2015
- 2015-10-29 CN CN201510719532.5A patent/CN105335755B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7684982B2 (en) * | 2003-01-24 | 2010-03-23 | Sony Ericsson Communications Ab | Noise reduction and audio-visual speech activity detection |
CN103198833A (en) * | 2013-03-08 | 2013-07-10 | 北京理工大学 | High-precision method of confirming speaker |
CN103856689A (en) * | 2013-10-31 | 2014-06-11 | 北京中科模识科技有限公司 | Character dialogue subtitle extraction method oriented to news video |
Non-Patent Citations (2)
Title |
---|
self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data;Kinnunen T 等;《Proceedings of the IEEE International Conference on Acoustics》;20131231;第7229-7233页 * |
基于视觉显著度的说话检测;王瑾 等;《武汉大学学报(理学版)》;20150830;第61卷(第4期);第363-367页 * |
Also Published As
Publication number | Publication date |
---|---|
CN105335755A (en) | 2016-02-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Shillingford et al. | Large-scale visual speech recognition | |
Raj et al. | Face recognition based smart attendance system | |
Kong et al. | Towards subject independent continuous sign language recognition: A segment and merge approach | |
Potamianos et al. | Audio-visual automatic speech recognition: An overview | |
Potamianos et al. | Recent advances in the automatic recognition of audiovisual speech | |
US7454342B2 (en) | Coupled hidden Markov model (CHMM) for continuous audiovisual speech recognition | |
CN105335755B (en) | A kind of speak detection method and system being related to media fragment | |
US20130226587A1 (en) | Lip-password Based Speaker Verification System | |
KR20010039771A (en) | Methods and apparatus for audio-visual speaker recognition and utterance verification | |
Khoury et al. | Hierarchical speaker clustering methods for the nist i-vector challenge | |
Sui et al. | Listening with your eyes: Towards a practical visual speech recognition system using deep boltzmann machines | |
Slimane et al. | Context matters: Self-attention for sign language recognition | |
CN105139856B (en) | Probability linear discriminant method for distinguishing speek person based on the regular covariance of priori knowledge | |
Ibrahim et al. | Geometrical-based lip-reading using template probabilistic multi-dimension dynamic time warping | |
Koller et al. | Read my lips: Continuous signer independent weakly supervised viseme recognition | |
Sarhan et al. | HLR-net: a hybrid lip-reading model based on deep convolutional neural networks | |
Dalka et al. | Visual lip contour detection for the purpose of speech recognition | |
Guy et al. | Learning visual voice activity detection with an automatically annotated dataset | |
Jain et al. | Visual speech recognition for isolated digits using discrete cosine transform and local binary pattern features | |
Liu et al. | Exploring deep learning for joint audio-visual lip biometrics | |
Paleček et al. | Audio-visual speech recognition in noisy audio environments | |
Benhaim et al. | Designing relevant features for visual speech recognition | |
Radha et al. | A person identification system combining recognition of face and lip-read passwords | |
Kuzmin et al. | Magnitude-aware probabilistic speaker embeddings | |
Pathan et al. | Recognition of spoken English phrases using visual features extraction and classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20210707 Address after: 215000 unit 01, 5 / F, building a, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province Patentee after: BOOSLINK SUZHOU INFORMATION TECHNOLOGY Co.,Ltd. Address before: 430072 Hubei Province, Wuhan city Wuchang District of Wuhan University Luojiashan Patentee before: WUHAN University |