Nothing Special   »   [go: up one dir, main page]

CN105335755B - A kind of speak detection method and system being related to media fragment - Google Patents

A kind of speak detection method and system being related to media fragment Download PDF

Info

Publication number
CN105335755B
CN105335755B CN201510719532.5A CN201510719532A CN105335755B CN 105335755 B CN105335755 B CN 105335755B CN 201510719532 A CN201510719532 A CN 201510719532A CN 105335755 B CN105335755 B CN 105335755B
Authority
CN
China
Prior art keywords
audio
state
result
follows
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510719532.5A
Other languages
Chinese (zh)
Other versions
CN105335755A (en
Inventor
胡瑞敏
王瑾
梁超
王晓晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BOOSLINK SUZHOU INFORMATION TECHNOLOGY Co.,Ltd.
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201510719532.5A priority Critical patent/CN105335755B/en
Publication of CN105335755A publication Critical patent/CN105335755A/en
Application granted granted Critical
Publication of CN105335755B publication Critical patent/CN105335755B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/758Involving statistics of pixels or of feature values, e.g. histogram matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Acoustics & Sound (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)

Abstract

The present invention provides a kind of speak detection method and system being related to media fragment, including the media signal of input is divided into audio signal and vision signal, it is respectively processed, conditional probability per second is calculated using hidden Markov model according to harmonics likelihood ratio for audio signal, it is clustered, for vision signal to the vision signal of the media file of input, human face region in the every frame image of extraction, extract lip portion, the image energy of lip region, it is clustered according to image energy, conditional probability per second is calculated using hidden Markov model, it is clustered, obtain two classes;The cluster result respectively obtained to audio signal and vision signal is matched, the final result for the detection that obtains speaking.It is an advantage of the invention that detection of speaking can be carried out by two kinds of information of audio and video, verification and measurement ratio is improved.

Description

A kind of speak detection method and system being related to media fragment
Technical field
The present invention relates to detection technique fields of speaking, and specifically relate to and a kind of being related to speaking for media fragment and detection method and be System.
Background technology
With the development of information technology, the technologies such as human-computer interaction, teleconference, Application on Voiceprint Recognition become hot research object, Detection of speaking also has obtained more and more attention as wherein part and parcel.Detection technique of speaking just is to discriminate between media fragment In a kind of technology for whether speaking of personnel.Traditional activity detection approach of speaking mainly is based purely on audio-frequency information or regards Frequency information, poor robustness.In order to solve this problem, the multi-modal detection technique of speaking based on audio/video information is introduced into.But For the prior art usually by the training aids of a supervised learning, generalization ability is not strong, and verification and measurement ratio is caused to decline.
Invention content
The present invention has different characteristics for different media files in varying environment, it is proposed that a kind of audio/video information Speak detection method and the system matched are different from tradition based on the method for having supervision, are believed in audio and video using the activity of speaking Breath follows identical Annual distribution, and detection of speaking is carried out by the matching of audio/video information.
In order to achieve the above objectives, technical solution provided by the invention is a kind of detection method of speaking being related to media fragment, Include the following steps:
Step 1, the media signal S (t) of input is divided into audio signal S1(t) and vision signal S2(t), located respectively Reason,
For audio signal S1(t), processing is as follows,
To the audio signal of the media file of input, the harmonic frequency vector in discrete fourier window is calculated;
Each frame harmonics likelihood ratio log Λ (t) are calculated, as the feature of audio, t is the frame label of audio;
According to harmonics likelihood ratio log Λ (t), conditional probability P (O per second are calculated using hidden Markov modelt| λ), into Row cluster, obtains two classes;In the hidden Markov model, state O is showntIt is tied after being normalized for harmonics likelihood ratio log Λ (t) Fruit, hidden state qtExpression is spoken or silent;
For vision signal S2(t), it handles as follows:
To the vision signal of the media file of input, extract per human face region in frame image;
Lip portion is extracted in the human face region of extraction;
The feature of lip region, described to be characterized as image energy E [n] in the every frame image of extraction;
According to image energy, conditional probability P (O per second are calculated using hidden Markov modelt| λ), it is clustered, is obtained To two classes;In the hidden Markov model, state O is showntAs a result, hidden state q after being normalized for image energy E [n]tExpression is said Words are silent;
Step 2, the cluster result that audio signal and vision signal will be respectively obtained, using the editor of test dna sequence Distance algorithm is matched, the final result for the detection that obtains speaking, when matching editing distance calculate realize it is as follows,
Define LX,Y(m, n) indicates first sequence X=x1x2…xmLength be the substring of n to second sequence Y= y1y2…ynLength be m substring editing distance, if 0≤i≤m, 0≤j≤n, Del, Ins, Sub respectively be delete, insert The cost enter, replaced, calculating is as follows,
If min (i, j)=0, LX,Y(m, n)=max (i, j),
Otherwise
Wherein, xiIndicate Audio clustering as a result, yjIndicate Video clustering result.
Moreover, described image ENERGY E [n] calculating is as follows,
Wherein, vy,t(i, j) indicates speed of the pixel (i, j) in the image of M × N sizes in Y-direction.
The present invention correspondingly provides a kind of detecting system of speaking being related to media fragment, comprises the following modules:
The media signal S (t) of input is divided into audio signal S by audio and video cluster module1(t) and vision signal S2(t), divide It is not handled,
For audio signal S1(t), processing is as follows,
To the audio signal of the media file of input, the harmonic frequency vector in discrete fourier window is calculated;
Each frame harmonics likelihood ratio log Λ (t) are calculated, as the feature of audio, t is the frame label of audio;
According to harmonics likelihood ratio log Λ (t), conditional probability P (O per second are calculated using hidden Markov modelt| λ), into Row cluster, obtains two classes;In the hidden Markov model, state O is showntIt is tied after being normalized for harmonics likelihood ratio log Λ (t) Fruit, hidden state qtExpression is spoken or silent;
For vision signal S2(t), it handles as follows:
To the vision signal of the media file of input, extract per human face region in frame image;
Lip portion is extracted in the human face region of extraction;
The feature of lip region, described to be characterized as image energy E [n] in the every frame image of extraction;
According to image energy, conditional probability P (O per second are calculated using hidden Markov modelt| λ), it is clustered, is obtained To two classes;In the hidden Markov model, state O is showntAs a result, hidden state q after being normalized for image energy E [n]tExpression is said Words are silent;
Matching module, the cluster result for that will be respectively obtained to audio signal and vision signal, using test dna sequence Editing distance algorithm matched, the final result for the detection that obtains speaking, when matching editing distance calculate realize it is as follows,
Define LX,Y(m, n) indicates first sequence X=x1x2…xmLength be the substring of n to second sequence Y= y1y2…ynLength be m substring editing distance, if 0≤i≤m, 0≤j≤n, Del, Ins, Sub respectively be delete, insert The cost enter, replaced, calculating is as follows,
If min (i, j)=0, LX,Y(m, n)=max (i, j),
Otherwise
Wherein, xiIndicate Audio clustering as a result, yjIndicate Video clustering result.
Moreover, described image ENERGY E [n] calculating is as follows,
Wherein, vy,t(i, j) indicates speed of the pixel (i, j) in the image of M × N sizes in Y-direction.
The present invention carries out detection of speaking by carrying out matched angle to audio/video information, eliminates conventional method complexity Training process, while improving correct verification and measurement ratio.
Description of the drawings
Fig. 1 is the method flow diagram of the embodiment of the present invention.
Fig. 2 is the structure diagram of the embodiment of the present invention.
Specific implementation mode
With reference to embodiments with attached drawing the present invention will be described in detail technical solution.
If the processing work of Fig. 1, institute of embodiment of the present invention providing method include specific following step:
Step 1, the media signal S (t) of input is divided into audio signal S1(t) and vision signal S2(t), located respectively Reason,
For audio signal S1(t), it handles as follows:
(1) to the audio signal of the media file of input, the harmonic frequency vector in discrete fourier window is calculated, if implementing The harmonic frequency shared in multiple discrete fourier window DFT (Discrete Fourier transform) is obtained in example.
(2) the likelihood ratio log Λ (t) that each frame contains harmonics ingredient are calculated, as audio frequency characteristics, t is the frame label of audio.
When it is implemented, prior art realization, such as bibliography LN Tan, BJ can be used in (1) and (2) Borgstrom,A Alwan.Voice activity detection using harmonic frequency components in likelihood ratio test[C].Acoustics Speech and Signal Processing (ICASSP),2010:4466–4469.
(3) it is clustered according to harmonics likelihood ratio, obtains two classes.According to the harmonics likelihood of all frames in one second in embodiment Than log Λ (t), conditional probability P (O per second are calculated with HMM (Hidden Markov Models)t| λ), aobvious state O thereint The harmonics likelihood ratio log Λ (t) obtained for step (2) (are normalized [1,10], i.e. Ot∈ { 1,2 ..., 10 }), hidden state qt Expression is spoken or silent, and hidden state number is Nq, i.e. Nq=2.It is trained study with Baum-Welch, obtains model ginseng Number λ=(A, B, π), wherein A indicate that the transfer matrix of hidden state, B indicate the state of sometime observable due to hidden state Probability, i.e. confusion matrix;π indicates initial state probabilities.It is T to design a length of window (T is the number of pictures per second of video) Sliding window calculates its corresponding P (O using forward-backward algorithm algorithmt| λ) as the feature clustered.When it is implemented, can refer to document Rabiner,L.R.,et al.:A tutorial on hidden markov models and selected applications in speech recognition.Proc.IEEE 77(2),257–286(1989).AT&T Bell Lab,MurrayHill。
Embodiment cluster is K-means algorithms, obtains two classes, is indicated respectively with 0 and 1.
For vision signal S2(t), it handles as follows:
(1) human face region in every frame image of the vision signal of the media file of extraction input, embodiment are special using Haar Face area of the cascade device extraction video of sign per the people in frame image.When it is implemented, the cascade device of Haar features is existing Technology, such as bibliography P.Viola, M.Jones.Robust real-time face detection [J] .International Journal of Computer Vision(IJCV),2004:137-154.
(2) lip portion is extracted in the human face region of extraction.When it is implemented, extraction, which is realized, can be used the prior art, example Such as bibliography Jie Zhang, Shiguang Shan, MeinaKan, et al.Coarse-to-Fine Auto-Encoder Networks(CFAN)for Real-Time Face Alignment[C].European Conference on Computer Vision(ECCV),2014:1-16.
Embodiment face area acquire its 68 facial feature points (68 characteristic points mark eyes, nose, face and Face mask), by marking the feature point coordinates of lip portion to extract a rectangle frame for including lip region.
(3) feature-image energy per lip region in frame image is extracted.Embodiment calculates two continuous frames mouth in Y-direction The light stream v in lip regiony,t(i, j), and its square is obtained into image energy E [n], i.e.,
V in formulay,t(i, j) indicates speed of the pixel (i, j) in the image of M × N sizes in Y-direction (vertical direction), Document Tiawongsombat P, Jeong M H, Yun J S, et al.Robust are can refer to when it is implemented, calculating and realizing visual speakingness detection using bi-level HMM[J].Pattern Recognition,2012, 45(2):783-793.
(4) it is clustered according to image energy, obtains two classes.According to the image energy of all frames per second, can calculate every The conditional probability of second, the feature as cluster.In embodiment, according to the image energy E [n] of all frames per second, calculated with HMM every Conditional probability P (the O of secondt| λ), aobvious state O thereintObtained for step (3) image energy E [n] (normalized [1, 10], i.e. Ot∈ { 1,2 ..., 10 }), hidden state qtIt exactly speaks or silent, Nq=2.It is trained with Baum-Welch It practises, obtains model parameter λ=(A, B, π), wherein A indicates the transfer matrix of hidden state;B is indicated sometime due to hidden state The shape probability of state of observable, i.e. confusion matrix;π indicates initial state probabilities.(T is the sliding window that one length of window of design is T The number of pictures per second of audio), calculate its corresponding P (O using forward-backward algorithm algorithmt| λ) as the feature clustered.It is poly- with K-means Class obtains two classes, is indicated respectively with 0 and 1.
Step 2, cluster result audio signal and vision signal respectively obtained according to step 1, it is poly- by obtain two Class result is matched, the final result for the detection that obtains speaking.In embodiment, because activity of speaking in audio is sent out on a timeline Movable probability of speaking in raw probability and video meets same distribution, utilizes test dna sequence method-in biology The distance that editing distance algorithm calculates audio/video information carries out activity matching of speaking.Define LX,Y(m, n) indicates first sequence X =x1x2…xmLength be the substring of n to second sequence Y=y1y2…ynLength be m substring editing distance, here Wherein, xiIndicate Audio clustering as a result, yjIndicate Video clustering result.Because Audio clustering result and Video clustering result are all Two classes may be expressed as the substring of 0 and 1 composition.
Obtain the detection of speaking of matched result to the end.Assuming that 0≤i≤m, 0≤j≤n, Del, Ins, Sub are to delete, and are inserted Enter, the cost of replacement.Obtained last matching result is exactly testing result of speaking.Computational methods:If min (i, j)=0, LX,Y (m, n)=max (i, j), otherwise
Wherein, xiIndicate Audio clustering as a result, yjIndicate Video clustering result.
According to calculate gained editing distance, you can know audio signal cluster result and vision signal cluster result into Row is matched as a result, the testing result whether spoken.In general, editing distance is smaller, and the similarity of two strings is bigger. Using test dna sequence method in biology, the case where can supporting video length and audio length there are a small amount of errors, realize Optimal comparison maximizes matched quantity, space and unmatched quantity are minimized, and general two sequence lengths are identical Space situation is not considered then.For the element of Corresponding matching in two sub- string sequences, assert that the personnel in respective media clips are It is speaking, for unmatched element in two sub- string sequences, is assert that testing result is not for the personnel in respective media clips It is speaking.Using whether the testing result spoken, can further improve communication efficiency, be applied to human-computer interaction, teleconference, Application on Voiceprint Recognition etc., for example, when detect do not speak when, the quality of media signal can be reduced, such as reduce video and draw The clarity in face.
When it is implemented, can refer to document Vladimir Levenshtein, Binary codes capable of correcting deletions,insertions and reversals,”in Sovietphysics doklady,1966, vol.10,p.707.
Technical solution of the present invention can be used computer software mode and support automatic running flow, and modular mode can also be used Corresponding system is provided.Embodiment provides a kind of detecting system of speaking being related to media fragment, comprises the following modules:
The media signal S (t) of input is divided into audio signal S by audio and video cluster module1(t) and vision signal S2(t), divide It is not handled,
For audio signal S1(t), processing is as follows,
To the audio signal of the media file of input, the harmonic frequency vector in discrete fourier window is calculated;
Each frame harmonics likelihood ratio log Λ (t) are calculated, as the feature of audio, t is the frame label of audio;
According to harmonics likelihood ratio log Λ (t), conditional probability P (O per second are calculated using hidden Markov modelt| λ), into Row cluster, obtains two classes;In the hidden Markov model, state O is showntIt is tied after being normalized for harmonics likelihood ratio log Λ (t) Fruit, hidden state qtExpression is spoken or silent;
For vision signal S2(t), it handles as follows:
To the vision signal of the media file of input, extract per human face region in frame image;
Lip portion is extracted in the human face region of extraction;
The feature of lip region, described to be characterized as image energy E [n] in the every frame image of extraction;
It is clustered according to image energy, conditional probability P (O per second is calculated using hidden Markov modelt| λ), it carries out Cluster, obtains two classes;In the hidden Markov model, state O is showntAs a result, hidden state q after being normalized for image energy E [n]t Expression is spoken or silent;
Matching module, the cluster result for that will be respectively obtained to audio signal and vision signal, using test dna sequence Editing distance algorithm matched, the final result for the detection that obtains speaking, when matching editing distance calculate realize it is as follows,
Define LX,Y(m, n) indicates first sequence X=x1x2…xmLength be the substring of n to second sequence Y= y1y2…ynLength be m substring editing distance, if 0≤i≤m, 0≤j≤n, Del, Ins, Sub respectively be delete, insert The cost enter, replaced, calculating is as follows,
If min (i, j)=0, LX,Y(m, n)=max (i, j),
Otherwise
Wherein, xiIndicate Audio clustering as a result, yjIndicate Video clustering result
Referring to Fig. 2, those skilled in the art can carry out system finer design, such as a kind of matching of audio/video information Detecting system of speaking, audio and video cluster module includes audio frequency process, and video handles two parts:Audio frequency process part is further It is made of audio preprocessing module, audio feature extraction module and the first cluster module;Video processing part is further by face Detection module, lip extraction module, video feature extraction module and the second cluster module composition.
The audio preprocessing module is denoted as module 1, for the audio signal of the media file to input, calculates discrete Fu In harmonic frequency vector in leaf window, acquired results input audio feature extraction module.
The audio feature extraction module is denoted as module 2, for calculating each frame harmonics likelihood ratio log Λ (t), as audio Feature, and input the first cluster module.
First cluster module is denoted as module 3, is used for according to harmonics likelihood ratio log Λ (t), using hidden Markov mould Type calculates conditional probability P (O per secondt| λ), it is clustered, obtains two classes, acquired results input matching module.
The face detection module is denoted as module 4, for the vision signal of the media file to input, extracts per frame image Middle human face region is simultaneously inputted lip extraction module.
The lip extraction module is denoted as module 5, extracts lip portion for the human face region in extraction, acquired results are defeated Enter video feature extraction module.
The video feature extraction module is denoted as module 6, the feature for extracting lip region in every frame image, gained knot Fruit inputs the second cluster module.
Second cluster module is denoted as module 7, for being clustered according to image energy, using hidden Markov model Calculate conditional probability P (O per secondt| λ), it is clustered, obtains two classes, acquired results input matching module.
The matching module is denoted as module 8, two clusters for obtaining the first cluster module and the second cluster module As a result it is matched, the final result for the detection that obtains speaking.
Each module specific implementation is corresponding to each step, and it will not go into details by the present invention.
Specific embodiment described herein is only an illustration of the spirit of the invention.The technical field of the invention Technical staff can make various modifications or additions to described specific embodiment or substitute by a similar method, However, it does not deviate from the spirit of the invention or beyond the scope of the appended claims.

Claims (4)

1. a kind of detection method of speaking being related to media fragment, which is characterized in that include the following steps:
Step 1, the media signal S (t) of input is divided into audio signal S1(t) and vision signal S2(t), it is respectively processed,
For audio signal S1(t), processing is as follows,
To the audio signal of the media file of input, the harmonic frequency vector in discrete fourier window is calculated;
Each frame harmonics likelihood ratio log Λ (t) are calculated, as the feature of audio, t is the frame label of audio;
According to harmonics likelihood ratio log Λ (t), conditional probability P (O per second are calculated using hidden Markov modelt| λ), gathered Class obtains two classes;In the hidden Markov model, state O is showntAs a result, hidden after being normalized for harmonics likelihood ratio log Λ (t) State qt expression is spoken or silent, and model parameter λ=(A, B, π), wherein A indicate that the transfer matrix of hidden state, B indicate certain The shape probability of state of observable due to hidden state of a moment, i.e. confusion matrix;π indicates initial state probabilities;
For vision signal S2(t), it handles as follows:
To the vision signal of the media file of input, extract per human face region in frame image;
Lip portion is extracted in the human face region of extraction;
The feature of lip region, described to be characterized as image energy E [n] in the every frame image of extraction;
According to image energy, conditional probability P (O per second are calculated using hidden Markov modelt| λ), it is clustered, obtains two Class;In the hidden Markov model, state O is showntFor image energy E [n] normalize after as a result, hidden state qt expression speak or Person is silent;
Step 2, the cluster result that audio signal and vision signal will be respectively obtained, using the editing distance of test dna sequence Algorithm is matched, the final result for the detection that obtains speaking, when matching editing distance calculate realize it is as follows,
Define LX,Y(m, n) indicates first sequence X=x1x2…xmLength be the substring of n to second sequence Y=y1y2…yn Length be m substring editing distance, if 0≤i≤m, 0≤j≤n, Del, Ins, Sub respectively be delete, be inserted into, replace Cost, calculating is as follows,
If min (i, j)=0, LX,Y(m, n)=max (i, j),
Otherwise
Wherein, xiIndicate Audio clustering as a result, yjIndicate Video clustering result.
2. being related to the detection method of speaking of media fragment according to claim 1, it is characterised in that:In step b3, the figure Picture ENERGY E [n] calculating is as follows,
Wherein, vy,t(i, j) indicates speed of the pixel (i, j) in the image of M × N sizes in Y-direction.
3. a kind of detecting system of speaking being related to media fragment, which is characterized in that comprise the following modules:
The media signal S (t) of input is divided into audio signal S by audio and video cluster module1(t) and vision signal S2(t), respectively into Row processing,
For audio signal S1(t), processing is as follows,
To the audio signal of the media file of input, the harmonic frequency vector in discrete fourier window is calculated;
Each frame harmonics likelihood ratio log Λ (t) are calculated, as the feature of audio, t is the frame label of audio;
According to harmonics likelihood ratio log Λ (t), conditional probability P (O per second are calculated using hidden Markov modelt| λ), gathered Class obtains two classes;In the hidden Markov model, state O is showntAs a result, hidden after being normalized for harmonics likelihood ratio log Λ (t) State qtExpression is spoken or silent, and model parameter λ=(A, B, π), wherein A indicate that the transfer matrix of hidden state, B indicate certain The shape probability of state of observable due to hidden state of a moment, i.e. confusion matrix;π indicates initial state probabilities;
For vision signal S2(t), it handles as follows:
To the vision signal of the media file of input, extract per human face region in frame image;
Lip portion is extracted in the human face region of extraction;
The feature of lip region, described to be characterized as image energy E [n] in the every frame image of extraction;
According to image energy, conditional probability P (O per second are calculated using hidden Markov modelt| λ), it is clustered, obtains two Class;In the hidden Markov model, state O is showntAs a result, hidden state q after being normalized for image energy E [n]tExpression speak or Person is silent;
Matching module, the cluster result for that will be respectively obtained to audio signal and vision signal, using the volume of test dna sequence Volume distance algorithm is matched, the final result for the detection that obtains speaking, when matching editing distance calculate realize it is as follows,
Define LX,Y(m, n) indicates first sequence X=x1x2…xmLength be the substring of n to second sequence Y=y1y2…yn Length be m substring editing distance, if 0≤i≤m, 0≤j≤n, Del, Ins, Sub respectively be delete, be inserted into, replace Cost, calculating is as follows,
If min (i, j)=0, LX,Y(m, n)=max (i, j),
Otherwise
Wherein, xiIndicate Audio clustering as a result, yjIndicate Video clustering result.
4. being related to the detecting system of speaking of media fragment according to claim 3, it is characterised in that:Described image ENERGY E [n] Calculating is as follows,
Wherein, vy,t(i, j) indicates speed of the pixel (i, j) in the image of M × N sizes in Y-direction.
CN201510719532.5A 2015-10-29 2015-10-29 A kind of speak detection method and system being related to media fragment Active CN105335755B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510719532.5A CN105335755B (en) 2015-10-29 2015-10-29 A kind of speak detection method and system being related to media fragment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510719532.5A CN105335755B (en) 2015-10-29 2015-10-29 A kind of speak detection method and system being related to media fragment

Publications (2)

Publication Number Publication Date
CN105335755A CN105335755A (en) 2016-02-17
CN105335755B true CN105335755B (en) 2018-08-21

Family

ID=55286270

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510719532.5A Active CN105335755B (en) 2015-10-29 2015-10-29 A kind of speak detection method and system being related to media fragment

Country Status (1)

Country Link
CN (1) CN105335755B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108831462A (en) * 2018-06-26 2018-11-16 北京奇虎科技有限公司 Vehicle-mounted voice recognition methods and device
CN109558788B (en) * 2018-10-08 2023-10-27 清华大学 Silence voice input identification method, computing device and computer readable medium
CN110309799B (en) * 2019-07-05 2022-02-08 四川长虹电器股份有限公司 Camera-based speaking judgment method
CN110706709B (en) * 2019-08-30 2021-11-19 广东工业大学 Multi-channel convolution aliasing voice channel estimation method combined with video signal

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7684982B2 (en) * 2003-01-24 2010-03-23 Sony Ericsson Communications Ab Noise reduction and audio-visual speech activity detection
CN103198833A (en) * 2013-03-08 2013-07-10 北京理工大学 High-precision method of confirming speaker
CN103856689A (en) * 2013-10-31 2014-06-11 北京中科模识科技有限公司 Character dialogue subtitle extraction method oriented to news video

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7684982B2 (en) * 2003-01-24 2010-03-23 Sony Ericsson Communications Ab Noise reduction and audio-visual speech activity detection
CN103198833A (en) * 2013-03-08 2013-07-10 北京理工大学 High-precision method of confirming speaker
CN103856689A (en) * 2013-10-31 2014-06-11 北京中科模识科技有限公司 Character dialogue subtitle extraction method oriented to news video

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data;Kinnunen T 等;《Proceedings of the IEEE International Conference on Acoustics》;20131231;第7229-7233页 *
基于视觉显著度的说话检测;王瑾 等;《武汉大学学报(理学版)》;20150830;第61卷(第4期);第363-367页 *

Also Published As

Publication number Publication date
CN105335755A (en) 2016-02-17

Similar Documents

Publication Publication Date Title
Shillingford et al. Large-scale visual speech recognition
Raj et al. Face recognition based smart attendance system
Kong et al. Towards subject independent continuous sign language recognition: A segment and merge approach
Potamianos et al. Audio-visual automatic speech recognition: An overview
Potamianos et al. Recent advances in the automatic recognition of audiovisual speech
US7454342B2 (en) Coupled hidden Markov model (CHMM) for continuous audiovisual speech recognition
CN105335755B (en) A kind of speak detection method and system being related to media fragment
US20130226587A1 (en) Lip-password Based Speaker Verification System
KR20010039771A (en) Methods and apparatus for audio-visual speaker recognition and utterance verification
Khoury et al. Hierarchical speaker clustering methods for the nist i-vector challenge
Sui et al. Listening with your eyes: Towards a practical visual speech recognition system using deep boltzmann machines
Slimane et al. Context matters: Self-attention for sign language recognition
CN105139856B (en) Probability linear discriminant method for distinguishing speek person based on the regular covariance of priori knowledge
Ibrahim et al. Geometrical-based lip-reading using template probabilistic multi-dimension dynamic time warping
Koller et al. Read my lips: Continuous signer independent weakly supervised viseme recognition
Sarhan et al. HLR-net: a hybrid lip-reading model based on deep convolutional neural networks
Dalka et al. Visual lip contour detection for the purpose of speech recognition
Guy et al. Learning visual voice activity detection with an automatically annotated dataset
Jain et al. Visual speech recognition for isolated digits using discrete cosine transform and local binary pattern features
Liu et al. Exploring deep learning for joint audio-visual lip biometrics
Paleček et al. Audio-visual speech recognition in noisy audio environments
Benhaim et al. Designing relevant features for visual speech recognition
Radha et al. A person identification system combining recognition of face and lip-read passwords
Kuzmin et al. Magnitude-aware probabilistic speaker embeddings
Pathan et al. Recognition of spoken English phrases using visual features extraction and classification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210707

Address after: 215000 unit 01, 5 / F, building a, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Patentee after: BOOSLINK SUZHOU INFORMATION TECHNOLOGY Co.,Ltd.

Address before: 430072 Hubei Province, Wuhan city Wuchang District of Wuhan University Luojiashan

Patentee before: WUHAN University