CN101625862B

CN101625862B - Method for detecting voice interval in automatic caption generating system

Info

Publication number: CN101625862B
Application number: CN2008101164605A
Authority: CN
Inventors: 李祺; 马华东; 郑侃彦; 韩忠涛; 张婷
Original assignee: China Digital Video Beijing Ltd
Current assignee: China Digital Video Beijing Ltd
Priority date: 2008-07-10
Filing date: 2008-07-10
Publication date: 2012-07-18
Anticipated expiration: 2028-07-10
Also published as: CN101625862A

Abstract

The invention relates to voice detection technology in an automatic caption generating system, in particular to a method for detecting a voice interval in an automatic caption generating system. The method comprises the following steps: dividing an audio sampling sequence into frames with fixed lengths, calculating short-time energy frequency values of each frame, and forming a short-time energy frequency value sequence; analyzing the short-time energy frequency value sequence from data of the first frame, and seeking for an ascending interval or a descending interval of the short-time energy frequency value sequence; and determining a starting point or an ending point of voice by calculating an average slope of a waveform of the short-time energy frequency value sequence and comparing the average slope and a threshold value, and finally finishing the detection for the voice interval. The method can carry out voice endpoint detection for the continuous voice under the condition that the background noise is changed frequently so as to improve the voice endpoint detection efficiency under a complex noise background.

Description

Detection method in the automatic caption generating system between speech region

Technical field

The present invention relates to the speech detection technology in the automatic caption generating system, be specifically related to the detection method between speech region in a kind of automatic caption generating system.

Background technology

The sound end detection technique is a new field of voice technology research, and it is applied in the automatic caption generating system.Current captions method for making at first need be ready to the captions manuscript; This captions manuscript was meant before producing TV shows; A text of finishing writing in advance, the title, the host that are writing down program want word, and by contents such as words that the interviewer said.When producing TV shows, the editorial staff adds sound, video material on the Storyboard of non-linear editing software, according to the purport of program, it is edited then.Editing operation generally comprises the modification to the material position, adds some stunts, adds captions or the like.When adding captions; Generally be in the captions manuscript, to select multistage literal (each section is exactly in short) earlier; Generate a new subtitle file with these literal then, this file is dragged the track of non-linear editing software, the every a word in the captions will broadcast in order at this moment; But tend to occur the phenomenon of " sound picture is asynchronous ", i.e. the sound that broadcasts in the moment of captions appearance and the audio file is not to going up.At this moment just need editorial staff's listening on one side, on one side many ground revise captions go into a little and go out value a little.This is very labor intensive and time just, influence the quality and the efficient of libretto file generation.

From the complex background noise, find out the starting point and the end point of voice, promptly the sound end detection technique is the basic problem during voice signal is handled always.Because the importance that sound end detects, people have proposed a variety of sound end detecting methods.These methods roughly can be divided into based on the method for model with based on two types of the methods of thresholding.

It is general etc. to use characteristic such as the Mel of multidimensional to fall based on the method for model, sets up the model line data training of going forward side by side but this method depends on very much, and calculated amount is very big; Because the intrinsic dimensionality that adopts is more, environment is carried out self-adaptation just needs lot of data, therefore implements suitable difficulty.

Sound end detecting method based on thresholding then is the characteristics according to voice; Select the suitable feature parameter; Then this characteristic parameter and predefined threshold value are compared, or earlier characteristic parameter is carried out a series of post-processed and then compare with thresholding.Traditional based on gate method in; Basically speech parameters such as short-time energy, short-time zero-crossing rate and short-term information entropy have all been used; Judge that respectively whether they surpass a threshold values, and then through " with " or inclusive-OR operation whether make be the judgement of voice beginning or end.

For the sound end detecting method based on thresholding, the factor that influences testing result mainly contains two: 1. the extraction of characteristic parameter; 2. threshold value confirms and adjustment.

Existing end-point detection parameter based on thresholding mainly contains:

1) energy: with the intensity of sound as judging parameter.This method has good effect under the situation of high s/n ratio.But under the low situation of signal to noise ratio (S/N ratio), as under interference of noise such as car engine sound, the sound of closing the door, the accuracy rate of this method is very low.

2) frequency: the characteristic that adopts frequency field is as basis for estimation.Noises such as this method can be distinguished voice and car engine accurately, the sound of closing the door, still, relatively poor for the difference effect of voice and musical sound.

Mainly there is following some deficiency in traditional sound end detecting method based on thresholding:

At first, no matter adopt which kind of audio frequency parameter, all there is very big deficiency in traditional sound end detecting method under specific noise circumstance.Such as, bad based on method performance in the environment of low SNR of energy; Algorithm based on information entropy lost efficacy in the next meeting of music background.

In addition, traditional sound end detecting method is mainly used in speech recognition, phonetic dialing, instruction control and the embedded system.In these applied environments, voice only can continue very short a period of time, generally are several seconds.And big variation do not take place in ground unrest basically in the end-point detection process, so preceding 5 frames that these methods are generally got audio frequency are analyzed noise.If but voice continue long period of time, and ground unrest frequent variations in testing process, said method can not well be worked so.

At last, traditional sound end detecting method focuses on the sound end that from ground unrest, accurately extracts single word (speech).By comparison, automatically the libretto generation system to precision require relatively lowly, and lay particular emphasis in continuous voice, carry out continuous sound end to detect, and finally detect the end points of statement.

Summary of the invention

The objective of the invention is to defective, the detection method between speech region in a kind of automatic caption generating system is provided, to improve the sound end detection efficiency under complicated noise background to the existing voice end-point detecting method.

Technical scheme of the present invention is following: the detection method in a kind of automatic caption generating system between speech region comprises the steps:

(1) the audio sample sequence is divided into the frame of regular length, calculates the short-time energy frequency values of each frame in the audio file, form short-time energy frequency value sequence a: X ₁X ₂X ₃X ₄... X _n

(2) analyze the short-time energy frequency value sequence successively since first frame, establish that current what analyze is the t frame, detect the short-time energy frequency values of each frame after the t frame,, make until finding a frame j

X _t≤X _T+1≤X _T+2≤...≤X _jAnd X _J+1>=X _J+2

Promptly seek since the first transition of the short-time energy frequency value sequence of t frame, be designated as A _t

(3) calculate the first transition A that is found _tThe average gradient R of short-time energy frequency values sequence waveform _t:

R_{t} = \frac{X_{j} - X_{t}}{j - t}

Wherein, X _tBe the short-time energy frequency values of t frame, X _jIt is the short-time energy frequency values of j frame;

(4) set a threshold value R _mConfirm the voice starting point, if R _t>=R _m, and the interval before the t frame is not considered between speech region, then the t frame is designated as the starting point of voice, makes t=j+1 then, removes to seek the voice terminal point that is complementary with it, thereby confirms between a speech region; If R _t＜R _m, then make t=j+1, the operation of repeating step (2).

Further, in the detection method in above-mentioned automatic caption generating system between speech region, the step of seeking the voice terminal point in the step (4) is following:

(a) seek since the t frame, detect the t frame short-time energy frequency values of each frame afterwards,, make until finding a frame k

X _t>=X _T+1>=X _T+2>=...>=X _kAnd X _K+1≤X _K+2

Promptly seek since the last transition of the short-time energy frequency value sequence of t frame, be designated as D _t

(b) calculate the last transition D that is found _tThe average gradient R of short-time energy frequency values sequence waveform _t:

R_{t} = \frac{X_{t} - X_{k}}{k - t}

Wherein, X _tBe the short-time energy frequency values of t frame, X _kIt is the short-time energy frequency values of k frame;

(c) the threshold value R through setting _mJudge the terminal point of voice signal, if R _t>=R _m, under the situation that finds a voice starting point, the t frame is designated as and the corresponding voice terminal point of previous voice starting point; If R _t＜R _m, then make t=k+1, the operation of repeating step (a).

Further, in the process of above-mentioned searching voice terminal point, step (c) if in R _t>=R _m, and also do not find an independent voice starting point before the t frame, promptly found the last transition D of not corresponding any voice starting point _t, this last transition D then _tBe between one section independent speech region, the t frame is designated as the voice starting point, the k frame is designated as the voice terminal point.

Further, the detection method in the aforesaid automatic caption generating system between speech region is in the process of seeking voice starting point and terminal point, if a last transition D who does not belong to phonological component ₁Be positioned at two first transition A that belong to phonological component ₁, A ₂Between, perhaps first transition A who does not belong to phonological component ₃Be positioned at two last transition D that belong to phonological component ₂, D ₃Between, then with last transition D ₁With first transition A ₃All be regarded as belonging between speech region.

Further, the detection method in the aforesaid automatic caption generating system between speech region when seeking a pair of voice starting point and terminal point, is confirmed threshold value R _mStep following:

(i) analyze current short-time energy frequency value sequence, find out its minimum value, be designated as EZE-feature _MinFind out its maximal value, be designated as EZE-feature _Max, calculate EZE-feature then _Max/ 100;

(ii) compare EZE-feature _MinAnd EZE-feature _Max/ 100, get wherein the greater, be designated as EZE-feature _Slope

(iii) confirm threshold value R _m=EZE-feature _Slope* 2.

Further, in the detection method in above-mentioned automatic caption generating system between speech region, after having detected all voice starting point and voice terminal point, travel through this sound end sequence, seek a voice terminal point F successively _e, and next voice starting point F _bIf, F _eAnd F _bAt a distance of surpassing official hour at interval, then confirm F _eAnd F _bBetween be the interval of statement, with F _eAnd F _bBe labeled as the statement end points, repeat this process and confirm all statement end points.The time interval of above-mentioned judgement statement end points defined is 100ms.

Further, the detection method in the aforesaid automatic caption generating system between speech region in step (1), is divided into the audio sample sequence frame of 10ms length.

Further, the detection method in the aforesaid automatic caption generating system between speech region, in step (1), the short-time energy frequency values of i frame is:

EZE-feature _i＝(E _i-E _b)·(Z _i-Z _b)·(H _i-H _b)

Wherein, EZE-feature _iThe short-time energy frequency values of representing the i frame; E _i, Z _iAnd H _iShort-time energy, short-time zero-crossing rate and the short-term information entropy of representing the i frame respectively; E _b, Z _bAnd H _bRepresent the short-time energy of current background noise respectively, short-time zero-crossing rate and short-term information entropy.

Beneficial effect of the present invention is following: the detection method in the automatic caption generating system provided by the present invention between speech region can be according to the pause of speaker speech; Find out accurately that each captions is pairing to be gone into a little and go out the time a little; The program making personnel only need drag the track of non-linear editing software to get final product file, have saved the human and material resources in the libretto file generative process greatly.In addition, the present invention has taken all factors into consideration the time domain and the frequency domain character of voice, can under the complicated background noise circumstance, carry out end-point detection to continuous speech, and finally detect the end points of statement.Compare higher, the better quality of sound end detection efficiency of the present invention with classic method.

Description of drawings

Fig. 1 is the synoptic diagram of automatic caption generating system.

Fig. 2 is for detecting the process flow diagram of voice starting point and terminal point.

Fig. 3 is short-time energy frequency values waveform special circumstances synoptic diagram.

Fig. 4 is for seeking the process flow diagram of statement end points.

Embodiment

Below in conjunction with accompanying drawing and specific embodiment the present invention is carried out detailed description.

Detection method is applied in the automatic caption generating system between speech region provided by the present invention; Automatic caption generating system is accepted the user and is imported one to adopt pcm audio compressed format, SF 48k, 16 of sampling resolutions, number of channels 2 (stereo), file layout be the audio file of wav, and corresponding captions manuscript; Be output as the subtitle file of a srt form, content is every a word and point of pairing start time and the concluding time point in the captions manuscript.The total system structure is as shown in Figure 1.

Testing process is as shown in Figure 2 between speech region provided by the present invention; After the audio data input, resolve audio file and extract digital sample values, the audio sample sequence that gets access to is carried out bandpass filtering; Bandwidth is 400hz～3500hz; Its fundamental purpose is noise or music filtering beyond the frequency range of people's pronunciation, thereby can reduce background music greatly to the influence that sound end detects, and carries out the detection of sound end then as follows.

(1) the audio sample sequence is carried out window treatments, be divided into the frame of 10ms length, and form a frame sequence, extract short-time energy, short-time zero-crossing rate and three audio frequency characteristics parameters of short-term information entropy to each frame data.

1. short-time energy

Energy is one of audio frequency characteristics parameter of the most often using, is that voice signal is represented the most intuitively.The voice signal energy analysis has this phenomenon of suitable variation in time based on the voice signal amplitude.Energy can be used to distinguish the voiceless sound section and the voiced segments of pronunciation, energy value bigger corresponding to the voiceless sound section, energy value less corresponding to voiced segments.For the signal of high s/n ratio, can judge with energy to have or not voice.The noise energy of no voice signal is less, and energy can enlarge markedly when voice signal is arranged, and can distinguish the starting point and the middle stop of voice signal thus roughly.In addition, energy can also be used to the boundary of differentiating initial consonant and simple or compound vowel of a Chinese syllable and the boundary of loigature etc.

In the present invention, adopt " short-time energy " as one of main characteristic parameters.So-called short-time energy is carried out the branch frame to sound signal earlier exactly and is handled, and then each frame is asked its energy, it be defined as all sampled values in the frame square with.The short-time energy of i frame is defined as:

E_{i} = Σ_{n = 1}^{N} S_{n}_{2}

Wherein, N representes the audio sample quantity that comprised in the i frame; S _nThe sampling value of representing n sampling.

2. short-time zero-crossing rate

Zero-crossing rate is audio frequency characteristics parameter commonly used in the sound signal processing process.When the time domain waveform of discrete voice signal through time during transverse axis, if the sampled value of adjacent moment has different symbols, be called " zero passage ".The zero passage number of times of unit interval is called " zero-crossing rate ", i.e. the number of times of audio sample value sign reversing in the unit interval.The same, will be defined as a frame unit interval in the present invention, the zero-crossing rate of each frame is exactly " short-time zero-crossing rate ".The short-time zero-crossing rate of i frame defines as follows:

Z_{n} = \frac{1}{2} Σ_{n = 1}^{N} | sgn (S_{n}) - sgn (S_{n - 1}) |

Wherein, X _nThe sampling value of representing n sampling; Sgn () is a sign function, is defined as

sgn (S_{n}) = \{\begin{matrix} 1, & S_{n} &GreaterEqual; 0 \\ - 1, & S_{n} \leq 0 \end{matrix}

The zero passage analysis is the simplest a kind of analysis in the time-domain analysis of voice.The pronunciation that it can distinguish voice is voiceless sound or voiced sound.Because most energy of unvoiced speech appear on the higher frequency, so the zero-crossing rate of voiceless sound is higher; And voiced speech has the frequency spectrum of high frequency roll-off, so the zero-crossing rate of voiced sound is low.Utilize short-time zero-crossing rate can also from ground unrest, find out voice signal.In the speech recognition of isolated word, must in a string continuous voice signal, suitably cut apart, in order to confirm the signal of each word pronunciation, also promptly find out the start and end position of each word.When confirming the starting point of word with average zero-crossing rate, basis for estimation is that the zero-crossing rate before the voice starting point is low, and the later zero-crossing rate of starting point has tangible numerical value.Under the situation of noise of having powerful connections, the average zero-crossing rate of general ground unrest is lower, and the average zero-crossing rate of word The initial segment sharply increases, thus the starting point of this word of decidable.

3. short-term information entropy

The perception of voice and human auditory system have the spectrum analysis function and are closely related.Therefore, voice signal being carried out spectrum analysis, is the important method of understanding voice signal and processes voice signals.Voice signal is a kind of typical non-stationary signal, but its non-stationary physical motion process by vocal organs produces, and can suppose that thus its frequency domain also is stably in short-term.

Information entropy is the important audio frequency parameter of frequency domain, and it has reacted the size of voice signal institute information conveyed amount.Information entropy often is used in encoding and decoding speech, and J.L.Shen is applied in it in the sound end detection technique first.The present invention calculates its information entropy to each frame equally, is called the short-term information entropy, and computing method are following:

(a) utilize Short Time Fourier Transform (FFT) that the signal of each frame is carried out by the conversion of time domain to frequency domain:

X (ω) = Σ_{n = - \infty}^{\infty} S_{n} e^{- jωn}

Wherein, S _nRepresent n audio sample,

ω = \frac{2 π}{N},

N is total hits;

Because the Fourier transform here all carries out a certain frame, therefore be equivalent to Fourier transform has been added a window function w (n-k).The value of k depends on to carry out short time discrete Fourier transform to which frame.

(b) calculate the probability of occurrence of each frequency:

p_{i} = \frac{s (f_{i})}{Σ_{k = 1}^{M} s (f_{k})}

Wherein, s (f _i) expression frequency f spectrum energy, p _iThe probability of occurrence of expression corresponding frequencies, M representes the sum of the frequency that Fourier transform calculates, promptly window width gets 480 here.

The constraint condition of defined is:

s(f _i)＝0?if?f _i≤250HZ?or?f _i≥3750HZ

p _i＝0?if?p _i≥0.9

First constraint formulations is used for guaranteeing the frequency range of voice signal.Because people's pronouncing frequency concentrates on 250Hz basically between the 3750Hz, so we are limited to frequency within this scope.Second constraint formulations is used for filtering and on some frequency, continues the noise that takes place.

(c) computing voice information entropy:

H_{i} = Σ_{j = 1}^{M} p_{j} \log p_{j}

Wherein, M representes the sum of the frequency that Fourier transform calculates, i.e. window width, p _iThe probability of occurrence of expression corresponding frequencies, H _iThe short-term information entropy of representing the i frame.

There is very big difference in evidence between the information entropy of voice signal and the information entropy of non-speech audio, can be used for seeking the position of sound end thus.Under many circumstances, especially when ground unrest mainly is mechanical noise, use information entropy more reliable than the simple energy that uses as characteristic parameter.

But under continuously ground unrest or music background, it is very unreliable to use information entropy to carry out sound end detection meeting.Because the same with voice, continuous ground unrest or background music also contain a lot of information.Comparatively speaking, use energy can obtain effect preferably on the contrary in this case, because the stack of voice and ground unrest always is bigger than simple ground unrest as characteristic parameter.

(2), and form a short-time energy frequency value sequence according to the short-time energy frequency values of above-mentioned each frame data of audio frequency characteristics calculation of parameter.

The short-time energy frequency values EZE-feature of i frame _iDefinition following:

EZE-feature _i＝(E _i-E _b)·(Z _i-Z _b)·(H _i-H _b)

Wherein, EZE-feature _iThe short-time energy frequency values of representing the i frame; E _i, Z _iAnd H _iShort-time energy, short-time zero-crossing rate and the short-term information entropy of representing the i frame respectively; And E _b, Z _bAnd H _bShort-time energy, short-time zero-crossing rate and the short-term information entropy of then having represented the current background noise respectively.

Short-time energy frequency values has combined the phonetic feature of time domain and frequency domain simultaneously.Short-time energy and short-time zero-crossing rate belong to the audio frequency characteristics parameter of time domain, and the short-term information entropy then belongs to the audio frequency characteristics parameter of frequency domain.The audio frequency characteristics parameter of time domain and frequency domain is combined, can bring into play their strong point separately, can evade their shortcoming separately again to a certain extent simultaneously, thereby can effectively tackle various dissimilar ground unrests.

In view of this uncertainty of ground unrest and background music, former frames that we can not use sound signal always as background noise.But should in the end-point detection process,, choose new audio frame automatically and as background noise handle according to detected voice situation.

At first, the initial 10ms of default audio file is the environment sound, with short-time energy mean value, short-time zero-crossing rate mean value and the short-term information entropy mean value of the sound signal of this 10ms short-time energy E as initial ground unrest _b, short-time zero-crossing rate Z _bWith short-term information entropy H _bAdaptive voice activity detection algorithm has been taked a kind of feedback mechanism for noise: when finding that ground unrest possibly change, algorithm will return back to the speech frame of noise before changing, and detects again.Its process is described below:

1) finds certain voice starting point, be designated as F _hFrame is worked as F _hWith a last voice terminal point F _iFrame then carries out the extraction of neighbourhood noise when 300ms is above.

2) from F _tFrame begins, and gets ensuing 10 frames and is used as ground unrest, recomputates E _b, Z _bAnd H _bValue.Computing method are for getting arithmetic mean, with E _bBe example:

E_{b} = \frac{E_{t} + E_{t + 1} + E_{t + 2} + E_{t + 3} + E_{t + 4} + E_{t + 5} + E_{t + 6} + E_{t + 7} {+ E}_{t + 8} + E_{t + 9}}{10}

3) from F _t+ 1 frame begins, and uses the E after upgrading _b, Z _bAnd H _b, recomputate the short-time energy frequency values of each frame, obtain new short-time energy frequency value sequence.

4) from F _t+ 1 frame begins, and uses new short-time energy frequency value sequence to carry out the end-point detection process again.

(3) testing process of voice starting point

Through emulation tool software matlab the short-time energy frequency values waveform research back of a lot of audio files is found: in the time period of voice and music stack; Perhaps having only in the time period of voice; The variation of short-time energy frequency values waveform is very violent: change frequency is very high, and the amplitude that changes is very big.And both do not having voice not have music, and having only in the time period of ground unrest, short-time energy frequency values keeps very little amplitude of variation basically, and the frequency ratio that changes is less.In addition; Having only music not have in the time period of voice, no matter whether there is ground unrest to occur, because after having passed through filter filtering; The HFS of music is filtered; So though the amplitude that short-time energy frequency values changes is still very greatly,, it is many that the frequency of its variation but will relax when voice are arranged.

Therefore, through calculating the short-time energy frequency value sequence of an audio file, and study its waveform, find out and wherein change violent and the bigger part of amplitude of variation, just can find the phonological component in this audio file, thereby can find its sound end.Whether so the emphasis of seeking sound end is exactly to find out those bigger parts of slope ratio in the short-time energy frequency values sequence waveform, and to judge them are end points of voice.

The flow process that detects the voice starting point is following:

1) supposes from t frame (corresponding short-time energy frequency values X _t) begin to seek, detect the t frame short-time energy frequency values of each frame afterwards, up to finding a frame j (corresponding short-time energy frequency values X _j), make

X _t≤X _T+1≤X _T+2≤...≤X _jAnd X _J+1>=X _J+2

2) calculate the first transition A that has just found _tThe average gradient of short-time energy frequency values sequence waveform

R_{t} = \frac{X_{j} - X_{t}}{j - t}

At first transition A _tIn because people's voice, its short-time energy frequency values sequence waveform can not steadily rise, its slope may constantly change, the time little when big.Though therefore at interval A _tIn short-time energy frequency value waveform keep the trend that rises always, but can only calculate its average gradient.

3) set a threshold value R _mIf R is arranged _t>=R _m, i.e. slope R _tVery precipitous then thought first transition A _tBelong to phonological component.Two kinds of situation are arranged this moment; A kind of is if the interval before the t frame has been considered between speech region; That is just explained and has found a voice starting point, needs to seek corresponding with it voice terminal point now, therefore makes t=j+1; Remove to seek the voice terminal point that is complementary with it, thereby confirm between a speech region.Another kind of situation is that the interval before the t frame is not considered between speech region, then the t frame is designated as the starting point of voice, makes t=j+1 then, removes to seek the voice terminal point that is complementary with it.

Otherwise, if R _t＜R _m, i.e. slope R _tRelatively milder.Two kinds of possibilities are also arranged this moment, and a kind of is R _tBe far smaller than R _m, mainly be because X _t, X _jAll smaller Deng short-time energy frequency values, first transition A is described _tBelong to ground unrest.Another kind of situation is R _tValue bigger, only be slightly less than R _m, this explanation first transition A _tProbably belong to background music.Do not have strict boundary between above-mentioned two kinds of situation, can't confirm that in other words the non-voice interval belongs to noise or background music on earth, but under both of these case, all think interval A _tBe not voice, therefore make t=j+1, the operation that detects the voice starting point is carried out in circulation.

(4) testing process of voice terminal point

X _t>=X _T+1>=X _T+2>=...>=X _jAnd X _J+1≤X _J+2

2) calculate the last transition D that has just found _tThe average gradient of short-time energy frequency values sequence waveform

R_{t} = \frac{X_{t} - X_{j}}{j - t}

The D in the last transition _tIn, because people's voice, its short-time energy frequency values sequence waveform also can not steadily descend.Though therefore D in the last transition _tMiddle short-time energy frequency value waveform keeps downward trend always, also can only calculate its average gradient.To last transition D _t, its average gradient R _tShould be negative value, but for convenience's sake, use X _t-X _jMake R _tBecome on the occasion of.

3) situation with the detection of voice starting point is similar, sets a threshold value R _mIf R is arranged _t>=R _m, i.e. slope R _tVery precipitous then thought last transition D _tBelong to phonological component.Two kinds of situation are arranged this moment, and a kind of is if found a voice starting point before the t frame, has then found corresponding with it voice terminal point now, therefore the t frame is designated as the terminal point of voice, makes t=j+1 then, proceeds the detection of next voice starting point.Another kind of situation is that the t frame does not also find an independent voice starting point before, has promptly found the last transition of not corresponding any voice starting point, then last transition D _tBe between one section independent speech region.Be designated as voice starting point with the t frame this moment, and the j frame is designated as the voice terminal point.Make t=t+1 then, continue to seek next voice starting point.

Otherwise, if R _t＜R _m, i.e. slope R _tRelatively milder.That is discussed in detecting with the voice starting point is the same, thinks interval D _tBelong to ground unrest or background music, this seasonal t=j+1, the operation of cycle detection voice starting point.

Slope threshold value R _mBe the artificial value of setting, therefore between the short-time energy frequency values waveform of actual deterministic process voice, background music and ground unrest, all do not have clear and definite boundary, set the different threshold value and can obtain different voice end-point detection result.This shows that threshold settings gets the whether suitable accuracy that will directly influence the sound end detection.By analysis, the present invention proposes the thresholding that following algorithm calculates the short-time energy frequency values waveform slope:

Step 1: analyze the short-time energy frequency value sequence, find out its minimum value, be designated as EZE-feature _MinFind out its maximal value, be designated as EZE-feature _Max, calculate EZE-feature then _Max/ 100.

Making a general survey of whole short-time energy frequency values waveform can find, its maximal value EZE-feature _MaxThan those maximum value EZE-feature _a(being the peak value of each ripple in the waveform) is big a little.And the minimum value EZE-feature of short-time energy frequency values _MinWith minimal value EZE-feature _i(being part milder in the waveform) then is more or less the same, because the both is very little value, so its difference can be ignored.Therefore adopt maximal value EZE-feature _Max1/100 with minimum value EZE-feature _MinCompare.

Step 2: compare EZE-feature _MinAnd EZE-feature _Max/ 100, get wherein the greater, be designated as EZE-feature _Slope

Step 3: the thresholding of short-time energy frequency values slope is decided to be: R _m=EZE-feature _Slope* 2.

The short-time energy frequency values slope threshold that uses said method to find is all effective for whole short-time energy frequency value sequence, therefore in the sound end testing process, does not need to make amendment once more.Do not have or the less situation that background music occurs under, use this method can satisfy the needs of seeking the statement end points basically, the accuracy when seeking the word end points is poor slightly.But when background music continued to occur in audio frequency, it is very complicated that the waveform of short-time energy frequency value sequence will become, and use this method just to can not get the slope threshold value that meets the demands this moment, therefore needs artificial the setting.No matter under which kind of situation,, all will improve the accuracy that sound end detects greatly by the manual thresholding of setting and adjusting the short-time energy frequency values slope of people.

If detected a voice starting point at last, and corresponding voice terminal point with it not, then think last at audio file, voice interrupt suddenly.This possibly lose some content or other reason and caused owing to audio file in pressing process.

When analyzing the short-time energy frequency value sequence, possibly run into various waveforms.Such as finding a first transition A ₁Belong to phonological component, back to back last transition D ₁But do not belong to phonological component, then first transition A and then again ₂Belong to phonological component.Also possibly be to find a last transition D ₂Belong to phonological component, back to back first transition A ₃But do not belong to phonological component, then last transition D and then again ₃Belong to phonological component.Above D in two kinds of situation ₁And A ₃Interval; Though the slope of its short-time energy frequency values waveform is less; But between all being between 2 sections speech region because of them; And generally only continue extremely short a period of time, and they belong to the small pause of people's pronunciation medial vowel and complex tone bound fraction through analyzing discovery, and therefore should not be regarded as is non-speech portion.Shown in Fig. 3 a.

A kind of in addition situation is to find a pair of voice starting point and terminal point (corresponding first transition A ₁With last transition D ₁) afterwards, and then be a first transition A who does not belong to voice ₂, be a last transition D who belongs to voice then ₂This moment last transition D ₂Just do not have and a first transition that only matches, promptly by D ₂The voice starting point that the voice terminal point that searches out does not match with it.This moment should be with last transition D ₂Starting point be regarded as the voice starting point, with D ₂Terminal point (as the voice terminal point) pairing.Shown in Fig. 3 b.

(5) detection of statement end points

The end points of inspect statement is one of fundamental purpose of the present invention, so the present invention is not the accuracy of paying attention to very much the end-point detection of single words, but pays attention to finding accurately the terminal of statement.

After finishing above-mentioned sound end testing process, can find out the sound end of word or speech.We have proposed to seek the algorithm of statement end points at this.

For the normal word speed of common people, the time interval between statement is probably about 100ms, and the time interval between word is generally less, has only a few tens of milliseconds.Therefore having reason to think and be separated by between the voice terminal point and starting point greater than 100ms, is exactly the interval between statement.

Because voice starting point that finds and terminal point all are to occur in pairs,, at first seek a voice terminal point F so will travel through this sound end sequence _e, find next voice starting point F then _bIf, F _eAnd F _bAt a distance of surpassing 100ms, then think F _eAnd F _bBetween be the interval of statement; If F _eAnd F _bBetween at a distance of less than 100ms, then think F _eAnd F _bBe not the interval between statement, this moment is with F _eAnd F _bBe labeled as the non-voice end points.Whole testing process has just found all statement end points after finishing.Testing process is as shown in Figure 4.

Method of the present invention is not limited to the embodiment described in the embodiment, and those skilled in the art's technical scheme according to the present invention draws other embodiment, belongs to technological innovation scope of the present invention equally.

Claims

1. the detection method between speech region in the automatic caption generating system comprises the steps:

(1) the audio sample sequence is divided into the frame of regular length, calculates the short-time energy frequency values of each frame in the audio file, form short-time energy frequency value sequence a: X ₁X ₂X ₃X ₄... X _nThe short-time energy frequency values of i frame is:

EZE-feature _i＝(E _i-E _b)·(Z _i-Z _b)·(H _i-H _b)

Wherein, EZE-feature _iThe short-time energy frequency values of representing the i frame; E _i, Z _iAnd H _iShort-time energy, short-time zero-crossing rate and the short-term information entropy of representing the i frame respectively; E _b, Z _bAnd H _bRepresent the short-time energy of current background noise respectively, short-time zero-crossing rate and short-term information entropy;

X _t≤X _T+1≤X _T+2≤...≤X _jAnd X _J+1>=X _J+2

R_{t} = \frac{X_{j} - X_{t}}{j - t}

2. the detection method in the automatic caption generating system as claimed in claim 1 between speech region is characterized in that: the step of seeking the voice terminal point in the step (4) is following:

X _t>=X _T+1>=X _T+2>=...>=X _kAnd X _K+1≤X _K+2

R_{t} = \frac{X_{t} - X_{k}}{k - t}

3. the detection method in the automatic caption generating system as claimed in claim 2 between speech region is characterized in that: step (c) if in R _t>=R _m, and also do not find an independent voice starting point before the t frame, promptly found the last transition D of not corresponding any voice starting point _t, this last transition D then _tBe between one section independent speech region, the t frame is designated as the voice starting point, the k frame is designated as the voice terminal point.

4. the detection method in the automatic caption generating system as claimed in claim 2 between speech region is characterized in that: in the process of seeking voice starting point and terminal point, if a last transition D who does not belong to phonological component ₁Be positioned at two first transition A that belong to phonological component ₁, A ₂Between, perhaps first transition A who does not belong to phonological component ₃Be positioned at two last transition D that belong to phonological component ₂, D ₃Between, then with last transition D ₁With first transition A ₃All be regarded as belonging between speech region.

5. the detection method between speech region in according to claim 1 or claim 2 the automatic caption generating system is characterized in that: when seeking a pair of voice starting point and terminal point, confirm threshold value R _mStep following:

(iii) confirm threshold value R _m=EZE-feature _Slope* 2.

6. the detection method in the automatic caption generating system as claimed in claim 2 between speech region is characterized in that: after having detected all voice starting point and voice terminal point, travel through this sound end sequence, seek a voice terminal point F successively _e, and next voice starting point F _bIf, F _eAnd F _bAt a distance of surpassing official hour at interval, then confirm F _eAnd F _bBetween be the interval of statement, with F _eAnd F _bBe labeled as the statement end points, repeat this process and confirm all statement end points.The time interval of above-mentioned judgement statement end points defined is 100ms.

7. the detection method in the automatic caption generating system as claimed in claim 1 between speech region is characterized in that: in step (1), the audio sample sequence is divided into the frame of 10ms length.