US20100036663A1 - Speech Detection Using Order Statistics - Google Patents
Speech Detection Using Order Statistics Download PDFInfo
- Publication number
- US20100036663A1 US20100036663A1 US12/515,536 US51553607A US2010036663A1 US 20100036663 A1 US20100036663 A1 US 20100036663A1 US 51553607 A US51553607 A US 51553607A US 2010036663 A1 US2010036663 A1 US 2010036663A1
- Authority
- US
- United States
- Prior art keywords
- frames
- threshold
- entropy
- speech
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title description 14
- 238000000034 method Methods 0.000 claims abstract description 32
- 230000005236 sound signal Effects 0.000 claims abstract description 18
- 238000004891 communication Methods 0.000 claims abstract description 4
- 238000012545 processing Methods 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 238000005070 sampling Methods 0.000 claims 2
- 238000004422 calculation algorithm Methods 0.000 description 18
- 230000000694 effects Effects 0.000 description 6
- 238000000926 separation method Methods 0.000 description 6
- 230000003044 adaptive effect Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000007906 compression Methods 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 239000000872 buffer Substances 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000013139 quantization Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000006866 deterioration Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000002243 precursor Substances 0.000 description 1
- 230000001235 sensitizing effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L1/00—Arrangements for detecting or preventing errors in the information received
- H04L1/0001—Systems modifying transmission characteristics according to link quality, e.g. power backoff
- H04L1/0015—Systems modifying transmission characteristics according to link quality, e.g. power backoff characterised by the adaptation strategy
- H04L1/0019—Systems modifying transmission characteristics according to link quality, e.g. power backoff characterised by the adaptation strategy in which mode-switching is based on a statistical approach
- H04L1/0021—Systems modifying transmission characteristics according to link quality, e.g. power backoff characterised by the adaptation strategy in which mode-switching is based on a statistical approach in which the algorithm uses adaptive thresholds
Definitions
- This invention in general relates to a method for reducing the total bandwidth requirement for voice-enabled applications over the Internet and specifically relates to a method of separating speech signal from non-speech signal.
- Speech signals consist of non-speech segments and speech segments.
- Non-speech segments do not contribute to comprehension and may contain noise or disturbances which are undesirable, and may cause deterioration.
- all segments, speech or otherwise demand bandwidth for transmission.
- segmentation of the input speech stream into “speech” and “non-speech” is the precursor to applying recognition algorithms.
- VAD Voice Activity Detection
- Further optimization is usually achieved by the following two methods.
- VAD scheme usually based on energy and zero-crossing methods, is embedded in codecs. Examples are G.729, Global System for Mobile communication (GSM), Adaptive Multi Rate (AMR), G.722 and 3 rd Generation Partnership Project (3GPP).
- GSM Global System for Mobile communication
- AMR Adaptive Multi Rate
- G.722 3 rd Generation Partnership Project
- 3GPP 3 rd Generation Partnership Project
- VAD scheme may not be embedded in the codec block. Selecting talk spurts and avoiding codec processing of non-speech segments at the transmitter has the additional advantage of reducing the computational load on the codec itself. This is particularly significant as the number of streams grows.
- VAD coding is independent of the speech code. Portability across codecs is an added advantage since one can use any codec after applying a stand-alone VAD that removes the non-speech part of the stream.
- VoIP voice over internet protocol
- the method and system disclosed herein seeks to separate the speech segments from the non-speech segments in an audio signal and transmit only the speech segments over the Internet.
- An entropy measure derived from spacings of order statistics of speech frames is used to differentiate non-speech and/or silent (inactive) zones from speech (active) zones.
- Non-speech segments are not transmitted; they are replaced, in general, by “comfort noise” during playout at the receiver's end, thereby increasing the proportion of available bandwidth for other users of the Internet.
- the present invention accomplishes a greater saving in bandwidth by detection of the speech or active signal by efficaciously discriminating it from non-speech.
- a threshold is devised and applied for detection of the speech and non-speech segments in real-time.
- the method and system disclosed herein enables speech activity detection through Adaptation of Threshold computed from Entropy derived from Spacings of Order Statistics—which we hereinafter refer to as ATESOS.
- the method and the system disclosed herein are scalable across different frame sizes.
- the method and the system disclosed herein determine the boundaries between contiguous active and inactive zones with sharper accuracy, thereby improving the effectiveness of speech spurt detection and speech recognition.
- the method and the system disclosed herein can be implemented in low Signal-to-noise ratio (SNR) environments since frame classification, or packet classification in the context of packet-switched network, for example, VoIP, is independent of the signal energy, and depends only on the signal entropy.
- SNR Signal-to-noise ratio
- the method and the system disclosed herein are applicable to packet switched networks for improving bandwidth utility.
- the method and the system disclosed herein initialize the threshold by observing merely two initial packets. Hence, the decision to differentiate speech from non-speech segments is made almost instantaneously. This rapid decision-making process minimizes delay to the extent that it is perceived as effectively on-line and real-time in its implementation.
- the method and the system disclosed herein is applicable for VoIP, speech recognition, speech-to-text, biometrics, etc.
- Speech boundary segmentation in the context of speech-to-text, speech recognition and spealer recognition are some examples in point.
- the system and method is adaptable to varying quantization levels of 8-bit, 16-bit, for instance. This scheme is therefore portable in its present form, equally efficaciously, across different quantization levels.
- the method and the system disclosed herein are less sensitive to the characteristics of the microphone employed to capture the original speech stream.
- FIG. 1 illustrates a method for reducing total bandwidth requirement for voice-enabled applications over the internet by transmitting only the frames containing active speech segments.
- FIG. 2 illustrates the flowchart for the ATESOS algorithm used in the detection and the separation of the active speech frames from the inactive speech frames in an audio signal.
- FIG. 3 illustrates the Pseudo code of the ATESOS algorithm used in the detection and the separation of the active speech frames from the inactive speech frames in an audio signal.
- FIG. 4 illustrates the system diagram that implements the separation of the active speech frames from the inactive speech frames.
- FIG. 5A illustrates the speech signal for utterances of “/Hello/, /One/, /Two/, /Three/” with deliberate pauses in between the words.
- FIGS. 5B and 5C illustrate the output waveforms from the application of the ATESOS algorithm, for a zero noise condition and 20 ms frame size.
- FIG. 6 illustrates the output waveform for speech activity detection (SpAD) with 5 dB babble noise.
- FIG. 6A illustrates the speech signal of utterances of “/Hello/, /One/, /Two/, /Three/” with deliberate pauses in between the words.
- FIG. 6B illustrates the entropy obtained from the spacings of the order statistics.
- FIG. 6C illustrates the decision taken by the ATESOS algorithm.
- FIG. 7A illustrates the speech signal for utterances of “/Hello/, /One/, /Two/, /Three/” with deliberate pauses in between the words.
- FIGS. 7B and 7C illustrate the output waveform from the implementation of the speech activity detection (SPAD) algorithm.
- SPAD speech activity detection
- FIG. 8 illustrates how x min and x max are determined for calculating the threshold.
- FIG. 1 illustrates a method for reducing total bandwidth requirement for voice-enabled applications over the internet by transmitting only the frames that consist of active speech segments.
- the analog input audio signal is sampled 101 and converted into a digital signal 102 .
- the sampled digital audio signal is then converted into audio frames of a fixed size 103 .
- any intelligible speech segment such as human speech or music, contains redundant information while noise or non-intelligible speech is characterized by lesser redundancy; i.e., it possesses “high” information content.
- Entropy reflects a measure of information content. Thus it follows that all intelligible speech segments have lower entropy or randomness in them and the non-intelligible speech segments have higher entropy.
- intelligible speech vis-à-vis non-intelligible speech reveals that over the mixed sample, non-intelligible speech segments have probabilities closer to the mean of the sample, whereas the probabilities associated with intelligible speech lie away from the mean and have a larger variance.
- Entropy for each of the frames is calculated 105 .
- Entropy is measured at each of the input instances, i.e. at the occurrence of the audio signal at a given time.
- a threshold is set for a first set of frames based on the entropy measured 106 .
- the first set of frames may comprise one or more frames.
- the threshold for a second set of frames is equal to the threshold for the first set of frames plus an increment. The increment may be positive or negative.
- the threshold for each frame in the second set may vary depending on the entropy of the frame and the threshold of the past frame plus the increment.
- the second set of frames may comprise one or more frames.
- the maximum and the minimum values of entropy are calculated for different input instances. If the entropy of the frame under consideration is greater than the threshold, then the frame is marked inactive; otherwise the frame is marked active 107 .
- the active speech frames are transmitted 108 .
- An adaptive threshold is achieved by sensitizing the threshold to varying entropy values of input frames as they stream in.
- the value to be added to, or subtracted from the threshold, called the increment, is determined by two variables: x max and x min .
- x max is the maximum entropy attained until the current frame; x min depends on whether the frame is active or inactive.
- the increment is calculated as a percentage of the sum of x max and x min . In particular, a non limiting example of the invention uses 10% of the sum of x max and x min as the increment. If the current frame is active then x min will be equal to minimum entropy observed over all the frames until the current frame in the given talk spurt. A talk spurt consists of consecutive frames marked as active.
- x min will be equal to x max . Therefore x min will be high if the frame is inactive and x min will be low if current frame is active.
- FIG. 8 illustrates the calculation of x min and x max . Initially x min and x max are calculated.
- x max is the maximum entropy attained until the current frame; and the graph for x max monotonically increases.
- a change in x max results in a new threshold value.
- This new threshold value is a step closer to x max .
- the x min depends on whether the current frame is active or inactive. For an active frame, it is checked if the entropy of the current frame is lesser than x min . If the entropy of the current frame is lesser than x min , then x min is updated and a new threshold is calculated 211 . Once the least value is hit upon in that active speech segment, the x min and, hence, the threshold do not change.
- the x min and x max are equal.
- a change only in x max results in adaptation of the threshold to the dynamics of the input.
- FIG. 2 illustrates the flowchart for the ATESOS algorithm used in the detection and the separation of the active speech frames from the inactive speech frames in an audio signal.
- the ATESOS algorithm marks the speech frame as active or inactive with reference to the threshold.
- the entropy is calculated using spacings of order statistics for the first two frames.
- the entropy H for the frames is calculated using the formula
- x max is the maximum entropy attained until the current frame.
- the threshold is a moving average and it is initialized to the mean of the first two entropy values
- x min is initialized to x max , wherein if the current frame is active, then x min will be equal to minimum entropy observed over all the frames until the current frame. If the frame is marked inactive then the x min is equal to x max .
- the entropy is calculated 204 for each frame starting from the third frame and as-and-when a recorded speech frame is available. A check is performed to determine if the entropy calculated is greater than the x max 205 . If the entropy calculated 204 is greater than the x max 205 , the x max and the threshold are calculated as follows 206 ,
- the frame is marked as active by assigning bspeechframe to 1 208 .
- a check is performed to determine if a new x min is achieved, and if the x min is greater than entropy 210 , a new threshold is calculated as follows 211 :
- the frame is marked as inactive by assigning bspeechframe to 0 and initializing x min to x max 209 .
- a new threshold value is calculated as 81.3% of the sum of the threshold and increment.
- FIG. 3 illustrates the Pseudo code of the ATESOS algorithm used in the detection and the separation of the active speech frames from the inactive speech frames in an audio signal.
- FIG. 4 illustrates the system diagram that implements the separation of the active speech frames from the inactive speech frames.
- the analog audio input is taken from the sensor 401 .
- the analog audio input can be a speech file that is directly fed to the analog to digital converter 402 .
- the analog audio input is passed through an analog to digital converter 402 for analog to digital conversion.
- the digital audio signal is then passed into a fixed-sized buffer 403 to convert the digital audio signal into frames of a particular size.
- the digitized and buffered audio signal converted to frames is then passed through the central processing unit 404 .
- the microprocessor located in the central processing unit 404 applies the ATESOS algorithm and differentiates the active speech frames from the inactive speech frames.
- the network interface module 405 accepts only the active speech frames and transmits them over the internet in the form of packets.
- the central processing unit 404 computes spacings of order statistics in a statistical sample for said sampled frames, measures the entropy of each of said sampled frames, sets threshold for entropy and marks the audio frames.
- the step of marking comprises marking the audio frame as an inactive speech frame when the entropy is greater than the threshold, and marking the audio frame as an active speech frame when the entropy is lesser than the threshold.
- the inactive speech frames are not received.
- the silence created by inactive frames at the transmitter is substituted by comfort noise making the listener perceive that the inactive frames were transmitted.
- FIG. 5A illustrates the speech signal for utterances of “/Hello/, /One/, /Two/, /Three/” with deliberate pauses in between the words.
- FIGS. 5B and 5C illustrate the output waveforms from the application of the ATESOS algorithm, for a zero noise condition and 20 ms frame size.
- FIG. 5B illustrates the entropy obtained from the spacings of the order statistics using the equation (1), described earlier under the description of FIG. 2 .
- the dotted line 501 in FIG. 5B illustrates the threshold values for the respective entropy values.
- FIG. 5C illustrates the decision taken by the ATESOS algorithm.
- the speech frame is marked as active if decision is 1, and inactive if the decision is 0.
- the decision is 1 when the entropy value is less than the threshold.
- FIG. 6A illustrates the speech signal of utterances of “/Hello/, /One/, /Two/, /Three/” with deliberate pauses in between the words.
- the signal is corrupted with additive babble noise and the overall SNR is 5 dB.
- FIG. 6B illustrates the entropy obtained from the spacings of the Order Statistics using equation (1).
- the frame size considered is 20 ms.
- the dotted line in FIG. 6B illustrates the threshold values for the respective entropy values.
- FIG. 6C illustrates the decision taken by the ATESOS algorithm.
- the speech frame is marked as active if the decision is 1 and inactive if the decision is 0.
- the decision is 1 when the entropy value is less than the threshold and the decision is 0 if the entropy value is greater than the threshold.
- FIG. 7A illustrates the speech signal for utterances of “/Hello/, /One/, /Two/, /Three/” with deliberate pauses in between the words.
- FIG. 7B and FIG. 7C illustrate the output waveform from the implementation of the speech activity detection (SpAD) algorithm.
- FIG. 7B illustrates the entropy obtained from the spacings of the order statistics using equation (1).
- the frame size considered is 60 ms.
- the dotted line in FIG. 7B illustrates the adaptive threshold values for the respective entropy values.
- the adaptive threshold is computed using the ATESOS algorithm described in FIG. 2 and FIG. 3 .
- FIG. 7C illustrates the decision taken by the ATESOS algorithm.
- the speech frame is marked as active if the decision is 1 and inactive if the decision is 0.
- the decision is 1 when the entropy value is less than the threshold and the decision is 0 if the entropy value is greater than the threshold.
- FIG. 8 illustrates the calculation of x min and x max .
- 801 points to the location on the entropy curve from where the x min starts decreasing in that active speech region.
- 802 points to the least value that x min reaches in that active speech region.
- 803 points to the location in the entropy curve where x max reaches the highest value.
- the method and system disclosed herein accomplishes a greater saving in bandwidth by detection of the speech/active signal by efficaciously discriminating from non-speech.
- the speech burst is located accurately.
- Information on the location of the speech burst may be provided to an echo cancellation module (not shown in figure).
- the identification of the location of the speech burst aids in the process of subtracting the return signal in VoIP systems.
- the location of the speech burst can be provided to a speech recognition module (not shown in figure) for accurately mapping and identifying the words.
- the ATESOS algorithm may be used to preprocess audio data for a speech recognition module.
- the location of the speech burst can be provided to coding modules for reducing the level of computation required for coding the speech data.
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Quality & Reliability (AREA)
- Computer Networks & Wireless Communication (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
- Monitoring And Testing Of Exchanges (AREA)
Abstract
Description
- This invention in general relates to a method for reducing the total bandwidth requirement for voice-enabled applications over the Internet and specifically relates to a method of separating speech signal from non-speech signal.
- Given the rapid growth in Internet traffic, there is a shortage in the bandwidth available for the transfer of data for voice over IP applications. Speech signals consist of non-speech segments and speech segments. Non-speech segments do not contribute to comprehension and may contain noise or disturbances which are undesirable, and may cause deterioration. However, all segments, speech or otherwise, demand bandwidth for transmission. Moreover, in the context of speech recognition, segmentation of the input speech stream into “speech” and “non-speech” is the precursor to applying recognition algorithms.
- Bandwidth optimization is achieved by speech compression using low bit rate codecs integrated with Voice Activity Detection (VAD). Further optimization is usually achieved by the following two methods. In the first method, VAD scheme, usually based on energy and zero-crossing methods, is embedded in codecs. Examples are G.729, Global System for Mobile communication (GSM), Adaptive Multi Rate (AMR), G.722 and 3rd Generation Partnership Project (3GPP). In the second method, VAD scheme may not be embedded in the codec block. Selecting talk spurts and avoiding codec processing of non-speech segments at the transmitter has the additional advantage of reducing the computational load on the codec itself. This is particularly significant as the number of streams grows. In such a setup, VAD coding is independent of the speech code. Portability across codecs is an added advantage since one can use any codec after applying a stand-alone VAD that removes the non-speech part of the stream.
- There is an unmet market need for a method and a system that effectively removes the non-speech component in a voice over internet protocol (VoIP) based communication system.
- The method and system disclosed herein seeks to separate the speech segments from the non-speech segments in an audio signal and transmit only the speech segments over the Internet. An entropy measure derived from spacings of order statistics of speech frames is used to differentiate non-speech and/or silent (inactive) zones from speech (active) zones. Non-speech segments are not transmitted; they are replaced, in general, by “comfort noise” during playout at the receiver's end, thereby increasing the proportion of available bandwidth for other users of the Internet. The present invention accomplishes a greater saving in bandwidth by detection of the speech or active signal by efficaciously discriminating it from non-speech. A threshold is devised and applied for detection of the speech and non-speech segments in real-time.
- The method and system disclosed herein enables speech activity detection through Adaptation of Threshold computed from Entropy derived from Spacings of Order Statistics—which we hereinafter refer to as ATESOS.
- The method and the system disclosed herein are scalable across different frame sizes.
- The method and the system disclosed herein determine the boundaries between contiguous active and inactive zones with sharper accuracy, thereby improving the effectiveness of speech spurt detection and speech recognition.
- The method and the system disclosed herein can be implemented in low Signal-to-noise ratio (SNR) environments since frame classification, or packet classification in the context of packet-switched network, for example, VoIP, is independent of the signal energy, and depends only on the signal entropy.
- Thus, the method and the system disclosed herein are applicable to packet switched networks for improving bandwidth utility.
- The method and the system disclosed herein initialize the threshold by observing merely two initial packets. Hence, the decision to differentiate speech from non-speech segments is made almost instantaneously. This rapid decision-making process minimizes delay to the extent that it is perceived as effectively on-line and real-time in its implementation.
- The method and the system disclosed herein is applicable for VoIP, speech recognition, speech-to-text, biometrics, etc. Speech boundary segmentation in the context of speech-to-text, speech recognition and spealer recognition are some examples in point. Further, the system and method is adaptable to varying quantization levels of 8-bit, 16-bit, for instance. This scheme is therefore portable in its present form, equally efficaciously, across different quantization levels.
- The method and the system disclosed herein are less sensitive to the characteristics of the microphone employed to capture the original speech stream.
- The foregoing summary, as well as the following detailed description of the embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the invention, there is shown in the drawings exemplary constructions of the invention; however, the invention is not limited to the specific methods and instrumentalities disclosed.
-
FIG. 1 illustrates a method for reducing total bandwidth requirement for voice-enabled applications over the internet by transmitting only the frames containing active speech segments. -
FIG. 2 illustrates the flowchart for the ATESOS algorithm used in the detection and the separation of the active speech frames from the inactive speech frames in an audio signal. -
FIG. 3 illustrates the Pseudo code of the ATESOS algorithm used in the detection and the separation of the active speech frames from the inactive speech frames in an audio signal. -
FIG. 4 illustrates the system diagram that implements the separation of the active speech frames from the inactive speech frames. -
FIG. 5A illustrates the speech signal for utterances of “/Hello/, /One/, /Two/, /Three/” with deliberate pauses in between the words. -
FIGS. 5B and 5C illustrate the output waveforms from the application of the ATESOS algorithm, for a zero noise condition and 20 ms frame size. -
FIG. 6 illustrates the output waveform for speech activity detection (SpAD) with 5 dB babble noise. -
FIG. 6A illustrates the speech signal of utterances of “/Hello/, /One/, /Two/, /Three/” with deliberate pauses in between the words. -
FIG. 6B illustrates the entropy obtained from the spacings of the order statistics. -
FIG. 6C illustrates the decision taken by the ATESOS algorithm. -
FIG. 7A illustrates the speech signal for utterances of “/Hello/, /One/, /Two/, /Three/” with deliberate pauses in between the words. -
FIGS. 7B and 7C illustrate the output waveform from the implementation of the speech activity detection (SPAD) algorithm. -
FIG. 8 illustrates how xmin and xmax are determined for calculating the threshold. -
FIG. 1 illustrates a method for reducing total bandwidth requirement for voice-enabled applications over the internet by transmitting only the frames that consist of active speech segments. The analog input audio signal is sampled 101 and converted into adigital signal 102. The sampled digital audio signal is then converted into audio frames of afixed size 103. - The spacings of order statistics is computed for the
above audio frames 104. Any intelligible speech segment, such as human speech or music, contains redundant information while noise or non-intelligible speech is characterized by lesser redundancy; i.e., it possesses “high” information content. Entropy reflects a measure of information content. Thus it follows that all intelligible speech segments have lower entropy or randomness in them and the non-intelligible speech segments have higher entropy. A statistical analysis of intelligible speech vis-à-vis non-intelligible speech reveals that over the mixed sample, non-intelligible speech segments have probabilities closer to the mean of the sample, whereas the probabilities associated with intelligible speech lie away from the mean and have a larger variance. - Entropy for each of the frames is calculated 105. Entropy is measured at each of the input instances, i.e. at the occurrence of the audio signal at a given time. A threshold is set for a first set of frames based on the entropy measured 106. The first set of frames may comprise one or more frames. The threshold for a second set of frames is equal to the threshold for the first set of frames plus an increment. The increment may be positive or negative. The threshold for each frame in the second set may vary depending on the entropy of the frame and the threshold of the past frame plus the increment. The second set of frames may comprise one or more frames. The maximum and the minimum values of entropy are calculated for different input instances. If the entropy of the frame under consideration is greater than the threshold, then the frame is marked inactive; otherwise the frame is marked active 107. The active speech frames are transmitted 108.
- An adaptive threshold is achieved by sensitizing the threshold to varying entropy values of input frames as they stream in. The value to be added to, or subtracted from the threshold, called the increment, is determined by two variables: xmax and xmin. xmax is the maximum entropy attained until the current frame; xmin depends on whether the frame is active or inactive. The increment is calculated as a percentage of the sum of xmax and xmin. In particular, a non limiting example of the invention uses 10% of the sum of xmax and xmin as the increment. If the current frame is active then xmin will be equal to minimum entropy observed over all the frames until the current frame in the given talk spurt. A talk spurt consists of consecutive frames marked as active. Usually, speech frames occur in bursts, and similarly silence frames occur in bursts. If the frames are marked as inactive, then xmin will be equal to xmax. Therefore xmin will be high if the frame is inactive and xmin will be low if current frame is active.
-
FIG. 8 illustrates the calculation of xmin and xmax. Initially xmin and xmax are calculated. - As stated above, xmax is the maximum entropy attained until the current frame; and the graph for xmax monotonically increases. Hence there is a need to sensitize the threshold to the varying nature of the input to the
sensor 401. A change in xmax results in a new threshold value. This new threshold value is a step closer to xmax. The xmin depends on whether the current frame is active or inactive. For an active frame, it is checked if the entropy of the current frame is lesser than xmin. If the entropy of the current frame is lesser than xmin, then xmin is updated and a new threshold is calculated 211. Once the least value is hit upon in that active speech segment, the xmin and, hence, the threshold do not change. - In an inactive speech segment, the xmin and xmax are equal. A change only in xmax results in adaptation of the threshold to the dynamics of the input.
- Due to the variation of the xmin, the increment will be a small step in the direction of the movement of the entropy curve. The threshold is calculated only if there is a change in either the xmax or xmin. The frames consisting of active speech frames are separated from the inactive speech frames and are transmitted over the
Internet 108. Thus the transmitted frames consist of only the active speech frames, thereby reducing bandwidth requirement for voice-enabled applications over the Internet.FIG. 2 illustrates the flowchart for the ATESOS algorithm used in the detection and the separation of the active speech frames from the inactive speech frames in an audio signal. The ATESOS algorithm marks the speech frame as active or inactive with reference to the threshold. - For each of the recorded speech frames 201, the entropy is calculated using spacings of order statistics for the first two frames. The first two frames are represented by j=1, 2. For the values of j=1 to 2, 202 the entropy H for the frames is calculated using the formula
-
- where
-
- Yi+m−Yi for 1≦i<i+m≦N is m-spacings of the nth order statistic.
- N is the number of samples in a frame
- Y is the set of ordered samples of a frame
- For j=3 203 the maximum value, xmax, in the first two frames is calculated, wherein xmax is the maximum entropy attained until the current frame.
-
x max=max{Ĥ(j)}∀j=1 to 2 - The threshold is a moving average and it is initialized to the mean of the first two entropy values
-
threshold=mean{Ĥ(1), Ĥ(2)} - xmin is initialized to xmax, wherein if the current frame is active, then xmin will be equal to minimum entropy observed over all the frames until the current frame. If the frame is marked inactive then the xmin is equal to xmax.
-
xmin=xmax - The entropy is calculated 204 for each frame starting from the third frame and as-and-when a recorded speech frame is available. A check is performed to determine if the entropy calculated is greater than the
x max 205. If the entropy calculated 204 is greater than the xmax 205, the xmax and the threshold are calculated as follows 206, -
(x max <Ĥ(j)) -
x max =Ĥ(j) -
incr=(x max +x min)/10 -
threshold=(threshold+incr)/1.23 - If the entropy obtained for the current frame is less than the
threshold 207, the frame is marked as active by assigning bspeechframe to 1 208. -
(Ĥ(j)<threshold) -
bSpeechFrame=1 -
nCompression=n Compression+ 1 - A check is performed to determine if a new xmin is achieved, and if the xmin is greater than
entropy 210, a new threshold is calculated as follows 211: -
(x min >Ĥ(j)) -
x min =Ĥ(j) -
Incr=(x max +x min)/10 -
threshold=(threshold+Incr)/1.23 - If the entropy calculated for the frame is greater than the threshold, then the frame is marked as inactive by assigning bspeechframe to 0 and initializing xmin to
x max 209. A new threshold value is calculated as 81.3% of the sum of the threshold and increment. -
bSpeechFrame=0 -
xmin=xmax - If bSpeechFrame is zero, the transmission of speech frames is withheld, i.e., the conversation is in ‘silence’. Similarly the consecutive frames marked as bSpeechFrame=1 results from a talk spurt.
-
FIG. 3 illustrates the Pseudo code of the ATESOS algorithm used in the detection and the separation of the active speech frames from the inactive speech frames in an audio signal. -
FIG. 4 illustrates the system diagram that implements the separation of the active speech frames from the inactive speech frames. The analog audio input is taken from thesensor 401. Optionally, the analog audio input can be a speech file that is directly fed to the analog todigital converter 402. The analog audio input is passed through an analog todigital converter 402 for analog to digital conversion. The digital audio signal is then passed into a fixed-sized buffer 403 to convert the digital audio signal into frames of a particular size. The digitized and buffered audio signal converted to frames is then passed through thecentral processing unit 404. The microprocessor located in thecentral processing unit 404 applies the ATESOS algorithm and differentiates the active speech frames from the inactive speech frames. Thenetwork interface module 405 accepts only the active speech frames and transmits them over the internet in the form of packets. Thecentral processing unit 404 computes spacings of order statistics in a statistical sample for said sampled frames, measures the entropy of each of said sampled frames, sets threshold for entropy and marks the audio frames. The step of marking comprises marking the audio frame as an inactive speech frame when the entropy is greater than the threshold, and marking the audio frame as an active speech frame when the entropy is lesser than the threshold. - At the receiver, the inactive speech frames are not received. During playout of the buffers, the silence created by inactive frames at the transmitter is substituted by comfort noise making the listener perceive that the inactive frames were transmitted.
-
FIG. 5A illustrates the speech signal for utterances of “/Hello/, /One/, /Two/, /Three/” with deliberate pauses in between the words. -
FIGS. 5B and 5C illustrate the output waveforms from the application of the ATESOS algorithm, for a zero noise condition and 20 ms frame size. -
FIG. 5B illustrates the entropy obtained from the spacings of the order statistics using the equation (1), described earlier under the description ofFIG. 2 . The dottedline 501 inFIG. 5B illustrates the threshold values for the respective entropy values. -
FIG. 5C illustrates the decision taken by the ATESOS algorithm. The speech frame is marked as active if decision is 1, and inactive if the decision is 0. The decision is 1 when the entropy value is less than the threshold. -
FIG. 6A illustrates the speech signal of utterances of “/Hello/, /One/, /Two/, /Three/” with deliberate pauses in between the words. The signal is corrupted with additive babble noise and the overall SNR is 5 dB. -
FIG. 6B illustrates the entropy obtained from the spacings of the Order Statistics using equation (1). The frame size considered is 20 ms. The dotted line inFIG. 6B illustrates the threshold values for the respective entropy values. -
FIG. 6C illustrates the decision taken by the ATESOS algorithm. The speech frame is marked as active if the decision is 1 and inactive if the decision is 0. The decision is 1 when the entropy value is less than the threshold and the decision is 0 if the entropy value is greater than the threshold. -
FIG. 7A illustrates the speech signal for utterances of “/Hello/, /One/, /Two/, /Three/” with deliberate pauses in between the words.FIG. 7B andFIG. 7C illustrate the output waveform from the implementation of the speech activity detection (SpAD) algorithm. -
FIG. 7B illustrates the entropy obtained from the spacings of the order statistics using equation (1). The frame size considered is 60 ms. The dotted line inFIG. 7B illustrates the adaptive threshold values for the respective entropy values. The adaptive threshold is computed using the ATESOS algorithm described inFIG. 2 andFIG. 3 . -
FIG. 7C illustrates the decision taken by the ATESOS algorithm. The speech frame is marked as active if the decision is 1 and inactive if the decision is 0. The decision is 1 when the entropy value is less than the threshold and the decision is 0 if the entropy value is greater than the threshold. -
FIG. 8 illustrates the calculation of xmin and xmax. 801 points to the location on the entropy curve from where the xmin starts decreasing in that active speech region. 802 points to the least value that xmin reaches in that active speech region. 803 points to the location in the entropy curve where xmax reaches the highest value. - The method and system disclosed herein accomplishes a greater saving in bandwidth by detection of the speech/active signal by efficaciously discriminating from non-speech.
- Using the ATESOS algorithm, the speech burst is located accurately. Information on the location of the speech burst may be provided to an echo cancellation module (not shown in figure). The identification of the location of the speech burst aids in the process of subtracting the return signal in VoIP systems.
- The location of the speech burst can be provided to a speech recognition module (not shown in figure) for accurately mapping and identifying the words. The ATESOS algorithm may be used to preprocess audio data for a speech recognition module.
- The location of the speech burst can be provided to coding modules for reducing the level of computation required for coding the speech data.
- The foregoing examples have been provided merely for the purpose of explanation and are in no way to be construed as limiting of the present method and system disclosed herein. While the invention has been described with reference to various embodiments, it is understood that the words that have been used herein are words of description and illustration, rather than words of limitations. Further, although the invention has been described herein with reference to particular means, materials and embodiments, the invention is not intended to be limited to the particulars disclosed herein; rather, the invention extends to all functionally equivalent structures, methods and uses, such as are within the scope of the appended claims. Those skilled in the art, having the benefit of the teachings of this specification, may effect numerous modifications thereto and changes may be made without departing from the scope and spirit of the invention in its aspects.
Claims (11)
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/IN2007/000028 WO2008090564A2 (en) | 2007-01-24 | 2007-01-24 | Speech activity detection |
Publications (2)
Publication Number | Publication Date |
---|---|
US20100036663A1 true US20100036663A1 (en) | 2010-02-11 |
US8380494B2 US8380494B2 (en) | 2013-02-19 |
Family
ID=39644962
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/515,536 Expired - Fee Related US8380494B2 (en) | 2007-01-24 | 2007-01-24 | Speech detection using order statistics |
Country Status (2)
Country | Link |
---|---|
US (1) | US8380494B2 (en) |
WO (1) | WO2008090564A2 (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110218803A1 (en) * | 2010-03-04 | 2011-09-08 | Deutsche Telekom Ag | Method and system for assessing intelligibility of speech represented by a speech signal |
US20120253813A1 (en) * | 2011-03-31 | 2012-10-04 | Oki Electric Industry Co., Ltd. | Speech segment determination device, and storage medium |
US20140040722A1 (en) * | 2012-08-02 | 2014-02-06 | Nuance Communications, Inc. | Methods and apparatus for voiced-enabling a web application |
US20140039885A1 (en) * | 2012-08-02 | 2014-02-06 | Nuance Communications, Inc. | Methods and apparatus for voice-enabling a web application |
US20160078873A1 (en) * | 2013-05-30 | 2016-03-17 | Huawei Technologies Co., Ltd. | Signal encoding method and device |
US9292253B2 (en) | 2012-08-02 | 2016-03-22 | Nuance Communications, Inc. | Methods and apparatus for voiced-enabling a web application |
US9292252B2 (en) | 2012-08-02 | 2016-03-22 | Nuance Communications, Inc. | Methods and apparatus for voiced-enabling a web application |
US10157612B2 (en) | 2012-08-02 | 2018-12-18 | Nuance Communications, Inc. | Methods and apparatus for voice-enabling a web application |
US11527265B2 (en) * | 2018-11-02 | 2022-12-13 | BriefCam Ltd. | Method and system for automatic object-aware video or audio redaction |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130090926A1 (en) * | 2011-09-16 | 2013-04-11 | Qualcomm Incorporated | Mobile device context information using speech detection |
CN108389575B (en) * | 2018-01-11 | 2020-06-26 | 苏州思必驰信息科技有限公司 | Audio data identification method and system |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6182035B1 (en) * | 1998-03-26 | 2001-01-30 | Telefonaktiebolaget Lm Ericsson (Publ) | Method and apparatus for detecting voice activity |
US20010014857A1 (en) * | 1998-08-14 | 2001-08-16 | Zifei Peter Wang | A voice activity detector for packet voice network |
US20010044719A1 (en) * | 1999-07-02 | 2001-11-22 | Mitsubishi Electric Research Laboratories, Inc. | Method and system for recognizing, indexing, and searching acoustic signals |
US20020165713A1 (en) * | 2000-12-04 | 2002-11-07 | Global Ip Sound Ab | Detection of sound activity |
US20040250078A1 (en) * | 2001-03-22 | 2004-12-09 | John Stach | Quantization -based data hiding employing calibration and locally adaptive quantization |
US20050060142A1 (en) * | 2003-09-12 | 2005-03-17 | Erik Visser | Separation of target acoustic signals in a multi-transducer arrangement |
US20050177364A1 (en) * | 2002-10-11 | 2005-08-11 | Nokia Corporation | Methods and devices for source controlled variable bit-rate wideband speech coding |
US20050192798A1 (en) * | 2004-02-23 | 2005-09-01 | Nokia Corporation | Classification of audio signals |
US20060053002A1 (en) * | 2002-12-11 | 2006-03-09 | Erik Visser | System and method for speech processing using independent component analysis under stability restraints |
US20060074641A1 (en) * | 2004-09-22 | 2006-04-06 | Goudar Chanaveeragouda V | Methods, devices and systems for improved codebook search for voice codecs |
US20070021958A1 (en) * | 2005-07-22 | 2007-01-25 | Erik Visser | Robust separation of speech signals in a noisy environment |
US20070265842A1 (en) * | 2006-05-09 | 2007-11-15 | Nokia Corporation | Adaptive voice activity detection |
US7412376B2 (en) * | 2003-09-10 | 2008-08-12 | Microsoft Corporation | System and method for real-time detection and preservation of speech onset in a signal |
-
2007
- 2007-01-24 WO PCT/IN2007/000028 patent/WO2008090564A2/en active Application Filing
- 2007-01-24 US US12/515,536 patent/US8380494B2/en not_active Expired - Fee Related
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6182035B1 (en) * | 1998-03-26 | 2001-01-30 | Telefonaktiebolaget Lm Ericsson (Publ) | Method and apparatus for detecting voice activity |
US20010014857A1 (en) * | 1998-08-14 | 2001-08-16 | Zifei Peter Wang | A voice activity detector for packet voice network |
US20010044719A1 (en) * | 1999-07-02 | 2001-11-22 | Mitsubishi Electric Research Laboratories, Inc. | Method and system for recognizing, indexing, and searching acoustic signals |
US20020165713A1 (en) * | 2000-12-04 | 2002-11-07 | Global Ip Sound Ab | Detection of sound activity |
US20040250078A1 (en) * | 2001-03-22 | 2004-12-09 | John Stach | Quantization -based data hiding employing calibration and locally adaptive quantization |
US20050177364A1 (en) * | 2002-10-11 | 2005-08-11 | Nokia Corporation | Methods and devices for source controlled variable bit-rate wideband speech coding |
US20060053002A1 (en) * | 2002-12-11 | 2006-03-09 | Erik Visser | System and method for speech processing using independent component analysis under stability restraints |
US7412376B2 (en) * | 2003-09-10 | 2008-08-12 | Microsoft Corporation | System and method for real-time detection and preservation of speech onset in a signal |
US20050060142A1 (en) * | 2003-09-12 | 2005-03-17 | Erik Visser | Separation of target acoustic signals in a multi-transducer arrangement |
US20050192798A1 (en) * | 2004-02-23 | 2005-09-01 | Nokia Corporation | Classification of audio signals |
US20060074641A1 (en) * | 2004-09-22 | 2006-04-06 | Goudar Chanaveeragouda V | Methods, devices and systems for improved codebook search for voice codecs |
US20070021958A1 (en) * | 2005-07-22 | 2007-01-25 | Erik Visser | Robust separation of speech signals in a noisy environment |
US20070265842A1 (en) * | 2006-05-09 | 2007-11-15 | Nokia Corporation | Adaptive voice activity detection |
Non-Patent Citations (12)
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8655656B2 (en) * | 2010-03-04 | 2014-02-18 | Deutsche Telekom Ag | Method and system for assessing intelligibility of speech represented by a speech signal |
US20110218803A1 (en) * | 2010-03-04 | 2011-09-08 | Deutsche Telekom Ag | Method and system for assessing intelligibility of speech represented by a speech signal |
US20120253813A1 (en) * | 2011-03-31 | 2012-10-04 | Oki Electric Industry Co., Ltd. | Speech segment determination device, and storage medium |
US9123351B2 (en) * | 2011-03-31 | 2015-09-01 | Oki Electric Industry Co., Ltd. | Speech segment determination device, and storage medium |
US9292252B2 (en) | 2012-08-02 | 2016-03-22 | Nuance Communications, Inc. | Methods and apparatus for voiced-enabling a web application |
US20140039885A1 (en) * | 2012-08-02 | 2014-02-06 | Nuance Communications, Inc. | Methods and apparatus for voice-enabling a web application |
US9292253B2 (en) | 2012-08-02 | 2016-03-22 | Nuance Communications, Inc. | Methods and apparatus for voiced-enabling a web application |
US20140040722A1 (en) * | 2012-08-02 | 2014-02-06 | Nuance Communications, Inc. | Methods and apparatus for voiced-enabling a web application |
US9400633B2 (en) * | 2012-08-02 | 2016-07-26 | Nuance Communications, Inc. | Methods and apparatus for voiced-enabling a web application |
US9781262B2 (en) * | 2012-08-02 | 2017-10-03 | Nuance Communications, Inc. | Methods and apparatus for voice-enabling a web application |
US10157612B2 (en) | 2012-08-02 | 2018-12-18 | Nuance Communications, Inc. | Methods and apparatus for voice-enabling a web application |
US20160078873A1 (en) * | 2013-05-30 | 2016-03-17 | Huawei Technologies Co., Ltd. | Signal encoding method and device |
US9886960B2 (en) * | 2013-05-30 | 2018-02-06 | Huawei Technologies Co., Ltd. | Voice signal processing method and device |
US10692509B2 (en) | 2013-05-30 | 2020-06-23 | Huawei Technologies Co., Ltd. | Signal encoding of comfort noise according to deviation degree of silence signal |
US11527265B2 (en) * | 2018-11-02 | 2022-12-13 | BriefCam Ltd. | Method and system for automatic object-aware video or audio redaction |
US11984141B2 (en) | 2018-11-02 | 2024-05-14 | BriefCam Ltd. | Method and system for automatic pre-recordation video redaction of objects |
US12125504B2 (en) | 2018-11-02 | 2024-10-22 | BriefCam Ltd. | Method and system for automatic pre-recordation video redaction of objects |
Also Published As
Publication number | Publication date |
---|---|
US8380494B2 (en) | 2013-02-19 |
WO2008090564A3 (en) | 2009-04-16 |
WO2008090564A2 (en) | 2008-07-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8380494B2 (en) | Speech detection using order statistics | |
Sangwan et al. | VAD techniques for real-time speech transmission on the Internet | |
US9053702B2 (en) | Systems, methods, apparatus, and computer-readable media for bit allocation for redundant transmission | |
US7412376B2 (en) | System and method for real-time detection and preservation of speech onset in a signal | |
Prasad et al. | Comparison of voice activity detection algorithms for VoIP | |
KR100636317B1 (en) | Distributed Speech Recognition System and method | |
US8102872B2 (en) | Method for discontinuous transmission and accurate reproduction of background noise information | |
US8050415B2 (en) | Method and apparatus for detecting audio signals | |
CN102598119B (en) | Pitch estimation | |
EP1229520A2 (en) | Silence insertion descriptor (sid) frame detection with human auditory perception compensation | |
JP3255584B2 (en) | Sound detection device and method | |
WO2001039175A1 (en) | Method and apparatus for voice detection | |
US7072828B2 (en) | Apparatus and method for improved voice activity detection | |
JP2008058983A (en) | Method for robust classification of acoustic noise in voice or speech coding | |
US6381568B1 (en) | Method of transmitting speech using discontinuous transmission and comfort noise | |
WO2015169064A1 (en) | Network voice quality evaluation method, device and system | |
Sakhnov et al. | Approach for Energy-Based Voice Detector with Adaptive Scaling Factor. | |
US20050060149A1 (en) | Method and apparatus to perform voice activity detection | |
US8060362B2 (en) | Noise detection for audio encoding by mean and variance energy ratio | |
US20050171769A1 (en) | Apparatus and method for voice activity detection | |
Prasad et al. | SPCp1-01: Voice Activity Detection for VoIP-An Information Theoretic Approach | |
CN111128244B (en) | Short wave communication voice activation detection method based on zero crossing rate detection | |
Prasad et al. | VAD for VOIP using cepstrum | |
EP1551006A1 (en) | Apparatus and method for voice activity detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: P.E.S. INSTITUTE OF TECHNOLOGY,INDIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:R., MURALISHANKAR;S, VIJAY;R., VENKATESHA PRASAD;AND OTHERS;REEL/FRAME:022707/0005 Effective date: 20090520 Owner name: P.E.S. INSTITUTE OF TECHNOLOGY, INDIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:R., MURALISHANKAR;S, VIJAY;R., VENKATESHA PRASAD;AND OTHERS;REEL/FRAME:022707/0005 Effective date: 20090520 |
|
REMI | Maintenance fee reminder mailed | ||
LAPS | Lapse for failure to pay maintenance fees | ||
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20170219 |