EP2096629A1

EP2096629A1 - A classing method and device for sound signal

Info

Publication number: EP2096629A1
Application number: EP07855800A
Authority: EP
Inventors: Wei Li; Lijing Xu; Qing Zhang; Jianfeng Xu; Shenghu Sang; Zhengzhong Du; Qin Yan; Haojiang Deng; Jun WANG
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2006-12-05
Filing date: 2007-12-26
Publication date: 2009-09-02
Anticipated expiration: 2027-12-26
Also published as: CN101197135A; EP2096629A4; EP2096629B1; CN100483509C; WO2008067735A1

Abstract

A method for classifying sound signals includes: receiving sound signals, and determining the update rate of background noise according to spectral distribution parameters of the background noise and the sound signals; and updating the noise parameters according to the update rate, and classifying the sound signals according to sub-band energy parameters and updated noise parameters. An apparatus for classifying sound signals includes: a background noise parameter updating module, configured to: determine the update rate of background noise according to spectral distribution parameters of the background noise and the current sound signals; and send the determined update rate; and a PSC module, configured to: receive the update rate from the background noise parameter updating module, update the noise parameters, classify the current sound signals according to the sub-band energy parameters and updated noise parameters, and send the sound signal type determined through classification.

Description

FIELD OF THE INVENTION

The present invention relates to speech coding technologies, and in particular, to a method and apparatus for classifying sound signals.

BACKGROUND

In speech communication, only about 40% signals include speech, and the others are mute or background noise. In order to save transmission bandwidth, a Voice Activity Detection (VAD) technique is applied in speech coding in the speech signal processing field. Therefore, the coder may encode the background noise and active speech at different rates. That is, the coder encodes the background noise at a lower rate, and encodes the active speech at a higher rate, thus reducing the average code rate and enhancing the variable-rate speech coding technology greatly.
The VAD in the related art is developed for speech signals only, and categorizes input audio signals into only two types: noise and non-noise. Later coders such as AMR_WB+ and SMV covers detection of music signals, serving as a correction and supplement to the VAD decision. The AMR-WB+ coder is characterized that after VAD, the coding mode varies between a speech signal and a music signal, and depends on whether the input audio signal is a speech signal or music signal, thus minimizing the code rate and ensuring the coding quality.
The two different coding modes in the AMR-WB+ are: Algebraic Code Excited Linear Prediction (ACELP)-based coding algorithm, and Transform Coded eXcitation (TCX)-based coding algorithm. The ACELP sets up a speech phonation model, makes the most of the speech characteristics, and is highly efficient in encoding speech signals. Moreover, the ACELP technology is so mature that the ACELP may be extended on a universal audio coder to improve the speech coding quality massively. Likewise, the TCX may be extended on the low-bit-rate speech coder to improve the quality of encoding broadband music.
Depending on complexity, the ACELP mode selection algorithm and the TCX mode selection algorithm of the AMR-WB+ coding algorithm come in two types: open loop selection algorithm, and closed loop selection algorithm. Closed-loop selection corresponds to high complexity, and is default option. It is a traversal search selection mode based on a perceptive weighted Signal-to-Noise Ratio (SNR). Evidently, such a selection method is rather accurate, but involves rather complicated operation and a huge amount of codes.
The open-loop selection includes the following steps.
In step 101, the VAD module judges whether the signal is a non-usable signal or usable signal according to the Tone_flag and the sub-band energy parameter (Level[n]).
In step 102, primary mode selection (EC) is performed.
In step 103, the mode primarily determined in step 102 is corrected, and refined mode selection is performed to determine the coding mode to be selected. Specifically, this step is performed based on open loop pitch parameters and Immittance Spectral Frequency (ISF) parameters.
In step 104, TCXS processing is performed. That is, when the number of times of selecting the speech signal coding mode continuously is less than three times, a small-sized closed-loop traversal search is performed to determine the coding mode finally, where the speech signal coding mode is ACELP and the music signal coding mode is TCX.
In the process of implementing the present invention, the inventor finds that the AMR-WB+ speech signal selection algorithm in the related art involves the following defects:

1. The VAD module in the related art underperforms in identifying noise and some music signals in the process of classifying signals, thus reducing accuracy of classifying sound signals.
2. Calculation of the open pitch parameters is necessary to the ACELP coding mode, but unnecessary to the TCX coding mode. According to the AMR-WB+ structure design, the VAD and the open-loop mode selection algorithm involve use of the open loop pitch parameters. Therefore, the open loop pitch needs to be calculated for all frames. However, as for other non-ACELP coding modes (such as TCX), the calculation of such parameters is redundant complexity, increases the calculation load of coding mode selection, and reduces the efficiency.
3. Although the VAD algorithm is superior in speech detection and noise immunity among the coders currently available, it may mistake music signals for noise at the hangover of some special music signals, thus intercepting the music hangover and making the music unnatural.
4. The AMR-WB+ mode selection algorithm disregards the Signal Noise Ratio (SNR) environment of the signal, and its performance of identifying speech and music in the case of a low SNR is further deteriorated.

SUMMARY

A method and apparatus for classifying sound signals are provided in an embodiment of the present invention to improve accuracy of sound signal classification.
A method for classifying and detecting sound signals in an embodiment of the present invention includes: receiving sound signals, and determining the update rate of background noise according to spectral distribution parameters of the background noise and spectral distribution parameters of the sound signals; and updating the noise parameters according to the update rate, and classifying the sound signals according to sub-band energy parameters and updated noise parameters.
An apparatus for classifying sound signals in an embodiment of the present invention includes: a background noise parameter updating module, configured to: determine the update rate of background noise according to spectral distribution parameters of the background noise and spectral distribution parameters of the current sound signals; and send the determined update rate; and a Primary Signal Classification (PSC) module, configured to: receive the update rate from the background noise parameter updating module, update the noise parameters, classify the current sound signals according to the sub-band energy parameters and updated noise parameters, and send the sound signal type determined through classification.
In the embodiments of the present invention, the update rate of the background noise is determined, the noise parameters are updated according to the update rate, the signals are classified primarily according to the sub-band energy parameters and the updated noise parameters, and the nonuseful signals and the useful signals in the received speech signals are determined, thus reducing the probability of mistaking useful signals for noise signals and improving accuracy of classifying sound signals.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 shows open loop selection of AMR-WB+ coding algorithm in the related art;
Figure 2 is a general flowchart of a method for classifying and detecting sound signals in an embodiment of the present invention;
Figure 3 is a schematic diagram showing an apparatus for classifying sound signals in an embodiment of the present invention;
Figure 4 is a schematic diagram showing a system in an embodiment of the present invention;
Figure 5 is a flowchart of calculating various parameters on a coder parameter extracting module in an embodiment of the present invention;
Figure 6 is a flowchart of calculating various parameters on another coder parameter extracting module in an embodiment of the present invention;
Figure 7 shows composition of a PSC module in an embodiment of the present invention;
Figure 8 shows how a signal type judging module determines characteristic parameters in an embodiment of the present invention;
Figure 9 shows how a signal type judging module performs speech judgment in an embodiment of the present invention;
Figure 10 shows how a signal type judging module performs music judgment in an embodiment of the present invention;
Figure 11 shows how a signal type judging module corrects a primary judgment result in an embodiment of the present invention;
Figure 12 shows how a signal type judging module performs primary type correction for uncertain signals in an embodiment of the present invention;
Figure 13 shows how a signal type judging module performs final type correction for signals in an embodiment of the present invention; and
Figure 14 shows how a signal type judging module performs parameter update in an embodiment of the present invention.

DETAILED DESCRIPTION

In order to make the technical solution, objectives and merits of the present invention clearer, a detailed description of the present invention is given below by reference to the accompanying drawings and preferred embodiments.
In the embodiments of the present invention, the update rate of the background noise is determined according to the spectral distribution parameters of the current sound signal and the background noise, and the noise parameters are updated according to the update rate. Therefore, the useful signals and the non-useful signals in the received speech signals are determined according to the updated noise parameters, thus improving the accuracy of the noise parameters in determining the useful signals and non-useful signals, and improving the accuracy of classifying sound signals.
Figure 2 shows a method for classifying and detecting sound signals in an embodiment of the present invention, including the following process:
Block 201: Sound signals are received, and the update rate of background noise is determined according to the spectral distribution parameters of the background noise and the sound signals.
Block 202: The noise parameters are updated according to the update rate, and the sound signals are classified according to sub-band energy parameters and updated noise parameters.
In block 202, the sound signals are classified into two types: useful signals, and non-useful signals. Afterward, the useful signals may be subdivided into speech signals and music signals, depending on whether the noise converges. The subdividing may be based on open loop pitch parameters, ISF parameters, and sub-band energy parameters, or based on ISF parameters and sub-band energy parameters.
Besides, in order to prevent mistaking music signal hangovers for non-useful signals and reducing the sound effect, a determined useful signal type is obtained in an embodiment of the present invention. The signal hangover length is determined according to the useful signal type, and the useful signals and the non-useful signals in the received speech signals are further determined according to the signal hangover length. Here the music signal hangover may be set to a relatively great value to improve the sound effect of the music signal.
In the process of determining a useful signal as a speech signal or music signal, it is appropriate to set the signal not accurately identifiable to an uncertain type first, and then correct the uncertain type according to other parameters, and finally determine the type of useful signals.
Calculation of ISF parameters is not necessarily involved in the coding mode of non-useful signals. Therefore, no ISF parameters are calculated for the determined noise signals if the corresponding coding mode needs no calculation of ISF parameters, with a view to reducing the calculation load in the classification process and improving the classification efficiency.
As shown in Figure 3, an apparatus for classifying sound signals in an embodiment of the present invention includes: a background noise parameter updating module, configured to: determine the update rate of background noise according to the spectral distribution parameters of the background noise and the current sound signals, and send the determined update rate to a PSC module; and a PSC module, configured to: update the noise parameters according to the update rate received from the background noise parameter updating module, perform primary classification for the signals according to the sub-band energy parameters and updated noise parameters, and determine the received speech signal to be a useful signal or non-useful signal.
The apparatus for classifying sound signals may further include a signal type judging module. The PSC module transfers the determined signal type to the signal type judging module. The signal type judging module determines the type of a useful signal based on the open loop pitch parameters, ISF parameters, and sub-band energy parameters, or based on ISF parameters and sub-band energy parameters, where the type of the useful signal includes speech and music.
The apparatus for classifying sound signals may further include a classification parameter extracting module. The PSC module transfers the determined signal type to the signal type judging module through the classification parameter extracting module. The classification parameter extracting module is further configured to: obtain ISF parameters and sub-band energy parameters, or further obtain open loop pitch parameters, process the obtained parameters into signal type characteristic parameters, and send the parameters to the signal type judging module; and process the obtained parameters into spectral distribution parameters of sound signals and background noise, and transfer the spectral distribution parameters to the background noise parameter updating module. Therefore, the signal type judging module determines the type of useful signals according to the foregoing signal type characteristic parameter and the signal type determined by the PSC module, where the type of useful signals includes speech and music.
The PSC module may be further configured to transfer the sound signal SNR calculated in the process of determining the signal type to the signal type judging module. The signal type judging module determines the useful signal to be a speech signal or music signal according to the SNR.
The apparatus for classifying sound signals may further include a coder mode and rate selecting module. The signal type judging module transfers the determined signal type to the coder mode and rate selecting module, and the coder mode and rate selecting module determines the coding mode and rate of sound signals according to the received signal type.
The apparatus for classifying sound signals may further include a coder parameter extracting module, which is configured to extract ISF parameters and sub-band energy parameters or additionally open loop pitch parameters, transfer the extracted parameters to the classification parameter extracting module, and transfer the extracted sub-band energy parameters to the PSC module.
The method for classifying and detecting sound signals and the apparatus for classifying sound signals in an embodiment of the present invention are detailed below.
Figure 4 is a schematic diagram showing a system in an embodiment of the present invention. The system includes a Sound Activity Detector (SAD). As required by the coder, the SAD sorts the audio digital signals into three types: non-useful signal, speech, and music, thus forming a basis for the coder to select the coding mode and rate.
As shown in Figure 4, the SAD module includes: a background noise estimation control module, a PSC module, a classification parameter extracting module, and a signal type judging module. As a signal classifier used inside the coder, the SAD makes the most of the parameters of the coder in order to reduce resource occupation and calculation complexity. Therefore, the coder parameter extracting module in the coder is used to calculate the sub-band energy parameters and coder parameters, and provide the calculated parameters for the SAD module. Moreover, the SAD module finally outputs a determined signal type (namely, non-useful signal, speech, or music), and provides the determined signal type for the coder mode and rate selecting module to select the coder mode and rate.
The SAD-related modules in the coder, sub-modules in the SAD, and the interaction processes between the sub-modules are detailed below.
The coder parameter extracting module in the coder calculates the sub-band energy parameters and coder parameters, and provides the calculated parameters for the SAD module. The sub-band energy parameters may be calculated through filtering of a filter group. The specific quantity of sub-bands (for example, 12 sub-bands in this embodiment) is determined according to the calculation complexity requirement and classification accuracy requirement.
Figure 5 or Figure 6 shows how a coder parameter extracting module calculates various parameters required by the SAD module in this embodiment.
The process shown in Figure 5 includes the following process:
Block 501: The coder parameter extracting module calculates the sub-band energy parameters first.
Block 502: The coder parameter extracting module decides whether it is necessary to perform ISF calculation according to the primary signal judgment result (Vad_flag) received from the PSC module, and performs block 503 if necessary; or performs block 504 if not necessary.
The decision about whether to perform ISF calculation in this block includes: If the current frame is composed of non-useful signal signals, the mechanism of the coder applies. The mechanism of the coder is: If ISF parameters are required when the coder encodes non-useful signals, the ISF calculation needs to be performed; otherwise, the operation of the coder parameter extracting module is finished. If the current frame is composed of useful signals, the ISF calculation needs to be performed. Most coding modes require calculation of ISF parameters for useful signals. Therefore, the calculation brings no redundant complexity to the coder. The technical solution to calculation of ISF parameters is detailed in the instruction manuals of coders, and is not repeated here any further.
Block 503: The coder parameter extracting module calculates the ISF parameters and then performs block 504.
Block 504: The coder parameter extracting module calculates the open loop pitch parameters.
The sub-band energy parameters calculated through the process in Figure 5 are provided for the PSC module and the classification parameter extracting module in the SAD, and other parameters are provided for the classification parameter extracting module in the SAD.
In the process shown in Figure 6, a block is added on the basis of the process in Figure 5, where the added block is to decide whether to calculate the open-loop pitch parameters depending on whether the primary noise converges. Blocks 601-603 are basically identical to blocks 501-503 in Figure 5. In block 604, a judgment is made about whether the primary noise parameter (namely, noise estimation) converges. If the primary noise parameter converges, the open loop pitch parameters are calculated in block 60; otherwise, no open loop pitch parameter is calculated.
The calculation of open-loop pitch parameters is redundant to some coding modes such as TCX. In order to simplify calculation, it is basically certain that the corresponding coding mode of the signal does not need to calculate open loop pitch parameters after the noise estimation converges. Therefore, the open loop pitch parameters are not calculated any more.
Before convergence of the noise estimation, the open loop pitch parameters need to be calculated in order to ensure convergence of the noise estimation and the convergence speed. However, such calculation occurs at the startup stage, and the complexity of calculation is ignorable. The technical solution to calculation of open loop pitch parameters is detailed in the instruction about ACELP-based coding, and is not repeated here any further. The basis for judging whether the noise estimation converges may be: The count of determining as noise frames continuously exceeds the noise convergence threshold (THR1). In an example in this embodiment, the value of THR1 is 20.
The foregoing extracted sub-band energy parameter is: level[i], where i represents a member index of the vector, and its value falls within 1...12 in this embodiment, corresponding to 0-200 Hz, 200-400 Hz, 400-600 Hz, 600-800 Hz, 800-1200 Hz, 1200-1600 Hz, 1600-2000 Hz, 2000-2400 Hz, 2400-3200 Hz, 3200-40000 Hz, 4000-4800 Hz, and 4800-6400 Hz, respectively.
The foregoing extracted ISF parameter is Isf_n [i], where n represents a frame index, and the value of i falls within 1...16, representing a member index in the vector.
The foregoing extracted open loop pitch parameters include: open_loop pitch gain (ol_gain), open_loop pitch lag (ol_lag), and tone_flag. If the value of ol_gain is greater than the value of tone threshold (TONE_THR), the tone_flag is set to 1.
The PSC module may be implemented through various VAD algorithms in the related art, and includes: background noise estimating sub-module, SNR calculating sub-module, useful signal estimating sub-module, judgment threshold adjusting sub-module, comparing sub-module, and hangover protective useful signal sub-module. In this embodiment, as shown in Figure 7, the implementation of the PSC module may differ from the VAD algorithm module in the related art in the following aspects:
I. The SNR calculating sub-module calculates the SNR according to this parameter and the sub-band energy parameters. The calculated SNR parameter is not only applied inside the PSC module, but also transferred to the signal type judging module so that the signal type judging module identifies the speech and music more accurately in the case of low SNR.
II. The VAD in the related art underperforms in identifying noise and some types of music, and improvement is made for the VAD in this embodiment: First, the calculation of the background noise parameter is controlled by the update rate (ACC) provided by the background noise parameter updating module. The background noise estimating sub-module receives the update rate from the background noise parameter updating module, updates the noise parameter, and transfers the sub-band energy estimation parameters of background noise calculated out according to the updated noise parameter to the SNR calculating sub-module. The calculation of the update rate is detailed in the instruction about the background noise parameter updating module hereinafter. In an example of this embodiment, the update rate comes in 4 levels: acc1, acc2, acc3, and acc4. For different update rates, different upward update parameters (update_up) and downward update parameters (update_down) are determined, where update_up corresponds to the upward update rate of background noise, and update_down corresponds to the downward update rate of background noise.
Afterwards, the solution to updating the noise parameter may be the solution in the AMR_WB+:
Therefore, the formula for updating noise estimation is:
Therefore, the formula for updating the spectral distribution parameter vector of noise is:
where,
m: frame index
n: sub-band index
i: element index of spectral distribution parameter vector, i = 1,2,3,4
bckr_est: sub-band energy of background noise estimation
p̃: estimation of spectral distribution parameter vector of background noise
p : spectral distribution parameter vector of the current signal
III. In the VAD in the related art, hangover is used to prevent useful signals from being mistaken for noise. The hangover length should be tradeoff between signal protection and transmission efficiency. For traditional speech coders, the hangover length may be a constant after learning. A multi-rate coder is oriented to audio signals such as music. Such signals tend to have a long low-energy hangover. It is difficult for a conventional VAD to detect such a hangover. Therefore, a relatively long hangover is required for protection. In this embodiment, the hangover length in the hangover protective useful signal sub-module is designed to be adaptive according to the SAD signal judgment result. If the judgment result is a music signal (SAD_flag = MUSIC), a long hangover (hang_len = HANG_LONG) is set; if the judgment result is a speech signal (SAD_flag = SPEECH), a short hangover (hang_len = HANG_SHORT) is set. The detailed setting mode is as follows:

        If (SAD_flag = MUSIC)
            hang_len = HANG_LONG
        else if (SAD_flag = SPEECH)
             hang_len = HANG_SHORT
        else
             hang_len = 0
        where,
        SAD_flag: SAD judgment flag
        hang_len: protective hangover length

In an example of this embodiment, HANG_LONG = 100, and HANG_SHORT = 20, which may be measured in frames.

The classification parameter extracting module is configured to: calculate the parameters required by the signal type judging module and the background noise parameter updating module according to the Vad_flag parameter determined by the PSC module and the sub-band energy parameters, ISF parameters, and open loop pitch parameters provided by the coder parameter extracting module; and provide the sub-band energy parameters, ISF parameters, open loop pitch parameters, and calculated parameters for the signal type judging module and the background noise parameter updating module. The parameters calculated by the classification parameter extracting module include:

1. Pitch parameter

Difference of continuous open loop pitch lags is compared. If the increment of the open loop pitch lag is less than a set threshold, the lag count accrues; if the sum of the lag counts of two continuous frames is great enough, the pitch is set to 1; otherwise, the pitch is set to 0. The formula for calculating the open loop pitch lag is specified in the AMR-WB+/AMR-WB standard document.

2. Longtime signal correlation value parameter (meangain)

The meangain is a moving average of tones of three adjacent frames, where tone = 1000*tone_flg. The definition of tone_flg is the same as that in the AMR-WB+.

3. Zero Cross Rate (zcr)

zcr = \frac{1}{T} \sum_{i - 1}^{T - 1} II \{x (i) x (i - 1) < 0\}

II{A} is 1 when A is "truth", and is 0 when A is false.

4. Time domain fluctuation of sub-band energy (t_flux)

t_flux = \frac{\sum_{i = 1}^{12} |{level}_{m} (i) - {level}_{m - 1} (i)|}{short_mean_level_energy}

where short_mean_level_energy represents short-time average energy.

5. Ratio of high sub-band energy to low sub-band energy (ra)

ra = \frac{Sublevel_high_energy}{sublevel_low_energy}

Given below is an instance of the present invention:
sublevel_high_energy = level[10]+ level[11];
sublevel_low_energy = level[0]+ level[1]+ level[2]+ level[3]+ level[4]+ level[5]+ level[6]+ level[7] + level[8]+ level[9];

6. Frequency domain fluctuation of sub-band energy (f_flux)

t_flux = \frac{\sum_{i = 2}^{12} |{level}_{m} (i) - {level}_{m} (i - 1)|}{short_mean_level_energy}

7. ISF mean short-time distance (isf_meanSD): average of ISF distance (Isf_SD) of five adjacent frames, where

Isf_SD = \sum_{i = 1}^{16} |{Isf}_{m} (i) - {Isf}_{m - 1} (i)|

8. Sub-band energy standard deviation mean (level_meanSD) parameter: average of the sub-band energy standard deviation (level_SD) of two adjacent frames, where the calculation method of the level_SD parameter is similar to the calculation method of the Isf_SD described above.

In the foregoing 8 parameters, the parameters provided for the background noise parameter updating module include: zcr, ra, i_flux, and t_flux; the parameters provided for the signal type judging module include: pitch, meangain, isf_meanSD, and level_meanSD.

The signal type judging module is configured to sort the signals into non-useful(such as noise), speech, and music according to the snr and Vad_flag parameters received from the PSC module and the sub-band energy parameter, pitch, meangain, Isf_meansD, and level_meanSD parameters received from the classification parameter extracting module. The signal type judging module may include:

a parameter updating sub-module, configured to: update the threshold in the signal type judgment process according to the SNR, and provide the updated threshold for a judging sub-module; and
a judging sub-module, configured to: receive the sound signal type from the PSC module, determine the type of the useful signals in the sound signals based on the open loop pitch parameter, ISF parameter, sub-band energy parameter, and updated threshold, or based on the ISF parameter and sub-band energy parameter and the updated threshold, and send the determined type of the useful signals to the coder mode and rate selecting module.

The process of determining a useful signal to be a speech signal or music signal includes:

firstly, setting both the speech flag bit and the music flag bit to 0, sorting the signals into speech, music and uncertain signals primarily according to the pitch parameter flag, longtime signal correlation value, isf_meansD, and level_meanSD, and modifying the value of the speech flag bit or music flag bit according to the primarily determined speech or music;
secondly, correcting the primarily determined speech, music, and uncertain signals according to: sub-band energy, longtime signal correlation value, level_meanSD, speech_flag, music_flag, whether the number of continuous frames whose pitch value is 1 exceeds the preset hangover frame threshold, number of continuous music frames, number of continuous speech frames, and type of the previous frame; and determining the type of useful signals, where the type of a useful signal includes speech signal and music signal.

The process of determining a useful signal to be a speech signal or music signal is detailed below.

In order to ensure stability of judging signals and avoid frequent conversion of judgment results, this embodiment provides a parameter flag hangover mechanism. The characteristic parameter values such as pitch_flag, level_meanSD_high_flag, ISF_meanSD_high_flag, ISF_meanSD_low_flag, level_meanSD_low_flag, and meangain_flag are determined according to the hangover mechanism, as shown in Figure 8.

In Figure 8, the length of the hangover period is determined according to the hangover parameter flag value. This embodiment provides two types of hangover settings (namely, two solutions to determining the hangover parameter flag value).

In the first hangover setting solution, when the parameter value is higher or lower than a threshold, the corresponding parameter hangover counter value increases by one; otherwise, the corresponding parameter hangover counter value is set to 0, and different parameter hangover flags are set according to the value of the parameter hangover counter. If the value of the parameter hangover counter is higher, the parameter hangover flag value is greater. The specific value is determined as required at the time of setting the parameter hangover flag value according to the parameter counter, and is not described here any further.

In the second hangover setting solution, the hangover length is controlled according to the Error Rate (ER) of the internal nodes of the decision tree corresponding to the training parameter. If the ER is lower, the hangover is shorter; if the ER is higher, the hangover is longer.

Afterwards, if the current signal is classified as a useful signal, the signal is primarily sorted into either speech or music:

Firstly, primary speech judgment is performed. As shown in Figure 9, in block 901, the speech flag bit is set to 0, and then in block 902, a judgment is made about whether the Isf_meanSD is greater than the first ISF speech threshold (such as 1500). If the Isf_meanSD is greater than the first ISF speech threshold, the speech flag bit is set to 1; otherwise,
in block 903, a judgment is made about whether the pitch value is 1 and the pitch lag value (t_top_mean) obtained switching on and switching off the pitch search is less than the pitch speech threshold (such as 40). If yes, the speech flag bit is set to 1; otherwise,
in block 904, a judgment is made about whether the number of continuous frames whose pitch value is 1 exceeds the preset threshold of the number of hangover frames (such as 2 frames). If yes, the speech flag bit is set to 1; otherwise:
in block 905, a judgment is made about whether the meangain exceeds the preset threshold of the longtime correlation speech (such as 8000). If yes, the speech flag bit is set to 1; otherwise,
in block 906, a judgment is made about whether either or both of the level_meanSD_high_flag value and the ISF_meanSD_high_flag value are 1. If yes, the speech flag bit is set to 1; otherwise, the value of the speech flag bit remains unchanged.

Afterwards, primary music judgment is performed, as shown in Figure 10:

In block 1001, the music flag bit is set to 0 first, and then in block 1002, a judgment is made about whether the signal fulfills both ISF_meanSD_low_flag = 1 and level_meanSD_low_flag = 1. If yes, the music signal flag (music_flag) is set; otherwise, the value of the music flag bit remains unchanged.

Afterwards, as shown in Figure 11, the primary judgment result is corrected:

In block 1101, a judgment is made about whether the instant energy of the sub-band is less than the sub-band energy threshold (such as 5000). If yes, the process proceeds to block 1102; otherwise, the signal is determined to be of the uncertain type.

In block 1102, a judgment is made about whether meangain_flag is 1 and the continuous count of music is less than the speech judgment threshold (such as 3) of continuous music count. If yes, the signal is determined to be a speech signal; otherwise,
in block 1103, a judgment is made about whether the ISF_meanSD value exceeds the preset threshold of the second ISF speech (such as 2000). If yes, the signal is determined to be a speech signal; otherwise,
in block 1104, a judgment is made about whether the level_energy is less than 10000 and more than five frames are previously determined to be noise. If yes, the current signal type is set to the uncertain type, with a view to reducing the probability of mistaking noise for music; otherwise,
in block 1105, a judgment is made about whether both the music flag bit and the speech flag bit are 1s. If yes, the current signal type is determined to be the uncertain type; otherwise,
in block 1106, a judgment is made about whether both the music flag bit and the speech flag bit are 0s. If yes, the current signal type is determined to be the uncertain type; otherwise,
in block 1107, a judgment is made about whether the music flag bit is 0 and the speech flag bit is 1. If yes, the current signal type is determined to be the speech type; otherwise,
in block 1108, because the music flag bit is 1 and the speech flag bit is 0, the current signal type is determined to be the music type.

After the signal is determined to be of the uncertain type in the foregoing blocks 1104, 1105 and 1106, block 1109 is performed to judge whether pitch_flag is 1, the ISF_meanSD is less than the ISF music threshold (such as 900), and the number of continuous speech frames is less than 3. If yes, the signal is determined to be of the music type; otherwise, the signal is still determined to be of the uncertain type.

After the signal is determined to be of the speech type in the foregoing blocks 1103 and 1107, block 1110 is performed to judge whether the number of continuous music frames is greater than 3 and the ISF_meanSD is less than the ISF music threshold. If yes, the signal is determined to be a music signal; otherwise, the signal is determined to be a speech signal.

After the speech signals and music signals are determined through the foregoing process, the signals of the uncertain type undergo the primary corrective classification process shown in Figure 12, including:

In block 1201, a judgment is made about whether the level_energy is less than the threshold (such as 5000) of the uncertain type of sub-band energy. If yes, the signal type is still determined to be the uncertain class; otherwise,
in block 1202, a judgment is made about whether the number of continuous music frames is greater than 1 and ISF_meanSD is less than the ISF music threshold. If yes, the signal is determined to be of the music class; otherwise,
the speech and music hangover flags are cleared. If the signals before this frame are continuous speech signals and the continuity is strong, the speech is judged according to the characteristic parameters of the speech. If the speech conditions are fulfilled, the speech_hangover_flag is set to 1, as illustrated in blocks 1203 to 1206 in Figure 12. If the signals before this frame are continuous music signals and the continuity is strong, the music is judged according to the characteristic parameters of the music. If the music conditions are fulfilled, the music_hangover_flag is set to 1, as illustrated in blocks 1207 to 1210 in Figure 12.

Afterwards, as illustrated in blocks 1211 to 1216 in Figure 12, if the speech hangover flag is 1 and the music hangover flag is 0, the current signal type is set to the speech class. If the music hangover flag is 1 and the speech hangover flag is 0, the current signal type is set to the music class. If both the music hangover flag and the speech hangover flag are 1 or both are 0, the signal type is set to the uncertain class. In this case, if more than 20 previous music frames are continuous, the signal is determined to be of the music class; if more than 20 previous speech frames are continuous, the signal is determined to be of the speech class.

After the foregoing primary correction is performed, the useful signal type is corrected finally in Figure 13. The type is further corrected according to the current context. In block 1301, if the current context is music and the continuity is longer than 3 seconds, namely, the current continuous music frames are more than 150 frames, mandatory correction may be performed according to the ISF_meanSD value to determine the music signal. In block 1302, if the current context is speech and the continuity is longer than 3 seconds, namely, the current continuous speech frames are more than 150 frames, mandatory correction may be performed according to the ISF_meanSD value to determine the speech signal class. Afterwards, if the signal type is still uncertain, the signal type is corrected according to the previous context in block 1303, namely, the current uncertain signal type is sorted into the previous signal type.

After the type of useful signals is determined in the foregoing process, the three type counters and the threshold values in the signal type judging module need to be updated. For the three type counters, if the current type is music (signal_sort = music), the music counter (music_continue_counter) increases by 1; otherwise, the music counter is cleared. Other type counters are processed similarly as shown in Figure 14, and are not detailed here any further. The threshold values are updated according to the SNR output by the PSC module. The threshold examples given in the embodiments herein are the values learned in the case that the SNR is 20 dB.

The background noise parameter updating module uses some spectral distribution parameters calculated in the classification parameter extracting module in the SAD to control the update rate of the background noise. In the actual application environment, the energy level of the background noise may surge abruptly. In this case, it is probable that the background noise estimation remains non-updated because the signals are continuously determined to be useful signals. Such a problem is solved by the background noise parameter updating module.

The background noise parameter updating module calculates the vector of relevant spectral distribution parameters according to the parameters received from the classification parameter extracting module. The vector includes the following elements:

zero cross rate short-time mean (zcr_mean)
short-time mean of ratio of high sub-band energy to low sub-band energy (RA)
short-time mean of frequency domain fluctuation (f_flux) of sub-band energy
short-time mean of time domain fluctuation (t_flux) of sub-band energy

where the zcr_mean is calculated in the following way, and other elements are calculated similarly:

zcr_{mean}_{m} = ALPHA┚zcr_{mean}_{m - 1} + (1 + ALPHA) ┚ {zcr}_{m}

where ALPHA = 0.96 and m represents a frame index.

This embodiment makes use of the stable spectral features of the background noise. The elements of the spectral distribution parameter vector are not limited to the 4 elements listed above. The update rate of the current background noise is controlled by a difference (d_cb ) between the current spectral distribution parameter and the spectral distribution parameter estimation of the background noise. The difference may be implemented through the algorithms such as Euclidean distance and Manhattan distance. An instance of the present invention adopts the Manhattan distance (a distance calculation method similar to Euclidean distance):

d_{cb} = \sum_{i = 1}^{4} |p (i) - \tilde{p} (i)|

where p is the spectral distribution parameter vector of the current signal, and p̃ is the spectral distribution parameter vector estimation of background noise.

In an example of this embodiment, if d_cb <TH1, the module outputs an update rate accl, which represents the fastest update rate; otherwise, if d_cb <TH2, the module outputs an update rate acc2; otherwise, if d_cb <TH3, the module outputs an update rate acc3; otherwise, the module outputs an update rate acc4. TH1, TH2, TH3 and TH4 are update thresholds, and the specific threshold values depend on the actual environment conditions.

In the embodiments of the present invention, the update rate of the background noise is determined, the noise parameters are updated according to the update rate, the signals are classified primarily according to the sub-band energy parameters and the updated noise parameters, and the non-useful signals and the useful signals in the received speech signals are determined, thus reducing the probability of mistaking useful signals for noise signals and improving accuracy of classifying sound signals.

It is understandable to those skilled in the art that the embodiments of the present invention may be implemented through software in addition to a universal hardware platform or through hardware only. In most cases, however, software in addition to a universal hardware platform is preferred. Therefore, the technical solution under the present invention or contributions to the related art may be embodied by a software product. The software product is stored in a storage medium and incorporates several instructions so that a computer device (for example, PC, server, or network device) may execute the method in each embodiment of the present invention.

Described above are preferred embodiments of the present invention. In practice, those skilled in the art may make modifications to the method under the present invention to meet the specific requirements. Although the invention has been described through some exemplary embodiments, the invention is not limited to such embodiments.

Claims

A method for classifying sound signals, comprising:
(a) receiving the sound signals, and determining an update rate of background noise according to spectral distribution parameters of the background noise and spectral distribution parameters of the sou nd signals; and

(b) updating noise parameters according to the update rate, and classifying the sound signals according to sub-band energy parameters and the updated noise parameters.
The method of claim 1, wherein after (b), the method further comprises:
(c) determining the type of useful signals obtained through classification based on an open loop pitch parameter, an Immittance Spectral Frequency (ISF) parameter, and a sub-band energy parameter, wherein the type of the useful signals comprises speech and music.
The method of claim 2, wherein before (c), the method further comprises:
(c0) detecting whether noise estimation converges; if the noise estimation converges, performing c1; otherwise, performing c; and

(c1) determining the type of the useful signals obtained through the classification based on the ISF parameter and the sub-band energy parameter, wherein the type of the useful signals comprises the speech and the music.
The method of claim 3, wherein the process of detecting whether primary noise converges in c0 is:
judging whether the number of continuous noise frames before a received sound signal exceeds a preset noise convergence threshold; if the number of continuous noise frames exceeds a preset noise convergence threshold, determining that the noise estimation converges; otherwise, determining that the noise estimation does not converge.
The method of claim 2, wherein (b) further comprises:
obtaining the determined type of the useful signals, determining a signal hangover length according to the type of the useful signals, and classifying the sound signals according to the signal hangover length.
The method of claim 2, wherein (c) further comprises:
initializing a speech flag bit and a music flag bit; determining the type of the useful signals primarily according to a pitch parameter flag, a longtime signal correlation parameter, an isf_meanSD parameter, a level_meanSD parameter, and corresponding thresholds, wherein the type is speech, music, or uncertain; and modifying the speech flag bit and the music flag bit according to the primarily determined speech and music;

correcting the primarily determined speech, music, and uncertain signals according to: sub-band energy, the longtime signal correlation parameter, the level_meanSD parameter, the speech flag bit, the music flag bit, whether a count of continuous frames whose pitch parameter flag value is 1 exceeds a preset hangover frame threshold, a count of continuous music frames, a count of continuous speech frames, and the type of a previous frame and corresponding thresholds; and correcting the primarily determined speech, music or uncertain signals; and finally determining the type of the useful signals, where the type of the useful signals comprises speech and music.
The method of claim 6, wherein the threshold is adjusted according to a Signal-to-Noise Ratio (SNR) of the sound signals.
The method of claim 1, wherein after (b), the method further comprises:
(d) determining a coding mode corresponding to non-useful signals obtained through the classification, and determining whether it is necessary to calculate an Immittance Spectral Frequency (ISF) parameter according to the determined coding mode.
The method of claim 1, wherein the noise parameters in (b) comprise: a noise estimation parameter, and a noise spectral distribution parameter.
The method of claims 1 or 9, wherein (a) comprises:
calculating a difference between the spectral distribution parameter of the sound signals and the spectral distribution parameter of the background noise, and determining the update rate according to the difference.
The method of claim 10, wherein the spectral distribution parameters involved in calculation of the difference comprise:
Zero Cross Rate (ZCR) short-time mean, short-time mean of ratio of high sub-band energy to low sub-band energy, short-time mean of sub-band energy frequency domain fluctuation, and short-time mean of sub-band energy time domain fluctuation.
An apparatus for classifying sound signals, comprising:
a background noise parameter updating module, configured to: determine an update rate of background noise according to a spectral distribution parameter of the background noise and spectral distribution parameters of current sound signals, and send the determined update rate; and

a Primary Signal Classification (PSC) module, configured to: receive the update rate from the background noise parameter updating module, update noise parameters, classify the current sound signals according to a sub-band energy parameter and the updated noise parameters, and send a sound signal type determined through classification.
The apparatus of claim 12, further comprising a signal type judging module, configured to:
receive the sound signal type from the PSC module;

determine the type of useful signals in the sound signals based on an open loop pitch parameter, an Immittance Spectral Frequency (ISF) parameter, and a sub-band energy parameter, or based on the ISF parameter and the sub-band energy parameter, wherein the type of the useful signals comprises speech and music; and

send the determined type of the useful signals.
The apparatus of claim 13, further comprising a classification parameter extracting module, configured to:
receive the sound signal type from the PSC module, and transfer the sound signal type to the signal type judging module; and

obtain the ISF parameter and the sub-band energy parameter, or further obtain the open loop pitch parameter, process the obtained parameters into signal type characteristic parameters, and send the parameters to the signal type judging module; and

process the obtained parameters into the spectral distribution parameter of the sound signals and the spectral distribution parameter of the background noise, and transfer the spectral distribution parameters to the background noise parameter updating module; and

the signal type judging module determines the type of the useful signals according to the signal type characteristic parameter and the sound signal type determined by the PSC module, wherein the type of the useful signals comprises speech and music.
The apparatus of claim 13 or claim 14, wherein the PSC module comprises:
a background noise estimating sub-module, a Signal-to-Noise Ratio (SNR) calculating sub-module, a useful signal estimating sub-module, a judgment threshold adjusting sub-module, a comparing sub-module, and a hangover protective useful signal sub-module; wherein

the background noise estimating sub-module is configured to: receive the update rate from the background noise parameter updating module, updates the noise parameters, and transfers the sub-band energy estimation parameter of the background noise calculated out according to the updated noise parameters to the SNR calculating sub-module;

the SNR calculating sub-module is configured to: receive the sub-band energy estimation parameter of the background noise, calculate an SNR according to this parameter and the sub-band energy parameter, and transfer the SNR to the signal type judging module;

the signal type judging module comprises a parameter updating sub-module and a judging sub-module, wherein the parameter updating sub-module is configured to update thresholds in a signal type judgment process according to the SNR and provide the updated threshold to the judging sub-module; and

the judging sub-module is configured to: receive the sound signal type from the PSC module, determine the type of the useful signals in the sound signals based on the open loop pitch parameter, ISF parameter, sub-band energy parameter, and updated thresholds, or based on the ISF parameter and sub-band energy parameter and the updated threshold, and send the determined type of the useful signals.
The apparatus of claim 13, further comprising:
a coder mode and rate selecting module, configured to: receive the type of the useful signals from the signal type judging module, and determine a coding mode and rate of the sound signals according to the received type of the useful signals.
The apparatus of claim 14, further comprising:
a coder parameter extracting module, configured to: extract the ISF parameter and the sub-band energy parameter or additionally the open loop pitch parameter, transfer the extracted parameters to the classification parameter extracting module, and transfer the extracted sub-band energy parameter to the PSC module.