Nothing Special   »   [go: up one dir, main page]

EP2096629A1 - A classing method and device for sound signal - Google Patents

A classing method and device for sound signal Download PDF

Info

Publication number
EP2096629A1
EP2096629A1 EP07855800A EP07855800A EP2096629A1 EP 2096629 A1 EP2096629 A1 EP 2096629A1 EP 07855800 A EP07855800 A EP 07855800A EP 07855800 A EP07855800 A EP 07855800A EP 2096629 A1 EP2096629 A1 EP 2096629A1
Authority
EP
European Patent Office
Prior art keywords
parameter
module
sub
parameters
noise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
EP07855800A
Other languages
German (de)
French (fr)
Other versions
EP2096629B1 (en
EP2096629A4 (en
Inventor
Wei Li
Lijing Xu
Qing Zhang
Jianfeng Xu
Shenghu Sang
Zhengzhong Du
Qin Yan
Haojiang Deng
Jun WANG
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of EP2096629A1 publication Critical patent/EP2096629A1/en
Publication of EP2096629A4 publication Critical patent/EP2096629A4/en
Application granted granted Critical
Publication of EP2096629B1 publication Critical patent/EP2096629B1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/20Vocoders using multiple modes using sound class specific coding, hybrid encoders or object based coding

Definitions

  • the present invention relates to speech coding technologies, and in particular, to a method and apparatus for classifying sound signals.
  • the coder may encode the background noise and active speech at different rates. That is, the coder encodes the background noise at a lower rate, and encodes the active speech at a higher rate, thus reducing the average code rate and enhancing the variable-rate speech coding technology greatly.
  • VAD Voice Activity Detection
  • the VAD in the related art is developed for speech signals only, and categorizes input audio signals into only two types: noise and non-noise.
  • Later coders such as AMR_WB+ and SMV covers detection of music signals, serving as a correction and supplement to the VAD decision.
  • the AMR-WB+ coder is characterized that after VAD, the coding mode varies between a speech signal and a music signal, and depends on whether the input audio signal is a speech signal or music signal, thus minimizing the code rate and ensuring the coding quality.
  • the two different coding modes in the AMR-WB+ are: Algebraic Code Excited Linear Prediction (ACELP)-based coding algorithm, and Transform Coded eXcitation (TCX)-based coding algorithm.
  • ACELP Algebraic Code Excited Linear Prediction
  • TCX Transform Coded eXcitation
  • the ACELP sets up a speech phonation model, makes the most of the speech characteristics, and is highly efficient in encoding speech signals.
  • the ACELP technology is so mature that the ACELP may be extended on a universal audio coder to improve the speech coding quality massively.
  • the TCX may be extended on the low-bit-rate speech coder to improve the quality of encoding broadband music.
  • the ACELP mode selection algorithm and the TCX mode selection algorithm of the AMR-WB+ coding algorithm come in two types: open loop selection algorithm, and closed loop selection algorithm. Closed-loop selection corresponds to high complexity, and is default option. It is a traversal search selection mode based on a perceptive weighted Signal-to-Noise Ratio (SNR). Evidently, such a selection method is rather accurate, but involves rather complicated operation and a huge amount of codes.
  • SNR Signal-to-Noise Ratio
  • the open-loop selection includes the following steps.
  • step 101 the VAD module judges whether the signal is a non-usable signal or usable signal according to the Tone_flag and the sub-band energy parameter (Level[n]).
  • step 102 primary mode selection (EC) is performed.
  • step 103 the mode primarily determined in step 102 is corrected, and refined mode selection is performed to determine the coding mode to be selected. Specifically, this step is performed based on open loop pitch parameters and Immittance Spectral Frequency (ISF) parameters.
  • ISF Immittance Spectral Frequency
  • step 104 TCXS processing is performed. That is, when the number of times of selecting the speech signal coding mode continuously is less than three times, a small-sized closed-loop traversal search is performed to determine the coding mode finally, where the speech signal coding mode is ACELP and the music signal coding mode is TCX.
  • a method and apparatus for classifying sound signals are provided in an embodiment of the present invention to improve accuracy of sound signal classification.
  • a method for classifying and detecting sound signals in an embodiment of the present invention includes: receiving sound signals, and determining the update rate of background noise according to spectral distribution parameters of the background noise and spectral distribution parameters of the sound signals; and updating the noise parameters according to the update rate, and classifying the sound signals according to sub-band energy parameters and updated noise parameters.
  • An apparatus for classifying sound signals in an embodiment of the present invention includes: a background noise parameter updating module, configured to: determine the update rate of background noise according to spectral distribution parameters of the background noise and spectral distribution parameters of the current sound signals; and send the determined update rate; and a Primary Signal Classification (PSC) module, configured to: receive the update rate from the background noise parameter updating module, update the noise parameters, classify the current sound signals according to the sub-band energy parameters and updated noise parameters, and send the sound signal type determined through classification.
  • PSC Primary Signal Classification
  • the update rate of the background noise is determined, the noise parameters are updated according to the update rate, the signals are classified primarily according to the sub-band energy parameters and the updated noise parameters, and the nonuseful signals and the useful signals in the received speech signals are determined, thus reducing the probability of mistaking useful signals for noise signals and improving accuracy of classifying sound signals.
  • Figure 1 shows open loop selection of AMR-WB+ coding algorithm in the related art
  • Figure 2 is a general flowchart of a method for classifying and detecting sound signals in an embodiment of the present invention
  • Figure 3 is a schematic diagram showing an apparatus for classifying sound signals in an embodiment of the present invention.
  • Figure 4 is a schematic diagram showing a system in an embodiment of the present invention.
  • Figure 5 is a flowchart of calculating various parameters on a coder parameter extracting module in an embodiment of the present invention
  • Figure 6 is a flowchart of calculating various parameters on another coder parameter extracting module in an embodiment of the present invention.
  • Figure 7 shows composition of a PSC module in an embodiment of the present invention
  • Figure 8 shows how a signal type judging module determines characteristic parameters in an embodiment of the present invention
  • Figure 9 shows how a signal type judging module performs speech judgment in an embodiment of the present invention.
  • Figure 10 shows how a signal type judging module performs music judgment in an embodiment of the present invention
  • Figure 11 shows how a signal type judging module corrects a primary judgment result in an embodiment of the present invention
  • Figure 12 shows how a signal type judging module performs primary type correction for uncertain signals in an embodiment of the present invention
  • Figure 13 shows how a signal type judging module performs final type correction for signals in an embodiment of the present invention.
  • Figure 14 shows how a signal type judging module performs parameter update in an embodiment of the present invention.
  • the update rate of the background noise is determined according to the spectral distribution parameters of the current sound signal and the background noise, and the noise parameters are updated according to the update rate. Therefore, the useful signals and the non-useful signals in the received speech signals are determined according to the updated noise parameters, thus improving the accuracy of the noise parameters in determining the useful signals and non-useful signals, and improving the accuracy of classifying sound signals.
  • Figure 2 shows a method for classifying and detecting sound signals in an embodiment of the present invention, including the following process:
  • Block 201 Sound signals are received, and the update rate of background noise is determined according to the spectral distribution parameters of the background noise and the sound signals.
  • Block 202 The noise parameters are updated according to the update rate, and the sound signals are classified according to sub-band energy parameters and updated noise parameters.
  • the sound signals are classified into two types: useful signals, and non-useful signals.
  • the useful signals may be subdivided into speech signals and music signals, depending on whether the noise converges.
  • the subdividing may be based on open loop pitch parameters, ISF parameters, and sub-band energy parameters, or based on ISF parameters and sub-band energy parameters.
  • a determined useful signal type is obtained in an embodiment of the present invention.
  • the signal hangover length is determined according to the useful signal type, and the useful signals and the non-useful signals in the received speech signals are further determined according to the signal hangover length.
  • the music signal hangover may be set to a relatively great value to improve the sound effect of the music signal.
  • an apparatus for classifying sound signals in an embodiment of the present invention includes: a background noise parameter updating module, configured to: determine the update rate of background noise according to the spectral distribution parameters of the background noise and the current sound signals, and send the determined update rate to a PSC module; and a PSC module, configured to: update the noise parameters according to the update rate received from the background noise parameter updating module, perform primary classification for the signals according to the sub-band energy parameters and updated noise parameters, and determine the received speech signal to be a useful signal or non-useful signal.
  • a background noise parameter updating module configured to: determine the update rate of background noise according to the spectral distribution parameters of the background noise and the current sound signals, and send the determined update rate to a PSC module
  • a PSC module configured to: update the noise parameters according to the update rate received from the background noise parameter updating module, perform primary classification for the signals according to the sub-band energy parameters and updated noise parameters, and determine the received speech signal to be a useful signal or non-useful signal.
  • the apparatus for classifying sound signals may further include a signal type judging module.
  • the PSC module transfers the determined signal type to the signal type judging module.
  • the signal type judging module determines the type of a useful signal based on the open loop pitch parameters, ISF parameters, and sub-band energy parameters, or based on ISF parameters and sub-band energy parameters, where the type of the useful signal includes speech and music.
  • the apparatus for classifying sound signals may further include a classification parameter extracting module.
  • the PSC module transfers the determined signal type to the signal type judging module through the classification parameter extracting module.
  • the classification parameter extracting module is further configured to: obtain ISF parameters and sub-band energy parameters, or further obtain open loop pitch parameters, process the obtained parameters into signal type characteristic parameters, and send the parameters to the signal type judging module; and process the obtained parameters into spectral distribution parameters of sound signals and background noise, and transfer the spectral distribution parameters to the background noise parameter updating module. Therefore, the signal type judging module determines the type of useful signals according to the foregoing signal type characteristic parameter and the signal type determined by the PSC module, where the type of useful signals includes speech and music.
  • the PSC module may be further configured to transfer the sound signal SNR calculated in the process of determining the signal type to the signal type judging module.
  • the signal type judging module determines the useful signal to be a speech signal or music signal according to the SNR.
  • the apparatus for classifying sound signals may further include a coder mode and rate selecting module.
  • the signal type judging module transfers the determined signal type to the coder mode and rate selecting module, and the coder mode and rate selecting module determines the coding mode and rate of sound signals according to the received signal type.
  • the apparatus for classifying sound signals may further include a coder parameter extracting module, which is configured to extract ISF parameters and sub-band energy parameters or additionally open loop pitch parameters, transfer the extracted parameters to the classification parameter extracting module, and transfer the extracted sub-band energy parameters to the PSC module.
  • a coder parameter extracting module configured to extract ISF parameters and sub-band energy parameters or additionally open loop pitch parameters, transfer the extracted parameters to the classification parameter extracting module, and transfer the extracted sub-band energy parameters to the PSC module.
  • FIG. 4 is a schematic diagram showing a system in an embodiment of the present invention.
  • the system includes a Sound Activity Detector (SAD).
  • SAD Sound Activity Detector
  • the SAD sorts the audio digital signals into three types: non-useful signal, speech, and music, thus forming a basis for the coder to select the coding mode and rate.
  • the SAD module includes: a background noise estimation control module, a PSC module, a classification parameter extracting module, and a signal type judging module.
  • the SAD makes the most of the parameters of the coder in order to reduce resource occupation and calculation complexity. Therefore, the coder parameter extracting module in the coder is used to calculate the sub-band energy parameters and coder parameters, and provide the calculated parameters for the SAD module.
  • the SAD module finally outputs a determined signal type (namely, non-useful signal, speech, or music), and provides the determined signal type for the coder mode and rate selecting module to select the coder mode and rate.
  • the SAD-related modules in the coder, sub-modules in the SAD, and the interaction processes between the sub-modules are detailed below.
  • the coder parameter extracting module in the coder calculates the sub-band energy parameters and coder parameters, and provides the calculated parameters for the SAD module.
  • the sub-band energy parameters may be calculated through filtering of a filter group.
  • the specific quantity of sub-bands (for example, 12 sub-bands in this embodiment) is determined according to the calculation complexity requirement and classification accuracy requirement.
  • Figure 5 or Figure 6 shows how a coder parameter extracting module calculates various parameters required by the SAD module in this embodiment.
  • the process shown in Figure 5 includes the following process:
  • Block 501 The coder parameter extracting module calculates the sub-band energy parameters first.
  • Block 502 The coder parameter extracting module decides whether it is necessary to perform ISF calculation according to the primary signal judgment result (Vad_flag) received from the PSC module, and performs block 503 if necessary; or performs block 504 if not necessary.
  • Vad_flag the primary signal judgment result
  • the decision about whether to perform ISF calculation in this block includes: If the current frame is composed of non-useful signal signals, the mechanism of the coder applies.
  • the mechanism of the coder is: If ISF parameters are required when the coder encodes non-useful signals, the ISF calculation needs to be performed; otherwise, the operation of the coder parameter extracting module is finished. If the current frame is composed of useful signals, the ISF calculation needs to be performed. Most coding modes require calculation of ISF parameters for useful signals. Therefore, the calculation brings no redundant complexity to the coder.
  • the technical solution to calculation of ISF parameters is detailed in the instruction manuals of coders, and is not repeated here any further.
  • Block 503 The coder parameter extracting module calculates the ISF parameters and then performs block 504.
  • Block 504 The coder parameter extracting module calculates the open loop pitch parameters.
  • the sub-band energy parameters calculated through the process in Figure 5 are provided for the PSC module and the classification parameter extracting module in the SAD, and other parameters are provided for the classification parameter extracting module in the SAD.
  • Blocks 601-603 are basically identical to blocks 501-503 in Figure 5 .
  • open-loop pitch parameters are redundant to some coding modes such as TCX. In order to simplify calculation, it is basically certain that the corresponding coding mode of the signal does not need to calculate open loop pitch parameters after the noise estimation converges. Therefore, the open loop pitch parameters are not calculated any more.
  • the open loop pitch parameters need to be calculated in order to ensure convergence of the noise estimation and the convergence speed. However, such calculation occurs at the startup stage, and the complexity of calculation is ignorable.
  • the technical solution to calculation of open loop pitch parameters is detailed in the instruction about ACELP-based coding, and is not repeated here any further.
  • the basis for judging whether the noise estimation converges may be: The count of determining as noise frames continuously exceeds the noise convergence threshold (THR1). In an example in this embodiment, the value of THR1 is 20.
  • the foregoing extracted sub-band energy parameter is: level[i], where i represents a member index of the vector, and its value falls within 1...12 in this embodiment, corresponding to 0-200 Hz, 200-400 Hz, 400-600 Hz, 600-800 Hz, 800-1200 Hz, 1200-1600 Hz, 1600-2000 Hz, 2000-2400 Hz, 2400-3200 Hz, 3200-40000 Hz, 4000-4800 Hz, and 4800-6400 Hz, respectively.
  • ISF parameter is Isf n [ i ], where n represents a frame index, and the value of i falls within 1...16, representing a member index in the vector.
  • the foregoing extracted open loop pitch parameters include: open_loop pitch gain (ol_gain), open_loop pitch lag (ol_lag), and tone_flag. If the value of ol_gain is greater than the value of tone threshold (TONE_THR), the tone_flag is set to 1.
  • the PSC module may be implemented through various VAD algorithms in the related art, and includes: background noise estimating sub-module, SNR calculating sub-module, useful signal estimating sub-module, judgment threshold adjusting sub-module, comparing sub-module, and hangover protective useful signal sub-module.
  • the implementation of the PSC module may differ from the VAD algorithm module in the related art in the following aspects:
  • the SNR calculating sub-module calculates the SNR according to this parameter and the sub-band energy parameters.
  • the calculated SNR parameter is not only applied inside the PSC module, but also transferred to the signal type judging module so that the signal type judging module identifies the speech and music more accurately in the case of low SNR.
  • the VAD in the related art underperforms in identifying noise and some types of music, and improvement is made for the VAD in this embodiment:
  • the calculation of the background noise parameter is controlled by the update rate (ACC) provided by the background noise parameter updating module.
  • the background noise estimating sub-module receives the update rate from the background noise parameter updating module, updates the noise parameter, and transfers the sub-band energy estimation parameters of background noise calculated out according to the updated noise parameter to the SNR calculating sub-module.
  • the calculation of the update rate is detailed in the instruction about the background noise parameter updating module hereinafter.
  • the update rate comes in 4 levels: acc1, acc2, acc3, and acc4.
  • different upward update parameters (update_up) and downward update parameters (update_down) are determined, where update_up corresponds to the upward update rate of background noise, and update_down corresponds to the downward update rate of background noise.
  • hangover is used to prevent useful signals from being mistaken for noise.
  • the hangover length should be tradeoff between signal protection and transmission efficiency.
  • the hangover length may be a constant after learning.
  • a multi-rate coder is oriented to audio signals such as music. Such signals tend to have a long low-energy hangover. It is difficult for a conventional VAD to detect such a hangover. Therefore, a relatively long hangover is required for protection.
  • the hangover length in the hangover protective useful signal sub-module is designed to be adaptive according to the SAD signal judgment result.
  • HANG_LONG 100
  • HANG_SHORT 20
  • the classification parameter extracting module is configured to: calculate the parameters required by the signal type judging module and the background noise parameter updating module according to the Vad_flag parameter determined by the PSC module and the sub-band energy parameters, ISF parameters, and open loop pitch parameters provided by the coder parameter extracting module; and provide the sub-band energy parameters, ISF parameters, open loop pitch parameters, and calculated parameters for the signal type judging module and the background noise parameter updating module.
  • the parameters calculated by the classification parameter extracting module include:
  • Difference of continuous open loop pitch lags is compared. If the increment of the open loop pitch lag is less than a set threshold, the lag count accrues; if the sum of the lag counts of two continuous frames is great enough, the pitch is set to 1; otherwise, the pitch is set to 0.
  • the formula for calculating the open loop pitch lag is specified in the AMR-WB+/AMR-WB standard document.
  • tone_flg 1000*tone_flg.
  • Zero Cross Rate (zcr) zcr 1 T ⁇ i - 1 T - 1 II x i ⁇ x ⁇ i - 1 ⁇ 0 II ⁇ A ⁇ is 1 when A is "truth", and is 0 when A is false.
  • Sub-band energy standard deviation mean (level_meanSD) parameter average of the sub-band energy standard deviation (level_SD) of two adjacent frames, where the calculation method of the level_SD parameter is similar to the calculation method of the Isf_SD described above.
  • the parameters provided for the background noise parameter updating module include: zcr, ra, i_flux, and t_flux; the parameters provided for the signal type judging module include: pitch, meangain, isf_meanSD, and level_meanSD.
  • the signal type judging module is configured to sort the signals into non-useful(such as noise), speech, and music according to the snr and Vad_flag parameters received from the PSC module and the sub-band energy parameter, pitch, meangain, Isf_meansD, and level_meanSD parameters received from the classification parameter extracting module.
  • the signal type judging module may include:
  • the process of determining a useful signal to be a speech signal or music signal includes:
  • this embodiment provides a parameter flag hangover mechanism.
  • the characteristic parameter values such as pitch_flag, level_meanSD_high_flag, ISF_meanSD_high_flag, ISF_meanSD_low_flag, level_meanSD_low_flag, and meangain_flag are determined according to the hangover mechanism, as shown in Figure 8 .
  • the length of the hangover period is determined according to the hangover parameter flag value.
  • This embodiment provides two types of hangover settings (namely, two solutions to determining the hangover parameter flag value).
  • the corresponding parameter hangover counter value increases by one; otherwise, the corresponding parameter hangover counter value is set to 0, and different parameter hangover flags are set according to the value of the parameter hangover counter. If the value of the parameter hangover counter is higher, the parameter hangover flag value is greater. The specific value is determined as required at the time of setting the parameter hangover flag value according to the parameter counter, and is not described here any further.
  • the hangover length is controlled according to the Error Rate (ER) of the internal nodes of the decision tree corresponding to the training parameter. If the ER is lower, the hangover is shorter; if the ER is higher, the hangover is longer.
  • ER Error Rate
  • the signal is primarily sorted into either speech or music:
  • the first ISF speech threshold such as 1500
  • the speech flag bit is set to 1; otherwise, in block 904, a judgment is made about whether the number of continuous frames whose pitch value is 1 exceeds the preset threshold of the number of hangover frames (such as 2 frames). If yes, the speech flag bit is set to 1; otherwise: in block 905, a judgment is made about whether the meangain exceeds the preset threshold of the longtime correlation speech (such as 8000). If yes, the speech flag bit is set to 1; otherwise, in block 906, a judgment is made about whether either or both of the level_meanSD_high_flag value and the ISF_meanSD_high_flag value are 1. If yes, the speech flag bit is set to 1; otherwise, the value of the speech flag bit remains unchanged.
  • the sub-band energy threshold such as 5000
  • the current signal type is set to the uncertain type, with a view to reducing the probability of mistaking noise for music; otherwise, in block 1105, a judgment is made about whether both the music flag bit and the speech flag bit are 1s. If yes, the current signal type is determined to be the uncertain type; otherwise, in block 1106, a judgment is made about whether both the music flag bit and the speech flag bit are 0s. If yes, the current signal type is determined to be the uncertain type; otherwise, in block 1107, a judgment is made about whether the music flag bit is 0 and the speech flag bit is 1. If yes, the current signal type is determined to be the speech type; otherwise, in block 1108, because the music flag bit is 1 and the speech flag bit is 0, the current signal type is determined to be the music type.
  • block 1109 is performed to judge whether pitch_flag is 1, the ISF_meanSD is less than the ISF music threshold (such as 900), and the number of continuous speech frames is less than 3. If yes, the signal is determined to be of the music type; otherwise, the signal is still determined to be of the uncertain type.
  • block 1110 is performed to judge whether the number of continuous music frames is greater than 3 and the ISF_meanSD is less than the ISF music threshold. If yes, the signal is determined to be a music signal; otherwise, the signal is determined to be a speech signal.
  • the signals of the uncertain type undergo the primary corrective classification process shown in Figure 12 , including:
  • the speech hangover flag is 1 and the music hangover flag is 0, the current signal type is set to the speech class. If the music hangover flag is 1 and the speech hangover flag is 0, the current signal type is set to the music class. If both the music hangover flag and the speech hangover flag are 1 or both are 0, the signal type is set to the uncertain class. In this case, if more than 20 previous music frames are continuous, the signal is determined to be of the music class; if more than 20 previous speech frames are continuous, the signal is determined to be of the speech class.
  • the useful signal type is corrected finally in Figure 13 .
  • the type is further corrected according to the current context.
  • the current context is music and the continuity is longer than 3 seconds, namely, the current continuous music frames are more than 150 frames
  • mandatory correction may be performed according to the ISF_meanSD value to determine the music signal.
  • the current context is speech and the continuity is longer than 3 seconds, namely, the current continuous speech frames are more than 150 frames
  • mandatory correction may be performed according to the ISF_meanSD value to determine the speech signal class.
  • the signal type is still uncertain
  • the signal type is corrected according to the previous context in block 1303, namely, the current uncertain signal type is sorted into the previous signal type.
  • the three type counters and the threshold values in the signal type judging module need to be updated.
  • the music counter music_continue_counter
  • Other type counters are processed similarly as shown in Figure 14 , and are not detailed here any further.
  • the threshold values are updated according to the SNR output by the PSC module.
  • the threshold examples given in the embodiments herein are the values learned in the case that the SNR is 20 dB.
  • the background noise parameter updating module uses some spectral distribution parameters calculated in the classification parameter extracting module in the SAD to control the update rate of the background noise.
  • the energy level of the background noise may surge abruptly. In this case, it is probable that the background noise estimation remains non-updated because the signals are continuously determined to be useful signals. Such a problem is solved by the background noise parameter updating module.
  • the background noise parameter updating module calculates the vector of relevant spectral distribution parameters according to the parameters received from the classification parameter extracting module.
  • the vector includes the following elements:
  • This embodiment makes use of the stable spectral features of the background noise.
  • the elements of the spectral distribution parameter vector are not limited to the 4 elements listed above.
  • the update rate of the current background noise is controlled by a difference ( d cb ) between the current spectral distribution parameter and the spectral distribution parameter estimation of the background noise.
  • the difference may be implemented through the algorithms such as Euclidean distance and Manhattan distance.
  • the module if d cb ⁇ TH1, the module outputs an update rate accl, which represents the fastest update rate; otherwise, if d cb ⁇ TH2, the module outputs an update rate acc2; otherwise, if d cb ⁇ TH3, the module outputs an update rate acc3; otherwise, the module outputs an update rate acc4.
  • TH1, TH2, TH3 and TH4 are update thresholds, and the specific threshold values depend on the actual environment conditions.
  • the update rate of the background noise is determined, the noise parameters are updated according to the update rate, the signals are classified primarily according to the sub-band energy parameters and the updated noise parameters, and the non-useful signals and the useful signals in the received speech signals are determined, thus reducing the probability of mistaking useful signals for noise signals and improving accuracy of classifying sound signals.
  • the embodiments of the present invention may be implemented through software in addition to a universal hardware platform or through hardware only. In most cases, however, software in addition to a universal hardware platform is preferred. Therefore, the technical solution under the present invention or contributions to the related art may be embodied by a software product.
  • the software product is stored in a storage medium and incorporates several instructions so that a computer device (for example, PC, server, or network device) may execute the method in each embodiment of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A method for classifying sound signals includes: receiving sound signals, and determining the update rate of background noise according to spectral distribution parameters of the background noise and the sound signals; and updating the noise parameters according to the update rate, and classifying the sound signals according to sub-band energy parameters and updated noise parameters. An apparatus for classifying sound signals includes: a background noise parameter updating module, configured to: determine the update rate of background noise according to spectral distribution parameters of the background noise and the current sound signals; and send the determined update rate; and a PSC module, configured to: receive the update rate from the background noise parameter updating module, update the noise parameters, classify the current sound signals according to the sub-band energy parameters and updated noise parameters, and send the sound signal type determined through classification.

Description

    FIELD OF THE INVENTION
  • The present invention relates to speech coding technologies, and in particular, to a method and apparatus for classifying sound signals.
  • BACKGROUND
  • In speech communication, only about 40% signals include speech, and the others are mute or background noise. In order to save transmission bandwidth, a Voice Activity Detection (VAD) technique is applied in speech coding in the speech signal processing field. Therefore, the coder may encode the background noise and active speech at different rates. That is, the coder encodes the background noise at a lower rate, and encodes the active speech at a higher rate, thus reducing the average code rate and enhancing the variable-rate speech coding technology greatly.
  • The VAD in the related art is developed for speech signals only, and categorizes input audio signals into only two types: noise and non-noise. Later coders such as AMR_WB+ and SMV covers detection of music signals, serving as a correction and supplement to the VAD decision. The AMR-WB+ coder is characterized that after VAD, the coding mode varies between a speech signal and a music signal, and depends on whether the input audio signal is a speech signal or music signal, thus minimizing the code rate and ensuring the coding quality.
  • The two different coding modes in the AMR-WB+ are: Algebraic Code Excited Linear Prediction (ACELP)-based coding algorithm, and Transform Coded eXcitation (TCX)-based coding algorithm. The ACELP sets up a speech phonation model, makes the most of the speech characteristics, and is highly efficient in encoding speech signals. Moreover, the ACELP technology is so mature that the ACELP may be extended on a universal audio coder to improve the speech coding quality massively. Likewise, the TCX may be extended on the low-bit-rate speech coder to improve the quality of encoding broadband music.
  • Depending on complexity, the ACELP mode selection algorithm and the TCX mode selection algorithm of the AMR-WB+ coding algorithm come in two types: open loop selection algorithm, and closed loop selection algorithm. Closed-loop selection corresponds to high complexity, and is default option. It is a traversal search selection mode based on a perceptive weighted Signal-to-Noise Ratio (SNR). Evidently, such a selection method is rather accurate, but involves rather complicated operation and a huge amount of codes.
  • The open-loop selection includes the following steps.
  • In step 101, the VAD module judges whether the signal is a non-usable signal or usable signal according to the Tone_flag and the sub-band energy parameter (Level[n]).
  • In step 102, primary mode selection (EC) is performed.
  • In step 103, the mode primarily determined in step 102 is corrected, and refined mode selection is performed to determine the coding mode to be selected. Specifically, this step is performed based on open loop pitch parameters and Immittance Spectral Frequency (ISF) parameters.
  • In step 104, TCXS processing is performed. That is, when the number of times of selecting the speech signal coding mode continuously is less than three times, a small-sized closed-loop traversal search is performed to determine the coding mode finally, where the speech signal coding mode is ACELP and the music signal coding mode is TCX.
  • In the process of implementing the present invention, the inventor finds that the AMR-WB+ speech signal selection algorithm in the related art involves the following defects:
    1. 1. The VAD module in the related art underperforms in identifying noise and some music signals in the process of classifying signals, thus reducing accuracy of classifying sound signals.
    2. 2. Calculation of the open pitch parameters is necessary to the ACELP coding mode, but unnecessary to the TCX coding mode. According to the AMR-WB+ structure design, the VAD and the open-loop mode selection algorithm involve use of the open loop pitch parameters. Therefore, the open loop pitch needs to be calculated for all frames. However, as for other non-ACELP coding modes (such as TCX), the calculation of such parameters is redundant complexity, increases the calculation load of coding mode selection, and reduces the efficiency.
    3. 3. Although the VAD algorithm is superior in speech detection and noise immunity among the coders currently available, it may mistake music signals for noise at the hangover of some special music signals, thus intercepting the music hangover and making the music unnatural.
    4. 4. The AMR-WB+ mode selection algorithm disregards the Signal Noise Ratio (SNR) environment of the signal, and its performance of identifying speech and music in the case of a low SNR is further deteriorated.
    SUMMARY
  • A method and apparatus for classifying sound signals are provided in an embodiment of the present invention to improve accuracy of sound signal classification.
  • A method for classifying and detecting sound signals in an embodiment of the present invention includes: receiving sound signals, and determining the update rate of background noise according to spectral distribution parameters of the background noise and spectral distribution parameters of the sound signals; and updating the noise parameters according to the update rate, and classifying the sound signals according to sub-band energy parameters and updated noise parameters.
  • An apparatus for classifying sound signals in an embodiment of the present invention includes: a background noise parameter updating module, configured to: determine the update rate of background noise according to spectral distribution parameters of the background noise and spectral distribution parameters of the current sound signals; and send the determined update rate; and a Primary Signal Classification (PSC) module, configured to: receive the update rate from the background noise parameter updating module, update the noise parameters, classify the current sound signals according to the sub-band energy parameters and updated noise parameters, and send the sound signal type determined through classification.
  • In the embodiments of the present invention, the update rate of the background noise is determined, the noise parameters are updated according to the update rate, the signals are classified primarily according to the sub-band energy parameters and the updated noise parameters, and the nonuseful signals and the useful signals in the received speech signals are determined, thus reducing the probability of mistaking useful signals for noise signals and improving accuracy of classifying sound signals.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • Figure 1 shows open loop selection of AMR-WB+ coding algorithm in the related art;
  • Figure 2 is a general flowchart of a method for classifying and detecting sound signals in an embodiment of the present invention;
  • Figure 3 is a schematic diagram showing an apparatus for classifying sound signals in an embodiment of the present invention;
  • Figure 4 is a schematic diagram showing a system in an embodiment of the present invention;
  • Figure 5 is a flowchart of calculating various parameters on a coder parameter extracting module in an embodiment of the present invention;
  • Figure 6 is a flowchart of calculating various parameters on another coder parameter extracting module in an embodiment of the present invention;
  • Figure 7 shows composition of a PSC module in an embodiment of the present invention;
  • Figure 8 shows how a signal type judging module determines characteristic parameters in an embodiment of the present invention;
  • Figure 9 shows how a signal type judging module performs speech judgment in an embodiment of the present invention;
  • Figure 10 shows how a signal type judging module performs music judgment in an embodiment of the present invention;
  • Figure 11 shows how a signal type judging module corrects a primary judgment result in an embodiment of the present invention;
  • Figure 12 shows how a signal type judging module performs primary type correction for uncertain signals in an embodiment of the present invention;
  • Figure 13 shows how a signal type judging module performs final type correction for signals in an embodiment of the present invention; and
  • Figure 14 shows how a signal type judging module performs parameter update in an embodiment of the present invention.
  • DETAILED DESCRIPTION
  • In order to make the technical solution, objectives and merits of the present invention clearer, a detailed description of the present invention is given below by reference to the accompanying drawings and preferred embodiments.
  • In the embodiments of the present invention, the update rate of the background noise is determined according to the spectral distribution parameters of the current sound signal and the background noise, and the noise parameters are updated according to the update rate. Therefore, the useful signals and the non-useful signals in the received speech signals are determined according to the updated noise parameters, thus improving the accuracy of the noise parameters in determining the useful signals and non-useful signals, and improving the accuracy of classifying sound signals.
  • Figure 2 shows a method for classifying and detecting sound signals in an embodiment of the present invention, including the following process:
  • Block 201: Sound signals are received, and the update rate of background noise is determined according to the spectral distribution parameters of the background noise and the sound signals.
  • Block 202: The noise parameters are updated according to the update rate, and the sound signals are classified according to sub-band energy parameters and updated noise parameters.
  • In block 202, the sound signals are classified into two types: useful signals, and non-useful signals. Afterward, the useful signals may be subdivided into speech signals and music signals, depending on whether the noise converges. The subdividing may be based on open loop pitch parameters, ISF parameters, and sub-band energy parameters, or based on ISF parameters and sub-band energy parameters.
  • Besides, in order to prevent mistaking music signal hangovers for non-useful signals and reducing the sound effect, a determined useful signal type is obtained in an embodiment of the present invention. The signal hangover length is determined according to the useful signal type, and the useful signals and the non-useful signals in the received speech signals are further determined according to the signal hangover length. Here the music signal hangover may be set to a relatively great value to improve the sound effect of the music signal.
  • In the process of determining a useful signal as a speech signal or music signal, it is appropriate to set the signal not accurately identifiable to an uncertain type first, and then correct the uncertain type according to other parameters, and finally determine the type of useful signals.
  • Calculation of ISF parameters is not necessarily involved in the coding mode of non-useful signals. Therefore, no ISF parameters are calculated for the determined noise signals if the corresponding coding mode needs no calculation of ISF parameters, with a view to reducing the calculation load in the classification process and improving the classification efficiency.
    As shown in Figure 3, an apparatus for classifying sound signals in an embodiment of the present invention includes: a background noise parameter updating module, configured to: determine the update rate of background noise according to the spectral distribution parameters of the background noise and the current sound signals, and send the determined update rate to a PSC module; and a PSC module, configured to: update the noise parameters according to the update rate received from the background noise parameter updating module, perform primary classification for the signals according to the sub-band energy parameters and updated noise parameters, and determine the received speech signal to be a useful signal or non-useful signal.
  • The apparatus for classifying sound signals may further include a signal type judging module. The PSC module transfers the determined signal type to the signal type judging module. The signal type judging module determines the type of a useful signal based on the open loop pitch parameters, ISF parameters, and sub-band energy parameters, or based on ISF parameters and sub-band energy parameters, where the type of the useful signal includes speech and music.
  • The apparatus for classifying sound signals may further include a classification parameter extracting module. The PSC module transfers the determined signal type to the signal type judging module through the classification parameter extracting module. The classification parameter extracting module is further configured to: obtain ISF parameters and sub-band energy parameters, or further obtain open loop pitch parameters, process the obtained parameters into signal type characteristic parameters, and send the parameters to the signal type judging module; and process the obtained parameters into spectral distribution parameters of sound signals and background noise, and transfer the spectral distribution parameters to the background noise parameter updating module. Therefore, the signal type judging module determines the type of useful signals according to the foregoing signal type characteristic parameter and the signal type determined by the PSC module, where the type of useful signals includes speech and music.
  • The PSC module may be further configured to transfer the sound signal SNR calculated in the process of determining the signal type to the signal type judging module. The signal type judging module determines the useful signal to be a speech signal or music signal according to the SNR.
  • The apparatus for classifying sound signals may further include a coder mode and rate selecting module. The signal type judging module transfers the determined signal type to the coder mode and rate selecting module, and the coder mode and rate selecting module determines the coding mode and rate of sound signals according to the received signal type.
  • The apparatus for classifying sound signals may further include a coder parameter extracting module, which is configured to extract ISF parameters and sub-band energy parameters or additionally open loop pitch parameters, transfer the extracted parameters to the classification parameter extracting module, and transfer the extracted sub-band energy parameters to the PSC module.
  • The method for classifying and detecting sound signals and the apparatus for classifying sound signals in an embodiment of the present invention are detailed below.
  • Figure 4 is a schematic diagram showing a system in an embodiment of the present invention. The system includes a Sound Activity Detector (SAD). As required by the coder, the SAD sorts the audio digital signals into three types: non-useful signal, speech, and music, thus forming a basis for the coder to select the coding mode and rate.
  • As shown in Figure 4, the SAD module includes: a background noise estimation control module, a PSC module, a classification parameter extracting module, and a signal type judging module. As a signal classifier used inside the coder, the SAD makes the most of the parameters of the coder in order to reduce resource occupation and calculation complexity. Therefore, the coder parameter extracting module in the coder is used to calculate the sub-band energy parameters and coder parameters, and provide the calculated parameters for the SAD module. Moreover, the SAD module finally outputs a determined signal type (namely, non-useful signal, speech, or music), and provides the determined signal type for the coder mode and rate selecting module to select the coder mode and rate.
  • The SAD-related modules in the coder, sub-modules in the SAD, and the interaction processes between the sub-modules are detailed below.
  • The coder parameter extracting module in the coder calculates the sub-band energy parameters and coder parameters, and provides the calculated parameters for the SAD module. The sub-band energy parameters may be calculated through filtering of a filter group. The specific quantity of sub-bands (for example, 12 sub-bands in this embodiment) is determined according to the calculation complexity requirement and classification accuracy requirement.
  • Figure 5 or Figure 6 shows how a coder parameter extracting module calculates various parameters required by the SAD module in this embodiment.
  • The process shown in Figure 5 includes the following process:
  • Block 501: The coder parameter extracting module calculates the sub-band energy parameters first.
  • Block 502: The coder parameter extracting module decides whether it is necessary to perform ISF calculation according to the primary signal judgment result (Vad_flag) received from the PSC module, and performs block 503 if necessary; or performs block 504 if not necessary.
  • The decision about whether to perform ISF calculation in this block includes: If the current frame is composed of non-useful signal signals, the mechanism of the coder applies. The mechanism of the coder is: If ISF parameters are required when the coder encodes non-useful signals, the ISF calculation needs to be performed; otherwise, the operation of the coder parameter extracting module is finished. If the current frame is composed of useful signals, the ISF calculation needs to be performed. Most coding modes require calculation of ISF parameters for useful signals. Therefore, the calculation brings no redundant complexity to the coder. The technical solution to calculation of ISF parameters is detailed in the instruction manuals of coders, and is not repeated here any further.
  • Block 503: The coder parameter extracting module calculates the ISF parameters and then performs block 504.
  • Block 504: The coder parameter extracting module calculates the open loop pitch parameters.
  • The sub-band energy parameters calculated through the process in Figure 5 are provided for the PSC module and the classification parameter extracting module in the SAD, and other parameters are provided for the classification parameter extracting module in the SAD.
  • In the process shown in Figure 6, a block is added on the basis of the process in Figure 5, where the added block is to decide whether to calculate the open-loop pitch parameters depending on whether the primary noise converges. Blocks 601-603 are basically identical to blocks 501-503 in Figure 5. In block 604, a judgment is made about whether the primary noise parameter (namely, noise estimation) converges. If the primary noise parameter converges, the open loop pitch parameters are calculated in block 60; otherwise, no open loop pitch parameter is calculated.
  • The calculation of open-loop pitch parameters is redundant to some coding modes such as TCX. In order to simplify calculation, it is basically certain that the corresponding coding mode of the signal does not need to calculate open loop pitch parameters after the noise estimation converges. Therefore, the open loop pitch parameters are not calculated any more.
  • Before convergence of the noise estimation, the open loop pitch parameters need to be calculated in order to ensure convergence of the noise estimation and the convergence speed. However, such calculation occurs at the startup stage, and the complexity of calculation is ignorable. The technical solution to calculation of open loop pitch parameters is detailed in the instruction about ACELP-based coding, and is not repeated here any further. The basis for judging whether the noise estimation converges may be: The count of determining as noise frames continuously exceeds the noise convergence threshold (THR1). In an example in this embodiment, the value of THR1 is 20.
  • The foregoing extracted sub-band energy parameter is: level[i], where i represents a member index of the vector, and its value falls within 1...12 in this embodiment, corresponding to 0-200 Hz, 200-400 Hz, 400-600 Hz, 600-800 Hz, 800-1200 Hz, 1200-1600 Hz, 1600-2000 Hz, 2000-2400 Hz, 2400-3200 Hz, 3200-40000 Hz, 4000-4800 Hz, and 4800-6400 Hz, respectively.
  • The foregoing extracted ISF parameter is Isfn [i], where n represents a frame index, and the value of i falls within 1...16, representing a member index in the vector.
  • The foregoing extracted open loop pitch parameters include: open_loop pitch gain (ol_gain), open_loop pitch lag (ol_lag), and tone_flag. If the value of ol_gain is greater than the value of tone threshold (TONE_THR), the tone_flag is set to 1.
  • The PSC module may be implemented through various VAD algorithms in the related art, and includes: background noise estimating sub-module, SNR calculating sub-module, useful signal estimating sub-module, judgment threshold adjusting sub-module, comparing sub-module, and hangover protective useful signal sub-module. In this embodiment, as shown in Figure 7, the implementation of the PSC module may differ from the VAD algorithm module in the related art in the following aspects:
  • I. The SNR calculating sub-module calculates the SNR according to this parameter and the sub-band energy parameters. The calculated SNR parameter is not only applied inside the PSC module, but also transferred to the signal type judging module so that the signal type judging module identifies the speech and music more accurately in the case of low SNR.
  • II. The VAD in the related art underperforms in identifying noise and some types of music, and improvement is made for the VAD in this embodiment: First, the calculation of the background noise parameter is controlled by the update rate (ACC) provided by the background noise parameter updating module. The background noise estimating sub-module receives the update rate from the background noise parameter updating module, updates the noise parameter, and transfers the sub-band energy estimation parameters of background noise calculated out according to the updated noise parameter to the SNR calculating sub-module. The calculation of the update rate is detailed in the instruction about the background noise parameter updating module hereinafter. In an example of this embodiment, the update rate comes in 4 levels: acc1, acc2, acc3, and acc4. For different update rates, different upward update parameters (update_up) and downward update parameters (update_down) are determined, where update_up corresponds to the upward update rate of background noise, and update_down corresponds to the downward update rate of background noise.
  • Afterwards, the solution to updating the noise parameter may be the solution in the AMR_WB+:
    Figure imgb0001
    Therefore, the formula for updating noise estimation is:
    Figure imgb0002
    Therefore, the formula for updating the spectral distribution parameter vector of noise is:
    Figure imgb0003
    where,
    m: frame index
    n: sub-band index
    i: element index of spectral distribution parameter vector, i = 1,2,3,4
    bckr_est: sub-band energy of background noise estimation
    : estimation of spectral distribution parameter vector of background noise
    p : spectral distribution parameter vector of the current signal
  • III. In the VAD in the related art, hangover is used to prevent useful signals from being mistaken for noise. The hangover length should be tradeoff between signal protection and transmission efficiency. For traditional speech coders, the hangover length may be a constant after learning. A multi-rate coder is oriented to audio signals such as music. Such signals tend to have a long low-energy hangover. It is difficult for a conventional VAD to detect such a hangover. Therefore, a relatively long hangover is required for protection. In this embodiment, the hangover length in the hangover protective useful signal sub-module is designed to be adaptive according to the SAD signal judgment result. If the judgment result is a music signal (SAD_flag = MUSIC), a long hangover (hang_len = HANG_LONG) is set; if the judgment result is a speech signal (SAD_flag = SPEECH), a short hangover (hang_len = HANG_SHORT) is set. The detailed setting mode is as follows:
        If (SAD_flag = MUSIC)
            hang_len = HANG_LONG
        else if (SAD_flag = SPEECH)
             hang_len = HANG_SHORT
        else
             hang_len = 0
        where,
        SAD_flag: SAD judgment flag
        hang_len: protective hangover length
  • In an example of this embodiment, HANG_LONG = 100, and HANG_SHORT = 20, which may be measured in frames.
  • The classification parameter extracting module is configured to: calculate the parameters required by the signal type judging module and the background noise parameter updating module according to the Vad_flag parameter determined by the PSC module and the sub-band energy parameters, ISF parameters, and open loop pitch parameters provided by the coder parameter extracting module; and provide the sub-band energy parameters, ISF parameters, open loop pitch parameters, and calculated parameters for the signal type judging module and the background noise parameter updating module. The parameters calculated by the classification parameter extracting module include:
  • 1. Pitch parameter
  • Difference of continuous open loop pitch lags is compared. If the increment of the open loop pitch lag is less than a set threshold, the lag count accrues; if the sum of the lag counts of two continuous frames is great enough, the pitch is set to 1; otherwise, the pitch is set to 0. The formula for calculating the open loop pitch lag is specified in the AMR-WB+/AMR-WB standard document.
  • 2. Longtime signal correlation value parameter (meangain)
  • The meangain is a moving average of tones of three adjacent frames, where tone = 1000*tone_flg. The definition of tone_flg is the same as that in the AMR-WB+.
  • 3. Zero Cross Rate (zcr) zcr = 1 T i - 1 T - 1 II x i x i - 1 < 0
    Figure imgb0004

    II{A} is 1 when A is "truth", and is 0 when A is false.
  • 4. Time domain fluctuation of sub-band energy (t_flux) t_flux = i = 1 12 level m i - level m - 1 i short_mean_level_energy
    Figure imgb0005

    where short_mean_level_energy represents short-time average energy.
  • 5. Ratio of high sub-band energy to low sub-band energy (ra) ra = Sublevel_high_energy sublevel_low_energy
    Figure imgb0006

    Given below is an instance of the present invention:
    sublevel_high_energy = level[10]+ level[11];
    sublevel_low_energy = level[0]+ level[1]+ level[2]+ level[3]+ level[4]+ level[5]+ level[6]+ level[7] + level[8]+ level[9];
  • 6. Frequency domain fluctuation of sub-band energy (f_flux) t_flux = i = 2 12 level m i - level m i - 1 short_mean_level_energy
    Figure imgb0007
  • 7. ISF mean short-time distance (isf_meanSD): average of ISF distance (Isf_SD) of five adjacent frames, where Isf_SD = i = 1 16 Isf m i - Isf m - 1 i
    Figure imgb0008
  • 8. Sub-band energy standard deviation mean (level_meanSD) parameter: average of the sub-band energy standard deviation (level_SD) of two adjacent frames, where the calculation method of the level_SD parameter is similar to the calculation method of the Isf_SD described above.
  • In the foregoing 8 parameters, the parameters provided for the background noise parameter updating module include: zcr, ra, i_flux, and t_flux; the parameters provided for the signal type judging module include: pitch, meangain, isf_meanSD, and level_meanSD.
  • The signal type judging module is configured to sort the signals into non-useful(such as noise), speech, and music according to the snr and Vad_flag parameters received from the PSC module and the sub-band energy parameter, pitch, meangain, Isf_meansD, and level_meanSD parameters received from the classification parameter extracting module. The signal type judging module may include:
    • a parameter updating sub-module, configured to: update the threshold in the signal type judgment process according to the SNR, and provide the updated threshold for a judging sub-module; and
    • a judging sub-module, configured to: receive the sound signal type from the PSC module, determine the type of the useful signals in the sound signals based on the open loop pitch parameter, ISF parameter, sub-band energy parameter, and updated threshold, or based on the ISF parameter and sub-band energy parameter and the updated threshold, and send the determined type of the useful signals to the coder mode and rate selecting module.
  • The process of determining a useful signal to be a speech signal or music signal includes:
    • firstly, setting both the speech flag bit and the music flag bit to 0, sorting the signals into speech, music and uncertain signals primarily according to the pitch parameter flag, longtime signal correlation value, isf_meansD, and level_meanSD, and modifying the value of the speech flag bit or music flag bit according to the primarily determined speech or music;
    • secondly, correcting the primarily determined speech, music, and uncertain signals according to: sub-band energy, longtime signal correlation value, level_meanSD, speech_flag, music_flag, whether the number of continuous frames whose pitch value is 1 exceeds the preset hangover frame threshold, number of continuous music frames, number of continuous speech frames, and type of the previous frame; and determining the type of useful signals, where the type of a useful signal includes speech signal and music signal.
  • The process of determining a useful signal to be a speech signal or music signal is detailed below.
  • In order to ensure stability of judging signals and avoid frequent conversion of judgment results, this embodiment provides a parameter flag hangover mechanism. The characteristic parameter values such as pitch_flag, level_meanSD_high_flag, ISF_meanSD_high_flag, ISF_meanSD_low_flag, level_meanSD_low_flag, and meangain_flag are determined according to the hangover mechanism, as shown in Figure 8.
  • In Figure 8, the length of the hangover period is determined according to the hangover parameter flag value. This embodiment provides two types of hangover settings (namely, two solutions to determining the hangover parameter flag value).
  • In the first hangover setting solution, when the parameter value is higher or lower than a threshold, the corresponding parameter hangover counter value increases by one; otherwise, the corresponding parameter hangover counter value is set to 0, and different parameter hangover flags are set according to the value of the parameter hangover counter. If the value of the parameter hangover counter is higher, the parameter hangover flag value is greater. The specific value is determined as required at the time of setting the parameter hangover flag value according to the parameter counter, and is not described here any further.
  • In the second hangover setting solution, the hangover length is controlled according to the Error Rate (ER) of the internal nodes of the decision tree corresponding to the training parameter. If the ER is lower, the hangover is shorter; if the ER is higher, the hangover is longer.
  • Afterwards, if the current signal is classified as a useful signal, the signal is primarily sorted into either speech or music:
  • Firstly, primary speech judgment is performed. As shown in Figure 9, in block 901, the speech flag bit is set to 0, and then in block 902, a judgment is made about whether the Isf_meanSD is greater than the first ISF speech threshold (such as 1500). If the Isf_meanSD is greater than the first ISF speech threshold, the speech flag bit is set to 1; otherwise,
    in block 903, a judgment is made about whether the pitch value is 1 and the pitch lag value (t_top_mean) obtained switching on and switching off the pitch search is less than the pitch speech threshold (such as 40). If yes, the speech flag bit is set to 1; otherwise,
    in block 904, a judgment is made about whether the number of continuous frames whose pitch value is 1 exceeds the preset threshold of the number of hangover frames (such as 2 frames). If yes, the speech flag bit is set to 1; otherwise:
    in block 905, a judgment is made about whether the meangain exceeds the preset threshold of the longtime correlation speech (such as 8000). If yes, the speech flag bit is set to 1; otherwise,
    in block 906, a judgment is made about whether either or both of the level_meanSD_high_flag value and the ISF_meanSD_high_flag value are 1. If yes, the speech flag bit is set to 1; otherwise, the value of the speech flag bit remains unchanged.
  • Afterwards, primary music judgment is performed, as shown in Figure 10:
  • In block 1001, the music flag bit is set to 0 first, and then in block 1002, a judgment is made about whether the signal fulfills both ISF_meanSD_low_flag = 1 and level_meanSD_low_flag = 1. If yes, the music signal flag (music_flag) is set; otherwise, the value of the music flag bit remains unchanged.
  • Afterwards, as shown in Figure 11, the primary judgment result is corrected:
  • In block 1101, a judgment is made about whether the instant energy of the sub-band is less than the sub-band energy threshold (such as 5000). If yes, the process proceeds to block 1102; otherwise, the signal is determined to be of the uncertain type.
  • In block 1102, a judgment is made about whether meangain_flag is 1 and the continuous count of music is less than the speech judgment threshold (such as 3) of continuous music count. If yes, the signal is determined to be a speech signal; otherwise,
    in block 1103, a judgment is made about whether the ISF_meanSD value exceeds the preset threshold of the second ISF speech (such as 2000). If yes, the signal is determined to be a speech signal; otherwise,
    in block 1104, a judgment is made about whether the level_energy is less than 10000 and more than five frames are previously determined to be noise. If yes, the current signal type is set to the uncertain type, with a view to reducing the probability of mistaking noise for music; otherwise,
    in block 1105, a judgment is made about whether both the music flag bit and the speech flag bit are 1s. If yes, the current signal type is determined to be the uncertain type; otherwise,
    in block 1106, a judgment is made about whether both the music flag bit and the speech flag bit are 0s. If yes, the current signal type is determined to be the uncertain type; otherwise,
    in block 1107, a judgment is made about whether the music flag bit is 0 and the speech flag bit is 1. If yes, the current signal type is determined to be the speech type; otherwise,
    in block 1108, because the music flag bit is 1 and the speech flag bit is 0, the current signal type is determined to be the music type.
  • After the signal is determined to be of the uncertain type in the foregoing blocks 1104, 1105 and 1106, block 1109 is performed to judge whether pitch_flag is 1, the ISF_meanSD is less than the ISF music threshold (such as 900), and the number of continuous speech frames is less than 3. If yes, the signal is determined to be of the music type; otherwise, the signal is still determined to be of the uncertain type.
  • After the signal is determined to be of the speech type in the foregoing blocks 1103 and 1107, block 1110 is performed to judge whether the number of continuous music frames is greater than 3 and the ISF_meanSD is less than the ISF music threshold. If yes, the signal is determined to be a music signal; otherwise, the signal is determined to be a speech signal.
  • After the speech signals and music signals are determined through the foregoing process, the signals of the uncertain type undergo the primary corrective classification process shown in Figure 12, including:
  • In block 1201, a judgment is made about whether the level_energy is less than the threshold (such as 5000) of the uncertain type of sub-band energy. If yes, the signal type is still determined to be the uncertain class; otherwise,
    in block 1202, a judgment is made about whether the number of continuous music frames is greater than 1 and ISF_meanSD is less than the ISF music threshold. If yes, the signal is determined to be of the music class; otherwise,
    the speech and music hangover flags are cleared. If the signals before this frame are continuous speech signals and the continuity is strong, the speech is judged according to the characteristic parameters of the speech. If the speech conditions are fulfilled, the speech_hangover_flag is set to 1, as illustrated in blocks 1203 to 1206 in Figure 12. If the signals before this frame are continuous music signals and the continuity is strong, the music is judged according to the characteristic parameters of the music. If the music conditions are fulfilled, the music_hangover_flag is set to 1, as illustrated in blocks 1207 to 1210 in Figure 12.
  • Afterwards, as illustrated in blocks 1211 to 1216 in Figure 12, if the speech hangover flag is 1 and the music hangover flag is 0, the current signal type is set to the speech class. If the music hangover flag is 1 and the speech hangover flag is 0, the current signal type is set to the music class. If both the music hangover flag and the speech hangover flag are 1 or both are 0, the signal type is set to the uncertain class. In this case, if more than 20 previous music frames are continuous, the signal is determined to be of the music class; if more than 20 previous speech frames are continuous, the signal is determined to be of the speech class.
  • After the foregoing primary correction is performed, the useful signal type is corrected finally in Figure 13. The type is further corrected according to the current context. In block 1301, if the current context is music and the continuity is longer than 3 seconds, namely, the current continuous music frames are more than 150 frames, mandatory correction may be performed according to the ISF_meanSD value to determine the music signal. In block 1302, if the current context is speech and the continuity is longer than 3 seconds, namely, the current continuous speech frames are more than 150 frames, mandatory correction may be performed according to the ISF_meanSD value to determine the speech signal class. Afterwards, if the signal type is still uncertain, the signal type is corrected according to the previous context in block 1303, namely, the current uncertain signal type is sorted into the previous signal type.
  • After the type of useful signals is determined in the foregoing process, the three type counters and the threshold values in the signal type judging module need to be updated. For the three type counters, if the current type is music (signal_sort = music), the music counter (music_continue_counter) increases by 1; otherwise, the music counter is cleared. Other type counters are processed similarly as shown in Figure 14, and are not detailed here any further. The threshold values are updated according to the SNR output by the PSC module. The threshold examples given in the embodiments herein are the values learned in the case that the SNR is 20 dB.
  • The background noise parameter updating module uses some spectral distribution parameters calculated in the classification parameter extracting module in the SAD to control the update rate of the background noise. In the actual application environment, the energy level of the background noise may surge abruptly. In this case, it is probable that the background noise estimation remains non-updated because the signals are continuously determined to be useful signals. Such a problem is solved by the background noise parameter updating module.
  • The background noise parameter updating module calculates the vector of relevant spectral distribution parameters according to the parameters received from the classification parameter extracting module. The vector includes the following elements:
    • zero cross rate short-time mean (zcr_mean)
    • short-time mean of ratio of high sub-band energy to low sub-band energy (RA)
    • short-time mean of frequency domain fluctuation (f_flux) of sub-band energy
    • short-time mean of time domain fluctuation (t_flux) of sub-band energy
    where the zcr_mean is calculated in the following way, and other elements are calculated similarly: zcr_ mean m = ALPHA┚zcr_ mean m - 1 + 1 + ALPHA zcr m
    Figure imgb0009

    where ALPHA = 0.96 and m represents a frame index.
  • This embodiment makes use of the stable spectral features of the background noise. The elements of the spectral distribution parameter vector are not limited to the 4 elements listed above. The update rate of the current background noise is controlled by a difference (dcb ) between the current spectral distribution parameter and the spectral distribution parameter estimation of the background noise. The difference may be implemented through the algorithms such as Euclidean distance and Manhattan distance. An instance of the present invention adopts the Manhattan distance (a distance calculation method similar to Euclidean distance): d cb = i = 1 4 p i - p ˜ i
    Figure imgb0010

    where p is the spectral distribution parameter vector of the current signal, and is the spectral distribution parameter vector estimation of background noise.
  • In an example of this embodiment, if dcb <TH1, the module outputs an update rate accl, which represents the fastest update rate; otherwise, if dcb <TH2, the module outputs an update rate acc2; otherwise, if dcb <TH3, the module outputs an update rate acc3; otherwise, the module outputs an update rate acc4. TH1, TH2, TH3 and TH4 are update thresholds, and the specific threshold values depend on the actual environment conditions.
  • In the embodiments of the present invention, the update rate of the background noise is determined, the noise parameters are updated according to the update rate, the signals are classified primarily according to the sub-band energy parameters and the updated noise parameters, and the non-useful signals and the useful signals in the received speech signals are determined, thus reducing the probability of mistaking useful signals for noise signals and improving accuracy of classifying sound signals.
  • It is understandable to those skilled in the art that the embodiments of the present invention may be implemented through software in addition to a universal hardware platform or through hardware only. In most cases, however, software in addition to a universal hardware platform is preferred. Therefore, the technical solution under the present invention or contributions to the related art may be embodied by a software product. The software product is stored in a storage medium and incorporates several instructions so that a computer device (for example, PC, server, or network device) may execute the method in each embodiment of the present invention.
  • Described above are preferred embodiments of the present invention. In practice, those skilled in the art may make modifications to the method under the present invention to meet the specific requirements. Although the invention has been described through some exemplary embodiments, the invention is not limited to such embodiments.
  • Claims (17)

    1. A method for classifying sound signals, comprising:
      (a) receiving the sound signals, and determining an update rate of background noise according to spectral distribution parameters of the background noise and spectral distribution parameters of the sou nd signals; and
      (b) updating noise parameters according to the update rate, and classifying the sound signals according to sub-band energy parameters and the updated noise parameters.
    2. The method of claim 1, wherein after (b), the method further comprises:
      (c) determining the type of useful signals obtained through classification based on an open loop pitch parameter, an Immittance Spectral Frequency (ISF) parameter, and a sub-band energy parameter, wherein the type of the useful signals comprises speech and music.
    3. The method of claim 2, wherein before (c), the method further comprises:
      (c0) detecting whether noise estimation converges; if the noise estimation converges, performing c1; otherwise, performing c; and
      (c1) determining the type of the useful signals obtained through the classification based on the ISF parameter and the sub-band energy parameter, wherein the type of the useful signals comprises the speech and the music.
    4. The method of claim 3, wherein the process of detecting whether primary noise converges in c0 is:
      judging whether the number of continuous noise frames before a received sound signal exceeds a preset noise convergence threshold; if the number of continuous noise frames exceeds a preset noise convergence threshold, determining that the noise estimation converges; otherwise, determining that the noise estimation does not converge.
    5. The method of claim 2, wherein (b) further comprises:
      obtaining the determined type of the useful signals, determining a signal hangover length according to the type of the useful signals, and classifying the sound signals according to the signal hangover length.
    6. The method of claim 2, wherein (c) further comprises:
      initializing a speech flag bit and a music flag bit; determining the type of the useful signals primarily according to a pitch parameter flag, a longtime signal correlation parameter, an isf_meanSD parameter, a level_meanSD parameter, and corresponding thresholds, wherein the type is speech, music, or uncertain; and modifying the speech flag bit and the music flag bit according to the primarily determined speech and music;
      correcting the primarily determined speech, music, and uncertain signals according to: sub-band energy, the longtime signal correlation parameter, the level_meanSD parameter, the speech flag bit, the music flag bit, whether a count of continuous frames whose pitch parameter flag value is 1 exceeds a preset hangover frame threshold, a count of continuous music frames, a count of continuous speech frames, and the type of a previous frame and corresponding thresholds; and correcting the primarily determined speech, music or uncertain signals; and finally determining the type of the useful signals, where the type of the useful signals comprises speech and music.
    7. The method of claim 6, wherein the threshold is adjusted according to a Signal-to-Noise Ratio (SNR) of the sound signals.
    8. The method of claim 1, wherein after (b), the method further comprises:
      (d) determining a coding mode corresponding to non-useful signals obtained through the classification, and determining whether it is necessary to calculate an Immittance Spectral Frequency (ISF) parameter according to the determined coding mode.
    9. The method of claim 1, wherein the noise parameters in (b) comprise: a noise estimation parameter, and a noise spectral distribution parameter.
    10. The method of claims 1 or 9, wherein (a) comprises:
      calculating a difference between the spectral distribution parameter of the sound signals and the spectral distribution parameter of the background noise, and determining the update rate according to the difference.
    11. The method of claim 10, wherein the spectral distribution parameters involved in calculation of the difference comprise:
      Zero Cross Rate (ZCR) short-time mean, short-time mean of ratio of high sub-band energy to low sub-band energy, short-time mean of sub-band energy frequency domain fluctuation, and short-time mean of sub-band energy time domain fluctuation.
    12. An apparatus for classifying sound signals, comprising:
      a background noise parameter updating module, configured to: determine an update rate of background noise according to a spectral distribution parameter of the background noise and spectral distribution parameters of current sound signals, and send the determined update rate; and
      a Primary Signal Classification (PSC) module, configured to: receive the update rate from the background noise parameter updating module, update noise parameters, classify the current sound signals according to a sub-band energy parameter and the updated noise parameters, and send a sound signal type determined through classification.
    13. The apparatus of claim 12, further comprising a signal type judging module, configured to:
      receive the sound signal type from the PSC module;
      determine the type of useful signals in the sound signals based on an open loop pitch parameter, an Immittance Spectral Frequency (ISF) parameter, and a sub-band energy parameter, or based on the ISF parameter and the sub-band energy parameter, wherein the type of the useful signals comprises speech and music; and
      send the determined type of the useful signals.
    14. The apparatus of claim 13, further comprising a classification parameter extracting module, configured to:
      receive the sound signal type from the PSC module, and transfer the sound signal type to the signal type judging module; and
      obtain the ISF parameter and the sub-band energy parameter, or further obtain the open loop pitch parameter, process the obtained parameters into signal type characteristic parameters, and send the parameters to the signal type judging module; and
      process the obtained parameters into the spectral distribution parameter of the sound signals and the spectral distribution parameter of the background noise, and transfer the spectral distribution parameters to the background noise parameter updating module; and
      the signal type judging module determines the type of the useful signals according to the signal type characteristic parameter and the sound signal type determined by the PSC module, wherein the type of the useful signals comprises speech and music.
    15. The apparatus of claim 13 or claim 14, wherein the PSC module comprises:
      a background noise estimating sub-module, a Signal-to-Noise Ratio (SNR) calculating sub-module, a useful signal estimating sub-module, a judgment threshold adjusting sub-module, a comparing sub-module, and a hangover protective useful signal sub-module; wherein
      the background noise estimating sub-module is configured to: receive the update rate from the background noise parameter updating module, updates the noise parameters, and transfers the sub-band energy estimation parameter of the background noise calculated out according to the updated noise parameters to the SNR calculating sub-module;
      the SNR calculating sub-module is configured to: receive the sub-band energy estimation parameter of the background noise, calculate an SNR according to this parameter and the sub-band energy parameter, and transfer the SNR to the signal type judging module;
      the signal type judging module comprises a parameter updating sub-module and a judging sub-module, wherein the parameter updating sub-module is configured to update thresholds in a signal type judgment process according to the SNR and provide the updated threshold to the judging sub-module; and
      the judging sub-module is configured to: receive the sound signal type from the PSC module, determine the type of the useful signals in the sound signals based on the open loop pitch parameter, ISF parameter, sub-band energy parameter, and updated thresholds, or based on the ISF parameter and sub-band energy parameter and the updated threshold, and send the determined type of the useful signals.
    16. The apparatus of claim 13, further comprising:
      a coder mode and rate selecting module, configured to: receive the type of the useful signals from the signal type judging module, and determine a coding mode and rate of the sound signals according to the received type of the useful signals.
    17. The apparatus of claim 14, further comprising:
      a coder parameter extracting module, configured to: extract the ISF parameter and the sub-band energy parameter or additionally the open loop pitch parameter, transfer the extracted parameters to the classification parameter extracting module, and transfer the extracted sub-band energy parameter to the PSC module.
    EP07855800A 2006-12-05 2007-12-26 Method and apparatus for classifying sound signals Active EP2096629B1 (en)

    Applications Claiming Priority (2)

    Application Number Priority Date Filing Date Title
    CN 200610164456 CN100483509C (en) 2006-12-05 2006-12-05 Aural signal classification method and device
    PCT/CN2007/003798 WO2008067735A1 (en) 2006-12-05 2007-12-26 A classing method and device for sound signal

    Publications (3)

    Publication Number Publication Date
    EP2096629A1 true EP2096629A1 (en) 2009-09-02
    EP2096629A4 EP2096629A4 (en) 2011-01-26
    EP2096629B1 EP2096629B1 (en) 2012-10-24

    Family

    ID=39491665

    Family Applications (1)

    Application Number Title Priority Date Filing Date
    EP07855800A Active EP2096629B1 (en) 2006-12-05 2007-12-26 Method and apparatus for classifying sound signals

    Country Status (3)

    Country Link
    EP (1) EP2096629B1 (en)
    CN (1) CN100483509C (en)
    WO (1) WO2008067735A1 (en)

    Cited By (6)

    * Cited by examiner, † Cited by third party
    Publication number Priority date Publication date Assignee Title
    WO2012146290A1 (en) * 2011-04-28 2012-11-01 Telefonaktiebolaget L M Ericsson (Publ) Frame based audio signal classification
    CN102928713A (en) * 2012-11-02 2013-02-13 北京美尔斯通科技发展股份有限公司 Background noise measuring method of magnetic antennas
    JP2014517938A (en) * 2011-05-24 2014-07-24 クゥアルコム・インコーポレイテッド Mode classification of noise robust speech coding
    US9224403B2 (en) 2010-07-02 2015-12-29 Dolby International Ab Selective bass post filter
    RU2630889C2 (en) * 2012-11-13 2017-09-13 Самсунг Электроникс Ко., Лтд. Method and device for determining the coding mode, method and device for coding audio signals and a method and device for decoding audio signals
    US10090003B2 (en) 2013-08-06 2018-10-02 Huawei Technologies Co., Ltd. Method and apparatus for classifying an audio signal based on frequency spectrum fluctuation

    Families Citing this family (15)

    * Cited by examiner, † Cited by third party
    Publication number Priority date Publication date Assignee Title
    JP5168162B2 (en) * 2009-01-16 2013-03-21 沖電気工業株式会社 SOUND SIGNAL ADJUSTMENT DEVICE, PROGRAM AND METHOD, AND TELEPHONE DEVICE
    EP2490214A4 (en) * 2009-10-15 2012-10-24 Huawei Tech Co Ltd Signal processing method, device and system
    CN102299693B (en) * 2010-06-28 2017-05-03 瀚宇彩晶股份有限公司 Message adjustment system and method
    CN102446506B (en) * 2010-10-11 2013-06-05 华为技术有限公司 Classification identifying method and equipment of audio signals
    US9099098B2 (en) * 2012-01-20 2015-08-04 Qualcomm Incorporated Voice activity detection in presence of background noise
    US8712076B2 (en) 2012-02-08 2014-04-29 Dolby Laboratories Licensing Corporation Post-processing including median filtering of noise suppression gains
    US9173025B2 (en) 2012-02-08 2015-10-27 Dolby Laboratories Licensing Corporation Combined suppression of noise, echo, and out-of-location signals
    EP3301676A1 (en) * 2012-08-31 2018-04-04 Telefonaktiebolaget LM Ericsson (publ) Method and device for voice activity detection
    CN106328169B (en) * 2015-06-26 2018-12-11 中兴通讯股份有限公司 A kind of acquisition methods, activation sound detection method and the device of activation sound amendment frame number
    CN106328152B (en) * 2015-06-30 2020-01-31 芋头科技(杭州)有限公司 automatic indoor noise pollution identification and monitoring system
    CN105654944B (en) * 2015-12-30 2019-11-01 中国科学院自动化研究所 It is a kind of merged in short-term with it is long when feature modeling ambient sound recognition methods and device
    CN107123419A (en) * 2017-05-18 2017-09-01 北京大生在线科技有限公司 The optimization method of background noise reduction in the identification of Sphinx word speeds
    CN108257617B (en) * 2018-01-11 2021-01-19 会听声学科技(北京)有限公司 Noise scene recognition system and method
    CN110992989B (en) * 2019-12-06 2022-05-27 广州国音智能科技有限公司 Voice acquisition method and device and computer readable storage medium
    CN113257276B (en) * 2021-05-07 2024-03-29 普联国际有限公司 Audio scene detection method, device, equipment and storage medium

    Citations (2)

    * Cited by examiner, † Cited by third party
    Publication number Priority date Publication date Assignee Title
    WO1996005592A1 (en) * 1994-08-10 1996-02-22 Qualcomm Incorporated Method and apparatus for selecting an encoding rate in a variable rate vocoder
    WO2002065457A2 (en) * 2001-02-13 2002-08-22 Conexant Systems, Inc. Speech coding system with a music classifier

    Family Cites Families (6)

    * Cited by examiner, † Cited by third party
    Publication number Priority date Publication date Assignee Title
    US6691084B2 (en) * 1998-12-21 2004-02-10 Qualcomm Incorporated Multiple mode variable rate speech coding
    US6456964B2 (en) * 1998-12-21 2002-09-24 Qualcomm, Incorporated Encoding of periodic speech using prototype waveforms
    JP3454206B2 (en) * 1999-11-10 2003-10-06 三菱電機株式会社 Noise suppression device and noise suppression method
    US6983242B1 (en) * 2000-08-21 2006-01-03 Mindspeed Technologies, Inc. Method for robust classification in speech coding
    CN1175398C (en) * 2000-11-18 2004-11-10 中兴通讯股份有限公司 Sound activation detection method for identifying speech and music from noise environment
    WO2002080148A1 (en) * 2001-03-28 2002-10-10 Mitsubishi Denki Kabushiki Kaisha Noise suppressor

    Patent Citations (2)

    * Cited by examiner, † Cited by third party
    Publication number Priority date Publication date Assignee Title
    WO1996005592A1 (en) * 1994-08-10 1996-02-22 Qualcomm Incorporated Method and apparatus for selecting an encoding rate in a variable rate vocoder
    WO2002065457A2 (en) * 2001-02-13 2002-08-22 Conexant Systems, Inc. Speech coding system with a music classifier

    Non-Patent Citations (3)

    * Cited by examiner, † Cited by third party
    Title
    "Digital cellular telecommunications system (Phase 2+); Universal Mobile Telecommunications System (UMTS); Audio codec processing functions; Extended Adaptive Multi-Rate - Wideband (AMR-WB+) codec; Transcoding functions (3GPP TS 26.290 version 6.3.0 Release 6); ETSI TS 126 290", ETSI STANDARDS, LIS, SOPHIA ANTIPOLIS CEDEX, FRANCE, vol. 3-SA4, no. V6.3.0, 1 June 2005 (2005-06-01), XP014030612, ISSN: 0000-0001 *
    JELINEK M ET AL: "Robust signal/noise discrimination for wideband speech and audio coding", SPEECH CODING, 2000. PROCEEDINGS. 2000 IEEE WORKSHOP ON SEPTEMBER 17-20, 2000, PISCATAWAY, NJ, USA,IEEE, 17 September 2000 (2000-09-17), pages 151-153, XP010520072, ISBN: 978-0-7803-6416-5 *
    See also references of WO2008067735A1 *

    Cited By (27)

    * Cited by examiner, † Cited by third party
    Publication number Priority date Publication date Assignee Title
    US9830923B2 (en) 2010-07-02 2017-11-28 Dolby International Ab Selective bass post filter
    US11996111B2 (en) 2010-07-02 2024-05-28 Dolby International Ab Post filter for audio signals
    US9858940B2 (en) 2010-07-02 2018-01-02 Dolby International Ab Pitch filter for audio signals
    US9224403B2 (en) 2010-07-02 2015-12-29 Dolby International Ab Selective bass post filter
    US11610595B2 (en) 2010-07-02 2023-03-21 Dolby International Ab Post filter for audio signals
    US9343077B2 (en) 2010-07-02 2016-05-17 Dolby International Ab Pitch filter for audio signals
    US9396736B2 (en) 2010-07-02 2016-07-19 Dolby International Ab Audio encoder and decoder with multiple coding modes
    US9552824B2 (en) 2010-07-02 2017-01-24 Dolby International Ab Post filter
    US9558754B2 (en) 2010-07-02 2017-01-31 Dolby International Ab Audio encoder and decoder with pitch prediction
    US9558753B2 (en) 2010-07-02 2017-01-31 Dolby International Ab Pitch filter for audio signals
    US9595270B2 (en) 2010-07-02 2017-03-14 Dolby International Ab Selective post filter
    US11183200B2 (en) 2010-07-02 2021-11-23 Dolby International Ab Post filter for audio signals
    US10811024B2 (en) 2010-07-02 2020-10-20 Dolby International Ab Post filter for audio signals
    US10236010B2 (en) 2010-07-02 2019-03-19 Dolby International Ab Pitch filter for audio signals
    WO2012146290A1 (en) * 2011-04-28 2012-11-01 Telefonaktiebolaget L M Ericsson (Publ) Frame based audio signal classification
    US9240191B2 (en) 2011-04-28 2016-01-19 Telefonaktiebolaget L M Ericsson (Publ) Frame based audio signal classification
    JP2014517938A (en) * 2011-05-24 2014-07-24 クゥアルコム・インコーポレイテッド Mode classification of noise robust speech coding
    CN102928713A (en) * 2012-11-02 2013-02-13 北京美尔斯通科技发展股份有限公司 Background noise measuring method of magnetic antennas
    RU2656681C1 (en) * 2012-11-13 2018-06-06 Самсунг Электроникс Ко., Лтд. Method and device for determining the coding mode, the method and device for coding of audio signals and the method and device for decoding of audio signals
    US11004458B2 (en) 2012-11-13 2021-05-11 Samsung Electronics Co., Ltd. Coding mode determination method and apparatus, audio encoding method and apparatus, and audio decoding method and apparatus
    RU2630889C2 (en) * 2012-11-13 2017-09-13 Самсунг Электроникс Ко., Лтд. Method and device for determining the coding mode, method and device for coding audio signals and a method and device for decoding audio signals
    US10468046B2 (en) 2012-11-13 2019-11-05 Samsung Electronics Co., Ltd. Coding mode determination method and apparatus, audio encoding method and apparatus, and audio decoding method and apparatus
    RU2680352C1 (en) * 2012-11-13 2019-02-19 Самсунг Электроникс Ко., Лтд. Encoding mode determining method and device, the audio signals encoding method and device and the audio signals decoding method and device
    US10529361B2 (en) 2013-08-06 2020-01-07 Huawei Technologies Co., Ltd. Audio signal classification method and apparatus
    US11289113B2 (en) 2013-08-06 2022-03-29 Huawei Technolgies Co. Ltd. Linear prediction residual energy tilt-based audio signal classification method and apparatus
    US11756576B2 (en) 2013-08-06 2023-09-12 Huawei Technologies Co., Ltd. Classification of audio signal as speech or music based on energy fluctuation of frequency spectrum
    US10090003B2 (en) 2013-08-06 2018-10-02 Huawei Technologies Co., Ltd. Method and apparatus for classifying an audio signal based on frequency spectrum fluctuation

    Also Published As

    Publication number Publication date
    EP2096629B1 (en) 2012-10-24
    EP2096629A4 (en) 2011-01-26
    CN101197135A (en) 2008-06-11
    CN100483509C (en) 2009-04-29
    WO2008067735A1 (en) 2008-06-12

    Similar Documents

    Publication Publication Date Title
    EP2096629B1 (en) Method and apparatus for classifying sound signals
    JP3197155B2 (en) Method and apparatus for estimating and classifying a speech signal pitch period in a digital speech coder
    CN101197130B (en) Sound activity detecting method and detector thereof
    RU2441286C2 (en) Method and apparatus for detecting sound activity and classifying sound signals
    EP2159788B1 (en) A voice activity detecting device and method
    US6202046B1 (en) Background noise/speech classification method
    US6424938B1 (en) Complex signal activity detection for improved speech/noise classification of an audio signal
    RU2417456C2 (en) Systems, methods and devices for detecting changes in signals
    CN103548081B (en) The sane speech decoding pattern classification of noise
    EP1147515A1 (en) Wide band speech synthesis by means of a mapping matrix
    US7478042B2 (en) Speech decoder that detects stationary noise signal regions
    CN101149921A (en) Mute test method and device
    CN101393741A (en) Audio signal classification apparatus and method used in wideband audio encoder and decoder
    US6564182B1 (en) Look-ahead pitch determination
    JPH10105194A (en) Pitch detecting method, and method and device for encoding speech signal
    JP3331297B2 (en) Background sound / speech classification method and apparatus, and speech coding method and apparatus
    Wang et al. Phonetic segmentation for low rate speech coding
    CN101393744A (en) Method for regulating threshold and detection module
    Zhang et al. A CELP variable rate speech codec with low average rate
    JPH08305388A (en) Voice range detection device
    Benyassine et al. A robust low complexity voice activity detection algorithm for speech communication systems
    Liu et al. Efficient voice activity detection algorithm based on sub-band temporal envelope and sub-band long-term signal variability
    LE RATE et al. Lei Zhang," Tian Wang," Vladimir Cuperman"*" School of Engineering Science, Simon Fraser University, Burnaby, BC, Canada* Department of Electrical and Computer Engineering, University of California, Santa Barbara, USA
    NO309831B1 (en) Method and apparatus for classifying speech signals

    Legal Events

    Date Code Title Description
    PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

    Free format text: ORIGINAL CODE: 0009012

    17P Request for examination filed

    Effective date: 20090608

    AK Designated contracting states

    Kind code of ref document: A1

    Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC MT NL PL PT RO SE SI SK TR

    DAX Request for extension of the european patent (deleted)
    A4 Supplementary search report drawn up and despatched

    Effective date: 20101223

    RIC1 Information provided on ipc code assigned before grant

    Ipc: G10L 11/02 20060101AFI20080710BHEP

    17Q First examination report despatched

    Effective date: 20110524

    GRAP Despatch of communication of intention to grant a patent

    Free format text: ORIGINAL CODE: EPIDOSNIGR1

    RTI1 Title (correction)

    Free format text: METHOD AND APPARATUS FOR CLASSIFYING SOUND SIGNALS

    GRAS Grant fee paid

    Free format text: ORIGINAL CODE: EPIDOSNIGR3

    GRAA (expected) grant

    Free format text: ORIGINAL CODE: 0009210

    AK Designated contracting states

    Kind code of ref document: B1

    Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC MT NL PL PT RO SE SI SK TR

    REG Reference to a national code

    Ref country code: GB

    Ref legal event code: FG4D

    REG Reference to a national code

    Ref country code: CH

    Ref legal event code: EP

    REG Reference to a national code

    Ref country code: AT

    Ref legal event code: REF

    Ref document number: 581291

    Country of ref document: AT

    Kind code of ref document: T

    Effective date: 20121115

    REG Reference to a national code

    Ref country code: IE

    Ref legal event code: FG4D

    REG Reference to a national code

    Ref country code: DE

    Ref legal event code: R096

    Ref document number: 602007026318

    Country of ref document: DE

    Effective date: 20121220

    REG Reference to a national code

    Ref country code: AT

    Ref legal event code: MK05

    Ref document number: 581291

    Country of ref document: AT

    Kind code of ref document: T

    Effective date: 20121024

    REG Reference to a national code

    Ref country code: NL

    Ref legal event code: VDEP

    Effective date: 20121024

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: SE

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20121024

    Ref country code: NL

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20121024

    Ref country code: FI

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20121024

    Ref country code: IS

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20130224

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: CY

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20121024

    Ref country code: BE

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20121024

    Ref country code: LV

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20121024

    Ref country code: GR

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20130125

    Ref country code: PT

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20130225

    Ref country code: SI

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20121024

    Ref country code: PL

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20121024

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: AT

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20121024

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: BG

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20130124

    Ref country code: SK

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20121024

    Ref country code: CZ

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20121024

    Ref country code: MC

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20121231

    Ref country code: DK

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20121024

    Ref country code: EE

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20121024

    REG Reference to a national code

    Ref country code: CH

    Ref legal event code: PL

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: IT

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20121024

    Ref country code: RO

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20121024

    PLBE No opposition filed within time limit

    Free format text: ORIGINAL CODE: 0009261

    STAA Information on the status of an ep patent application or granted ep patent

    Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

    REG Reference to a national code

    Ref country code: IE

    Ref legal event code: MM4A

    26N No opposition filed

    Effective date: 20130725

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: CH

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20121231

    Ref country code: IE

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20121226

    Ref country code: LI

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20121231

    Ref country code: ES

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20130204

    REG Reference to a national code

    Ref country code: DE

    Ref legal event code: R097

    Ref document number: 602007026318

    Country of ref document: DE

    Effective date: 20130725

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: MT

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20121024

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: TR

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20121024

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: LU

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20121226

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: LT

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20121024

    Ref country code: HU

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20071226

    REG Reference to a national code

    Ref country code: FR

    Ref legal event code: PLFP

    Year of fee payment: 9

    REG Reference to a national code

    Ref country code: FR

    Ref legal event code: PLFP

    Year of fee payment: 10

    REG Reference to a national code

    Ref country code: FR

    Ref legal event code: PLFP

    Year of fee payment: 11

    P01 Opt-out of the competence of the unified patent court (upc) registered

    Effective date: 20230524

    PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

    Ref country code: GB

    Payment date: 20231102

    Year of fee payment: 17

    PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

    Ref country code: FR

    Payment date: 20231108

    Year of fee payment: 17

    Ref country code: DE

    Payment date: 20231031

    Year of fee payment: 17