TWI559295B

TWI559295B - Elimination of non - steady - state noise

Info

Publication number: TWI559295B
Application number: TW103134971A
Authority: TW
Inventors: Tai Shih Chi; Chung Chien Hsu; Tse En Lin; Jian Hueng Chen; Yi Cheng Chen
Original assignee: Chunghwa Telecom Co Ltd
Priority date: 2014-10-08
Filing date: 2014-10-08
Publication date: 2016-11-21
Also published as: TW201614640A

Description

Eliminate non-steady state noise methods

本發明是有關於一種消除非穩態性雜訊方法，特別是有關於一種結合背景訊號偵測之消除非穩態性雜訊方法。 The invention relates to a method for eliminating non-steady state noise, in particular to a method for eliminating unsteady noise in combination with background signal detection.

聲控操作已逐漸成為日常生活中不可或缺的一部分，已經大量被應用在例如智慧型手機及平板電腦上，可取代傳統的鍵盤或觸控輸入，而語音及雜訊混淆訊號處理是影響聲控操作正確性最關鍵的因素，卻也是最需要解決的一部分，因為這與聲控操作的實用性及便利性是直接相關的。 Voice-activated operation has gradually become an indispensable part of daily life. It has been widely used in, for example, smart phones and tablets to replace traditional keyboard or touch input, while voice and noise obfuscation signal processing affects voice control operations. The most critical factor of correctness is also the most important part of the solution, because it is directly related to the practicality and convenience of voice-activated operation.

近年來，在語音領域的應用中包含雜訊消除、語音分離，甚至是與音樂相關的歌聲分離或是樂器分離都有相當多的研究提出。而大部分的演算法在處理前都必須對目標聲音或是要消除的背景訊號(可能包含背景語音與背景雜訊)進行估測。然而，手持無線通訊系統中所遇到的常常是時變非穩態的環境，由於這種非穩態的雜訊都會隨時間改變，在一般過濾雜訊的過程中很容易導致原語音訊號因為背景雜訊估測錯誤導致語音失真。因此，如何消除非穩態性雜訊，以過濾出正確的目標語音，為本發明首要之目的。 In recent years, there have been quite a lot of researches on the application of noise in the field of speech, such as noise cancellation, speech separation, and even music-related separation of songs or instrument separation. Most algorithms must estimate the target sound or the background signal to be removed (which may include background speech and background noise) before processing. However, the handheld wireless communication system often encounters a time-varying and non-stationary environment. Since this unsteady noise will change with time, it is easy to cause the original voice signal during the general filtering of noise. Background noise estimation errors lead to speech distortion. Therefore, how to eliminate the non-steady state noise to filter out the correct target speech is the primary purpose of the present invention.

有鑑於上述習知技藝之問題，本發明之目的就是在提供一種消除非穩態性雜訊方法，以解決習知在時變非穩態的環境中進行雜訊過濾時，容易導致原語音訊號因為背景雜訊估測錯誤而產生語音失真之問題。 In view of the above problems of the prior art, the object of the present invention is to provide a method for eliminating unsteady noise, which is suitable for solving the original voice signal when performing noise filtering in a time-varying non-steady-state environment. The problem of speech distortion is caused by background noise estimation errors.

根據本發明之目的，提出一種消除非穩態性雜訊方法，其包含下列步驟：輸入一聲音串流進行估測，以區別出一主要語音與包含了一背景語音及一背景雜訊之一背景訊號；使用基底式背景雜訊估測方法求得估測後之背景訊號之一頻譜特性基底；將頻譜特性基底進行基底式的雜訊分離，以取得消除背景雜訊之一乾淨訊號；以及利用人聲端點偵測方法從消除背景雜訊之乾淨訊號求得已雜訊消除之一乾淨語音訊號。 According to an object of the present invention, a method for canceling an unsteady noise is provided, which comprises the steps of: inputting a stream of sounds for estimation to distinguish a primary voice from one of a background voice and a background noise; Background signal; using a basal background noise estimation method to obtain a spectral characteristic substrate of the estimated background signal; and performing spectral noise separation on the base of the spectral characteristic substrate to obtain a clean signal for eliminating background noise; The vocal endpoint detection method is used to obtain a clean voice signal that has been eliminated by noise from the clean signal of background noise.

較佳地，此消除非穩態性雜訊方法更包含下列步驟：將聲音串流劃分成複數個音框，並以音框為單位求得聲音串流之一二維頻譜圖；及將二維頻譜圖帶入具有二維時、頻域之脈衝響應濾波器組，以求出一諧波頻率調變能量值，並將諧波頻率調變能量值與閥值進行比對，當諧波頻率調變能量值低於閥值，即視為背景雜訊，高於閥值則視為背景語音。 Preferably, the method for canceling the non-steady state noise further comprises the steps of: dividing the sound stream into a plurality of sound boxes, and obtaining a two-dimensional spectrum map of the sound stream in units of the sound box; The dimensional spectrogram is brought into the impulse response filter bank with two-dimensional time and frequency domain to find a harmonic frequency modulation energy value, and the harmonic frequency modulation energy value is compared with the threshold value, when the harmonic The frequency modulation energy value is lower than the threshold, which is regarded as background noise, and the threshold value is regarded as background speech.

其中，閥值的計算是結合混合雜訊的語音訊號和雜訊訊號以及調整係數來更新。 Among them, the threshold value is calculated by combining the voice signal and noise signal of the mixed noise and the adjustment coefficient.

其中，閥值的運算式為：γ=ρ．{M[FME _S+N(t)]-M[FME _N(t)]}+M[FME _N(t)]；其中，γ為閥值，FME _S+N(t)為整段語音中調變能量最大的前一時間的該諧波頻率調變能量值，FME _N(t)為整段語音中調變能量最小之該時間的該諧波頻率調變能量值，ρ為調整係數，M代表對該時間之該諧波頻率調變能量值取平均值。 Among them, the operation formula of the threshold is: γ = ρ . { M [ FME _{S + N} ( t )]- M [ FME _N ( t )]}+ M [ FME _N ( t )]; where γ is the threshold and FME _{S + N} ( t ) is the whole speech The harmonic frequency modulation energy value of the previous time when the modulation energy is maximum, FME _N ( t ) is the harmonic frequency modulation energy value of the time when the modulation energy is the smallest in the whole speech, and ρ is the adjustment coefficient. M represents an average of the harmonic frequency modulation energy values for the time.

其中，ρ可為0.25。 Where ρ can be 0.25.

較佳地，此消除非穩態性雜訊方法更包含下列步驟：利用基底式雜訊分離方法將頻譜特性基底分離以求得語音結構及語音激發矩陣，進而取得消除背景雜訊之乾淨訊號。 Preferably, the method for canceling the non-steady state noise further comprises the steps of: separating the spectral characteristic substrate by using the base noise separation method to obtain a speech structure and a speech excitation matrix, thereby obtaining a clean signal for eliminating background noise.

承上所述，依本發明之消除非穩態性雜訊方法，其可具備下列一或多個特點： According to the present invention, the method for eliminating non-steady state noise according to the present invention may have one or more of the following characteristics:

1.習知的技術利用非語音區段估測雜訊有時無法做很精準的估測，尤其在估測非穩態雜訊並不準確，這也影響在訊號消除的效果。而本發明中結合語音諧波的頻率調變能量之背景訊號偵測，可從語音結構中分析來精確的找出預期分離的背景訊號，即使在目標語音及雜訊同時存在的區段也能有效區別雜訊，透過此精確的雜訊估測提高了後續的雜訊消除效能。 1. Conventional techniques Using non-speech segments to estimate noise sometimes fails to make very accurate estimates, especially when estimating unsteady noise is not accurate, which also affects the effect of signal cancellation. In the present invention, the background signal detection combined with the frequency modulation energy of the speech harmonics can be accurately analyzed from the speech structure to find the background signal that is expected to be separated, even in the segment where the target speech and the noise exist simultaneously. Effectively distinguishes noise, and improves the subsequent noise cancellation performance through this accurate noise estimation.

2.習知的技術為了提升雜訊消除的效果，都會在非語音段對雜訊進行估計，因此對於非穩態雜訊這種不可預期的訊號消除效果極差(因為非穩態雜訊只占非語音段落中一小部份)。而本發明針對自然界中聲音所具有的特定結構而做的分析，可找出非穩態雜訊語音結構之基底來消除，故得以應用在任何的聲音訊號上，消除非穩態雜訊干擾。 2. Known techniques In order to improve the effect of noise cancellation, noise is estimated in non-speech segments, so unpredictable signal cancellation for unsteady noise is extremely poor (because of unsteady noise only A small part of the non-speech paragraph). The present invention is directed to the analysis of the specific structure of the sound in nature, and the base of the unsteady noise speech structure can be found to be eliminated, so that it can be applied to any sound signal to eliminate unsteady noise interference.

3.習知的語音端點偵測，在遭遇非穩態的雜訊干擾(雜訊中可能也混和外在背景語音的成分)無法區別出目標語音訊號，而本發明藉由加入消除非穩態性雜訊的方法，估測可能的非穩態雜訊並將其分離，除可有效提升語音端點偵測效能之外，並能消除語音中的非穩態背景訊號。 3. The conventional speech endpoint detection cannot distinguish the target speech signal when encountering unsteady noise interference (which may also mix the components of the external background speech), and the present invention eliminates instability by adding The method of state noise estimates the possible unsteady noise and separates it, which can effectively improve the speech endpoint detection performance and eliminate the unsteady background signal in the speech.

S11~S14‧‧‧步驟 S11~S14‧‧‧Steps

第1圖為本發明之消除非穩態性雜訊方法之流程圖。 1 is a flow chart of a method for eliminating non-steady state noise according to the present invention.

第2圖為本發明之聯合時域頻域脈衝響應之示意圖。 Figure 2 is a schematic diagram of the combined time domain frequency domain impulse response of the present invention.

第3圖為本發明之噪音模型訓練及辨識之示意圖。 Figure 3 is a schematic diagram of the noise model training and identification of the present invention.

第4圖為本發明之實施例之雜訊消除範例圖。 Fig. 4 is a diagram showing an example of noise cancellation according to an embodiment of the present invention.

為利貴審查員瞭解本發明之技術特徵、內容與優點及其所能達成之功效，茲將本發明配合附圖，並以實施例之表達形式詳細說明如下，而其中所使用之圖式，其主旨僅為示意及輔助說明書之用，未必為本發明實施後之真實比例與精準配置，故不應就所附之圖式的比例與配置關係解讀、侷限本發明於實際實施上的權利範圍，合先敘明。 The technical features, contents, and advantages of the present invention, as well as the advantages thereof, can be understood by the present inventors, and the present invention will be described in detail with reference to the accompanying drawings. The subject matter is only for the purpose of illustration and description. It is not intended to be a true proportion and precise configuration after the implementation of the present invention. Therefore, the scope and configuration relationship of the attached drawings should not be interpreted or limited. First described.

請參閱第1圖，其係為本發明之消除非穩態性雜訊方法之流程圖。本發明之方法包含主要四個步驟，首先，步驟S11，是運用短時間傅立葉轉換(Short-time Fourier transform)來求得聲音的時頻表示法，輸入聲音串流並計算出每個音框的短時傅立葉頻譜，以得到整個聲音訊號二維頻譜圖，且使用多個經特別設計的二維時頻域脈衝響應帶通濾波器組。此分析方法的概念是來自於根據已知大腦皮質聽覺區的神經反應而建立的聽覺模型，此區域是代表語音片段識別中最具強健性的語音諧波的頻率調變能量，藉由此分析方法我們可以解析出語音特有的特徵，並以一個閥值(計算方式如下列式(4))來判斷語音或是非語音片段。接著，步驟S12將背景訊號片段輸入基底式稀疏非負矩陣分解的訓練過程，訓練出來屬於雜訊的基底 B _noise。B _noise是一個由數個背景雜訊基底向量所組成的二維矩陣，接著執行步驟S13，將訓練好的雜訊基底B _noise做語音與雜訊分離，透過稀疏性非負矩陣分解方法，最後得到乾淨語音，最後，執行步驟S14，再做一次語音或是非語音識別。方程式依序如下列式(1)~(3)，式(1)在做基底式稀疏非負矩陣分解的過程中，原始的混淆語音訊號X在可以近似分解成B和H，且為了避免資料過於稀疏時，所求的基底個數相對較多會造成B _noise誤差過大，因此加入稀疏性的參數λ作為H的約束條件(sparseness constraints)讓一些誤差較大基底的重要性降低(降低權重H)來減少B _noise誤差： Please refer to FIG. 1 , which is a flow chart of the method for eliminating non-steady state noise according to the present invention. The method of the present invention comprises the main four steps. First, in step S11, a short-time Fourier transform is used to obtain a time-frequency representation of the sound, input the sound stream and calculate each sound box. The short-time Fourier spectrum is used to obtain a two-dimensional spectrogram of the entire sound signal, and a plurality of specially designed two-dimensional time-frequency domain impulse response bandpass filter banks are used. The concept of this analytical method is derived from an auditory model based on the neural response of the known auditory region of the cerebral cortex, which is the frequency-modulating energy representative of the most robust speech harmonics in speech segment recognition. Method We can parse out the characteristics unique to speech and judge the speech or non-speech fragments with a threshold (calculated as the following formula (4)). Next, in step S12, the background signal segment is input into the training process of the base-type sparse non-negative matrix decomposition, and the base B _noise belonging to the noise is trained. B _noise is a two-dimensional matrix composed of several background noise base vectors. Then, step S13 is performed to separate the trained noise base B _{noise into} speech and noise, and the sparse non-negative matrix decomposition method is finally obtained. Clean the voice, and finally, perform step S14 and perform another voice or non-speech recognition. The equation is in the following order (1)~(3), and in the process of doing the basic sparse non-negative matrix factorization, the original confusing speech signal X can be approximated into B and H, and in order to avoid the data too much. When sparse, the number of bases obtained is relatively large, which will cause the B _noise error to be too large. Therefore, adding the sparse parameter λ as the constraint of H (sparseness constraints) reduces the importance of some large errors (lower weight H). To reduce the B _noise error:

此部分包含許多參數，包含稀疏性的參數λ、雜訊基底B _noise個數、具時域特性的基底。根據非負矩陣分解方法可以將一個含語音及雜訊的訊號X用式(2)來近似。根據前面訓練出的基底B _noise經過基底式稀疏性非負矩陣分解的分離過程，經過疊代運算後可得語音的基底矩陣Bspeech以及激發矩陣Hspeech，接著使用語音基底矩陣與語音的激發矩陣來還原出較乾淨訊號如式(3)，並輸入第四部分的人聲端點偵測，此部分與第一部分功能相似但因為為較乾淨訊號，故能切出更準確的語音片段。 This section contains a number of parameters, including the parameter λ of sparsity, the number of noise floor B _noise , and the base with time domain characteristics. According to the non-negative matrix factorization method, a signal X containing speech and noise can be approximated by equation (2). According to the separation process of the substrate B _noise trained in the previous step by the sparse non-negative matrix decomposition, the base matrix Bspeech and the excitation matrix Hspeech of the speech can be obtained after the iterative operation, and then the speech matrix and the excitation matrix of the speech are used to restore. The cleaner signal is as in equation (3), and the fourth part of the vocal endpoint detection is input. This part is similar to the first part but because it is a cleaner signal, it can cut out more accurate speech segments.

詳細地來說，本發明將一段聲音串流(steam)求得其傅立葉頻譜圖(Fourier spectrogram)，根據神經生理學的發現，在聽覺模型中可以假設大腦皮質聽覺區基本上是把中腦輸出的聽覺頻譜圖當作二維圖像來進行處理，因此我們針對傳統的傅立葉頻譜(Fourier spectrogram)，設計出兩個具有二維時、頻域脈衝響應濾波器組。第2圖即為本發明經挑選後最具鑑別力的兩個濾波器，左邊為對往下移動(downward)的FM信號(rate=1Hz，scale=5ms)有最大反應的濾波器脈衝響應，右邊為對往上移動(upward)的FM信號(rate=-1Hz，scale=5ms)有最大反應的濾波器脈衝響應。原則上並不限定濾波器數量，為了簡化運算量而在此階段實際上我們總共設計了2個二維帶通濾波器組，其頻率軸上封包的變化率及時間軸上封包的變化率分別為<5>ms及<1>Hz和<downward、upward>的組合，最後為這兩個濾波器求得的能量值來設定閥值，計算方式如式(4)來判斷每個音框中的訊號是屬於語音訊號或是雜訊。在取得部分雜訊訊號後，可以透過基底式雜訊估測來計算出雜訊的基底(亦可稱為雜訊的語音結構)，本發明使用非負矩陣運算來實現，如第3圖所示。至於為何挑選雜訊部分進行訓練的原因，一般來說語音的變化程度會較大，以致必須使用更高維度的基底才足以代表語音訊號的語音結構，相較之下從雜訊來估測基底，因為基底較小在計算量可以減少許多。透過基底式稀疏非負矩陣分解的訓練過程，可以得到雜訊的基底Bnoise(Noise bases)。當我們以上述方式訓練出雜訊之頻譜特性(也可稱基底)時，使用基底式的背景雜訊分離便能將混合語音分離出來，本發明使用非負矩陣分解之分離過程來消除雜訊，如式子(2)，現在只剩下Bspeech、Hspeech、Hnoise這三個部分需要更新，再利用基底式稀疏非負矩陣分解的分離過程求得Bspeech、Hspeech、Hnoise後，最後將求得的Bspeech、Hspeech代入式(3)求得分離出來之目標語音訊號Xspeech。 In detail, the present invention obtains a Fourier spectrogram from a sound stream. According to neurophysiological findings, it can be assumed in the auditory model that the auditory region of the cerebral cortex is basically outputting the midbrain. The auditory spectrogram is treated as a two-dimensional image Processing, so we designed two two-dimensional time and frequency domain impulse response filter banks for the traditional Fourier spectrogram. Figure 2 is the two most discriminating filters after the selection of the present invention. The left side is the filter impulse response with the largest response to the downward FM signal (rate=1Hz, scale=5ms). On the right is the filter impulse response that has the greatest response to the upward FM signal (rate = -1 Hz, scale = 5 ms). In principle, the number of filters is not limited. In order to simplify the calculation, at this stage, we actually design a total of two two-dimensional band-pass filter banks, and the rate of change of the packet on the frequency axis and the rate of change of the packet on the time axis are respectively For the combination of <5>ms and <1>Hz and <downward, upward>, the threshold value is finally set for the energy values obtained by the two filters, and the calculation method is as shown in equation (4) to judge each sound box. The signal is a voice signal or a noise. After obtaining some of the noise signals, the base of the noise can be calculated through the base noise estimation (also referred to as the speech structure of the noise), and the present invention is implemented by using a non-negative matrix operation, as shown in FIG. . As for why the noise part is selected for training, the degree of change in speech is generally so large that a higher-dimensional base must be used to represent the speech structure of the speech signal, and the base is estimated from noise. Because the substrate is smaller, the amount of calculation can be reduced a lot. The base Bnoise (Noise bases) of the noise can be obtained through the training process of the basal sparse non-negative matrix factorization. When we train the spectral characteristics of the noise (also referred to as the substrate) in the above manner, the mixed noise can be separated by using the background type background noise separation. The present invention uses the separation process of the non-negative matrix decomposition to eliminate the noise. As in equation (2), only the three parts of Bspeech, Hspeech, and Hnoise need to be updated, and Bspeech, Hspeech, and Hnoise are obtained by the separation process of the base-type sparse non-negative matrix factorization. Finally, the Bspeech, Hspeech substituting (3) finds the separated target speech signal Xspeech.

另外，如第4圖是一個實際案例的範例圖，這是在餐廳環境下手機所錄下含目標語音和人聲干擾雜訊的混合訊號。即正常使用下常遇見的情境，除了語音之外還有具不確定的人聲干擾，我們發現，由於語音具有的諧波性(harmonicity)及頻率調變的方向性(FM)，因此可以從圖四的諧波的頻率調變能量(FME)輪廓線來觀察是否有完全過濾掉人聲的干擾雜訊部分。直接觀察這10秒鐘的音檔，目標語音訊號出現在0.7秒到2.8秒之間，而其他部份則是必須過濾掉的人聲的干擾雜訊。如第4圖之諧波的頻率調變能量(FME)輪廓線所示，取一閥值來做第1圖之第一步的語音非語音識別，閥值γ的選擇如下式：γ=ρ．{M[FME _S+N(t)]-M[FME _N(t)]}+M[FME _N(t)]...式(4)。 In addition, as shown in Fig. 4 is an example diagram of a practical case, which is a mixed signal recorded by a mobile phone with a target voice and vocal interference noise in a restaurant environment. That is to say, the situation often encountered under normal use, in addition to the voice has uncertain vocal interference, we found that due to the harmonicity of the voice and the directionality of the frequency modulation (FM), it can be seen from the figure The frequency modulation energy (FME) contour of the four harmonics is used to observe whether there is interference noise part that completely filters out the human voice. Directly observe the 10-second audio file, the target voice signal appears between 0.7 seconds and 2.8 seconds, while the other part is the interference noise of the vocal that must be filtered out. As shown in the frequency modulation energy (FME) contour of the harmonics in Fig. 4, a threshold value is taken to make the speech non-speech recognition of the first step of Fig. 1, and the threshold γ is selected as follows: γ = ρ . { M [ FME _{S + N} ( t )]- M [ FME _N ( t )]} + M [ FME _N ( t )]... Equation (4).

其中，M為平均運算子，FME _S+N(t)為整段語音中調變能量最大的前250ms的FME值，FME _N(t)為整段語音中調變能量最小250ms的FME值，ρ為調整係數0.25，M代表對這250ms能量值取平均值。可發現在經過語音非語音識別處理後，第4圖上半部在未經非負矩陣分解處理前還是有一部分人聲干擾雜訊因為能量值高於設定的閥值γ，而被暫時視為一般語音訊號。此部分人聲雜訊雖然頻譜調變能量高於閥值，但其頻譜特性與其他背景人聲雜訊類似，所以透過將非語音部分輸入基底式稀疏非負矩陣分解的訓練過程來訓練出人聲干擾雜訊的頻譜基底Bnoise(Noise bases)，並以此特徵輸入第1圖步驟S13的基底式稀疏非負矩陣分解的分離部分，求得Bspeech、Hspeech，套用式(3)即可還原得到經背景雜訊消除之乾淨訊號Xspeech。最後，輸入第1圖之步驟S14的人聲端點偵測，結合被分離出來之目標語音訊號與閥值的更新公式(4)取得調變能量之閥值，當調變能量的訊號低於閥值，則將被過濾掉，透過此偵測方法即可取得經背景雜訊消除後的目標語音。此方法使用的閥值計算方式相同如式(4)，但由於FME _S+N(t)與FME _N(t)在這兩階段已經過雜訊消除，所以最後的閥值與前一次的結果並不相同，其結果如第4圖所示。可發現經過基底式稀疏非負矩陣分解處理後的訊號，其諧波的頻率調變能量會因為雜訊被消除掉，而輪廓線更容易將語音精準的切分出來。 Where M is the average operator, FME _{S + N} ( t ) is the FME value of the first 250 ms with the largest modulation energy in the whole speech, and FME _N ( t ) is the FME value with the minimum modulation energy of 250 ms in the whole speech. ρ is an adjustment factor of 0.25, and M represents an average of the 250 ms energy values. It can be found that after the speech non-speech recognition processing, the upper part of the fourth picture still has some vocal interference noise before the non-negative matrix decomposition processing, because the energy value is higher than the set threshold γ, and is temporarily regarded as the general voice. Signal. Although the spectral modulation energy is higher than the threshold, the spectral characteristics of this part of the vocal noise are similar to those of other background vocal noises. Therefore, the vocal interference noise is trained by inputting the non-speech part into the basic sparse non-negative matrix decomposition training process. The spectrum base Bnoise (Noise bases), and input the feature into the separation part of the base-type sparse non-negative matrix decomposition of step S13 in Fig. 1, and obtain Bspeech and Hspeech, and the pattern (3) can be restored to obtain the background noise elimination. The clean signal Xspeech. Finally, the vocal endpoint detection in step S14 of FIG. 1 is input, and the threshold value of the modulation energy is obtained by combining the separated target speech signal and the threshold update formula (4), when the modulation energy signal is lower than the valve. The value will be filtered out. Through this detection method, the target speech after background noise cancellation can be obtained. The threshold used in this method is calculated in the same way as equation (4), but since FME _{S + N} ( t ) and FME _N ( t ) have been eliminated by noise in these two phases, the final threshold and the previous result It is not the same, and the result is shown in Figure 4. It can be found that the signal modulated by the base-type sparse non-negative matrix factor, the harmonic frequency modulation energy will be eliminated due to noise, and the contour line is more easy to segment the voice accurately.

綜合上述，本發明之消除非穩態性雜訊的方法，係透過背景訊號偵測方式以快速的找出背景訊號(包含背景語音及雜訊)，再將背景訊號標示出來並透過基底式背景估測來訓練出穩態或非穩態雜訊的頻譜基底，此方法即使在目標語音及雜訊同時存在的區段也能有效估計雜訊的頻譜特性。接著以訓練出來的雜訊頻譜基底，來對原始聲音串流做背景雜訊分離以消除非穩態性雜訊，最後輸入乾淨的語音訊號以進行人聲端點偵測，即可過濾出正確的目標語音。 In summary, the method for eliminating unsteady noise in the present invention is to quickly find the background signal (including background voice and noise) through the background signal detection method, and then mark the background signal and pass through the base background. Estimating to train the spectral base of steady-state or unsteady noise, this method can effectively estimate the spectral characteristics of the noise even in the segment where the target speech and noise exist simultaneously. Then use the trained noise spectrum base to perform background noise separation on the original sound stream to eliminate unsteady noise, and finally input a clean voice signal for vocal endpoint detection to filter out the correct Target speech.

以上所述僅為舉例性，而非為限制性者。任何未脫離本發明之精神與範疇，而對其進行之等效修改或變更，均應包含於後附之申請專利範圍中。 The above is intended to be illustrative only and not limiting. Any equivalent modifications or alterations to the spirit and scope of the invention are intended to be included in the scope of the appended claims.

S11~S14‧‧‧步驟 S11~S14‧‧‧Steps

Claims

A method for eliminating non-steady state noise includes the steps of: inputting a stream of sounds for estimation to distinguish a primary speech from a background signal including a background speech and a background noise; using a base background The noise estimation method obtains a spectrum characteristic base of the background signal after estimating the background signal; and the spectral characteristic base of the background signal performs a base-type noise separation on a sound stream to obtain the elimination One of the background noises is a clean signal; the vocal endpoint detection method is used to obtain a clean voice signal from the clean signal of the background noise; wherein the voice stream is input for estimation to distinguish The main voice and the background signal including the background voice and the background noise further comprise the steps of: dividing the sound stream into a plurality of sound boxes, and determining one of the sound streams in units of the sound box a two-dimensional spectrogram; and the two-dimensional spectrogram is brought into an impulse response filter bank having a two-dimensional time domain and a frequency domain to obtain a harmonic frequency modulation energy value, and the harmonic frequency is modulated Comparing the magnitude with a threshold value, when the modulated harmonic frequency energy value is below the threshold, the background noise that is considered higher than the threshold is considered the background speech.

The method for eliminating unsteady noise according to claim 1, wherein the threshold is calculated by combining the voice signal and the noise signal of the mixed noise and the adjustment coefficient.

The method for eliminating non-steady state noise according to item 2 of the patent application scope, wherein the operation formula of the threshold is: γ = ρ . { M [ FME _{S + N} ( t )]- M [ FME _N ( t )]}+ M [ FME _N ( t )]; where γ is the threshold and FME _{S + N} ( t ) is the whole speech The harmonic frequency modulation energy value of the previous time when the modulation energy is maximum, FME _N ( t ) is the harmonic frequency modulation energy value of the time when the modulation energy is the smallest in the whole speech, and ρ is the adjustment coefficient. M represents an average of the harmonic frequency modulation energy values for the time.

The method for eliminating non-steady state noise as described in claim 3, wherein ρ is 0.25.

The method for canceling an unsteady state noise according to claim 1, wherein the background characteristic of the background signal after the background signal is estimated using a background background noise estimation method further comprises the following Step: Decompose the background signal to obtain the spectral characteristic base of the background signal and its excited state matrix.

The method for canceling an unsteady state noise according to claim 1, wherein the spectral characteristic of the background signal is subjected to a ground-based noise separation of the sound stream to obtain the background noise. The clean signal further comprises the steps of: separating the spectral characteristic substrate by using a matrix-based noise separation method to obtain a speech structure and a speech excitation matrix, thereby obtaining the clean signal for eliminating the background noise.

The method for eliminating an unsteady noise according to the first aspect of the patent application, wherein the clean voice signal obtained by removing the noise from the background noise by using the vocal endpoint detection method further includes The following steps: dividing the sound stream into a plurality of sound boxes, and obtaining the two-dimensional spectrogram of the sound stream in units of the sound box; bringing the two-dimensional spectrogram into a two-dimensional frequency domain The impulse response filter bank to find the The harmonic frequency modulates the energy value, and compares the harmonic frequency modulation energy value with a threshold value. When the harmonic frequency modulation energy value is lower than the threshold value, the background noise is regarded as higher than the valve. The value is regarded as the background speech; the second stage noise cancellation is performed on the clean signal for eliminating the background noise; and the harmonic frequency modulation energy value is compared with the threshold value for each frame, which is higher than the threshold It is regarded as a clean voice signal, below the threshold is regarded as background noise; and a clean voice signal is obtained from a stream of sound.