JPH1152977A

JPH1152977A - Method and device for voice processing

Info

Publication number: JPH1152977A
Application number: JP9206366A
Authority: JP
Inventors: Takehiko Isaka; 岳彦井阪; Hitoshi Nagata; 仁史永田
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1997-07-31
Filing date: 1997-07-31
Publication date: 1999-02-26
Anticipated expiration: 2017-07-31
Also published as: JP3677143B2

Abstract

PROBLEM TO BE SOLVED: To accurately detect a voice interval against an object sound source employing a small number of microphones under the environment in which the S/N ratio is low and the direction to a noise sound source is not specified. SOLUTION: In the device, a voice input section 10 inputs audio signals to terminals 10-1 to 10-n through plural channels ch1 to chn. A beam former processing section 20 conducts a beam former process against the audio signals inputted by the section 10 to suppress the signals arriving from an object sound source. An object sound source direction estimating section 30 obtains the direction to the object sound source from the filter coefficients obtained by the section 20. A voice/non-voice determining section 40 determines the voice interval of the audio signals based on the estimated direction to the object sound source.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力される音声信
号の音声区間を検出したり雑音を抑圧し音声を強調する
処理を行う音声処理処理方法／装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an audio processing method / device for detecting an audio section of an input audio signal and for suppressing noise to enhance audio.

【０００２】[0002]

【従来の技術】雑音環境下で音声区間を検出する方法と
しては、例えば文献１：新美康永著、「音声認識」共立
出版に開示されているように、エネルギーと零交差回数
を用いて音声区間を検出する方法がある。しかし、この
方法ではＳＮ比が大きく低下したときには音声区間を正
確に検出することは難しい。2. Description of the Related Art As a method of detecting a speech section in a noisy environment, for example, as disclosed in Reference 1: Yasunaga Niimi, "Speech Recognition," Kyoritsu Shuppan, speech and energy are used using the number of zero crossings. There is a method of detecting a section. However, with this method, it is difficult to accurately detect a voice section when the SN ratio is significantly reduced.

【０００３】そこで、ＳＮ比の低い環境で音声入力を行
うことを可能とするために、マイクロホンアレイによる
雑音抑圧処理が研究されており、例えば文献２：「音響
システムとデジタル処理」電子情報通信学会編では、少
数のマイクロホンによる適応マイクロホンアレイを用い
てＳＮ比を改善する方法が開示されている。しかしなが
ら、雑音源が多数存在し雑音源の方向を特定できないよ
うな環境下では、この方法によりＳＮ比を改善すること
は難しいため、マイクロホンアレイの出力パワーを用い
て正確に音声区間を検出することは困難である。In order to enable speech input in an environment with a low SN ratio, noise suppression processing using a microphone array has been studied. For example, Reference 2: "Acoustic system and digital processing" IEICE In this volume, a method for improving an SN ratio by using an adaptive microphone array having a small number of microphones is disclosed. However, in an environment where there are many noise sources and the direction of the noise source cannot be specified, it is difficult to improve the S / N ratio by this method. Therefore, it is necessary to accurately detect a voice section using the output power of the microphone array. It is difficult.

【０００４】[0004]

【発明が解決しようとする課題】上述したように、少数
のマイクロホンによるマイクロホンアレイを用いてＳＮ
比を改善する方法では、雑音源の方向を特定できないよ
うな雑音環境下の場合にＳＮ比の改善が期待できないた
め、マイクロホンアレイの出力パワーを用いて正確に音
声区間を検出することが難しいという問題があった。As described above, SN using a microphone array with a small number of microphones is used.
According to the method for improving the ratio, it is difficult to accurately detect a voice section using the output power of the microphone array because it is not expected to improve the SN ratio in a noise environment where the direction of the noise source cannot be specified. There was a problem.

【０００５】本発明は上記の問題点を解決するためにな
されたもので、その目的はＳＮ比が低く、かつ雑音源の
方向を特定できないような環境下で、少数のマイクロホ
ンにより目的音源に対して音声区間を正確に検出できる
音声処理方法および装置を提供することにある。本発明
の他の目的は、雑音を抑圧して音声のみを強調する処理
を確実に行うことができる音声処理方法および装置を提
供することにある。SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and has as its object the purpose of using a small number of microphones for a target sound source in an environment where the SN ratio is low and the direction of the noise source cannot be specified. It is an object of the present invention to provide a voice processing method and apparatus capable of accurately detecting a voice section. Another object of the present invention is to provide an audio processing method and apparatus capable of reliably performing a process of suppressing noise and emphasizing only audio.

【０００６】[0006]

【課題を解決するための手段】上記の課題を解決するた
め、本発明は複数のチャネルを介して入力される音声信
号に対して、ビームフォーマにより目的音源から到来す
る信号を抑圧するためのディジタル演算処理、つまりビ
ームフォーマ処理を施し、このビームフォーマ処理によ
り得られたフィルタ係数から目的音源の方向を推定し、
この目的音源の方向に基づいて音声信号の音声区間を決
定することを基本的な特徴とする。SUMMARY OF THE INVENTION In order to solve the above-mentioned problems, the present invention relates to a digital signal for suppressing a signal coming from a target sound source by a beamformer for an audio signal input through a plurality of channels. Calculation processing, that is, beamformer processing, is performed, and the direction of the target sound source is estimated from the filter coefficients obtained by this beamformer processing.
The basic feature is that the voice section of the voice signal is determined based on the direction of the target sound source.

【０００７】雑音源の方向を特定できないような環境で
は、ビームフォーマによって目的音源のＳＮ比を改善す
ることは難しいが、目的音源からの音声は方向性を持っ
ているため、音声区間では目的音源の方向をビームフォ
ーマのフィルタ係数から推定することが可能であり、こ
の推定された目的音源の方向に基づいて音声区間を検出
することができる。In an environment where the direction of the noise source cannot be specified, it is difficult to improve the SN ratio of the target sound source by the beamformer. However, since the sound from the target sound source has directionality, the target sound source is not included in the voice section. Can be estimated from the filter coefficient of the beamformer, and a voice section can be detected based on the estimated direction of the target sound source.

【０００８】また、本発明は目的音源から到来する信号
を抑圧するためのビームフォーマ処理を行う第１のビー
ムフォーマとは別に、雑音源から到来する信号を抑圧
し、目的音源からの信号を出力するためのビームフォー
マ処理を施す第２のビームフォーマを設け、第２のビー
ムフォーマにより得られたフィルタ係数から雑音源の方
向を推定し、目的音源の方向と第１および第２のビーム
フォーマにより得られた出力のパワーとに基づいて第２
のビームフォーマを制御すると共に、雑音源の方向と第
１および第２のビームフォーマにより得られた出力のパ
ワーとに基づいて第１のビームフォーマを制御すること
を特徴とする。Also, the present invention suppresses a signal coming from a noise source and outputs a signal from the target sound source, separately from the first beamformer which performs a beamformer process for suppressing a signal coming from the target sound source. A second beamformer for performing beamformer processing for estimating the direction of a noise source from the filter coefficients obtained by the second beamformer, and calculating the direction of the target sound source and the first and second beamformers. The second based on the obtained output power
And controlling the first beamformer based on the direction of the noise source and the output power obtained by the first and second beamformers.

【０００９】このようにすると、方向性のある雑音源が
存在する場合でも、第１のビームフォーマの入力方向を
雑音源の方向に追随させることで、高精度に目的音源の
方向を推定でき、もって音声区間をより確実に検出する
ことが可能となる。In this way, even when a directional noise source is present, the direction of the target sound source can be estimated with high accuracy by making the input direction of the first beamformer follow the direction of the noise source. This makes it possible to more reliably detect the voice section.

【００１０】音声区間の決定に際しては、推定された目
的音源の方向に加えて、さらに音声信号のパワーを用い
て行ってもよい。また、本発明は第１のビームフォーマ
の出力および推定された目的音源の方向の少なくとも一
方を用いて、第２のビームフォーマの出力中の雑音を抑
圧して音声を強調することを特徴とする。The determination of the voice section may be performed using the power of the voice signal in addition to the estimated direction of the target sound source. Further, the present invention is characterized in that at least one of the output of the first beamformer and the estimated direction of the target sound source is used to suppress noise in the output of the second beamformer to enhance the sound. .

【００１１】すなわち、雑音源が非常に多いために雑音
源の方向を特定できないような環境では、ビームフォー
マによる雑音抑圧性能は低下するが、音声信号は方向性
があるため、雑音源の方向に目的方向を設定した第１の
ビームフォーマにより、目的信号を抑圧した雑音のみの
出力を抽出できるので、これを用いてスペクトルサブト
ラクションの手法により、第２のビームフォーマの出力
に対して音声強調処理を行うことが可能である。That is, in an environment in which the direction of the noise source cannot be specified because the number of noise sources is so large, the noise suppression performance by the beamformer is reduced, but since the voice signal has directionality, the direction of the noise source is reduced. Since the output of only the noise that suppresses the target signal can be extracted by the first beamformer in which the target direction is set, the speech enhancement processing is performed on the output of the second beamformer by using the spectrum extraction method. It is possible to do.

【００１２】ここで、目的音源と雑音源の方向が固定で
かつ既知である場合には、目的音源方向の推定と第１お
よび第２のビームフォーマの制御は不要であるから、第
１のビームフォーマを最も強い雑音源方向に向け、第２
のビームフォーマを目的音源方向に向けておけばよい。
この場合は、第１のビームフォーマの出力に基づいて第
２のビームフォーマの出力に対して音声強調処理を行う
ことができる。Here, when the directions of the target sound source and the noise source are fixed and known, it is not necessary to estimate the direction of the target sound source and control the first and second beamformers. The former toward the strongest noise source,
May be directed toward the target sound source.
In this case, speech enhancement processing can be performed on the output of the second beamformer based on the output of the first beamformer.

【００１３】さらに、本発明では上記のようにして推定
された目的音源方向と音声強調された信号を用いて音声
区間の検出を行うことも可能であり、それによって音声
区間の検出性能をさらに向上させることができる。Further, in the present invention, it is possible to detect a voice section using the target sound source direction estimated as described above and the voice-emphasized signal, thereby further improving the voice section detection performance. Can be done.

【００１４】[0014]

【発明の実施の形態】以下、図面を参照して本発明の実
施形態を説明する。（第１の実施形態）本実施形態では、複数のチャネルを
介して入力された音声信号から目的音源の方向を推定
し、音声区間を検出する機能を有する音声処理装置を説
明する。Embodiments of the present invention will be described below with reference to the drawings. (First Embodiment) In this embodiment, a speech processing apparatus having a function of estimating a direction of a target sound source from speech signals input through a plurality of channels and detecting a speech section will be described.

【００１５】本実施形態に係る音声処理装置は、図１に
示されるように複数（ｎ個）のチャネルｃｈ１〜ｃｈｎ
を介して入力端子１０−１〜１０−ｎから音声信号を入
力とする音声入力部１０と、これらの音声信号に対して
目的音源から到来する信号を抑圧するためのビームフォ
ーマ処理を行うビームフォーマ２０と、このビームフォ
ーマ２０から得られたフィルタ係数に基づいて目的音源
の方向を推定する目的音源方向推定部３０と、推定され
た目的音源の方向の時系列値と、音声入力部１０から得
られた信号のパワーの時系列値、音声入力部１０から得
られた信号のチャネル間の相関値の時系列値のいずれか
あるいは両方に基づいて、音声信号の音声／非音声を決
定する音声／非音声決定部４０とにより構成される。As shown in FIG. 1, the audio processing apparatus according to this embodiment has a plurality of (n) channels ch1 to chn.
Audio input unit 10 for inputting audio signals from input terminals 10-1 to 10-n via the input terminal, and a beamformer for performing beamformer processing on these audio signals to suppress a signal coming from a target sound source 20, a target sound source direction estimating unit 30 for estimating the direction of the target sound source based on the filter coefficient obtained from the beamformer 20, a time-series value of the estimated direction of the target sound source, Speech / non-speech for determining speech / non-speech of a speech signal based on one or both of a time series value of the power of the obtained signal and a time series value of a correlation value between channels of the signal obtained from the speech input unit 10. And a non-speech determining unit 40.

【００１６】ここでは簡単のため、チャネル数ｎが２の
場合を例にとり説明する。ビームフォーマ２０は、音声
入力部１０からの信号に対して目的音源を抑圧するため
の適応ビームフォーマ処理と呼ばれるフィルタ演算処理
を行う。ビームフォーマ２０の内部の処理方法として
は、種々の方法が知られており、例えば先の文献２や、
文献３：Heykin著“Adaptive Filter Theory(Plentice
Hall) ”に開示されているように、一般化サイドローブ
キャンセラ（ＧＳＣ）、フロスト型ビームフォーマおよ
び参照信号法などがある。本実施形態は適応ビームフォ
ーマであればどのようなものにも適用可能であるが、こ
こでは２チャネルのＧＳＣを例にとり説明する。Here, for simplicity, a case where the number n of channels is 2 will be described as an example. The beamformer 20 performs a filter calculation process called an adaptive beamformer process for suppressing a target sound source on a signal from the audio input unit 10. Various methods are known as a processing method inside the beamformer 20, for example, the above-mentioned document 2 and
Reference 3: Heykin, “Adaptive Filter Theory (Plentice
Hall) ", a generalized sidelobe canceller (GSC), a frost type beamformer, a reference signal method, etc. The present embodiment can be applied to any adaptive beamformer. However, here, a description will be given using a 2-channel GSC as an example.

【００１７】図２に、ビームフォーマ２０の例として、
２チャネルのＧＳＣの中で一般的なJim-Griffith型のＧ
ＳＣの構成例を示す。これは例えば、文献２に示されて
いるように、減算器２１、加算器２２、遅延器２３、適
応フィルタ２４および減算器２５からなるＧＳＣであ
る。適応フィルタ２４はＬＭＳ、ＲＬＳ、射影型ＬＭＳ
などの種々のものが使用可能であり、フィルタ長Ｌａは
例えばＬａ＝５０を用いる。遅延器２３の遅延量は例え
ばＬａ／２とする。FIG. 2 shows an example of the beam former 20.
Jim-Griffith type G which is common among two channel GSC
3 shows a configuration example of an SC. This is, for example, a GSC composed of a subtractor 21, an adder 22, a delay unit 23, an adaptive filter 24, and a subtractor 25, as described in Document 2. The adaptive filter 24 is an LMS, RLS, projection type LMS
For example, La = 50 is used as the filter length La. The delay amount of the delay unit 23 is, for example, La / 2.

【００１８】ビームフォーマ２０を構成する図２に示し
た２チャネルのJim-Griffith型ＧＳＣの適応フィルタ２
４にＬＭＳ適応フィルタを用いた場合、このフィルタの
更新は、時刻をｎとして適応フィルタ２４の係数をＷ
（ｎ）、第ｉチャネルの入力信号をｘｉ（ｎ）、第ｉチ
ャネルの入力信号ベクトルをＸｉ（ｎ）＝（ｘｉ
（ｎ），ｘｉ（ｎ−１），…，ｘｉ（ｎ−Ｌａ＋１））
とおくと、次式で表される。The adaptive filter 2 of the two-channel Jim-Griffith type GSC shown in FIG.
When the LMS adaptive filter is used for No. 4, the updating of this filter is performed by setting the time to n and setting the coefficient of the adaptive filter 24 to W
(N), the input signal of the i-th channel is xi (n), and the input signal vector of the i-th channel is Xi (n) = (xi
(N), xi (n-1), ..., xi (n-La + 1))
Then, it is expressed by the following equation.

【００１９】ｙ（ｎ）＝ｘ０（ｎ）＋ｘｌ（ｎ）（１）Ｘ′（ｎ）＝Ｘ１（ｎ）−Ｘ０（ｎ）（２）ｅ（ｎ）＝ｙ（ｎ）−Ｗ（ｎ）Ｘ′（ｎ）（３）Ｗ（ｎ＋１）＝Ｗ（ｎ）一μＸ′（ｎ）ｅ（ｎ）（４）図２のＧＳＣの入力方向を目的音源の方向以外の方向、
例えば目的音源の方向を基準として９０°に設定してお
く。ここでは、２チャネルの信号に遅延を与えることに
より、設定した入力方向からの信号が等価的にアレイに
同時に到着するようにする。このため、図２の構成のビ
ームフォーマ２０に対して図３に示すように遅延器２６
をチャネル１側に挿入する。遅延器２６の遅延時間は、
入力方向を９０°にする場合、τ＝ｄ／ｃである。ここ
でｃは音速、ｄはマイクロホン間の距離である。Y (n) = x0 (n) + xl (n) (1) X ′ (n) = X1 (n) −X0 (n) (2) e (n) = y (n) −W (n ) X ′ (n) (3) W (n + 1) = W (n) −μX ′ (n) e (n) (4) The input direction of the GSC in FIG.
For example, it is set to 90 ° with reference to the direction of the target sound source. Here, by delaying the signals of the two channels, signals from the set input directions are equivalently and simultaneously arrive at the array. For this reason, as shown in FIG.
Into the channel 1 side. The delay time of the delay unit 26 is
When the input direction is 90 °, τ = d / c. Here, c is the speed of sound, and d is the distance between the microphones.

【００２０】目的音源の方向から信号が到来した場合、
ビームフォーマ２０内のフィルタは目的音源の方向に感
度が低くなっているため、このフィイタのフィルタ係数
から感度の方向依存性である指向性を調べることによ
り、目的音源の方向を推定することができる。When a signal arrives from the direction of the target sound source,
Since the filter in the beamformer 20 has low sensitivity in the direction of the target sound source, the direction of the target sound source can be estimated by examining the directivity, which is the direction dependency of the sensitivity, from the filter coefficient of the filter. .

【００２１】図４に、目的音源方向推定部３０において
目的音源の方向を推定する手順を示す。まず、初期設定
として目的方向の探索範囲θｒ、フィルタ長Ｌ、ＦＦＴ
長（ＦＦＴポイント数）Ｎ、チャネル数Ｍなどを設定す
る（ステップＳ１０１）。例えばθｒ＝２０°、Ｌ＝５
０、Ｎ＝６４、Ｍ＝２とする。ビームフォーマは目的音
源からの信号の到来方向範囲のみを探索するため、例え
ば目的音源の方向を基準として探索角度範囲は士θｒの
範囲とする。FIG. 4 shows a procedure for estimating the direction of the target sound source in the target sound source direction estimating section 30. First, as initial settings, a search range θr in the target direction, a filter length L, FFT
The length (number of FFT points) N and the number of channels M are set (step S101). For example, θr = 20 °, L = 5
0, N = 64, and M = 2. Since the beamformer searches only the range of the direction of arrival of the signal from the target sound source, the search angle range is set to the range of θr, for example, based on the direction of the target sound source.

【００２２】次に、ビームフォーマがＧＳＣならば、フ
ィルタ係数をトランスバーサル型のビームフォーマと等
価な形に変換する（ステップＳ１０２）。例えば２チャ
ネルのJim-Griffith型ＧＳＣの場合、ＧＳＣ内の適応フ
ィルタの係数をｗｇ＝（ｗ₀ ，ｗ₁ ，ｗ₂ ，…，ｗ_L-2 ，ｗ_L-1 ）とおくと、第１チャネルｃｈ１の等価フィルタの係数
は、ｗ_e1＝（−Ｗ₀ ，−Ｗ₁ ，−Ｗ₂ ，…，−Ｗ_L/2 ＋１，
…，−Ｗ_L-2 ，−Ｗ_L-1 ）第２チャネルｃｈ２の等価フィルタの係数は、ｗ_e2＝（ｗ₀ ，ｗ₁ ，ｗ₂ ，…，ｗ_L/2 −１，…，ｗ
_L-2 ，Ｗ_L-1 ）とおけばよい。Next, if the beamformer is GSC, the filter coefficients are converted into a form equivalent to a transversal type beamformer (step S102). For example, in the case of a two-channel Jim-Griffith type GSC, if the coefficients of the adaptive filter in the GSC are given by wg = (w ₀ , w ₁ , w ₂ ,..., W _L-2 , w _L-1 ), coefficient of the equivalent filter of the channel ch1 _{_{is, w e1 = (- W 0}} , -W 1, -W 2, ..., -W L / 2 +1,
_{..., -W L-2, -W} L-1) coefficient of the equivalent filter of the second channel ch2 _{_{is, w e2 = (w 0,}} w 1, w 2, ..., w L / 2 -1, ..., w
_L-2 , WL _-1 ).

【００２３】次に、チャネル毎にフィルタ係数のＦＦＴ
を行い、その周波数成分Ｗei（ｋ）を求める（ステップ
Ｓ１０３）。ここで、ｋは周波数成分の番号、ｉはチャ
ネルの番号である。Next, the FFT of the filter coefficient for each channel
To obtain the frequency component Wei (k) (step S103). Here, k is a frequency component number, and i is a channel number.

【００２４】次に、探索範囲の中のある１つの方向をθ
とすると、θ方向から到来する信号に関する各チャネル
の伝播位相遅れを表す方向ベクトルＳ（ｋ，θ）を生成
する（ステップＳ１０４）。方向ベクトルＳ（ｋ，θ）
は、例えば図５に示したマイクロホン配置の場合、第１
チャネルｃｈ１を基準とすると、Ｓ（ｋ，θ）＝(1，exp(−ｊｋ／Ｎｆs ｄ sin
（θ))) となる。ｆｓはサンプリング周波数、ｄはマイクロホン
間の距離である。Next, one direction in the search range is defined as θ
Then, a direction vector S (k, θ) representing the propagation phase delay of each channel with respect to the signal arriving from the θ direction is generated (step S104). Direction vector S (k, θ)
Is, for example, in the case of the microphone arrangement shown in FIG.
On the basis of the channel ch1, S (k, θ) = (1, exp (−jk / N fs d sin
(Θ))). fs is the sampling frequency, and d is the distance between the microphones.

【００２５】次に、ＦＦＴにより求めたフィルタの周波
数成分Ｗｅ＝（Ｗe1（ｋ），Ｗe2（ｋ））と方向ベクト
ルＳ（ｋ，θ）の内積の絶対値の２乗｜Ｓ・Ｗ｜² を求
める（ステップＳ１０５）。Next, the square | S · W | ² of the absolute value of the inner product of the frequency component We = (We1 (k), We2 (k)) of the filter obtained by the FFT and the direction vector S (k, θ) Is obtained (step S105).

【００２６】ステップＳ１０３〜Ｓ１０６の処理の全て
の周波数、すなわちｋ＝１からｋ＝Ｎ／２までについて
行い、求めた内積の２乗和を方向θ毎に周波数ｋについ
て加算し、全帯域についてまとめた方向毎の感度Ｄ（θ）＝Σ｜Ｗ（ｋ）・Ｓ（ｋ，θ）｜² を求める（ステップＳ１０６）。このとき、方向を例え
ば１°ずつ変化させ、探索範囲の全ての方向について調
べるようにする（ステップＳ１０７）。次に、求めた方
向毎の感度が最小となる方向θmin をＤ（θ）から求
め、これを信号（目的音源からの信号または雑音源から
の信号）の到来方向とする（ステップＳ１０８）。The processing of steps S103 to S106 is performed for all frequencies, that is, from k = 1 to k = N / 2, and the sum of squares of the obtained inner products is added for the frequency k for each direction θ, and all the bands are summarized. The sensitivity D (θ) = Σ | W (k) · S (k, θ) | ² is obtained for each of the directions (step S106). At this time, the direction is changed, for example, by 1 °, and all the directions in the search range are checked (step S107). Next, a direction θmin at which the sensitivity in each of the obtained directions is minimized is obtained from D (θ), and this is set as an arrival direction of a signal (a signal from a target sound source or a signal from a noise source) (step S108).

【００２７】次に、音声／非音声決定部４０の処理につ
いて説明する。音声／非音声決定部４０では、目的音源
方向推定部３０で推定された目的音源の方向の時系列値
と、入力信号のパワーの時系列値のいずれかあるいは両
方に基づいて音声／非音声の決定を行う。なお、２チャ
ネルの相関値の時系列値を使うことも可能である。Next, the processing of the voice / non-voice determination section 40 will be described. The speech / non-speech determination unit 40 determines the speech / non-speech based on one or both of the time series value of the direction of the target sound source estimated by the target sound source direction estimation unit 30 and the time series value of the power of the input signal. Make a decision. It is also possible to use a time-series value of the correlation value of two channels.

【００２８】音声／非音声の決定は、例えば以下の２つ
の方法によって行うことが可能である。すなわち、
（１）目的音源の方向の時間変動量を用いる方法、
（２）目的音源の方向の時間変動量および入力信号のパ
ワーを用いる方法である。ここで、目的音源の方向を用
いずにその時間変動量を用いて音声／非音声を決定する
のは、目的音源から信号が到来していないときには入力
信号中に方向性のある信号が含まれず、目的音源の方向
の推定値はランダムな値をとり、目的音源から信号が到
来しているときには目的音源の方向の推定値は一定の範
囲内の値をとるので、目的音源の方向の時間変動量が一
定範囲内のときに音声とみなせば検出が可能となるため
である。The voice / non-voice determination can be made, for example, by the following two methods. That is,
(1) A method using a time variation in the direction of a target sound source,
(2) This method uses the amount of time variation in the direction of the target sound source and the power of the input signal. Here, the reason for determining speech / non-speech using the time variation amount without using the direction of the target sound source is that when a signal does not arrive from the target sound source, a directional signal is not included in the input signal. However, the estimated value of the direction of the target sound source takes a random value, and when a signal arrives from the target sound source, the estimated value of the direction of the target sound source takes a value within a certain range. This is because if the amount is within a certain range, it can be detected if it is regarded as voice.

【００２９】まず、（１）の方法について音声／非音声
の決定手順を図６を参照しながら説明する。図６は、音
声／非音声の決定における処理の流れを状態遷移図で示
したものであり、非音声状態を出発点とする。時刻ｎの
目的音源の方向の時間変動量をΔθ（ｎ）＝｜θ（ｎ）
一θ（ｎ−１）｜、音声の断片として認めるのに必要な
θ（ｎ）の最大時間変動量をθth（例えばθth＝５°）
として、Δθ（ｎ）≦θthとなったとき、その時刻を音
声の仮の始端とし、仮の始端を見つけた状態を表す仮音
声状態に遷移する。First, the procedure for determining voice / non-voice in the method (1) will be described with reference to FIG. FIG. 6 is a state transition diagram showing a flow of processing in the determination of voice / non-voice, with the non-voice state as a starting point. The time variation in the direction of the target sound source at time n is represented by Δθ (n) = | θ (n)
Θ (n−1) |, the maximum temporal variation of θ (n) required to be recognized as a voice fragment is θth (eg, θth = 5 °)
When Δθ (n) ≦ θth, the time is set as a temporary start point of the voice, and a transition is made to a temporary voice state indicating a state where the temporary start point is found.

【００３０】仮音声状態では、音声の断片として認める
に必要な最小時間長をＴ１（例えばＴ１＝２０ｍｓｅ
ｃ）とし、この時間長Ｔ１以内にΔθ（ｎ）＞θthとな
れば非音声状態に戻り、そうでなければΔθ（ｎ）＞θ
thとなった時刻を音声の仮の終端とし、音声の終端が決
定するのを待っている状態を表す終端待ち状態に遷移す
る。In the provisional voice state, the minimum time length required to be recognized as a voice fragment is T1 (for example, T1 = 20 ms).
c), and if Δθ (n)> θth is satisfied within this time length T1, the state returns to the non-voice state; otherwise, Δθ (n)> θ
The time when th is reached is set as a temporary end of the sound, and the state transits to an end waiting state indicating a state of waiting for the end of the sound to be determined.

【００３１】終端待ち状態では、音声終了の判断に必要
な最小時間長をＴ２（例えばＴ２＝１００ｍｓｅｃ）と
し、この時間長Ｔ２以内にΔθ（ｎ）≦θthになれば、
音声が継続している状態を表す仮音声継続状態に遷移す
る。そうでない場合は、最後に終端待ち状態に遷移した
ときの時刻を音声の仮の終端とし、仮の始端から仮の終
端までの時間が音声として認めるのに必要な最小時間長
Ｔ３（例えばＴ３＝３００ｍｓｅｃ）以下であれば非音
声状態に戻り、そうでなければ仮の始端から仮の終端ま
でを音声区間として終了状態に遷移する。In the end waiting state, the minimum time length required for determining the end of voice is T2 (for example, T2 = 100 msec). If Δθ (n) ≦ θth is satisfied within this time length T2,
The state transits to a temporary sound continuation state indicating a state in which the sound is continuing. Otherwise, the time of the last transition to the end wait state is set as the temporary end of the voice, and the minimum time length T3 required to recognize the time from the temporary start end to the temporary end as the voice (for example, T3 = If it is less than 300 msec), the state returns to the non-speech state.

【００３２】仮音声継続状態では、時間長Ｔ１以内にΔ
θ（ｎ）＞θthになれば終端待ち状態に戻り、そうでな
ければ音声が継続している状態を表す音声継続状態に遷
移する。In the temporary voice continuation state, Δ
If θ (n)> θth, the process returns to the end waiting state, and if not, the state transits to a sound continuation state indicating a state in which sound is continuing.

【００３３】一方、音声継続状態ではΔθ（ｎ）＞θth
となったとき終端待ち状態へ遷移する。次に、（２）の
方法について音声／非音声の決定手順を図７を参照しな
がら説明する。ここで、音声として認めるのに必要な入
力信号のパワーの最小値としてＰth１，Ｐth２の２つ設
ける（Ｐth１＞Ｐth２）。図７において、まず非音声状
態を出発点とし、時刻ｎの目的音源の方向の時間変動量
をΔθ（ｎ）、音声の断片として認めるのに必要なθ
（ｎ）の最大時間変動量をθthとして、Δθ（ｎ）≦θ
thまたはＰ（ｎ）＞Ｐth１となったとき、その時刻を音
声の仮の始端とし、仮の始端を見つけた状態を表す仮音
声状態に遷移する。On the other hand, in the voice continuation state, Δθ (n)> θth
When the state becomes, the state transits to the termination waiting state. Next, the procedure for determining voice / non-voice in the method (2) will be described with reference to FIG. Here, two values of Pth1 and Pth2 are provided as the minimum values of the power of the input signal necessary to recognize the sound (Pth1> Pth2). In FIG. 7, first, a non-speech state is set as a starting point, and the amount of time variation in the direction of the target sound source at time n is Δθ (n), which is necessary to recognize as a speech fragment.
When the maximum time variation of (n) is θth, Δθ (n) ≦ θ
When th or P (n)> Pth1, the time is set as a temporary start point of the voice, and the state transits to a temporary voice state indicating a state where the temporary start point is found.

【００３４】仮音声状態では、「Ｔ１以内に、Δθ
（ｎ）＞θthかつＰ（ｎ）≦Ｐth１」または「Δθ
（ｎ）＞θthかつＰ（ｎ）≦Ｐth１、となるまでのＰ
（ｎ）の最大値が閾値Ｐth以下」であれば非音声状態に
戻り、そうでなければ音声の終端が決定するのを待って
いる状態を表す終端待ち状態に遷移する。ここで、Ｐth
は音声として受理するのに必要な入力信号のパワーの最
小値である。In the provisional voice state, "ΔT within T1
(N)> θth and P (n) ≦ Pth1 ”or“ Δθ ”
P until (n)> θth and P (n) ≦ Pth1
If the maximum value of (n) is equal to or less than the threshold value Pth, the state returns to the non-speech state. Otherwise, the state transits to the end waiting state indicating a state of waiting for the end of the sound to be determined. Where Pth
Is the minimum value of the power of the input signal required to be received as voice.

【００３５】終端待ち状態では、Ｔ２以内にΔθ（ｎ）
≦θthまたはＰ（ｎ）＞Ｐth１になれば、音声が継続し
ている状態を表す仮音声継続状態に遷移する。そうでな
い場合は、最後に終端待ち状態に遷移したときの時刻を
音声の仮の終端とし、仮の始端から仮の終端までの時間
が音声として認めるのに必要な最小時間長Ｔ３（例えば
Ｔ３＝３００ｍｓｅｃ）以下であれば非音声状態に戻
り、そうでなければ仮の始端から仮の終端までを音声区
間として終了状態に遷移する。In the termination waiting state, Δθ (n) within T2
If .ltoreq..theta.th or P (n)> Pth1, the state transits to a temporary sound continuation state indicating a state in which sound is continuing. Otherwise, the time of the last transition to the end wait state is set as the temporary end of the voice, and the minimum time length T3 required to recognize the time from the temporary start end to the temporary end as the voice (for example, T3 = If it is less than 300 msec), the state returns to the non-speech state.

【００３６】仮音声継続状態では、「Ｔ１以内に、Δθ
（ｎ）＞θthかつＰ（ｎ）≦Ｐth１」または「Δθ
（ｎ）＞θthかつＰ（ｎ）≦Ｐth１、となるまでのＰ
（ｎ）の最大値が閾値Ｐth以下」であれば終端待ち状態
に戻り、そうでなければ音声が継続している状態を表す
音声継続状態に遷移する。In the tentative voice continuation state, "ΔT within T1
(N)> θth and P (n) ≦ Pth1 ”or“ Δθ ”
P until (n)> θth and P (n) ≦ Pth1
If the maximum value of (n) is equal to or less than the threshold value Pth, the process returns to the end waiting state, and if not, the state transits to the sound continuation state indicating the state in which the sound continues.

【００３７】音声継続状態では、Δθ（ｎ）＞θthかつ
Ｐ（ｎ）≦Ｐth１となったとき終端待ち状態へ遷移す
る。この（２）の音声／非音声決定方法では、以上の手
順で得られた音声区間においてさらにＰ（ｎ）＞Ｐth２
を満たす区間を音声区間とする。ここで、Ｐth２は前述
したようにＰ（ｎ）の第２の閾値である。In the voice continuation state, when Δθ (n)> θth and P (n) ≦ Pth1, the state transits to the end waiting state. In the voice / non-voice determination method of (2), in the voice section obtained by the above procedure, P (n)> Pth2
A section that satisfies is defined as a voice section. Here, Pth2 is the second threshold value of P (n) as described above.

【００３８】（２）の方法では、ＳＮ比が低い場合、Ｐ
th、Ｐth２を大きい値に設定してしまうと、音声区間を
検出できないおそれがある。従って、Ｐth，Ｐth２の値
は、パワーのみによる検出の場合よりも小さい値に設定
しておくようにする。Ｐth，Ｐth２が小さい値に設定さ
れても求めた目的音源方向の値を優先して用いているの
で、音声検出性能は確実に向上できる。例えば、Ｐth，
Ｐth１，Ｐth２の値は背景雑音レベルに対する相対値Ｐ
th＝５ｄＢ，Ｐth１＝２ｄＢ，Ｐth２＝５ｄＢとする。
Ｐth，Ｐth１，Ｐth２の値は背景雑音レベルの状況に応
じて実験的に決めることが望ましい。In the method (2), when the SN ratio is low, P
If th and Pth2 are set to large values, the voice section may not be detected. Therefore, the values of Pth and Pth2 are set to smaller values than in the case of detection using only power. Even if Pth and Pth2 are set to small values, the value of the target sound source direction obtained is preferentially used, so that the voice detection performance can be reliably improved. For example, Pth,
The values of Pth1 and Pth2 are relative values P to the background noise level.
It is assumed that th = 5 dB, Pth1 = 2 dB, and Pth2 = 5 dB.
It is desirable that the values of Pth, Pth1 and Pth2 be determined experimentally according to the situation of the background noise level.

【００３９】本実施形態によれば、ビームフォーマによ
り雑音を抑圧するのではなく、目的音源の方向をビーム
フォーマ内部のフィルタのフィルタ係数から得るように
しているので、雑音源の方向を特定できないような環境
でも目的音源の音声区間を正確に検出することができ
る。According to the present embodiment, the direction of the target sound source is obtained from the filter coefficient of the filter inside the beamformer instead of suppressing the noise by the beamformer, so that the direction of the noise source cannot be specified. It is possible to accurately detect the voice section of the target sound source even in a simple environment.

【００４０】次に、本発明の他の実施形態について説明
する。なお、以下の実施形態で使用するブロック図にお
いて、名称が同一のブロックは基本的に同一機能を有す
るものとして詳細な説明を省略する。Next, another embodiment of the present invention will be described. In the block diagrams used in the following embodiments, blocks having the same name have basically the same function, and a detailed description thereof will be omitted.

【００４１】（第２の実施形態）本実施形態では、方向
性のある雑音源がある場合でも、高精度に目的音源の方
向を抽出できるようにするため、目的音源の信号を抑圧
するビームフォーマの入力方向を雑音の方向に追随させ
る場合について説明する。(Second Embodiment) In this embodiment, even if there is a directional noise source, the beamformer for suppressing the signal of the target sound source can be extracted with high accuracy. A case will be described in which the input direction follows the direction of noise.

【００４２】ビームフォーマで設定される雑音源の方向
を実際の雑音源の方向に追随させるため、本実施形態に
おいては目的音源から到来する信号を抑圧する第１のビ
ームフォーマとは別に第２のビームフォーマを設け、こ
の第２のビームフォーマ内のフィルタの指向性から雑音
源の方向を推定し、その推定結果に基づいて第１のビー
ムフォーマの制御を行う。In order to make the direction of the noise source set by the beamformer follow the direction of the actual noise source, in the present embodiment, a second beamformer is provided separately from the first beamformer for suppressing the signal coming from the target sound source. A beamformer is provided, the direction of the noise source is estimated from the directivity of the filter in the second beamformer, and the first beamformer is controlled based on the estimation result.

【００４３】図８に、本実施形態に係る音声区間検出機
能を有する音声処理装置の構成を示す。本実施形態で
は、簡単のためチャネル数が２の場合の処理を例として
述べるが、２チャネルに限定されるものではない。FIG. 8 shows a configuration of a voice processing apparatus having a voice section detection function according to the present embodiment. In the present embodiment, processing for the case where the number of channels is 2 will be described as an example for simplicity, but the present invention is not limited to two channels.

【００４４】入力端子５０−１、５０−２からチャネル
ｃｈ１、ｃｈ２を介して音声入力部５０に入力される音
声信号は、第１および第２のビームフォーマ６１、６２
にそれぞれ入力される。第１のビームフォーマ６１内の
フィルタのフィルタ係数から目的音源の方向を推定し、
その推定結果を第１の制御部６４に与える。雑音源方向
推定部６５は、第２のビームフォーマ６２内のフィルタ
のフィルタ係数から雑音源の方向を推定し、その結果を
第２の制御部６６に与える。The audio signals input from the input terminals 50-1 and 50-2 to the audio input unit 50 via the channels ch1 and ch2 are converted into first and second beamformers 61 and 62.
Respectively. Estimating the direction of the target sound source from the filter coefficients of the filter in the first beamformer 61,
The estimation result is provided to the first control unit 64. The noise source direction estimating unit 65 estimates the direction of the noise source from the filter coefficients of the filters in the second beamformer 62, and supplies the result to the second control unit 66.

【００４５】音声／非音声決定部７０は、目的音源方向
推定部６３で推定された目的音源の方向の時系列と、音
声入力部５０から得られた信号のパワーの時系列値およ
び音声入力部５０から得られた信号のチャネル間の相関
値の時系列値の少なくとも一方に基づいて音声／非音声
を決定する。以降、第１および第２ビームフォーマ６
１、６２において設定されている雑音源および目的音源
の方向を入力方向と呼ぶことにする。The speech / non-speech deciding section 70 includes a time series of the direction of the target sound source estimated by the target sound source direction estimating section 63, a time series value of the power of the signal obtained from the speech input section 50, and a speech input section. Speech / non-speech is determined based on at least one of the time-series values of the correlation value between channels of the signal obtained from 50. Hereinafter, the first and second beamformers 6
The directions of the noise source and target sound source set in 1, 62 will be referred to as input directions.

【００４６】第１の制御部６４は、目的音源方向推定部
６３により推定された目的音源の方向が入力方向として
設定されるように、第２のビームフォーマ６２を制御す
る。第２の制御部６６は、雑音源方向推定部６５により
推定された雑音源の方向が入力方向として設定されるよ
うに、第１のビームフォーマ６１を制御する。第１のビ
ームフォーマ６１の入力方向を雑音源の方向に設定する
のは、第１のビームフォーマ６１により雑音源の方向が
推定されるのを防ぐためであり、第２のビームフォーマ
６２の入力方向を目的音源の方向に設定するのは、第２
のビームフォーマ６２により目的音源の方向が推定され
るのを防ぐためである。The first controller 64 controls the second beamformer 62 so that the direction of the target sound source estimated by the target sound source direction estimator 63 is set as the input direction. The second control unit 66 controls the first beamformer 61 such that the direction of the noise source estimated by the noise source direction estimation unit 65 is set as the input direction. The reason why the input direction of the first beamformer 61 is set to the direction of the noise source is to prevent the direction of the noise source from being estimated by the first beamformer 61. Setting the direction to the direction of the target sound source is the second
This is to prevent the beamformer 62 from estimating the direction of the target sound source.

【００４７】第１および第２のビームフォーマ６１、６
２は、既に述べたようにＧＳＣでもフロスト型でも参照
信号型でもよい。この場合、第１のビームフォーマ６１
内のフィルタでは目的音源の方向に、第２のビームフォ
ーマ６２内のフィルタでは雑音源の方向にそれぞれ感度
が低くなっているため、各々のフィルタのフィルタ係数
からその感度の方向依存性である指向性を調べることに
より、目的音源および雑音源の方向を推定することがで
きる。First and second beam formers 61 and 6
2 may be a GSC, a frost type, or a reference signal type as described above. In this case, the first beam former 61
In the filters inside, the sensitivity is lower in the direction of the target sound source and in the filter in the second beamformer 62, the sensitivity is lower in the direction of the noise source. By examining the characteristics, the directions of the target sound source and the noise source can be estimated.

【００４８】目的音源方向推定部６３と雑音源方向推定
部６５では、前述のように第１および第２のビームフォ
ーマ６１、６２内のフィルタの指向性から目的音源およ
び雑音源の方向を推定するため、図４に示したような手
順で処理を行う。ここで、初期設定で設定される第１の
ビームフォーマ６１の目的音源到来方向の探索範囲は２
０°、第２のビームフォーマ６２の雑音到来方向の探索
範囲は例えば９０°とする。The target sound source direction estimating unit 63 and the noise source direction estimating unit 65 estimate the directions of the target sound source and the noise source from the directivity of the filters in the first and second beamformers 61 and 62 as described above. Therefore, the processing is performed according to the procedure shown in FIG. Here, the search range of the arrival direction of the target sound source of the first beamformer 61 set by the initial setting is 2
The search range of the noise arrival direction of the second beamformer 62 is, for example, 90 °.

【００４９】制御部６４と制御部６６では、推定された
音源方向に対してビームフォーマの出力パワーにより重
み付けを行い、過去の推定された音源方向との平均化を
行いながら、入力方向を更新するようにする。例えば、
特願平９−９７９４に開示されている式に従って計算を
行う。このような更新により目的音源からの信号のパワ
ーが大きく、雑音のパワーが小さいときには更新を速く
し、それ以外では更新を遅くするように制御することが
できる。The control unit 64 and the control unit 66 update the input direction while averaging the estimated sound source direction with the output power of the beamformer and averaging with the past estimated sound source direction. To do. For example,
The calculation is performed according to the formula disclosed in Japanese Patent Application No. 9-9794. By such an update, it is possible to control so that the update is accelerated when the power of the signal from the target sound source is large and the power of the noise is small, and the update is delayed otherwise.

【００５０】図９に、上述した推定処理を含む本実施形
態の全体的な処理の流れを示す。まず、初期設定として
目的音源の方向として許容する範囲Φを設定し、第１の
ビームフォーマ６１の入力方向θ１を例えば０°に、第
２のビームフォーマ６２の入力方向θ２を例えば９０°
に、目的音源方向推定部６３の探索範囲θｒ１を例えば
２０°に、雑音源方向推定部６５の探索範囲θｒ２を例
えば９０°にそれぞれ設定する（ステップＳ２０１）。
ここで、ある角度範囲に到来した信号を目的音源からの
信号とみなすようにするために、目的音源方向に許容範
囲Φを設ける。Φの値は、例えば第１のビームフォーマ
６１の探索範囲θｒ１と同じ値とし、Φ＝θｒ１＝２０
°とする。なお、方向の基準として、図５に示したよう
に２つのマイクロホンを結ぶ直線に垂直な方向を０°と
している。FIG. 9 shows an overall processing flow of the present embodiment including the above-described estimation processing. First, as an initial setting, a range Φ allowed as the direction of the target sound source is set, the input direction θ1 of the first beamformer 61 is set to, for example, 0 °, and the input direction θ2 of the second beamformer 62 is set to, for example, 90 °.
Then, the search range θr1 of the target sound source direction estimating unit 63 is set to, for example, 20 °, and the search range θr2 of the noise source direction estimating unit 65 is set to, for example, 90 ° (step S201).
Here, an allowable range Φ is provided in the direction of the target sound source so that a signal arriving within a certain angle range is regarded as a signal from the target sound source. The value of Φ is, for example, the same value as the search range θr1 of the first beam former 61, and Φ = θr1 = 20
°. As a direction reference, a direction perpendicular to a straight line connecting the two microphones is set to 0 ° as shown in FIG.

【００５１】次に、第１のビームフォーマ６１の入力方
向を設定する（ステップＳ２０２）。ここでは、２チャ
ネルの信号に遅延を与えることにより、設定した入力方
向からの信号が等価的にアレイに同時に到着するように
する。このために、第１のビームフォーマ６１におい
て、図３に示す遅延器２６により第１のチャネルｃｈ１
の信号に与える遅延をτ＝ｄｓｉｎ（θ１）／ｃにより
計算する。ここで、ｃは音速、ｄはマイクロホン間の距
離である。Next, the input direction of the first beam former 61 is set (step S202). Here, by delaying the signals of the two channels, signals from the set input directions are equivalently and simultaneously arrive at the array. Therefore, in the first beamformer 61, the first channel ch1 is output by the delay unit 26 shown in FIG.
Is calculated by τ = dsin (θ1) / c. Here, c is the speed of sound, and d is the distance between microphones.

【００５２】次に、第１のビームフォーマ６１の処理を
行い（ステップＳ２０３）、得られたフィルタ係数から
上述した方法により探索範囲±θｒ１内で目的音源の方
向を推定する（ステップＳ２０４）。推定された目的音
源の方向をθｎとする。Next, the processing of the first beamformer 61 is performed (step S203), and the direction of the target sound source is estimated within the search range ± θr1 from the obtained filter coefficients by the method described above (step S204). The direction of the estimated target sound source is defined as θn.

【００５３】次に、ステップＳ２０４で推定された目的
音源の方向θｎが雑音源の方向の近傍（０°±Φ）にあ
るか否かを判断し（ステップＳ２０５）、近傍にある場
合はそのままステップＳ２０７に進む。Next, it is determined whether or not the direction θn of the target sound source estimated in step S204 is in the vicinity (0 ° ± Φ) of the direction of the noise source (step S205). Proceed to S207.

【００５４】一方、ステップＳ２０４で推定された目的
音源の方向θｎが雑音源の方向の近傍でない場合は、推
定された目的音源の方向を入力方向とするように第２の
ビームフォーマ６２の入力方向を設定する（ステップＳ
２０６）。すなわち、θ２の値を先に述べた平均化によ
り更新する。ステップＳ２０２と同様に、第２チャネル
ｃｈ２の信号に遅延を与えて入力方向からの信号が等価
的にアレイに同時に到達するようにするため、第２のビ
ームフオーマ６２において、図３に示すように遅延器２
６により第１チャネルｃｈ１に与える遅延をτ＝ｄｓ
ｉｎ（θ２）／ｃにより計算する。On the other hand, if the direction θn of the target sound source estimated in step S204 is not close to the direction of the noise source, the input direction of the second beamformer 62 is set so that the estimated direction of the target sound source is set as the input direction. (Step S
206). That is, the value of θ2 is updated by the averaging described above. As in step S202, the second beamformer 62 delays the signal of the second channel ch2 as shown in FIG. Vessel 2
6, the delay given to the first channel ch1 is τ = ds.
It is calculated by in (θ2) / c.

【００５５】次に、第２のビームフォーマ６２の処理を
行い（ステップＳ２０７）、探索範囲±θｒ２の中で雑
音源の方向を推定し（ステップＳ２０８）、再びステッ
プＳ２０２に戻って、推定された雑音源の方向を入力方
向とするように第１のビームフォーマ６１の入力方向を
設定する。このときも、先に述べた平均化により入力方
向を更新する。以降、以上の処理を繰り返す。Next, the processing of the second beamformer 62 is performed (step S207), the direction of the noise source is estimated within the search range ± θr2 (step S208), and the process returns to step S202 again. The input direction of the first beamformer 61 is set so that the direction of the noise source is set as the input direction. Also at this time, the input direction is updated by the averaging described above. Thereafter, the above processing is repeated.

【００５６】音声／非音声決定部７０では、図６および
図７に示した処理手順によって音声／非音声が決定され
る。具体的な決定方法は、第１の実施形態に示した２つ
の方法が考えられるが、重複するので説明は避ける。The voice / non-voice determination section 70 determines voice / non-voice according to the processing procedure shown in FIGS. As the specific determination method, the two methods described in the first embodiment can be considered, but the description is omitted because they are duplicated.

【００５７】このように本実施形態によれば、２つのビ
ームフォーマを設け、一方のビームフォーマで目的音源
の方向を推定し、他方のビームフォーマで雑音源の方向
を推定するようにしたため、方向性のある雑音源がある
場合でも目的音源の音声区間を正確に検出することがで
きる。As described above, according to this embodiment, two beamformers are provided, one of the beamformers estimates the direction of the target sound source, and the other beamformer estimates the direction of the noise source. Even if there is a noise source with a possibility, the voice section of the target sound source can be accurately detected.

【００５８】（第３の実施形態）本実施形態では、第２
の実施形態で述べた２つのビームフォーマを用いた構成
において、音声区間を検出する代わりに音声強調を行
い、目的とする音声を高精度に抽出する方法を説明す
る。本実施形態の構成を図１０に示す。(Third Embodiment) In the present embodiment, the second
In the configuration using two beamformers described in the first embodiment, a method of extracting a target voice with high accuracy by performing voice enhancement instead of detecting a voice section will be described. FIG. 10 shows the configuration of the present embodiment.

【００５９】図１０に示す音声処理装置は、複数チャネ
ルを介して音声を入力する音声入力部８０、入力音声を
フィルタ処理し、目的音源からの信号を抑圧する第１の
ビームフォーマ９１、入力音声をフィルタ処理し、雑音
を抑圧して目的音声を抽出する第２のビームフォーマ９
２、第１のビームフォーマ９１のフィルタ係数から目的
音源方向を推定する目的音源方向推定部９３、目的音源
方向推定部により推定された目的音源方向に第２のビー
ムフォーマ９２の目的方向を設定する第１の制御部９
４、第２のビームフォーマのフィルタ９２から雑音源方
向を推定する雑音源方向推定部９５、推定された雑音源
方向に第１のビームフォーマ９１の目的方向を設定する
第２の制御部９６、第２のビームフォーマ９２の出力信
号中の雑音成分を抑圧して音声を強調する処理を行う音
声強調部１００からなっている。The sound processing apparatus shown in FIG. 10 includes a sound input unit 80 for inputting sound via a plurality of channels, a first beamformer 91 for filtering the input sound and suppressing a signal from a target sound source, and an input sound. , A second beamformer 9 for extracting a target voice by suppressing noise
2. A target sound source direction estimating unit 93 for estimating the target sound source direction from the filter coefficients of the first beamformer 91, and sets the target direction of the second beamformer 92 to the target sound source direction estimated by the target sound source direction estimating unit. First control unit 9
4. a noise source direction estimating unit 95 for estimating the noise source direction from the filter 92 of the second beamformer; a second control unit 96 for setting the target direction of the first beamformer 91 to the estimated noise source direction; It comprises a voice emphasizing unit 100 for suppressing noise components in the output signal of the second beamformer 92 and emphasizing voice.

【００６０】この構成は、ほぼ図８に示した第２の実施
形態の構成における音声／非音声決定部７０が音声強調
部１００に入れ替わった形となっている。第２の実施形
態ではビームフォーマ９１の出力信号を用いていなかっ
たが、本実施形態ではこれを音声強調の雑音参照用の信
号として用いて音声強調処理を行っている。In this configuration, the voice / non-voice determination unit 70 in the configuration of the second embodiment shown in FIG. In the second embodiment, the output signal of the beamformer 91 is not used, but in the present embodiment, the signal is used as a signal for noise reference in voice enhancement to perform voice enhancement processing.

【００６１】先に述べたように、雑音源が非常に多く、
雑音源方向を特定できないような環境では、ビームフォ
ーマによる雑音抑圧性能は低下するが、入力音声は方向
性があるため、雑音方向に目的方向を設定したビームフ
ォーマにより、目的音源からの信号を抑圧した雑音のみ
の出力を抽出できる。従って、ビームフォーマ９１の出
力は、雑音のみの信号であり、これを用いて従来からよ
く知られているスペクトルサブトラクション（ＳＳ）の
手法を用いて音声を強調する。スペクトルサブトラクシ
ョンの詳細については、例えば文献４：S.Boll著：“Su
ppression of acoustics noise in speech using spect
ral subtraction ”，IEEE Trans．,ASSP-27，No.2，p
p.113-120，1979”に述べられている。As mentioned earlier, there are many noise sources,
In an environment where the direction of the noise source cannot be specified, the noise suppression performance of the beamformer decreases, but since the input speech has directivity, the signal from the target sound source is suppressed by the beamformer that sets the target direction in the noise direction. The output of only the noise can be extracted. Therefore, the output of the beamformer 91 is a signal containing only noise, and the signal is used to emphasize the sound using a well-known technique of spectral subtraction (SS). For details of the spectral subtraction, see, for example, Reference 4: S. Boll: “Su
ppression of acoustics noise in speech using spect
ral subtraction ", IEEE Trans., ASSP-27, No.2, p
pp. 113-120, 1979 ”.

【００６２】スペクトルサブトラクションには、参照用
の雑音信号と音声信号の２チャネルを用いる２ｃｈＳＳ
と、１チャネルの音声信号のみを用いる１ｃｈＳＳとが
あるが、本実施形態では参照用雑音としてビームフォー
マ９１の出力を用いる２ｃｈＳＳにより音声強調を行
う。通常、２ｃｈＳＳの雑音信号としては、目的音声が
入力されないように目的音声収集用のマイクロホンと距
離を隔てたマイクロホンの信号を使うが、雑音信号の性
質が目的音声収集用マイクロホンに混入する雑音と異な
ってしまい、ＳＳの精度が落ちるという問題がある。For the spectral subtraction, 2chSS using two channels of a reference noise signal and a voice signal is used.
And 1chSS using only one-channel audio signal. In the present embodiment, audio enhancement is performed by 2chSS using the output of beamformer 91 as reference noise. Normally, as a noise signal of 2chSS, a signal of a microphone separated from the microphone for collecting the target voice is used so that the target voice is not input. However, the noise signal has a different characteristic from the noise mixed in the microphone for collecting the target voice. As a result, there is a problem that the accuracy of the SS decreases.

【００６３】これに対し、本実施形態では雑音収集専用
のマイクロホンは使わず、音声収集用のマイクロホンか
ら雑音信号を抽出しているため、雑音の性質が異なって
しまうという問題がなく、精度よくＳＳを行うことがで
きる。第２の実施形態と異なるのは、この２ｃｈＳＳの
部分だけであり、他の部分は同じなので、まず２ｃｈＳ
Ｓについて説明する。On the other hand, in the present embodiment, the microphone dedicated to noise collection is not used, and the noise signal is extracted from the microphone for voice collection. It can be performed. The only difference from the second embodiment is the 2chSS part, and the other parts are the same.
S will be described.

【００６４】２ｃｈＳＳは例えば図１３に示すような構
成であり、この図の処理を入力データをブロック処理し
てブロック毎に行う。図１３に示す２ｃｈＳＳは、雑音
信号をフーリエ変換する第１のＦＦＴ１０１、第１のＦ
ＦＴにより得られた周波数成分を帯域パワーに変換する
第１の帯域パワー変換部１０２、得られた帯域パワーを
時間方向に平均化する雑音パワー計算部１０３、音声信
号をフーリエ変換する第２のＦＦＴ１０４、第２のＦＦ
Ｔにより得られた周波数成分を帯域パワーに変換する第
２の帯域パワー変換部１０５、得られた帯域パワーを時
間方向に平均化する音声パワー計算部１０６、得られた
雑音パワーと音声パワーとから帯域毎の重みを計算する
帯域重み計算部１０７、音声信号から第２のＦＦＴによ
り得られた周波数スペクトルを帯域毎の重みにより重み
付けする重み付け部１０８、重み付けされた周波数スペ
クトルを逆ＦＦＴして音声を出力する逆ＦＦＴ部１０９
からなっている。The 2chSS has, for example, a configuration as shown in FIG. 13, and performs the processing shown in FIG. 13 for each block by performing block processing on input data. The 2chSS illustrated in FIG. 13 includes a first FFT 101 that performs a Fourier transform on a noise signal, a first FFT 101,
A first band power converter 102 for converting a frequency component obtained by the FT into a band power, a noise power calculator 103 for averaging the obtained band power in a time direction, and a second FFT 104 for performing a Fourier transform on the audio signal , The second FF
A second band power converter 105 for converting the frequency component obtained by T into band power, a voice power calculator 106 for averaging the obtained band power in the time direction, and a calculation based on the obtained noise power and voice power. A band weight calculation unit 107 for calculating a weight for each band, a weighting unit 108 for weighting a frequency spectrum obtained from the audio signal by the second FFT with a weight for each band, and performing inverse FFT on the weighted frequency spectrum to generate a voice. Output inverse FFT section 109
Consists of

【００６５】ブロック長は例えば２５６点とし、ＦＦＴ
の点数と一致させる。ＦＦＴの際には、例えばハニング
窓により窓掛けを行い、ブロック長の半分の１２８点ず
つシフトさせながら、同じ処理を繰り返す。最後に逆Ｆ
ＦＴして得られた処理結果の波形に、１２８点ずつオー
バラップさせながら加算して窓掛けによる変形を復元
し、出力するようにする。The block length is, for example, 256 points, and the FFT
Match the score of. At the time of FFT, windowing is performed using, for example, a Hanning window, and the same processing is repeated while shifting by 128 points, which is half the block length. Finally, reverse F
The waveform of the processing result obtained by the FT is added while overlapping 128 points at a time to restore the deformation due to the windowing and output.

【００６６】帯域パワーへの変換は、例えば表１に示す
ように周波数成分を分割して１６の帯域にまとめ、帯域
毎に周波数成分の２乗和を計算して帯域パワーとする。
雑音パワーと音声パワーの計算は、帯域毎に例えば、１
次の回帰フィルタにより次式のように行う。For conversion into band power, for example, as shown in Table 1, frequency components are divided into 16 bands, and the sum of squares of the frequency components is calculated for each band to obtain band power.
The calculation of the noise power and the voice power is, for example, 1 for each band.
The following regression filter is used to perform the following equation.

【００６７】ｐ_k,n ＝ａ・ｐｐ_k ＋（１−ａ）・ｐ_k,n-1 （５）ｖ_k,n ＝ａ・ｖｖ_k ＋（１−ａ）・ｖ_k,n-1 （６）ここで、ｋは、帯域の番号、ｎはブロックの香号、ｐは
平均化された雑音チャネルの帯域パワー、ｐｐは雑音チ
ャネルの当ブロックの帯域パワー、ｖは音声チャネルの
平均化された帯域パワー、ｖｖは音声チャネルの当ブロ
ックの帯域パワー、ａは定数である。ａの値は、例えば
０．５を用いる。P _{k, n} = a · pp _k + (1-a) · p _{k, n−1} (5) v _{k, n} = a · vv _k + (1-a) · v _{k, n−1} (6) Here, k is the band number, n is the scent of the block, p is the band power of the averaged noise channel, pp is the band power of the current block of the noise channel, and v is the average of the voice channel. , Vv is the band power of this block of the voice channel, and a is a constant. As the value of a, for example, 0.5 is used.

【００６８】次に、帯域重み計算部では、得られた雑音
と音声の帯域パワーを用いて、例えば次式により帯域毎
の重みｗ_k,n を計算する。ｗ_k,n ＝｜ｖ_k,n −ｐ_k,n ｜／ｖ_k,n （７）次に、帯域毎の重みを用い、例えば次式により音声チャ
ネルの周波数成分に重み付けする。Ｙ_i,n ＝Ｘ_i,n ・ｗ_k,n （８）ここで、Ｙ_i,n は重み付けされた周波数成分、Ｘ_i,n は
音声チャネルの第２のＦＦＴにより得られた周波数成
分、ｉは周波数成分の番号であり、表１において周波数
成分番号ｉに対応する帯域ｋの重みｗ_k,n を用いるよう
にする。Next, the band weight calculation unit calculates the weight w _{k, n} for each band by the following equation using the obtained noise and the band power of the voice. w _{k, n} = | v _{k, n} -p _{k, n} | / v _{k, n} (7) Next, using the weight for each band, the frequency component of the voice channel is weighted by the following equation, for example. Y _{i, n} = X _{i, n} · w _{k, n} (8) where Y _{i, n} is a weighted frequency component, X _{i, n} is a frequency component obtained by the second FFT of the voice channel, i is the frequency component number, and the weight w _{k, n} of the band k corresponding to the frequency component number i in Table 1 is used.

【００６９】[0069]

【表１】 [Table 1]

【００７０】２ｃｈＳＳによる音声強調部の処理の流れ
を図１４を参照して説明する。まず、初期設定を行い、
例えばブロック長＝２５６、ＦＦＴ点数＝２５６、シフ
ト点数＝１２８、帯域数＝１６とする（ステップＳ３０
１）。次に、第１のＦＦＴにおいて雑音チャネルのデー
タを読み込んで窓掛けおよびＦＦＴを行い、雑音の周波
数成分を求める（ステップＳ３０２）。次に、第２のＦ
ＦＴにおいて音声チャネルのデータを読み込んで窓掛け
およびＦＦＴを行い、音声の周波数成分を求める（ステ
ップＳ３０３）。次に、第１の帯域パワー変換部におい
て、雑音の周波数成分から表１の対応に従って雑音の帯
域パワーを計算する（ステップＳ３０４）。次に、第２
の帯域パワー変換部において、音声の周波数成分から表
１の対応に従って音声の帯域パワーを計算する（ステッ
プＳ３０５）。次に、雑音パワー計算部において、式
（５）に従って平均雑音パワーを求める（ステップＳ３
０６）。次に、音声パワー計算部において、式（６）に
従って平均音声パワーを求める（ステップＳ３０７）。
次に、帯域重み計算部において、式（７）に従って帯域
重みを求める（ステップＳ３０８）。次に、重み付け部
において音声の周波数成分に対して、ステップＳ３０８
で求めた重み係数を式（８）に従って重み付けする（ス
テップＳ３０９）。次に、逆ＦＦＴ部において、ステッ
プＳ３０９で重み付けされた周波数成分を逆ＦＦＴして
波形を求め、前のブロックまでに求めた波形の最後の１
２８ポイントに重畳させて出力する（ステップＳ３１
０）。The flow of the processing of the voice emphasizing unit by 2chSS will be described with reference to FIG. First, make the initial settings,
For example, the block length = 256, the number of FFT points = 256, the number of shift points = 128, and the number of bands = 16 (step S30).
1). Next, in the first FFT, data of the noise channel is read, windowing and FFT are performed, and a frequency component of noise is obtained (step S302). Next, the second F
The voice channel data is read in the FT, windowing and FFT are performed, and a frequency component of the voice is obtained (step S303). Next, the first band power converter calculates the band power of the noise from the frequency components of the noise in accordance with the correspondence shown in Table 1 (step S304). Next, the second
The band power conversion unit calculates the band power of the sound from the frequency components of the sound according to the correspondence in Table 1 (step S305). Next, in the noise power calculation unit, an average noise power is obtained according to the equation (5) (step S3).
06). Next, the audio power calculation unit obtains an average audio power according to the equation (6) (step S307).
Next, the band weight calculation unit obtains the band weight according to the equation (7) (step S308). Next, the weighting unit processes the frequency component of the voice in step S308.
Is weighted according to equation (8) (step S309). Next, in the inverse FFT unit, the frequency component weighted in step S309 is inverse FFT to obtain a waveform, and the last one of the waveforms obtained up to the previous block is obtained.
The output is superimposed on 28 points (step S31)
0).

【００７１】以上、ステップＳ３０２〜Ｓ３１０までを
入力がなくなるまで繰り返す。なお、この処理はビーム
フォーマの処理を含めた全体の処理と同期させてブロッ
ク処理すると都合がよく、その場合はビームフォーマの
ブロック長は、音声強調部のシフト長１２８点と一致さ
せるようにする。Steps S302 to S310 are repeated until there is no input. In this case, it is convenient to perform the block processing in synchronization with the entire processing including the processing of the beamformer. In this case, the block length of the beamformer is set to be equal to the shift length of 128 points of the voice emphasizing unit. .

【００７２】（第４の実施形態）図１１に、本実施形態
に係る音声処理装置を示す。第３の実施形態では、２つ
のビームフォーマを用いてその目的方向を各々雑音源方
向および目的音源方向に向けるように制御していたが、
目的音源と雑音源が固定されていてその方向が既知であ
る場合にはビームフォーマの目的方向を制御する必要が
ないので、本実施形態のように図１０の目的音源方向推
定部９３と第１および第２の制御部９４，９６を省略し
た構成とすることも可能である。この場合、第１のビー
ムフォーマ１２１は最も強い雑音源方向に向け、第２の
ビームフォーマ１２２は目的音源方向に向けておく。こ
の場合の処理は、第２の実施形態において音源方向推定
部とビームフォーマの目的方向制御部を省略するだけで
容易に実施可能なので、詳細な説明は省略する。(Fourth Embodiment) FIG. 11 shows an audio processing apparatus according to this embodiment. In the third embodiment, two beamformers are used to control their target directions to the noise source direction and the target sound source direction, respectively.
If the target sound source and the noise source are fixed and their directions are known, there is no need to control the target direction of the beamformer. Therefore, as in this embodiment, the target sound source direction estimating unit 93 of FIG. It is also possible to adopt a configuration in which the second control units 94 and 96 are omitted. In this case, the first beamformer 121 is directed toward the strongest noise source, and the second beamformer 122 is directed toward the target sound source. Since the processing in this case can be easily performed simply by omitting the sound source direction estimating unit and the target direction control unit of the beamformer in the second embodiment, detailed description will be omitted.

【００７３】（第５の実施形態）図１２に、本実施形態
に係る音声強調処理機能を有する音声処理装置の構成を
示す。目的音声よりも強い雑音源がない場合には、本実
施形態のように雑音を抑圧する第２のビームフォーマも
省略することができる。この場合も、第２のビームフォ
ーマの処理を省略するだけなので、容易に実施可能であ
り、改めて説明しない。(Fifth Embodiment) FIG. 12 shows a configuration of a speech processing apparatus having a speech enhancement processing function according to the present embodiment. If there is no noise source stronger than the target voice, the second beamformer for suppressing noise as in the present embodiment can also be omitted. Also in this case, since the processing of the second beamformer is only omitted, it can be easily implemented and will not be described again.

【００７４】（第６の実施形態）図１５に、本実施形態
に係る音声区間検出機能を有する音声処理装置の構成を
示す。第２の実施形態では、目的音源からの信号を抑圧
する第１のビームフォーマのフィルタから得られる目的
音源方向を音声区間検出に用いることにより、雑音環境
での音声区間検出性能を向上する方法について説明した
が、本実施形態は目的音源方向と第３の実施形態で述べ
た音声強調処理の出力を併用して音声区間の検出を行う
ことにより、さらに音声区間検出性能を向上することが
できるようにしたものである。(Sixth Embodiment) FIG. 15 shows the configuration of a speech processing apparatus having a speech section detection function according to this embodiment. In the second embodiment, a method for improving voice section detection performance in a noise environment by using a target sound source direction obtained from a filter of a first beamformer for suppressing a signal from a target sound source for voice section detection. As described above, the present embodiment can further improve the voice section detection performance by detecting the voice section using both the target sound source direction and the output of the voice enhancement processing described in the third embodiment. It was made.

【００７５】図１５に示すように、本実施形態は第３の
実施形態の構成に第２の実施形態で説明した音声／非音
声決定部７０を付け加えた形となっており、音声区間検
出処理として、第２の実施形態で用いている第２のビー
ムフォーマの出力の代わりに音声強調部１９０からの音
声強調処理後の出力を用いた点が特徴となっている。As shown in FIG. 15, the present embodiment has a configuration in which the speech / non-speech determination unit 70 described in the second embodiment is added to the configuration of the third embodiment. The feature of the present embodiment is that the output of the voice enhancement unit 190 after the voice enhancement processing is used instead of the output of the second beamformer used in the second embodiment.

【００７６】このように、目的音源からの信号を抑圧す
る第１のビームフォーマの出力を雑音信号として２ｃｈ
ＳＳによる音声強調処理を行うことにより、従来の２ｃ
ｈＳＳよりも、精度よく雑音を抑圧することができ、さ
らに音声強調出力と目的音源方向に基づいて音声区間検
出することにより、非定常雑音下の音声区間検出性能を
大幅に向上することができる。As described above, the output of the first beamformer for suppressing the signal from the target sound source is used as a noise signal for 2ch.
By performing voice enhancement processing by SS, the conventional 2c
Noise can be suppressed more accurately than hSS, and voice section detection based on voice enhancement output and target sound source direction can greatly improve voice section detection performance under non-stationary noise.

【００７７】なお、上記の音声区間検出において検出の
ために用いるパラメータはビームフォーマの出力パワー
や目的音源方向だけに限らず、例えば零交差数、スペク
トルの傾き、ＬＰＣケプストラム、Δ−ケプストラム、
Δ２−ケプストラム、ＬＰＣ残差、自己相関係数、反射
係数、対数断面積比、ピッチ等のパラメータおよびこれ
らを組み合わせたものを用いることも可能である。The parameters used for detection in the above speech section detection are not limited to the output power of the beamformer and the direction of the target sound source. For example, the number of zero crossings, the slope of the spectrum, the LPC cepstrum, the Δ-cepstrum,
It is also possible to use parameters such as Δ2-cepstrum, LPC residual, autocorrelation coefficient, reflection coefficient, logarithmic cross-sectional area ratio, pitch, etc., and a combination thereof.

【００７８】[0078]

【発明の効果】以上説明したように、本発明によればＳ
Ｎ比が低く雑音源の方向を特定できないような環境下
で、目的音源の音声区間の正確な検出や、さらには音声
強調処理を行うことができる。As described above, according to the present invention, S
In an environment where the N ratio is low and the direction of the noise source cannot be specified, accurate detection of the voice section of the target sound source and voice enhancement processing can be performed.

[Brief description of the drawings]

【図１】本発明の第１の実施形態に係る音声処理装置の
構成を示すブロック図FIG. 1 is a block diagram showing a configuration of an audio processing device according to a first embodiment of the present invention.

【図２】同実施形態における適応ビームフォーマ処理部
の構成を示すブロック図FIG. 2 is a block diagram showing a configuration of an adaptive beamformer processing unit according to the embodiment;

【図３】一方のチャネルの入力側に遅延器を挿入したビ
ームフォーマの構成を示すブロック図FIG. 3 is a block diagram showing a configuration of a beamformer in which a delay unit is inserted on the input side of one channel.

【図４】同実施形態における音源方向の推定処理の手順
を示すフローチャートFIG. 4 is a flowchart showing a procedure of a sound source direction estimation process in the embodiment;

【図５】２のマイクロフォンからの信号間の時間遅れに
ついての説明図FIG. 5 is an explanatory diagram of a time delay between signals from two microphones.

【図６】同実施形態において音声／非音声を決定する第
１の方法における処理の流れを示す状態遷移図FIG. 6 is a state transition diagram showing a processing flow in a first method for determining voice / non-voice in the embodiment.

【図７】同実施形態において音声／非音声を決定する第
１の方法における処理の流れを示す状態遷移図FIG. 7 is a state transition diagram showing a processing flow in a first method for determining voice / non-voice in the embodiment.

【図８】本発明の第２の実施形態に係る音声処理装置の
構成を示すブロック図FIG. 8 is a block diagram showing a configuration of an audio processing device according to a second embodiment of the present invention.

【図９】同実施形態における処理間の流れを示すフロー
チャートFIG. 9 is a flowchart showing a flow between processes in the embodiment;

【図１０】本発明の第３の実施形態に係る音声処理装置
の構成を示すブロック図FIG. 10 is a block diagram showing a configuration of an audio processing device according to a third embodiment of the present invention.

【図１１】本発明の第４の実施形態に係る音声処理装置
の構成を示すブロック図FIG. 11 is a block diagram showing a configuration of an audio processing device according to a fourth embodiment of the present invention.

【図１２】本発明の第５の実施形態に係る音声処理装置
の構成を示すブロック図FIG. 12 is a block diagram showing a configuration of an audio processing device according to a fifth embodiment of the present invention.

【図１３】２チャネルスペクトルサブトラクションによ
る音声強調部の構成を示すブロック図FIG. 13 is a block diagram showing a configuration of a speech enhancement unit using two-channel spectral subtraction.

【図１４】２チャネルスペクトルサブトラクションによ
る音声強調部の処理手順を示すフローチャートFIG. 14 is a flowchart showing a processing procedure of a speech enhancement unit based on two-channel spectral subtraction;

【図１５】本発明の第６の実施形態に係る音声処理装置
の構成を示すブロック図FIG. 15 is a block diagram showing a configuration of an audio processing device according to a sixth embodiment of the present invention.

[Explanation of symbols]

１０−１〜１０−ｎ…音声信号入力端子１０…音声入力部２０…ビームフォーマ処理部２１…減算器２２…加算器２３…遅延器２４…適応フィルタ２５…減算器２６…遅延器２７…ビームフォーマ本体３０…目的音源方向推定部４０…音声／非音声決定部５０−１〜５０−ｎ…音声信号入力端子５０…音声入力部６１…第１のビームフォーマ６２…第２のビームフォーマ６３…目的音源方向推定部６４…第１の制御部６５…雑音源方向推定部６６…第２の制御部７０…音声／非音声決定部８０−１〜８０−ｎ…音声信号入力端子８０…音声入力部９１…第１のビームフォーマ処理部９２…第２のビームフォーマ処理部９３…目的音源方向推定部９４…第１の制御部９５…雑音源方向推定部９６…第２の制御部１００…音声強調部１０１…ＦＦＴ部１０２…帯域パワー変換部１０３…雑音パワー計算部１０４…ＦＦＴ部１０５…帯域パワー変換部１０６…音声パワー計算部１０７…帯域重み計算部１０８…重み付け部１０９…逆ＦＦＴ部１１０−１〜１１０−ｎ…音声信号入力端子１１０…音声入力部１２１…第１のビームフォーマ処理部１２２…第２のビームフォーマ処理部１３０…音声強調部１４０−１〜１４０−ｎ…音声信号入力端子１４０…音声入力部１５０…第１のビームフォーマ処理部１６０…音声強調部１７０−１〜１７０−ｎ…音声信号入力端子１７０…音声入力部１８１…第１のビームフォーマ処理部１８２…第２のビームフォーマ処理部１８３…目的音源方向推定部１８４…第１の制御部１８５…雑音源方向推定部１８６…第２の制御部１９０…音声強調部２００…音声／非音声決定部 10-1 to 10-n Audio signal input terminal 10 Audio input unit 20 Beamformer processing unit 21 Subtractor 22 Adder 23 Delay unit 24 Adaptive filter 25 Subtractor 26 Delay unit 27 Beam Former body 30 Target sound direction estimation unit 40 Voice / non-voice determination unit 50-1 to 50-n Voice signal input terminal 50 Voice input unit 61 First beamformer 62 Second beamformer 63 Target sound source direction estimating unit 64 first control unit 65 noise source direction estimating unit 66 second control unit 70 voice / non-voice determining unit 80-1 to 80-n voice signal input terminal 80 voice input Unit 91: First beamformer processing unit 92: Second beamformer processing unit 93: Target sound source direction estimating unit 94: First control unit 95: Noise source direction estimating unit 96: Second control unit 100: Sound Emphasis unit 101 FFT unit 102 Band power conversion unit 103 Noise power calculation unit 104 FFT unit 105 Band power conversion unit 106 Voice power calculation unit 107 Band weight calculation unit 108 Weighting unit 109 Inverse FFT unit 110 -1 to 110-n audio signal input terminal 110 audio input unit 121 first beamformer processing unit 122 second beamformer processing unit 130 audio enhancement unit 140-1 to 140-n audio signal input Terminal 140: Audio input unit 150: First beamformer processing unit 160: Audio enhancement unit 170-1 to 170-n Audio signal input terminal 170: Audio input unit 181: First beamformer processing unit 182: Second Beamformer processing unit 183 ... target sound source direction estimating unit 184 ... first control unit 185 ... noise source direction estimating unit 186 The second control unit 190 ... sound enhancement unit 200 ... voice / non-voice determining section

Claims

[Claims]

An audio input step of inputting an audio signal through a plurality of channels, and a beamformer process for suppressing a signal coming from a target sound source is performed on the audio signal input in the audio input step. A beam former processing step, a target sound source direction estimating step of estimating a direction of a target sound source from a filter coefficient obtained in the beam former processing step, and a direction of the target sound source estimated by the target sound source direction estimating step. A voice section determining step of determining a voice section of the voice signal.

2. An audio input step of inputting an audio signal through a plurality of channels, and performing a beamformer process on the audio signal input in the audio input step to suppress a signal coming from a target sound source. A first beamformer processing step, a target sound source direction estimating step of estimating a direction of a target sound source from a filter coefficient obtained in the first beamformer processing step, and a speech signal input in the speech input step. A second beamformer processing step of performing a beamformer processing for suppressing a signal coming from a noise source and outputting a signal from a target sound source, and a filter coefficient obtained in the second beamformer processing step. A noise source direction estimating step of estimating a direction of a noise source, and the target sound source direction estimating step A first control step of controlling the second beamformer processing step based on the estimated direction of the target sound source and an output power obtained by the first and second beamformer processing steps; A second beamformer processing step for controlling the first beamformer processing step based on the noise source direction estimated by the noise source direction estimation step and the output power obtained by the first and second beamformer processing steps; A voice processing method, comprising: a control step; and a voice section determining step of determining a voice section of the voice signal based on the direction of the target sound source estimated in the target sound source direction estimation step.

3. The voice section determining step determines the voice section of the voice signal based on the direction of the target sound source estimated in the target sound source direction estimating step and the power of the voice signal. Item 3. The audio processing method according to item 1 or 2.

4. An audio input unit for inputting an audio signal through a plurality of channels, and performing a beamformer process on the audio signal input by the audio input unit for suppressing a signal coming from a target sound source. A beamformer; a target sound source direction estimating means for obtaining a direction of a target sound source from a filter coefficient obtained by the beamformer; and a voice section of the voice signal based on the direction of the target sound source estimated by the target sound source direction estimating means. And a voice section determining means for determining the voice section.

5. An audio input unit for inputting audio via a plurality of channels, and a beamformer for applying a beamformer process to the audio signal input by the audio input unit for suppressing a signal coming from a target sound source. 1 beamformer, target sound source direction estimating means for estimating the direction of a target sound source from the filter coefficients obtained by the first beamformer, and a sound signal input by the sound input means coming from a noise source. Beamformer for performing a beamformer process for suppressing a signal to be output and outputting a signal from a target sound source, and a noise source for estimating a direction of the noise source from a filter coefficient obtained by the second beamformer Direction estimating means, the direction of the target sound source estimated by the target sound source direction estimating means, and the output power of the first and second beamformers. A first control unit for controlling the second beamformer based on the direction of the noise source estimated by the noise source direction estimating unit and output powers of the first and second beamformers. Second control means for controlling the first beamformer, and voice section determining means for determining a voice section of the voice signal based on the direction of the target sound source estimated by the target sound source direction estimating means. An audio processing device characterized by performing.

6. The voice section determining means determines the voice section of the voice signal based on the direction of the target sound source estimated by the target sound source direction estimating means and the power of the voice signal. Item 6. The audio processing device according to item 4 or 5.

7. A voice input step of inputting voice via a plurality of channels, and a beamformer process for suppressing a signal coming from a target sound source on the voice signal input in the voice input step. A beam source processing step, a target sound source direction estimating step of estimating a target sound source direction from a filter coefficient obtained by the first beam former processing, and a noise source for a voice signal input in the voice input step. A second beamformer processing step of suppressing a signal arriving from the above and performing a beamformer processing for outputting a signal from a target sound source; and determining a noise source direction from a filter coefficient obtained by the second beamformer processing. A noise source direction estimation step for estimating, and a target sound source estimated by the target sound source direction estimation step A first control step of controlling the second beamformer processing step based on the direction and output powers of the first and second beamformer processing; and a noise source direction estimated by the noise source direction estimation step. A second control step of controlling the first beamformer processing step based on the output power obtained by the first and second beamformer processing steps; and a first beamformer processing step And a voice emphasizing step of suppressing noise in the output obtained by the second beamformer processing step and enhancing voice based on at least one of the output obtained by the above and the target sound source direction. The audio processing method to do.

8. A voice section detection step of detecting a voice section of the voice signal based on the target sound source direction estimated by the target sound source direction estimation step and a voice signal in which voice is emphasized by the voice enhancement step. The audio processing method according to claim 7, further comprising:

9. An audio inputting step of inputting an audio signal through a plurality of channels, and performing a beamformer process on the audio signal input in the audio inputting step to suppress a signal coming from a target sound source. A second beamformer processing step of performing a first beamformer processing step and a beamformer processing for suppressing a signal coming from a noise source with respect to the audio signal input in the audio input step and outputting a signal from a target sound source. A beamformer processing step; and a speech enhancement step of suppressing noise in the output obtained by the second beamformer processing step based on the output obtained by the first beamformer processing step to enhance voice. A voice processing method comprising:

10. An audio input step of inputting an audio signal through a plurality of channels, and performing a beamformer process on the audio signal input in the audio input step to suppress a signal coming from a target sound source. A beamformer processing step, and voice enhancement for suppressing noise in a voice signal input via any one of the plurality of channels based on an output obtained in the beamformer processing step to enhance voice. And a voice processing method.

11. An audio input unit for inputting an audio signal through a plurality of channels, and performing a beamformer process on the audio signal input by the audio input unit to suppress a signal coming from a target sound source. A first beamformer; a target sound source direction estimating means for estimating a target sound source direction from a filter coefficient obtained by the first beamformer; a sound signal input by the sound input means coming from a noise source Beamformer for performing a beamformer process for suppressing a signal to be output and outputting a signal from a target sound source, and a noise source direction for estimating a noise source direction from a filter coefficient obtained by the second beamformer Estimating means, a target sound source direction estimated by the target sound source direction estimating means, and an output power of the first and second beamformers. First control means for controlling the processing of the second beamformer based on the above, and the noise source direction estimated by the noise source direction estimating means and the output power of the first and second beamformers. Second control means for controlling the processing of the first beamformer based on at least one of an output of the first beamformer and a target sound source direction estimated by the target sound source direction estimating means. A voice emphasis unit that suppresses noise in the output of the second beamformer and emphasizes voice.

12. A voice section detecting means for detecting a voice section of the voice signal based on a target sound source direction estimated by the target sound source direction estimating means and a signal in which voice is emphasized by the voice emphasizing means. The audio processing device according to claim 11, wherein:

13. An audio input means for inputting an audio signal via a plurality of channels, and performing a beamformer process on the audio signal input by the audio input means for suppressing a signal coming from a target sound source. A first beamformer, and a second beamformer for performing a beamformer process for suppressing a signal coming from a noise source with respect to the audio signal input by the audio input unit and outputting a signal from a target sound source And a voice emphasizing unit that suppresses noise in the output of the second beamformer based on the output of the first beamformer and emphasizes voice.

14. An audio input unit for inputting an audio signal via a plurality of channels, and performing a beamformer process on the audio signal input by the audio input unit for suppressing a signal coming from a target sound source. A beamformer; and a voice emphasis unit that suppresses noise in a signal input through one of the plurality of channels based on an output of the beamformer, and emphasizes and outputs voice. An audio processing device characterized by the above-mentioned.