JP4973287B2

JP4973287B2 - Sound processing apparatus and program

Info

Publication number: JP4973287B2
Application number: JP2007100756A
Authority: JP
Inventors: 健一山内
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2007-04-06
Filing date: 2007-04-06
Publication date: 2012-07-11
Anticipated expiration: 2027-04-06
Also published as: JP2008257048A

Description

本発明は、所期の音源から到来した音響（以下「目的音」という）と目的音以外の音響（以下「妨害音」という）との混合音から時間軸上における目的音の区間を検出する技術に関する。 The present invention detects a section of a target sound on the time axis from a mixed sound of sound coming from an intended sound source (hereinafter referred to as “target sound”) and sound other than the target sound (hereinafter referred to as “interfering sound”). Regarding technology.

音声と雑音との混合音から時間軸上における音声の区間を検出する技術が従来から提案されている。例えば特許文献１には、入力音声の周波数スペクトルの形状（平坦性）に基づいて音声と雑音とを区別する技術が開示されている。また、特許文献２には、入力音声のピッチ（零交差数）に応じて音声と雑音とを区別する技術が開示されている。
特開２００４−２７２０５２号公報特開平１−２８６６４３号公報 Conventionally, a technique for detecting a voice section on a time axis from a mixed sound of voice and noise has been proposed. For example, Patent Document 1 discloses a technique for distinguishing speech from noise based on the shape (flatness) of the frequency spectrum of the input speech. Patent Document 2 discloses a technology for distinguishing between speech and noise according to the pitch (number of zero crossings) of the input speech.
JP 2004-272052 A Japanese Patent Laid-Open No. 1-286643

特許文献１や特許文献２の技術においては音声と雑音との音響的な特徴の相違に基づいて両者が区別される。したがって、目的音として検出すべき音声に他の音声が混合されている場合には、目的音の区間に加えて目的音以外の音声の区間も検出される。すなわち、目的音の区間のみを高精度に検出することができないという問題がある。以上の事情に鑑みて、本発明は、目的音以外の音声が存在する場合であっても目的音の区間を高精度に検出するという課題の解決をひとつの目的としている。 In the techniques of Patent Document 1 and Patent Document 2, both are distinguished based on the difference in acoustic characteristics between speech and noise. Therefore, when other sounds are mixed with the sound to be detected as the target sound, a voice section other than the target sound is also detected in addition to the target sound section. That is, there is a problem that only the target sound section cannot be detected with high accuracy. In view of the above circumstances, an object of the present invention is to solve the problem of detecting a section of a target sound with high accuracy even when a sound other than the target sound exists.

以上の課題を解決するために、本発明のひとつの態様に係る音処理装置は、第１収音器が生成した音信号の各フレームと第１収音器から離間した第２収音器が生成した音信号の各フレームとで複数の周波数の各々における成分値を対比することで、各フレームの複数の周波数を、目的音が優勢な優勢周波数と目的音が劣勢な劣勢周波数とに選別する選別手段と、複数のフレームの各々について、選別手段が当該フレームについて選別した優勢周波数の個数（例えば図３や図７の個数Ｎs）と劣勢周波数の個数（例えば図３や図７の個数Ｎi）とに基づいて、当該フレームが目的音区間内のフレームか目的音区間外のフレームかを判定する区間検出手段とを具備する。 In order to solve the above-described problems, a sound processing apparatus according to one aspect of the present invention includes each frame of a sound signal generated by a first sound collector and a second sound collector spaced from the first sound collector. By comparing the component values at each of a plurality of frequencies with each frame of the generated sound signal, the plurality of frequencies of each frame are sorted into a dominant frequency where the target sound is dominant and an inferior frequency where the target sound is inferior. and selection means, for each of a plurality of frames, the number of dominant frequency selection means has selected for that frame (e.g., FIG. 3 and the number Ns of FIG. 7) and inferior number of frequencies (e.g., the number Ni of FIG. 3 and FIG. 7) based on the bets, the frame is provided with a section detecting means for determining whether the frame outer frame or target sound period in the target sound section.

以上の構成においては、第１収音器が生成した音信号と第２収音器が生成した音信号とをフレームごとに対比することで選別された優勢周波数の個数と劣勢周波数の個数とに基づいて目的音の区間が検出されるから、妨害音が環境音などの雑音である場合はもちろん人間の音声である場合にも、目的音区間を高精度に検出することが可能である。なお、目的音は人間の音声に限定されない。また、第１収音器および第２収音器の特性（指向性）は本発明において不問であるが、無指向性または略無指向性のマイクロホンが特に好適である。 In the above configuration, the number of dominant frequencies and the number of inferior frequencies selected by comparing the sound signal generated by the first sound collector and the sound signal generated by the second sound collector for each frame are obtained. Since the target sound section is detected based on the target sound section, it is possible to detect the target sound section with high accuracy even when the interfering sound is a noise such as an environmental sound or a human voice. The target sound is not limited to human voice. Further, the characteristics (directivity) of the first sound collector and the second sound collector are not required in the present invention, but an omnidirectional or substantially omnidirectional microphone is particularly suitable.

本発明の好適な態様において、区間検出手段は、優勢周波数の個数から劣勢周波数の個数を減算した数値が閾値を上回るフレームを目的音の区間内のフレームと判定する。本態様によれば、優勢周波数の個数と劣勢周波数の個数との差分値に基づいて目的音の区間が検出されるから、目的音の区間を検出するための処理量を軽減することが可能である。 In a preferred aspect of the present invention, the section detection means determines that a frame in which a numerical value obtained by subtracting the number of inferior frequencies from the number of dominant frequencies exceeds a threshold is a frame in the section of the target sound. According to this aspect, since the target sound section is detected based on the difference value between the number of dominant frequencies and the number of inferior frequencies, it is possible to reduce the amount of processing for detecting the target sound section. is there.

本発明の好適な態様に係る音処理装置は、複数の周波数の各々における優勢周波数および劣勢周波数の一方から他方への時間軸上における変化を平滑化する平滑化手段（例えば図７のステップＳB4）と、平滑化手段による平滑化で優勢周波数から劣勢周波数に変化した周波数の個数に応じた第１変化数（例えば図７の変化数Ｍs_i）と、平滑化で劣勢周波数から優勢周波数に変化した周波数の個数に応じた第２変化数（例えば図７の変化数Ｍi_s）とを計数する計数手段（例えば図７のステップＳB5〜ＳB8）とを具備し、区間検出手段は、劣勢周波数の個数に対する第２変化数の比である第１相対比（例えば図７の相対比Ｒ1）と優勢周波数の個数に対する第１変化数の比である第２相対比（例えば図７の相対比Ｒ2）とに基づいて目的音の区間を検出する。本態様によれば、平滑化手段による平滑化によって優勢周波数および劣勢周波数の一方から他方に変化した周波数の個数に基づいて目的音の区間が検出されるから、選別手段による誤選別を補償して高精度に目的音の区間を検出することができる。 The sound processing apparatus according to a preferred aspect of the present invention is a smoothing means for smoothing a change on the time axis from one of the dominant frequency and the inferior frequency in each of the plurality of frequencies (for example, step SB4 in FIG. 7). And a first change number (for example, change number Ms_i in FIG. 7) corresponding to the number of frequencies changed from the dominant frequency to the inferior frequency by smoothing by the smoothing means, and a frequency changed from the inferior frequency to the dominant frequency by smoothing. Counting means (for example, steps SB5 to SB8 in FIG. 7) for counting a second change number (for example, the change number Mi_s in FIG. 7) corresponding to the number of the first and second intervals, Based on a first relative ratio (for example, relative ratio R1 in FIG. 7) that is a ratio of two changes and a second relative ratio (for example, relative ratio R2 in FIG. 7) that is the ratio of the first number of changes to the number of dominant frequencies. To check the target sound section. To. According to this aspect, since the section of the target sound is detected based on the number of frequencies changed from one of the dominant frequency and the inferior frequency by the smoothing by the smoothing means, the erroneous selection by the selection means is compensated. The target sound section can be detected with high accuracy.

さらに好適な態様において、区間検出手段は、第１相対比から第２相対比を減算した数値が閾値を上回るフレームを目的音の区間内のフレームと判定する。本態様によれば、第１相対比と第２相対比との差分値に基づいて目的音の区間が検出されるから、目的音の区間を検出するための処理量を軽減することが可能である。 In a further preferred aspect, the section detecting means determines that a frame in which a numerical value obtained by subtracting the second relative ratio from the first relative ratio exceeds a threshold is a frame in the target sound section. According to this aspect, since the target sound section is detected based on the difference value between the first relative ratio and the second relative ratio, it is possible to reduce the amount of processing for detecting the target sound section. is there.

本発明に係る音処理装置は、各処理に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）によって実現されるほか、ＣＰＵ（Central Processing Unit）などの汎用の演算処理装置とプログラムとの協働によっても実現される。本発明に係るプログラムは、第１収音器が生成した音信号の各フレームと第１収音器から離間した第２収音器が生成した音信号の各フレームとで複数の周波数の各々における成分値を対比することで、各フレームの複数の周波数を、目的音が優勢な優勢周波数と目的音が劣勢な劣勢周波数とに選別する選別処理と、複数のフレームの各々について、選別処理で当該フレームについて選別した優勢周波数の個数と劣勢周波数の個数とに基づいて、当該フレームが目的音区間内のフレームか目的音区間外のフレームかを判定する区間検出処理とをコンピュータに実行される。以上のプログラムによっても、本発明に係る音処理装置と同様の作用および効果が奏される。なお、本発明のプログラムは、ＣＤ−ＲＯＭなど可搬型の記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、ネットワークを介した配信の形態でサーバ装置から提供されてコンピュータにインストールされる。 The sound processing apparatus according to the present invention is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to each processing, and a general-purpose arithmetic processing apparatus such as a CPU (Central Processing Unit) and a program. It is also realized through collaboration with. The program according to the present invention is provided for each of a plurality of frequencies in each frame of the sound signal generated by the first sound collector and each frame of the sound signal generated by the second sound collector separated from the first sound collector. by comparing the component values, a plurality of frequencies of each frame, a distinguishing processing target sound dominant dominant frequency and the target sound is sorted into the inferior recessive frequency, for each of a plurality of frames, the at sorting process based on the number of number and recessive frequency of dominant frequencies were selected for a frame, the frame is performed and the segment detection process to determine whether the frame outer frame or target sound period in the target sound section in the computer. With the above program, the same operations and effects as the sound processing apparatus according to the present invention are exhibited. The program of the present invention is provided to a user in a form stored in a portable recording medium such as a CD-ROM and installed in a computer, or provided from a server device in a form of distribution via a network. Installed on the computer.

また、目的音の区間を検出する方法としても本発明は特定される。本発明のひとつの態様に係る音処理方法においては、第１収音器が生成した音信号の各フレームと第１収音器から離間した第２収音器が生成した音信号の各フレームとで複数の周波数の各々における成分値を対比することで、各フレームの複数の周波数を、目的音が優勢な優勢周波数と目的音が劣勢な劣勢周波数とに選別し、複数のフレームの各々について、当該フレームについて選別した優勢周波数の個数と劣勢周波数の個数とに基づいて、当該フレームが目的音区間内のフレームか目的音区間外のフレームかを判定する。以上の方法によっても、本発明に係る音処理装置と同様の作用および効果が奏される。
The present invention is also specified as a method for detecting the target sound section. In the sound processing method according to one aspect of the present invention, each frame of the sound signal generated by the first sound collector and each frame of the sound signal generated by the second sound collector separated from the first sound collector; By comparing the component values in each of the plurality of frequencies, the plurality of frequencies of each frame are sorted into the dominant frequency where the target sound is dominant and the inferior frequency where the target sound is inferior, and for each of the plurality of frames, based on the number of number and recessive frequency of dominant frequencies were selected for the target frame, the frame is to determine whether the frame outer frame or target sound period in the target sound section. Also by the above method, the effect | action and effect similar to the sound processing apparatus concerning this invention are show | played.

＜Ａ：音処理装置の構成および動作＞
図１は、本発明の第１実施形態に係る音処理装置の構成を示すブロック図である。音処理装置１０は、目的音と妨害音とを分離するとともに時間軸上における目的音の区間を検出する装置である。図１に示すように、音処理装置１０には第１収音器３１と第２収音器３２とが接続される。第１収音器３１および第２収音器３２の各々は、目的音と妨害音とが混合した周囲の音響を収音する無指向性または略無指向性のマイクロホンである。第１収音器３１は音信号ＳAを生成し、第２収音器３２は音信号ＳBを生成する。 <A: Configuration and operation of sound processing device>
FIG. 1 is a block diagram showing the configuration of the sound processing apparatus according to the first embodiment of the present invention. The sound processing device 10 is a device that separates a target sound and an interfering sound and detects a section of the target sound on the time axis. As shown in FIG. 1, a first sound collector 31 and a second sound collector 32 are connected to the sound processing apparatus 10. Each of the first sound collector 31 and the second sound collector 32 is an omnidirectional or substantially omnidirectional microphone that collects ambient sounds in which target sound and interference sound are mixed. The first sound collector 31 generates a sound signal SA, and the second sound collector 32 generates a sound signal SB.

第１収音器３１と第２収音器３２とは相互に間隔をあけて配置される。第１収音器３１は、第２収音器３２と比較して目的音の音源５１（例えば音処理装置１０を利用中の話者）に近い。第２収音器３２は、第１収音器３１と比較して妨害音の音源５２に近い。 The 1st sound collector 31 and the 2nd sound collector 32 are arrange | positioned at intervals. The first sound collector 31 is closer to the target sound source 51 (for example, a speaker who is using the sound processing device 10) than the second sound collector 32. The second sound collector 32 is closer to the disturbing sound source 52 than the first sound collector 31.

図１に示すように、音処理装置１０は、周波数分析部１１と選別部１３と音源分離部１５と区間検出部１７と処理部１９とを具備する。音処理装置１０の各要素は、例えばＣＰＵなどの演算処理装置がプログラムを実行することで実現されてもよいし、音声処理に専用されるＤＳＰ（Digital Signal Processor）などの電子回路によって実現されてもよい。また、以上の各要素が部分的に別個の電子回路に搭載された構成も採用される。 As shown in FIG. 1, the sound processing apparatus 10 includes a frequency analysis unit 11, a selection unit 13, a sound source separation unit 15, a section detection unit 17, and a processing unit 19. Each element of the sound processing device 10 may be realized by an arithmetic processing device such as a CPU executing a program, or may be realized by an electronic circuit such as a DSP (Digital Signal Processor) dedicated to sound processing. Also good. A configuration in which each of the above elements is partially mounted on a separate electronic circuit is also employed.

周波数分析部１１は、音信号ＳAのパワースペクトルＰAと音信号ＳBのパワースペクトルＰBとを特定する。さらに詳述すると、周波数分析部１１は、音信号ＳAを区分した複数のフレームの各々について離散フーリエ変換などの周波数分析を実行することでパワースペクトルＰAを特定する。周波数分析部１１は、同様の方法で音信号ＳBについても各フレームのパワースペクトルＰBを特定する。相前後する各フレームは時間軸上で部分的に重複する。 The frequency analysis unit 11 specifies the power spectrum PA of the sound signal SA and the power spectrum PB of the sound signal SB. More specifically, the frequency analysis unit 11 specifies the power spectrum PA by performing frequency analysis such as discrete Fourier transform on each of a plurality of frames into which the sound signal SA is divided. The frequency analysis unit 11 specifies the power spectrum PB of each frame for the sound signal SB in the same manner. Each successive frame partially overlaps on the time axis.

図２は、パワースペクトルＰAおよびＰBの模式図である。同図に示すように、パワースペクトルＰAおよびＰBは、各々が別個の周波数Ｆ（Ｆ1〜ＦK）に対応したＫ個の周波数ビンで表現される（Ｋは２以上の自然数）。パワースペクトルＰAに対応したひとつの周波数ビンは、パワースペクトルＰAのうち当該周波数ビンに対応した周波数（以下「対象周波数」という）Ｆでのパワーを含むデータである。 FIG. 2 is a schematic diagram of the power spectra PA and PB. As shown in the figure, the power spectrums PA and PB are expressed by K frequency bins corresponding to different frequencies F (F1 to FK) (K is a natural number of 2 or more). One frequency bin corresponding to the power spectrum PA is data including power at a frequency F (hereinafter referred to as “target frequency”) F corresponding to the frequency bin in the power spectrum PA.

図１の選別部１３は、周波数分析部１１が特定したパワースペクトルＰAとＰBとを対比することで、各周波数ビンに対応した総ての対象周波数Ｆを、目的音が優勢な周波数（以下「優勢周波数」という）ｆsと目的音が劣勢な周波数（以下「劣勢周波数」という）ｆiとに区別する。すなわち、選別部１３は、図２に示すように、パワースペクトルＰAおよびＰBについて同じ周波数におけるパワーを総ての対象周波数Ｆ1〜ＦKについて比較し、パワースペクトルＰAのパワーが大きい対象周波数Ｆを優勢周波数ｆsに選別するとともに、パワースペクトルＰBのパワーが大きい対象周波数Ｆを劣勢周波数ｆiに選別する。そして、選別部１３は、ひとつのフレームの対象周波数Ｆ1〜ＦKの各々が優勢周波数ｆsおよび劣勢周波数ｆiの何れに選別されたかを指定する選別データ（フラグ）Ｄを設定する。すなわち、選別部１３は、優勢周波数ｆsに選別した対象周波数Ｆについては選別データＤを「１」に設定し、劣勢周波数ｆiに選別した対象周波数Ｆについては選別データＤを「０」に設定する。 1 compares the power spectra PA and PB specified by the frequency analysis unit 11 to select all target frequencies F corresponding to the respective frequency bins at frequencies (hereinafter, “the target sound is dominant). A distinction is made between f s (referred to as “dominant frequency”) and f i where the target sound is inferior (hereinafter referred to as “dominant frequency”) fi. That is, as shown in FIG. 2, the selection unit 13 compares the power at the same frequency for the power spectra PA and PB for all the target frequencies F1 to FK, and selects the target frequency F with the high power of the power spectrum PA as the dominant frequency. While selecting fs, the target frequency F having a large power spectrum PB power is selected as the inferior frequency fi. Then, the selection unit 13 sets selection data (flag) D that designates which of the target frequencies F1 to FK of one frame is selected as the dominant frequency fs or the inferior frequency fi. That is, the sorting unit 13 sets the sorting data D to “1” for the target frequency F sorted to the dominant frequency fs, and sets the sorting data D to “0” for the target frequency F sorted to the inferior frequency fi. .

音源分離部１５は、選別部１３による選別の結果に基づいて目的音と妨害音とを分離する手段である。本形態の音源分離部１５は、目的音が強調されたパワースペクトルＱAと妨害音が強調された（すなわち目的音が抑制された）パワースペクトルＱBとをパワースペクトルＰAおよびＰBから生成する。さらに詳述すると、音源分離部１５は、パワースペクトルＰAのうち選別部１３が優勢周波数ｆsに選別した対象周波数Ｆ（Ｄ＝１）の周波数ビンを周波数軸に沿って配列することで目的音のパワースペクトルＱAを生成し、パワースペクトルＰBのうち選別部１３が劣勢周波数ｆiに選別した対象周波数Ｆ（Ｄ＝０）の周波数ビンを配列することで妨害音のパワースペクトルＱBを生成する。パワースペクトルＰAのうち劣勢周波数ｆiに選別された対象周波数Ｆの周波数ビンとパワースペクトルＰBのうち優勢周波数ｆsに選別された対象周波数Ｆの周波数ビンとは破棄される。なお、音源分離部１５がパワースペクトルＱAおよびＱBの一方のみを生成および出力する構成も採用される。 The sound source separation unit 15 is a means for separating the target sound and the disturbing sound based on the result of selection by the selection unit 13. The sound source separation unit 15 of this embodiment generates a power spectrum QA in which the target sound is emphasized and a power spectrum QB in which the interference sound is emphasized (that is, the target sound is suppressed) from the power spectra PA and PB. More specifically, the sound source separation unit 15 arranges the frequency bins of the target frequency F (D = 1) selected by the selection unit 13 to the dominant frequency fs in the power spectrum PA along the frequency axis, thereby arranging the target sound. The power spectrum QA is generated, and the power spectrum QB of the interference sound is generated by arranging the frequency bins of the target frequency F (D = 0) selected by the selection unit 13 as the inferior frequency fi in the power spectrum PB. The frequency bin of the target frequency F selected as the inferior frequency fi in the power spectrum PA and the frequency bin of the target frequency F selected as the dominant frequency fs in the power spectrum PB are discarded. A configuration in which the sound source separation unit 15 generates and outputs only one of the power spectra QA and QB is also employed.

図１の区間検出部１７は、時間軸上において目的音が存在する区間（以下「目的音区間」という）を選別部１３による選別の結果に基づいて検出する手段である。さらに詳述すると、区間検出部１７は、優勢周波数ｆsの個数Ｎs（パワースペクトルＱAの周波数ビンの個数）が劣勢周波数ｆiの個数Ｎi（パワースペクトルＱBの周波数ビンの個数）に対して相対的に多いフレームを目的音区間内のフレームと判定し、優勢周波数ｆsの個数Ｎsが劣勢周波数ｆiの個数Ｎiに対して相対的に少ないフレームを目的音区間外のフレームと判定する。すなわち、目的音区間はフレームを単位として画定される。 The section detection unit 17 in FIG. 1 is means for detecting a section where the target sound exists on the time axis (hereinafter referred to as “target sound section”) based on the result of selection by the selection unit 13. More specifically, in the section detection unit 17, the number Ns of dominant frequencies fs (number of frequency bins of the power spectrum QA) is relative to the number Ni of inferior frequencies fi (number of frequency bins of the power spectrum QB). Many frames are determined as frames within the target sound section, and frames having a relatively small number Ns of dominant frequencies fs relative to the number Ni of inferior frequencies fi are determined as frames outside the target sound section. That is, the target sound section is defined in units of frames.

図３は、区間検出部１７による処理の内容を示すフローチャートである。選別部１３がひとつのフレームについてＫ個の対象周波数Ｆ（Ｆ1〜ＦK）を優勢周波数ｆsと劣勢周波数ｆiとに区分するたびに図３の処理が実行される。図３に示すように、区間検出部１７は、Ｋ個の対象周波数Ｆのうち選別部１３が優勢周波数ｆsに選別した対象周波数Ｆの総個数（「１」に設定された選別データＤの総数）Ｎsを計数する（ステップＳA1）。また、区間検出部１７は、Ｋ個の対象周波数Ｆのうち劣勢周波数ｆiに選別された対象周波数Ｆの総個数（「０」に設定された選別データＤの総数）Ｎiを計数する（ステップＳA2）。優勢周波数ｆsの個数Ｎsと劣勢周波数ｆiの個数Ｎiとの加算値は対象周波数Ｆの総数Ｋとなる（Ｋ＝Ｎs＋Ｎi）。 FIG. 3 is a flowchart showing the contents of processing by the section detection unit 17. The process of FIG. 3 is executed each time the selecting unit 13 classifies the K target frequencies F (F1 to FK) into the dominant frequency fs and the inferior frequency fi for one frame. As illustrated in FIG. 3, the section detection unit 17 includes the total number of the target frequencies F selected by the selection unit 13 as the dominant frequency fs among the K target frequencies F (the total number of the selection data D set to “1”). ) Ns is counted (step SA1). Further, the section detection unit 17 counts the total number of target frequencies F selected as the inferior frequency fi among the K target frequencies F (the total number of selection data D set to “0”) Ni (step SA2). ). The sum of the number Ns of dominant frequencies fs and the number Ni of inferior frequencies fi is the total number K of target frequencies F (K = Ns + Ni).

区間検出部１７は、ステップＳA1にて算定した個数ＮsからステップＳA2にて算定した個数Ｎiを減算することで指標値Ａ（Ａ＝Ｎs−Ｎi）を算定する（ステップＳA3）。次いで、区間検出部１７は、ステップＳA3にて算定した指標値Ａが所定の閾値ＴＨを上回るか否かを判定する（ステップＳA4）。閾値ＴＨは例えばゼロに設定される。 The section detection unit 17 calculates the index value A (A = Ns−Ni) by subtracting the number Ni calculated in step SA2 from the number Ns calculated in step SA1 (step SA3). Next, the section detection unit 17 determines whether or not the index value A calculated in step SA3 exceeds a predetermined threshold value TH (step SA4). The threshold value TH is set to zero, for example.

ステップＳA4の結果が肯定である場合（Ａ＞ＴＨ）、区間検出部１７は、現段階で処理の対象となっているフレームを目的音区間内のフレームと判定する（ステップＳA5）。一方、ステップＳA4の結果が否定である場合（Ａ≦ＴＨ）、区間検出部１７は、現段階のフレームを目的音区間外のフレームと判定する（ステップＳA6）。そして、区間検出部１７は、ステップＳA5またはＳA6の判定の結果を処理部１９に通知する（ステップＳA7）。 If the result of step SA4 is affirmative (A> TH), the section detection unit 17 determines that the frame being processed at this stage is a frame within the target sound section (step SA5). On the other hand, if the result of step SA4 is negative (A ≦ TH), the section detection unit 17 determines that the current frame is a frame outside the target sound section (step SA6). Then, the section detection unit 17 notifies the processing unit 19 of the determination result of step SA5 or SA6 (step SA7).

図４は、区間検出部１７による処理の結果を示す模式図である。同図の横軸は時間（フレーム数）を示し、縦軸は優勢周波数ｆsの個数Ｎsおよび劣勢周波数ｆiの個数Ｎiを示す。図４においては、区間検出部１７が検出した目的音区間（斜線が付された区間）が横軸に沿って図示されている。同図に示すように、優勢周波数ｆsの個数Ｎsが劣勢周波数ｆiの個数Ｎiを上回る区間が目的音区間として検出される。 FIG. 4 is a schematic diagram showing a result of processing by the section detection unit 17. In the figure, the horizontal axis indicates time (number of frames), and the vertical axis indicates the number Ns of dominant frequencies fs and the number Ni of inferior frequencies fi. In FIG. 4, the target sound section (section hatched) detected by the section detection unit 17 is shown along the horizontal axis. As shown in the figure, a section in which the number Ns of dominant frequencies fs exceeds the number Ni of inferior frequencies fi is detected as a target sound section.

図１の処理部１９は、音源分離部１５が生成したパワースペクトルＱAおよびＱBと区間検出部１７が検出した目的音区間とに基づいて所定の処理を実行する手段である。本形態の処理部１９は音声認識処理を実行する。すなわち、処理部１９は、パワースペクトルＱAに逆離散フーリエ変換を実行することで時間軸領域の音信号（すなわち目的音の波形を示す信号）を生成し、区間検出部１７が検出した目的音区間内の成分を対象として音声認識処理を実行する。 The processing unit 19 in FIG. 1 is a unit that executes predetermined processing based on the power spectra QA and QB generated by the sound source separation unit 15 and the target sound section detected by the section detection unit 17. The processing unit 19 of the present embodiment executes a voice recognition process. In other words, the processing unit 19 generates a sound signal in the time domain (that is, a signal indicating the waveform of the target sound) by performing inverse discrete Fourier transform on the power spectrum QA, and the target sound section detected by the section detection unit 17 Speech recognition processing is executed for the components in the target.

以上のように本形態においては、第１収音器３１が生成した音信号ＳAと第２収音器３２が生成した音信号ＳBとの対比によって選別された優勢周波数ｆsの個数Ｎsと劣勢周波数ｆiの個数Ｎiとに基づいて目的音区間が検出されるから、妨害音が環境音などの雑音である場合はもちろん人間の音声である場合にも、目的音区間が高精度に検出される。したがって、処理部１９による音声認識処理の精度を高めることが可能となる。目的音区間の高精度な検出に加えて音源分離部１５による目的音の強調（目的音と妨害音との分離）が実行されるから、目的音を対象とした音声認識処理の精度の向上は特に顕著である。 As described above, in the present embodiment, the number Ns of dominant frequencies fs selected based on the comparison between the sound signal SA generated by the first sound collector 31 and the sound signal SB generated by the second sound collector 32, and the inferior frequency. Since the target sound section is detected based on the number Ni of fi, the target sound section is detected with high accuracy even when the interfering sound is noise such as environmental sound as well as human speech. Therefore, it is possible to improve the accuracy of the voice recognition processing by the processing unit 19. Since the target sound is emphasized (separation between the target sound and the interference sound) by the sound source separation unit 15 in addition to the high-precision detection of the target sound section, the accuracy of the speech recognition processing for the target sound is improved. This is particularly noticeable.

また、選別部１３による処理の結果が音源分離１５における目的音の強調と区間検出部１７における目的音区間の検出とに共用されるから、目的音の強調と目的音区間の検出とが別個の基準で実行される構成と比較して音処理装置１０における処理量が削減されるという利点もある。 Further, since the result of the processing by the selection unit 13 is shared by the target sound enhancement in the sound source separation 15 and the detection of the target sound interval in the section detection unit 17, the enhancement of the target sound and the detection of the target sound section are separate. There is also an advantage that the processing amount in the sound processing apparatus 10 is reduced as compared with the configuration executed on the basis.

＜Ｂ：第２実施形態＞
次に、本発明の第２実施形態に係る音処理装置１０について説明する。なお、本形態に係る音処理装置１０の構成や動作は、区間検出部１７を除いて第１実施形態と共通する。したがって、以下では区間検出部１７の作用を重点的に説明し、区間検出部１７以外の要素の説明は適宜に省略する。 <B: Second Embodiment>
Next, the sound processing apparatus 10 according to the second embodiment of the present invention will be described. The configuration and operation of the sound processing apparatus 10 according to this embodiment are the same as those in the first embodiment except for the section detection unit 17. Therefore, hereinafter, the operation of the section detection unit 17 will be described mainly, and description of elements other than the section detection unit 17 will be omitted as appropriate.

選別部１３が対象周波数Ｆを優勢周波数ｆsと劣勢周波数ｆiとに選別した結果が、音信号ＳAやＳBに突発的に発生するノイズに起因して、現実の目的音と妨害音との優劣から瞬間的に逆転する場合がある。例えば、第ｊ番目のフレームにおける音信号ＳAまたはＳBのノイズに起因してパワースペクトルＰAのひとつの対象周波数ＦにおけるパワーがパワースペクトルＰBを上回ると、図５に示すように、第(j-1)番目までのフレームと第(j+1)番目以後のフレームにて対象周波数Ｆが劣勢周波数ｆiであるにも拘わらず、第ｊ番目のフレームにて瞬間的に対象周波数Ｆが優勢周波数ｆsに選別される。すなわち、本来ならば第ｊ番目のフレームについても劣勢周波数ｆiに選別されるべき対象周波数Ｆがノイズに起因して優勢周波数ｆsと誤選別される。同様に、第ｊ番目のフレームでのノイズに起因してパワースペクトルＰBのひとつの対象周波数Ｆにおけるパワーが瞬間的にパワースペクトルＰAを上回ると、図６に示すように、本来ならば第ｊ番目のフレームでも優勢周波数ｆsに選別されるべき対象周波数Ｆが劣勢周波数ｆiと誤選別される。本形態の区間検出部１７は、以上のような誤選別を補償して目的音区間を検出する。 The result of the selection unit 13 selecting the target frequency F into the dominant frequency fs and the inferior frequency fi is based on the superiority or inferiority of the actual target sound and the interference sound due to the noise suddenly generated in the sound signals SA and SB. There may be a momentary reversal. For example, when the power at one target frequency F of the power spectrum PA exceeds the power spectrum PB due to the noise of the sound signal SA or SB in the jth frame, as shown in FIG. ) Although the target frequency F is the inferior frequency fi in the frames up to the (j + 1) th and subsequent frames, the target frequency F instantaneously becomes the dominant frequency fs in the jth frame. Selected. In other words, the target frequency F that should be selected as the inferior frequency fi for the jth frame is erroneously selected as the dominant frequency fs due to noise. Similarly, when the power at one target frequency F of the power spectrum PB instantaneously exceeds the power spectrum PA due to noise in the jth frame, as shown in FIG. In this frame, the target frequency F to be selected as the dominant frequency fs is erroneously selected as the inferior frequency fi. The section detection unit 17 of the present embodiment detects the target sound section by compensating for the erroneous selection as described above.

図７は、本形態に係る区間検出部１７の動作の内容を示すフローチャートである。第(j-W)番目から第(j+W)番目までの「2W+1」個のフレームの各々について選別部１３がＫ個の対象周波数Ｆ1〜ＦKを優勢周波数ｆsと劣勢周波数ｆiとに選別するたびに、時間軸上の中央に位置する第ｊ番目のフレームを対象として図７の処理が実行される（Ｗは自然数）。 FIG. 7 is a flowchart showing the contents of the operation of the section detection unit 17 according to this embodiment. For each of “2W + 1” frames from the (jW) th to the (j + W) th, the sorting unit 13 sorts the K target frequencies F1 to FK into the dominant frequency fs and the inferior frequency fi. Each time, the process of FIG. 7 is executed for the j-th frame located at the center on the time axis (W is a natural number).

区間検出部１７は、第ｊ番目のフレームについて、図３のステップＳA1およびＳA2と同様に、Ｋ個の対象周波数Ｆのなかの優勢周波数ｆsの個数Ｎs（ステップＳB1）と劣勢周波数ｆiの総個数Ｎi（ステップＳB2）とを計数する。また、区間検出部１７は、Ｋ個の対象周波数Ｆ1〜ＦKの何れかを識別するための変数ｋを「１」に初期化するとともに、変化数Ｍs_iおよびＭi_sを「０」に初期化する（ステップＳB3）。 For the j-th frame, the section detecting unit 17 similarly to steps SA1 and SA2 in FIG. 3, the number Ns of dominant frequencies fs among the K target frequencies F (step SB1) and the total number of inferior frequencies fi. Ni (step SB2) is counted. In addition, the section detection unit 17 initializes a variable k for identifying any of the K target frequencies F1 to FK to “1” and initializes the number of changes Ms_i and Mi_s to “0” ( Step SB3).

区間検出部１７は、対象周波数ＦkについてステップＳB4からステップＳB8までの処理を順次に実行したうえで変数ｋに「１」を加算する（ステップＳB9）。さらに、区間検出部１７は、ステップＳB9における加算後の変数ｋが対象周波数Ｆの総数Ｋを上回るか否かを判定する（ステップＳB10）。区間検出部１７は、ステップＳB10の結果が否定である場合には更新後の変数ｋに対応した対象周波数ＦkについてステップＳB4からステップＳB8までの処理を実行し、ステップＳB10の結果が肯定である場合にはステップＳB11に処理を移行する。すなわち、Ｋ個の対象周波数Ｆ1〜ＦKの各々についてステップＳB4からステップＳB8の処理が順次に反復される。 The section detection unit 17 sequentially executes the processing from step SB4 to step SB8 for the target frequency Fk, and then adds “1” to the variable k (step SB9). Further, the section detection unit 17 determines whether or not the variable k after addition in step SB9 exceeds the total number K of the target frequencies F (step SB10). If the result of step SB10 is negative, the section detection unit 17 executes the processing from step SB4 to step SB8 for the target frequency Fk corresponding to the updated variable k, and the result of step SB10 is positive. In step SB11, the process proceeds. That is, the processing from step SB4 to step SB8 is sequentially repeated for each of the K target frequencies F1 to FK.

ステップＳB4において、区間検出部１７は、対象周波数Ｆkについて、優勢周波数ｆsおよび劣勢周波数ｆiの一方から他方への時間軸上における変化を第(j-W)番目から第（j+W）番目までの「2W+1」個のフレームにわたって平滑化する。さらに詳述すると、区間検出部１７は、第(j-W)番目から第(j+W)番目までの各フレームの対象周波数Ｆkについて設定された選別データＤ（Ｄ(j-W)〜Ｄ(j+W)）の配列に対してメジアンフィルタによる平滑化を実行する。すなわち、選別データＤ(j-W)〜Ｄ(j+W)を大小の順番に配列したときの中央（第(k+1)番目）に位置する選別データＤが第ｊ番目のフレームのうち優勢周波数Ｆkにおける更新後の選別データＤ(j)として算定される。 In step SB4, the section detection unit 17 changes the dominant frequency fs and the inferior frequency fi on the time axis from one to the other of the target frequency Fk from the (jW) th to the (j + W) th “j”. Smooth over 2W + 1 "frames. More specifically, the section detection unit 17 selects the selection data D (D (jW) to D (j + W) set for the target frequency Fk of each frame from the (jW) th to the (j + W) th frame. )) Is smoothed by a median filter. That is, the sorting data D located at the center ((k + 1) th) when the sorting data D (jW) to D (j + W) are arranged in order of magnitude is the dominant frequency of the jth frame. Calculated as updated data D (j) after updating in Fk.

メジアンフィルタを利用した平滑化によれば、「2W+1」個のフレームの範囲内でＷ個までのフレームにわたって連続する選別データＤの変動が除去される。例えば、図５における第ｊ番目のフレームの選別データＤ(j)は、同図に矢印で示すように平滑化によって「１（優勢周波数ｆs）」から「０（劣勢周波数ｆi）」に変化する。また、図６における第ｊ番目のフレームの選別データＤ(j)は平滑化によって「０（劣勢周波数ｆi）」から「１（優勢周波数ｆs）」に変化する。 According to the smoothing using the median filter, fluctuations in the selection data D continuous over up to W frames within the range of “2W + 1” frames are removed. For example, the selection data D (j) of the j-th frame in FIG. 5 is changed from “1 (dominant frequency fs)” to “0 (inferior frequency fi)” by smoothing as shown by an arrow in FIG. . Further, the selection data D (j) of the jth frame in FIG. 6 changes from “0 (inferior frequency fi)” to “1 (dominant frequency fs)” by smoothing.

ステップＳB4を実行すると、区間検出部１７は、選別データＤ(j)が平滑化によって「１」から「０」に変化したか否かを判定する（ステップＳB5）。ステップＳB5の結果が肯定である場合、区間検出部１７は変化数Ｍs_iに「１」を加算する（ステップＳB6）。一方、ステップＳB5の結果が否定である場合、区間検出部１７は、ステップＳB6を経ずにステップＳB7に処理を移行する。 When step SB4 is executed, the section detection unit 17 determines whether or not the selection data D (j) has changed from “1” to “0” due to smoothing (step SB5). When the result of step SB5 is affirmative, the section detection unit 17 adds “1” to the change number Ms_i (step SB6). On the other hand, if the result of step SB5 is negative, the section detection unit 17 proceeds to step SB7 without passing through step SB6.

実際は劣勢周波数ｆiに選別されるべきであったにも拘わらず瞬間的なノイズに起因して優勢周波数ｆsと誤選別された対象周波数Ｆkの選別データＤ(j)は、図５に矢印で示したようにステップＳB4の平滑化で「１」から「０」に変化する。したがって、ステップＳB4からステップＳB8までの処理がＫ回にわたって反復された段階における変化数Ｍs_iは、ひとつのフレームにおいて優勢周波数ｆsと誤選別された対象周波数Ｆkの個数（誤選別の回数）に相当する。 The selection data D (j) of the target frequency Fk, which was erroneously selected as the dominant frequency fs due to the instantaneous noise, although it should actually be selected as the inferior frequency fi, is indicated by an arrow in FIG. As described above, the smoothing in step SB4 changes from “1” to “0”. Therefore, the number of changes Ms_i at the stage where the processing from step SB4 to step SB8 is repeated K times corresponds to the number of target frequencies Fk misselected (the number of misselections) as the dominant frequency fs in one frame. .

次いで、区間検出部１７は、選別データＤ(j)が平滑化によって「０」から「１」に変化したか否かを判定し（ステップＳB7）、ステップＳB7の結果が肯定である場合に限って変化数Ｍi_sに「１」を加算する（ステップＳB8）。実際には優勢周波数ｆsであるにも拘わらず劣勢周波数ｆiと誤選別された対象周波数Ｆkの選別データＤ(j)は、図６に示したように平滑化で「０」から「１」に変化する。したがって、ステップＳB4からステップＳB8までの処理がＫ回にわたって反復された段階における変化数Ｍi_sは、劣勢周波数ｆiと誤選別された対象周波数Ｆkの個数（誤選別の回数）に相当する。 Next, the section detection unit 17 determines whether or not the selection data D (j) has changed from “0” to “1” by smoothing (step SB7), and only when the result of step SB7 is positive. Then, “1” is added to the number of changes Mi_s (step SB8). Actually, the selection data D (j) of the target frequency Fk, which has been erroneously selected as the inferior frequency fi in spite of being the dominant frequency fs, is smoothed from “0” to “1” as shown in FIG. Change. Accordingly, the number of changes Mi_s at the stage where the processing from step SB4 to step SB8 is repeated K times corresponds to the number of target frequencies Fk misselected (number of misselections).

ステップＳB4からＳB8までの処理をＫ回にわたって反復すると（ステップＳB10：YES）、区間検出部１７は、相対比Ｒ1およびＲ2を算定する（ステップＳB11）。相対比Ｒ1は、ステップＳB8にて計数した変化数Ｍi_sとステップＳB2にて計数した劣勢周波数ｆiの個数Ｎiとの比（Ｒ1＝Ｍi_s／Ｎi）である。相対比Ｒ2は、ステップＳB6にて計数した変化数Ｍs_iとステップＳB1にて計数した優勢周波数ｆsの個数Ｎsとの比（Ｒ2＝Ｍs_i／Ｎs）である。そして、区間検出部１７は、相対比Ｒ1から相対比Ｒ2を減算することで指標値Ｂ（Ｂ＝Ｒ1−Ｒ2）を算定する（ステップＳB12）。 When the processes from step SB4 to SB8 are repeated K times (step SB10: YES), the section detection unit 17 calculates the relative ratios R1 and R2 (step SB11). The relative ratio R1 is a ratio (R1 = Mi_s / Ni) between the number of changes Mi_s counted in step SB8 and the number Ni of inferior frequencies fi counted in step SB2. The relative ratio R2 is a ratio (R2 = Ms_i / Ns) between the number of changes Ms_i counted in step SB6 and the number Ns of dominant frequencies fs counted in step SB1. Then, the section detection unit 17 calculates the index value B (B = R1-R2) by subtracting the relative ratio R2 from the relative ratio R1 (step SB12).

次に、区間検出部１７は、ステップＳB12にて算定した指標値Ｂが所定の閾値ＴＨを上回るか否かを判定する（ステップＳB13）。閾値ＴＨは例えばゼロに設定される。ステップＳB13の結果が肯定である場合（Ｂ＞ＴＨ）、区間検出部１７は、現段階で処理の対象となっているフレームを目的音区間内のフレームと判定する（ステップＳB14）。一方、ステップＳB13の結果が否定である場合（Ｂ≦ＴＨ）、区間検出部１７は、現段階のフレームを目的音区間外のフレームと判定する（ステップＳB15）。そして、区間検出部１７は、ステップＳB14またはステップＳB15における判定の結果を処理部１９に通知して図７の処理を終了する（ステップＳB16）。 Next, the section detection unit 17 determines whether or not the index value B calculated in step SB12 exceeds a predetermined threshold value TH (step SB13). The threshold value TH is set to zero, for example. If the result of step SB13 is affirmative (B> TH), the section detection unit 17 determines that the frame being processed at this stage is a frame within the target sound section (step SB14). On the other hand, when the result of step SB13 is negative (B ≦ TH), the section detection unit 17 determines that the current stage frame is a frame outside the target sound section (step SB15). Then, the section detection unit 17 notifies the processing unit 19 of the result of determination in step SB14 or step SB15, and ends the process of FIG. 7 (step SB16).

図８は、本形態の区間検出部１７による処理の結果を示す模式図である。同図の縦軸は相対比Ｒ1およびＲ2の数値を示し、横軸は時間（フレーム数）である。相対比Ｒ1が相対比Ｒ2を上回る目的音区間が横軸に沿って図示されている。図８で想定されている音信号ＳAおよびＳBは図４の例示と同波形である。図８に示すように、本形態によれば、第１実施形態と同様に、妨害音が人間の音声である場合であっても目的音区間を高精度に検出することが可能である。しかも、本形態においては誤選別の影響を踏まえて目的音区間が検出されるから、第１実施形態よりも高精度に目的音区間を検出できるという利点がある。 FIG. 8 is a schematic diagram showing a result of processing by the section detection unit 17 of the present embodiment. In the figure, the vertical axis represents the numerical values of the relative ratios R1 and R2, and the horizontal axis represents time (the number of frames). A target sound section in which the relative ratio R1 exceeds the relative ratio R2 is shown along the horizontal axis. The sound signals SA and SB assumed in FIG. 8 have the same waveforms as those illustrated in FIG. As shown in FIG. 8, according to the present embodiment, the target sound section can be detected with high accuracy even when the interfering sound is human speech, as in the first embodiment. Moreover, in the present embodiment, since the target sound section is detected in consideration of the influence of misselection, there is an advantage that the target sound section can be detected with higher accuracy than in the first embodiment.

＜Ｃ：変形例＞
以上の形態には様々な変形を加えることができる。具体的な変形の態様を例示すれば以下の通りである。なお、以下の各態様を適宜に組み合わせてもよい。 <C: Modification>
Various modifications can be made to the above embodiment. An example of a specific modification is as follows. In addition, you may combine each following aspect suitably.

（１）変形例１
選別部１３による選別の結果に基づいて目的音区間を決定する（各フレームが目的音区間内か否かを判定する）方法は適宜に変更される。例えば、第１実施形態においては、優勢周波数ｆsの個数Ｎsと劣勢周波数ｆiの個数Ｎiとの比に基づいて目的音区間を検出する構成が採用される。すなわち、区間検出部１７は、図３のステップＳA3において個数Ｎiに対する個数Ｎsの比を指標値Ａ（Ａ＝Ｎs／Ｎi）として算定する。図３のステップＳA4の閾値ＴＨは例えば「１」に設定される。以上の構成によっても第１実施形態と同様の効果が奏される。 (1) Modification 1
The method of determining the target sound section (determining whether each frame is within the target sound section) based on the result of selection by the selection unit 13 is appropriately changed. For example, in the first embodiment, a configuration is adopted in which the target sound section is detected based on the ratio of the number Ns of dominant frequencies fs and the number Ni of inferior frequencies fi. That is, the section detection unit 17 calculates the ratio of the number Ns to the number Ni as the index value A (A = Ns / Ni) in step SA3 in FIG. The threshold value TH in step SA4 in FIG. 3 is set to “1”, for example. With the above configuration, the same effect as that of the first embodiment can be obtained.

同様に、第２実施形態においては、相対比Ｒ1とＲ2との比に基づいて目的音区間を検出する構成が採用される。例えば、区間検出部１７は、図７のステップＳB12において相対比Ｒ2に対する相対比Ｒ1の比を指標値Ｂ（Ｂ＝Ｒ1／Ｒ2）として算定する。図７のステップＳB13における閾値ＴＨは例えば「１」に設定される。以上の構成によっても第２実施形態と同様の効果が奏される。以上の例示から理解されるように、本発明の好適な態様における区間検出部１７は、優勢周波数ｆsの個数Ｎsと劣勢周波数ｆiの個数Ｎiとに基づいて目的音区間を検出する手段であれば足りる。 Similarly, in the second embodiment, a configuration is adopted in which the target sound section is detected based on the ratio between the relative ratios R1 and R2. For example, the section detector 17 calculates the ratio of the relative ratio R1 to the relative ratio R2 as an index value B (B = R1 / R2) in step SB12 of FIG. The threshold value TH in step SB13 in FIG. 7 is set to “1”, for example. With the above configuration, the same effect as in the second embodiment can be obtained. As can be understood from the above examples, the section detection unit 17 according to a preferred aspect of the present invention is a means for detecting a target sound section based on the number Ns of dominant frequencies fs and the number Ni of inferior frequencies fi. It ’s enough.

（２）変形例２
以上の構成においては、周波数分析部１１が特定したパワースペクトルＰAおよびＰBの全帯域を対象として個数ＮsおよびＮiを計数する構成を例示したが、パワースペクトルＰAおよびＰBの各々のうち特定の帯域（例えば人間の音声が主として属する300Hzから4000Hzまでの帯域）内の対象周波数Ｆのみについて個数ＮsおよびＮiを計数してもよい。 (2) Modification 2
In the above configuration, the configuration in which the number Ns and Ni are counted for all the bands of the power spectrums PA and PB specified by the frequency analysis unit 11 is illustrated, but a specific band ( For example, the numbers Ns and Ni may be counted only for the target frequency F within the band from 300 Hz to 4000 Hz to which human voice mainly belongs.

また、対象周波数Ｆが属する帯域ごとに個別に重み付けして個数ＮsやＮiを計数する構成も採用される。例えば、ひとつのフレームにおいてｎ1個の優勢周波数ｆsが帯域Ｂ1に属するとともにｎ2個の優勢周波数ｆsが帯域Ｂ1とは相違する帯域Ｂ2に属する場合に、帯域Ｂ1に対応する重み値α1と帯域Ｂ2に対応する重み値α2とを別個に設定したうえで、個数Ｎsを以下の式(a)で算定する。
Ｎs＝ｎ1×α1＋ｎ2×α2 ……(a)
以上の構成によれば、特定の帯域に属する目的音の区間を特に高精度に検出することが可能である。 A configuration is also employed in which the number Ns or Ni is counted by weighting each band to which the target frequency F belongs. For example, when n1 dominant frequencies fs belong to the band B1 and n2 dominant frequencies fs belong to the band B2 different from the band B1 in one frame, the weight value α1 and the band B2 corresponding to the band B1 are set. The corresponding weight value α2 is set separately, and the number Ns is calculated by the following equation (a).
Ns = n1 × α1 + n2 × α2 (a)
According to the above configuration, it is possible to detect a target sound section belonging to a specific band with particularly high accuracy.

（３）変形例３
以上の各形態においては閾値ＴＨが固定値である構成を例示したが、閾値ＴＨが可変に制御される構成も採用される。例えば、第１収音器３１や第２収音器３２の周囲の音量が大きいほど閾値ＴＨを増加させる構成や、音処理装置１０に接続される収音器の個数が多いほど閾値ＴＨを増加させる構成が採用される。 (3) Modification 3
In each of the above embodiments, the configuration in which the threshold value TH is a fixed value is exemplified, but a configuration in which the threshold value TH is variably controlled is also employed. For example, the threshold TH is increased as the volume around the first sound collector 31 and the second sound collector 32 is increased, or the threshold TH is increased as the number of sound collectors connected to the sound processing device 10 is increased. The configuration to be adopted is adopted.

（４）変形例４
第１実施形態の指標値Ａや第２実施形態の指標値Ｂが閾値ＴＨを上回るフレームを目的音区間として検出する構成を例示したが、指標値Ａや指標値Ｂが閾値ＴＨを上回るフレームの区間よりも広い区間を目的音区間として検出する構成も採用される。例えば、指標値Ｂ（または指標値Ａ）が閾値ＴＨを上回り始めるフレームよりも所定個だけ手前のフレームを目的音区間の始点とする構成や、指標値Ｂ（または指標値Ａ）が閾値ＴＨを下回り始めるフレームから所定個だけ後のフレームを目的音区間の終点とする構成が採用される。 (4) Modification 4
The configuration in which the frame in which the index value A of the first embodiment and the index value B of the second embodiment exceed the threshold value TH is exemplified as the target sound section, but the frame of which the index value A or the index value B exceeds the threshold value TH A configuration is also employed in which a section wider than the section is detected as the target sound section. For example, a configuration in which a predetermined number of frames before the frame in which the index value B (or index value A) starts to exceed the threshold value TH is used as the starting point of the target sound interval, or the index value B (or index value A) exceeds the threshold value TH. A configuration is adopted in which a predetermined number of frames after the frame that starts to fall below the end point of the target sound section.

（５）変形例５
第２実施形態において複数のフレームにわたる選別データＤの変動を平滑化するための手段はメジアンフィルタに限定されない。例えばメジアンフィルタに代えてカルマンフィルタを使用してもよい。また、複数のフレームにおける選別データＤの移動平均を第ｊ番目のフレームの選別データＤ(j)とする構成も採用される。 (5) Modification 5
In the second embodiment, the means for smoothing the variation of the selected data D over a plurality of frames is not limited to the median filter. For example, a Kalman filter may be used instead of the median filter. A configuration is also employed in which the moving average of the selection data D in a plurality of frames is the selection data D (j) of the jth frame.

（６）変形例６
複数の対象周波数Ｆを優勢周波数ｆsと劣勢周波数ｆiとに区別する方法としては、以上の各形態に例示した方法に代えて、公知の技術を任意に採用することが可能である。例えば、特開２００６−１９７５５２号公報に開示された技術を優勢周波数ｆsと劣勢周波数ｆiとの選別に採用してもよい。さらに詳述すると、目的音が到来する方向に対して垂直な方向に第１収音器３１と第２収音器３２とを配置する。選別部１３は、第１収音器３１が生成した音信号ＳAと第２収音器３２が生成した音信号ＳBとの差分を周波数分析したパワースペクトルＰAと、音信号ＳAを遅延した信号と音信号ＳBとの差分を周波数分析したパワースペクトルＰBとを対比する。選別部１３は、パワースペクトルＰAのパワーがパワースペクトルＰBと比較して小さい対象周波数Ｆを優勢周波数ｆsに選別するとともにパワースペクトルＰBのパワーがパワースペクトルＰAと比較して小さい対象周波数Ｆを劣勢周波数ｆiに選別する。以上の方法によっても同様の効果が奏される。 (6) Modification 6
As a method of distinguishing the plurality of target frequencies F into the dominant frequency fs and the inferior frequency fi, a known technique can be arbitrarily adopted instead of the method exemplified in each of the above embodiments. For example, the technique disclosed in Japanese Patent Application Laid-Open No. 2006-197552 may be adopted for selecting the dominant frequency fs and the inferior frequency fi. More specifically, the first sound collector 31 and the second sound collector 32 are arranged in a direction perpendicular to the direction in which the target sound arrives. The selection unit 13 includes a power spectrum PA obtained by frequency analysis of the difference between the sound signal SA generated by the first sound collector 31 and the sound signal SB generated by the second sound collector 32, and a signal obtained by delaying the sound signal SA. The power spectrum PB obtained by frequency analysis of the difference from the sound signal SB is compared. The sorting unit 13 sorts the target frequency F whose power spectrum PA is smaller than the power spectrum PB into the dominant frequency fs and also selects the target frequency F whose power spectrum PB is smaller than the power spectrum PA as the inferior frequency. Select fi. The same effect can be obtained by the above method.

また、以上の各形態においては音信号ＳAおよびＳBの各々における各対象周波数Ｆのパワーを対比する構成を例示したが、優勢周波数ｆsと劣勢周波数ｆiとの選別のために対比される成分値（音響的な特徴量）はパワーに限定されない。例えば、音信号ＳAの各フレームと音信号ＳBの各フレームとについて各対象周波数Ｆにおける位相を対比することで当該対象周波数Ｆを優勢周波数ｆsと劣勢周波数ｆiとに選別する構成も採用される。 Further, in each of the above embodiments, the configuration in which the power of each target frequency F in each of the sound signals SA and SB is compared is exemplified, but the component value (for comparison between the dominant frequency fs and the inferior frequency fi) ( The acoustic feature amount is not limited to power. For example, a configuration in which the target frequency F is selected as the dominant frequency fs and the inferior frequency fi by comparing the phases at the target frequencies F for each frame of the sound signal SA and each frame of the sound signal SB is also employed.

（７）変形例７
区間検出部１７による検出の結果を利用して処理部１９が実行する処理は音声認識処理に限定されない。例えば、パワースペクトルＱAが示す音響のうち目的音区間以外の区間（主として妨害音）の音量を低減する処理を処理部１９が実行してもよい。 (7) Modification 7
The process executed by the processing unit 19 using the detection result by the section detection unit 17 is not limited to the voice recognition process. For example, the processing unit 19 may execute a process of reducing the volume of a section (mainly disturbing sound) other than the target sound section of the sound indicated by the power spectrum QA.

また、目的音と妨害音とが混在した音響から妨害音を抑圧する処理（例えばスペクトルサブトラクション法）を処理部１９が実行してもよい。スペクトルサブトラクション法においては、妨害音が優勢であるパワースペクトルＱBと所定の係数（以下「抑圧係数」という）との乗算値を目的音のパワースペクトルＱA（またはＰA）から減算することで目的音のパワースペクトルが生成される。処理部１９は、区間検出部１７が検出した目的音区間の内側と外側とで抑圧係数を変化させる。例えば、目的音区間の内側では目的音区間の外側よりも抑圧係数を増加させるといった具合である。目的音区間の外側の妨害音が過度に抑制された場合には不自然な音響となる場合がある。以上のように目的音の内側と外側とで抑圧係数を変化させる構成によれば、目的音区間内では妨害音を有効に除去するとともに目的音区間外では妨害音が適度に抑圧された自然な音響を生成できるという利点がある。 Further, the processing unit 19 may execute a process (for example, a spectral subtraction method) for suppressing the interference sound from the sound in which the target sound and the interference sound are mixed. In the spectral subtraction method, the target sound is subtracted from the power spectrum QA (or PA) of the target sound by subtracting the product of the power spectrum QB in which the disturbing sound is dominant and a predetermined coefficient (hereinafter referred to as “suppression coefficient”). A power spectrum is generated. The processing unit 19 changes the suppression coefficient between the inside and the outside of the target sound section detected by the section detecting unit 17. For example, the suppression coefficient is increased on the inner side of the target sound section than on the outer side of the target sound section. When the disturbance sound outside the target sound section is excessively suppressed, the sound may be unnatural. As described above, according to the configuration in which the suppression coefficient is changed between the inside and outside of the target sound, the interference sound is effectively removed within the target sound section, and the natural sound in which the interference sound is appropriately suppressed outside the target sound section. There is an advantage that sound can be generated.

（８）変形例８
音声の存在する区間を検出するための公知の方法を以上の各形態に組合わせてもよい。例えば、音信号ＳAのピッチを検出するピッチ検出部が図１の音処理装置１０に追加される。区間検出部１７は、図３や図７の処理で検出した目的音区間のうちピッチ検出部がピッチを検出した区間を新たな目的音区間として処理部１９に通知する。人間の音声や楽器の演奏音などについては明確なピッチが特定されるのに対し、雑音については明確なピッチが検出されない。したがって、以上の構成によれば、人間の音声や楽器の演奏音を含む目的音区間を高精度に検出することが可能となる。 (8) Modification 8
You may combine the well-known method for detecting the area where an audio | voice exists in each of the above forms. For example, a pitch detector for detecting the pitch of the sound signal SA is added to the sound processing apparatus 10 of FIG. The section detection unit 17 notifies the processing unit 19 of a section in which the pitch detection unit has detected the pitch among the target sound sections detected in the processes of FIGS. 3 and 7 as a new target sound section. While a clear pitch is specified for human voices and musical instrument performance sounds, no clear pitch is detected for noise. Therefore, according to the above configuration, it is possible to detect the target sound section including the human voice and the performance sound of the musical instrument with high accuracy.

本発明の第１実施形態に係る音処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the sound processing apparatus which concerns on 1st Embodiment of this invention. 優勢周波数と劣勢周波数との選別を説明するための概念図である。It is a conceptual diagram for demonstrating selection with a dominant frequency and an inferior frequency. 区間検出部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of an area detection part. 個数ＮsおよびＮiと目的音区間との関係を示す概念図である。It is a conceptual diagram which shows the relationship between the number Ns and Ni, and a target sound area. 選別部による誤選別を説明するための概念図である。It is a conceptual diagram for demonstrating the misselection by the selection part. 選別部による誤選別を説明するための概念図である。It is a conceptual diagram for demonstrating the misselection by the selection part. 区間検出部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of an area detection part. 相対比Ｒ1およびＲ2と目的音区間との関係を示す概念図である。It is a conceptual diagram which shows the relationship between relative ratio R1 and R2 and a target sound area.

Explanation of symbols

１０……音処理装置、１１……周波数分析部、１３……選別部、１５……音源分離部、１７……区間検出部、１９……処理部。 DESCRIPTION OF SYMBOLS 10 ... Sound processing apparatus, 11 ... Frequency analysis part, 13 ... Sorting part, 15 ... Sound source separation part, 17 ... Section detection part, 19 ... Processing part.

Claims

Comparing component values at each of a plurality of frequencies between each frame of the sound signal generated by the first sound collector and each frame of the sound signal generated by the second sound collector separated from the first sound collector. And a selection means for selecting the plurality of frequencies of each frame into a dominant frequency in which the target sound is dominant and an inferior frequency in which the target sound is inferior,
For each of a plurality of frames, said sorting means based on the number of number and recessive frequency of dominant frequencies were selected for the target frame, the frame is to determine whether the frame outer frame or target sound period in the target sound section A sound processing device comprising section detection means.

The sound processing apparatus according to claim 1, wherein the section detection unit determines that a frame in which a value obtained by subtracting the number of inferior frequencies from the number of dominant frequencies exceeds a threshold is a frame in a section of the target sound.

Smoothing means for smoothing a change on the time axis from one of the dominant frequency and the inferior frequency in each of the plurality of frequencies;
A first change number corresponding to the number of frequencies changed from the dominant frequency to the inferior frequency by smoothing by the smoothing means, and a second change number corresponding to the number of frequencies changed from the inferior frequency to the dominant frequency by the smoothing. And counting means for counting
The section detecting means includes
A target sound section is detected based on a first relative ratio, which is a ratio of the second change number to the number of inferior frequencies, and a second relative ratio, which is a ratio of the first change number to the number of dominant frequencies. The sound processing apparatus according to claim 1.

The sound processing apparatus according to claim 3, wherein the section detection unit determines that a frame in which a numerical value obtained by subtracting the second relative ratio from the first relative ratio exceeds a threshold value is a frame in the target sound section.

The section detection means weights and adds the number of dominant frequencies in each of a plurality of bands by a weight value individually set for each band, and individually calculates the number of inferior frequencies in each of a plurality of bands for each band. Weighted addition with the set weight value
The sound processing apparatus according to any one of claims 1 to 4.

On the computer,
Comparing component values at each of a plurality of frequencies between each frame of the sound signal generated by the first sound collector and each frame of the sound signal generated by the second sound collector separated from the first sound collector. And a sorting process for sorting the plurality of frequencies of each frame into a dominant frequency where the target sound is dominant and an inferior frequency where the target sound is inferior,
For each of a plurality of frames, the sorting processing based on the number of number and recessive frequency of dominant frequencies were selected for that frame, the frame to determine the frame outer frame or target sound period in the target sound section A program that executes section detection processing and.