JP5740914B2

JP5740914B2 - Audio output device

Info

Publication number: JP5740914B2
Application number: JP2010241588A
Authority: JP
Inventors: 一浩里吉; 好史大泉
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2010-10-28
Filing date: 2010-10-28
Publication date: 2015-07-01
Anticipated expiration: 2030-10-28
Also published as: JP2012093594A

Description

この発明は、マスカ音を出力する音声出力装置に関するものである。 The present invention relates to an audio output device that outputs masker sounds.

従来、オフィス等において、パーティションにスピーカを取り付け、話者の音声と関連性の低い音声をマスカ音として出力することにより、近接する他の空間に存在する人に話者の音声を聞き取り難くしたものが提案されている（例えば、特許文献１参照）。これにより、話者の発言内容を理解し難くなくなるため、話者のプライバシーが保つことができる。 Conventionally, in offices, etc., speakers are attached to partitions, and the voice that is not related to the speaker's voice is output as a masker sound, making it difficult to hear the voice of the speaker in other nearby spaces Has been proposed (see, for example, Patent Document 1). Thereby, since it becomes difficult to understand the content of the speaker's speech, the privacy of the speaker can be maintained.

特開平０６−１７５６６６号公報Japanese Patent Laid-Open No. 06-175666

しかし、特許文献１の方式では、マスカ音の出力位置が固定されているため、聴取者がマスカ音に耳慣れし、いわゆるカクテルパーティ効果により、話者の音声を聞き分けて発言内容を理解してしまうおそれがある。 However, in the method of Patent Document 1, since the output position of the masker sound is fixed, the listener gets used to the masker sound, and by the so-called cocktail party effect, the speaker's voice is heard and the contents of the statement are understood. There is a fear.

そこで、本発明は、カクテルパーティ効果を適切に抑制することができる音声出力装置を提供することを目的とする。 Therefore, an object of the present invention is to provide an audio output device that can appropriately suppress the cocktail party effect.

この発明の音声出力装置は、マスカ音を生成するマスカ音生成部と、マスカ音を出力する複数のスピーカと、マスカ音の定位位置を制御し、マスカ音に係る音声信号を前記複数のスピーカに供給する定位制御部と、を備えている。そして、定位制御部は、マスカ音の定位位置を動的に変化させることを特徴とする。具体的には、定位制御部は、所定の位置を中心とした所定範囲内で定位位置をランダムに変化させる。定位位置を変化させるには、複数のスピーカに供給する音声信号の遅延量を変化させることで実現可能である。 The audio output device according to the present invention controls a masker sound generating unit that generates a masker sound, a plurality of speakers that output the masker sound, a localization position of the masker sound, and an audio signal related to the masker sound to the plurality of speakers. And a localization control unit to be supplied. The localization control unit dynamically changes the localization position of the masker sound. Specifically, the localization control unit randomly changes the localization position within a predetermined range centered on the predetermined position. The localization position can be changed by changing the delay amount of the audio signal supplied to the plurality of speakers.

また、定位制御部は、前記所定の位置を中心として、当該中心位置を最も高い確率で定位位置に設定し、当該中心位置から離れるにしたがって低い確率で定位位置を設定するように、前記定位位置を動的に変化させることも可能である。例えば、ガウス分布に従った確率で定位位置を動的に変化させる。定位位置は、実際の話者の位置に近いほうが話者の音源位置とマスカ音の音源位置が離れず、マスキング効果が高くなる。ただし、第三者にとってマスカー音が常に同じ方向から聞こえると耳慣れを起こし、カクテルパーティー効果によって話者の音声を聞き分けて発言内容を理解してしまう。そこで、マスキング効果を高く保ちつつも、カクテルパーティー効果を抑制するために、音源位置を動的に変化させ、かつ、話者の位置に近いところで定位位置の出現確率が高く、離れるにしたがって出現確率が低くなるように設定することが好ましい。 Further, the localization control unit sets the localization position at the highest probability with the predetermined position as the center, and sets the localization position with a lower probability as the distance from the center position increases. Can be dynamically changed. For example, the localization position is dynamically changed with a probability according to a Gaussian distribution. When the localization position is closer to the actual speaker position, the sound source position of the speaker and the sound source position of the masker sound are not separated from each other, and the masking effect is enhanced. However, if a third party hears the masker sound always from the same direction, they will get used to the ears, and the speaker's voice will be heard and understood by the cocktail party effect. Therefore, in order to suppress the cocktail party effect while keeping the masking effect high, the sound source position is dynamically changed, and the appearance probability of the localization position is high near the speaker position, and the appearance probability as it goes away Is preferably set to be low.

また、マスカ音は、どの様な音であってもよいが、話者の音声を収音するマイクを備え、マイクで収音した音声に基づいてマスカ音を生成することが望ましい。例えば、話者の発話音声を所定時間保持し、時間軸上あるいは周波数軸上で改変し、語彙的に何ら意味をなさない（会話内容が理解できない）ようにしたものを用いる。あるいは、男性および女性を含む複数人の音声で、かつ語彙的に何ら意味をなさない汎用的な発話音声を出力するか、この汎用的な音声のフォルマント等の周波数特性を話者の音声に近似させたものとしてもよい。 The masker sound may be any sound, but it is desirable to provide a microphone that picks up the voice of the speaker and generate the masker sound based on the sound picked up by the microphone. For example, the voice of the speaker is held for a predetermined time, and is modified on the time axis or the frequency axis so that it does not make any meaning in the vocabulary (the conversation content cannot be understood). Or, output a general utterance voice that does not make any lexical meaning with the voice of multiple people including men and women, or approximate the frequency characteristics of this general voice formant to the voice of the speaker It is good also as what was made to do.

この場合、音声出力装置は、マスカ音に係る音声がスピーカからマイクに至るエコー成分を疑似した疑似エコー信号をマイクで収音した音声からキャンセルし、マスカ音生成部に供給するエコーキャンセラを備えていることが望ましい。これにより、スピーカから出力され、マイクに回り込んだマスカ音を除去することができ、話者の音声だけに基づいてマスカ音を生成することができる。 In this case, the audio output device includes an echo canceller that cancels a pseudo echo signal that simulates an echo component from the speaker to the microphone, and that is supplied to the masker sound generation unit. It is desirable. As a result, the masker sound output from the speaker and wrapping around the microphone can be removed, and the masker sound can be generated based only on the voice of the speaker.

この発明によれば、マスカ音の出力位置が動的に変化するため、カクテルパーティ効果を適切に抑制することができる。 According to this invention, since the output position of the masker sound changes dynamically, the cocktail party effect can be appropriately suppressed.

マスキングシステムの構成を示す配置図である。It is an arrangement drawing showing the composition of a masking system. マイク、スピーカアレイ、および音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of a microphone, a speaker array, and an audio processing apparatus. スピーカアレイによる仮想音源定位手法を示す図である。It is a figure which shows the virtual sound source localization method by a speaker array. 仮想音源位置の動的変化を説明する図である。It is a figure explaining the dynamic change of a virtual sound source position. 音声処理装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of a speech processing unit. エコーキャンセラを備えた場合の音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio | voice processing apparatus at the time of providing an echo canceller.

図１は、本発明の音声出力装置を備えたマスキングシステムの構成を示す配置図である。マスキングシステムは、例えば銀行や調剤薬局等の対話カウンターに設置され、カウンターを挟んで会話を行う者の発言内容を第三者に理解できないようにするマスカ音を当該第三者に対して放音するものである。 FIG. 1 is a layout diagram showing the configuration of a masking system provided with an audio output device of the present invention. The masking system is installed at a dialogue counter such as a bank or dispensing pharmacy, for example, and emits a masker sound to the third party to prevent the third party from understanding the remarks of the person who is talking across the counter. To do.

図１においては、カウンターを挟んで話者Ｈ１と聴取者Ｈ２が存在し、カウンターから離れた位置に複数の第三者Ｈ３が存在する。話者Ｈ１は、例えば薬の説明を行う薬剤師であり、聴取者Ｈ２は薬の説明を聞く患者であり、第三者Ｈ３は順番待ちの患者である。 In FIG. 1, there are a speaker H1 and a listener H2 across the counter, and there are a plurality of third parties H3 at positions away from the counter. For example, the speaker H1 is a pharmacist explaining the medicine, the listener H2 is a patient listening to the medicine explanation, and the third person H3 is a patient waiting for the turn.

カウンターの上面にはマイク１が設置されている。マイク１は、カウンター周囲の音声をとして、主に話者Ｈ１の音声を収音する。カウンターの第三者の存在する方向（紙面下方向）には、この第三者に向かって音声を出力するスピーカアレイ２が設置されている。なお、スピーカアレイ２は、机の下等、聴取者Ｈ２にスピーカアレイの出力した音声が聞こえにくいように設置されている。 A microphone 1 is installed on the upper surface of the counter. The microphone 1 mainly collects the voice of the speaker H1 using the voice around the counter. A speaker array 2 that outputs sound toward the third party is installed in a direction where the third party of the counter exists (downward in the drawing). Note that the speaker array 2 is installed so that the listener H2 can hardly hear the sound output from the speaker array, such as under a desk.

マイク１とスピーカアレイ２は、音声処理装置３に接続されている。マイク１は、話者Ｈ１の音声を収音し、音声処理装置３に出力する。音声処理装置３は、マイク１で収音した話者Ｈ１の音声に基づいて、当該話者Ｈ１の音声をマスクするためのマスカ音を生成し、スピーカアレイ２に出力する。このとき、音声処理装置３は、スピーカアレイ２の各スピーカに供給する音声信号の遅延量を制御することで、第三者Ｈ３が知覚するマスカ音の音源位置（仮想音源位置）を動的に変化させる。これにより、第三者Ｈ３には、マスカ音の音源位置が常に移動している様に聞こえることになり、耳慣れによるカクテルパーティ効果を適切に抑制することができる。 The microphone 1 and the speaker array 2 are connected to the sound processing device 3. The microphone 1 picks up the voice of the speaker H1 and outputs it to the voice processing device 3. The voice processing device 3 generates a masker sound for masking the voice of the speaker H1 based on the voice of the speaker H1 collected by the microphone 1 and outputs the masker sound to the speaker array 2. At this time, the sound processing device 3 dynamically controls the sound source position (virtual sound source position) of the masker sound perceived by the third party H3 by controlling the delay amount of the sound signal supplied to each speaker of the speaker array 2. Change. As a result, the third party H3 can hear that the sound source position of the masker sound is constantly moving, and the cocktail party effect due to ear habituation can be appropriately suppressed.

以下、上記のマスキングシステムを実現するための具体的な構成、動作について説明する。図２は、マイク１、スピーカアレイ２、および音声処理装置３の構成を示すブロック図である。音声処理装置３は、Ａ／Ｄコンバータ５１、制御部７２、マスカ音生成部７３、遅延処理部８、Ｄ／Ａコンバータ６１〜Ｄ／Ａコンバータ６８を備えている。スピーカアレイ２は、８つのスピーカ２１〜スピーカ２８を備えている。なお、スピーカアレイのスピーカの数は、この例に限るものではない。 Hereinafter, a specific configuration and operation for realizing the masking system will be described. FIG. 2 is a block diagram showing configurations of the microphone 1, the speaker array 2, and the sound processing device 3. The sound processing device 3 includes an A / D converter 51, a control unit 72, a masker sound generation unit 73, a delay processing unit 8, and D / A converters 61 to D / A converter 68. The speaker array 2 includes eight speakers 21 to 28. The number of speakers in the speaker array is not limited to this example.

Ａ／Ｄコンバータ５１は、マイク１で収音した音声を入力し、デジタル音声信号に変換する。Ａ／Ｄコンバータ５１で変換された各デジタル音声信号は、マスカ音生成部７３に入力される。 The A / D converter 51 inputs the sound collected by the microphone 1 and converts it into a digital sound signal. Each digital audio signal converted by the A / D converter 51 is input to the masker sound generation unit 73.

マスカ音生成部７３は、入力されたデジタル音声信号に係る話者音声に基づいて、この話者音声をマスクするためのマスカ音を生成する。マスカ音は、どの様な音であってもよいが、カウンターから離れた位置に存在する複数の第三者Ｈ３の不快感を抑えたものであることが好ましい。例えば、話者Ｈ１の発話音声を所定時間保持し、時間軸上あるいは周波数軸上で改変し、語彙的に何ら意味をなさない（会話内容が理解できない）ようにしたものを用いる。あるいは、男性および女性を含む複数人の音声で、かつ語彙的に何ら意味をなさない汎用的な発話音声を内蔵記憶部（不図示）に記憶しておき、この汎用的な音声を出力するか、汎用的な音声のフォルマント等の周波数特性を話者Ｈ１の音声に近似させたものとしてもよい。また、マスカ音には、空調音のような背景音を混ぜてもよい。第三者Ｈ３は、このようなマスカ音を話者Ｈ１の音声と同時に聞くことで、話者Ｈ１の発言内容が理解し難くなる。生成されたマスカ音は、遅延処理部８の各ディレイ８１〜ディレイ８８に出力される。 The masker sound generation unit 73 generates a masker sound for masking the speaker voice based on the speaker voice related to the input digital voice signal. The masker sound may be any sound, but is preferably one that suppresses the discomfort of a plurality of third parties H3 existing at positions away from the counter. For example, the voice of the speaker H1 is retained for a predetermined time, and is modified on the time axis or the frequency axis so that it does not make any lexical meaning (the conversation content cannot be understood). Alternatively, whether general-purpose utterance voices that do not make any lexical meaning are stored in the built-in storage unit (not shown), and are output from the general-purpose voices including men and women Alternatively, frequency characteristics such as a general voice formant may be approximated to the voice of the speaker H1. The masker sound may be mixed with a background sound such as an air conditioning sound. By listening to such a masker sound simultaneously with the voice of the speaker H1, the third party H3 has difficulty in understanding the content of the speech of the speaker H1. The generated masker sound is output to each of the delays 81 to 88 of the delay processing unit 8.

遅延処理部８のディレイ８１〜ディレイ８８は、それぞれスピーカアレイ２のスピーカ２１〜スピーカ２８に対応して設けられており、各スピーカに供給する音声信号の遅延量を個別に変更するものである。ディレイ８１〜ディレイ８８の遅延量は、制御部７２によって制御される。 The delays 81 to 88 of the delay processing unit 8 are provided corresponding to the speakers 21 to 28 of the speaker array 2, respectively, and individually change the delay amount of the audio signal supplied to each speaker. The delay amount of the delays 81 to 88 is controlled by the control unit 72.

制御部７２は、ディレイ８１〜ディレイ８８の遅延量を制御することで、所定の位置に仮想音源を設定することができる。図３は、スピーカアレイによる仮想音源定位手法を示す図である。 The control unit 72 can set the virtual sound source at a predetermined position by controlling the delay amounts of the delays 81 to 88. FIG. 3 is a diagram showing a virtual sound source localization method using a speaker array.

同図に示すように、制御部７２は、所定の位置（例えば話者Ｈ１の位置）に仮想音源Ｖを設定する。仮想音源Ｖからスピーカアレイ２の各スピーカまでの距離は、それぞれ異なるが、最も仮想音源Ｖに近いスピーカ（同図ではスピーカ２１）から順にマスカ音を出力し、時間経過とともにスピーカ２２から順にスピーカ２８まで音声を出力することで、カウンターから離れた位置に存在する複数の第三者Ｈ３には、焦点となる仮想音源位置から等距離の位置（図中点線で示すスピーカの位置）にスピーカが存在し、これら仮想的なスピーカの位置から同時にマスカ音が放音されるように知覚させることができる。よって、第三者Ｈ３は、仮想的に話者Ｈ１の位置からマスカ音が発せられたように知覚することになる。 As shown in the figure, the control unit 72 sets the virtual sound source V at a predetermined position (for example, the position of the speaker H1). The distance from the virtual sound source V to each speaker of the speaker array 2 is different, but the masker sound is output in order from the speaker closest to the virtual sound source V (speaker 21 in the figure), and the speaker 28 is sequentially from the speaker 22 over time. A plurality of third parties H3 that are located away from the counter by outputting the sound up to the counter have a speaker at the same distance from the virtual sound source position that is the focal point (the position of the speaker indicated by the dotted line in the figure) In addition, it is possible to perceive that masker sounds are emitted simultaneously from the positions of these virtual speakers. Therefore, the third party H3 virtually perceives that a masker sound was emitted from the position of the speaker H1.

ここで、制御部７２は、各スピーカに供給するマスカ音の音声信号の遅延量を動的に変化させることで、仮想音源Ｖの位置を動的に変化させる。図４は、仮想音源位置の動的変化を説明する図である。同図においては、第三者Ｈ３から見て話者Ｈ１に向かって右側に仮想音源Ｖ１の位置を設定する状態から、話者Ｈ１に向かって左側の仮想音源Ｖ２の位置を変化させる例を示す。 Here, the controller 72 dynamically changes the position of the virtual sound source V by dynamically changing the delay amount of the masker sound signal supplied to each speaker. FIG. 4 is a diagram for explaining the dynamic change of the virtual sound source position. The figure shows an example in which the position of the left virtual sound source V2 is changed toward the speaker H1 from the state where the position of the virtual sound source V1 is set on the right side toward the speaker H1 when viewed from the third party H3. .

制御部７２は、所定時間経過毎（例えば１秒経過毎）に、ディレイ８１〜ディレイ８８の遅延量を変更する。例えば、図４の様に、第三者Ｈ３から見て話者Ｈ１に向かって右側に存在する仮想音源Ｖ１を設定する場合は、向かって右側のスピーカ２１に供給する音声信号の遅延量を小さく、向かって左側のスピーカ２８に供給する音声信号の遅延量を大きく設定しているが、向かって左側に存在する仮想音源Ｖ２を設定する場合は、スピーカ２１に供給する音声信号の遅延量を大きく、スピーカ２８に供給する音声信号の遅延量を小さく設定する。すると、第三者Ｈ３は、マスカ音の出力位置が仮想音源Ｖ１の位置から仮想音源Ｖ２の位置に移動したように知覚することになる。このため、同じマスカ音が出力されていても、音源位置が変化し、話者Ｈ１との合成音（同時に聞いた音）が変化して聞こえることになる。そのため、カウンターから離れた位置に存在する複数の第三者Ｈ３の耳慣れを防止し、カクテルパーティ効果を適切に抑制することができる。 The control unit 72 changes the delay amounts of the delays 81 to 88 every elapse of a predetermined time (for example, every elapse of 1 second). For example, as shown in FIG. 4, when the virtual sound source V1 existing on the right side from the third party H3 toward the speaker H1 is set, the delay amount of the audio signal supplied to the right speaker 21 is reduced. The delay amount of the audio signal supplied to the left speaker 28 is set large, but when the virtual sound source V2 existing on the left side is set, the delay amount of the audio signal supplied to the speaker 21 is increased. The delay amount of the audio signal supplied to the speaker 28 is set small. Then, the third person H3 perceives that the output position of the masker sound has moved from the position of the virtual sound source V1 to the position of the virtual sound source V2. For this reason, even if the same masker sound is output, the sound source position changes, and the synthesized sound (sound heard simultaneously) with the speaker H1 changes and can be heard. For this reason, it is possible to prevent the ear habituation of a plurality of third parties H3 existing at positions away from the counter and appropriately suppress the cocktail party effect.

また、同図の例では、制御部７２は、中心位置Ｓ（同図の例ではマイク１の位置に一致する。）を中心とした円の内側に移動領域Ｚを設定し、この移動領域Ｚ内で仮想音源の位置をランダムに変化させる。無論、この移動領域Ｚ外に仮想音源を設定してもよいが、話者Ｈ１の位置から離れるにしたがって、聴取者はマスカ音と話者Ｈ１との定位位置を別の位置と知覚しやすくなり、マスキング効果が低くなるため、話者Ｈ１に近い位置からある程度の範囲内で変化させ、カクテルパーティ効果を抑制することが望ましい。 Moreover, in the example of the figure, the control part 72 sets the movement area Z inside the circle centering on the center position S (it corresponds to the position of the microphone 1 in the example of the figure), and this movement area Z The position of the virtual sound source is changed at random. Of course, a virtual sound source may be set outside this moving area Z, but as the listener moves away from the position of the speaker H1, the listener can easily perceive the localization position of the masker sound and the speaker H1 as another position. Since the masking effect becomes low, it is desirable to change the position within a certain range from the position close to the speaker H1 to suppress the cocktail party effect.

さらに、制御部７２は、当該中心位置Ｓに仮想音源位置を設定する確率を最も高くし、当該中心位置Ｓから離れるにしたがって低い確率で設定するように、仮想音源位置を動的に変化させることも可能である。例えば、ガウス分布に従った確率で仮想音源位置を動的に変化させる。図４の例では、移動領域Ｚ内において、黒い位置ほど高い確率で仮想音源位置が出現し、白い位置ほど低い確率で仮想音源位置が出現する態様としている。話者Ｈ１の位置に近いほうがマスキング効果を高くすることができるため、話者Ｈ１の位置に近いところで仮想音源の出現確率を高くし、離れるにしたがって出現確率を低くなるように設定する。 Furthermore, the control unit 72 dynamically changes the virtual sound source position so that the probability of setting the virtual sound source position at the center position S is the highest, and the probability is set with a lower probability as the distance from the center position S increases. Is also possible. For example, the virtual sound source position is dynamically changed with a probability according to a Gaussian distribution. In the example of FIG. 4, in the movement area Z, the virtual sound source position appears with a higher probability as the black position, and the virtual sound source position appears with a lower probability as the white position. Since the masking effect can be increased closer to the position of the speaker H1, the appearance probability of the virtual sound source is set higher near the position of the speaker H1, and the appearance probability is set lower as the distance from the speaker H1 increases.

なお、中心位置Ｓは、マイクの位置や話者の位置を想定して予め設定しておいてもよいが、スピーカアレイの後方の任意の位置（例えばスピーカアレイの中心から０．５ｍ程度後方）とし、移動領域Ｚは、半径１ｍ等の任意の値に設定しておいてもよいし、ユーザが操作を行う操作部（不図示）を設け、ユーザからの手動入力を受け付ける態様であってもよい。また、スピーカアレイの幅に応じて自動的に設定してもよい。例えば、スピーカアレイの端部スピーカ２１およびスピーカ２８を結ぶ直線を設定し、この直線を長辺とし、スピーカ２１、スピーカ２８、および中心位置Ｓを結ぶ直角三角形や正三角形を設定する。そして、移動領域Ｚの円の半径をスピーカ２１（またはスピーカ２８）と中心位置Ｓとの距離に設定する。 The center position S may be set in advance assuming the position of the microphone and the position of the speaker. However, an arbitrary position behind the speaker array (for example, about 0.5 m behind the center of the speaker array). The moving area Z may be set to an arbitrary value such as a radius of 1 m, or an operation unit (not shown) for operation by the user may be provided to accept manual input from the user. Good. Alternatively, it may be automatically set according to the width of the speaker array. For example, a straight line connecting the end speaker 21 and the speaker 28 of the speaker array is set, this straight line is the long side, and a right triangle or equilateral triangle connecting the speaker 21, the speaker 28, and the center position S is set. Then, the radius of the circle of the moving area Z is set to the distance between the speaker 21 (or speaker 28) and the center position S.

次に、図５は、音声処理装置３の動作を示すフローチャートである。音声処理装置３は、初回起動時（電源オン時）にこの動作を開始し、以後所定時間経過毎（例えば１秒経過毎）にもこの動作を行う。まず、音声処理装置３は、話者音声が収音されるまで待機する（ｓ１１）。例えば、有音と判定できる程度の所定レベル以上の音声が収音されたとき、話者音声が収音されたと判断する。話者音声が収音されず、会話を行っていない場合、マスカ音は不要であるため、マスカ音の生成、定位処理を待機する態様とする。ただし、この処理を省略し、常にマスカ音の生成、定位処理を行う態様としてもよい。 Next, FIG. 5 is a flowchart showing the operation of the voice processing device 3. The voice processing device 3 starts this operation at the first activation (when the power is turned on), and thereafter performs this operation every predetermined time (for example, every one second). First, the voice processing device 3 stands by until a speaker voice is collected (s11). For example, when a sound of a predetermined level or higher that can be determined to be sound is picked up, it is determined that a speaker voice is picked up. When the speaker voice is not picked up and the conversation is not performed, the masker sound is unnecessary, so the masker sound is generated and the localization process is awaited. However, this process may be omitted, and a masker sound generation and localization process may always be performed.

音声処理装置３は、話者音声が収音された場合、マスカ音生成部７３によってマスカ音の生成を行う（ｓ１２）。なお、マスカ音は、収音した話者音声のレベルに応じて音量が変化する態様であることが望ましい。収音した話者音声のレベルが低い場合、第三者Ｈ３に低いレベルで話者音声が到達し、会話内容を把握し難いため、マスカ音のレベルも低くすることができる。一方で、収音した話者音声のレベルが高い場合、第三者Ｈ３には話者音声が高いレベルで到達し、会話内容を把握しやすいため、マスカ音のレベルも高くするほうが好ましい。また、仮想音源位置が動的に変化する瞬間にマスカ音のレベルに変化を与え、第三者Ｈ３に仮想音源の位置が少しずつ変化するよう知覚させ、不快感を低減するようにしてもよい。 When the speaker voice is collected, the voice processing device 3 generates a masker sound by the masker sound generation unit 73 (s12). Note that it is desirable that the masker sound has an aspect in which the volume changes according to the level of the collected speaker voice. When the level of the collected speaker voice is low, the voice of the speaker reaches the third party H3 at a low level and it is difficult to grasp the content of the conversation, so the masker sound level can also be lowered. On the other hand, when the level of the collected speaker voice is high, the speaker voice reaches the third party H3 at a high level and it is easy to grasp the content of the conversation. In addition, the level of the masker sound is changed at the moment when the virtual sound source position dynamically changes, and the third person H3 may perceive the position of the virtual sound source to change little by little to reduce discomfort. .

そして、音声処理装置３は、マスカ音の定位位置がランダムに変化するように制御部７２で遅延量の設定を行う（ｓ１３）。例えば、図４に示したように、中心位置Ｓ（話者Ｈ１に近い位置）から所定範囲内（移動領域Ｚ内）で中心に近いほど高い確率で、中心から離れるに従って低い確率で仮想音源位置が出現するように、各スピーカに供給する音声信号の遅延量を動的に変化させる。 Then, the sound processing device 3 sets the delay amount by the control unit 72 so that the localization position of the masker sound changes randomly (s13). For example, as shown in FIG. 4, the virtual sound source position has a higher probability as it is closer to the center within a predetermined range (within the movement area Z) from the center position S (position closer to the speaker H1), and with a lower probability as the distance from the center increases. So that the delay amount of the audio signal supplied to each speaker is dynamically changed.

以上のようにして、音声処理装置３は、マスカ音の仮想音源位置を動的に変化させることにより、第三者Ｈ３には、マスカ音が常に移動しているように聞こえることになり、カクテルパーティ効果を適切に抑制することができる。 As described above, the voice processing device 3 dynamically changes the virtual sound source position of the masker sound, so that the third party H3 can hear that the masker sound always moves, and the cocktail. The party effect can be appropriately suppressed.

なお、図６に示すように、音声処理装置３は、エコーキャンセラを備えていてもよい。図６は、エコーキャンセラを備えた場合の音声処理装置３の構成を示すブロック図である。図１と共通する構成については同じ記号を付し、その説明を省略する。 As shown in FIG. 6, the voice processing device 3 may include an echo canceller. FIG. 6 is a block diagram showing the configuration of the audio processing device 3 provided with an echo canceller. Components that are the same as those in FIG. 1 are given the same reference numerals, and descriptions thereof are omitted.

この例における音声処理装置３は、Ａ／Ｄコンバータ５１から出力された音声信号を入力するエコーキャンセラ７５を備えている。エコーキャンセラ７５は、マスカ音生成部７３からマスカ音に係る音声信号を入力し、スピーカからマイクに至る音響伝達系の伝達特性を模擬した適応型フィルタを用いてマスカ音に係る音声信号をフィルタ処理し、Ａ／Ｄコンバータ５１から入力された信号に減算処理することでエコー成分を削減する。また、エコーキャンセラ７５は、スピーカアレイのスピーカユニットの数だけ設ける態様であってもよい。スピーカからマイクに至る音響伝達系（エコーパス）は、各スピーカの数だけ存在することになるため、理想的にはスピーカ毎のエコーパスを推定した適応型フィルタを設け、各スピーカに供給する音声信号をフィルタ処理してエコー成分を推定し、減算することが望ましい。 The audio processing device 3 in this example includes an echo canceller 75 that inputs the audio signal output from the A / D converter 51. The echo canceller 75 receives the sound signal related to the masker sound from the masker sound generation unit 73, and filters the sound signal related to the masker sound using an adaptive filter that simulates the transfer characteristic of the sound transfer system from the speaker to the microphone. The echo component is reduced by subtracting the signal input from the A / D converter 51. Further, the echo canceller 75 may be provided as many as the number of speaker units in the speaker array. Since there are as many acoustic transmission systems (echo paths) from the speakers to the microphones as there are speakers, an adaptive filter that ideally estimates the echo path for each speaker is provided, and the audio signal supplied to each speaker is It is desirable to estimate and subtract echo components by filtering.

なお、音声処理装置３は、本実施形態に示したマスキングシステムに専用の装置でなくとも、一般的なパーソナルコンピュータ等の情報処理装置のハードウェアおよびソフトウェアを用いて実現可能である。 Note that the voice processing device 3 can be realized by using hardware and software of an information processing device such as a general personal computer, instead of a device dedicated to the masking system shown in the present embodiment.

なお、本実施形態では、話者Ｈ１の音声を収音するマイクを１つ設ける例を示したが、マイクの数は複数であってもよい。また、複数のマイクを配列したマイクアレイを設ける態様であってもよい。この場合、マイクアレイの各マイクが収音した音声の位相差を検出することで、話者Ｈ１の位置を検出することができ、上述の中心位置Ｓや移動領域Ｚを、検出した話者Ｈ１の位置（あるいは話者Ｈ１に近い位置）に設定することができる。 In the present embodiment, an example is shown in which one microphone that collects the voice of the speaker H1 is provided, but a plurality of microphones may be provided. Moreover, the aspect which provides the microphone array which arranged the some microphone may be sufficient. In this case, the position of the speaker H1 can be detected by detecting the phase difference of the sound collected by each microphone of the microphone array, and the above-described center position S and moving region Z are detected by the detected speaker H1. (Or a position close to the speaker H1).

また、位置特定の手段としては、画像認識やセンサを用いるなど音声以外の情報を元とした手法であってもよい。 Further, as the means for specifying the position, a technique based on information other than sound, such as using image recognition or a sensor, may be used.

Ｈ１…話者
Ｈ２…聴取者
Ｈ３…第三者
１…マイク
２…スピーカアレイ
３…音声処理装置 H1 ... speaker H2 ... listener H3 ... third party 1 ... microphone 2 ... speaker array 3 ... speech processing device

Claims

A masker sound generator for generating masker sounds;
A plurality of speakers for outputting the masker sound;
A localization control unit for controlling a localization position of the masker sound and supplying an audio signal related to the masker sound to the plurality of speakers;
With
The sound output device, wherein the localization control unit randomly changes the localization position of the masker sound within a predetermined range centered on a predetermined position .

The localization control unit sets the localization position so that the center position is set to the localization position with the highest probability around the predetermined position, and the localization position is set with a lower probability as the distance from the center position increases. The audio output device according to claim 1, wherein the audio output device is dynamically changed.

A microphone that picks up the voice of the speaker
The masking sound generating unit, an audio output device according to claim 1 or 2 to generate the masking sound based on the sound picked up by the microphone.

Cancel the pseudo echo signal sound according to the masking sound is pseudo echo component reaching the microphone from the loudspeaker from the audio picked up by the microphone, claim 3 comprising an echo canceller to be supplied to the masking sound generating unit The audio output device according to 1.