JP3468389B2

JP3468389B2 - Speech recognition dialogue apparatus and speech recognition dialogue processing method

Info

Publication number: JP3468389B2
Application number: JP21224995A
Authority: JP
Inventors: 康永宮沢; 満広稲積; 浩長谷川; 伊佐央枝常; 治浦野
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 1995-08-21
Filing date: 1995-08-21
Publication date: 2003-11-17
Anticipated expiration: 2015-08-21
Also published as: JPH0962287A

Abstract

PROBLEM TO BE SOLVED: To enhance speech recognition ratio producing by difference in age and sex and make conversation suitable for sex and ages possible by making a ROM part for memorizing memory contents previously set in order to conduct speech recognition and response output a cartridge. SOLUTION: In a device which analyzes a voice inputted and generates voice characteristic data, compares it with standard voice characteristic data previously registered, outputs word detected data, understands the meaning of inputted voices, decides response contents corresponding to the inputted voices, and inputs them, a memory means which memorizes memory contents previously set in order to conduct speech recognition and memory contents previously set in order to conduct response output to the speech is arranged on a detachable cartridge 20 side to a device main body 1, and connected to a speech recognition response processing part 10 installed on the device main body 1 side by mounting the cartridge 20 on the device main body 1 side, and response contents to input speech are outputted based on data memorized in the cartridge 20.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、音声を認識し、その認
識結果に対応した応答や特定動作を行う音声認識対話装
置および音声認識対話処理方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition dialogue apparatus and a speech recognition dialogue processing method for recognizing speech and performing a response or a specific operation corresponding to the recognition result.

【０００２】[0002]

【従来の技術】この種の音声認識装置においては、特定
話者のみの音声を認識可能な特定話者音声認識装置と不
特定話者の音声を認識可能な不特定話者音声認識装置が
ある。2. Description of the Related Art In this type of voice recognition device, there are a specific speaker voice recognition device capable of recognizing a voice of a specific speaker only and an unspecified speaker voice recognition device capable of recognizing a voice of an unspecified speaker. .

【０００３】特定話者音声認識装置は、或る特定の話者
が認識可能な単語を一単語ずつ所定の手順に従って入力
することによって、その特定話者の標準的な音声信号パ
ターンを登録しておき、登録終了後、特定話者が登録し
た単語を話すと、その入力音声を分析した特徴パターン
と登録された特徴パターンとを比較して音声認識を行う
ものである。この種の音声認識対話装置の一例として音
声認識玩具がある。たとえば、音声指令となる複数の命
令語として、「おはよう」、「おやすみ」、「こんにち
わ」などの言葉を１０単語程度、その玩具を使用する子
どもが予め登録しておき、話者がたとえば「おはよう」
というと、その音声信号と、登録されている「おはよ
う」の音声信号を比較して、両音声信号が一致したと
き、音声指令に対する定められた電気信号を出力し、こ
れに基づいて玩具に特定動作を行わせるものである。The specific speaker voice recognition device registers a standard voice signal pattern of a specific speaker by inputting words that can be recognized by a specific speaker word by word according to a predetermined procedure. After the registration is completed, when the specific speaker speaks the registered word, the speech recognition is performed by comparing the characteristic pattern obtained by analyzing the input voice with the registered characteristic pattern. A voice recognition toy is an example of this type of voice recognition dialogue device. For example, about 10 words such as “Good morning”, “Good night”, and “Hello” are used as a plurality of command words that serve as voice commands, and the child who uses the toy has registered in advance, and the speaker can say “Good morning”. "
Then, the voice signal is compared with the voice signal of the registered "Good morning", and when both voice signals match, a predetermined electric signal for the voice instruction is output, and based on this, the toy is specified. It is what makes the operation.

【０００４】このような特定話者音声認識装置は、特定
話者かあるいはそれに近い音声パターンを有する音声し
か認識されず、また、初期設定として、認識させたい単
語を一単語ずつすべて登録させる必要がありその手間は
極めて面倒であった。Such a specific speaker voice recognition device recognizes only a specific speaker or a voice having a voice pattern close to that of the specific speaker, and it is necessary to register all the words to be recognized one by one as an initial setting. The trouble was extremely troublesome.

【０００５】これに対して、不特定話者音声認識装置
は、多数（たとえば、２００人程度）の話者が発話した
音声を用いて、前記したような認識対象単語の標準音声
特徴データを予め作成して記憶（登録）させておき、こ
れら予め登録された認識可能な単語に対して、不特定な
話者の発する音声を認識可能としたものである。On the other hand, the unspecified speaker voice recognition device uses the voices uttered by a large number of speakers (for example, about 200 people) to previously generate the standard voice feature data of the recognition target word as described above. It is created and stored (registered), and the speech uttered by an unspecified speaker can be recognized for these pre-registered recognizable words.

【０００６】[0006]

【発明が解決しようとする課題】この不特定話者の音声
を認識可能な不特定音声認識装置は、確かに、標準的な
音声に対しては比較的高い認識率が確保されるが、殆ど
の全ての音声に対しても高い認識率が得られるとは限ら
れない。たとえば、幼児の声、大人の声、女性の声、男
性の声などのように、年齢や性別によって音声の特徴が
大きく異なり、大人の問いかけに対してはきわめて高い
認識率が得られても、幼児の問いかけに対しては殆ど認
識されないという問題も生じてくる。Although the unspecified speech recognition device capable of recognizing the speech of the unspecified speaker surely has a relatively high recognition rate for the standard speech, it is almost the same. A high recognition rate is not always obtained for all the voices. For example, voice characteristics vary greatly depending on age and gender, such as infant voice, adult voice, female voice, male voice, etc. Even if a very high recognition rate is obtained for adult questions, There is also the problem that little attention is given to infants' questions.

【０００７】また、この種の音声認識装置を、ぬいぐる
みなどの玩具に適用した場合、そのぬいぐるみで遊ぶ子
どもの年代や性別などによって、対話内容も変化するの
が普通である。たとえば、幼児と小学校の高学年、男子
と女子では、求める対話内容はそれぞれ異なるのが一般
的である。When this type of voice recognition device is applied to a toy such as a stuffed toy, the contents of the dialogue usually change depending on the age and sex of the child who plays the stuffed toy. For example, the contents of dialogues required for infants and upper grades of elementary school and for boys and girls are generally different.

【０００８】しかしながら、この種の音声認識装置にあ
っては、認識可能な登録単語も限られており、それに対
する応答内容も或る程度は限られた内容のものであるの
が一般的である。したがって、この種の玩具は短期間の
うちに飽きがくるのが通例であり、また、前記したよう
に、年齢や性別などによる音声の特徴により、認識率の
良し悪しにも問題があった。これは、玩具のみならず音
声認識を利用する電子機器すべてについても同様であ
る。However, in this type of voice recognition device, the number of registered words that can be recognized is also limited, and the response content to that is generally limited to some extent. . Therefore, this type of toy usually gets tired in a short period of time, and as described above, there is a problem in whether the recognition rate is good or bad due to the characteristics of voices such as age and sex. This is true not only for toys but for all electronic devices that utilize voice recognition.

【０００９】本発明はこれらの課題を解決するためにな
されたもので、標準音声特徴データを記憶する標準音声
特徴データ記憶手段や応答内容を記憶する応答内容記憶
手段などのＲＯＭの部分をカートリッジ式とし、そのカ
ートリッジを適宜選択して装着可能とすることで、年齢
や性別などの違いによる音声の特徴に対応した認識が可
能となり、認識率の向上を図るとともに、様々な年齢や
性別に対応した会話を可能とすることを目的としてい
る。The present invention has been made to solve these problems, and the ROM portion such as standard voice characteristic data storage means for storing standard voice characteristic data and response content storage means for storing response content is a cartridge type. By selecting the cartridge appropriately and mounting it, it is possible to recognize it according to the characteristics of voice due to differences such as age and sex, improve the recognition rate, and support various ages and sexes. The purpose is to enable conversation.

【００１０】[0010]

【課題を解決するための手段】本発明の音声認識対話装
置は、話者から入力された音声を分析して音声特徴デー
タを発生し、この音声特徴データと予め登録された認識
可能な単語の標準音声特徴データとを比較して単語検出
データを出力し、この単語検出データを受けて、入力音
声の意味を理解し、それに対応した応答内容を決定して
出力する音声認識対話装置において、認識された音声に
対する応答出力を行うために前記話者の年代あるいは性
別等に対応させて予め設定されたデータを記憶した記憶
手段を有して装置本体に対して着脱自在に装着可能な複
数のカートリッジを設け、これら複数のカートリッジか
ら選択された少なくとも１つを装置本体に装着すること
により、装置本体側に設けられた音声認識応答処理部に
接続し、前記記憶手段は、話者の年代あるいは性別等に
対応させて予め登録された認識可能な単語に対応する応
答内容を記憶した会話内容記憶手段と、どのような音声
合成出力を発生するかを指示するための指示内容を記憶
する応答データ指示内容記憶手段とで構成され、前記音
声認識応答処理部は、状態検出手段と、話者の音声を入
力する音声入力手段と、この音声入力手段により入力さ
れた音声を分析して音声特徴データを発生する音声分析
手段と、予め登録された認識可能な単語の標準音声特徴
データを記憶した標準音声特徴データ記憶手段と、この
標準音声特徴データ記憶手段の記憶内容を基に、入力音
声に対する単語検出データを出力する単語検出手段と、
この単語検出手段からの単語検出データを受けて入力音
声の意味を理解するとともに、前記状態検出手段と前記
会話内容記憶手段からの内容にもとづいて、最適な意味
を選んで、それに対応した応答内容を決定する音声理解
会話制御手段と、前記音声理解会話制御手段により決定
された応答内容に対し、前記応答データ指示内容記憶手
段の内容に基づいた音声合成出力を発生する音声合成手
段と、この音声合成手段からの音声合成出力を外部に出
力する音声出力手段とを有したことを特徴とする。A speech recognition interactive apparatus of the present invention analyzes speech input from a speaker to generate speech feature data, and recognizes this speech feature data and pre-registered recognizable words. output word detection data is compared with the standard voice characteristic data, it receives this word detection data, to understand the meaning of the input voice, the voice recognition interaction device for determining and outputting the response content corresponding thereto, certified a plurality of detachably mounted to said speaker age or apparatus main body having a storage means in association with sex, etc. storing preset data in order to perform a response output to identify speech The cartridge is provided, and at least one selected from the plurality of cartridges is attached to the apparatus main body to connect to the voice recognition response processing unit provided on the apparatus main body side. The dan is for instructing what kind of speech synthesis output should be generated, and conversation content storage means that stores the response content corresponding to the recognizable word registered in advance corresponding to the speaker's age or sex. Response data instruction content storage means for storing the instruction content of the voice recognition response processing section, the voice recognition response processing section, the state detection means, the voice input means for inputting the voice of the speaker, and the voice input means. Voice analysis means for analyzing voice to generate voice feature data, standard voice feature data storage means for storing standard voice feature data of pre-registered recognizable words, and stored contents of the standard voice feature data storage means Based on, the word detection means for outputting the word detection data for the input speech,
The word detection data from the word detection means is received to understand the meaning of the input voice, and the optimum meaning is selected based on the contents from the state detection means and the conversation content storage means, and the corresponding response content is selected. a speech understanding conversation control means for determining a relative response content decided by the previous SL speech understanding conversation control unit, a voice synthesizing means for generating a speech synthesis output based on the contents of the response data instruction content storage means, this And a voice output unit for outputting the voice synthesis output from the voice synthesis unit to the outside.

【００１１】そして、記憶手段は、話者の年代あるいは
性別等に対応させて予め登録された認識可能な単語に対
する標準音声特徴データを記憶した標準音声特徴データ
記憶手段で構成され、前記音声認識応答処理部は、話者
の音声を入力する音声入力手段と、この音声入力手段に
より入力された音声を分析して音声特徴データを発生す
る音声分析手段と、この音声分析手段からの音声特徴デ
ータを入力し、前記標準音声特徴データ記憶手段の記憶
内容を基に、入力音声に対する単語検出データを出力す
る単語検出手段と、この単語検出手段からの単語検出デ
ータを受けて入力音声の意味を理解し、それに対応した
応答内容を決定する音声理解会話制御手段と、この音声
理解会話制御手段によって決定された応答内容に基づい
た音声合成出力を発生させるための応答データ指示内容
記憶手段と、前記音声理解会話制御手段により決定され
た応答内容に対し、前記応答データ指示内容記憶手段の
内容に基づいた音声合成出力を発生する音声合成手段
と、この音声合成手段からの音声合成出力を外部に出力
する音声出力手段とを有したことを特徴とする。The memory means is the age of the speaker or
Consists of a standard voice characteristic data storage means for storing standard voice characteristic data for recognizable words previously registered in correspondence to the sex, etc., the voice recognition response processing unit, a speaker
An audio input means for inputting a voice, a voice analyzing means for generating speech feature data by analyzing the voice input by the voice input unit inputs the speech feature data from the voice analyzing means, said standard voice Based on the stored contents of the characteristic data storage unit, a word detection unit that outputs word detection data for the input voice, and the word detection data from the word detection unit to understand the meaning of the input voice, and the corresponding response content Voice understanding conversation control means, response data instruction content storing means for generating a voice synthesis output based on the response content determined by the voice understanding conversation control means, and the voice understanding conversation control means. And a voice synthesizing means for generating a voice synthesizing output based on the contents of the response data instruction content storing means for the response contents. Characterized in that and an audio output means for outputting the speech synthesized output to the outside from.

【００１２】また、話者から入力された音声を分析して
音声特徴データを発生し、この音声特徴データと予め登
録された認識可能な単語の標準音声特徴データとを比較
して単語検出データを出力し、この単語検出データを受
けて、入力音声の意味を理解し、それに対応した応答内
容を決定して出力する音声認識対話装置において、音声
認識を行うために前記話者の年代あるいは性別等に対応
させて予め設定されたデータ、認識された音声に対する
応答出力を行うために前記話者の年代あるいは性別等に
対応させて予め設定されたデータなどを記憶した記憶手
段を有して装置本体に対して着脱自在に装着可能な複数
のカートリッジを設け、これら複数のカートリッジから
選択された少なくとも１つを装置本体に装着することに
より、装置本体側に設けられた音声認識応答処理部に接
続し、前記記憶手段は、話者の年代あるいは性別等に対
応させて予め登録された認識可能な単語に対する標準音
声特徴データを記憶した標準音声特徴データ記憶手段
と、話者の年代あるいは性別等に対応させて予め登録さ
れた認識可能な単語に対応する応答内容を記憶した会話
内容記憶手段と、どのような音声合成出力を発生するか
を指示するための指示内容を記憶する応答データ指示内
容記憶手段とで構成され、前記音声認識応答処理部は、
状態検出手段と、話者の音声を入力する音声入力手段
と、この音声入力手段により入力された音声を分析して
音声特徴データを発生する音声分析手段と、この音声分
析手段からの音声特徴データを入力し、前記標準音声特
徴データ記憶手段の記憶内容を基に、入力音声に対する
単語検出データを出力する単語検出手段と、この単語検
出手段からの単語検出データを受けて入力音声の意味を
理解するとともに、前記状態検出手段と前記会話内容記
憶手段からの内容にもとづいて，最適な意味を選んで、
それに対応した応答内容を決定する音声理解会話制御手
段と、この音声理解会話制御手段によって決定された応
答内容に対し、前記応答データ指示内容記憶手段の内容
に基づいた音声合成出力を発生する音声合成手段と、こ
の音声合成手段からの音声合成出力を外部に出力する音
声出力手段とを有したことを特徴とする。Further, the voice input from the speaker is analyzed to generate voice feature data, and this voice feature data is compared with the standard voice feature data of recognizable words registered in advance to obtain word detection data. In a voice recognition dialogue device that outputs, receives the word detection data, understands the meaning of the input voice, determines and outputs the response content corresponding to it, in order to perform voice recognition, the speaker's age or gender, etc. The main body of the apparatus having storage means for storing data preset corresponding to the above, and data preset corresponding to the age or sex of the speaker in order to output a response to the recognized voice. A plurality of cartridges that can be detachably attached to the apparatus main body side are provided by attaching at least one selected from the plurality of cartridges to the apparatus main body. Connected to a provided voice recognition response processing section, the storage means stores standard voice feature data storage means for storing standard voice feature data for recognizable words registered in advance corresponding to the age, sex, etc. of the speaker. And a conversation content storage means for storing response contents corresponding to recognizable words registered in advance corresponding to the age or sex of the speaker, and for instructing what kind of voice synthesis output is generated. And a response data instruction content storage unit for storing instruction content, wherein the voice recognition response processing unit is
State detection means, voice input means for inputting the voice of the speaker, voice analysis means for analyzing the voice input by the voice input means to generate voice characteristic data, and voice characteristic data from the voice analysis means. And a word detection means for outputting word detection data for the input voice based on the stored contents of the standard voice feature data storage means, and understanding the meaning of the input voice by receiving the word detection data from the word detection means. At the same time, based on the contents from the state detecting means and the conversation content storing means, the optimum meaning is selected,
A voice understanding conversation control means for determining a response content corresponding thereto, and a voice synthesis for generating a voice synthesis output based on the content of the response data instruction content storing means with respect to the response content determined by the voice understanding conversation control means. And a voice output unit for outputting the voice synthesis output from the voice synthesis unit to the outside.

【００１３】[0013]

【００１４】そして、話者から入力された音声を分析し
て音声特徴データを発生し、この音声特徴データと予め
登録された認識可能な単語の標準音声特徴データとを比
較して単語検出データを出力し、この単語検出データを
受けて、入力音声の意味を理解し、それに対応した応答
内容を決定して出力する音声認識対話方法において、認
識された音声に対する応答出力を行うために前記話者の
年代あるいは性別等に対応させて予め設定されたデータ
を記憶した記憶手段を有して装置本体に対して着脱自在
に装着可能な複数のカートリッジを設け、これら複数の
カートリッジから選択された少なくとも１つを装置本体
に装着することにより、装置本体側に設けられた音声認
識応答処理部に接続し、前記記憶手段に記憶される記憶
内容は、話者の年代あるいは性別等に対応させて予め登
録された認識可能な単語に対応する応答内容と、どのよ
うな音声合成出力を発生するかを指示する応答データ指
示内容とで形成し、前記音声認識応答処理部には、状態
を検出する状態検出工程と、音声入力手段により入力さ
れた音声を分析して音声特徴データを発生する音声分析
工程と、予め登録された認識可能な単語の標準音声特徴
データを基に、入力音声に対する単語検出データを出力
する単語検出工程と、この単語検出工程からの単語検出
データを受けて入力音声の意味を理解するとともに、前
記状態検出工程により検出された状態と、前記応答内容
とにもとづいて，最適な意味を選んで、それに対応した
応答内容を、前記記憶手段側に記憶された会話内容を参
照して決定する音声理解会話制御工程と、この音声理解
会話制御工程によって決定された内容に対し、前記記憶
手段側に記憶された応答データ指示内容に基づいた音声
合成出力を発生する音声合成工程と、この音声合成工程
からの音声合成出力を外部に出力する音声出力工程とを
有したことを特徴とする。Then, the voice input from the speaker is analyzed to generate voice feature data, and this voice feature data is compared with the standard voice feature data of recognizable words registered in advance to obtain word detection data. In the voice recognition interactive method of outputting, receiving the word detection data, understanding the meaning of the input voice, determining and outputting the response content corresponding to the input voice, the speaker for outputting the response to the recognized voice. preset data so as to correspond to the age or sex, etc.
By providing a plurality of cartridges detachably attachable to the apparatus main body having a storage means for storing, and attaching at least one selected from the plurality of cartridges to the apparatus main body, The stored contents connected to the provided voice recognition response processing unit and stored in the storage means include response contents corresponding to recognizable words registered in advance corresponding to the age or sex of the speaker. It indicates whether to generate a speech synthesis output as formed by the response data instruction content, the speech recognition response processing unit, the state
Based on standard voice feature data of pre-registered recognizable words, a state detection process of detecting a voice, a voice analysis process of analyzing voice input by voice input means to generate voice feature data, for a word detection step of outputting a word detection data, as well as understand the meaning of the input voice receiving word detection data from the word detection process, before
The state detected by the state detection step and the content of the response
Based on bets, select the best sense, determines the response content corresponding thereto, and speech understanding conversation control step of determining with reference to the stored conversation in the storage means side, by the speech understanding conversation control process A voice synthesizing step for generating a voice synthesizing output based on the response data instruction content stored in the storage means side, and a voice outputting step for outputting the voice synthesizing output from the voice synthesizing step to the outside. It is characterized by having.

【００１５】また、話者から入力された音声を分析して
音声特徴データを発生し、この音声特徴データと予め登
録された認識可能な単語の標準音声特徴データとを比較
して単語検出データを出力し、この単語検出データを受
けて、入力音声の意味を理解し、それに対応した応答内
容を決定して出力する音声認識対話方法において、音声
認識を行うために前記話者の年代あるいは性別等に対応
させて予め設定されたデータ、認識された音声に対する
応答出力を行うために前記話者の年代あるいは性別等に
対応させて予め設定されたデータなどを記憶した記憶手
段を有して装置本体に対して着脱自在に装着可能な複数
のカートリッジを設け、これら複数のカートリッジから
選択された少なくとも１つを装置本体に装着することに
より、装置本体側に設けられた音声認識応答処理部に接
続し、前記記憶手段に記憶された記憶内容は、話者の年
代あるいは性別等に対応させて予め登録された認識可能
な単語に対する標準音声特徴データ、話者の年代、性別
等に対応させて予め登録された認識可能な単語に対応す
る応答内容、どのような音声合成出力を発生するかを指
示するための指示内容を記憶する応答データ指示内容と
で形成し、前記音声認識応答処理部には、状態を検出す
る状態検出工程と、音声入力手段により入力された音声
を分析して音声特徴データを発生する音声分析工程と、
この音声分析工程からの音声特徴データを入力し、前記
記憶手段に記憶された記憶内容を基に、入力音声に対す
る単語検出データを出力する単語検出工程と、この単語
検出工程からの単語検出データを受けて入力音声の意味
を理解するとともに、前記状態検出工程により検出され
た状態と、前記応答内容とにもとづいて，最適な意味を
選んで、それに対応した応答内容を、前記記憶手段に記
憶された会話内容を参照して決定する音声理解会話制御
工程と、この音声理解会話制御工程によって決定された
応答内容に対し、前記記憶手段に記憶された応答データ
指示内容に基づいた音声合成出力を発生する音声合成工
程と、この音声合成工程からの音声合成出力を外部に出
力する音声出力工程とを有したことを特徴とする。Also, the voice input from the speaker is analyzed to generate voice feature data, and this voice feature data is compared with the standard voice feature data of recognizable words registered in advance to obtain word detection data. In the voice recognition interactive method of outputting, receiving the word detection data, understanding the meaning of the input voice, determining and outputting the corresponding response content, in order to perform voice recognition, the age or sex of the speaker, etc. The main body of the apparatus having storage means for storing data preset corresponding to the above, and data preset corresponding to the age or sex of the speaker in order to output a response to the recognized voice. A plurality of cartridges that can be detachably attached to the apparatus main body side are provided by attaching at least one selected from the plurality of cartridges to the apparatus main body. Connected to the provided voice recognition response processing unit, the stored contents stored in the storage means are standard voice feature data for recognizable words registered in advance corresponding to the age or sex of the speaker, the speaker Formed of response contents corresponding to pre-registered recognizable words corresponding to the age, sex, etc., and response data instruction contents storing instruction contents for instructing what kind of speech synthesis output is generated Then, the voice recognition response processing unit detects the state.
A state detecting step, a voice analyzing step of analyzing voice input by the voice input means to generate voice characteristic data,
The word detection step of inputting the voice feature data from this voice analysis step and outputting the word detection data for the input voice based on the stored contents stored in the storage means, and the word detection data from this word detection step In addition to understanding the meaning of the input voice, it is detected by the state detection step.
A voice understanding conversation control step of selecting an optimum meaning based on the state and the response content, and determining the corresponding response content by referring to the conversation content stored in the storage means; A speech synthesis step for generating a speech synthesis output based on the response data instruction content stored in the storage means for the response content determined by the understanding conversation control step, and the speech synthesis output from this speech synthesis step to the outside. And a voice output step of outputting.

【００１６】そして、前記記憶手段に記憶された記憶内
容は、話者の年代あるいは性別等に対応させて予め登録
された認識可能な単語に対する標準音声特徴データを記
憶した標準音声特徴データであって、前記音声認識応答
処理部には、音声入力手段により入力された音声を分析
して音声特徴データを発生する音声分析工程と、この音
声分析工程からの音声特徴データを入力し、前記標準音
声特徴データ記憶手段の記憶内容を基に、入力音声に対
する単語検出データを出力する単語検出工程と、この単
語検出工程からの単語検出データを受けて入力音声の意
味を理解し、それに対応した応答内容を決定する音声理
解会話制御工程と、この音声理解会話制御工程によって
決定された応答内容に対し、どのような音声合成出力と
するか指示するを応答データ指示内容に基づいて音声合
成する音声合成工程と、この音声合成工程からの音声合
成出力を外部に出力する音声出力工程とを有したことを
特徴とする。[0016] Then, the storage contents stored in said storage means is a standard voice characteristic data stored standard voice characteristic data for the speaker age or recognizable words previously registered in correspondence to the sex, etc. , The voice recognition response
The processing unit, and the speech analysis process of generating a speech feature data by analyzing the voice inputted by the voice input unit inputs the speech feature data from the voice analysis step, storing the standard voice characteristic data storage means A word detection process that outputs word detection data for the input voice based on the content, and a voice understanding conversation that receives the word detection data from this word detection process to understand the meaning of the input voice and determines the corresponding response content. A control step, a voice synthesizing step for performing voice synthesizing based on the response data instruction content to instruct what kind of voice synthesizing output should be given to the response content determined by the voice understanding conversation control step, and this voice synthesizing step And a voice output step of outputting a voice synthesis output from the outside to the outside.

【００１７】また、前記記憶手段に記憶される記憶内容
は、話者の年代あるいは性別等に対応させて予め登録さ
れた認識可能な単語に対応する応答内容と、どのような
音声合成出力を発生するかを指示する応答データ指示内
容であって、前記音声認識応答処理部には、音声入力手
段により入力された音声を分析して音声特徴データを発
生する音声分析工程と、予め登録された認識可能な単語
の標準音声特徴データを基に、入力音声に対する単語検
出データを出力する単語検出工程と、この単語検出工程
からの単語検出データを受けて入力音声の意味を理解
し、それに対応した応答内容を、前記記憶手段側に記憶
された会話内容を参照して決定する音声理解会話制御工
程と、この音声理解会話制御工程によって決定された内
容に対し、前記記憶手段側に記憶された応答データ指示
内容に基づいた音声合成出力を発生する音声合成工程
と、この音声合成工程からの音声合成出力を外部に出力
する音声出力工程とを有したことを特徴とする。The memory contents stored in the memory means include response contents corresponding to recognizable words registered in advance corresponding to the age or sex of the speaker, and what kind of voice synthesis output is generated. a response data instruction contents instructing whether the recognition, wherein the speech recognition response processing unit, a speech analysis process of generating a speech feature data by analyzing the voice inputted by the voice input means, which is registered in advance Based on the standard speech feature data of possible words, the word detection process that outputs the word detection data for the input voice, and the word detection data from this word detection process is received to understand the meaning of the input voice and the corresponding response With respect to the voice comprehension conversation control step of determining the content by referring to the conversation content stored in the storage means side, and the storage with respect to the content determined by the voice comprehension conversation control step A voice synthesizing step for generating a voice synthesizing output based on the response data instruction content stored on the column side, and a voice outputting step for externally outputting the voice synthesizing output from this voice synthesizing step .

【００１８】[0018]

【００１９】また、前記記憶手段に記憶された記憶内容
は、話者の年代あるいは性別等に対応させて予め登録さ
れた認識可能な単語に対する標準音声特徴データ、話者
の年代あるいは性別等に対応させて予め登録された認識
可能な単語に対応する応答内容、どのような音声合成出
力を発生するかを指示するための指示内容を記憶する応
答データ指示内容であって、前記音声認識応答処理部に
は、音声入力手段により入力された音声を分析して音声
特徴データを発生する音声分析工程と、この音声分析工
程からの音声特徴データを入力し、前記記憶手段に記憶
された記憶内容を基に、入力音声に対する単語検出デー
タを出力する単語検出工程と、この単語検出工程からの
単語検出データを受けて入力音声の意味を理解し、それ
に対応した応答内容を、前記記憶手段に記憶された会話
内容を参照して決定する音声理解会話制御工程と、この
音声理解会話制御工程によって決定された応答内容に対
し、前記記憶手段に記憶された応答データ指示内容に基
づいた音声合成出力を発生する音声合成工程と、この音
声合成工程からの音声合成出力を外部に出力する音声出
力工程とを有したことを特徴とする。The stored contents stored in the storage means are standard voice feature data for recognizable words registered in advance corresponding to the age or sex of the speaker , the speaker
Response data instruction content that stores response content corresponding to a recognizable word registered in advance corresponding to the age, gender, etc., and instruction content for instructing what kind of voice synthesis output is generated. The voice recognition response processing unit receives a voice analysis step of analyzing voice input by a voice input unit to generate voice feature data, and inputs voice feature data from the voice analysis step. Based on the stored contents stored in the storage means, a word detection step of outputting word detection data for the input voice, and receiving the word detection data from this word detection step, understanding the meaning of the input voice, and responding thereto. With respect to the voice understanding conversation control step of determining the response content with reference to the conversation content stored in the storage means, and the storage of the response content determined by the voice understanding conversation control step. A speech synthesis step of generating a speech synthesis output based on the response data instruction content stored in the stage, characterized in that and an audio output step of outputting a speech synthesis output from the speech synthesis step to the outside.

【００２０】[0020]

【作用】本発明は、音声認識を行うために予め設定され
た記憶内容、認識された音声に対する応答出力を行うた
めに予め設定された記憶内容などを、装置本体に対して
着脱自在に装着可能なカートリッジ側に記憶させ、この
カートリッジが装置本体に装着されることにより、その
カートリッジ内に記憶されたデータを基に、入力音声に
対する応答内容を出力するようにしたので、装置本体は
１台であっても、カートリッジを変えることにより、様
々な年代あるいは性別などに応じた対話が可能となる。
したがって、ユーザに応じたカートリッジを選択するこ
とができ、また、認識可能な単語、応答内容もカートリ
ッジ単位で選択できることから、対話内容などに幅広い
バリエーションを持たせることができ、それに対応して
様々な動作をさせることが可能となるAccording to the present invention, the preset memory contents for voice recognition and the preset memory contents for outputting a response to the recognized voice can be detachably attached to the main body of the apparatus. Since it is stored in the cartridge side and this cartridge is mounted in the main body of the device, the response content to the input voice is output based on the data stored in the cartridge, so that the main body of the device is one unit. Even if there is one, changing the cartridge makes it possible to have conversations according to various ages or genders.
Therefore, the cartridge can be selected according to the user, and the recognizable word and the response content can be selected for each cartridge. Therefore, it is possible to give a wide variation to the content of the dialogue and the like. It will be possible to operate

【００２１】[0021]

【実施例】以下、本発明の実施例を図面を参照して説明
する。なお、この実施例では、本発明を玩具に適用した
場合を例にとり、ここでは、犬などのぬいぐるみに適用
した場合について説明する。また、不特定話者の音声を
認識可能な不特定話者音声認識装置に本発明を適用した
例について説明する。Embodiments of the present invention will be described below with reference to the drawings. In this embodiment, the case where the present invention is applied to a toy is taken as an example, and here, the case where the present invention is applied to a stuffed animal such as a dog will be described. Further, an example in which the present invention is applied to an unspecified speaker voice recognition device capable of recognizing an unspecified speaker's voice will be described.

【００２２】（第１の実施例）図１は本発明の全体的な
概略構成を説明する図であり、概略的には、犬のぬいぐ
るみ（装置本体）１内に収納された音声認識応答処理部
１０（詳細は後述する）と、犬のぬいぐるみ１の所定の
部分に着脱自在に装着可能なカートリッジ部２０（詳細
は後述する）から構成されている。(First Embodiment) FIG. 1 is a diagram for explaining the overall schematic configuration of the present invention. In general, a voice recognition response process stored in a stuffed dog (device body) 1 is shown. It is composed of a section 10 (details will be described later) and a cartridge section 20 (details will be described later) detachably mountable to a predetermined portion of the stuffed dog 1.

【００２３】図２はこの第１の実施例による音声認識応
答処理部１０およびカートリッジ部２０の構成を説明す
るブロック図である。この第１の実施例では、標準音声
特徴データ記憶部２１、会話内容記憶部２２、応答デー
タ指示内容記憶部２３の３つのＲＯＭをカートリッジ側
に設けた例について説明する。FIG. 2 is a block diagram for explaining the configurations of the voice recognition response processing section 10 and the cartridge section 20 according to the first embodiment. In the first embodiment, an example will be described in which three ROMs including a standard voice feature data storage unit 21, a conversation content storage unit 22, and a response data instruction content storage unit 23 are provided on the cartridge side.

【００２４】音声認識応答処理手段１０は、音声入力部
１１、音声分析部２１、単語検出部１３、音声理解会話
制御部１４、音声合成部１５、音声出力部１６などから
構成されている。なお、これらの構成要素のうち、音声
分析部１２、単語検出部１３、音声理解会話制御部１
４、音声合成部１５などは、ぬいぐるみ１のたとえば腹
部付近に収納され、音声入力部（マイクロホン）１１は
ぬいぐるみ１のたとえば耳の部分、音声出力部（スピー
カ）１６はたとえば口の部分に設けられる。The voice recognition response processing means 10 is composed of a voice input unit 11, a voice analysis unit 21, a word detection unit 13, a voice understanding conversation control unit 14, a voice synthesis unit 15, a voice output unit 16 and the like. Note that among these components, the voice analysis unit 12, the word detection unit 13, the voice understanding conversation control unit 1
4. The voice synthesizing unit 15 and the like are stored in the stuffed toy 1, for example, near the abdomen, the voice input unit (microphone) 11 is provided, for example, in the ear part of the stuffed toy 1, and the voice output unit (speaker) 16 is provided in the mouth, for example. .

【００２５】一方、カートリッジ部２０は、標準音声特
徴データ記憶部２１、会話内容記憶部２２、応答データ
指示内容記憶部２３により構成され、ぬいぐるみ１のた
とえば腹部付近に設けられたカートリッジ装着部（図示
せず）に外部から容易に着脱可能となっている。そし
て、このカートリッジ部２０がカートリッジ装着部に正
しく装着されると、前記音声認識応答処理部１０の各部
に接続され、信号の授受が可能となる。具体的には、標
準音声特徴データ記憶部２１は前記単語検出部１３に接
続され、会話内容記憶部２２は前記音声理解会話制御部
１４に接続され、応答データ指示内容記憶部２３は前記
音声理解会話制御部１４および音声合成部１５に接続さ
れるようになっている。On the other hand, the cartridge unit 20 is composed of a standard voice feature data storage unit 21, a conversation content storage unit 22, and a response data instruction content storage unit 23, and a cartridge mounting unit provided near the abdomen of the stuffed toy 1 (see FIG. It is easily removable from the outside (not shown). When the cartridge unit 20 is properly mounted in the cartridge mounting unit, it is connected to each unit of the voice recognition response processing unit 10 and signals can be transmitted and received. Specifically, the standard voice feature data storage unit 21 is connected to the word detection unit 13, the conversation content storage unit 22 is connected to the voice understanding conversation control unit 14, and the response data instruction content storage unit 23 is the voice understanding. It is adapted to be connected to the conversation control unit 14 and the voice synthesis unit 15.

【００２６】前記標準音声特徴データ記憶部２１は、１
つ１つの単語に対し多数（たとえば、２００人程度）の
話者が発話した音声を用いて予め作成した認識可能な単
語（登録単語という）の標準パターンを記憶（登録）し
ているＲＯＭである。ここでは、ぬいぐるみを例にして
いるので、登録単語は１０単語程度とし、その単語とし
ては、たとえば、「おはよう」、「おやすみ」、「こん
にちは」、「明日」、「天気」など挨拶に用いる言葉が
多いが、これに限定されるものではなく、色々な単語を
登録することができ、登録単語数も１０単語に限られる
ものではない。また、前記会話内容記憶部２２は、どの
ような単語が登録単語となっているか、そして、それぞ
れの登録単語に対してどのような応答をするかというよ
うな内容を記憶している。この会話内容記憶部２２は、
本来、音声理解会話制御部１４内に設けられているもの
であるが、登録単語などがカートリッジにより変わる可
能性もあるためカートリッジ部２０側に設けられてい
る。また、応答データ指示内容記憶部２３は、それぞれ
の登録単語に対応してどのような音声合成出力とするか
を指示する内容が記憶されており、同じ応答内容であっ
てもたとえば男の子の話し方による音声合成出力、ある
いは女の子の話し方による音声合成出力とするというよ
うに、主に声の質を指示する内容が予め設定されてい
る。The standard voice feature data storage unit 21 stores 1
This is a ROM that stores (registers) a standard pattern of recognizable words (referred to as registered words) created in advance using voices spoken by a large number of speakers (for example, about 200 people) for each word. . In this case, because it is a stuffed toy as an example, the registration word is about 10 words, as is the word, for example, "Good morning", "good night", "Hello", used to greeting such as "tomorrow", "weather" words However, the number of registered words is not limited to 10 and the number of registered words is not limited to 10. Further, the conversation content storage unit 22 stores contents such as what kind of word is a registered word and what kind of response is made to each registered word. This conversation content storage unit 22 is
Originally, it is provided in the voice comprehension / conversation control unit 14, but it is provided on the cartridge unit 20 side because the registered word may change depending on the cartridge. Further, the response data instruction content storage unit 23 stores the content instructing what kind of voice synthesis output should be performed corresponding to each registered word. Even if the response content is the same, for example, depending on how the boy speaks. The contents mainly instructing the voice quality are preset, such as voice synthesis output or voice synthesis output according to the way the girl speaks.

【００２７】ところで、このカートリッジ部２０は、前
記したように標準音声特徴データや、登録単語、さらに
は、それぞれの登録単語に対する応答内容、声の質など
が様々の種類用意されているもので、ぬいぐるみ１を使
用するユーザが任意に選べるようになっている。たとえ
ば、幼児用のカートリッジには、幼児の音声を認識しや
すいような標準音声特徴データによる複数の登録単語が
標準音声特徴データ記憶部２１に記憶され、そして、ど
のような単語が登録単語であるか、それぞれの登録単語
に対してどのような応答を行うかが会話内容記憶部２２
に記憶され、さらに、それぞれの登録単語に対し、どの
ような音声合成出力とするかというような指示が応答デ
ータ指示内容記憶部２３に記憶されている。By the way, as described above, the cartridge section 20 is prepared with various kinds of standard voice feature data, registered words, further, response contents to each registered word, voice quality, and the like. A user who uses the stuffed toy 1 can arbitrarily select it. For example, in the cartridge for infants, a plurality of registered words based on the standard voice feature data that makes it easy to recognize the voice of the infant is stored in the standard voice feature data storage unit 21, and what kind of word is the registered word. Conversation content storage unit 22 determines how to respond to each registered word.
Further, the response data instruction content storage unit 23 stores instructions such as what kind of voice synthesis output should be made for each registered word.

【００２８】次に、以上の各部におけるそれぞれの機
能、さらには全体的な処理などについて以下に順次説明
する。Next, the respective functions of the above-mentioned respective parts, and the overall processing will be described below in order.

【００２９】音声認識応答処理部１０の音声入力部１１
は図示されていないがマイクロホン、増幅器、ローパス
フィルタ、Ａ／Ｄ変換器などから構成され、マイクロホ
ンから入力された音声を、増幅器、ローパスフィルタを
通して適当な音声波形としたのち、Ａ／Ｄ変換器により
ディジタル信号（たとえば、１２ＫＨｚ．１６ｂｉｔ）
に変換して出力し、その出力を音声分析部１２に送る。
音声分析部１２では、音声入力部１１から送られてきた
音声波形信号を、演算器（ＣＰＵ）を用いて短時間毎に
周波数分析を行い、周波数の特徴を表す数次元の特徴ベ
クトルを抽出（LPCーCEPSTRUM係数が一般的）し、この
特徴ベクトルの時系列（以下、音声特徴ベクトル列とい
う）を出力する。The voice input unit 11 of the voice recognition response processing unit 10
Is composed of a microphone, an amplifier, a low-pass filter, an A / D converter, etc., which are not shown in the figure. The sound input from the microphone is converted into an appropriate sound waveform through the amplifier and the low-pass filter, and then by the A / D converter. Digital signal (for example, 12KHz.16bit)
And outputs the converted output to the voice analysis unit 12.
In the voice analysis unit 12, the voice waveform signal sent from the voice input unit 11 is subjected to frequency analysis for each short time using a computing unit (CPU), and a multidimensional feature vector representing a feature of frequency is extracted ( The LPC-CEPSTRUM coefficient is common), and the time series of this feature vector (hereinafter referred to as a voice feature vector sequence) is output.

【００３０】単語検出部１３は図示されていないが主に
演算器（ＣＰＵ）と処理プログラムを記憶しているＲＯ
Ｍから構成され、カートリッジ部２０に設けられた標準
音声特徴データ記憶部２１に登録されている単語が、入
力音声中のどの部分にどれくらいの確かさで存在するか
を検出するものである。この単語検出部１３としては、
隠れマルコフモデル（ＨＭＭ）方式やＤＰマッチング方
式などを用いることも可能であるが、ここでは、ＤＲＮ
Ｎ（ダイナミックリカレントニューラルネットワー
ク）方式によるキーワードスポッティング処理技術（こ
の技術に関しては、本出願人が特開平６ー４０９７、特
開平６ー１１９４７６により、すでに特許出願済みであ
る。）を用いて、不特定話者による連続音声認識に近い
音声認識を可能とするための単語検出データを出力する
ものであるとする。Although not shown in the figure, the word detecting unit 13 mainly stores an arithmetic unit (CPU) and a processing program.
It is configured to detect which part of the input voice the word, which is composed of M and is registered in the standard voice feature data storage unit 21 provided in the cartridge unit 20, exists and with certainty. As the word detecting unit 13,
It is also possible to use a hidden Markov model (HMM) method, a DP matching method, or the like.
Unspecified using the keyword spotting processing technology by the N (dynamic recurrent neural network) method (for this technology, the applicant has already applied for a patent according to Japanese Patent Application Laid-Open Nos. 6-4097 and 6-119476). It is assumed that the word detection data is output so as to enable speech recognition similar to continuous speech recognition by a speaker.

【００３１】この単語検出部１３の具体的な処理につい
て、図３を参照しながら簡単に説明する。単語検出部１
３は、標準音声特徴データ記憶部２１に登録されている
単語が、入力音声中のどの部分にどれくらいの確かさで
存在するかを検出するものである。今、話者から「明日
の天気は、・・・」というような音声が入力され、図３
（ａ）に示すような音声信号が出力されたとする。この
「明日の天気は、・・・」の文節のうち、「明日」と
「天気」がこの場合のキーワードとなり、これらは、予
め登録されている１０単語程度の登録単語の１つとし
て、標準音声特徴データ記憶部２１にそのパターンが記
憶されている。そして、これら登録単語をたとえば１０
単語としたとき、これら１０単語（これを、単語１、単
語２、単語３、・・・とする）に対応して各単語を検出
するための信号が出力されていて、その検出信号の値な
どの情報から、入力音声中にどの程度の確かさで対応す
る単語が存在するかを検出する。つまり、「天気」とい
う単語（単語１）が入力音声中に存在したときに、その
「天気」という信号を待っている検出信号が、同図
（ｂ）の如く、入力音声の「天気」の部分で立ち上が
る。同様に、「明日」という単語（単語２）が入力音声
中に存在したときに、その「明日」という信号を待って
いる検出信号が、同図（ｃ）の如く、入力音声の「明
日」の部分で立ち上がる。同図（ｂ），（ｃ）におい
て、0.9あるいは0.8といった数値は、確からしさ（近似
度）を示す数値であり、0.9や0.8といった高い数値であ
れば、その高い確からしさを持った登録単語は、入力さ
れた音声に対する認識候補であるということができる。
つまり、「明日」という登録単語は、同図（ｃ）に示す
ように、入力音声信号の時間軸上のｗ１の部分に0.8と
いう確からしさで存在し、「天気」という登録単語は、
同図（ｂ）に示すように、入力音声信号の時間軸上のｗ
２の部分に0.9という確からしさで存在することがわか
る。The specific processing of the word detection unit 13 will be briefly described with reference to FIG. Word detector 1
3 detects the word registered in the standard voice feature data storage unit 21 and the certainty of which part of the input voice exists. Now, the speaker inputs a voice such as "Tomorrow's weather is ..."
It is assumed that an audio signal as shown in (a) is output. In the phrase "weather of tomorrow ...", "tomorrow" and "weather" are the keywords in this case, and these are standard words as one of the registered words of about 10 words in advance. The pattern is stored in the voice feature data storage unit 21. Then, these registered words are, for example, 10
When the word is used, a signal for detecting each word is output corresponding to these 10 words (this is referred to as word 1, word 2, word 3, ...), and the value of the detection signal is output. From such information, it is detected with certainty that the corresponding word is present in the input voice. That is, when the word "weather" (word 1) is present in the input voice, the detection signal waiting for the signal "weather" indicates the "weather" of the input voice as shown in FIG. Get up in parts. Similarly, when the word "tomorrow" (word 2) is present in the input voice, the detection signal waiting for the signal "tomorrow" is "tomorrow" of the input voice as shown in FIG. Stand up at the part. In FIGS. 9B and 9C, the numerical value such as 0.9 or 0.8 is a numerical value indicating the certainty (approximation degree), and if the numerical value is high such as 0.9 or 0.8, the registered word having the high certainty is , It can be said that the input voice is a recognition candidate.
That is, the registered word "tomorrow" exists at a certainty of 0.8 in the portion of w1 on the time axis of the input voice signal, as shown in FIG.
As shown in FIG. 7B, w on the time axis of the input audio signal
It can be seen that there is a certainty of 0.9 in the 2nd part.

【００３２】また、この図３の例では、「天気」という
入力に対して、同図（ｄ）に示すように、単語３（この
単語３は「何時」という登録単語であるとする）を待つ
信号も、時間軸上のｗ２の部分に、ある程度の確からし
さ（その数値は0.6程度）を有して立ち上がっている。
このように、入力音声信号に対して同一時刻上に、２つ
以上の登録単語が認識候補として存在する場合には、最
も近似度（確からしさを示す数値）の高い単語を認識単
語として選定する方法、各単語間の相関規則を表した相
関表を予め作成しておき、この相関表により、いずれか
１つの単語を認識単語として選定する方法などを用い
て、或る１つの認識候補単語を決定する。たとえば、前
者の方法で認識候補を決定するとすれば、この場合は、
時間軸上のｗ２の部分に対応する近似度は、「天気」を
検出する検出信号の近似度が最も高いことから、その部
分の入力音声に対する認識候補は「天気」であるとの判
定を行う。なお、これらの近似度を基に入力音声の認識
は音声理解会話制御部１４にて行う。In the example of FIG. 3, in response to the input of "weather", as shown in FIG. 3D, the word 3 (this word 3 is a registered word "what time") is input. The waiting signal also rises with a certain degree of certainty (its numerical value is about 0.6) in the portion of w2 on the time axis.
Thus, when two or more registered words are present as recognition candidates at the same time with respect to the input voice signal, the word having the highest degree of approximation (a numerical value indicating the certainty) is selected as the recognition word. A method, a correlation table representing a correlation rule between each word is created in advance, and a method of selecting any one word as a recognition word by this correlation table is used to identify a certain recognition candidate word. decide. For example, if we decide the recognition candidates using the former method, in this case,
Since the approximation degree corresponding to the portion of w2 on the time axis has the highest approximation degree of the detection signal for detecting "weather", it is determined that the recognition candidate for the input voice of that portion is "weather". . The voice understanding conversation control unit 14 recognizes the input voice based on these degrees of approximation.

【００３３】音声理解会話制御部１４は、主に演算器
（ＣＰＵ）と処理プログラムを記憶しているＲＯＭから
構成され、単語検出部１３からの単語検出データを入力
して、その単語検出データを基に、音声を認識し（入力
音声全体の意味を理解し）、カートリッジ部２０に設け
られた会話内容記憶部２２を参照して、入力音声の意味
に応じた応答内容を決定するとともに、応答データ指示
内容記憶部２３を参照して、どのような音声合成出力と
するかを示す信号を音声合成部（主にＣＰＵとＲＯＭで
構成される）１５に送る。The speech comprehension conversation control section 14 is mainly composed of an arithmetic unit (CPU) and a ROM storing a processing program, inputs the word detection data from the word detection section 13, and outputs the word detection data. Based on this, the voice is recognized (the meaning of the entire input voice is understood), and the conversation content storage unit 22 provided in the cartridge unit 20 is referenced to determine the response content according to the meaning of the input voice and With reference to the data instruction content storage unit 23, a signal indicating what kind of voice synthesis output is to be sent to the voice synthesis unit (mainly composed of a CPU and ROM) 15.

【００３４】たとえば、単語検出部１３からの図３
（ｂ）〜（ｅ）に示すような検出データ（これをワード
ラティスという。このワードラティスは、登録単語名、
近似度、単語の始点ｓと終点ｅを示す信号などが含まれ
る）が入力されると、まず、そのワードラティスを基
に、入力音声の中のキーワードとしての単語を１つまた
は複数個決定する。この例では、入力音声は「明日の天
気は・・・」であるので、「明日」と「天気」が検出さ
れることになり、この「明日」と「天気」のキーワード
から「明日の天気は・・・」という連続的な入力音声の
内容を理解し、それに対応した応答内容を選んで出力す
る。なお、この場合、応答内容としては、「明日の天気
は晴れだよ」というような応答内容となるが、これは、
ここでは図示されていない状態検出手段（温度検出部、
気圧検出部、カレンダ部、計時部など）が設けられてい
て、たとえば、天気に関する情報であれば、気圧検出部
からの気圧の変化の状況を基に天気の変化を判断し、気
圧が上昇傾向であればそれに対応した応答内容を応答デ
ータ指示内容記憶部２３から読み出すようにする。同様
に、気温、時間、日付などに関する応答も可能となる。For example, FIG. 3 from the word detection unit 13
Detection data as shown in (b) to (e) (this is called word lattice. This word lattice is a registered word name,
(Including the degree of approximation, a signal indicating the start point s and the end point e of a word, etc.) is input, first, based on the word lattice, one or more words as keywords in the input voice are determined. . In this example, since the input voice is "Tomorrow's weather is ...", "Tomorrow" and "Weather" will be detected, and "Tomorrow's weather" will be detected from the keywords of "Tomorrow" and "Weather". Understands the content of continuous input voice such as "...", and selects and outputs the response content corresponding to it. In addition, in this case, the response content is such that "the weather of tomorrow is sunny".
State detection means (temperature detection unit, not shown here)
For example, if it is information about the weather, the change in the weather is judged based on the change in the atmospheric pressure from the atmospheric pressure detection unit, and the atmospheric pressure tends to increase. If so, the response content corresponding thereto is read from the response data instruction content storage unit 23. Similarly, responses regarding temperature, time, date, etc. are also possible.

【００３５】また、以上説明したキーワードスポッティ
ング処理による連続音声認識に近い音声認識処理は、日
本語だけでなく他の言語においても適用可能である。た
とえば、使用する言語が英語であるとすれば、登録され
ている認識可能な単語は、たとえば、“good mornin
g”、“time”、“tommorow”、“good night”などが
一例として挙げられ、これら認識可能な登録単語の特徴
データが、標準音声特徴データ記憶部２１に記憶されて
いる。そして今、話者が「what time is itnow」と
問いかけた場合、この「what time is it now」の
文節のうち、単語「time」がこの場合のキーワードとな
り、「time」という単語が入力音声の中に存在したとき
に、その「time」の音声信号を待っている検出信号が、
入力音声の「time」の部分で立ち上がる。そして、単語
検出部１３からの検出データ（ワードラティス）が入力
されると、まず、そのワードラティスを基に、入力音声
の中のキーワードとしての単語を１つまたは複数個決定
する。この例では、入力音声は、「what time is it
now」であるので、「time」がキーワードとして検出
されることになり、このキーワードを基に、「what ti
me is it now」という連続的な入力音声の内容を理
解する。The speech recognition processing similar to the continuous speech recognition by the keyword spotting processing described above can be applied not only to Japanese but also to other languages. For example, if your language is English, the recognizable words that are registered are, for example, “good mornin
“G”, “time”, “tommorow”, “good night”, etc. are given as examples, and the feature data of the recognizable registered words are stored in the standard voice feature data storage unit 21. And now, Person asked "what time is it now", the word "time" was the keyword in this case in the phrase "what time is it now", and the word "time" was present in the input speech. Sometimes, the detection signal waiting for that "time" voice signal,
Start up at the "time" part of the input voice. Then, when the detection data (word lattice) from the word detection unit 13 is input, first, one or a plurality of words as keywords in the input voice are determined based on the word lattice. In this example, the input voice is "what time is it
Since it is now, "time" will be detected as a keyword, and based on this keyword, "what ti
Understand the contents of continuous input speech "me is it now".

【００３６】なお、前記した音声分析、単語検出、音声
理解会話制御、音声合成などの制御を行うＣＰＵはそれ
ぞれに設けてもよいが、これら全ての処理を行う１台の
メインのＣＰＵを設け、この１台のＣＰＵで本発明の全
体の処理を行うようにしてもよい。The CPUs for controlling the above-mentioned voice analysis, word detection, voice comprehension conversation control, voice synthesis and the like may be provided respectively, but one main CPU for performing all of these processes is provided, The entire processing of the present invention may be performed by this one CPU.

【００３７】このような構成において、たとえば、装着
されているカートリッジが、幼児を対象としたものであ
るとすれば、幼児が「おはよう」と問いかければ、その
入力音声は音声分析部１２で分析されたのち、単語検出
部１３に送られる。そして、単語検出部１３では、標準
音声特徴データ記憶部２１に記憶されている特徴データ
をもとに、単語検出部１３により前記したような処理を
行い、入力音声に対する単語検出データ（ワードラティ
ス）を出力する。なお、このとき、カートリッジ部２０
は幼児向けのものであるから、標準音声特徴データ記憶
部２１の内容は、幼児の音声の特徴を基に得られた標準
音声特徴データであるため、高い認識率での認識が可能
となる。In such a configuration, if the mounted cartridge is intended for an infant, for example, if the infant asks "Good morning", the input voice is analyzed by the voice analysis unit 12. Then, it is sent to the word detection unit 13. Then, in the word detection unit 13, based on the feature data stored in the standard voice feature data storage unit 21, the word detection unit 13 performs the above-described processing, and word detection data (word lattice) for the input voice is obtained. Is output. At this time, the cartridge unit 20
Since it is for infants, the contents of the standard voice feature data storage unit 21 are standard voice feature data obtained based on the features of the voice of the infant, and therefore recognition can be performed at a high recognition rate.

【００３８】そして、単語検出部１３からのワードラテ
ィスを受けた音声理解会話制御部１４では、そのワード
ラティスをもとにカートリッジ部２０の会話内容記憶部
２２を参照して、入力音声が「おはよう」であることを
理解するとともに、それに対する応答内容を得たのち、
応答データ指示内容記憶部２３を参照する。これによ
り、音声合成部１５では、応答データ指示内容記憶部２
３から得た情報を基に、音声合成出力を出し、音声出力
部１６から応答内容として出力される。この場合の、応
答内容としては、幼児に対する応答であるので、たとえ
ば、幼児向けの話し方で「おはよう」と応答する。Then, the voice understanding conversation control unit 14, which has received the word lattice from the word detecting unit 13, refers to the conversation content storing unit 22 of the cartridge unit 20 based on the word lattice and the input voice is "Good morning. , And after getting the response content to it,
The response data instruction content storage unit 23 is referred to. As a result, in the voice synthesis unit 15, the response data instruction content storage unit 2
A voice synthesis output is produced based on the information obtained from 3, and the voice output unit 16 outputs the response contents. In this case, the content of the response is a response to the infant, and therefore, for example, "good morning" is answered in the way of speaking for the infant.

【００３９】このようにして、ユーザが任意に選択した
カートリッジ部２０をぬいぐるみ１に装着することによ
り、カートリッジ部２０の内容に応じた対話が可能とな
る。たとえば、前記したような幼児向けのカートリッジ
を装着すれば幼児向けの対話が行え、小学生向けのカー
トリッジを装着すれば、それに応じた対話が可能とな
る。なお、このカートリッジは、様々な年齢、あるいは
性別に応じて種々用意しておくことが可能である。具体
的には、男子幼児用、女子幼児用、小学校の低学年の男
子用、女子用などきめ細かい分類も可能である。In this way, by mounting the cartridge unit 20 arbitrarily selected by the user on the stuffed animal 1, it becomes possible to have a dialogue according to the contents of the cartridge unit 20. For example, if a cartridge for infants as described above is attached, dialogue for infants can be performed, and if a cartridge for elementary school children is attached, dialogue according to it can be performed. It is possible to prepare various cartridges according to various ages or sexes. Specifically, it is possible to perform detailed classifications for boys and toddlers, girls, toddlers in elementary school, and girls.

【００４０】これにより、装置本体（この場合はぬいぐ
るみ）は１台であっても、カートリッジを変えることに
より、様々な年代あるいは性別などに応じた対話が可能
となる。また、この場合は、標準音声特徴データ記憶部
２１、会話内容記憶部２２、応答データ指示内容記憶部
２３がカートリッジ部２０に組み込まれているので、認
識可能な単語やその標準音声特徴データをカートリッジ
毎に異なったものとすることができ、また、その認識可
能な単語に対する応答内容および音声の質をカートリッ
ジ毎に異なったものとすることができる。したがって、
カートリッジの種類のバリエーションを増やすことによ
り、様々な年代あるいは性別などに応じた対話が可能と
なる。As a result, even if the apparatus main body (plush toy in this case) is only one, by changing the cartridge, it is possible to have a dialogue according to various ages or sexes. Further, in this case, since the standard voice feature data storage unit 21, the conversation content storage unit 22, and the response data instruction content storage unit 23 are incorporated in the cartridge unit 20, the recognizable words and the standard voice feature data thereof are stored in the cartridge. It can be different for each cartridge, and the response content and voice quality for the recognizable word can be different for each cartridge. Therefore,
By increasing the variety of cartridge types, it becomes possible to have dialogues according to various ages or genders.

【００４１】（第２の実施例）以上説明した第１の実施
例では、標準音声特徴データ記憶部２１、会話内容記憶
部２２、応答データ指示内容記憶部２３をカートリッジ
部２０に設けた例について説明したが、カートリッジ部
２０としては、これら３つの要素を全て持たずに、たと
えば、標準音声特徴データ記憶部２１のみをカートリッ
ジ部２０に設けるようにしてもよく、また、応答データ
指示内容記憶部２３のみをカートリッジ部２０に設ける
ようにしてもよく、その組み合わせは種々考えられる。(Second Embodiment) In the first embodiment described above, an example in which the standard voice feature data storage unit 21, the conversation content storage unit 22, and the response data instruction content storage unit 23 are provided in the cartridge unit 20 is described. As described above, the cartridge section 20 may be provided with only the standard voice feature data storage section 21 in the cartridge section 20 without having all of these three elements, and the response data instruction content storage section may be provided. Only 23 may be provided in the cartridge portion 20, and various combinations thereof are conceivable.

【００４２】ここでは、カートリッジ部２０として、標
準音声特徴データ記憶部２１のみをカートリッジ部２０
内に設けた場合、会話内容記憶部２２と応答データ指示
内容記憶部２３をカートリッジ部２０内に設けた場合、
応答データ指示内容記憶部２３のみをカートリッジ部２
０内に設けた場合についてそれぞれ説明する。Here, as the cartridge unit 20, only the standard voice characteristic data storage unit 21 is used.
If the conversation content storage unit 22 and the response data instruction content storage unit 23 are provided in the cartridge unit 20,
Only the response data instruction content storage unit 23 is used as the cartridge unit 2.
The case where it is provided within 0 will be described respectively.

【００４３】まず、標準音声特徴データ記憶部２１のみ
をカートリッジ部２０内に設けた場合について説明す
る。図４は、その構成を示すブロック図であり、図２と
同一部分には同一符号が付されている。すなわち、この
場合は、会話内容記憶部２２と応答データ指示内容記憶
部２３は音声認識応答処理部１０側に設けられている。
なお、応答内容記憶部２２は音声理解会話制御部１４内
に設けてもよいが、ここでは、別個に設けた場合が示さ
れている。First, the case where only the standard voice feature data storage unit 21 is provided in the cartridge unit 20 will be described. FIG. 4 is a block diagram showing the configuration, and the same parts as those in FIG. 2 are designated by the same reference numerals. That is, in this case, the conversation content storage unit 22 and the response data instruction content storage unit 23 are provided on the voice recognition response processing unit 10 side.
The response content storage unit 22 may be provided in the voice comprehension conversation control unit 14, but here it is shown that it is provided separately.

【００４４】このように標準音声特徴データ記憶部２１
のみをカートリッジ部２０側に設け、会話内容記憶部２
２と応答データ指示内容記憶部２３が装置（ぬいぐるみ
１）側にある場合は、それぞれの登録単語に対してどの
ような応答をするかなどは、装置側にて予め決められて
いる。したがって、この場合は、登録単語は、装置側で
予め決められた単語（たとえば、前記したように「おは
よう」、「こんにちは」、「おやすみ」などというよう
な１０単語程度）のみであるが、それぞれの登録単語に
おける年代や性別に応じた標準音声特徴データをカート
リッジ毎に様々用意することができる。たとえば、それ
ぞれの登録単語に対して幼児の標準音声特徴データが記
憶されたカートリッジ、それぞれの登録単語に対して小
学生の標準音声特徴データが記憶されたカートリッジ、
それぞれの登録単語に対して大人の女性の標準音声特徴
データが記憶されたカートリッジ、それぞれの登録単語
に対して大人の男性の標準音声特徴データが記憶された
カートリッジというように、それぞれの年代や性別ごと
に、それぞれの登録単語に対する標準音声特徴データが
記憶されたカートリッジを用意しておく。In this way, the standard voice feature data storage unit 21
Only the cartridge portion 20 is provided, and the conversation content storage unit 2 is provided.
When 2 and the response data instruction content storage unit 23 are on the device (stuffed toy 1) side, the device side determines in advance how to respond to each registered word. Therefore, in this case, the registration word, word predetermined by apparatus (e.g., as described above, "good morning", "Hello", 10 about words like that as "Good night") but only, respectively It is possible to prepare various standard voice feature data for each cartridge according to the age and sex of the registered word. For example, a cartridge storing standard voice feature data of infants for each registered word, a cartridge storing standard voice feature data of elementary school students for each registered word,
A cartridge that stores standard voice feature data of an adult female for each registered word, a cartridge that stores standard voice feature data of an adult male for each registered word, such as the age and gender. A cartridge storing standard voice feature data for each registered word is prepared for each item.

【００４５】このようにして、たとえば、幼児が使用す
る場合は、幼児用の音声特徴データが記憶されたカート
リッジを選択して、それを装置本体に装着することによ
り、幼児の音声の特徴を基に得られた音声特徴データと
の比較が行えることから高い認識率で認識することがで
きる。そして、認識された単語に対して、予め決められ
た応答内容を音声合成して出力する。このように、年代
や性別に応じて標準音声特徴データのカートリッジを選
択することにより、認識率を大幅に向上させることがで
きる。Thus, for example, when used by an infant, by selecting a cartridge in which voice feature data for the infant is stored and mounting it in the main body of the apparatus, the features of the voice of the infant are analyzed. Since it can be compared with the obtained voice feature data, the recognition can be performed with a high recognition rate. Then, a predetermined response content is speech-synthesized and output for the recognized word. In this way, the recognition rate can be significantly improved by selecting the cartridge of the standard voice feature data according to the age and sex.

【００４６】次に、会話内容記憶部２２と応答データ指
示内容記憶部２３をカートリッジ部２０内に設けた場合
について説明する。図５は、その構成を示すブロック図
であり、図２と同一部分には同一符号が付されている。
すなわち、この場合は、標準音声特徴データ記憶部２１
は音声認識応答処理部１０側に設けられ、会話内容記憶
部２２と応答データ記憶部２３はカートリッジ部２０側
に設けられている。Next, a case where the conversation content storage unit 22 and the response data instruction content storage unit 23 are provided in the cartridge unit 20 will be described. FIG. 5 is a block diagram showing the configuration, and the same parts as those in FIG. 2 are designated by the same reference numerals.
That is, in this case, the standard voice feature data storage unit 21
Is provided on the voice recognition response processing unit 10 side, and the conversation content storage unit 22 and the response data storage unit 23 are provided on the cartridge unit 20 side.

【００４７】このように会話内容記憶部２２と応答デー
タ指示内容記憶部２３のみをカートリッジ部２０側に設
け、標準音声特徴データ記憶部２１を音声認識応答処理
部１０側に設けた場合は、登録単語は装置側で予め決め
られた単語（たとえば、前記したように「おはよう」、
「こんにちは」、「おやすみ」などというような１０単
語程度）のみであるが、それぞれの登録単語に対する応
答内容およびそれぞれの応答内容に対する音声合成出力
（声の質など）をカートリッジ毎に様々用意することが
できる。たとえば、「おはよう」という単語に対しては
どのような応答内容とするか、さらには、その応答内容
をどのような音声合成出力とするかなどを予め何通りか
決めておき、それらをカートリッジ毎に会話内容記憶部
２２および応答データ指示内容記憶部２３に記憶させて
おく。具体的には、幼児向けのカートリッジは、「おは
よう」、「おやすみ」などの登録単語に対しては、幼児
向けの応答内容で、かつ、幼児向けの音質での応答を行
い、小学生向けのカートリッジとしては、登録単語に対
する応答を、たとえば、テレビアニメのキャラクタの話
し方を真似て、しかも小学生向けの応答内容での応答を
行うなどというように、それぞれの年齢や性別に合わせ
た応答内容と声の質による応答を行うカートリッジを種
々用意しておく。As described above, when only the conversation content storage unit 22 and the response data instruction content storage unit 23 are provided on the cartridge unit 20 side and the standard voice feature data storage unit 21 is provided on the voice recognition response processing unit 10 side, registration is performed. The word is a predetermined word on the device side (for example, "Good morning" as described above,
"Hello", "Good night," but only 10 about a word) such that like, be variously prepared speech synthesis output to the response content and each of the response content for each of the registered words (such as voice quality) for each cartridge You can For example, what kind of response contents should be given to the word "Good morning", and what kind of voice synthesis output the response contents should be, should be decided in advance for each cartridge. It is stored in the conversation content storage unit 22 and the response data instruction content storage unit 23. Specifically, the cartridge for infants is a cartridge for elementary school students that responds to registered words such as "Good morning" and "Good night" with the response content for infants and the sound quality for infants. As for the response to the registered word, for example, by imitating the way the character of the TV animation is spoken and responding with the response content for elementary school students, the response content and voice of each age and gender are matched. Prepare various cartridges that respond by quality.

【００４８】このようにして、たとえば、小学生が使用
する場合に、前記したような小学生向けのカートリッジ
を選択して、それを装置に装着することにより、小学生
が何らかの登録単語を話しかけると、前記したようなテ
レビアニメのキャラクタの話し方を真似た声の質で、か
つ、そのカートリッジに設定された応答内容が返ってく
るというようなことが可能となる。In this way, for example, when the elementary school student uses the cartridge for elementary school students as described above and attaches it to the device, the elementary school student speaks some registered word. It becomes possible to return the response contents set in the cartridge with the quality of the voice imitating the talking style of the character of the TV animation.

【００４９】次に、応答データ指示内容記憶部２３のみ
をカートリッジ部２０側に設けた場合について説明す
る。図６は、その構成を示すブロック図であり、図２と
同一部分には同一符号が付されている。すなわち、この
場合は、標準音声特徴データ記憶部２１および会話内容
記憶部２２は装置本体の音声認識応答処理部１０側に設
けられ、応答データ指示内容記憶部２３のみがカートリ
ッジ部２０側に設けられている。Next, a case where only the response data instruction content storage unit 23 is provided on the cartridge unit 20 side will be described. FIG. 6 is a block diagram showing the configuration, and the same parts as those in FIG. 2 are designated by the same reference numerals. That is, in this case, the standard voice feature data storage unit 21 and the conversation content storage unit 22 are provided on the voice recognition response processing unit 10 side of the apparatus body, and only the response data instruction content storage unit 23 is provided on the cartridge unit 20 side. ing.

【００５０】このように応答データ指示内容記憶部２３
のみをカートリッジ部２０側に設け、会話内容記憶部２
２と標準音声特徴データ記憶部２１を音声認識応答処理
部１０側に設けた場合は、登録単語は装置側で予め決め
られた単語（たとえば、前記したように「おはよう」、
「こんにちは」、「おやすみ」などというような１０単
語程度）のみであり、また、これらの登録単語に対する
応答内容は基本的には装置側で予め設定された内容とな
るが、その応答内容に対してどのような音声合成出力と
するかをカートリッジ毎に様々設定することができる。
たとえば、「おはよう」という単語に対する応答内容
を、どのような声の質で出力とするかを予め何通りか決
めておき、それらをカートリッジ毎に応答データ指示内
容記憶部２３に記憶させておく。具体的には、幼児向け
のカートリッジの場合は、「おはよう」、「おやすみ」
などの種々の登録単語に対しては、それらの登録単語に
対して、母親のような声での応答を行うような指示内
容、あるいは幼児向けテレビアニメのキャラクタの声に
似せた応答を行うような指示内容が記憶され、小学生向
けのカートリッジとしては、登録単語に対する応答を小
学生向けのテレビアニメのキャラクタの話し方を真似て
応答を行うなどの指示内容が記憶されるというように、
それぞれの年齢や性別に合わせて、登録単語毎に、どの
ような音声合成出力とするかを指示する内容が記憶され
たカートリッジを種々用意しておく。なお、このとき、
それぞれの登録単語に対するするそれぞれの応答内容は
基本的には予め設定された内容である。In this way, the response data instruction content storage unit 23
Only the cartridge portion 20 is provided, and the conversation content storage unit 2 is provided.
2 and the standard voice feature data storage unit 21 are provided on the voice recognition response processing unit 10 side, the registered word is a word predetermined by the device side (for example, “Ohayo” as described above,
"Hello", "Good night" is only 10 about a word) such as such, also, the response content to these registered words is basically the content which is previously set on the device side, for that response content It is possible to set various types of voice synthesis output for each cartridge.
For example, it is determined in advance what kind of voice quality the response content to the word "Ohayo" is to be output, and these are stored in the response data instruction content storage unit 23 for each cartridge. Specifically, in the case of cartridges for infants, "Good morning" and "Good night"
For various registered words such as, respond to those registered words with an instruction to make a voice response like a mother, or to make a response that resembles the voice of a TV anime character for young children. As a cartridge for elementary school students, the instruction content such as responding to the registered word by simulating the way the character of the TV animation character for elementary school students speaks is stored.
Various cartridges are prepared for each registered word, in which contents for instructing what kind of voice synthesis output is to be stored are prepared for each age and sex. At this time,
The contents of each response to each registered word are basically preset contents.

【００５１】このようにして、たとえば、小学生が使用
する場合は、前記したような小学生向けのカートリッジ
を選択して、それを装置に装着することにより、小学生
が何らかの登録単語を話しかけると、応答内容は基本的
には予め設定された内容であるが、前記したような小学
生向けのテレビアニメのキャラクタの話し方を真似た声
での応答が返ってくるというようなことが可能となる。In this way, for example, when the elementary school student uses the cartridge for elementary school students as described above and mounts it on the device, when the elementary school student speaks some registered word, the response contents Is basically a preset content, but it is possible to receive a response with a voice imitating the way of speaking the character of the TV animation for elementary school students as described above.

【００５２】以上説明したように本発明では、標準音声
特徴データ記憶部２１、会話内容記憶部２２、応答デー
タ指示内容記憶部２３などのＲＯＭの部分をカートリッ
ジ式とし、ユーザの年齢や性別などに応じた標準音声特
徴データ、応答内容などを有するカートリッジを種々用
意しておき、ユーザが任意に選択できるようにしてい
る。したがって、認識応答装置そのものは１台であって
も、カートリッジを取り替えることにより、幅広い人が
利用でき、ユーザに応じた音声認識および対話が行え
る。As described above, in the present invention, the ROM portion including the standard voice feature data storage unit 21, the conversation content storage unit 22, the response data instruction content storage unit 23, etc. is of a cartridge type, and can be used for the age and sex of the user. Various kinds of cartridges having standard voice characteristic data and response contents are prepared so that the user can arbitrarily select them. Therefore, even if the recognition response device itself is one, by replacing the cartridge, it can be used by a wide range of people, and voice recognition and dialogue can be performed according to the user.

【００５３】なお、以上の各実施例では、本発明を玩具
としてぬいぐるみに適用した例を説明したが、ぬいぐる
みに限られるものではなく。他の玩具にも適用できるこ
とは勿論であり、さらに、玩具だけではなく、ゲーム機
や、日常使われる様々な電子機器などにも適用でき、そ
の適用範囲は極めて広いものと考えられる。In each of the above embodiments, the present invention is applied to a plush toy as a toy, but the present invention is not limited to a plush toy. It can be applied not only to other toys, but also to toys, as well as game machines and various electronic devices used in daily life, and its application range is considered to be extremely wide.

【００５４】[0054]

【発明の効果】以上説明したように、本発明の音声認識
対話装置は、音声認識および認識された音声に対する応
答を行うために予め設定された記憶内容を記憶する記憶
手段を、装置本体に対して着脱自在に装着可能なカート
リッジ側に設け、このカートリッジが装置本体側に装着
されることにより、そのカートリッジ内に記憶されたデ
ータを基に、入力音声に対する応答内容を出力するよう
にしたので、装置本体は１台であっても、カートリッジ
を変えることにより、様々な年代あるいは性別などに応
じた対話が可能となる。したがって、本発明をたとえ
ば、玩具などに適用した場合には、子どもの成長に合わ
せたカートリッジを選択することができ、また、認識可
能な単語、応答内容もカートリッジを選択することによ
り色々選ぶことができるため、１台の玩具でも途中で飽
きてしまうことが少なく、長い期間使用することがで
き、また、年代や性別にとらわれることなく幅広く使用
可能となる。さらに、玩具だけでなく、電子機器などの
適用した場合にも、ユーザに適応したカートリッジを選
択することにより、対話内容などに幅広いバリエーショ
ンを持たせることができ、それに対応して様々な動作を
させることが可能となるなど、その効果はきわめて大き
いものとなる。As described above, in the voice recognition dialogue apparatus of the present invention, a storage means for storing preset memory contents for voice recognition and response to the recognized voice is provided to the apparatus body. Since it is provided on the side of the cartridge that can be detachably mounted, and the cartridge is mounted on the side of the main body of the device, the response content to the input voice is output based on the data stored in the cartridge. Even if there is only one main body of the apparatus, by changing the cartridge, it becomes possible to have a dialogue according to various ages or sexes. Therefore, when the present invention is applied to, for example, a toy, a cartridge can be selected according to the growth of children, and recognizable words and response contents can be variously selected by selecting the cartridge. Since it is possible, even one toy does not get tired on the way, and it can be used for a long period of time, and it can be widely used regardless of age or sex. Furthermore, not only toys but also when applied to electronic devices and the like, by selecting a cartridge that is suitable for the user, it is possible to give a wide variety of dialogue contents, and to perform various actions corresponding to it. The effect is extremely large.

【００５５】また、予め登録された認識可能な単語の標
準音声特徴データを記憶する標準音声特徴データ記憶手
段をカートリッジ側に設けるようにしたので、それぞれ
の登録単語における年代や性別に応じた標準音声特徴デ
ータをカートリッジ毎に様々用意することができ、ユー
ザに適応したカートリッジを選択して使用することによ
り、認識率の大幅な向上を図ることができる。Further, since the standard voice feature data storage means for storing the standard voice feature data of pre-registered recognizable words is provided on the cartridge side, the standard voice corresponding to the age and sex of each registered word is provided. Various characteristic data can be prepared for each cartridge, and a recognition rate can be significantly improved by selecting and using a cartridge suitable for the user.

【００５６】また、予め登録された認識可能な単語に対
応する応答内容を記憶する会話内容記憶手段と、どのよ
うな音声合成出力を発生するかを指示する指示内容を記
憶する応答データ指示内容記憶手段をカートリッジ側に
設けるようにしたので、それぞれの年齢や性別に合わせ
た応答内容を持ったカートリッジを種々用意することが
でき、たとえば、子供向けのカートリッジを選択すれ
ば、何らかの登録単語を話しかけると、子供向けの応答
内容で、かつ、子供向けの音声で応答するというような
ことが可能となる。Further, conversation content storage means for storing response contents corresponding to pre-registered recognizable words, and response data instruction content storage for storing instruction contents for instructing what kind of voice synthesis output should be generated. Since the means is provided on the cartridge side, it is possible to prepare various cartridges having response contents according to each age and sex. For example, if a child-friendly cartridge is selected, if some registered word is spoken, , It becomes possible to respond with a response content for children and a voice for children.

【００５７】また、どのような音声合成出力を発生する
かを指示するための指示内容を記憶する応答データ指示
内容記憶手段をカートリッジ側に設けるようにしたの
で、それぞれの年齢や性別に合わせて、登録単語毎に、
どのような音声合成出力とするかを指示する内容が記憶
されたカートリッジを種々用意することができ、たとえ
ば、小学生向けのカートリッジを選択すれば、何らかの
登録単語を話しかけると、応答内容は基本的には予め設
定された内容であるが、前記したような小学生向けのテ
レビアニメのキャラクタの話し方を真似た声での応答が
返ってくるというようなことが可能となる。Further, since the response data instruction content storage means for storing the instruction content for instructing what kind of voice synthesis output is generated is provided on the cartridge side, it is possible to match the age and sex of each with For each registered word,
It is possible to prepare various cartridges that store the content that dictates what kind of voice synthesis output is used. For example, if you select a cartridge for elementary school students, when you speak some registered word, the response content is basically Is a preset content, but it is possible to receive a response with a voice imitating the way the character of the television animation character for elementary school students as described above is imitated.

【００５８】また、予め登録された認識可能な単語の標
準音声特徴データを記憶する標準音声特徴データ記憶手
段と、前記登録された認識可能な単語に対応する応答内
容を記憶する会話内容記憶手段と、どのような音声合成
出力を発生するかを指示する応答データ指示内容記憶手
段を、カートリッジ側に設けるようにしたので、それぞ
れの登録単語における年代や性別に応じた標準音声特徴
データをカートリッジ毎に様々用意することができ、ユ
ーザに適応したカートリッジを選択して使用することに
より、認識率の大幅な向上を図ることができ、また、認
識可能な登録単語もカートリッジ単位で設定でき、会話
のバリエーションを大幅に増やすことができる。さら
に、それぞれの年齢や性別に合わせた応答内容および音
声合成出力を持ったカートリッジを種々用意することが
できる。これにより、１台の装置本体であっても、カー
トリッジを変えることにより、様々な年代あるいは性別
などに応じたバリエーションの豊富な対話が可能とな
る。Also, standard voice feature data storage means for storing standard voice feature data of pre-registered recognizable words, and conversation content storage means for storing response contents corresponding to the registered recognizable words. Since the response data instruction content storage means for instructing what kind of voice synthesis output is generated is provided on the cartridge side, the standard voice feature data corresponding to the age and gender of each registered word is provided for each cartridge. Various types of cartridges can be prepared, and the recognition rate can be significantly improved by selecting and using a cartridge that is suitable for the user. In addition, recognizable registered words can be set for each cartridge, which allows for variations in conversation. Can be significantly increased. Further, various cartridges having response contents and voice synthesis output suitable for each age and sex can be prepared. As a result, even in the case of one apparatus main body, by changing the cartridge, it becomes possible to have a wide variety of dialogues according to various ages or sexes.

【００５９】また、本発明の音声認識対話処理方法は、
音声認識および認識された音声に対する応答を行うため
に予め設定された記憶内容を、装置本体に対して着脱自
在に装着可能なカートリッジ側に設け、このカートリッ
ジが装置本体側に装着されることにより、そのカートリ
ッジ内に記憶されたデータを基に、入力音声に対する応
答内容を発生するようにしたので、装置本体は１台であ
っても、カートリッジを変えることにより、様々な年代
あるいは性別などに応じた対話が可能となる。したがっ
て、本発明をたとえば、玩具などに適用した場合には、
子どもの成長に合わせたカートリッジを選択することが
でき、また、認識可能な単語、応答内容もカートリッジ
を選択することにより色々選ぶことができるため、１台
の玩具でも途中で飽きてしまうことが少なく、長い期間
使用することができ、また、年代や性別にとらわれるこ
となく幅広く使用可能となる。さらに、玩具だけでな
く、電子機器などの適用した場合にも、ユーザに適応し
たカートリッジを選択することにより、対話内容などに
幅広いバリエーションを持たせることができ、それに対
応して様々な動作をさせることが可能となるなど、その
適用範囲はきわめて広いものとなる。The speech recognition dialogue processing method of the present invention is
By providing a preset memory content for voice recognition and a response to the recognized voice on the cartridge side that is detachably mountable to the apparatus body, and by mounting this cartridge on the apparatus body side, Based on the data stored in the cartridge, the response content to the input voice is generated. Therefore, even if there is only one main body of the device, the cartridge can be changed to meet various ages or sexes. Dialogue becomes possible. Therefore, when the present invention is applied to, for example, a toy,
You can choose a cartridge that suits your child's growth, and various recognizable words and responses can be selected by choosing a cartridge, so you will not get bored with one toy. It can be used for a long period of time, and it can be used widely regardless of age or sex. Furthermore, not only toys but also when applied to electronic devices and the like, by selecting a cartridge that is suitable for the user, it is possible to give a wide variety of dialogue contents, and to perform various actions corresponding to it. The scope of application is extremely wide.

【００６０】また、予め登録された認識可能な単語の標
準音声特徴データをカートリッジ側に記憶させるように
したので、それぞれの登録単語における年代や性別に応
じた標準音声特徴データをカートリッジ毎に様々用意す
ることができ、ユーザに適応したカートリッジを選択し
て使用することにより、認識率の大幅な向上を図ること
ができる。Further, since the standard voice characteristic data of the pre-registered recognizable words are stored in the cartridge side, various standard voice characteristic data corresponding to the age and sex of each registered word are prepared for each cartridge. By selecting and using a cartridge suitable for the user, the recognition rate can be significantly improved.

【００６１】また、予め登録された認識可能な単語に対
応する応答内容およびどのような音声合成出力を発生す
るかを指示する指示内容を、カートリッジ側に記憶させ
るようにしたので、それぞれの年齢や性別に合わせた応
答内容を持ったカートリッジを種々用意することがで
き、たとえば、子供向けのカートリッジを選択すれば、
何らかの登録単語を話しかけると、子供向けの応答内容
で、かつ、子供向けの音声で応答するというようなこと
が可能となる。Further, since the contents of the response corresponding to the pre-registered recognizable word and the contents of the instruction for instructing what kind of voice synthesis output should be generated are stored in the cartridge side, each age and It is possible to prepare various cartridges with response contents tailored to gender, for example, if you select a cartridge for children,
Speaking some registered word makes it possible to respond with a response content for children and a voice for children.

【００６２】また、どのような音声合成出力を発生する
かを指示するための指示内容を、カートリッジ側に記憶
させるようにようにしたので、それぞれの年齢や性別に
合わせて、登録単語毎に、どのような音声合成出力とす
るかを指示する内容が記憶されたカートリッジを種々用
意することができ、たとえば、小学生向けのカートリッ
ジを選択すれば、何らかの登録単語を話しかけると、応
答内容は基本的には予め設定された内容であるが、前記
したような小学生向けのテレビアニメのキャラクタの話
し方を真似た声での応答が返ってくるというようなこと
が可能となる。Further, since the contents of the instruction for instructing what kind of voice synthesis output is to be generated are stored in the cartridge side, each registered word can be stored according to each age and sex. It is possible to prepare various cartridges that store the content that dictates what kind of voice synthesis output is used. For example, if you select a cartridge for elementary school students, when you speak some registered word, the response content is basically Is a preset content, but it is possible to receive a response with a voice imitating the way the character of the television animation character for elementary school students as described above is imitated.

【００６３】また、予め登録された認識可能な単語の標
準音声特徴データ、前記登録された認識可能な単語に対
応する応答内容、どのような音声合成出力を発生するか
を指示する応答データ指示内容を、カートリッジ側に記
憶させるようにしたので、それぞれの登録単語における
年代や性別に応じた標準音声特徴データをカートリッジ
毎に様々用意することができ、ユーザに適応したカート
リッジを選択して使用することにより、認識率の大幅な
向上を図ることができ、また、認識可能な登録単語もカ
ートリッジ単位で設定でき、会話のバリエーションを大
幅に増やすことができる。さらに、それぞれの年齢や性
別に合わせた応答内容および音声合成出力を持ったカー
トリッジを種々用意することができる。これにより、１
台の装置本体であっても、カートリッジを変えることに
より、様々な年代あるいは性別などに応じたバリエーシ
ョンの豊富な対話が可能となる。Standard voice feature data of recognizable words registered in advance, response contents corresponding to the registered recognizable words, response data instruction contents for instructing what kind of voice synthesis output is to be generated. Since it is stored in the cartridge side, various standard voice feature data can be prepared for each cartridge according to the age and sex of each registered word, and the cartridge suitable for the user can be selected and used. As a result, the recognition rate can be significantly improved, and recognizable registered words can also be set for each cartridge, so that the variation in conversation can be greatly increased. Further, various cartridges having response contents and voice synthesis output suitable for each age and sex can be prepared. This gives 1
Even with the main body of the stand, by changing the cartridge, it is possible to have a wide variety of dialogues according to various ages or sexes.

[Brief description of drawings]

【図１】本発明の概略を説明する図。FIG. 1 is a diagram illustrating an outline of the present invention.

【図２】本発明の第１の実施例を説明するブロック図。FIG. 2 is a block diagram illustrating a first embodiment of the present invention.

【図３】単語検出部による単語検出処理および音声理解
会話制御部による音声認識処理を説明する図。FIG. 3 is a diagram illustrating a word detection process by a word detection unit and a voice recognition process by a voice understanding conversation control unit.

【図４】本発明の第２の実施例（その１）を説明するブ
ロック図。FIG. 4 is a block diagram illustrating a second embodiment (No. 1) of the present invention.

【図５】本発明の第２の実施例（その２）を説明するブ
ロック図。FIG. 5 is a block diagram illustrating a second embodiment (No. 2) of the present invention.

【図６】本発明の第２の実施例（その３）を説明するブ
ロック図。FIG. 6 is a block diagram illustrating a second embodiment (No. 3) of the present invention.

[Explanation of symbols]

１・・・ぬいぐるみ（装置本体）１０・・・音声認識応答処理部１１・・・音声入力部１２・・・音声分析部１３・・・単語検出部１４・・・音声理解会話制御部１５・・・音声合成部１６・・・音声出力部２０・・・カートリッジ部２１・・・標準音声特徴データ記憶部２２・・・会話内容記憶部２３・・・応答データ指示内容記憶部 1 ... Plush toy (device body) 10 ... Voice recognition response processing unit 11 ... Voice input section 12 ... Voice analysis unit 13 ... Word detection unit 14 ... Voice understanding conversation control unit 15 ... Speech synthesizer 16 ... Voice output section 20 ... Cartridge part 21 ... Standard voice feature data storage unit 22 ... Conversation content storage section 23 ... Response data instruction content storage unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者長谷川浩長野県諏訪市大和３丁目３番５号セイコーエプソン株式会社内 (72)発明者枝常伊佐央長野県諏訪市大和３丁目３番５号セイコーエプソン株式会社内 (72)発明者浦野治長野県諏訪市大和３丁目３番５号セイコーエプソン株式会社内 (56)参考文献特開平６−133039（ＪＰ，Ａ) 特開平４−93899（ＪＰ，Ａ) 特開平４−167071（ＪＰ，Ａ) ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Hiroshi Hasegawa 3-5 Yamato 3-chome, Suwa City, Nagano Prefecture Within Co-Epson Corporation (72) Inventor Isao Tsuneo 3-5 Yamato 3-chome, Suwa City, Nagano Prefecture Within Co-Epson Corporation (72) Inventor Osamu Urano 3-5 Yamato 3-chome, Suwa City, Nagano Prefecture Within Co-Epson Corporation (56) Reference literature JP-A-6-133039 (JP, A) JP-A-4-93899 (JP, A) Japanese Patent Laid-Open No. 4-167071 (JP, A)

Claims

(57) [Claims]

1. A voice input from a speaker is analyzed to generate voice feature data, and this voice feature data is compared with standard voice feature data of recognizable words registered in advance to obtain word detection data. and outputs, in response to this word detection data, to understand the meaning of the input voice, the voice recognition interaction device for determining and outputting the response content corresponding thereto, the talk in order to perform a response output to recognized speech Data set in advance according to the age or gender of the person
A storage means for storing data in a detachably plurality of cartridges which can be mounted is provided to the apparatus main body, by at least one selected from the plurality of cartridges attached to the main body, the apparatus body side Connected to a voice recognition response processing unit provided in, the storage means, a conversation content storage means for storing response content corresponding to a recognizable word registered in advance corresponding to the age or gender of the speaker, , A response data instruction content storage unit that stores instruction content for instructing what kind of voice synthesis output is to be generated, the voice recognition response processing unit includes a state detection unit and a speaker voice. A voice input means for inputting, a voice analyzing means for analyzing voice input by the voice input means to generate voice feature data, and a standard voice feature of a pre-registered recognizable word. A standard voice feature data storage unit that stores data, a word detection unit that outputs word detection data for an input voice based on the stored contents of the standard voice feature data storage unit, and a word detection data from this word detection unit. Understand the meaning of the input voice and detect the status
Based on the content from the means and the conversation content storage means,
Select the best sense, a speech understanding conversation control means for determining a response content corresponding thereto, relative to the response content decided by the previous SL speech understanding conversation control means, based on the contents of the response data instruction content storage means A voice recognition dialogue apparatus comprising: a voice synthesizing means for generating a voice synthesizing output; and a voice outputting means for outputting the voice synthesizing output from the voice synthesizing means to the outside.

2. A voice input from a speaker is analyzed to generate voice feature data, and the voice feature data is compared with standard voice feature data of recognizable words registered in advance to obtain word detection data. In a voice recognition dialogue device that outputs, receives the word detection data, understands the meaning of the input voice, determines and outputs the corresponding response content, in order to perform voice recognition, the age or sex of the speaker, etc. The main body of the apparatus having storage means for storing data preset corresponding to the above, and data preset corresponding to the age or sex of the speaker in order to output a response to the recognized voice. A plurality of cartridges detachably attachable to the device main body are provided, and at least one selected from the plurality of cartridges is attached to the device main body to install the cartridge on the device main body side. Connected to a voice recognition response processing unit, the storage means, a standard voice feature data storage means for storing standard voice feature data for recognizable words registered in advance corresponding to the age or sex of the speaker, , Conversation content storage means for storing response contents corresponding to recognizable words registered in advance corresponding to the age or sex of the speaker, and instructions for instructing what kind of speech synthesis output is generated The voice recognition response processing unit includes a state detecting unit, a voice input unit for inputting a voice of a speaker, and a voice input by the voice input unit. Voice analysis means for analyzing and generating voice feature data, and voice feature data from this voice analysis means are input, and based on the stored contents of the standard voice feature data storage means, the input sound is input. A word detection unit for outputting a word detection data, to understand the meaning of the input voice receiving word detection data from the word detection means with respect to the state
Based on the contents from the detection means and the conversation content storage means, a voice understanding conversation control means for selecting an optimum meaning and determining a response content corresponding thereto, and a response content determined by the voice understanding conversation control means On the other hand, it is characterized by having a voice synthesizing means for generating a voice synthesizing output based on the contents of the response data instruction content storing means, and a voice outputting means for outputting the voice synthesizing output from the voice synthesizing means to the outside. Speech recognition interactive device.

3. A voice input from a speaker is analyzed to generate voice feature data, and this voice feature data is compared with standard voice feature data of recognizable words registered in advance to obtain word detection data. In the voice recognition interactive method of outputting, receiving the word detection data, understanding the meaning of the input voice, determining and outputting the corresponding response content, in order to output the response to the recognized voice, the speaker The date set in advance according to the age, gender, etc.
A storage means for storing data in a detachably plurality of cartridges which can be mounted is provided to the apparatus main body, by at least one selected from the plurality of cartridges attached to the main body, the apparatus body side Connected to a voice recognition response processing unit provided in, the storage content stored in the storage means, the response content corresponding to a recognizable word pre-registered in association with the age or gender of the speaker, what is formed by the response data instruction content indicates whether to generate a speech synthesis output, wherein the voice recognition response processing unit, the state detection for detecting the state
Based on the step, a voice analysis step of analyzing the voice input by the voice input means to generate voice feature data, and the standard voice feature data of recognizable words registered in advance, the word detection data for the input voice is obtained. The word detection step for outputting and the word detection data from this word detection step are received to understand the meaning of the input voice, and the state detection step is performed.
Based on the more detected state and the response content ,
For the speech understanding conversation control step of determining the optimum meaning and determining the corresponding response content with reference to the conversation content stored in the storage means side, and the content determined by this speech understanding conversation control step A voice synthesizing step for generating a voice synthesizing output based on the response data instruction content stored in the storage means side, and a voice outputting step for outputting the voice synthesizing output from the voice synthesizing step to the outside. Characteristic speech recognition dialogue processing method.

4. The speech input data from a speaker is analyzed to generate speech feature data, and this speech feature data is compared with standard speech feature data of recognizable words registered in advance to determine word detection data. In the voice recognition dialogue method of outputting, receiving the word detection data, understanding the meaning of the input voice, determining and outputting the corresponding response content, in order to perform voice recognition, the age or gender of the speaker, etc. The main body of the apparatus having storage means for storing data preset corresponding to the above, and data preset corresponding to the age or sex of the speaker in order to output a response to the recognized voice. A plurality of cartridges detachably attachable to the device main body are provided, and at least one selected from the plurality of cartridges is attached to the device main body to install the cartridge on the device main body side. Connected to the voice recognition response processing unit, the storage content stored in the storage means, the standard voice feature data for a recognizable word registered in advance corresponding to the age or gender of the speaker, It is formed of response contents corresponding to recognizable words registered in advance corresponding to the age, sex, etc., and response data instruction contents storing instruction contents for instructing what kind of voice synthesis output is generated. The voice recognition response processing unit includes a state detection unit that detects a state.
A voice analysis step of analyzing voice input by the voice input means to generate voice feature data, and voice feature data from the voice analysis step, and based on the stored contents stored in the storage means. In the word detection step of outputting the word detection data for the input voice, while understanding the meaning of the input voice by receiving the word detection data from the word detection step , the state detected by the state detection step,
A voice understanding conversation control step of selecting an optimum meaning based on the response content and determining the corresponding response content with reference to the conversation content stored in the storage means, and this voice understanding conversation control step. A voice synthesis step for generating a voice synthesis output based on the response data instruction content stored in the storage means, and a voice output for externally outputting the voice synthesis output from the voice synthesis step in response to the response content determined by And a voice recognition dialogue processing method.