JPH1097280A

JPH1097280A - Speech image recognition and translation device

Info

Publication number: JPH1097280A
Application number: JP8247939A
Authority: JP
Inventors: Shinji Wakizaka; 新路脇坂; Kazuo Kondo; 和夫近藤; Toshihisa Tsukada; 俊久塚田; Akio Amano; 明雄天野; Koji Ito; 功二伊東; Hiroko Sato; 裕子佐藤; Kazuyoshi Ishiwatari; 一嘉石渡
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1996-09-19
Filing date: 1996-09-19
Publication date: 1998-04-14

Abstract

PROBLEM TO BE SOLVED: To improve the precision of translation by judging a phrase of a sentence consisting of an inputted continuous speech when features of an image of the whole mouth and its periphery corresponding to features of a speech indicate the end of the phrase of the sentence. SOLUTION: The device consists of a microphone 101, A/D converting ICs 102 and 107, a speech input part 103, a memories 104 and 109, a speech recognizing process part 105, a camera 106, an image input part 108, an image recognizing process part 110, a translating process part 111, a speech synthesizing process part 112, a D/A converting IC 113, an LCD 114, and a speaker 115. A phrase of a sentence consisting of a continuous input speech is recognized and the words constituting the sentence are recognized to perform speech recognition and translation. Here, when features of the image of the mouth corresponding to the speech indicate the end of the phrase of the sentence, it is judged that it is the phrase of the sentence consisting of the inputted continuous speech, the character and character string in the ending of this phrase are recognized, and a mark indicating the phrase is added to accurately recognize the phrase and word ending of the speech input sentence.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、携帯型音声翻訳機
やＰＤＡに代表される小型情報機器や携帯型翻訳機等へ
の応用に好適な音声画像認識翻訳装置に係り、特に、海
外旅行先などで必要な会話音声を双方向に取り込んで音
声認識してお互いの母国語の言語に翻訳する音声認識翻
訳装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech image recognition translator suitable for application to a portable information translator such as a portable speech translator or a PDA, a portable translator, and the like. The present invention relates to a speech recognition translator for bidirectionally taking in necessary conversational speech, recognizing the speech, and translating the speech into each other's native language.

【０００２】[0002]

【従来の技術】海外旅行者数の急激な増加に伴い、言葉
の壁によるコミュニケーションの不自由さを克服するた
めに、音声による定型会話集などの携帯型翻訳機が普及
しつつある。これらの携帯型翻訳機は、予め会話に用い
る文章を音声データとして登録しておき、その場の状況
に合わせて必要な会話文を選択して再生する方式をとっ
ている。したがって、この方式では、決められた定型文
章の範囲の中から選択し表現で、自分の質問や要求を相
手の言葉（言語）で一方的に聞かせるのみであり、ある
程度自由な言い回しで自分の言葉や相手の言葉を翻訳す
ることができない。2. Description of the Related Art With the rapid increase in the number of overseas travelers, portable translators such as a fixed-volume collection of voices are becoming widespread in order to overcome the inconvenience of communication due to language barriers. These portable translators employ a system in which sentences used for conversation are registered in advance as voice data, and necessary conversation sentences are selected and reproduced according to the situation at the time. Therefore, in this method, one simply asks one's own question or request in the language (language) of the other party by selecting and expressing from a fixed range of fixed sentences, and using his / her own language with some freedom I can't translate words or other people's words.

【０００３】また、特開平５−３５７７６号公報（名
称；「言語自動選択機能付翻訳装置」）には、マイクか
ら入力した操作者の音声を認識して翻訳し、翻訳した言
語の音声を出力するようにした携帯用の翻訳装置が開示
されている。Japanese Patent Application Laid-Open No. 5-35776 (name: "Translator with automatic language selection function") recognizes and translates the voice of an operator input from a microphone, and outputs the voice of the translated language. A portable translation device adapted to do so is disclosed.

【０００４】図１０は、このような従来の音声翻訳装置
の１例を示すブロック図である。同図において、１００
１は制御部、１００２は音声区間切り出し部、１００３
は音声認識部、１００４は表示部、１００５は音声合成
部、１００６は翻訳語データ用メモリカード、１００７
は音声認識辞書部、１００８はスピーカ、１００９はマ
イク、１０１０はスピーカアンプ、１０１１は操作信号
である。FIG. 10 is a block diagram showing an example of such a conventional speech translator. In FIG.
1 is a control unit, 1002 is a speech section cutout unit, 1003
Is a voice recognition unit, 1004 is a display unit, 1005 is a voice synthesis unit, 1006 is a memory card for translated word data, 1007
Denotes a voice recognition dictionary unit, 1008 denotes a speaker, 1009 denotes a microphone, 1010 denotes a speaker amplifier, and 1011 denotes an operation signal.

【０００５】制御部１００１は、マイクロプロセッサを
中心にして構成され、装置の各部を制御する。[0005] The control unit 1001 is mainly composed of a microprocessor, and controls each unit of the apparatus.

【０００６】音声区間切り出し部１００２は、マイク１
００９から入力された音声をデジタル信号に変換して切
り出し、音声認識部１００３に送る。[0006] The voice section extracting unit 1002 is provided with a microphone 1
The voice input from 009 is converted into a digital signal, cut out, and sent to the voice recognition unit 1003.

【０００７】音声認識部１００３は、キーボードまたは
スイッチ等により入力された操作信号１０１１を受けた
制御部１００１の指示により、マイク１００９から入力
して音声区間切り出し部１００２を経て切り出した音声
を分析する。そして、その結果を、音声認識辞書部１０
０７に格納された標準音声パターンと比較して音声認識
を行う。The voice recognition unit 1003 analyzes the voice input from the microphone 1009 and cut out through the voice section cutout unit 1002 according to the instruction of the control unit 1001 which receives the operation signal 1011 input from the keyboard or the switch. Then, the result is input to the speech recognition dictionary unit 10.
The voice recognition is performed in comparison with the standard voice pattern stored in 07.

【０００８】音声合成部１００５は、音声認識部１００
３により認識した音声に対応した翻訳語を翻訳語データ
用メモリカード１００６から読み込み、音声信号に変換
してスピーカアンプ１０１０及びスピーカ１００８を経
て音声として出力する。[0008] The speech synthesizing unit 1005 includes a speech recognizing unit 100.
3 is read from the translated word data memory card 1006, converted into a voice signal, and output as a voice via the speaker amplifier 1010 and the speaker 1008.

【０００９】表示部１００４は、翻訳装置の使用者の指
示や翻訳語の文字による表示等を行う。[0009] The display unit 1004 displays instructions of the user of the translation apparatus, displays translated characters, and the like.

【００１０】翻訳語データ用メモリカード１００６は、
ＲＯＭ，フラッシュメモリ，ハードディスク等からな
り、翻訳語を音声合成して出力する場合には、音声デー
タを格納する。また、この翻訳語データ用メモリカード
１００６からは、翻訳語に対応したキャラクターコード
を読み込んで表示部１００４に表示する。そして、この
翻訳語データ用メモリカード１００６を他の言語のもの
と交換することにより、複数の言語に翻訳することを可
能にする。[0010] The translated word data memory card 1006 is
It is composed of a ROM, a flash memory, a hard disk, etc., and stores speech data when a translated word is synthesized and output. From the translated word data memory card 1006, a character code corresponding to the translated word is read and displayed on the display unit 1004. By exchanging the translated word data memory card 1006 with another language, it becomes possible to translate the data into a plurality of languages.

【００１１】音声認識辞書部１００７は、ＲＯＭ，ＲＡ
Ｍ等からなり、操作者の発声に対応した標準音声パター
ンを格納している。この標準音声パターンは、操作者が
予め登録して格納しておく。The speech recognition dictionary unit 1007 has a ROM, RA
And a standard voice pattern corresponding to the utterance of the operator. The standard voice pattern is registered and stored by the operator in advance.

【００１２】また、音声情報の認識に当たって、画像情
報を組み合わせる技術についても、例えば、特開昭６０
−１８８９９８号公報や特開昭６３−１９１１９８号公
報に開示されているが、未だ正確な翻訳を可能にする具
体的な認識方法を開示したものはない。A technique for combining image information in recognizing voice information is disclosed in, for example,
Japanese Patent Application Laid-Open No. 188998 and Japanese Patent Application Laid-Open No. 63-191198 do not disclose a specific recognition method that enables accurate translation.

【００１３】[0013]

【発明が解決しようとする課題】前記した従来の携帯型
音声翻訳装置においては、操作者の発生する音声を認識
するものの、音声による定型会話集などの定型文章形式
の翻訳機と同様に、予め会話に用いる文章を音声データ
として登録しておき、その場の状況に合わせて必要な会
話文を選択して再生する方式と機能的には変わらない。
すなわち、自分の質問や要求を相手の言葉で一方的に聞
かせることは可能であるが、不特定な相手の自然な会話
における音声を認識して翻訳することができないことか
ら、相手の言っていることが理解できずに会話が成立し
ない事態が発生するという問題がある。携帯型音声認識
装置においては、自分の言いたいことを翻訳するより
は、むしろ相手の言っていることを翻訳してもらうこと
の方が重要である。それを実現するためには、解決しな
ければならない難しい技術課題が沢山ある。In the above-mentioned conventional portable speech translating apparatus, although the speech generated by the operator is recognized, the translation is performed in advance in the same manner as a translator in the form of a fixed sentence such as a set of conversations by speech. This is functionally the same as a method in which sentences used for conversation are registered as voice data, and necessary conversation sentences are selected and reproduced according to the situation at the place.
In other words, it is possible to ask one's question or request one-sidedly in the other party's language, but since it is not possible to recognize and translate the speech of the unspecified other person's natural conversation, There is a problem that a situation occurs in which a conversation cannot be established because the user cannot understand that the user is in the office. In a portable speech recognition device, it is more important to have a person translate what he / she is saying than to translate what he / she wants to say. There are many difficult technical issues that need to be solved to achieve that.

【００１４】海外旅行先などで交わされる会話のよう
な、比較的短い文章から成る連続音声を認識して翻訳す
るためには、連続音声を構成している単語と単語、例え
ば、日本語の場合は、名詞，動詞，助詞などの区別や、
文の区切りをはっきりと認識しなければ正しい翻訳を行
なうことができない。そこで、音声の入力を文節単位に
区切って入力することが考えられるが、余り間を置き過
ぎても不自然な会話文になってしまう。すなわち、海外
旅行先などで交わされる会話のような、比較的短い文章
から成る連続音声の入力であって多少文節を意識する程
度の連続音声の入力により、音声を認識して翻訳しなけ
ればならない。In order to recognize and translate a continuous voice composed of relatively short sentences, such as a conversation exchanged at an overseas travel destination or the like, words and words constituting the continuous voice, for example, in the case of Japanese Is used to distinguish between nouns, verbs, particles, etc.
Without a clear recognition of sentence breaks, correct translation cannot be performed. Therefore, it is conceivable to input the voice in a unit of a phrase, but an unnatural conversational sentence will be generated even if the input is too long. That is, the speech must be recognized and translated by inputting a continuous speech composed of relatively short sentences, such as a conversation exchanged at an overseas travel destination or the like, which is somewhat conscious of a phrase. .

【００１５】そこで、本発明の目的は、少しでも会話ら
しい相互の会話音声認識翻訳を可能にする携帯型音声認
識翻訳機を実現するために、入力された一連の音声から
成る文の文節を認識し、また、文を構成する単語を認識
することで会話音声を認識して正しく翻訳することがで
きる音声認識翻訳装置を提供することにある。SUMMARY OF THE INVENTION An object of the present invention is to realize a portable speech recognition translator capable of recognizing and translating mutual conversational speech which is a little like conversation. Another object of the present invention is to provide a speech recognition and translation device capable of recognizing conversational speech by recognizing words constituting a sentence and performing correct translation.

【００１６】[0016]

【課題を解決するための手段】前記した目的を達成する
ために、本発明になる音声認識翻訳装置は、音声と画像
を取り込む手段と、取り込んだ音声データを記憶してお
くメモリと、取り込んだ一連の音声に対して該音声の特
徴を抽出し、音声認識処理を行う音声認識処理部と、取
り込んだ画像データを記憶しておくメモリと、取り込ん
だ一連の画像に対して該画像の特徴を抽出し、画像認識
処理を行う画像認識処理部と、音声認識及び画像認識さ
れた単語や文章の認識結果に対して翻訳したい単語や文
章に翻訳する翻訳処理部とを備え、経時的に変化する音
声の特徴と該音声の特徴に対応した画像の特徴の２つの
相関関係から、入力された一連の音声からなる文の文節
を認識し、かつ、文を構成する単語を認識することで音
声認識及び翻訳する音声画像認識翻訳装置において、音
声に対応する口の画像の特徴が文の文節の終わりを示し
た場合には、入力された一連の音声からなる該文の文節
と判断すると共に該文節の終わりの文字や文字列を認識
し、更に文節を示す印を付加することを特徴とする。こ
のようにすれば、音声入力文の文節と語尾（日本語にお
いては助詞）を正確に認識することができるので、翻訳
の精度が向上する。In order to achieve the above-mentioned object, a speech recognition / translation apparatus according to the present invention comprises: means for capturing voice and image; a memory for storing captured voice data; A voice recognition processing unit that extracts a feature of the voice from a series of voices and performs a voice recognition process; a memory that stores the captured image data; An image recognition processing unit that extracts and performs image recognition processing, and a translation processing unit that translates the recognition result of words and sentences that have been subjected to speech recognition and image recognition into words or sentences to be translated, and changes over time Speech recognition by recognizing a phrase of a sentence composed of a series of input speeches and recognizing words constituting the sentence from two correlations between features of the speech and features of an image corresponding to the features of the speech. And translate In the speech image recognition and translation device, when the feature of the mouth image corresponding to the voice indicates the end of a sentence phrase, it is determined that the sentence is composed of a series of input speeches, and the end of the phrase is determined. Characters and character strings are recognized, and a mark indicating a phrase is added. In this way, the phrase and the ending (particles in Japanese) of the voice input sentence can be accurately recognized, so that the translation accuracy is improved.

【００１７】具体的には、同じサンプリング周波数で同
時刻の間に取り込んだ一連の音声と画像から抽出した音
声の特徴が、子音から母音に変化し、更に母音から無音
状態に変化していく過程において、この音声の特徴に対
応した口全体とその周辺近傍の画像の特徴が文の文節の
終わりを示した場合には、入力された一連の音声からな
る文の文節と判断するようにする。More specifically, a process in which a series of voices captured at the same sampling frequency at the same time and voice characteristics extracted from an image changes from a consonant to a vowel and further from a vowel to a silent state. In the case where the features of the image of the whole mouth and the vicinity of the mouth corresponding to the features of the voice indicate the end of the phrase of the sentence, it is determined that the phrase is a sentence composed of a series of input voices.

【００１８】また、同じサンプリング周波数で同時刻の
間に取り込んだ一連の日本語音声と画像から抽出した音
声の特徴が、子音から母音に変化し、更に母音から無音
状態に変化していく過程において、この音声の特徴に対
応した口全体とその周辺近傍の画像の特徴が文の文節の
終わりを示した場合には、入力された一連の音声からな
る文の文節と判断すると共に該文節の終わりの助詞を認
識するようにする。In a process in which the features of a series of Japanese voices captured at the same time at the same sampling frequency and voices extracted from images change from consonants to vowels and further from vowels to silence. If the features of the image corresponding to the whole mouth and its surroundings corresponding to the features of the speech indicate the end of a sentence, the sentence is determined to be a sentence consisting of a series of input speeches, and the end of the sentence is determined. To recognize the particles of

【００１９】また、取り込んだ音声データを記憶してお
く前記メモリを、常時、音データを取り込むメモリ
（１）と、音声が入力されると該音声データを取り込む
メモリ（２）とから構成すると共に、取り込んだ画像デ
ータを記憶しておく前記メモリを、常時、画像データを
取り込むメモリ（３）と、音声が入力されると画像デー
タを取り込むメモリ（４）とから構成し、音声が入力さ
れたと判断したときに音声データ及び画像データの取り
込みを開始し、また、音声の入力が終了したと判断した
ときには音声データ及び画像データの取り込みを中止
し、記憶した音声データ及び画像データに対して音声認
識及び翻訳するようにする。Further, the memory for storing the fetched audio data is constituted by a memory (1) which always fetches audio data, and a memory (2) which fetches the audio data when voice is inputted. The memory for storing the captured image data comprises a memory (3) for constantly capturing image data, and a memory (4) for capturing image data when audio is input. When it is determined, the capture of the audio data and the image data is started. When it is determined that the input of the voice is completed, the capture of the audio data and the image data is stopped, and the voice recognition of the stored voice data and the image data is performed. And to translate.

【００２０】また、音声が入力されたと判断したときに
音声データ及び画像データの取り込みを開始する処理及
び記憶した音声データ及び画像データの読み出し処理に
ついて、音声の特徴が現われる任意に設定された周期
（サンプリング周波数）で、常に音データを取り込んで
おくメモリ（１）の該音データに対して、時間Ｔｉに取
り込んだ音データの強度Ｐｉと、１つ前の周期の時間Ｔ
ｉ-1に取り込んだ音データの強度Ｐｉ-1との差ΔＰｉの
値が、任意に設定された音データの強度の境界値Ｐｔｈ
を超えた場合には該音データは音声データであると判断
して次の音データから順にメモリ（２）に書き込むと共
に、画像データの取り込みにおいては、音データが音声
データであると判断した時点から画像データを常に取り
込んでいるメモリ（３）への該画像データの書き込みを
終了し、次の画像データから順にメモリ（４）に書き込
み、メモリ（１）に記憶した音声データとメモリ（２）
に記憶した音声データとを合わせて時間軸ｔ方向に順に
読み出して音声波形データを形成すると共に、メモリ
（３）に記憶した画像データとメモリ（４）に記憶した
画像データと合わせて時間軸ｔ方向に順に読み出して一
連の口の動きを示す画像データを形成するようにする。In the process of starting the capture of voice data and image data when it is determined that voice has been input, and the process of reading out the stored voice data and image data, an arbitrarily set period (where a feature of voice appears) (Sampling frequency), the sound data stored in the memory (1) that always fetches the sound data, the intensity Pi of the sound data fetched at the time Ti, and the time T of the immediately preceding cycle.
The value of the difference ΔPi from the intensity Pi-1 of the sound data taken into i-1 is the boundary value Pth of the intensity of the sound data arbitrarily set.
Is exceeded, the sound data is determined to be sound data, and the sound data is written in the memory (2) in order from the next sound data. The writing of the image data into the memory (3) which always takes in the image data from the memory (3) is finished, the next image data is written into the memory (4) in order, and the audio data stored in the memory (1) and the memory (2)
Along with the audio data stored in the memory (3) to form audio waveform data by sequentially reading the data in the direction of the time axis t, and the image data stored in the memory (3) and the image data stored in the memory (4) together with the time axis t The image data is read out sequentially in the direction to form a series of mouth movement image data.

【００２１】また、更に、音声認識及び画像認識された
単語や文章の認識結果に対して認識結果の修正または補
正を行う認識結果修正部を設ける。Further, there is further provided a recognition result correcting unit for correcting or correcting the recognition result of the words and sentences recognized by the voice recognition and the image recognition.

【００２２】また、前記画像の特徴は、画像データの解
像度変換及び２値化処理を施して正規化した人の顔画像
と口全体とその周辺近傍の画像から抽出する。The features of the image are extracted from the face image of the person, the whole mouth, and the image in the vicinity of the periphery, which are normalized by performing resolution conversion and binarization processing of the image data.

【００２３】また、前記音声と画像を取り込む手段を、
カメラとマイクを一体的にして構成して携帯型の音声画
像認識翻訳装置を構成する。Also, the means for capturing the sound and the image may include:
A camera and a microphone are integrally configured to constitute a portable voice image recognition and translation device.

【００２４】上述のような構成の音声認識翻訳装置を用
いることにより、少しでも会話らしい相互の音声認識翻
訳を可能にする携帯型音声認識翻訳機が得られる。By using the speech recognition and translating apparatus having the above-described configuration, a portable speech recognition translator capable of performing mutual speech recognition and translation, which is slightly conversational, is obtained.

【００２５】[0025]

【発明の実施の形態】以下、図面を参照しながら、本発
明の実施の形態について詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００２６】図１は、本発明の一実施形態に係る音声画
像認識翻訳装置の構成を示すブロック図である。この図
１に示した音声画像認識翻訳装置は、携帯型音声認識翻
訳機である。このような装置は、ＣＰＵやメモリや専用
ＩＣ等のいくつかのＬＳＩで構成することができる。ま
た、チップとして、半導体素子上に構成することもでき
る。FIG. 1 is a block diagram showing the configuration of a speech image recognition and translation apparatus according to one embodiment of the present invention. The speech image recognition translator shown in FIG. 1 is a portable speech recognition translator. Such a device can be constituted by several LSIs such as a CPU, a memory, and a dedicated IC. In addition, a chip can be formed on a semiconductor element.

【００２７】図１において、１０１は音声を取り込むた
めの指向性マイクであり、例えば、海外旅行先の空港や
駅構内，飛行機内，ホテル，観光地，レストランやショ
ッピング等で交わされる会話音声を取り込む。In FIG. 1, reference numeral 101 denotes a directional microphone for capturing voice, and captures a conversation voice exchanged at, for example, an airport, a station yard, an airplane, a hotel, a sightseeing spot, a restaurant or shopping, etc. at an overseas travel destination. .

【００２８】１０２は１６ビットのアナログ／デジタル
（Ａ／Ｄ）変換ＩＣであり、前記マイク１０１内のフィ
ルタやアンプにより音声帯域以外の音が取り除かれて雑
音処理された音声データの連続的なアナログ信号を、音
声のサンプリング周波数、例えば１２ｋＨｚでサンプリ
ングしてデジタル信号に変換する。Reference numeral 102 denotes a 16-bit analog / digital (A / D) conversion IC, which is a continuous analog of digital data in which noise outside the audio band is removed by a filter or an amplifier in the microphone 101 and noise processing is performed. The signal is sampled at a sampling frequency of audio, for example, 12 kHz, and is converted into a digital signal.

【００２９】１０３は音声取り込み部であり、Ａ／Ｄ変
換ＩＣ１０２でサンプリングされた１６ビットの音声デ
ータに対して、シリアルデータからパラレルデータにシ
リアル／パラレル変換を行ってレジスタ等に一旦格納し
ておくためのものである。Reference numeral 103 denotes an audio capturing unit which performs serial / parallel conversion from serial data to parallel data on 16-bit audio data sampled by the A / D conversion IC 102 and temporarily stores the converted data in a register or the like. It is for.

【００３０】１０４は、前記音声取り込み部１０３によ
り取り込んだ音声データ、例えば、会話音声の１フレー
ズ分の連続音声データを記憶しておくためのメモリであ
り、また、連続音声データを書き込めるだけの必要最小
限の容量を持つメモリである。連続音声データのメモリ
の書き込みは、ＣＰＵ等のソフトウエア処理で行って
も、専用のハードウエアで行っても良い。Reference numeral 104 denotes a memory for storing voice data fetched by the voice fetching unit 103, for example, continuous voice data for one phrase of conversation voice, and is necessary only for writing continuous voice data. It is a memory with a minimum capacity. Writing of the continuous audio data to the memory may be performed by software processing such as a CPU or by dedicated hardware.

【００３１】１０５は音声認識処理部であり、メモリ１
０４に書き込まれた連続音声データに対して、デジタル
フィルタ，音声分析，音声区間検出，照合，判定等の一
連の音声認識処理を行う。ここで、音声認識に必要な音
響モデルデータ，辞書データ，文法データは、この音声
認識処理部１０５内においてメモリ等に登録し格納して
おく。音声認識処理は、ＣＰＵ，ＤＳＰ等のソフトウエ
ア処理で行っても、専用のハードウエアで行っても良
い。Reference numeral 105 denotes a voice recognition processing unit,
A series of speech recognition processes, such as digital filtering, speech analysis, speech section detection, collation, and determination, are performed on the continuous speech data written in 04. Here, acoustic model data, dictionary data, and grammar data necessary for speech recognition are registered and stored in a memory or the like in the speech recognition processing unit 105. The voice recognition processing may be performed by software processing such as a CPU and a DSP, or may be performed by dedicated hardware.

【００３２】１０６は、画像を取り込むための高解像度
カメラである。これは、例えば、ＣＣＤカメラであり、
海外旅行先の空港や駅構内，飛行機内，ホテル，観光
地，レストランやショッピング等で交わされる会話音声
に合わせて該音声を発生する人の口の動きを画像データ
として取り込む。Reference numeral 106 denotes a high-resolution camera for capturing an image. This is, for example, a CCD camera,
The movement of the mouth of the person generating the sound is captured as image data in accordance with the conversation sound exchanged at the airport, station yard, airplane, hotel, sightseeing spot, restaurant, shopping, etc. at the overseas travel destination.

【００３３】１０７は１６ビットのアナログ／デジタル
（Ａ／Ｄ）変換ＩＣであり、前記ＣＣＤカメラ１０６か
らのアナログ信号を、音声のサンプリング周波数に同期
して、例えば、１２ｋＨｚでサンプリングしてデジタル
信号に変換する。Reference numeral 107 denotes a 16-bit analog / digital (A / D) conversion IC, which samples an analog signal from the CCD camera 106 at a frequency of, for example, 12 kHz in synchronization with a sampling frequency of an audio signal into a digital signal. Convert.

【００３４】１０８は画像取り込み部であり、前記Ａ／
Ｄ変換ＩＣ１０７でサンプリングした１６ビットの画像
データに対して、シリアルデータからパラレルデータに
シリアル／パラレル変換を行ってレジスタ等に一旦格納
しておくためのものである。Reference numeral 108 denotes an image capturing unit,
This is for performing serial / parallel conversion from serial data to parallel data on 16-bit image data sampled by the D conversion IC 107 and temporarily storing the data in a register or the like.

【００３５】１０９は、画像取り込み部１０８により取
り込んだ画像データ、例えば、会話音声の１フレーズ分
の連続画像データを記憶しておくためのメモリであり、
また、連続画像データを書き込めるだけの必要最小限の
容量を持つメモリである。連続画像データのメモリの書
き込みは、ＣＰＵ等のソフトウエア処理で行っても専用
のハードウエアで行っても良い。Reference numeral 109 denotes a memory for storing image data captured by the image capturing unit 108, for example, continuous image data for one phrase of conversation voice.
Further, the memory has a minimum necessary capacity for writing continuous image data. Writing of the continuous image data to the memory may be performed by software processing such as a CPU or by dedicated hardware.

【００３６】１１０は画像認識処理部であり、前記メモ
リ１０９に書き込まれた連続画像データに対して、デジ
タルフィルタ，画像変換，２値化処理，画像解析，特徴
抽出，照合，判定等の一連の画像認識処理を行う。ここ
で、画像認識に必要な画像モデルデータ，辞書データ，
文法データは、画像認識処理部１０５内において、メモ
リ等に登録して格納しておく。画像認識処理は、ＣＰＵ
やＤＳＰ等のソフトウエア処理で行っても、専用のハー
ドウエアで行っても良い。ここで、画像認識処理した結
果は、音声認識処理部１０５に渡す。Reference numeral 110 denotes an image recognition processing unit which performs a series of operations such as digital filtering, image conversion, binarization processing, image analysis, feature extraction, collation, and determination on the continuous image data written in the memory 109. Perform image recognition processing. Here, image model data, dictionary data,
The grammar data is registered and stored in a memory or the like in the image recognition processing unit 105. Image recognition processing is performed by CPU
Or by software processing such as a DSP, or by dedicated hardware. Here, the result of the image recognition processing is passed to the voice recognition processing unit 105.

【００３７】１１１は、前記音声認識処理部１０５から
出力された会話音声の認識結果に対して翻訳したい言語
に翻訳処理を行う翻訳処理部である。音声認識処理部１
０５から出力する認識結果は、例えば、日本語であれば
名詞，助詞，動詞，副詞等のかな漢字まじりのテキスト
文章である。翻訳処理では、これらのかな漢字まじり文
章に対して、構文解析及び辞書，文法規則，事例等から
の文章生成を行い、翻訳結果を出力する。Reference numeral 111 denotes a translation processing unit for performing a translation process on the speech speech recognition result output from the speech recognition processing unit 105 into a language to be translated. Voice recognition processing unit 1
The recognition result output from 05 is, for example, a text sentence of kana-kanji characters such as nouns, particles, verbs, and adverbs in Japanese. In the translation process, the Kana-Kanji mixed sentence is subjected to syntax analysis and sentence generation from a dictionary, grammar rules, examples, and the like, and a translation result is output.

【００３８】１１２は、前記翻訳処理部１１１から出力
された翻訳結果を、会話文に適した音声に変換して音声
出力する音声合成処理部である。この音声合成処理部１
１２では、より自然な会話文音声にするために、文章を
構成している単語の発音やアクセント、更に、文章全体
の抑揚を最適化して会話文の音声合成を行い、相手側に
対して聞き取りやすい自然な音声を出力するための処理
も行う。Reference numeral 112 denotes a speech synthesis processing unit that converts the translation result output from the translation processing unit 111 into speech suitable for a conversation sentence and outputs the speech. This speech synthesis processing unit 1
In step 12, in order to make the conversational speech more natural, the pronunciation and accent of the words making up the sentence and the inflection of the entire sentence are optimized, and the speech of the conversational sentence is synthesized, and the other party is heard. It also performs processing to output easy natural sound.

【００３９】１１３は１６ビットのデジタル／アナログ
（Ｄ／Ａ）変換ＩＣであり、前記音声合成処理部１１２
から出力された音声のデジタル信号を、例えば、ロウパ
スフィルタ（ＬＰＦ）を経由して音声周波数帯域１２ｋ
Ｈｚでアナログ信号に変換する。Reference numeral 113 denotes a 16-bit digital / analog (D / A) conversion IC.
The digital signal of the audio output from is transmitted through, for example, a low-pass filter (LPF) to an audio frequency band 12k.
Convert to analog signal at Hz.

【００４０】１１４は、音声認識結果の途中経過や翻訳
結果をテキストで表示するための液晶ディスプレイ（Ｌ
ＣＤ）である。Reference numeral 114 denotes a liquid crystal display (L) for displaying the progress of the speech recognition result and the translation result as text.
CD).

【００４１】１１５は、音声認識結果やその途中経過，
翻訳結果を音声合成して音声出力するためのスピーカで
ある。Reference numeral 115 denotes a speech recognition result, its progress,
This is a speaker for synthesizing the translation result and outputting the speech.

【００４２】図２は、音声認識をサポートするためのア
プローチとして、音声を発生する人の口の動きの画像を
取り込み、画像認識して音声認識と共に音声の内容を解
読する方法の一例を説明するための図である。FIG. 2 illustrates an example of a method of capturing an image of a mouth movement of a person generating a voice, performing image recognition, and decoding the voice content together with the voice recognition as an approach for supporting voice recognition. FIG.

【００４３】図１に示したカメラ１０６、例えば、ＣＣ
Ｄカメラにより、音声入力に同期して音声を発生する人
の口の動きの画像を取り込む。The camera 106 shown in FIG.
The D camera captures an image of the mouth movement of a person who generates sound in synchronization with the sound input.

【００４４】２０１は、音声入力と同期して取り込んだ
任意の時点での会話音声を発生している人の静止画像で
ある。画像の解像度は、縦ｍドット×横ｎドット×深さ
ｌビットである。この画像２０１は、音声入力に同期し
て、例えば、音声のサンプリング周波数を１２ｋＨｚに
設定し、この音声の特徴を抽出するためのフレームを音
声のサンプリング点数を２４０ポイントにした場合は、
２０ｍｓのフレーム単位で音声入力に同期させて取り込
むことになる。音声の取り込みにおける２０ｍｓのフレ
ーム単位を画像の取り込みに適用すると、２０ｍｓで１
フレームの画像を取り込むことになるので、１秒間に５
０フレームの画像を処理することになる。これは、現行
のＮＴＳＣ等の動画像である３０フレーム／秒よりも時
間軸の分解能が高くなる。Reference numeral 201 denotes a still image of a person generating conversational speech at an arbitrary point in time, which is captured in synchronization with the voice input. The resolution of the image is m dots vertically × n dots horizontally × l bit deep. This image 201 is synchronized with the audio input, for example, when the sampling frequency of the audio is set to 12 kHz and the frame for extracting the features of the audio is set to 240 sampling points of the audio,
The data is taken in synchronization with the audio input in units of 20 ms. If a frame unit of 20 ms in audio capture is applied to image capture, 1 ms in 20 ms
Since the image of the frame will be captured, 5 times a second
That is, an image of 0 frame is processed. This means that the resolution on the time axis is higher than 30 frames / sec, which is a current moving image such as NTSC.

【００４５】２０２は、取り込んだ画像２０１に対し
て、カラー画像から濃淡画像に変換する濃淡化処理を行
い、その後、２値化画像を得る２値化処理を施し、口の
形の特徴を抽出するために取り出した画像である。A 202 performs a shading process for converting the captured image 201 from a color image to a shading image, and then performs a binarizing process for obtaining a binarized image, thereby extracting a mouth shape feature. This is an image taken out for the purpose.

【００４６】２０３は、画像２０２に対して、正規化，
エッジ検出，平滑化処理を施し、口の輪郭等の特徴を抽
出した画像である。この画像２０３の解像度は、画像の
特徴を十分に表すことができる必要最小限の解像度とす
る。Reference numeral 203 denotes normalization,
This is an image obtained by performing edge detection and smoothing processing and extracting features such as a contour of a mouth. The resolution of the image 203 is set to the minimum necessary resolution capable of sufficiently representing the features of the image.

【００４７】図３は、音声及び画像を取り込むためのカ
メラを内蔵したマイクの構成を説明するための図であ
る。図１に示した音声画像認識翻訳装置において、マイ
ク１０１とカメラ１０６を一体化したものである。特
に、携帯型音声認識翻訳機のような装置においては、部
品点数の削減，低コスト化，低価格化が重要である。ま
た、図２を参照して説明したように、口の動きの画像認
識を行うことから、音声を取り込むマイクにマイクを内
蔵すれば、このマイクに向かって音声を発生する口の画
像を比較的容易に取り込むことができる。FIG. 3 is a diagram for explaining a configuration of a microphone having a built-in camera for capturing voices and images. In the speech image recognition / translation apparatus shown in FIG. 1, a microphone 101 and a camera 106 are integrated. In particular, in a device such as a portable speech recognition translator, it is important to reduce the number of components, reduce costs, and reduce costs. Also, as described with reference to FIG. 2, since image recognition of mouth movement is performed, if a microphone is incorporated in a microphone that captures sound, an image of the mouth that generates sound toward the microphone can be relatively generated. Can be easily captured.

【００４８】３０１は、音声及び画像を取り込むための
カメラを内蔵したマイク本体である。外形は、円筒形で
あっても、角形であっても良い。３０２はマイクであ
り、コンデンサ型マイクや抵抗型マイク等で構成され
る。３０３はレンズであり、音声を発生する人物の顔や
口の画像が取り込めるようにチューニングしておく。３
０４はＣＣＤカメラであり、レンズ３０３から進入して
きた音声を発生する人物の顔や口の画像を取り込む。Reference numeral 301 denotes a microphone body having a built-in camera for capturing audio and images. The outer shape may be cylindrical or square. Reference numeral 302 denotes a microphone, which includes a condenser microphone, a resistance microphone, and the like. Reference numeral 303 denotes a lens, which is tuned so as to capture an image of the face or mouth of a person generating sound. 3
Reference numeral 04 denotes a CCD camera, which captures an image of a face or a mouth of a person who generates sound coming from the lens 303.

【００４９】３０５は音声及び画像データを伝送するた
めのケーブルであり、音声信号ケーブル３０６と画像信
号ケーブル３０７とを備える。図１に示した音声画像認
識翻訳装置おいては、音声信号ケーブル３０６はＡ／Ｄ
変換ＩＣ１０２に接続し、画像信号ケーブル３０７はＡ
／Ｄ変換ＩＣ１０７に接続する。Reference numeral 305 denotes a cable for transmitting audio and image data, and includes an audio signal cable 306 and an image signal cable 307. In the audio-visual recognition / translation apparatus shown in FIG.
The image signal cable 307 is connected to the conversion IC 102
/ D conversion IC 107 is connected.

【００５０】図４は、このような本発明になる音声画像
認識翻訳装置において、実際に、会話音声による会話文
が、音声認識と画像認識とから認識され、翻訳されるま
でを説明するための図である。FIG. 4 is a diagram for explaining a process in which a conversational sentence based on conversational speech is actually recognized from speech recognition and image recognition and translated in the speech image recognition translation apparatus according to the present invention. FIG.

【００５１】入力された会話音声による会話文例の内容
は、「コウエンハドコデスカ」である。入力される音
声の発生スピードは、会話における自然なスピードであ
り、認識率を高めるために丁寧にはっきりと発声してい
る。The content of the example of the conversation sentence based on the input conversation voice is “Kouenha Dokodeska”. The speed at which the input voice is generated is the natural speed of the conversation, and is carefully and clearly uttered to increase the recognition rate.

【００５２】４０１は、入力された会話音声による会話
文例における「コウエンハドコデスカ」の音声波形で
ある。時間軸ｔにおける音声波形の振幅は、音声強度を
表している。この音声波形に対して、音声のサンプリン
グ周波数を１２ｋＨｚに設定し、音声の特徴を抽出する
ためのフレームを音声のサンプリング点数を２４０ポイ
ントにした場合は、２０ｍｓのフレーム単位で音声を取
り込むことになる。Reference numeral 401 denotes a speech waveform of “Kouenha Dokodeska” in a conversation sentence example based on the inputted conversation voice. The amplitude of the sound waveform on the time axis t indicates the sound intensity. If the audio sampling frequency is set to 12 kHz for this audio waveform and the number of audio sampling points is 240 for frames for extracting audio features, the audio will be captured in frame units of 20 ms. .

【００５３】４０２〜４０４は、音声の特徴を抽出する
ためのフレームを音声のサンプリング点数を２４０ポイ
ントにした場合の２０ｍｓ単位の音声特徴フレームであ
る。音声の特徴は、一般的に、音声認識で採用されてい
る音声分析により抽出された特徴である。ここで、音声
特徴フレーム４０２で表される特徴は、「コウエンハ」
の動的な音声の特徴である。また、音声特徴フレーム４
０３で表される特徴は、「ハ」の静的な音声の特徴
である。再び、音声特徴フレーム４０４で表される特徴
は、「ドコデスカ」の動的な音声の特徴である。Reference numerals 402 to 404 denote voice feature frames in 20 ms units when the number of voice sampling points is 240 for frames for extracting voice features. In general, the features of speech are features extracted by speech analysis employed in speech recognition. Here, the feature represented by the audio feature frame 402 is “Kouenha”
Is a feature of dynamic speech. Also, voice feature frame 4
The feature represented by 03 is a feature of the static voice of “c”. Again, the feature represented by the audio feature frame 404 is a dynamic audio feature of "docodesca".

【００５４】このような音声の特徴抽出から音声認識を
行うことは可能であるが、音声の始まり（語頭）や終わ
り（語尾）において誤認識が起こりやすい。例えば、
「コウエンハ」が「コウエンヘ」になったり、「ドコデ
スカ」が「ココデスカ」になったりする。また、この音
声認識翻訳装置を海外旅行先などで活用することを考え
ると、周囲の雑音などにより、更に認識率が低下する。
一方、翻訳においては、音声認識の結果次第で、翻訳精
度が変化する。特に、日本語から英語に翻訳する例をと
ってみると、語頭や語尾において誤認識が起こって、
「コウエンハ」が「コウエンヘ」になったり、「ドコデ
スカ」が「ココデスカ」になったりすると、正しく翻訳
できなくなる。また、文節も認識できないと正しい翻訳
は困難になる。そこで、翻訳のために、音声認識の精
度を高める手段として、音声の始まり（語頭）や終わり
（語尾）における口の動きに着目した。４０５〜４０７
は、音声特徴フレーム４０２〜４０４で示した音声の特
徴を抽出するためのフレームを音声のサンプリング点数
を２４０ポイントにした場合の２０ｍｓ単位の音声特徴
フレームに対応させて、口の動きの画像を取り込み、図
２を参照して説明したような処理により特徴を抽出した
画像である。ここで、画像４０５で表される特徴は、
「コウエンハ」の動的な口の形の連続的な動きによる画
像の特徴である。また、画像４０６で表される特徴は、
「ハ」の静的な口の形の画像の特徴である。再び、
画像４０７で表される特徴は、「ドコデスカ」の動的な
口の形の連続的な動きによる画像の特徴である。したが
って、会話文のような連続音声の認識に対して、口の動
きの画像認識を行うことにより、音声の始まりや終わり
における文章の認識精度を高め、また、文節の認識を行
うことにより、翻訳精度を高めることができるようにし
た。同じサンプリング周波数で同時刻の間に取り込んだ
一連の音声と画像の特徴を抽出し、抽出した音声の特徴
が子音から母音に変化し、更に母音から無音状態に変化
していく過程において、この音声の特徴に対応した口全
体とその周辺近傍の画像の特徴が、文の文節の終わりを
示した場合には、入力された一連の音声からなる文の文
節と判断し、文節の終わりの文字や文字列を認識するよ
うにする。日本語の場合には、文節の終わりの助詞を認
識するようにすることが好ましい。更に、例えば、「コ
ウエンハ」の動的な口の形の連続的な動きによる画像の
特徴からも画像認識を行い、音声認識と共に、「公園
は」を認識する。あるいは、音声認識がなされなかった
場合でも、画像の認識だけから「公園は」を認識するよ
うにする。これにより、ロバストネス向上が期待でき
る。Although it is possible to perform speech recognition from such speech feature extraction, erroneous recognition tends to occur at the beginning (at the beginning) or at the end (at the end) of the speech. For example,
"Kouenha" becomes "Kouenhe", and "Dokodeska" becomes "Kokodeska". Also, considering that this speech recognition translation apparatus is used in overseas travel destinations, the recognition rate is further reduced due to ambient noise and the like.
On the other hand, in translation, the translation accuracy changes depending on the result of speech recognition. In particular, if you take the example of translating from Japanese to English, misrecognition occurs at the beginning and end of words,
If "Kouenha" becomes "Kouenhe" or "Dokodeska" becomes "Kokodeska", it cannot be translated correctly. In addition, correct translation becomes difficult if phrases cannot be recognized. Therefore, for translation, we focused on the movement of the mouth at the beginning (start of word) and end (end of word) of speech as a means of improving the accuracy of speech recognition. 405-407
Captures images of mouth movements in correspondence with audio feature frames in 20 ms units when the number of audio sampling points is 240 points, for extracting audio characteristics indicated by audio characteristic frames 402 to 404. 3 is an image in which features are extracted by the processing described with reference to FIG. Here, the features represented by the image 405 are as follows:
It is a feature of the image by the continuous movement of the dynamic mouth shape of "Kouenha". The features represented by the image 406 are as follows:
"C" is a feature of the static mouth-shaped image. again,
The feature represented by the image 407 is a feature of the image due to the continuous movement of the dynamic mouth shape of “docosca”. Therefore, in contrast to the recognition of continuous speech such as conversational sentences, image recognition of mouth movements is performed to improve the recognition accuracy of sentences at the beginning and end of speech, and translation is performed by performing phrase recognition. Accuracy can be increased. In the process of extracting the features of a series of voices and images captured at the same sampling frequency at the same time, the features of the extracted voices change from consonants to vowels, and further from vowels to silence. If the features of the image of the whole mouth and its surroundings corresponding to the features of the above indicate the end of a sentence phrase, it is determined that the sentence is a sentence composed of a series of input speech, Recognize character strings. In the case of Japanese, it is preferable to recognize the particle at the end of a phrase. Furthermore, for example, image recognition is also performed based on the image features due to the continuous movement of the dynamic mouth shape of “Koenha”, and “Park is” is recognized together with voice recognition. Alternatively, even when voice recognition is not performed, “park” is recognized only from image recognition. Thereby, improvement in robustness can be expected.

【００５５】４０８〜４１２は、入力された会話音声に
よる会話文例「コウエンハドコデスカ」の音声認識結
果である。結果は、かな漢字まじりのテキスト文章で出
力する。ここで、文章を構成している単語や文節を認識
することで、「公園」４０８と「は」４１０との間にス
ラッシュ４０９を挿入し、「は」４１０と「どこです
か」４１２との間にスラッシュ４１１を挿入する。これ
により、翻訳処理を容易にし、翻訳精度を高めることが
できるようになる。Reference numerals 408 to 412 denote speech recognition results of a conversation sentence example “Kouenha Codesca” based on the inputted conversation speech. The result is output as a text sentence in Kana-Kanji. Here, by recognizing the words and phrases constituting the sentence, a slash 409 is inserted between “park” 408 and “ha” 410, and between “ha” 410 and “where is” 412. Insert slash 411 into As a result, the translation process can be facilitated and the translation accuracy can be improved.

【００５６】４１３は、認識結果「公園／は／どこです
か」の翻訳結果であり「Ｗｈｅｒｅｉｓｔｈｅｐａ
ｒｋ？」をテキストで出力する。そして、この翻訳結果
４１３は、音声合成して音声出力する。Reference numeral 413 denotes a translation result of the recognition result “Park / Ha / Where?” And “Whereis the pa”
rk? Is output as text. Then, the translation result 413 is synthesized and output as voice.

【００５７】図５は、このような本発明になる音声画像
認識翻訳装置において、音声の取り込み及び画像の取り
込み方法を説明するための図である。FIG. 5 is a diagram for explaining a method of capturing voice and capturing images in the voice image recognition / translation apparatus according to the present invention.

【００５８】図５の（ａ）は、発明者が実際に音声とし
て発生した「山田」の音声の波形である。図５の（ｂ）
は、音声及び画像を取り込むためのメモリである。FIG. 5A shows a waveform of the sound of "Yamada" actually generated as a sound by the inventor. FIG. 5 (b)
Is a memory for taking in audio and images.

【００５９】図５（ａ）において、図の横軸方向（図に
おける右方向）は時間ｔを表し、縦軸方向（図における
上下方向）は音声波形の振幅強度を表している。In FIG. 5A, the horizontal axis (right direction in the figure) represents time t, and the vertical axis (vertical direction in the figure) represents the amplitude intensity of the speech waveform.

【００６０】区間５０１に示す波形は、常時、音声を取
り込んでいる状態における音声の波形である。ここで、
区域１に示す波形は、無音状態の時の波形であり、区域
２に示す波形は、無音状態から音声が始まる最初の音声
波形を示している。音声が発生されていない場合は、区
域１で示すような波形が連続するので、区域２でも区域
１と同様な波形が現われる。そこで、例えば、連続的に
観測される音の波形において、ある時間領域だけ音の波
形データをメモリに記録しておき、常に新しい音データ
をメモリに格納し、古い音データから消していくような
必要最小限のメモリ容量を持つメモリを考える。例え
ば、０．１秒程度の音声データが格納できる容量を持た
せる。The waveform shown in the section 501 is a voice waveform in a state where the voice is always captured. here,
The waveform shown in the zone 1 is a waveform in the silent state, and the waveform shown in the zone 2 is the first audio waveform in which the voice starts from the silent state. When no sound is generated, a waveform similar to that of the area 1 appears in the area 2 since the waveform shown in the area 1 is continuous. Therefore, for example, in a continuously observed sound waveform, the sound waveform data is recorded in the memory only for a certain time region, the new sound data is always stored in the memory, and the old sound data is deleted from the memory. Consider a memory with the minimum required memory capacity. For example, a capacity capable of storing about 0.1 second of audio data is provided.

【００６１】図５（ｂ）におけるメモリ（１）５０６
は、連続的に観測される音の波形において、ある時間領
域だけ音の波形データをメモリに記録しておき、常に新
しい音データをメモリに格納し、古い音データから消し
ていくために必要最小限のメモリ容量ｎを持つメモリで
ある。５０７は、メモリ（１）５０６におけるライトア
ドレスＷＡである。５０８は、メモリ（１）５０６にお
けるリードアドレスＲＡである。音声が取り込まれる
と、ライトアドレスＷＡ５０７の示すアドレスに音声デ
ータをライトし、ライトアドレスＷＡ５０７をインクリ
メントしておく。この動作をメモリ（１）５０６の先頭
アドレスから順に繰り返し、アドレスの最後までライト
したならば再び先頭アドレスに戻って同様に処理を繰り
返す。例えば、図５（ａ）の区間５０１において、無音
状態の区域１の波形データは、図５（ｂ）におけるメモ
リ（１）５０６の領域１に書き込まれる。また、図５
（ａ）の区間５０１において、無音状態から音声が始ま
る区域２の最初の音声波形データは、図５（ｂ）におけ
るメモリ（１）の領域２に書き込まれる。区間５０１に
おいて、区域２のデータが区域１のデータよりも前に示
されているのは、メモリ（１）５０６において、領域２
のデータが領域１のデータよりも新しいことを表すため
である。したがって、音声認識するためには、区間５０
２に示す波形データのように、領域１のデータを領域２
のデータよりも先に読み出すことになる。これにより、
音データを常に取り込むことが可能になる。メモリ
（１）５０６の音データリードに関しては後で述べる。The memory (1) 506 in FIG.
Is the minimum required to record the sound waveform data in the memory for a certain time region in the continuously observed sound waveform, always store the new sound data in the memory, and erase from the old sound data. This is a memory having a limited memory capacity n. 507 is a write address WA in the memory (1) 506. 508 is a read address RA in the memory (1) 506. When the audio is captured, the audio data is written to the address indicated by the write address WA507, and the write address WA507 is incremented. This operation is repeated in order from the start address of the memory (1) 506, and when writing to the end of the address is completed, the process returns to the start address and repeats the same processing. For example, in the section 501 of FIG. 5A, the waveform data of the silent section 1 is written to the area 1 of the memory (1) 506 in FIG. FIG.
In the section 501 of FIG. 5A, the first voice waveform data of the area 2 where the voice starts from the silent state is written to the area 2 of the memory (1) in FIG. In the section 501, the data of the area 2 is shown before the data of the area 1 because, in the memory (1) 506, the area 2
Is newer than the data in the area 1. Therefore, in order to perform voice recognition, the section 50 is required.
As in the waveform data shown in FIG.
Will be read out before the data of. This allows
Sound data can always be captured. The sound data read from the memory (1) 506 will be described later.

【００６２】次に、無音状態から、音声が始まる最初の
音声波形が観測されたならば、次から取り込まれる音デ
ータは音声であると判断し、別なメモリに格納すること
を考える。図５（ａ）における区間５０３が音声データ
である。これにより、メモリ容量を大幅に削減できる。Next, when the first voice waveform at which voice starts from the silent state is observed, it is determined that the voice data to be fetched next is voice, and it is considered that the voice data is stored in another memory. Section 503 in FIG. 5A is audio data. Thereby, the memory capacity can be significantly reduced.

【００６３】図５（ｂ）におけるメモリ（２）５０９
は、無音状態から、音声が始まる最初の音声波形が観測
されたならば、次から取り込まれる音データは音声であ
ると判断し、その音声データを格納するためのメモリで
ある。５１０は、メモリ（２）５０９におけるライトア
ドレスＷＡである。５１１は、メモリ（２）５０９にお
けるリードアドレスＲＡである。メモリ容量は、例え
ば、単語程度の認識であれば、３〜５秒の音声データを
格納することができる容量を持たせる。メモリ（２）に
おける３の領域には、図５（ａ）における区間５０３の
音声データ３が格納される。ここで、音声データが、音
声状態から再び無音状態になり、無音と判断したなら
ば、メモリ（２）５０９への音声データの書き込みを中
止しても良い。以上に説明した音声の取り込み処理によ
り、図５（ａ）に示す区間５０２，５０３の一連の音声
波形データ１，２，３は、図５（ｂ）に示すメモリ
（１）５０６及びメモリ（２）５０９に書き込まれ、こ
れらのメモリ（１）及び（２）に書き込まれた音声波形
データは、５１２に示すように時系列に読み出すことが
できる。The memory (2) 509 in FIG.
Is a memory for storing the voice data when the first voice waveform at which the voice starts from the silent state is determined to be voice data to be fetched from the next time. 510 is a write address WA in the memory (2) 509. 511 is a read address RA in the memory (2) 509. For example, in the case of recognition of about a word, the memory capacity is set to have a capacity capable of storing voice data for 3 to 5 seconds. In the area 3 in the memory (2), the audio data 3 in the section 503 in FIG. 5A is stored. Here, if it is determined that the audio data is in a silence state again from the audio state and is silence, the writing of the audio data to the memory (2) 509 may be stopped. By the above-described audio capturing process, a series of audio waveform data 1, 2, and 3 in the sections 502 and 503 shown in FIG. 5A are stored in the memory (1) 506 and the memory (2) shown in FIG. ) 509 and the audio waveform data written in these memories (1) and (2) can be read out in time series as shown at 512.

【００６４】図５（ｂ）におけるメモリ（３）５１３
は、メモリ（１）５０６と同様に、連続的に観測される
音の波形に対応させて、ある時間領域だけ口の動きの画
像データをメモリに記録しておき、常に新しい画像デー
タをメモリに格納し、古い画像データから消していくた
めに必要最小限のメモリ容量ｍを持つメモリである。５
１４は、メモリ（３）５１３におけるライトアドレスＷ
Ａである。５１５は、メモリ（３）５１３におけるリー
ドアドレスＲＡである。画像が取り込まれると、ライト
アドレスＷＡ５１４の示すアドレスに画像データをライ
トし、ライトアドレスＷＡ５１４をインクリメントして
おく。この動作をメモリ（３）５１３の先頭アドレスか
ら順に繰り返し、アドレスの最後までライトしたならば
再び先頭アドレスに戻って同様の処理を繰り返す。例え
ば、図５（ａ）の区間５０１において、無音状態の区域
１の波形データに対応した口の動きの画像データは、図
５（ｂ）におけるメモリ（３）５１３の領域１に書き込
まれる。The memory (3) 513 in FIG.
, Like the memory (1) 506, stores image data of mouth movements in a memory for a certain time period in association with the waveform of a sound that is continuously observed, and always stores new image data in the memory. This is a memory having a minimum required memory capacity m for storing and erasing old image data. 5
14 is a write address W in the memory (3) 513
A. 515 is a read address RA in the memory (3) 513. When the image is captured, the image data is written to the address indicated by the write address WA514, and the write address WA514 is incremented. This operation is repeated in order from the start address of the memory (3) 513, and when writing to the end of the address is completed, the operation returns to the start address again and the same processing is repeated. For example, in the section 501 of FIG. 5A, the image data of the mouth movement corresponding to the waveform data of the silent section 1 is written to the area 1 of the memory (3) 513 in FIG. 5B.

【００６５】また、図５（ａ）の区間５０１において、
無音状態から音声が始まる区域２の最初の音声波形デー
タに対応した口の動きの画像データは、図５（ｂ）にお
けるメモリ（３）５１３の領域２に書き込まれている。
区間５０１において、区域２のデータが区域１のデータ
よりも前に示されているのは、メモリ（３）５１３にお
いて、領域２のデータが領域１のデータよりも新しいこ
とを表すためである。したがって、音声認識に対応した
画像認識を行うためには、画像においても、区間５０２
に示す波形データのように、メモリ（３）５１３におい
て、領域１のデータを領域２のデータよりも先に読み出
すことになる。これにより、音データに対応した画像デ
ータを常に取り込むことが可能になる。Also, in the section 501 of FIG.
The image data of the mouth movement corresponding to the first audio waveform data in the area 2 where the audio starts from the silent state is written in the area 2 of the memory (3) 513 in FIG.
The reason that the data of the area 2 is shown before the data of the area 1 in the section 501 is to indicate that the data of the area 2 is newer than the data of the area 1 in the memory (3) 513. Therefore, in order to perform image recognition corresponding to voice recognition, even in the image, the section 502
In the memory (3) 513, the data in the area 1 is read out before the data in the area 2 as in the waveform data shown in FIG. This makes it possible to always capture image data corresponding to sound data.

【００６６】次に、無音状態から、音声が始まる最初の
音声波形が観測されたならば、次から取り込まれる音デ
ータは音声であると判断し、画像データにおいても同様
に、別なメモリに格納することを考える。図５（ａ）に
おける区間５０３が音声データであり、この音声データ
対応した動的な口の動きを表す画像データが存在する。
これにより、メモリ容量を大幅に削減できる。Next, if the first voice waveform at which the voice starts from the silent state is observed, it is determined that the voice data to be taken in next is voice, and the image data is similarly stored in another memory. Think about doing. A section 503 in FIG. 5A is audio data, and there is image data representing a dynamic mouth movement corresponding to the audio data.
Thereby, the memory capacity can be significantly reduced.

【００６７】図５（ｂ）におけるメモリ（４）５１６
は、無音状態から、音声が始まる最初の音声波形が観測
されたならば、次から取り込まれる音データは音声であ
ると判断し、画像データにおいても同様に、その音声デ
ータに対応した画像データを格納するためのメモリであ
る。５１７は、メモリ（４）５１６におけるライトアド
レスＷＡである。５１８は、メモリ（４）５１６におけ
るリードアドレスＲＡである。このメモリ（４）５１６
における領域３には、図５（ａ）における区間５０３の
音声データ３に対応する画像データが格納される。ここ
で、音声データが、音声状態から再び無音状態になり、
無音と判断したならば、メモリ（４）５１６への画像デ
ータの書き込みを中止しても良い。The memory (4) 516 in FIG.
If the first sound waveform at which sound starts from the silent state is observed, it is determined that the sound data taken in next is sound, and the image data corresponding to the sound data is similarly determined in the image data. It is a memory for storing. 517 is a write address WA in the memory (4) 516. 518 is a read address RA in the memory (4) 516. This memory (4) 516
5 stores image data corresponding to the audio data 3 in the section 503 in FIG. 5A. Here, the audio data is changed from the audio state to the silence state again,
If it is determined that there is no sound, writing of image data to the memory (4) 516 may be stopped.

【００６８】以上の説明した画像の取り込み処理によ
り、図５（ａ）に示す区間５０２，５０３の一連の音声
波形データ１，２，３に対応した画像データは、図５
（ｂ）に示すメモリ（３）５１３及びメモリ（４）５１
６に書き込まれ、これらのメモリ（３），（４）に書き
込まれたデータは、５１９に示すように時系列に読み出
すことができる。By the above described image fetching process, the image data corresponding to the series of audio waveform data 1, 2, and 3 in the sections 502 and 503 shown in FIG.
The memory (3) 513 and the memory (4) 51 shown in FIG.
6 and written in these memories (3) and (4) can be read out in time series as shown at 519.

【００６９】図６は、図５における音声及び画像の取り
込み処理を示すフローチャートである。FIG. 6 is a flowchart showing the voice and image capturing process in FIG.

【００７０】処理ステップ６０１では、常に音データを
取り込んでメモリ（１）にライトし、常に画像データを
取り込んでメモリ（３）にライトする。例えば、音デー
タは、１２ｋＨｚでサンプリングされた音声を含む音デ
ータである。また、画像データは、音声の取り込みに同
期してサンプリングされた口の動きを含む人物の顔画像
データである。In processing step 601, sound data is always taken and written to the memory (1), and image data is always taken and written to the memory (3). For example, the sound data is sound data including sound sampled at 12 kHz. The image data is face image data of a person including mouth movements sampled in synchronization with the capture of audio.

【００７１】処理ステップ６０２では、メモリ（１）に
ライトされた音データに対して、音データの振幅強度Ｐ
をサンプリングされるデータ毎に観測し、その振幅強度
Ｐｉが任意に設定されたスレッシュホールドレベルＰｔ
ｈを超えなかった場合は、ライトアドレスＷＡ５０７，
５１４をインクリメントし、処理ステップ６０１の処理
を繰り返すように戻る。また、この処理ステップ６０２
では、メモリ（１）にライトした音データに対して、音
データの振幅強度Ｐをサンプリングされるデータ毎に観
測し、その振幅強度Ｐｉが任意に設定されたスレッシュ
ホールドレベルＰｔｈを超えた場合は、その音データ
は、音声であると判断する。In processing step 602, the amplitude intensity P of the sound data is written to the sound data written in the memory (1).
Is observed for each sampled data, and its amplitude intensity Pi is arbitrarily set to a threshold level Pt.
h, the write address WA507,
514 is incremented, and the process returns to repeat the processing of the processing step 601. Also, this processing step 602
With respect to the sound data written in the memory (1), the amplitude intensity P of the sound data is observed for each sampled data, and when the amplitude intensity Pi exceeds an arbitrarily set threshold level Pth, It is determined that the sound data is voice.

【００７２】処理ステップ６０３では、処理ステップ６
０２で音データが音声であると判断したときに、メモリ
（１）５０６及びメモリ（３）５１３のライトアドレス
ＷＡ５０７，５１４を記憶する。In processing step 603, processing step 6
When it is determined in step 02 that the sound data is sound, the write addresses WA507 and 514 of the memory (1) 506 and the memory (3) 513 are stored.

【００７３】処理ステップ６０４では、次にサンプリン
グされる音データは音声データであると判断し、メモリ
（２）５０９へは音声データＰi+1 からライトし、メモ
リ（４）５１６へは画像データＩi+1 からライトする。
以上により、音声認識に必要な音声データ及び画像デー
タを取り込むことができる。In the processing step 604, it is determined that the sound data to be sampled next is sound data, the sound data Pi + 1 is written to the memory (2) 509, and the image data Ii is written to the memory (4) 516. Write from +1.
As described above, voice data and image data necessary for voice recognition can be captured.

【００７４】処理ステップ６０５では、音声認識するた
めに、メモリ（１）５０６及びメモリ（２）５０９に書
き込まれている音声データと、メモリ（３）５１３及び
メモリ（４）５１６に書き込まれている画像データを読
み出すにあたり、処理ステップ６０３で記憶しておいた
メモリ（１）５０６のライトアドレスＷＡ５０７の次の
アドレスを該メモリ（１）５０６のリードアドレスの先
頭アドレスＲＡ（＝ＷＡ＋１）５０８として該メモリ
（１）５０６に格納されている音声データを総て読み出
す。また、処理ステップ６０３で記憶しておいたメモリ
（３）５１３のライトアドレスＷＡ５１４の次のアドレ
スを該メモリ（３）５１３のリードアドレスの先頭アド
レスＲＡ（＝ＷＡ＋１）５１５として該メモリ（３）５
１３に格納されている画像データを総て読み出す。At processing step 605, the voice data written in the memory (1) 506 and the memory (2) 509 and the voice data written in the memory (3) 513 and the memory (4) 516 for voice recognition. In reading the image data, the next address of the write address WA507 of the memory (1) 506 stored in the processing step 603 is set as the head address RA (= WA + 1) 508 of the read address of the memory (1) 506, (1) Read out all audio data stored in 506. Further, the next address of the write address WA514 of the memory (3) 513 stored in the processing step 603 is set as the head address RA (= WA + 1) 515 of the read address of the memory (3) 513, and the memory (3) 5
13 is read out.

【００７５】最後に、処理ステップ６０６では、大半の
音声データが格納されているメモリ（２）５０９に対し
て、リードアドレスの先頭アドレスＲＡ（先頭）５１１
から該メモリ（２）５０９に格納されている音声データ
を総て読み出す。また、大半の画像データが格納されて
いるメモリ４（５１６）に対して、リードアドレスの先
頭アドレスＲＡ（先頭）５１８から該メモリ（４）５１
６に格納されている画像データを総て読み出す。Finally, in the processing step 606, the head address RA (head) 511 of the read address is stored in the memory (2) 509 storing most of the audio data.
From the memory (2) 509. In addition, the memory 4 (516) storing most of the image data is read from the head address RA (head) 518 of the read address to the memory (4) 51.
6 is read out.

【００７６】図７は、常時、音データを取り込み、その
データが無音状態であるか、あるいは、音声であるかを
判断する場合のやり方を例示している。図５（ａ）にお
いて、波形５０４は、音声と判断された最初に取り込ま
れた波形であり、５０５は、音声であると判断された時
点で発生されるトリガ（フラグ）あるいは信号である。
この音声と判断された最初に取り込まれた波形を詳細に
見てみると、図７に波形７０１で示すような波形となっ
ている。ここで、７０２は、時刻ｔ（ｉ）においてサン
プリングされた音声データＰｉである。７０３は、時刻
ｔ（ｉ−１）においてサンプリングされた音声データＰ
i-1 である。FIG. 7 shows an example of a method for always taking in sound data and determining whether the data is in a silent state or a sound. In FIG. 5A, a waveform 504 is a waveform that is first taken in when it is determined to be a voice, and 505 is a trigger (flag) or a signal generated when it is determined that the voice is a voice.
A closer look at the first captured waveform determined to be this voice has a waveform as shown by waveform 701 in FIG. Here, reference numeral 702 denotes audio data Pi sampled at time t (i). 703 is audio data P sampled at time t (i-1).
i-1.

【００７７】例えば、音声データＰｉ７０２と音声デー
タＰi-1 ７０３の差分値ΔＰを観測し、任意に設定され
たスレッシュホールドレベルＰｔｈと比較して図６にお
ける処理６０２を行ってもよい。また、差分値ΔＰの積
分値ΣΔＰを観測し、任意に設定されたスレッシュホー
ルドレベルＰｔｈと比較して、図６における処理６０２
を行ってもよい。更に、差分値ΔＰをある時間内に複数
回（ｋ回）観測することで、その観測回数を任意に設定
された回数（スレッシュホールド回数）と比較して、図
６における処理６０２を行ってもよい。For example, the process 602 in FIG. 6 may be performed by observing the difference value ΔP between the audio data Pi 702 and the audio data Pi-1 703 and comparing it with an arbitrarily set threshold level Pth. Further, the integrated value ΣΔP of the difference value ΔP is observed and compared with an arbitrarily set threshold level Pth.
May be performed. Further, by observing the difference value ΔP a plurality of times (k times) within a certain time, the number of observations may be compared with an arbitrarily set number of times (threshold number), and the process 602 in FIG. 6 may be performed. Good.

【００７８】図８は、上述した携帯型音声画像翻訳機の
イメージ及び外観の例を示す図である。図８（ａ）は、
本発明になる携帯型音声翻訳機を海外旅行者が使用して
いる場面を示している。ユーザである海外旅行者は、携
帯型翻訳機のディスプレイ及び音声入出力手段を介し
て、例えば、ショッピングにおいて店員と会話をする際
に、自分の話す内容を相手のわかる言葉に翻訳し、意図
を伝え、逆に、相手の言っている言葉を自分のわかる言
葉に翻訳し、相手の意図を理解する。特に、会話におけ
る日本語，英語，ドイツ語，フランス語，イタリア語，
ロシア語，中国語等の言語については、各国の言語に対
応することができ、限定されることはない。図８
（ｂ）は、携帯型翻訳機の外観図であり、８０１は、携
帯型翻訳機の本体である。FIG. 8 is a diagram showing an example of an image and an appearance of the above-mentioned portable audio-visual translator. FIG. 8 (a)
1 shows a situation where an overseas traveler uses the portable speech translator according to the present invention. Overseas travelers who are users, for example, when talking with a clerk at shopping via the display and voice input / output means of the portable translator, translate what they speak into words that the other party can understand, and express their intentions. To convey, and conversely, translate the words of the other person into words that you understand, and understand the intention of the other person. In particular, Japanese, English, German, French, Italian,
Languages such as Russian and Chinese can correspond to the languages of each country and are not limited. FIG.
(B) is an external view of a portable translator, and 801 is a main body of the portable translator.

【００７９】８０２は、多方向性マイクであって、空港
や駅構内，飛行機内，バスや地下鉄やタクシー等の乗り
物車内，観光地建物内等での会話音声に含まれる各場所
での雑音を除去する目的で使用され、会話音声が無いと
きには各場所での全体音を取り込む。Reference numeral 802 denotes a multidirectional microphone, which reduces noise at various places included in conversational voices in airports, train stations, airplanes, vehicles such as buses, subways and taxis, and in tourist attractions. It is used for the purpose of removing, and when there is no conversation voice, the whole sound at each location is captured.

【００８０】８０３は、図３に示したようなＣＣＤカメ
ラ内蔵マイクで指向性があり、海外旅行先での空港や駅
構内，飛行機内，ホテル，観光地，レストランやショッ
ピング等で交わされる会話音声及び口の動きを含む顔の
画像をアナログ信号として取り込む。Reference numeral 803 denotes a microphone having a built-in CCD camera as shown in FIG. 3, which has directivity, and is a conversation voice exchanged at an airport, a station yard, an airplane, a hotel, a sightseeing spot, a restaurant or shopping, etc. at an overseas destination. And an image of the face including the movement of the mouth is captured as an analog signal.

【００８１】８０４は、音声出力手段であり、音声認識
により翻訳した内容を報音するためのスピーカやイヤホ
ンからなる。Reference numeral 804 denotes a voice output means, which comprises a speaker or an earphone for reporting the content translated by voice recognition.

【００８２】８０５は、音声認識した結果やその修正，
補正結果及び翻訳結果の内容を表示するためのディスプ
レイである。Reference numeral 805 denotes the result of speech recognition and its modification,
10 is a display for displaying the contents of a correction result and a translation result.

【００８３】８０６は、ＩＣカードで、例えば、日本語
から中国語に音声認識翻訳するための音響モデル，単語
辞書，文法辞書，翻訳事例辞書，音声合成用の音声辞書
等をメモリやハードディスクに格納して搭載している。An IC card 806 stores, for example, an acoustic model for speech recognition translation from Japanese into Chinese, a word dictionary, a grammar dictionary, a translation example dictionary, a speech dictionary for speech synthesis, and the like in a memory or a hard disk. It is installed.

【００８４】８０７は、ＩＣカードで、例えば、中国語
から日本語に音声認識翻訳するための音響モデル，単語
辞書，文法辞書，翻訳事例辞書，音声合成用の音声辞書
等をメモリやハードディスクに格納して搭載している。Reference numeral 807 denotes an IC card which stores, for example, an acoustic model for speech recognition translation from Chinese to Japanese, a word dictionary, a grammar dictionary, a translation example dictionary, a speech dictionary for speech synthesis, and the like in a memory or a hard disk. It is installed.

【００８５】図９は、本発明に係る音声画像認識翻訳装
置の他の実施形態の構成を示すブロック図である。例え
ば、この図９に示した音声画像認識翻訳装置は、携帯型
音声認識翻訳機であり、ＣＰＵやメモリや専用ＩＣ等の
いくつかのＬＳＩで構成される装置であっても、半導体
素子上に構成されるチップであっても良い。FIG. 9 is a block diagram showing the configuration of another embodiment of the voice / image recognition / translation apparatus according to the present invention. For example, the speech image recognition translator shown in FIG. 9 is a portable speech recognition translator, and even if it is a device composed of several LSIs such as a CPU, a memory, and a dedicated IC, it can be mounted on a semiconductor element. A configured chip may be used.

【００８６】図９において、９０１は音声を取り込むた
めの指向性マイクであり、例えば、海外旅行先の空港や
駅構内，飛行機内，ホテル，観光地，レストランやショ
ッピング等で交わされる会話音声を取り込む。In FIG. 9, reference numeral 901 denotes a directional microphone for capturing voice, for example, voice for conversations exchanged at airports, train stations, airplanes, hotels, sightseeing spots, restaurants and shopping destinations of overseas travel destinations. .

【００８７】９０２は１６ビットのアナログ／デジタル
（Ａ／Ｄ）変換ＩＣであり、マイク９０１内のフィルタ
やアンプにより音声帯域以外の音が取り除かれ、雑音処
理された音声データの連続的なアナログ信号を、音声の
サンプリング周波数、例えば１２ｋＨｚでサンプリング
してデジタル信号に変換する。Reference numeral 902 denotes a 16-bit analog / digital (A / D) conversion IC, in which a sound outside the audio band is removed by a filter or an amplifier in the microphone 901 and a continuous analog signal of audio data subjected to noise processing. Is sampled at a sampling frequency of audio, for example, 12 kHz, and is converted into a digital signal.

【００８８】９０３は音声取り込み部であり、前記Ａ／
Ｄ変換ＩＣ１０２でサンプリングされた１６ビットの音
声データに対してシリアルデータからパラレルデータに
シリアル／パラレル変換を行ってレジスタ等に一旦格納
しておくためのものである。Reference numeral 903 denotes a voice capturing unit,
This is for performing serial / parallel conversion from serial data to parallel data on 16-bit audio data sampled by the D conversion IC 102 and temporarily storing the data in a register or the like.

【００８９】９０４は、前記音声取り込み部９０３によ
り取り込んだ音声データ、例えば、会話音声の１フレー
ズ分の連続音声データを記憶しておくためのメモリであ
り、また、連続音声データを書き込めるだけの必要最小
限の容量を持つメモリである。連続音声データのメモリ
の書き込みは、ＣＰＵ等のソフトウエア処理で行って
も、専用のハードウエアで行っても良い。Reference numeral 904 denotes a memory for storing voice data captured by the voice capturing unit 903, for example, continuous voice data for one phrase of conversation voice. It is a memory with a minimum capacity. Writing of the continuous audio data to the memory may be performed by software processing such as a CPU or by dedicated hardware.

【００９０】９０５は、音声認識処理部であり、メモリ
９０４に書き込まれた連続音声データに対して、デジタ
ルフィルタ，音声分析，音声区間検出，照合，判定等の
一連の音声認識処理を行う。ここで、音声認識に必要な
音響モデルデータ，辞書データ，文法データは、この音
声認識処理部９０５内において、メモリ等に登録し格納
しておく。音声認識処理は、ＣＰＵやＤＳＰ等のソフト
ウエア処理で行っても、専用のハードウエアで行っても
良い。A voice recognition processing unit 905 performs a series of voice recognition processes such as digital filtering, voice analysis, voice section detection, collation, and determination on continuous voice data written in the memory 904. Here, acoustic model data, dictionary data, and grammatical data necessary for speech recognition are registered and stored in a memory or the like in the speech recognition processing unit 905. The voice recognition processing may be performed by software processing such as a CPU or a DSP, or may be performed by dedicated hardware.

【００９１】９０６は、画像を取り込むための高解像度
カメラであり、例えば、ＣＣＤカメラである。この高解
像度カメラ９０６は、海外旅行先の空港や駅構内，飛行
機内，ホテル，観光地，レストランやショッピング等で
交わされる会話の音声に合わせて、この音声を発生する
人の口の動きを画像データとして取り込む。Reference numeral 906 denotes a high-resolution camera for taking in an image, for example, a CCD camera. This high-resolution camera 906 images the movement of the mouth of a person who generates this voice in accordance with the voice of a conversation exchanged at an airport, a station yard, an airplane, a hotel, a sightseeing spot, a restaurant or shopping, etc. at an overseas travel destination. Import as data.

【００９２】９０７は１６ビットのアナログ／デジタル
（Ａ／Ｄ）変換ＩＣであり、ＣＣＤカメラ９０６からの
アナログ信号を、音声のサンプリング周波数に同期し
て、例えば、１２ｋＨｚでサンプリングしてデジタル信
号に変換する。Reference numeral 907 denotes a 16-bit analog / digital (A / D) conversion IC, which converts an analog signal from the CCD camera 906 into a digital signal by sampling at, for example, 12 kHz in synchronization with a sampling frequency of audio. I do.

【００９３】９０８は画像読み取り部であり、前記Ａ／
Ｄ変換ＩＣ９０７によりサンプリングされた１６ビット
の画像データに対して、シリアルデータからパラレルデ
ータにシリアル／パラレル変換を行ってレジスタ等に一
旦格納しておくためのものである。Reference numeral 908 denotes an image reading unit.
This is for performing serial / parallel conversion from serial data to parallel data on 16-bit image data sampled by the D conversion IC 907 and temporarily storing the data in a register or the like.

【００９４】９０９は、画像取り込み部９０８により取
り込んだ画像データ、例えば、会話音声の１フレーズ分
の連続画像データを記憶しておくためのメモリであり、
また、連続画像データを書き込めるだけの必要最小限の
容量を持つメモリである。連続画像データのメモリへの
書き込みは、ＣＰＵ等のソフトウエア処理で行っても、
専用のハードウエアで行っても良い。Reference numeral 909 denotes a memory for storing image data fetched by the image fetching unit 908, for example, continuous image data for one phrase of conversation voice.
Further, the memory has a minimum necessary capacity for writing continuous image data. Writing of continuous image data to the memory can be performed by software processing such as CPU.
It may be performed by dedicated hardware.

【００９５】９１０は画像認識処理部であり、メモリ９
０９に書き込まれた連続画像データに対して、デジタル
フィルタ，画像変換，２値化処理，画像解析，特徴抽
出，照合，判定等の一連の画像認識処理を行う。ここ
で、画像認識に必要な画像モデルデータ，辞書データ，
文法データは、画像認識処理部１０５内において、メモ
リ等に登録して格納しておく。画像認識処理は、ＣＰＵ
やＤＳＰ等のソフトウエア処理で行っても、専用のハー
ドウエアで行っても良い。ここで、画像認識処理した結
果は、音声認識処理部に渡す。Reference numeral 910 denotes an image recognition processing unit,
A series of image recognition processes such as digital filter, image conversion, binarization process, image analysis, feature extraction, collation, and determination are performed on the continuous image data written in step 09. Here, image model data, dictionary data,
The grammar data is registered and stored in a memory or the like in the image recognition processing unit 105. Image recognition processing is performed by CPU
Or by software processing such as a DSP, or by dedicated hardware. Here, the result of the image recognition processing is passed to the voice recognition processing unit.

【００９６】９１１は、前記音声認識処理部９０５から
出力された会話音声の認識結果に対して翻訳したい言語
に翻訳処理を行う翻訳処理部である。音声認識処理部９
０５から出力される認識結果は、例えば、日本語であれ
ば名詞，助詞，動詞，副詞等のかな漢字まじりのテキス
ト文章である。翻訳処理では、これらのかな漢字まじり
文章に対して、構文解析及び辞書，文法規則，事例等か
らの文章生成を行い、翻訳結果を出力する。Reference numeral 911 denotes a translation processing unit for performing a translation process on the speech speech recognition result output from the speech recognition processing unit 905 into a language to be translated. Voice recognition processing unit 9
The recognition result output from 05 is, for example, a text sentence of kana-kanji characters, such as nouns, particles, verbs, and adverbs in Japanese. In the translation process, the Kana-Kanji mixed sentence is subjected to syntax analysis and sentence generation from a dictionary, grammar rules, examples, and the like, and a translation result is output.

【００９７】９１２は、前記翻訳処理部９１１から出力
された翻訳結果に対して、会話文に適した音声に変換し
て音声出力する音声合成処理部である。この音声合成処
理部９１２では、より自然な会話文音声にするために、
文章を構成している単語の発音やアクセント、更に、文
章全体の抑揚を最適化して会話文の音声合成を行い、相
手側に対して聞き取りやすい自然な音声を出力するため
の処理も行う。Reference numeral 912 denotes a speech synthesis processing unit that converts the translation result output from the translation processing unit 911 into a speech suitable for a conversation sentence and outputs the speech. In the speech synthesis processing unit 912, in order to make the conversation sentence speech more natural,
It also optimizes the pronunciation and accent of the words that make up the sentence, as well as the inflection of the entire sentence, synthesizes the speech of the conversation sentence, and performs processing for outputting natural speech that is easy to hear for the other party.

【００９８】９１３は１６ビットのデジタル／アナログ
（Ｄ／Ａ）変換ＩＣであり、前記音声合成処理部９１２
から出力された音声のデジタル信号を、例えば、ロウパ
スフィルタ（ＬＰＦ）を経由して音声周波数帯域１２ｋ
Ｈｚでアナログ信号に変換する。Reference numeral 913 denotes a 16-bit digital / analog (D / A) conversion IC.
The digital signal of the audio output from is transmitted through, for example, a low-pass filter (LPF) to an audio frequency band 12k.
Convert to analog signal at Hz.

【００９９】９１４は、音声認識結果の途中経過や翻訳
結果をテキストで表示するための液晶ディスプレイ（Ｌ
ＣＤ）である。Reference numeral 914 denotes a liquid crystal display (L) for displaying the progress of the speech recognition result and the translation result as text.
CD).

【０１００】９１５は、音声認識結果やその途中経過，
翻訳結果を音声合成して音声出力するためのスピーカで
ある。Reference numeral 915 denotes a speech recognition result, the progress thereof,
This is a speaker for synthesizing the translation result and outputting the speech.

【０１０１】９１６は、前記音声認識処理部９０５から
出力された音声及び画像の認識結果に対して、誤認識部
分の修正，補正を行う認識結果修正部である。誤認識部
分を含んだ認識結果を、翻訳処理部９１１で翻訳すべき
会話文の認識結果として、ただちに転送して翻訳させる
と、誤った翻訳結果になってしまう。そこで、認識結果
修正部９１６は、認識結果を翻訳する前に、音声会話文
を自分側で翻訳したい適切な文に修正することで翻訳精
度を高めることができるようにする。A recognition result correction unit 916 corrects and corrects an erroneously recognized portion of the voice and image recognition results output from the voice recognition processing unit 905. If the translation result including the misrecognition part is immediately transferred and translated by the translation processing unit 911 as the recognition result of the conversation sentence to be translated, the translation result will be incorrect. Therefore, before translating the recognition result, the recognition result correcting unit 916 can improve the translation accuracy by correcting the voice conversation sentence to an appropriate sentence to be translated on its own.

【０１０２】[0102]

【発明の効果】以上のように、本発明によれば、国際化
時代における翻訳精度の高い音声認識翻訳装置の実現
や、海外旅行先等で少しでも会話らしい相互のコミュニ
ケーションをアシストする翻訳精度の高い携帯型音声認
識翻訳機の実現が可能となる。As described above, according to the present invention, it is possible to realize a speech recognition translator with high translation accuracy in the era of internationalization, and to provide a translation accuracy that assists mutual communication in conversation abroad at least while traveling overseas. A high portable speech recognition translator can be realized.

[Brief description of the drawings]

【図１】本発明になる音声画像認識翻訳装置の一形態形
態を示すブロック図である。FIG. 1 is a block diagram showing one embodiment of a speech image recognition and translation apparatus according to the present invention.

【図２】図１に示した本発明になる音声画像認識装置に
おける画像取り込み及び画像処理を示す説明図である。FIG. 2 is an explanatory diagram showing image capturing and image processing in the voice image recognition device according to the present invention shown in FIG. 1;

【図３】図１に示した本発明になる音声画像認識装置の
画像取り込み用カメラ内蔵マイクの構成図である。FIG. 3 is a configuration diagram of an image capturing camera built-in microphone of the voice image recognition device according to the present invention shown in FIG. 1;

【図４】図１に示した本発明になる音声画像認識装置に
おける音声及び画像取り込み方法及び音声画像認識翻訳
を示す説明図である。FIG. 4 is an explanatory diagram showing a voice and image capturing method and voice image recognition translation in the voice image recognition device according to the present invention shown in FIG. 1;

【図５】図１に示した本発明になる音声画像認識装置の
音声及び画像取り込み方法を示す説明図である。FIG. 5 is an explanatory diagram showing a voice and image capturing method of the voice and image recognition device according to the present invention shown in FIG. 1;

【図６】図１に示した本発明になる音声画像認識装置に
おける音声及び画像取り込み方法を示すフローチャート
図である。FIG. 6 is a flowchart showing a voice and image capturing method in the voice and image recognition device according to the present invention shown in FIG. 1;

【図７】図１に示した本発明になる音声画像認識装置に
おける音声データの判断方法を示す説明図である。FIG. 7 is an explanatory diagram showing a method of determining voice data in the voice image recognition device according to the present invention shown in FIG. 1;

【図８】図１に示した本発明になる音声画像認識装置を
携帯型翻訳機に適用した一例を示す説明図である。FIG. 8 is an explanatory diagram showing an example in which the speech image recognition device according to the present invention shown in FIG. 1 is applied to a portable translator.

【図９】本発明になる音声画像認識翻訳装置の他の実施
形態を示すブロック図である。FIG. 9 is a block diagram showing another embodiment of the speech image recognition / translation apparatus according to the present invention.

【図１０】従来の携帯型の音声翻訳装置の構成を示すブ
ロック図である。FIG. 10 is a block diagram showing a configuration of a conventional portable speech translation device.

[Explanation of symbols]

１０１…マイク、１０２…Ａ／Ｄ変換ＩＣ、１０３…音
声取り込み部、１０４…メモリ、１０５…音声認識処理
部、１０６…カメラ、１０７…Ａ／Ｄ変換ＩＣ、１０８
…画像取り込み部、１０９…メモリ、１１０…画像認識
処理部、１１１…翻訳処理部、１１２…音声合成処理
部、１１３…Ｄ／Ａ変換ＩＣ、１１４…ＬＣＤ、１１５
…スピーカ。101: microphone, 102: A / D conversion IC, 103: voice capturing unit, 104: memory, 105: voice recognition processing unit, 106: camera, 107: A / D conversion IC, 108
... Image capturing unit, 109, memory, 110, image recognition processing unit, 111, translation processing unit, 112, voice synthesis processing unit, 113, D / A conversion IC, 114, LCD, 115
... speakers.

フロントページの続き (51)Int.Cl.⁶ 識別記号ＦＩＧ０６Ｆ 15/38 Ａ (72)発明者天野明雄東京都国分寺市東恋ケ窪一丁目280番地株式会社日立製作所中央研究所内 (72)発明者伊東功二東京都小平市上水本町五丁目20番１号株式会社日立製作所半導体事業部内 (72)発明者佐藤裕子東京都小平市上水本町五丁目20番１号株式会社日立製作所半導体事業部内 (72)発明者石渡一嘉東京都小平市上水本町五丁目20番１号株式会社日立製作所半導体事業部内Continued on the front page (51) Int.Cl. ⁶ Identification symbol FIG06F 15/38 A (72) Inventor Akio Amano 1-280 Higashi-Koigakubo, Kokubunji-shi, Tokyo Inside the Central Research Laboratory, Hitachi, Ltd. (72) Inventor Isao Ito (Ii) Within the Semiconductor Division, Hitachi, Ltd. 5-2-1, Kamimizu Honcho, Kodaira City, Tokyo (72) Inventor Yuko Sato Within the Semiconductor Division, 5-2-1, Kamimizu Honmachi, Kodaira, Tokyo (72) Inventor Kazuka Ishiwata 5-2-1, Josuihoncho, Kodaira-shi, Tokyo Inside the Semiconductor Division, Hitachi, Ltd.

Claims

[Claims]

1. A means for capturing voice and image, a memory for storing captured voice data, and a voice recognition processing unit for extracting a feature of the voice from a series of captured voice and performing voice recognition processing. A memory that stores the captured image data, an image recognition processing unit that extracts features of the captured image from a series of images, and performs an image recognition process;
A translation processing unit that translates the recognition result of the words and sentences recognized by the voice recognition and the image recognition into a word or a sentence to be translated, and the features of the speech that changes over time and the features of the image corresponding to the features of the speech A speech image recognition and translation device that recognizes a phrase of a sentence composed of a series of input voices and recognizes words constituting the sentence from the two correlations to recognize and translate the speech. When the feature of the mouth image indicates the end of a sentence phrase, it is determined that the sentence is composed of a series of input speeches, and the character or character string at the end of the phrase is recognized. An audio-visual recognition / translation device characterized by adding a mark to indicate.

2. A method according to claim 1, wherein the feature of a series of voices captured at the same sampling frequency and at the same time and voices extracted from the image changes from a consonant to a vowel, and further changes from a vowel to a silent state. In the process, if the features of the whole mouth corresponding to this voice feature and the image in the vicinity of the mouth indicate the end of the sentence phrase, it is determined that the sentence is composed of a series of input speech. Characteristic speech image recognition translator.

3. A feature of a series of Japanese voices and voices extracted from an image captured at the same time at the same sampling frequency, which changes from a consonant to a vowel, and further changes from a vowel to a silent state. In the process, if the features of the image of the whole mouth and its surroundings corresponding to the features of this voice indicate the end of the phrase of the sentence, it is determined that the phrase is a sentence composed of a series of input speech. And a speech image recognition translator for recognizing a particle at the end of the phrase.

4. A memory according to claim 1, wherein said memory for storing the fetched audio data is a memory for fetching sound data at all times, and a memory for fetching said audio data when a voice is inputted. And a memory (3) for constantly capturing image data and a memory (4) for capturing image data when audio is input, wherein the memory for storing the captured image data is configured; When it is determined that the voice has been input, the capture of the voice data and image data is started, and when it is determined that the voice input has been completed, the capture of the voice data and image data is stopped, and the stored voice data and image data are stored. An audio-visual recognition and translation apparatus characterized by performing speech recognition and translation for a speech.

5. A process according to claim 4, wherein the process of starting the capture of the voice data and the image data when the voice is determined to be input and the process of reading out the stored voice data and the image data are arbitrarily characterized in that the voice characteristics appear. At the set cycle (sampling frequency), the sound data stored in the memory (1) that always takes in the sound data is compared with the intensity Pi of the sound data taken at the time Ti and the time Ti− of the previous cycle. If the value of the difference ΔPi from the intensity Pi-1 of the sound data captured in 1 exceeds the arbitrarily set boundary value Pth of the intensity of the sound data, it is determined that the sound data is sound data. The next sound data is written into the memory (2) in order, and the image data is always taken in from the time when it is determined that the sound data is the sound data. When the writing of the image data to the memory (3) is completed, the next image data is sequentially written to the memory (4), and the audio data stored in the memory (1) and the audio data stored in the memory (2) are compared. Along with the image data stored in the memory (3) and the image data stored in the memory (4), the data is sequentially read in the direction of the time axis t to read out a series of audio waveform data. An audio-video recognition / translation apparatus characterized by forming image data indicating mouth movement.

6. A speech image according to claim 1, further comprising a recognition result correction unit for correcting or correcting the recognition result of the word or sentence recognized by the voice recognition and the image recognition. Recognition translation device.

7. A method according to claim 1, wherein the features of the image are extracted from a face image of a person, a whole mouth, and an image in the vicinity of the periphery thereof, which are normalized by performing resolution conversion and binarization processing of the image data. An audio-visual recognition / translation device characterized by the following.

8. A portable speech image recognition and translation apparatus according to claim 1, wherein said means for taking in the speech and the image is constituted by integrating a camera and a microphone.