JP2013088488A

JP2013088488A - Speech search device, speech search method, and program

Info

Publication number: JP2013088488A
Application number: JP2011226266A
Authority: JP
Inventors: Hideaki Inoue; 秀昭井上
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2011-10-13
Filing date: 2011-10-13
Publication date: 2013-05-13

Abstract

PROBLEM TO BE SOLVED: To provide a speech search device, a speech search method, and a program which have high search accuracy.SOLUTION: A search object acquisition unit 62 acquires a triphone string A where triphones each of which has a phoneme included in speech data of a search object as a center phoneme and includes a phoneme directly preceding the center phoneme and a phoneme directly succeeding the center phoneme are arranged in time sequence. An input unit 64 inputs a search word to be searched for speech data. A search word acquisition unit 66 converts the search word inputted by the input unit 64, to a phoneme string to acquire a triphone string B. A search unit 67 calculates similarities between a triphone string C obtained by eliminating the first and last biphones of the triphone string B and partial strings included in the triphone string A and extracts a partial string for which the calculated similarity meets a prescribed condition, from the triphone string A. An output unit 69 displays the reproduction start time in speech data, of the partial string extracted by the search unit 67.

Description

本発明は、音声検索装置、音声検索方法及びプログラムに関する。 The present invention relates to a voice search device, a voice search method, and a program.

音声データに含まれる音声をキーワードで検索する場合、音声データをモデル化するのが一般的である。このようなモデルの１つにトライフォンがある（例えば、特許文献１及び２参照）。トライフォンは、３音素の音素モデルである。トライフォンは、音声データに含まれる各音素を中心音素とし、中心音素と、その中心音素の前側直近の音素と、後側直近の音素とを含む。最終的に、音声データは、トライフォンが時系列順に並べられたトライフォン列にモデル化される。 When searching for speech contained in audio data by a keyword, it is common to model the audio data. One such model is a triphone (see, for example, Patent Documents 1 and 2). A triphone is a phoneme model of three phonemes. The triphone uses each phoneme included in the voice data as a central phoneme, and includes a central phoneme, a phoneme nearest to the front side of the central phoneme, and a phoneme nearest to the rear side. Finally, the voice data is modeled into a triphone sequence in which triphones are arranged in time series.

トライフォン列に変換された音声データに含まれる音声をキーワードで検索する場合には、キーワードもトライフォン列にモデル化する必要がある。キーワードのトライフォン列は、音声データのトライフォン列の一部と比較される。比較の結果、キーワードのトライフォン列と類似度の高い部分が音声データのトライフォン列から検索される。 When searching for a voice included in the voice data converted into the triphone string using a keyword, the keyword must also be modeled into the triphone string. The keyword triphone sequence is compared with a portion of the triphone sequence of voice data. As a result of the comparison, a portion having high similarity to the keyword triphone string is searched from the triphone string of the voice data.

特開２００６−１１２５７号公報JP 2006-11257 A 特開２０１１−３９４６８号公報JP 2011-39468 A

キーワードの最初の音素には、前側に音素がないため、最初の音素を中心音素とする音素モデルは、トライフォンではなく、２音素のバイフォンとなる。同様に、キーワードの最後の音素には、後側に音素がないため、最後の音素を中心音素とする音素モデルは、バイフォンとなる。すなわち、キーワードのトライフォン列の最初及び最後の音素モデルは、バイフォンとなる。 Since the first phoneme of the keyword has no phoneme on the front side, the phoneme model having the first phoneme as the central phoneme is not a triphone but a biphone of two phonemes. Similarly, since the last phoneme of the keyword does not have a phoneme on the rear side, the phoneme model having the last phoneme as the central phoneme is biphone. In other words, the first and last phoneme models of the keyword triphone string are biphones.

音声データのトライフォン列とキーワードのトライフォン列との比較は、すべてがトライフォンであるトライフォン列と、一部にバイフォンを含むトライフォン列との比較となる。キーワードのトライフォン列に含まれるバイフォンは、両者の類似度を下げる要因となり、検索におけるノイズとなる。この結果、キーワードの検索精度が低下する。 The comparison between the triphone string of the voice data and the triphone string of the keyword is a comparison between a triphone string that is all triphones and a triphone string that partially includes biphones. The biphones included in the keyword triphone string are factors that reduce the similarity between the two and cause noise in the search. As a result, the keyword search accuracy decreases.

本発明は、上記実情に鑑みてなされたものであり、検索精度が高い音声検索装置、音声検索方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object thereof is to provide a voice search device, a voice search method, and a program with high search accuracy.

上記目的を達成するため、本発明に係る音声検索装置は、
検索対象の音声データに含まれる各音素を中心音素とし、中心音素とその音素の前側直近の音素及び後側直近の音素の少なくとも一方とを含む音素モデルが時系列順に配列された第１の音素モデル列を取得する検索対象取得部と、
前記音声データに対して検索する検索語を音素列に変換し、変換した音素列を構成する各音素を中心音素とし、中心音素とその中心音素の前側直近の音素及び後側直近の音素の少なくとも一方とを含む音素モデルが時系列順に配列された第２の音素モデル列を取得する検索語取得部と、
前記第２の音素モデル列の最初及び最後の少なくとも一方の音素モデルを除く音素モデル列である第３の音素モデル列と前記第１の音素モデル列に含まれる部分列との第１の類似度を算出し、算出された前記第１の類似度が所定の条件を満たす部分列を、前記第１の音素モデル列から抽出する探索部と、
前記探索部により抽出された前記部分列に対応する前記音声データに関する情報を出力する出力部と、
を備える。 In order to achieve the above object, a voice search device according to the present invention provides:
A first phoneme in which each phoneme included in the speech data to be searched is a central phoneme, and a phoneme model including the central phoneme and at least one of the phoneme nearest to the front side and the phoneme nearest to the back side is arranged in time series order A search target acquisition unit for acquiring a model column;
A search word to be searched for the speech data is converted into a phoneme string, each phoneme constituting the converted phoneme string is set as a central phoneme, and at least a central phoneme, a phoneme nearest to the front side of the central phoneme, and a phoneme nearest to the rear side A search word acquisition unit that acquires a second phoneme model sequence in which phoneme models including one of them are arranged in chronological order;
The first similarity between a third phoneme model sequence that is a phoneme model sequence excluding at least one of the first and last phoneme models of the second phoneme model sequence and a subsequence included in the first phoneme model sequence And a search unit that extracts a partial sequence in which the calculated first similarity satisfies a predetermined condition from the first phoneme model sequence,
An output unit that outputs information related to the audio data corresponding to the partial sequence extracted by the search unit;
Is provided.

本発明によれば、検索精度が高くなる。 According to the present invention, search accuracy is increased.

本発明の実施形態１に係る音声検索装置のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the voice search device which concerns on Embodiment 1 of this invention. 本発明の実施形態１に係る音声検索装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the speech search device which concerns on Embodiment 1 of this invention. 検索対象の音声データに基づいて生成されたトライフォン列Ａを示す図である。It is a figure which shows the triphone row | line | column A produced | generated based on the audio | voice data of search object. 検索語に基づいて生成された音素列、トライフォン列Ｂ及びトライフォン列Ｃを示す図である。It is a figure which shows the phoneme row | line | column, the triphone row | line | column B, and the triphone row | line | column C which were produced | generated based on the search word. ２つのトライフォン列で定義されるノードを示す図である。It is a figure which shows the node defined by two triphone strings. トライフォン間距離テーブルの内容を示す図である。It is a figure which shows the content of the distance table between triphones. 連続ＤＰマッチングで用いる漸化式を説明する図である。It is a figure explaining the recurrence formula used by continuous DP matching. 部分列の抽出方法を説明する図である。It is a figure explaining the extraction method of a partial sequence. 本発明の実施形態１に係る音声検索処理のフローチャートである。It is a flowchart of the voice search process which concerns on Embodiment 1 of this invention. 本発明の実施形態１に係る探索処理のフローチャートである。It is a flowchart of the search process which concerns on Embodiment 1 of this invention. 本発明の実施形態２に係る探索処理のフローチャートである。It is a flowchart of the search process which concerns on Embodiment 2 of this invention. 本発明の実施形態３に係る探索処理のフローチャートである。It is a flowchart of the search process which concerns on Embodiment 3 of this invention. 本発明の実施形態４に係る探索処理のフローチャートである。It is a flowchart of the search process which concerns on Embodiment 4 of this invention.

（実施形態１）
本発明の実施形態１について、図面を参照して詳細に説明する。まず、図１を参照して、本実施形態に係る音声検索装置１００のハードウェア構成について説明する。図１に示すように、音声検索装置１００は、ＲＯＭ１（Read Only Memory）と、ＲＡＭ２（Random Access Memory）と、外部記憶装置３と、入力装置４と、出力装置５と、ＣＰＵ６（Central Processing Unit）とを備える。 (Embodiment 1)
Embodiment 1 of the present invention will be described in detail with reference to the drawings. First, a hardware configuration of the voice search device 100 according to the present embodiment will be described with reference to FIG. As shown in FIG. 1, the speech search apparatus 100 includes a ROM 1 (Read Only Memory), a RAM 2 (Random Access Memory), an external storage device 3, an input device 4, an output device 5, and a CPU 6 (Central Processing Unit). ).

ＲＯＭ１は、各種初期設定、ハードウェアの検査、プログラムのロード等を行うための初期プログラムを記憶する。ＲＡＭ２は、ＣＰＵ６が実行する各種ソフトウェアプログラム、これらのソフトウェアプログラムの実行に必要なデータ等を一時的に記憶する。外部記憶装置３は、例えば、ハードディスク等であって、各種ソフトウェアプログラム、データ等を記憶する。これらソフトウェアプログラムの中には、アプリケーションソフトウェアプログラムやＯＳ（Operating System）のような基本ソフトウェアプログラムなどが含まれている。 The ROM 1 stores an initial program for performing various initial settings, hardware inspection, program loading, and the like. The RAM 2 temporarily stores various software programs executed by the CPU 6, data necessary for executing these software programs, and the like. The external storage device 3 is, for example, a hard disk or the like and stores various software programs, data, and the like. These software programs include application software programs and basic software programs such as an OS (Operating System).

各種データには、音声データが含まれる。音声データは、例えば、ニュース放送等の音声、録音された会議の音声、映画の音声等に係る音声データである。 Various data includes audio data. The audio data is, for example, audio data relating to audio such as news broadcasts, recorded conference audio, movie audio, and the like.

入力装置４は、例えば、キーボード等である。入力装置４は、ユーザがキーボードを用いて操作入力したテキストデータ等をＣＰＵ６に入力する。出力装置５は、例えば、液晶ディスプレイ等の画面、スピーカ等を備える。出力装置５は、ＣＰＵ６によって出力されたテキストデータを画面に表示し、音声データをスピーカから出力する。 The input device 4 is, for example, a keyboard. The input device 4 inputs text data and the like input by a user using a keyboard to the CPU 6. The output device 5 includes, for example, a screen such as a liquid crystal display, a speaker, and the like. The output device 5 displays text data output by the CPU 6 on a screen and outputs audio data from a speaker.

ＣＰＵ６は、外部記憶装置３に記憶されたソフトウェアプログラムをＲＡＭ２に読み出して、そのソフトウェアプログラムを実行制御することにより、以下の機能構成を実現する。 The CPU 6 reads the software program stored in the external storage device 3 into the RAM 2 and controls the execution of the software program, thereby realizing the following functional configuration.

次に、図２を参照して、音声検索装置１００の機能構成を説明する。音声検索装置１００は、音声データ記憶部６１、検索対象取得部６２、トライフォン列記憶部６３、入力部６４、変換テーブル記憶部６５、検索語取得部６６、探索部６７、トライフォン間距離記憶部６８、出力部６９を備える。音声データ記憶部６１、トライフォン列記憶部６３、変換テーブル記憶部６５及びトライフォン間距離記憶部６８は、外部記憶装置３の記憶領域に構築されている。 Next, the functional configuration of the voice search device 100 will be described with reference to FIG. The voice search device 100 includes a voice data storage unit 61, a search target acquisition unit 62, a triphone string storage unit 63, an input unit 64, a conversion table storage unit 65, a search word acquisition unit 66, a search unit 67, and an inter-triphone distance storage. Unit 68 and output unit 69. The audio data storage unit 61, the triphone string storage unit 63, the conversion table storage unit 65, and the inter-triphone distance storage unit 68 are constructed in the storage area of the external storage device 3.

音声データ記憶部６１は、検索対象の音声データを記憶する。検索対象取得部６２は、音声データ記憶部６１に記憶された検索対象の音声データに基づくトライフォン列Ａ（第１の音素モデル列）を取得する。トライフォンとは、中心音素とその音素の前側直近の音素と後側直近の音素とを含む音素モデルである。トライフォン列Ａは、検索対象の音声データに含まれる各音素を中心音素とするトライフォンが時系列順に配列されたものである。トライフォン列Ａは、音響モデルとして、例えば、トライフォンＨＭＭ（Hidden Markov Model）を用いて生成される。検索対象取得部６２は、トライフォンＨＭＭを利用してトライフォン列Ａを生成するトライフォン音韻認識エンジンとして機能する。 The voice data storage unit 61 stores voice data to be searched. The search target acquisition unit 62 acquires a triphone string A (first phoneme model string) based on the search target voice data stored in the voice data storage unit 61. The triphone is a phoneme model including a central phoneme, a phoneme nearest to the front side of the phoneme, and a phoneme nearest to the rear side. In the triphone row A, triphones having each phoneme included in the audio data to be searched as a central phoneme are arranged in time series. The triphone train A is generated using, for example, a triphone HMM (Hidden Markov Model) as an acoustic model. The search target acquisition unit 62 functions as a triphone phoneme recognition engine that generates the triphone string A using the triphone HMM.

トライフォンは、「前側直近の音素−中心音素＋後側直近の音素」で表現される。トライフォンを時系列順に配列したものがトライフォン列となる。例えば、図３に示すように、音声データに基づいて生成されたトライフォン列Ａの一部は、「ａ−ｏ＋Ｎ」、「ｏ−Ｎ＋ｓ」、「Ｎ−ｓ＋ｅ」、「ｓ−ｅ：＋ｓｈ」、「ｅ−ｓｈ＋ｉ」、「ｓｈ−ｉ＋Ｎ」、「ｉ−Ｎ＋ｇ」、「Ｎ−ｇ＋ｏ」、「ｇ−ｏ：＋ｓｈ」、「ｏ−ｓｈ＋ｏ」、「ｓｈ−ｏ＋ｒ」、「ｏ−ｒ＋ｉ」、「ｒ−ｉ＋ｋ」、「ｉ−ｋ＋ｏ」、「ｋ−ｏ：＋ｚ」、「ｏ−ｚ＋ａ」、「ｚ−ａ＋ｇ」となる。 The triphone is expressed by “front nearest phoneme−central phoneme + rear latest phoneme”. An array of triphones arranged in chronological order is a triphone sequence. For example, as illustrated in FIG. 3, a part of the triphone string A generated based on the audio data includes “a−o + N”, “o−N + s”, “N−s + e”, “se− + sh”. ”,“ E−sh + i ”,“ sh−i + N ”,“ i−N + g ”,“ N−g + o ”,“ go− + sh ”,“ o−sh + o ”,“ sh−o + r ”,“ or−r + i ” ”,“ R−i + k ”,“ i−k + o ”,“ k−o: + z ”,“ o−z + a ”, and“ z−a + g ”.

検索対象取得部６２は、音声データに基づいてトライフォン列Ａを予め生成しておき、トライフォン列記憶部６３に記憶させる。このとき、検索対象取得部６２は、生成した各トライフォンと、そのトライフォンに対応する音声データの再生開始時間とが対応付けられた情報テーブルを作成し、トライフォン列記憶部６３に記憶させる。また、検索対象取得部６２は、生成した各トライフォンに対応する音声データを、情報テーブルの再生開始時間に対応付けてトライフォン列記憶部６３に記憶させる。 The search target acquisition unit 62 generates a triphone string A in advance based on the audio data, and stores it in the triphone string storage unit 63. At this time, the search target acquisition unit 62 creates an information table in which each generated triphone is associated with the reproduction start time of the audio data corresponding to the triphone and stores the information table in the triphone string storage unit 63. . In addition, the search target acquisition unit 62 causes the triphone string storage unit 63 to store the generated audio data corresponding to each triphone in association with the reproduction start time of the information table.

入力部６４は、音声データに対して検索する検索語を入力する。より詳細には、入力部６４は、入力装置４が備えるキーボードを用いてユーザが操作入力した検索語、例えば「信号処理」のテキストデータを検索語取得部６６に入力する。 The input unit 64 inputs a search term to search for voice data. More specifically, the input unit 64 inputs a search term input by the user using the keyboard of the input device 4, for example, text data of “signal processing” to the search term acquisition unit 66.

変換テーブル記憶部６５は、テキストデータと音素とが対応付けられた変換テーブルを記憶する。検索語取得部６６は、入力部６４によって入力された検索語に基づくトライフォン列Ｂ（第２の音素モデル列）を取得する。より詳細には、まず、検索語取得部６６は、入力部６４によって入力された検索語を音素列に変換する。検索語の音素列への変換では、検索語取得部６６は、変換テーブル記憶部６５に記憶された変換テーブルを参照し、テキストデータを音素に変換する。続いて、検索語取得部６６は、変換した音素を時系列順に配列し、音素列に変換する。 The conversion table storage unit 65 stores a conversion table in which text data and phonemes are associated with each other. The search word acquisition unit 66 acquires a triphone string B (second phoneme model string) based on the search word input by the input unit 64. More specifically, first, the search word acquisition unit 66 converts the search word input by the input unit 64 into a phoneme string. In the conversion of the search word into the phoneme string, the search word acquisition unit 66 refers to the conversion table stored in the conversion table storage unit 65 and converts the text data into a phoneme. Subsequently, the search word acquisition unit 66 arranges the converted phonemes in time series order and converts them into phoneme strings.

検索語取得部６６は、変換した音素列を構成する各音素を中心音素とし、中心音素とその中心音素の前側直近の音素と後側直近の音素とを含むトライフォンが時系列順に配列されたトライフォン列Ｂ（第２の音素モデル列）を生成する。 The search word acquisition unit 66 uses each phoneme constituting the converted phoneme string as a central phoneme, and triphones including the central phoneme, a phoneme nearest to the front side of the central phoneme, and a phoneme nearest to the rear side are arranged in chronological order. A triphone string B (second phoneme model string) is generated.

例えば、入力部６４によって入力された検索語が「信号処理」の場合には、図４に示すように、検索語取得部６６は、「信号処理」を、音素列「ｓｈ」、「ｉ」、「Ｎ」、「ｇ」、「ｏ：」、「ｓｈ」、「ｏ」、「ｒ」、「ｉ」に変換する。続いて、検索語取得部６６は、変換された音素列に基づいて、トライフォン列Ｂとして、「ｓｈ＋ｉ」、「ｓｈ−ｉ＋Ｎ」、「ｉ−Ｎ＋ｇ」、「Ｎ−ｇ＋ｏ」、「ｇ−ｏ：＋ｓｈ」、「ｏ−ｓｈ＋ｏ」、「ｓｈ−ｏ＋ｒ」、「ｏ−ｒ＋ｉ」、「ｒ−ｉ」を生成する。上述のように、生成されたトライフォン列Ｂは、最初及び最後にバイフォンを含む。 For example, when the search term input by the input unit 64 is “signal processing”, as shown in FIG. 4, the search term acquisition unit 66 sets “signal processing” as a phoneme string “sh”, “i”. , “N”, “g”, “o:”, “sh”, “o”, “r”, “i”. Subsequently, the search word acquisition unit 66 sets “sh + i”, “sh−i + N”, “i−N + g”, “N−g + o”, “g−” as the triphone string B based on the converted phoneme string. o: + sh ”,“ o−sh + o ”,“ sh−o + r ”,“ o−r + i ”, and“ r−i ”are generated. As described above, the generated triphone string B includes biphones at the beginning and end.

探索部６７は、トライフォン列Ｂの最初及び最後のトライフォンを除くトライフォン列であるトライフォン列Ｃ（第３の音素モデル列）とトライフォン列Ａに含まれる部分列との類似度を算出する。探索部６７は、検索語取得部６６から出力されたトライフォン列Ｂに基づいてトライフォン列Ｃを生成する。トライフォン列Ｃは、図４に示すように、トライフォン列Ｂの最初のバイフォン「ｓｈ＋ｉ」と、最後のバイフォン「ｒ−ｉ」が除かれて、「ｓｈ−ｉ＋Ｎ」、「ｉ−Ｎ＋ｇ」、「Ｎ−ｇ＋ｏ」、「ｇ−ｏ：＋ｓｈ」、「ｏ−ｓｈ＋ｏ」、「ｓｈ−ｏ＋ｒ」、「ｏ−ｒ＋ｉ」となる。 The search unit 67 determines the similarity between the triphone sequence C (third phoneme model sequence), which is a triphone sequence excluding the first and last triphones of the triphone sequence B, and the partial sequence included in the triphone sequence A. calculate. The search unit 67 generates a triphone string C based on the triphone string B output from the search word acquisition unit 66. As shown in FIG. 4, in the triphone line C, the first biphone “sh + i” and the last biphone “ri” of the triphone line B are removed, and “sh−i + N” and “i−N + g” are removed. , “N−g + o”, “go: + sh”, “o−sh + o”, “sh−o + r”, “o−r + i”.

探索部６７は、類似度をトライフォン列Ａとトライフォン列Ｃとの連続ＤＰ（Dynamic Programming）マッチングで算出する。連続ＤＰマッチングは、既知の任意の方法を用いてよいが、本実施形態では、探索部６７は、以下の方法で連続ＤＰマッチングを実行する。 The search unit 67 calculates the similarity by continuous DP (Dynamic Programming) matching between the triphone sequence A and the triphone sequence C. For the continuous DP matching, any known method may be used. In the present embodiment, the search unit 67 executes the continuous DP matching by the following method.

まず、探索部６７は、トライフォン列Ｃに含まれる各トライフォンに対応する行と、トライフォン列Ａに含まれる各トライフォンに対応する列とで構成されるノードの集合を定義する。以下では、一例として図５に示すように、ノードがトライフォン列Ｃの７個のトライフォンに対応する７行、トライフォン列Ａに含まれる１０個のトライフォンに対応する１０列に配置された場合を説明する。なお、ｉ行ｊ列に位置するノードをノード（ｉ，ｊ）とし、図５に示すノードの集合を探索テーブルとする。 First, the search unit 67 defines a set of nodes including a row corresponding to each triphone included in the triphone column C and a column corresponding to each triphone included in the triphone column A. In the following, as shown in FIG. 5 as an example, the nodes are arranged in 7 rows corresponding to 7 triphones in the triphone column C and 10 columns corresponding to 10 triphones included in the triphone column A. The case will be described. Note that a node located in i row and j column is a node (i, j), and a set of nodes shown in FIG. 5 is a search table.

次に、探索部６７は、各ノード（ｉ，ｊ）についてコストＸ（ｉ，ｊ）を定義する。コストＸ（ｉ，ｊ）は、トライフォン列Ｃの最初のトライフォンに対応するノード（ｉ＝１のノード）からノード（ｉ，ｊ）までたどった場合に、たどったノードに対応するトライフォン列Ｃとトライフォン列Ａの部分列とのマッチングの誤差、すなわち類似度を示す。 Next, the search unit 67 defines a cost X (i, j) for each node (i, j). The cost X (i, j) is the triphone corresponding to the node that is traced when the node (i, j) is traced from the node corresponding to the first triphone of the triphone string C (i = 1 node). The matching error between the column C and the partial column of the triphone column A, that is, the similarity is shown.

コストＸ（ｉ，ｊ）は、以下のように更新される。まず、探索部６７は、探索テーブルの１行目のノード（ノード（１，１）からノード（１，１０））のコストＸ（コストＸ（１，１）からコストＸ（１，１０））に、それぞれトライフォン列Ｃの最初のトライフォンと、それに対応するトライフォン列Ａのトライフォンとの距離を初期値として代入する。 The cost X (i, j) is updated as follows. First, the search unit 67 calculates the cost X (cost X (1, 1) to cost X (1, 10)) of the node (node (1, 1) to node (1, 10)) in the first row of the search table. In addition, the distance between the first triphone of the triphone line C and the triphone of the triphone line A corresponding thereto is substituted as an initial value.

より詳細には、トライフォン列Ｃのｉ番目のトライフォンとトライフォン列Ａのｊ番目のトライフォンとの距離をｄ（ｉ，ｊ）とすると、探索部６７は、コストＸ（１，ｊ）にｄ（１，ｊ）を代入する。また、探索部６７は、ノード（１，１）を除く１列目のノード（ノード（２，１）からノード（７，１））のコストＸ（コストＸ（２，１）からコストＸ（７，１））に、想定されるコストＸの最大値より十分に大きな値を初期値として代入する。 More specifically, if the distance between the i-th triphone in the triphone row C and the j-th triphone in the triphone row A is d (i, j), the search unit 67 calculates the cost X (1, j ) Is substituted for d (1, j). Further, the search unit 67 eliminates the cost X (cost X (2, 1) to cost X (from the node (2, 1) to the node (7, 1)) in the first column excluding the node (1, 1). In 7, 1)), a value sufficiently larger than the assumed maximum value of cost X is substituted as an initial value.

距離ｄ（ｉ，ｊ）は、トライフォン間距離記憶部６８に記憶されたトライフォン間距離テーブルを参照して求められる。図６に示すように、トライフォン間距離テーブルは、トライフォン同士の全ての組み合わせについて実験的に求められた距離ｄを格納している。図６の例では、「ａ−ａ＋ａ」に対する「ａ−ａ＋ｉ」、「ａ−ａ＋ｕ」、「ｏ−ｗ＋ｏ」の距離ｄは、それぞれ「５」、「７」、「１００」である。「ａ−ａ＋ｉ」に対する「ａ−ａ＋ｕ」、「ｏ−ｗ＋ｏ」との距離ｄは、それぞれ「２０」、「９９」である。また、「ａ−ａ＋ｕ」に対する「ｏ−ｗ＋ｏ」の距離ｄは、「９８」である。また、同一のトライフォン間の距離ｄは「０」である。 The distance d (i, j) is obtained with reference to the inter-triphone distance table stored in the inter-triphone distance storage unit 68. As shown in FIG. 6, the inter-triphone distance table stores distances d obtained experimentally for all combinations of triphones. In the example of FIG. 6, the distances d of “a−a + i”, “a−a + u”, and “o−w + o” with respect to “a−a + a” are “5”, “7”, and “100”, respectively. The distances “a−a + u” and “o−w + o” with respect to “a−a + i” are “20” and “99”, respectively. The distance d of “o−w + o” with respect to “a−a + u” is “98”. The distance d between the same triphones is “0”.

トライフォン間の距離ｄは、例えば、そのトライフォンに含まれる３つの音素間の距離の平均であってもよい。この場合、音素間の距離は、音響モデルを用いて各音素の特徴量からあらかじめ求められている。音素の特徴量は、例えば、音素の波形データの周波数帯域における短時間スペクトルである。 The distance d between triphones may be, for example, the average of the distances between the three phonemes included in the triphone. In this case, the distance between phonemes is obtained in advance from the feature amount of each phoneme using an acoustic model. The phoneme feature amount is, for example, a short-time spectrum in the frequency band of the phoneme waveform data.

初期値が代入されたノード同士は、トライフォンの一致又は置換、挿入及び脱落に対応するパスで互いに接続される。ここで、図７を参照して、一致又は置換、挿入及び脱落について詳細に説明する。図７に示すように、一致又は置換は、ノード（ｉ−１，ｊ−１）とノード（ｉ，ｊ）との接続である。挿入は、ノード（ｉ−１，ｊ−２）とノード（ｉ，ｊ）との接続である。脱落は、ノード（ｉ−２，ｊ−１）とノード（ｉ，ｊ）との接続である。 The nodes to which the initial values are assigned are connected to each other through paths corresponding to triphone matching or replacement, insertion, and dropout. Here, with reference to FIG. 7, the coincidence or replacement, insertion and dropout will be described in detail. As shown in FIG. 7, the match or replacement is a connection between the node (i-1, j-1) and the node (i, j). The insertion is a connection between the node (i-1, j-2) and the node (i, j). The dropout is a connection between the node (i−2, j−1) and the node (i, j).

次に、探索部６７は、コストＸ（ｉ，ｊ）を、一致又は置換、挿入及び脱落それぞれに対応する次の漸化式で更新する。 Next, the search unit 67 updates the cost X (i, j) with the following recurrence formulas corresponding to the match or replacement, insertion, and dropout, respectively.

上記漸化式の上段は、一致又は置換の場合に対応する。この場合、探索部６７は、コストＸ（ｉ−１，ｊ−１）にｄ（ｉ，ｊ）を加算する。上記漸化式の中段は、挿入の場合に対応する。この場合、探索部６７は、コストＸ（ｉ−１，ｊ−２）にｄ（ｉ，ｊ−１）及びｄ（ｉ，ｊ）の平均を加算し、さらに挿入コストαを加算する。上記漸化式の下段は脱落の場合に対応する。この場合、探索部６７は、コストＸ（ｉ−２，ｊ−１）にｄ（ｉ，ｊ）及びｄ（ｉ−１，ｊ）を加算し、さらに脱落コストβを加算する。なお、挿入コストα及び脱落コストβはあらかじめ設定された定数である。 The upper part of the recurrence formula corresponds to the case of matching or replacement. In this case, the search unit 67 adds d (i, j) to the cost X (i-1, j-1). The middle stage of the recurrence formula corresponds to the case of insertion. In this case, the search unit 67 adds the average of d (i, j−1) and d (i, j) to the cost X (i−1, j−2), and further adds the insertion cost α. The lower stage of the recurrence formula corresponds to the case of dropout. In this case, the search unit 67 adds d (i, j) and d (i-1, j) to the cost X (i−2, j−1), and further adds the dropout cost β. Note that the insertion cost α and the dropout cost β are preset constants.

なお、上記漸化式は、例えば２つのトライフォン列を挿入する場合に対応する式として、コストＸ（ｉ−１，ｊ−３）にｄ（ｉ，ｊ−２）とｄ（ｉ，ｊ−１）とｄ（ｉ，ｊ）の平均を加算し、さらに挿入コストγを加算する式を漸化式に加える等の任意の変形が可能である。 Note that the recurrence formula is a formula corresponding to, for example, the case where two triphone strings are inserted, and d (i, j-2) and d (i, j) are added to the cost X (i-1, j-3). −1) and the average of d (i, j) are added, and an arbitrary modification such as adding an expression for adding the insertion cost γ to the recurrence formula is possible.

探索部６７は、上記のようにＸ（ｉ，ｊ）を更新する。次に、探索部６７は、ｉ＝７の各ノードのうち、コストＸ（７，ｊ）が最も小さいノードを選択する。以下、コストＸ（７，９）が最も小さい場合を例にとって説明する。 The search unit 67 updates X (i, j) as described above. Next, the search unit 67 selects a node having the smallest cost X (7, j) among the nodes with i = 7. Hereinafter, a case where the cost X (7, 9) is the smallest will be described as an example.

図８に示すように、探索部６７は、選択されたコストＸ（７，９）が算出された経路を遡ることによって、トライフォン列Ｃの１番目のトライフォンにマッチングされたトライフォン列Ａの３番目のトライフォンを特定する。この結果、探索部６７は、トライフォン列Ａの３番目から９番目までを含む部分列を抽出する。このとき、コストＸ（７，９）は、トライフォン列Ｃとトライフォン列Ａに含まれる部分列との類似度に相当する。 As shown in FIG. 8, the search unit 67 goes back the route from which the selected cost X (7, 9) is calculated, thereby matching the first triphone string A matched with the first triphone in the triphone string C. Identifies the third triphone. As a result, the search unit 67 extracts a partial sequence including the third to ninth portions of the triphone sequence A. At this time, the cost X (7, 9) corresponds to the similarity between the triphone sequence C and the partial sequence included in the triphone sequence A.

このように、連続ＤＰマッチングを実行することによって、探索部６７は、算出されたコストＸが、所定の条件として最も小さい部分列をトライフォン列Ａから抽出する。 As described above, by executing the continuous DP matching, the search unit 67 extracts the partial sequence having the smallest calculated cost X as the predetermined condition from the triphone sequence A.

なお、探索部６７は、コストＸが最も小さいノードを選択したが、コストＸが所定の閾値未満であること及びコストＸが小さい順に所定順位内にあることのいずれかを所定の条件として、ノードを選択することによって、部分列を抽出するようにしてもよい。 The search unit 67 selects the node with the lowest cost X. However, the search unit 67 determines whether the cost X is less than a predetermined threshold or is within a predetermined order in ascending order of the cost X. The partial sequence may be extracted by selecting.

部分列を抽出した探索部６７は、抽出された部分列に対応する音声データに関する情報を出力する。抽出された部分列に対応する音声データに関する情報は、例えば、抽出された部分列の最初のトライフォンに対応する再生開始時間及び抽出された部分列に対応する音声データである。探索部６７は、トライフォン列記憶部６３に記憶された情報テーブルを参照して、当該再生開始時間と音声データとを取得し、当該再生開始時間と音声データとを出力部６９に出力する。 The search unit 67 that has extracted the partial sequence outputs information relating to the audio data corresponding to the extracted partial sequence. The information regarding the audio data corresponding to the extracted partial sequence is, for example, the reproduction start time corresponding to the first triphone of the extracted partial sequence and the audio data corresponding to the extracted partial sequence. The search unit 67 refers to the information table stored in the triphone string storage unit 63, acquires the reproduction start time and audio data, and outputs the reproduction start time and audio data to the output unit 69.

出力部６９は、探索部６７によって抽出された部分列に対応する音声データに関する情報を出力装置５に出力する。より具体的には、出力部６９は、探索部６７から出力された再生開始時間を、出力装置５が備える画面を介して表示する。ここで、ユーザによって入力装置４を介して抽出された部分列に対応する音声データの再生の指示が入力されると、出力部６９は、探索部６７から出力された音声データを、スピーカを介して出力するようにしてもよい。 The output unit 69 outputs information related to the audio data corresponding to the partial sequence extracted by the search unit 67 to the output device 5. More specifically, the output unit 69 displays the reproduction start time output from the search unit 67 via a screen provided in the output device 5. Here, when an instruction to reproduce audio data corresponding to the partial sequence extracted via the input device 4 is input by the user, the output unit 69 outputs the audio data output from the search unit 67 via a speaker. May be output.

次に、図９を参照しながら、本実施形態における音声検索装置１００による音声検索処理のフローについて詳細に説明する。前提として、トライフォン列記憶部６３には、図３に示すトライフォン列Ａがすでに記憶されているものとする。 Next, the flow of the voice search process by the voice search device 100 in this embodiment will be described in detail with reference to FIG. As a premise, it is assumed that the triphone string storage unit 63 has already stored the triphone string A shown in FIG.

入力部６４は、検索語「信号処理」に対応するテキストデータを検索語取得部６６に入力する（ステップＳ１）。検索語取得部６６は、入力されたテキストデータに基づいてトライフォン列Ｂを生成する（ステップＳ２）。続いて、探索部６７は、トライフォン列Ｃから部分列を抽出する探索処理を実行する（ステップＳ３）。 The input unit 64 inputs text data corresponding to the search term “signal processing” to the search term acquisition unit 66 (step S1). The search term acquisition unit 66 generates a triphone string B based on the input text data (step S2). Subsequently, the search unit 67 executes a search process for extracting a partial sequence from the triphone sequence C (step S3).

ここで、図１０を参照しながら、探索処理（ステップＳ３）を詳細に説明する。探索部６７は、トライフォン列Ｂの最初及び最後のバイフォンを除いたトライフォン列Ｃを生成する（ステップＳ１１）。続いて、探索部６７は、トライフォン列Ｃとトライフォン列Ａとで連続ＤＰマッチングを実行する（ステップＳ１２）。続いて、探索部６７は、連続ＤＰマッチングの結果から、コストＸが最も小さい部分列を抽出する（ステップＳ１３）。次に、図９に戻って、出力部６９は、抽出された部分列に対応する再生開始時間を、画面を介して表示する（ステップＳ４）。そして、音声検索装置１００は、音声検索処理を終了する。 Here, the search process (step S3) will be described in detail with reference to FIG. The search unit 67 generates a triphone sequence C excluding the first and last biphones of the triphone sequence B (step S11). Subsequently, the search unit 67 performs continuous DP matching between the triphone sequence C and the triphone sequence A (step S12). Subsequently, the search unit 67 extracts a partial sequence having the smallest cost X from the result of continuous DP matching (step S13). Next, returning to FIG. 9, the output unit 69 displays the reproduction start time corresponding to the extracted partial sequence via the screen (step S4). Then, the voice search device 100 ends the voice search process.

以上詳細に説明したように、本実施形態に係る音声検索装置１００は、トライフォン列Ｂの最初及び最後のバイフォンを除いたトライフォン列Ｃを用いて連続ＤＰマッチングを実行する。そのため、バイフォンを含むトライフォン列Ｂを用いて上記連続ＤＰマッチングを実行した場合と比較して、バイフォンの音素数がトライフォンの音素数と一致しないことによるコストＸの増加を防ぐことができる。これにより、検索精度が高くなる。 As described above in detail, the speech search apparatus 100 according to the present embodiment performs continuous DP matching using the triphone sequence C excluding the first and last biphones of the triphone sequence B. Therefore, compared to the case where the continuous DP matching is performed using the triphone string B including biphones, it is possible to prevent an increase in cost X due to the fact that the number of phonemes of the biphone does not match the number of phonemes of the triphone. This increases the search accuracy.

（実施形態２）
次に、本発明の実施形態２について、詳細に説明する。本実施形態における音声検索装置１００のハードウェア構成及び機能構成は、上記実施形態１と同じであるが、探索部６７の機能が異なる。検索語が短い、すなわちトライフォン列Ｂを構成するトライフォンの数が少ない場合には、バイフォンを削除するとトライフォン列Ｃを構成するトライフォンの数が少なくなって、検索語に係るデータ量が極端に少なくなることがある。この結果、検索精度がかえって低下することがある。 (Embodiment 2)
Next, Embodiment 2 of the present invention will be described in detail. The hardware configuration and functional configuration of the voice search device 100 in the present embodiment are the same as those in the first embodiment, but the function of the search unit 67 is different. When the search term is short, that is, when the number of triphones constituting the triphone row B is small, if the biphone is deleted, the number of triphones constituting the triphone row C is reduced, and the amount of data related to the search term is reduced. May be extremely low. As a result, the search accuracy may be lowered.

そこで、探索部６７は、トライフォン列Ｂを構成するトライフォンの数が閾値未満である場合には、トライフォン列Ｂとトライフォン列Ａに含まれる部分列とのコストＹ（第２の類似度）を算出する。そして、探索部６７は、算出されたコストＹが最も小さい部分列を、トライフォン列Ａから抽出する。一方、探索部６７は、トライフォン列Ｂを構成するトライフォンの数が閾値以上である場合には、トライフォン列Ｃとトライフォン列Ａに含まれる部分列とのコストＸ（第１の類似度）を算出する。 Therefore, when the number of triphones constituting the triphone train B is less than the threshold, the search unit 67 determines the cost Y (second similarity) between the triphone train B and the partial train included in the triphone train A. Degree). Then, the search unit 67 extracts the partial sequence having the smallest calculated cost Y from the triphone sequence A. On the other hand, when the number of triphones constituting the triphone string B is equal to or greater than the threshold, the search unit 67 determines the cost X (first similarity) between the triphone string C and the partial string included in the triphone string A. Degree).

本実施形態における音声検索装置１００による音声検索処理のフローについて詳細に説明する。本実施形態に係る音声検索処理のフローは、上記実施形態１に係る音声検索処理における探索処理（ステップＳ３）が異なる。以下では、図１１を参照しながら、本実施形態の探索処理について上記実施形態１と異なる部分を主に説明する。 The flow of the voice search process by the voice search device 100 according to this embodiment will be described in detail. The flow of the voice search process according to the present embodiment is different from the search process (step S3) in the voice search process according to the first embodiment. In the following, with reference to FIG. 11, the difference from the first embodiment will be mainly described in the search process of the present embodiment.

探索部６７は、トライフォン列Ｂを構成するトライフォンの数が閾値以上であるか否かを判定する（ステップＳ２１）。トライフォン列Ｂを構成するトライフォンの数が閾値以上の場合（ステップＳ２１；Ｙｅｓ）、探索部６７は、上記実施形態１のステップＳ１１からステップＳ１３と同様に、ステップＳ２２からステップＳ２４を実行する。続いて、音声検索装置１００は、音声検索処理に戻り、ステップＳ４を実行して、音声検索処理を終了する。 The search unit 67 determines whether or not the number of triphones constituting the triphone train B is equal to or greater than a threshold value (step S21). When the number of triphones constituting the triphone row B is equal to or greater than the threshold (step S21; Yes), the search unit 67 executes steps S22 to S24 in the same manner as steps S11 to S13 of the first embodiment. . Subsequently, the voice search device 100 returns to the voice search process, executes step S4, and ends the voice search process.

一方、トライフォン列Ｂを構成するトライフォンの数が閾値未満の場合（ステップＳ２１；Ｎｏ）、探索部６７は、トライフォン列Ｂとトライフォン列Ａとで連続ＤＰマッチングを行う（ステップＳ２５）。続いて、探索部６７は、コストＹが最も小さい部分列を抽出する（ステップＳ２４）。そして、音声検索装置１００は、音声検索処理に戻り、ステップＳ４を実行して、音声検索処理を終了する。 On the other hand, when the number of triphones constituting the triphone train B is less than the threshold (step S21; No), the search unit 67 performs continuous DP matching between the triphone train B and the triphone train A (step S25). . Subsequently, the search unit 67 extracts a partial sequence having the smallest cost Y (step S24). Then, the voice search device 100 returns to the voice search process, executes step S4, and ends the voice search process.

以上詳細に説明したように、本実施形態によれば、トライフォン列Ｂを構成するトライフォンの数に応じて、トライフォン列Ｃを用いてコストＸを算出するか、トライフォン列Ｂを用いてコストＹを算出するかが選択される。こうすることで、トライフォン列Ｂのトライフォンの数が少ない場合に、トライフォン列Ｂの最初及び最後のバイフォンを除くことによってデータ量が極端に少なくなるのを防ぐことができる。この結果、検索精度が高くなる。 As described above in detail, according to the present embodiment, the cost X is calculated using the triphone sequence C or the triphone sequence B is used according to the number of triphones constituting the triphone sequence B. Whether to calculate the cost Y is selected. By doing so, when the number of triphones in the triphone train B is small, it is possible to prevent the data amount from becoming extremely small by excluding the first and last biphones in the triphone train B. As a result, the search accuracy is increased.

（実施形態３）
次に、本発明の実施形態３について、詳細に説明する。本実施形態における音声検索装置１００のハードウェア構成及び機能構成は、上記実施形態１と同じであるが、探索部６７の機能が異なる。探索部６７は、コストＸ（第１の類似度）及びコストＹ（第２の類似度）を両方算出し、コストＸ、Ｙが最も小さい部分列を抽出する。 (Embodiment 3)
Next, Embodiment 3 of the present invention will be described in detail. The hardware configuration and functional configuration of the voice search device 100 in the present embodiment are the same as those in the first embodiment, but the function of the search unit 67 is different. The search unit 67 calculates both the cost X (first similarity) and the cost Y (second similarity), and extracts a substring having the smallest costs X and Y.

本実施形態における音声検索装置１００による音声検索処理のフローについて詳細に説明する。本実施形態に係る音声検索処理のフローは、上記実施形態１に係る音声検索処理における探索処理（ステップＳ３）が異なる。以下では、図１２を参照しながら、本実施形態の探索処理について、上記実施形態１と異なる部分を主に説明する。 The flow of the voice search process by the voice search device 100 according to this embodiment will be described in detail. The flow of the voice search process according to the present embodiment is different from the search process (step S3) in the voice search process according to the first embodiment. In the following, with reference to FIG. 12, the search process of the present embodiment will be described mainly with respect to the differences from the first embodiment.

探索部６７は、トライフォン列Ｂとトライフォン列Ａとで連続ＤＰマッチングを行う（ステップＳ３１）。続いて、探索部６７は、トライフォン列Ｃを生成する（ステップＳ３２）。続いて、探索部６７は、トライフォン列Ｃとトライフォン列Ａとで連続ＤＰマッチングを行う（ステップＳ３３）。そして、探索部６７は、コストＸ、Ｙが最も小さい部分列を抽出する（ステップＳ３４）。続いて、音声検索装置１００は、音声検索処理に戻り、ステップＳ４を実行して、音声検索処理を終了する。 The search unit 67 performs continuous DP matching between the triphone train B and the triphone train A (step S31). Subsequently, the search unit 67 generates a triphone string C (step S32). Subsequently, the search unit 67 performs continuous DP matching between the triphone train C and the triphone train A (step S33). Then, the search unit 67 extracts a partial sequence having the smallest costs X and Y (step S34). Subsequently, the voice search device 100 returns to the voice search process, executes step S4, and ends the voice search process.

以上詳細に説明したように、本実施形態によれば、トライフォン列Ｂを構成するトライフォンの数が少なく、バイフォンを除くことによってデータ量が極端に少なくなる場合には、コストＹに基づいて部分列が抽出される。一方、トライフォンの数が十分あって、バイフォンを除くことが有利な場合には、コストＸに基づいて部分列が抽出される。このように、トライフォンの数に柔軟に対応して部分列を抽出することができるので、検索精度をさらに向上させることができる。 As described above in detail, according to the present embodiment, when the number of triphones constituting the triphone train B is small and the amount of data is extremely reduced by removing the biphone, the cost Y is used. A subsequence is extracted. On the other hand, if the number of triphones is sufficient and it is advantageous to exclude biphones, the subsequence is extracted based on the cost X. As described above, since the partial sequence can be extracted in a flexible manner corresponding to the number of triphones, the search accuracy can be further improved.

（実施形態４）
次に、本発明の実施形態４について、詳細に説明する。本実施形態における音声検索装置１００のハードウェア構成及び機能構成は、上記実施形態１と同じであるが、探索部６７の機能が異なる。探索部６７は、トライフォン列Ｂの最初及び最後のバイフォンの重みを、トライフォン列Ｂを構成する残りのトライフォンより軽減して、トライフォン列Ｂとトライフォン列Ａに含まれる部分列とのコストＺを算出し、算出されたコストＺが最も小さい部分列を、トライフォン列Ａから抽出する。 (Embodiment 4)
Next, Embodiment 4 of the present invention will be described in detail. The hardware configuration and functional configuration of the voice search device 100 in the present embodiment are the same as those in the first embodiment, but the function of the search unit 67 is different. The search unit 67 reduces the weights of the first and last biphones of the triphone train B from the remaining triphones constituting the triphone train B, and the triphone train B and the partial trains included in the triphone train A The sub-sequence with the smallest calculated cost Z is extracted from the triphone sequence A.

例えば、探索部６７は、トライフォン列Ｂの最初のバイフォン「ｓｈ＋ｉ」と、これに対応するトライフォン列Ａのトライフォンとの距離ｄに１／２を乗じた値と、トライフォン列Ｂの最後のバイフォン「ｒ−ｉ」と、これに対応するトライフォン列Ａのトライフォンの距離ｄに１／２を乗じた値を算出する。探索部６７は、この算出した値を用いて、連続ＤＰマッチングを行う。そして、探索部６７は、トライフォン列Ｂとトライフォン列Ａに含まれる部分列とのコストＺを算出し、算出されたコストＺが最も小さい部分列を抽出する。 For example, the search unit 67 multiplies the distance d between the first biphone “sh + i” of the triphone train B and the triphone of the triphone train A corresponding to this by 1/2, A value obtained by multiplying the last biphone “ri” and the triphone distance d of the triphone line A corresponding thereto by ½ is calculated. The search unit 67 performs continuous DP matching using the calculated value. Then, the search unit 67 calculates the cost Z of the triphone sequence B and the partial sequence included in the triphone sequence A, and extracts the partial sequence having the smallest calculated cost Z.

次に、本実施形態における音声検索装置１００による音声検索処理のフローについて詳細に説明する。本実施形態に係る音声検索処理のフローは、上記実施形態１に係る音声検索処理における探索処理（ステップＳ３）が異なる。以下では、図１３を参照しながら、本実施形態の探索処理について、上記実施形態１と異なる部分を主に説明する。 Next, the flow of the voice search process by the voice search device 100 in the present embodiment will be described in detail. The flow of the voice search process according to the present embodiment is different from the search process (step S3) in the voice search process according to the first embodiment. Hereinafter, with reference to FIG. 13, the search process of the present embodiment will be described mainly with respect to the differences from the first embodiment.

探索部６７は、トライフォン列Ｂとトライフォン列Ａとで連続ＤＰマッチングを行う（ステップＳ４１）。ただし、コストＺの算出では、探索部６７は、トライフォン列Ｂの最初及び最後のバイフォンとトライフォン列Ａのトライフォンとの距離ｄに１／２を乗じる。続いて、探索部６７は、コストＺが最も小さい部分列を抽出する（ステップＳ４２）。そして、音声検索装置１００は、音声検索処理に戻り、ステップＳ４を実行して、音声検索処理を終了する。 The search unit 67 performs continuous DP matching between the triphone train B and the triphone train A (step S41). However, in calculating the cost Z, the search unit 67 multiplies the distance d between the first and last biphones of the triphone row B and the triphones of the triphone row A by 1/2. Subsequently, the search unit 67 extracts a partial sequence having the smallest cost Z (step S42). Then, the voice search device 100 returns to the voice search process, executes step S4, and ends the voice search process.

以上説明したように、本実施形態によれば、コストの算出における、トライフォン列Ｂの最初及び最後のバイフォンとトライフォン列Ａのトライフォンとの距離の重みを軽減する。こうすることで、トライフォン列Ｂの最初及び最後のバイフォンの音素数がトライフォン列Ａのトライフォンの音素数と一致しないことによるコストＺの増加を小さくすることができる。この結果、検索精度が高まる。 As described above, according to the present embodiment, the weight of the distance between the first and last biphones of the triphone row B and the triphones of the triphone row A in the cost calculation is reduced. By doing so, it is possible to reduce the increase in the cost Z due to the fact that the number of phonemes of the first and last biphones of the triphone sequence B do not match the number of phonemes of the triphone of the triphone sequence A. As a result, search accuracy is increased.

また、上記各実施形態において、探索部６７は、コストが所定の閾値未満であること及びコストが小さい順に所定順位内にあることのいずれかを所定の条件として、部分列を抽出してもよいとした。こうすることで、検索語に合致又は類似する部分列が複数抽出され場合には、ユーザは複数の検索結果を確認できる。 Further, in each of the above embodiments, the search unit 67 may extract a partial sequence on the condition that either the cost is less than a predetermined threshold or the cost is within a predetermined order in ascending order. It was. In this way, when a plurality of partial sequences that match or are similar to the search word are extracted, the user can check a plurality of search results.

また、上記各実施形態では、類似度を連続ＤＰマッチングで算出するようにした。これにより、類似度を効率的に算出できる。 In the above embodiments, the similarity is calculated by continuous DP matching. Thereby, the similarity can be calculated efficiently.

また、上記各実施形態では、音素モデルをトライフォンとした。トライフォンを用いずに、音素で表現すると、例えば音素「ｒ」は、前側直近と後側直近の音素に関わらず「ｒ」と表現される。一方、トライフォンを用いると、同じ音素「ｒ」でも、例えば「ｏ−ｒ＋ｉ」のように、前側直近の音素が「ｏ」、前側直近の音素が「ｉ」であるという情報を含めて表現することができる。このように、トライフォンは、より多くのデータ量を含むため、より精密に音素を表現することができ、検索精度が向上する。 In each of the above embodiments, the phone model is a triphone. When expressed in phonemes without using a triphone, for example, the phoneme “r” is expressed as “r” regardless of the phonemes nearest to the front side and the back side. On the other hand, using a triphone, the same phoneme “r” is expressed by including information that the front nearest phoneme is “o” and the front nearest phoneme is “i”, for example, “o−r + i”. can do. Thus, since the triphone includes a larger amount of data, the phoneme can be expressed more precisely, and the search accuracy is improved.

なお、音声データ記憶部６１に記憶される検索対象の音声データは、ＣＰＵ６がＣＤ（Compact Disc）、ＤＶＤ（Digital Versatile Disc）等の媒体から読み込むようにしてもよいし、インターネット回線を介してサーバ等からダウンロードするようにしてもよい。また、音声データのデータ形式は、ＷＡＶＥ形式、ＡＩＦＦ（Audio Interchange File Format）形式等であってもよい。 Note that the audio data to be searched stored in the audio data storage unit 61 may be read by the CPU 6 from a medium such as a CD (Compact Disc) or a DVD (Digital Versatile Disc) or a server via an Internet line. You may make it download from etc. The data format of the audio data may be a WAVE format, an AIFF (Audio Interchange File Format) format, or the like.

なお、検索語は、「信号処理」等の単語に限らず、複数の単語を含むキーワード、語句等であってもよい。 The search term is not limited to a word such as “signal processing” but may be a keyword, a phrase, or the like including a plurality of words.

なお、探索部６７は、トライフォン列Ｂの最初及び最後のバイフォンのいずれか一方を除いたトライフォン列をトライフォン列Ｃとして生成してもよい。 Note that the search unit 67 may generate a triphone sequence excluding either the first or last biphone of the triphone sequence B as the triphone sequence C.

なお、上記実施形態２における、トライフォン列Ｂを構成するトライフォンの数の閾値は、十個以上二十個未満が好ましく、十数個がより好ましい。 In the second embodiment, the threshold of the number of triphones constituting the triphone row B is preferably 10 or more and less than 20 and more preferably 10 or more.

また、入力部６４は、入力装置４を介して入力された検索語としての音声データに対して音声認識処理等をしてもよい。入力部６４は、音声認識処理によって音声データが変換されたテキストデータを検索語取得部６６に入力する。より具体的には、入力装置４は、例えば、マイクロフォンを備え、入力部６４は、マイクロフォンに対してユーザが発声した「信号処理」に対応する音声データを、音声認識処理によってテキストデータに変換して、検索語取得部６６に入力する。こうすることで、ユーザは、検索語を発声して入力することができるので、さらに利便性が向上する。 Further, the input unit 64 may perform voice recognition processing or the like on voice data as a search term input via the input device 4. The input unit 64 inputs the text data obtained by converting the voice data by the voice recognition process to the search word acquisition unit 66. More specifically, the input device 4 includes, for example, a microphone, and the input unit 64 converts voice data corresponding to “signal processing” uttered by the user to the microphone into text data by voice recognition processing. To the search term acquisition unit 66. By doing so, the user can utter and input a search word, and convenience is further improved.

なお、上記各実施形態では、探索部６７は、抽出された部分列を構成するトライフォンの内、最初のトライフォンの再生開始時間を出力部６９に出力するようにした。こうすることで、ユーザが、表示された再生開始時間を指定して音声データの再生を指示することによって、検索語に合致する部分から音声データを再生することができる。また、探索部６７は、トライフォン列記憶部６３に記憶された情報テーブルを参照して、抽出された部分列とその前後数秒とを含む音声データをトライフォン列記憶部６３から取得し、当該音声データを出力部６９に出力してもよい。この場合、出力部６９は、探索部６７から出力された音声データを、出力装置５が備えるスピーカを介して出力する。こうすることで、ユーザは、検索結果を確認する上で、検索語に合致する音声の前後を含めて確認できるので、利便性がさらに高まる。 In each of the embodiments described above, the search unit 67 outputs the reproduction start time of the first triphone among the triphones constituting the extracted partial sequence to the output unit 69. By doing so, the user can reproduce the audio data from the portion matching the search word by designating the displayed reproduction start time and instructing the reproduction of the audio data. Further, the search unit 67 refers to the information table stored in the triphone sequence storage unit 63, acquires audio data including the extracted partial sequence and several seconds before and after it from the triphone sequence storage unit 63, and Audio data may be output to the output unit 69. In this case, the output unit 69 outputs the audio data output from the search unit 67 via a speaker included in the output device 5. By doing so, the user can check the search result including the front and back of the voice that matches the search word, thereby further improving convenience.

また、探索部６７は、抽出された部分列とその前後数秒とを含む音声データとともに抽出された部分列における最初のトライフォンの再生開始時間を出力部６９に出力してもよい。この場合、探索部６７は、所定の条件を満たす複数の部分列における最初のトライフォンの再生開始時間を出力してもよい。探索部６７は、コストＸが小さい上位、例えば、５つの部分列の再生開始時間、及び各再生開始時間に対応付けられたその部分列とその前後数秒とを含む音声データを出力部６９に出力してもよい。この場合、出力部６９によって、上位５つの部分列の再生開始時間が画面に表示される。さらに、出力部６９は、標示された上位５つの部分列の再生開始時間のうち、ユーザによって選択された再生開始時間に対応する音声データを出力するようにしてもよい。 In addition, the search unit 67 may output the reproduction start time of the first triphone in the extracted partial sequence together with the audio data including the extracted partial sequence and several seconds before and after the extracted partial sequence to the output unit 69. In this case, the search unit 67 may output the reproduction start time of the first triphone in a plurality of partial sequences that satisfy a predetermined condition. The search unit 67 outputs, to the output unit 69, audio data including the upper part having a low cost X, for example, the reproduction start times of the five partial sequences, the partial sequences associated with the respective reproduction start times, and several seconds before and after the partial sequences. May be. In this case, the output unit 69 displays the playback start times of the top five partial columns on the screen. Further, the output unit 69 may output audio data corresponding to the reproduction start time selected by the user among the reproduction start times of the upper five subsequences indicated.

また、探索部６７は、検索対象の音声データが複数であった場合に、抽出された部分列を含む音声データのデータ名を音声データ記憶部６１から取得して、出力部６９に出力してもよい。この場合、出力部６９は、探索部６７から出力された音声データのデータ名を、画面に表示する。 In addition, when there are a plurality of search target audio data, the search unit 67 acquires the data name of the audio data including the extracted partial sequence from the audio data storage unit 61 and outputs it to the output unit 69. Also good. In this case, the output unit 69 displays the data name of the audio data output from the search unit 67 on the screen.

また、探索部６７は、経路を遡る処理を省略し、ｉ＝７の各ノードのうち、コストＸが最も小さいノードから所定数のノードに対応するトライフォンを含む部分列を抽出してもよい。このように経路を遡る処理を省略すると、部分列を概略的に抽出することになるが、部分列の抽出に要する計算量を抑えることができる。 Further, the search unit 67 may omit the process of tracing back the route, and extract a partial sequence including triphones corresponding to a predetermined number of nodes from the node with the lowest cost X among the nodes of i = 7. . If the process of tracing back the route is omitted in this way, the partial sequence is roughly extracted, but the amount of calculation required to extract the partial sequence can be suppressed.

なお、上記各実施形態では、音素の特徴量を、音素の波形データの周波数帯域における短時間スペクトルとしたが、音素の特徴量は、音素の周波数帯域における短時間スペクトルの対数値等であってもよい。 In each of the above-described embodiments, the phoneme feature value is a short-time spectrum in the frequency band of the phoneme waveform data, but the phoneme feature value is a logarithmic value of the short-time spectrum in the phoneme frequency band. Also good.

また、検索対象取得部６２、検索語取得部６６は、検索対象の音声データ及び検索語が変換された音素列に基づいてバイフォン列をそれぞれ生成してもよい。この場合、最初の音素の前側直近、及び最後の音素の後側直近に音素がないため、最初及び最後の音素モデルは、「ｓｈ」と「ｉ」のように１音素のみのモノフォンとなる。探索部６７は、検索語のバイフォン列の最初及び最後のモノフォン（「ｓｈ」、「ｉ」）を除いたバイフォン列と検索対象のバイフォン列に含まれる部分列とで連続ＤＰマッチングを行えばよい。 In addition, the search target acquisition unit 62 and the search word acquisition unit 66 may each generate a biphone string based on the search target speech data and the phoneme string obtained by converting the search word. In this case, since there is no phoneme immediately before the first phoneme and immediately after the last phoneme, the first and last phoneme models are monophones such as “sh” and “i”. The search unit 67 may perform continuous DP matching between the biphone strings excluding the first and last monophones (“sh”, “i”) of the search word biphone strings and the partial strings included in the search target biphone strings. .

また、上記各実施形態では、トライフォン間の距離ｄを、そのトライフォンに含まれる３つの音素間の距離の平均としたが、トライフォンの中心音素間の距離の重みを大きくして算出した３つの音素間の距離の重み付け平均としてもよい。 In each of the above embodiments, the distance d between the triphones is the average of the distances between the three phonemes included in the triphone, but is calculated by increasing the weight of the distance between the central phonemes of the triphone. A weighted average of distances between three phonemes may be used.

なお、音声検索装置１００は、テレビ受像機、地上デジタル放送受信機、録画装置、録音装置、ＰＣ（Personal Computer）、携帯電話機等の携帯端末、ＰＨＳ（Personal Handy-phone System）、ＰＤＡ（Personal Digital Assistant又はPersonal Data Assistance）、ゲーム機等であってもよい。 The voice search device 100 includes a television receiver, a terrestrial digital broadcast receiver, a recording device, a recording device, a personal computer (PC), a mobile terminal such as a mobile phone, a personal handy-phone system (PHS), a personal digital device (PDA). Assistant or Personal Data Assistance), game machine, or the like.

なお、上記各実施形態において、実行されるソフトウェアプログラムは、フレキシブルディスク、ＣＤ−ＲＯＭ（Compact Disc Read-Only Memory）、ＤＶＤ、ＭＯ（Magneto-Optical disc）等のコンピュータ読み取り可能な記録媒体に格納して配布し、そのソフトウェアプログラムをインストールすることにより、上述の処理を実行するシステムを構成することとしてもよい。 In each of the above embodiments, the software program to be executed is stored in a computer-readable recording medium such as a flexible disk, a CD-ROM (Compact Disc Read-Only Memory), a DVD, or an MO (Magneto-Optical disc). It is also possible to configure a system that executes the above-described processing by distributing and installing the software program.

また、ソフトウェアプログラムをインターネット等の通信ネットワーク上の所定のサーバ装置が有するディスク装置等に格納しておき、例えば、搬送波に重畳させて、ダウンロード等するようにしてもよい。 Further, the software program may be stored in a disk device or the like of a predetermined server device on a communication network such as the Internet, and may be downloaded, for example, superimposed on a carrier wave.

また、上述の機能を、ＯＳが分担して実現する場合又はＯＳとアプリケーションソフトウェアプログラムとの協働により実現する場合等には、ＯＳ以外の部分のみを媒体に格納して配布してもよく、また、ダウンロード等してもよい。 In addition, when the above functions are realized by sharing the OS, or when the functions are realized by cooperation between the OS and the application software program, only the part other than the OS may be stored in the medium and distributed. Moreover, you may download.

以上、本発明の好ましい実施形態について説明したが、本発明は係る特定の実施形態に限定されるものではなく、本発明には、特許請求の範囲に記載された発明とその均等の範囲が含まれる。以下に、本願出願の当初の特許請求の範囲に記載された発明を付記する。 As mentioned above, although preferable embodiment of this invention was described, this invention is not limited to the specific embodiment which concerns, This invention includes the invention described in the claim, and its equivalent range It is. Hereinafter, the invention described in the scope of claims of the present application will be appended.

（付記１）
検索対象の音声データに含まれる各音素を中心音素とし、中心音素とその音素の前側直近の音素及び後側直近の音素の少なくとも一方とを含む音素モデルが時系列順に配列された第１の音素モデル列を取得する検索対象取得部と、
前記音声データに対して検索する検索語を音素列に変換し、変換した音素列を構成する各音素を中心音素とし、中心音素とその中心音素の前側直近の音素及び後側直近の音素の少なくとも一方とを含む音素モデルが時系列順に配列された第２の音素モデル列を取得する検索語取得部と、
前記第２の音素モデル列の最初及び最後の少なくとも一方の音素モデルを除く音素モデル列である第３の音素モデル列と前記第１の音素モデル列に含まれる部分列との第１の類似度を算出し、算出された前記第１の類似度が所定の条件を満たす部分列を、前記第１の音素モデル列から抽出する探索部と、
前記探索部により抽出された前記部分列に対応する前記音声データに関する情報を出力する出力部と、
を備える音声検索装置。 (Appendix 1)
A first phoneme in which each phoneme included in the speech data to be searched is a central phoneme, and a phoneme model including the central phoneme and at least one of the phoneme nearest to the front side and the phoneme nearest to the back side is arranged in time series order A search target acquisition unit for acquiring a model column;
A search word to be searched for the speech data is converted into a phoneme string, each phoneme constituting the converted phoneme string is set as a central phoneme, and at least a central phoneme, a phoneme nearest to the front side of the central phoneme, and a phoneme nearest to the rear side A search word acquisition unit that acquires a second phoneme model sequence in which phoneme models including one of them are arranged in chronological order;
The first similarity between a third phoneme model sequence that is a phoneme model sequence excluding at least one of the first and last phoneme models of the second phoneme model sequence and a subsequence included in the first phoneme model sequence And a search unit that extracts a partial sequence in which the calculated first similarity satisfies a predetermined condition from the first phoneme model sequence,
An output unit that outputs information related to the audio data corresponding to the partial sequence extracted by the search unit;
A voice search device comprising:

（付記２）
前記探索部は、
前記第２の音素モデル列を構成する前記音素モデルの数が閾値未満である場合には、前記第２の音素モデル列と前記第１の音素モデル列に含まれる部分列との第２の類似度を算出し、算出された前記第２の類似度が所定の条件を満たす部分列を、前記第１の音素モデル列から抽出する、
ことを特徴とする付記１に記載の音声検索装置。 (Appendix 2)
The search unit
When the number of the phoneme models constituting the second phoneme model sequence is less than a threshold value, a second similarity between the second phoneme model sequence and a partial sequence included in the first phoneme model sequence Calculating a degree, and extracting a partial sequence in which the calculated second similarity satisfies a predetermined condition from the first phoneme model sequence;
The voice search device according to Supplementary Note 1, wherein

（付記３）
前記探索部は、
前記第１の類似度及び前記第２の音素モデル列と前記第１の音素モデル列に含まれる部分列との第２の類似度を両方算出し、前記第１、第２の類似度が所定の条件を満たす部分列を、前記第１の音素モデル列から抽出する、
ことを特徴とする付記１に記載の音声検索装置。 (Appendix 3)
The search unit
Both the first similarity and the second similarity between the second phoneme model sequence and the partial sequence included in the first phoneme model sequence are calculated, and the first and second similarities are predetermined. A partial sequence that satisfies the following condition is extracted from the first phoneme model sequence:
The voice search device according to Supplementary Note 1, wherein

（付記４）
検索対象の音声データに含まれる各音素を中心音素とし、中心音素とその音素の前側直近の音素及び後側直近の音素の少なくとも一方とを含む音素モデルが時系列順に配列された第１の音素モデル列を取得する検索対象取得部と、
前記音声データに対して検索する検索語を音素列に変換し、変換した音素列を構成する各音素を中心音素とし、中心音素とその中心音素の前側直近の音素及び後側直近の音素の少なくとも一方とを含む音素モデルが時系列順に配列された第２の音素モデル列を取得する検索語取得部と、
前記第２の音素モデル列の最初及び最後の少なくとも一方の音素モデルの重みを、前記第２の音素モデル列を構成する残りの音素モデルより軽減して、前記第２の音素モデル列と前記第１の音素モデル列に含まれる部分列との類似度を算出し、算出された前記類似度が所定の条件を満たす部分列を、前記第１の音素モデル列から抽出する探索部と、
前記探索部により抽出された前記部分列に対応する前記音声データに関する情報を出力する出力部と、
を備える音声検索装置。 (Appendix 4)
A first phoneme in which each phoneme included in the speech data to be searched is a central phoneme, and a phoneme model including the central phoneme and at least one of the phoneme nearest to the front side and the phoneme nearest to the back side is arranged in time series order A search target acquisition unit for acquiring a model column;
A search word to be searched for the speech data is converted into a phoneme string, each phoneme constituting the converted phoneme string is set as a central phoneme, and at least a central phoneme, a phoneme nearest to the front side of the central phoneme, and a phoneme nearest to the rear side A search word acquisition unit that acquires a second phoneme model sequence in which phoneme models including one of them are arranged in chronological order;
The weight of at least one of the first and last phoneme models in the second phoneme model sequence is reduced from the remaining phoneme models constituting the second phoneme model sequence, and the second phoneme model sequence and the second phoneme model sequence are reduced. A search unit that calculates a similarity with a partial sequence included in one phoneme model sequence, and extracts a partial sequence in which the calculated similarity satisfies a predetermined condition from the first phoneme model sequence;
An output unit that outputs information related to the audio data corresponding to the partial sequence extracted by the search unit;
A voice search device comprising:

（付記５）
前記探索部は、
前記類似度が所定の閾値以上であること及び前記類似度が高い順に所定順位内にあることのいずれかを前記所定の条件として、前記部分列を抽出する、
ことを特徴とする付記１乃至４のいずれかに記載の音声検索装置。 (Appendix 5)
The search unit
The partial sequence is extracted with the predetermined condition that either the similarity is equal to or higher than a predetermined threshold and the similarity is within a predetermined order in descending order.
The voice search device according to any one of appendices 1 to 4, wherein

（付記６）
前記探索部は、
前記類似度を連続ＤＰマッチングで算出する、
ことを特徴とする付記１乃至５のいずれかに記載の音声検索装置。 (Appendix 6)
The search unit
Calculating the similarity by continuous DP matching;
The voice search device according to any one of appendices 1 to 5, characterized in that:

（付記７）
前記音素モデルは、
中心音素とその中心音素の前側直近の音素と後側直近の音素とを含むトライフォンである、
ことを特徴とする付記１乃至６のいずれかに記載の音声検索装置。 (Appendix 7)
The phoneme model is
A triphone including a central phoneme, a phoneme nearest to the front side of the central phoneme, and a phoneme nearest to the rear side.
The voice search device according to any one of appendices 1 to 6, characterized in that:

（付記８）
検索対象の音声データに含まれる各音素を中心音素とし、中心音素とその音素の前側直近の音素及び後側直近の音素の少なくとも一方とを含む音素モデルが時系列順に配列された第１の音素モデル列を取得する検索対象取得工程と、
前記音声データに対して検索する検索語を音素列に変換し、変換した音素列を構成する各音素を中心音素とし、中心音素とその中心音素の前側直近の音素及び後側直近の音素の少なくとも一方とを含む音素モデルが時系列順に配列された第２の音素モデル列を取得する検索語取得工程と、
前記第２の音素モデル列の最初及び最後の少なくとも一方の音素モデルを除く音素モデル列である第３の音素モデル列と前記第１の音素モデル列に含まれる部分列との第１の類似度を算出し、算出された前記第１の類似度が所定の条件を満たす部分列を、前記第１の音素モデル列から抽出する探索工程と、
前記探索工程において抽出された前記部分列に対応する前記音声データに関する情報を出力する出力工程と、
を含む音声検索方法。 (Appendix 8)
A first phoneme in which each phoneme included in the speech data to be searched is a central phoneme, and a phoneme model including the central phoneme and at least one of the phoneme nearest to the front side and the phoneme nearest to the back side is arranged in time series order A search target acquisition process for acquiring a model column;
A search word to be searched for the speech data is converted into a phoneme string, each phoneme constituting the converted phoneme string is set as a central phoneme, and at least a central phoneme, a phoneme nearest to the front side of the central phoneme, and a phoneme nearest to the rear side A search word acquisition step of acquiring a second phoneme model sequence in which phoneme models including one are arranged in chronological order;
The first similarity between a third phoneme model sequence that is a phoneme model sequence excluding at least one of the first and last phoneme models of the second phoneme model sequence and a subsequence included in the first phoneme model sequence And a search step of extracting a partial sequence in which the calculated first similarity satisfies a predetermined condition from the first phoneme model sequence,
An output step of outputting information on the audio data corresponding to the partial sequence extracted in the search step;
Voice search method including

（付記９）
検索対象の音声データに含まれる各音素を中心音素とし、中心音素とその音素の前側直近の音素及び後側直近の音素の少なくとも一方とを含む音素モデルが時系列順に配列された第１の音素モデル列を取得する検索対象取得工程と、
前記音声データに対して検索する検索語を音素列に変換し、変換した音素列を構成する各音素を中心音素とし、中心音素とその中心音素の前側直近の音素及び後側直近の音素の少なくとも一方とを含む音素モデルが時系列順に配列された第２の音素モデル列を取得する検索語取得工程と、
前記第２の音素モデル列の最初及び最後の少なくとも一方の音素モデルの重みを、前記第２の音素モデル列を構成する残りの音素モデルより軽減して、前記第２の音素モデル列と前記第１の音素モデル列に含まれる部分列との類似度を算出し、算出された前記類似度が所定の条件を満たす部分列を、前記第１の音素モデル列から抽出する探索工程と、
前記探索工程において抽出された前記部分列に対応する前記音声データに関する情報を出力する出力工程と、
を含む音声検索方法。 (Appendix 9)
A first phoneme in which each phoneme included in the speech data to be searched is a central phoneme, and a phoneme model including the central phoneme and at least one of the phoneme nearest to the front side and the phoneme nearest to the back side is arranged in time series order A search target acquisition process for acquiring a model column;
A search word to be searched for the speech data is converted into a phoneme string, each phoneme constituting the converted phoneme string is set as a central phoneme, and at least a central phoneme, a phoneme nearest to the front side of the central phoneme, and a phoneme nearest to the rear side A search word acquisition step of acquiring a second phoneme model sequence in which phoneme models including one are arranged in chronological order;
The weight of at least one of the first and last phoneme models in the second phoneme model sequence is reduced from the remaining phoneme models constituting the second phoneme model sequence, and the second phoneme model sequence and the second phoneme model sequence are reduced. A search step of calculating a similarity with a partial sequence included in one phoneme model sequence, and extracting a partial sequence in which the calculated similarity satisfies a predetermined condition from the first phoneme model sequence;
An output step of outputting information on the audio data corresponding to the partial sequence extracted in the search step;
Voice search method including

（付記１０）
コンピュータを、
検索対象の音声データに含まれる各音素を中心音素とし、中心音素とその音素の前側直近の音素及び後側直近の音素の少なくとも一方とを含む音素モデルが時系列順に配列された第１の音素モデル列を取得する検索対象取得部、
前記音声データに対して検索する検索語を音素列に変換し、変換した音素列を構成する各音素を中心音素とし、中心音素とその中心音素の前側直近の音素及び後側直近の音素の少なくとも一方とを含む音素モデルが時系列順に配列された第２の音素モデル列を取得する検索語取得部、
前記第２の音素モデル列の最初及び最後の少なくとも一方の音素モデルを除く音素モデル列である第３の音素モデル列と前記第１の音素モデル列に含まれる部分列との第１の類似度を算出し、算出された前記第１の類似度が所定の条件を満たす部分列を、前記第１の音素モデル列から抽出する探索部、
前記探索部により抽出された前記部分列に対応する前記音声データに関する情報を出力する出力部、
として機能させるプログラム。 (Appendix 10)
Computer
A first phoneme in which each phoneme included in the speech data to be searched is a central phoneme, and a phoneme model including the central phoneme and at least one of the phoneme nearest to the front side and the phoneme nearest to the back side is arranged in time series order Search target acquisition unit that acquires model columns,
A search word to be searched for the speech data is converted into a phoneme string, each phoneme constituting the converted phoneme string is set as a central phoneme, and at least a central phoneme, a phoneme nearest to the front side of the central phoneme, and a phoneme nearest to the rear side A search word acquisition unit that acquires a second phoneme model sequence in which phoneme models including one of them are arranged in chronological order;
The first similarity between a third phoneme model sequence that is a phoneme model sequence excluding at least one of the first and last phoneme models of the second phoneme model sequence and a subsequence included in the first phoneme model sequence A search unit that extracts a partial sequence in which the calculated first similarity satisfies a predetermined condition from the first phoneme model sequence,
An output unit that outputs information about the audio data corresponding to the partial sequence extracted by the search unit;
Program to function as.

（付記１１）
コンピュータを、
検索対象の音声データに含まれる各音素を中心音素とし、中心音素とその音素の前側直近の音素及び後側直近の音素の少なくとも一方とを含む音素モデルが時系列順に配列された第１の音素モデル列を取得する検索対象取得部、
前記音声データに対して検索する検索語を音素列に変換し、変換した音素列を構成する各音素を中心音素とし、中心音素とその中心音素の前側直近の音素及び後側直近の音素の少なくとも一方とを含む音素モデルが時系列順に配列された第２の音素モデル列を取得する検索語取得部、
前記第２の音素モデル列の最初及び最後の少なくとも一方の音素モデルの重みを、前記第２の音素モデル列を構成する残りの音素モデルより軽減して、前記第２の音素モデル列と前記第１の音素モデル列に含まれる部分列との類似度を算出し、算出された前記類似度が所定の条件を満たす部分列を、前記第１の音素モデル列から抽出する探索部、
前記探索部により抽出された前記部分列に対応する前記音声データに関する情報を出力する出力部、
として機能させるプログラム。 (Appendix 11)
Computer
A first phoneme in which each phoneme included in the speech data to be searched is a central phoneme, and a phoneme model including the central phoneme and at least one of the phoneme nearest to the front side and the phoneme nearest to the back side is arranged in time series order Search target acquisition unit that acquires model columns,
A search word to be searched for the speech data is converted into a phoneme string, each phoneme constituting the converted phoneme string is set as a central phoneme, and at least a central phoneme, a phoneme nearest to the front side of the central phoneme, and a phoneme nearest to the rear side A search word acquisition unit that acquires a second phoneme model sequence in which phoneme models including one of them are arranged in chronological order;
The weight of at least one of the first and last phoneme models in the second phoneme model sequence is reduced from the remaining phoneme models constituting the second phoneme model sequence, and the second phoneme model sequence and the second phoneme model sequence are reduced. A search unit that calculates a similarity with a partial sequence included in one phoneme model sequence, and extracts a partial sequence in which the calculated similarity satisfies a predetermined condition from the first phoneme model sequence;
An output unit that outputs information about the audio data corresponding to the partial sequence extracted by the search unit;
Program to function as.

１…ＲＯＭ、２…ＲＡＭ、３…外部記憶装置、４…入力装置、５…出力装置、６…ＣＰＵ、６１…音声データ記憶部、６２…検索対象取得部、６３…トライフォン列記憶部、６４…入力部、６５…変換テーブル記憶部、６６…検索語取得部、６７…探索部、６８…トライフォン間距離記憶部、６９…出力部、１００…音声検索装置 DESCRIPTION OF SYMBOLS 1 ... ROM, 2 ... RAM, 3 ... External storage device, 4 ... Input device, 5 ... Output device, 6 ... CPU, 61 ... Audio | voice data storage part, 62 ... Search object acquisition part, 63 ... Triphone row | line storage part, 64 ... input unit, 65 ... conversion table storage unit, 66 ... search word acquisition unit, 67 ... search unit, 68 ... inter-triphone distance storage unit, 69 ... output unit, 100 ... voice search device

Claims

A first phoneme in which each phoneme included in the speech data to be searched is a central phoneme, and a phoneme model including the central phoneme and at least one of the phoneme nearest to the front side and the phoneme nearest to the back side is arranged in time series order A search target acquisition unit for acquiring a model column;
A search word to be searched for the speech data is converted into a phoneme string, each phoneme constituting the converted phoneme string is set as a central phoneme, and at least a central phoneme, a phoneme nearest to the front side of the central phoneme, and a phoneme nearest to the rear side A search word acquisition unit that acquires a second phoneme model sequence in which phoneme models including one of them are arranged in chronological order;
The first similarity between a third phoneme model sequence that is a phoneme model sequence excluding at least one of the first and last phoneme models of the second phoneme model sequence and a subsequence included in the first phoneme model sequence And a search unit that extracts a partial sequence in which the calculated first similarity satisfies a predetermined condition from the first phoneme model sequence,
An output unit that outputs information related to the audio data corresponding to the partial sequence extracted by the search unit;
A voice search device comprising:

The search unit
When the number of the phoneme models constituting the second phoneme model sequence is less than a threshold value, a second similarity between the second phoneme model sequence and a partial sequence included in the first phoneme model sequence Calculating a degree, and extracting a partial sequence in which the calculated second similarity satisfies a predetermined condition from the first phoneme model sequence;
The voice search device according to claim 1.

The search unit
Both the first similarity and the second similarity between the second phoneme model sequence and the partial sequence included in the first phoneme model sequence are calculated, and the first and second similarities are predetermined. A partial sequence that satisfies the following condition is extracted from the first phoneme model sequence:
The voice search device according to claim 1.

A first phoneme in which each phoneme included in the speech data to be searched is a central phoneme, and a phoneme model including the central phoneme and at least one of the phoneme nearest to the front side and the phoneme nearest to the back side is arranged in time series order A search target acquisition unit for acquiring a model column;
A search word to be searched for the speech data is converted into a phoneme string, each phoneme constituting the converted phoneme string is set as a central phoneme, and at least a central phoneme, a phoneme nearest to the front side of the central phoneme, and a phoneme nearest to the rear side A search word acquisition unit that acquires a second phoneme model sequence in which phoneme models including one of them are arranged in chronological order;
The weight of at least one of the first and last phoneme models in the second phoneme model sequence is reduced from the remaining phoneme models constituting the second phoneme model sequence, and the second phoneme model sequence and the second phoneme model sequence are reduced. A search unit that calculates a similarity with a partial sequence included in one phoneme model sequence, and extracts a partial sequence in which the calculated similarity satisfies a predetermined condition from the first phoneme model sequence;
An output unit that outputs information related to the audio data corresponding to the partial sequence extracted by the search unit;
A voice search device comprising:

The search unit
The partial sequence is extracted with the predetermined condition that either the similarity is equal to or higher than a predetermined threshold and the similarity is within a predetermined order in descending order.
The voice search device according to claim 1, wherein the voice search device is a voice search device.

The search unit
Calculating the similarity by continuous DP matching;
The voice search device according to claim 1, wherein the voice search device is a voice search device.

The phoneme model is
A triphone including a central phoneme, a phoneme nearest to the front side of the central phoneme, and a phoneme nearest to the rear side.
The voice search device according to any one of claims 1 to 6.

A first phoneme in which each phoneme included in the speech data to be searched is a central phoneme, and a phoneme model including the central phoneme and at least one of the phoneme nearest to the front side and the phoneme nearest to the back side is arranged in time series order A search target acquisition process for acquiring a model column;
A search word to be searched for the speech data is converted into a phoneme string, each phoneme constituting the converted phoneme string is set as a central phoneme, and at least a central phoneme, a phoneme nearest to the front side of the central phoneme, and a phoneme nearest to the rear side A search word acquisition step of acquiring a second phoneme model sequence in which phoneme models including one are arranged in chronological order;
The first similarity between a third phoneme model sequence that is a phoneme model sequence excluding at least one of the first and last phoneme models of the second phoneme model sequence and a subsequence included in the first phoneme model sequence And a search step of extracting a partial sequence in which the calculated first similarity satisfies a predetermined condition from the first phoneme model sequence,
An output step of outputting information on the audio data corresponding to the partial sequence extracted in the search step;
Voice search method including

A first phoneme in which each phoneme included in the speech data to be searched is a central phoneme, and a phoneme model including the central phoneme and at least one of the phoneme nearest to the front side and the phoneme nearest to the back side is arranged in time series order A search target acquisition process for acquiring a model column;
A search word to be searched for the speech data is converted into a phoneme string, each phoneme constituting the converted phoneme string is set as a central phoneme, and at least a central phoneme, a phoneme nearest to the front side of the central phoneme, and a phoneme nearest to the rear side A search word acquisition step of acquiring a second phoneme model sequence in which phoneme models including one are arranged in chronological order;
The weight of at least one of the first and last phoneme models in the second phoneme model sequence is reduced from the remaining phoneme models constituting the second phoneme model sequence, and the second phoneme model sequence and the second phoneme model sequence are reduced. A search step of calculating a similarity with a partial sequence included in one phoneme model sequence, and extracting a partial sequence in which the calculated similarity satisfies a predetermined condition from the first phoneme model sequence;
An output step of outputting information on the audio data corresponding to the partial sequence extracted in the search step;
Voice search method including

Computer
A first phoneme in which each phoneme included in the speech data to be searched is a central phoneme, and a phoneme model including the central phoneme and at least one of the phoneme nearest to the front side and the phoneme nearest to the back side is arranged in time series order Search target acquisition unit that acquires model columns,
A search word to be searched for the speech data is converted into a phoneme string, each phoneme constituting the converted phoneme string is set as a central phoneme, and at least a central phoneme, a phoneme nearest to the front side of the central phoneme, and a phoneme nearest to the rear side A search word acquisition unit that acquires a second phoneme model sequence in which phoneme models including one of them are arranged in chronological order;
The first similarity between a third phoneme model sequence that is a phoneme model sequence excluding at least one of the first and last phoneme models of the second phoneme model sequence and a subsequence included in the first phoneme model sequence A search unit that extracts a partial sequence in which the calculated first similarity satisfies a predetermined condition from the first phoneme model sequence,
An output unit that outputs information about the audio data corresponding to the partial sequence extracted by the search unit;
Program to function as.

Computer
A first phoneme in which each phoneme included in the speech data to be searched is a central phoneme, and a phoneme model including the central phoneme and at least one of the phoneme nearest to the front side and the phoneme nearest to the back side is arranged in time series order Search target acquisition unit that acquires model columns,
A search word to be searched for the speech data is converted into a phoneme string, each phoneme constituting the converted phoneme string is set as a central phoneme, and at least a central phoneme, a phoneme nearest to the front side of the central phoneme, and a phoneme nearest to the rear side A search word acquisition unit that acquires a second phoneme model sequence in which phoneme models including one of them are arranged in chronological order;
The weight of at least one of the first and last phoneme models in the second phoneme model sequence is reduced from the remaining phoneme models constituting the second phoneme model sequence, and the second phoneme model sequence and the second phoneme model sequence are reduced. A search unit that calculates a similarity with a partial sequence included in one phoneme model sequence, and extracts a partial sequence in which the calculated similarity satisfies a predetermined condition from the first phoneme model sequence;
An output unit that outputs information about the audio data corresponding to the partial sequence extracted by the search unit;
Program to function as.