JP2012059126A

JP2012059126A - Search device

Info

Publication number: JP2012059126A
Application number: JP2010203280A
Authority: JP
Inventors: Toshiyuki Hanazawa; 利行花沢; Yohei Okato; 洋平岡登
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2010-09-10
Filing date: 2010-09-10
Publication date: 2012-03-22
Anticipated expiration: 2030-09-10
Also published as: JP5404563B2

Abstract

PROBLEM TO BE SOLVED: To solve the problem that an unnatural search result is obtained due to a sequence of words or syllables in a conventional device which compares an input character string with candidates of a search object facility name per word or syllable to calculate search scores on the basis of the numbers of matching words or syllables and presents the candidates in order from the highest score when searching a facility name of a facility with its formal name unknown.SOLUTION: A search device includes: search means which collates an input character string with a plurality of search object documents and outputs a plurality of documents and search scores according to frequencies with which the character string appears in the documents; a morpheme dictionary which holds morphemes of the search object documents and penalty values which are given to respective morphemes in accordance with their degrees of importance during search; and search ranking correction means which refers to the morpheme dictionary for search results to reconfigure the output order of the search results on the basis of corrected search scores obtained by subtracting penalty values from morphemes existing in the documents but not extracted from the character string.

Description

この発明は、大量の文書や施設名中から、所望の文書や施設名の検索を効率よく行う大規模な検索装置に関するものである。 The present invention relates to a large-scale search apparatus that efficiently searches a desired document or facility name from a large number of documents and facility names.

さまざまな施設名を検索対象とする検索システムを構築する場合、利用者は施設の正式名称を知らない場合があるので、施設名を形態素や音節に分解して、形態素や音節のユニグラムやバイグラムを照合単位として検索を行う技術が従来よりあり、このような技術として下記特許文献１に開示されたものがある。 When building a search system that searches various facility names, users may not know the official name of the facility, so the facility name is decomposed into morphemes and syllables, and morphemes and syllable unigrams and bigrams are converted. Conventionally, there is a technique for performing a search as a collation unit, and such a technique is disclosed in Patent Document 1 below.

特許文献１では、単語や音節等を単位として、入力文字列と検索対象施設名を比較照合し、マッチした単語や音節のユニグラムやバイグラム数に基づいて検索スコアを算出し、スコアの高い順に候補を提示する技術が開示されている。
しかし特許文献１の技術では、例えば大船にある「ウミベ」という百貨店の正式名称が「ウミベ大船」である場合、「おーふなうみべ（大船ウミベ）」という入力文字列で音節バイグラム数に基づいて検索すると、正解の「うみべおーふな（ウミベ大船）」よりも、「えーしょぼーおーふなうみべてん（A書房大船ウミベ店）」という不自然な検索結果が検索結果の上位に出力されるという課題があった。これは前記入力文字列中の音節バイグラム「なう」が前者の「うみべおーふな」ではマッチしないのに対し、後者の「えーしょぼーおーふなうみべてん」ではマッチし、検索スコアが後者のほうが高くなるためである。 In Patent Document 1, an input character string is compared with the name of a facility to be searched in units of words, syllables, etc., a search score is calculated based on the number of unigrams and bigrams of matched words and syllables, and candidates are in descending order of score. The technique which presents is disclosed.
However, in the technique of Patent Document 1, for example, when the official name of the department store “Umibe” in the Ofuna is “Umibe Ofuna”, the input string “Ofuna Umibe (Ofune Umibe)” is used for the number of syllable bigrams. Based on the search result, the search result is more unnatural than the correct answer, “Umbeo Funa (Umibe Ofuna)”, “Eshshoboo Funa Umbeten (A Shobo Ofuna Umibe)” There was a problem of being output to the top of the. This is because the syllable bigram "Nau" in the input string does not match in the former "Umibeo Funa", but in the latter "Esshoboo Funa Umibeten" This is because the latter has a higher search score.

特開2008-262279号公報JP 2008-262279 A

この発明は上記課題を解決するためになされたもので、前記のような不自然な検索結果を抑制し検索精度を向上させることを目的とする。 The present invention has been made to solve the above-described problems, and an object thereof is to suppress the unnatural search results as described above and improve the search accuracy.

この発明に係る検索装置は、
入力された文字列に基づいて、検索対象とする複数個の文書から所望の文書を検索する検索装置であって、
前記文字列を入力として、前記文字列と検索対象とする複数個の文書を照合し、前記文字列と部分一致または完全一致する複数個の文書と、前記文字列が複数個の文書中に出現する回数に応じた検索スコアとを検索結果として出力する検索手段と、
前記検索対象とする複数個の文書のそれぞれに対する形態素と、検索時に使用される重要度に応じて形態素毎に付与したペナルティ値とを保持する形態素辞書と、
前記文字列と前記検索手段の検索結果を入力とし、前記検索結果のそれぞれの文書に対し、前記形態素辞書を参照して前記文字列から形態素を抽出し、前記文書中には存在するが、前記文字列中からは抽出されなかった形態素に対し、前記ペナルティ値を差し引いて前記検索スコアを修正し、修正した検索スコアに基づいて検索結果の出力順位を再構成して出力する検索順位修正手段とを備える。 The search device according to the present invention provides:
A search device for searching a desired document from a plurality of documents to be searched based on an input character string,
Using the character string as input, the character string and a plurality of documents to be searched are collated, a plurality of documents partially or completely matching the character string, and the character string appear in the plurality of documents. Search means for outputting a search score corresponding to the number of times to be searched as a search result;
A morpheme dictionary that holds a morpheme for each of the plurality of documents to be searched and a penalty value assigned to each morpheme according to the importance used during the search;
The search result of the character string and the search means is input, and for each document of the search result, the morpheme is extracted from the character string with reference to the morpheme dictionary, and exists in the document, Search rank correction means for correcting the search score by subtracting the penalty value from the morpheme that has not been extracted from the character string, and reconstructing and outputting the output rank of the search result based on the corrected search score; Is provided.

この発明による検索装置によれば、入力された文字列に基づいて検索手段で検索された検索対象の複数個の文書と、前記文字列が複数個の文書中に出現する回数に応じた検索スコアとの検索結果を、検索対象とする複数個の文書のそれぞれに対する形態素と、検索時に使用される重要度に応じて形態素毎に付与したペナルティ値とを保持する形態素辞書を参照し、前記検索対象文書中には存在するが、前記入力文字列中からは抽出されなかった形態素に対し、前記ペナルティ値を差し引いて前記検索スコアを修正し、修正した検索スコアに基づいて検索結果の出力順位を再構成して出力する検索順位修正手段により不自然な検索結果を抑制する効果がある。 According to the search device of the present invention, a plurality of documents to be searched searched by the search means based on the input character string, and a search score corresponding to the number of times the character string appears in the plurality of documents The search result is referred to by referring to a morpheme dictionary that holds a morpheme for each of a plurality of documents to be searched and a penalty value assigned to each morpheme according to the importance used in the search. The search score is corrected by subtracting the penalty value for morphemes that are present in the document but not extracted from the input character string, and the output order of the search results is re-established based on the corrected search score. There is an effect of suppressing unnatural search results by the search order correcting means configured and output.

この発明による検索装置の実施の形態１の構成を示すブロック図である。It is a block diagram which shows the structure of Embodiment 1 of the search device by this invention. テキスト検索辞書作成用の検索対象施設名の説明図である。It is explanatory drawing of the search object facility name for text search dictionary preparation. 検索対象施設名から作成したテキスト検索辞書の説明図である。It is explanatory drawing of the text search dictionary created from the search object facility name. ペナルティ値が設定された形態素辞書例を示す説明図である。It is explanatory drawing which shows the example of a morpheme dictionary in which the penalty value was set. 検索手段の検索結果である施設名のID番号と検索スコアの対の中間検索結果の出力例を示す説明図である。It is explanatory drawing which shows the example of an output of the intermediate search result of the ID number of a facility name which is a search result of a search means, and a search score pair. 検索順位修正手段による処理手順のフロー図である。It is a flowchart of the process sequence by a search order correction means. 検索順位修正手段による修正検索スコアの大きさ順並べ換え結果を示す説明図である。It is explanatory drawing which shows the magnitude | order rearrangement result of the correction search score by a search order correction means. この発明による検索装置の実施の形態２の構成を示すブロック図である。It is a block diagram which shows the structure of Embodiment 2 of the search device by this invention. 実施の形態２の検索順位修正手段による修正検索スコアの大きさ順並べ換え結果を示す説明図である。It is explanatory drawing which shows the order rearrangement result of the corrected search score by the search order correction means of Embodiment 2.

実施の形態１．
本実施の形態では施設や観光スポットの名称（以後は簡単のため施設と観光スポットを総称して施設という）を検索する場合を例にとり説明する。
図１はこの発明による検索装置の実施の形態１の構成を示すブロック図である。
同図において、１は文字列の入力端、２は文字列、３は検索手段、４は検索辞書メモリ、
５は中間検索結果、６は検索順位修正手段、７は形態素辞書メモリ、８は検索結果である。 Embodiment 1 FIG.
In the present embodiment, a description will be given taking as an example the case of searching for names of facilities and sightseeing spots (hereinafter, facilities and sightseeing spots are collectively referred to as facilities for simplicity).
FIG. 1 is a block diagram showing a configuration of a first embodiment of a search device according to the present invention.
In the figure, 1 is an input terminal of a character string, 2 is a character string, 3 is a search means, 4 is a search dictionary memory,
5 is an intermediate search result, 6 is a search order correcting means, 7 is a morpheme dictionary memory, and 8 is a search result.

検索辞書メモリ４にはテキスト検索辞書を事前に作成して格納しておく。テキスト検索辞書の作成方法を説明する。例えば図２に示すとおり、検索対象施設名が「A書房大船ウミべ店（えーしょぼーおーふなうみべてん）」、「ウミベ大船（うみベおーふな）」等として説明する。（）内は施設名の読みを示している。ここで「ウミベ」は施設の固有名詞であり、本例では百貨店名とする。 In the search dictionary memory 4, a text search dictionary is created and stored in advance. A method for creating a text search dictionary will be described. For example, as shown in FIG. 2, the search target facility name is described as “A Shobo Ofuna Umibe Store”, “Umibe Ofuna”, and the like. Figures in parentheses indicate the name of the facility. Here, “Umibe” is a proper noun of the facility, and in this example, it is a department store name.

前記テキスト検索辞書は施設名を構成する言語単位を索引語として転置インデックスとして構成する。本例では索引語として施設名の読みの音節の２連鎖（音節バイグラム）を用いる。「A書房大船ウミベ店（えーしょぼーおーふなうみベてん）」に含まれる音節バイグラムは、「えーしょ」、「しょぼー」、「ぼーおー」、「おーふ」「ふな」「なう」「うみ」「みベ」、「べて」「てん」の10種類である。また「ウミベ大船（うみベおーふな）」に含まれる音節バイグラムは「うみ」「みべ」「ベおー」「おーふ」「ふな」の５種類である。検索辞書メモリ４は、これらの音節バイグラムを索引語として、索引語と施設名のID番号をテキスト検索辞書として保持する。前記の施設名から作成したテキスト検索辞書を図３に示す。 The text search dictionary is configured as a transposed index with language units constituting facility names as index words. In this example, two chain (syllable bigram) of the syllable of the facility name reading is used as an index word. The syllable bigrams included in “A Shobo Ofuna Umibe” are “Esho,” “Shoboo,” “Booh,” “Ohfu,” “Fu” There are 10 types: N, Nau, Umi, Mibe, Bete, and Ten. There are five syllable bigrams in “Umibe Ofuna”: “Umi”, “Mibe”, “Beow”, “Ohfu”, and “Funa”. The search dictionary memory 4 holds these syllable bigrams as index words and the index words and facility name ID numbers as text search dictionaries. A text search dictionary created from the facility name is shown in FIG.

形態素辞書メモリ７には形態素辞書を事前に作成して格納しておく。形態素辞書の作成方法を説明する。まず検索対象とする施設名を形態素解析器等を使用して形態素に分割する。必要に応じて形態素への分割結果を人手で修正してもよい。また英語等のように元々単語に分割されている言語では分割処理は不要であり、この場合には単語を形態素とみなす。次に各形態素毎に検索時に使用される重要度に応じて所定のペナルティ値を付与し、形態素とともに形態素辞書として保持する。なお本実施の形態では前記ペナルティ値は当該施設を検索するときに省略される可能性の低い形態素ほど大きなペナルティ値を設定しておく。前記「A書房大船ウミベ店」、および「ウミベ大船」に対する形態素辞書の例を図４に示す。「A書房大船ウミベ店」の形態素辞書は、「えーしょぼー(3)」、「おーふな(1)」、「うみべ(1)」である。（）内の値はペナルティ値である。「A書房大船ウミベ店」を検索する場合の文字列２としては、「えーしょぼー」という形態素を省略する可能性は低いと考えられるので、他の形態素よりも大きなペナルティ値を付与している。一方「ウミベ大船」に対する形態素辞書は、当該施設を検索する場合の発話としては、「うみべ」という形態素を省略する可能性は低いと考えられるので、他の形態素よりも大きなペナルティ値を付与している。 In the morpheme dictionary memory 7, a morpheme dictionary is created and stored in advance. A method for creating a morpheme dictionary will be described. First, the facility name to be searched is divided into morphemes using a morphological analyzer or the like. If necessary, the division result into morphemes may be corrected manually. In addition, division processing is unnecessary in a language originally divided into words such as English, and in this case, the word is regarded as a morpheme. Next, a predetermined penalty value is assigned to each morpheme according to the importance used at the time of search, and the morpheme dictionary is held together with the morpheme. In the present embodiment, the penalty value is set to a larger value for a morpheme that is less likely to be omitted when searching for the facility. FIG. 4 shows an example of the morpheme dictionary for “A Shobo Ofuna Umibe Store” and “Umibe Ofuna”. The morpheme dictionaries of “A Shobo Ofuna Umibe” are “Eshshobo (3)”, “Ohuna (1)”, and “Umibe (1)”. Values in parentheses are penalty values. The character string 2 when searching for “A Shobo Ofuna Umibe store” is considered to be unlikely to omit the morpheme “Eshoshibo”, so a higher penalty value than other morphemes is given. . On the other hand, the morpheme dictionary for “Umibe Ofuna” gives a higher penalty value than other morphemes because it is unlikely that the morpheme “Umibe” will be omitted as an utterance when searching for the facility. ing.

次に検索の動作について説明する。
文字列の入力端１から文字列２を入力すると、検索手段３はまず文字列２を構成する音節バイグラムを全て抽出する。例えば入力文字列２を「おーふなうみべ」とすると、音節バイグラムとして、「おーふ」「ふな」「なう」「うみ」「みべ」という５個の音節バイグラムを抽出する。 Next, the search operation will be described.
When the character string 2 is input from the character string input terminal 1, the search means 3 first extracts all syllable bigrams constituting the character string 2. For example, if the input character string 2 is “Ohuna Umibe”, five syllable bigrams “Ohu”, “Funa”, “Nau”, “Umi”, and “Mibe” are extracted as syllable bigrams. .

次に検索手段３は、検索辞書メモリ４に格納しているテキスト検索辞書を参照し、抽出した音節バイグラム毎に当該音節バイグラムを含む施設の検索スコアに１を加算する。抽出した全音節バイグラムに対しこのスコア加算処理を行う。本例では、施設ID=1の「A書房大船ウミベ店（えーしょぼーおーふなうみベてん）」は、「おーふ」「ふな」「なう」「うみ」「みべ」の５個の音節バイグラムが文字列２の音節バイグラムとマッチするので、検索スコアは５となる。一方施設ID=2の「ウミベ大船（うみベおーふな）」は「おーふ」「ふな」「うみ」「みべ」の４個の音節バイグラムが文字列２の音節バイグラムとマッチするので、検索スコアは４となる。上記加算処理終了後、検索手段３は中間検索結果５として、検索スコアが１以上のN個の施設名のID番号と検索スコアの対を出力する。ここでNは1以上の整数である。中間検索結果５の出力例を図５に示す。 Next, the search means 3 refers to the text search dictionary stored in the search dictionary memory 4 and adds 1 to the search score of the facility including the syllable bigram for each extracted syllable bigram. This score addition process is performed on the extracted whole syllable bigram. In this example, “A Shobo Ofuna Umibe store” with facility ID = 1 is “Ofu” “Funa” “Nau” “Umi” “Mibe” Since the five syllable bigrams match the syllable bigram of the character string 2, the search score is 5. On the other hand, “Umibe Ofuna” with facility ID = 2 matches the four syllable bigrams of “Ohu”, “Funa”, “Umi”, and “Mibe” with the syllable bigram of string 2. Therefore, the search score is 4. After the addition process is completed, the search means 3 outputs a pair of ID numbers and search scores of N facility names having a search score of 1 or more as the intermediate search result 5. Here, N is an integer of 1 or more. An output example of the intermediate search result 5 is shown in FIG.

次に検索順位修正手段６は、文字列の入力端１からの文字列２と検索手段３からの中間検索結果５を入力とし、中間検索結果５のN個の施設名それぞれに対し形態素辞書メモリ７に格納されている当該施設名の形態素辞書を用いて、文字列２と照合することにより文字列２に含まれる形態素を抽出する。抽出した形態素と、形態素辞書メモリ７に格納されている当該施設の形態素辞書を比較し、形態素辞書中には存在するが、認識結果の音素列からは抽出されなかった形態素に対し、図４に示す形態素辞書に予め設定されたペナルティ値を付与して検索スコアをリスコアリングする。 Next, the search order correction means 6 receives the character string 2 from the character string input terminal 1 and the intermediate search result 5 from the search means 3 as input, and morpheme dictionary memory for each of the N facility names in the intermediate search result 5. The morpheme contained in the character string 2 is extracted by collating with the character string 2 using the morpheme dictionary of the facility name stored in 7. The extracted morpheme and the morpheme dictionary of the facility stored in the morpheme dictionary memory 7 are compared, and the morpheme present in the morpheme dictionary but not extracted from the phoneme string of the recognition result is shown in FIG. The search score is re-scored by assigning a preset penalty value to the indicated morpheme dictionary.

以下に図６を参照し、検索順位修正手段６の具体的な処理手順を述べる。
手順１）k=1とおく（図６のst101）
手順２）形態素辞書メモリ７に保持している形態素辞書を参照し、図５に示す検索手段３の中間検索結果５の第k位（この場合はk=1であるから1位）の施設名の形態素と文字列２の照合処理を行い、文字列２に含まれる形態素を抽出する（図６のst102）。ここで前記照合処理とは、形態素辞書中の１個以上の形態素の組み合わせが文字列２と一致するか否かを調べることであり、一致する場合は前記１個以上の形態素が文字列２に含まれると判定し、前記１個以上の形態素を抽出する。
例えばk=1の場合は、前述のように１位の検索結果は施設ID＝１の施設名であり、図４に示すとおり形態素辞書中の形態素は、「えーしょぼー」、「おーふな」、「うみべ」、「てん」となる。これらの形態素と文字列２である「おーふなうみべ」との間で照合を行うと、「おーふな」と「うみべ」の２個の形態素が抽出される。 The specific processing procedure of the search order correction means 6 will be described below with reference to FIG.
Procedure 1) Set k = 1 (st101 in FIG. 6)
Procedure 2) Refers to the morpheme dictionary stored in the morpheme dictionary memory 7 and refers to the facility name of the k-th place (in this case, k = 1 because it is k = 1) of the intermediate search result 5 of the search means 3 shown in FIG. The morpheme and character string 2 are collated, and the morpheme contained in the character string 2 is extracted (st102 in FIG. 6). Here, the collation processing is to check whether or not a combination of one or more morphemes in the morpheme dictionary matches the character string 2. If they match, the one or more morphemes are converted into the character string 2. It is determined that it is included, and the one or more morphemes are extracted.
For example, in the case of k = 1, the search result of the first place is the facility name with the facility ID = 1 as described above. As shown in FIG. 4, the morphemes in the morpheme dictionary are “Esshobo”, “Ohfu” "N", "Umibe", "Ten". When collation is performed between these morphemes and the character string 2 “Oh Funa Umibe”, two morphemes “Oh Funa” and “Umibe” are extracted.

手順３）手順２で抽出した文字列２に含まれる形態素と、k位の検索結果の形態素辞書中の形態素を比較し、前記形態素辞書中には存在するが文字列２中には存在しない形態素に対し、形態素辞書中のペナルティ値を累積したペナルティ累積値P(k)を算出する（図６のst103）。
例えばk=1の場合は、上述のとおり文字列２に含まれる形態素は「おーふな」と「うみべ」の２個、形態素辞書中の形態素は、「えーしょぼー」、「おーふな」、「うみべ」、「てん」なので、形態素辞書中には存在するが文字列２中には存在しない形態素は「えーしょぼー」と「てん」の２個である。これらの形態素に対するペナルティ値は図４に示すとおり、それぞれ3と0なので、前記ペナルティ累積値P(k)の値は、P(k) = 3+0 = 3となる。 Step 3) The morpheme included in the character string 2 extracted in step 2 is compared with the morpheme in the morpheme dictionary of the k-th search result, and the morpheme that exists in the morpheme dictionary but does not exist in the character string 2 On the other hand, a penalty accumulated value P (k) obtained by accumulating the penalty values in the morpheme dictionary is calculated (st103 in FIG. 6).
For example, in the case of k = 1, as described above, the morpheme included in the character string 2 is “Ofuna” and “Umbe”, and the morpheme in the morpheme dictionary is “Eshoshibo”, “Oh Since “Funa”, “Umibe”, and “Ten”, the two morphemes that exist in the morpheme dictionary but do not exist in the character string 2 are “Esshobo” and “Ten”. Since the penalty values for these morphemes are 3 and 0, respectively, as shown in FIG. 4, the penalty accumulated value P (k) is P (k) = 3 + 0 = 3.

手順４）手順３で算出したペナルティ累積値P(k)と、検索スコアS(k)から下記の(1)式によって修正検索スコアS’(k)を算出する（図６のst104）。(1)式中でαは実験的に予め決めた定数であり、本実施の形態例ではα=0.5とする。 Procedure 4) The corrected search score S '(k) is calculated from the penalty accumulated value P (k) calculated in the procedure 3 and the search score S (k) by the following equation (1) (st104 in FIG. 6). In the equation (1), α is a constant determined experimentally in advance, and α = 0.5 in the present embodiment.

S’(k) = S(k) - αP(k) ・・・ (1)
この結果、上述のk=1の例では、S’(1) = 5 - 0.5*3 = 3.5となる。 S '(k) = S (k)-αP (k) (1)
As a result, in the above-described example of k = 1, S ′ (1) = 5−0.5 * 3 = 3.5.

手順５）k =Nなら、手順６へ進む。k <Nなら、k=k+1とし、手順２に戻る。（図６のst105,st106）。
手順６）手順4で修正した修正スコアS’(k) (k=1〜N)を用い、修正スコアS’(k)の大きい順に検索結果を並べ換え、検索結果８として出力する。（図５のst107） Step 5) If k = N, go to step 6. If k <N, set k = k + 1 and return to step 2. (St105, st106 in FIG. 6).
Step 6) Using the corrected score S ′ (k) (k = 1 to N) corrected in step 4, the search results are rearranged in descending order of the corrected score S ′ (k), and output as the search result 8. (St107 in FIG. 5)

処理手順は以上である。上記処理の結果、検索手段の出力結果で第２位の施設ID=2では、図4に示すとおり形態素辞書中の形態素は「うみべ」、「おーふな」なので、これらの形態素と文字列２である「おーふなうみべ」との間で照合を行うと「おーふな」と「うみべ」の２個の形態素が抽出される。この結果、形態素辞書中の形態素が認識結果中に全て存在するので、ペナルティ累積値P(k)の値は0となり、（１）式で計算される修正後の検索スコアS’(2) = S(2) = 4となる。
修正後の検索スコアの大きい順に検索順位を並べ換えた結果を図７に示す。「ウミベ大船」が検索順位の第１位になっていることがわかる。 The processing procedure is as described above. As a result of the above processing, if the facility ID = 2 as the output result of the search means, the morphemes in the morpheme dictionary are “Umbe” and “Ohuna” as shown in FIG. When matching is performed with “Ohuna Umibe” in column 2, two morphemes “Ohuna” and “Umibe” are extracted. As a result, since all the morphemes in the morpheme dictionary exist in the recognition result, the penalty accumulated value P (k) becomes 0, and the corrected search score S ′ (2) = calculated by the equation (1) = S (2) = 4.
FIG. 7 shows the result of rearranging the search order in descending order of the corrected search score. It can be seen that “Umibe Ofuna” is ranked first in the search order.

このように本実施の形態によれば、各施設名毎に形態素辞書を備え、各形態素には当該形態素が文字列２に含まれなかった場合に付与するペナルティ値を設定する。このペナルティ値として当該施設を検索するときに省略される可能性の低い形態素ほど大きなペナルティ値を設定しておき、上述したとおりペナルティ累積値P(k)を差し引いた修正スコアS’(k)の大きい順に検索結果を出力するように構成したので、「大船ウミベ」という発話に対し、「ウミベ大船」よりも「A書房大船ウミベ店」が上位に検索されるという不自然な結果を抑制する効果がある。 As described above, according to the present embodiment, a morpheme dictionary is provided for each facility name, and a penalty value is set for each morpheme when the morpheme is not included in the character string 2. As the penalty value, a morpheme that is less likely to be omitted when searching for the facility is set with a larger penalty value, and the corrected score S ′ (k) is calculated by subtracting the penalty accumulated value P (k) as described above. Since the search results are output in descending order, the effect of suppressing the unnatural result that “A Shobo Ofuna Umibe Store” is searched higher than “Umibe Ofuna” for the utterance “Ofune Umibe”. There is.

なお、本例では検索手段３では、音節バイグラムを転置インデックスの索引語としたが、索引語は任意の単位でよい。例えば単語のバイグラムや、単語または音節のユニグラムでもよい。また本例では検索手段３における検索方式として転置インデックスを用いる方式を説明したが、文字列２と検索対象との部分マッチングを許す任意の検索方式を用いてもよい。 In this example, the search means 3 uses the syllable bigram as the index word of the transposed index, but the index word may be an arbitrary unit. For example, it may be a bigram of words or a unigram of words or syllables. In this example, the transposition index is used as the search method in the search means 3, but any search method that allows partial matching between the character string 2 and the search target may be used.

また、形態素辞書の各形態素に付与するペナルティ値としては、施設名を構成する最後尾の形態素が「店」である施設名の先頭の形態素に対し、他の形態素よりも大きなペナルティ値を付与してもよい。これは一般に、公園や百貨店内にある施設名は「施設のブランド名等の固有名詞＋（公園名または百貨店名）＋店」というパターンが多く、最後尾の形態素が「店」である施設名の先頭の形態素は、当該施設を検索する場合に省略することがほぼ無いと考えられるからである。このようにペナルティ値を付与することによりペナルティ付与作業の効率化を図る効果が得られる。 In addition, as a penalty value to be assigned to each morpheme in the morpheme dictionary, a penalty value larger than other morphemes is assigned to the first morpheme of the facility name whose last morpheme constituting the facility name is `` store ''. May be. In general, the names of facilities in parks and department stores often have the pattern of “proprietary nouns such as brand names of facilities + (park name or department store name) + store”, and the name of the facility whose last morpheme is “store” This is because it is considered that the first morpheme of is hardly omitted when searching for the facility. Thus, by giving a penalty value, the effect of aiming at the efficiency of a penalty provision operation | work is acquired.

実施の形態２．
本実施の形態では、実施の形態１と同様に施設名を検索する場合を例にとり説明する。
図８はこの発明による検索装置の実施の形態２の構成を示すブロック図である。
同図において、実施の形態１と同等部分には同一番号を付し、説明を省略する。９は音声の入力端、10は入力音声、11は音声認識手段、12は言語モデルメモリ、13は音響モデルメモリである。 Embodiment 2. FIG.
In the present embodiment, a case where a facility name is searched as in the first embodiment will be described as an example.
FIG. 8 is a block diagram showing the configuration of the second embodiment of the search device according to the present invention.
In the figure, the same parts as those in the first embodiment are denoted by the same reference numerals, and the description thereof is omitted. Reference numeral 9 is a voice input terminal, 10 is an input voice, 11 is a voice recognition means, 12 is a language model memory, and 13 is an acoustic model memory.

言語モデルメモリ12には統計言語モデルを事前に作成して格納しておく。本例では検索対象とする全施設名の表記の音節列を学習データとして、音節を単位としたトライグラムを学習して格納しておく。なお音節を単位とすることの利点は、学習データとする施設数に関わらず、音節の種類数は数百個以下におさまるので、認識時の演算量増加を抑えた言語モデルを作成できることである。
音響モデルメモリ13には音声の特徴をモデル化した音響モデルを格納している。本実施の形態では音響モデルは例えばHMM(Hidden Markov Model)とする。 A statistical language model is created and stored in the language model memory 12 in advance. In this example, trigrams in units of syllables are learned and stored using the syllable string of all facility names to be searched as learning data. The advantage of using syllables as a unit is that, regardless of the number of facilities used as learning data, the number of types of syllables is less than a few hundred, so a language model can be created that suppresses the increase in the amount of computation during recognition. .
The acoustic model memory 13 stores an acoustic model obtained by modeling a voice feature. In the present embodiment, the acoustic model is, for example, an HMM (Hidden Markov Model).

次に音声認識と検索の動作について説明する。
音声の入力端９から音声10を入力すると音声認識手段11は言語モデルメモリ12に保存されている言語モデルと音響モデルメモリ13に保存されている音響モデルを用いて、例えばビタビアルゴリズムによって音声認識を行い音声認識結果として、文字列２を出力する。文字列２は本例ではひらがな表記とする。
例えば音声10の発話内容が「大船ウミベ」である音声認識手段11の出力は、例えば「おーふなうみで」となる。本例では、「うみべ」の最後の１音節を「で」に誤認識したものとする。 Next, speech recognition and search operations will be described.
When speech 10 is input from the speech input end 9, the speech recognition means 11 uses the language model stored in the language model memory 12 and the acoustic model stored in the acoustic model memory 13 to perform speech recognition using, for example, a Viterbi algorithm. The character string 2 is output as the speech recognition result. The character string 2 is written in hiragana in this example.
For example, the output of the speech recognition means 11 whose utterance content of the voice 10 is “Ofuna Umibe” is, for example, “Oh Fu Umi de”. In this example, it is assumed that the last syllable of “Umbe” is misrecognized as “de”.

次に検索手段３は文字列２である「おーふなうみで」を入力として以下のように検索処理を行う。まず文字列２である「おーふなうみで」を構成する音節バイグラムを全て抽出する。本例では「おーふ」「ふな」「なう」「うみ」「みで」という５個の音節バイグラムを抽出する。次に検索辞書メモリ４に格納しているテキスト検索辞書を参照し、抽出した音節バイグラム毎に当該音節バイグラムを含む施設の検索スコアに１を加算する。抽出した全音節バイグラムに対しこの検索スコア加算処理を行う。本例では、施設ID=1の「A書房大船ウミベ店（えーしょぼーおーふなうみベてん）」は、「おーふ」「ふな」「なう」「うみ」の4個の音節バイグラムが文字列２の音節バイグラムとマッチするので、検索スコアは4となる。一方施設ID=2の「ウミベ大船（うみベおーふな）」は「おーふ」「ふな」「うみ」の３個の音節バイグラムが文字列２の音節バイグラムとマッチするので、検索スコアは３となる。上記加算処理終了後、検索手段３は中間検索結果５として、検索スコアが１以上のN個の施設名のID番号と検索スコアの対を出力する。ここでNは1以上の整数である。 Next, the search means 3 performs the search process as follows with the character string 2 “Oh Fu Na Umi” as an input. First, all the syllable bigrams that make up the character string “Oh Fu Na Umi” are extracted. In this example, five syllable bigrams “Ofu” “Funa” “Nau” “Umi” “Mide” are extracted. Next, the text search dictionary stored in the search dictionary memory 4 is referred to, and 1 is added to the search score of the facility including the syllable bigram for each extracted syllable bigram. This search score addition processing is performed on the extracted whole syllable bigram. In this example, “A Shobo Ofuna Umibeten” with facility ID = 1 has four “Ofu”, “Funa”, “Nau”, and “Umi”. The search score is 4 because the syllable bigram matches the syllable bigram of string 2. On the other hand, “Umibe Ofuna” with facility ID = 2 matches the three syllable bigrams of “Ohu”, “Funa”, and “Umi” with the syllable bigram of string 2, so search The score is 3. After the addition process is completed, the search means 3 outputs a pair of ID numbers and search scores of N facility names having a search score of 1 or more as the intermediate search result 5. Here, N is an integer of 1 or more.

次に検索順位修正手段６は、文字列２と中間検索結果５を入力とし、中間検索結果５のN個の施設名それぞれに対し当該施設名の形態素辞書を用いて、文字列２と照合することにより文字列２に含まれる形態素を抽出する。抽出した形態素と、当該施設の形態素辞書を比較し、形態素辞書中には存在するが、認識結果の音素列からは抽出されなかった形態素に対し、予め設定したペナルティ値を付与して検索スコアをリスコアリングする。 Next, the search order correcting means 6 receives the character string 2 and the intermediate search result 5 as input, and collates with the character string 2 by using the facility name morpheme dictionary for each of the N facility names of the intermediate search result 5. Thus, the morpheme included in the character string 2 is extracted. The extracted morpheme is compared with the morpheme dictionary of the facility, and a search score is given by assigning a preset penalty value to the morpheme that exists in the morpheme dictionary but is not extracted from the phoneme string of the recognition result. Rescore.

検索順位修正手段６の具体的な処理手順は実施の形態１とほぼ同等である。違いは実施の形態１で述べた検索順位修正手段６の処理手順２における検索結果の施設名の形態素と文字列２との照合処理の方法である。実施の形態１では、形態素辞書中の１個以上の形態素の組み合わせが文字列２と一致するか否かを調べることによって照合処理を行ったが、本実施例では、形態素辞書中の１個以上の形態素の組み合わせと、文字列２との間で音節あるいは音素の置換または脱落または挿入を許したDP(Dynamic Programming)マッチングによる照合処理を行う。そして置換または脱落または挿入の個数が予め定めた所定の個数c以下なら、前記１個以上の形態素が文字列２に含まれると判定し、前記１個以上の形態素を抽出する。本実施の形態では前記所定の個数c=1とする。DPマッチングを用いる理由は、文字列２に音声認識誤りがあり、形態素辞書中の形態素と音節または音素が完全一致しない場合でも、形態素を抽出できるようにするためである。 The specific processing procedure of the search order correcting means 6 is almost the same as that of the first embodiment. The difference is the method of collation processing between the morpheme of the facility name of the retrieval result and the character string 2 in the processing procedure 2 of the retrieval order correcting means 6 described in the first embodiment. In Embodiment 1, collation processing is performed by checking whether or not a combination of one or more morphemes in the morpheme dictionary matches the character string 2, but in this embodiment, one or more in the morpheme dictionary. The morpheme combination and the character string 2 are collated by DP (Dynamic Programming) matching that allows substitution, omission, or insertion of syllables or phonemes. If the number of replacements, omissions, or insertions is equal to or less than a predetermined number c, it is determined that the one or more morphemes are included in the character string 2, and the one or more morphemes are extracted. In the present embodiment, the predetermined number c = 1. The reason for using DP matching is to enable extraction of a morpheme even when there is a speech recognition error in the character string 2 and the morpheme in the morpheme dictionary does not completely match the syllable or phoneme.

例えばk=1の場合は、k(=1)位の検索結果は施設ID＝１の施設名であり、図４に示すとおり形態素辞書中の形態素は、「えーしょぼー」、「おーふな」、「うみべ」、「てん」となる。これらの形態素と音声認識結果である「おーふなうみで」との間でDPマッチングを用いた照合処理を行う。これによって文字列２である「おーふなうみで」から、「おーふな」と「うみべ」の２個の形態素が抽出される。このうち「うみべ」は音声認識結果の文字列２である「おーふなうみで」中には完全一致する音節列が存在しないが、音節「べ」と「で」の置換が１個なので、DPマッチングを行うことによって抽出が可能になる。 For example, in the case of k = 1, the search result of the k (= 1) rank is the facility name of the facility ID = 1, and the morphemes in the morpheme dictionary are “Esshobo”, “Ohfu” as shown in FIG. "N", "Umibe", "Ten". A matching process using DP matching is performed between these morphemes and the speech recognition result “Oh Fu Na Umi”. As a result, two morphemes “Ofuna” and “Umibe” are extracted from the character string 2 “Ohuna Umi de”. Of these, “Umibe” has no exact syllable string in “Ohuna Umi de”, which is the character string 2 of the speech recognition result, but there is one replacement of the syllable “be” and “de”. Therefore, extraction is possible by performing DP matching.

またk=2の場合は、k(=2)位の検索結果は施設ID＝2では、図４に示すとおり形態素辞書中の形態素は「うみべ」、「おーふな」なので、これらの形態素と音声認識結果の文字列２である「おーふなうみで」との間でDPマッチングを行うと「おーふな」と「うみべ」の２個の形態素が抽出される。
手順３以降の処理は実施の形態１と同一なので説明を省略する。 In the case of k = 2, the search result of the k (= 2) rank is the facility ID = 2, and the morpheme in the morpheme dictionary is “Umbe” and “Ohuna” as shown in FIG. When DP matching is performed between the morpheme and the character string 2 of the speech recognition result “Oh Funa Uide”, two morphemes “Oh Funa” and “Umbe” are extracted.
Since the processing after the procedure 3 is the same as that of the first embodiment, the description thereof is omitted.

以上の処理によって修正検索スコアの大きい順に検索順位を並べ換えた結果を図９に示す。図９によれば「ウミベ大船」が検索順位の第１位になっていることがわかる。なお図７に示した実施の形態１における検索スコアおよび修正検索スコアと比較して、本実施例の検索スコアおよび修正検索スコアの値がそれぞれ１小さいが、これは上述したとおり音声認識結果である文字列２の「おーふなうみで」の最後の１音節「で」は「べ」を誤認識したものであり、その結果検索手段３における検索スコア算出時にマッチする音節バイグラム数が１個少なくなったためである。 FIG. 9 shows a result of rearranging the search order in the descending order of the corrected search score by the above processing. According to FIG. 9, it is understood that “Umibe Ofuna” is ranked first in the search order. In addition, compared with the search score and the corrected search score in the first embodiment shown in FIG. 7, each of the search score and the corrected search score in this example is 1 smaller, but this is a speech recognition result as described above. The last one syllable “de” of “Oh fu na umi” in the character string 2 is a misrecognized “be”, and as a result, the number of syllable bigrams that match when calculating the search score in the search means 3 is one. This is because it has decreased.

なお、形態素辞書メモリ７に保持している形態素辞書の各形態素に付与するペナルティ値としては、施設名を構成する最後尾の形態素が「店」である施設名の先頭の形態素に対し、他の形態素よりも大きなペナルティ値を付与してもよい。これは一般に、公園や百貨店内にある施設名は「施設のブランド名等の固有名詞＋（公園名または百貨店名）＋店」というパターンが多く、最後尾の形態素が「店」である施設名の先頭の形態素は、当該施設を検索する場合に省略することがほぼ無いと考えられるからである。このようにペナルティ値を付与することによりペナルティ付与作業の効率化を図る効果が得られる。 The penalty value assigned to each morpheme of the morpheme dictionary held in the morpheme dictionary memory 7 is different from the first morpheme of the facility name whose last morpheme constituting the facility name is “store”. You may give a penalty value larger than a morpheme. In general, the names of facilities in parks and department stores often have the pattern of “proprietary nouns such as brand names of facilities + (park name or department store name) + store”, and the name of the facility whose last morpheme is “store” This is because it is considered that the first morpheme of is hardly omitted when searching for the facility. Thus, by giving a penalty value, the effect of aiming at the efficiency of a penalty provision operation | work is acquired.

この発明は文字列により大量の文書や施設名中から、所望の文書や施設名の大規模な検索を効率よく行う検索装置に関し、携帯端末やカーナビゲーションシステム等各種のナビゲーションシステムに適用が可能である。 The present invention relates to a search device that efficiently performs a large-scale search for a desired document or facility name from a large number of documents and facility names using character strings, and can be applied to various navigation systems such as portable terminals and car navigation systems. is there.

１、９；文字列の入力端、２；文字列、３；検索手段、４；検索辞書メモリ、５；中間検索結果、６；検索順位修正手段、７；形態素辞書メモリ、８；検索結果、10；入力音声、11；音声認識手段、12；言語モデルメモリ、13；音響モデルメモリ。 1, 9; character string input terminal, 2; character string, 3; search means, 4; search dictionary memory, 5; intermediate search result, 6; search rank correction means, 7; morpheme dictionary memory, 8; 10; input speech; 11; speech recognition means; 12; language model memory; 13; acoustic model memory.

Claims

A search device for searching a desired document from a plurality of documents to be searched based on an input character string,
Using the character string as input, the character string and a plurality of documents to be searched are collated, a plurality of documents partially or completely matching the character string, and the character string appear in the plurality of documents. Search means for outputting a search score corresponding to the number of times to be searched as a search result;
A morpheme dictionary that holds a morpheme for each of the plurality of documents to be searched and a penalty value assigned to each morpheme according to the importance used during the search;
The search result of the character string and the search means is input, and for each document of the search result, the morpheme is extracted from the character string with reference to the morpheme dictionary, and exists in the document, Search rank correction means for correcting the search score by subtracting the penalty value from the morpheme that has not been extracted from the character string, and reconstructing and outputting the output rank of the search result based on the corrected search score; ,
A search device comprising:

2. The search apparatus according to claim 1, wherein the input character string is obtained by recognizing the input voice by voice recognition means and outputting the recognition result as a character string.

The penalty value given to the morpheme is assigned a larger penalty value for a morpheme that is less likely to be omitted from an input character string when searching for the document. Search device.

The search order correction means uses DP matching on a character string as a method of extracting a morpheme from the character string with reference to the morpheme dictionary, and even when the character string and the morpheme in the morpheme dictionary do not completely match The search device according to claim 1, wherein a morpheme is extracted from the character string.

The document to be searched is a plurality of facility names, and a penalty value larger than other morphemes is given to the first morpheme of the facility name whose last morpheme constituting the facility name is “Store” The search device according to claim 1, wherein: