JP2006107070A

JP2006107070A - Annotation word generation program and apparatus

Info

Publication number: JP2006107070A
Application number: JP2004292057A
Authority: JP
Inventors: Naoto Akira; 直人秋良
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2004-10-05
Filing date: 2004-10-05
Publication date: 2006-04-20

Abstract

<P>PROBLEM TO BE SOLVED: To provide technology allowing generation of a different notation word of a word including kanji and hiragana. <P>SOLUTION: In this different notation word generation program generating the different notation word of the designated word, a plurality of notation fluctuation character string pairs are generated from a plurality of different notation word pairs, a character string included in the designated word is replaced by use of the notation fluctuation character string pair, and the different notation word of the designated word is generated. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は、異表記語の生成あるいは異表記語の判定に用いる表記揺れ文字列ペアを生成するプログラムおよび装置に関する。 The present invention relates to a program and an apparatus for generating a notation fluctuation character string pair used for generation of different notation words or determination of different notation words.

文書検索などのテキスト処理システムでは、同じ単語が異なる単語として扱われてしまう異表記語の存在が問題となっている。例えば、片仮名表記や漢字表記の揺れ、漢字の送り仮名の違い、仮名と英字の違い、省略語などが異表記語として挙げられる。文書検索においては、異表記語の存在が検索漏れという問題を生じる。異表記語を同じ単語として扱うための方法として、異表記語の辞書を人手で作成することが考えられるが、単語は母集団が未知で異なり数が多いため、一部の単語に対する異表記語のみにしか対応ができないという問題と、人手による作業を多く必要とするため辞書の作成にコストを要するという問題がある。 In text processing systems such as document retrieval, the existence of different notation words that cause the same word to be treated as different words is a problem. For example, katakana notation or kanji notation, differences in kana feed kana, differences between kana and English, abbreviations, and the like can be cited as different notation words. In document retrieval, the existence of different notation words causes a problem of omission of retrieval. One way to treat different vocabulary words as the same word is to create a dictionary of different vocabulary words manually, but because the word population is unknown and different, there are many different grammar words for some words. There is a problem that it is only possible to deal with the problem, and there is a problem that it takes a lot of manual work and thus it takes cost to create a dictionary.

上記問題を解決するための手段として、異表記展開ルールを作成し異表記語を生成する技術が登場した。例えば、特開平６−４４２９５（特許文献１）では、一般性の高い異表記展開ルールを作成し、一般性の低い表記揺れは、辞書の登録語に対して個々に記述し、双方を用いることによって片仮名語の異表記語を生成するという方法を用いている。また、異表記展開ルールを用いない方式では、指定した語に対する異表記語候補をコーパスから抽出し、異表記語を辞書に登録するという技術が登場した。例えば、特開２００４−１１０６３３（特許文献２）では、異表記語を抽出しようとする単語の関連語を単語の共起情報に基づいて抽出し、異表記語を抽出しようとする単語と表記の類似度が高い関連語を異表記語候補として生成し、人手で異表記語と判断された語を辞書に登録するという方法を用いている。 As a means for solving the above-mentioned problem, a technique for creating different notation expansion rules and generating different notation words has appeared. For example, in Japanese Patent Laid-Open No. 6-44295 (Patent Document 1), a highly generalized different notation development rule is created, and a less general notation fluctuation is described individually for each registered word in the dictionary, and both are used. A method of generating a Katakana variant notation is used. Also, in a method that does not use the different notation development rule, a technique has appeared in which different notation word candidates for a specified word are extracted from the corpus and the different notation words are registered in a dictionary. For example, in Japanese Patent Application Laid-Open No. 2004-110633 (Patent Document 2), a related word of a word from which a different notation word is to be extracted is extracted based on the word co-occurrence information, and the notation of the word to be extracted is a word notation. A method is used in which related words having a high degree of similarity are generated as different notation word candidates, and words that are manually determined as different notation words are registered in a dictionary.

特開平６−４４２９５号公報JP-A-6-44295

特開２００４−１１０６３３号公報JP 2004-110633 A

上記従来の技術は、異表記展開ルールを用いて異表記語を生成する場合には、表記展開ルールを人手で記述することには限界があるため、片仮名語の異表記語など、一般性の高い表記揺れを要因とする異表記語のみにしか対応できないという問題があった。また、異表記展開ルールを用いずに異表記語を生成する場合には、漢字や平仮名を含む単語の異表記語を生成できるというメリットがあるものの、辞書に登録されていない単語に対する異表記語の生成や、低頻度語に対する異表記語の生成が困難であるという問題があった。
本発明の目的は、片仮名語や漢字表記の揺れ、漢字の送り仮名の違いなどにも対応した異表記語を生成することと、人手の介入を最小限に抑えて異表記語の生成に用いる異表記展開ルールを生成することである。 In the conventional technique described above, when different notation words are generated using different notation expansion rules, there is a limit in manually describing the notation expansion rules. There was a problem that it could only deal with different notation words due to high notation fluctuation. Also, when generating different notation words without using different notation expansion rules, there is an advantage that different notation words for words including kanji and hiragana can be generated, but different notation words for words that are not registered in the dictionary There is a problem that it is difficult to generate a different word for a low frequency word.
An object of the present invention is to generate different notation words corresponding to fluctuations in katakana words and kanji notation, differences in kana feed kana, etc., and to generate different notation words with minimal human intervention. It is to generate a different notation expansion rule.

上記目的を達成するために、本願で開示する発明の概要を説明すれば以下の通りである。
本発明の異表記語生成プログラムは、記憶装置から異表記語データを読み出し、該異表記語データ中の異表記語ペアから部分文字列対応を生成し、該部分文字列対応から表記揺れ文字列ペアを生成し、メモリやハードディスクなどの記憶装置に格納する。
次に、異表記語を抽出しようとする単語に含まれる文字列で、表記揺れ文字列ペアデータに含まれる文字列に対して、表記揺れ文字列ペアで対応する文字列に置換して異表記語を生成し、表示装置あるいは記憶装置に上記異表記語を出力する。 In order to achieve the above object, the outline of the invention disclosed in the present application will be described as follows.
The different notation word generation program of the present invention reads the different notation word data from the storage device, generates the partial character string correspondence from the different notation word pair in the different notation word data, and the notation fluctuation character string from the partial character string correspondence A pair is generated and stored in a storage device such as a memory or a hard disk.
Next, in the character string included in the word to be extracted, the character string included in the notation fluctuation character string pair data is replaced with the corresponding character string in the notation fluctuation character string pair. A word is generated, and the different notation word is output to a display device or a storage device.

本発明によれば、異表記語データから生成した表記揺れ文字列ペアを用いて異表記語を生成することによって、人手で異表記展開ルールを作成する方式では生成が困難であった、漢字表記の揺れ、漢字の送り仮名の違い、漢字の読みなどを要因とする異表記語を生成できるという効果がある。
また、表記揺れ文字列ペアは、汎用性を有するため、表記揺れ文字列ペアの生成に用いていない単語や未知の単語に対する異表記語を生成できるという効果がある。 According to the present invention, it is difficult to generate a Chinese character notation by using a method for manually creating different notation expansion rules by generating a different notation word using a notation fluctuation character string pair generated from different notation word data. It is possible to generate different notation words due to fluctuations in kanji, differences in kana feed kana, reading kanji, and the like.
In addition, since the written character string pair has versatility, there is an effect that it is possible to generate different written words for words that are not used for generating the written character string pair or unknown words.

以下、本発明の第１の実施例を、図を用いて説明する。
図１は、本実施例の異表記語生成装置の構成図である。本装置は、中央演算装置（ＣＰＵ）１０１と、主メモリ１０２と、表示装置１０３と、入力装置１０４と、記憶装置１１０と、で構成される。
記憶装置１１０には、ＯＳ（オペレーティングシステム）１１１と、異表記語データ１１２と、表記揺れ文字列ペアデータ１１３と、単語データ１１４と、部分文字列対応生成プログラム１１５と、表記揺れ文字列ペア生成プログラム１１６と、異表記語生成プログラム１１７と、異表記語表示プログラム１１８と、が格納される。 A first embodiment of the present invention will be described below with reference to the drawings.
FIG. 1 is a configuration diagram of the different word generation device of the present embodiment. This apparatus includes a central processing unit (CPU) 101, a main memory 102, a display device 103, an input device 104, and a storage device 110.
The storage device 110 includes an OS (operating system) 111, different notation word data 112, notation fluctuation character string pair data 113, word data 114, partial character string correspondence generation program 115, and notation fluctuation character string pair generation. The program 116, the different notation word generation program 117, and the different notation word display program 118 are stored.

異表記語データ１１２には、表記揺れの関係にある単語ペアが登録される。表記揺れ文字列ペアデータ１１３には、異表記語データ１１２から生成される表記揺れ文字列ペアデータが登録される。単語データ１１４には、任意のテキストから抽出された単語のリストが登録される。部分文字列対応生成プログラム１１５は、異表記語データ１１２から、部分文字列対応を生成する。また、表記揺れ文字列ペア生成プログラム１１６は、部分文字列生成プログラム１１５で生成された部分文字列対応から表記揺れ文字列ペアを生成し、生成した表記揺れ文字列ペアを表記揺れ文字列ペアデータ１１３に登録する。異表記語生成プログラム１１７は、入力装置１０４から単語の入力を受け、該単語に含まれる文字列を、表記揺れ文字列ペアデータ１１３を用いて置換し、置換された単語で単語データ１１４に登録されている単語を異表記語として生成する。異表記語表示プログラム１１８は、異表記語生成プログラム１１７で生成された異表記語を表示装置１０３に表示させる。
尚、上記プログラムは、主メモリ１０２に読み込まれ、ＣＰＵ１０１が制御することにより実行される。 In the different notation word data 112, word pairs having a notation fluctuation relationship are registered. In the written shaking character string pair data 113, written shaking character string pair data generated from the different written word data 112 is registered. In the word data 114, a list of words extracted from an arbitrary text is registered. The partial character string correspondence generation program 115 generates a partial character string correspondence from the different notation word data 112. The notation fluctuation character string pair generation program 116 generates a notation fluctuation character string pair from the partial character string correspondence generated by the partial character string generation program 115, and the generated notation fluctuation character string pair is expressed as notation fluctuation character string pair data. 113 is registered. The different notation word generation program 117 receives an input of a word from the input device 104, replaces a character string included in the word using the notation fluctuation character string pair data 113, and registers the replaced word in the word data 114. The generated word is generated as a different notation word. The different notation word display program 118 causes the display device 103 to display the different notation words generated by the different notation word generation program 117.
The above program is read into the main memory 102 and executed under the control of the CPU 101.

次に、本実施例の処理の流れを、図２のフローチャートを用いて説明する。
まず、部分文字列対応生成プログラム１１５で、メモリやハードディスクなどの記憶装置から、異表記語データ１１２を読み出し（Ｓ２０１）、図３に例を示す異表記語データに含まれる異表記語ペア各々について、部分文字列対応を生成する（Ｓ２０２）。ここで、部分文字列対応は、単語間の異なり文字数が最小となるように文字の対応付けをして得られる１文字以上の単語構成文字列の対応とする。文献ＲｅｃｈａｒｄＢｅｌｌｍａｎ，“ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ”，ＰｒｉｎｃｅｔｏｎＵｎｉｖｅｒｓｉｔｙＰｒｅｓｓ，Ｐｒｉｎｃｅｔｏｎ，ＮｅｗＪｅｒｓｅｙ，１９５７．に述べられているパターン認識処理などの分野でよく知られているＤＰマッチング法（ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ）を用いると、 Next, the processing flow of the present embodiment will be described with reference to the flowchart of FIG.
First, the partial character string correspondence generation program 115 reads the different notation word data 112 from a storage device such as a memory or a hard disk (S201), and for each different notation word pair included in the different notation word data shown in FIG. Then, the partial character string correspondence is generated (S202). Here, the partial character string correspondence is a correspondence between one or more word-constituting character strings obtained by associating characters so that the number of different characters between words is minimized. Literature Richard Bellman, “Dynamic Programming”, Princeton University Press, Princeton, New Jersey, 1957. If the DP matching method (Dynamic Programming) well known in the field of pattern recognition processing described in the above is used,

で計算される単語ＷＡと単語ＷＢの異なり文字数Ｄ（ＷＡ，ＷＢ）を最小とするＡｉとＢｊの対応関係を用いて部分文字列対応が生成される。ただし、
単語ＷＡ＝Ａ１Ａ２…Ａｉ…ＡＩ（Ａｉはｉ番目の文字）
単語ＷＢ＝Ｂ１Ｂ２…Ｂｊ…ＢＪ（Ｂｊはｊ番目の文字）
とする。 The partial character string correspondence is generated using the correspondence relationship between Ai and Bj that minimizes the number of characters D (WA, WB) between the word WA and the word WB calculated in step (1). However,
Word WA = A1A2 ... Ai ... AI (Ai is the i-th character)
Word WB = B1B2 ... Bj ... BJ (Bj is the jth character)
And

例えば、「書き込み‐書込み」の異表記語ペアからは、「書」と「込」と「み」が共通するので、図４に示す「書き‐書」と「込み‐込み」という部分文字列対応と、図５に示す「書‐書」と「き込み‐込み」という部分文字列対応が生成される。ここで、部分文字列となる文字列の生成は、図４、図５の例のように矢印が横方向または縦方向の文字の結合を優先とし、共通文字が連続する場合には文字列を結合して生成する。また、「書込み‐かき込み」の異表記語ペアからは、「込」と「み」が共通するので、「書‐かき」と「込み‐込み」という部分文字列対応と、「書−か」と「込み‐き込み」という部分文字列対応が生成される。尚、文字間の距離ｄ（ｉ，ｊ）は、漢字と漢字間、漢字と平仮名間、平仮名と平仮名間というように、文字の種類別に異なる距離を定義して計算することも可能である。 For example, since the “written”, “included”, and “only” are common to the “written-written” variant, the partial character strings “written-written” and “included-included” shown in FIG. Correspondence and partial character string correspondences of “letter-book” and “crib-in” shown in FIG. 5 are generated. Here, the generation of a character string to be a partial character string is performed by giving priority to the combination of characters with arrows in the horizontal direction or the vertical direction as in the examples of FIGS. 4 and 5. Generate by combining. Also, from the different writing word pair of “write-kaki”, “moke” and “mi” are common, so the partial character string correspondence of “write-kaki” and “maki-maki” and “write-ka” And a substring correspondence of “include-ki” is generated. Note that the distance d (i, j) between characters can be calculated by defining different distances for each character type, such as between kanji and kanji, between kanji and hiragana, and between hiragana and hiragana.

次に、表記揺れ文字列ペア生成プログラム１１６で、部分文字列対応生成プログラムにより複数の異表記語ペアから生成された複数の部分文字列対応各々に対して、指定条件を満たす部分文字列対応を表記揺れ文字列ペアとして生成する（Ｓ２０３）。例えば、図６に示すように、「お‐御」の平仮名と漢字の揺れ、「取‐取り」の送り仮名の違い、「タ‐ター」の長音の有無、「バ‐ヴァ」の片仮名表記の揺れ、「２‐二」の数字と漢字の違い、「竜‐龍」の漢字表記の揺れなどを示す表記揺れ文字列ペアが生成される。 Next, in the notation fluctuation character string pair generation program 116, for each of the plurality of partial character string correspondences generated from the plurality of different notation word pairs by the partial character string correspondence generation program, the partial character string correspondence satisfying the specified condition is performed. A notation shaking character string pair is generated (S203). For example, as shown in Fig. 6, "O-Go" hiragana and kanji swaying, "Tori-tori" feed kana, "Tata" long sound, "Bava" katakana notation A notation shaking character string pair is generated that indicates the shaking of the character, the difference between the number “2-2” and the kanji, the shaking of the kanji notation of “dragon-dragon”, and the like.

指定条件には、一箇所の部分文字列対応が異なり、残りが同じ部分文字列対応である場合に、異なる部分文字列対応を表記揺れ文字列ペアとするといった条件を設定する。例えば、この条件を用いた場合、「書込み‐かき込み」の異表記語ペアから得られる、「書‐かき」と「込み‐込み」という部分文字列対応は条件を満たすため、「書‐かき」が表記揺れ文字列として生成されるが、「書−か」と「込み‐き込み」という部分文字列対応は条件を満たさないため、表記揺れ文字列は生成されない。また、指定数よりも少ない異表記語ペアから生成された表記揺れ文字列ペアはノイズである可能性が高いという考えの基に表記揺れ文字列ペアから除外する。尚、指定条件は、ノイズである表記揺れ文字列ペアを少なくする条件であれば、どのような条件を用いても構わない。 The specification condition is set such that when the partial character string correspondences at one place are different and the rest are correspondences with the same partial character string, different partial character string correspondences are set as the notation-shaking character string pairs. For example, when this condition is used, the partial character string correspondence of “writing-writing” and “including-including” obtained from the “notation-writing” variant pair condition satisfies the condition, so “writing-writing”. Is generated as a notation fluctuation character string, but the correspondence of the partial character strings “write-ka” and “include-kiaki” does not satisfy the condition, and therefore the notation fluctuation character string is not generated. In addition, based on the idea that there is a high possibility that a written character string pair generated from a different number of different written word pairs less than the specified number is a noise, it is excluded from the written character string pair. The designation condition may be any condition as long as it is a condition for reducing the notation fluctuation character string pairs as noise.

生成された表記揺れ文字列ペアは、表記揺れ文字列ペア生成プログラム１１６を用いて、表記揺れ文字列ペアデータ格納エリア１１３に格納される（Ｓ２０４）。ここで、表記揺れ文字列ペアデータの格納は、表記揺れ文字列ペアの生成に用いた異表記語において表記揺れ文字列ペアの前後に出現した文字列の情報、表記揺れ文字列ペア各々の生成に用いられた異表記語ペアの頻度など、表記揺れ文字列ペアの特徴を示す情報を併せて格納しても構わない。
尚、上記ステップＳ２０１〜Ｓ２０４は、表記揺れ文字列ペアデータが生成済みである場合には省略することができる。 The generated written shaking character string pair is stored in the written shaking character string pair data storage area 113 using the written shaking character string pair generation program 116 (S204). Here, the notation fluctuation character string pair data is stored in the different notation words used to generate the notation fluctuation character string pair, information on the character strings that appear before and after the notation fluctuation character string pair, and the generation of each notation fluctuation character string pair. Information indicating the characteristics of the written character string pair, such as the frequency of the different written word pair used in the above, may be stored together.
Note that the above steps S201 to S204 can be omitted if the written character string pair data has already been generated.

次に、マウスやキーボードなどの入力装置１０４から、異表記語を生成しようとする単語（以下、注目単語と呼ぶ）の入力を受け（Ｓ２０５）、注目単語を構成する文字列が表記揺れ文字列ペアデータ１１３に含まれている場合は、上記注目単語を構成する文字列を表記揺れ文字列ペアで対応する文字列に置換し、異表記語を生成する（Ｓ２０６）。ここで、単語でない文字列が生成されることを防止するため、事前に任意のテキストを形態素解析して生成した単語データ１１４に含まれない単語は、異表記語から除外することもできる。また、１個の注目単語に対して、適用可能な表記揺れ文字列ペアが複数存在する場合には、１個の注目単語から複数の異表記語が生成される。例えば、「書−かき」と「込み‐こみ」という表記揺れ文字列ペアが用意されていた場合、「かき込み」という注目単語が入力されると、「かき」と「込み」が注目単語に含まれているので、「書き込み」と「かきこみ」と「書きこみ」という異表記語が生成される。 Next, an input of a word (hereinafter referred to as an attention word) from which an alternate word is to be generated is received from the input device 104 such as a mouse or a keyboard (S205), and the character string constituting the attention word is expressed as a fluctuation character string. If it is included in the pair data 113, the character string constituting the word of interest is replaced with the corresponding character string by the notation fluctuation character string pair to generate a different notation word (S206). Here, in order to prevent generation of a character string that is not a word, words that are not included in the word data 114 generated by morphological analysis of arbitrary text in advance can be excluded from the different notation words. Further, when there are a plurality of applicable notation fluctuation character string pairs for one attention word, a plurality of different notation words are generated from the one attention word. For example, if there is a swing character string pair of “writing-kaki” and “maki-komi”, if the attention word “kaki” is entered, “kaki” and “maki” are included in the attention word. Therefore, different notation words “writing”, “writing” and “writing” are generated.

また、「書き込み」という注目単語が入力されると、「書‐かき」の表記揺れ文字列ペアの「書」が注目単語に含まれているので、「かきき込み」という誤った異表記語が生成されるが、文書検索などの用途では、存在しない単語や検索結果に影響しない単語であれば誤った異表記語が生成されても問題は生じない。誤った異表記語の生成によって問題が生じる用途では、生成された異表記語が単語データ１１４に含まれているかの照会により「かきき込み」は単語でないことが分かるので、異表記語を誤って生成することを防止できる。 In addition, when the attention word “writing” is entered, the “word” of the character string pair “writing-straw” is included in the attention word. However, in the case of a document search or the like, if a word that does not exist or a word that does not affect the search result is generated, there is no problem even if an erroneous typographical word is generated. In applications where problems occur due to the generation of wrong typographical words, it can be seen that “stirring” is not a word by querying whether the generated typographical words are included in the word data 114. Can be prevented.

生成された異表記語は、パソコンモニタなどの表示装置１０３に表示させる（Ｓ２０７）。尚、生成された異表記語は、ハードディスクなどの記憶装置に異表記語辞書として登録することもできる。注目単語が他にもある場合は、上記ステップＳ２０５からＳ２０７を繰り返し、注目単語がなくなった時点で処理を終了する。 The generated different notation word is displayed on the display device 103 such as a personal computer monitor (S207). The generated different notation words can also be registered as a different notation word dictionary in a storage device such as a hard disk. If there are other words of interest, the above steps S205 to S207 are repeated, and the process is terminated when the words of interest disappear.

本実施例によれば、異表記語ペアから表記揺れ文字列ペアを生成するので、片仮名語に限らず漢字や平仮名を含む表記揺れ文字列ペアを生成することができ、片仮名表記や漢字表記の揺れ、漢字の送り仮名や漢字の読みの違いなどに対応した異表記語を生成することができるという効果がある。また、異表記語を抽出しようとする単語が未知の単語や表記揺れ文字列ペアの生成に用いていない単語であっても、既存の異表記語ペアから生成した表記揺れ文字列ペアデータを用いることにより異表記語を生成できるという効果がある。 According to the present embodiment, a notation fluctuation character string pair is generated from a pair of different notation words, so that a notation fluctuation character string pair including not only a katakana word but also a kanji or hiragana can be generated. There is an effect that it is possible to generate different notation words corresponding to shaking, kana feed kana or kanji reading differences. In addition, even if the word to be extracted is an unknown word or a word that is not used to generate a notation fluctuation character string pair, the notation fluctuation character string pair data generated from the existing different notation word pair is used. Thus, there is an effect that different notation words can be generated.

次に、本発明の第２の実施形態である与えられた単語ペアが異表記語ペアであるかどうかを判定する方法を、図を用いて説明する。図７は第２の実施形態の処理の流れを示すフローチャートである。第２の実施形態は、第１の実施形態と同様に、異表記語ペアから表記揺れ文字列ペアを生成し、生成された表記揺れ文字列ペアを、メモリやハードディスクなどの記憶装置に格納する（Ｓ７０１〜Ｓ７０４）。次に、マウスやキーボードなどの入力装置から単語ペアの入力を受け（Ｓ７０５）、第１の実施形態と同様に単語ペアから部分文字列対応を生成する（Ｓ７０６）。ここで、単語ペアの入力は、ハードディスクなどの記憶装置に格納されている単語ペアを、指定された順序で取り出すなど、単語ペアが入力されれば方式は問わない。 Next, a method for determining whether or not a given word pair is an alloword pair according to the second embodiment of the present invention will be described with reference to the drawings. FIG. 7 is a flowchart showing the flow of processing of the second embodiment. In the second embodiment, similarly to the first embodiment, a notation fluctuation character string pair is generated from a pair of different notation words, and the generated notation fluctuation character string pair is stored in a storage device such as a memory or a hard disk. (S701 to S704). Next, a word pair is input from an input device such as a mouse or a keyboard (S705), and a partial character string correspondence is generated from the word pair as in the first embodiment (S706). Here, the method of inputting word pairs is not limited as long as word pairs are input, such as taking out word pairs stored in a storage device such as a hard disk in a specified order.

次に、生成された部分文字列対応で、異なる文字列の対応が表記揺れであるかどうかを、生成された表記揺れ文字列ペアとの照合によって判定し、異なる文字列の対応である部分文字列対応がすべて表記揺れであれば、入力された単語ペアを異表記と判定し（Ｓ７０７）、ハードディスクやメモリなどの記憶装置に判定結果を出力する（Ｓ７０８）。例えば、「読み込み‐読込み」の単語ペアが入力されると、「読み‐読」と「込み‐込み」という部分文字列対応が生成され、「読み‐読」が表記揺れ文字列ペアデータに登録されていると、「読み込み‐読込み」の単語ペアは異表記と判定される。
本実施例によれば、異表記の可能性がある単語ペアが与えられた場合に、効率よく異表記かどうかを判定できるという効果がある。 Next, it is determined whether or not the correspondence between the different character strings in the generated partial character string is notation fluctuation by collating with the generated notation fluctuation character string pair, and the partial character corresponding to the different character string is determined. If all the column correspondences are notations, the input word pair is determined to be different notation (S707), and the determination result is output to a storage device such as a hard disk or memory (S708). For example, when a word pair of “read-read” is input, a partial character string correspondence of “read-read” and “include-include” is generated, and “read-read” is registered in the swaying character string pair data If so, the word pair “read-read” is determined to be different.
According to the present embodiment, there is an effect that it is possible to efficiently determine whether or not the different notation is given when a word pair having the possibility of different notation is given.

異表記語データから表記揺れ文字列ペアデータを生成し、漢字や平仮名を含む異表記語の生成も可能とする本方式は、異表記語の存在が問題となる文書検索システムやテキストマイニングシステムなどの文書処理装置全般に適用できる。 This method, which generates notation fluctuation character string pair data from different notation word data and can also generate different notation words including kanji and hiragana, is a document search system and text mining system where the existence of different notation words is a problem It can be applied to all document processing apparatuses.

本発明の第１の実施形態である異表記語生成装置の構成図である。It is a block diagram of the alloword generation device which is the 1st embodiment of the present invention. 本発明の第１の実施形態である異表記語を生成する手順を示したフローチャートである。It is the flowchart which showed the procedure which produces | generates the alloword which is the 1st Embodiment of this invention. 異表記語データの例を示した図である。It is the figure which showed the example of different notation word data. 部分文字列対応の例を示した図である。It is the figure which showed the example corresponding to a partial character string. 部分文字列対応の例を示した図である。It is the figure which showed the example corresponding to a partial character string. 表記揺れ文字列ペアデータの例を示した図である。It is the figure which showed the example of notation shaking character string pair data. 本発明の第２の実施形態である単語ペアの異表記判定の手順を示したフローチャートである。It is the flowchart which showed the procedure of the different notation determination of the word pair which is the 2nd Embodiment of this invention.

Explanation of symbols

１０１：ＣＰＵ、１０２：主メモリ、１０３：表示装置、１０４：入力装置、１１０：記憶装置、１１１：ＯＳ、１１２：異表記語データ、１１３：表記揺れ文字列ペアデータ、１１４：単語データ、１１５：部分文字列対応生成プログラム、１１６：表記揺れ文字列ペア生成プログラム、１１７：異表記語生成プログラム、１１８：異表記語表示プログラム。 101: CPU, 102: Main memory, 103: Display device, 104: Input device, 110: Storage device, 111: OS, 112: Different notation word data, 113: Notation fluctuation character string pair data, 114: Word data, 115 : Partial character string correspondence generation program, 116: notation fluctuation character string pair generation program, 117: different notation word generation program, 118: different notation word display program.

Claims

Reading a word pair which is a combination of a plurality of words from the storage means;
Generating partial character string correspondence data from the word pairs;
Generating notation fluctuation character string pair data from the partial character string correspondence data;
A program for causing a computer to realize a method for generating written character string pair data having a step of writing the written character string pair data to a storage means.

Receiving an input of the attention word from the input means;
A program for causing a computer to realize a method for generating an utterance word having the step of generating an utterance word for the attention word using the notation fluctuation character string pair data.

A storage device for storing a plurality of word pairs, and a notation fluctuation character string pair data generation device having a calculation unit,
The arithmetic unit reads a word pair from the storage device, generates partial character string correspondence data from the word pair, and generates notation fluctuation character string pair data from the partial character string correspondence data. String pair data generator.

A storage device that stores notation fluctuation character string pair data generated from a plurality of word pairs, an arithmetic unit, a display device, and an alloword generation device having an input device,
The said calculating part produces | generates the different notation word with respect to the word input from the said input device using the said notation fluctuation character string pair data, The different notation word production | generation apparatus characterized by the above-mentioned.

Receiving a word pair input from an input means;
Reading out the above-mentioned notation shaking character string pair data from the storage means;
Generating a substring correspondence from the word pair;
A program that causes a computer to realize a method for determining different notation words, including the step of determining whether the partial character string correspondence is notation fluctuation by collating the notation fluctuation character string pair data.

A storage device for storing notation fluctuation character string pair data generated from a plurality of word pairs, an arithmetic unit, and an alloword generation device having an input device,
The arithmetic unit generates a partial character string correspondence from the word pair input from the input device, and collates the partial character string correspondence that is a correspondence of the character string different from the partial character string correspondence with the notation distorted character string pair data. And determining whether or not the set of words is a different notation word.

The extraction corresponding to the partial character string includes a step of associating characters between the character strings, and the step of associating the characters between the character strings associates the characters between the character strings using different weights according to character characteristics. The program according to claim 1.