JP5094486B2

JP5094486B2 - Synonymity determination device, method, program, and recording medium

Info

Publication number: JP5094486B2
Application number: JP2008065256A
Authority: JP
Inventors: いづみ高橋; 久子浅野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-03-14
Filing date: 2008-03-14
Publication date: 2012-12-12
Anticipated expiration: 2028-03-14
Also published as: JP2009223463A

Description

本発明は、テキストに含まれる文字列表現の同義性を判定する技術、詳細にはテキストから同義語候補としての文字列表現である同義語候補表現を抽出して一対の同義語候補表現よりなる同義語候補ペアを生成し、当該同義語候補ペア中の同義語候補表現同士が同義か否か（同一の情報を指すか否か）を判定する技術に関する。 The present invention is a technique for determining the synonymity of character string expressions included in text, and more specifically, synonym candidate expressions that are character string expressions as synonym candidates are extracted from text and consist of a pair of synonym candidate expressions. The present invention relates to a technique for generating a synonym candidate pair and determining whether or not synonym candidate expressions in the synonym candidate pair are synonymous (whether or not they indicate the same information).

同義性判定技術は、テキストエディタにおけるスペルミス検出などに用いられる他、同技術により得られた同義語を集約することで同義語辞書を作成し、その辞書を検索装置に組み込んでクエリ拡張に用いる等の利用法がある。 The synonym determination technique is used for detecting spelling errors in a text editor, creating a synonym dictionary by aggregating synonyms obtained by the technique, incorporating the dictionary into a search device, and using it for query expansion, etc. There is a usage.

なお、明細書及び図面中に登場する「ＰｌａｙＳｔａｔｉｏｎ」、「プレイステーション」、「プレーステーション」は登録商標であり、また、「アップル社」、「ペヨンジュン」、「木村拓哉」、「安藤美姫」、「安めぐみ」、「ハリーポッター」は著名な企業や人物、キャラクタ等の名称（氏名）であるが、本願が言語処理の発明であり、「（登録商標）」の文字を挿入したり、その表記を変更すると意味が変わってしまうため、そのまま記載するものとした。 “PlayStation”, “PlayStation”, and “PlayStation” appearing in the specification and drawings are registered trademarks, and “Apple Inc.”, “Payeonjung”, “Takuya Kimura”, “Miki Ando”, “ “Megumi Yasu” and “Harry Potter” are names (names) of prominent companies, people, characters, etc., but this application is an invention of language processing, and the characters “(registered trademark)” are inserted, The meaning will change if the notation is changed.

言語を用いて任意の１つの事物・事象を表現しようとする場合、多彩な表現を選択することが可能なため、当該言語を表す文字列からなるテキスト中には同一の情報が複数の異なる表現（文字列表現）で存在する。また、テキストの量やそれを作成する人数が増えるほど、１つの情報に対する表現のバリエーションは増加する。そのためテキスト中の同一の情報を漏れなく集めるためには、２つの文字列表現同士が同一の情報を指しているか否かを判定する、同義性判定手法が必要となる。「同一の情報を指す文字列表現」の粒度は文書単位から単語単位まで様々なものが考えられるが、本発明における同義性判定は、名詞及び複合名詞単位で行うものを対象とする。 When expressing any one thing / event using a language, it is possible to select a variety of expressions, so the same information in the text consisting of a character string representing the language has multiple different expressions It exists in (character string representation). In addition, as the amount of text and the number of people who create it increase, the variation of expression for one piece of information increases. Therefore, in order to collect the same information in the text without omission, a synonymity determination method for determining whether or not two character string expressions indicate the same information is necessary. The granularity of the “character string expression indicating the same information” may vary from the document unit to the word unit, but the synonymity determination in the present invention is performed on the basis of nouns and compound noun units.

前述した名詞及び複合名詞単位の粒度において同義性判定を行う従来の手法としては大きく分けて２つあり、１つは識別手法、もう１つは生成手法である。識別手法は、任意のテキストから同義語候補表現を抽出して同義語候補ペアを生成し、同義か否かを判定する方法である（例えば、非特許文献１参照）。生成手法は、ある文字列表現の同義語候補として考えられる表現を全て生成する方法であり、生成後にＷｅｂなどを用いて実在を確認する場合もある（例えば、非特許文献２参照）。 There are roughly two conventional methods for performing synonymity determination in the above-mentioned noun and compound noun unit granularity, one is an identification method, and the other is a generation method. The identification method is a method of extracting synonym candidate expressions from arbitrary text to generate synonym candidate pairs and determining whether or not they are synonymous (for example, see Non-Patent Document 1). The generation method is a method of generating all expressions that can be considered as synonym candidates for a certain character string expression, and the existence may be confirmed by using Web or the like after generation (for example, see Non-Patent Document 2).

両手法とも同義語候補ペアを作り、正解へと絞り込む（同義性判定を行う）という順序で行われ、獲得可能な同義語のカバー範囲はペア生成の手法に、精度はペア生成と絞込みの両手法に依存する。始めに文字種や略語、表記ゆれなど、より多様性のある同義語候補のペアを生成すればカバー範囲は広くなるが、精度は低くなる。精度を高くするには、ペア生成の時点で同義語の種類に制限を加え、より確からしい同義語候補ペアに絞り込んで収集するか、絞込み（同義性判定）の手法として、多様性のある同義語に対しても高精度で判定が行える手法を採用する必要がある。 In both methods, synonym candidate pairs are created and narrowed down to the correct answer (synonymity judgment is performed), the synonym coverage that can be acquired is the pair generation method, and accuracy is both hands of pair generation and narrowing down Depends on the law. First, if a variety of synonym candidate pairs such as character types, abbreviations, and notation fluctuations are generated, the coverage is widened, but the accuracy is lowered. To improve accuracy, limit the types of synonyms at the time of pair generation, narrow down the collection to more probable synonym candidate pairs, or use various synonyms as a method of narrowing down (synonymity judgment). It is necessary to adopt a method that can determine a word with high accuracy.

識別手法における同義語候補の収集方法としては、表構造、タグや記号、特殊な表現（○○こと××）などのメタ情報を用いるもの（非特許文献３）と、表記の類似を利用して略語やカタカナ異表記など特定の種類に限定し、その特徴とのパターンマッチにより収集するもの（非特許文献１）がある。両方法ともペアの生成時にメタ情報や表記の種類などの制約を設けることで一定の精度を担保している。 As a method for collecting synonym candidates in the identification method, a method using meta information such as a table structure, tags and symbols, and special expressions (XX and XX) (non-patent document 3) is used. There are some which are limited to specific types such as abbreviations and katakana notation and are collected by pattern matching with the features (Non-patent Document 1). In both methods, certain accuracy is ensured by providing constraints such as the type of meta information and notation when generating pairs.

生成手法は、あるテキスト表現の同義語候補として考えられる表現を全て生成する手法であり、生成後にＷｅｂなどを用いて実在を確認する場合もある（例えば非特許文献２）。生成手法ではヒューリスティックなルールや確率モデルなどで同義語候補ペアの生成を行うため、生成できる同義語候補は略語やカタカナ異表記等の特定の種類に限定される。
酒井浩之、増山繁「コーパスからの名詞と略語の対応関係の自動獲得」言語処理学会第９回年次大会発表論文集、２００３年、ｐｐ．２２６〜２２９村山紀文、奥山学「Ｎｏｉｓｙ−ｃｈａｎｎｅｌｍｏｄｅｌを用いた略語自動推定」言語処理学会第１２回年次大会発表論文集、２００６年、ｐｐ．７６３〜７６６関恒仁、嶋田和孝、遠藤勉「表の構造を利用した類義語抽出」言語処理学会第１１回年次大会発表論文集、２００５年、Ｃ１−６ The generation method is a method of generating all expressions that can be considered as synonym candidates for a certain text expression, and the existence may be confirmed using Web or the like after generation (for example, Non-Patent Document 2). In the generation method, synonym candidate pairs are generated using heuristic rules, probability models, and the like, and thus synonym candidates that can be generated are limited to specific types such as abbreviations and katakana different notations.
Hiroyuki Sakai, Shigeru Masuyama “Automatic Acquisition of Correspondence between Nouns and Abbreviations from Corpus” Proceedings of the 9th Annual Conference of the Language Processing Society of Japan, 2003, pp. 226-229 Norifumi Murayama, Manabu Okuyama “Abbreviations Automatic Estimation Using Noisy-channel model” Proceedings of the 12th Annual Conference of the Language Processing Society of Japan, 2006, pp. 763-766 Tsunehito Seki, Kazutaka Shimada, Tsutomu Endo “Synonym Extraction Using Table Structure” Proceedings of the 11th Annual Conference of the Language Processing Society of Japan, 2005, C1-6

従来の識別手法において、メタ情報を利用して同義語候補ペアを収集する手法は、特殊な記述方法で書かれた表記以外は同義語候補ペアとして利用できず、それ以外の手法でも特定の種類（略語やカタカナ異表記）に特化しているため獲得できる範囲がその種類内に限られてしまい、カバー範囲が狭いという問題があった。 In the conventional identification method, the method of collecting synonym candidate pairs using meta information cannot be used as synonym candidate pairs other than the notation written in a special description method. Since it specializes in (abbreviation and katakana different notation), the range that can be acquired is limited to the type, and there is a problem that the cover range is narrow.

また、生成手法においては、前述したように同義語候補ペアをヒューリスティックなルールにより生成する場合と、確率モデルを用いて生成する場合があるが、前者ではコストが高く、カバー範囲も狭いという問題があり、後者では極端に精度が低いという問題があった。 In addition, in the generation method, synonym candidate pairs may be generated using heuristic rules as described above, and may be generated using a probabilistic model, but the former has a problem of high cost and narrow coverage. In the latter, there was a problem that the accuracy was extremely low.

同義語には表記ゆれのように表記が類似しているほど同義である可能性が高い場合と、省略語のように表記の類似度だけでは同義性が計れない場合、さらにその両方の性質を備えている場合が混在しており、従来の手法のカバー範囲が狭いのは、手法を特定の同義語の種類に特化せざるをえないことが原因であった。 If synonyms are more likely to be synonymous as the notation is similar, such as fluctuations in notation, and if synonyms cannot be measured only by the similarity of the notation, such as abbreviations, then both of these properties are further The reason why the coverage of the conventional method is narrow is that the method has to be specialized for a specific type of synonym.

しかし、様々な種類がある同義語も発生過程に注目すると、（ａ）定型的な文字列の追加、（ｂ）読みを保存しての表記変換、（ｃ）省略化、という３つの主な原因に絞ることができる。その３種類が個別に起こる場合、そして同時に起こる場合があることにより同義語の多様性が増している（ａ，ｂ，ｃ，ａ＋ｂ，ａ＋ｃ，ｂ＋ｃ，ａ＋ｂ＋ｃの７パターン）。 However, paying attention to the generation process of synonyms with various types, there are three main types: (a) addition of a regular character string, (b) notation conversion by storing readings, and (c) omission. The cause can be narrowed down. The diversity of synonyms is increased when the three types occur individually and sometimes simultaneously (seven patterns of a, b, c, a + b, a + c, b + c, and a + b + c).

（ａ）で追加される文字列は、「ちゃん」や「ティ」などの接辞表現や特定の記号など定型的な表現である。よって追加された定型的な文字列を削除すれば生成前の表記と同じになる。（ｂ）は「ＰｌａｙＳｔａｔｉｏｎ」を「プレイステーション」「プレーステーション」のように、読みを保存したまま表記を変換するため、同義語両方に読みを付与すると全く同じ、または非常に類似した読みとなる。（ｃ）は「国際連合」を「国連」のように文字順を保存したまま文字を削除するため、長いものが短いものを包含する関係にあり、どの文字が削除されるかにはある程度法則性がある。 The character string added in (a) is a fixed expression such as an affix expression such as “Chan” or “Tee” or a specific symbol. Therefore, if the added standard character string is deleted, it becomes the same as the notation before generation. In (b), “PlayStation” is converted to “PlayStation” and “PlayStation”, and the notation is converted while the readings are stored. Therefore, if readings are given to both synonyms, the readings are the same or very similar. In (c), characters are deleted while preserving the order of characters as in “United Nations” as in “United Nations”, so long characters include short ones, and there are some rules regarding which characters are deleted. There is sex.

そこでまず、（ａ）で追加された定型的な文字列を削除して表記の正規化を行い、（ｂ）で変換された表記に読みを付与して正規化を行って、その結果、表記または読みが同じになるか、非常に類似していれば同義と判定する。そして（ａ）と（ｂ）で生じるゆれを吸収した後に（ｃ）の省略が起こったか否かを判定する、という順序で判定を行えば、多様性のある同義語を全て判定することができ、カバー範囲を広げることが可能になる。 Therefore, first, the standard character string added in (a) is deleted to normalize the notation, and the notation converted in (b) is given a normalization, resulting in the notation. Or, if the readings are the same or very similar, it is determined to be synonymous. And if it is determined in the order of determining whether omission of (c) has occurred after absorbing the fluctuations generated in (a) and (b), all synonyms with diversity can be determined. It becomes possible to widen the cover range.

本発明は以上の問題を鑑みてなされたもので、同義語候補収集時には１テキスト（少なくとも１つの文を含む１まとまりの文章）内の名詞総当りで同義語の種類に関係なく同義語候補ペアを生成し、生成した多様性のある同義語候補ペアの同義性判定可能な範囲をほぼ全種類にまで広げるため、それぞれを表記と読みの両方から正規化を行う。 The present invention has been made in view of the above problems. When synonym candidates are collected, synonym candidate pairs can be used regardless of the type of synonyms in all nouns in one text (a group of sentences including at least one sentence). In order to expand the synonym candidate range of the generated diverse synonym candidate pairs to almost all types, normalization is performed from both notation and reading.

同義語候補ペア生成時には、名詞総当りで同義語候補ペアを生成するため、テキストの記述方式に依存せず同義語候補ペアが生成できる。また、１テキスト内という制約によってテキストを跨いで存在する、表記は類似しているが無関係な同義語候補ペアを生成されるのを防ぐ。 When synonym candidate pairs are generated, synonym candidate pairs are generated for all nouns, so that synonym candidate pairs can be generated regardless of the text description method. In addition, it is possible to prevent generation of synonym candidate pairs that exist across the text due to the restriction within one text but have similar notation but are irrelevant.

同義性判定時には生成した同義語候補ペアそれぞれを表記と読み両方から正規化を行うとともに、同義語候補ペアがどの種類の同義語か条件判定し、同義語候補ペアを種類ごとに分離してから判定を行うことで、同義語候補ペアの種類に適した同義性判定手法を適用可能にし、精度を向上させることを可能とした。 At the time of synonym determination, each generated synonym candidate pair is normalized from both notation and reading, and the synonym candidate pair is subjected to condition determination, and the synonym candidate pair is separated for each type. By performing the determination, a synonym determination method suitable for the type of the synonym candidate pair can be applied, and the accuracy can be improved.

また、本発明の同義性判定手法は、ほぼ全ての同義語の種類（略語やカタカナ異表記）に対応可能であるため、同義語候補収集方法がどのような手法であっても同義性判定が可能で、既存の同義語候補収集手法と組み合わせて使用することも可能である。 In addition, since the synonym determination method of the present invention is compatible with almost all types of synonyms (abbreviations and katakana different notations), synonym determination is possible regardless of the method of collecting synonym candidates. It can also be used in combination with existing synonym candidate collection techniques.

本発明は、テキストを入力すると、そこに含まれる名詞及び複合名詞から同義語候補ペアを生成し、ペアの単語それぞれの表記と読みを正規化し、その過程で全く同じ表記または非常に類似した読みとなったものは同義と判定し、同義と判定されなかったもので包含関係にあるものは分類器を用いて同義語かどうか判定を行うことを特徴とする。 The present invention, by entering the text, generates a synonym candidate pair from nouns and compound nouns contained therein, normalizing words each notation and reading pair readings were exactly the same notation or very similar in the process What is determined to be synonymous, and what is not determined to be synonymous and is in an inclusive relationship is determined using a classifier to determine whether it is a synonym.

本発明によれば、（ａ）定型的な文字列の追加、または（ｂ）読みを保存しての表記変換、あるいは（ｃ）省略化等を伴う、多様性のある文字列表現の同義性を精度高く判定することができる。 According to the present invention, synonymity of diverse character string expressions, including (a) addition of a regular character string, (b) notational conversion by storing readings, or (c) omission, etc. Can be determined with high accuracy.

以下、本発明を図示の実施の形態により詳細に説明する。 Hereinafter, the present invention will be described in detail with reference to the illustrated embodiments.

図１は本発明の同義性判定装置の実施の形態の一例を示すもので、本同義性判定装置は、同義語候補ペア生成部１、同義性判定部２、逆変換ルール記憶部３、音節正規化ルール記憶部４、音節類似度テーブル記憶部５、省略判定モデル記憶部６、解析処理結果テーブル記憶部７及び同義語候補ペアリスト記憶部８からなる。 FIG. 1 shows an example of an embodiment of a synonym determination device according to the present invention. The synonym determination device includes a synonym candidate pair generation unit 1, a synonym determination unit 2, an inverse conversion rule storage unit 3, and a syllable. It consists of a normalization rule storage unit 4, a syllable similarity table storage unit 5, an omission determination model storage unit 6, an analysis processing result table storage unit 7, and a synonym candidate pair list storage unit 8.

同義語候補ペア生成部１は、図示しないキーボード等から直接入力され又は記憶媒体から読み出されて入力され又は通信媒体を介して他の装置等から入力されたテキスト、ここでは１テキスト（少なくとも１つの文を含む１まとまりの文章に対応するテキストデータ）を処理単位として周知の形態素解析や固有表現抽出などの解析処理を行い、その解析結果に基づいて前記テキストから同義語候補表現を抽出し、当該同義語候補表現とともにこれに対応する前記解析結果を解析処理結果テーブル記憶部７に記憶し、さらに各同義語候補表現に対して逆変換ルール記憶部３に記憶された逆変換ルールを用いて表記の正規化を行い、音節正規化ルール記憶部４に記憶された音節正規化ルールを用いて読みの正規化を行い、それらの結果を解析処理結果テーブル記憶部７に記憶する。その後、解析処理結果テーブル記憶部７に記憶された同義語候補表現同士を総当たりで組み合わせて同義語候補ペアを作成し、同義語候補ペアリスト記憶部８に記憶する。 The synonym candidate pair generation unit 1 is a text input directly from a not-shown keyboard or the like, or read and input from a storage medium or input from another device or the like via a communication medium, here, one text (at least one text) Text data corresponding to a set of sentences including two sentences) is processed as a processing unit, and analysis processing such as well-known morphological analysis and specific expression extraction is performed, and synonym candidate expressions are extracted from the text based on the analysis results, The analysis result corresponding to the synonym candidate expression is stored in the analysis processing result table storage unit 7 and the reverse conversion rule stored in the reverse conversion rule storage unit 3 for each synonym candidate expression. Normalize the notation, normalize readings using the syllable normalization rules stored in the syllable normalization rule storage unit 4, and analyze the results And stores the result table storage section 7. Then, to create a synonym candidate paired with synonyms candidate expressions each other stored in the analysis result table storing unit 7 in total per stored synonym candidate pair list memory 8.

同義性判定部２は、同義語候補ペアリスト記憶部８から同義語候補ペアを取り出し、当該同義語候補ペア中の同義語候補表現同士が同義か否かを、解析処理結果テーブル記憶部７に記憶された前記同義語候補ペア中の各同義語候補表現に対応する解析結果や正規化処理結果、並びに音節類似度テーブル記憶部５に記憶された音節類似度テーブル及び省略判定モデル記憶部６に記憶された省略判定モデルを用いて以下に述べるようにして判定し、同義であれば前記同義語候補ペアを同義語ペアとして出力し、これを同義語候補ペアリスト記憶部８に記憶された全ての同義語候補ペアに対して同様に繰り返して同義語ペアリストを出力する。 Synonymous determination unit 2 extracts a synonym candidate pair from synonym candidate pair list memory 8, whether synonyms candidates expressed together in the synonym candidate pairs interchangeably, the analysis result table storage section 7 Analysis results and normalization processing results corresponding to each synonym candidate expression in the stored synonym candidate pairs, and a syllable similarity table and an omission determination model storage unit 6 stored in the syllable similarity table storage unit 5 Judgment is made using the stored abbreviation judgment model as described below, and if it is synonymous, the synonym candidate pair is output as a synonym pair, which is all stored in the synonym candidate pair list storage unit 8 Similarly, the synonym pair list is output for the synonym candidate pairs.

同義か否かの判定は、まず、正規化後の表記が全く同じか否かで判定を行い、ここで同義と判定されなければ正規化後の読みが類似しているか否かで判定、即ち音節類似度テーブル記憶部５に記憶された音節類似度テーブルを用いて正規化後の読みの類似度を求め、該求めた類似度が所定の値以上かどうかで判定を行い、さらにここでも同義と判定されない場合は表記または読みが包含関係にあれば、省略判定モデル記憶部６に記憶された省略判定モデルを用いて省略語関係にある（略語）か否かで判定を行う。 The determination of whether or not synonyms are made by first determining whether or not the normalized notation is exactly the same, and if it is not determined to be synonymous here, it is determined whether or not the normalized readings are similar, that is, Using the syllable similarity table stored in the syllable similarity table storage unit 5, the similarity of reading after normalization is obtained, and it is determined whether or not the obtained similarity is a predetermined value or more. If the notation or reading is in the inclusive relationship, the determination is made based on whether or not the abbreviation model is stored using the abbreviation determination model stored in the abbreviation determination model storage unit 6 (abbreviation).

逆変換ルール記憶部３は、名詞を利用する際に一般的に挿入されると思われる接頭辞や接尾辞等の接辞形を削除するルール、愛称を作成する際に利用される繰り返し表現を削除するルールを少なくとも含む、同義語候補表現の表記を正規化し、当該表記の正規化に併せて読みを正規化するための逆変換ルールを記憶している。 Inverse conversion rule storage unit 3 deletes rules that delete prefixes and suffixes that are generally inserted when using nouns, and repetitive expressions that are used to create nicknames The reverse conversion rule for normalizing the notation of the synonym candidate expression including at least the rule to perform and normalizing the reading in accordance with the normalization of the notation is stored.

音節正規化ルール記憶部４は、同義語候補表現の読みの母音連続や長音、促音に適用することで、和語と外来語とで異なる読みの長さの単位（モーラと音節）、口語表現、音訳時のゆれを少なくとも正規化するための音節正規化ルールを記憶している。 The syllable normalization rule storage unit 4 is applied to the vowel continuation, long sound, and prompting of the reading of the synonym candidate expression, so that the unit of reading length (mora and syllable) that is different between Japanese and foreign words, colloquial expression The syllable normalization rule for at least normalizing the fluctuation at the time of transliteration is stored.

音節類似度テーブル記憶部５は、「表記は異なるが読みが類似する音節ペア」をキーとし、「距離（類似度）」を値とした、同義語候補ペア中の同義語候補表現同士の読みの類似度を求めるための音節類似度テーブルを記憶している。 Syllables similarity table storage unit 5, and a key to "notation syllable pairs differ for reading is similar", "distance (similarity)" was the value read synonyms candidate representation together in synonyms candidate pairs A syllable similarity table for determining the similarity of

省略判定モデル記憶部６は、予め機械学習により生成した、２つの単語が省略語関係にあるか否かを判定するモデルからなる、同義語候補ペア中の同義語候補表現同士が省略語関係にあるか否かを判定するための省略判定モデルを記憶している。 Skip determination model storage unit 6 was produced in advance by machine learning, two words consisting of model determines whether the abbreviations relationship, the abbreviation relationship synonyms candidates expressed together in synonyms candidate pairs An abbreviated determination model for determining whether or not there is stored.

以下、前述した各部についてさらに詳細に説明する。なお、以下の説明では各記憶部３乃至８の記憶内容に対しても、当該記憶部の符号をそのまま付して説明する場合があることを注記しておく。 Hereafter, each part mentioned above is demonstrated in detail. It should be noted that in the following description, the storage contents of the storage units 3 to 8 may be described with the reference numerals of the storage units as they are.

［同義語候補ペア生成部１］
図２は同義語候補ペア生成部１の詳細を示すもので、解析処理部１１、正規化処理部１２及びペア生成部１３からなる。同義語候補ペア生成部１では、１テキストを入力として、解析処理部１１で形態素解析及び固有表現抽出等のテキスト解析処理を行い、その結果をもとに同義語候補表現の切り出しを行う。そして正規化処理部１２で同義語候補表現の正規化を行った後、ペア生成部１３で同義語候補ペアを生成する。本実施の形態においては、同義語候補表現として固有表現を対象とした場合を例に採って説明する。 [Synonym candidate pair generation unit 1]
FIG. 2 shows details of the synonym candidate pair generation unit 1, which includes an analysis processing unit 11, a normalization processing unit 12, and a pair generation unit 13. In the synonym candidate pair generation unit 1, one text is input, and the analysis processing unit 11 performs text analysis processing such as morphological analysis and specific expression extraction, and extracts synonym candidate expressions based on the results. Then, after normalization processing unit 12 normalizes synonym candidate expressions, pair generation unit 13 generates synonym candidate pairs. In the present embodiment, it will be described as a case where the target-specific expression as a synonym candidate representation as an example.

解析処理部１１では、周知の技術として確立されている形態素解析技術、固有表現抽出技術などを用いてテキストの解析を行い、同義語候補表現の抽出を行う。形態素解析では、テキストに対し、形態素（表記）、読み、品詞（固有表現クラスを含む）などの情報を付与する。この時、テキスト内の文字の半角／全角の統一など、単純な正規化も済ませておく。読み付与は１番尤もらしいものを用いても、Ｎ−ｂｅｓｔを用いても良い。また、表記がアルファベットなどの未知語であり、形態素解析のみでは正しく読みが付与されないものについては読みを付与し直す。この読みの付与に関しては、アルファベットなど未知語の読みを正しく推定する手法（例えば、特開２００１−１４２８７７公報（発明の名称：アルファベット文字・日本語読み対応付け装置と方法およびアルファベット単語音訳装置と方法ならびにその処理プログラムを記録した記録媒体）等参照）を利用する。 The analysis processing unit 11 analyzes text using a morphological analysis technique, a specific expression extraction technique, and the like established as well-known techniques, and extracts synonym candidate expressions. In the morphological analysis, information such as morpheme (notation), reading, part of speech (including proper expression class), etc. is given to the text. At this time, simple normalization such as unification of half-width / full-width characters in the text is also completed. For the reading assignment, the most likely one may be used, or N-best may be used. In addition, if the notation is an unknown word such as an alphabet and the reading is not correctly given only by morphological analysis, the reading is given again. Regarding the provision of readings, a method of correctly estimating readings of unknown words such as alphabets (for example, Japanese Patent Application Laid-Open No. 2001-142877 (name of invention: alphabetic character / Japanese reading correspondence device and method and alphabetic word transliteration device and method) As well as a recording medium on which the processing program is recorded).

同義語候補表現は、固有表現抽出技術により切り出し、表記、読み、品詞等の解析結果（の情報）と共に解析処理結果テーブル７の同義語候補表現カラム、解析結果カラムへ書き出す。但し、表記が全く同一の同義語候補表現が既にテーブル７内に存在する場合はレコードが重複しないよう、書き出しは行わない。またこの時、形態素の区切りの情報は、例えば“／”の記号などを用いて表記、読み、品詞それぞれで保持しておく。解析処理結果テーブル７の一例を図３に示す。但し、この時点では解析処理結果テーブル７のうち同義語候補表現カラムと解析結果カラムのみが埋まり、他は空の状態である。 The synonym candidate expression is cut out by the unique expression extraction technique, and written to the synonym candidate expression column and the analysis result column of the analysis processing result table 7 together with the analysis result (information thereof) such as notation, reading, and part of speech. However, when synonym candidate expressions having exactly the same notation already exist in the table 7, writing is not performed so that records do not overlap. At this time, the morpheme segmentation information is stored in notation, reading, and part of speech using, for example, a symbol “/”. An example of the analysis processing result table 7 is shown in FIG. However, at this time, only the synonym candidate expression column and the analysis result column in the analysis processing result table 7 are filled, and the others are empty.

ここでは説明のため切り出し対象を固有表現としたが、形態素解析結果を利用して名詞や複合名詞を同義語候補表現としても良い。この結果、作成される解析処理結果テーブル７には１テキスト内に存在する同義語候補表現の異なり数分だけレコードができる。 Here, for the sake of explanation, the cut-out target is a specific expression, but nouns and compound nouns may be used as synonym candidate expressions using the morphological analysis results. As a result, the created analysis processing result table 7 can have as many records as the number of different synonym candidate expressions existing in one text.

正規化処理部１２での処理の流れを図４を用いて説明する。正規化処理部１２では、切り出した同義語候補表現に対して表記と読みから正規化を行う。入力は解析処理部１１で作成した解析処理結果テーブル７の全レコードの解析結果カラムのリストとし、リスト内のレコードごとに以下の処理を繰り返す。全レコードの処理を終えた場合は、正規化処理部１２での処理を終了する。 The flow of processing in the normalization processing unit 12 will be described with reference to FIG. The normalization processing unit 12 normalizes the extracted synonym candidate expression from notation and reading. The input is a list of analysis result columns of all records in the analysis processing result table 7 created by the analysis processing unit 11, and the following processing is repeated for each record in the list. When all the records have been processed, the processing in the normalization processing unit 12 ends.

（ステップｓ１２−１）解析処理結果テーブル７の解析結果カラムの表記がアルファベットであれば大文字／小文字を大文字に統一し、同カラムへ上書きする。ステップｓ１２−２へ進む。 (Step s12-1) If the notation of the analysis result column in the analysis processing result table 7 is alphabet, uppercase / lowercase letters are unified to uppercase, and the same column is overwritten. Proceed to step s12-2.

（ステップｓ１２−２）解析処理結果テーブル７の解析結果カラムの表記と読みに逆変換ルール記憶部３に記憶された逆変換ルールの表記用ルールと読み用ルール（詳細は後述）をそれぞれ適用し、結果を解析処理結果テーブル７へ書き出す。書き出し先については、逆変換ルールのうち、表記用ルールの適用結果は表記正規化カラム、読み用ルールの適用結果は表記＋読み正規化カラムとする（なお、適用すべきルールがない場合は解析結果カラムの表記をそのまま表記正規化カラムへ書き出し、解析結果カラムの読みをそのまま表記＋読み正規化カラムへ書き出す。）。ステップｓ１２−３へ進む。 (Step s12-2) The notation rule and the reading rule (details will be described later) of the reverse conversion rule stored in the reverse conversion rule storage unit 3 are applied to the notation and the reading of the analysis result column in the analysis processing result table 7, respectively. The result is written to the analysis processing result table 7. For the export destination, out of the inverse conversion rules, the application result of the notation rule is the notation normalization column, and the application result of the reading rule is the notation + reading normalization column. The result column notation is written as it is to the notation normalization column, and the analysis result column reading is written as it is to the notation + reading normalization column.) Proceed to step s12-3.

（ステップｓ１２−３）解析処理結果テーブル７の解析結果カラムの読みと、ステップｓ１２−３で書き出した表記＋読み正規化カラムの読みに対して音節正規化ルール記憶部４に記憶された音節正規化ルール（詳細は後述）を適用し、結果を解析処理結果テーブル７へ書き出す。解析結果カラムの読みを正規化した結果は読み正規化カラムへ書き出し、表記＋読み正規化後の読みを正規化した結果は表記＋読み正規化カラムへ上書きする（なお、適用すべきルールがない場合は解析結果カラムの読みをそのまま読み正規化カラムへ書き出し、表記＋読み正規化カラムはそのまま（上書きしない）とする）。 (Step s12-3) The syllable normalization stored in the syllable normalization rule storage unit 4 for the reading of the analysis result column of the analysis processing result table 7 and the reading of the notation + reading normalization column written in step s12-3 Apply the conversion rule (details will be described later), and write the result to the analysis processing result table 7. The result of normalizing the reading of the analysis result column is written to the reading normalization column, and the result of normalizing the reading after reading + normalization is overwritten to the notation + reading normalization column (There are no rules to apply) In this case, the reading of the analysis result column is read as it is and written to the normalization column, and the notation + reading normalization column is left as it is (not overwritten)).

ペア生成部１３では、正規化処理部１２での処理を終えた解析処理結果テーブル７の全レコードを総当たりで組み合わせて同義語候補ペアを作成し、そのペアの同義語候補表現のＩＤ（候補ＩＤ）を同義語候補ペアリストとして同義語候補ペアリスト記憶部８に記憶する。同義語候補ペアリストの一例を図５に示す。ペア作成の手法に関しては、カバー範囲を重要視しないのであれば総当たり以外の、例えばメタ情報を用いた手法を用いても、以後の本発明を利用することは可能である。 In pairs producing formation unit 13 creates a synonym candidate paired with brute force all records of analysis results table 7 have been processed in the normalization processing unit 12, ID synonyms candidate representations of the pair (candidate ID) stored in the synonym candidate pair list memory 8 as a synonym candidate pair list. An example of the synonym candidate pair list is shown in FIG. With respect to the pair creation method, if the cover range is not regarded as important, it is possible to use the present invention thereafter even using a method other than brute force, for example, using meta information.

［同義性判定部２］
図６は同義性判定部２の詳細を示すもので、表記類似判定部２１、読み類似判定部２２及び省略判定部２３からなる。図７は同義性判定部２での処理の流れを示すものである。 [Synonymity determination unit 2]
FIG. 6 shows details of the synonymity determination unit 2, which includes a notation similarity determination unit 21, a reading similarity determination unit 22, and an omission determination unit 23. FIG. 7 shows the flow of processing in the synonymity determination unit 2.

同義性判定部２は、解析処理結果テーブル７及び同義語候補ペアリスト８を入力とし、表記類似判定部２１、読み類似判定部２２及び省略判定部２３により、同義語候補ペア中の同義語候補表現同士が同義か否かを判定し、同義語ペアリストを出力する。 Synonymous determination unit 2 receives the analysis result table 7 and synonyms candidate pair list 8, the title similarity determination unit 21 reads the similarity determination unit 22 and the skip determination unit 23, a synonym candidate in synonym candidate pairs It is determined whether expressions are synonymous, and a synonym pair list is output.

即ち、表記類似判定部２１では正規化した表記から同義性の判定を行い（ステップｓ２１）、読み類似判定部２２では正規化した読みから音節類似度テーブル５を用いて同義性の判定を行い（ステップｓ２２）、そして省略判定部２３では正規化した表記と読みの両方から省略判定モデル６を用いて省略語か否かを判定して同義性の判定を行う（ステップｓ２３）。同義語候補ペアリスト８のレコードごとにステップｓ２１〜ｓ２３の処理を繰り返し、いずれかの過程で同義と判定された時点でその同義語候補ペア中の同義語候補表現同士を同義であると認定し（ステップｓ２４）、次の同義語候補ペアの処理へと移行する。最後まで同義と判定されなかったペアは同義語であると認定しない（ステップｓ２５）。処理に必要となる同義語候補ペアの同義語候補表現のそれぞれの表記や読み、品詞等の情報は、同義語候補ペアリスト８の候補ＩＤを用いて解析処理結果テーブル７内の該当情報を参照する。同義語候補ペアリスト８の全てのレコードの処理が終了した時点で、同義語と認定された同義語候補ペアを同義語ペアリストとして出力する。 That is, the notation similarity determination unit 21 determines synonymity from the normalized notation (step s21), and the reading similarity determination unit 22 determines synonymity from the normalized reading using the syllable similarity table 5 ( In step s22), the omission determination unit 23 determines whether or not the abbreviation word uses the omission determination model 6 from both normalized notation and reading, and determines synonymity (step s23). Repeating the processing of steps s21~s23 for each record synonym candidate pair list 8, recognized as synonymous synonyms candidate representations each other during its synonyms candidate pair when either is determined synonymous in the process (Step s24), the process proceeds to processing of the next synonym candidate pair. Pairs that have not been determined to be synonymous until the end are not recognized as synonyms (step s25). For information on each synonym candidate expression of the synonym candidate pair required for processing, such as notation, reading, part of speech, etc., refer to the corresponding information in the analysis processing result table 7 using the candidate ID of the synonym candidate pair list 8 To do. When the processing of all the records in the synonym candidate pair list 8 is completed, the synonym candidate pairs recognized as synonyms are output as a synonym pair list.

表記類似判定部２１での処理の流れを図８を用いて説明する。入力は同義語候補ペアリスト８の１レコードとする。ここでは同義語候補ペア中の同義語候補表現の各々の正規化後の表記を見て判定を行う。 The flow of processing in the notation similarity determination unit 21 will be described with reference to FIG. The input is one record of the synonym candidate pair list 8. Here, the determination is made by looking at the normalized notation of each synonym candidate expression in the synonym candidate pair.

（ステップｓ２１−１）同義語候補ペア中の各同義語候補表現の表記正規化カラムの表記同士が全く同じである場合はステップｓ２１−２へ、それ以外の場合は表記類似判定部２１での処理を終了し、読み類似判定部２２での処理に進む。 (Step s21-1) If the notation normalization column notation of each synonym candidate expression in the synonym candidate pair is exactly the same, go to step s21-2, otherwise in the notation similarity determination unit 21 The process ends, and the process proceeds to the reading similarity determination unit 22.

（ステップｓ２１−２）同義と判定し、同義性判定部２での処理を終了する。 (Step s21-2) It is determined to be synonymous, and the process in the synonymity determining unit 2 is terminated.

読み類似判定部２２での処理の流れを図９を用いて説明する。表記類似判定部２１で同義と判定されなかった同義語候補ペアのレコードを入力とする。ここでは同義語候補ペア中の同義語候補表現の各々の読みを見て判定を行う。１つの同義語候補表現について読みは、解析結果カラムの読み、読み正規化カラムの読み、表記＋読み正規化カラムの読みの３つが存在するため、それぞれについて以下の処理を３回繰り返し行い、そのいずれかの過程で同義と判定されれば同義語であると認定して同義性判定部２での処理を終了し、そうでない場合は省略判定部２３での処理に進める。この際、解析結果カラムの読み、読み正規化カラムの読み、表記＋読み正規化カラムの読みを単に“読み”と記述する。また、以下の処理において、マッチングの際は形態素の区切り情報（“／”）は無視する。 The flow of processing in the reading similarity determination unit 22 will be described with reference to FIG. A record of a synonym candidate pair that is not determined to be synonymous by the notation similarity determination unit 21 is input. Here, the determination is made by looking at each reading of the synonym candidate expression in the synonym candidate pair. There are three readings for one synonym candidate expression: reading of the analysis result column, reading of the reading normalization column, and reading of the notation + reading normalization column. Therefore, the following processing is repeated three times for each, If the synonym is determined to be synonymous in any process, the synonym is recognized and the process in the synonym determination unit 2 is terminated. If not, the process proceeds to the process in the omission determination unit 23. At this time, the reading of the analysis result column, the reading of the reading normalization column, and the reading of the notation + reading normalization column are simply described as “reading”. In the following processing, the morpheme delimiter information (“/”) is ignored during matching.

（ステップｓ２２−１）同義語候補ペア中の各同義語候補表現の読みが全く同じである場合はステップｓ２２−５へ、それ以外はステップｓ２２−２へ進む。 (Step S22-1) If the reading of synonyms candidate each synonym candidate expressed in the pair is identical to step S22-5, otherwise proceeds to step S22-2.

（ステップｓ２２−２）同義語候補ペア中の各同義語候補表現の読みの音節数をカウントし、同じである場合はステップｓ２２−３へ進む。それ以外の場合は処理を終了する（読み類似判定部２２での処理の繰り返し回数が２回以下ならステップｓ２２−１へ戻り、当該同義語候補ペアの次の読みに対する処理へ移る。３回目であれば読み類似判定部２２での処理を終了する。）。 (Step s22-2) The number of syllables of reading of each synonym candidate expression in the synonym candidate pair is counted, and if it is the same, the process proceeds to step s22-3. In other cases, the process ends (if the number of repetitions of the process in the reading similarity determination unit 22 is 2 or less, the process returns to step s22-1 and proceeds to the process for the next reading of the synonym candidate pair. If there is, the process in the reading similarity determination unit 22 ends.)

（ステップｓ２２−３）同義語候補ペア中の各同義語候補表現の読みの、音節位置が同じで読みが異なる音節間の距離を音節類似度テーブル５（詳細は後述する。）を用いて求める。音節位置が同じで読みが異なる音節が多数存在する場合は、ペア間で異なる音節間の距離の総和を用いる。ステップｓ２２−４へ進む。 (Step s22-3) The distance between syllables having the same syllable position but different readings in the reading of each synonym candidate expression in the synonym candidate pair is obtained using the syllable similarity table 5 (details will be described later). . When there are many syllables having the same syllable position but different readings, the sum of distances between syllables different between pairs is used. Proceed to step s22-4.

（ステップｓ２２−４）距離の総和が予め設定した閾値より小さければステップｓ２２−５へ進む。それ以外の場合は処理を終了する（読み類似判定部２２での処理の繰り返し回数が２回以下ならステップｓ２２−１へ戻り、当該同義語候補ペアの次の読みの処理へ移る。３回目であれば読み類似判定部２２での処理を終了する。）。 (Step s22-4) If the sum of the distances is smaller than a preset threshold value, the process proceeds to Step s22-5. In other cases, the process ends (if the number of repetitions of the process in the reading similarity determination unit 22 is 2 or less, the process returns to step s22-1 and proceeds to the process of the next reading of the synonym candidate pair. If there is, the process in the reading similarity determination unit 22 ends.)

（ステップｓ２２−５）同義と判定し、同義性判定部２での処理を終了する。 (Step s22-5) The synonymity determination unit 2 ends the process of determining synonymity.

省略判定部２３での処理の流れを図１０を用いて説明する。読み類似判定部２２で同義と判定されなかった同義語候補ペアのレコードを入力とする。ここでは同義語候補ペアについて、解析結果カラムの表記同士、表記正規化カラムの表記同士、読み正規化カラムの読み同士、表記＋読み正規化カラムの読み同士の４パターンそれぞれについて以下の処理を繰り返し行い、そのいずれかの過程で同義と判定されれば同義語であると認定し、そうでない場合は同義語でないと認定して同義性判定部２での処理を終了する。この際、解析結果カラムの表記、表記正規化カラムの表記を単に“表記”と記述し、読み正規化カラムの読み、表記＋読み正規化カラムの読みを単に“読み”と記述する。また、以下の処理において、マッチングの際は形態素の区切り情報（“／”）は無視する。 The flow of processing in the omission determination unit 23 will be described with reference to FIG. A record of a synonym candidate pair that has not been determined to be synonymous by the reading similarity determination unit 22 is input. Here, for the synonym candidate pairs, the following processing is repeated for each of the four patterns of the analysis result column notations, notation normalization column notations, reading normalization column notations, notation + reading normalization column notations If it is determined to be synonymous in any of the processes, it is recognized as a synonym. Otherwise, it is determined as not a synonym and the process in the synonym determination unit 2 is terminated. At this time, the notation of the analysis result column and the notation normalization column are simply described as “notation”, and the reading normalization column reading and the notation + reading normalization column reading are simply described as “reading”. In the following processing, the morpheme delimiter information (“/”) is ignored during matching.

（ステップｓ２３−１）表記を対象としている場合は表記同士、読みを対象としている場合は読み同士が包含関係にある場合はステップｓ２３−２へ進む。それ以外の場合は処理を終了する（省略判定部２３での処理の繰り返し回数が３回以下なら当該同義語候補ペアの次の表記または読みに対する処理へ移る。４回目であれば省略判定部２３での処理を終了する。）。 (Step s23-1) When notation is targeted, when notation is read, when reading is in an inclusive relationship, it progresses to step s23-2. In other cases, the process ends (if the number of repetitions of the process in the abbreviation determination unit 23 is 3 or less, the process proceeds to the next notation or reading of the synonym candidate pair. The process at is terminated.)

（ステップｓ２３−２）ＤＰマッチング法（ＲｉｃｈａｒｄＥ．Ｂｅｌｌｍａｎ，“ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ”，１９５７）等を用いて位置合わせを行う。ステップｓ２３−３へ進む。 (Step s23-2) Alignment is performed using a DP matching method (Richard E. Bellman, “Dynamic Programming”, 1957) or the like. Proceed to step s23-3.

（ステップｓ２３−３）それぞれのペアのうち長い文字数（読みの場合は音節数）の方を省略前、短い方を省略後として、省略前後の差異を元に、分類器にかけるための素性（詳細は後述）を抽出する。ステップｓ２３−４へ進む。 (Step s23-3) The feature for applying to the classifier based on the difference between before and after the omission, with the longer number of characters (number of syllables in the case of reading) of each pair before omission and after the shorter one is omitted. The details are described later. Proceed to step s23-4.

（ステップｓ２３−４）ステップｓ２３−３で抽出した素性と分類器のモデルである省略判定モデル６を用いて同義義語候補ペア中の同義語候補表現同士が省略語関係にあるかを判定し、省略語であると判定した場合はステップｓ２５−５へ進む。それ以外の場合、処理を終了する（省略判定部２３での処理の繰り返し回数が３回以下なら当該同義語候補ペアの次の表記または読みの処理へ移る。４回目であれば省略判定部２３での処理を終了する。）。 (Step S23-4) Synonym candidates expressed together in synonymous synonym candidate pair with skip determination model 6 is identity and classifier model extracted in step s23-3, it is determined whether the abbreviation relationship If it is determined that it is an abbreviation, the process proceeds to step s25-5. In other cases, the process ends (if the number of repetitions of the process in the omission determination unit 23 is 3 or less, the process proceeds to the next notation or reading process of the synonym candidate pair. The process at is terminated.)

（ステップｓ２３−５）同義と判定し、同義性判定部２での処理を終了する。 (Step s <b> 23-5) The synonym determination unit 2 terminates the process of determining synonymity.

前述した（ステップｓ２３−３）で用いる素性としては、同義語候補ペア中の各同義語候補表現の表記（解析結果カラムの表記または表記正規化カラムの表記）に対して抽出を行う場合、
＜省略前後の同義語候補表現＞
・省略前：形態素数，文字数，品詞，固有表現クラス，文字種
・省略後：形態素数，文字数，品詞，固有表現クラス，文字種
＜形態素単位の素性＞
・形態素が丸ごと省略された場合：品詞，表記，文字数，文字種，位置情報（先頭か末尾か真中か），先頭の形態素が残っているか
・形態素が丸ごと残った場合：品詞，表記，文字数，文字種，位置情報（先頭か末尾か真中か），末尾の形態素を省略したか
＜文字単位の素性＞
・文字単位で省略された場合：品詞，表記，文字種，位置情報（先頭か末尾か真中か），表記内で先頭の文字を省略したか
・文字単位で残った場合：品詞，表記，文字種，位置情報（先頭か末尾か真中か），形態素内で先頭の文字が残っているか
を用いる。しかし、ここに挙げた以外にも、形態素解析情報、文脈情報などを利用しても良い。この時、用いる品詞や表記などの情報は解析処理結果テーブル７の情報を利用する。 As the feature used in the above-described (step s23-3), when performing extraction for each synonym candidate expression notation (notation of analysis result column or notation normalization column notation) in the synonym candidate pair,
<Synonym candidate expressions before and after omission>
-Before omission: morpheme number, number of characters, part of speech, proper expression class, character type-After omission: morpheme number, number of characters, part of speech, proper expression class, character type <Feature of morpheme unit>
・ If the entire morpheme is omitted: part of speech, notation, number of characters, character type, position information (whether it is the beginning or end or middle), whether the first morpheme remains ・ If the entire morpheme remains: part of speech, notation, number of characters, character type , Position information (whether it is the beginning, the end, or the middle), whether the morpheme at the end was omitted <Characteristic features in character units>
・ If omitted in character units: part of speech, notation, character type, position information (whether it is first or last or middle), or whether the first character in the notation is omitted ・ If left in character units: part of speech, notation, character type, Uses position information (whether it is the beginning, end, or middle) and whether the first character remains in the morpheme. However, in addition to those listed here, morphological analysis information, context information, and the like may be used. At this time, information such as part of speech and notation used uses information in the analysis processing result table 7.

また、同義語候補ペア中の各同義語候補表現の読み（読み正規化カラムの読みまたは表記＋読み正規化カラムの読み）に対して素性を抽出する際は、上記で述べた素性例において、表記の素性には読みを、文字数の素性には音節数を、位置情報の素性には音節で数えた場合の何音節目かを用いる。また、文字種の素性は「カタカナ」で統一する。 Also, when extracting features for each synonym candidate expression reading (reading normalization column reading or notation + reading normalization column reading) in the synonym candidate pair, in the feature example described above, Reading is used as the feature of the notation, the number of syllables is used as the feature of the number of characters, and the number of syllables when counted by syllable is used as the feature of the position information. Also, the character type features are unified with “Katakana”.

［逆変換ルール（記憶部）３］
図１１は逆変換ルール３の一例を示すもので、名詞を利用する際に一般的に挿入されると思われる接頭辞や接尾辞等の接辞形を削除するルール、愛称を作成する際に利用される繰り返し表現を削除するルール等をヒューリスティックに記述したものである。一組の同義語候補ペアに対して適用可能なルールは全て適用する。本ルールは、同義語候補表現における、より一般的な表記に対して挿入されると思われる定型的な文字列を当該表記から削除し、当該文字列の削除に併せて読みを訂正するルールであるため、省略語を作成するためのルールとは異なる（本発明においては、省略語は逆変換ルールを用いて判定せず、省略判定部において分類器を用いて判定する。）。よってルールは正規表現などを用いて簡単に書き表すことができ、多くの同義語に共通して挿入されるような文字列を削除するルールとする。なお、図１１での正規表現はＰｅｒｌで書くことを例に説明を行っている。よって、他の表現を用いる場合には、その手法に準じる。 [Inverse conversion rule (storage unit) 3]
FIG. 11 shows an example of the reverse conversion rule 3, which is used when creating a nickname or a rule for deleting a prefix form such as a prefix or suffix that is generally inserted when using a noun. Heuristic description of rules for deleting repeated expressions. All rules applicable to a set of synonym candidate pairs apply. This rule is a rule that deletes a typical character string that seems to be inserted for a more general notation in a synonym candidate expression from the notation, and corrects the reading as the character string is deleted. Therefore, it is different from a rule for creating an abbreviation (in the present invention, an abbreviation is not determined using an inverse conversion rule, but is determined using a classifier in an abbreviation determination unit). Therefore, the rule can be easily written using a regular expression or the like, and the rule is to delete a character string inserted in common with many synonyms. Note that the regular expression in FIG. 11 is described as being written in Perl. Therefore, when using other expressions, the method is followed.

逆変換ルール３は表記に適用するルール、読みに適用するルールが対になっており、表記用正規表現が適用できない場合は、対になった読み用正規表現も適用しない。しかし、その逆に読み用正規表現が適用できない、または存在しない場合に関しては表記用正規表現のみを適用して良い（表記が変化しないのに読みが変化することはあり得ないが、表記が変化しても読みが変化しない場合はあり得るため）。逆変換ルールは図１１で挙げたもの以外にも任意に作成して登録可能で、例えば接尾辞の削除ルールで「ちゃん」や「氏」などを加えることなどが考えられる。 In the reverse conversion rule 3, a rule that applies to notation and a rule that applies to reading are paired. If a regular expression for notation is not applicable, the paired reading regular expression is not applied. However, on the contrary, if the regular expression for reading cannot be applied or does not exist, only the regular expression for notation may be applied (the reading may not change although the notation does not change, but the notation changes) Even if the reading doesn't change). The reverse conversion rules can be arbitrarily created and registered in addition to those shown in FIG. 11, and for example, “chan” or “Mr.” can be added as a suffix deletion rule.

［音節正規化ルール（記憶部）４］
図１２は音節正規化ルール４の一例を示すもので、和語と外来語とで異なる読みの長さの単位（モーラと音節）、口語表現、音訳時のゆれ等を正規化するために、同義語候補表現の読みの母音連続や長音、促音に適用するルール等からなる。ルールの適応順序はルール番号順とし、適用可能なルールは全て適用する。 [Syllable normalization rule (storage unit) 4]
FIG. 12 shows an example of the syllable normalization rule 4, in order to normalize the reading length unit (mora and syllable), colloquial expression, fluctuation in transliteration, etc. Consists of rules that apply to vowel continuations, long tones, and prompt sounds for reading synonym candidate expressions. The rules are applied in the order of rule numbers, and all applicable rules are applied.

表記変換による同義性判定には、位置が同じで読みが異なる音節間の距離を用いる。その際、表記変換による同義語間では音節の長さが等しいことが条件となる。しかし、和語はモーラ、外来語では音節と、読みの単位が異なる。さらに他言語に和語の読みを与える際（音訳時）には、外来語間でも同じ音節数にならないという問題があった。その例として以下のような場合が挙げられる。 For synonymity determination by notation conversion, the distance between syllables having the same position but different readings is used. At that time, the syllable length is the same between synonyms by notation conversion. However, the unit of reading differs between Japanese words and mora, and foreign words and syllables. Furthermore, when giving Japanese readings to other languages (transliteration), there was a problem that the number of syllables was not the same between foreign words. Examples thereof include the following cases.

・和語、口語「ユウコ」，「ユーコ」
モーラ数で数えると３モーラであるが、音節数で数えると３音節，２音節となる。・ Japanese, colloquial "Yuko", "Yuko"
When counted by the number of mora, it is 3 mora, but when counted by the number of syllables, it becomes 3 syllables and 2 syllables.

・外来語「スパゲティ」，「スパゲティー」，「スパゲッティー」
音節数で数えれば３つとも４音節であるが、モーラ数で数えると４モーラ，５モーラ，６モーラとなる。・ Foreign words "spaghetti", "spaghetti", "spaghetti"
If counted by the number of syllables, all three are 4 syllables, but if counted by the number of mora, they become 4 mora, 5 mora, and 6 mora.

・他言語の和語読み「ウインブルドン」，「ウィンブルドン」
音節数で数えると７音節，６音節、モーラ数で数えても７モーラ，６モーラとなる。・ Japanese language reading "Wimbledon", "Wimbledon" in other languages
When counted by the number of syllables, 7 syllables and 6 syllables, and when counted by the number of mora, they become 7 mora and 6 mora.

これらに対し、図１２に示すような音節正規化ルールを用いることにより、全て音節数で数えられるように読みの長さの単位を統一でき、（ステップｓ２２−３）において位置合わせが可能となる。 On the other hand, by using the syllable normalization rule as shown in FIG. 12, the unit of the reading length can be unified so that all can be counted by the number of syllables, and the alignment can be performed in (step s22-3). .

［音節類似度テーブル（記憶部）５］
図１３（ａ）は音節類似度テーブルの作成手順、同図（ｂ）は音節類似度テーブルの一例を示すもので、音節類似度テーブル５は、キー：表記は異なるが読みが類似する音節ペア、値：距離（類似度）、により構成される。このテーブル５は図１３（ａ）に示すように、形態素解析辞書から標準表記が同じで発音（読み）が異なる単語を収集し、読み正規化処理部１２と同様に音節正規化ルール４を適用して読みの長さの単位を音節に統一し、音節数が等しい場合に位置合わせを行う。そして音節位置が同じで読みが異なる音節ペアを抜き出してカウントし、音節ペアの出現数を、音節ペアを構成する音節それぞれの出現回数の和で割った値を音節間の距離（類似度）とすることで作成する。 [Syllable similarity table (storage unit) 5]
FIG. 13 (a) shows a procedure for creating a syllable similarity table, and FIG. 13 (b) shows an example of a syllable similarity table. The syllable similarity table 5 is a syllable pair whose key: notation is different but reading is similar. , Value: distance (similarity). As shown in FIG. 13A, this table 5 collects words having the same standard notation and different pronunciation (reading) from the morphological analysis dictionary, and applies the syllable normalization rule 4 in the same manner as the reading normalization processing unit 12. Then, the unit of the reading length is unified into syllables, and alignment is performed when the number of syllables is equal. Then, syllable pairs with the same syllable position but different readings are extracted and counted, and the value obtained by dividing the number of occurrences of syllable pairs by the sum of the number of occurrences of each syllable constituting the syllable pair is the distance between syllables (similarity). To create.

［省略判定モデル（記憶部）６］
省略判定モデル６は２つの単語が省略語関係にあるか否かを判定するためのモデルで、判定を行いたい同義語候補ペア中の各同義語候補表現の表記及び読み、形態素解析情報、位置合わせ情報等を入力とし、同義か否かを２値判定する識別関数からなる。識別関数としては、例えばＶ．Ｖａｐｎｉｋ，“Ｔｈｅｎａｔｕｒｅｏｆｓｔａｔｉｓｔｉｃａｌｌｅａｒｎｉｎｇｔｈｅｏｒｙ”，Ｓｐｒｉｎｇｅｒ，１９９５で述べられているＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ（ＳＶＭ）の識別関数を用い、識別関数のパラメータは予め省略判定部２３で述べた素性からなる学習データをＳＶＭで学習して決定しておく。ここでは学習アルゴリズムとしてＳＶＭを挙げたが、決定木、最大エントロピー法等のほかの学習アルゴリズムを利用しても良い。 [Omission determination model (storage unit) 6]
The abbreviation determination model 6 is a model for determining whether or not two words are in an abbreviation relationship. Notation and reading of each synonym candidate expression in the synonym candidate pair to be determined, morphological analysis information, position It consists of an identification function that inputs matching information and the like and makes a binary decision as to whether or not they are synonymous. As an identification function, for example, V.I. The discriminant function of Support Vector Machine (SVM) described in Vapnik, “The nature of statistical learning theory”, Springer, 1995, is used. Learning and determining with SVM. Here, although SVM was mentioned as a learning algorithm, you may utilize other learning algorithms, such as a decision tree and the maximum entropy method.

前述した実施の形態における具体的な処理の実施例を詳細に説明する。ここで、同義性判定部２の読み類似判定部２２で用いる閾値には「０．９」を用い、また、省略判定モデル４の分類器としてはＳＶＭを用いることとする。まず、同義語候補ペア生成部１及び同義性判定部２で行う処理を説明し、その後、逆変換ルール３及び音節正規化ルール４の詳細な適用例、音節類似度テーブル５及び省略判定モデル６の作成例について説明する。 An example of specific processing in the above-described embodiment will be described in detail. Here, “0.9” is used as the threshold value used in the reading similarity determination unit 22 of the synonymity determination unit 2, and SVM is used as the classifier of the omission determination model 4. First, processing performed by the synonym candidate pair generation unit 1 and the synonymity determination unit 2 will be described, and then detailed application examples of the inverse transformation rule 3 and the syllable normalization rule 4, the syllable similarity table 5 and the omission determination model 6 An example of creating will be described.

［Ｉ］同義語候補ペア生成部１及び同義性判定部２で行う処理
［同義語候補ペア生成部１］
解析処理部１１への入力テキストが図１４に示すようなものであった場合、「アップル社」，「ｅｖａ」などの同義語候補表現が抽出される。入力テキスト中では「アップル社」という表現が２度出現しているが、解析処理結果テーブル７には１度だけ書き出す。１テキスト全ての解析を終えた状態が図１５に示すようになったものとして、以下説明を行う。 [I] Process performed by synonym candidate pair generation unit 1 and synonymity determination unit 2 [synonym candidate pair generation unit 1]
If the input text to the analysis processing unit 11 is as shown in FIG. 14, synonym candidate expressions such as “Apple” and “eva” are extracted. Although the expression “Apple” appears twice in the input text, it is written once in the analysis processing result table 7. The following description will be made on the assumption that the analysis of one text has been completed as shown in FIG.

正規化処理部１２への入力を図１５に示す解析処理結果テーブル７の全レコードの同義語候補表現カラムのリストとし、リスト内のレコードごとに以下の処理を繰り返す。 The input to the normalization processing unit 12 is a list of synonym candidate expression columns of all records in the analysis processing result table 7 shown in FIG. 15, and the following processing is repeated for each record in the list.

まずＩＤ１の「アップル社」から処理を開始する。 First, processing is started from “Apple Inc.” of ID1.

（ステップｓ１２−１）解析結果カラムの表記「アップル／社」はアルファベットではないため、そのままステップｓ１２−２へ進む（ここで処理を行う例；レコードＩＤ２：ｅｖａは大文字ＥＶＡへと変換し、上書きする。）。 (Step s12-1) Since the notation “Apple / Company” in the analysis result column is not an alphabet, the process proceeds directly to Step s12-2 (example of processing here; record ID 2: eva is converted to uppercase EVA and overwritten) To do.)

（ステップｓ１２−２）解析結果カラムの表記「アップル／社」、解析結果カラムの読み「アップル／シャ」に逆変換ルール３（詳細は後述）の表記用正規表現、読み用正規表現をそれぞれ適用すると、それぞれ「アップル」、「アップル」となる。その結果を解析処理結果テーブル７の表記正規化カラム、表記＋読み正規化カラムへ書き出し、ステップｓ１２−３へ進む。 (Step s12-2) The notation regular expression and the reading regular expression of the reverse conversion rule 3 (details will be described later) are applied to the analysis result column notation “Apple / Company” and the analysis result column reading “Apple / Sha”, respectively. Then, “Apple” and “Apple” respectively. The result is written to the notation normalization column and the notation + reading normalization column of the analysis processing result table 7, and the process proceeds to step s12-3.

（ステップｓ１２−３）解析結果カラムの読み「アップル／シャ」と、ステップｓ１２−２で書き出した表記＋読み正規化カラムの読み「アップル」に対して音節正規化ルール４（詳細は後述）を適用する。前者は「アプル／シャ」となり、読み正規化カラムへ書き出される。後者は「アプル」となり、表記＋読み正規化カラムへ上書きされる。 (Step s12-3) The syllable normalization rule 4 (details will be described later) is applied to the analysis result column reading “Apple / Sha” and the notation + reading normalization column reading “Apple” written in Step s12-2. Apply. The former is “apple / sha” and is written to the reading normalization column. The latter becomes “apple” and is overwritten to the notation + reading normalization column.

以上の処理をＩＤ２以後も同様に繰り返す（図１５ではＩＤ１４まで表示）。その結果が図３となる。 The above processing is repeated in the same manner after ID2 (in FIG. 15, ID14 is displayed). The result is shown in FIG.

ここで、（ステップｓ１２−２）で逆変換ルール３が適用されるのは、図３に示す処理結果テーブル７内のレコードのうち（以下の例からは形態素区切り記号“／”は必要のない限り省略する。）、
・レコードＩＤ１：アップル社（→表記：アップル，表記＋読み：アップル）
・レコードＩＤ１０：ショコタン（→表記：ショコ，表記＋読み：ショコ）
・レコードＩＤ１２：ヨン様（→表記：ヨン，表記＋読み：ヨン）
・レコードＩＤ１４：ミキティ（→表記：ミキ，表記＋読み：ミキ）
の４つである。 Here, in (step s12-2), the reverse conversion rule 3 is applied because of the records in the processing result table 7 shown in FIG. 3 (the morpheme delimiter “/” is not necessary from the following example) Omitted as far as possible).
・ Record ID 1: Apple (→ notation: Apple, notation + reading: Apple)
・ Record ID 10: Shokotan (→ Notation: Choco, Notation + Reading: Choco)
・ Record ID 12: Yong (→ Notation: Yon, Notation + Reading: Yon)
Record ID 14: Mikiti (→ notation: Miki, notation + reading: Miki)
There are four.

また、（ステップｓ１２−３）で音節正規化ルール４が適用されるのは、図３に示す処理結果テーブル７内のレコードのうち、
・レコードＩＤ１：アップル社（→読み：アプルシャ，表記＋読み：アプル）
・レコードＩＤ２：ショウコ（→読み：ショコ，表記＋読み：ショコ）
・レコードＩＤ４：八景島シーパラダイス（→読み：ハケイジマシパラダイス，表記＋読み：ハケイジマシパラダイス）
・レコードＩＤ８：アップル（→読み：アプル，表記＋読み：アプル）
・レコードＩＤ１１：シーパラ（→読み：シパラ，表記＋読み：シパラ）
の５つである。 In addition, the syllable normalization rule 4 is applied in (step s12-3) among the records in the processing result table 7 shown in FIG.
・ Record ID1: Apple Inc. (→ Reading: Apulcia, Notation + Reading: Apple)
・ Record ID2: Shoko (→ reading: sho, notation + reading: shoko)
・ Record ID4: Hakkeijima Sea Paradise (→ Reading: Hakeijimashi Paradise, Notation + Reading: Hakeijimashi Paradise)
・ Record ID8: Apple (→ reading: apple, notation + reading: apple)
-Record ID 11: Sea Para (→ Reading: Shipara, Notation + Reading: Shipara)
These are five.

ペア生成部１３では正規化処理部１２で作成した図３に示す解析処理結果テーブル７の全レコード総当たりで同義語候補ペアを作成し、図５に示す同義語候補ペアリスト８を出力する。 Create a synonym candidate pairs in all records brute analysis processing result table 7 shown in FIG. 3 created in pairs producing formation unit 13 normalization processing unit 12, the output of the synonym candidate pair list 8 shown in FIG. 5 To do.

［同義性判定部２］
同義性判定部２への入力は、図５に示す同義語候補ペアリスト８及び図３に示す解析処理結果テーブル７である。以下、図５のリスト８のレコードごとに表記類似判定部２１から省略判定部２３までの処理を繰り返し、どこかの過程で同義と判定された時点でその同義語候補ペアを同義であると認定し、次の同義語候補ペアの処理へと移行する。最後まで同義と判定されなかったペアは同義語であると認定しない。全てのレコードの処理が終了した時点で、同義語と認定された同義語候補ペアを同義語ペアリストとして出力する。 [Synonymity determination unit 2]
The input to the synonym determination unit 2 is the synonym candidate pair list 8 shown in FIG. 5 and the analysis processing result table 7 shown in FIG. Hereinafter, the processing from the notation similarity determination unit 21 to the omission determination unit 23 is repeated for each record in the list 8 of FIG. 5, and when the synonym candidate pair is determined to be synonymous in some process, the synonym candidate pair is recognized as synonymous. Then, the process proceeds to processing of the next synonym candidate pair. Pairs that have not been determined to be synonymous until the end are not recognized as synonyms. When all the records have been processed, synonym candidate pairs recognized as synonyms are output as a synonym pair list.

まず、同義語ペアリスト８内のレコードＩＤ１（「アップル社」，「ＥＶＡ」）から処理を開始する。 First, processing is started from record ID 1 (“Apple”, “EVA”) in the synonym pair list 8.

表記類似判定部２１では同義語ペアリスト８のレコードＩＤ１に対応する候補ＩＤ１，ＩＤ２から、解析処理結果テーブル７を参照すると、表記正規化カラムの表記は「アップル」，「ＥＶＡ」となっており、異なるため表記類似判定部２１での処理を終了し、読み類似判定部２２へと進む。 When the notation similarity determination unit 21 refers to the analysis processing result table 7 from the candidate IDs 1 and 2 corresponding to the record ID 1 of the synonym pair list 8, the notation normalization columns are “Apple” and “EVA”. Therefore, the processing in the notation similarity determination unit 21 is terminated, and the process proceeds to the reading similarity determination unit 22.

読み類似判定部２２では同義語ペアリスト８のレコードＩＤ１に対応する候補ＩＤ１，ＩＤ２から処理結果テーブル７の読みを参照し、判定を行う。この同義語候補ペアの読みはそれぞれ、解析結果カラムの読み：「アップルシャ」，「エバ」、読み正規化カラムの読み：「アプルシャ」，エバ」、表記＋読み正規化カラムの読み：「アプル」，「エバ」の３つとなる。それぞれについて以下の処理を３回繰り返し行う。 The reading similarity determination unit 22 refers to the reading of the processing result table 7 from the candidate IDs 1 and 2 corresponding to the record ID 1 of the synonym pair list 8 and performs determination. The reading of this synonym candidate pair is the reading of the analysis result column: "Applesha", "Eva", the reading normalization column: "Apulcia", Eva ", the notation + the reading normalization column reading:" Apple " ”And“ Eve ”. The following process is repeated three times for each.

まず、解析結果カラムの読み：「アップルシャ」，「エバ」について処理を行う。 First, the analysis result column readings: “Applesha” and “Eva” are processed.

（ステップｓ２２−１）同義語候補ペアの読みが異なるためステップｓ２２−２へ進む。 (Step s22-1) Since reading of synonym candidate pairs is different, the process proceeds to Step s22-2.

（ステップｓ２２−２）読みの音節数をカウントすると「アップルシャ」は４音節、「エバ」は２音節で異なるため、次の繰り返し処理へ移る。 (Step s22-2) When the number of syllables to be read is counted, “Applesha” is different in four syllables and “Eva” is different in two syllables.

ステップｓ２２−１へ戻り、「アプルシャ」，「エバ」、「アプル」，「エバ」と処理を繰り返すが、両者とも音節数が異なり、同義とならないため、省略判定部２３へ進む。 The process returns to step s22-1 and repeats the processing “Apulcia”, “Eve”, “Apple”, and “Eve”, but the number of syllables differs from each other and is not synonymous.

省略判定部２３で同義語ペアリスト８のレコードＩＤ１のペアについて次に挙げる４通りの情報について、解析処理テーブル７を参照して判定を行う。解析結果カラムの表記：「アップル社」，「ＥＶＡ」、表記正規化カラムの表記：「アップル」，「ＥＶＡ」、読み正規化後カラムの読み：「アプルシャ」，「エバ」、表記＋読み正規化カラムの読み：「アプル」，「エバ」の４パターンである。それぞれについて以下の処理を繰り返し行う。 The omission determination unit 23 determines the following four types of information regarding the pair of record ID 1 in the synonym pair list 8 with reference to the analysis processing table 7. Analysis result column notation: “Apple”, “EVA”, notation normalization column notation: “Apple”, “EVA”, reading normalization column reading: “Aprsha”, “Eva”, notation + reading normalization Reading of the conversion column: There are 4 patterns of “apple” and “eva”. The following processing is repeated for each.

まず、解析結果カラムの表記：「アップル社」，「ＥＶＡ」から処理を開始する。 First, the processing starts from the notation of the analysis result column: “Apple”, “EVA”.

（ステップｓ２３−１）表記が包含関係にないため、次の繰り返し処理に移る。続いて「アップル」，「ＥＶＡ」、「アプルシャ」，「エバ」、「アプル」，「エバ」の処理を行っていくが、全て包含関係に無いため省略判定部２３での処理を終了する。 (Step s23-1) Since the notation is not in an inclusive relationship, the process proceeds to the next iteration. Subsequently, the processing of “Apple”, “EVA”, “Apulcia”, “Eva”, “Apple”, “Eva” is performed, but since all are not in an inclusion relationship, the processing in the omission determination unit 23 is terminated.

同義語ペアリスト８のレコードＩＤ１は同義性判定部２の処理中に１度も同義と判定されなかったため、同義語ペアと認定せず次のレコード、即ちＩＤ２の処理に移る。 Since the record ID 1 in the synonym pair list 8 has never been determined to be synonymous during the process of the synonym determination unit 2, the record ID 1 is not recognized as a synonym pair and the process proceeds to the next record, that is, ID2.

ここで、レコードＩＤ２（「アップル社」，「八景島シーパラダイス」）、レコードＩＤ３（「アップル社」，「翔子」）もレコードＩＤ１と同様に同義とならないので、以下、図５のリスト８内で最終的に同義と判定されるレコードＩＤ７，ＩＤ２２，ＩＤ４５，ＩＤ６３，ＩＤ７８，ＩＤ９０，ＩＤ１２１のうち、代表的なパターンであるＩＤ７，ＩＤ２２，ＩＤ４５，ＩＤ６３，ＩＤ１２１に絞って同義性判定部２での処理を説明する。 Here, the record ID 2 (“Apple Inc.”, “Hakkeijima Sea Paradise”) and the record ID 3 (“Apple Inc.”, “Shoko”) are not synonymous with the record ID 1, so in the list 8 of FIG. Of the record ID7, ID22, ID45, ID63, ID78, ID90, and ID121 that are finally determined to be synonymous, the synonymity determination unit 2 narrows down to representative patterns ID7, ID22, ID45, ID63, and ID121. Processing will be described.

★レコードＩＤ７（「アップル社」，「アップル」）
［表記類似判定部２１］
（ステップｓ２１−１）同義語候補ペアリスト８の候補ＩＤ１，ＩＤ８から解析処理結果テーブル７を参照すると、表記正規化カラムの表記同士は「アップル」，「アップル」となっており、全く同じなためステップｓ２１−２へ進む。 ★ Record ID 7 ("Apple company", "Apple")
[Notation similarity determination unit 21]
(Step s21-1) When the analysis processing result table 7 is referred to from the candidate ID1 and ID8 of the synonym candidate pair list 8, the notation of the notation normalization column is “Apple” and “Apple”, which are exactly the same. Therefore, the process proceeds to step s21-2.

★レコードＩＤ２２（「ＥＶＡ」，「エヴァ」）
［表記類似判定部２１］
（ステップｓ２１−１）同義語候補ペアリスト８の候補ＩＤ２，ＩＤ９から解析処理結果テーブル７を参照すると、表記正規化カラムの表記同士は「ＥＶＡ」，「エヴァ」で異なるため表記類似判定部２１の処理を終了し、読み類似判定部２２へと進む。 ★ Record ID 22 ("EVA", "EVA")
[Notation similarity determination unit 21]
(Step s21-1) When the analysis processing result table 7 is referred to from the candidate ID2 and ID9 of the synonym candidate pair list 8, the notation similarity determination unit 21 is different because the notation of the notation normalization column differs between “EVA” and “EVA”. This process is terminated, and the process proceeds to the reading similarity determination unit 22.

［読み類似判定部２２］
同義語候補ペアリスト８の候補ＩＤ２，ＩＤ９から解析処理結果テーブル７を参照して
・解析結果カラムの読み：「エバ」，「エヴァ」
・読み正規化カラムの読み：「エバ」，「エヴァ」
・表記＋読み正規化カラムの読み：「エバ」，「エヴァ」
を求め、繰り返し処理を行う。 [Reading similarity determination unit 22]
Refer to the analysis processing result table 7 from the candidate ID 2 and ID 9 of the synonym candidate pair list 8. Reading of the analysis result column: “Eva”, “Eva”
・ Reading of normalization column: “Eva”, “Eva”
・ Notation + Reading normalization column reading: "Eva", "Eva"
And repeat the process.

まず、解析結果カラムの読み：「エバ」，「エヴァ」について処理を行う。 First, the analysis result column is read: “eva” and “eva” are processed.

（ステップｓ２２−１）読みが異なるためステップｓ２２−２へ進む。 (Step s22-1) Since the reading is different, the process proceeds to Step s22-2.

（ステップｓ２２−２）読みの音節数をカウントし、両者とも２音節なためステップｓ２２−３へ進む。 (Step s22-2) The number of syllables to be read is counted, and since both are two syllables, the process proceeds to Step s22-3.

（ステップｓ２２−３）音節位置が同じで読みが異なる音節は、「バ」と「ヴァ」で、音節類似度テーブル４（図１３（ｂ）：詳細は後述）から距離が０．８７と求まる。この同義語候補ペアには音節位置が同じで読みが異なるペアは１つしかないため、ペア間の距離は０．８７となる。ステップｓ２２−４へ進む。 (Step s22-3) The syllables having the same syllable position but different readings are “B” and “V”, and the distance is obtained as 0.87 from the syllable similarity table 4 (FIG. 13B: details will be described later). . Since this synonym candidate pair has only one pair with the same syllable position and different readings, the distance between the pairs is 0.87. Proceed to step s22-4.

（ステップｓ２２−４）距離の総和は０．８７で予め設定した閾値０．９より小さいためステップｓ２２−５へ進む。 (Step s22-4) Since the total sum of the distances is 0.87 and is smaller than the preset threshold value 0.9, the process proceeds to Step s22-5.

★レコードＩＤ４５（「翔子」，「ショコタン」）
［表記類似判定部２１］
（ステップｓ２１−１）同義語候補ペアリスト８の候補ＩＤ３，ＩＤ１０から解析処理結果テーブル７を参照すると、表記正規化カラムの表記同士は「翔子」，「ショコ」で異なるため表記類似判定部２１の処理を終了し、読み類似判定部２２へと進む。 ★ Record ID 45 (“Shoko”, “Shokotan”)
[Notation similarity determination unit 21]
(Step s21-1) When referring to the analysis processing result table 7 from the candidate ID3 and ID10 of the synonym candidate pair list 8, the notation similarity determination unit 21 is different because the notation of the notation normalization column differs between “Shoko” and “Shoko”. This process is terminated, and the process proceeds to the reading similarity determination unit 22.

［読み類似判定部２２］
同義語候補ペアリスト８の候補ＩＤ３，ＩＤ１０から解析処理結果テーブル７を参照して
・解析結果カラムの読み：「ショウコ」，「ショコタン」
・読み正規化カラムの読み：「ショコ」，「ショコタン」
・表記＋読み正規化カラムの読み：「ショコ」，「ショコ」
を求め、繰り返し処理を行う。 [Reading similarity determination unit 22]
Refer to the analysis processing result table 7 from the candidate ID 3 and ID 10 of the synonym candidate pair list 8. Reading of the analysis result column: “shoko”, “shokotan”
・ Reading of reading normalization column: “Shoko”, “Shokotan”
・ Notation + Reading normalization column readings: “Shoko”, “Shoko”
And repeat the process.

まず、解析結果カラムの読み：「ショウコ」，「ショコタン」について処理を行う。 First, the analysis result column is read: “shoko” and “shokotan” are processed.

（ステップｓ２２−２）読みの音節数をカウントし、「ショウコ」は３音節、「ショコタン」は４音節で異なるため次の繰り返し処理へ移る。 (Step s22-2) The number of syllables to be read is counted, and “Shoko” is different in 3 syllables, and “Shochotan” is different in 4 syllables.

次に読み正規化カラムの読み：「ショコ」，「ショコタン」について処理を行う。 Next, the reading normalization column readings: “chocolate” and “chocotan” are processed.

（ステップｓ２２−２）読みの音節数をカウントし、「ショコ」は２音節、「ショコタン」は４音節で異なるため次の繰り返し処理へ移る。 (Step s22-2) The number of reading syllables is counted. Since “chocolate” is different in two syllables and “chocotan” is different in four syllables, the process proceeds to the next repetition process.

次に表記＋読み正規化カラムの読み：「ショコ」，「ショコ」について処理を行う。 Next, the notation + reading normalization column readings: “chocolate” and “chocolate” are processed.

（ステップｓ２２−１）読みが等しいため、ステップｓ２２−５へ進む。 (Step s22-1) Since readings are equal, the flow proceeds to Step s22-5.

★レコードＩＤ６３（「八景島シーパラダイス」，「シーパラ」）
［表記類似判定部２１］
（ステップｓ２１−１）同義語候補ペアリスト８の候補ＩＤ４，ＩＤ１１から解析処理結果テーブル７を参照すると、表記正規化カラムの表記同士は「八景島シーパラダイス」，「シーパラ」で表記が異なるため表記類似判定部２１の処理を終了し、読み類似判定部２２へと進む。 ★ Record ID 63 ("Hakkeijima Sea Paradise", "Sea Para")
[Notation similarity determination unit 21]
(Step s21-1) When the analysis processing result table 7 is referred to from the candidate ID 4 and ID 11 of the synonym candidate pair list 8, the notation of the notation normalization column is expressed as “Hakkeijima Sea Paradise” and “Sea Para”. finished processing the similarity determination unit 21, proceeds non to read similarity determination unit 22.

［読み類似判定部２２］
同義語候補ペアリスト８の候補ＩＤ４，ＩＤ１１から解析処理結果テーブル７を参照して
・解析結果カラムの読み：「ハッケイジマシーパラダイス」，「シーパラ」
・読み正規化カラムの読み：「ハケイジマシパラダイス」，「シパラ」
・表記＋読み正規化カラムの読み：「ハケイジマシパラダイス」，「シパラ」
を求め、繰り返し処理を行う。 [Reading similarity determination unit 22]
Refer to the analysis processing result table 7 from the candidate ID 4 and ID 11 of the synonym candidate pair list 8-Reading of the analysis result column: "Hackage Masi Paradise", "Sea Para"
-Reading normalization column readings: "Hakeijima paradise", "Shipara"
・ Notation + Reading normalization column readings: “Hakeijimashi Paradise”, “Shipara”
And repeat the process.

この３パターンのそれぞれについて前記と同様に処理を繰り返した結果、どれもペア間で音節数が異なり、同義と判定されないため、省略判定部２３へ処理を進める。 As a result of repeating the processing in the same manner as described above for each of the three patterns, the number of syllables differs between the pairs and is not determined to be synonymous, so the processing proceeds to the omission determination unit 23.

［省略判定部２３］
同義語候補ペアリスト８の候補ＩＤ３，ＩＤ１０から解析処理結果テーブル７を参照して
・解析結果カラムの表記同士：「八景島シーパラダイス」，「シーパラ」
・表記正規化カラムの表記同士：「八景島シーパラダイス」，「シーパラ」
・読み正規化後カラムの読み同士：「ハケイジマシパラダイス」，「シパラ」
・表記＋読み正規化カラムの読み同士：「ハケイジマシパラダイス」，「シパラ」
を求める。 [Omission determination unit 23]
Refer to the analysis processing result table 7 from the candidate ID 3 and ID 10 of the synonym candidate pair list 8. The notations in the analysis result column: “Hakkeijima Sea Paradise”, “Sea Para”
・ Notation of notation normalization column: "Hakkeijima Sea Paradise", "Sea Para"
-Readings of columns after reading normalization: "Hakeijimashi Paradise", "Shipara"
・ Notation + Reading normalization column readings: "Hakeijima paradise", "Shipara"
Ask for.

この４パターンのそれぞれについて以下の処理を繰り返し行う。 The following processing is repeated for each of the four patterns.

まず、解析結果カラムの表記同士：「八景島シーパラダイス」，「シーパラ」の処理を行う。 First, the notation of analysis result columns: “Hakkeijima Sea Paradise” and “Sea Para” are processed.

（ステップｓ２３−１）表記が包含関係にあるためステップｓ２３−２へ進む。 (Step s23-1) Since the notation is in an inclusive relationship, the process proceeds to Step s23-2.

（ステップｓ２３−２）ＤＰマッチング法により位置合わせを行うと、図１６の左上に示すようになる。ステップｓ２３−３へ進む。 (Step s23-2) When alignment is performed by the DP matching method, the result is as shown in the upper left of FIG. Proceed to step s23-3.

（ステップｓ２３−３）素性の抽出を図１６のステップｓ２３−３に示すように行う。削除前の表現が「八景島シーパラダイス」、削除後の表現が「シーパラ」で、削除後残った形態素は「シー」、削除された形態素は「八景島」、削除された文字は「ダ」，「イ」，「ス」、残った文字は「パ」，「ラ」である。この６つについて、それぞれ処理結果テーブル７を参照しながら図１６の右側に示したように形態素数，文字数，品詞等の素性の抽出を行う。 (Step s23-3) Feature extraction is performed as shown in step s23-3 in FIG. The expression before deletion is “Hakkeijima Sea Paradise”, the expression after deletion is “Seapara”, the remaining morpheme is “Sea”, the deleted morpheme is “Hakkeijima”, the deleted characters are “Da”, “ “I” and “su” and the remaining characters are “pa” and “la”. With respect to the six, the features such as the number of morphemes, the number of characters, the part of speech, etc. are extracted as shown on the right side of FIG.

（ステップｓ２３−４）分類器のモデルである略判定モデル６を用いて同義義語候補ペアが省略語関係にあるかを判定した結果、同義となるためステップｓ２３−５へ進む。 (Step s23-4) As a result of determining whether the synonym candidate pair is in an abbreviation relationship using the approximate determination model 6 that is a model of the classifier, the process proceeds to Step s23-5 because it is synonymous.

★レコードＩＤ１２１（「安藤美姫」，「ミキティ」）
［表記類似判定部２１］
（ステップｓ２１−１）同義語候補ペアリスト８の候補ＩＤ７，ＩＤ１４から解析処理結果テーブル７を参照すると、表記正規化カラムの表記同士は「安藤美姫」，「ミキ」で表記同士が異なるため表記類似判定部２１の処理を終了し、読み類似判定部２２へと進む。 ★ Record ID 121 ("Miki Ando", "Mikiti")
[Notation similarity determination unit 21]
(Step s21-1) When referring to the analysis processing result table 7 from the candidate ID 7 and ID 14 of the synonym candidate pair list 8, the notation of the notation normalization column is “Miki Ando” and “Miki”, so the notations are different. The process of the similarity determination unit 21 ends, and the process proceeds to the reading similarity determination unit 22.

［読み類似判定部２２］
同義語候補ペアリスト８の候補ＩＤ７，ＩＤ１４から解析処理結果テーブル７を参照して
・解析結果カラムの読み：「アンドウミキ」，「ミキティ」
・読み正規化カラムの読み：「アンドウミキ」，「ミキティ」
・表記＋読み正規化カラムの読み：「アンドウミキ」，「ミキ」
を求め、繰り返し処理を行う。 [Reading similarity determination unit 22]
Refer to the analysis processing result table 7 from the candidate ID 7 and ID 14 of the synonym candidate pair list 8. Reading of the analysis result column: “Ando Miki”, “Mikiti”
-Reading normalization column readings: "Andumiki", "Mikiti"
・ Notation + Reading normalization column reading: “Ando Miki”, “Miki”
And repeat the process.

［省略判定部２３］
同義語候補ペアリスト８の候補ＩＤ７，ＩＤ１４から解析処理結果テーブル７を参照して
・解析結果カラムの表記同士：「安藤美姫」，「ミキティ」
・表記正規化カラムの表記同士：「安藤美姫」，「ミキ」
・読み正規化後カラムの読み同士：「アンドウミキ」，「ミキティ」
・表記＋読み正規化カラムの読み同士：「アンドウミキ」，「ミキ」
を求める。 [Omission determination unit 23]
Refer to the analysis processing result table 7 from the candidate IDs 7 and 14 of the synonym candidate pair list 8. The notations in the analysis result column: “Miki Ando”, “Mikiti”
・ Notation of notation normalization columns: “Miki Ando”, “Miki”
-Readings of columns after reading normalization: "Andumiki", "Mikity"
・ Notation + Reading normalization column readings: "Ando Miki", "Miki"
Ask for.

この４パターンそれぞれについて以下の処理を繰り返し行う。 The following processing is repeated for each of the four patterns.

まず、解析結果カラムの表記同士：「安藤美姫」，「ミキティ」の処理を行う。 First, the notation of the analysis result column: “Miki Ando” and “Mikiti” are processed.

（ステップｓ２３−１）表記が包含関係にないため、次の繰り返し処理へ進む。 (Step s23-1) Since the notation is not in an inclusive relationship, the process proceeds to the next iterative process.

次の「安藤美姫」，「ミキ」、「アンドウミキ」，「ミキティ」の処理を順次行うが、両者とも包含関係にないため、同義と判定されない。最後の「アンドウミキ」，「ミキ」の処理は以下のようになる。 The following “Miki Ando”, “Miki”, “Ando Miki”, and “Mikiti” are sequentially processed, but since they are not in an inclusive relationship, they are not determined to be synonymous. The final “Andumiki” and “Miki” processes are as follows.

（ステップｓ２３−１）読みが包含関係にあるためステップｓ２３−２へ進む。 (Step s23-1) Since reading is in an inclusive relationship, the process proceeds to Step s23-2.

以後、（ステップｓ２３−２）〜（ステップｓ２３−５）はレコードＩＤ６３（「八景島シーパラダイス」，「シーパラ」）の例の場合と同様に行う。 Thereafter, (step s23-2) to (step s23-5) are performed in the same manner as in the case of the record ID 63 (“Hakkeijima Sea Paradise”, “Sea Para”).

以上のようにして同義語候補ペアリスト８内の同義語候補全ての同義性判定を終えた後、同義と判定されたレコードＩＤ７，ＩＤ２２，ＩＤ４５，ＩＤ６３，ＩＤ７８，ＩＤ９０，ＩＤ１２１を同義語ペアとして出力する。 After finishing the synonym candidate all synonymous determination of synonym candidate pair list in 8 as described above, records determined synonymous ID7, ID22, ID45, ID63, ID78, ID90, ID121 as synonyms pairs Output.

［II］逆変換ルール３及び音節正規化ルール４の詳細な適用例、音節類似度テーブル５及び省略判定モデル６の作成例
逆変換ルール３については、本実施例では図１１に挙げた接頭辞の削除、接尾辞の削除、読み仮名の削除、繰り返し表現の削除、省略記号の削除の４つについて説明する。 [II] Detailed application example of reverse conversion rule 3 and syllable normalization rule 4, example of creation of syllable similarity table 5 and omission determination model 6 With respect to reverse conversion rule 3, the prefixes shown in FIG. , Deletion of suffix, deletion of reading kana, deletion of repeated expression, and deletion of ellipsis.

接頭辞の削除では、例えば「表記：お／吉，読み：オ／キチ」という対象に適用すると接頭辞の「お」を削除して「表記：吉，読み：キチ」となる。 When deleting a prefix, for example, when applied to a target “notation: o / kichi, reading: o / kichi”, the prefix “o” is deleted and becomes “notation: kichi, reading: kichi”.

接尾辞の削除では、例えば「表記：アップル／社，読み：アップル／シャ」という対象に適用すると、接尾辞の「社」を削除して「表記：アップル，読み：アップル」となる。 In the deletion of the suffix, for example, when applied to a target “notation: Apple / Company, reading: Apple / Sha”, the suffix “Company” is deleted to become “notation: Apple, reading: Apple”.

読み仮名の削除では、例えば「表記：安／（／やす／）／めぐみ，読み：ヤス／ヤス／メグミ」という対象に適用すると、「表記：安／めぐみ，読み：ヤス／メグミ」となる。 In the case of deletion of a reading kana, for example, when applied to a target of “notation: cheap / (/ easy /) / Megumi, reading: Yasu / Yasu / Megumi”, it becomes “notation: Yasu / Megumi, reading: Yasu / Megumi”.

繰り返し表現の削除では、例えば「表記：キョンキョン，読み：キョンキョン」という対象に適用すると、「表記：キョン，読み：キョン」となる。 When deleting repeated expressions, for example, when applied to an object “notation: Kyung Kyung, reading: Kyung Kyung”, it becomes “notation: Kyung, reading: Kyung”.

省略記号の削除では、「表記：ハリーポッター３／／炎／の／−，読み：ハリーポッターサン／／ホノオ／ノ」という対象に適用すると，「表記：ハリーポッター３／／炎／の，読み：ハリーポッターサン／／ホノオ／ノ」となる。 When the ellipsis is deleted, it is applied to the object "notation: Harry Potter 3 // Fire / No /-, reading: Harry Potter Sun // Hoono / No", and "Notation: Harry Potter 3 // Fire / : Harry Potter Sun // Hoono / No ”.

音節正規化ルール４については，本実施例では図１２に挙げた「ユウコ」、「ウィンブルドン」、「イェルサレム」、「スウィング」の４例について説明する。 Regarding the syllable normalization rule 4, four examples of “Yuko”, “Wimbledon”, “Jerusalem”, and “Swing” shown in FIG. 12 will be described in this embodiment.

★「ユウコ」
「ユ」と「ウ」で同じ母音が連続するためルール１が適用され、母音連続部分が長音化して「ユーコ」となる。次にルール１により長音となった部分に対してルール５が適用され「ユコ」となる。 ★ "Yuko"
Since the same vowel continues in “Yu” and “U”, Rule 1 is applied, and the continuous vowel part becomes longer and becomes “Yuko”. Next, rule 5 is applied to the part that has become a long sound according to rule 1, resulting in “Yuko”.

★「ウインブルドン」
母音「ウ」と別種の母音の拗音「ィ」が連続するためルール２が適用され、「ウインブルドン」となる。 ★ "Wimbledon"
Since the vowel “U” and the roaring “i” of another type of vowel are continuous, Rule 2 is applied, resulting in “Wimbledon”.

★「イェルサレム」
母音「イ」と別種の母音の拗音「ェ」が連続するためルール２が適用され、「イエルサレム」となる。さらに、ルール２に当てはまった母音と母音の拗音がルール３の条件と一致するため、母音「イ」が削除され、「エルサレム」となる。 ★ "Jerusalem"
Since the vowel “I” and the roaring “e” of another kind of vowel are continuous, Rule 2 is applied, resulting in “Jerusalem”. Furthermore, since the vowels that meet rule 2 and the stuttering of the vowels match the conditions of rule 3, the vowel “i” is deleted and becomes “Jerusalem”.

★「スウイング」
母音「ウ」と別種の母音の拗音「ィ」が連続するためルール２が適用され、「スウイング」となる。さらに、ルール２を適用した結果、母音が「ウウイ」と連続することになり、ルール４の条件に当てはまるため、連続する同種の母音「ウ」を１つ削除して「スイング」となる。 ★ "Swing"
Since the vowel “U” and the roar “i” of another kind of vowel are continuous, Rule 2 is applied, resulting in “Swing”. Further, as a result of applying the rule 2, the vowels are continuous with “Uui”, and the condition of the rule 4 is satisfied. Therefore, one continuous vowel “U” of the same type is deleted and becomes “swing”.

音節類似度テーブル５の作成方法を、図１３に挙げた「アーティスト」とその異表記を用いた例で説明する。形態素解析辞書は、異表記・表記ゆれに対応しており、その読み及び標準表記の情報を備えたものを用いる。まず、形態素解析辞書から、標準表記が同じで発音が異なる単語「アーティスト」，「アーテスト」，「アーチスト」を集める。次に、それぞれの表現に対して音節正規化ルール４を適用し、音節数が一致したもので位置合わせを行う。その結果、「アティスト」，「アテスト」，「アチスト」となり、音節位置が同じで読みが異なる３つのペア「テとティ」，「チとティ」，「テとチ」が求まる。同様にして形態素解析辞書から標準表記をキーに異表記の収集、正規化、読みの異なるペアを収集する。辞書全部の処理が終わった時点で、音節位置が同じで読みが異なるペアと、出現する音節の数を、それぞれ種類ごとにカウントする。そして図１３の式を用いて距離を計算することで、テーブルを作成する。 A method of creating the syllable similarity table 5 will be described with an example using “artist” and its different notation shown in FIG. The morphological analysis dictionary supports different notations and notation fluctuations, and uses a dictionary having information on the reading and standard notation. First, the words “Artist”, “Artist”, and “Artist” with the same standard notation and different pronunciation are collected from the morphological analysis dictionary. Next, the syllable normalization rule 4 is applied to each expression, and alignment is performed with the syllable number matching. As a result, three pairs “te and tee”, “chi and ti”, “te and ti”, and “te and ti” having the same syllable position and different readings are obtained as “Atist”, “Attest”, and “Atist”. Similarly, pairs of different notation collection, normalization, and reading are collected from the morphological analysis dictionary using the standard notation as a key. When all the dictionaries have been processed, the number of pairs with the same syllable position and different readings and the number of syllables that appear are counted for each type. Then, a table is created by calculating the distance using the formula of FIG.

省略判定モデル６の作成方法として、識別関数にＳＶＭを用いた例を説明する。学習に利用するテキストデータは、出来れば実運用時に用いる入力テキストと同じドメインから取得すること、実運用時に用いるのと同じ解析器で解析処理を行うことが望ましい。学習用のテキストに対して本実施例１と同じ方法で同義語候補ペアを作成した後、表記が包含関係になっているものだけを取り出す。そしてそれぞれのエントリについて人手で同義か否かの正解付けを行う。そして省略判定部２３と同様の素性を抽出し、識別関数のパラメータを学習することにより、省略判定モデルを作成する。また、このとき同義語候補ペア生成部１で作成した同義語候補ペアの、読みカラムや表記正規化カラム、読み正規化カラム、表記＋読み正規化カラムのデータを用いて学習すれば、それぞれのカラムに対応した省略判定モデルを作成できる。 As a method for creating the omission determination model 6, an example in which SVM is used as an identification function will be described. If possible, it is desirable that the text data used for learning is acquired from the same domain as the input text used in actual operation, and analysis processing is performed with the same analyzer as used in actual operation. After creating a synonym weather Hope A in the same manner as the embodiment 1 with respect to the text for learning, take out only what notation is in inclusion relation. Each entry is correctly answered whether it is synonymous or not. Then, a feature similar to that of the omission determination unit 23 is extracted, and an omission determination model is created by learning the parameters of the discrimination function. Further, synonyms climate Hope A created in synonym candidate pair generation unit 1 at this time, read column or representation normalization column, normalized column reading, if trained with data notation + readings normalized column, It is possible to create an omission judgment model corresponding to each column.

なお、実施の形態における逆変換ルール記憶部、音節正規化ルール記憶部、音節類似度テーブル記憶部、省略判定モデル記憶部、解析処理結果テーブル記憶部、同義語候補ペアリスト記憶部、という記載は、どのようなデータを記憶するかという機能上の違いに基づく表現であり、ハードウェア的に個別の記憶部（記憶装置）が必要であるという意味ではない。また、本発明は、周知のコンピュータに媒体もしくは通信回線を介して、図１、図２、図６の構成図に示された機能を実現するプログラムあるいは図４、図７乃至図１０の流れ図に示された手順を備えるプログラムをインストールすることによっても実現可能である。 Note that the description of the inverse conversion rule storage unit, syllable normalization rule storage unit, syllable similarity table storage unit, omission determination model storage unit, analysis processing result table storage unit, synonym candidate pair list storage unit in the embodiment This is an expression based on the functional difference of what kind of data is stored, and does not mean that an individual storage unit (storage device) is necessary in terms of hardware. In addition, the present invention can be applied to a program for realizing the functions shown in the configuration diagrams of FIGS. 1, 2, and 6 or the flowcharts of FIGS. 4, 7 to 10 via a medium or a communication line in a known computer. It can also be realized by installing a program having the indicated procedure.

本発明の同義性判定装置の実施の形態の一例を示す概略ブロック図The schematic block diagram which shows an example of embodiment of the synonym determination apparatus of this invention 同義語候補ペア生成部の詳細を示すブロック図Block diagram showing details of synonym candidate pair generator 解析処理結果テーブルの一例を示す説明図Explanatory drawing which shows an example of an analysis process result table 正規化処理部における処理の流れ図Process flow in normalization processing unit 同義語候補ペアリストの一例を示す説明図Explanatory drawing which shows an example of a synonym candidate pair list 同義性判定部の詳細を示すブロック図Block diagram showing details of synonymity determination unit 同義性判定部における処理の流れ図Flow chart of processing in synonymity determination unit 表記類似判定部における処理の流れ図Flow chart of processing in the notation similarity determination unit 読み類似判定部における処理の流れ図Flow chart of processing in the reading similarity determination unit 省略判定部における処理の流れ図Flow chart of processing in the omission determination unit 逆変換ルールの一例を示す説明図Explanatory drawing which shows an example of a reverse conversion rule 音節正規化ルールの一例を示す説明図Explanatory drawing which shows an example of a syllable normalization rule 音節類似度テーブルの作成手順及びその一例を示す説明図Explanatory drawing which shows the preparation procedure of the syllable similarity table and an example thereof 入力テキストの一例を示す説明図Explanatory drawing showing an example of input text 解析処理終了時点の解析処理結果テーブルの一例を示す説明図Explanatory drawing which shows an example of the analysis process result table at the time of an analysis process end 省略判定部における素性抽出処理の一例を示す説明図Explanatory drawing which shows an example of the feature extraction process in an omission determination part

１：同義語候補ペア生成部、２：同義性判定部、３：逆変換ルール記憶部、４：音節正規化ルール記憶部、５：音節類似度テーブル記憶部、６：省略判定モデル記憶部、７：解析処理結果テーブル記憶部、８：同義語候補ペアリスト記憶部、１１：解析処理部、１２：正規化処理部、１３：ペア生成部、２１：表記類似判定部、２２：読み類似判定部、２３：省略判定部。 1: synonym candidate pair generation unit, 2: synonymity determination unit, 3: inverse transformation rule storage unit, 4: syllable normalization rule storage unit, 5: syllable similarity table storage unit, 6: omission determination model storage unit, 7: analysis processing result table storage unit, 8: synonym candidate pair list storage unit, 11: analysis processing unit, 12: normalization processing unit, 13: pair generation unit, 21: notation similarity determination unit, 22: reading similarity determination Part, 23: omission determination part.

Claims

It extracts synonyms candidate expression is a string representation of a synonym candidate generate synonyms candidate pair from the text, determining synonymy whether synonyms candidates expressed together in the synonym candidate pair synonymous A determination device,
It includes at least rules for deleting affix forms that are commonly inserted when using nouns, and rules for deleting repeated expressions used when creating nicknames. Each of these rules further deletes a string from the notation. A reverse conversion rule storage unit that stores a reverse conversion rule that includes a notation rule for reading and a reading rule for correcting reading in conjunction with deletion of the character string ;
Rule 1 for changing to a vowel and its extended sound when the same vowel continues, Rule 2 for changing the vowel to a normal vowel when the vowel and another type of vowel are continuous, and vowels that match rule 2 Apply rule 3 and rule 2 to delete vowel “I” when “i”, stuttering is “e”, and the vowel of the next two syllables changes from “d” higher to lower than “d” As a result, when two vowels of the same type become continuous + different types of vowels, it includes at least rule 4 for deleting one of the same vowels of the same type and rule 5 for deleting long vowels and prompt sounds, which are applied in the order of the rule numbers. a syllable normalization rule storage unit for storing syllable normalization rules that,
A syllable similarity table storage unit for storing a syllable similarity table with “syllabic pairs with different notations but similar readings” as keys, and “distance” as a value ;
An abbreviated determination model storage unit that stores an abbreviated determination model composed of a model for determining whether or not two words are preliminarily generated by machine learning .
At least morphological analysis processing is performed on the input text, and synonym candidate expressions are extracted from the text based on the analysis results, and corresponding analysis results are assigned, and the notation and reading in the analysis results of the synonym candidate expressions are reversed. Applying the reverse conversion rule stored in the conversion rule storage unit, and further adding the syllable normalization rule stored in the syllable normalization rule storage unit to the reading in the analysis result of the synonym candidate expression and the reading after applying the reverse conversion rule after applying, the synonym candidate pair generating means for generating a synonym candidate pair consisting of a pair of synonyms candidate expressed in combination with each other synonyms candidate expressions,
If the notation of each synonym candidate expression in the synonym candidate pair is exactly the same, it is determined as synonymous, or if the reading of each synonym candidate expression in the synonym candidate pair is exactly the same, it is determined as synonymous, Alternatively, the syllable similarity table in which the sum of the distances between syllables having the same number of syllables, the same syllable position, and different readings is stored in the syllable similarity table storage unit in the reading of each synonym candidate expression in the synonym candidate pair If the sum of the distances is smaller than a preset threshold, it is determined to be synonymous, or the omission determination model storage when the notation or reading of each synonym candidate expression in the synonym candidate pair is in an inclusive relationship The abbreviation determination model stored in the section is used to determine whether or not there is an abbreviation relationship. If there is an abbreviation relationship, it is determined to be synonymous, and the synonym candidate pair determined to be synonymous is output as a synonym pair. Synonymity judgment means , Synonymous determination apparatus characterized by comprising a.

It extracts synonyms candidate expression is a string representation of a synonym candidate generate synonyms candidate pair from the text, determining synonymy whether synonyms candidates expressed together in the synonym candidate pair synonymous A determination method comprising:
It includes at least rules for deleting affix forms that are commonly inserted when using nouns, and rules for deleting repeated expressions used when creating nicknames. Each of these rules further deletes a string from the notation. A reverse conversion rule storage unit that stores a reverse conversion rule that includes a notation rule for reading and a reading rule for correcting reading in conjunction with deletion of the character string ;
Rule 1 for changing to a vowel and its extended sound when the same vowel continues, Rule 2 for changing the vowel to a normal vowel when the vowel and another type of vowel are continuous, and vowels that match rule 2 Apply rule 3 and rule 2 to delete vowel “I” when “i”, stuttering is “e”, and the vowel of the next two syllables changes from “d” higher to lower than “d” As a result, when two vowels of the same type become continuous + different types of vowels, it includes at least rule 4 for deleting one of the same vowels of the same type and rule 5 for deleting long vowels and prompt sounds, which are applied in the order of the rule numbers. a syllable normalization rule storage unit for storing syllable normalization rules that,
A syllable similarity table storage unit for storing a syllable similarity table with “syllabic pairs with different notations but similar readings” as keys, and “distance” as a value ;
Using an abbreviated determination model storage unit that stores an abbreviated determination model consisting of a model for determining whether or not two words are generated in advance by machine learning and are in an abbreviation relationship ;
The synonym candidate pair generation means performs at least morphological analysis on the input text, extracts a synonym candidate expression from the text based on the analysis result, and gives a corresponding analysis result to analyze the synonym candidate expression Apply the reverse conversion rule stored in the reverse conversion rule storage unit to the notation and reading in the result, and further to the syllable normalization rule storage unit in the reading in the analysis result of the synonym candidate expression and after the reverse conversion rule application After applying the stored syllable normalization rules, combining the synonym candidate expressions together to generate a synonym candidate pair consisting of a pair of synonym candidate expressions;
The synonym determining means determines that each synonym candidate expression in the synonym candidate pair is exactly the same, or determines that each synonym candidate expression in the synonym candidate pair is exactly the same. If the synonym candidate pairs are read as synonyms, the sum of distances between syllables with the same number of syllables, the same syllable position, and different readings is stored in the syllable similarity table storage unit. Is determined using the syllable similarity table, and is determined to be synonymous if the sum of the distances is smaller than a preset threshold, or the notation or reading of each synonym candidate expression in the synonym candidate pair is in an inclusive relationship Sometimes it is determined whether or not there is an abbreviation relationship using the abbreviation determination model stored in the abbreviation determination model storage unit, and if there is an abbreviation relationship, it is determined to be synonymous, and the synonym candidate pair determined to be synonymous As a synonym pair Synonymous determination method characterized by comprising the that step.

The program for functioning a computer as each means of the synonym determination apparatus of Claim 1 .

A computer-readable recording medium on which the program according to claim 3 is recorded.