JP2013200794A

JP2013200794A - Device, method, and program for attribute extraction

Info

Publication number: JP2013200794A
Application number: JP2012069728A
Authority: JP
Inventors: Hisako Asano; 久子浅野; Kenji Hara; 謙治原; Sakura Homma; 咲来本間
Original assignee: NTT Communications Corp
Current assignee: NTT Communications Corp
Priority date: 2012-03-26
Filing date: 2012-03-26
Publication date: 2013-10-03
Anticipated expiration: 2032-03-26
Also published as: JP5690300B2

Abstract

PROBLEM TO BE SOLVED: To extract an appropriate range of attribute information from text data and prevent erroneous extraction of the attribute information.SOLUTION: An attribute extraction device 1 has a language analysis means 11 that executes language analysis of text data including a user profile and adds a semantic attribute to an indeclinable head included in the text data, a head semantic attribute extraction pattern storage means 11 that stores therein at least one semantic attribute pattern and corresponding attribute classification, and a head semantic attribute extraction means 11 that, if the semantic attribute added to the indeclinable head matches any one of the semantic attribute pattern stored in the head semantic attribute extraction pattern storage means, extracts a noun phrase of an articulation including the indeclinable head as attribute information of the attribute classification corresponding to the matched semantic attribute pattern.

Description

本発明は、テキストデータを解析し、ユーザの属性情報を抽出する属性抽出装置、属性抽出方法、および属性抽出プログラムに関する。 The present invention relates to an attribute extraction apparatus, an attribute extraction method, and an attribute extraction program that analyze text data and extract user attribute information.

インターネット上で流通しているテキストデータを、形態素解析等により解析を行い、単語や文節を抽出すること（非特許文献１）、また抽出した単語について当該単語の表記と対応する属性を記述した属性辞書と比較し、分類することが行われている。 Analyzing text data distributed on the Internet by morphological analysis, etc., and extracting words and phrases (Non-Patent Document 1), and attributes describing attributes corresponding to the notation of the words for the extracted words It is compared with the dictionary and classified.

「テキストからの知識抽出の基盤となる日本語基本解析技術」、NTT技術ジャーナル2008.6、p20-23“Basic Japanese Analysis Technology as a Base for Knowledge Extraction from Text”, NTT Technical Journal 2008.6, p20-23

抽出した単語が属性辞書に登録された単語と一致する場合には、職業等の属性を確定することができる。しかしながら、属性辞書に登録された単語が、ある属性の一部の場合は、抽出した単語全体を属性情報として取り出すことができない。例えば、属性辞書に「エンジニア（職業）」が登録されている場合、テキストから抽出した単語が「ネットワークエンジニア」であっても、職業として「エンジニア」しか抽出することができない。 If the extracted word matches the word registered in the attribute dictionary, attributes such as occupation can be determined. However, when the word registered in the attribute dictionary is a part of a certain attribute, the entire extracted word cannot be extracted as attribute information. For example, when “engineer (profession)” is registered in the attribute dictionary, even if the word extracted from the text is “network engineer”, only “engineer” can be extracted as the occupation.

また、文の構造的にその属性を表さない場合、適切な属性として抽出できなかったり、誤抽出をしてしまう可能性がある。例えば、属性辞書に「エンジニア（職業）」が登録されている場合、テキストから抽出した単語が「インフラエンジニア勉強会主催」であっても、職業として「エンジニア」を誤抽出してしまう。 Also, if the attribute is not represented in the sentence structure, it may not be extracted as an appropriate attribute or may be erroneously extracted. For example, when “engineer (profession)” is registered in the attribute dictionary, even if the word extracted from the text is “organized by an infrastructure engineer study group”, “engineer” is erroneously extracted as a profession.

適切な範囲の属性情報を抽出するとともに、誤抽出を防止するために、幅広い表記・単語を登録した属性辞書を作成するが考えられるが、幅広い表記をカバーした属性辞書の作成およびメンテナンスには、非常に大きな負荷を要する。 In order to extract the appropriate range of attribute information and to prevent erroneous extraction, it is possible to create an attribute dictionary that registers a wide range of notations and words, but for creating and maintaining an attribute dictionary that covers a wide range of notations, A very large load is required.

本発明は上記事情に鑑みてなされたものであり、本発明の目的は、テキストデータから適切な範囲の属性情報を抽出するとともに、属性情報の誤抽出を防止する属性抽出装置、属性抽出方法、および属性抽出プログラムを提供することにある。 The present invention has been made in view of the above circumstances, and an object of the present invention is to extract an appropriate range of attribute information from text data and to prevent attribute information from being erroneously extracted, an attribute extraction method, And providing an attribute extraction program.

上記目的を達成するため、本発明は、属性抽出装置であって、ユーザプロフィールを含むテキストデータを言語解析し、前記テキストデータに含まれる体言系主辞に意味属性を付与する言語解析手段と、少なくとも１つの意味属性パタンと、対応する属性分類とが記憶された主辞意味属性抽出パタン記憶手段と、前記体言系主辞に付与された意味属性が、前記主辞意味属性抽出パタン記憶手段に記憶されたいずれかの意味属性パタンに一致する場合、当該体言系主辞を含む文節の名詞句を、一致した意味属性パタンに対応する属性分類の属性情報として抽出する主辞意味属性抽出手段と、を有する。 In order to achieve the above object, the present invention is an attribute extraction device, which linguistically analyzes text data including a user profile and assigns a semantic attribute to a syntactic main word included in the text data, and at least A main meaning semantic attribute extraction pattern storage means storing one semantic attribute pattern and a corresponding attribute classification, and a semantic attribute given to the syntactic main character is stored in the main meaning meaning attribute extraction pattern storage means. And the meaning attribute extraction means for extracting the noun phrase of the phrase including the syntactic main word as attribute information of the attribute classification corresponding to the matching semantic attribute pattern.

本発明は、属性抽出装置が行う属性抽出方法であって、ユーザプロフィールを含むテキストデータを言語解析し、前記テキストデータに含まれる体言系主辞に意味属性を付与する言語解析ステップと、前記体言系主辞に付与された意味属性が、主辞意味属性抽出パタン記憶手段に記憶されたいずれかの意味属性パタンに一致する場合、当該体言系主辞を含む文節の名詞句を、一致する意味属性パタンに対応する属性分類の属性情報として抽出する主辞意味属性抽出ステップと、を行う。 The present invention is an attribute extraction method performed by an attribute extraction apparatus, which performs language analysis on text data including a user profile, and assigns semantic attributes to a syntactic headword included in the text data; If the semantic attribute assigned to the main word matches one of the semantic attribute patterns stored in the main word semantic attribute extraction pattern storage means, the noun phrase of the phrase containing the body language main word corresponds to the matching semantic attribute pattern A main meaning attribute extracting step for extracting as attribute information of the attribute classification to be performed.

本発明は、属性抽出装置が実行する属性抽出プログラムであって、属性抽出装置を、ユーザプロフィールを含むテキストデータを言語解析し、前記テキストデータに含まれる体言系主辞に意味属性を付与する言語解析手段、少なくとも１つの意味属性パタンと、対応する属性分類とが記憶された主辞意味属性抽出パタン記憶手段、および、前記体言系主辞に付与された意味属性が、前記主辞意味属性抽出パタン記憶手段に記憶されたいずれかの意味属性パタンに一致する場合、当該体言系主辞を含む文節の名詞句を、一致した意味属性パタンに対応する属性分類の属性情報として抽出する主辞意味属性抽出手段、として機能させるための属性抽出プログラムである。 The present invention relates to an attribute extraction program executed by an attribute extraction device, which performs language analysis on text data including a user profile and assigns a semantic attribute to a syntactic subject included in the text data. Means, at least one semantic attribute pattern, and a main meaning semantic attribute extraction pattern storage means in which a corresponding attribute classification is stored, and a semantic attribute assigned to the syntactic main character is stored in the main meaning semantic attribute extraction pattern storage means. If it matches any of the stored semantic attribute patterns, it functions as a main meaning attribute extraction means that extracts the noun phrase of the phrase containing the body part as the attribute information of the attribute classification corresponding to the matched semantic attribute pattern This is an attribute extraction program.

本発明によれば、テキストデータから適切な範囲の属性情報を抽出するとともに、属性情報の誤抽出を防止する属性抽出装置、属性抽出方法、および属性抽出プログラムを提供することができる。 According to the present invention, it is possible to provide an attribute extraction device, an attribute extraction method, and an attribute extraction program that extract attribute information in an appropriate range from text data and prevent erroneous extraction of attribute information.

本発明の実施形態に係る属性抽出装置の構成を示す構成図である。It is a block diagram which shows the structure of the attribute extraction apparatus which concerns on embodiment of this invention. 属性抽出部の構成を示すブロック図である。It is a block diagram which shows the structure of an attribute extraction part. 属性抽出部の処理の具体例を示す図である。It is a figure which shows the specific example of a process of an attribute extraction part. 主辞意味属性抽出パタン記憶部の一例を示す図である。It is a figure which shows an example of a head meaning attribute extraction pattern memory | storage part. 品詞・表記抽出パタン記憶部の一例を示す図である。It is a figure which shows an example of a part of speech and description extraction pattern memory | storage part. 固有表現抽出パタン記憶部の一例を示す図である。It is a figure which shows an example of a specific expression extraction pattern memory | storage part. 属性付クチコミデータおよび集計データの一例を示す図である。It is a figure which shows an example of word-of-mouth data with an attribute and total data. 分析結果の一例を示す図である。It is a figure which shows an example of an analysis result.

以下、本発明の実施形態について説明する。 Hereinafter, embodiments of the present invention will be described.

図１は、本発明の実施形態に係る属性抽出装置１の構成を示す構成図である。本実施形態の属性抽出装置１は、ネットワーク上に公開される大量のテキストデータを入力し、各テキストデータから抽出されるユーザの属性情報を用いてデータを分析・分類する装置である。 FIG. 1 is a configuration diagram showing a configuration of an attribute extraction device 1 according to an embodiment of the present invention. The attribute extraction apparatus 1 according to the present embodiment is an apparatus that inputs a large amount of text data to be disclosed on a network, and analyzes and classifies data using user attribute information extracted from each text data.

本実施形態では、テキストデータとして、ユーザプロフィールが記述されたプロフィールテキスト２１と、当該プロフィールテキスト２１のユーザが入力したつぶやき、クチコミ、感想などのクチコミテキスト２２とが、ペアになって属性抽出装置１に入力される。本実施形態では、ネットワーク上から取得したクチコミテキスト２２と、当該クチコミテキスト２２を入力したユーザのプロフィールテキスト２２とをペアにして属性抽出装置１に入力する。 In the present embodiment, as text data, a profile text 21 in which a user profile is described and a review text 22 such as a tweet, a review, and an impression input by the user of the profile text 21 are paired and the attribute extraction device 1. Is input. In this embodiment, the review text 22 acquired from the network and the profile text 22 of the user who has input the review text 22 are input to the attribute extraction apparatus 1 as a pair.

なお、入力されるテキストデータ（プロフィールテキスト２１およびクチコミテキスト２２）としては、例えばツイッター（twitter）、ブログ（blog）、フェースブック(facebook)などが考えられる。 In addition, as input text data (profile text 21 and word-of-mouth text 22), for example, Twitter, blog, Facebook, etc. can be considered.

本実施形態の属性抽出装置１は、属性抽出部１１と、クチコミ抽出部１２と、属性付クチコミ情報記憶部１３と、分析部１４とを備える。 The attribute extraction apparatus 1 according to the present embodiment includes an attribute extraction unit 11, a word-of-mouth extraction unit 12, a word-of-mouth information storage unit 13 with attributes, and an analysis unit 14.

属性抽出部１１は、入力されたプロフィールテキスト２１を解析し、属性情報を抽出する。なお、属性情報としては、職業、趣味・嗜好、年齢・年代、ロケーションなどが含まれる。 The attribute extraction unit 11 analyzes the input profile text 21 and extracts attribute information. The attribute information includes occupation, hobby / preference, age / age, location, and the like.

図２は、属性抽出部１１の構成を示すブロック図である。図示する属性抽出部１１は、基本言語解析部１０１と、主辞意味属性抽出部１０２と、品詞・表記抽出部１０３と、固有表現抽出部１０４と、主辞意味属性抽出パタン記憶部１０５と、品詞・表記抽出パタン記憶部１０６と、固有表現抽出パタン記憶部１０７とを備える。 FIG. 2 is a block diagram illustrating a configuration of the attribute extraction unit 11. The attribute extraction unit 11 shown in the figure includes a basic language analysis unit 101, a subject meaning attribute extraction unit 102, a part of speech / notation extraction unit 103, a specific expression extraction unit 104, a subject part meaning attribute extraction pattern storage unit 105, a part of speech / A notation extraction pattern storage unit 106 and a specific expression extraction pattern storage unit 107 are provided.

基本言語解析部１０１は、プロフィールテキスト２１が入力されると、当該プロフィールテキスト２１を形態素解析、固有表現抽出、係り受け解析等の基本言語解析を行う。形態素解析は、入力されたテキストを単語に分割し、各単語に品詞などの情報を付加する。固有表現抽出は、形態素解析により分割された単語列から人名、地名、組織名等の固有表現データを特定し、特定した固有表現データに固有表現種別（人名、地名、組織名等）を付与する。係り受け解析は、どの文節がどの文節を修飾するか（係るか）を解析する。 When the profile text 21 is input, the basic language analysis unit 101 performs basic language analysis such as morphological analysis, specific expression extraction, and dependency analysis on the profile text 21. Morphological analysis divides input text into words and adds information such as parts of speech to each word. Specific expression extraction specifies specific expression data such as a person name, place name, and organization name from a word string divided by morphological analysis, and assigns a specific expression type (person name, place name, organization name, etc.) to the specified specific expression data. . Dependency analysis analyzes which clause modifies (related) which clause.

また、本実施形態の基本言語解析部１０１は、各単語に意味属性を付与する。意味属性は、あらかじめ定められた分類体系である。例えば、日本語語彙大系（ISBN4-00-130101-6 C3881岩波書店）等を利用してもよい。また、以後の説明で用いる体言系主辞は、文節内で最も文節末に近い体言（品詞が名詞、名詞接尾辞、数詞、助数詞等）である。 In addition, the basic language analysis unit 101 of this embodiment gives a semantic attribute to each word. The semantic attribute is a predetermined classification system. For example, a Japanese vocabulary system (ISBN4-00-130101-6 C3881 Iwanami Shoten) may be used. In addition, the syntactic headwords used in the following explanation are those that are closest to the end of the phrase in the phrase (parts of speech are nouns, noun suffixes, numerals, classifiers, etc.).

主辞意味属性抽出部１０２は、主辞意味属性抽出パタン記憶部１０５を用いて、プロフィールテキスト２１から属性情報を抽出する。すなわち、主辞意味属性抽出部１０２は、体言系主辞に付与された意味属性が、主辞意味属性抽出パタン記憶部１０５に記憶されたいずれかの意味属性パタンに一致する場合、当該体言系主辞を含む文節の名詞句を、一致した意味属性パタンに対応する属性分類の属性情報として抽出する。主辞意味属性抽出パタン記憶部１０５には、少なくとも１つの意味属性パタンと、対応する属性分類（属性種別、属性クラス）とが記憶されている。 The subject meaning attribute extraction unit 102 extracts attribute information from the profile text 21 using the subject meaning attribute extraction pattern storage unit 105. That is, the main meaning semantic attribute extraction unit 102 includes the main body main character when the semantic attribute given to the main character system main character matches any semantic attribute pattern stored in the main character meaning attribute extraction pattern storage unit 105. The noun phrase of the phrase is extracted as attribute information of the attribute classification corresponding to the matched semantic attribute pattern. The main meaning attribute extraction pattern storage unit 105 stores at least one semantic attribute pattern and a corresponding attribute classification (attribute type, attribute class).

品詞・表記抽出部１０３は、品詞・表記抽出パタン記憶部１０６を参照し、単語の品詞と単語の表記に基づいて属性情報をプロフィールテキスト２１から抽出する。すなわち、品詞・表記抽出部１０３は、品詞・表記抽出パタン記憶部１０６に記憶されたいずれかの品詞表記パタンに一致するデータまたはデータの一部を、一致した品詞表記パタンに対応する属性分類の属性情報として抽出する。品詞・表記抽出パタン記憶部１０６には、品詞と表記の組合せである、少なくとも１つの品詞表記パタンと、対応する属性分類とが記憶されている。 The part-of-speech / notation extraction unit 103 refers to the part-of-speech / notation extraction pattern storage unit 106 and extracts attribute information from the profile text 21 based on the part of speech / word notation. That is, the part-of-speech / notation extraction unit 103 sets the attribute classification corresponding to the matched part-of-speech notation pattern to data that matches any part-of-speech notation pattern stored in the part-of-speech / notation extraction pattern storage unit 106. Extract as attribute information. The part-of-speech / notation extraction pattern storage unit 106 stores at least one part-of-speech notation pattern, which is a combination of part-of-speech and notation, and a corresponding attribute classification.

固有表現抽出部１０４は、固有表現抽出パタン記憶部１０７を用いて、プロフィールテキスト２１から属性情報を抽出する。すなわち、固有表現抽出部１０４は、固有表現データに付与された固有表現種別が、固有表現抽出パタン記憶部１０７に記憶されたいずれかの固有表現パタンに一致する場合、当該固有表現データを、一致した固有表現パタンに対応する属性分類の属性情報として抽出する。固有表現抽出パタン記憶部１０７には、少なくとも１つの固有表現パタンと、対応する属性分類とが記憶されている。 The specific expression extraction unit 104 extracts attribute information from the profile text 21 using the specific expression extraction pattern storage unit 107. That is, the unique expression extraction unit 104 matches the specific expression data when the specific expression type given to the specific expression data matches one of the specific expression patterns stored in the specific expression extraction pattern storage unit 107. It is extracted as attribute information of the attribute classification corresponding to the specific expression pattern. The specific expression extraction pattern storage unit 107 stores at least one specific expression pattern and a corresponding attribute classification.

クチコミ抽出部１２は、プロフィールテキスト２１とペアで入力されるクチコミテキスト２２を解析し、クチコミ情報を抽出する。例えば、クチコミ抽出部１２は、あらかじめ定めたキーワードに関する評価属性(対象)、評価表現（プロフィールテキスト２２に記述された表現（かっこいい、さわやかなど））、極性（好評または不評）などをクチコミ情報として抽出する。 The review extraction unit 12 analyzes the review text 22 input as a pair with the profile text 21 and extracts the review information. For example, the word-of-mouth communication extraction unit 12 extracts, as word-of-mouth information, evaluation attributes (objects), evaluation expressions (expressions described in the profile text 22 (cool, refreshing, etc.)), polarities (favorable or unpopular), etc. To do.

属性付クチコミ情報記憶部１３には、属性抽出部１１が抽出した属性情報と、クチコミ抽出部１２が抽出したクチコミ情報とが対応付けて（ペアで）記憶される。ただし、キーワードがひとつも抽出されなかったクチコミテキスト２２については、属性付口コミ情報記憶部１３に何も記憶しない。 The attributed word-of-mouth information storage unit 13 stores the attribute information extracted by the attribute extraction unit 11 and the word-of-mouth information extracted by the word-of-mouth extraction unit 12 in association (in pairs). However, nothing is stored in the attributed word-of-mouth information storage unit 13 for the word-of-mouth text 22 from which no keyword has been extracted.

分析部１４は、属性付クチコミ情報記憶部１３に記憶された属性情報およびクチコミ情報を分析する。すなわち、分析部１４は、プロフィールテキスト２１から抽出される少なくとも１つの属性情報を用いて、クチコミテキスト２２から抽出されるクチコミ情報を分析する。図１に示す分析部１４は、分析対象属性分析部１５と、競合比較分析部１６と、属性別トレンド分析部１７とを備える。 The analysis unit 14 analyzes the attribute information and the review information stored in the attributed review information storage unit 13. That is, the analysis unit 14 analyzes the word-of-mouth information extracted from the word-of-mouth text 22 using at least one attribute information extracted from the profile text 21. The analysis unit 14 illustrated in FIG. 1 includes an analysis target attribute analysis unit 15, a competition comparison analysis unit 16, and an attribute-specific trend analysis unit 17.

分析対象属性分析部１５は、ある分析対象キーワードに対するユーザの属性情報を分析し、分析結果を表示する。例えば、ある分析対象キーワードを含む各クチコミテキスト２２に対応する各プロフィールテキスト２１の属性情報を分析・集計し（例えば、職業クラス集計、趣味嗜好集計、男女集計など）、それぞれ円グラフ、棒グラフなどで表示する。 The analysis target attribute analysis unit 15 analyzes user attribute information for a certain analysis target keyword and displays the analysis result. For example, the attribute information of each profile text 21 corresponding to each review text 22 including a certain analysis target keyword is analyzed / aggregated (for example, occupation class aggregation, hobby preference aggregation, gender aggregation, etc.), and each is represented by a pie chart, a bar chart, etc. indicate.

競合比較属性分析部１６は、複数の分析対象キーワードに対するユーザの属性情報を分析し、分析結果を比較表示する。例えば、各分析対象キーワードを含む各クチコミテキスト２２に対応する各プロフィールテキスト２１の属性情報を分析・集計し（例えば、職業クラス集計、趣味嗜好集計、男女集計など）、円グラフや、棒グラフなどで表示する。 The competition comparison attribute analysis unit 16 analyzes user attribute information for a plurality of analysis target keywords, and compares and displays the analysis results. For example, the attribute information of each profile text 21 corresponding to each review text 22 including each analysis target keyword is analyzed and aggregated (for example, occupation class aggregation, hobby preference aggregation, gender aggregation, etc.), and a pie chart, bar graph, etc. indicate.

属性別トレンド分析部１７は、あるユーザ属性に対する分析対象キーワード集計結果を表示する。例えば、性別＝「女性」、趣味・嗜好＝「音楽」というユーザ属性をもつユーザだけで集計した分析対象キーワードのランキング表示などを行う。 The attribute-specific trend analysis unit 17 displays the analysis target keyword aggregation result for a certain user attribute. For example, the ranking of the analysis target keywords aggregated only by users having user attributes of gender = “female” and hobbies / preferences = “music” is performed.

上記説明した属性抽出装置１は、例えば、ＣＰＵと、メモリと、ＨＤＤ等の外部記憶装置と、入力装置と、出力装置とを備えた汎用的なコンピュータシステムを用いることができる。このコンピュータシステムにおいて、ＣＰＵがメモリ上にロードされた属性抽出装置１用のプログラムを実行することにより、属性抽出装置１の各機能が実現される。また、属性抽出装置１用のプログラムは、ハードディスク、フレキシブルディスク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ−ＲＯＭなどのコンピュータ読取り可能な記録媒体に記憶することも、ネットワークを介して配信することもできる。 As the attribute extraction apparatus 1 described above, for example, a general-purpose computer system including a CPU, a memory, an external storage device such as an HDD, an input device, and an output device can be used. In this computer system, each function of the attribute extraction device 1 is realized by the CPU executing a program for the attribute extraction device 1 loaded on the memory. Further, the program for the attribute extraction apparatus 1 can be stored in a computer-readable recording medium such as a hard disk, a flexible disk, a CD-ROM, an MO, or a DVD-ROM, or can be distributed via a network.

次に、本実施形態の処理について説明する。 Next, the processing of this embodiment will be described.

図３は、属性抽出部１１の属性抽出処理を示すものである。 FIG. 3 shows an attribute extraction process of the attribute extraction unit 11.

図示する例では、プロフィールテキストとして、「河川エンジニアをやってます。外車好きです。宮沢賢治は全部読みました。」３０１が、属性抽出部１１に入力されるものとする。 In the example shown in the figure, it is assumed that “I am a river engineer. I like foreign cars. Kenji Miyazawa has read all” 301 is input to the attribute extraction unit 11 as the profile text.

基本言語解析部１０１は、入力されたプロフィールテキスト３０１に対して、基本言語解析を行うことにより、解析結果データ３０２を出力する（Ｓ１１）。すなわち、基本言語解析部１０１は、形態素解析を行い、「/」または「//」で区切った単位で単語に分割し、各単語に対して品詞（不図示）を付与する。また、本実施形態では、形態素解析を行う際に、全ての単語に対して、意味属性（その単語の意味をあらわす分類）を付与する。図示する例では、「エンジニア」という体言系主辞には、「専門技術職」という意味属性が付与される。体言系主辞である「エンジニア」は、「河川/エンジニア/を」の文節の中で最も後方の名詞（または代名詞）である。なお、他の単語にも意味属性が付与されるが、ここでは省略する。 The basic language analysis unit 101 performs basic language analysis on the input profile text 301 to output analysis result data 302 (S11). That is, the basic language analyzing unit 101 performs morphological analysis, divides the word into units divided by “/” or “//”, and gives a part of speech (not shown) to each word. Further, in this embodiment, when performing morphological analysis, a semantic attribute (a classification representing the meaning of the word) is assigned to all words. In the example shown in the drawing, a semantic attribute of “professional technical position” is given to the physicist main word “engineer”. “Engineer”, which is a synonymous headword, is the most backward noun (or pronoun) in the “river / engineer / wo” clause. Note that semantic attributes are also given to other words, but they are omitted here.

また、基本言語解析部１０１は、固有表現抽出を行うことにより、「宮沢賢治」を人名の固有表現として抽出し、抽出した固有表現データには「（人名（固有表現））」（固有表現種別）が付与される。また、基本言語解析部１０１は、係り受け解析を行うことにより、形態素解析により「/」で区切られた単語の文節を検出し、文節境界線となる位置に「//」を設定する。基本言語解析部１０１は、このような言語解析を行うことで、解析結果データ３０２を出力する。 Further, the basic language analysis unit 101 extracts “Kenji Miyazawa” as a specific expression of a person name by performing a specific expression extraction, and “(person name (specific expression))” (specific expression type) is included in the extracted specific expression data. ) Is given. Further, the basic language analysis unit 101 performs dependency analysis to detect a phrase of a word delimited by “/” by morphological analysis, and sets “//” at a position that becomes a phrase boundary line. The basic language analysis unit 101 outputs the analysis result data 302 by performing such language analysis.

主辞意味属性抽出部１０２、品詞・表記抽出部１０３および固有表現抽出部１０４は、プロフィールテキストの解析結果データ３０２から、属性情報をそれぞれ抽出する（Ｓ２１）。 The main meaning attribute extraction unit 102, the part of speech / notation extraction unit 103, and the specific expression extraction unit 104 extract attribute information from the analysis result data 302 of the profile text, respectively (S21).

主辞意味属性抽出部１０２は、主辞意味属性抽出パタン記憶部１０５を用いて属性情報を抽出する。具体的には、主辞意味属性抽出部１０２は、各体言系主辞について、当該主辞に付与された意味属性と、主辞意味属性抽出パタン記憶部１０５の各意味属性パタンとを照合し、いずれかの意味属性パタンと一致した体言系主辞については、当該体言系主辞と同一の文節に含まれる名詞句を属性情報として抽出する。図３に示す解析結果データ３０２の中で体言系主辞は、「エンジニア」、「好き」、「賢治」、「全部」である。 The subject meaning attribute extraction unit 102 extracts attribute information using the subject meaning attribute extraction pattern storage unit 105. Specifically, the subject meaning attribute extraction unit 102 collates the semantic attributes assigned to the subject with each semantic attribute pattern of the subject meaning attribute extraction pattern storage unit 105 for each body language subject, For the syntactic main word that matches the semantic attribute pattern, a noun phrase included in the same phrase as the syntactic main word is extracted as attribute information. In the analysis result data 302 shown in FIG. 3, the syntactic main words are “engineer”, “like”, “kenji”, and “all”.

図４は、主辞意味属性抽出パタン記憶部１０５の一例を示す図である。図示する主辞意味属性抽出パタン記憶部１０５には、属性種別および属性クラス（属性分類）と、意味属性パタンとが対応付けて記憶されている。主辞意味属性抽出部１０２は、「エンジニア」、「好き」、「賢治」、「全部」のそれぞれに付与された各意味属性と、主辞意味属性抽出パタン記憶部１０５の意味属性パタンを照合する。この場合、「エンジニア」の意味属性（専門技術職）が主辞意味属性抽出パタン記憶部１０５の意味属性パタン（専門技術職or技術職）に一致し、他は一致しないものとする。主辞意味属性抽出部１０２は、一致した「エンジニア」と同一文節に含まれる名詞句の「河川エンジニア」を属性情報（属性値）として抽出する。また、主辞意味属性抽出部１０２は、抽出した「河川エンジニア」の属性分類として、主辞意味属性抽出パタン記憶部１０５の一致したデータ（レコード）の属性種別（職業）、属性クラス（専門技術職）を特定する。そして、主辞意味属性抽出部１０２は、抽出した属性情報（「河川エンジニア」）を、特定した属性分類（「属性種別（職業）、属性クラス（専門技術職）」）ととともに、属性情報として出力する。 FIG. 4 is a diagram illustrating an example of the main meaning attribute extraction pattern storage unit 105. The main character meaning attribute extraction pattern storage unit 105 shown in the figure stores attribute types and attribute classes (attribute classifications) and semantic attribute patterns in association with each other. The subject meaning attribute extraction unit 102 collates each meaning attribute assigned to each of “engineer”, “like”, “kenji”, and “all” with the meaning attribute pattern of the subject meaning attribute extraction pattern storage unit 105. In this case, it is assumed that the semantic attribute (technical technical position) of “engineer” matches the semantic attribute pattern (professional technical position or technical position) of the main word meaning attribute extraction pattern storage unit 105, and the others do not match. The main meaning attribute extraction unit 102 extracts “river engineer” of the noun phrase included in the same phrase as the matched “engineer” as attribute information (attribute value). Further, the main meaning attribute extraction unit 102 uses the attribute classification (profession) and the attribute class (special technical job) of the matched data (record) in the main meaning attribute extraction pattern storage unit 105 as the attribute classification of the extracted “river engineer”. Is identified. Then, the main meaning attribute extraction unit 102 outputs the extracted attribute information (“river engineer”) as attribute information together with the identified attribute classification (“attribute type (profession), attribute class (professional technical position)”). To do.

品詞・表記抽出部１０３は、品詞・表記抽出パタン記憶部１０６を用いて属性情報を抽出する。品詞・表記抽出部１０３は、品詞・表記抽出パタン記憶部１０６のいずれかの品詞・表記パタンと一致する表現または表現の一部を抽出する。 The part-of-speech / notation extraction unit 103 uses the part-of-speech / notation extraction pattern storage unit 106 to extract attribute information. The part-of-speech / notation extraction unit 103 extracts an expression or a part of the expression that matches any of the part-of-speech / notation extraction patterns stored in the part-of-speech / notation extraction pattern storage unit 106.

図５は、品詞・表記抽出パタン記憶部１０６の一例を示す図である。図示する品詞・表記抽出パタン記憶部１０６には、属性種別と、属性クラスと、品詞・表記パタンとが対応付けて記憶されている。図示する品詞・表記抽出パタン記憶部１０６では、年齢と、趣味・嗜好と、性別とを抽出する例を挙げている。すなわち、品詞・表記抽出部１０３は、解析結果データ３０２の中に、品詞・表記パタンに設定された「品詞＝名詞」「表記＝好き」と一致する記述がある場合、趣味・嗜好の属性情報であると判別し、名詞の部分を属性情報として抽出する。図３に示す例では、「外車好き」が品詞・表記パタンと一致するため、「外車」を属性情報（属性値）として抽出する。品詞・表記抽出部１０３は、抽出した「外車」の属性分類として品詞・表記抽出パタン記憶部１０６の一致した品詞・表記パタンの属性種別（趣味・嗜好）、属性クラス（その他）を特定する。そして、品詞・表記抽出部１０３は、抽出した属性情報を特定した属性分類ととともに属性情報として出力する。 FIG. 5 is a diagram illustrating an example of the part of speech / notation extraction pattern storage unit 106. The part-of-speech / notation extraction pattern storage unit 106 shown in the figure stores attribute types, attribute classes, and part-of-speech / notation patterns in association with each other. In the illustrated part-of-speech / notation extraction pattern storage unit 106, an example of extracting age, hobbies / preferences, and sex is given. That is, the part-of-speech / notation extraction unit 103, when there is a description in the analysis result data 302 that matches “part-of-speech = noun” and “notation = like” set in the part-of-speech / notation pattern, And the noun part is extracted as attribute information. In the example shown in FIG. 3, since “like a car” matches the part of speech / notation pattern, “car” is extracted as attribute information (attribute value). The part-of-speech / notation extraction unit 103 identifies the attribute type (hobby / preference) and attribute class (others) of the matched part-of-speech / notation pattern in the part-of-speech / notation extraction pattern storage unit 106 as the attribute classification of the “external vehicle” extracted. Then, the part-of-speech / notation extraction unit 103 outputs the extracted attribute information together with the identified attribute classification as attribute information.

また、品詞・表記抽出部１０３は、解析結果データ３０２の中に、品詞・表記パタンに設定された「品詞＝数詞」、「品詞＝助数詞、表記＝才or歳」と一致する記述がある場合、年齢の属性情報であると判別し、当該数詞（例えば、２４）を属性情報（属性値）として抽出する。また、品詞・表記抽出部１０３は、抽出した属性情報の属性分類として品詞・表記抽出パタン記憶部１０６の一致した品詞・表記パタンの属性種別（年齢）、属性クラス（１０代/２０代/３０代・・）を特定する。なお、属性クラスについては、属性情報から対応する年代を設定する（例えば、２４の場合は２０代を設定する）。そして、品詞・表記抽出部１０３は、抽出した属性情報を特定した属性分類ととともに属性情報として出力する。なお、性別の属性情報についても、品詞・表記抽出パタン記憶部１０６を用いて同様に抽出する。 In addition, the part of speech / notation extraction unit 103 includes, in the analysis result data 302, a description that matches “part of speech = numerical”, “part of speech = classifier, notation = age or age” set in the part of speech / notation pattern. , It is determined that the attribute information is age, and the number (for example, 24) is extracted as attribute information (attribute value). The part-of-speech / notation extraction unit 103 also determines the attribute type (age) and attribute class (10's / 20's / 30's) of the matched part of speech / notation pattern storage unit 106 as the attribute classification of the extracted attribute information. Identify the cost. For the attribute class, the corresponding age is set from the attribute information (for example, in the case of 24, the 20s are set). Then, the part-of-speech / notation extraction unit 103 outputs the extracted attribute information together with the identified attribute classification as attribute information. Note that the attribute information of gender is extracted in the same manner using the part of speech / notation extraction pattern storage unit 106.

固有表現抽出部１０４は、固有表現抽出パタン記憶部１０７を用いて属性情報を抽出する。具体的には、固有表現抽出部１０４は、解析結果データ３０２で固有表現として抽出された固有表現データについて、当該データの固有表現種別と、固有表現抽出パタン記憶部１０７の固有表現パタンとを照合し、マッチした場合に当該固有表現データを属性情報として抽出する。 The specific expression extraction unit 104 uses the specific expression extraction pattern storage unit 107 to extract attribute information. Specifically, the specific expression extraction unit 104 compares the specific expression type of the data with the specific expression pattern of the specific expression extraction pattern storage unit 107 for the specific expression data extracted as the specific expression in the analysis result data 302. If there is a match, the specific expression data is extracted as attribute information.

図６に示す例では、解析結果データ３０２の中で固有表現データは、「宮沢賢治」でその種別は「人名」である。したがって、固有表現抽出部１０４は、「宮沢賢治」を属性情報（属性値）として抽出し、その属性分類として固有表現抽出パタン記憶部１０７の一致した固有表現パタンの属性種別（趣味・嗜好）、属性クラス（人名）を特定する。そして、固有表現抽出部１０４は、抽出した属性情報を特定した属性分類ととともに属性情報として出力する。なお属性種別が「趣味・嗜好」で属性クラス「人工物」（商品名、書籍名など）の属性情報、また、属性種別が「ロケーション」で属性クラス「ロケーション」の属性情報についても、固有表現抽出パタン記憶部１０７を用いて同様に抽出する。 In the example shown in FIG. 6, the specific expression data in the analysis result data 302 is “Kenji Miyazawa” and the type is “person name”. Therefore, the specific expression extraction unit 104 extracts “Kenji Miyazawa” as attribute information (attribute value), and the attribute classification (hobby / preference) of the corresponding specific expression pattern in the specific expression extraction pattern storage unit 107 as the attribute classification, Specify the attribute class (person name). Then, the unique expression extraction unit 104 outputs the extracted attribute information as attribute information together with the specified attribute classification. In addition, the attribute information of the attribute class “artificial” (product name, book name, etc.) with the attribute type “hobby / preference”, and the attribute information of the attribute class “location” with the attribute type “location” Extraction is similarly performed using the extraction pattern storage unit 107.

図７（ａ）は、属性付クチコミ情報記憶部１３に記憶される、属性付クチコミデータの一例を示す図である。図示する属性付クチコミデータは、クチコミ基本データ（クチコミＩＤ、発信日時、クチコミテキスト）と、プロフィール基本データ（ユーザＩＤ、プロフィールテキスト）と、クチコミ抽出結果（分析キーワード、評価属性/評価表現/極性）と、プロフィール抽出結果（抽出された各属性情報の属性種別/属性クラス/属性値）とが対応付けて記憶されている。なお、クチコミ抽出結果は、図１に示すクチコミ情報であって、プロフィール抽出結果は、図１に示す属性情報である。 FIG. 7A is a diagram illustrating an example of attributed review data stored in the attributed review information storage unit 13. The review data with attributes shown in the figure are the basic review data (review ID, outgoing date, review text), basic profile data (user ID, profile text), and review extraction results (analysis keyword, evaluation attribute / evaluation expression / polarity) And the profile extraction result (attribute type / attribute class / attribute value of each extracted attribute information) are stored in association with each other. The review extraction result is the review information shown in FIG. 1, and the profile extraction result is the attribute information shown in FIG.

クチコミ抽出部１２は、図示するクチコミデータ「新しく出たXX-01ってかっこいいよね！」が入力された場合、当該データを言語解析することで、評価属性（無し）、評価表現（かっこいい）、極性（好評）を、分析キーワード（XX-01）に対する評価・クチコミとして抽出する。 When the word-of-mouth data shown in the figure “Newly released XX-01 is cool!” Is input, the word-of-mouth extraction unit 12 performs language analysis on the data, so that the evaluation attribute (none), the evaluation expression (cool), The polarity (popular) is extracted as the evaluation / review for the analysis keyword (XX-01).

図７（ｂ）は、分析部１４が図７（ａ）の属性付クチコミデータを分析・集計したデータ例である。分析部１４は、このような集計データを用いて、所望の分析を行い、分析結果を出力する。
図８は、競合比較属性分析部１６が分析した競合比較属性の分析結果の一例を示すものである。図示する例では、複数の分析対象キーワード（各コンビニ名称、各デパート名称）を含む各クチコミテキスト２２に対応する各プロフィールテキスト２１の属性情報（職業）を分析・集計し、分析キーワード毎に棒グラフで表示している。このグラフにより、例えば、デパートとコンビニでは、職業の分布傾向が異なることがみてとれる。具体的には、コンビニは学生が主力で、デパートは専門技術職が主力であること（ただし、デパートＡはコンビニ的傾向があること）を把握することができる。このような比較分析により、イベント・キャンペーン効果の測定、ターゲット層の推定などの分析を容易に行うことができる。 FIG. 7B is a data example in which the analysis unit 14 analyzes and summarizes the attributed word-of-mouth data in FIG. 7A. The analysis unit 14 performs a desired analysis using such aggregated data and outputs an analysis result.
FIG. 8 shows an example of the analysis result of the competition comparison attribute analyzed by the competition comparison attribute analysis unit 16. In the illustrated example, attribute information (profession) of each profile text 21 corresponding to each review text 22 including a plurality of analysis target keywords (each convenience store name, each department store name) is analyzed and tabulated, and a bar graph is displayed for each analysis keyword. it's shown. From this graph, it can be seen that, for example, the distribution tendency of occupations differs between department stores and convenience stores. Specifically, it is possible to understand that the convenience store mainly has students and the department store has specialized technical positions (however, department store A tends to be a convenience store). By such comparative analysis, analysis such as measurement of event / campaign effect and estimation of target layer can be easily performed.

なお、図８は、競合比較属性の分析結果の一例であるが、図８の１つの棒グラフが、分析対象属性分析部１５が分析した分析対象の分析結果の一例を示すものである。また、属性別トレンド分析部１７の分析結果（不図示）により、指定した属性情報（およびその他の条件）で、今、何（分析対象キーワード）がはやっているかを分析・集計し、新商品開発等に生かすことができる。例えば、深夜帯に「学生」が多くつぶやくキーワードランキング、「東京」に住む「女性」で「音楽好き」がつぶやくキーワードランキングなどを分析することができる。 FIG. 8 shows an example of the analysis result of the competitive comparison attribute, but one bar graph of FIG. 8 shows an example of the analysis result of the analysis target analyzed by the analysis target attribute analysis unit 15. Moreover, based on the analysis result (not shown) of the trend analysis unit 17 by attribute, it analyzes and aggregates what is currently being analyzed (keywords to be analyzed) with the specified attribute information (and other conditions), and develops a new product. Etc. For example, it is possible to analyze keyword rankings that many “students” tweet in the middle of the night, and keyword rankings that tweet “music lovers” in “female” living in “Tokyo”.

以上説明した本実施形態では、フリーフォーマットで記載されたプロフィールテキスト（ツイッターやブログなどのユーザプロフィール）から、職業、趣味・嗜好などの属性情報を、適切な範囲で抽出するとともに、属性情報の誤抽出を防止することができる。具体的には、主辞意味属性抽出パタン記憶部１０５、品詞・表記抽出パタン記憶部１０６および固有表現抽出パタン記憶部１０７に登録されていない属性情報（属性値）であっても、適切に抽出することができる。また、抽出した属性情報（属性値）に適切な属性分類を付与することで、誤抽出を防止することができる。 In the present embodiment described above, attribute information such as occupation, hobbies, and preferences is extracted from an appropriate range from profile text (user profile such as Twitter and blog) described in a free format, and an error in attribute information is detected. Extraction can be prevented. Specifically, even attribute information (attribute values) that are not registered in the main meaning attribute extraction pattern storage unit 105, the part-of-speech / notation extraction pattern storage unit 106, and the unique expression extraction pattern storage unit 107 are appropriately extracted. be able to. Moreover, erroneous extraction can be prevented by assigning an appropriate attribute classification to the extracted attribute information (attribute value).

また、本実施形態では、フリーフォーマットで記載されたプロフィールテキストから属性情報を自動的に幅広い表現で抽出することができるため、趣味・嗜好の調査を目的としたアンケートを行うことなく、目的の情報を収集することができる。また、プロフィールテキスト（ユーザプロフィール欄等）で、職業や趣味・嗜好のカテゴリ毎に具体的な記入がない場合であっても、所望の属性情報を抽出することができる。 In addition, in this embodiment, attribute information can be automatically extracted from a profile text written in a free format in a wide range of expressions, so that target information can be obtained without conducting a questionnaire for hobby / preference investigations. Can be collected. Further, even if there is no specific entry for each category of occupation, hobby, and preference in the profile text (user profile field, etc.), desired attribute information can be extracted.

また、本実施形態では、クチコミ抽出部１２がクチコミテキスト２２を解析し、分析キーワードに対する評価・クチコミとして評価属性、評価表現、極性を抽出し、抽出したクチコミ情報を属性情報を用いて分析することで、よりきめ細かいポジネガなどの分析を行うことができる。 In the present embodiment, the word-of-mouth extraction unit 12 analyzes the word-of-mouth text 22 and extracts evaluation attributes, evaluation expressions, and polarities as evaluation / review for the analysis keyword, and analyzes the extracted word-of-mouth information using the attribute information. This makes it possible to analyze finer positives and negatives.

なお、本発明は上記実施形態に限定されるものではなく、その要旨の範囲内で数々の変形が可能である。 In addition, this invention is not limited to the said embodiment, Many deformation | transformation are possible within the range of the summary.

１：属性抽出装置
１１：属性抽出部
１０１：基本言語解析部
１０２：主辞意味属性抽出部
１０３：品詞・表記抽出部
１０４：固有表現抽出部
１０５：主辞意味属性抽出パタン記憶部
１０６：品詞・表記抽出パタン記憶部
１０７：固有表現抽出パタン記憶部
１２：クチコミ抽出部
１３：属性付クチコミ情報記憶部
１４：分析部
１５：分析対象属性分析部
１６：競合比較属性分析部
１７：属性別トレンド分析部
２１：プロフィールテキスト
２２：クチコミテキスト DESCRIPTION OF SYMBOLS 1: Attribute extraction apparatus 11: Attribute extraction part 101: Basic language analysis part 102: Main part meaning attribute extraction part 103: Part of speech / notation extraction part 104: Specific expression extraction part 105: Main part meaning attribute extraction pattern storage part 106: Part of speech / notation Extraction pattern storage unit 107: Specific expression extraction pattern storage unit 12: Review extraction unit 13: Attributed review information storage unit 14: Analysis unit 15: Analysis target attribute analysis unit 16: Competitive comparison attribute analysis unit 17: Trend analysis unit by attribute 21: Profile text 22: Review text

Claims

An attribute extraction device,
Linguistic analysis of text data including a user profile, and language analysis means for assigning a semantic attribute to the syntactic subject included in the text data;
A main meaning attribute extraction pattern storage unit storing at least one semantic attribute pattern and a corresponding attribute classification;
If the semantic attribute assigned to the syntactic main character matches any semantic attribute pattern stored in the main character semantic attribute extraction pattern storage means, the noun phrase of the phrase containing the syntactic main character is matched. An attribute extraction device comprising: a main meaning attribute extraction means for extracting as attribute information of an attribute classification corresponding to an attribute pattern.

The attribute extraction device according to claim 1,
The language analysis means divides the text data into words, gives parts of speech to each word,
A part-of-speech notation extracting means for extracting attribute information from the text data based on the word part-of-speech and the word notation;
Part-of-speech notation extraction pattern storage means storing at least one part-of-speech notation pattern, which is a combination of part of speech and notation, and a corresponding attribute classification;
The part-of-speech notation extraction unit extracts data or a part of the data that matches any of the part-of-speech notation patterns stored in the part-of-speech notation extraction pattern storage unit as attribute information of an attribute classification corresponding to the matched part-of-speech notation pattern An attribute extraction device characterized by

The attribute extraction device according to claim 1,
The language analysis means identifies specific expression data included in the text data, assigns a specific expression type to the specific expression data,
A unique expression extraction pattern storage means storing at least one unique expression type pattern and a corresponding attribute classification;
When the unique expression type given to the specific expression data matches any of the specific expression type patterns stored in the specific expression extraction pattern storage unit, the specific expression data corresponds to the matched specific expression type. An attribute extraction device characterized by further comprising specific expression extraction means for extracting the attribute classification as attribute information.

The attribute extraction device according to any one of claims 1 to 3,
The attribute extraction apparatus characterized in that the attribute information includes at least one of occupation, age, sex, and location.

The attribute extraction device according to claim 1,
The attribute extraction apparatus characterized in that the main meaning attribute extraction means extracts occupation attribute information.

The attribute extraction device according to claim 3,
The part-of-speech notation extraction unit extracts at least one attribute information of age, hobby, and preference.

The attribute extraction device according to any one of claims 1 to 6,
The text data is data published on a network and is input by a user of a user profile included in the text data and associated with input text data published on the network. Attribute extraction device.

The attribute extraction device according to claim 7,
An attribute extraction apparatus characterized by further comprising an analysis means for analyzing the input text data using at least one attribute information extracted from the text data.

An attribute extraction method performed by an attribute extraction device,
Linguistic analysis of text data including a user profile, and assigning semantic attributes to the syntactic subject included in the text data;
When the semantic attribute assigned to the syntactic main character matches any semantic attribute pattern stored in the main character semantic attribute extraction pattern storage means, the semantic attribute that matches the noun phrase of the phrase including the syntactic main character An attribute extraction method characterized by performing a main meaning attribute extraction step for extracting as attribute information of an attribute classification corresponding to a pattern.

An attribute extraction program executed by the attribute extraction device,
Attribute extraction device
Linguistic analysis of text data including a user profile, and linguistic analysis means for assigning a semantic attribute to a syntactic subject included in the text data;
A main meaning attribute extraction pattern storing means storing at least one semantic attribute pattern and a corresponding attribute classification; and
If the semantic attribute assigned to the syntactic main character matches any semantic attribute pattern stored in the main character semantic attribute extraction pattern storage means, the noun phrase of the phrase containing the syntactic main character is matched. Main meaning attribute extracting means for extracting as attribute information of attribute classification corresponding to attribute pattern,
Attribute extraction program to function as