JP2013242676A

JP2013242676A - User attribute estimation device, user attribute estimation method and program

Info

Publication number: JP2013242676A
Application number: JP2012115106A
Authority: JP
Inventors: Yuki Kurauchi; 雄貴蔵内; Takeshi Kurashima; 健倉島
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-05-18
Filing date: 2012-05-18
Publication date: 2013-12-05
Anticipated expiration: 2032-05-18
Also published as: JP5791565B2

Abstract

PROBLEM TO BE SOLVED: To focus on attributes owned by propagated users to propagate user attributes, and thereby estimate an unknown user attribute with high accuracy.SOLUTION: A user attribute estimation device is configured to comprise the means for: extracting a user set having a specific attribute from a user set of a known user attribute stored in user attribute storage means; extracting a characteristic amount appearing characteristically in a user set having a specific attribute on the basis of a user set having the specific attribute and each dialogue log stored in dialogue log storage means, calculating attribute probability serving as probability that each characteristic amount belongs to each user attribute and storing the attribute probability in attribute probability storage means; calculating intimacy between an input user and each adjacent user on the basis of each dialogue log stored in dialogue log storage means; and calculating propagation probability of propagating each attribute of adjacent users to the input user on the basis of the intimacy and the attribute probability of each characteristic amount stored in the attribute probability storage means.

Description

本発明は、ユーザ属性を推定するための技術に係り、特に、ソーシャルネットワークにおいてユーザ属性を公開していないユーザの属性を推定するための技術に関する。 The present invention relates to a technique for estimating a user attribute, and more particularly to a technique for estimating an attribute of a user who does not disclose a user attribute in a social network.

ユーザ属性を推定するための第１の従来技術として、地理属性とユーザ属性の関係を学習し利用することで、地理属性から未知のユーザ属性を推定するユーザ属性推定装置がある（例えば、特許文献１参照）。 As a first conventional technique for estimating a user attribute, there is a user attribute estimation device that estimates an unknown user attribute from a geographic attribute by learning and using the relationship between the geographic attribute and the user attribute (for example, Patent Documents). 1).

また、第２の従来技術として、ソーシャルネットワーク上における近さを定義し、近いユーザから属性を伝搬させることによって、未知のユーザ属性を推定するユーザ属性推定装置がある（例えば、非特許文献１参照）。 In addition, as a second conventional technique, there is a user attribute estimation device that estimates an unknown user attribute by defining proximity on a social network and propagating an attribute from a close user (see, for example, Non-Patent Document 1). ).

特開２０１１−２３８１６９号公報JP2011-238169A

Alan Mislove, Bimal Viswanath, Krishna P. Gummadi, Peter Druschel, "You are who you know: inferring user profiles in online social networks". In Proceedings of the third ACM international conference on Web search and data mining, pages 251-260, 2010.Alan Mislove, Bimal Viswanath, Krishna P. Gummadi, Peter Druschel, "You are who you know: inferring user profiles in online social networks". In Proceedings of the third ACM international conference on Web search and data mining, pages 251-260, 2010.

第１の従来技術と同様の手法を用いることで、ユーザが使う単語や、地理情報、時間情報などの特徴量をもとに、ユーザ属性を推定することができる。しかし、これらの情報を持たないような活発でないユーザに対してはユーザ属性を推定することができなかった。 By using a method similar to that of the first conventional technique, it is possible to estimate user attributes based on feature quantities such as words used by the user, geographic information, and time information. However, user attributes cannot be estimated for inactive users who do not have such information.

第２の従来技術はその課題に対し、ソーシャルネットワーク上で近いユーザの属性を利用し伝搬させることによって、活発でないユーザに対してもユーザ属性の推定を可能としている。しかし、近いユーザから全ての属性を伝搬させているため、伝搬される側のユーザが持たない属性も同時に伝搬させてしまうという問題点があった。 The second prior art makes it possible to estimate a user attribute even for an inactive user by using and propagating a user attribute close to the social network. However, since all attributes are propagated from a nearby user, there is a problem that attributes that the user on the side of propagation does not have are propagated at the same time.

本発明は、上記の点に鑑みなされたものであり、伝搬される側のユーザが持つ属性に絞ってユーザ属性を伝搬することで、高精度に未知のユーザ属性を推定することを可能とする技術を提供することを目的とする。 The present invention has been made in view of the above points, and it is possible to estimate an unknown user attribute with high accuracy by propagating the user attribute by focusing on the attribute of the user on the propagation side. The purpose is to provide technology.

上記の課題を解決するため、本発明は、会話ログ記憶手段に格納されているユーザ間の会話ログと、ユーザ属性記憶手段に格納されているユーザ属性が既知であるユーザ集合の属性集合とに基づいて、ユーザ属性が未知である入力ユーザのユーザ属性を推定するためのユーザ属性推定装置であって、
前記ユーザ属性記憶手段に格納されているユーザ集合から特定の属性を持つユーザ集合を抽出するユーザ集合抽出手段と、
前記特定の属性を持つユーザ集合と前記会話ログ記憶手段に格納されている各会話ログに基づいて、特定の属性を持つユーザ集合において特徴的に表れる特徴量を抽出し、各特徴量が各ユーザ属性に属する確率である属性確率を算出し、属性確率記憶手段に格納する属性確率算出手段と、
前記会話ログ記憶手段に格納されている各会話ログに基づいて、前記入力ユーザと各近隣ユーザ間の親密度を算出する親密度算出手段と、
前記親密度と前記属性確率記憶手段に格納されている各特徴量の属性確率に基づいて、近隣ユーザの各属性を前記入力ユーザに伝搬させる伝搬確率を算出する伝搬確率算出手段と、を有することを特徴とするユーザ属性推定装置として構成される。 In order to solve the above problems, the present invention provides a conversation log between users stored in the conversation log storage means and an attribute set of user sets whose user attributes stored in the user attribute storage means are known. A user attribute estimation device for estimating a user attribute of an input user whose user attribute is unknown,
User set extraction means for extracting a user set having a specific attribute from the user set stored in the user attribute storage means;
Based on the user set having the specific attribute and each conversation log stored in the conversation log storage unit, the feature quantity characteristically expressed in the user set having the specific attribute is extracted, and each feature quantity is each user. Calculating an attribute probability that is a probability belonging to the attribute, and storing the attribute probability in the attribute probability storage means;
A closeness calculating means for calculating a closeness between the input user and each neighboring user based on each conversation log stored in the conversation log storage means;
Propagation probability calculation means for calculating a propagation probability for propagating each attribute of a neighboring user to the input user based on the familiarity and the attribute probability of each feature quantity stored in the attribute probability storage means. It is comprised as a user attribute estimation apparatus characterized by.

前記ユーザ属性推定装置は、前記伝搬確率算出手段により算出された結果に基づいて、属性名に含まれる複数の属性値のうち、最も伝搬確率の高い属性値を前記入力ユーザの当該属性名における属性値として出力する出力手段を更に有してもよい。 The user attribute estimation device is configured to select an attribute value having the highest propagation probability among a plurality of attribute values included in the attribute name based on the result calculated by the propagation probability calculation unit as an attribute in the attribute name of the input user. You may further have an output means to output as a value.

前記伝搬確率算出手段は、例えば、前記入力ユーザと各近隣ユーザの会話において特徴量が含まれる確率と特徴量の属性確率とをかけて得られた確率に前記親密度をかけることにより伝搬確率を算出する。 The propagation probability calculation means, for example, calculates the propagation probability by multiplying the probability obtained by multiplying the probability that a feature amount is included in the conversation between the input user and each neighboring user and the attribute probability of the feature amount, by the intimacy. calculate.

また、本発明は、前記ユーザ属性推定装置が実行するユーザ属性推定方法として構成することもできる。また、本発明は、コンピュータを、前記ユーザ属性推定装置におけるユーザ集合抽出手段、属性確率算出手段、親密度算出手段、伝搬確率算出手段として機能させるためのプログラムとして構成することもできる。 Moreover, this invention can also be comprised as a user attribute estimation method which the said user attribute estimation apparatus performs. The present invention can also be configured as a program for causing a computer to function as user set extraction means, attribute probability calculation means, intimacy calculation means, and propagation probability calculation means in the user attribute estimation device.

本発明によれば、会話ログと学習ユーザ集合のユーザ属性に基づいて、伝搬する側のユーザが持つ属性を伝搬する際に、伝搬される側のユーザが持つ属性に絞ることができるので、伝搬される側のユーザが持たない属性も同時に伝搬させてしまうという問題を解決し、高精度に未知のユーザ属性を推定することが可能となる。 According to the present invention, when propagating the attributes of the user on the propagation side based on the user attributes of the conversation log and the learning user set, the attributes of the user on the propagation side can be narrowed down. It is possible to solve the problem that attributes that the user on the side to be transmitted does not have at the same time and to estimate unknown user attributes with high accuracy.

本発明の実施の形態におけるユーザ属性推定装置１００の構成図である。It is a block diagram of the user attribute estimation apparatus 100 in embodiment of this invention. 本発明の実施の形態における学習部１１０の動作のフローチャートである。It is a flowchart of operation | movement of the learning part 110 in embodiment of this invention. 本発明の実施の形態における推論部１２０の動作のフローチャートである。It is a flowchart of operation | movement of the inference part 120 in embodiment of this invention. 本発明の実施の形態におけるユーザ属性記憶部１０に格納される情報の例である。It is an example of the information stored in the user attribute memory | storage part 10 in embodiment of this invention. 本発明の実施の形態における会話ログ記憶部２０に格納される情報の例である。It is an example of the information stored in the conversation log memory | storage part 20 in embodiment of this invention. 本発明の実施の形態における属性確率記憶部５０に記憶される情報の例である。It is an example of the information memorize | stored in the attribute probability memory | storage part 50 in embodiment of this invention.

以下、図面を参照して本発明の実施の形態を説明する。なお、以下で説明する実施の形態は一例に過ぎず、本発明が適用される実施の形態は、以下の実施の形態に限られるわけではない。 Embodiments of the present invention will be described below with reference to the drawings. The embodiment described below is only an example, and the embodiment to which the present invention is applied is not limited to the following embodiment.

（実施の形態の概要）
まず、本実施の形態の概要について説明する。本実施の形態に係る技術は、ソーシャルネットワーク上で関わりを持つ人同士は、全ての属性ではなく、一部の属性を共有している、という仮定と、2ユーザがソーシャルネットワーク上で会話する内容は2ユーザの共通性を含みやすい、という仮定に基づく。そして、会話ログの内容を解析することで、伝搬する側のユーザが持つ属性のうち、伝搬される側のユーザと共有する属性がどれであるのかを推定し、共有する属性に絞って伝搬を行うものである。 (Outline of the embodiment)
First, an outline of the present embodiment will be described. The technology according to the present embodiment is based on the assumption that people who are involved in a social network share some attributes, not all attributes, and the contents of the two users talking on the social network. Is based on the assumption that it is easy to include commonality between two users. Then, by analyzing the content of the conversation log, it is estimated which of the attributes of the user on the propagating side is shared with the user on the propagating side, and the propagation is limited to the attributes that are shared. Is what you do.

（ユーザ属性推定装置１００の構成）
図１は、会話ログと学習ユーザ集合のユーザ属性に基づいて、ユーザがある属性を持つ確率を算出するためのユーザ属性推定装置１００のブロック図を示す。 (Configuration of User Attribute Estimation Device 100)
FIG. 1 shows a block diagram of a user attribute estimation device 100 for calculating a probability that a user has a certain attribute based on the conversation log and the user attribute of the learning user set.

同図に示すように、ユーザ属性推定装置１００は、ユーザ属性記憶部１０、会話ログ記憶部２０、ユーザ集合抽出部３０、属性確率算出部４０、属性確率記憶部５０、入力部６０、親密度算出部７０、伝搬確率算出部８０、出力部９０を有する。 As shown in the figure, the user attribute estimation device 100 includes a user attribute storage unit 10, a conversation log storage unit 20, a user set extraction unit 30, an attribute probability calculation unit 40, an attribute probability storage unit 50, an input unit 60, a closeness degree. A calculation unit 70, a propagation probability calculation unit 80, and an output unit 90 are included.

上記各機能部のうち、ユーザ属性記憶部１０、ユーザ集合抽出部３０、会話ログ記憶部２０、属性確率算出部４０、及び属性確率記憶部５０は、後述する学習の処理を行うものであり、学習部１１０を構成する。また、入力部６０、会話ログ記憶部２０、親密度算出部７０、ユーザ属性記憶部１０、属性確率記憶部５０、伝搬確率算出部８０、及び出力部９０は、後述する推論の処理を行うものであり、推論部１２０を構成する。 Among the above functional units, the user attribute storage unit 10, the user set extraction unit 30, the conversation log storage unit 20, the attribute probability calculation unit 40, and the attribute probability storage unit 50 perform learning processing described later. The learning unit 110 is configured. The input unit 60, conversation log storage unit 20, familiarity calculation unit 70, user attribute storage unit 10, attribute probability storage unit 50, propagation probability calculation unit 80, and output unit 90 perform inference processing described later. And constitutes the inference unit 120.

本実施の形態に係るユーザ属性推定装置１００は、各記憶部となる記憶装置（メモリ、ハードディスク等）を備えるコンピュータに、ユーザ集合抽出部３０、属性確率算出部４０、入力部６０、親密度算出部７０、伝搬確率算出部８０、出力部９０の処理に対応するプログラムを実行させることにより実現可能である。当該プログラムは、可搬メモリ等の記憶媒体に格納して配布し、上記コンピュータにインストールして用いてもよいし、ネットワーク上のサーバからダウンロードして上記コンピュータにインストールしてもよい。また、ユーザ属性記憶部１０、会話ログ記憶部２０、属性確率記憶部５０のいずれか又は全部を学習及び推論の処理を行うコンピュータ内に備えずに、当該コンピュータからネットワーク経由でアクセス可能な外部装置に備えてもよい。 The user attribute estimation apparatus 100 according to the present embodiment includes a user set extraction unit 30, an attribute probability calculation unit 40, an input unit 60, and a closeness calculation in a computer including a storage device (memory, hard disk, etc.) serving as each storage unit. This can be realized by executing a program corresponding to the processing of the unit 70, the propagation probability calculation unit 80, and the output unit 90. The program may be stored in a storage medium such as a portable memory and distributed, installed on the computer, or downloaded from a server on a network and installed on the computer. In addition, any or all of the user attribute storage unit 10, the conversation log storage unit 20, and the attribute probability storage unit 50 are not provided in a computer that performs learning and inference processing, and can be accessed from the computer via a network. You may prepare for.

（学習部１１０の処理内容）
以下、学習部１１０について詳述する。学習部１１０では、ユーザ属性記憶部１０に格納されているユーザ属性と会話ログ記憶部２０に格納されている会話ログに基づき、会話ログに含まれる各特徴量が各ユーザ属性に属する確率である属性確率を算出し、属性確率記憶部５０に格納する。 (Processing content of learning unit 110)
Hereinafter, the learning unit 110 will be described in detail. The learning unit 110 is a probability that each feature amount included in the conversation log belongs to each user attribute based on the user attribute stored in the user attribute storage unit 10 and the conversation log stored in the conversation log storage unit 20. The attribute probability is calculated and stored in the attribute probability storage unit 50.

以下に学習部１１０の各構成・動作をより詳細に説明する。図２は、学習部１１０の動作を示すフローチャートであり、以下の説明において、対応する図２のステップ番号を適宜示すことにする。 Below, each structure and operation | movement of the learning part 110 are demonstrated in detail. FIG. 2 is a flowchart showing the operation of the learning unit 110. In the following description, the corresponding step numbers of FIG.

＜ユーザ属性記憶部１０＞
ユーザ属性記憶部１０に格納されるデータの例を図４に示す。 <User attribute storage unit 10>
An example of data stored in the user attribute storage unit 10 is shown in FIG.

ユーザ属性記憶部１０では、ユーザ集合Uの属性集合Aを格納する。属性Aは属性名a_iと属性値a_ijの組である。属性名には、性別、年代、居住地、出身地、母国語、利用可能な言語、職業、勤務先、学歴情報（出身または在学中の大学、高校…など）、所属集団名、宗教、指示する政党、経験スポーツ、嗜好（好きな食べ物、好きな音楽、好きな本…など）、趣味…などが含まれる。例えば、図４に示す例では、IDがIDaのユーザについて、属性名が「職業」である属性の属性値が「学生」であることが示されている。 The user attribute storage unit 10 stores the attribute set A of the user set U. The attribute A is a set of an attribute name a _i and an attribute value a _ij . Attribute names include gender, age, place of residence, hometown, native language, available language, occupation, place of work, educational information (university or university, high school, etc.), group name, religion, instructions Political parties, experience sports, preferences (favorite food, favorite music, favorite books ...), hobbies ..., etc. For example, in the example illustrated in FIG. 4, the attribute value of the attribute whose attribute name is “profession” is “student” for the user whose ID is IDa.

また、ユーザ属性記憶部１０は、ユーザ集合Uのプロフィール集合Pを格納するものでもよい。プロフィールPには、自己紹介文や、友人などからの紹介文などが含まれる。この場合は、前処理として、プロフィールを形態素解析し、形態素解析結果に属性名が含まれるかのマッチングなどによって、属性集合Aを抽出する。 Further, the user attribute storage unit 10 may store the profile set P of the user set U. Profile P includes a self-introduction sentence and an introduction sentence from a friend or the like. In this case, as preprocessing, the profile is subjected to morphological analysis, and the attribute set A is extracted by matching whether the attribute name is included in the morphological analysis result.

当該ユーザ属性記憶部１０は、これらの情報が保存され、復元可能なものであればよく、特定のものに限定されない。例えば、データベースや、予め備えられた汎用的な記憶装置（メモリやハードディスク装置）の特定領域に記憶されるもの、もしくは、Webページを保持するWebサーバや、データベースを具備するデータベースサーバ等である。 The user attribute storage unit 10 is not limited to a specific one as long as the information can be stored and restored. For example, a database, a database stored in a specific area of a general-purpose storage device (memory or hard disk device) provided in advance, a Web server holding a Web page, a database server including a database, or the like.

＜会話ログ記憶部２０＞
会話ログ記憶部２０に格納されるデータの例を図５に示す。 <Conversation log storage unit 20>
An example of data stored in the conversation log storage unit 20 is shown in FIG.

会話ログ記憶部２０では、ユーザ集合Uの会話ログ集合Lを格納する。会話ログLは、投稿そのものの識別番号である投稿ID、その投稿が返信した投稿の識別番号である返信先投稿ID、投稿したユーザの識別番号である投稿元ユーザIDと、投稿されたユーザの識別番号である投稿先ユーザID、投稿内容を含む情報である。また、会話ログLは、投稿内容の投稿時間、投稿場所、ハイパーリンクのような記述内容への補足情報、友人情報といった情報を含んでもよい。なお、返信先投稿IDと投稿先ユーザIDは、単数でも複数でも値がなくともよい。これは、返信ではない投稿や返信されていない投稿も会話ログLに含まれることを意味する。投稿内容は、文章、文章の形態素解析結果、画像、映像、"共感情報"、"レーティング情報"などといった内容のいずれでも良い。"共感情報"とは、Facebook(登録商標)などにおける、『いいね！』ボタンなどのような、共感を示す情報を表す。"レーティング情報"とは、『食べログ（登録商標）』などにおける評価点のような、投稿内容に対する評価点を示す情報を表す。返信先投稿ID、投稿元ユーザID、投稿先ユーザIDなどは、投稿内容に一定のフォーマットで含まれていてもよい。投稿場所は、緯度経度情報でも、地名でもよい。友人情報は、Twitter（登録商標）のフォローのような友人登録の情報であり、片側からの登録であっても、両側からの登録であってもよい。 The conversation log storage unit 20 stores a conversation log set L of the user set U. The conversation log L includes a post ID that is an identification number of the post itself, a reply destination post ID that is an identification number of a post to which the post has replied, a source user ID that is an identification number of the user who has posted, and the posting user's ID This is information including the posting destination user ID, which is an identification number, and the posting content. Further, the conversation log L may include information such as the posting time of the posted content, the posting location, supplementary information to the description content such as a hyperlink, and friend information. Note that the reply destination post ID and the post destination user ID may be singular, plural, or have no value. This means that posts that are not replies and posts that have not been replied are also included in the conversation log L. The posted content may be any content such as a sentence, a morphological analysis result of the sentence, an image, a video, “sympathy information”, “rating information”, and the like. “Empathy information” means “Like!” On Facebook (registered trademark). ”Indicates information that shows empathy, such as a button. “Rating information” represents information indicating an evaluation score for the posted content, such as an evaluation score for “taste log (registered trademark)”. The reply destination posting ID, the posting source user ID, the posting destination user ID, and the like may be included in the posting content in a certain format. The posting location may be latitude / longitude information or a place name. The friend information is information of friend registration such as follow of Twitter (registered trademark), and may be registration from one side or registration from both sides.

当該会話ログ記憶部２０は、これらの情報が保存され、復元可能なものであればよく、特定のものに限定されない。例えば、データベースや、予め備えられた汎用的な記憶装置（メモリやハードディスク装置）の特定領域に記憶されるもの、もしくは、Webページを保持するWebサーバや、データベースを具備するデータベースサーバ等である。 The conversation log storage unit 20 is not limited to a specific one as long as the information is stored and can be restored. For example, a database, a database stored in a specific area of a general-purpose storage device (memory or hard disk device) provided in advance, a Web server holding a Web page, a database server including a database, or the like.

＜ユーザ集合抽出部３０＞
ユーザ集合抽出部３０では、ユーザ属性記憶部１０からユーザ集合Uの属性集合Aを入力として受け付ける。そして、各属性名a_iについて属性値a_ijが等しい同属性ユーザ集合U_ijを抽出し、属性確率算出部４０に出力する（図２のステップ２１）。 <User set extraction unit 30>
The user set extraction unit 30 receives the attribute set A of the user set U from the user attribute storage unit 10 as an input. Then, the same attribute user set U _ij having the same attribute value a _ij for each attribute name a _i is extracted and output to the attribute probability calculation unit 40 (step 21 in FIG. 2).

また、第１の従来技術と同様に、同属性ユーザ集合U_ijが持つ投稿特徴量f'と属性値a_ijを学習データとして、会話ログ記憶部２０に含まれる特徴量と各属性の関係性を学習し、これを利用して未知のユーザ属性を推定することで、同属性ユーザ集合U_ijの拡張を行ってもよい（図２のステップ２２、ステップ２３）。この際、用いる特徴量としては、各ユーザが投稿に使う単語、地理情報、時間情報、投稿の頻度、投稿の時間帯…などが考えられる。 Further, similarly to the first prior art, as learning data post feature amount f 'and the attribute values a _ij of the attribute user set U _ij has the relationship of features and attributes included in the conversation log storage unit 20 And using this to estimate unknown user attributes, the attribute user set U _ij may be expanded (step 22 and step 23 in FIG. 2). At this time, as the feature amount to be used, words used by each user for posting, geographic information, time information, posting frequency, posting time zone, etc. can be considered.

＜属性確率算出部４０＞
属性確率算出部４０では、ユーザ集合抽出部３０から同属性ユーザ集合U_ijと、会話ログ記憶部２０から同属性ユーザ集合U_ijの会話ログ集合L_Uijを入力として受け付ける。そして、与えられた全ての特徴量集合Fに含まれる特徴量fが各ユーザ属性に属する確率である属性確率P(a_ij|f)を算出し、属性確率記憶部５０に格納する（図２のステップ２４）。特徴量fには、各2ユーザが会話で使う単語、地理情報、時間情報、会話の頻度、文字数、返信が続く回数、返信の時間帯、などが含まれる。属性確率P(a_ij|f)の算出方法は、以下の方法などが考えられる。 <Attribute probability calculation unit 40>
The attribute probability calculation unit 40 receives as input the same attribute user set U _ij from the user set extraction unit 30 and the conversation log set L _Uij of the same attribute user set U _ij from the conversation log storage unit 20. Then, an attribute probability P (a _ij | f), which is a probability that the feature quantity f included in all the given feature quantity sets F belongs to each user attribute, is calculated and stored in the attribute probability storage unit 50 (FIG. 2). Step 24). The feature amount f includes words used by two users in each conversation, geographic information, time information, conversation frequency, number of characters, number of replies, time of reply, and the like. As a method for calculating the attribute probability P (a _ij | f), the following method can be considered.

ただし、C(f, a_ij)は、特徴量fと属性値a_ijを同時に持つユーザの数である。

Here, C (f, a _ij ) is the number of users having the feature quantity f and the attribute value a _ij at the same time.

この処理を、全ての特徴量集合Fに対してではなく、一部に限ってもよい。全ての特徴量集合Fに対して検定などの処理を行うことで、同属性ユーザ集合U_ijにおいて出現頻度が偏り特徴的に表れる特徴量を抽出できる。検定の方法としては、例えばχ²検定の式は下記で表される。 This processing may be limited to a part rather than all the feature amount sets F. By performing processing such as a test on all feature quantity sets F, it is possible to extract feature quantities whose appearance frequencies are biased and appear characteristically in the same attribute user set U _ij . As a test method, for example, the formula of χ ² test is expressed as follows.

ただしここでO = 頻度の観測値、E = 帰無仮説から導かれる頻度の期待値（理論値）である。

Where O = observed frequency and E = expected frequency (theoretical value) derived from the null hypothesis.

＜属性確率記憶部５０＞
属性確率記憶部５０に格納されるデータの例を図６に示す。属性確率記憶部５０では、属性確率算出部４０から入力された属性確率を格納する。例えば、図６において、特徴量「期末試験」が、ユーザ属性「学生」に属する確率が0.7であることが示されている。 <Attribute probability storage unit 50>
An example of data stored in the attribute probability storage unit 50 is shown in FIG. The attribute probability storage unit 50 stores the attribute probability input from the attribute probability calculation unit 40. For example, FIG. 6 shows that the probability that the feature quantity “final examination” belongs to the user attribute “student” is 0.7.

当該属性確率記憶部５０は、これらの情報が保存され、復元可能なものであればよく、特定のものに限定されない。例えば、データベースや、予め備えられた汎用的な記憶装置（メモリやハードディスク装置）の特定領域に記憶されるもの、もしくは、Webページを保持するWebサーバや、データベースを具備するデータベースサーバ等である。 The attribute probability storage unit 50 is not limited to a specific one as long as these pieces of information are stored and can be restored. For example, a database, a database stored in a specific area of a general-purpose storage device (memory or hard disk device) provided in advance, a Web server holding a Web page, a database server including a database, or the like.

（推論部１２０の処理内容）
以下、推論部１２０について詳述する。推論部１２０では、ユーザ属性記憶部１０に格納されているユーザ属性と会話ログ記憶部２０に格納されている会話ログと属性確率記憶部５０に格納されている属性確率に基づき、ユーザ属性を伝搬することで未知のユーザ属性を推定し、出力する。 (Processing contents of the inference unit 120)
Hereinafter, the inference unit 120 will be described in detail. The inference unit 120 propagates the user attribute based on the user attribute stored in the user attribute storage unit 10, the conversation log stored in the conversation log storage unit 20, and the attribute probability stored in the attribute probability storage unit 50. To estimate and output unknown user attributes.

以下に推論部１２０の各構成・動作をより詳細に説明する。図３は、推論部１２０の動作を示すフローチャートであり、以下の説明において、対応する図３のステップ番号を適宜示すことにする
＜入力部６０＞
入力部６０では、予測したいユーザの情報を入力として受け付ける。そして、ユーザIDへの変換を行い、親密度算出部７０へと出力する（図３のステップ３１）。 Below, each structure and operation | movement of the inference part 120 are demonstrated in detail. FIG. 3 is a flowchart showing the operation of the inference unit 120. In the following description, the corresponding step numbers of FIG.
<Input unit 60>
The input unit 60 receives information about a user who is to be predicted as an input. And it converts into user ID and outputs it to the familiarity calculation part 70 (step 31 of FIG. 3).

＜親密度算出部７０＞
親密度算出部７０は、入力部６０からユーザIDと、会話ログ記憶部２０から会話ログ集合Lを入力として受け付ける。そして、まず、入力ユーザuのソーシャルネットワークにおける近隣ユーザ集合S_uを抽出する（図３のステップ３２）。次に、入力ユーザuと近隣ユーザu'間の親密度w_uu'をそれぞれ算出し、伝搬確率算出部８０に出力する（図３のステップ３３）。 <Intimacy calculation unit 70>
The familiarity calculation unit 70 receives the user ID from the input unit 60 and the conversation log set L from the conversation log storage unit 20 as inputs. Then, first extracts the neighbor user set S _u in a social network of input user u (step 32 in FIG. 3). Next, the intimacy w _uu ′ between the input user u and the neighboring user u ′ is calculated and output to the propagation probability calculation unit 80 (step 33 in FIG. 3).

入力ユーザuのソーシャルネットワークにおける近隣ユーザ集合S_uとは、会話ログ中で会話したことがあるユーザとしてもよく、会話ログの友人情報から友人関係を持つユーザを抜き出して用いてもよい。2ユーザ間の親密度w_uu'の算出方法は、会話の頻度、文字数、返信が続く回数、返信が行われるまでにかかる時間、共通の友人数などを特徴量として算出する方法が考えられる。例えば、第２の従来技術においては、下記の式によって求められるが、下記の式に上記の特徴量の全てまたは一部を加えた方法であってもよい。また、w_uu'を近隣ユーザ集合S_uについて足し合わせると1になるよう正規化を行う。

The proximity user set S _u in a social network of input user u, may be a user who have a conversation in the conversation log may be used by extracting users with friendship friend information of the conversation log. As a method for calculating the intimacy w _{uu ′} between the two users, a method of calculating the frequency of conversation, the number of characters, the number of times a reply is continued, the time taken for a reply to be made, the number of common friends, and the like as feature quantities can be considered. For example, in the second prior art, it can be obtained by the following equation, but may be a method in which all or part of the above-described feature amount is added to the following equation. Further, it performs normalization so that the 1 The sum of w _{uu 'for} neighboring user set S _u.

ここで、distは下記の関数である。

Here, dist is the following function.

ただし、Kは、ユーザuとユーザu'およびその共通の近隣ユーザを経由した経路集合であるとする。また、strengthは下記の関数である。

However, it is assumed that K is a route set that passes through the user u, the user u ′, and their common neighboring users. Strength is the following function.

ここで、X'_ijは下記のように求める。 Here, X ′ _ij is obtained as follows.

ただし、X_ijはユーザiとユーザjが過去に会話した回数であるとする。

Here, X _ij is the number of times user i and user j have talked in the past.

＜伝搬確率算出部８０＞
伝搬確率算出部８０では、親密度算出部７０から親密度w_uu'と、ユーザ属性記憶部１０から近隣ユーザ集合S_uの属性集合A_Suと、会話ログ記憶部２０からユーザuと近隣ユーザ集合S_uの会話ログ集合L_uSuと、属性確率記憶部５０から属性確率P(a_ij|f)を入力として受け付ける。そして、会話ログ集合L_uSuをもとにユーザuと近隣各ユーザu'の会話において特徴量fが含まれる確率P(f|u,u')を算出したのち（図３のステップ３４、ステップ３５）、伝搬確率P(a_ij|u,u')を算出する（図３のステップ３６）。伝搬確率に基づいて近隣ユーザ集合S_uから属性を伝搬することによって、入力ユーザuが各属性を持つ確率P(a_ij|u)を算出し、出力部９０へと出力する（図３のステップ３７）。 <Propagation probability calculation unit 80>
In the propagation probability calculation unit 80, the closeness degree calculation unit 70 and the familiarity w _{uu ',} the user attribute and the attribute set A _Su points user set S _u from the storage unit 10, close to the user set and the user u from the conversation log storage unit 20 and the conversation log set _L uSu of S _u, from the attribute probability storage unit 50 attribute probability P | accepts as input (a _ij f). Then, after calculating the probability P (f | u, u ′) that the feature quantity f is included in the conversation between the user u and each neighboring user u ′ based on the conversation log set L _uSu (step 34 in FIG. 3, step 34) 35) The propagation probability P (a _ij | u, u ′) is calculated (step 36 in FIG. 3). By propagating attributes from the neighboring user set _Su based on the propagation probability, the probability P (a _ij | u) that the input user u has each attribute is calculated and output to the output unit 90 (step of FIG. 3). 37).

ユーザuと近隣各ユーザu'の会話において特徴量fが含まれる確率P(f|u,u')の算出方法は、以下の方法が考えられる。 The following method can be considered as a method of calculating the probability P (f | u, u ′) that the feature quantity f is included in the conversation between the user u and each neighboring user u ′.

ただし、C(f,u,u')は、ユーザu,u'の会話において特徴量fが含まれた回数を表す。

Here, C (f, u, u ′) represents the number of times the feature quantity f is included in the conversation of the user u, u ′.

伝搬確率の算出方法は、以下の方法などが考えられる。 The following method etc. can be considered as a method for calculating the propagation probability.

伝搬を行う際には、下式のようにユーザu'が持つ属性値a_ijについて入力ユーザuが各属性を持つ確率P(a_ij|u)を算出する方法などが考えられる。

When performing propagation, a method of calculating the probability P (a _ij | u) that the input user u has each attribute for the attribute value a _ij that the user u ′ has, as shown in the following equation, can be considered.

この際、ユーザu'が持つ全ての属性値を伝搬確率P(a_ij|u)に基づき伝搬してもよく、ユーザu'が持つ属性値のうち、最も伝搬確率P(a_ij|u)が高い属性値に絞って伝搬してもよい。

At this time, all attribute values possessed by the user u ′ may be propagated based on the propagation probability P (a _ij | u), and among the attribute values possessed by the user u ′, the propagation probability P (a _ij | u) is the highest. May be propagated with a high attribute value.

＜出力部９０＞
出力部９０では、伝搬確率算出部８０から入力ユーザuが各属性を持つ確率を入力として受け付ける。そして、伝搬された属性値a_ijのうち、属性名a_iに含まれる属性値a_ijのうち最も確率の高い属性値をユーザuの属性名a_iにおける属性値a_ijとして出力する（図３のステップ３８）。 <Output unit 90>
In the output unit 90, the probability that the input user u has each attribute is received from the propagation probability calculation unit 80 as an input. Of the propagation attribute values a _ij, and outputs the most probable attribute value of the attribute values a _ij included in the attribute name a _i as attribute values a _ij in the attribute name a _i of the user u (FIG. 3 Step 38).

ここで、出力とは、ディスプレイへの表示、プリンタへの印字、音出力、外部装置への送信等を含む概念である。出力部９０は、ディスプレイやスピーカ等の出力デバイスを含むと考えても含まないと考えてもよい。出力部９０は、出力デバイスのドライバソフトまたは、出力デバイスのドライバソフトと出力デバイス等で実現され得る。 Here, output is a concept including display on a display, printing on a printer, sound output, transmission to an external device, and the like. The output unit 90 may or may not include an output device such as a display or a speaker. The output unit 90 can be realized by driver software of an output device or driver software of an output device and an output device.

（具体例）
以下、具体的な例を用いて本実施の形態の処理について説明する。ここでは、下記のような条件における具体例を説明する。 (Concrete example)
Hereinafter, the processing of the present embodiment will be described using a specific example. Here, a specific example under the following conditions will be described.

・入力は入力ユーザuである
・属性名a_iには職業と居住地が含まれる
・属性値a_職業jには学生と会社員が含まれる
・属性値a_居住地jには東京と神奈川が含まれる
・ユーザ属性記憶部１０にはプロフィールが格納されている
・会話ログには形態素解析された投稿内容、時間情報、友人情報が含まれる
ユーザ集合抽出部３０において、プロフィールに「学生」という文字を含むユーザを抽出することで、職業の属性値が学生であるユーザ集合U_{職業:学生}を抽出し、同様にU_{職業:会社員}、U_{居住地:東京}、U_{居住地:神奈川}を抽出する。そして、会話ログ記憶部２０に問い合わせ、それぞれのユーザ集合に含まれるユーザの会話ログを抽出する。抽出した会話ログに含まれる特徴量について検定を行うことによって、U_{職業:学生}の会話ログには「期末試験」「部活」などの単語が有意に含まれている他、「16時台に多く投稿が行われている」という特徴量を得る。そして、例えば、「期末試験」という単語を利用したユーザが5000人おり、「期末試験」という単語を利用し、かつ、職業:学生という人数が500人いたとすれば、P(職業:学生|期末試験)は0.1と算出できる。すると、ユーザが特徴量fを含む確率P(f|u)と特徴量fが属性aに属する確率P(a|f)をかけ合わせることによってユーザuがある属性aを持つ確率P(a|u)が算出できる。例えば、あるユーザが「期末試験」という単語を100回に1回使っているとすれば、0.01×0.1=0.001となる。これを全ての特徴量集合Fについて足し合わせればよい。各特徴量が独立である場合の以上のプロセスにより、各ユーザがU_{職業:学生}に属する確率を算出することで、プロフィールに「学生」という文字が含まれていなくとも、U_{職業:学生}を得ることができ、また、プロフィールに「学生」という文字が含まれていても、U_{職業:学生}でないユーザを排除することができる。・ Input is input user u ・ Attribute name a _i includes occupation and residence ・ Attribute value a _{occupation j} includes students and office workers ・ Attribute value a _{residence j} includes Tokyo and Kanagawa Included-User attribute storage unit 10 stores profile-Conversation log includes morphologically analyzed post content, time information, and friend information In user set extraction unit 30, the word "student" appears in the profile The user group U _{occupation: students} whose _occupation attribute values are students is extracted, and similarly, U _{occupation: company employee} , U _{residence: Tokyo} , U _{residence: Kanagawa} are extracted. Then, the conversation log storage unit 20 is inquired, and user conversation logs included in each user set are extracted. By examining the features included in the extracted conversation logs, U _{occupation: students} ' conversation logs contain significant words such as `` final exam '' and `` club activities'' The feature amount “posting is performed” is obtained. For example, if there are 5,000 users who use the word “final examination”, and the word “final examination” is used, and there are 500 people who are occupation: students, then P (occupation: student | Final exam) can be calculated as 0.1. Then, the probability P (a |) that the user u has an attribute a by multiplying the probability P (f | u) including the feature quantity f by the probability P (a | f) that the feature quantity f belongs to the attribute a. u) can be calculated. For example, if a user uses the word “final exam” once in 100 times, 0.01 × 0.1 = 0.001. What is necessary is just to add this about all the feature-value sets F. By calculating the probability that each user belongs to U _{occupation: student} by the above process when each feature quantity is independent, even if the word “student” is not included in the profile, U _{occupation: student} is obtained. In addition, even if the profile includes the word “student”, it is possible to exclude U _{occupation: non-student} users.

次に、属性確率算出部４０において、先のステップで得たユーザ集合から、会話ログ記憶部２０に問い合わせ、U_{職業:学生}に含まれるユーザ間の会話ログを抽出する。ユーザ間の会話に含まれる特徴量について検定を行うことによって、U_{職業:学生}同士の会話には「文化祭」などの単語や顔文字が有意に含まれている他、「文字数が少ない」「返信が続く回数が多い」などの特徴量を得る。そして、ユーザ集合抽出部３０と同様に特徴量fが属性aに属する確率P(a|f)を算出し、属性確率記憶部５０へと格納する。 Next, the attribute probability calculation unit 40 inquires of the conversation log storage unit 20 from the user set obtained in the previous step, and extracts conversation logs between users included in U _{occupation: student} . By examining the amount of features included in the conversation between users, U _profession: Conversations between _students include words such as “Cultural Festival” and emoticons significantly, as well as “small number of characters” The feature amount such as “the number of times the reply continues” is obtained. Then, similarly to the user set extraction unit 30, the probability P (a | f) that the feature quantity f belongs to the attribute a is calculated and stored in the attribute probability storage unit 50.

親密度算出部７０では、会話ログに問い合わせ、ユーザuの友人集合S_uを抽出し、近隣ユーザu'とのそれぞれの親密度を算出する。親密度の算出は、2ユーザの全ての共通の友人について、共通の友人間における会話回数をホップ数でわったものを全て足し合わせることで行う。例えば、共通の友人がu_a、u_bであり、ユーザuとユーザu'は5回、ユーザuとユーザu_aは3回、ユーザu'とユーザu_aは1回、ユーザuとユーザu_bは4回、ユーザu'とユーザu_bは6回会話していたとすると、5+(3+1)/2+(4+6)/2=12となる。これを正規化することで、例えば0.6となったとする。 In closeness degree calculation unit 70, contact the conversation log, extracts friend set S _u of the user u, calculates the respective closeness with neighboring users u '. The calculation of the intimacy is performed by adding all the common friends of the two users, which are the number of conversations between the common friends divided by the number of hops. For example, the common friends are u _a and u _b , user u and user u ′ are 5 times, user u and user u _a are 3 times, user u ′ and user u _a are 1 time, user u and user u are _{If b} is 4 times and user u ′ and user u _b are talking 6 times, 5+ (3 + 1) / 2 + (4 + 6) / 2 = 12. By normalizing this, for example, it is assumed that it becomes 0.6.

伝搬確率算出部８０では、ユーザuとユーザu'において、ユーザu'が持つ属性値のうち、どの属性値を伝搬すべきかを算出する。まず、ユーザ属性記憶部１０に問い合わせ、友人集合S_uの属性を抽出する。会話ログ記憶部２０に問い合わせ、ユーザuと友人集合S_uの会話ログを抽出する。P(職業:学生|u)は、P(16時に会話|u,u')P(職業:学生|16時に会話)といった各特徴量ごとの属性確率を、全ての特徴量について足し合わせることによって算出する。そして、各属性値に対する確率値のうち、ユーザuとユーザu'の間で最も高いもののみをユーザu'からユーザuへ伝搬する。例えば、P(職業:学生|u,u')が0.8、P(職業:会社員|u,u')が0.2、P(居住地:東京|u,u')が0.4、P(居住地:神奈川|u,u')が0.6だったとする。この場合、最も確率値の高い「職業:学生」のみを伝搬し、次に確率値の高い「居住地:神奈川」は全く伝搬しないという方法でも、全ての属性を確率値に基づき伝搬するという方法でもよい。伝搬の際には、前に算出した親密度と確率値をかけ合わせ、全ユーザについて足し合わせる。つまり、ユーザu'からユーザuに職業:学生という属性値が伝搬する確率は、0.6×0.8=0.48となる。これを全てのユーザについて足し合わせればよく、結果、P(職業:学生|u)は0.7などとなる。同様に、P(職業:会社員|u)は0.3、P(居住地:東京|u)は0.6、P(居住地:神奈川|u)は0.4のように算出できる。 The propagation probability calculation unit 80 calculates which attribute value should be propagated among the attribute values possessed by the user u ′ between the user u and the user u ′. First, the user attribute storage unit 10 is inquired to extract the attributes of the friend set _Su . Contact the conversation log storage unit 20, to extract a conversation log of user u and a friend set S _u. P (Occupation: Student | u) adds the attribute probabilities for each feature such as P (Occupation: Student | Talk at 16:00) P (Occupation: Student | Talk at 16:00) for all features. calculate. Then, among the probability values for each attribute value, only the highest value between the user u and the user u ′ is propagated from the user u ′ to the user u. For example, P (Occupation: Student | u, u ') is 0.8, P (Occupation: Company employee | u, u') is 0.2, P (Residence: Tokyo | u, u ') is 0.4, P (Residence) : Kanagawa | u, u ') is 0.6. In this case, even if only the method with the highest probability value (Profession: Student) is propagated and the next highest probability value (Residence: Kanagawa) is not propagated at all, all attributes are propagated based on probability values But you can. At the time of propagation, the intimacy calculated previously and the probability value are multiplied and added up for all users. That is, the probability that the attribute value of occupation: student is propagated from the user u ′ to the user u is 0.6 × 0.8 = 0.48. This can be added for all users, and as a result, P (occupation: student | u) is 0.7. Similarly, P (occupation: company employee | u) can be calculated as 0.3, P (residence: Tokyo | u) can be calculated as 0.6, and P (residence: Kanagawa | u) can be calculated as 0.4.

出力部９０では、属性名a_iそれぞれの属性値a_ijのうち、最も確率が高いものを出力する。例えば、ユーザuの職業は、会社員よりも学生の属性値の方が高い確率を持つため、ユーザuの職業は学生であるとして出力する。以上の方法により、たとえユーザuが東京に住んでいるが、神奈川の学校に通っているため友人の中に神奈川に住む人が多かったとしても、それらの友人からは学校という属性値のみを伝搬し、過去に通っていた学校の友人などから東京という属性値が伝搬されることで、東京という正しい属性値を推定することができる。 The output unit 90 outputs the attribute value a _ij of each attribute name a _i having the highest probability. For example, since the user u's occupation has a higher probability of the student attribute value than the office worker, the user u's occupation is output as a student. By the above method, even if user u lives in Tokyo but attends school in Kanagawa, even if there are many friends who live in Kanagawa, only the attribute value of school is propagated from those friends The correct attribute value of Tokyo can be estimated by propagating the attribute value of Tokyo from a friend of the school who went to the past.

本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments, and various modifications and applications are possible within the scope of the claims.

１０ユーザ属性記憶部
２０会話ログ記憶部
３０ユーザ集合抽出部
４０属性確率算出部
５０属性確率記憶部
６０入力部
７０親密度算出部
８０伝搬確率算出部
９０出力部
１００ユーザ属性推定装置
１１０学習部
１２０推論部 10 user attribute storage unit 20 conversation log storage unit 30 user set extraction unit 40 attribute probability calculation unit 50 attribute probability storage unit 60 input unit 70 familiarity calculation unit 80 propagation probability calculation unit 90 output unit 100 user attribute estimation device 110 learning unit 120 Reasoning department

Claims

Based on the conversation log between users stored in the conversation log storage means and the attribute set of the user set in which the user attributes stored in the user attribute storage means are known, the input user whose user attributes are unknown A user attribute estimation device for estimating a user attribute,
User set extraction means for extracting a user set having a specific attribute from the user set stored in the user attribute storage means;
Based on the user set having the specific attribute and each conversation log stored in the conversation log storage unit, the feature quantity characteristically expressed in the user set having the specific attribute is extracted, and each feature quantity is each user. Calculating an attribute probability that is a probability belonging to the attribute, and storing the attribute probability in the attribute probability storage means;
A closeness calculating means for calculating a closeness between the input user and each neighboring user based on each conversation log stored in the conversation log storage means;
Propagation probability calculating means for calculating a propagation probability for propagating each attribute of a neighboring user to the input user based on the attribute probability of each feature quantity stored in the familiarity and the attribute probability storage means;
The user attribute estimation apparatus characterized by having.

Based on the result calculated by the propagation probability calculating means, output means for outputting the attribute value having the highest propagation probability among the plurality of attribute values included in the attribute name as the attribute value in the attribute name of the input user. The user attribute estimation device according to claim 1, further comprising:

The propagation probability calculation means calculates the propagation probability by multiplying the probability obtained by multiplying the probability that the feature amount is included in the conversation between the input user and each neighboring user and the attribute probability of the feature amount by the familiarity. The user attribute estimation apparatus according to claim 1 or 2, wherein

Based on the conversation log between users stored in the conversation log storage means and the attribute set of the user set in which the user attributes stored in the user attribute storage means are known, the input user whose user attributes are unknown A user attribute estimation method executed by a user attribute estimation device for estimating a user attribute,
A user set extraction step of extracting a user set having a specific attribute from the user set stored in the user attribute storage means;
Based on the user set having the specific attribute and each conversation log stored in the conversation log storage unit, the feature quantity characteristically expressed in the user set having the specific attribute is extracted, and each feature quantity is each user. Calculating an attribute probability that is a probability belonging to the attribute, and storing the attribute probability in the attribute probability storage means;
A familiarity calculating step of calculating a familiarity between the input user and each neighboring user based on each conversation log stored in the conversation log storage means;
A propagation probability calculating step of calculating a propagation probability for propagating each attribute of a neighboring user to the input user based on the attribute probability of each feature quantity stored in the familiarity and the attribute probability storage means;
A user attribute estimation method characterized by comprising:

Based on the result calculated by the propagation probability calculating step, an output step of outputting the attribute value having the highest propagation probability among the plurality of attribute values included in the attribute name as the attribute value in the attribute name of the input user. 5. The user attribute estimation method according to claim 4, further comprising:

In the propagation probability calculation step, the user attribute estimation device multiplies the intimacy by a probability obtained by multiplying a probability that a feature amount is included in a conversation between the input user and each neighboring user and an attribute probability of the feature amount. The user attribute estimation method according to claim 4 or 5, wherein a propagation probability is calculated by:

A program for causing a computer to function as a user set extraction unit, an attribute probability calculation unit, a closeness calculation unit, and a propagation probability calculation unit in the user attribute estimation apparatus according to any one of claims 1 to 3.