JP2017126158A

JP2017126158A - Binary classification learning device, binary classification device, method, and program

Info

Publication number: JP2017126158A
Application number: JP2016004441A
Authority: JP
Inventors: 昭典藤野; Akinori Fujino; 修功上田; Shuko Ueda
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-01-13
Filing date: 2016-01-13
Publication date: 2017-07-20
Anticipated expiration: 2036-01-13
Also published as: JP6482481B2

Abstract

PROBLEM TO BE SOLVED: To learn a score function capable of accurately performing binary classification even in the case where difference between a positive example number and a negative example number is large.SOLUTION: A score calculation section 32 uses an evaluation value model, a generation probability model of positive example data and a generation probability model of negative example data to calculate a score which indicates whether or not a sample without label is the positive example data, for each sample without label. An evaluation value model calculation section 34 and a generation probability model calculation section 36 calculate an evaluation value model on the basis of a score for each sample without label, a sample with label and the sample without label, and calculate a generation probability model of positive example data and a generation probability model of negative example data. Until it converges, the score calculation section 32, evaluation value model calculation section 34 and generation probability model calculation section 36 repeat processing.SELECTED DRAWING: Figure 2

Description

本発明は、２値分類のための２値分類学習装置、２値分類装置、方法、及びプログラムに関するものである。 The present invention relates to a binary classification learning device, a binary classification device, a method, and a program for binary classification.

統計的手法に基づくコンテンツの２値分類技術では、コンテンツとカテゴリの依存関係の強さを表すスコア関数を、モデルパラメータと特徴ベクトルの関数として与え、ある特定の種別に関する度合を、スコア関数をもとに推定することでコンテンツの２値分類を行う。モデルパラメータの値は、一般的に、当該の種別に関連するか否かが判明しているコンテンツ(以下、ラベルありサンプル)を用いて求める。この枠組に基づく手法では、モデルパラメータ値の計算に用いるラベルありサンプルの量を増やすことで、新規コンテンツの２値分類の精度を向上させることができる。しかし、ラベルありサンプルを得るには、人手でコンテンツを分類する必要があるため、大量のラベルありサンプルを準備することは容易ではない。そこで、当該の種別に関連するか否かが判明していないコンテンツ(以下、ラベルなしサンプル)を大量に集め、それらのラベルなしサンプルをモデルパラメータの計算に利用することで、ラベルありサンプルに含まれない特徴量の学習不足を補い、ラベルありサンプルのみを利用する場合と比べて２値分類の精度を向上させる半教師あり学習技術がある。 In binary content classification technology based on statistical methods, a score function representing the strength of dependency between content and category is given as a function of model parameters and feature vectors. Binary classification of content is performed by estimating the above. The value of the model parameter is generally obtained using content (hereinafter referred to as a labeled sample) that is known whether or not it is related to the type. In the method based on this framework, the accuracy of binary classification of new content can be improved by increasing the amount of labeled samples used for calculation of model parameter values. However, in order to obtain a labeled sample, it is necessary to classify the contents manually, so it is not easy to prepare a large number of labeled samples. Therefore, a large amount of content (hereinafter referred to as unlabeled samples) for which it is not known whether or not it is related to the relevant category is collected, and these unlabeled samples are used for calculation of model parameters. There is a semi-supervised learning technique that compensates for insufficient learning of feature quantities that are not obtained and improves the accuracy of binary classification as compared to using only labeled samples.

非特許文献１の技術では、２値分類の精度を表すのに一般によく用いられるAUC値が最大になるようにマージン最大化学習に基づいてスコア関数のパラメータ値を計算することを特徴とする。AUC(Area Under the Curve) は受信者応答特性(Receiver Operating Characteristic) 曲線に基づく評価指標であり、AUC値が大きいほど、正例から負例の順にコンテンツが正しくスコアで順位付けされていることを示す。この技術では、当該の種別に関連するコンテンツ(以下、正例)のスコア値から当該の種別に関連しないコンテンツ(以下、負例) のスコア値を引いた差が一定のマージン値以上になるようにスコア関数のパラメータ値を学習する。このパラメータ学習では、すべての正例と負例の組合せに対する差が考慮される。また、ラベルなしサンプルが正例、負例のどちらかであるかを予測し、予測結果が負例の場合は正例の各ラベルありサンプルのスコア値からラベルなしサンプルのスコア値を引いた差が一定のマージン以上になるように学習し、予測結果が正例の場合はラベルなしサンプルのスコア値から負例のラベルありサンプルのスコア値を引いた差が一定のマージン以上になるように学習する。 The technique of Non-Patent Document 1 is characterized in that the parameter value of the score function is calculated based on margin maximization learning so that the AUC value that is generally used to express the accuracy of binary classification is maximized. AUC (Area Under the Curve) is an evaluation index based on the receiver operating characteristic (Receiver Operating Characteristic) curve.The larger the AUC value, the more correctly the content is ranked in the score from positive to negative examples. Show. In this technology, the difference between the score value of content related to the type (hereinafter referred to as positive example) and the score value of content not related to the type (hereinafter referred to as negative example) is equal to or greater than a certain margin value. The parameter value of the score function is learned. This parameter learning takes into account the differences for all positive and negative example combinations. Also, predict whether the unlabeled sample is positive or negative, and if the prediction result is negative, the difference between the score value of each labeled sample in the positive example minus the score value of the unlabeled sample If the prediction result is positive, the difference between the score value of the unlabeled sample minus the score value of the negative labeled sample is learned to be greater than a certain margin. To do.

非特許文献２、特許文献１の技術では、スポーツ、音楽、科学、経済などといった複数のカテゴリのいずれか１つに各コンテンツを分類する問題に対して、分類に用いる識別関数をラベルありサンプルとラベルなしサンプルとから学習し、識別関数を、各カテゴリにおけるコンテンツの生成確率モデルと、コンテンツに対するカテゴリの条件付確率モデル(識別的確率モデル) との重み付き統合で与えることを特徴とする。非特許文献３で述べられているように、一般に、生成確率モデルはラベルなしサンプルの分布を学習するのに有効なモデルであるのに対して、識別的確率モデルはラベルありサンプルを正しく分類するのに有効であることが知られている。非特許文献２、特許文献１の技術では、両モデルを適切に組み合わせることによって、識別関数の学習にラベルありサンプルとラベルなしサンプルの統計情報を効果的に利用し、新規サンプルの分類精度を向上させる。 In the technologies of Non-Patent Document 2 and Patent Document 1, with respect to the problem of classifying each content into any one of a plurality of categories such as sports, music, science, economy, etc., the identification function used for classification is a labeled sample. Learning from unlabeled samples, the discriminant function is given by weighted integration of a content generation probability model in each category and a category conditional probability model (discriminative probability model) for the content. As described in Non-Patent Document 3, in general, the generation probability model is an effective model for learning the distribution of unlabeled samples, whereas the discriminative probability model correctly classifies labeled samples. It is known to be effective. In the techniques of Non-Patent Document 2 and Patent Document 1, by appropriately combining both models, statistical information of labeled and unlabeled samples is effectively used for learning of the discrimination function, and the classification accuracy of new samples is improved. Let

特許第５３０８３６０号公報Japanese Patent No. 5308360

Shijun Wang, Diana Li, Nicholas Petrick, Berkman Sahiner, Marius George Linguraru, and Ronald M. Summers: Optimizing area under the ROC curve using semi-supervised learning. Pattern Recognition, Elsevier 48 276−287 (2015).Shijun Wang, Diana Li, Nicholas Petrick, Berkman Sahiner, Marius George Linguraru, and Ronald M. Summers: Optimizing area under the ROC curve using semi-supervised learning.Pattern Recognition, Elsevier 48 276−287 (2015). A. Fujino, N. Ueda, and M. Nagata: Adaptive semi-supervised learning on labeled and unlabeled data with different distributions. Knowledge and Information Systems (KAIS), Springer, 37 (1), 129−154 (2013).A. Fujino, N. Ueda, and M. Nagata: Adaptive semi-supervised learning on labeled and unlabeled data with different distributions.Knowledge and Information Systems (KAIS), Springer, 37 (1), 129-154 (2013). M. Seeger: Learning with labeled and unlabeled data. Technical report, University of Edinburgh (2001).M. Seeger: Learning with labeled and unlabeled data.Technical report, University of Edinburgh (2001).

非特許文献１の技術では、ラベルありサンプルによる教師あり学習でよく用いられるマージン最大化学習を拡張してラベルなしサンプルもスコア関数の学習に利用している。しかし、非特許文献３で述べられているように、一般に、ラベルなしサンプルの分布を学習するのに生成確率モデルが有効であることが知られており、学習に生成確率モデルを利用することで高い分類精度を与えるスコア関数を得られる可能性がある。非特許文献２、特許文献１の技術は、各コンテンツに対して複数のカテゴリ候補の中から適切なカテゴリを正確に予測することを目的とした技術であって、カテゴリ予測の正解率を高めるように識別関数のパラメータを学習する。しかし、本発明で対象とする２値分類問題では、ある特定の種別に関連するコンテンツ(正例)の数は、関連しないコンテンツ(負例)の数と比べて非常に少ないことが多い。このような場合、すべて負例と判定する識別関数を学習すれば100%に近いカテゴリ予測の正解率を得られるが、正例を正しく抽出できるとは限らない。したがって、正例と負例の数の差が大きい２値分類の問題に非特許文献２、特許文献１の技術をそのまま適用しても、正例と負例を高い精度で分類できる保証はない。正例と負例の数の差が大きい２値分類の問題に対して、ラベルなしサンプルを効果的に学習に用いて、高い分類精度を与えるスコア関数を得ることが課題である。 In the technique of Non-Patent Document 1, margin maximization learning, which is often used in supervised learning using labeled samples, is expanded and unlabeled samples are also used for learning of the score function. However, as described in Non-Patent Document 3, it is generally known that the generation probability model is effective for learning the distribution of unlabeled samples, and by using the generation probability model for learning, There is a possibility of obtaining a score function that gives high classification accuracy. The technologies of Non-Patent Document 2 and Patent Document 1 are technologies aimed at accurately predicting an appropriate category from among a plurality of category candidates for each content, and increasing the accuracy rate of category prediction. To learn the parameters of the discriminant function. However, in the binary classification problem targeted by the present invention, the number of contents (positive examples) related to a specific type is often very small compared to the number of unrelated contents (negative examples). In such a case, learning a discriminant function that determines all negative examples can obtain a correct rate of category prediction close to 100%, but it is not always possible to correctly extract positive examples. Therefore, even if the techniques of Non-Patent Document 2 and Patent Document 1 are applied as they are to the problem of binary classification with a large difference between the number of positive examples and negative examples, there is no guarantee that positive examples and negative examples can be classified with high accuracy. . The problem is to obtain a score function that gives high classification accuracy by effectively using unlabeled samples for learning for the problem of binary classification where the difference between the number of positive examples and negative examples is large.

本発明では、上記事情を鑑みて成されたものであり、正例と負例の数の差が大きい場合であっても、精度よく２値分類をすることができるスコア関数を学習することができる２値分類学習装置、方法、及びプログラムを提供することを目的とする。 In the present invention, it was made in view of the above circumstances, and it is possible to learn a score function that can accurately perform binary classification even when the difference between the numbers of positive examples and negative examples is large. An object of the present invention is to provide a binary classification learning device, method, and program.

また、精度よく２値分類をすることができる２値分類装置、方法、及びプログラムを提供することを目的とする。 Another object of the present invention is to provide a binary classification apparatus, method, and program capable of performing binary classification with high accuracy.

上記目的を達成するために、本発明に係る２値分類学習装置は、特定の種別に関連する正例のデータであるか前記特定の種別に関連しない負例のデータであるかが与えられたラベルありサンプルと、前記正例のデータであるか負例のデータであるかが未知のラベルなしサンプルとからなる訓練データに基づいて、２値分類のためのスコア関数を学習する２値分類学習装置であって、前記ラベルなしサンプルの各々について、正例のデータであるか否かを示す値を出力する関数を用いて表される評価値モデルと、正例のデータの生成確率モデルと、負例のデータの生成確率モデルとを用いて、前記ラベルなしサンプルが前記正例のデータであるか否かを示すスコアを計算するスコア計算部と、前記スコア計算部によって計算された前記ラベルなしサンプルの各々についての前記スコアと、前記ラベルありサンプルと、前記ラベルなしサンプルとに基づいて、前記評価値モデルを計算する評価値モデル計算部と、前記スコア計算部によって計算された前記ラベルなしサンプルの各々についての前記スコアと、前記ラベルありサンプルと、前記ラベルなしサンプルとに基づいて、前記正例のデータの生成確率モデル及び前記負例のデータの生成確率モデルを計算する生成確率モデル計算部と、予め定められた収束判定条件を満たすまで、前記スコア計算部による計算、前記評価値モデル計算部による計算、及び前記生成確率モデル計算部による計算を繰り返させ、前記評価値モデルと前記生成確率モデルとを用いた前記スコア関数を出力する収束判定部と、を含んで構成されている。 In order to achieve the above object, the binary classification learning device according to the present invention is given whether it is positive example data related to a specific type or negative example data not related to the specific type Binary classification learning for learning a score function for binary classification based on training data consisting of a labeled sample and an unlabeled sample whose positive or negative data is unknown. An evaluation value model represented using a function that outputs a value indicating whether or not the data is positive data for each of the unlabeled samples, and a generation probability model of positive data, A score calculation unit that calculates a score indicating whether the unlabeled sample is the positive example data using a negative example data generation probability model, and the label calculated by the score calculation unit An evaluation value model calculation unit that calculates the evaluation value model based on the score for each of the samples, the labeled sample, and the unlabeled sample; and the unlabeled sample calculated by the score calculation unit A generation probability model calculation unit that calculates the generation probability model of the positive example data and the generation probability model of the negative example data based on the score, the labeled sample, and the unlabeled sample And repeating the calculation by the score calculation unit, the calculation by the evaluation value model calculation unit, and the calculation by the generation probability model calculation unit until a predetermined convergence determination condition is satisfied, and the evaluation value model and the generation probability And a convergence determination unit that outputs the score function using a model.

本発明に係る２値分類学習方法は、スコア計算部と、評価値モデル計算部と、生成確率モデル計算部と、収束判定部とを含み、特定の種別に関連する正例のデータであるか前記特定の種別に関連しない負例のデータであるかが与えられたラベルありサンプルと、前記正例のデータであるか負例のデータであるかが未知のラベルなしサンプルとからなる訓練データに基づいて、２値分類のためのスコア関数を学習する２値分類学習装置における２値分類学習方法であって、前記スコア計算部が、前記ラベルなしサンプルの各々について、正例のデータであるか否かを示す値を出力する関数を用いて表される評価値モデルと、正例のデータの生成確率モデルと、負例のデータの生成確率モデルとを用いて、前記ラベルなしサンプルが前記正例のデータであるか否かを示すスコアを計算し、前記評価値モデル計算部が、前記スコア計算部によって計算された前記ラベルなしサンプルの各々についての前記スコアと、前記ラベルありサンプルと、前記ラベルなしサンプルとに基づいて、前記評価値モデルを計算し、前記生成確率モデル計算部が、前記スコア計算部によって計算された前記ラベルなしサンプルの各々についての前記スコアと、前記ラベルありサンプルと、前記ラベルなしサンプルとに基づいて、前記正例のデータの生成確率モデル及び前記負例のデータの生成確率モデルを計算し、前記収束判定部が、予め定められた収束判定条件を満たすまで、前記スコア計算部による計算、前記評価値モデル計算部による計算、及び前記生成確率モデル計算部による計算を繰り返させ、前記評価値モデルと前記生成確率モデルとを用いた前記スコア関数を出力する。 Whether the binary classification learning method according to the present invention includes a score calculation unit, an evaluation value model calculation unit, a generation probability model calculation unit, and a convergence determination unit, and is positive data related to a specific type? Training data consisting of a labeled sample given whether it is negative example data not related to the specific type, and an unlabeled sample whose positive example data or negative example data is unknown A binary classification learning method in a binary classification learning device that learns a score function for binary classification based on whether the score calculation unit is positive data for each of the unlabeled samples The unlabeled sample is converted into the positive value using the evaluation value model expressed using a function that outputs a value indicating whether or not, a positive example data generation probability model, and a negative example data generation probability model. Example de The evaluation value model calculation unit calculates the score for each of the unlabeled samples calculated by the score calculation unit, the labeled sample, and the unlabeled The evaluation value model is calculated based on the sample, and the generation probability model calculation unit calculates the score for each of the unlabeled samples calculated by the score calculation unit, the labeled sample, and the label Based on the no sample, the generation probability model of the positive example data and the generation probability model of the negative example data are calculated, and the score calculation is performed until the convergence determination unit satisfies a predetermined convergence determination condition. Repeat the calculation by the calculation part, the calculation by the evaluation value model calculation part, and the calculation by the generation probability model calculation part. And outputs the score function using the evaluation value model and said generation probability model.

本発明に係る２値分類装置は、入力されたテストデータと、上記の２値分類学習装置によって学習されたスコア関数とに基づいて、前記テストデータが正例であることを示すスコア値を算出するスコア算出部を含んで構成されている。 The binary classification device according to the present invention calculates a score value indicating that the test data is a positive example based on the input test data and the score function learned by the binary classification learning device. It is comprised including the score calculation part to do.

本発明に係る２値分類方法は、スコア算出部を含む２値分類装置における２値分類方法であって、前記スコア算出部が、入力されたテストデータと、上記の２値分類学習方法によって学習されたスコア関数とに基づいて、前記テストデータが正例であることを示すスコア値を算出する。 A binary classification method according to the present invention is a binary classification method in a binary classification device including a score calculation unit, wherein the score calculation unit learns by input test data and the binary classification learning method described above. A score value indicating that the test data is a positive example is calculated based on the score function thus obtained.

また、本発明のプログラムは、コンピュータを、上記の２値分類学習装置又は２値分類装置を構成する各部として機能させるためのプログラムである。 Moreover, the program of this invention is a program for functioning a computer as each part which comprises said binary classification learning apparatus or a binary classification apparatus.

以上説明したように、本発明の２値分類学習装置、方法、及びプログラムによれば、前記ラベルなしサンプルの各々について、評価値モデルと、正例のデータの生成確率モデルと、負例のデータの生成確率モデルとを用いて、前記ラベルなしサンプルが前記正例のデータであるか否かを示すスコアを計算し、前記ラベルなしサンプルの各々についての前記スコアと、前記ラベルありサンプルと、前記ラベルなしサンプルとに基づいて、前記評価値モデルを計算し、前記正例のデータの生成確率モデル及び前記負例のデータの生成確率モデルを計算することを繰り返すことにより、正例と負例の数の差が大きい場合であっても、精度よく２値分類をすることができるスコア関数を学習することができる。 As described above, according to the binary classification learning device, method, and program of the present invention, for each of the unlabeled samples, the evaluation value model, the positive data generation probability model, and the negative data A score indicating whether the unlabeled sample is the positive example data, the score for each of the unlabeled samples, the labeled sample, By calculating the evaluation value model based on the unlabeled sample and calculating the generation probability model of the positive example data and the generation probability model of the negative example data, the positive example and the negative example are repeated. Even when the difference in number is large, it is possible to learn a score function that can accurately perform binary classification.

また、本発明の２値分類装置、方法、及びプログラムによれば、精度よく２値分類をすることができる。 Moreover, according to the binary classification apparatus, method, and program of the present invention, binary classification can be performed with high accuracy.

本実施形態に係る２値分類装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the binary classification device which concerns on this embodiment. 本実施形態に係る２値分類装置のスコア関数生成部の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the score function production | generation part of the binary classification device which concerns on this embodiment. 本実施形態に係る２値分類装置における２値分類学習処理ルーチンのフローチャート図である。It is a flowchart figure of the binary classification learning process routine in the binary classification device concerning this embodiment. 本実施形態に係る２値分類装置における２値分類処理ルーチンのフローチャート図である。It is a flowchart figure of the binary classification processing routine in the binary classification device concerning this embodiment.

以下、図面を参照して本発明の実施形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の実施形態の概要＞
まず、本実施形態の概要について説明する。 <Outline of Embodiment of the Present Invention>
First, an outline of the present embodiment will be described.

本実施形態に係る２値分類装置においては、データベースに含まれる論文、特許等の文書、オンラインニュースデータ、電子メール等のテキスト情報から成るコンテンツや、Webデータ、blogデータ等のテキスト情報とリンク情報から成るコンテンツ、あるいは画像データ等のコンテンツ、といった特徴ベクトルにより表現することが可能なコンテンツの集合の中から、例えばスポーツ、音楽、科学、経済などといった特定の種別に関連するコンテンツを抽出する際に、当該特定の種別に関連するか否かが判明しているコンテンツと、関連するか否かが不明のコンテンツとの統計情報を用いてコンテンツの特徴ベクトルを入力とし、特定の種別に関する度合を表すスコア値を出力とする２値分類器を学習し、２値分類器を用いてコンテンツを２値分類する。 In the binary classification apparatus according to the present embodiment, content consisting of text information such as papers, patents, online news data, e-mails, etc. included in the database, text information such as Web data, blog data, and link information When extracting content related to a specific type, such as sports, music, science, economics, etc., from a set of content that can be expressed by feature vectors such as content consisting of image data or content such as image data The feature vector of the content is used as an input using the statistical information of the content that is known to be related to the specific type and the content that is unknown whether it is related, and represents the degree of the specific type Learning a binary classifier that outputs score values, and binary content using the binary classifier Classify.

また、本実施形態に係る２値分類装置では、ＡＵＣ値を最大化させるようにパラメータを学習させる評価値モデルと、生成確率モデルとの重み付き統合でスコア関数を与え、両モデルのパラメータを、ラベルありサンプルとラベルなしサンプル双方の統計情報を同時に用いて計算することで得る。スコア関数とパラメータ値は、評価値モデルの対数の期待値と、生成確率モデルの期待対数尤度と、スコア関数が与える確率値のエントロピー正則化項と、評価値モデルと生成確率モデルの各モデルパラメータの正則化項と、の重み付き最大化に基づいて与えられる。 Further, in the binary classification device according to the present embodiment, a score function is given by weighted integration of an evaluation value model that learns parameters so as to maximize the AUC value and a generation probability model, and the parameters of both models are It is obtained by calculating using statistical information of both labeled and unlabeled samples simultaneously. The score function and parameter values are the logarithmic expected value of the evaluation value model, the expected log likelihood of the generation probability model, the entropy regularization term of the probability value given by the score function, and the evaluation value model and the generation probability model It is given based on the regularization term of the parameter and a weighted maximization of it.

＜本発明の第１の実施の形態に係る２値分類装置の構成＞
次に、本発明の第１の実施の形態に係る２値分類装置の構成について説明する。図１に示すように、本実施の形態に係る２値分類装置１００は、ＣＰＵと、ＲＡＭと、後述する２値分類学習処理ルーチン及び２値分類処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この２値分類装置１００は、機能的には図１に示すように入力部１０と、演算部２０と、出力部９０とを含んで構成されている。 <Configuration of Binary Classification Device According to First Embodiment of the Present Invention>
Next, the configuration of the binary classification device according to the first embodiment of the present invention will be described. As shown in FIG. 1, a binary classification device 100 according to the present embodiment includes a CPU, a RAM, a program for executing a binary classification learning processing routine and a binary classification processing routine described later, and various data. It can be composed of a computer including a stored ROM. Functionally, the binary classification device 100 includes an input unit 10, a calculation unit 20, and an output unit 90 as shown in FIG.

入力部１０は、２値分類対象となるコンテンツと同様の形式をもつコンテンツの例を集めて生成された訓練データ集合を受け付ける。 The input unit 10 receives a training data set generated by collecting examples of content having the same format as the content to be subjected to binary classification.

例えば、Web記事の２値分類を行う場合は、Web記事の例と、その例が特定の種別(音楽、スポーツ、ビジネスなど)に関連するか否かを表すクラスとを記録した訓練データを用いる。クラスには、当該特定の種別に関連することを表す正例(+)と、当該特定の種別に関連しないことを表す負例(−)の２種類がある。コンテンツの例ごとに正例または負例のいずれか一方がクラスとして付与される。 For example, when performing binary classification of Web articles, use training data that records examples of Web articles and classes that indicate whether the examples are related to a specific type (music, sports, business, etc.) . There are two types of classes: a positive example (+) indicating that it is related to the specific type, and a negative example (-) indicating that it is not related to the specific type. Either a positive example or a negative example is given as a class for each example of content.

また、訓練データ集合は、コンテンツ本体と、上記のように付与されたクラスの対から成るラベルありサンプルと、クラスが不明なコンテンツ本体からなるラベルなしサンプルとを含む。 The training data set includes a content body, a labeled sample composed of a pair of classes given as described above, and an unlabeled sample composed of a content body whose class is unknown.

また、入力部１０は、２値分類対象となるコンテンツであるテストデータ集合を受け付ける。 Further, the input unit 10 receives a test data set that is a content to be subjected to binary classification.

演算部２０は、訓練データデータベース２２と、スコア関数生成部２４と、テストデータデータベース２６と、順位付け部２８とを含んで構成されている。なお、スコア関数生成部２４が、２値分類学習装置の一例である。 The calculation unit 20 includes a training data database 22, a score function generation unit 24, a test data database 26, and a ranking unit 28. The score function generation unit 24 is an example of a binary classification learning device.

訓練データデータベース２２には、入力部１０において受け付けた訓練データ集合が記憶されている。 The training data database 22 stores a training data set received by the input unit 10.

スコア関数生成部２４は、訓練データデータベース２２に記憶されている訓練データ集合に基づいて、２値分類のためのスコア関数を生成する。 The score function generation unit 24 generates a score function for binary classification based on the training data set stored in the training data database 22.

スコア関数生成部２４は、図２に示すように、初期化部３０、スコア計算部３２、評価値モデル計算部３４、生成確率モデル計算部３６、収束判定部３８、及びスコア関数記憶部４０を含んでいる。 As shown in FIG. 2, the score function generation unit 24 includes an initialization unit 30, a score calculation unit 32, an evaluation value model calculation unit 34, a generation probability model calculation unit 36, a convergence determination unit 38, and a score function storage unit 40. Contains.

以下、スコア関数を生成する原理について説明する。 Hereinafter, the principle of generating the score function will be described.

まず、ｘはコンテンツの特徴ベクトルを表す。ω∈｛＋,−｝は、正例(+)、負例(−) のいずれか一方を値とするクラスを表す。また、評価値モデルをＦ(ｘ;Ｗ)と表し、生成確率モデルをｐ(ｘ;θ_ω)と表し、スコア関数のパラメータを、 First, x represents a feature vector of content. ω∈ {+, −} represents a class whose value is either positive example (+) or negative example (−). Also, the evaluation value model is represented as F (x; W), the generation probability model is represented as p (x; θ _ω ), and the parameters of the score function are

と表す。また、訓練データデータベース２２に保存されているラベルありサンプルの正例集合を、 It expresses. In addition, a positive example set of labeled samples stored in the training data database 22

と表し、ラベルありサンプルの負例集合を、 And a negative example set of labeled samples

と表し、ラベルなしサンプル集合を、 And the unlabeled sample set

と表す。 It expresses.

本実施の形態では、評価値モデルに線形関数を、生成確率モデルにNaive Bayesモデル(以下、NBモデル)を用いる場合を例に説明する。 In this embodiment, a case where a linear function is used as an evaluation value model and a Naive Bayes model (hereinafter referred to as NB model) is used as a generation probability model will be described as an example.

コンテンツに含まれる単語や画素、リンク、あるいはそれらの組み合わせ等により構成される特徴量空間を A feature space composed of words, pixels, links, or combinations included in content

とするとき、コンテンツの特徴ベクトルｘは、コンテンツに含まれるｔ_vの頻度ｘ_vをもとに When a feature vector x of the content, based on the frequency x _v of t _v in the content

で表現される。Ｖはコンテンツに含まれる可能性がある特徴の種類の数を表す。例えば、コンテンツがテキストデータである場合、Ｖはコンテンツに出現する可能性がある語彙の総数を表す。Ａ^Tは、行列(ベクトル)Ａの転置を表す。 It is expressed by V represents the number of types of features that may be included in the content. For example, when the content is text data, V represents the total number of vocabularies that can appear in the content. A ^T represents the transpose of the matrix (vector) A.

評価値モデルでは、コンテンツの特徴ベクトルに対して正例である場合に大きなスカラ値を出力する関数ｆ(ｘ;Ｗ)を設計する。ラベルありサンプルの正例集合 In the evaluation value model, a function f (x; W) that outputs a large scalar value when the content feature vector is a positive example is designed. Positive set of labeled samples

と、負例集合 And negative example set

に評価値モデルを適用する場合、２値分類の精度を表すＡＵＣ値は、以下の式で算出できる。 When the evaluation value model is applied to the AUC value, the AUC value representing the accuracy of the binary classification can be calculated by the following equation.

ただし、 However,

は、 Is

の場合に１を、それ以外の場合に０を出力するステップ関数である。ラベルありデータ集合Ｄ⁺、Ｄ⁻だけから評価値モデルを学習させる場合は、ＡＵＣを最大化させるＷを求める最適化問題を解いてやればよい。しかし、ステップ関数 Is a step function that outputs 1 in the case of 0 and 0 in other cases. When the evaluation value model is learned only from the labeled data sets D ⁺ and D ^{−, it is} only necessary to solve an optimization problem for obtaining W that maximizes the AUC. But the step function

は微分不可能な関数であり、上記の最適化問題を解くのは容易ではない。そこで、上記の最適化問題を、ステップ関数 Is a non-differentiable function and it is not easy to solve the above optimization problem. Therefore, the above optimization problem is converted into a step function.

をシグモイド関数 The sigmoid function

を用いて近似した以下の目的関数を最大化させるＷを求める問題に置き換えることで、評価値モデルの学習を容易に行うことができる。 The evaluation value model can be easily learned by replacing it with the problem of obtaining W that maximizes the following objective function approximated using.

式(3)中のＲ（Ｗ）は、評価値モデルのパラメータＷに関する正則化項であり、Ｃは正則化項の重みを与えるハイパーパラメータである。正則化項は、ラベルありサンプル集合に対して過剰にモデルが適合することで新規サンプルに対する予測精度を低下させる過学習を抑制するためによく用いられる。 R (W) in equation (3) is a regularization term relating to the parameter W of the evaluation value model, and C is a hyperparameter that gives the weight of the regularization term. Regularization terms are often used to suppress over-learning, which reduces prediction accuracy for new samples due to excessive model adaptation to labeled sample sets.

本実施の形態では、ラベルなしサンプル集合 In this embodiment, an unlabeled sample set

をラベルありサンプルとともに用いてスコア関数のパラメータを学習する。ラベルなしサンプルｘ_mが正例である確率Ｐ(＋|ｘ_m)と、ラベルなしサンプルｘ_mが負例である確率Ｐ(−|ｘ_m)が与えられる場合、例えば、以下の目的関数を最大化させるＷを評価値モデルのパラメータ推定値として計算できる。 Is used with a labeled sample to learn the parameters of the score function. Probability P unlabeled sample x _m is a positive example | a (+ x _m), the probability P unlabeled sample x _m is negative example - | if (x _m) is given, for example, the following objective function W to be maximized can be calculated as a parameter estimation value of the evaluation value model.

式(4)中の In formula (4)

は、ラベルなしサンプルｘ_mがラベルありサンプルの負例集合Ｄ^-に含まれるすべてのサンプルｘ^- _jに対して、 Is unlabeled sample x _m is negative examples set D Label Yes Sample ^- For _j, ^- all samples x contained in

の場合に最大値１になり、 The maximum value is 1 in the case of

の場合に最小値０になる。目的関数に In this case, the minimum value is 0. To the objective function

を導入するのは、ラベルなしサンプルｘ_mが正例であると予測される場合に、 Is introduced if the unlabeled sample x _m is expected to be a positive example,

となるようにパラメータＷを学習するためである。同様に、 This is because the parameter W is learned so that Similarly,

を導入するのは、ラベルなしサンプルｘ_mが負例であると予測される場合に、ラベルありサンプルの正例集合Ｄ⁺に対して、 Is introduced for a positive example set D ⁺ of labeled samples when the unlabeled sample x _m is predicted to be a negative example.

となるようにパラメータＷを学習するためである。 This is because the parameter W is learned so that

式(4)中のＲ(Ｐ_m)は、Ｐ(+|ｘ_m)とＰ(−|ｘ_m)に関するエントロピー正則化項であり、例えば、 R (P _m ) in equation (4) is an entropy regularization term for P (+ | x _m ) and P (− | x _m ), for example,

を用いることができる。γは正則化項Ｒ(Ｐ_m)の重みを与えるハイパーパラメータである。制約 Can be used. γ is a hyperparameter that gives the weight of the regularization term R (P _m ). Constraints

の下でラグランジュ未定乗数法を用いることで、Ｊ(Ｗ,Ｐ)を最大化させるＷとＰ(ω|ｘ_m)、ω∈ {+,−}, ∀mを得られる。 By using the Lagrange undetermined multiplier method, W and P (ω | x _m ), ω∈ {+, −}, ∀m that maximize J (W, P) can be obtained.

しかし、式(4)のＪ(Ｗ,Ｐ)に基づく学習では、ラベルありサンプルの正例、負例の情報のみに基づいてラベルなしサンプルのクラスを予測してパラメータ値を学習することになるため、ラベルなしサンプルを学習に用いる効果が小さい。本実施の形態では、文書や画像といったコンテンツの種類に応じて生成確率モデルで仮定されるコンテンツの分布特性を事前知識として用いることで、ラベルなしサンプルを学習に用いる効果を高める。正例のコンテンツの生成確率モデルｐ(ｘ;θ_＋)と、負例のコンテンツの生成確率モデルｐ(ｘ; θ₋)とを導入し、評価値モデルのパラメータＷと、生成確率モデルのパラメータΘ = [θ_＋ θ₋]と、ラベルなしサンプルのクラス確率Ｐ(ω|ｘ_m),ω ∈ {+,−}, ∀mと、を以下の目的関数を最大化させるように学習させる。 However, in the learning based on J (W, P) in Expression (4), the parameter value is learned by predicting the class of the unlabeled sample based only on the positive and negative information of the labeled sample. Therefore, the effect of using unlabeled samples for learning is small. In the present embodiment, the effect of using the unlabeled sample for learning is enhanced by using, as prior knowledge, the distribution characteristic of the content assumed in the generation probability model according to the type of content such as a document or an image. Introducing a positive content generation probability model p (x; θ ₊ ) and a negative content generation probability model p (x; θ ₋ ), an evaluation value model parameter W, and a generation probability model parameter and, unlabeled sample class probability _{_{P - Θ = [θ + θ}} ] (ω | x m), ω ∈ {+, -}, and ∀M, a is learned so as to maximize an objective function below.

式(6)中のｐ(Θ)は、生成確率モデルのパラメータΘの事前確率分布を表し、Θの過学習を抑制するための正則化項である。βは生成確率モデルの学習への寄与度を与えるハイパーパラメータである。 P (Θ) in Equation (6) represents a prior probability distribution of the parameter Θ of the generation probability model, and is a regularization term for suppressing overlearning of Θ. β is a hyperparameter that gives the degree of contribution of the generation probability model to learning.

制約 Constraints

の下でラグランジュ未定乗数法を用いると、式(6)のＪ(Ｗ,Θ,Ｐ)を最大化させるＰ(ω|ｘ_m), ω ∈ {+,−}, ∀mの解＾Ｐ(ω|ｘ_m;Ｗ,Θ)の解 Using Lagrange's undetermined multiplier method under, the solution of P (ω | x _m ), ω ∈ {+, −}, ∀m that maximizes J (W, Θ, P) in equation (6) ^ P Solution of (ω | x _m ; W, Θ)

が得られる。ただし、 Is obtained. However,

である。本実施の形態では、式(7)で与えられるｘについての関数＾Ｐ(+|ｘ;Ｗ,Θ)をコンテンツが正例であるか否かを判定するためのスコア関数とする。 It is. In the present embodiment, the function ^ P (+ | x; W, Θ) for x given by equation (7) is used as a score function for determining whether or not the content is a positive example.

式(7)を式(6)に代入すると、ＷとΘに関する目的関数 Substituting Equation (7) into Equation (6), the objective function for W and Θ

が得られる。 Is obtained.

式(10)で与えられる目的関数＾Ｊ(Ｗ,Θ)を最大化させるＷとΘは、例えば、 W and Θ that maximize the objective function ^ J (W, Θ) given by equation (10) are, for example,

を満たす以下のＱ1(Ｗ,Θ,Ｗ^(t),Θ^(t))を最大化させるＷとΘを繰り返し計算することで、ＷとΘの初期値周辺で＾Ｊ(Ｗ,Θ) を最大化させる局所最適解を得ることができる。 By repeatedly calculating W and Θ that maximize the following Q1 (W, Θ, W ^(t) , Θ ^(t) ) that satisfies the following, ^ J (W, Θ) around the initial values of W and Θ A local optimal solution to be maximized can be obtained.

ただし、

However,

である。(t + 1)ステップの推定値を It is. (t + 1) step estimate

のように計算することで、＾Ｊ(Ｗ,Θ)を単調に増加させるＷとΘの推定値を得ることができる。式(14)の解は、以下のようにＷ^(t+1)とΘ^(t+1)を独立に計算することで得られる。 By calculating as follows, it is possible to obtain the estimated values of W and Θ that monotonically increase ^ J (W, Θ). The solution of equation (14) can be obtained by independently calculating W ^{(t + 1)} and Θ ^{(t + 1)} as follows.

評価値モデルで用いる関数ｆ(ｘ,Ｗ)を、コンテンツの特徴ベクトルｘと同次元のパラメータベクトルｗを用いて A function f (x, W) used in the evaluation value model is obtained by using a parameter vector w having the same dimension as that of the content feature vector x.

のように線形関数で設計し、Ｗの正則化項Ｒ(Ｗ)として As a regular function R (W) of W

を用いる場合、式(15)の解Ｗ^(t+1)＝ｗ^(t+1)として、 Is used, the solution W ^{(t + 1)} = w ^{(t + 1)} of equation (15)

を満たす以下のＱ₂(ｘ,ｘ^(u))を最大化させるｗを繰り返し計算することで、初期値ｗ^(t)の周辺でＪ^(t) _s(ｗ)を最大化させるｗを求めてもよい。 The w that maximizes J ^(t) _s (w) around the initial value w ^(t) is obtained by repeatedly calculating w that maximizes the following Q ₂ (x, x ^(u) ) satisfying May be.

(u + 1)ステップの推定値を (u + 1) step estimate

のように計算することで、J^(t) _s(ｗ)を単調に増加させるｗの推定値を得ることができる。式(20)によるパラメータ推定で得られるｗの推定値＾ｗは By calculating as follows, an estimated value of w that monotonically increases J ^(t) _s (w) can be obtained. The estimated value ^ w obtained by parameter estimation according to equation (20) is

を満たすので、式(11)で示したＱ1(Ｗ,Θ,Ｗ^(t),Θ^(t))の値はＷがｗ^(t)の場合よりも＾ｗの場合で大きくなる。したがって、式(14)の(t + 1) ステップで式(20)による繰り返し計算でｗの推定値を求めることで、＾Ｊ(Ｗ,Θ) を単調増加させるＷ^(t+1)＝ｗ^(t+1) が得られる。 Therefore, the value of Q 1 (W, Θ, W ^(t) , Θ ^(t) ) shown in the equation (11) is larger in the case of W than in the case where W is w ^(t) . Therefore, by obtaining the estimated value of w by the iterative calculation according to equation (20) at the (t + 1) step of equation (14), ^ J (W, Θ) is monotonically increased W ^{(t + 1)} = w ^{(t + 1)} is obtained.

Ｑ₂(ｗ,ｗ^(u))はｗに関する凹関数であるため、式(20)を満たすｗ^(u+1)の大域最適解を求めることができる。式(20)の解ｗ^(u+1)は、例えば、準ニュートン法の一種であるBFGSアルゴリズム（非特許文献４参照）を用いて算出できる。 Since Q ₂ (w, w ^(u) ) is a concave function related to w, a global optimum solution of w ^{(u + 1)} satisfying Expression (20) can be obtained. The solution w ^{(u + 1)} of Equation (20) can be calculated using, for example, a BFGS algorithm (see Non-Patent Document 4) which is a kind of quasi-Newton method.

非特許文献４：D. C. Liu and J. Nocedal: On the limited memory BFGS method for large scale optimization, Math. Programming, Ser. B, Vol. 45, No. 3, pp. 503−528 (1989). Non-Patent Document 4: D. C. Liu and J. Nocedal: On the limited memory BFGS method for large scale optimization, Math. Programming, Ser. B, Vol. 45, No. 3, pp. 503-528 (1989).

NBモデルを用いて生成確率モデルを設計する場合、正例のコンテンツの生成確率モデルｐ(ｘ; θ_＋)と、負例のコンテンツの生成確率モデルｐ(ｘ; θ₋)は、 When designing the generation probability model using the NB model, the generation probability model p (x; θ ₊ ) of the positive example content and the generation probability model p (x; θ ₋ ) of the negative example content are:

で定義される。ここで、 Defined by here,

であり、 And

である。また、ディリクレ分布を用いて、Θ の事前確率分布を It is. Also, using the Dirichlet distribution, the prior probability distribution of Θ

のように設計する。ξ(＞０)はハイパーパラメータである。このように、NBモデルを用いて生成確率モデルを設計する場合、式(16) を満たす解は、 Design like this. ξ (> 0) is a hyper parameter. Thus, when designing a generation probability model using the NB model, the solution satisfying equation (16) is

で計算できる。 It can be calculated with

(t+1)学習ステップにおけるスコア関数のパラメータ推定値Ｗ^(t+1)、Θ^(t+1)の計算後、例えば以下の式で与える収束条件を満たすかどうかを確認する。 (t + 1) After calculating the parameter estimate values W ^{(t + 1)} and Θ ^{(t + 1)} of the score function in the learning step, for example, it is confirmed whether or not the convergence condition given by the following equation is satisfied.

式(25)中の||Ｗ^(t)||、||Θ^(t)||は行列Ｗ^(t)、Θ^(t)のフロベニウスノルムを表す。εは設計者が与える微小な値である。収束条件を満たす場合はW^(t+1), Θ^(t+1) をスコア関数のパラメータ値＾Ｗ、＾Θとしてスコア関数記憶部４０に格納する。収束条件を満たさない場合は、t ← t + 1として、上記の処理を繰り返す． || W ^(t) || and || Θ ^(t) || in Equation (25) represent the Frobenius norm of the matrices W ^(t) and Θ ^(t) . ε is a minute value given by the designer. When the convergence condition is satisfied, W ^{(t + 1)} and Θ ^{(t + 1)} are stored in the score function storage unit 40 as the parameter values ^ W and ^ Θ of the score function. If the convergence condition is not satisfied, t ← t + 1 and repeat the above process.

以上説明した原理に従って、スコア関数生成部２４の初期化部３０は、評価値モデルのパラメータＷ、生成確率モデルのパラメータΘに初期値を設定する。 In accordance with the principle described above, the initialization unit 30 of the score function generation unit 24 sets initial values for the parameter W of the evaluation value model and the parameter Θ of the generation probability model.

スコア計算部３２は、ラベルなしサンプルｘ_mの各々について、初期化部３０で初期値が設定され、又は評価値モデル計算部３４によって前回計算された評価値モデルのパラメータＷと、初期化部３０で初期値が設定され、又は生成確率モデル計算部３６によって前回計算された生成確率モデルのパラメータΘとを用いて、確率値のエントロピー正則化項Ｒ（Ｐ_m）を考慮した上記式（７）に従って、ラベルなしサンプルｘ_mが正例のデータである確率値＾Ｐ(＋|ｘ_m;Ｗ,Θ)、及び負例である確率値＾Ｐ(−|ｘ_m;Ｗ,Θ)を計算する。 The score calculation unit 32 sets an initial value for each of the unlabeled samples x _{m by} the initialization unit 30 or the parameter W of the evaluation value model previously calculated by the evaluation value model calculation unit 34, and the initialization unit 30. In the above equation (7), in which the initial value is set at or the parameter Θ of the generation probability model previously calculated by the generation probability model calculation unit 36 is used and the entropy regularization term R (P _m ) of the probability value is considered. accordingly the probability value unlabeled sample x _m is the data of the positive sample _{^ P (+ | x m;} W, Θ), and a negative sample in which the probability value _{^ P (- | x m;} W, Θ) calculated To do.

評価値モデル計算部３４は、スコア計算部３２によって計算されたラベルなしサンプルｘ_mの各々についての確率値＾Ｐ(＋|ｘ_m;Ｗ,Θ)、＾Ｐ(−|ｘ_m;Ｗ,Θ)と、ラベルありサンプルｘ⁺ _i,ｘ^- _jと、ラベルなしサンプルｘ_mとに基づいて、評価値モデルのモデルパラメータの正則化項Ｒ（Ｗ）を考慮した上記式（１９）、式（２０）に従って、評価値モデルＦ(ｘ;Ｗ)に対して、シグモイド関数を用いて近似した、ＡＵＣ（ＡｒｅａＵｎｄｅｒｔｈｅＣｕｒｖｅ）値を最大化するように、評価値モデルのパラメータＷを計算する。 The evaluation value model calculation unit 34 calculates the probability values ^ P (+ | x _m ; W, Θ) and ^ P (− | x _m ; W, for each of the unlabeled samples x _m calculated by the score calculation unit 32. Θ), the above-mentioned formulas (19) and (14) taking into account the regularization term R (W) of the model parameter of the evaluation value model based on the labeled samples x ⁺ _i , x ^- _j and the unlabeled sample x _m According to (20), the parameter W of the evaluation value model is calculated so as to maximize the AUC (Area Under the Curve) value approximated by using the sigmoid function with respect to the evaluation value model F (x; W). .

生成確率モデル計算部３６は、スコア計算部３２によって計算されたラベルなしサンプルｘ_mの各々についての確率値＾Ｐ(＋|ｘ_m;Ｗ,Θ)、＾Ｐ(−|ｘ_m;Ｗ,Θ)と、ラベルありサンプルｘ⁺ _i,ｘ^- _jと、ラベルなしサンプルｘ_mとに基づいて、生成確率モデルのモデルパラメータの正則化項ｐ（Θ）を考慮した上記式（２３）、式（２４）に従って、正例のデータの生成確率モデルｐ(ｘ;θ_＋)及び負例のデータの生成確率モデルｐ(ｘ;θ₋)を用いて求められる、ラベルありサンプルの対数尤度及びラベルなしサンプルの期待対数尤度の和を最大化するように、正例のデータの生成確率モデルのパラメータθ_＋,v及び負例のデータの生成確率モデルのパラメータθ_−,vを計算する。 The generation probability model calculation unit 36 calculates the probability values ^ P (+ | x _m ; W, Θ) and ^ P (− | x _m ; W, for each of the unlabeled samples x _m calculated by the score calculation unit 32. (23), which takes into account the regularization term p (Θ) of the model parameter of the generation probability model based on Θ), labeled samples x ⁺ _i , x ^- _j and unlabeled samples x _m According to (24), the log likelihood of the labeled sample obtained using the positive data generation probability model p (x; θ ₊ ) and the negative data generation probability model p (x; θ ₋ ) and The parameter θ _{+, v} of the positive example data generation probability model and the parameter θ _{−, v} of the negative example data generation probability model are calculated so as to maximize the sum of the expected log likelihoods of the unlabeled samples.

収束判定部３８は、上記式（２５）に示す収束判定条件を満たすまで、スコア計算部３２による計算、評価値モデル計算部３４による計算、及び生成確率モデル計算部３６による計算を繰り返させ、スコア関数で用いる評価値モデルのパラメータの推定値＾Ｗ及び生成確率モデルのパラメータの推定値＾Θを、スコア関数記憶部４０に格納する。 The convergence determination unit 38 repeats the calculation by the score calculation unit 32, the calculation by the evaluation value model calculation unit 34, and the calculation by the generation probability model calculation unit 36 until the convergence determination condition represented by the above formula (25) is satisfied. The estimated value ^ W of the evaluation value model used in the function and the estimated value ^ Θ of the parameter of the generation probability model are stored in the score function storage unit 40.

テストデータデータベース２６には、入力部１０において受け付けたテストデータ集合 In the test data database 26, a set of test data received by the input unit 10

が記憶されている。 Is remembered.

順位付け部２８は、テストデータデータベース２６に記憶されているテストデータ集合Ｄpに含まれるテストデータｘ_zの各々について、スコア関数記憶部４０に記憶されているスコア関数で用いる評価値モデルのパラメータＷ及び生成確率モデルのパラメータΘに基づいて、上記式（７）に従って、テストデータｘ_zが正例であることを示すスコア値として、確率値＾Ｐ(＋|ｘ_z;Ｗ,Θ)を算出し、スコア値が大きい順にテストデータｘ_zを並べ替えて、得られたスコア値とテストデータの順序を２値分類結果として出力部９０により出力する。なお、スコア値に閾値を設け、閾値以上のスコア値が得られたテストデータのみを正例として出力してもよい。 The ranking unit 28 uses, for each of the test data _xz included in the test data set Dp stored in the test data database 26, the parameter W of the evaluation value model used in the score function stored in the score function storage unit 40. Then, based on the parameter Θ of the generation probability model, a probability value ^ P (+ | x _z ; W, Θ) is calculated as a score value indicating that the test data x _z is a positive example according to the above equation (7). Then, the test data _xz is rearranged in descending order of the score values, and the obtained score values and the order of the test data are output by the output unit 90 as binary classification results. Note that a threshold value may be provided for the score value, and only test data for which a score value equal to or greater than the threshold value is obtained may be output as a positive example.

＜本発明の第１の実施形態に係る２値分類装置の作用＞
次に、本発明の第１の実施形態に係る２値分類装置１００の作用について説明する。２値分類装置１００は、入力部１０によって、訓練データ集合を受け付け訓練データデータベース２２に記憶し、テストデータ集合を受け付けテストデータデータベース２６に記憶すると、２値分類装置１００によって、図３に示す２値分類学習処理ルーチンが実行される。 <Operation of Binary Classification Device According to First Embodiment of the Present Invention>
Next, the operation of the binary classification device 100 according to the first embodiment of the present invention will be described. The binary classification device 100 accepts the training data set by the input unit 10 and stores it in the training data database 22, and receives the test data set and stores it in the test data database 26. A value classification learning process routine is executed.

まず、ステップＳ１０１で、訓練データデータベース２２に記憶されている訓練データ集合に含まれる、ラベルありサンプルの正例集合Ｄ⁺及びラベルありサンプルの負例集合Ｄ^-と、ラベルなしサンプル集合Ｄ_uとを読み込む。ここで、ｉは正例集合に含まれるラベルありサンプルのID番号を表し、ｊは負例集合に含まれるラベルありサンプルのID 番号を表し、ｍはラベルなしサンプル集合に含まれるラベルなしサンプルのID番号を表す。 First, in step S101, included in the training data set stored in the training data database 22, a label Yes Yes positive sample set D ⁺ and label samples negative examples set D of sample ^- and the unlabeled sample set D _u Is read. Here, i represents the ID number of the labeled sample included in the positive example set, j represents the ID number of the labeled sample included in the negative example set, and m represents the unlabeled sample included in the unlabeled sample set. Represents an ID number.

次のステップＳ１０２で、初期化部３０が、パラメータ値の計算に用いるハイパーパラメータ値を設定する。具体的には、ハイパーパラメータC, ξ, γ, β の値を設定する。また、収束条件のパラメータεと、最大繰り返し計算数t_maxの値とを設定する。 In the next step S102, the initialization unit 30 sets a hyperparameter value used for calculation of the parameter value. Specifically, the values of hyper parameters C, ξ, γ, β are set. In addition, a convergence condition parameter ε and a value of the maximum number of iterations t _max are set.

そして、ステップＳ１０３で、初期化部３０が、学習ステップtの初期値t=0と、スコア関数を構成する生成確率モデルのパラメータの初期値Θ⁽⁰⁾とを設定する。例えば、Θ⁽⁰⁾ の各要素θ⁽⁰⁾ _ω,v （ω ∈ {+, -}）に1/Ｖを代入する。 In step S103, the initialization unit 30 sets the initial value t = 0 of the learning step t and the initial value Θ ⁽⁰⁾ of the parameter of the generation probability model that constitutes the score function. For example, 1 / V is substituted for each element θ ⁽⁰⁾ _{ω, v} (ω ∈ {+, −}) of Θ ⁽⁰⁾ .

また、ステップＳ１０４で、初期化部３０が、ラベルありサンプルの正例集合Ｄ⁺及びラベルありサンプルの負例集合Ｄ^-を用いて評価値モデルのパラメータ値を計算し、得られた値を、スコア関数を構成する評価値モデルのパラメータの初期値Ｗ⁽⁰⁾とする。 Further, in step S104, the initialization unit 30 calculates the parameter value of the evaluation value model using the positive example set D ⁺ of the labeled sample and the negative example set D ⁻ of the labeled sample, and obtains the obtained value as The initial value W ⁽⁰⁾ of the parameter of the evaluation value model constituting the score function is used.

ステップＳ１０４は、例えば、下記の１．〜４．により実現する。 Step S104 includes, for example, the following 1. ~ 4. To achieve.

１．収束条件のパラメータε‘と、最大繰り返し計算数ｕ_maxの値を設定する。 1. The convergence condition parameter ε ′ and the maximum number of iterations u _max are set.

２．ｕに0を代入し、ｗ^(u)の各要素に0を代入する。＾Ｐ(ω|ｘ_m;ｗ⁽⁰⁾,Θ⁽⁰⁾), ω ∈ {+, −}に0を代入する。 2. 0 is assigned to u, and 0 is assigned to each element of w ^(u) . ^ P (ω | x _m ; w ⁽⁰⁾ , Θ ⁽⁰⁾ ), 0 is substituted into ω ∈ {+, −}.

３．BFGSアルゴリズムを用いて、上記式(20)を満たすｗ^(u+1)の解を計算する。 3. Using the BFGS algorithm, a solution of w ^{(u + 1)} that satisfies the above equation (20) is calculated.

４．下記の（ａ），（ｂ）の学習終了判定を実行する。 4). The following learning end determinations (a) and (b) are executed.

（ａ）収束条件 (A) Convergence conditions

を満たさなく、ｕ＜ｕ_maxのとき
ｉ． u にu + 1を代入する。
ｉｉ．上記３．に戻る。 When u <u _{max is} not satisfied i. Substitute u + 1 for u.
ii. 3. above. Return to.

（ｂ）それ以外のとき、ｗ^(t) にｗ^(u+1) を代入する。 (B) In other ^cases , w ^{(u + 1)} is substituted for w ^(t) .

そして、ステップＳ１０５において、スコア計算部３２が、上記ステップＳ１０４又は後述するステップＳ１０６で得られたパラメータＷ^(t)と、上記ステップＳ１０４又は後述するステップＳ１０７で得られたパラメータΘ^(t)とをパラメータ値として用いて、ラベルなしサンプルｘ_mの各々について、上記式（７）に従って、当該ラベルなしサンプルｘ_mが正例である確率値＾Ｐ(＋|ｘ_m;Ｗ^(t),Θ^(t))と負例である確率値＾Ｐ(−|ｘ_m;Ｗ^(t),Θ^(t))を計算する。 In step S105, the score calculation unit 32 calculates the parameter W ^(t) obtained in step S104 or step S106 described later and the parameter Θ ^(t) obtained in step S104 or step S107 described later. used as the parameter values for each of the unlabeled sample x _m, in accordance with the above equation (7), the probability value without the label sample x _m is a positive example _{^ P (+ | x m;} W (t), Θ ( ^t) ) and a negative example probability value ^ P (− | x _m ; W ^(t) , Θ ^(t) ) are calculated.

ステップＳ１０６では、評価値モデル計算部３４が、上記ステップＳ１０５で計算されたラベルなしサンプルｘ_mの各々についての確率値＾Ｐ(＋|ｘ_m;Ｗ,Θ)、＾Ｐ(−|ｘ_m;Ｗ,Θ)と、ラベルありサンプルｘ⁺ _i,ｘ^- _jと、ラベルなしサンプルｘ_mとを用いて、評価値モデルのパラメータの推定値Ｗ^(t+1) を計算する。 In step S106, the evaluation value model calculation unit 34, a probability value for each of the calculated unlabeled samples x _m in step _{S105 ^ P (+ | x m} ; W, Θ), ^ P (- | x m ; W, Θ), labeled samples x ⁺ _i , x ⁻ _j and unlabeled samples x _m are used to calculate the estimated value W ^{(t + 1)} of the parameters of the evaluation value model.

ステップＳ１０６は、例えば、下記の（ａ）〜（ｄ）により実現する。 Step S106 is realized by the following (a) to (d), for example.

（ａ）収束条件のパラメータε‘と、最大繰り返し計算数ｕ_maxの値を設定する。 (A) The value of the convergence condition parameter ε ′ and the maximum number of iterations u _max are set.

（ｂ）ｕに0を代入し、ｗ^(u)にｗ^(t)を代入する。 (B) 0 is substituted for ^u, and w ^(t) is substituted for w ^(u) .

（ｃ）BFGSアルゴリズムを用いて、上記式(20)を満たすｗ^(u+1)の解を計算する。 (C) A solution of w ^{(u + 1)} satisfying the above equation (20) is calculated using the BFGS algorithm.

（ｄ）下記のｉ．〜ｉｉ．の学習終了判定を実行 (D) i. To ii. Execute learning end judgment

ｉ．収束条件 i. Convergence condition

を満たさなく、ｕ＜ｕ_maxのとき
Ａ．ｕにｕ＋１を代入する。
Ｂ．上記(c)に戻る。 When u <u _{max is} not satisfied A. Substitute u + 1 for u.
B. Return to (c) above.

ｉｉ．それ以外のとき、Ｗ^(t+1) にｗ^(u+1)を代入する。 ii. Otherwise, substitute w ^{(u + 1)} for W ^{(t + 1)} .

ステップＳ１０７では、生成確率モデル計算部３６が、上記ステップＳ１０５で計算されたラベルなしサンプルｘ_mの各々についての確率値＾Ｐ(＋|ｘ_m;Ｗ,Θ)、＾Ｐ(−|ｘ_m;Ｗ,Θ)と、ラベルありサンプルｘ⁺ _i,ｘ^- _jと、ラベルなしサンプルｘ_mとを用いて、生成確率モデルのパラメータの推定値Θ^(t+1)を計算する。具体的には、上記式(23)、式(24)を用いて、 In step S107, the generation probability model calculation unit 36, a probability value for each of the calculated unlabeled samples x _m in step _{S105 ^ P (+ | x m} ; W, Θ), ^ P (- | x m ; W, Θ), labeled samples x ⁺ _i , x ⁻ _j and unlabeled samples x _m are used to calculate the estimated value Θ ^{(t + 1)} of the generation probability model. Specifically, using the above formula (23) and formula (24),

を計算する。
Calculate

ステップＳ１０８で、収束判定部３８が、パラメータＷ、の推定値の変化量d(t+1, t)を計算し、上記式（２５）の収束判定条件d(t+1, t)＜εを満たすか否かを判定する。収束判定条件を満たす場合には、ステップＳ１１０で、＾Ｗ← Ｗ^(t+1)、＾Θ ← Θ^(t+1)として、スコア関数の各パラメータの推定値＾Ｗ、＾Θをスコア関数記憶部４０に格納して、２値分類学習処理ルーチンを終了する。一方、収束判定条件を満たさなければ、ステップＳ１０９で、パラメータ学習のステップをt ← t + 1のように更新して、上記ステップＳ１０５へ戻り、ステップＳ１０５からステップＳ１０８までの処理を再度実施する。この処理は収束判定条件を満たすか、設計者が事前に与える最大値t_maxに学習ステップt が到達するまで繰り返す。 In step S108, the convergence determination unit 38 calculates the change amount d (t + 1, t) of the estimated value of the parameter W, and the convergence determination condition d (t + 1, t) <ε in the above equation (25). It is determined whether or not the above is satisfied. When the convergence determination condition is satisfied, in step S110, ^ W ← W ^{(t + 1)} and ^ Θ ← Θ ^{(t + 1)} are set, and the estimated values ^ W and ^ Θ of the score function are used as the score function. It stores in the memory | storage part 40, and complete | finishes a binary classification learning process routine. On the other hand, if the convergence determination condition is not satisfied, the parameter learning step is updated as t ← t + 1 in step S109, the process returns to step S105, and the processes from step S105 to step S108 are performed again. This process is repeated until the convergence determination condition is satisfied or the learning step t 1 reaches the maximum value t _max given in advance by the designer.

また、２値分類装置１００によって、図４に示す２値分類処理ルーチンが実行される。 Further, the binary classification processing routine shown in FIG.

まず、ステップＳ２０１で、テストデータデータベース２６に記憶されているテストデータ集合Ｄ_pを読み込む。 First, in step S201, the test data set D _p stored in the test data database 26 is read.

次のステップＳ２０２では、スコア関数生成部２４からスコア関数を取得する。ステップＳ２０３では、スコア関数記憶部４０から、スコア関数のパラメータ値＾Ｗ、＾Θを読み込む。 In the next step S202, a score function is acquired from the score function generator 24. In step S203, the score function parameter values ^ W and ^ Θ are read from the score function storage unit 40.

そして、ステップＳ２０４では、上記ステップＳ２０１で読み込まれたテストデータ集合Ｄ_pと、上記ステップＳ２０２とステップＳ２０３で読み込まれたパラメータ計算済みのスコア関数＾Ｐ(＋|ｗ；＾Ｗ,＾Θ )を用いて、テストデータｘ_zごとにスコア値＾Ｐ(＋|ｗ_z；＾Ｗ,＾Θ )を計算し、スコア値が大きい順にテストデータを並べ替える。そして、得られたスコア値とテストデータの順序を２値分類結果として出力部９０により出力して、利用者に提示し、必要に応じてテストデータ集合をスコアが高い順番に並べ変えた結果を適切な箇所に保存して、２値分類処理ルーチンを終了する。 In step S204, the test data set D _p read in step S201, the score function of the loaded parameters computed in step S202 and step S203 ^ P (+ | w; ^ W, ^ Θ) of The score value ^ P (+ | _wz ; ^ W, ^ Θ) is calculated for each test data _xz , and the test data is rearranged in descending order of the score values. Then, the order of the obtained score values and test data is output as a binary classification result by the output unit 90, presented to the user, and the result of rearranging the test data set in the order of higher score as necessary is displayed. Save to an appropriate location and end the binary classification routine.

以上説明したように、第１の実施の形態に係る２値分類装置によれば、ラベルなしサンプルの各々について、評価値モデルと、正例のデータの生成確率モデルと、負例のデータの生成確率モデルとを用いて、ラベルなしサンプルが正例のデータである確率、負例のデータである確率を計算し、ラベルなしサンプルの各々についての確率と、ラベルありサンプルと、ラベルなしサンプルとに基づいて、評価値モデルを計算し、正例のデータの生成確率モデル及び負例のデータの生成確率モデルを計算することを繰り返すことにより、正例と負例の数の差が大きい場合であっても、精度よく２値分類をすることができるスコア関数を学習することができる。また、このように学習されたスコア関数を用いて、テストデータについて、精度よく２値分類をすることができる。 As described above, according to the binary classification device according to the first embodiment, for each of the unlabeled samples, the evaluation value model, the positive example data generation probability model, and the negative example data generation Using the probability model, calculate the probability that the unlabeled sample is positive data, the probability that it is negative data, and calculate the probability for each unlabeled sample, labeled sample, and unlabeled sample On the basis of this, the evaluation value model is calculated, and the generation probability model of the positive example data and the generation probability model of the negative example data are repeated, so that the difference between the numbers of the positive example and the negative example is large. However, it is possible to learn a score function that can perform binary classification with high accuracy. Moreover, it is possible to classify the test data with high accuracy using the score function learned in this way.

また、特徴ベクトルによって表現されるコンテンツを、ある特定の種別に関連するか否かを判定する２値分類問題において、スコア関数のパラメータである評価値モデルのパラメータと、生成確率モデルのパラメータとを、ラベルありサンプルとラベルなしサンプル双方の統計情報を同時に用いて計算することで、ラベルありサンプル集合に対して負例より正例に大きなスコアを与えるスコア関数を学習し、かつラベルなしサンプルの統計情報を効果的に取り込んでラベルありサンプルに含まれない特徴量に関するスコア関数の学習不足を補う。以上の技術により、正例と負例の数に大きな差がある２値分類問題においても、新規コンテンツ集合の中から当該特定の種別に関連するコンテンツを高い精度で抽出する分類装置を実現することができる。 In the binary classification problem for determining whether or not the content represented by the feature vector is related to a specific type, the evaluation value model parameter that is a parameter of the score function, and the generation probability model parameter are Learns a score function that gives a larger score for positive samples than for negative samples by calculating simultaneously using statistical information for both labeled and unlabeled samples, and statistics for unlabeled samples Incorporates information effectively to compensate for the lack of learning of the score function for features that are not included in labeled samples. With the above technology, to realize a classification device that extracts content related to a specific type from a new content set with high accuracy even in a binary classification problem in which there is a large difference between the number of positive examples and negative examples. Can do.

＜第２の実施の形態＞
次に、第２の実施の形態に係る２値分類装置について説明する。なお、第２の実施の形態に係る２値分類装置の構成は、第１の実施の形態と同様であるため、同一符号を付して、説明を省略する <Second Embodiment>
Next, a binary classification apparatus according to the second embodiment will be described. Note that the configuration of the binary classification device according to the second embodiment is the same as that of the first embodiment, and therefore, the same reference numerals are given and description thereof is omitted.

第２の実施の形態では、評価値モデルの計算のための目的関数が、第１の実施の形態と異なっている。 In the second embodiment, the objective function for calculating the evaluation value model is different from that in the first embodiment.

上記第１の実施の形態では、評価値モデルの学習にラベルなしサンプルを用いるために、上記式(4)の目的関数を定義したが、第２の実施の形態では、以下の目的関数を用いて評価値モデルの学習を行う。 In the first embodiment, since the unlabeled sample is used for learning of the evaluation value model, the objective function of the above equation (4) is defined. In the second embodiment, the following objective function is used. To learn the evaluation value model.

第１の実施の形態では、ラベルなしサンプルと正例のラベルありサンプルの関数値の比較と、ラベルなしサンプルと負例のラベルありサンプルの関数値の比較と、を並列的に行う目的関数を設計するのに対して、第２の実施の形態では、ラベルなしサンプルをそれぞれ１つ加えたときに当該ラベルなしサンプルの関数値とラベルありサンプル集合の各関数値とから計算されるＡＵＣの推定値を最大化させるパラメータ値を求めるための目的関数を設計する。 In the first embodiment, the objective function for performing the comparison of the function value of the unlabeled sample and the positive labeled sample and the comparison of the function value of the unlabeled sample and the negative labeled sample in parallel is provided. In contrast, in the second embodiment, when one unlabeled sample is added, the AUC is calculated from the function value of the unlabeled sample and each function value of the labeled sample set. Design an objective function to find the parameter value that maximizes the value.

上記第１の実施の形態と同様にして、正例のコンテンツの生成確率モデルｐ(ｘ; θ_＋)と、負例のコンテンツの生成確率モデルｐ(ｘ; θ₋) とを導入し、評価値モデルのパラメータＷと、生成確率モデルのパラメータΘ = [θ_＋ θ₋]と、ラベルなしサンプルのクラス確率Ｐ(ω|ｘ_m), ω ∈ {+, −}, ∀mとを、以下の目的関数を最大化させるように学習させる。 In the same manner as in the first embodiment, a positive content generation probability model p (x; θ ₊ ) and a negative content generation probability model p (x; θ ₋ ) are introduced and evaluated. and parameters W value model, the parameters of the generation probability model theta = _- a, unlabeled sample class probability _{P [θ + θ] (ω} | x m), ω ∈ {+, -}, and ∀M, less To learn to maximize the objective function of.

制約 Constraints

の下でラグランジュ未定乗数法を用いると、式(27)のＪ(Ｗ,Θ,Ｐ)を最大化させるＰ(ω|ｘ_m), ω ∈ {+, −}, ∀mの解 Using Lagrange's undetermined multiplier method under, the solution of P (ω | x _m ), ω ∈ {+, −}, ∀m that maximizes J (W, Θ, P) in Eq. (27)

が得られる。ただし、 Is obtained. However,

である。本実施の形態では、式(28) で与えられるｘについての関数＾Ｐ(+|ｘ;Ｗ, Θ)を、コンテンツが正例であるか否かを判定するためのスコア関数とする。 It is. In this embodiment, the function ^ P (+ | x; W, Θ) for x given by Expression (28) is used as a score function for determining whether or not the content is a positive example.

式(28)を式(27)に代入すると、ＷとΘに関する目的関数 Substituting equation (28) into equation (27), the objective function for W and Θ

が得られる。 Is obtained.

上記式(31)で与えられる目的関数＾Ｊ(Ｗ,Θ)を最大化させるＷとΘの計算方法として、例えば、 As a calculation method of W and Θ that maximizes the objective function ^ J (W, Θ) given by the above equation (31), for example,

を満たす以下のＱ₁(Ｗ,Θ,Ｗ^(t),Θ^(t)) を最大化させるＷとΘを繰り返し計算することで、ＷとΘの初期値周辺で＾Ｊ(Ｗ, Θ) を最大化させる局所最適解を得ることができる。 ^ J (W, Θ) around the initial values of W and Θ by repeatedly calculating W and Θ that maximize the following Q ₁ (W, Θ, W ^(t) , Θ ^(t) ) that satisfies It is possible to obtain a local optimal solution that maximizes.

ただし， However,

であり、Ｊ^(t) _g (Θ) は上記式(13) を満たす。ただし、上記第１の実施の形態の場合とは異なり、上記式(13) に含まれる＾Ｐ(ω|ｘ_m;Ｗ^(t),Θ^(t)) は、上記式(28) で計算される確率の推定値である。(t+1) ステップの推定値を J ^(t) _g (Θ) satisfies the above equation (13). However, unlike the case of the first embodiment, ^ P (ω | x _m ; W ^(t) , Θ ^(t) ) included in the equation (13) is calculated by the equation (28). Is an estimate of the probability of being done. (t + 1) step estimate

のように計算することで、＾Ｊ(Ｗ, Θ) を単調に増加させるＷとΘの推定値を得ることができる。上記式(20)の解は、以下のようにＷ^(t+1) とΘ^(t+1) を独立に計算することで得られる。 By calculating as follows, it is possible to obtain the estimated values of W and Θ that monotonically increase ^ J (W, Θ). The solution of the above equation (20) can be obtained by independently calculating W ^{(t + 1)} and Θ ^{(t + 1)} as follows.

評価値モデルで用いる関数ｆ(ｘ,Ｗ)を式(17)のように線形関数で設計し、Ｗの正則化項Ｒ(Ｗ)として式(18)を用いる場合、式(35)の解Ｗ^(t+1)＝ｗ^(t+1)として、 When the function f (x, W) used in the evaluation value model is designed as a linear function as shown in equation (17) and equation (18) is used as the regularization term R (W) of W, the solution of equation (35) W ^{(t + 1)} = w ^{(t + 1)}

を満たす以下のＱ₂(ｗ,ｗ^(u))を最大化させるｗを繰り返し計算することで、初期値ｗ^(t)の周辺でＪ^(t) _a(w)を最大化させるｗを求めてもよい。 The w that maximizes J ^(t) _a (w) is obtained around the initial value w ^(t) by repeatedly calculating w that maximizes the following Q ₂ (w, w ^(u) ) that satisfies May be.

ただし、 However,

である。Ｑ₂(ｗ,ｗ^(u))はｗに関する凹関数であるため、BFGSアルゴリズムなどを用いてＱ₂(ｗ,ｗ^(u))を最大化させるｗの大域最適解ｗ^(u+1)を求めることができる。 It is. Since Q ₂ (w, w ^(u) ) is a concave function related to w, the global optimal solution w ^{(u + 1) for} w that maximizes Q ₂ (w, w ^(u) ) using the BFGS algorithm or the like. Can be requested.

以上説明した原理に従って、スコア関数生成部２４の初期化部３０は、第１の実施の形態と同様に、評価値モデルのパラメータＷ、生成確率モデルのパラメータΘに初期値を設定する。 In accordance with the principle described above, the initialization unit 30 of the score function generation unit 24 sets initial values for the parameter W of the evaluation value model and the parameter Θ of the generation probability model, as in the first embodiment.

スコア計算部３２は、ラベルなしサンプルｘ_mの各々について、初期化部３０で初期値が設定され、又は評価値モデル計算部３４によって前回計算された評価値モデルのパラメータＷと、初期化部３０で初期値が設定され、又は生成確率モデル計算部３６によって前回計算された生成確率モデルのパラメータΘとを用いて、上記式（７）に従って、ラベルなしサンプルｘ_mが正例のデータである確率値＾Ｐ(＋|ｘ_m;Ｗ,Θ)、及び負例である確率値＾Ｐ(−|ｘ_m;Ｗ,Θ)を計算する。 The score calculation unit 32 sets an initial value for each of the unlabeled samples x _{m by} the initialization unit 30 or the parameter W of the evaluation value model previously calculated by the evaluation value model calculation unit 34, and the initialization unit 30. The probability that the unlabeled sample x _m is positive data according to the above equation (7) using the parameter Θ of the generation probability model previously set by or the generation probability model calculation unit 36 previously calculated The value {circumflex over (P)} (+ | x _m ; W, Θ) and the negative probability value {circumflex over (P)} (− | x _m ; W, Θ) are calculated.

評価値モデル計算部３４は、スコア計算部３２によって計算されたラベルなしサンプルｘ_mの各々についての確率値＾Ｐ(＋|ｘ_m;Ｗ,Θ)、＾Ｐ(−|ｘ_m;Ｗ,Θ)と、ラベルありサンプルｘ⁺ _i,ｘ^- _jと、ラベルなしサンプルｘ_mとに基づいて、上記式（２７）、式（２８）に従って、評価値モデルに対して、シグモイド関数を用いて近似した、ＡＵＣ（ＡｒｅａＵｎｄｅｒｔｈｅＣｕｒｖｅ）値を最大化するように、評価値モデルのパラメータＷを計算する。 The evaluation value model calculation unit 34 calculates the probability values ^ P (+ | x _m ; W, Θ) and ^ P (− | x _m ; W, for each of the unlabeled samples x _m calculated by the score calculation unit 32. Θ), labeled samples x ⁺ _i , x ⁻ _j and unlabeled samples x _m , using the sigmoid function for the evaluation value model according to the above equations (27) and (28) The parameter W of the evaluation value model is calculated so as to maximize the approximate AUC (Area Under the Curve) value.

生成確率モデル計算部３６は、スコア計算部３２によって計算されたラベルなしサンプルｘ_mの各々についての確率値＾Ｐ(＋|ｘ_m;Ｗ,Θ)、＾Ｐ(−|ｘ_m;Ｗ,Θ)と、ラベルありサンプルｘ⁺ _i,ｘ^- _jと、ラベルなしサンプルｘ_mとに基づいて、上記第１の実施の形態と同様に、上記式（２３）、式（２４）に従って、正例のデータの生成確率モデルθ_＋,v及び負例のデータの生成確率モデルθ_−,vを計算する。 The generation probability model calculation unit 36 calculates the probability values ^ P (+ | x _m ; W, Θ) and ^ P (− | x _m ; W, for each of the unlabeled samples x _m calculated by the score calculation unit 32. and theta), the label has a sample x ⁺ _i, x ^- and _j, based on the unlabeled sample x _m, as in the first embodiment, the equation (23), according to equation (24), positive The generation probability model θ _{+, v} of the example data and the generation probability model θ _{−, v} of the negative example data are calculated.

収束判定部３８は、上記第１の実施の形態と同様に、上記式（２５）に示す収束判定条件を満たすまで、スコア計算部３２による計算、評価値モデル計算部３４による計算、及び生成確率モデル計算部３６による計算を繰り返させ、スコア関数で用いる評価値モデルのパラメータＷ及び生成確率モデルのパラメータΘを、スコア関数記憶部４０に格納する。 Similar to the first embodiment, the convergence determination unit 38 performs the calculation by the score calculation unit 32, the calculation by the evaluation value model calculation unit 34, and the generation probability until the convergence determination condition represented by the above equation (25) is satisfied. The calculation by the model calculation unit 36 is repeated, and the evaluation value model parameter W and the generation probability model parameter Θ used in the score function are stored in the score function storage unit 40.

なお、第２の実施の形態に係る２値分類装置の他の構成及び作用については、第１の実施の形態と同様であるため、説明を省略する。 In addition, since it is the same as that of 1st Embodiment about the other structure and effect | action of the binary classification device based on 2nd Embodiment, description is abbreviate | omitted.

＜実験例＞
次に、テキスト自動分類装置の性能評価によく用いられるデータベース20 newsgroups(20News, 非特許文献２参照)に、第１の実施の形態に係る２値分類装置１００を適用した場合の実験結果を、表１、表２に示す。表１は、ラベルありサンプル数が１００の場合の実験結果であり、表２は、ラベルありサンプル数が２００の場合の実験結果である。 <Experimental example>
Next, experimental results when the binary classification device 100 according to the first embodiment is applied to a database 20 newsgroups (20News, see Non-Patent Document 2) often used for performance evaluation of an automatic text classification device. Tables 1 and 2 show. Table 1 shows the experimental results when the number of labeled samples is 100, and Table 2 shows the experimental results when the number of labeled samples is 200.

このデータベースには、コンテンツ本体とコンテンツが属するカテゴリ情報が付与されており、カテゴリの総数は20である。性能評価では、ある1つのターゲットカテゴリ(TC)と他の19カテゴリに分類する評価用データセットを作成した。データベースに含まれる20カテゴリをそれぞれターゲットカテゴリ(TC)として用いることで、20通りの評価用データセットを作成した。 This database is provided with category information to which the content main body and the content belong, and the total number of categories is 20. In performance evaluation, we created an evaluation data set that was classified into one target category (TC) and another 19 categories. Twenty categories included in the database were used as target categories (TC), and 20 evaluation data sets were created.

それぞれの評価用データセットを用いた実験評価で、２値分類装置１００のパラメータ値の計算に用いるラベルありサンプルと5000個のラベルなしサンプルを、当該評価用データセットから無作為に抽出した。ここで、ラベルありサンプルとは、コンテンツ本体とクラス情報(正例・負例) の両方を訓練データとして用いるサンプルであり、ラベルなしサンプルとはコンテンツ本体のみの情報を訓練データとして用いるサンプルである。すなわち、ラベルなしサンプルとして抽出されたコンテンツについては、データベースに記録されたクラス情報を用いずに、スコア関数のパラメータ計算を行う。また、ラベルありサンプルとラベルなしサンプルとして抽出されなかった残りのコンテンツから、2000文書を利用者が分類を望むコンテンツとして(以下、テストサンプル)無作為に抽出して、２値分類の性能評価に用いた。性能評価の尺度には、2値分類の評価によく用いられるAUC値を用いた。 In the experimental evaluation using each evaluation data set, a labeled sample and 5000 unlabeled samples used for calculation of the parameter values of the binary classification apparatus 100 were randomly extracted from the evaluation data set. Here, the labeled sample is a sample that uses both the content body and class information (positive and negative examples) as training data, and the unlabeled sample is a sample that uses only the content body information as training data. . That is, for the content extracted as an unlabeled sample, the score function parameters are calculated without using the class information recorded in the database. Also, from the remaining contents that were not extracted as labeled samples and unlabeled samples, 2000 documents were randomly extracted as contents that the user wanted to classify (hereinafter referred to as test samples) to evaluate the performance of binary classification. Using. The AUC value often used for the evaluation of binary classification was used as a scale for performance evaluation.

表１、２は、それぞれ100個と200個のラベルありサンプルを用いて、２値分類装置１００の識別関数のパラメータ値を計算した場合の結果を表す。ラベルありデータの個数とターゲットクラスのそれぞれの条件で、無作為のサンプル抽出と実験評価を繰り返し10 回行い、平均AUC値を算出した。表中の「方法１」の欄は第１の実施の形態を用いて得られた平均AUC値を、「方法２」は非特許文献２の技術を２値分類に適用して得られた平均AUC値を、「方法３」は、第１の実施の形態に記載の評価値モデルの関数 Tables 1 and 2 show the results when the parameter values of the discrimination function of the binary classification device 100 are calculated using 100 and 200 labeled samples, respectively. Random sample extraction and experimental evaluation were repeated 10 times under the respective conditions for the number of labeled data and the target class, and the average AUC value was calculated. The column of “Method 1” in the table is the average AUC value obtained using the first embodiment, and “Method 2” is the average obtained by applying the technique of Non-Patent Document 2 to the binary classification. The AUC value, “method 3” is a function of the evaluation value model described in the first embodiment.

を式(１０)の目的関数の最大化によりラベルありサンプルのみで学習させて得られた平均AUC値を示す括弧中の数値は10回の実験評価で得られたAUC値の標準偏差である。 Is the standard deviation of the AUC values obtained in 10 experimental evaluations, indicating the average AUC value obtained by learning only with the labeled sample by maximizing the objective function of equation (10).

表１、２より、ラベルありサンプル数が100の場合、200の場合でともに、方法１は方法２、３よりも高い分類精度が得られた。方法１と方法３の比較より、本発明による２値分類装置でラベルなしサンプルをスコア関数の学習に用いることで分類精度が向上することが確認された。また、方法１と方法２の比較により、本発明の第１の実施の形態による2値分類装置は、非特許文献2の従来技術による分類装置に比べて分類性能の点で優位性を有していることがわかる。 From Tables 1 and 2, Method 1 obtained higher classification accuracy than Methods 2 and 3 in both cases where the number of labeled samples was 100 and 200. From comparison between Method 1 and Method 3, it was confirmed that the classification accuracy is improved by using the unlabeled sample for learning of the score function in the binary classification apparatus according to the present invention. Further, by comparing Method 1 and Method 2, the binary classification device according to the first embodiment of the present invention has an advantage in classification performance compared with the classification device according to the prior art of Non-Patent Document 2. You can see that

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、上記の実施の形態では、２値分類学習装置と２値分類装置とを、１つの２値分類装置で実現する場合を例に説明したが、これに限定されるものではなく、２値分類学習装置と２値分類装置とを別々に設けてもよい。この場合には、２値分類学習装置が、入力部１０、訓練データデータベース２２、スコア関数生成部２４、及び出力部９０を備えればよい。また、２値分類装置が、入力部１０、テストデータデータベース２６、スコア関数記憶部４０、順位付け部２８、及び出力部９０を備えればよい。 For example, in the above embodiment, the case where the binary classification learning device and the binary classification device are realized by one binary classification device has been described as an example. However, the present invention is not limited to this. A classification learning device and a binary classification device may be provided separately. In this case, the binary classification learning device may include the input unit 10, the training data database 22, the score function generation unit 24, and the output unit 90. Further, the binary classification device may include the input unit 10, the test data database 26, the score function storage unit 40, the ranking unit 28, and the output unit 90.

本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能であるし、ネットワークを介して提供することも可能である。 In the present specification, the program has been described as an embodiment in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium or provided via a network. Is also possible.

１０入力部
２０演算部
２２訓練データデータベース
２４スコア関数生成部
２６テストデータデータベース
２８順位付け部
３０初期化部
３２スコア計算部
３４評価値モデル計算部
３６生成確率モデル計算部
３８収束判定部
４０スコア関数記憶部
９０出力部
１００２値分類装置 DESCRIPTION OF SYMBOLS 10 Input part 20 Operation part 22 Training data database 24 Score function generation part 26 Test data database 28 Ranking part 30 Initialization part 32 Score calculation part 34 Evaluation value model calculation part 36 Generation probability model calculation part 38 Convergence determination part 40 Score function Storage unit 90 Output unit 100 Binary classification device

Claims

A labeled sample given whether it is positive example data related to a specific type or negative example data not related to the specific type, and the positive example data or the negative example data A binary classification learning device for learning a score function for binary classification based on training data consisting of an unlabeled sample that is unknown.
For each of the unlabeled samples, an evaluation value model expressed using a function that outputs a value indicating whether or not the data is positive data, a generation probability model of positive data, and negative data A score calculation unit that calculates a score indicating whether the unlabeled sample is the positive example data using a generation probability model;
An evaluation value model calculation unit that calculates the evaluation value model based on the score for each of the unlabeled samples calculated by the score calculation unit, the labeled sample, and the unlabeled sample;
Based on the score for each of the unlabeled samples calculated by the score calculation unit, the labeled sample, and the unlabeled sample, the generation probability model of the positive example data and the negative example data A generation probability model calculation unit for calculating a generation probability model of
Until the predetermined convergence determination condition is satisfied, the calculation by the score calculation unit, the calculation by the evaluation value model calculation unit, and the calculation by the generation probability model calculation unit are repeated, and the evaluation value model and the generation probability model A convergence determination unit that outputs the score function using
A binary classification learning device.

The function of the evaluation value model is a linear function related to a feature vector extracted from the data,
2. The evaluation value model calculation unit calculates the evaluation value model so as to maximize an AUC (Area Under the Curve) value approximated by using a sigmoid function to the evaluation value model. Binary classification learning device.

The generation probability model of the positive example data is obtained by modeling the probability distribution of the positive example data using a naive bayes model,
The generation probability model of the negative example data is obtained by modeling the probability distribution of the negative example data using a naive Bayes model,
The generation probability model calculation unit obtains the log likelihood of the labeled sample and the expected log likelihood of the unlabeled sample, which are obtained using the generation probability model of the positive example data and the generation probability model of the negative example data. The binary classification learning device according to claim 1, wherein the generation probability model of the positive example data and the generation probability model of the negative example data are calculated so as to maximize the sum of degrees.

The score calculation unit calculates the score for each of the unlabeled samples in consideration of an entropy regularization term of a probability value given by the score,
The evaluation value model calculation unit calculates the evaluation value model in consideration of the regularization term of the model parameter of the evaluation value model,
The generation probability model calculation unit calculates a generation probability model of the positive example data and a generation probability model of the negative example data in consideration of a regularization term of model parameters of the generation probability model. The binary classification learning device according to claim 3.

A score value indicating that the test data is a positive example based on the input test data and the score function learned by the binary classification learning device according to any one of claims 1 to 4. The binary classification device containing the score calculation part which calculates | requires.

Negative example data that includes a score calculation unit, an evaluation value model calculation unit, a generation probability model calculation unit, and a convergence determination unit and is positive example data related to a specific type or not related to the specific type A score for binary classification based on training data consisting of a labeled sample given whether or not and whether the positive or negative data is unknown. A binary classification learning method in a binary classification learning device for learning a function,
The score calculation unit, for each of the unlabeled samples, an evaluation value model represented using a function that outputs a value indicating whether or not the data is a positive example, and a generation probability model of positive data And using a negative example data generation probability model to calculate a score indicating whether the unlabeled sample is the positive example data,
The evaluation value model calculation unit calculates the evaluation value model based on the score for each of the unlabeled samples calculated by the score calculation unit, the labeled sample, and the unlabeled sample. ,
The generation probability model calculation unit generates the positive example data based on the score, the labeled sample, and the unlabeled sample for each of the unlabeled samples calculated by the score calculation unit. Calculating a probability model and a generation probability model of the negative example data;
Until the convergence determination unit satisfies a predetermined convergence determination condition, the calculation by the score calculation unit, the calculation by the evaluation value model calculation unit, and the calculation by the generation probability model calculation unit are repeated, and the evaluation value model And a binary classification learning method for outputting the score function using the generation probability model.

A binary classification method in a binary classification device including a score calculation unit,
The score calculation unit calculates a score value indicating that the test data is a positive example based on the input test data and the score function learned by the binary classification learning method according to claim 6. Binary classification method.

The program for functioning a computer as each part of the binary classification learning apparatus of any one of Claims 1-4, or the binary classification apparatus of Claim 5.