CN109522954A

CN109522954A - Heterogeneous Information network linking prediction meanss

Info

Publication number: CN109522954A
Application number: CN201811357907.8A
Authority: CN
Inventors: 陈可佳; 张培
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2018-11-14
Filing date: 2018-11-14
Publication date: 2019-03-26

Abstract

A heterogeneous information network link prediction device, the device includes: a setting unit, suitable for setting a meta-path between node pairs in a heterogeneous network to be predicted, the maximum length of the meta-path and the setting corresponding to each meta-path type type label; construction unit, suitable for extracting heterogeneous topological features between node pairs based on meta-paths, constructing sample vectors, and forming a sample set; the sample set includes a training set and a test set; a classification learning unit, suitable for The training set and test set in the sample set are used for multi-label classification learning, and the corresponding multi-label classifier is obtained; the prediction unit is suitable for using the multi-label classifier obtained by training to predict the unknown relationship between the nodes in the heterogeneous network to be predicted. . The above solution can improve the accuracy of link prediction of heterogeneous information networks.

Description

Heterogeneous Information Network Link Prediction Device

技术领域technical field

本发明属于数据分析技术领域，特别是涉及一种异构信息网络链接预测装置。The invention belongs to the technical field of data analysis, and in particular relates to a heterogeneous information network link prediction device.

背景技术Background technique

现实世界中的许多复杂系统可以被形式化为网络，节点表示对象，链接表示对象之间的交互。其中的大多数网络为异构网路，其包含各种类型的对象和关系，通常由多个子网络构成。例如，在线社交网络Twitter包含关于诸如用户基本信息、用户位置和用户推特操作的类型的信息，具有发表/回复/转发推文、关注/跟随、签到等等的类型的信息。Many complex systems in the real world can be formalized as networks, with nodes representing objects and links representing interactions between objects. Most of these networks are heterogeneous networks that contain various types of objects and relationships, often consisting of multiple sub-networks. For example, the online social network Twitter contains information about types such as user basic information, user location, and user tweeting actions, with types of post/reply/retweet, follow/follow, check-in, and the like.

作为链接挖掘中的关键问题，链接预测旨在基于当前或历史网络预测未来链接的形成。它具有应用于书目网络、生物网络、社交网络等领域的更广泛的应用。大多数现有的链接预测方法被设计用于同构信息网络，其节点和链接是相同的类型。近来，在异构网络中推动链接预测有巨大的兴趣，因为它具有更广泛的应用前景。As a key problem in link mining, link prediction aims to predict the formation of future links based on current or historical networks. It has wider applications in bibliographic networks, biological networks, social networks, etc. Most existing link prediction methods are designed for homogeneous information networks whose nodes and links are of the same type. Recently, there has been great interest in advancing link prediction in heterogeneous networks because of its broader application prospects.

但是，现有技术中的异构网络中链接预测，存在预测准确性低的问题。However, the link prediction in the heterogeneous network in the prior art has the problem of low prediction accuracy.

发明内容SUMMARY OF THE INVENTION

本发明解决的技术问题是如何提高异构信息网络链接预测的准确性。The technical problem solved by the present invention is how to improve the accuracy of link prediction of heterogeneous information networks.

为了达到上述目的，本发明实施例还提供了一种异构信息网络链接预测装置，所述装置包括：In order to achieve the above object, an embodiment of the present invention also provides a heterogeneous information network link prediction device, the device includes:

设定单元，适于设定待预测异构网络中节点对之间的元路径、元路径的最大长度和每种元路径类型设置对应的类型标签；a setting unit, adapted to set the meta-path between the node pairs in the heterogeneous network to be predicted, the maximum length of the meta-path and the type label corresponding to each meta-path type setting;

构建单元，适于基于元路径提取节点对之间的异构拓扑特征，构建样本向量，组成样本集；所述样本集包括训练集和测试集；a construction unit, adapted to extract heterogeneous topological features between pairs of nodes based on the meta-path, construct a sample vector, and form a sample set; the sample set includes a training set and a test set;

分类学习单元，适于基于所述样本集中的训练集和测试集进行多标记分类学习，得到对应的多标记分类器；a classification learning unit, adapted to perform multi-label classification learning based on the training set and the test set in the sample set to obtain a corresponding multi-label classifier;

预测单元，适于采用训练得到的多标记分类器对待预测异构网络中节点之间的未知关系进行预测。The prediction unit is suitable for using the multi-label classifier obtained by training to predict the unknown relationship between the nodes in the heterogeneous network to be predicted.

可选地，所述分类学习单元，适于分别从所述训练集中选取与所设置的类型标签中每两个类型标签构成的标签对对应的训练子集，并对所选取的训练子集分别进行二分类学习，得到与每个标签对一一对应的多个二分类器；将所述测试集分别输入训练得到的多个二分类器，计算所述测试集中的样本对应的实例在各个类型标签上获取的第一投票；将对应的虚拟标签分别添加进对应的训练子集中的每个样本，得到对应的类型标签和虚拟标签构成的标签对对应的训练子集，并采用所得到的训练子集分别训练得到与每个类型标签一一对应的多个辅助二分类器；所述虚拟标签用于标记与对应的训练子集中的样本相关和不相关的类型标签的分割点；将所述测试集分别输入训练得到的多个辅助二分类器，计算所述测试集中的样本对应的实例分别在每个类型标签上获得的第二投票和在虚拟标签上获取的第三投票；基于所述测试样本对应的实例在所述每个类型标签上获得的第一投票和第二投票及在虚拟标签上获取的第三投票，确定最终的多标记分类器。Optionally, the classification learning unit is adapted to select a training subset corresponding to the tag pair formed by every two type tags in the set type tags from the training set, and respectively select a training subset for the selected training subset. Carry out two-class learning, and obtain multiple two-classifiers corresponding to each label pair one-to-one; input the test set into the multiple two-classifiers obtained by training, and calculate the instances corresponding to the samples in the test set in each type The first vote obtained on the label; the corresponding virtual label is added to each sample in the corresponding training subset to obtain the training subset corresponding to the label pair formed by the corresponding type label and the virtual label, and the obtained training The subsets are trained to obtain a plurality of auxiliary binary classifiers corresponding to each type label one-to-one; the virtual label is used to mark the segmentation point of the type label that is related and unrelated to the samples in the corresponding training subset; The test set is respectively input into a plurality of auxiliary binary classifiers obtained by training, and the second vote obtained on each type label and the third vote obtained on the virtual label for the instances corresponding to the samples in the test set are calculated respectively; The first vote and the second vote obtained on each type label and the third vote obtained on the virtual label for the instance corresponding to the test sample determine the final multi-label classifier.

可选地，所述节点对之间的异构拓扑特征，包括路径数特征和随机游走特征。Optionally, the heterogeneous topology features between the node pairs include path number features and random walk features.

可选地，所述分类学习单元，适于采用如下的公式计算得到所述测试集中的样本对应的实例在每个类型标签上获得的第一投票：Optionally, the classification learning unit is adapted to use the following formula to calculate and obtain the first vote obtained on each type label for the instance corresponding to the sample in the test set:

其中，ζ(x_i，l_j)表示实例x_i在标签l_j上获得的投票，Clf_jk表示标签对(l_j，l_k)对应的二分类器，表示在训练子集中正确地将样本预测为负例，当表示在训练子集中正确地将样本预测为正例。where ζ( _xi , l _j ) represents the votes obtained by the instance _xi on the label l _j , Clf _jk represents the binary classifier corresponding to the label pair (l _j , l _k ), indicates that the sample is correctly predicted as a negative example in the training subset, when Indicates that samples are correctly predicted as positives in the training subset.

可选地，所述分类学习单元，适于采用如下的公式计算所述测试集中的样本对应的实例在每个类型标签上获得的第二投票：Optionally, the classification learning unit is adapted to use the following formula to calculate the second vote obtained on each type label for the instance corresponding to the sample in the test set:

其中，ζ(x_i，l_j)表示实例x_i在标签l_j上获得的尚未更新的投票，Clf_jk表示标签对(l_j，l_k)对应的二分类器，表示在训练子集中正确地将样本预测为正例。where ζ( _xi , l _j ) represents the unupdated votes obtained by the instance _xi on the label l _j , Clf _jk represents the binary classifier corresponding to the label pair (l _j , l _k ), Indicates that samples are correctly predicted as positives in the training subset.

其中，ζ^*(x_i，l_s)表示实例x_i在虚拟标签l_s上获得的投票，表示在训练子集中正确地将样本预测为负例。where ζ ^* ( _xi , _ls ) represents the votes obtained by instance _xi on the virtual label _ls , Indicates that samples are correctly predicted as negatives in the training subset.

可选地，所述分类学习单元，适于采用如下的公式基于所述测试样本对应的实例在所述每个类型标签上获得的第一投票和第二投票及在虚拟标签上获取的第三投票，确定最终的多标记分类器：Optionally, the classification learning unit is adapted to adopt the following formula based on the first vote and the second vote obtained on each type label and the third vote obtained on the virtual label based on the instance corresponding to the test sample: Vote to determine the final multi-label classifier:

h(x)＝{l_j|ζ^*(x，l_j)＞ζ^*(x，l_s)}；h(x)={l _j |ζ ^* (x, l _j )>ζ ^* (x, l _s )};

其中，h(x)表示所述多标记分类器，l_j表示第j个类型标签，ζ^*(x，l_j)表示实例x_i在标签l_j上获得的最终投票，ζ^*(x，l_s)表示实例x_i在虚拟标签l_s上获得的投票。where h(x) denotes the multi-label classifier, l _j denotes the j-th type label, ζ ^* (x, l _j ) denotes the final vote obtained by instance x _i on label l _j , ζ ^* (x, l _s ) represents the votes obtained by instance _xi on the virtual label l _s .

可选地，所述装置还可以包括：Optionally, the device may also include:

计算输出单元，计算所述标签对之间的依赖分数并输出。Calculate the output unit, calculate the dependency score between the label pairs and output.

与现有技术相比，本发明的有益效果为：Compared with the prior art, the beneficial effects of the present invention are:

上述的方案，通过基于所述样本集中的训练集和测试集进行多标记分类学习，得到对应的多标记分类器，并采用训练得到的多标记分类器对待预测异构网络中节点之间的未知关系进行预测，不仅仅可以对节点之间的直接链接进行预测，还可以对节点之间的其他关系，即元路径进行预测，使得实例具有多个而不再是唯一的类型标签，故可以提高异构网络链接预测的准确性。In the above scheme, a corresponding multi-label classifier is obtained by performing multi-label classification learning based on the training set and the test set in the sample set, and the multi-label classifier obtained by training is used to predict unknown unknowns between nodes in a heterogeneous network. Relationship prediction can not only predict the direct links between nodes, but also predict other relationships between nodes, that is, meta-paths, so that instances have multiple rather than unique type labels, so it can improve the Accuracy of Link Prediction in Heterogeneous Networks.

进一步地，通过计算所述标签对之间的依赖分数并输出，可以提供关于如何形成新链接的建议，帮助探索复杂网络中链接和关系形成的规律，方便实用。Further, by calculating and outputting the dependency scores between the tag pairs, suggestions on how to form new links can be provided, helping to explore the rules of link and relationship formation in complex networks, which is convenient and practical.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative labor.

图1是本发明实施例的一种异构信息网络链接预测方法的流程示意图；1 is a schematic flowchart of a method for predicting a heterogeneous information network link according to an embodiment of the present invention;

图2是本发明实施例的一种多标记分类器的训练方法的流程示意图；2 is a schematic flowchart of a training method for a multi-label classifier according to an embodiment of the present invention;

图3示出了本发明实施例中的一种异构信息网络链接预测装置的结构示意图。FIG. 3 shows a schematic structural diagram of an apparatus for predicting a link of a heterogeneous information network in an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。本发明实施例中有关方向性指示(诸如上、下、左、右、前、后等)仅用于解释在某一特定姿态(如附图所示)下各部件之间的相对位置关系、运动情况等，如果该特定姿态发生改变时，则该方向性指示也相应地随之改变。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application. The relevant directional indications (such as up, down, left, right, front, back, etc.) in the embodiments of the present invention are only used to explain the relative positional relationship between the various components under a certain posture (as shown in the accompanying drawings). Movement conditions, etc., if the specific posture changes, the directional indication also changes accordingly.

如背景技术所述，异构网路的链接预测存在以下的问题：As mentioned in the background art, the link prediction of heterogeneous networks has the following problems:

(1)拓扑特征的表示：由于对象和链接的异质性，在同构网络中使用的拓扑特征不是直接可用的。两个节点的邻居可以具有不同的类型，所以共同邻居的数量不能表示这种异质性。(1) Representation of topological features: Due to the heterogeneity of objects and links, topological features used in homogeneous networks are not directly available. The neighbors of two nodes can be of different types, so the number of common neighbors cannot represent this heterogeneity.

(2)不同关系之间的相互依赖性：在本发明中，关系不仅指的是显式链接，而且指的是间接连接。对不同类型的关系之间的相关性建模是重要的，因为它们可能彼此影响。例如，在共同作者网络中，两个学者可以通过参加同一会议(另一种类型的关系)来建立共同作者(一种关系类型)。(2) Interdependence between different relationships: In the present invention, a relationship not only refers to an explicit link, but also refers to an indirect link. Modeling correlations between different types of relationships is important because they may affect each other. For example, in a co-authorship network, two scholars can establish co-authorship (one type of relationship) by attending the same conference (another type of relationship).

单独研究网络的同构投影的简单方法避免了第一个问题，但是会丢失信息，因为它忽略了类型之间的依赖性模式。虽然异质性带来了很多困难来链接预测，它也提供丰富和有价值的信息，了解链接形成的潜在机制。链接预测的研究遵循常见的意义，即几个链接可以随机生成，但大多数链接是在潜在模式下生成的。目标链接的信息取决于目标链接的节点之间的潜在关系。The simple approach of studying isomorphic projections of networks alone avoids the first problem, but loses information because it ignores patterns of dependencies between types. Although heterogeneity brings many difficulties to link prediction, it also provides rich and valuable information to understand the underlying mechanisms of link formation. Research on link prediction follows the common sense that several links can be generated randomly, but most links are generated in latent patterns. The information of the target link depends on the potential relationship between the nodes of the target link.

若不仅仅考虑预测节点之间的直接链接，而且还考虑预测节点之间的其他关系，即元路径，传统的二分类监督方法将不再适用，需要使用多标记学习。If not only direct links between prediction nodes are considered, but also other relationships between prediction nodes, namely meta-paths, the traditional two-class supervision method will no longer be applicable, and multi-label learning needs to be used.

真实世界的对象往往并不只具有唯一的语义，而是可能具有多义性的，例如一张图片可能传达了多种信息如“蓝天”、“小河”、“牛”以及“炊烟”等等。多义性对象由于不再具有唯一的语义，这就使得单一语义的传统监督学习框架难以取得好的效果。Objects in the real world often do not have unique semantics, but may be ambiguous. For example, a picture may convey various information such as "blue sky", "little river", "cow" and "smoke". Ambiguity objects no longer have unique semantics, which makes it difficult for traditional supervised learning frameworks with single semantics to achieve good results.

本发明的技术方案通过基于所述样本集中的训练集和测试集进行多标记分类学习，得到对应的多标记分类器，并采用训练得到的多标记分类器对待预测异构网络中节点之间的未知关系进行预测，不仅仅可以对节点之间的直接链接进行预测，还可以对节点之间的其他关系，即元路径进行预测，使得实例具有多个而不再是唯一的类型标签，故可以提高异构网络链接预测的准确性。The technical scheme of the present invention obtains the corresponding multi-label classifier by performing multi-label classification learning based on the training set and the test set in the sample set, and uses the multi-label classifier obtained by training to predict the relationship between nodes in the heterogeneous network. Predicting unknown relationships can not only predict the direct links between nodes, but also predict other relationships between nodes, that is, meta-paths, so that instances have multiple rather than unique type labels, so it can be Improve the accuracy of link prediction in heterogeneous networks.

为使本发明的上述目的、特征和有益效果能够更为明显易懂，下面结合附图对本发明的具体实施例做详细的说明。In order to make the above objects, features and beneficial effects of the present invention more clearly understood, specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

图1是本发明实施例的一种异构信息网络链接预测方法的流程示意图。参见图1，一种异构信息网络链接预测方法，具体可以包括如下的步骤：FIG. 1 is a schematic flowchart of a method for predicting a link of a heterogeneous information network according to an embodiment of the present invention. Referring to Fig. 1, a method for predicting a heterogeneous information network link may specifically include the following steps:

步骤S101：设定待预测异构网络中节点对之间的元路径、元路径的最大长度和每种元路径类型设置对应的类型标签。Step S101: Set the meta-path between the node pairs in the heterogeneous network to be predicted, the maximum length of the meta-path, and the type label corresponding to each meta-path type setting.

在具体实施中，所述元路径为异构网络中从一个节点通过一系列的类型连接达到另一节点的路径。例如，元路径P_k＝(V₁E₁V₂E₂...E_n-1V_n)表示节点V₁通过一系列类型链接E_i(i＝1，2，...，n-1)到达节点V_n的路径，可以记为 In a specific implementation, the meta-path is a path from one node to another node through a series of type connections in a heterogeneous network. For example, a meta-path _Pk =(V ₁ E ₁ V ₂ E ₂ ... E _n-1 V _n ) means that node V ₁ is linked through a series of types E _i (i=1,2,...,n- 1) The path to reach the node _Vn can be recorded as

为了简化，本发明实施例中使用对应节点的英文名称的第一个大写字母进行表示。例如，在DBLP合著者网络中，P表示论文，A表示作者，T表示主题，V表示会议，U表示用户，则用于描述作者间的引用关系的元路径，可以简写为“APPA”。For simplicity, in this embodiment of the present invention, the first capital letter of the English name of the corresponding node is used for representation. For example, in the DBLP co-author network, P represents the paper, A represents the author, T represents the topic, V represents the conference, and U represents the user. The meta-path used to describe the citation relationship between authors can be abbreviated as "APPA".

在具体实施中，所述待预测异构网络中节点对之间的元路径、元路径的最大长度和每种元路径类型设置对应的类型标签，本领域的技术人员均可以根据实际的需要进行设置，在此不做限制。In the specific implementation, the meta-path between the node pairs in the heterogeneous network to be predicted, the maximum length of the meta-path, and the type label corresponding to each meta-path type are set, and those skilled in the art can perform according to actual needs. Setting, there is no restriction here.

步骤S102：基于元路径提取节点对之间的异构拓扑特征，构建样本向量，组成样本集；所述样本集包括训练集和测试集。Step S102: Extract heterogeneous topological features between node pairs based on the meta-path, construct a sample vector, and form a sample set; the sample set includes a training set and a test set.

在本发明一实施例中，所述节点对之间的异构拓扑特征，包括路径数特征和随机游走特征。其中，路径数表示节点对之间的对应类型的元路径的总数量，节点之间的随机游走特征与元路径类型一一对应，即每种元路径类型分别具有一个对应的随机游走特征，具体可以采用如下的公式计算得到：In an embodiment of the present invention, the heterogeneous topology features between the node pairs include path number features and random walk features. Among them, the number of paths represents the total number of meta-paths of the corresponding type between node pairs, and the random walk features between nodes are in one-to-one correspondence with the meta-path types, that is, each meta-path type has a corresponding random walk feature. , which can be calculated by the following formula:

其中，表示起点为v_i终点为节点v_j的第k种元路径的数量，表示起点是v_i的第k种元路径的总数量。in, represents the number of k-th meta-paths whose origin is v _i and the end is node v _j , Represents the total number of _k -th meta-paths whose origin is vi.

通过上述的描述可知，当待预测异构网路中的元路径类型为K，且节点对之间每张元路径可以被量化为路径数和随机游走两个异构拓扑特征时，则每个节点对可以被2K个异构拓扑特征描述。换言之，每个节点对采用对应的2K维空间向量进行描述，该2K维空间向量称为该节点对对应的实例。It can be seen from the above description that when the type of meta-path in the heterogeneous network to be predicted is K, and each meta-path between node pairs can be quantified as two heterogeneous topological features, the number of paths and random walks, then each Each node pair can be described by 2K heterogeneous topological features. In other words, each node pair is described by a corresponding 2K-dimensional space vector, and the 2K-dimensional space vector is called an instance corresponding to the node pair.

假设X表示2K维实例空间，L＝{l₁，l₂，...，l_K}表示标签空间，其中包含K种元路径，则基于元路径提取节点对之间的异构拓扑特征所构建的样本向量组成的样本集中，D＝{(x_i，L_i)|1≤i≤m}表示多标记训练集，m表示多标记训练集中的样本数，(x_i，L_i)是一个多标记样本，其中为描述节点对(v_i，v_j)的2K维特征向量，x_i中的每一维可以是异构特征或同构特征，是x_i的标签集；T＝{(x_i，L_i)|m+1≤i≤n}是多标记测试集，n表示样本集中的样本总数。Assuming that X represents a 2K-dimensional instance space, and L ₌ {l ₁ , l ₂ , . In the sample set composed of the constructed sample vectors, D={(x _i , Li )|1≤i≤m _} represents the multi-label training set, m represents the number of samples in the multi-label training set, and ( _xi _, Li ) is a multi-label sample, where To describe the 2K-dimensional feature vector of the node pair (vi, _vj _{), each dimension in x i} _can be a heterogeneous feature or a homogeneous feature, is the label set of _{xi; T={(x i} _, L _i )|m+1≤i≤n} is the multi-label test set, and n represents the total number of samples in the sample set.

步骤S103：基于所述样本集中的训练集和测试集进行多标记分类学习，得到对应的多标记分类器。Step S103: Perform multi-label classification learning based on the training set and the test set in the sample set to obtain a corresponding multi-label classifier.

在具体实施中，当构建完成对应的训练集和测试集时，便可以采用所得到的多标记训练集D进行多标记分类学习训练得到的多标记分类器，并对训练得到的多标记分类器采用所述多标记测试集T进行优化，具体请参见图2中的详细介绍，不再赘述。In a specific implementation, when the corresponding training set and test set are constructed, the obtained multi-label training set D can be used to carry out multi-label classification learning and training of the multi-label classifier, and the multi-label classifier obtained by training can be used for The multi-label test set T is used for optimization. For details, please refer to the detailed introduction in FIG. 2 , which will not be repeated here.

步骤S104：采用训练得到的多标记分类器对待预测异构网络中节点之间的未知关系进行预测。Step S104: Use the multi-label classifier obtained by training to predict the unknown relationship between the nodes in the heterogeneous network to be predicted.

在具体实施中，当训练得到的对应的多标记分类器时，便可以使用所得到的多标记分类器对待预测网络的未知部分进行预测，即对待预测网络中未连接的目标节点对之间的未知连接关系进行预测，得到预测后的新网络图。In a specific implementation, when the corresponding multi-label classifier is trained, the obtained multi-label classifier can be used to predict the unknown part of the network to be predicted, that is, the relationship between unconnected target node pairs in the network to be predicted. The unknown connection relationship is predicted, and the predicted new network graph is obtained.

在本发明一实施例中，所述方法还可以包括：In an embodiment of the present invention, the method may further include:

步骤S105：计算所述标签对之间的依赖分数并输出。Step S105: Calculate and output the dependency scores between the tag pairs.

在本发明一实施例中，通过采用卡方检验工具(Chi-Square Test工具)可以计算所述标签对之间的依赖分数并输出，可以为用户提供关于如何形成节点对之间新链接的建议，帮助探索复杂网络中链接和关系形成的规律，方便实用。In an embodiment of the present invention, by using a chi-square test tool (Chi-Square Test tool), the dependency score between the tag pairs can be calculated and output, and the user can be provided with suggestions on how to form new links between node pairs , which helps to explore the laws of formation of links and relationships in complex networks, which is convenient and practical.

图2是本发明实施例的一种多标记分类器的训练方法的流程示意图。参见图2，一种多标记分类器的训练方法，具体可以包括如下的步骤：FIG. 2 is a schematic flowchart of a training method for a multi-label classifier according to an embodiment of the present invention. Referring to Figure 2, a training method for a multi-label classifier may specifically include the following steps:

步骤S201：分别从所述训练集中选取与所设置的类型标签中每两个类型标签构成的标签对对应的训练子集，并对所选取的训练子集分别进行二分类学习，得到与每个标签对一一对应的多个二分类器。Step S201: respectively select a training subset corresponding to the label pair formed by every two type labels in the set type labels from the training set, and perform two-class learning on the selected training subset respectively, and obtain a Multiple binary classifiers with one-to-one correspondence of labels.

在具体实施中，对于标签空间L＝{l₁，l₂，...，l_K}，其中的每两个类型标签构成的标签对对应的训练子集中的样本为从多标记训练集D中选取，具体可以定义为：In a specific implementation, for the label space L ₌ {l ₁ , l ₂ , . can be defined as:

D_jk＝{(x_i，ψ(L_i，l_j，l_k))|φ(L_i，l_j)≠φ(L_i，l_k)} (2)D _jk = {(x _i , ψ(L _i , l _j , l _k ))|φ(L _i , l _j )≠φ(L _i , l _k )} (2)

其中，D_jk表示标签对(l_j，l_k)对应的训练子集。Among them, D _jk represents the training subset corresponding to the label pair (l _j , l _k ).

在具体实施中，当从多标记训练集D中选取标签对(l_j，l_k)对应的训练子集D_jk时便可以采用分类学习算法对二分类学习，得到与每个标签对(l_j，l_k)一一对应的多个二分类器。In a specific implementation, when the training subset D _jk corresponding to the label pair (l _j , l _k ) is selected from the multi-label training set D, the classification learning algorithm can be used to learn the two-class classification, and each label pair (l _j , _lk ) one-to-one correspondence of multiple binary classifiers.

步骤S202：将所述测试集分别输入训练得到的多个二分类器，计算所述测试集中的样本对应的实例在各个类型标签上获取的第一投票。Step S202: Input the test set into a plurality of binary classifiers obtained by training respectively, and calculate the first vote obtained on each type label of the instance corresponding to the sample in the test set.

在本发明一实施例中，所述测试集中的样本对应的实例在各个类型标签上获取的第一投票采用如下的公式计算得到：In an embodiment of the present invention, the first vote obtained on each type label for the instance corresponding to the sample in the test set is calculated by using the following formula:

步骤S203：将对应的虚拟标签分别添加进对应的训练子集中的每个样本，得到对应的类型标签和虚拟标签构成的标签对对应的训练子集，并采用所得到的训练子集分别训练得到与每个类型标签一一对应的多个辅助二分类器。Step S203: Add the corresponding virtual label to each sample in the corresponding training subset respectively, obtain the training subset corresponding to the label pair formed by the corresponding type label and the virtual label, and use the obtained training subset to train respectively to obtain Multiple auxiliary binary classifiers corresponding to each type label one-to-one.

在具体实施中，将一个虚拟标签l_s添加进每个样本，用于标记与x_i相关及不相关的标签的分割点，针对每个新的标签对(l_j，l_s)分别得到的对应的训练集D_is，并采用对应的二分类学习算法对每个新的标签对(l_j，l_s)分别得到的对应的训练集D_js分别进行学习，得到对应的K个辅助二分类器Clf_js。In a specific implementation, a virtual label _ls is added to each sample to mark the segmentation points of labels related to x _i and unrelated labels, respectively obtained for each new label pair (l _j , _ls ) The corresponding training set D _is , and the corresponding two-class learning algorithm is used to learn the corresponding training set D _js obtained from each new label pair (l _j , l _s ) respectively, to obtain the corresponding K auxiliary two-class loader Clf _js .

步骤S204：将所述测试集分别输入训练得到的多个辅助二分类器，计算所述测试集中的样本对应的实例分别在每个类型标签上获得的第二投票和在虚拟标签上获取的第三投票。Step S204: Input the test set into a plurality of auxiliary binary classifiers obtained by training respectively, and calculate the second vote obtained on each type label and the first vote obtained on the virtual label for the instances corresponding to the samples in the test set. Three votes.

在本发明一实施例中，采用如下的公式计算所述测试集中的样本对应的实例在每个类型标签上获得的第二投票：In an embodiment of the present invention, the following formula is used to calculate the second vote obtained by the instance corresponding to the sample in the test set on each type label:

在本发明一实施例中，采用如下的公式计算所述测试集中的样本对应的实例在每个类型标签上获得的第三投票：In an embodiment of the present invention, the following formula is used to calculate the third vote obtained on each type label for the instance corresponding to the sample in the test set:

步骤S205：基于所述测试样本对应的实例在所述每个类型标签上获得的第一投票和第二投票及在虚拟标签上获取的第三投票，确定最终的多标记分类器。Step S205: Determine a final multi-label classifier based on the first vote and the second vote obtained on the each type label and the third vote obtained on the virtual label for the instance corresponding to the test sample.

在本发明一实施例中，所确定最终的多标记分类器可以表示为：In an embodiment of the present invention, the determined final multi-label classifier can be expressed as:

h(x)＝{l_j|ζ^*(x，l_j)＞ζ^*(x，l_s)} (8)h(x)={l _j |ζ ^* (x, l _j )>ζ ^* (x, l _s )} (8)

上述对本发明实施例中的异构信息网络链接预测方法进行了详细的描述，下面将对上述的方法对应的装置进行介绍。The method for predicting a link of a heterogeneous information network in the embodiment of the present invention has been described in detail above, and an apparatus corresponding to the above method will be introduced below.

图3示出了本发明实施例中的一种异构信息网络链接预测装置的结构示意图。参见图3，一种异构信息网络链接预测装置30可以包括设定单元301、构建单元302、分类学习单元303和预测单元304，其中：FIG. 3 shows a schematic structural diagram of an apparatus for predicting a link of a heterogeneous information network in an embodiment of the present invention. 3, a heterogeneous information network link prediction apparatus 30 may include a setting unit 301, a construction unit 302, a classification learning unit 303 and a prediction unit 304, wherein:

所述设定单元301，适于设定待预测异构网络中节点对之间的元路径、元路径的最大长度和每种元路径类型设置对应的类型标签。The setting unit 301 is adapted to set the meta-path between node pairs in the heterogeneous network to be predicted, the maximum length of the meta-path, and the type label corresponding to each meta-path type setting.

所述构建单元302，适于基于元路径提取节点对之间的异构拓扑特征，构建样本向量，组成样本集；所述样本集包括训练集和测试集。The construction unit 302 is adapted to extract heterogeneous topological features between node pairs based on the meta-path, construct a sample vector, and form a sample set; the sample set includes a training set and a test set.

所述分类学习单元303，适于基于所述样本集中的训练集和测试集进行多标记分类学习，得到对应的多标记分类器。The classification learning unit 303 is adapted to perform multi-label classification learning based on the training set and the test set in the sample set to obtain a corresponding multi-label classifier.

所述预测单元304，适于采用训练得到的多标记分类器对待预测异构网络中节点之间的未知关系进行预测。The predicting unit 304 is adapted to use the multi-label classifier obtained by training to predict the unknown relationship between nodes in the heterogeneous network to be predicted.

在具体实施中，所述分类学习单元303，适于分别从所述训练集中选取与所设置的类型标签中每两个类型标签构成的标签对对应的训练子集，并对所选取的训练子集分别进行二分类学习，得到与每个标签对一一对应的多个二分类器；将所述测试集分别输入训练得到的多个二分类器，计算所述测试集中的样本对应的实例在各个类型标签上获取的第一投票；将对应的虚拟标签分别添加进对应的训练子集中的每个样本，得到对应的类型标签和虚拟标签构成的标签对对应的训练子集，并采用所得到的训练子集分别训练得到与每个类型标签一一对应的多个辅助二分类器；所述虚拟标签用于标记与对应的训练子集中的样本相关和不相关的类型标签的分割点；将所述测试集分别输入训练得到的多个辅助二分类器，计算所述测试集中的样本对应的实例分别在每个类型标签上获得的第二投票和在虚拟标签上获取的第三投票；基于所述测试样本对应的实例在所述每个类型标签上获得的第一投票和第二投票及在虚拟标签上获取的第三投票，确定最终的多标记分类器。In a specific implementation, the classification learning unit 303 is adapted to select a training subset corresponding to a label pair formed by every two type labels in the set type labels from the training set, and perform a Two-class learning is carried out on the set respectively, and multiple two-classifiers corresponding to each label pair are obtained; the test set is respectively input into the multiple two-classifiers obtained by training, and the instances corresponding to the samples in the test set are calculated in The first vote obtained on each type label; the corresponding virtual label is added to each sample in the corresponding training subset, and the corresponding training subset of the label pair composed of the corresponding type label and the virtual label is obtained, and the obtained The training subsets are trained to obtain a plurality of auxiliary binary classifiers corresponding to each type label one-to-one; the virtual label is used to mark the segmentation point of the type label related and unrelated to the sample in the corresponding training subset; the The test set is respectively input to a plurality of auxiliary binary classifiers obtained by training, and the second vote obtained on each type label and the third vote obtained on the virtual label for the instances corresponding to the samples in the test set are calculated respectively; based on The first vote and the second vote obtained on each type label and the third vote obtained on the virtual label for the instance corresponding to the test sample determine the final multi-label classifier.

在具体实施中，所述节点对之间的异构拓扑特征，包括路径数特征和随机游走特征。In a specific implementation, the heterogeneous topological features between the node pairs include path number features and random walk features.

在本发明一实施例中，所述分类学习单元303，适于采用如下的公式计算得到所述测试集中的样本对应的实例在每个类型标签上获得的第一投票：In an embodiment of the present invention, the classification learning unit 303 is adapted to use the following formula to calculate and obtain the first vote obtained on each type label for the instance corresponding to the sample in the test set:

其中，ζ(x_i，l_j)表示实例x_i在标签l_j上获得的投票，Clf_jk表示标签对(l_j，l_k)对应的二分类器，表示在训练子集中正确地将样本预测为负例，当表示在训练子集中正确地将样本预测为正例。 where ζ( _xi , l _j ) represents the votes obtained by the instance _xi on the label l _j , Clf _jk represents the binary classifier corresponding to the label pair (l _j , l _k ), indicates that the sample is correctly predicted as a negative example in the training subset, when Indicates that samples are correctly predicted as positives in the training subset.

在本发明一实施例中，所述分类学习单元303，适于采用如下的公式计算所述测试集中的样本对应的实例在每个类型标签上获得的第二投票：In an embodiment of the present invention, the classification learning unit 303 is adapted to use the following formula to calculate the second vote obtained on each type label for the instance corresponding to the sample in the test set:

其中，ζ(x_i，l_j)表示实例x_i在标签l_j上获得的尚未更新的投票，Clf_jk表示标签对(l_j，l_k)对应的二分类器，表示在训练子集中正确地将样本预测为正例。 where ζ( _xi , l _j ) represents the unupdated votes obtained by the instance _xi on the label l _j , Clf _jk represents the binary classifier corresponding to the label pair (l _j , l _k ), Indicates that samples are correctly predicted as positives in the training subset.

其中，ζ^*(x_i，l_s)表示实例x_i在虚拟标签l_s上获得的投票，表示在训练子集中正确地将样本预测为负例。 where ζ ^* ( _xi , _ls ) represents the votes obtained by instance _xi on the virtual label _ls , Indicates that samples are correctly predicted as negatives in the training subset.

在本发明一实施例中，所述分类学习单元303，适于采用如下的公式基于所述测试样本对应的实例在所述每个类型标签上获得的第一投票和第二投票及在虚拟标签上获取的第三投票，确定最终的多标记分类器：In an embodiment of the present invention, the classification learning unit 303 is adapted to adopt the following formula based on the first vote and the second vote obtained on each type label based on the instance corresponding to the test sample and the virtual label The third vote obtained on the final multi-label classifier:

h(x)＝{l_j|ζ^*(x，l_j)＞ζ^*(x，l_s)}；其中，h(x)表示所述多标记分类器，l_j表示第j个类型标签，ζ^*(x，l_j)表示实例x_i在标签l_j上获得的最终投票，ζ^*(x，l_s)表示实例x_i在虚拟标签l_s上获得的投票。h(x)={l _j |ζ ^* (x, l _j )>ζ ^* (x, l _s )}; where h(x) represents the multi-label classifier, and l _j represents the j-th type label , ζ ^* (x, l _j ) denotes the final vote obtained by instance _xi on label l _j , and ζ ^* (x, l _s ) denotes the vote obtained by instance _xi on virtual label l _s .

在本发明一实施例中，所述装置30还可以包括305，其中：In an embodiment of the present invention, the apparatus 30 may further include 305, wherein:

所述计算输出单元305，适于计算所述标签对之间的依赖分数并输出。The calculation and output unit 305 is adapted to calculate and output the dependency scores between the tag pairs.

本发明实施例还提供了一种计算机可读存储介质，其上存储有计算机指令，所述计算机指令运行时执行所述的异构信息网络链接预测方法的步骤。其中，所述的异构信息网络链接预测方法请参见前述相关部分的介绍，不再赘述。Embodiments of the present invention further provide a computer-readable storage medium, which stores computer instructions, and when the computer instructions are run, executes the steps of the method for predicting a link between heterogeneous information networks. Wherein, for the method for predicting the link of heterogeneous information network, please refer to the introduction in the above-mentioned relevant part, and will not be repeated here.

本发明实施例还提供了一种终端，包括存储器和处理器，所述存储器上储存有能够在所述处理器上运行的计算机指令，所述处理器运行所述计算机指令时执行所述的异构信息网络链接预测方法的步骤。其中，所述的异构信息网络链接预测方法请参见前述相关部分的介绍，不再赘述。An embodiment of the present invention further provides a terminal, including a memory and a processor, where the memory stores computer instructions that can be run on the processor, and the processor executes the exception when running the computer instructions. Steps to construct an information network link prediction method. Wherein, for the method for predicting the link of heterogeneous information network, please refer to the introduction in the above-mentioned relevant part, and will not be repeated here.

采用本发明实施例中的上述方案，通过基于所述样本集中的训练集和测试集进行多标记分类学习，得到对应的多标记分类器，并采用训练得到的多标记分类器对待预测异构网络中节点之间的未知关系进行预测，不仅仅可以对节点之间的直接链接进行预测，还可以对节点之间的其他关系，即新类型的元路径进行预测，故可以提高异构网络链接预测的准确性。Using the above solution in the embodiment of the present invention, a corresponding multi-label classifier is obtained by performing multi-label classification learning based on the training set and the test set in the sample set, and the multi-label classifier obtained by training is used to predict the heterogeneous network Predicting the unknown relationship between nodes in the network can not only predict the direct links between nodes, but also predict other relationships between nodes, that is, new types of meta-paths, so it can improve the link prediction of heterogeneous networks. accuracy.

以上显示和描述了本发明的基本原理、主要特征和本发明的优点。本行业的技术人员应该了解，本发明不受上述实施例的限制，上述实施例和说明书中描述的只是说明本发明的原理，在不脱离本发明精神和范围的前提下，本发明还会有各种变化和改进，本发明要求保护范围由所附的权利要求书、说明书及其等效物界定。The foregoing has shown and described the basic principles, main features and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited by the above-mentioned embodiments, and the descriptions in the above-mentioned embodiments and the description are only to illustrate the principle of the present invention. Without departing from the spirit and scope of the present invention, the present invention will have Various changes and improvements, the claimed scope of the present invention is defined by the appended claims, description and their equivalents.

Claims

1. a heterogeneous information network link prediction device, is characterized in that, comprises:

a setting unit, adapted to set the meta-path between the node pairs in the heterogeneous network to be predicted, the maximum length of the meta-path and the type label corresponding to each meta-path type setting;

a construction unit, adapted to extract heterogeneous topological features between pairs of nodes based on the meta-path, construct a sample vector, and form a sample set; the sample set includes a training set and a test set;

a classification learning unit, adapted to perform multi-label classification learning based on the training set and the test set in the sample set to obtain a corresponding multi-label classifier;

The prediction unit is suitable for using the multi-label classifier obtained by training to predict the unknown relationship between the nodes in the heterogeneous network to be predicted.

2 . The heterogeneous information network link prediction device according to claim 1 , wherein the classification learning unit is adapted to select from the training set and each two type labels in the set type labels respectively. 3 . The training subsets corresponding to the label pairs, and the selected training subsets are respectively subjected to binary classification learning to obtain multiple binary classifiers corresponding to each label pair one-to-one; The second classifier calculates the first vote obtained on each type label for the instance corresponding to the sample in the test set; adds the corresponding virtual label to each sample in the corresponding training subset, and obtains the corresponding type label and virtual label. The training subset corresponding to the label pair formed by the label is used, and the obtained training subset is used to train to obtain a plurality of auxiliary binary classifiers corresponding to each type label one-to-one; the virtual label is used to mark the corresponding training sub-classifier. The split points of the relevant and irrelevant type labels of the samples in the set; the test set is respectively input into a plurality of auxiliary binary classifiers obtained by training, and the instances corresponding to the samples in the test set are calculated respectively obtained on each type label. The second vote and the third vote obtained on the virtual label; the first vote and the second vote obtained on the each type label based on the instance corresponding to the test sample and the third vote obtained on the virtual label, Determine the final multi-label classifier.

3 . The link prediction device for heterogeneous information networks according to claim 2 , wherein the heterogeneous topology features between the node pairs include path number features and random walk features. 4 .

4 . The heterogeneous information network link prediction device according to claim 3 , wherein the classification learning unit is adapted to obtain the instance corresponding to the sample in the test set by calculating the following formula on each type label. 5 . First vote received:

where ζ( _xi , l _j ) represents the votes obtained by the instance _xi on the label l _j , Clf _jk represents the binary classifier corresponding to the label pair (l _j , l _k ), indicates that the sample is correctly predicted as a negative example in the training subset, when Indicates that samples are correctly predicted as positives in the training subset.

5. The heterogeneous information network link prediction device according to claim 4, wherein the classification learning unit is adapted to use the following formula to calculate the instance corresponding to the sample in the test set and obtain on each type label The second vote:

where ζ( _xi , l _j ) represents the unupdated votes obtained by the instance _xi on the label l _j , Clf _jk represents the binary classifier corresponding to the label pair (l _j , l _k ), Indicates that samples are correctly predicted as positives in the training subset.

6 . The heterogeneous information network link prediction device according to claim 5 , wherein the classification learning unit is adapted to calculate the instance corresponding to the sample in the test set using the following formula and obtain on each type label. 7 . The second vote:

where ζ ^* ( _xi , _ls ) represents the votes obtained by instance _xi on the virtual label _ls , Indicates that samples are correctly predicted as negatives in the training subset.

7. The heterogeneous information network link prediction device according to any one of claims 3 to 6, wherein the classification learning unit is adapted to use the following formula in the The first and second votes obtained on each type label and the third vote obtained on the virtual label determine the final multi-label classifier:

h(x)={l _j |ζ ^* (x, l _j )>ζ ^* (x, l _s )}

where h(x) denotes the multi-label classifier, l _j denotes the j-th type label, ζ ^* (x, l _j ) denotes the final vote obtained by instance x _i on label l _j , ζ ^* (x, l _s ) represents the votes obtained by instance _xi on the virtual label l _s .

8. The heterogeneous information network link prediction device according to any one of claims 3-6, further comprising:

A calculation output unit, adapted to calculate and output the dependency scores between the tag pairs.