CN113222610A

CN113222610A - Risk identification method and device

Info

Publication number: CN113222610A
Application number: CN202110493573.2A
Authority: CN
Inventors: 王膂; 刘丹丹; 曾威龙; 李迪; 王彦
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2021-05-07
Filing date: 2021-05-07
Publication date: 2021-08-06
Anticipated expiration: 2041-05-07
Also published as: CN113222610B

Abstract

The embodiment of the specification provides a risk identification method and device. According to the method of the embodiment, firstly, at least one incidence relation path which takes an entity to be identified as an end node is generated according to incidence relation data to form a path set of the entity to be identified; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities; then, carrying out risk identification on the incidence relation path by utilizing a path identification model obtained by pre-training; and determining a risk identification result of the entity to be identified by utilizing the risk identification result of each incidence relation path in the path set of the entity to be identified.

Description

Risk identification method and device

Technical Field

One or more embodiments of the present disclosure relate to the field of computer application technologies, and in particular, to a risk identification method and apparatus.

Background

There are various risks in the economic system, such as a series of risks of false funding, group operations, hidden architecture, evasive regulation, related transactions, drastic expansions, illegal fund flows, etc., which may have very serious consequences. However, in an economic system, a large number of entities are involved, and the entities include intricate associations. Therefore, how to efficiently and accurately identify the risk of an entity from an economic system becomes a difficult and urgent problem to be solved.

Disclosure of Invention

One or more embodiments of the present specification describe a risk identification method and apparatus to facilitate efficient and accurate risk identification for an entity.

According to a first aspect, there is provided a risk identification method comprising:

generating at least one incidence relation path taking an entity to be identified as an end node according to incidence relation data to form a path set of the entity to be identified; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities;

carrying out risk identification on the incidence relation path by utilizing a path identification model obtained by pre-training;

and determining a risk identification result of the entity to be identified by using the risk identification result of each incidence relation path in the path set of the entity to be identified.

In one embodiment, the generating at least one incidence relation path with the entity to be identified as the end node according to the incidence relation data includes:

taking the entity to be identified as a starting point to perform path search from the incidence relation data to obtain an incidence relation path taking the entity to be identified as an end node; or,

respectively taking each entity in the incidence relation data as an initial end node, and searching a path according to the incidence relation; and acquiring the incidence relation path with the entity to be identified as a starting point from the obtained incidence relation paths.

In another embodiment, the performing risk identification on the association relationship path by using a path identification model obtained through pre-training includes:

coding the incidence relation path to obtain a coding sequence of the incidence relation path;

and inputting the coding sequence of the incidence relation path into the path identification model to obtain a risk identification result of the path identification model for the incidence relation path.

In one embodiment, encoding the association relationship path to obtain an encoding sequence of the association relationship path includes:

respectively encoding the entity and the incidence relation in the incidence relation path;

and splicing the coding results of the entities and the coding results of the association relations according to the sequence in the association relation path to obtain the coding results of the association relation path.

In another embodiment, determining the risk identification result for the entity to be identified by using the risk identification result of each association relationship path in the path set of the entity to be identified includes:

and inputting the risk identification result of each incidence relation path in the path set of the entity to be identified and the statistical characteristics of the entity to be identified in the incidence relation data into a risk identification model obtained by pre-training to obtain the risk identification result of the entity to be identified.

In one embodiment, the statistical characteristics of the entity to be identified in the association relation data include at least one of the following:

the length of the shortest path in the path set of the entity to be identified;

the length of the longest path in the path set of the entity to be identified;

the path set of the entity to be identified comprises the number of entities of a preset type;

the number of entities with the same preset attribute as the entity to be identified;

whether the entity to be identified has a preset relationship characteristic or not;

the proportion of the entities to be identified having the specific attribute;

and whether a cycle exists in the incidence relation path of the entity to be identified.

In another embodiment, the association data includes share structure data; the entity to be identified is an enterprise or a natural person; the association relationship is a stock control relationship.

According to a second aspect, there is provided a method of obtaining a risk identification model, comprising:

acquiring training data, wherein the training data comprises a path set of an entity sample formed by incidence relation paths taking the entity sample as end nodes and a risk condition label labeled on the entity sample; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities;

and taking the risk identification result of each incidence relation path in the path set of the entity sample as the input of a classification model, taking the risk condition label as the target output of the classification model, and training the classification model to obtain a risk identification model.

In one embodiment, the obtaining training data comprises:

taking an entity marked with a risk state label as an entity sample, and taking the entity sample as a starting point to perform path search from the incidence relation data to obtain a path set of the entity sample formed by incidence relation paths taking the entity sample as an end node so as to obtain the training data; or,

respectively taking each entity in the incidence relation data as an initial end node, and searching a path according to the incidence relation; selecting entity samples from the end nodes of the obtained incidence relation paths, labeling risk state labels on the entity samples, and obtaining a path set of the entity samples formed by the incidence relation paths taking the entity samples as end points to obtain the training data.

In one embodiment, the using the risk identification result of each incidence relation path in the path set of the entity sample as the input of the classification model includes:

and taking the risk identification result of each incidence relation path in the path set of the entity sample and the statistical characteristics of the entity sample in the incidence relation data as the input of the classification model.

In another embodiment, the statistical characteristics of the entity sample in the incidence relation data comprise at least one of the following:

a length of a shortest path in a set of paths of the entity sample;

a length of a longest path in the set of paths of the physical sample;

the path set of the entity sample comprises the number of entities of a preset type;

the number of entities with the same preset attribute as the entity sample;

whether the entity sample has a preset relationship characteristic or not;

the proportion of the entity sample having a particular attribute;

whether a cycle exists in the incidence relation path of the entity sample.

According to a third aspect, there is provided a method of obtaining a path recognition model, comprising:

acquiring training data, wherein the training data comprises an incidence relation path and a risk condition label labeled on the incidence relation path; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities;

and taking the coding sequence of the incidence relation path as the input of a classification model, taking the risk condition label of the incidence relation path as the target output of the classification model, and training the classification model to obtain the path recognition model.

According to a fourth aspect, there is provided a risk identification apparatus comprising:

the path generation unit is configured to generate at least one incidence relation path with an entity to be identified as an end node according to the incidence relation data to form a path set of the entity to be identified; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities;

the path identification unit is configured to perform risk identification on the incidence relation path by using a path identification model obtained through pre-training;

and the risk identification unit is configured to determine a risk identification result of the entity to be identified by using the risk identification result of each incidence relation path in the path set of the entity to be identified.

In an embodiment, the path generating unit is specifically configured to: taking the entity to be identified as a starting point to perform path search from the incidence relation data to obtain an incidence relation path taking the entity to be identified as an end node; or,

In another embodiment, the path identifying unit includes:

the path coding subunit is configured to code the incidence relation path to obtain a coding sequence of the incidence relation path;

and the path identification subunit is configured to input the coding sequence of the incidence relation path into the path identification model to obtain a risk identification result of the path identification model on the incidence relation path.

In an embodiment, the risk identification unit is specifically configured to input a risk identification result of each association relationship path in the path set of the entity to be identified and a statistical characteristic of the entity to be identified in the association relationship data into a risk identification model obtained through pre-training, so as to obtain a risk identification result of the entity to be identified.

According to a fifth aspect, there is provided an apparatus for obtaining a risk identification model, comprising:

the system comprises a first obtaining unit, a second obtaining unit and a third obtaining unit, wherein the first obtaining unit is configured to obtain training data, the training data comprises a path set of entity samples formed by incidence relation paths taking the entity samples as end nodes, and risk condition labels labeled on the entity samples; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities;

and the first training unit is configured to take a risk identification result of each incidence relation path in the path set of the entity sample as an input of a classification model, take the risk condition label as a target output of the classification model, and train the classification model to obtain a risk identification model.

In an embodiment, the first obtaining unit is specifically configured to use an entity labeled with a risk state label as an entity sample, perform path search from the association relationship data using the entity sample as a starting point, and obtain a path set of the entity sample formed by association relationship paths using the entity sample as an end node, so as to obtain the training data; or,

In another embodiment, the path identifying unit is specifically configured to encode the association relationship path to obtain an encoding sequence of the association relationship path; and inputting the coding sequence of the incidence relation path into the path identification model to obtain a risk identification result of the path identification model for the incidence relation path.

In an embodiment, the first training unit is specifically configured to use a risk identification result of each association relationship path in the path set of the entity sample and a statistical feature of the entity sample in the association relationship data as inputs of the classification model.

According to a sixth aspect, there is provided an apparatus for obtaining a path recognition model, comprising:

a second obtaining unit configured to obtain training data, the training data including an incidence relation path and a risk condition label labeled to the incidence relation path; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities;

the path coding unit is configured to code the incidence relation path to obtain a coding sequence of the incidence relation path;

and the second training unit is configured to take the coding sequence of the incidence relation path as the input of a classification model, take the risk condition label of the incidence relation path as the target output of the classification model, and train the classification model to obtain the path recognition model.

In an embodiment, the path encoding unit is specifically configured to encode the entity and the association relationship in the association relationship path respectively; and splicing the coding results of the entities and the coding results of the association relations according to the sequence in the association relation path to obtain the coding results of the association relation path.

According to a seventh aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor which, when executing the executable code, implements the method of the first aspect.

According to the method and the device provided by the embodiment of the specification, the characteristics of the incidence relation between the entity and the entity are reflected through the incidence relation path, the risk identification result of the incidence relation path and the statistical characteristics of the entity to be identified are used as the basis of risk identification, and the entity is subjected to risk identification. Therefore, the purpose of efficiently and accurately identifying the risk of the entity is achieved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 shows a flow diagram of a risk identification method according to one embodiment;

FIG. 2 illustrates an example diagram of an incidence relation path, according to one embodiment;

FIG. 3 illustrates a flow diagram of a method of risk identification for a path, according to one embodiment;

FIG. 4 illustrates a flow diagram of a method of training a path recognition model, according to one embodiment;

FIG. 5 shows a schematic diagram of a cyclic path according to one embodiment;

FIG. 6 illustrates a flow diagram of a method of training a risk recognition model, according to one embodiment;

FIG. 7 shows a schematic block diagram of a risk identification apparatus according to an embodiment;

FIG. 8 shows a schematic block diagram of an apparatus for obtaining a risk identification model according to one embodiment;

FIG. 9 shows a schematic block diagram of an apparatus for obtaining a path recognition model according to one embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

FIG. 1 shows a flow diagram of a risk identification method according to one embodiment. It is to be appreciated that the method can be performed by any apparatus, device, platform, cluster of devices having computing and processing capabilities. As shown in fig. 1, the method includes:

step 101, generating at least one incidence relation path with an entity to be identified as an end node according to incidence relation data to form a path set of the entity to be identified; the nodes in the association relationship path are entities, and the edges are the association relationship among the entities.

And 103, carrying out risk identification on the incidence relation path by using a path identification model obtained by pre-training.

And 105, determining a risk identification result of the entity to be identified by using the risk identification result of each incidence relation path in the path set of the entity to be identified.

As can be seen from the method flow provided by the embodiment shown in fig. 1, the method disclosed by the present disclosure processes the characteristics of the association relationship between the entity and the entity through the association relationship path, and performs risk identification on the entity by using the risk identification result of the association relationship path and the statistical characteristics of the entity to be identified as the basis for risk identification. Therefore, the purpose of efficiently and accurately identifying the risk of the entity is achieved.

The steps shown in fig. 1 above are described in detail with reference to specific embodiments. As a typical application scenario of the present disclosure, it can be applied to risk identification based on the equity structure. For example, the investigation of UBO (Ultimate business owner, ultimately benefited by all) is a mandatory requirement for the central office, bank custody. Not complying with increasingly stringent legal regulations and regulations, can pose significant legal, financial, and reputation risks to both companies and the associated responsible parties. An enterprise or enterprise group controlled by UBO through a complex equity relationship has a series of risks such as false funding, group operation, hidden architecture, evasion of supervision, associated transaction, rapid expansion and the like. Risk identification of UBOs from complex economic relationship data (e.g., stock control relationship data) is required. In this scenario, the association relationship data is economic relationship data, the entity to be identified may be an enterprise or a natural person, and the association relationship is a stock control relationship. For convenience of understanding of the specific technical solution, in the following embodiments, description will be made by taking the specific scenario as an example. However, it should be noted that the present disclosure is not limited to this exemplary application scenario, and may also be applied to other application scenarios.

Firstly, the above step 101, that is, "generating at least one association relationship path using the entity to be identified as an end node according to the association relationship data, and forming a path set of the entity to be identified" will be described in detail.

When generating an association relationship path with an entity to be identified as an endpoint according to the association relationship data, the following two ways may be adopted, but not limited to:

the first mode is as follows: and performing path search from the incidence relation data by taking the entity to be identified as a starting point to obtain an incidence relation path taking the entity to be identified as an end node.

Taking the above typical scenario as an example, economic relationship data, such as stock control relationship data, may be obtained from an interface provided by a department of industry and commerce. When risk identification needs to be carried out on a certain specific entity (the specific entity is referred to as an entity to be identified in the present disclosure, and the entity to be identified can be one or more entities), path search is carried out in the stock control relationship data by taking the entity to be identified as a starting point.

Taking the entity to be identified as a business, for example, business 1. An entity (which may be a business or a natural person) having the stock right for the business 1 is searched in the stock-controlling relationship data, and if the entity is the business 2, an entity having the stock-controlling right for the business 2 is further searched in the stock-controlling relationship data. And repeating the steps, searching the path until the found entity is a natural person. The entity having control right to an enterprise may be one or more.

Assuming that the final path search obtained several incidence relation paths (i.e., stock control paths) for business 1, two of them are shown in FIG. 2. The nodes in the path shown in the figure are entities, which can be enterprises or natural persons. The edges are the association between the nodes, in this example the association is the stock control relationship, and the numbers in the figure are the stock control ratio. The entity to be identified, enterprise 1, is the end node in each of these paths.

The entity to be identified is a natural person, for example, the natural person 1. The entity that is found in the stock-controlling relationship data and is controlled by the natural person is only possible to be an enterprise, which is assumed to be enterprise 3 and enterprise 4. This forms a two-path bifurcation. Then further searching the entity held by the enterprise 3 and the entity held by the enterprise 4in the holding relation data. And the like until the subsequent entity without stock control.

The second mode is as follows: respectively taking each entity in the incidence relation data as an initial end node, and searching a path according to the incidence relation; and acquiring the incidence relation path with the entity to be identified as a starting point from the obtained incidence relation paths.

In this way, each entity in the association relationship data is used as a starting end node, and the path search method described in the first way is used to perform search, so as to obtain an association relationship path with each entity as an end point, which is equivalent to a total association relationship path set. And then finding out the incidence relation path of the entity to be identified from the incidence relation paths as a path set of the entity to be identified.

Taking the above typical scenario as an example, all enterprises are found from the stock control relationship data. By performing a route search for each business, it is possible to obtain an association route starting from each business. And if the entity to be identified is the enterprise 1, finding an incidence relation path which takes the enterprise 1 as an end node from the obtained incidence relation paths.

The following describes step 103, that is, "perform risk identification on the association relationship path by using the path identification model obtained through pre-training" in detail with reference to the embodiment.

In this step, risk identification is performed on the association relationship paths one by one to obtain risk state information of each association relationship path. As a preferred embodiment, as shown in fig. 3, the method specifically includes the following steps:

step 301: and coding the incidence relation path to obtain a coding sequence of the incidence relation path.

When the incidence relation path is coded in the step, the entity and the incidence relation in the incidence relation path can be respectively coded; and splicing the coding results of the entities and the coding results of the association relations according to the sequence in the association relation path to obtain the coding results of the association relation path. When the entity is coded, at least one attribute information of the entity can be coded and then spliced according to a preset sequence to obtain a coding result of the entity.

Taking the above typical scenario as an example, after the stock control relationship path of the entity to be identified is obtained, the goal is to represent the stock control relationship path as a text sequence. Since the stock control relationship path is composed of the entities and the stock control relationship, each entity and stock control relationship can be first encoded separately.

In the business name example, it contains a lot of personalized information representing the territory and showing the aesthetic or favorite of the registrant. For example, the enterprise name is "Shenzhen City beauty ounce Limited," where "Shenzhen City" represents a territory and "beauty ounce" represents personalized information of the aesthetics or preferences of the registrant. However, these information have little effect on risk identification, and the encoded sequence information has low concentration and long length, which is not favorable for effectively utilizing the performance of the path identification model. Therefore, in this embodiment, some important attribute information of the enterprise may be encoded, and then spliced according to a preset sequence to obtain an encoding result of the enterprise.

The important attributes may include business type, industry of interest, registered capital, etc. Examples are as follows:

for business types, tens of common business types can be converted into type codes. For example, "limited liability company" is encoded as "TYP 1," limited liability company (state stock control) "is encoded as" TYP344, "stocks limited company" is encoded as "TYP 189," listed stocks limited company "is encoded as" TYP126 "and so on.

For the industry, the national economy industry primary and secondary classification codes can be coded, and the number of the codes is 118. For example, "agriculture, forestry, animal husbandry, fisheries" is encoded as "INDA," rental and business services "is encoded as" INDL, "information transfer, software and information technology services" is encoded as "INDI," and so on.

For the registered capital, the scope to which the registered capital belongs may be encoded. For example, the registered capital is encoded as "AST 1" at 1 ten thousand to 5 ten thousand, the registered capital is encoded as "AST 1" at 10 ten thousand to 100 ten thousand, the registered capital is encoded as "AST 3" at 101 ten thousand to 500 ten thousand, and so on.

For a node that is a natural PERSON, it may be directly encoded, such as "PERSON" to indicate that its type is a natural PERSON.

In encoding the stock control relationship, the stock control ratio and the initial year of stock holding may be used for encoding. For example, 60% of the year 2014 holdings may be encoded as "Y2014R 60".

Taking path 1in fig. 2 as an example, natural PERSON 1 is encoded as "PERSON", the stock control relationship between natural PERSON 1 and business 3 is encoded as "Y2014R 60", business 3 is encoded as "TYP 1IND0AST 6", the stock control relationship between business 3 and business 2 is encoded as "Y2008R 35", business 2 is encoded as "TYP 1IND0AST 5", the stock control relationship between business 2 and business 1 is encoded as "Y1988R 29", business 1 is encoded as "TYP 344IND 7", and the encoding sequence of the whole path is "PERSON Y2014R60 TYP1 d0AST 6Y 2008R35 TYP1IND0AST 5Y 1988R29 TYP344 dlast 7".

It should be noted that the encoding manner shown in the above example is only for illustrative purposes, but is not intended to limit the specific encoding manner. Any coding method under the above spirit principle falls into the protection scope of the present application.

Step 303: and inputting the coding sequence of the incidence relation path into the path identification model to obtain a risk identification result of the path identification model to the incidence relation path.

The path identification model adopted in the embodiment may be a text classification model, that is, classification is performed based on the coding sequence of the association relationship path, and the classification result is a specific risk state. E.g. high, medium, low risk. Such as high and low risk. As another example, primary risk, secondary risk, tertiary risk … …. That is, the classification may be binary or multi-classification.

The text classification model employed therein may include, but is not limited to, a Word2 vec-based text classification model, a fastText-based text classification model, or a textCNN-based text classification model.

Still taking the above exemplary application scenario as an example, the coding sequence obtained by the stock control relationship path is similar to the text in the natural language. Therefore, the coding sequence of the stock control relation path can be vectorized and mapped to a specific classification result through the full connection layer.

The following describes the training process of the path recognition model in detail by using a specific embodiment. As shown in fig. 4, the training process may include the steps of:

step 401: acquiring training data, wherein the training data comprises an incidence relation path and a risk condition label labeled on the incidence relation path; the nodes in the association relationship path are entities, and the edges are the association relationship among the entities.

In this step, some association relationship paths may be labeled with risk condition labels in advance, that is, the training data includes some association relationship paths with known risk conditions.

As an implementation manner, each entity in the association relationship data can be respectively used as a starting end node, and path search is performed according to the association relationship to obtain each association relationship path. And then selecting the incidence relation paths of various risk conditions from the incidence relation paths and labeling the risk condition labels. Taking the second category as an example, it can be labeled low risk and high risk. The above path searching process may refer to the relevant description in step 101 in the above embodiment, and is not described herein again.

And marking the risk condition label on the incidence relation path in a manual marking mode. The incidence relation path of the enterprise or the natural person related to tax evasion can also be marked as high risk in an automatic marking mode, for example, a blacklist-based mode. Incidence relation paths of enterprises or natural persons such as taxpayers of class A, outstanding contributions, star enterprises and the like can be marked as low risk in a white list-based mode.

Step 403: and coding the incidence relation path to obtain a coding sequence of the incidence relation path.

The manner of encoding the association relationship path may refer to the related description of step 301 in the above embodiment, and is not described herein again.

Step 405: and taking the coding sequence of the incidence relation path as the input of the classification model, taking the risk condition label of the incidence relation path as the target output of the classification model, and training the classification model to obtain the path recognition model.

The text classification model firstly carries out vectorization on the coding sequence of the incidence relation path, and then the coding sequence is mapped to a specific classification result through the full connection layer.

The Word2vec (Word vector) -based text classification Model is characterized in that Words are vectorized by using a Word2vec CBOW (Continuous Bag-of-Words Model), sentence vectors are obtained by a weighted averaging method, and finally some secondary features are removed by using a principal component analysis method. In the Word2 vec-based text classification model, when the hierarchy Softmax is used in the fully connected layer, all Word vectors in the training data are at the huffman (huffman tree) leaf nodes.

The fastText is a rapid text classification algorithm, and a text classification model based on the fastText is to perform superposition averaging on all words and n-gram vectors in a coding sequence. That is, the fastTest generates vectors for the character-level n-gram as an additional feature in addition to the word vector of each word, that is, character-level n-gram vectors are additionally introduced into the vector features input to the fully-connected layer. The vectorization mode has a better effect on the word vectors generated by the low-frequency words, and for other words except the training corpus, the word vectors can be constructed by overlapping character-level n-gram vectors. Fully connected layer when using the hierarchy Softmax, the huffman leaf nodes are word vectors for each class label.

fastText is a more preferred approach because the existing scenario is to classify the risk profile of the encoded sequence. In addition, in this embodiment, when training the text classification model, the path length (i.e., the number of nodes included) of the adopted training data is preferably greater than 2. The reason is that the accuracy of model classification is affected due to the fact that the short path has less information content.

In the specific training process, the training objective is to make the classification result of the text classification model on the risk condition of the incidence relation path consistent with the risk condition label in the training data as much as possible. A loss function can be constructed according to the training target, parameters of the text classification model are updated according to the value of the loss function in each iteration until an iteration ending condition is reached, and the finally obtained text classification model is used as a path recognition model. The iteration ending condition may be that the iteration number reaches a preset iteration number threshold, or that the value of the loss function is less than or equal to a preset loss function threshold, or the like.

For the path identification model, the output of the full connection layer is actually the high risk probability of each incidence relation path, and the risk condition of the incidence relation path can be identified according to the high risk probability and the high risk probability threshold.

The following describes in detail the step 105 of determining the risk identification result of the entity to be identified by using the risk identification result of each association relationship path in the path set of the entity to be identified, with reference to an embodiment.

When determining the risk identification result of the entity to be identified, as an achievable mode, the risk identification result of each incidence relation in the path set of the entity to be identified can be directly determined. For example, the proportion of the risk identification result of each incidence relation in the path set of the entity to be identified is determined. For example, if the high risk percentage in the risk identification result of each association in the path set of an enterprise exceeds 50%, it may be determined that the risk identification result of the enterprise is high risk.

However, in view of the fact that the text classification model has limited recalls for a large number of short paths and there are some cases of controlling the inauguration enterprises through a plurality of short stock relations, a preferred embodiment is provided herein in addition to the above-mentioned manner of directly determining the risk identification result of the entity to be identified by using the risk result of the incidence relation path. The risk identification result of each incidence relation path in the path set and the statistical characteristics of the entity to be identified in the incidence relation data are input into a risk identification model obtained by pre-training, and the risk identification result of the entity to be identified is obtained.

Wherein, the statistical characteristics of the entity to be identified in the association relation data may include, but are not limited to, at least one of the following:

the number of paths in the path set of the entity to be identified;

the length of the shortest path in the path set of the entity to be identified;

the length of the longest path in the path set of the entity to be identified;

the number of entities having the same preset attribute as the entity to be identified; the preset attributes can comprise a legal person, UBO, a contact way and the like;

whether the entity to be identified has a preset relationship characteristic or not; for example, whether the overseas enterprise benefits exceed a preset proportion;

the proportion of entities to be identified having a particular attribute; such as the proportion of the overseas benefit stock;

and whether a cycle exists in the incidence relation path of the entity to be identified. For example, as shown in FIG. 5, if natural person P1 has 1% equity to business C1, business C1 has 100% equity to business C2, and business C2 in turn has 99% equity to business C1, then the relationship path is determined to have a loop.

When the risk identification result of each incidence relation path in the path set is input into the risk identification model, the classification identification result of each incidence relation path can be directly input into the risk identification model as a plurality of features, or one or any combination of the maximum risk probability value, the minimum risk probability value, the average risk probability value and the like of each incidence relation path can be input into the risk identification model as features.

Continuing with the exemplary application scenario as an example, after risk identification of each association relationship path in the path set of the enterprise 1, the maximum risk probability value, the minimum risk probability value, and the average risk probability value of the association relationship paths included in the path set are used as a part of feature input of the risk identification model, and statistical features such as the number of paths in the path set of the enterprise 1, the length of the shortest path, the length of the longest path, the number of included enterprises, whether the overseas enterprise benefits exceed 25%, the overseas profit-right ratio, the number of enterprises with the same legal, the number of enterprises with the same UBO, the number of enterprises with the same contact method, the number of impassable enterprises, and the existence cycle of stock control paths are used as another part of feature input of the risk identification model. And classifying the enterprise 1 by the risk identification model according to the input characteristics to obtain a risk identification result of the enterprise 1.

The risk identification model is pre-established, and the establishment process can be seen in fig. 6. As shown in fig. 6, the method may include the steps of:

step 601: acquiring training data, wherein the training data comprises a path set of entity samples formed by incidence relation paths taking the entity samples as endpoints, and risk condition labels labeled on the entity samples; the nodes in the association relationship path are entities, and the edges are the association relationship among the entities.

In this embodiment, some entities with known risk conditions may be used as entities, and labeled with risk condition labels, and a path search is performed from the association relationship data using the entity sample as a starting point, so as to obtain a path set in which an inertia relationship path using the entity sample as an end point constitutes the entity sample.

Or taking each entity in the association relation data as an initial end node, and searching a path according to the association relation; and selecting an entity with a known risk condition from the end nodes of the searched incidence relation path as an entity sample, labeling a risk state label for the entity sample, and acquiring a path set of the entity sample formed by the incidence relation path with the entity sample as an end point to obtain training data.

The above path searching process may refer to the relevant description in step 101 in the above embodiment, and is not described herein again. In the above exemplary scenario, the entity sample may be a business or a natural person.

Step 602: and carrying out risk identification on the incidence relation path by using a path identification model obtained by pre-training.

The process of risk identification for the association relationship path may refer to relevant records in the embodiment shown in fig. 3, which is not described herein again.

Step 603: and taking the risk identification result of each incidence relation path in the path set of the entity sample and the statistical characteristics of the entity sample in the incidence relation data as the input of a classification model, taking the risk condition label as the target output of the classification model, and training the classification model to obtain the risk identification model.

Likewise, the statistical characteristics of the entity sample in the association relationship data may include, but are not limited to, at least one of the following:

the number of paths in a path set of the entity sample;

length of shortest path in path set of entity sample;

the length of the longest path in the set of paths of the physical samples;

the number of entities having the same preset attribute as the entity sample; the preset attributes can comprise a legal person, UBO, a contact way and the like;

whether the entity sample has a preset relationship characteristic or not; for example, whether the overseas enterprise benefits exceed a preset proportion;

a proportion of the entity samples having a particular attribute; such as the proportion of the overseas benefit stock;

whether a cycle exists in the incidence relation path of the entity sample.

What kind of characteristic input is adopted when the risk recognition model is trained, what kind of characteristic input is also adopted in the process of risk recognition by using the risk recognition model, namely the characteristics adopted in the training process and the recognition process need to be kept consistent.

In this embodiment, the classification model may adopt a decision tree model, for example, an XGBoost (Extreme Gradient Boosting) model.

In a specific training process, the training goal is to make the classification result of the classification model on the risk condition of the entity sample as consistent as possible with the risk condition label in the training data. A loss function can be constructed according to the training target, parameters of the classification model are updated according to values of the loss function in each iteration until an iteration ending condition is reached, and the finally obtained classification model is used as a risk identification model. The iteration ending condition may be that the iteration number reaches a preset iteration number threshold, or that the value of the loss function is less than or equal to a preset loss function threshold, or the like.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

According to an embodiment of another aspect, a risk identification device is provided. Fig. 7 shows a schematic block diagram of a risk identification device according to an embodiment. It is to be appreciated that the apparatus can be implemented by any apparatus, device, platform, and cluster of devices having computing and processing capabilities. As shown in fig. 7, the apparatus 700 includes:

a path generating unit 710 configured to generate at least one incidence relation path using the entity to be identified as an end node according to the incidence relation data, and form a path set of the entity to be identified; the nodes in the association relationship path are entities, and the edges are the association relationship among the entities.

A path identification unit 720, configured to perform risk identification on the association relationship path by using a path identification model obtained through pre-training;

and the risk identification unit 730 is configured to determine a risk identification result of the entity to be identified by using the risk identification result of each incidence relation path in the path set of the entity to be identified.

Wherein, the path generating unit 710 may be specifically configured to: taking the entity to be identified as a starting point to perform path search from the incidence relation data to obtain an incidence relation path taking the entity to be identified as an end node; or respectively taking each entity in the association relation data as an initial end node, and searching a path according to the association relation; and acquiring the incidence relation path with the entity to be identified as a starting point from the obtained incidence relation paths.

As one preferred embodiment, the path identifying unit 720 may specifically include: a path encoding subunit 721 and a path identifying subunit 722.

The path encoding subunit 721 is configured to encode the association relationship path, and obtain an encoding sequence of the association relationship path.

When encoding the association relationship path, the path encoding subunit 721 may encode the entity and the association relationship in the association relationship path, respectively; and splicing the coding results of the entities and the coding results of the association relations according to the sequence in the association relation path to obtain the coding results of the association relation path. When the entity is encoded, the path encoding subunit 721 may encode at least one attribute information of the entity and then splice the encoded attribute information according to a preset sequence to obtain an encoding result of the entity.

And the path identifying subunit 722 is configured to input the coding sequence of the incidence relation path into the path identifying model, so as to obtain a risk identification result of the path identifying model on the incidence relation path.

Wherein the path recognition model can adopt but is not limited to a Word2 vec-based text classification model, a fastText-based text classification model or a textCNN-based text classification model.

As a preferred embodiment, the risk identifying unit 730 may be specifically configured to input the risk identification result of each association relationship path in the path set of the entity to be identified and the statistical characteristic of the entity to be identified in the association relationship data into a risk identification model obtained by pre-training, so as to obtain the risk identification result of the entity to be identified.

the number of paths in the path set of the entity to be identified;

the length of the shortest path in the path set of the entity to be identified;

the length of the longest path in the path set of the entity to be identified;

the number of entities having the same preset attribute as the entity to be identified;

the proportion of entities to be identified having a particular attribute;

According to an embodiment of another aspect, an apparatus for obtaining a risk identification model is also provided. FIG. 8 shows a schematic block diagram of an apparatus for obtaining a risk identification model according to one embodiment. It is to be appreciated that the apparatus can be implemented by any apparatus, device, platform, and cluster of devices having computing and processing capabilities. As shown in fig. 8, the apparatus 800 includes: a first acquisition unit 801, a path recognition unit 802 and a first training unit 803. The main functions of each component unit are as follows:

a first obtaining unit 801 configured to obtain training data, where the training data includes a path set in which an entity sample is formed by incidence relation paths using the entity sample as end nodes, and a risk condition label labeled to the entity sample; the nodes in the association relationship path are entities, and the edges are the association relationship among the entities.

A path identification unit 802 configured to perform risk identification on the association relationship path by using a path identification model obtained through pre-training.

The first training unit 803 is configured to train the classification model to obtain a risk identification model by taking a risk identification result of each association relation path in the path set of the entity sample as an input of the classification model and taking a risk condition label as a target output of the classification model.

The first obtaining unit 801 may be specifically configured to use an entity labeled with a risk state label as an entity sample, perform path search from the association relationship data using the entity sample as a starting point, and obtain a path set of the entity sample formed by association relationship paths using the entity sample as an end node, so as to obtain training data; or,

respectively taking each entity in the incidence relation data as an initial end node, and searching a path according to the incidence relation; and selecting entity samples from the end nodes of the searched incidence relation paths, labeling risk state labels on the entity samples, and acquiring a path set of the entity samples formed by the incidence relation paths with the entity samples as end points to obtain training data.

As a preferred embodiment, the path identifying unit 802 may be specifically configured to encode the association relationship path to obtain an encoding sequence of the association relationship path; and inputting the coding sequence of the incidence relation path into the path identification model to obtain a risk identification result of the path identification model to the incidence relation path.

When encoding the association relationship path, the path identifying unit 802 may encode the entity and the association relationship in the association relationship path respectively; and splicing the coding results of the entities and the coding results of the association relations according to the sequence in the association relation path to obtain the coding results of the association relation path. When an entity is encoded, the path identifying unit 802 may encode at least one attribute information of the entity and then splice the encoded attribute information according to a preset sequence to obtain an encoding result of the entity.

As a preferred embodiment, the first training unit 803 may be specifically configured to use the risk identification result of each association relationship path in the path set of the entity sample and the statistical features of the entity sample in the association relationship data as the input of the classification model.

the number of paths in a path set of the entity sample;

length of shortest path in path set of entity sample;

the length of the longest path in the set of paths of the physical samples;

the number of entities having the same preset attribute as the entity sample;

whether the entity sample has a preset relationship characteristic or not;

a proportion of the entity samples having a particular attribute;

whether a cycle exists in the incidence relation path of the entity sample.

In this embodiment, the classification model may adopt a decision tree model, such as an XGBoost model.

According to an embodiment of another aspect, an apparatus for obtaining a path recognition model is provided. FIG. 9 shows a schematic block diagram of an apparatus for obtaining a path recognition model according to one embodiment. It is to be appreciated that the apparatus can be implemented by any apparatus, device, platform, and cluster of devices having computing and processing capabilities. As shown in fig. 9, the apparatus 900 includes: a second acquisition unit 901, a path coding unit 902 and a second training unit 903. The main functions of each component unit are as follows:

a second obtaining unit 901 configured to obtain training data, where the training data includes association relation paths and risk condition labels labeled to the association relation paths; the nodes in the association relationship path are entities, and the edges are the association relationship among the entities.

As an implementation manner, the second obtaining unit 901 may use each entity in the association relationship data as a start end node, and perform path search according to the association relationship to obtain each association relationship path. And then selecting the incidence relation paths of various risk conditions from the incidence relation paths and labeling the risk condition labels.

A path encoding unit 902 configured to encode the association relationship path to obtain an encoding sequence of the association relationship path.

And a second training unit 903, configured to train the classification model to obtain the path recognition model by taking the coding sequence of the association relationship path as an input of the classification model and taking the risk condition label of the association relationship path as a target output of the classification model.

As a preferred embodiment, the path encoding unit 902 may be specifically configured to encode the entity and the association relationship in the association relationship path respectively; and splicing the coding results of the entities and the coding results of the association relations according to the sequence in the association relation path to obtain the coding results of the association relation path.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 1, 3, 4 and 6.

According to an embodiment of still another aspect, there is also provided a computing device including a memory and a processor, the memory having stored therein executable code, the processor, when executing the executable code, implementing the method in conjunction with fig. 1, fig. 3, fig. 4 and fig. 6.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A risk identification method, comprising:

2. The method of claim 1, wherein the generating at least one incidence relation path with the entity to be identified as an end node according to the incidence relation data comprises:

3. The method of claim 1, wherein the performing risk identification on the incidence relation path by using a pre-trained path identification model comprises:

4. The method according to claim 3, wherein encoding the incidence relation path to obtain the encoding sequence of the incidence relation path comprises:

5. The method of claim 1, wherein determining the risk identification result for the entity to be identified by using the risk identification result of each incidence relation path in the path set of the entity to be identified comprises:

6. The method of claim 5, wherein the statistical features of the entity to be identified in the association relationship data comprise at least one of:

the length of the shortest path in the path set of the entity to be identified;

the length of the longest path in the path set of the entity to be identified;

the proportion of the entities to be identified having the specific attribute;

7. The method according to any one of claims 1 to 6, wherein the association data comprises stock right structure data; the entity to be identified is an enterprise or a natural person; the association relationship is a stock control relationship.

8. A method of obtaining a risk identification model, comprising:

9. The method of claim 8, wherein the acquiring training data comprises:

10. The method of claim 8, wherein the risk identification of the incidence relation path by using the pre-trained path identification model comprises:

11. The method of claim 8, wherein using the risk identification result of each incidence relation path in the path set of the entity sample as an input of a classification model comprises:

12. The method of claim 11, wherein the statistical features of the entity sample in the association relationship data comprise at least one of:

a length of a shortest path in a set of paths of the entity sample;

a length of a longest path in the set of paths of the physical sample;

the number of entities with the same preset attribute as the entity sample;

whether the entity sample has a preset relationship characteristic or not;

the proportion of the entity sample having a particular attribute;

whether a cycle exists in the incidence relation path of the entity sample.

13. A method of obtaining a path recognition model, comprising:

14. The method according to claim 13, wherein encoding the incidence relation path to obtain the encoding sequence of the incidence relation path comprises:

15. A risk identification device comprising:

16. The apparatus according to claim 15, wherein the path generation unit is specifically configured to: taking the entity to be identified as a starting point to perform path search from the incidence relation data to obtain an incidence relation path taking the entity to be identified as an end node; or,

17. The apparatus of claim 15, wherein the path identifying unit comprises:

18. The apparatus according to claim 15, wherein the risk identification unit is specifically configured to input a risk identification result of each association relationship path in the path set of the entity to be identified and a statistical feature of the entity to be identified in the association relationship data into a risk identification model obtained through pre-training, so as to obtain a risk identification result of the entity to be identified.

19. An apparatus for obtaining a risk identification model, comprising:

20. The apparatus according to claim 19, wherein the first obtaining unit is specifically configured to use an entity labeled with a risk state label as an entity sample, perform a path search from the association relationship data using the entity sample as a starting point, and obtain a path set in which association relationship paths using the entity sample as an end node constitute the entity sample, so as to obtain the training data; or,

21. The apparatus according to claim 19, wherein the path identifying unit is specifically configured to encode the association relationship path to obtain an encoding sequence of the association relationship path; and inputting the coding sequence of the incidence relation path into the path identification model to obtain a risk identification result of the path identification model for the incidence relation path.

22. The apparatus according to claim 19, wherein the first training unit is specifically configured to use a risk identification result of each association relationship path in the path set of the entity sample and a statistical feature of the entity sample in the association relationship data as inputs of the classification model.

23. Apparatus for obtaining a path recognition model, comprising:

24. The apparatus according to claim 23, wherein the path encoding unit is specifically configured to encode the entity and the association relationship in the association relationship path respectively; and splicing the coding results of the entities and the coding results of the association relations according to the sequence in the association relation path to obtain the coding results of the association relation path.

25. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, performs the method of any of claims 1-14.