CN113222610A - Risk identification method and device - Google Patents
Risk identification method and device Download PDFInfo
- Publication number
- CN113222610A CN113222610A CN202110493573.2A CN202110493573A CN113222610A CN 113222610 A CN113222610 A CN 113222610A CN 202110493573 A CN202110493573 A CN 202110493573A CN 113222610 A CN113222610 A CN 113222610A
- Authority
- CN
- China
- Prior art keywords
- path
- incidence relation
- entity
- identified
- risk
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 238000012549 training Methods 0.000 claims abstract description 94
- 238000013145 classification model Methods 0.000 claims description 74
- 108091026890 Coding region Proteins 0.000 claims description 36
- 238000002372 labelling Methods 0.000 claims description 8
- 238000000547 structure data Methods 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 13
- 230000008569 process Effects 0.000 description 13
- 230000006870 function Effects 0.000 description 12
- 239000013598 vector Substances 0.000 description 12
- 230000008901 benefit Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 5
- 238000012935 Averaging Methods 0.000 description 2
- 101100492584 Caenorhabditis elegans ast-1 gene Proteins 0.000 description 2
- 230000003796 beauty Effects 0.000 description 2
- 238000003066 decision tree Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 101100072624 Arabidopsis thaliana INDH gene Proteins 0.000 description 1
- DNAWGBOKUFFVMB-ANYFDBNWSA-N C1C[C@@H](O)[C@@H]2C(COC(=O)[C@](O)([C@H](C)O)C(C)C)=CC[N+]21[O-] Chemical compound C1C[C@@H](O)[C@@H]2C(COC(=O)[C@](O)([C@H](C)O)C(C)C)=CC[N+]21[O-] DNAWGBOKUFFVMB-ANYFDBNWSA-N 0.000 description 1
- 244000063498 Spondias mombin Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000004883 computer application Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012847 principal component analysis method Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q20/00—Payment architectures, schemes or protocols
- G06Q20/38—Payment protocols; Details thereof
- G06Q20/40—Authorisation, e.g. identification of payer or payee, verification of customer or shop credentials; Review and approval of payers, e.g. check credit lines or negative lists
- G06Q20/401—Transaction verification
- G06Q20/4016—Transaction verification involving fraud or risk level assessment in transaction processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Business, Economics & Management (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Accounting & Taxation (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- Computer Security & Cryptography (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Finance (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The embodiment of the specification provides a risk identification method and device. According to the method of the embodiment, firstly, at least one incidence relation path which takes an entity to be identified as an end node is generated according to incidence relation data to form a path set of the entity to be identified; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities; then, carrying out risk identification on the incidence relation path by utilizing a path identification model obtained by pre-training; and determining a risk identification result of the entity to be identified by utilizing the risk identification result of each incidence relation path in the path set of the entity to be identified.
Description
Technical Field
One or more embodiments of the present disclosure relate to the field of computer application technologies, and in particular, to a risk identification method and apparatus.
Background
There are various risks in the economic system, such as a series of risks of false funding, group operations, hidden architecture, evasive regulation, related transactions, drastic expansions, illegal fund flows, etc., which may have very serious consequences. However, in an economic system, a large number of entities are involved, and the entities include intricate associations. Therefore, how to efficiently and accurately identify the risk of an entity from an economic system becomes a difficult and urgent problem to be solved.
Disclosure of Invention
One or more embodiments of the present specification describe a risk identification method and apparatus to facilitate efficient and accurate risk identification for an entity.
According to a first aspect, there is provided a risk identification method comprising:
generating at least one incidence relation path taking an entity to be identified as an end node according to incidence relation data to form a path set of the entity to be identified; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities;
carrying out risk identification on the incidence relation path by utilizing a path identification model obtained by pre-training;
and determining a risk identification result of the entity to be identified by using the risk identification result of each incidence relation path in the path set of the entity to be identified.
In one embodiment, the generating at least one incidence relation path with the entity to be identified as the end node according to the incidence relation data includes:
taking the entity to be identified as a starting point to perform path search from the incidence relation data to obtain an incidence relation path taking the entity to be identified as an end node; or,
respectively taking each entity in the incidence relation data as an initial end node, and searching a path according to the incidence relation; and acquiring the incidence relation path with the entity to be identified as a starting point from the obtained incidence relation paths.
In another embodiment, the performing risk identification on the association relationship path by using a path identification model obtained through pre-training includes:
coding the incidence relation path to obtain a coding sequence of the incidence relation path;
and inputting the coding sequence of the incidence relation path into the path identification model to obtain a risk identification result of the path identification model for the incidence relation path.
In one embodiment, encoding the association relationship path to obtain an encoding sequence of the association relationship path includes:
respectively encoding the entity and the incidence relation in the incidence relation path;
and splicing the coding results of the entities and the coding results of the association relations according to the sequence in the association relation path to obtain the coding results of the association relation path.
In another embodiment, determining the risk identification result for the entity to be identified by using the risk identification result of each association relationship path in the path set of the entity to be identified includes:
and inputting the risk identification result of each incidence relation path in the path set of the entity to be identified and the statistical characteristics of the entity to be identified in the incidence relation data into a risk identification model obtained by pre-training to obtain the risk identification result of the entity to be identified.
In one embodiment, the statistical characteristics of the entity to be identified in the association relation data include at least one of the following:
the length of the shortest path in the path set of the entity to be identified;
the length of the longest path in the path set of the entity to be identified;
the path set of the entity to be identified comprises the number of entities of a preset type;
the number of entities with the same preset attribute as the entity to be identified;
whether the entity to be identified has a preset relationship characteristic or not;
the proportion of the entities to be identified having the specific attribute;
and whether a cycle exists in the incidence relation path of the entity to be identified.
In another embodiment, the association data includes share structure data; the entity to be identified is an enterprise or a natural person; the association relationship is a stock control relationship.
According to a second aspect, there is provided a method of obtaining a risk identification model, comprising:
acquiring training data, wherein the training data comprises a path set of an entity sample formed by incidence relation paths taking the entity sample as end nodes and a risk condition label labeled on the entity sample; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities;
carrying out risk identification on the incidence relation path by utilizing a path identification model obtained by pre-training;
and taking the risk identification result of each incidence relation path in the path set of the entity sample as the input of a classification model, taking the risk condition label as the target output of the classification model, and training the classification model to obtain a risk identification model.
In one embodiment, the obtaining training data comprises:
taking an entity marked with a risk state label as an entity sample, and taking the entity sample as a starting point to perform path search from the incidence relation data to obtain a path set of the entity sample formed by incidence relation paths taking the entity sample as an end node so as to obtain the training data; or,
respectively taking each entity in the incidence relation data as an initial end node, and searching a path according to the incidence relation; selecting entity samples from the end nodes of the obtained incidence relation paths, labeling risk state labels on the entity samples, and obtaining a path set of the entity samples formed by the incidence relation paths taking the entity samples as end points to obtain the training data.
In another embodiment, the performing risk identification on the association relationship path by using a path identification model obtained through pre-training includes:
coding the incidence relation path to obtain a coding sequence of the incidence relation path;
and inputting the coding sequence of the incidence relation path into the path identification model to obtain a risk identification result of the path identification model for the incidence relation path.
In one embodiment, the using the risk identification result of each incidence relation path in the path set of the entity sample as the input of the classification model includes:
and taking the risk identification result of each incidence relation path in the path set of the entity sample and the statistical characteristics of the entity sample in the incidence relation data as the input of the classification model.
In another embodiment, the statistical characteristics of the entity sample in the incidence relation data comprise at least one of the following:
a length of a shortest path in a set of paths of the entity sample;
a length of a longest path in the set of paths of the physical sample;
the path set of the entity sample comprises the number of entities of a preset type;
the number of entities with the same preset attribute as the entity sample;
whether the entity sample has a preset relationship characteristic or not;
the proportion of the entity sample having a particular attribute;
whether a cycle exists in the incidence relation path of the entity sample.
According to a third aspect, there is provided a method of obtaining a path recognition model, comprising:
acquiring training data, wherein the training data comprises an incidence relation path and a risk condition label labeled on the incidence relation path; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities;
coding the incidence relation path to obtain a coding sequence of the incidence relation path;
and taking the coding sequence of the incidence relation path as the input of a classification model, taking the risk condition label of the incidence relation path as the target output of the classification model, and training the classification model to obtain the path recognition model.
In one embodiment, encoding the association relationship path to obtain an encoding sequence of the association relationship path includes:
respectively encoding the entity and the incidence relation in the incidence relation path;
and splicing the coding results of the entities and the coding results of the association relations according to the sequence in the association relation path to obtain the coding results of the association relation path.
According to a fourth aspect, there is provided a risk identification apparatus comprising:
the path generation unit is configured to generate at least one incidence relation path with an entity to be identified as an end node according to the incidence relation data to form a path set of the entity to be identified; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities;
the path identification unit is configured to perform risk identification on the incidence relation path by using a path identification model obtained through pre-training;
and the risk identification unit is configured to determine a risk identification result of the entity to be identified by using the risk identification result of each incidence relation path in the path set of the entity to be identified.
In an embodiment, the path generating unit is specifically configured to: taking the entity to be identified as a starting point to perform path search from the incidence relation data to obtain an incidence relation path taking the entity to be identified as an end node; or,
respectively taking each entity in the incidence relation data as an initial end node, and searching a path according to the incidence relation; and acquiring the incidence relation path with the entity to be identified as a starting point from the obtained incidence relation paths.
In another embodiment, the path identifying unit includes:
the path coding subunit is configured to code the incidence relation path to obtain a coding sequence of the incidence relation path;
and the path identification subunit is configured to input the coding sequence of the incidence relation path into the path identification model to obtain a risk identification result of the path identification model on the incidence relation path.
In an embodiment, the risk identification unit is specifically configured to input a risk identification result of each association relationship path in the path set of the entity to be identified and a statistical characteristic of the entity to be identified in the association relationship data into a risk identification model obtained through pre-training, so as to obtain a risk identification result of the entity to be identified.
According to a fifth aspect, there is provided an apparatus for obtaining a risk identification model, comprising:
the system comprises a first obtaining unit, a second obtaining unit and a third obtaining unit, wherein the first obtaining unit is configured to obtain training data, the training data comprises a path set of entity samples formed by incidence relation paths taking the entity samples as end nodes, and risk condition labels labeled on the entity samples; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities;
the path identification unit is configured to perform risk identification on the incidence relation path by using a path identification model obtained through pre-training;
and the first training unit is configured to take a risk identification result of each incidence relation path in the path set of the entity sample as an input of a classification model, take the risk condition label as a target output of the classification model, and train the classification model to obtain a risk identification model.
In an embodiment, the first obtaining unit is specifically configured to use an entity labeled with a risk state label as an entity sample, perform path search from the association relationship data using the entity sample as a starting point, and obtain a path set of the entity sample formed by association relationship paths using the entity sample as an end node, so as to obtain the training data; or,
respectively taking each entity in the incidence relation data as an initial end node, and searching a path according to the incidence relation; selecting entity samples from the end nodes of the obtained incidence relation paths, labeling risk state labels on the entity samples, and obtaining a path set of the entity samples formed by the incidence relation paths taking the entity samples as end points to obtain the training data.
In another embodiment, the path identifying unit is specifically configured to encode the association relationship path to obtain an encoding sequence of the association relationship path; and inputting the coding sequence of the incidence relation path into the path identification model to obtain a risk identification result of the path identification model for the incidence relation path.
In an embodiment, the first training unit is specifically configured to use a risk identification result of each association relationship path in the path set of the entity sample and a statistical feature of the entity sample in the association relationship data as inputs of the classification model.
According to a sixth aspect, there is provided an apparatus for obtaining a path recognition model, comprising:
a second obtaining unit configured to obtain training data, the training data including an incidence relation path and a risk condition label labeled to the incidence relation path; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities;
the path coding unit is configured to code the incidence relation path to obtain a coding sequence of the incidence relation path;
and the second training unit is configured to take the coding sequence of the incidence relation path as the input of a classification model, take the risk condition label of the incidence relation path as the target output of the classification model, and train the classification model to obtain the path recognition model.
In an embodiment, the path encoding unit is specifically configured to encode the entity and the association relationship in the association relationship path respectively; and splicing the coding results of the entities and the coding results of the association relations according to the sequence in the association relation path to obtain the coding results of the association relation path.
According to a seventh aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor which, when executing the executable code, implements the method of the first aspect.
According to the method and the device provided by the embodiment of the specification, the characteristics of the incidence relation between the entity and the entity are reflected through the incidence relation path, the risk identification result of the incidence relation path and the statistical characteristics of the entity to be identified are used as the basis of risk identification, and the entity is subjected to risk identification. Therefore, the purpose of efficiently and accurately identifying the risk of the entity is achieved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 shows a flow diagram of a risk identification method according to one embodiment;
FIG. 2 illustrates an example diagram of an incidence relation path, according to one embodiment;
FIG. 3 illustrates a flow diagram of a method of risk identification for a path, according to one embodiment;
FIG. 4 illustrates a flow diagram of a method of training a path recognition model, according to one embodiment;
FIG. 5 shows a schematic diagram of a cyclic path according to one embodiment;
FIG. 6 illustrates a flow diagram of a method of training a risk recognition model, according to one embodiment;
FIG. 7 shows a schematic block diagram of a risk identification apparatus according to an embodiment;
FIG. 8 shows a schematic block diagram of an apparatus for obtaining a risk identification model according to one embodiment;
FIG. 9 shows a schematic block diagram of an apparatus for obtaining a path recognition model according to one embodiment.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
FIG. 1 shows a flow diagram of a risk identification method according to one embodiment. It is to be appreciated that the method can be performed by any apparatus, device, platform, cluster of devices having computing and processing capabilities. As shown in fig. 1, the method includes:
And 103, carrying out risk identification on the incidence relation path by using a path identification model obtained by pre-training.
And 105, determining a risk identification result of the entity to be identified by using the risk identification result of each incidence relation path in the path set of the entity to be identified.
As can be seen from the method flow provided by the embodiment shown in fig. 1, the method disclosed by the present disclosure processes the characteristics of the association relationship between the entity and the entity through the association relationship path, and performs risk identification on the entity by using the risk identification result of the association relationship path and the statistical characteristics of the entity to be identified as the basis for risk identification. Therefore, the purpose of efficiently and accurately identifying the risk of the entity is achieved.
The steps shown in fig. 1 above are described in detail with reference to specific embodiments. As a typical application scenario of the present disclosure, it can be applied to risk identification based on the equity structure. For example, the investigation of UBO (Ultimate business owner, ultimately benefited by all) is a mandatory requirement for the central office, bank custody. Not complying with increasingly stringent legal regulations and regulations, can pose significant legal, financial, and reputation risks to both companies and the associated responsible parties. An enterprise or enterprise group controlled by UBO through a complex equity relationship has a series of risks such as false funding, group operation, hidden architecture, evasion of supervision, associated transaction, rapid expansion and the like. Risk identification of UBOs from complex economic relationship data (e.g., stock control relationship data) is required. In this scenario, the association relationship data is economic relationship data, the entity to be identified may be an enterprise or a natural person, and the association relationship is a stock control relationship. For convenience of understanding of the specific technical solution, in the following embodiments, description will be made by taking the specific scenario as an example. However, it should be noted that the present disclosure is not limited to this exemplary application scenario, and may also be applied to other application scenarios.
Firstly, the above step 101, that is, "generating at least one association relationship path using the entity to be identified as an end node according to the association relationship data, and forming a path set of the entity to be identified" will be described in detail.
When generating an association relationship path with an entity to be identified as an endpoint according to the association relationship data, the following two ways may be adopted, but not limited to:
the first mode is as follows: and performing path search from the incidence relation data by taking the entity to be identified as a starting point to obtain an incidence relation path taking the entity to be identified as an end node.
Taking the above typical scenario as an example, economic relationship data, such as stock control relationship data, may be obtained from an interface provided by a department of industry and commerce. When risk identification needs to be carried out on a certain specific entity (the specific entity is referred to as an entity to be identified in the present disclosure, and the entity to be identified can be one or more entities), path search is carried out in the stock control relationship data by taking the entity to be identified as a starting point.
Taking the entity to be identified as a business, for example, business 1. An entity (which may be a business or a natural person) having the stock right for the business 1 is searched in the stock-controlling relationship data, and if the entity is the business 2, an entity having the stock-controlling right for the business 2 is further searched in the stock-controlling relationship data. And repeating the steps, searching the path until the found entity is a natural person. The entity having control right to an enterprise may be one or more.
Assuming that the final path search obtained several incidence relation paths (i.e., stock control paths) for business 1, two of them are shown in FIG. 2. The nodes in the path shown in the figure are entities, which can be enterprises or natural persons. The edges are the association between the nodes, in this example the association is the stock control relationship, and the numbers in the figure are the stock control ratio. The entity to be identified, enterprise 1, is the end node in each of these paths.
The entity to be identified is a natural person, for example, the natural person 1. The entity that is found in the stock-controlling relationship data and is controlled by the natural person is only possible to be an enterprise, which is assumed to be enterprise 3 and enterprise 4. This forms a two-path bifurcation. Then further searching the entity held by the enterprise 3 and the entity held by the enterprise 4in the holding relation data. And the like until the subsequent entity without stock control.
The second mode is as follows: respectively taking each entity in the incidence relation data as an initial end node, and searching a path according to the incidence relation; and acquiring the incidence relation path with the entity to be identified as a starting point from the obtained incidence relation paths.
In this way, each entity in the association relationship data is used as a starting end node, and the path search method described in the first way is used to perform search, so as to obtain an association relationship path with each entity as an end point, which is equivalent to a total association relationship path set. And then finding out the incidence relation path of the entity to be identified from the incidence relation paths as a path set of the entity to be identified.
Taking the above typical scenario as an example, all enterprises are found from the stock control relationship data. By performing a route search for each business, it is possible to obtain an association route starting from each business. And if the entity to be identified is the enterprise 1, finding an incidence relation path which takes the enterprise 1 as an end node from the obtained incidence relation paths.
The following describes step 103, that is, "perform risk identification on the association relationship path by using the path identification model obtained through pre-training" in detail with reference to the embodiment.
In this step, risk identification is performed on the association relationship paths one by one to obtain risk state information of each association relationship path. As a preferred embodiment, as shown in fig. 3, the method specifically includes the following steps:
step 301: and coding the incidence relation path to obtain a coding sequence of the incidence relation path.
When the incidence relation path is coded in the step, the entity and the incidence relation in the incidence relation path can be respectively coded; and splicing the coding results of the entities and the coding results of the association relations according to the sequence in the association relation path to obtain the coding results of the association relation path. When the entity is coded, at least one attribute information of the entity can be coded and then spliced according to a preset sequence to obtain a coding result of the entity.
Taking the above typical scenario as an example, after the stock control relationship path of the entity to be identified is obtained, the goal is to represent the stock control relationship path as a text sequence. Since the stock control relationship path is composed of the entities and the stock control relationship, each entity and stock control relationship can be first encoded separately.
In the business name example, it contains a lot of personalized information representing the territory and showing the aesthetic or favorite of the registrant. For example, the enterprise name is "Shenzhen City beauty ounce Limited," where "Shenzhen City" represents a territory and "beauty ounce" represents personalized information of the aesthetics or preferences of the registrant. However, these information have little effect on risk identification, and the encoded sequence information has low concentration and long length, which is not favorable for effectively utilizing the performance of the path identification model. Therefore, in this embodiment, some important attribute information of the enterprise may be encoded, and then spliced according to a preset sequence to obtain an encoding result of the enterprise.
The important attributes may include business type, industry of interest, registered capital, etc. Examples are as follows:
for business types, tens of common business types can be converted into type codes. For example, "limited liability company" is encoded as "TYP 1," limited liability company (state stock control) "is encoded as" TYP344, "stocks limited company" is encoded as "TYP 189," listed stocks limited company "is encoded as" TYP126 "and so on.
For the industry, the national economy industry primary and secondary classification codes can be coded, and the number of the codes is 118. For example, "agriculture, forestry, animal husbandry, fisheries" is encoded as "INDA," rental and business services "is encoded as" INDL, "information transfer, software and information technology services" is encoded as "INDI," and so on.
For the registered capital, the scope to which the registered capital belongs may be encoded. For example, the registered capital is encoded as "AST 1" at 1 ten thousand to 5 ten thousand, the registered capital is encoded as "AST 1" at 10 ten thousand to 100 ten thousand, the registered capital is encoded as "AST 3" at 101 ten thousand to 500 ten thousand, and so on.
For a node that is a natural PERSON, it may be directly encoded, such as "PERSON" to indicate that its type is a natural PERSON.
In encoding the stock control relationship, the stock control ratio and the initial year of stock holding may be used for encoding. For example, 60% of the year 2014 holdings may be encoded as "Y2014R 60".
Taking path 1in fig. 2 as an example, natural PERSON 1 is encoded as "PERSON", the stock control relationship between natural PERSON 1 and business 3 is encoded as "Y2014R 60", business 3 is encoded as "TYP 1IND0AST 6", the stock control relationship between business 3 and business 2 is encoded as "Y2008R 35", business 2 is encoded as "TYP 1IND0AST 5", the stock control relationship between business 2 and business 1 is encoded as "Y1988R 29", business 1 is encoded as "TYP 344IND 7", and the encoding sequence of the whole path is "PERSON Y2014R60 TYP1 d0AST 6Y 2008R35 TYP1IND0AST 5Y 1988R29 TYP344 dlast 7".
It should be noted that the encoding manner shown in the above example is only for illustrative purposes, but is not intended to limit the specific encoding manner. Any coding method under the above spirit principle falls into the protection scope of the present application.
Step 303: and inputting the coding sequence of the incidence relation path into the path identification model to obtain a risk identification result of the path identification model to the incidence relation path.
The path identification model adopted in the embodiment may be a text classification model, that is, classification is performed based on the coding sequence of the association relationship path, and the classification result is a specific risk state. E.g. high, medium, low risk. Such as high and low risk. As another example, primary risk, secondary risk, tertiary risk … …. That is, the classification may be binary or multi-classification.
The text classification model employed therein may include, but is not limited to, a Word2 vec-based text classification model, a fastText-based text classification model, or a textCNN-based text classification model.
Still taking the above exemplary application scenario as an example, the coding sequence obtained by the stock control relationship path is similar to the text in the natural language. Therefore, the coding sequence of the stock control relation path can be vectorized and mapped to a specific classification result through the full connection layer.
The following describes the training process of the path recognition model in detail by using a specific embodiment. As shown in fig. 4, the training process may include the steps of:
step 401: acquiring training data, wherein the training data comprises an incidence relation path and a risk condition label labeled on the incidence relation path; the nodes in the association relationship path are entities, and the edges are the association relationship among the entities.
In this step, some association relationship paths may be labeled with risk condition labels in advance, that is, the training data includes some association relationship paths with known risk conditions.
As an implementation manner, each entity in the association relationship data can be respectively used as a starting end node, and path search is performed according to the association relationship to obtain each association relationship path. And then selecting the incidence relation paths of various risk conditions from the incidence relation paths and labeling the risk condition labels. Taking the second category as an example, it can be labeled low risk and high risk. The above path searching process may refer to the relevant description in step 101 in the above embodiment, and is not described herein again.
And marking the risk condition label on the incidence relation path in a manual marking mode. The incidence relation path of the enterprise or the natural person related to tax evasion can also be marked as high risk in an automatic marking mode, for example, a blacklist-based mode. Incidence relation paths of enterprises or natural persons such as taxpayers of class A, outstanding contributions, star enterprises and the like can be marked as low risk in a white list-based mode.
Step 403: and coding the incidence relation path to obtain a coding sequence of the incidence relation path.
The manner of encoding the association relationship path may refer to the related description of step 301 in the above embodiment, and is not described herein again.
Step 405: and taking the coding sequence of the incidence relation path as the input of the classification model, taking the risk condition label of the incidence relation path as the target output of the classification model, and training the classification model to obtain the path recognition model.
The path identification model adopted in the embodiment may be a text classification model, that is, classification is performed based on the coding sequence of the association relationship path, and the classification result is a specific risk state. E.g. high, medium, low risk. Such as high and low risk. As another example, primary risk, secondary risk, tertiary risk … …. That is, the classification may be binary or multi-classification.
The text classification model employed therein may include, but is not limited to, a Word2 vec-based text classification model, a fastText-based text classification model, or a textCNN-based text classification model.
The text classification model firstly carries out vectorization on the coding sequence of the incidence relation path, and then the coding sequence is mapped to a specific classification result through the full connection layer.
The Word2vec (Word vector) -based text classification Model is characterized in that Words are vectorized by using a Word2vec CBOW (Continuous Bag-of-Words Model), sentence vectors are obtained by a weighted averaging method, and finally some secondary features are removed by using a principal component analysis method. In the Word2 vec-based text classification model, when the hierarchy Softmax is used in the fully connected layer, all Word vectors in the training data are at the huffman (huffman tree) leaf nodes.
The fastText is a rapid text classification algorithm, and a text classification model based on the fastText is to perform superposition averaging on all words and n-gram vectors in a coding sequence. That is, the fastTest generates vectors for the character-level n-gram as an additional feature in addition to the word vector of each word, that is, character-level n-gram vectors are additionally introduced into the vector features input to the fully-connected layer. The vectorization mode has a better effect on the word vectors generated by the low-frequency words, and for other words except the training corpus, the word vectors can be constructed by overlapping character-level n-gram vectors. Fully connected layer when using the hierarchy Softmax, the huffman leaf nodes are word vectors for each class label.
fastText is a more preferred approach because the existing scenario is to classify the risk profile of the encoded sequence. In addition, in this embodiment, when training the text classification model, the path length (i.e., the number of nodes included) of the adopted training data is preferably greater than 2. The reason is that the accuracy of model classification is affected due to the fact that the short path has less information content.
In the specific training process, the training objective is to make the classification result of the text classification model on the risk condition of the incidence relation path consistent with the risk condition label in the training data as much as possible. A loss function can be constructed according to the training target, parameters of the text classification model are updated according to the value of the loss function in each iteration until an iteration ending condition is reached, and the finally obtained text classification model is used as a path recognition model. The iteration ending condition may be that the iteration number reaches a preset iteration number threshold, or that the value of the loss function is less than or equal to a preset loss function threshold, or the like.
For the path identification model, the output of the full connection layer is actually the high risk probability of each incidence relation path, and the risk condition of the incidence relation path can be identified according to the high risk probability and the high risk probability threshold.
The following describes in detail the step 105 of determining the risk identification result of the entity to be identified by using the risk identification result of each association relationship path in the path set of the entity to be identified, with reference to an embodiment.
When determining the risk identification result of the entity to be identified, as an achievable mode, the risk identification result of each incidence relation in the path set of the entity to be identified can be directly determined. For example, the proportion of the risk identification result of each incidence relation in the path set of the entity to be identified is determined. For example, if the high risk percentage in the risk identification result of each association in the path set of an enterprise exceeds 50%, it may be determined that the risk identification result of the enterprise is high risk.
However, in view of the fact that the text classification model has limited recalls for a large number of short paths and there are some cases of controlling the inauguration enterprises through a plurality of short stock relations, a preferred embodiment is provided herein in addition to the above-mentioned manner of directly determining the risk identification result of the entity to be identified by using the risk result of the incidence relation path. The risk identification result of each incidence relation path in the path set and the statistical characteristics of the entity to be identified in the incidence relation data are input into a risk identification model obtained by pre-training, and the risk identification result of the entity to be identified is obtained.
Wherein, the statistical characteristics of the entity to be identified in the association relation data may include, but are not limited to, at least one of the following:
the number of paths in the path set of the entity to be identified;
the length of the shortest path in the path set of the entity to be identified;
the length of the longest path in the path set of the entity to be identified;
the path set of the entity to be identified comprises the number of entities of a preset type;
the number of entities having the same preset attribute as the entity to be identified; the preset attributes can comprise a legal person, UBO, a contact way and the like;
whether the entity to be identified has a preset relationship characteristic or not; for example, whether the overseas enterprise benefits exceed a preset proportion;
the proportion of entities to be identified having a particular attribute; such as the proportion of the overseas benefit stock;
and whether a cycle exists in the incidence relation path of the entity to be identified. For example, as shown in FIG. 5, if natural person P1 has 1% equity to business C1, business C1 has 100% equity to business C2, and business C2 in turn has 99% equity to business C1, then the relationship path is determined to have a loop.
When the risk identification result of each incidence relation path in the path set is input into the risk identification model, the classification identification result of each incidence relation path can be directly input into the risk identification model as a plurality of features, or one or any combination of the maximum risk probability value, the minimum risk probability value, the average risk probability value and the like of each incidence relation path can be input into the risk identification model as features.
Continuing with the exemplary application scenario as an example, after risk identification of each association relationship path in the path set of the enterprise 1, the maximum risk probability value, the minimum risk probability value, and the average risk probability value of the association relationship paths included in the path set are used as a part of feature input of the risk identification model, and statistical features such as the number of paths in the path set of the enterprise 1, the length of the shortest path, the length of the longest path, the number of included enterprises, whether the overseas enterprise benefits exceed 25%, the overseas profit-right ratio, the number of enterprises with the same legal, the number of enterprises with the same UBO, the number of enterprises with the same contact method, the number of impassable enterprises, and the existence cycle of stock control paths are used as another part of feature input of the risk identification model. And classifying the enterprise 1 by the risk identification model according to the input characteristics to obtain a risk identification result of the enterprise 1.
The risk identification model is pre-established, and the establishment process can be seen in fig. 6. As shown in fig. 6, the method may include the steps of:
step 601: acquiring training data, wherein the training data comprises a path set of entity samples formed by incidence relation paths taking the entity samples as endpoints, and risk condition labels labeled on the entity samples; the nodes in the association relationship path are entities, and the edges are the association relationship among the entities.
In this embodiment, some entities with known risk conditions may be used as entities, and labeled with risk condition labels, and a path search is performed from the association relationship data using the entity sample as a starting point, so as to obtain a path set in which an inertia relationship path using the entity sample as an end point constitutes the entity sample.
Or taking each entity in the association relation data as an initial end node, and searching a path according to the association relation; and selecting an entity with a known risk condition from the end nodes of the searched incidence relation path as an entity sample, labeling a risk state label for the entity sample, and acquiring a path set of the entity sample formed by the incidence relation path with the entity sample as an end point to obtain training data.
The above path searching process may refer to the relevant description in step 101 in the above embodiment, and is not described herein again. In the above exemplary scenario, the entity sample may be a business or a natural person.
Step 602: and carrying out risk identification on the incidence relation path by using a path identification model obtained by pre-training.
The process of risk identification for the association relationship path may refer to relevant records in the embodiment shown in fig. 3, which is not described herein again.
Step 603: and taking the risk identification result of each incidence relation path in the path set of the entity sample and the statistical characteristics of the entity sample in the incidence relation data as the input of a classification model, taking the risk condition label as the target output of the classification model, and training the classification model to obtain the risk identification model.
Likewise, the statistical characteristics of the entity sample in the association relationship data may include, but are not limited to, at least one of the following:
the number of paths in a path set of the entity sample;
length of shortest path in path set of entity sample;
the length of the longest path in the set of paths of the physical samples;
the path set of the entity sample comprises the number of entities of a preset type;
the number of entities having the same preset attribute as the entity sample; the preset attributes can comprise a legal person, UBO, a contact way and the like;
whether the entity sample has a preset relationship characteristic or not; for example, whether the overseas enterprise benefits exceed a preset proportion;
a proportion of the entity samples having a particular attribute; such as the proportion of the overseas benefit stock;
whether a cycle exists in the incidence relation path of the entity sample.
When the risk identification result of each incidence relation path in the path set is input into the risk identification model, the classification identification result of each incidence relation path can be directly input into the risk identification model as a plurality of features, or one or any combination of the maximum risk probability value, the minimum risk probability value, the average risk probability value and the like of each incidence relation path can be input into the risk identification model as features.
What kind of characteristic input is adopted when the risk recognition model is trained, what kind of characteristic input is also adopted in the process of risk recognition by using the risk recognition model, namely the characteristics adopted in the training process and the recognition process need to be kept consistent.
In this embodiment, the classification model may adopt a decision tree model, for example, an XGBoost (Extreme Gradient Boosting) model.
In a specific training process, the training goal is to make the classification result of the classification model on the risk condition of the entity sample as consistent as possible with the risk condition label in the training data. A loss function can be constructed according to the training target, parameters of the classification model are updated according to values of the loss function in each iteration until an iteration ending condition is reached, and the finally obtained classification model is used as a risk identification model. The iteration ending condition may be that the iteration number reaches a preset iteration number threshold, or that the value of the loss function is less than or equal to a preset loss function threshold, or the like.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
According to an embodiment of another aspect, a risk identification device is provided. Fig. 7 shows a schematic block diagram of a risk identification device according to an embodiment. It is to be appreciated that the apparatus can be implemented by any apparatus, device, platform, and cluster of devices having computing and processing capabilities. As shown in fig. 7, the apparatus 700 includes:
a path generating unit 710 configured to generate at least one incidence relation path using the entity to be identified as an end node according to the incidence relation data, and form a path set of the entity to be identified; the nodes in the association relationship path are entities, and the edges are the association relationship among the entities.
A path identification unit 720, configured to perform risk identification on the association relationship path by using a path identification model obtained through pre-training;
and the risk identification unit 730 is configured to determine a risk identification result of the entity to be identified by using the risk identification result of each incidence relation path in the path set of the entity to be identified.
Wherein, the path generating unit 710 may be specifically configured to: taking the entity to be identified as a starting point to perform path search from the incidence relation data to obtain an incidence relation path taking the entity to be identified as an end node; or respectively taking each entity in the association relation data as an initial end node, and searching a path according to the association relation; and acquiring the incidence relation path with the entity to be identified as a starting point from the obtained incidence relation paths.
As one preferred embodiment, the path identifying unit 720 may specifically include: a path encoding subunit 721 and a path identifying subunit 722.
The path encoding subunit 721 is configured to encode the association relationship path, and obtain an encoding sequence of the association relationship path.
When encoding the association relationship path, the path encoding subunit 721 may encode the entity and the association relationship in the association relationship path, respectively; and splicing the coding results of the entities and the coding results of the association relations according to the sequence in the association relation path to obtain the coding results of the association relation path. When the entity is encoded, the path encoding subunit 721 may encode at least one attribute information of the entity and then splice the encoded attribute information according to a preset sequence to obtain an encoding result of the entity.
And the path identifying subunit 722 is configured to input the coding sequence of the incidence relation path into the path identifying model, so as to obtain a risk identification result of the path identifying model on the incidence relation path.
Wherein the path recognition model can adopt but is not limited to a Word2 vec-based text classification model, a fastText-based text classification model or a textCNN-based text classification model.
As a preferred embodiment, the risk identifying unit 730 may be specifically configured to input the risk identification result of each association relationship path in the path set of the entity to be identified and the statistical characteristic of the entity to be identified in the association relationship data into a risk identification model obtained by pre-training, so as to obtain the risk identification result of the entity to be identified.
Wherein, the statistical characteristics of the entity to be identified in the association relation data may include, but are not limited to, at least one of the following:
the number of paths in the path set of the entity to be identified;
the length of the shortest path in the path set of the entity to be identified;
the length of the longest path in the path set of the entity to be identified;
the path set of the entity to be identified comprises the number of entities of a preset type;
the number of entities having the same preset attribute as the entity to be identified;
whether the entity to be identified has a preset relationship characteristic or not;
the proportion of entities to be identified having a particular attribute;
and whether a cycle exists in the incidence relation path of the entity to be identified.
When the risk identification result of each incidence relation path in the path set is input into the risk identification model, the classification identification result of each incidence relation path can be directly input into the risk identification model as a plurality of features, or one or any combination of the maximum risk probability value, the minimum risk probability value, the average risk probability value and the like of each incidence relation path can be input into the risk identification model as features.
According to an embodiment of another aspect, an apparatus for obtaining a risk identification model is also provided. FIG. 8 shows a schematic block diagram of an apparatus for obtaining a risk identification model according to one embodiment. It is to be appreciated that the apparatus can be implemented by any apparatus, device, platform, and cluster of devices having computing and processing capabilities. As shown in fig. 8, the apparatus 800 includes: a first acquisition unit 801, a path recognition unit 802 and a first training unit 803. The main functions of each component unit are as follows:
a first obtaining unit 801 configured to obtain training data, where the training data includes a path set in which an entity sample is formed by incidence relation paths using the entity sample as end nodes, and a risk condition label labeled to the entity sample; the nodes in the association relationship path are entities, and the edges are the association relationship among the entities.
A path identification unit 802 configured to perform risk identification on the association relationship path by using a path identification model obtained through pre-training.
The first training unit 803 is configured to train the classification model to obtain a risk identification model by taking a risk identification result of each association relation path in the path set of the entity sample as an input of the classification model and taking a risk condition label as a target output of the classification model.
The first obtaining unit 801 may be specifically configured to use an entity labeled with a risk state label as an entity sample, perform path search from the association relationship data using the entity sample as a starting point, and obtain a path set of the entity sample formed by association relationship paths using the entity sample as an end node, so as to obtain training data; or,
respectively taking each entity in the incidence relation data as an initial end node, and searching a path according to the incidence relation; and selecting entity samples from the end nodes of the searched incidence relation paths, labeling risk state labels on the entity samples, and acquiring a path set of the entity samples formed by the incidence relation paths with the entity samples as end points to obtain training data.
As a preferred embodiment, the path identifying unit 802 may be specifically configured to encode the association relationship path to obtain an encoding sequence of the association relationship path; and inputting the coding sequence of the incidence relation path into the path identification model to obtain a risk identification result of the path identification model to the incidence relation path.
When encoding the association relationship path, the path identifying unit 802 may encode the entity and the association relationship in the association relationship path respectively; and splicing the coding results of the entities and the coding results of the association relations according to the sequence in the association relation path to obtain the coding results of the association relation path. When an entity is encoded, the path identifying unit 802 may encode at least one attribute information of the entity and then splice the encoded attribute information according to a preset sequence to obtain an encoding result of the entity.
As a preferred embodiment, the first training unit 803 may be specifically configured to use the risk identification result of each association relationship path in the path set of the entity sample and the statistical features of the entity sample in the association relationship data as the input of the classification model.
Likewise, the statistical characteristics of the entity sample in the association relationship data may include, but are not limited to, at least one of the following:
the number of paths in a path set of the entity sample;
length of shortest path in path set of entity sample;
the length of the longest path in the set of paths of the physical samples;
the path set of the entity sample comprises the number of entities of a preset type;
the number of entities having the same preset attribute as the entity sample;
whether the entity sample has a preset relationship characteristic or not;
a proportion of the entity samples having a particular attribute;
whether a cycle exists in the incidence relation path of the entity sample.
When the risk identification result of each incidence relation path in the path set is input into the risk identification model, the classification identification result of each incidence relation path can be directly input into the risk identification model as a plurality of features, or one or any combination of the maximum risk probability value, the minimum risk probability value, the average risk probability value and the like of each incidence relation path can be input into the risk identification model as features.
In this embodiment, the classification model may adopt a decision tree model, such as an XGBoost model.
According to an embodiment of another aspect, an apparatus for obtaining a path recognition model is provided. FIG. 9 shows a schematic block diagram of an apparatus for obtaining a path recognition model according to one embodiment. It is to be appreciated that the apparatus can be implemented by any apparatus, device, platform, and cluster of devices having computing and processing capabilities. As shown in fig. 9, the apparatus 900 includes: a second acquisition unit 901, a path coding unit 902 and a second training unit 903. The main functions of each component unit are as follows:
a second obtaining unit 901 configured to obtain training data, where the training data includes association relation paths and risk condition labels labeled to the association relation paths; the nodes in the association relationship path are entities, and the edges are the association relationship among the entities.
As an implementation manner, the second obtaining unit 901 may use each entity in the association relationship data as a start end node, and perform path search according to the association relationship to obtain each association relationship path. And then selecting the incidence relation paths of various risk conditions from the incidence relation paths and labeling the risk condition labels.
A path encoding unit 902 configured to encode the association relationship path to obtain an encoding sequence of the association relationship path.
And a second training unit 903, configured to train the classification model to obtain the path recognition model by taking the coding sequence of the association relationship path as an input of the classification model and taking the risk condition label of the association relationship path as a target output of the classification model.
As a preferred embodiment, the path encoding unit 902 may be specifically configured to encode the entity and the association relationship in the association relationship path respectively; and splicing the coding results of the entities and the coding results of the association relations according to the sequence in the association relation path to obtain the coding results of the association relation path.
The text classification model employed therein may include, but is not limited to, a Word2 vec-based text classification model, a fastText-based text classification model, or a textCNN-based text classification model.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 1, 3, 4 and 6.
According to an embodiment of still another aspect, there is also provided a computing device including a memory and a processor, the memory having stored therein executable code, the processor, when executing the executable code, implementing the method in conjunction with fig. 1, fig. 3, fig. 4 and fig. 6.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.
Claims (25)
1. A risk identification method, comprising:
generating at least one incidence relation path taking an entity to be identified as an end node according to incidence relation data to form a path set of the entity to be identified; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities;
carrying out risk identification on the incidence relation path by utilizing a path identification model obtained by pre-training;
and determining a risk identification result of the entity to be identified by using the risk identification result of each incidence relation path in the path set of the entity to be identified.
2. The method of claim 1, wherein the generating at least one incidence relation path with the entity to be identified as an end node according to the incidence relation data comprises:
taking the entity to be identified as a starting point to perform path search from the incidence relation data to obtain an incidence relation path taking the entity to be identified as an end node; or,
respectively taking each entity in the incidence relation data as an initial end node, and searching a path according to the incidence relation; and acquiring the incidence relation path with the entity to be identified as a starting point from the obtained incidence relation paths.
3. The method of claim 1, wherein the performing risk identification on the incidence relation path by using a pre-trained path identification model comprises:
coding the incidence relation path to obtain a coding sequence of the incidence relation path;
and inputting the coding sequence of the incidence relation path into the path identification model to obtain a risk identification result of the path identification model for the incidence relation path.
4. The method according to claim 3, wherein encoding the incidence relation path to obtain the encoding sequence of the incidence relation path comprises:
respectively encoding the entity and the incidence relation in the incidence relation path;
and splicing the coding results of the entities and the coding results of the association relations according to the sequence in the association relation path to obtain the coding results of the association relation path.
5. The method of claim 1, wherein determining the risk identification result for the entity to be identified by using the risk identification result of each incidence relation path in the path set of the entity to be identified comprises:
and inputting the risk identification result of each incidence relation path in the path set of the entity to be identified and the statistical characteristics of the entity to be identified in the incidence relation data into a risk identification model obtained by pre-training to obtain the risk identification result of the entity to be identified.
6. The method of claim 5, wherein the statistical features of the entity to be identified in the association relationship data comprise at least one of:
the length of the shortest path in the path set of the entity to be identified;
the length of the longest path in the path set of the entity to be identified;
the path set of the entity to be identified comprises the number of entities of a preset type;
the number of entities with the same preset attribute as the entity to be identified;
whether the entity to be identified has a preset relationship characteristic or not;
the proportion of the entities to be identified having the specific attribute;
and whether a cycle exists in the incidence relation path of the entity to be identified.
7. The method according to any one of claims 1 to 6, wherein the association data comprises stock right structure data; the entity to be identified is an enterprise or a natural person; the association relationship is a stock control relationship.
8. A method of obtaining a risk identification model, comprising:
acquiring training data, wherein the training data comprises a path set of an entity sample formed by incidence relation paths taking the entity sample as end nodes and a risk condition label labeled on the entity sample; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities;
carrying out risk identification on the incidence relation path by utilizing a path identification model obtained by pre-training;
and taking the risk identification result of each incidence relation path in the path set of the entity sample as the input of a classification model, taking the risk condition label as the target output of the classification model, and training the classification model to obtain a risk identification model.
9. The method of claim 8, wherein the acquiring training data comprises:
taking an entity marked with a risk state label as an entity sample, and taking the entity sample as a starting point to perform path search from the incidence relation data to obtain a path set of the entity sample formed by incidence relation paths taking the entity sample as an end node so as to obtain the training data; or,
respectively taking each entity in the incidence relation data as an initial end node, and searching a path according to the incidence relation; selecting entity samples from the end nodes of the obtained incidence relation paths, labeling risk state labels on the entity samples, and obtaining a path set of the entity samples formed by the incidence relation paths taking the entity samples as end points to obtain the training data.
10. The method of claim 8, wherein the risk identification of the incidence relation path by using the pre-trained path identification model comprises:
coding the incidence relation path to obtain a coding sequence of the incidence relation path;
and inputting the coding sequence of the incidence relation path into the path identification model to obtain a risk identification result of the path identification model for the incidence relation path.
11. The method of claim 8, wherein using the risk identification result of each incidence relation path in the path set of the entity sample as an input of a classification model comprises:
and taking the risk identification result of each incidence relation path in the path set of the entity sample and the statistical characteristics of the entity sample in the incidence relation data as the input of the classification model.
12. The method of claim 11, wherein the statistical features of the entity sample in the association relationship data comprise at least one of:
a length of a shortest path in a set of paths of the entity sample;
a length of a longest path in the set of paths of the physical sample;
the path set of the entity sample comprises the number of entities of a preset type;
the number of entities with the same preset attribute as the entity sample;
whether the entity sample has a preset relationship characteristic or not;
the proportion of the entity sample having a particular attribute;
whether a cycle exists in the incidence relation path of the entity sample.
13. A method of obtaining a path recognition model, comprising:
acquiring training data, wherein the training data comprises an incidence relation path and a risk condition label labeled on the incidence relation path; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities;
coding the incidence relation path to obtain a coding sequence of the incidence relation path;
and taking the coding sequence of the incidence relation path as the input of a classification model, taking the risk condition label of the incidence relation path as the target output of the classification model, and training the classification model to obtain the path recognition model.
14. The method according to claim 13, wherein encoding the incidence relation path to obtain the encoding sequence of the incidence relation path comprises:
respectively encoding the entity and the incidence relation in the incidence relation path;
and splicing the coding results of the entities and the coding results of the association relations according to the sequence in the association relation path to obtain the coding results of the association relation path.
15. A risk identification device comprising:
the path generation unit is configured to generate at least one incidence relation path with an entity to be identified as an end node according to the incidence relation data to form a path set of the entity to be identified; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities;
the path identification unit is configured to perform risk identification on the incidence relation path by using a path identification model obtained through pre-training;
and the risk identification unit is configured to determine a risk identification result of the entity to be identified by using the risk identification result of each incidence relation path in the path set of the entity to be identified.
16. The apparatus according to claim 15, wherein the path generation unit is specifically configured to: taking the entity to be identified as a starting point to perform path search from the incidence relation data to obtain an incidence relation path taking the entity to be identified as an end node; or,
respectively taking each entity in the incidence relation data as an initial end node, and searching a path according to the incidence relation; and acquiring the incidence relation path with the entity to be identified as a starting point from the obtained incidence relation paths.
17. The apparatus of claim 15, wherein the path identifying unit comprises:
the path coding subunit is configured to code the incidence relation path to obtain a coding sequence of the incidence relation path;
and the path identification subunit is configured to input the coding sequence of the incidence relation path into the path identification model to obtain a risk identification result of the path identification model on the incidence relation path.
18. The apparatus according to claim 15, wherein the risk identification unit is specifically configured to input a risk identification result of each association relationship path in the path set of the entity to be identified and a statistical feature of the entity to be identified in the association relationship data into a risk identification model obtained through pre-training, so as to obtain a risk identification result of the entity to be identified.
19. An apparatus for obtaining a risk identification model, comprising:
the system comprises a first obtaining unit, a second obtaining unit and a third obtaining unit, wherein the first obtaining unit is configured to obtain training data, the training data comprises a path set of entity samples formed by incidence relation paths taking the entity samples as end nodes, and risk condition labels labeled on the entity samples; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities;
the path identification unit is configured to perform risk identification on the incidence relation path by using a path identification model obtained through pre-training;
and the first training unit is configured to take a risk identification result of each incidence relation path in the path set of the entity sample as an input of a classification model, take the risk condition label as a target output of the classification model, and train the classification model to obtain a risk identification model.
20. The apparatus according to claim 19, wherein the first obtaining unit is specifically configured to use an entity labeled with a risk state label as an entity sample, perform a path search from the association relationship data using the entity sample as a starting point, and obtain a path set in which association relationship paths using the entity sample as an end node constitute the entity sample, so as to obtain the training data; or,
respectively taking each entity in the incidence relation data as an initial end node, and searching a path according to the incidence relation; selecting entity samples from the end nodes of the obtained incidence relation paths, labeling risk state labels on the entity samples, and obtaining a path set of the entity samples formed by the incidence relation paths taking the entity samples as end points to obtain the training data.
21. The apparatus according to claim 19, wherein the path identifying unit is specifically configured to encode the association relationship path to obtain an encoding sequence of the association relationship path; and inputting the coding sequence of the incidence relation path into the path identification model to obtain a risk identification result of the path identification model for the incidence relation path.
22. The apparatus according to claim 19, wherein the first training unit is specifically configured to use a risk identification result of each association relationship path in the path set of the entity sample and a statistical feature of the entity sample in the association relationship data as inputs of the classification model.
23. Apparatus for obtaining a path recognition model, comprising:
a second obtaining unit configured to obtain training data, the training data including an incidence relation path and a risk condition label labeled to the incidence relation path; wherein, the nodes in the incidence relation path are entities, and the edges are incidence relations among the entities;
the path coding unit is configured to code the incidence relation path to obtain a coding sequence of the incidence relation path;
and the second training unit is configured to take the coding sequence of the incidence relation path as the input of a classification model, take the risk condition label of the incidence relation path as the target output of the classification model, and train the classification model to obtain the path recognition model.
24. The apparatus according to claim 23, wherein the path encoding unit is specifically configured to encode the entity and the association relationship in the association relationship path respectively; and splicing the coding results of the entities and the coding results of the association relations according to the sequence in the association relation path to obtain the coding results of the association relation path.
25. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, performs the method of any of claims 1-14.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110493573.2A CN113222610B (en) | 2021-05-07 | 2021-05-07 | Risk identification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110493573.2A CN113222610B (en) | 2021-05-07 | 2021-05-07 | Risk identification method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113222610A true CN113222610A (en) | 2021-08-06 |
CN113222610B CN113222610B (en) | 2022-08-23 |
Family
ID=77091240
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110493573.2A Active CN113222610B (en) | 2021-05-07 | 2021-05-07 | Risk identification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113222610B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109003089A (en) * | 2018-06-28 | 2018-12-14 | 中国工商银行股份有限公司 | risk identification method and device |
CN109523153A (en) * | 2018-11-12 | 2019-03-26 | 平安科技(深圳)有限公司 | Acquisition methods, device, computer equipment and the storage medium of illegal fund collection enterprise |
CN110570111A (en) * | 2019-08-30 | 2019-12-13 | 阿里巴巴集团控股有限公司 | Enterprise risk prediction method, model training method, device and equipment |
CN111476508A (en) * | 2020-05-15 | 2020-07-31 | 支付宝(杭州)信息技术有限公司 | Risk identification method and system for target operation |
CN112463981A (en) * | 2020-11-26 | 2021-03-09 | 福建正孚软件有限公司 | Enterprise internal operation management risk identification and extraction method and system based on deep learning |
-
2021
- 2021-05-07 CN CN202110493573.2A patent/CN113222610B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109003089A (en) * | 2018-06-28 | 2018-12-14 | 中国工商银行股份有限公司 | risk identification method and device |
CN109523153A (en) * | 2018-11-12 | 2019-03-26 | 平安科技(深圳)有限公司 | Acquisition methods, device, computer equipment and the storage medium of illegal fund collection enterprise |
CN110570111A (en) * | 2019-08-30 | 2019-12-13 | 阿里巴巴集团控股有限公司 | Enterprise risk prediction method, model training method, device and equipment |
CN111476508A (en) * | 2020-05-15 | 2020-07-31 | 支付宝(杭州)信息技术有限公司 | Risk identification method and system for target operation |
CN112463981A (en) * | 2020-11-26 | 2021-03-09 | 福建正孚软件有限公司 | Enterprise internal operation management risk identification and extraction method and system based on deep learning |
Non-Patent Citations (2)
Title |
---|
吕华揆等: "金融股权知识图谱构建与应用", 《数据分析与知识发现》 * |
王金凤等: "基于"路径――目标"权变理论的全面风险管理案例研究――一个煤炭企业的调查", 《审计研究》 * |
Also Published As
Publication number | Publication date |
---|---|
CN113222610B (en) | 2022-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11734328B2 (en) | Artificial intelligence based corpus enrichment for knowledge population and query response | |
Gasmi et al. | LSTM recurrent neural networks for cybersecurity named entity recognition | |
CN109992664B (en) | Dispute focus label classification method and device, computer equipment and storage medium | |
CN110162749A (en) | Information extracting method, device, computer equipment and computer readable storage medium | |
CN111723569A (en) | Event extraction method and device and computer readable storage medium | |
CN112434535B (en) | Element extraction method, device, equipment and storage medium based on multiple models | |
CN111883115A (en) | Voice flow quality inspection method and device | |
CN111428504B (en) | Event extraction method and device | |
CN112287095A (en) | Method and device for determining answers to questions, computer equipment and storage medium | |
CN112632226B (en) | Semantic search method and device based on legal knowledge graph and electronic equipment | |
CN112559734B (en) | Brief report generating method, brief report generating device, electronic equipment and computer readable storage medium | |
CN111324738B (en) | Method and system for determining text label | |
WO2023071745A1 (en) | Information labeling method, model training method, electronic device and storage medium | |
CN110991185A (en) | Method and device for extracting attributes of entities in article | |
CN113254602B (en) | Knowledge graph construction method and system for science and technology policy field | |
Thattinaphanich et al. | Thai named entity recognition using Bi-LSTM-CRF with word and character representation | |
CN111091004A (en) | Training method and training device for sentence entity labeling model and electronic equipment | |
Jagdish et al. | Identification of End‐User Economical Relationship Graph Using Lightweight Blockchain‐Based BERT Model | |
CN116821372A (en) | Knowledge graph-based data processing method and device, electronic equipment and medium | |
CN116127013A (en) | Personal sensitive information knowledge graph query method and device | |
CN113222610B (en) | Risk identification method and device | |
CN109635289B (en) | Entry classification method and audit information extraction method | |
CN116702765A (en) | Event extraction method and device and electronic equipment | |
CN114003708B (en) | Automatic question-answering method and device based on artificial intelligence, storage medium and server | |
CN112579774B (en) | Model training method, model training device and terminal equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |