CN112328709B

CN112328709B - Entity labeling method and device, server and storage medium

Info

Publication number: CN112328709B
Application number: CN202011301554.7A
Authority: CN
Inventors: 黄佳洋; 丘宇彬; 陈枫; 徐维黛; 朱易文; 陈清财; 李东方; 付冠宇
Original assignee: Shenzhen Turing Robot Co ltd
Current assignee: Shenzhen Turing Robot Co ltd
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2024-07-09
Anticipated expiration: 2040-11-19
Also published as: CN112328709A

Abstract

The embodiment of the application discloses an entity labeling method and device, a server and a storage medium, wherein the entity labeling method comprises the following steps: acquiring a target entity, wherein the target entity comprises a target entity name and target entity content; determining the entity type of the target entity according to the name and the content of the target entity; determining the position information of at least one candidate entity content in a preset document according to the entity type and the target entity content; determining nearest neighbor entity names of each candidate entity content in a preset document according to the position information of at least one candidate entity content, and calculating confidence between every two entity names in a set to be clustered; clustering each entity name in the to-be-clustered set according to the confidence between every two entity names to obtain a first clustered group; and determining the position information of the candidate entity contents corresponding to the nearest entity names in the first cluster group as the labeling result of the target entity contents. By adopting the method and the device, the entity labeling quality can be improved.

Description

Entity labeling method and device, server and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and apparatus for labeling entities, a server, and a storage medium.

Background

At present, with the perfection of the gold-melting market in recent years, more and more people participate in investment, production and other activities, and massive financial texts such as contracts, specifications and the like are derived. Often, important information related to production interactions, such as name, amount, date, and contact information, exists in the financial text. In actual business, to facilitate recording and statistics, finance companies often define categories of important information, and then store the important information in unstructured data in a structured (entity name-entity value) manner according to the categories. Because the structured data does not contain the position information of the entity, the entity is usually marked manually in the prior art, the labor cost is too high, and the marking quality is not high.

Disclosure of Invention

The embodiment of the application provides an entity labeling method and device, a server and a storage medium, so as to improve entity labeling quality.

In a first aspect, an embodiment of the present application provides an entity labeling method, including:

acquiring a target entity, wherein the target entity comprises a target entity name and target entity content;

Determining the entity type of the target entity according to the name and the content of the target entity;

determining the position information of at least one candidate entity content in a preset document according to the entity type and the target entity content;

Determining nearest neighbor entity names of each candidate entity content in a preset document according to the position information of at least one candidate entity content, and calculating confidence between every two entity names in a to-be-clustered set, wherein the to-be-clustered set comprises the nearest neighbor entity name and a target entity name of each candidate entity content;

Clustering each entity name in the to-be-clustered set according to the confidence coefficient between every two entity names to obtain a first clustered group, wherein the first clustered group comprises a target entity name and at least one nearest neighbor entity name;

And determining the position information of the candidate entity contents corresponding to the nearest entity names in the first cluster group as the labeling result of the target entity contents.

Optionally, determining the location information of at least one candidate entity content in the preset document according to the entity type and the target entity content includes:

if the entity type is the text type, calculating the difference degree between the target entity content and each entity content in the preset document to obtain a plurality of difference degree values;

And determining the position information of at least one entity content corresponding to the difference value smaller than the preset difference threshold value in the difference values as the position information of at least one candidate entity content.

Optionally, determining the location information of at least one candidate entity content in the preset document according to the entity type and the target entity content, and further includes:

if the entity type is a non-text type, converting the target entity content according to the non-text type to obtain target entity conversion content;

and determining the position information of at least one entity content consistent with the target entity content and the target entity conversion content in the preset document as the position information of at least one candidate entity content.

Optionally, the at least one candidate entity content includes a first candidate entity content, and the location information includes a location start value and a location end value;

determining the nearest neighbor entity name of each candidate entity content in a preset document according to the position information of at least one candidate entity content, wherein the method comprises the following steps:

Determining a median between a position start value and a position end value of the first candidate entity content as an absolute position of the first candidate entity content;

And determining the entity name with the smallest distance between the absolute position in the preset document and the absolute position of the first candidate entity content as the nearest neighbor entity name of the first candidate entity content, and further obtaining the nearest neighbor entity name of each candidate entity content.

Optionally, the at least one nearest neighbor entity name comprises a first nearest neighbor entity name;

calculating the confidence coefficient between every two entity names in the set to be clustered comprises the following steps:

acquiring position information of a plurality of entity names consistent with the first nearest neighbor entity name in a preset document;

determining the nearest neighbor entity name of each entity name according to the position information of each entity name to obtain a nearest neighbor entity name set;

And determining the occurrence probability of the target entity name in the nearest entity name set as the confidence coefficient between the first nearest entity name and the target entity name, and further obtaining the confidence coefficient between every two entity names.

Optionally, clustering each entity name in the set to be clustered according to the confidence between every two entity names to obtain a first cluster group, including:

Determining a confidence vector of each entity name according to the confidence between every two entity names;

traversing the distance between the confidence coefficient vector of each entity name and each initial cluster center, and distributing the confidence coefficient vector of each entity name to the cluster group corresponding to the initial cluster center with the smallest distance, so as to obtain n initial cluster groups;

Calculating the distance between the cluster center of each initial cluster group and the initial cluster center of each initial cluster group, and dividing the set to be clustered into n cluster groups when the distance meets the convergence condition;

and determining a cluster group comprising the target entity name from the n cluster groups as a first cluster group.

Optionally, determining the entity type of the target entity according to the target entity name and the target entity content includes:

Determining at least one preset entity type containing target entity content in the entity content range of the plurality of preset entity types as at least one candidate entity type;

And determining the candidate entity type corresponding to the keyword set containing the target entity name in the keyword set of the at least one candidate entity type as the entity type of the target entity.

In a second aspect, an embodiment of the present application provides an entity labeling device, including:

The target entity acquisition module is used for acquiring a target entity, wherein the target entity comprises a target entity name and target entity content;

the entity type determining module is used for determining the entity type of the target entity according to the target entity name and the target entity content;

The position information determining module is used for determining the position information of at least one candidate entity content in a preset document according to the entity type and the target entity content;

The determining and calculating module is used for determining the nearest neighbor entity name of each candidate entity content in a preset document according to the position information of at least one candidate entity content, calculating the confidence coefficient between every two entity names in a to-be-clustered set, wherein the to-be-clustered set comprises the nearest neighbor entity name and the target entity name of each candidate entity content;

the cluster group determining module is used for clustering each entity name in the to-be-clustered set according to the confidence coefficient between every two entity names to obtain a first cluster group, wherein the first cluster group comprises a target entity name and at least one nearest neighbor entity name;

and the labeling result determining module is used for determining the position information of the candidate entity content corresponding to each nearest entity name in the first cluster group as the labeling result of the target entity content.

Optionally, the location information determining module includes:

the difference degree calculating unit is used for calculating the difference degree between the target entity content and each entity content in the preset document to obtain a plurality of difference degree values if the entity type is the text type;

The first position determining unit is used for determining the position information of at least one entity content corresponding to the difference value smaller than the preset difference threshold value in the difference values as the position information of at least one candidate entity content.

Optionally, the location information determining module further includes:

The content conversion unit is used for converting the target entity content according to the non-text type if the entity type is the non-text type so as to obtain target entity conversion content;

And the second position determining unit is used for determining the position information of at least one entity content consistent with the target entity content and the target entity conversion content in the preset document as the position information of at least one candidate entity content.

a determining computing module, comprising: and a nearest neighbor name determination unit.

A nearest neighbor name determining unit for determining a median between a position start value and a position end value of the first candidate entity content as an absolute position of the first candidate entity content;

A determining computing module, comprising: and a confidence calculating unit.

The confidence calculating unit is used for obtaining position information of a plurality of entity names consistent with the first nearest neighbor entity name in the preset document;

Optionally, the cluster group determining module includes:

the vector determining unit is used for determining the confidence vector of each entity name according to the confidence coefficient between every two entity names;

The traversal distribution unit is used for traversing the distance between the confidence coefficient vector of each entity name and each initial cluster center, distributing the confidence coefficient vector of each entity name to the cluster group corresponding to the initial cluster center with the smallest distance, and further obtaining n initial cluster groups;

the calculation dividing unit is used for calculating the distance between the cluster center of each initial cluster group and the initial cluster center of each initial cluster group, and dividing the set to be clustered into n cluster groups when the distance meets the convergence condition;

and the cluster group determining unit is used for determining a cluster group comprising the target entity name from the n cluster groups as a first cluster group.

Optionally, the labeling result determining module is configured to determine at least one preset entity type including the target entity content in the entity content ranges of the plurality of preset entity types as at least one candidate entity type; and determining the candidate entity type corresponding to the keyword set containing the target entity name in the keyword set of the at least one candidate entity type as the entity type of the target entity.

In a third aspect, a server is provided for an embodiment of the present application, where the server includes a processor, a memory, and a transceiver, where the processor, the memory, and the transceiver are connected to each other, and the memory is configured to store a computer program that supports an electronic device to execute the entity labeling method described above, where the computer program includes program instructions; the processor is configured to invoke program instructions to perform the entity labeling method as in one aspect of the embodiments of the present application described above.

In a fourth aspect, a storage medium is provided for an embodiment of the present application, where the storage medium stores a computer program, and the computer program includes program instructions; the program instructions, when executed by a processor, cause the processor to perform a method of labeling entities as in an aspect of embodiments of the application.

In the embodiment of the application, a target entity is acquired, wherein the target entity comprises a target entity name and target entity content; determining the entity type of the target entity according to the name and the content of the target entity; determining the position information of at least one candidate entity content in a preset document according to the entity type and the target entity content; determining nearest neighbor entity names of each candidate entity content in a preset document according to the position information of at least one candidate entity content, and calculating confidence between every two entity names in a set to be clustered; clustering each entity name in the to-be-clustered set according to the confidence between every two entity names to obtain a first clustered group; and determining the position information of the candidate entity contents corresponding to the nearest entity names in the first cluster group as the labeling result of the target entity contents. By adopting the method and the device, the entity labeling quality can be improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of an entity labeling method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of an entity labeling method according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an entity labeling device according to an embodiment of the present application;

Fig. 4 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Fig. 1 is a schematic flow chart of an entity labeling method according to an embodiment of the present application. As shown in fig. 1, this embodiment of the method includes the steps of:

s101, acquiring a target entity, wherein the target entity comprises a target entity name and target entity content.

In some possible embodiments, the entity labeling device obtains the target entity from the structured database according to the target entity name and the target entity content, where the structured database may be understood as a database obtained by the finance company storing the important information in the unstructured data in a structured (e.g. entity name-entity value) manner according to the predefined important information category.

S102, determining the entity type of the target entity according to the name and the content of the target entity.

In some possible embodiments, determining at least one preset entity type including the target entity content in the entity content range of the plurality of preset entity types as at least one candidate entity type; and determining the candidate entity type corresponding to the keyword set containing the target entity name in the keyword set of the at least one candidate entity type as the entity type of the target entity.

The preset entity types may include an amount type, a name type, a number type, a ratio type, a number type, a letter type, a currency type, a date type, and a short string type. Wherein, the monetary type entity generally refers to larger values such as asset scale, turnover, etc., for example 15000; the named entities may include textual entities such as company names, persona names, etc., such as the state asset company, inc; the number type entity may include a cell phone number; ratio-type entities may include ratio, percentage, etc. numeric entities, such as 0.132; the numbered entity can be a mixed entity of numbers and English letters, such as 0U34K8; the monetary entity may be a monetary entity with units, such as 10 yuan; the time-type entity may be an entity comprising units of time of day, month or year, such as 12 months; the short string type entity may be a text type entity of length 1 and representing the option information, such as no, female.

For example, assume that the amount type entity content ranges from any number greater than 500, and the keyword set is { sales, liabilities, sales, … … }; the physical content range of the number pattern is any natural number with the number of digits of 11, and the keyword set is { mobile phone number, … … }; the content range of the letter type entity is full English, and the keyword set is { rating information, rating and … … }. The entity "turnover 15000000000" includes entity content "15000000000" and entity name "turnover", the entity content range of the monetary type and the number type both includes entity content "15000000000", the monetary type and the number type are determined as candidate entity types, the entity name "turnover" is included in the keyword set { sales, liabilities, turnover, … … } of the monetary type entity, and the monetary type entity is determined as entity type of entity "turnover 1 billion".

S103, determining the position information of at least one candidate entity content in the preset document according to the entity type and the target entity content.

In some possible embodiments, the entity labeling device determines the location information of at least one candidate entity content of the target entity content in the preset document in different manners according to different entity types. Here, entity types can be largely classified into text types and non-text types. For a specific implementation of this step, please refer to the description of the following embodiments.

S104, determining nearest neighbor entity names of each candidate entity content in a preset document according to the position information of at least one candidate entity content, and calculating the confidence between every two entity names in the to-be-clustered set.

The at least one candidate entity content comprises a first candidate entity content, and the set to be clustered comprises a nearest neighbor entity name and a target entity name of each candidate entity content.

In some possible implementations, a median between the position start value and the position end value of the first candidate entity content is determined as the absolute position of the first candidate entity content; and determining the entity name with the smallest distance between the absolute position in the preset document and the absolute position of the first candidate entity content as the nearest neighbor entity name of the first candidate entity content, and further obtaining the nearest neighbor entity name of each candidate entity content.

For example, assuming that the performance in 2019 of Word document "… … a company is in the eye, the annual growth rate is 13 million, and the annual growth rate is as high as 30% >. The starting position value and the ending position value of candidate entity content" 13 million "in" 13 billion "are 10 and 11, respectively, the absolute position of candidate entity content" 13 billion "is calculated to be 10.5, and the absolute positions of entity names" annual growth rate "adjacent to the candidate entity content" 13 billion "are 6.5 and 16, respectively, the entity name" annual growth rate "is determined as the nearest neighbor entity name of candidate entity content" 13 billion ".

And then, the entity labeling device calculates the confidence coefficient between every two entity names in the to-be-clustered set, wherein at least one nearest neighbor entity name in the to-be-clustered set comprises a first nearest neighbor entity name.

In some possible embodiments, acquiring position information of a plurality of entity names consistent with the first nearest neighbor entity name in a preset document; determining the nearest neighbor entity name of each entity name according to the position information of each entity name to obtain a nearest neighbor entity name set; and determining the occurrence probability of the target entity name in the nearest entity name set as the confidence coefficient between the first nearest entity name and the target entity name, and further obtaining the confidence coefficient between every two entity names.

Specifically, the confidence between the entity name a and the entity name b can be calculated as follows: obtaining position information of a plurality of entity names consistent with the entity name a in a preset document, determining nearest neighbor entity names of each entity name according to the position information of each entity name, forming a nearest neighbor entity name set A according to the nearest neighbor entity names of each entity name, and if the entity name b appears k times in the nearest neighbor entity name set A and the nearest neighbor entity name set A shares N nearest neighbor entity names, the appearance probability of the entity name b in the nearest neighbor entity name set A is k/N, namely the confidence degree between the entity name a and the entity name b is k/N. And calculating the confidence coefficient between every two entity names in the to-be-clustered set according to the mode.

S105, clustering each entity name in the to-be-clustered set according to the confidence between every two entity names to obtain a first clustered group.

The first cluster group comprises a target entity name and at least one nearest neighbor entity name.

In some possible implementations, a confidence vector for each entity name is determined based on the confidence between the entity names; traversing the distance between the confidence coefficient vector of each entity name and each initial cluster center, and distributing the confidence coefficient vector of each entity name to the cluster group corresponding to the initial cluster center with the smallest distance, so as to obtain n initial cluster groups; calculating the distance between the cluster center of each initial cluster group and the initial cluster center of each initial cluster group, and dividing the set to be clustered into n cluster groups when the distance meets the convergence condition; and determining a cluster group comprising the target entity name from the n cluster groups as a first cluster group.

Specifically, the entity labeling device can obtain a confidence coefficient matrix according to the confidence coefficient between every two entity names. Illustratively, the confidence matrix is as follows:

Wherein, P ₁₁ is the confidence between entity name A1 and entity name A1, P ₁₂ is the confidence between entity name A1 and entity name A2, P _nn is the confidence between entity name An and entity name An.

The confidence vector of each entity name may be obtained according to the confidence matrix, for example, the confidence vector of the entity name A1 may be (P ₁₁,P₁₂,...,P_1n) or (P ₁₁,P₂₁,...,P_n1)^T), which is not limited in the present application.

Specifically, the entity labeling device clusters confidence vectors of each entity name in the to-be-clustered set by using a k-means algorithm, and the implementation process is as follows: 1) Randomly selecting confidence coefficient vectors of n entity names from the confidence coefficient vectors of the plurality of entity names as initial cluster centers of n cluster groups, namely, the initial cluster centers of cluster group C ₁、C₂、...、C_n are Q ₁、Q₂、...、Q_n respectively; 2) Traversing the similarity (such as Euclidean distance) between the confidence coefficient vectors of the rest entity names and Q ₁、Q₂、...、Q_n respectively, comparing the sizes, if the similarity between the confidence coefficient vector of the entity name A1 and Q ₁ is higher, distributing the entity name A1 to a cluster group C ₁, and completing the distribution of each entity name according to the mode; 3) And (3) recalculating the cluster centers of the cluster groups C ₁、C₂、...、C_n, repeating the steps 2) and 3) until the distances between the cluster centers of the cluster groups and the initial cluster centers are smaller than a preset threshold value, and obtaining n cluster groups after the k-means algorithm reaches a convergence condition and clustering is finished. And then, determining the cluster group where the target entity name is located in the n cluster groups as a first cluster group.

S106, determining the position information of the candidate entity contents corresponding to the nearest entity names in the first cluster group as the labeling result of the target entity contents.

Wherein the first cluster group includes a target entity name and a nearest neighbor entity name of at least one candidate entity content.

For example, assuming that the target entity content is "1 million", the first cluster group includes the target entity name "business" and the nearest neighbor entity name "business" of the candidate entity content "1 million", and the position information of the candidate entity content "1 million" corresponding to the nearest neighbor entity name "business" of the candidate entity content "1 million" in the first cluster group, that is, the start position value 21 and the end position value 22, is determined as the labeling result of the target entity content "1 million".

In the embodiment of the application, the entity labeling device determines the entity type of the target entity according to the target entity name and the target entity content, determines the position information of at least one candidate entity content in a preset document according to the entity type and the target entity content, determines the nearest neighbor entity name of each candidate entity content in the preset document according to the position information of at least one candidate entity content, calculates the confidence coefficient between every two entity names in the to-be-clustered set, clusters each entity name in the to-be-clustered set according to the confidence coefficient between every two entity names to obtain a first clustered group comprising the target entity name and at least one nearest neighbor entity name, and determines the position information of the candidate entity content corresponding to each nearest neighbor entity name in the first clustered group as the labeling result of the target entity content, thereby improving the entity labeling quality and the labeling efficiency.

Fig. 2 is a schematic flow chart of an entity labeling method according to an embodiment of the present application. As shown in fig. 2, this method embodiment includes the steps of:

S201, acquiring a target entity, wherein the target entity comprises a target entity name and target entity content.

S202, determining the entity type of the target entity according to the name and the content of the target entity.

Here, the specific implementation manner of step S201 to step S202 may refer to the description of step S101 to step S102 in the corresponding embodiment of fig. 1, which is not repeated here.

And S203, if the entity type is the text type, determining the position information of at least one candidate entity content according to the difference degree between the target entity content and each entity content in the preset document.

In some possible embodiments, if the entity type is a text type, calculating a difference degree between the target entity content and each entity content in the preset document to obtain a plurality of difference degree values, and determining location information of at least one entity content corresponding to a difference degree value smaller than a preset difference degree threshold value in the plurality of difference degree values as location information of at least one candidate entity content.

Here, the text type may include a name type and a short string type, and the preset document may be a document in a text format such as readable Word or TXT obtained after the PDF document is converted. The degree of difference between the two entity contents may be an edit distance between the two entity contents. The preset difference value can be adjusted according to the length of the physical content, for example, if the length of the physical content is less than or equal to 3, the preset difference threshold can be set to 0; if the length of the entity content is less than or equal to 8, the preset difference threshold may be set to 1; if the length of the entity content is greater than 8 and less than 12, the preset difference threshold may be set to 2; if the length of the physical content is greater than or equal to 12 and less than or equal to 16, the preset difference threshold may be set to 3, and according to the above manner, the preset difference threshold of the physical content with different lengths may be obtained. The location information may include a location start value and a location end value.

For example, the entity labeling device calculates that the entity content "thai resultant force share limited company" and the target entity content "thai resultant force share limited company" in the Word document are different by only one Word, that is, the difference value (edit distance) between the entity content "thai resultant force share limited company" and the target entity content "thai resultant force share limited company" is 1, and since the length of the target entity content "thai resultant force share limited company" is 10, the difference value 1 is smaller than the preset difference threshold 2, and the entity content "thai resultant force share limited company" is determined as the candidate entity content of the target entity content in the Word document, and since the first Word "thai resultant force share limited company" and the last Word "thai resultant force share limited company" in the candidate entity content "thai resultant force share limited company" are located in the 30 th Word and 39 th Word in the Word document, respectively, the initial position value and the position end value of the candidate entity content "thai resultant force share limited company" are 30 and 39, respectively.

S204, if the entity type is a non-text type, determining the position information of at least one candidate entity content in the preset document according to the conversion content of the target entity content and the target entity.

In some possible embodiments, if the entity type is a non-text type, converting the target entity content according to the non-text type to obtain target entity conversion content, and determining location information of at least one entity content consistent with the target entity content and the target entity conversion content in the preset document as location information of at least one candidate entity content.

Here, the non-text type may include an amount type, a ratio type, a date type, and a currency type. For the target entity content with the entity type of the monetary type, the target entity content can be converted in different numerical units (such as ten, hundred, thousand and the like) and in the form of capitalized Chinese; for the target entity content with the entity type being the ratio type, the target entity content can be converted in the forms of Chinese expressions of ' … percent ', percentage, ' and ratio units of ' BP or BPS '; for the target entity content with the entity type of date, the target entity content can be converted in the forms of fixed date expression of 'xxxx/xx/xx', 'xxxx.xx.xx' or 'xxxxxx year, xx month and xx day', capital Chinese and the like; for target entity content with entity type of currency, the target entity content can be converted in the form of 'pure numerical character string + currency unit'. For example, see table 1, table 1 is a non-text type of transcription style table.

TABLE 1

S205, determining nearest neighbor entity names of each candidate entity content in a preset document according to the position information of at least one candidate entity content, and calculating the confidence between every two entity names in the to-be-clustered set.

S206, clustering each entity name in the to-be-clustered set according to the confidence between every two entity names to obtain a first clustered group.

S207, determining the position information of the candidate entity content corresponding to each nearest neighbor entity name in the first cluster group as the labeling result of the target entity content.

Here, the specific implementation manner of step S205 to step S207 may refer to the description of step S104 to step S106 in the corresponding embodiment of fig. 1, which is not repeated here.

In the embodiment of the application, the entity labeling device determines the position information of at least one candidate entity content in a preset document according to the entity type of the target entity and the target entity content, further determines the nearest neighbor entity name of each candidate entity content, clusters the nearest neighbor entity name of each candidate entity content and the target entity name according to the confidence coefficient vector of the entity name to obtain a first cluster group containing the target entity name, and determines the position information of the candidate entity content corresponding to the nearest neighbor entity name of all the candidate entity contents in the first cluster group as the labeling result of the target entity content, thereby improving the entity labeling quality and the labeling efficiency and reducing the labor cost.

Fig. 3 is a schematic structural diagram of an entity labeling device according to an embodiment of the present application. As shown in fig. 3, the entity labeling device 3 includes a target entity acquisition module 31, an entity type determination module 32, a location information determination module 33, a determination calculation module 34, a cluster determination module 35, and a labeling result determination module 36.

A target entity obtaining module 31, configured to obtain a target entity, where the target entity includes a target entity name and target entity content;

an entity type determining module 32, configured to determine an entity type of the target entity according to the target entity name and the target entity content;

a location information determining module 33, configured to determine location information of at least one candidate entity content in a preset document according to the entity type and the target entity content;

the determining and calculating module 34 is configured to determine, according to the location information of at least one candidate entity content, a nearest neighbor entity name of each candidate entity content in a preset document, calculate a confidence level between every two entity names in a to-be-clustered set, where the to-be-clustered set includes the nearest neighbor entity name and a target entity name of each candidate entity content;

The cluster group determining module 35 is configured to cluster each entity name in the to-be-clustered set according to the confidence level between every two entity names to obtain a first cluster group, where the first cluster group includes a target entity name and at least one nearest neighbor entity name;

the labeling result determining module 36 is configured to determine, as a labeling result of the target entity content, location information of candidate entity content corresponding to each nearest neighbor entity name in the first cluster group.

Optionally, the location information determining module 33 includes:

The difference calculating unit 331 is configured to calculate a difference between the target entity content and each entity content in the preset document to obtain a plurality of difference values if the entity type is a text type;

The first location determining unit 332 is configured to determine location information of at least one entity content corresponding to a difference value less than a preset difference threshold value among the plurality of difference values as location information of at least one candidate entity content.

Optionally, the location information determining module 33 further includes:

A content conversion unit 333, configured to convert the target entity content according to the non-text type to obtain target entity conversion content if the entity type is the non-text type;

The second location determining unit 334 is configured to determine location information of at least one entity content consistent with the target entity content and the target entity conversion content in the preset document as location information of at least one candidate entity content.

The determination calculation module 34 includes: nearest neighbor name determination unit 341.

A nearest neighbor name determining unit 341, configured to determine a median between a position start value and a position end value of the first candidate entity content as an absolute position of the first candidate entity content;

the determination calculation module 34 includes: the confidence calculation unit 342.

A confidence calculating unit 342, configured to obtain location information of a plurality of entity names consistent with the first nearest neighbor entity name in the preset document;

Optionally, the cluster group determining module 35 includes:

A vector determining unit 351, configured to determine a confidence vector of each entity name according to the confidence between every two entity names;

The traversal allocation unit 352 is configured to traverse the distance between the confidence vector of each entity name and each initial cluster center, allocate the confidence vector of each entity name to the cluster group corresponding to the initial cluster center with the smallest distance, and further obtain n initial cluster groups;

A calculation dividing unit 353, configured to calculate a distance between a cluster center of each initial cluster group and an initial cluster center of each initial cluster group, and divide the set to be clustered into n cluster groups when the distance meets a convergence condition;

The cluster group determining unit 354 is configured to determine a cluster group including the target entity name from among the n cluster groups as a first cluster group.

Optionally, the labeling result determining module 36 is configured to determine at least one preset entity type including the target entity content in the entity content ranges of the plurality of preset entity types as at least one candidate entity type; and determining the candidate entity type corresponding to the keyword set containing the target entity name in the keyword set of the at least one candidate entity type as the entity type of the target entity.

It will be appreciated that the entity marking device 3 is configured to implement the steps performed by the entity marking device in the embodiments of fig. 1 and 2. With respect to the specific implementation manner and corresponding beneficial effects of the functional blocks included in the entity marking device 3 of fig. 3, reference may be made to the foregoing specific description of the embodiments of fig. 1 and 2, which are not repeated here.

The entity marking device 3 in the embodiment shown in fig. 3 may be implemented as the server 400 shown in fig. 4. Referring to fig. 4, a schematic structural diagram of a server is provided in an embodiment of the present application. As shown in fig. 4, the server 400 may include: one or more processors 401, memory 402, and a transceiver 403. The processor 401, the memory 402, and the transceiver 403 are connected by a bus 404. Wherein the transceiver 403 is configured to receive or transmit data, and the memory 402 is configured to store a computer program, the computer program including program instructions; the processor 401 is configured to execute program instructions stored in the memory 402, and perform the following operations:

Optionally, the processor 401 determines, according to the entity type and the target entity content, location information of at least one candidate entity content in a preset document, and specifically performs the following operations:

Optionally, the processor 401 determines, according to the entity type and the target entity content, location information of at least one candidate entity content in a preset document, and specifically further performs the following operations:

The processor 401 determines, in a preset document, a nearest neighbor entity name of each candidate entity content according to the location information of at least one candidate entity content, and specifically performs the following operations:

the processor 401 calculates the confidence between every two entity names in the set to be clustered, and specifically performs the following operations:

Optionally, the processor 401 clusters each entity name in the to-be-clustered set according to the confidence between every two entity names to obtain a first clustered group, and specifically performs the following operations:

Optionally, the processor 401 determines the entity type of the target entity according to the name of the target entity and the content of the target entity, and specifically performs the following operations:

In an embodiment of the present application, a computer storage medium is further provided, which may be used to store computer software instructions for the entity marking device in the embodiment shown in fig. 3, where the computer software instructions include a program for executing the program designed for the entity marking device in the embodiment shown in the foregoing. The storage medium includes but is not limited to flash memory, hard disk, solid state disk.

In an embodiment of the present application, there is further provided a computer program product, which, when executed by a computing device, can perform the entity marking apparatus designed for the embodiment shown in fig. 3.

The terms "first," "second," "third," and "fourth" and the like in the description and in the claims and drawings are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

In the present application, "a and/or B" means one of the following: a, B, A and B. "at least one of … …" refers to any combination of the listed items or any number of the listed items, e.g., "at least one of A, B and C" refers to one of the following: any of seven cases a, B, C, a and B, B and C, a and C, A, B and C.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like.

The method and related apparatus provided in the embodiments of the present application are described with reference to the flowchart and/or schematic structural diagrams of the method provided in the embodiments of the present application, and each flow and/or block of the flowchart and/or schematic structural diagrams of the method may be implemented by computer program instructions, and combinations of flows and/or blocks in the flowchart and/or block diagrams. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or structural diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or structures.

The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims

1. A method for labeling entities, comprising:

determining the entity type of the target entity according to the target entity name and the target entity content;

Determining nearest neighbor entity names of each candidate entity content in the preset document according to the position information of the at least one candidate entity content, and calculating confidence between every two entity names in a set to be clustered, wherein the set to be clustered comprises the nearest neighbor entity names of each candidate entity content and the target entity names;

clustering each entity name in the to-be-clustered set according to the confidence between every two entity names to obtain a first clustered group, wherein the first clustered group comprises the target entity name and at least one nearest neighbor entity name;

and determining the position information of the candidate entity content corresponding to each nearest neighbor entity name in the first cluster group as a labeling result of the target entity content.

2. The method according to claim 1, wherein determining location information of at least one candidate entity content in a preset document according to the entity type and the target entity content comprises:

if the entity type is a text type, calculating the difference degree between the target entity content and each entity content in the preset document to obtain a plurality of difference degree values;

And determining the position information of at least one entity content corresponding to the difference value smaller than a preset difference threshold value in the difference values as the position information of the at least one candidate entity content.

3. The method according to claim 1, wherein determining location information of at least one candidate entity content in a preset document according to the entity type and the target entity content, further comprises:

And determining the position information of at least one entity content consistent with the target entity content and the target entity conversion content in the preset document as the position information of the at least one candidate entity content.

4. The method of claim 1, wherein the at least one candidate entity content comprises a first candidate entity content, and the location information comprises a location start value and a location end value;

The determining, according to the location information of the at least one candidate entity content, the nearest neighbor entity name of each candidate entity content in the preset document includes:

5. The method of claim 1, wherein the at least one nearest neighbor entity name comprises a first nearest neighbor entity name;

the calculating the confidence coefficient between every two entity names in the set to be clustered comprises the following steps:

Acquiring position information of a plurality of entity names consistent with the first nearest neighbor entity name in the preset document;

6. The method of claim 1, wherein clustering each entity name in the set to be clustered to obtain a first cluster group according to the confidence between the entity names comprises:

calculating the distance between the cluster center of each initial cluster group and the initial cluster center of each initial cluster group, and dividing the set to be clustered into n cluster groups when the distance meets a convergence condition;

And determining a cluster group comprising the target entity name from the n cluster groups as the first cluster group.

7. The method according to any one of claims 1-6, wherein said determining an entity type of the target entity from the target entity name and the target entity content comprises:

determining at least one preset entity type containing the target entity content in the entity content range of a plurality of preset entity types as at least one candidate entity type;

8. An entity labeling apparatus, comprising:

The determining and calculating module is used for determining the nearest neighbor entity name of each candidate entity content in the preset document according to the position information of the at least one candidate entity content, and calculating the confidence between every two entity names in a to-be-clustered set, wherein the to-be-clustered set comprises the nearest neighbor entity name of each candidate entity content and the target entity name;

the cluster group determining module is used for clustering each entity name in the to-be-clustered set according to the confidence degree between every two entity names to obtain a first cluster group, wherein the first cluster group comprises the target entity name and at least one nearest neighbor entity name;

9. A server comprising a processor, a memory and a transceiver, the processor, the memory and the transceiver being interconnected, wherein the transceiver is configured to receive or transmit data, the memory is configured to store program code, and the processor is configured to invoke the program code to perform the entity labeling method of any of claims 1-7.

10. A storage medium storing a computer program, the computer program comprising program instructions; the program instructions, when executed by a processor, cause the processor to perform the entity labeling method of any of claims 1-7.