Nothing Special   »   [go: up one dir, main page]

CN112328709B - Entity labeling method and device, server and storage medium - Google Patents

Entity labeling method and device, server and storage medium Download PDF

Info

Publication number
CN112328709B
CN112328709B CN202011301554.7A CN202011301554A CN112328709B CN 112328709 B CN112328709 B CN 112328709B CN 202011301554 A CN202011301554 A CN 202011301554A CN 112328709 B CN112328709 B CN 112328709B
Authority
CN
China
Prior art keywords
entity
content
name
target
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011301554.7A
Other languages
Chinese (zh)
Other versions
CN112328709A (en
Inventor
黄佳洋
丘宇彬
陈枫
徐维黛
朱易文
陈清财
李东方
付冠宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Turing Robot Co ltd
Original Assignee
Shenzhen Turing Robot Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Turing Robot Co ltd filed Critical Shenzhen Turing Robot Co ltd
Priority to CN202011301554.7A priority Critical patent/CN112328709B/en
Publication of CN112328709A publication Critical patent/CN112328709A/en
Application granted granted Critical
Publication of CN112328709B publication Critical patent/CN112328709B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Remote Sensing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses an entity labeling method and device, a server and a storage medium, wherein the entity labeling method comprises the following steps: acquiring a target entity, wherein the target entity comprises a target entity name and target entity content; determining the entity type of the target entity according to the name and the content of the target entity; determining the position information of at least one candidate entity content in a preset document according to the entity type and the target entity content; determining nearest neighbor entity names of each candidate entity content in a preset document according to the position information of at least one candidate entity content, and calculating confidence between every two entity names in a set to be clustered; clustering each entity name in the to-be-clustered set according to the confidence between every two entity names to obtain a first clustered group; and determining the position information of the candidate entity contents corresponding to the nearest entity names in the first cluster group as the labeling result of the target entity contents. By adopting the method and the device, the entity labeling quality can be improved.

Description

Entity labeling method and device, server and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and apparatus for labeling entities, a server, and a storage medium.
Background
At present, with the perfection of the gold-melting market in recent years, more and more people participate in investment, production and other activities, and massive financial texts such as contracts, specifications and the like are derived. Often, important information related to production interactions, such as name, amount, date, and contact information, exists in the financial text. In actual business, to facilitate recording and statistics, finance companies often define categories of important information, and then store the important information in unstructured data in a structured (entity name-entity value) manner according to the categories. Because the structured data does not contain the position information of the entity, the entity is usually marked manually in the prior art, the labor cost is too high, and the marking quality is not high.
Disclosure of Invention
The embodiment of the application provides an entity labeling method and device, a server and a storage medium, so as to improve entity labeling quality.
In a first aspect, an embodiment of the present application provides an entity labeling method, including:
acquiring a target entity, wherein the target entity comprises a target entity name and target entity content;
Determining the entity type of the target entity according to the name and the content of the target entity;
determining the position information of at least one candidate entity content in a preset document according to the entity type and the target entity content;
Determining nearest neighbor entity names of each candidate entity content in a preset document according to the position information of at least one candidate entity content, and calculating confidence between every two entity names in a to-be-clustered set, wherein the to-be-clustered set comprises the nearest neighbor entity name and a target entity name of each candidate entity content;
Clustering each entity name in the to-be-clustered set according to the confidence coefficient between every two entity names to obtain a first clustered group, wherein the first clustered group comprises a target entity name and at least one nearest neighbor entity name;
And determining the position information of the candidate entity contents corresponding to the nearest entity names in the first cluster group as the labeling result of the target entity contents.
Optionally, determining the location information of at least one candidate entity content in the preset document according to the entity type and the target entity content includes:
if the entity type is the text type, calculating the difference degree between the target entity content and each entity content in the preset document to obtain a plurality of difference degree values;
And determining the position information of at least one entity content corresponding to the difference value smaller than the preset difference threshold value in the difference values as the position information of at least one candidate entity content.
Optionally, determining the location information of at least one candidate entity content in the preset document according to the entity type and the target entity content, and further includes:
if the entity type is a non-text type, converting the target entity content according to the non-text type to obtain target entity conversion content;
and determining the position information of at least one entity content consistent with the target entity content and the target entity conversion content in the preset document as the position information of at least one candidate entity content.
Optionally, the at least one candidate entity content includes a first candidate entity content, and the location information includes a location start value and a location end value;
determining the nearest neighbor entity name of each candidate entity content in a preset document according to the position information of at least one candidate entity content, wherein the method comprises the following steps:
Determining a median between a position start value and a position end value of the first candidate entity content as an absolute position of the first candidate entity content;
And determining the entity name with the smallest distance between the absolute position in the preset document and the absolute position of the first candidate entity content as the nearest neighbor entity name of the first candidate entity content, and further obtaining the nearest neighbor entity name of each candidate entity content.
Optionally, the at least one nearest neighbor entity name comprises a first nearest neighbor entity name;
calculating the confidence coefficient between every two entity names in the set to be clustered comprises the following steps:
acquiring position information of a plurality of entity names consistent with the first nearest neighbor entity name in a preset document;
determining the nearest neighbor entity name of each entity name according to the position information of each entity name to obtain a nearest neighbor entity name set;
And determining the occurrence probability of the target entity name in the nearest entity name set as the confidence coefficient between the first nearest entity name and the target entity name, and further obtaining the confidence coefficient between every two entity names.
Optionally, clustering each entity name in the set to be clustered according to the confidence between every two entity names to obtain a first cluster group, including:
Determining a confidence vector of each entity name according to the confidence between every two entity names;
traversing the distance between the confidence coefficient vector of each entity name and each initial cluster center, and distributing the confidence coefficient vector of each entity name to the cluster group corresponding to the initial cluster center with the smallest distance, so as to obtain n initial cluster groups;
Calculating the distance between the cluster center of each initial cluster group and the initial cluster center of each initial cluster group, and dividing the set to be clustered into n cluster groups when the distance meets the convergence condition;
and determining a cluster group comprising the target entity name from the n cluster groups as a first cluster group.
Optionally, determining the entity type of the target entity according to the target entity name and the target entity content includes:
Determining at least one preset entity type containing target entity content in the entity content range of the plurality of preset entity types as at least one candidate entity type;
And determining the candidate entity type corresponding to the keyword set containing the target entity name in the keyword set of the at least one candidate entity type as the entity type of the target entity.
In a second aspect, an embodiment of the present application provides an entity labeling device, including:
The target entity acquisition module is used for acquiring a target entity, wherein the target entity comprises a target entity name and target entity content;
the entity type determining module is used for determining the entity type of the target entity according to the target entity name and the target entity content;
The position information determining module is used for determining the position information of at least one candidate entity content in a preset document according to the entity type and the target entity content;
The determining and calculating module is used for determining the nearest neighbor entity name of each candidate entity content in a preset document according to the position information of at least one candidate entity content, calculating the confidence coefficient between every two entity names in a to-be-clustered set, wherein the to-be-clustered set comprises the nearest neighbor entity name and the target entity name of each candidate entity content;
the cluster group determining module is used for clustering each entity name in the to-be-clustered set according to the confidence coefficient between every two entity names to obtain a first cluster group, wherein the first cluster group comprises a target entity name and at least one nearest neighbor entity name;
and the labeling result determining module is used for determining the position information of the candidate entity content corresponding to each nearest entity name in the first cluster group as the labeling result of the target entity content.
Optionally, the location information determining module includes:
the difference degree calculating unit is used for calculating the difference degree between the target entity content and each entity content in the preset document to obtain a plurality of difference degree values if the entity type is the text type;
The first position determining unit is used for determining the position information of at least one entity content corresponding to the difference value smaller than the preset difference threshold value in the difference values as the position information of at least one candidate entity content.
Optionally, the location information determining module further includes:
The content conversion unit is used for converting the target entity content according to the non-text type if the entity type is the non-text type so as to obtain target entity conversion content;
And the second position determining unit is used for determining the position information of at least one entity content consistent with the target entity content and the target entity conversion content in the preset document as the position information of at least one candidate entity content.
Optionally, the at least one candidate entity content includes a first candidate entity content, and the location information includes a location start value and a location end value;
a determining computing module, comprising: and a nearest neighbor name determination unit.
A nearest neighbor name determining unit for determining a median between a position start value and a position end value of the first candidate entity content as an absolute position of the first candidate entity content;
And determining the entity name with the smallest distance between the absolute position in the preset document and the absolute position of the first candidate entity content as the nearest neighbor entity name of the first candidate entity content, and further obtaining the nearest neighbor entity name of each candidate entity content.
Optionally, the at least one nearest neighbor entity name comprises a first nearest neighbor entity name;
A determining computing module, comprising: and a confidence calculating unit.
The confidence calculating unit is used for obtaining position information of a plurality of entity names consistent with the first nearest neighbor entity name in the preset document;
determining the nearest neighbor entity name of each entity name according to the position information of each entity name to obtain a nearest neighbor entity name set;
And determining the occurrence probability of the target entity name in the nearest entity name set as the confidence coefficient between the first nearest entity name and the target entity name, and further obtaining the confidence coefficient between every two entity names.
Optionally, the cluster group determining module includes:
the vector determining unit is used for determining the confidence vector of each entity name according to the confidence coefficient between every two entity names;
The traversal distribution unit is used for traversing the distance between the confidence coefficient vector of each entity name and each initial cluster center, distributing the confidence coefficient vector of each entity name to the cluster group corresponding to the initial cluster center with the smallest distance, and further obtaining n initial cluster groups;
the calculation dividing unit is used for calculating the distance between the cluster center of each initial cluster group and the initial cluster center of each initial cluster group, and dividing the set to be clustered into n cluster groups when the distance meets the convergence condition;
and the cluster group determining unit is used for determining a cluster group comprising the target entity name from the n cluster groups as a first cluster group.
Optionally, the labeling result determining module is configured to determine at least one preset entity type including the target entity content in the entity content ranges of the plurality of preset entity types as at least one candidate entity type; and determining the candidate entity type corresponding to the keyword set containing the target entity name in the keyword set of the at least one candidate entity type as the entity type of the target entity.
In a third aspect, a server is provided for an embodiment of the present application, where the server includes a processor, a memory, and a transceiver, where the processor, the memory, and the transceiver are connected to each other, and the memory is configured to store a computer program that supports an electronic device to execute the entity labeling method described above, where the computer program includes program instructions; the processor is configured to invoke program instructions to perform the entity labeling method as in one aspect of the embodiments of the present application described above.
In a fourth aspect, a storage medium is provided for an embodiment of the present application, where the storage medium stores a computer program, and the computer program includes program instructions; the program instructions, when executed by a processor, cause the processor to perform a method of labeling entities as in an aspect of embodiments of the application.
In the embodiment of the application, a target entity is acquired, wherein the target entity comprises a target entity name and target entity content; determining the entity type of the target entity according to the name and the content of the target entity; determining the position information of at least one candidate entity content in a preset document according to the entity type and the target entity content; determining nearest neighbor entity names of each candidate entity content in a preset document according to the position information of at least one candidate entity content, and calculating confidence between every two entity names in a set to be clustered; clustering each entity name in the to-be-clustered set according to the confidence between every two entity names to obtain a first clustered group; and determining the position information of the candidate entity contents corresponding to the nearest entity names in the first cluster group as the labeling result of the target entity contents. By adopting the method and the device, the entity labeling quality can be improved.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of an entity labeling method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of an entity labeling method according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of an entity labeling device according to an embodiment of the present application;
Fig. 4 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Fig. 1 is a schematic flow chart of an entity labeling method according to an embodiment of the present application. As shown in fig. 1, this embodiment of the method includes the steps of:
s101, acquiring a target entity, wherein the target entity comprises a target entity name and target entity content.
In some possible embodiments, the entity labeling device obtains the target entity from the structured database according to the target entity name and the target entity content, where the structured database may be understood as a database obtained by the finance company storing the important information in the unstructured data in a structured (e.g. entity name-entity value) manner according to the predefined important information category.
S102, determining the entity type of the target entity according to the name and the content of the target entity.
In some possible embodiments, determining at least one preset entity type including the target entity content in the entity content range of the plurality of preset entity types as at least one candidate entity type; and determining the candidate entity type corresponding to the keyword set containing the target entity name in the keyword set of the at least one candidate entity type as the entity type of the target entity.
The preset entity types may include an amount type, a name type, a number type, a ratio type, a number type, a letter type, a currency type, a date type, and a short string type. Wherein, the monetary type entity generally refers to larger values such as asset scale, turnover, etc., for example 15000; the named entities may include textual entities such as company names, persona names, etc., such as the state asset company, inc; the number type entity may include a cell phone number; ratio-type entities may include ratio, percentage, etc. numeric entities, such as 0.132; the numbered entity can be a mixed entity of numbers and English letters, such as 0U34K8; the monetary entity may be a monetary entity with units, such as 10 yuan; the time-type entity may be an entity comprising units of time of day, month or year, such as 12 months; the short string type entity may be a text type entity of length 1 and representing the option information, such as no, female.
For example, assume that the amount type entity content ranges from any number greater than 500, and the keyword set is { sales, liabilities, sales, … … }; the physical content range of the number pattern is any natural number with the number of digits of 11, and the keyword set is { mobile phone number, … … }; the content range of the letter type entity is full English, and the keyword set is { rating information, rating and … … }. The entity "turnover 15000000000" includes entity content "15000000000" and entity name "turnover", the entity content range of the monetary type and the number type both includes entity content "15000000000", the monetary type and the number type are determined as candidate entity types, the entity name "turnover" is included in the keyword set { sales, liabilities, turnover, … … } of the monetary type entity, and the monetary type entity is determined as entity type of entity "turnover 1 billion".
S103, determining the position information of at least one candidate entity content in the preset document according to the entity type and the target entity content.
In some possible embodiments, the entity labeling device determines the location information of at least one candidate entity content of the target entity content in the preset document in different manners according to different entity types. Here, entity types can be largely classified into text types and non-text types. For a specific implementation of this step, please refer to the description of the following embodiments.
S104, determining nearest neighbor entity names of each candidate entity content in a preset document according to the position information of at least one candidate entity content, and calculating the confidence between every two entity names in the to-be-clustered set.
The at least one candidate entity content comprises a first candidate entity content, and the set to be clustered comprises a nearest neighbor entity name and a target entity name of each candidate entity content.
In some possible implementations, a median between the position start value and the position end value of the first candidate entity content is determined as the absolute position of the first candidate entity content; and determining the entity name with the smallest distance between the absolute position in the preset document and the absolute position of the first candidate entity content as the nearest neighbor entity name of the first candidate entity content, and further obtaining the nearest neighbor entity name of each candidate entity content.
For example, assuming that the performance in 2019 of Word document "… … a company is in the eye, the annual growth rate is 13 million, and the annual growth rate is as high as 30% >. The starting position value and the ending position value of candidate entity content" 13 million "in" 13 billion "are 10 and 11, respectively, the absolute position of candidate entity content" 13 billion "is calculated to be 10.5, and the absolute positions of entity names" annual growth rate "adjacent to the candidate entity content" 13 billion "are 6.5 and 16, respectively, the entity name" annual growth rate "is determined as the nearest neighbor entity name of candidate entity content" 13 billion ".
And then, the entity labeling device calculates the confidence coefficient between every two entity names in the to-be-clustered set, wherein at least one nearest neighbor entity name in the to-be-clustered set comprises a first nearest neighbor entity name.
In some possible embodiments, acquiring position information of a plurality of entity names consistent with the first nearest neighbor entity name in a preset document; determining the nearest neighbor entity name of each entity name according to the position information of each entity name to obtain a nearest neighbor entity name set; and determining the occurrence probability of the target entity name in the nearest entity name set as the confidence coefficient between the first nearest entity name and the target entity name, and further obtaining the confidence coefficient between every two entity names.
Specifically, the confidence between the entity name a and the entity name b can be calculated as follows: obtaining position information of a plurality of entity names consistent with the entity name a in a preset document, determining nearest neighbor entity names of each entity name according to the position information of each entity name, forming a nearest neighbor entity name set A according to the nearest neighbor entity names of each entity name, and if the entity name b appears k times in the nearest neighbor entity name set A and the nearest neighbor entity name set A shares N nearest neighbor entity names, the appearance probability of the entity name b in the nearest neighbor entity name set A is k/N, namely the confidence degree between the entity name a and the entity name b is k/N. And calculating the confidence coefficient between every two entity names in the to-be-clustered set according to the mode.
S105, clustering each entity name in the to-be-clustered set according to the confidence between every two entity names to obtain a first clustered group.
The first cluster group comprises a target entity name and at least one nearest neighbor entity name.
In some possible implementations, a confidence vector for each entity name is determined based on the confidence between the entity names; traversing the distance between the confidence coefficient vector of each entity name and each initial cluster center, and distributing the confidence coefficient vector of each entity name to the cluster group corresponding to the initial cluster center with the smallest distance, so as to obtain n initial cluster groups; calculating the distance between the cluster center of each initial cluster group and the initial cluster center of each initial cluster group, and dividing the set to be clustered into n cluster groups when the distance meets the convergence condition; and determining a cluster group comprising the target entity name from the n cluster groups as a first cluster group.
Specifically, the entity labeling device can obtain a confidence coefficient matrix according to the confidence coefficient between every two entity names. Illustratively, the confidence matrix is as follows:
Wherein, P 11 is the confidence between entity name A1 and entity name A1, P 12 is the confidence between entity name A1 and entity name A2, P nn is the confidence between entity name An and entity name An.
The confidence vector of each entity name may be obtained according to the confidence matrix, for example, the confidence vector of the entity name A1 may be (P 11,P12,...,P1n) or (P 11,P21,...,Pn1)T), which is not limited in the present application.
Specifically, the entity labeling device clusters confidence vectors of each entity name in the to-be-clustered set by using a k-means algorithm, and the implementation process is as follows: 1) Randomly selecting confidence coefficient vectors of n entity names from the confidence coefficient vectors of the plurality of entity names as initial cluster centers of n cluster groups, namely, the initial cluster centers of cluster group C 1、C2、...、Cn are Q 1、Q2、...、Qn respectively; 2) Traversing the similarity (such as Euclidean distance) between the confidence coefficient vectors of the rest entity names and Q 1、Q2、...、Qn respectively, comparing the sizes, if the similarity between the confidence coefficient vector of the entity name A1 and Q 1 is higher, distributing the entity name A1 to a cluster group C 1, and completing the distribution of each entity name according to the mode; 3) And (3) recalculating the cluster centers of the cluster groups C 1、C2、...、Cn, repeating the steps 2) and 3) until the distances between the cluster centers of the cluster groups and the initial cluster centers are smaller than a preset threshold value, and obtaining n cluster groups after the k-means algorithm reaches a convergence condition and clustering is finished. And then, determining the cluster group where the target entity name is located in the n cluster groups as a first cluster group.
S106, determining the position information of the candidate entity contents corresponding to the nearest entity names in the first cluster group as the labeling result of the target entity contents.
Wherein the first cluster group includes a target entity name and a nearest neighbor entity name of at least one candidate entity content.
For example, assuming that the target entity content is "1 million", the first cluster group includes the target entity name "business" and the nearest neighbor entity name "business" of the candidate entity content "1 million", and the position information of the candidate entity content "1 million" corresponding to the nearest neighbor entity name "business" of the candidate entity content "1 million" in the first cluster group, that is, the start position value 21 and the end position value 22, is determined as the labeling result of the target entity content "1 million".
In the embodiment of the application, the entity labeling device determines the entity type of the target entity according to the target entity name and the target entity content, determines the position information of at least one candidate entity content in a preset document according to the entity type and the target entity content, determines the nearest neighbor entity name of each candidate entity content in the preset document according to the position information of at least one candidate entity content, calculates the confidence coefficient between every two entity names in the to-be-clustered set, clusters each entity name in the to-be-clustered set according to the confidence coefficient between every two entity names to obtain a first clustered group comprising the target entity name and at least one nearest neighbor entity name, and determines the position information of the candidate entity content corresponding to each nearest neighbor entity name in the first clustered group as the labeling result of the target entity content, thereby improving the entity labeling quality and the labeling efficiency.
Fig. 2 is a schematic flow chart of an entity labeling method according to an embodiment of the present application. As shown in fig. 2, this method embodiment includes the steps of:
S201, acquiring a target entity, wherein the target entity comprises a target entity name and target entity content.
S202, determining the entity type of the target entity according to the name and the content of the target entity.
Here, the specific implementation manner of step S201 to step S202 may refer to the description of step S101 to step S102 in the corresponding embodiment of fig. 1, which is not repeated here.
And S203, if the entity type is the text type, determining the position information of at least one candidate entity content according to the difference degree between the target entity content and each entity content in the preset document.
In some possible embodiments, if the entity type is a text type, calculating a difference degree between the target entity content and each entity content in the preset document to obtain a plurality of difference degree values, and determining location information of at least one entity content corresponding to a difference degree value smaller than a preset difference degree threshold value in the plurality of difference degree values as location information of at least one candidate entity content.
Here, the text type may include a name type and a short string type, and the preset document may be a document in a text format such as readable Word or TXT obtained after the PDF document is converted. The degree of difference between the two entity contents may be an edit distance between the two entity contents. The preset difference value can be adjusted according to the length of the physical content, for example, if the length of the physical content is less than or equal to 3, the preset difference threshold can be set to 0; if the length of the entity content is less than or equal to 8, the preset difference threshold may be set to 1; if the length of the entity content is greater than 8 and less than 12, the preset difference threshold may be set to 2; if the length of the physical content is greater than or equal to 12 and less than or equal to 16, the preset difference threshold may be set to 3, and according to the above manner, the preset difference threshold of the physical content with different lengths may be obtained. The location information may include a location start value and a location end value.
For example, the entity labeling device calculates that the entity content "thai resultant force share limited company" and the target entity content "thai resultant force share limited company" in the Word document are different by only one Word, that is, the difference value (edit distance) between the entity content "thai resultant force share limited company" and the target entity content "thai resultant force share limited company" is 1, and since the length of the target entity content "thai resultant force share limited company" is 10, the difference value 1 is smaller than the preset difference threshold 2, and the entity content "thai resultant force share limited company" is determined as the candidate entity content of the target entity content in the Word document, and since the first Word "thai resultant force share limited company" and the last Word "thai resultant force share limited company" in the candidate entity content "thai resultant force share limited company" are located in the 30 th Word and 39 th Word in the Word document, respectively, the initial position value and the position end value of the candidate entity content "thai resultant force share limited company" are 30 and 39, respectively.
S204, if the entity type is a non-text type, determining the position information of at least one candidate entity content in the preset document according to the conversion content of the target entity content and the target entity.
In some possible embodiments, if the entity type is a non-text type, converting the target entity content according to the non-text type to obtain target entity conversion content, and determining location information of at least one entity content consistent with the target entity content and the target entity conversion content in the preset document as location information of at least one candidate entity content.
Here, the non-text type may include an amount type, a ratio type, a date type, and a currency type. For the target entity content with the entity type of the monetary type, the target entity content can be converted in different numerical units (such as ten, hundred, thousand and the like) and in the form of capitalized Chinese; for the target entity content with the entity type being the ratio type, the target entity content can be converted in the forms of Chinese expressions of ' … percent ', percentage, ' and ratio units of ' BP or BPS '; for the target entity content with the entity type of date, the target entity content can be converted in the forms of fixed date expression of 'xxxx/xx/xx', 'xxxx.xx.xx' or 'xxxxxx year, xx month and xx day', capital Chinese and the like; for target entity content with entity type of currency, the target entity content can be converted in the form of 'pure numerical character string + currency unit'. For example, see table 1, table 1 is a non-text type of transcription style table.
TABLE 1
S205, determining nearest neighbor entity names of each candidate entity content in a preset document according to the position information of at least one candidate entity content, and calculating the confidence between every two entity names in the to-be-clustered set.
S206, clustering each entity name in the to-be-clustered set according to the confidence between every two entity names to obtain a first clustered group.
S207, determining the position information of the candidate entity content corresponding to each nearest neighbor entity name in the first cluster group as the labeling result of the target entity content.
Here, the specific implementation manner of step S205 to step S207 may refer to the description of step S104 to step S106 in the corresponding embodiment of fig. 1, which is not repeated here.
In the embodiment of the application, the entity labeling device determines the position information of at least one candidate entity content in a preset document according to the entity type of the target entity and the target entity content, further determines the nearest neighbor entity name of each candidate entity content, clusters the nearest neighbor entity name of each candidate entity content and the target entity name according to the confidence coefficient vector of the entity name to obtain a first cluster group containing the target entity name, and determines the position information of the candidate entity content corresponding to the nearest neighbor entity name of all the candidate entity contents in the first cluster group as the labeling result of the target entity content, thereby improving the entity labeling quality and the labeling efficiency and reducing the labor cost.
Fig. 3 is a schematic structural diagram of an entity labeling device according to an embodiment of the present application. As shown in fig. 3, the entity labeling device 3 includes a target entity acquisition module 31, an entity type determination module 32, a location information determination module 33, a determination calculation module 34, a cluster determination module 35, and a labeling result determination module 36.
A target entity obtaining module 31, configured to obtain a target entity, where the target entity includes a target entity name and target entity content;
an entity type determining module 32, configured to determine an entity type of the target entity according to the target entity name and the target entity content;
a location information determining module 33, configured to determine location information of at least one candidate entity content in a preset document according to the entity type and the target entity content;
the determining and calculating module 34 is configured to determine, according to the location information of at least one candidate entity content, a nearest neighbor entity name of each candidate entity content in a preset document, calculate a confidence level between every two entity names in a to-be-clustered set, where the to-be-clustered set includes the nearest neighbor entity name and a target entity name of each candidate entity content;
The cluster group determining module 35 is configured to cluster each entity name in the to-be-clustered set according to the confidence level between every two entity names to obtain a first cluster group, where the first cluster group includes a target entity name and at least one nearest neighbor entity name;
the labeling result determining module 36 is configured to determine, as a labeling result of the target entity content, location information of candidate entity content corresponding to each nearest neighbor entity name in the first cluster group.
Optionally, the location information determining module 33 includes:
The difference calculating unit 331 is configured to calculate a difference between the target entity content and each entity content in the preset document to obtain a plurality of difference values if the entity type is a text type;
The first location determining unit 332 is configured to determine location information of at least one entity content corresponding to a difference value less than a preset difference threshold value among the plurality of difference values as location information of at least one candidate entity content.
Optionally, the location information determining module 33 further includes:
A content conversion unit 333, configured to convert the target entity content according to the non-text type to obtain target entity conversion content if the entity type is the non-text type;
The second location determining unit 334 is configured to determine location information of at least one entity content consistent with the target entity content and the target entity conversion content in the preset document as location information of at least one candidate entity content.
Optionally, the at least one candidate entity content includes a first candidate entity content, and the location information includes a location start value and a location end value;
The determination calculation module 34 includes: nearest neighbor name determination unit 341.
A nearest neighbor name determining unit 341, configured to determine a median between a position start value and a position end value of the first candidate entity content as an absolute position of the first candidate entity content;
And determining the entity name with the smallest distance between the absolute position in the preset document and the absolute position of the first candidate entity content as the nearest neighbor entity name of the first candidate entity content, and further obtaining the nearest neighbor entity name of each candidate entity content.
Optionally, the at least one nearest neighbor entity name comprises a first nearest neighbor entity name;
the determination calculation module 34 includes: the confidence calculation unit 342.
A confidence calculating unit 342, configured to obtain location information of a plurality of entity names consistent with the first nearest neighbor entity name in the preset document;
determining the nearest neighbor entity name of each entity name according to the position information of each entity name to obtain a nearest neighbor entity name set;
And determining the occurrence probability of the target entity name in the nearest entity name set as the confidence coefficient between the first nearest entity name and the target entity name, and further obtaining the confidence coefficient between every two entity names.
Optionally, the cluster group determining module 35 includes:
A vector determining unit 351, configured to determine a confidence vector of each entity name according to the confidence between every two entity names;
The traversal allocation unit 352 is configured to traverse the distance between the confidence vector of each entity name and each initial cluster center, allocate the confidence vector of each entity name to the cluster group corresponding to the initial cluster center with the smallest distance, and further obtain n initial cluster groups;
A calculation dividing unit 353, configured to calculate a distance between a cluster center of each initial cluster group and an initial cluster center of each initial cluster group, and divide the set to be clustered into n cluster groups when the distance meets a convergence condition;
The cluster group determining unit 354 is configured to determine a cluster group including the target entity name from among the n cluster groups as a first cluster group.
Optionally, the labeling result determining module 36 is configured to determine at least one preset entity type including the target entity content in the entity content ranges of the plurality of preset entity types as at least one candidate entity type; and determining the candidate entity type corresponding to the keyword set containing the target entity name in the keyword set of the at least one candidate entity type as the entity type of the target entity.
It will be appreciated that the entity marking device 3 is configured to implement the steps performed by the entity marking device in the embodiments of fig. 1 and 2. With respect to the specific implementation manner and corresponding beneficial effects of the functional blocks included in the entity marking device 3 of fig. 3, reference may be made to the foregoing specific description of the embodiments of fig. 1 and 2, which are not repeated here.
The entity marking device 3 in the embodiment shown in fig. 3 may be implemented as the server 400 shown in fig. 4. Referring to fig. 4, a schematic structural diagram of a server is provided in an embodiment of the present application. As shown in fig. 4, the server 400 may include: one or more processors 401, memory 402, and a transceiver 403. The processor 401, the memory 402, and the transceiver 403 are connected by a bus 404. Wherein the transceiver 403 is configured to receive or transmit data, and the memory 402 is configured to store a computer program, the computer program including program instructions; the processor 401 is configured to execute program instructions stored in the memory 402, and perform the following operations:
acquiring a target entity, wherein the target entity comprises a target entity name and target entity content;
Determining the entity type of the target entity according to the name and the content of the target entity;
determining the position information of at least one candidate entity content in a preset document according to the entity type and the target entity content;
Determining nearest neighbor entity names of each candidate entity content in a preset document according to the position information of at least one candidate entity content, and calculating confidence between every two entity names in a to-be-clustered set, wherein the to-be-clustered set comprises the nearest neighbor entity name and a target entity name of each candidate entity content;
Clustering each entity name in the to-be-clustered set according to the confidence coefficient between every two entity names to obtain a first clustered group, wherein the first clustered group comprises a target entity name and at least one nearest neighbor entity name;
And determining the position information of the candidate entity contents corresponding to the nearest entity names in the first cluster group as the labeling result of the target entity contents.
Optionally, the processor 401 determines, according to the entity type and the target entity content, location information of at least one candidate entity content in a preset document, and specifically performs the following operations:
if the entity type is the text type, calculating the difference degree between the target entity content and each entity content in the preset document to obtain a plurality of difference degree values;
And determining the position information of at least one entity content corresponding to the difference value smaller than the preset difference threshold value in the difference values as the position information of at least one candidate entity content.
Optionally, the processor 401 determines, according to the entity type and the target entity content, location information of at least one candidate entity content in a preset document, and specifically further performs the following operations:
if the entity type is a non-text type, converting the target entity content according to the non-text type to obtain target entity conversion content;
and determining the position information of at least one entity content consistent with the target entity content and the target entity conversion content in the preset document as the position information of at least one candidate entity content.
Optionally, the at least one candidate entity content includes a first candidate entity content, and the location information includes a location start value and a location end value;
The processor 401 determines, in a preset document, a nearest neighbor entity name of each candidate entity content according to the location information of at least one candidate entity content, and specifically performs the following operations:
Determining a median between a position start value and a position end value of the first candidate entity content as an absolute position of the first candidate entity content;
And determining the entity name with the smallest distance between the absolute position in the preset document and the absolute position of the first candidate entity content as the nearest neighbor entity name of the first candidate entity content, and further obtaining the nearest neighbor entity name of each candidate entity content.
Optionally, the at least one nearest neighbor entity name comprises a first nearest neighbor entity name;
the processor 401 calculates the confidence between every two entity names in the set to be clustered, and specifically performs the following operations:
acquiring position information of a plurality of entity names consistent with the first nearest neighbor entity name in a preset document;
determining the nearest neighbor entity name of each entity name according to the position information of each entity name to obtain a nearest neighbor entity name set;
And determining the occurrence probability of the target entity name in the nearest entity name set as the confidence coefficient between the first nearest entity name and the target entity name, and further obtaining the confidence coefficient between every two entity names.
Optionally, the processor 401 clusters each entity name in the to-be-clustered set according to the confidence between every two entity names to obtain a first clustered group, and specifically performs the following operations:
Determining a confidence vector of each entity name according to the confidence between every two entity names;
traversing the distance between the confidence coefficient vector of each entity name and each initial cluster center, and distributing the confidence coefficient vector of each entity name to the cluster group corresponding to the initial cluster center with the smallest distance, so as to obtain n initial cluster groups;
Calculating the distance between the cluster center of each initial cluster group and the initial cluster center of each initial cluster group, and dividing the set to be clustered into n cluster groups when the distance meets the convergence condition;
and determining a cluster group comprising the target entity name from the n cluster groups as a first cluster group.
Optionally, the processor 401 determines the entity type of the target entity according to the name of the target entity and the content of the target entity, and specifically performs the following operations:
Determining at least one preset entity type containing target entity content in the entity content range of the plurality of preset entity types as at least one candidate entity type;
And determining the candidate entity type corresponding to the keyword set containing the target entity name in the keyword set of the at least one candidate entity type as the entity type of the target entity.
In an embodiment of the present application, a computer storage medium is further provided, which may be used to store computer software instructions for the entity marking device in the embodiment shown in fig. 3, where the computer software instructions include a program for executing the program designed for the entity marking device in the embodiment shown in the foregoing. The storage medium includes but is not limited to flash memory, hard disk, solid state disk.
In an embodiment of the present application, there is further provided a computer program product, which, when executed by a computing device, can perform the entity marking apparatus designed for the embodiment shown in fig. 3.
The terms "first," "second," "third," and "fourth" and the like in the description and in the claims and drawings are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
In the present application, "a and/or B" means one of the following: a, B, A and B. "at least one of … …" refers to any combination of the listed items or any number of the listed items, e.g., "at least one of A, B and C" refers to one of the following: any of seven cases a, B, C, a and B, B and C, a and C, A, B and C.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like.
The method and related apparatus provided in the embodiments of the present application are described with reference to the flowchart and/or schematic structural diagrams of the method provided in the embodiments of the present application, and each flow and/or block of the flowchart and/or schematic structural diagrams of the method may be implemented by computer program instructions, and combinations of flows and/or blocks in the flowchart and/or block diagrams. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or structural diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or structures.
The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims (10)

1. A method for labeling entities, comprising:
Acquiring a target entity, wherein the target entity comprises a target entity name and target entity content;
determining the entity type of the target entity according to the target entity name and the target entity content;
Determining the position information of at least one candidate entity content in a preset document according to the entity type and the target entity content;
Determining nearest neighbor entity names of each candidate entity content in the preset document according to the position information of the at least one candidate entity content, and calculating confidence between every two entity names in a set to be clustered, wherein the set to be clustered comprises the nearest neighbor entity names of each candidate entity content and the target entity names;
clustering each entity name in the to-be-clustered set according to the confidence between every two entity names to obtain a first clustered group, wherein the first clustered group comprises the target entity name and at least one nearest neighbor entity name;
and determining the position information of the candidate entity content corresponding to each nearest neighbor entity name in the first cluster group as a labeling result of the target entity content.
2. The method according to claim 1, wherein determining location information of at least one candidate entity content in a preset document according to the entity type and the target entity content comprises:
if the entity type is a text type, calculating the difference degree between the target entity content and each entity content in the preset document to obtain a plurality of difference degree values;
And determining the position information of at least one entity content corresponding to the difference value smaller than a preset difference threshold value in the difference values as the position information of the at least one candidate entity content.
3. The method according to claim 1, wherein determining location information of at least one candidate entity content in a preset document according to the entity type and the target entity content, further comprises:
If the entity type is a non-text type, converting the target entity content according to the non-text type to obtain target entity conversion content;
And determining the position information of at least one entity content consistent with the target entity content and the target entity conversion content in the preset document as the position information of the at least one candidate entity content.
4. The method of claim 1, wherein the at least one candidate entity content comprises a first candidate entity content, and the location information comprises a location start value and a location end value;
The determining, according to the location information of the at least one candidate entity content, the nearest neighbor entity name of each candidate entity content in the preset document includes:
Determining a median between a position start value and a position end value of the first candidate entity content as an absolute position of the first candidate entity content;
And determining the entity name with the smallest distance between the absolute position in the preset document and the absolute position of the first candidate entity content as the nearest neighbor entity name of the first candidate entity content, and further obtaining the nearest neighbor entity name of each candidate entity content.
5. The method of claim 1, wherein the at least one nearest neighbor entity name comprises a first nearest neighbor entity name;
the calculating the confidence coefficient between every two entity names in the set to be clustered comprises the following steps:
Acquiring position information of a plurality of entity names consistent with the first nearest neighbor entity name in the preset document;
Determining the nearest neighbor entity name of each entity name according to the position information of each entity name to obtain a nearest neighbor entity name set;
and determining the occurrence probability of the target entity name in the nearest entity name set as the confidence coefficient between the first nearest entity name and the target entity name, and further obtaining the confidence coefficient between every two entity names.
6. The method of claim 1, wherein clustering each entity name in the set to be clustered to obtain a first cluster group according to the confidence between the entity names comprises:
Determining a confidence vector of each entity name according to the confidence between every two entity names;
Traversing the distance between the confidence coefficient vector of each entity name and each initial cluster center, and distributing the confidence coefficient vector of each entity name to the cluster group corresponding to the initial cluster center with the smallest distance, so as to obtain n initial cluster groups;
calculating the distance between the cluster center of each initial cluster group and the initial cluster center of each initial cluster group, and dividing the set to be clustered into n cluster groups when the distance meets a convergence condition;
And determining a cluster group comprising the target entity name from the n cluster groups as the first cluster group.
7. The method according to any one of claims 1-6, wherein said determining an entity type of the target entity from the target entity name and the target entity content comprises:
determining at least one preset entity type containing the target entity content in the entity content range of a plurality of preset entity types as at least one candidate entity type;
And determining the candidate entity type corresponding to the keyword set containing the target entity name in the keyword set of the at least one candidate entity type as the entity type of the target entity.
8. An entity labeling apparatus, comprising:
The target entity acquisition module is used for acquiring a target entity, wherein the target entity comprises a target entity name and target entity content;
The entity type determining module is used for determining the entity type of the target entity according to the target entity name and the target entity content;
the position information determining module is used for determining the position information of at least one candidate entity content in a preset document according to the entity type and the target entity content;
The determining and calculating module is used for determining the nearest neighbor entity name of each candidate entity content in the preset document according to the position information of the at least one candidate entity content, and calculating the confidence between every two entity names in a to-be-clustered set, wherein the to-be-clustered set comprises the nearest neighbor entity name of each candidate entity content and the target entity name;
the cluster group determining module is used for clustering each entity name in the to-be-clustered set according to the confidence degree between every two entity names to obtain a first cluster group, wherein the first cluster group comprises the target entity name and at least one nearest neighbor entity name;
And the labeling result determining module is used for determining the position information of the candidate entity content corresponding to each nearest entity name in the first cluster group as the labeling result of the target entity content.
9. A server comprising a processor, a memory and a transceiver, the processor, the memory and the transceiver being interconnected, wherein the transceiver is configured to receive or transmit data, the memory is configured to store program code, and the processor is configured to invoke the program code to perform the entity labeling method of any of claims 1-7.
10. A storage medium storing a computer program, the computer program comprising program instructions; the program instructions, when executed by a processor, cause the processor to perform the entity labeling method of any of claims 1-7.
CN202011301554.7A 2020-11-19 2020-11-19 Entity labeling method and device, server and storage medium Active CN112328709B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011301554.7A CN112328709B (en) 2020-11-19 2020-11-19 Entity labeling method and device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011301554.7A CN112328709B (en) 2020-11-19 2020-11-19 Entity labeling method and device, server and storage medium

Publications (2)

Publication Number Publication Date
CN112328709A CN112328709A (en) 2021-02-05
CN112328709B true CN112328709B (en) 2024-07-09

Family

ID=74321631

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011301554.7A Active CN112328709B (en) 2020-11-19 2020-11-19 Entity labeling method and device, server and storage medium

Country Status (1)

Country Link
CN (1) CN112328709B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372060A (en) * 2016-08-31 2017-02-01 北京百度网讯科技有限公司 Search text labeling method and device
CN107798136A (en) * 2017-11-23 2018-03-13 北京百度网讯科技有限公司 Entity relation extraction method, apparatus and server based on deep learning

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933785B (en) * 2019-02-03 2023-06-20 北京百度网讯科技有限公司 Method, apparatus, device and medium for entity association
CN110399616A (en) * 2019-07-31 2019-11-01 国信优易数据有限公司 Name entity detection method, device, electronic equipment and readable storage medium storing program for executing
CN111444344B (en) * 2020-03-27 2022-10-25 腾讯科技(深圳)有限公司 Entity classification method, entity classification device, computer equipment and storage medium
CN111507400B (en) * 2020-04-16 2023-10-31 腾讯科技(深圳)有限公司 Application classification method, device, electronic equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372060A (en) * 2016-08-31 2017-02-01 北京百度网讯科技有限公司 Search text labeling method and device
CN107798136A (en) * 2017-11-23 2018-03-13 北京百度网讯科技有限公司 Entity relation extraction method, apparatus and server based on deep learning

Also Published As

Publication number Publication date
CN112328709A (en) 2021-02-05

Similar Documents

Publication Publication Date Title
CN111797210A (en) Information recommendation method, device and equipment based on user portrait and storage medium
CN112434535B (en) Element extraction method, device, equipment and storage medium based on multiple models
CN106980620B (en) Method and device for matching Chinese character strings
CN106909575B (en) Text clustering method and device
CN112162977B (en) MES-oriented mass data redundancy removing method and system
CN112241458B (en) Text knowledge structuring processing method, device, equipment and readable storage medium
CN113449084A (en) Relationship extraction method based on graph convolution
CN114528413B (en) Knowledge graph updating method, system and readable storage medium supported by crowdsourced marking
CN111062803A (en) Financial business query and review method and system
CN112784585A (en) Abstract extraction method and terminal for financial bulletin
CN114528418B (en) Text processing method, system and storage medium
CN112507170A (en) Data asset directory construction method based on intelligent decision and related equipment thereof
US20210397636A1 (en) Text object management system
CN110597977B (en) Data processing method, data processing device, computer equipment and storage medium
CN112328709B (en) Entity labeling method and device, server and storage medium
CN115687790B (en) Advertisement pushing method and system based on big data and cloud platform
CN109344388B (en) Method and device for identifying spam comments and computer-readable storage medium
CN115048682B (en) Safe storage method for land circulation information
CN107315807B (en) Talent recommendation method and device
CN115712722A (en) Clustering system, method, electronic device and storage medium for multi-language short message text
CN115033699A (en) Fund user classification method and device
CN114780649A (en) Method and device for identifying structured data entity type
CN109474703B (en) Personalized product combination pushing method, device and system
CN114117187A (en) Data query method and related device
CN113535125A (en) Financial demand item generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant