CN113392221B

CN113392221B - Method and related device for processing thin entity

Info

Publication number: CN113392221B
Application number: CN202011184275.7A
Authority: CN
Inventors: 杨石兵; 徐也; 沈卓; 荆宁
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2024-03-19
Anticipated expiration: 2040-10-29
Also published as: CN113392221A

Abstract

The embodiment of the application provides a method for processing a thin entity and a related device, wherein the method comprises the following steps: acquiring a first text sentence corresponding to a first thin entity and a second text sentence corresponding to a second thin entity, and performing splicing processing on the first text sentence and the second text sentence to obtain a target text sentence; processing the target text sentence to obtain a text vector corresponding to the target text sentence, and determining a fusion score of the first thin entity and the second thin entity according to the text vector corresponding to the target text sentence; if the fusion score is greater than or equal to the specified score threshold, determining that the first thin entity and the second thin entity are the same entity, and processing text sentences of the first thin entity and the second thin entity in different splicing modes through the embodiment so as to improve the fusion accuracy of the thin entity pairs.

Description

Method and related device for processing thin entity

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method for processing a thin entity, a device for processing a thin entity, and a thin entity processing apparatus.

Background

Knowledge Graph (knowledgegraph), which may also be referred to as Knowledge domain visualization or Knowledge domain mapping map, may be considered as a triplet that combines "entities-relationships-entities". The subject field related to the construction of the knowledge graph is wide, and the successful construction of the knowledge graph brings certain convenience to different fields, such as intelligent search, automatic question answering, individual recommendation, decision support and other general fields.

The knowledge graph construction generally comprises three steps of information extraction, knowledge fusion and knowledge processing. More specifically, the entity fusion technology involved in the knowledge fusion step in the construction process of the knowledge graph is one of the research hotspots comparing heat and fire at present. At present, most fusion technologies are aimed at entities, the matching degree of texts is mostly measured through similarity, and most of the current fusion technologies depend on rich attribute information carried by the entities, so that fusion errors are easy to occur.

Disclosure of Invention

The embodiment of the application provides a method and a related device for processing thin entities, which can effectively improve the fusion accuracy of the thin entity pairs.

In one aspect, the application discloses a method for processing a thin entity, which includes:

acquiring a first text sentence corresponding to a first thin entity and a second text sentence corresponding to a second thin entity, and performing splicing processing on the first text sentence and the second text sentence to obtain a target text sentence;

processing the target text sentence to obtain a text vector corresponding to the target text sentence;

Determining a fusion score of the first thin entity and the second thin entity according to the text vector corresponding to the target text sentence;

and if the fusion score is greater than or equal to a specified score threshold, determining that the first thin entity and the second thin entity are the same entity.

In one aspect, the application discloses a device for processing a thin entity, the device comprising:

the acquisition unit is used for acquiring a first text sentence corresponding to the first entity and a second text sentence corresponding to the second entity;

the processing unit is used for performing splicing processing on the first text sentence and the second text sentence to obtain a target text sentence;

the processing unit is further used for processing the target text sentence to obtain a text vector corresponding to the target text sentence;

a determining unit, configured to determine a fusion score of the first thin entity and the second thin entity according to a text vector corresponding to the target text sentence;

the determining unit is further configured to determine that the first thin entity and the second thin entity are the same entity if the fusion score is greater than or equal to a specified score threshold.

In one aspect, an embodiment of the present application discloses a thin entity processing apparatus, including a memory and a processor: the memory is used for storing a computer program; the processor runs the computer program to realize the processing method for the thin entity.

In one aspect, embodiments of the present application disclose a computer readable storage medium storing a computer program that, when executed by a processor, performs the above-described method of processing a thin entity.

In one aspect, embodiments of the present application disclose a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the above-described processing method for the thin entity.

According to the embodiment of the application, the thin entity processing equipment acquires a first text sentence corresponding to a first thin entity and a second text sentence corresponding to a second thin entity, performs splicing processing on the first text sentence and the second text sentence to obtain a target text sentence, and can reduce the problems of error fusion and recall missing when fusion is performed on the thin entity pairs through different splicing processing modes; further, the thin entity processing device processes the target text sentence to obtain a text vector corresponding to the target text sentence, determines a fusion score of the first thin entity and the second thin entity according to the text vector corresponding to the target text sentence, determines whether the first thin entity and the second thin entity can be fused according to the fusion score, and determines that the first thin entity and the second thin entity are the same entity if the fusion score is greater than or equal to a specified score threshold value, and processes the text sentences of the first thin entity and the second thin entity in different splicing modes to improve the fusion accuracy of the thin entity pair; in addition, the embodiment of the application further comprises judging whether the entity to be detected is a thin entity, and the entity can be screened to achieve the purpose of improving the overall data quality of the map.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a method for processing a thin entity according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a method of handling a thin entity according to an embodiment of the present disclosure;

FIG. 3 is an exemplary diagram of a thin entity fusion as disclosed in an embodiment of the present application;

FIG. 4 is a flow chart of another method of handling thin entities disclosed in an embodiment of the present application;

FIG. 5 is a schematic view of a processing device for thin entities according to an embodiment of the present application;

fig. 6 is a schematic structural view of a thin entity processing apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The method for processing the thin entity provided by the embodiment of the application further relates to:

artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The present application relates to natural language processing techniques and machine learning underlying artificial intelligence software. Among them, natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like. Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

More particularly, the present application relates to knowledge-graph technology under natural language processing technology. A Knowledge Graph (knowledgegraph) is a series of different graphs showing the relationship between the Knowledge development process and the attribute, and then the corresponding visualization means is used to show the relationship between the Knowledge entities or between the Knowledge entities and the Knowledge attribute. The construction of the knowledge graph can be mainly divided into four processes of data acquisition, information extraction, knowledge fusion and knowledge processing.

In combination with the embodiment of the application, the thin entity processing equipment acquires a first text sentence corresponding to a first thin entity and a second text sentence corresponding to a second thin entity, and performs splicing processing on the first text sentence and the second text sentence to obtain a target text sentence, namely extraction of information in the construction process of the realized knowledge graph; the thin entity processing equipment processes the target text sentence by using the related model in the machine learning to obtain a text vector corresponding to the target text sentence, determines a fusion score of the first thin entity and the second thin entity according to the text vector corresponding to the target text sentence, judges whether the fusion can be carried out or not by the fusion score, and determines that the first thin entity and the second thin entity are the same entity if the fusion score is greater than or equal to a designated score threshold value, thereby embodying knowledge fusion in the knowledge graph over-construction process. Through the embodiment, text sentences of the first thin entity and the second thin entity are processed in different splicing modes, so that fusion accuracy of the thin entity pairs is improved.

Referring to fig. 1, fig. 1 is a schematic architecture diagram of a method for processing a thin entity, as shown in fig. 1, an architecture diagram 100 of the method includes a thin entity processing device 101 and a storage platform 102, where the thin entity processing device 101 is mainly used for determining whether a thin entity to be detected is an entity, and for fusing a thin entity pair (a first thin entity and a second thin entity), and the storage platform 102 is mainly used for storing various data information, and in this application, the storage platform 102 is mainly used for storing entity information of various data sources.

In one possible implementation manner, the thin entity processing device 101 obtains a first text sentence corresponding to a first thin entity and a second text sentence corresponding to a second thin entity from the storage platform 102, and performs a concatenation process on the first text sentence and the second text sentence to obtain a target text sentence; the entity processing device 101 processes the target text sentence to obtain a text vector corresponding to the target text sentence, determines a fusion score of the first entity and the second entity according to the text vector corresponding to the target text sentence, determines whether the first entity and the second entity can be fused according to the fusion score, if the fusion score is greater than or equal to a specified score threshold, determines that the first entity and the second entity are the same entity, and processes the text sentences of the first entity and the second entity in different splicing manners to improve the fusion accuracy of the entity pair.

In a possible implementation manner, the thin entity processing device 101 obtains at least two entities to be detected from the storage platform 102, and scores the at least two entities to be detected according to a second scoring rule, so as to obtain scoring results corresponding to the entities to be detected; the thin entity processing device 101 determines a thin entity set from the at least two to-be-detected entities according to the scoring result corresponding to each to-be-detected entity, and acquires a first thin entity and a second thin entity from the thin entity set. In the embodiment of the present application, the main entity processing device 101 determines whether the entity to be detected is a thin entity, so as to achieve a screening of the entity, thereby improving the overall data quality of the map.

The processing method for the thin entity can be applied to intelligent search, relational network construction, inconsistency detection, data anomaly analysis and the like.

Thin entity processing device 101 is explained as "thin entity processing device" as used herein including, but not limited to, a user device, a handheld device with wireless communication capabilities, an in-vehicle device, a wearable device, or a computing device. Illustratively, the thin physical processing device may be a mobile phone (mobile phone), a tablet computer, or a computer with wireless transceiver function. The thin-entity processing device may also be a Virtual Reality (VR) terminal device, an augmented reality (augmented reality, AR) terminal device, a wireless terminal device in industrial control, a wireless terminal device in unmanned driving, a wireless terminal device in telemedicine, a wireless terminal device in smart grid, a wireless terminal device in smart city, a wireless terminal device in smart home (smart home), etc. Alternatively, the thin-body processing device may be a server corresponding to each device. In the technical solution provided in the embodiments of the present application, the device for implementing the function of the thin entity processing device is described by taking the thin entity processing device as an example.

Referring to fig. 2, fig. 2 is a schematic flow chart of a method for processing a thin entity according to an embodiment of the present application, where the flow chart may specifically include the following steps:

s201, the thin entity processing equipment acquires a first text sentence corresponding to a first thin entity and a second text sentence corresponding to a second thin entity, and performs splicing processing on the first text sentence and the second text sentence to obtain a target text sentence.

Wherein, the thin entity refers to an entity which carries attribute information and is weaker or difficult to identify; an entity refers to something stored in the real world in a knowledge graph, such as a person, place name, concept, medicine, company, etc.

In one possible implementation, before the apparatus for processing a thin entity obtains a first text sentence corresponding to a first thin entity and a second text sentence corresponding to a second thin entity, first profile information corresponding to the first thin entity and profile information corresponding to the second thin entity are extracted. The profile information can be obtained from information carried by the first entity and the second entity obtained from a data source, or can be obtained by splicing according to attribute information of the first entity and the second entity, and because the attribute information of the entity is weak, the fusion of the entities can be assisted by mining the profile information carried by the entity and the abundant semantic information of the attribute information. For example, if a to-be-detected entity is obtained from the internet, the entity is correspondingly obtained and carries some related information of the entity, and the related information may be called profile information. The processing device of the thin entity extracts the profile information corresponding to the first thin entity and the profile information corresponding to the second thin entity, and then carries out corresponding cleaning and sorting (preprocessing) on the profile information, including cutting-off processing, keyword screening processing, filtering processing and the like on the profile information, so as to generate usable text sentences, and the attribute information of the first thin entity and the attribute information of the second thin entity are spliced randomly to generate usable text sentences. Therefore, after preprocessing, a first text sentence corresponding to the first thin entity and a second text sentence corresponding to the second thin entity can be obtained. In this application, the first text sentence and the second text sentence may be referred to as a pair of available text sentences.

In one possible implementation manner, the thin entity processing device performs forward concatenation on the first text sentence and the second text sentence, performs reverse concatenation on the first text sentence and the second text sentence, and uses the text sentence obtained by forward concatenation and the text sentence obtained by reverse concatenation as the target text sentence respectively. The first text sentence and the second text sentence are spliced, so that the subsequent processing of the target text sentence is facilitated. Because the splicing processing of the first text sentence and the second text sentence does not need to be performed according to a specific sequence, the method and the device can reduce the context relation or the reasoning relation between sentences through forward splicing and direction splicing of the first text sentence and the second text sentence, so that the effect of improving the accuracy rate of fusion of thin entities is achieved.

S202, the thin entity processing equipment processes the target text sentence to obtain a text vector corresponding to the target text sentence.

The number of the target text sentences is more than one, and the target text sentences can comprise text sentences obtained by forward splicing and text sentences obtained by reverse splicing. The thin entity processing device processes the target text sentence by adopting a mode of calling a model, and the adopted model can be flexibly changed.

In one possible implementation manner, the thin entity processing device invokes a pre-training model (Bidirectional EncoderRepresentations from Transformers, BERT) to train the text sentences obtained by forward splicing to obtain first text vectors corresponding to the text sentences obtained by forward splicing, and the thin entity processing device invokes the BERT to train the text sentences obtained by reverse splicing to obtain second text vectors corresponding to the text sentences obtained by reverse splicing. The BERT model is realized based on a bidirectional transducer encoder, is a method for pre-training language representation, trains a general language understanding model on a large amount of text corpus (wikipedia), and then uses the model to execute a wanted natural language processing (Natural Language Processing, NLP) task. The BERT model is applied to the fusion process of the thin entities, so that the accuracy and recall rate of the fusion of the thin entities can be improved.

S203, the thin entity processing device determines a fusion score of the first thin entity and the second thin entity according to the text vector corresponding to the target text sentence.

From the foregoing, it can be seen that the text vector corresponding to the target text sentence includes a first text vector (corresponding to the text sentence that is spliced in the forward direction) and a second text vector (corresponding to the text sentence that is spliced in the reverse direction). Thus, in one possible implementation, the thin entity processing device performs a fusion process on the first text vector and the second text vector to obtain a fused text vector. The fusion processing mode can be flexibly changed, can be a direct splicing mode or a bit-wise and operation mode. The fusion processing method adopted by the application is to directly splice the first text vector and the second text vector. Then, the text vector after the fusion processing is subjected to dimension transformation, and training parameters can be introduced during dimension transformation so as to achieve transformation of the text vector. In a possible embodiment, the dimension transformation may be performed by a fully connected network layer, and the general processing manner is to perform dimension reduction on the high-dimension vector to obtain a low-dimension vector. And finally, calculating the text vector after dimension transformation through a first scoring rule to obtain a fusion score of the first thin entity and the second thin entity, wherein the fusion score is a specific numerical value, and the range of the numerical value is 0-1. Wherein the fusion score herein is used to indicate a likelihood that the first thin entity and the second thin entity can fuse, i.e., a probability that the first thin entity and the second thin entity can fuse successfully. The first scoring rule may be a sigmoid function, where the output of the sigmoid function is between 0 and 1, and in a classification task, the output of the sigmoid function is generally an event probability, that is, it is determined that the output is of a positive class when a certain probability condition is met, and in this embodiment, it is considered that the value of the output through the sigmoid function is greater than a specified threshold, and it is determined that fusion between thin entities may be achieved.

S204, if the fusion score is greater than or equal to the specified score threshold, the thin entity processing device determines that the first thin entity and the second thin entity are the same entity.

The specified score threshold is also a numerical value, and the numerical value can be set by a developer according to specific conditions or can be calculated according to historical data.

In one possible implementation manner, if the obtained fusion score is greater than or equal to the specified score threshold, the thin entity processing device may determine that the first thin entity and the second thin entity are the same entity, that is, the first thin entity and the second thin entity may be successfully fused.

The steps S201 to S204 may be described by a specific example, as shown in fig. 3, which is an exemplary diagram of a thin entity fusion disclosed in the embodiment of the present application, where "e1" in fig. 3 corresponds to a first thin entity in the present application, "e2" in fig. 3 corresponds to a second thin entity in the present application, "e1 profile information [ SEP ]" corresponds to a first text sentence in the present application, and "e2 profile information [ SEP ]" corresponds to a second text sentence in the present application; performing forward splicing and reverse splicing on the first text sentence and the second text sentence to obtain a target text sentence corresponding to forward splicing and a target text sentence corresponding to reverse splicing, wherein as shown in fig. 3, "e1 profile information [ SEP ] e2 profile information [ SEP ]" is a target text sentence corresponding to forward splicing, and "e2 profile information [ SEP ] e1 profile information [ SEP ]" is a target text sentence corresponding to reverse splicing; processing the 'e 1 profile information [ SEP ] e2 profile information [ SEP ] and the' e2 profile information [ SEP ] e1 profile information [ SEP ] through a pre-training model to obtain a first text vector corresponding to a text sentence obtained by forward splicing and a second text vector corresponding to a text sentence obtained by reverse splicing; and fusing the first text vector and the second text vector, performing dimension transformation through a fully connected network, and calculating the vector subjected to dimension transformation through a sigmoid function to obtain a probability value (fusion score). For example, the probability value of the output may be 0.9684, and the score threshold value is specified to be 0.95, then since 0.9684>0.95, it is indicated that "e1" (the first thin entity) and "e2" (the second thin entity) are fusible, and it is also indicated that "e1" and "e2" are the same entity; for another example, a score threshold of 0.98 is specified, then since 0.9684<0.98, it is stated that "e1" and "e2" (are not fusible, and also that "e1" and "e2" are not the same entity).

In the embodiment of the application, the thin entity processing device acquires a first text sentence corresponding to a first thin entity and a second text sentence corresponding to a second thin entity, performs splicing processing on the first text sentence and the second text sentence to obtain a target text sentence, and processes the first text sentence and the second text sentence in different splicing processing modes to reduce the problems of error fusion and recall when the thin entity pairs are fused.

Referring to fig. 4, fig. 4 is a schematic flow chart of another method for processing a thin entity disclosed in an embodiment of the present application, where the schematic flow chart may be divided into two parts, one part is to determine whether an entity to be detected is a thin entity, and the other part is to determine whether the determined thin entity pair can be fused, and specifically the method flow may include the following steps:

S401, the thin entity processing equipment acquires at least two entities to be detected, and scores the at least two entities to be detected respectively according to a second scoring rule to obtain scoring results corresponding to the entities to be detected.

In one possible implementation, the thin entity processing device obtains at least two to-be-detected entities from different data sources, where the number of to-be-detected entities is guaranteed to be greater than or equal to 2, so as to ensure that the first thin entity and the second thin entity can be obtained subsequently. Wherein at least two entities to be detected are obtained from different data sources, which means that they are from different data sources for the same entity to be detected. For example, entity a to be detected is obtained from data source 1, entity B to be detected is obtained from data source 2, and entity C to be detected is obtained from data source 3. The entity a to be detected, the entity B to be detected and the entity C to be detected are the same entity for human eyes, but since they come from different data sources, it is necessary to determine whether the entity a to be detected, the entity B to be detected and the entity C to be detected are truly the same entity. If the entities belong to the same entity, fusion can be performed, and further, the entity can be added into the map.

After the thin entity processing device obtains at least two entities to be detected, scoring the at least two entities to be detected according to a second scoring rule to obtain scoring results corresponding to the entities to be detected, wherein the scoring results specifically can be: the method comprises the steps that thin entity processing equipment determines a reference entity to be detected, wherein the reference entity to be detected is any one of at least two entities to be detected, and N attribute information of the reference entity to be detected is acquired, wherein N is a positive integer; the thin entity processing equipment calculates a first weight value corresponding to each attribute information in the N attribute information according to a first calculation mode, and calculates a second weight value corresponding to each attribute information according to a second calculation mode; and calculating a scoring result of the reference entity to be detected according to the first weight value and the second weight value.

In one possible implementation manner, the second scoring rule may be a set thin entity scoring formula for judging the entity to be detected, where the thin entity scoring formula is shown in formula (1):

wherein s represents the final score result of the entity to be detected, and the weight scores corresponding to all the attributes carried by the entity to be detected are accumulated; w (w) _i Representing the weight score (first weight value) of the ith attribute of the entity to be detected; w (w) _enhance Represented is the enhancement weight (second weight value) of the current attribute. w (w) _i Calculated by the formula (2), w _enhance Calculated by the formula (3).

The formula (2) is as follows:

w _i ＝w _frequency *w _importance *w _model (2)

wherein w is _i Calculating the weight score of the ith attribute of the entity to be detected; w (w) _frequency Representing the frequency of occurrence of the ith attribute in the category to which the current entity belongs, the frequency can be calculated by the ratio of the total frequency of occurrence of the ith attribute to the total number of entities contained in the category. For example, the entity to be detected is movie a, the category of the entity to be detected is movie a, the corresponding i-th attribute is actor, and w can be calculated by the ratio assuming that 1000 entities included in the category of movie have 1000 actors, wherein 800 entities in 1000 have actor attributes _frequency ＝800/1000＝4/5；w _importance The importance of the ith attribute to the category to which the entity to be detected belongs can be represented by the ratio of the total number of the categories to the number of times the attribute appears, and the logarithm of the comparison value is taken. For example, assuming that 100 total categories exist, in which movie A exists in the category of movies, movie A also exists in the category of shows, movie A also exists in the category of songs, and at the same time, these 4 categories all contain the same attribute actors, w can be obtained by calculation _importance ＝log(100/4)；w _model The attribute weight enhancement is represented, the attribute weight enhancement can be flexibly changed, and the attribute important participation degree manually set in the existing entity fusion method adopted by the application participates in calculation.

Equation (3) is as follows:

wherein w is _enhance The enhanced weight of the attribute is calculated, the enhanced weight is dynamically changed and is changed according to the information of the entity to be detected, wherein if the attribute value is 1, w is _enhance =1, if the attribute value is D, w _enhance The term "D" refers to the number of attributes, e.g., the attribute is actor, and the actor may be actor 1, actor 2, and actor 3, and the attribute value of this attribute is 3.

S402, the thin entity processing equipment determines a thin entity set from at least two entities to be detected according to scoring results corresponding to the entities to be detected.

In a possible implementation manner, in step S401, a scoring result of the entity to be detected is obtained by the formula (1), and the thin entity processing device determines each reference scoring result from the scoring results corresponding to each entity to be detected, where the reference scoring result is greater than or equal to the first scoring threshold and less than or equal to the second scoring threshold; and determining the entity to be detected corresponding to each reference scoring result as a thin entity, and determining a thin entity set according to the entity to be detected corresponding to each reference scoring result. The first scoring threshold and the second scoring threshold may be accumulated according to a priori knowledge, i.e. an automated threshold judgment, or may be set according to a developer.

In one possible implementation manner, if the scoring result of the entity to be detected is smaller than the first scoring threshold, in this case, the attribute information of the entity to be detected corresponding to the scoring result may be considered too thin and substantially unrecognizable, and if the map is introduced, the overall quality of the map may be reduced, so that the processing manner may be to directly filter out the entity to be detected, or to climb new information from the internet again, so as to update the attribute information of the entity to be detected. Further, if there is a scoring result greater than the second scoring threshold in the scoring results of the entity to be detected, the fusion technique for the entity corresponding to the scoring result is now mature, which will not be described in detail herein.

Further, in one possible implementation manner, in the process of determining whether the entity to be detected is a thin entity, the thin entity processing device may consider adding a schema constraint, and may further improve accuracy of identifying the thin entity.

S403, the thin entity processing device acquires a first thin entity and a second thin entity from the thin entity set.

In one possible implementation manner, after determining the set of entities from the entities to be detected, the entity processing device obtains two entities from the set of entities as a first entity and a second entity (where the first entity and the second entity together form a entity pair). It should be noted that in the practical application process, the acquired first thin entity and the acquired second thin entity may be multiple pairs, mainly because the data sources may be multiple, a pair of thin entities is fused and detected each time, whether the first thin entity and the second thin entity are the same entity is detected, so that fusion of the thin entities is achieved, and the fused thin entities are added into the map to enrich information of the map, and improve the overall quality of the map.

S404, the thin entity processing equipment acquires a first text sentence corresponding to the first thin entity and a second text sentence corresponding to the second thin entity, and performs splicing processing on the first text sentence and the second text sentence to obtain a target text sentence.

S405, the thin entity processing equipment processes the target text sentence to obtain a text vector corresponding to the target text sentence.

S406, the thin entity processing device determines a fusion score of the first thin entity and the second thin entity according to the text vector corresponding to the target text sentence.

S407, if the fusion score is greater than or equal to the specified score threshold, the thin entity processing device determines that the first thin entity and the second thin entity are the same entity.

Steps S404 to S407 are the same as steps S201 to S204 in fig. 2, and will not be described in detail here. It should be noted that in the process of fusing the first entity and the second entity, adding more auxiliary features, such as various feature information corresponding to text sentences, may be considered, so as to improve the fusion accuracy in the fusion process. Wherein the first scoring rule and the second scoring rule are different from each other and have no sequence relationship.

In the embodiment of the application, the thin entity processing equipment acquires at least two entities to be detected, and scores the at least two entities to be detected according to a second scoring rule to obtain scoring results corresponding to the entities to be detected; the thin entity processing equipment determines a thin entity set from the at least two to-be-detected entities according to the scoring results corresponding to the to-be-detected entities, and acquires the first thin entity and the second thin entity from the thin entity set. Meanwhile, the embodiment combines with fig. 2, and the accuracy rate of fusion of thin entity pairs is improved.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a processing apparatus for thin entities according to an embodiment of the present application, where the processing apparatus 50 for thin entities may include: an acquisition unit 501, a processing unit 502, a determination unit 503, mainly for:

an obtaining unit 501, configured to obtain a first text sentence corresponding to a first entity and a second text sentence corresponding to a second entity;

the processing unit 502 is configured to perform a concatenation process on the first text sentence and the second text sentence to obtain a target text sentence;

The processing unit 502 is further configured to process the target text sentence to obtain a text vector corresponding to the target text sentence;

a determining unit 503, configured to determine a fusion score of the first thin entity and the second thin entity according to a text vector corresponding to the target text sentence;

the determining unit 503 is further configured to determine that the first thin entity and the second thin entity are the same entity if the fusion score is greater than or equal to a specified score threshold.

In a possible implementation manner, the processing unit 502 performs a concatenation process on the first text sentence and the second text sentence to obtain a target text sentence, where the target text sentence is used for:

forward splicing the first text sentence and the second text sentence, and backward splicing the first text sentence and the second text sentence;

and respectively taking the text sentences obtained by forward splicing and the text sentences obtained by reverse splicing as target text sentences.

In a possible implementation manner, the processing unit 502 processes the target text sentence to obtain a text vector corresponding to the target text sentence, where the text vector is used to:

Processing the text sentences obtained by forward splicing to obtain first text vectors corresponding to the text sentences obtained by forward splicing;

and processing the text sentences obtained by the reverse splicing to obtain second text vectors corresponding to the text sentences obtained by the reverse splicing.

In a possible implementation manner, the determining unit 503 determines a fusion score of the first thin entity and the second thin entity according to a text vector corresponding to the target text sentence, where the fusion score is used for:

performing fusion processing on the first text vector and the second text vector to obtain a fused text vector;

and carrying out dimension transformation on the fused text vectors, and obtaining the fusion score of the text vectors after the dimension transformation through a first scoring rule.

In a possible implementation manner, the obtaining unit 501 is further configured to obtain at least two entities to be detected;

the processing unit 502 is further configured to score the at least two entities to be detected according to a second scoring rule, so as to obtain scoring results corresponding to the entities to be detected;

the determining unit 503 is further configured to determine a thin entity set from the at least two entities to be detected according to scoring results corresponding to the entities to be detected;

The obtaining unit 501 is further configured to obtain the first thin entity and the second thin entity from the thin entity set.

In a possible implementation manner, the determining unit 503 determines a thin entity set from the at least two entities to be detected according to the scoring results corresponding to the respective entities to be detected, where the thin entity set is configured to:

determining each reference scoring result from scoring results corresponding to each entity to be detected, wherein the reference scoring result is larger than or equal to a first scoring threshold value and smaller than or equal to a second scoring threshold value;

determining the entity to be detected corresponding to each reference scoring result as a thin entity;

and determining a thin entity set according to the entities to be detected corresponding to the reference scoring results.

In a possible implementation manner, the processing unit 502 scores the at least two entities to be detected according to a second scoring rule, so as to obtain scoring results corresponding to the entities to be detected, where the scoring results are used for:

determining a reference entity to be detected, wherein the reference entity to be detected is any one of the at least two entities to be detected;

acquiring N attribute information of the reference entity to be detected, wherein N is a positive integer;

Calculating a first weight value corresponding to each attribute information in the N attribute information according to a first calculation mode, and calculating a second weight value corresponding to each attribute information according to a second calculation mode;

and calculating a scoring result of the reference entity to be detected according to the first weight value and the second weight value.

In this embodiment of the present application, the obtaining unit 501 obtains a first text sentence corresponding to a first thin entity and a second text sentence corresponding to a second thin entity, the processing unit 502 performs a stitching process on the first text sentence and the second text sentence to obtain a target text sentence, and in different stitching processing manners, when fusion is performed on a thin entity pair, the problem of error fusion and recall missing can be reduced, further, the processing unit 502 processes the target text sentence to obtain a text vector corresponding to the target text sentence, the determining unit 503 determines a fusion score of the first thin entity and the second thin entity according to the text vector corresponding to the target text sentence, determines whether fusion can be performed or not according to the fusion score, if the fusion score is greater than or equal to a specified score threshold, the determining unit 503 determines that the first thin entity and the second thin entity are the same entity, and in the above embodiment, processes the texts of the first thin entity and the second thin entity by different stitching manners, so as to improve the fusion accuracy of the thin entity pair.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a thin-entity processing apparatus according to an embodiment of the present application, where the thin-entity processing apparatus 60 includes at least a processor 601 and a memory 602. Wherein the processor 601 and the memory 602 may be connected by a bus or otherwise. The memory 602 may comprise a computer readable storage medium, the memory 602 for storing a computer program comprising computer instructions, and the processor 601 for executing the computer instructions stored by the memory 602. The processor 601 (or CPU (Central Processing Unit, central processing unit)) is a computational core as well as a control core of the thin-entity processing device 60, which is adapted to implement one or more computer instructions, in particular to load and execute one or more computer instructions to implement a corresponding method flow or a corresponding function.

The present embodiment also provides a computer-readable storage medium (Memory) that is a Memory device in the thin-entity processing device 60 for storing programs and data. It will be appreciated that the memory 602 herein may include both built-in storage media in the thin entity processing device 60 and extended storage media supported by the thin entity processing device 60. The computer readable storage medium provides storage space that stores the operating system of the thin entity processing device 60. Also stored in this memory space are one or more computer instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor 601. Note that, the Memory 602 may be a high-speed RAM Memory or a Non-Volatile Memory (Non-Volatile Memory), such as at least one disk Memory; optionally, at least one computer readable storage medium remotely located from the aforementioned processor 601.

In one implementation, the thin entity processing device 60 may be the thin entity processing device 101 in the processing system for thin entities shown in FIG. 1; the memory 602 has stored therein first computer instructions; the first computer instructions stored in the memory 602 are loaded and executed by the processor 601 to implement the corresponding steps in the method embodiments shown in fig. 2 and 4; in particular implementations, the first computer instructions in memory 602 are loaded by processor 601 and perform the steps of:

acquiring a first text sentence corresponding to a first thin entity and a second text sentence corresponding to a second thin entity;

splicing the first text sentence and the second text sentence to obtain a target text sentence;

In a possible implementation manner, the processor 601 performs a concatenation process on the first text sentence and the second text sentence to obtain a target text sentence, where the target text sentence is used for:

In a possible implementation manner, the processor 601 processes the target text sentence to obtain a text vector corresponding to the target text sentence, where the text vector is used to:

In a possible implementation manner, the processor 601 determines a fusion score of the first thin entity and the second thin entity according to a text vector corresponding to the target text sentence, and is configured to:

In a possible implementation, the processor 601 is further configured to:

obtaining at least two entities to be detected, and scoring the at least two entities to be detected according to a second scoring rule to obtain scoring results corresponding to the entities to be detected;

determining a thin entity set from the at least two entities to be detected according to scoring results corresponding to the entities to be detected;

the first thin entity and the second thin entity are obtained from the thin entity set.

In a possible implementation manner, the processor 601 determines a thin entity set from the at least two entities to be detected according to the scoring result corresponding to each entity to be detected, where the thin entity set is used for:

In a possible implementation manner, the processor 601 scores the at least two entities to be detected according to a second scoring rule, so as to obtain scoring results corresponding to the entities to be detected, where the scoring results are used for:

In this embodiment of the present invention, a processor 601 of a thin entity processing device obtains a first text sentence corresponding to a first thin entity and a second text sentence corresponding to a second thin entity, performs a stitching process on the first text sentence and the second text sentence to obtain a target text sentence, and processes the first text sentence and the second text sentence in different stitching manners so as to reduce the problem of error fusion and recall when fusing thin entity pairs.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device can execute the method in the embodiment corresponding to the flowcharts of fig. 2 and 4, and therefore, a detailed description will not be given here.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, such as the division of the modules described above, are merely a logical function division, and may be implemented in other manners, such as multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not performed.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of processing a thin entity, the method comprising:

acquiring a first thin entity and a second thin entity from the thin entity set; the thin entity refers to an entity carrying weak attribute information or difficult to identify;

acquiring a first text sentence corresponding to the first thin entity and a second text sentence corresponding to the second thin entity;

Processing the text sentences obtained by forward splicing to obtain first text vectors corresponding to the text sentences obtained by forward splicing; processing the text sentences obtained by the reverse splicing to obtain second text vectors corresponding to the text sentences obtained by the reverse splicing;

determining a fusion score of the first thin entity and the second thin entity according to the first text vector and the second text vector;

2. The method of claim 1, wherein the determining a fusion score for the first thin entity and the second thin entity from the first text vector and the second text vector comprises:

3. The method according to claim 1 or 2, wherein the determining a set of thin entities from the at least two entities to be detected according to the scoring result corresponding to each entity to be detected comprises:

4. The method according to claim 1 or 2, wherein scoring the at least two to-be-detected entities according to the second scoring rule to obtain scoring results corresponding to the to-be-detected entities, respectively, includes:

5. The method of claim 1, wherein the obtaining a first text sentence corresponding to the first entity and a second text sentence corresponding to the second entity comprises:

extracting profile information corresponding to the first entity and profile information corresponding to the second entity;

preprocessing profile information corresponding to the first thin entity and profile information corresponding to the second thin entity to obtain a first text sentence corresponding to the first thin entity and a second text sentence corresponding to the second thin entity respectively.

6. A device for handling thin entities, the device comprising:

the acquisition unit is used for acquiring at least two entities to be detected;

the processing unit is used for scoring the at least two entities to be detected according to a second scoring rule to obtain scoring results corresponding to the entities to be detected;

the determining unit is used for determining a thin entity set from the at least two entities to be detected according to scoring results corresponding to the entities to be detected;

the acquisition unit is also used for acquiring a first thin entity and a second thin entity from the thin entity set; the thin entity refers to an entity carrying weak attribute information or difficult to identify;

The acquiring unit is further configured to acquire a first text sentence corresponding to the first entity and a second text sentence corresponding to the second entity;

the processing unit is further configured to forward splice the first text sentence and the second text sentence, and reverse splice the first text sentence and the second text sentence;

the processing unit is further used for processing the text sentences obtained by forward splicing to obtain first text vectors corresponding to the text sentences obtained by forward splicing, and processing the text sentences obtained by reverse splicing to obtain second text vectors corresponding to the text sentences obtained by reverse splicing;

the determining unit is further configured to determine a fusion score of the first thin entity and the second thin entity according to the first text vector and the second text vector;

7. A thin entity processing apparatus, the thin entity processing apparatus comprising:

a memory for storing a computer program;

A processor running the computer program; a method of handling thin entities according to any one of claims 1 to 5.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the method of processing a thin entity according to any one of claims 1 to 5.

9. A computer program product, characterized in that it comprises computer instructions stored in a computer-readable storage medium, which when executed by a processor implement a method of processing a thin entity according to any of claims 1 to 5.