CN115827884A - Text processing method, text processing device, electronic equipment, medium and program product - Google Patents
Text processing method, text processing device, electronic equipment, medium and program product Download PDFInfo
- Publication number
- CN115827884A CN115827884A CN202210903474.1A CN202210903474A CN115827884A CN 115827884 A CN115827884 A CN 115827884A CN 202210903474 A CN202210903474 A CN 202210903474A CN 115827884 A CN115827884 A CN 115827884A
- Authority
- CN
- China
- Prior art keywords
- tuple
- text
- relation
- sub
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012545 processing Methods 0.000 title claims abstract description 160
- 238000003672 processing method Methods 0.000 title abstract description 19
- 238000000034 method Methods 0.000 claims abstract description 75
- 230000011218 segmentation Effects 0.000 claims abstract description 72
- 238000002372 labelling Methods 0.000 claims description 87
- 238000000605 extraction Methods 0.000 claims description 78
- 238000012549 training Methods 0.000 claims description 34
- 238000012216 screening Methods 0.000 claims description 24
- 230000015654 memory Effects 0.000 claims description 22
- 238000004590 computer program Methods 0.000 claims description 21
- 238000003058 natural language processing Methods 0.000 abstract description 10
- 238000001914 filtration Methods 0.000 description 22
- 238000010276 construction Methods 0.000 description 16
- 238000010586 diagram Methods 0.000 description 14
- 238000007781 pre-processing Methods 0.000 description 14
- 238000013528 artificial neural network Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 7
- 235000019580 granularity Nutrition 0.000 description 7
- 238000013473 artificial intelligence Methods 0.000 description 6
- 230000001364 causal effect Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 3
- 206010012601 diabetes mellitus Diseases 0.000 description 3
- 238000003062 neural network model Methods 0.000 description 3
- 208000001647 Renal Insufficiency Diseases 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000005611 electricity Effects 0.000 description 2
- 201000006370 kidney failure Diseases 0.000 description 2
- 239000003607 modifier Substances 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 241000283707 Capra Species 0.000 description 1
- 241001481710 Cerambycidae Species 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Landscapes
- Machine Translation (AREA)
Abstract
The embodiment of the application discloses a text processing method, a text processing device, electronic equipment, a medium and a program product, which are applied to the technical field of natural language processing. The method comprises the following steps: the method comprises the steps of obtaining a text sentence set, segmenting text sentences in the text sentence set based on segmentation characters to obtain a plurality of sub text sentences, determining at least one first relation tuple according to the plurality of sub text sentences, determining at least one second relation tuple according to the plurality of sub text sentences, and constructing a relation knowledge graph corresponding to the text sentence set according to the at least one first relation tuple and the at least one second relation tuple. By adopting the method and the device, the participle combination with the specified relation in the text sentence can be comprehensively acquired, and the accuracy of the constructed knowledge graph is improved.
Description
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a text processing method, an apparatus, an electronic device, a medium, and a program product.
Background
With the continuous development of computer technology, entity combinations with specified relationships can be extracted from text sentences through Natural Language Processing (NLP) and other related technologies to construct triples, and knowledge maps are constructed based on the triples. For example, a word segmentation combination having a superior-inferior relationship may be extracted from the text, and a relationship tuple may be constructed on the word segmentation combination, so as to obtain a knowledge graph indicating the superior-inferior relationship. In some implementations, an extraction template corresponding to the specified relationship is constructed based on the grammar, and the two entities are extracted as an entity combination by using the extraction template. However, the language structure of the text sentence is very flexible, and the extraction template is relatively fixed, so that the extraction result is not comprehensive enough, and the constructed knowledge graph is not accurate enough.
Disclosure of Invention
The embodiment of the application provides a text processing method, a text processing device, an electronic device, a medium and a program product, which can comprehensively acquire word segmentation combinations with specified relations in text sentences and improve the accuracy of a constructed knowledge graph.
In one aspect, an embodiment of the present application provides a text processing method, where the method includes:
acquiring a text sentence set, and segmenting text sentences in the text sentence set based on segmentation characters to obtain a plurality of sub text sentences;
determining at least one first relational tuple from the plurality of sub-text sentences; a first relation tuple comprises two participles which are from the same sub-text sentence and have a specified association relation;
determining at least one second relational tuple from the plurality of sub-text sentences; a second relational tuple comprises two participles which are from different sub-text sentences and have a specified association relation;
and constructing a relation knowledge graph corresponding to the text sentence set according to the at least one first relation tuple and the at least one second relation tuple.
In one aspect, an embodiment of the present application provides a text processing apparatus, where the apparatus includes:
the acquisition module is used for acquiring the text sentence set and segmenting the text sentences in the text sentence set based on the segmentation characters to obtain a plurality of sub text sentences;
the processing module is used for determining at least one first relation tuple according to the plurality of sub-text sentences; a first relation tuple comprises two participles which are from the same sub-text sentence and have a specified association relation;
the processing module is further used for determining at least one second relation tuple according to the plurality of sub-text sentences; a second relational tuple comprises two participles which come from different sub-text sentences and have a specified association relation;
and the processing module is also used for constructing a relation knowledge graph corresponding to the text sentence set according to the at least one first relation tuple and the at least one second relation tuple.
In one aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a processor and a memory, where the memory is used to store a computer program, and the computer program includes program instructions, and the processor is configured to call the program instructions to perform some or all of the steps in the foregoing method.
In one aspect, the present application provides a computer-readable storage medium, which stores a computer program, where the computer program includes program instructions, and the program instructions, when executed by a processor, are used to perform some or all of the steps of the above method.
Accordingly, according to an aspect of the present application, there is provided a computer program product or computer program comprising computer instructions which, when executed by a processor, perform some or all of the steps of the above method.
In the embodiment of the application, a text sentence set can be obtained, the text sentences in the text sentence set are segmented based on segmentation characters to obtain a plurality of sub-text sentences, at least one first relation element group is determined according to the plurality of sub-text sentences, and at least one second relation element group is determined according to the plurality of sub-text sentences; and a second relational tuple comprises two participles which are from different sub-text sentences and have a specified association relationship, and a relational knowledge graph corresponding to the text sentence set is constructed according to at least one first relational tuple and at least one second relational tuple. By the method, the plurality of relation tuples can be determined according to the content in one sub-text sentence and the content between two sub-text sentences, so that the relation tuples can be determined more finely and comprehensively according to the text sentences, and the constructed knowledge graph is more accurate and reliable.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of an application architecture according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a text processing method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a relational knowledge-graph provided by an embodiment of the present application;
fig. 4 is a schematic flowchart of a text processing method according to an embodiment of the present application;
fig. 5a is a schematic view of a scenario for determining a first relational tuple according to an embodiment of the present disclosure;
fig. 5b is a schematic view of a scenario of determining a first relation tuple according to an embodiment of the present application;
FIG. 6a is a schematic diagram of a knowledge graph building framework provided by an embodiment of the present application;
FIG. 6b is a schematic diagram of a knowledge graph building framework provided by an embodiment of the present application;
fig. 7a is a schematic view of a scenario that a first relation tuple is determined based on a first text processing model according to an embodiment of the present application;
fig. 7b is a scene schematic diagram illustrating a determination of a second relational tuple based on a second text processing model according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The text processing method provided by the embodiment of the application is realized in the electronic equipment, and the electronic equipment can be a server or a terminal. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like. The terminal may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, an aircraft, and the like, but is not limited thereto. The embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent traffic, driving assistance and the like.
Next, technical terms involved in the technical field to which the solution of the embodiment of the present application is possibly applied are described in association:
1. artificial intelligence:
the embodiment of the application relates to the technical field of artificial intelligence, in particular to a natural language processing technology in the artificial intelligence, and the natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like. For example, the determination of the relationship tuple and the construction of the relationship knowledge graph in the technical solution of the present application can be realized by using a natural language processing related technology.
2. Knowledge graph:
the knowledge graph represents relationships between different entities and may be constructed based on multiple triples. A triplet may indicate two entities and a particular relationship between the two entities. According to the technical scheme, two participles (entities) with specified association relation in the text sentence can be determined, and corresponding relation tuples (triples) are constructed based on the two participles. The designated association relationship may be an upper and lower relationship, a causal relationship, etc., and is not limited herein. For example, if it is determined from the text sentence that the participle a and the participle B have a superior-inferior relationship and the participle a is superior to the participle B, the constructed relationship tuple may be recorded as [ a, B, X ], where the relationship tuple indicates two participles a (first participle) and B (second participle), and the X is used to indicate that a and B satisfy the specified association condition, e.g., a and B have a superior-inferior relationship and the first participle is a superior word of the second participle.
In some embodiments, please refer to fig. 1, where fig. 1 is a schematic diagram of an application architecture provided in the present embodiment, and the text processing method provided in the present application can be executed through the application architecture. As shown in fig. 1, the electronic device may obtain a plurality of sub-text sentences, for example, a text sentence set may be obtained, and a sub-text sentence obtained by obtaining the text sentence set and segmenting a text sentence therein may also be obtained by obtaining a pre-segmented sub-text sentence; determining at least one first relational tuple according to the plurality of sub-text sentences, wherein two participles in the first relational tuple are from the same sub-text sentence, that is, determining whether two participles with a specified association relationship are contained in each sub-text sentence (for example, determining two participles associated from the sub-text sentence a and determining two participles associated from the sub-text sentence B), determining at least one second relational tuple according to the plurality of sub-text sentences, wherein two participles in the second relational tuple are from different sub-text sentences, that is, determining whether two participles with a specified association relationship are contained between each two sub-text sentences (for example, determining two participles associated from the sub-text sentence a and the sub-text sentence B, one participle is from the sub-text sentence a and one participle is from the sub-text sentence B), and then constructing a relational knowledge graph according to the first relational tuple and the second relational tuple. In some embodiments, the first relational tuple may be derived by a first text processing model and the second relational tuple may be derived by a second text processing model.
It should be understood that fig. 1 merely illustrates a possible application architecture of the present application, and does not limit the specific architecture of the present application, that is, the present application may also provide other forms of application architectures.
Optionally, in some embodiments, the electronic device may execute the text processing method according to actual business requirements to improve the accuracy of constructing the knowledge graph. The technical scheme can be applied to any type of knowledge graph construction scene. For example, the text sentence set may be a sentence related to a character brief description, and may be a knowledge graph constructed for a top-bottom relationship in the character brief description, where a relationship tuple may be a triple indicating that two participles (i.e., a participle combination) have a top-bottom relationship, the two participles may include a word representing a top level and a word representing a bottom level, and if the text sentence is "three-fold is an actor", the participle representing the top level is "actor", and the participle representing the bottom level is "three-fold".
For another example, the set of textual sentences may be sentences related to medical literature, and may be a knowledge graph constructed for causal relationships in medical literature, where the relationship tuple may be a triple indicating that two participles (i.e., participle combinations) have causal relationships, and the two participles may include a word representing a cause (e.g., "diabetes") and a word representing an effect, e.g., "diabetes causes renal failure", the participle representing the cause is "diabetes", and the participle representing the effect is "renal failure". The electronic equipment can acquire a text sentence set, and word segmentation combinations from the same sub-text sentence or different sub-text sentences are acquired from the text sentence set according to the method provided by the technical scheme of the application, so that a knowledge graph of specified association relations (such as upper and lower relations and causal relations) can be constructed.
Optionally, data related to the present application, such as a text sentence set, a relational knowledge graph, and the like, may be stored in a database, or may be stored in a block chain, such as by a block chain distributed system, which is not limited in the present application.
It should be noted that, in the specific implementation manner of the present application, if the relevant data related to the user information to be collected is obtained when the text sentence set is obtained, for example, text corpora written by the user is collected to construct the text sentence set, when the above embodiment of the present application is applied to a specific product or technology, user permission or consent needs to be obtained, and the collection, use, and processing of the relevant data need to comply with relevant laws and regulations and standards of relevant countries and regions.
It is to be understood that the foregoing scenarios are only examples, and do not constitute a limitation on application scenarios of the technical solutions provided in the embodiments of the present application, and the technical solutions of the present application may also be applied to other scenarios. For example, as can be known by those skilled in the art, with the evolution of system architecture and the emergence of new service scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.
Based on the above description, the present application embodiment proposes a text processing method, which may be executed by the above-mentioned electronic device. Referring to fig. 2, fig. 2 is a schematic flowchart of a text processing method according to an embodiment of the present disclosure. As shown in fig. 2, a flow of the text processing method according to the embodiment of the present application may include the following steps:
s201, a text sentence set is obtained, and text sentences in the text sentence set are segmented based on segmentation characters to obtain a plurality of sub text sentences.
In some embodiments, one or more text sentences, which may be text sentences in articles or paragraphs, may be included in the set of text sentences. For example, a text sentence can be a complete sentence, the sentence is a basic unit of language operation, and is composed of words and phrases (phrases), which can express a complete meaning, and the end of the sentence is generally indicated by a complete identifier such as a period, question mark, ellipsis, exclamation mark, etc. In some embodiments, the obtained text corpus (e.g., a complete document or a middle paragraph of a document) may be segmented to obtain a text sentence set. If the text corpus can be divided according to the preset ending identifier, a plurality of text sentences are obtained to be used as the text sentence set. The text corpus can be data input by related business personnel or data extracted from websites (such as encyclopedia); or data obtained from a domain-specific database (e.g., a medical domain database); or data obtained from a document library, etc., where the source of the text corpus is not limited. The obtained text corpus may be any type of text data, such as medically related text corpus data, or text corpus data related to a person profile, which is not limited herein. In addition, the text corpus comprising the set of text sentences may be in any natural language, for example, the text corpus may be Chinese or English. The present application is not intended to be limited to the specific form set forth herein. Therefore, the knowledge graph construction aiming at the unstructured data (namely the text corpora) can be realized.
In some embodiments, the electronic device may pre-process the set of text sentences and perform the following method on the pre-processed set of text sentences. The pre-processing may include at least one of: and analyzing the part of speech of the text sentence and screening the text sentence. The processing procedure and principle of each text sentence are the same, and here, the preprocessing procedure of one text sentence is taken as an example for explanation.
In some embodiments, the word segmentation and the part-of-speech tagging may be specifically performed on a text sentence to obtain an initial word segmentation set, where the initial word segmentation set includes a plurality of word segments tagged with parts-of-speech, and the initial word segmentation set is merged according to a specified merging rule to obtain a target word segmentation set corresponding to the text sentence, so that a problem that the word segments cannot express a complete meaning due to a fine word segmentation granularity can be avoided. For example, a text sentence may be participled and part-of-speech tagged using texsquart (a kind of segmentation tool). The appointed merging rule is used for merging appointed adjacent participles in the initial participle set, and therefore the appointed merging rule can be used for obtaining the participles with more complete expression semantics in the text sentence.
The specified merge rule may include at least one or more of: 1) Merging adjacent nouns, for example, adjacent participles are 'knowledge/NN' and 'map/NN', and a 'knowledge map/NN' can be obtained through merging; 2) The adjacent modifiers and nouns are combined, for example, the adjacent participles are limited/JJ and company/NN, and can be combined to obtain the company limited/NN. 3) The combination of adjacent quantitative terms and nouns, such as adjacent participles as "NR < th >," one/CD "and" middle school/NN ", can be combined to obtain" first middle school/NN ". 4) The adjacent inclusion identifier, which may be a book title number, a double quotation mark, a bracket, etc., is merged with the included internal word, such as the adjacent participles are "/PU", "limited/JJ", "company/NN", ", and" limited company "/NN" can be merged. The non-merged participles (including part of speech) and the merged participles (including part of speech) in the initial participle set can be used as the target participle set. Wherein NN represents noun, JJ represents modifier, NR represents proper noun, CD represents number word, and PU represents punctuation word.
In addition, when adjacent participles are combined, whether the two participles can form one participle can be further judged according to a specified rule, for example, when a proper noun or a conventional word can be formed, the two participles are combined, for example, the adjacent participles of "knowledge" and "map" can be combined into "knowledge map", the adjacent participles of "XX" and "university" can be combined into "university", but the adjacent participles of "actor" and "zhang" cannot be combined into one participle; the specified rule may be that the participles labeled for the specified part of speech are merged, for example, based on the above (1), the adjacent nouns labeled as "NN" are merged; or a common sense word bank of related fields is constructed, and if the words formed by adjacent participles belong to the common sense word bank, the words are merged; or may be intelligently judged through related technologies, such as building a model to determine whether the two participles can be merged, and the like, which is not limited herein. The specified merge rule may further include other contents, which may be specifically set by the relevant service staff according to the actual application scenario.
In some embodiments, text sentence filtering for the text sentence may specifically be to determine whether to filter or filter the text sentence based on the first filtering vocabulary, so as to implement text sentence-level filtering. If the text sentence comprises any filtering word in the first filtering word table, filtering the text sentence from the text sentence set; and if the text sentence does not comprise all the filter words in the first filter word list, screening the text sentence as a text sentence set which is subsequently segmented. That is, a text sentence with any filter word in the first filter word list may be regarded as that no participle combination with a specified association relationship exists in the text sentence, so that the accuracy of subsequently determining a relationship tuple may be improved. The first filtering word list can be set by related service personnel according to actual scenes. For example, as shown in table 1 below:
filter word | Text sentence example |
In the legend | Text sentence 1: { in legend } Filtration The Capricorn is a goat |
Today's appliances | Text sentence 2: { today } Filtration Is a cloudy day |
Like being | Text sentence 3: the hill { like is } Filtration Buddha statue |
TABLE 1
Taking text sentence 2 as an example, if the filter word is "today", and text sentence 2 includes "today", it means that it is more likely that no relation tuple will be determined from text sentence 2, and text sentence 2 is filtered from the text sentence set.
In some embodiments, the segmentation characters (which may also be referred to as segmentation identifiers) may be symbols of a smaller granularity of the segmented text or text sentence than the finalization identifier, such as commas, pause numbers, etc., and may be set by the relevant service personnel according to the specific grammatical structure of the text sentence. Therefore, the electronic equipment can segment the text sentence based on the segmented characters to obtain a plurality of sub-texts with shorter length and less semantic content. For example, each text sentence in the preprocessed text sentence set is segmented. Relationship tuples for building a relationship knowledge graph may subsequently be determined from within one sub-sentence and between two sub-sentences.
S202, determining at least one first relation tuple according to the plurality of sub-text sentences.
In some embodiments, the electronic device may determine a first relational tuple from within the sub-text sentence, i.e., the first relational tuple includes two participles (e.g., a first participle and a second participle) from the same sub-text sentence and having a specified associative relationship. Such as a superior-inferior relationship. The first relational tuple is further used to indicate a specified association condition that the two participles should satisfy, for example, two participles in the first relational tuple have a specified association relationship and satisfy a condition that a previous participle (such as the first participle) is an upper-level word of a next participle (such as the second participle), or satisfy a condition that a previous participle (such as the first participle) is a lower-level word of a next participle (such as the second participle), and the like. It is to be understood that the first participle and the second participle herein may represent an order and a position in the first relational tuple.
In some embodiments, the process and principle of determining whether the first associated tuple exists in each sub-text sentence are the same, and here, any one of the sub-text sentences (the target sub-text sentence) is taken as an example for description, which may specifically be that each character in the target sub-text sentence is labeled, and the first associated tuple is determined based on each labeled character.
In some embodiments, the labeling process is label labeling for each character for a specified association. The tag can be set based on a specific designated association relation and is used for extracting the first participle and the second participle in the target sub text sentence. For example, the association relationship is specified as a top-bottom relationship, the tags may specifically include a first tag, a second tag, and a third tag, the first tag may be used to label characters forming a top word (e.g., a first participle), the second tag may be used to label characters forming a bottom word (e.g., a second participle), and the third tag may be used to label remaining characters, that is, characters without a top-bottom relationship. For another example, the specified relationship is a causal relationship, the tags may specifically include a first tag, a second tag, and a third tag, the first tag may be used to label characters constituting a cause word (e.g., a first participle), the second tag may be used to label characters constituting a result word (e.g., a second participle), and the third tag may be used to label remaining characters, that is, characters without causal relationship. Thus, the participles determined from the first label have a specified associative relationship with the participles determined from the second label.
In some embodiments, in each of the labeled characters, a continuous character labeled as a first label is determined as a first participle, a continuous character labeled as a second label is determined as a second participle, and a first relational tuple is constructed based on the first participle and the second participle. For example, the target sub-text sentence is "zhang san is an actor", wherein the "actor" and "member" characters are labeled as a first label and the "actor" and "three" characters are labeled as a second label, so that two consecutive characters of "actor" and "member" can be used as a first participle and two consecutive characters of "actor" and "three" can be used as a second participle, and a first relational tuple (e.g., [ actor, zhang san, X ]) can be constructed from the first participle and the second participle. Wherein, X is used to indicate that the first participle and the second participle have a superior-inferior relation, and the first participle is a superior word of the second participle (and/or the second participle is a inferior word of the first participle), that is, the specified association condition is satisfied.
In some embodiments, the labeling process for each character in the target sub-text sentence can be implemented by the first text processing model, and the specific process and the specific description of the first text processing model can be referred to the related description of the following embodiments.
S203, determining at least one second relation tuple according to the plurality of sub-text sentences.
In some embodiments, the electronic device can determine a second relational tuple from between every two sub-sentences. The associated description of the second relational tuple may be the associated description of the first relational tuple described above.
In some embodiments, the process and principle of determining whether a second associated tuple exists between every two sub-sentences are the same, and here, taking two sub-sentences (a first sub-text sentence and a second sub-text sentence) in the multiple sub-text sentences as an example, specifically, the method may include obtaining a first candidate participle set corresponding to the first sub-text sentence and a second candidate participle set corresponding to the second sub-text sentence, constructing at least one candidate relational tuple according to the first candidate participle set and the second candidate participle set, and determining at least one second relational tuple from the at least one candidate relational tuple. The first candidate participle set comprises specified participles forming a first sub text sentence, and the second candidate participle set comprises specified participles forming a second sub text sentence. The candidate relation tuples comprise one candidate participle from the first set of candidate participles and one candidate participle from the second set of candidate participles.
The first sub-text sentence and the second sub-text sentence may be any two sub-text sentences of the plurality of sub-text sentences, or may be two sub-text sentences from the same text sentence of the plurality of sub-text sentences. For example, the text sentence set includes a text sentence 1 and a text sentence 2, the text sentence 1 is divided to obtain a sub-text sentence 1 and a sub-text sentence 2, the sub-text sentence 2 is divided to obtain a sub-text sentence 3 and a sub-text sentence 4, the sub-text sentences 1 to 4 can be combined pairwise, and candidate relation tuples are determined according to candidate participle sets corresponding to all the sub-text sentence combinations respectively; or the candidate relation tuple may be determined according to the candidate participle set corresponding to the sub-text sentence 1 and the candidate participle set corresponding to the sub-text sentence 2, and the candidate relation tuple may be determined according to the candidate participle set corresponding to the sub-text sentence 3 and the candidate participle set corresponding to the sub-text sentence 4. It is understood that when a text sentence does not include the segmentation characters, the text sentence is also a sub-text sentence, that is, the first relational tuple may be determined from the sub-text sentence only according to the process of step S202, and the sub-text sentence does not have other sub-text sentences capable of determining the second relational tuple.
In some embodiments, the first candidate participle set and the second candidate participle set are obtained in the same manner, and taking the first candidate participle set as an example, it may specifically be that participles meeting the specified part of speech condition are obtained from a target participle set corresponding to the first sub-text sentence to serve as the first candidate participle set. The specified part-of-speech condition can be set by related business personnel according to an actual scene. For example, in the upper and lower relationships, since the upper and lower terms are usually nouns, the condition for specifying the part of speech may be to extract the part of speech of the noun. The target participle set corresponding to the first sub-text sentence is obtained from the target participle set corresponding to the text sentence.
In some embodiments, the constructing at least one candidate relationship tuple according to the first candidate participle set and the second candidate participle set may specifically be that, taking each candidate participle in the first candidate participle set as a first participle, respectively constructing a candidate relationship tuple with each candidate participle in the second candidate participle set, and taking each candidate participle in the second candidate participle set as a first participle, respectively constructing a candidate relationship tuple with each candidate participle in the second candidate participle set. For example, the first candidate participle set comprises participle a and participle B, and the second candidate participle set comprises participle C and participle D; therefore, the first candidate participle set is used as the first participle, the constructed candidate relation tuple comprises [ A, C, X ], [ A, D, X ], [ B, C, X ], [ B, D, X ], and the second candidate participle set is used as the first participle, and the constructed candidate relation tuple comprises [ C, A, X ], [ C, B, X ], [ D, A, X ], [ D, B, X ].
In some embodiments, the determining the at least one second relational tuple from the at least one candidate relational tuple may specifically be determining probabilities that two candidate participles in each candidate relational tuple satisfy a specified association condition, and determining the at least one second relational tuple according to a probability corresponding to each candidate relational tuple. The candidate relation tuple with the corresponding probability greater than the preset probability threshold may be used as the second relation tuple. The determination of the probability that two candidate participles in the candidate relation tuple satisfy the specified association condition may be implemented by the second text processing model, and the specific process and the specific description of the second text processing model may be referred to the related description of the following embodiments.
S204, a relational knowledge graph corresponding to the text sentence set is constructed according to the at least one first relational tuple and the at least one second relational tuple.
In some embodiments, a relational knowledge graph corresponding to the text sentence set can be constructed according to the at least one first relational tuple and the at least one second relational tuple and according to a specified construction mode. The specified construction mode may be that the participles in the first relational tuple or the second relational tuple are used as nodes, and the nodes are connected by using the direction in which the second participle points to the first participle (or the direction in which the first participle points to the second participle) to form the relationship indication map.
For example, as shown in fig. 3, fig. 3 is a schematic diagram of a relationship knowledge graph provided in the embodiment of the present application; the relational knowledge graph represents the superior-inferior relation covered in the text corpus corresponding to the text sentence set, taking a relational tuple [ actor, zhang san, X ] as an example, the participles of zhang san and actor are nodes, and the second participle points to the direction of the first participle to connect the two nodes, namely, the hyponym of zhang san points to the actor of hypernym.
Therefore, a more accurate and reliable relation knowledge graph with higher quality can be constructed by accurately determining the relation tuples in one sub-text sentence and between two sub-texts. The content length of the text corpus used for constructing the relational knowledge graph is not limited, and the method can be applied to tuple determination of any incidence relation in any language form, namely the flexibility and the universality are higher, the generalization capability is strong, and the method is applicable to various application scenes. Subsequently, the similar things can be aggregated through the constructed relation knowledge graph, and the superior words with different granularities are provided for each inferior word. For example, for a hyponym with different dimensions of different connection levels for a hypernym, the hyponyms can be the same kind of things corresponding to the hypernym, and a hyponym with different dimensions of different connection levels, so that hypernyms with various granularities for a hyponym can be determined. In addition, the application effect of downstream tasks can be improved through the high-quality relation knowledge graph, for example, the method can play a better role in related entity recommendation tasks such as auxiliary named entity recognition, or labeling of article works and the like.
In the embodiment of the application, a text sentence set can be obtained, the text sentences in the text sentence set are segmented based on segmentation characters to obtain a plurality of sub-text sentences, at least one first relation element group is determined according to the plurality of sub-text sentences, at least one second relation element group is determined according to the plurality of sub-text sentences, and a relation knowledge graph corresponding to the text sentence set is constructed according to the at least one first relation element group and the at least one second relation element group. By the method, the plurality of relation tuples can be determined according to the content in one sub-text sentence and the content between two sub-text sentences, so that the relation tuples can be determined more finely and comprehensively according to the text sentences, and the constructed knowledge graph is more accurate and reliable.
Referring to fig. 4, fig. 4 is a flowchart illustrating a text processing method according to an embodiment of the present application, where the method can be executed by the above-mentioned electronic device. As shown in fig. 4, the flow of the text processing method in the embodiment of the present application may include the following steps:
s401, a text sentence set is obtained, and text sentences in the text sentence set are segmented based on segmentation characters to obtain a plurality of sub text sentences. For a specific implementation of step S401, reference may be made to the relevant description of the foregoing embodiments, which is not described herein again.
S402, determining at least one first relation tuple according to the plurality of sub-text sentences.
In some embodiments, the first relational tuple may be determined from within one sub-sentence. For a specific description of the first relational tuple, reference may be made to the related description of the first relational tuple in the above embodiment, and details are not described herein again.
In some embodiments, taking any one of the multiple subfiles as the target subfiles as an example, determining at least one first relational tuple according to the target subfiles may specifically be, performing labeling processing on each character in the target subfiles, determining a first participle according to the character labeled as the first label, determining a second participle according to the character labeled as the second label, and constructing at least one first relational tuple based on the first participle and the second participle, where any one first relational tuple includes one first participle and one second participle. After the labeling processing, in the target sub text sentence, the participle determined according to the first label and the participle determined according to the second label have a specified association relationship.
In some embodiments, the annotation process may be a BIO annotation, B-begin, I-inside, O-outside, i.e., B may represent the starting character of a participle, I may represent a character in a participle other than the starting character, and O may represent the remaining characters other than the specified participle, i.e., not any type of character. Thus the first label may comprise B-N and I-N for representing the start character and the remaining character, respectively, constituting the first participle (N), and the second label may comprise B-M and I-M for representing the start character and the remaining character, respectively, constituting the second participle (M).
In some embodiments, determining the first segmentation from the character labeled as the first label may be determining consecutive characters from a starting character identified in the first label to the labeled remaining characters as the first segmentation. For example, the sequential characters "electric", "visual", "dramatic" and "are" labeled as O, B-N, I-N, O in that order, the "electric" is labeled as the initial character, "view" and "play" are labeled as the remaining characters and are consecutive characters, so "e", "view" and "play" can be determined as the first participle "tv play". As another example, successive characters "hand", "machine", "electricity", "computer" are labeled B-N, I-N, B-N, I-N in order, then "hand" is labeled as the starting character, "machine" is labeled as the remaining character, and is a successive character, thus the "cell phone" may be determined as the first participle "cell phone", and the next successive character "electricity" is labeled as the starting character, "computer" is labeled as the remaining character, and "computer" is determined as the second first participle. The manner of determining the second segmentation is the same as that of determining the first segmentation, and is not described herein again.
In some embodiments, one or more first participles and one or more second participles may be determined in a target sub-text sentence, and a first relational tuple may be constructed based on each first participle and each second participle, respectively. For example, the labeled hypernyms in the target sub-text sentence are a and B, the labeled hyponyms are C and D, and the constructed first relational tuple is [ a, C, X ], [ B, C, X ], [ a, D, X ], [ B, D, X ]. For another example, the hypernyms labeled in the target sub-text sentence are a and B, the hyponym labeled in the target sub-text sentence is C, and the constructed first relational tuple is [ a, C, X ], [ B, C, X ].
Therefore, in the labeling processing, each character is labeled, and then two participles with a specified association relationship in the target sub-text sentence are determined based on the labeling result, and if one target sub-text sentence includes more than one first participle and/or second participle, a situation that a first relationship tuple is constructed incorrectly may exist. For example, the hypernyms labeled in the target sub-text sentence are a and B, the hyponyms labeled in the target sub-text sentence are C and D, a first relation tuple is constructed based on a, C, B and C, and a may not be the hypernym of C, or a and C do not show corresponding hypernyms in the semantic meaning represented by the target sub-text sentence. Therefore, the first relation tuple is determined through the sub-text sentence with finer granularity, the possibility that one sub-text sentence contains a plurality of groups of participle combinations with specified association relation can be reduced, and the distance between the first participle and the second participle can be ensured to be in a smaller range. The sub-text sentences generally have fewer contents and shorter distances, and the complete text sentences generally have more contents and longer distances, so that compared with the method of directly labeling the complete text sentences, the labeling difficulty can be reduced, the accuracy of labeling processing is improved, and the accuracy of the determined first relation tuple is further improved.
For example, as shown in fig. 5 a-5 b, fig. 5 a-5 b are schematic diagrams of a scenario for determining a first relational tuple; the first label is used for labeling hypernyms and can comprise B-ENT and I-ENT; the second label is used for labeling hyponyms and can comprise B-CAT and I-CAT, and the third label is used for labeling other payments and can be O; if the text sentence is "drama ABC" is very beautiful with three actors inside ", and if the complete text sentence is labeled, the obtained labeling result may be as shown in fig. 5a, the superior word determined based on the first tag includes" drama "and" actors ", and the inferior word determined based on the second tag includes" ABC "and" three actors ", so that there are 4 first relationship tuples determined, and it can be seen that the first relationship tuples 2 to 3 are wrong relationship tuples in the semantic meaning represented by the text sentence. Therefore, the sub-sentences obtained by segmentation can be used for labeling with smaller granularity, so that the probability that the determined first relational tuple is wrong when one sub-sentence comprises participle combinations in a plurality of relational tuples is reduced; that is, the text sentence may be divided into a sub-text sentence 1 and a sub-text sentence 2, and the labeling processing may be performed separately, and the obtained labeling result may be as shown in fig. 5b, so that the first relational tuple determined based on the sub-text sentence 1 is [ drama, < ABC >, X ], and the second relational tuple determined based on the sub-text sentence 2 is [ actor, zhang, X ].
In some embodiments, the annotation processing of the sub-text sentence to determine the at least one first relational tuple can be performed and determined based on a first text processing model. The first initial model can be trained to obtain a first text processing model, and the first text processing model can perform labeling processing on the input sub-text sentence and output a labeling result of each character in the sub-text sentence.
In some embodiments, the training of the first initial model to obtain the first text processing model may specifically be to obtain a sample set and a relationship extraction template, perform relationship tuple extraction on a plurality of sample text sentences in the sample set based on the relationship extraction template to obtain an initial relationship tuple set, screen the initial relationship tuple set to obtain at least one reference relationship tuple, and train the first initial model based on the plurality of sample text sentences and the at least one reference relationship tuple to obtain the first text processing model. The two sample participles of the reference relational tuple are from the same sample text sentence. Any reference relation tuple comprises two sample participles, the two sample participles have a specified association relation, and the two sample participles comprise a first sample participle and a second sample participle which meet specified association conditions, namely the first participle and the second participle in a corresponding first relation tuple (second relation tuple).
In some embodiments, the sample set may be segmented based on the collected sample text corpus. For example, the sample text corpus may be segmented according to the ending identifier to obtain a plurality of sample text sentences as a sample set. In addition, after the sample set is preprocessed, the relationship tuple extraction can be performed on the preprocessed sample set by using the relationship extraction template. The preprocessing process can be referred to the preprocessing process for the text sentence sets in the above embodiments.
In some embodiments, the relationship extraction template may be determined based on a syntactic structure and a specified associative relationship. The relation tuple extraction of the sample text sentence based on the relation extraction template can be determined according to the relation extraction template and a target participle set corresponding to the sample text sentence, for example, sample participles meeting the specified part of speech are obtained from the target participle set, whether the two sample participles meet the syntactic structure and semantic meaning indicated by the relation extraction template in the sample text sentence is determined, if yes, a first sample participle and a second sample participle are further determined according to the indication of the relation extraction template in the two sample participles to generate a corresponding relation tuple. For example, if the relationship extraction template indicates that the first sample segmentation is a hypernym and indicates that the second sample segmentation is a hyponym, a relationship tuple for the hypernym can be obtained according to the first sample segmentation and the second sample segmentation. The target participle set corresponding to the sample text sentence includes sample participles and parts of speech constituting the sample text sentence, and the determination manner of the target participle set may be the same as that of the target participle set corresponding to the text sentence in the above embodiment.
For example, taking the designated association relationship as the upper and lower relationships as an example, the relationship extraction template for the chinese corpus may be as shown in the following table 2:
TABLE 2
Wherein each relationship extraction template has an extraction precision. Taking the first template description "{ CAT } is one (one.) { ENT }" as an example, CAT in template 1 represents a hyponym, ENT represents a hypernym, the sample text sentence "Zhang three is an actor", the first template can be used to obtain "Zhang three" as a hyponym, and "actor" as a hypernym.
It is understood that when the language forms of the text corpus are different and the designated association relationship is different, the relationship extraction template may have various forms. The relationship extraction template may be set by related business personnel according to an actual scene, or may be generated intelligently, for example, iteratively generated by techniques such as bootstrap (a natural language processing tool). And are not limited herein.
In some embodiments, after obtaining the initial relationship tuple set based on the relationship extraction template, the initial relationship tuple set may be screened according to a specified screening rule to obtain a reference relationship tuple with higher accuracy, so that a high-quality relationship knowledge graph is constructed, and a high-quality training sample may be provided for the first and second text processing models. It may specifically include at least one of: tuple screening based on the second filtering vocabulary and tuple screening based on the occurrence frequency of the initial relation tuples.
In some embodiments, the screening of the initial relationship tuple set based on the second filtering word table may be that, if any sample participle in the initial relationship tuple is a filtering word in the second filtering word table, the initial relationship tuple is filtered, and if neither sample participle in the initial relationship tuple is a filtering word in the second filtering word table, the initial relationship tuple is screened out as a reference relationship tuple, so as to implement text word-level filtering. I.e. the initial relational tuple where any filter word in the second filter word table exists can be considered as the wrong relational tuple. That is, depending on the difference in the syntactic structure and semantics, the extracted initial relationship tuple may not have a specified associative relationship even if it conforms to the relationship extraction template. For example, the second filtered vocabulary may be as follows
Shown in Table 3:
TABLE 3
Taking the sample text sentence 1 as an example, performing relationship tuple extraction on the sample text sentence 1 based on the relationship extraction template in the table 2 may extract an initial relationship tuple [ friend, us, X ], however, two sample participles in the initial relationship tuple do not have a superior-inferior relationship, so that the initial relationship tuple may be filtered by a filter word "us" in the second filter word table, so as to ensure accuracy of the at least one reference relationship tuple.
In some embodiments, the second filtering vocabulary can be set by the relevant business personnel according to the actual scene. For example, a large number of relational tuples are collected through a relational extraction template, the high-frequency participles in the large number of relational tuples are counted and analyzed, obviously wrong high-frequency participles are determined, and the high-frequency participles are determined as a second filtering vocabulary.
In some embodiments, the screening of the initial relationship tuple set based on the occurrence frequency of the initial relationship tuples may be to count the occurrence frequency (or occurrence frequency) of each initial relationship tuple in the initial relationship tuple set, and screen out the initial relationship tuples with the occurrence frequency (or occurrence frequency) higher than a specified threshold as reference relationship tuples. The initial relationship tuples with higher occurrence frequency or probability may be considered as tuples with higher probability of being correct. Wherein, the occurrence frequency p (x) = W (x)/W of the initial relation tuple x; w (x) represents the frequency of occurrence of the initial relationship tuple x in the initial relationship tuple set, and W represents the number of tuples in the initial relationship tuple set.
In some embodiments, the first initial model is trained based on a plurality of sample text sentences and at least one reference relationship tuple to obtain a first text processing model, specifically, the plurality of sample text sentences are partitioned based on partition characters to obtain a plurality of sample sub-text sentences, each character in each sample sub-text sentence is labeled based on at least one reference relationship tuple to obtain a first labeling result for the plurality of sample sub-text sentences, the first initial model is called to label each sample sub-text sentence respectively to obtain a second labeling result for the plurality of sample sub-text sentences, and the first initial model is trained based on the second labeling result with the first labeling result as a reference to obtain the first text processing model. If one sample text sentence is based on the relationship extraction template and screened to obtain one or more reference relationship tuples, two participles in the reference relationship tuples can be from the same sample text sentence obtained by dividing the corresponding sample text sentence or from different sample text sentences. The segmented sample sub-sentences are labeled according to the reference relation tuple, so that the sentence length can be shortened, the possibility that the sentences contain a plurality of participles for combination is reduced, and the labeling accuracy is improved.
In some embodiments, a labeling process (which may be called full-scale labeling) may be performed on the sample sub-text sentence by using each of the at least one reference relationship tuple, that is, a first labeling result of one sample sub-text sentence is determined based on all reference relationship tuples. Or labeling each corresponding sample sub-text based on the associated reference relation tuple of each sample sub-text (which may be called one-to-one labeling), that is, determining the first labeling result of the sample sub-text a based on the associated reference relation tuple a of the sample sub-text a. The associated reference relation tuple of the sample sub-text sentence may be a reference relation tuple corresponding to the sample text sentence to which the sample sub-text sentence belongs. For example, the sample text sentence a is divided into the sample sub-text sentence a and the sample sub-text sentence B, and a reference relation tuple corresponding to the sample text sentence a determined based on the relation extraction template and the screening process may be used as an associated reference relation tuple of the sample sub-text sentence a and the sample sub-text sentence B.
Therefore, taking one sample sub-text sentence (target sample sub-text) as an example, if the target sample sub-text includes the first sample participle and the second sample participle in any one reference relation tuple (indicated as a target relation tuple) among all reference relation tuples (or associated reference relation tuples), in the sample sub-text, the first character constituting the first sample participle is labeled as the first label, the second character constituting the second sample participle is labeled as the second label, and the rest characters (i.e., characters except the first character and the second character in the target sample sub-text) are labeled as the third label, so as to obtain the first labeling result of the sample sub-text. The first sample participle is a sample participle matched with the first label in the target relation tuple, and the second sample participle is a sample participle matched with the second label in the target relation tuple.
In some embodiments, the first label, the second label and the third label may be constructed based on a BIO labeling method, for example, the first label may include B-N and I-N for respectively representing a start character and a remaining character constituting the first sample participle (N), the second label may include B-M and I-M for respectively representing a start character and a remaining character constituting the second sample participle (M), and the third label may be O for representing the remaining characters. For example, the sample subform is "ABC" is a tv drama, and the first labeling result obtained by performing labeling processing based on the reference relationship tuple [ tv drama, ABC, X ] is: /'B-ENT' A/'I-ENT' B/'I-ENT' C/'I-ENT' is/'O' section/'O' electric/'B-CAT' video/'I-CAT' dramatic/'I-CAT'; the first labels B-ENT and I-ENT represent hypernyms and the second labels B-CAT and I-CAT represent hyponyms.
If one sample sub-text sentence does not comprise a first sample participle and a second sample participle in any reference relation tuple or has no associated reference relation tuple, all characters in the sample sub-text sentence are marked as O, and if one sample sub-text sentence comprises a first sample participle and a second sample participle in a plurality of reference relation tuples, marking a first label and a second label for each first sample participle and each second sample participle in the sample sub-text sentence respectively.
In some embodiments, the first sample segmentation and the second sample segmentation in each reference relation tuple may be text-matched with the sample subfsentence to determine whether the sample subfsentence includes the first sample segmentation and the second sample segmentation, or the sample segmentation in the target segmentation set corresponding to the sample subfsentence may be matched with the first sample segmentation and the second sample segmentation in the reference relation tuple to determine whether the sample subfsentence includes the first sample segmentation and the second sample segmentation.
In some embodiments, the electronic device may invoke the first initial template to determine a second labeling result for each sample sub-text sentence, and train the first initial model with the second labeling result by taking the first labeling result as a real result (i.e., as a reference) that the first text processing model should output. That is, the prediction bias of the first initial model is determined by combining the first labeling result and the second labeling result, and the model parameter of the first initial model is corrected by using the prediction bias until the model converges, so as to obtain the first text processing model. The first text processing model can accurately predict the labeling result in the sub-text sentence to determine the first participle and the second participle in the sample sub-text sentence.
It can be understood that taking the first labeling result as a reference means training the first initial model by taking the first labeling result as a sample sub-text sentence (training data) with respect to a labeling label (or called a real labeling result) of the first text processing model, and the first labeling result is a high-quality training sample obtained by using the relationship extraction template. Therefore, remote supervision can be achieved through the reference relation tuples, namely, automatic labeling of training data is achieved, so that the workload of manual labeling can be reduced, the labeling efficiency and the labeling accuracy are improved, and the modeling difficulty of the first text processing model can be reduced by training based on the sample sub-sentences with shorter lengths.
The first text processing model may also be called a text-based sequence labeling model, and may be constructed based on any model type and topology structure of a deep neural network capable of implementing a labeling task, for example, a combination of a classical language processing model and a CRF (conditional random field), the CRF serving as a decoder outputs a labeling result of a character, and the language processing model may be a BERT (Bidirectional Encoder Representation) model, or may be another language model selected according to a limitation on model Memory occupation and requirements on extraction speed and accuracy in an actual application scenario, such as a LSTM (Long Short-Term Memory) model, a gated convolutional neural network, a FastBERT model (an improved BERT model), a SpanBERT model (an improved BERT model), and the like.
S403, determining at least one second relational tuple according to the plurality of sub-text sentences.
In some embodiments, the second relational tuple may be determined from between two sub-sentences. The specific description of the second relational tuple can be referred to the related description of the second relational tuple in the above embodiment.
In some embodiments, the determining of the at least one second relationship tuple may specifically be that a candidate participle set corresponding to each sub-text sentence is respectively obtained, where the candidate participle set includes specified participles constituting the corresponding sub-text sentence; and constructing at least one candidate relation tuple according to the candidate participle set corresponding to each sub text sentence, and determining at least one second relation tuple from the at least one candidate relation tuple. Any one of the candidate relational tuples includes two candidate participles from different sets of candidate participles. It can be understood that, when two participles having a specified association relationship exist in two sub-text sentences, the two participles are relatively far apart from each other in the two sub-text sentences and are in different sub-text sentences, so that extraction is inconvenient, and therefore, the problems that accurate extraction cannot be performed if the relational tuple extraction is directly performed on the two sub-text sentences and the extraction capability of the model for two participles having a specified association relationship and having a long distance in the text sentences is weak can be solved to a greater extent.
In some embodiments, the obtaining of the candidate participle set may be that a participle meeting a specified part-of-speech condition is determined from a target participle set corresponding to the sub-sentence, and is used as the candidate participle set corresponding to the sub-sentence. For example, the condition of specifying part of speech may indicate that a participle of the part of speech of a noun is obtained. And are not limited herein.
In some embodiments, the determining at least one second relation tuple from the at least one candidate relation tuple may specifically be that a to-be-processed text sentence corresponding to each candidate relation tuple is generated based on the relation cue word, the to-be-processed text sentence corresponding to each candidate relation tuple is subjected to text processing, a relation probability of the to-be-processed text sentence corresponding to each candidate relation tuple is obtained, and the at least one second relation tuple is determined from the at least one candidate relation tuple according to the relation probability corresponding to each candidate relation tuple.
In some embodiments, at least one candidate relation tuple may be constructed for the candidate participle set corresponding to each two sub-sentences. Or at least one candidate relation tuple can be constructed for the candidate participle sets corresponding to every two sub-sentences belonging to the same text sentence. The specific process of constructing the candidate relational tuple can be referred to the related description of the above embodiment.
In some embodiments, the relationship hint word is used to connect two participles in the candidate relationship tuple, which not only can constitute a sentence containing complete semantics, but also shortens the distance between the two participles. The relation prompt may be set by the related service personnel, and may be "yes", "include", and the like. The connection rule is determined according to a specified association relationship. For example, the association relationship is designated as an upper-lower relationship, the connection rule is [ lower word ] is [ upper word ], the candidate relationship tuple is [ actor, zhang san, X ], and "actor" represents a first participle (upper word) and "zhang san" represents a second participle (lower word), so that "zhang san is actor" can be obtained by connection based on the relationship cue word. The relation cue words can be one or more, and one subfsentence can correspond to one or more text sentences to be processed.
In some embodiments, the relationship probability may be used to indicate a probability that the to-be-processed text sentence contains two candidate participles having a specified association relationship and two candidate participles in a candidate relationship tuple corresponding to the to-be-processed text sentence satisfy a specified association condition. I.e. the higher the probability, the higher the probability that the corresponding candidate relation tuple is correct. A candidate relationship tuple having a relationship probability greater than a probability threshold may be determined as the second relationship tuple. In addition, when there are a plurality of text sentences to be processed corresponding to one candidate relationship tuple, each text sentence to be processed corresponds to one relationship probability, and the average relationship probability corresponding to each text sentence to be processed may be used as the final relationship probability, or each relationship probability is subjected to weighted summation, the result is used as the final relationship probability, and the final relationship probability is used to determine the second relationship tuple.
In some embodiments, the determining of the relation probabilities of the candidate relational tuples to determine the at least one first relational tuple may be performed and determined based on the second text processing model. The second initial model can be trained to obtain a second text processing model, and the second text processing model can perform relation prediction on the input text sentence to be processed and output the relation probability corresponding to the candidate relation tuple.
In some embodiments, the training of the second initial model to obtain the second text processing model may specifically be to obtain a sample set and a relationship extraction template, determine at least one reference relationship tuple corresponding to a plurality of sample text sentences in the sample set based on the relationship extraction template, and train the second initial model based on the at least one reference relationship tuple to obtain the second text processing model. The set of samples may be the same as or different from the set of samples used to train the first text-processing model. The specific process of determining the reference relationship tuple based on the relationship extraction template may refer to the process of determining the reference relationship tuple in the process of training the first text processing model, and details are not repeated here. The reference relation tuple is a high-quality training sample obtained by a relation extraction template. In addition, after the sample set is preprocessed, the relationship tuple extraction can be performed on the preprocessed sample set by using the relationship extraction template. The preprocessing process can be referred to the preprocessing process for the text sentence sets in the above embodiments.
In some embodiments, the electronic device may set a relationship label for each reference relationship tuple (for example, 1, indicating that the relationship tuple is a correct relationship tuple), call the second initial model to output a sample relationship probability of each reference relationship tuple, determine a prediction bias for the second initial model according to the relationship label and the sample relationship probability, and correct a model parameter based on the prediction bias to train to obtain the second text processing model. The output of the sample relationship probability of each reference relationship tuple may be that a to-be-processed text corresponding to each reference relationship tuple is generated based on the relationship cue word, and the second initial model is called to predict the to-be-processed text, so as to obtain the corresponding sample relationship probability.
In some embodiments, the electronic device may further train the second initial model based on the at least one reference relationship tuple and the screened ones of the initial relationship tuples other than the at least one reference relationship tuple. For example, a relationship label (for example, 1) may be set for each reference relationship tuple and a relationship label (for example, 0, indicating that the relationship tuple is an erroneous relationship tuple) may be set for each other relationship tuple, and the second initial model may be trained based on the relationship labels of the reference relationship tuple and the relationship labels of the other relationship tuples according to the above training process to obtain the second text processing model.
In some embodiments, the plurality of sample text sentences in the sample set includes a plurality of sample sub-text sentences segmented based on the segmented characters. Therefore, the specific way of obtaining the second text processing model through training may also be that at least one sample relation tuple is determined from a plurality of sample sub-text sentences, a relation label of each sample relation tuple is determined based on at least one reference relation tuple, a text sentence to be processed corresponding to each sample relation tuple is generated based on the relation cue word, the second initial model is invoked to perform text processing on the text sentence to be processed corresponding to each sample relation tuple, and the relation probability of the text sentence to be processed corresponding to each sample relation tuple is respectively generated; the relation probability indicates the probability that two sample participles in the sample relation tuples meet the specified association condition, and the second initial model is trained according to the relation probability corresponding to each sample relation tuple and the relation label thereof to obtain a second text processing model.
For example, a prediction bias for the second initial model may be determined according to the corresponding relationship probability of each sample relationship tuple and the relationship label thereof, and the model parameters of the second text processing model are modified based on the prediction bias until the model converges to train to obtain the second text processing model. The specific manner of generating the text sentence to be processed based on the relationship cue word may refer to the manner of generating the text sentence to be processed corresponding to the candidate relationship tuple. And will not be described in detail herein.
In some embodiments, the determining at least one sample relational tuple from the plurality of sample sub-text sentences may be to respectively obtain a sample segmentation set corresponding to each sample sub-text sentence, and construct at least one sample relational tuple according to the sample segmentation set corresponding to each sample sub-text sentence; the sample word segmentation set comprises specified sample word segments forming corresponding sample sub-text sentences, for example, the sample word segmentation set comprises sample word segments meeting specified part-of-speech conditions, and if the specified part-of-speech conditions can indicate that the sample word segments in the sample sub-text sentences are noun part-of-speech. Any one sample relational tuple includes two sample tokens from different sets of sample tokens. The construction mode of the sample relational tuple can be the same as that of the candidate relational tuple, that is, the two sample participle sets from which the two sample participles come can be sample participle sets corresponding to any two sample sub-text sentences in the plurality of sample sub-text sentences or sample participle sets corresponding to any two sample sub-text sentences obtained by dividing the same sample text sentence.
In some embodiments, the relationship label of each sample relationship tuple is determined based on at least one reference relationship tuple, which may specifically be that, if a sample relationship tuple is any reference relationship tuple, the sample relationship tuple is a first relationship label, for example, 1; if the sample relationship tuple is not an arbitrary reference relationship tuple, the sample relationship tuple is a second relationship label, for example, 0.
In some embodiments, since the first text processing model determines the first relational tuple by labeling characters in the sub-text sentence, there may be more than one first participle or second participle labeled in one sub-text sentence, so that more than one first relational tuple is determined based on one sub-text sentence, the first relational tuple can be further verified by the second text processing model. For example, when one sub-text is labeled with two first participles and two second participles, 4 first relationship tuples are obtained, so that the to-be-processed text corresponding to each first relationship tuple can be generated according to the method, the second text processing model is sequentially input, the relationship probability corresponding to each first relationship tuple is obtained, and the 4 first relationship tuples are screened according to the relationship probability. The verified first relational tuple and the second relational tuple can be subsequently used for construction of a relational knowledge graph.
The second text processing model may also be referred to as a text-based prompt learning model, and may be constructed based on any model type and topology of a deep neural network capable of implementing an annotation task, for example, the second text processing model may include a BERT model and a probabilistic prediction network layer, such as a linear network layer and a SOFTMAX layer (normalization layer). In the second text processing model, the BERT model may perform feature processing on the text sentence to be processed to obtain text sentence features, and the linear network layer and the SOFTMAX layer perform feature processing on the text sentence features corresponding to the text sentence to be processed to output the relationship probability corresponding to the text sentence to be processed. The BERT model may be replaced with other language models such as LSTM models, gated convolutional neural networks, fastBERT models, spanBERT models, and the like.
Therefore, the to-be-processed text is constructed by constructing the relation cue words, so that the prediction of the second text processing model for the candidate relation tuples is realized through the to-be-processed text, the relation cue words can be understood as added prompt information input to the model, the model input can be more suitable for the language model, the model prediction has better effect, and the defect that a simple short sentence is constructed through the relation cue words, so that the defect that the short sentence is directly extracted through a long-distance long sentence can be avoided. The method can realize the prediction of the relation tuple between the two sub-text sentences, and realize the prediction target by constructing the text sentence to be processed with relatively simple and short distance through the relation cue words, thereby not only improving the prediction efficiency and accuracy, but also reducing the modeling difficulty of the model. The first initial model and the second initial model are automatically labeled by utilizing the reference relation tuple determined by the relation extraction template, so that the entity relation (namely the relation of the appointed participle in the subfsentence) in a remote supervision (a non-supervision training) mode can be automatically labeled, the training corpus is intelligently constructed, the manual participation and the labor cost in the model training stage are reduced, the training efficiency can be improved, and the problem that the modeling difficulty is caused when the participle combination is in a long distance in one text sentence can be solved. The first text processing model and the second text processing model are used for determining the relation tuples in the sub-text sentences and among the sub-text sentences, so that the determination efficiency and accuracy can be improved, the determination recall rate of the relation tuples can be improved, and the relation extraction and the automatic construction of the knowledge graph aiming at the unstructured text can be realized.
In some embodiments, if there are already a large number of relation tuples with higher quality, the template extraction module may be skipped, the training of the neural network module may be directly performed by using the large number of relation tuples, for example, a sample set of a part of related fields is obtained, a sample sub-text sentence included in the sample set is labeled by using the relation tuples to be used as a label, a first text processing model is obtained by training based on the sample sub-text sentence, and a second text processing model is obtained by directly using the relation tuples and the relation labels thereof. The reference relation tuples are extracted by using the relation extraction template to train the first text processing model and the second text processing model without acquiring a sample set.
S404, a relation extraction template is obtained, and relation tuples of the text sentences in the text sentence set are extracted based on the relation extraction template to obtain at least one third relation tuple.
In some embodiments, the electronic device may perform the construction of the relational knowledge graph directly based on (or after being filtered according to a specified filtering rule) the at least one first relational tuple and the at least one second relational tuple. Or, a relationship extraction template may be obtained, a relationship tuple is extracted based on the relationship extraction template to obtain at least one third relationship tuple, and the at least one first relationship tuple and the at least one second relationship tuple serve as a supplement of the at least one third relationship tuple and participate in the construction of the relationship knowledge graph, for example, the construction of the relationship knowledge graph is directly performed based on (or after being screened according to a specified screening rule) the first, second, and third relationship tuples.
Wherein, the specified filtering rule may be the filtering rule described in the above step S402. The obtained relationship extraction template may be a template used in training the first text processing model and the second text processing model. The electronic device may respectively perform relationship tuple extraction on each text sentence in the text sentence set based on the relationship extraction template to obtain at least one third relationship tuple, where the third relationship tuple includes two participles, and the two participles are from the same text sentence and may be from different sub-text sentences or the same sub-text sentence in the same text sentence. The manner of extracting the text sentence based on the relationship extraction template may be the same as the manner of extracting the sample text sentence based on the relationship extraction template, and is not described herein again.
S405, a relation knowledge graph corresponding to the text sentence set is constructed according to the at least one first relation tuple, the at least one second relation tuple and the at least one third relation tuple.
In some embodiments, the electronic device may construct a relational knowledge-graph directly based on the at least one first relational tuple, the at least one second relational tuple, and the at least one third relational tuple. Or the data processing can be performed on at least one first relation tuple, at least one second relation tuple and at least one third relation tuple to obtain a target relation tuple set, and a relation knowledge graph is constructed according to the target relation tuple set.
In some embodiments, the data processing may include at least one or more of: tuple screening processing and tuple deduplication processing. The specific process of the tuple screening process can be referred to the screening process described in the above step S402. The tuple deduplication process is used to delete duplicate relationship tuples of the at least one first relationship tuple, the at least one second relationship tuple, and the at least one third relationship tuple. Therefore, under the condition of sufficient text corpus, a high-quality relation knowledge graph can be quickly constructed in a short time through the method.
Based on the above description, the embodiments of the present application provide schematic diagrams of a knowledge graph construction framework, as shown in fig. 6 a-6 b; the system comprises a text preprocessing module, a template extraction module, a neural network module and a map construction module, wherein the text preprocessing module, the template extraction module, the neural network module and the map construction module can be included; the neural network module may include a first text processing model, which may be a sequence labeling model (BERT + CRF), and a second text processing model, which may be a prompt learning model (BERT + linear network layer + SOFTMAX layer); the construction process of the knowledge graph comprises a training phase and an application phase:
as in fig. 6a, the application phase includes: 1) Obtaining a text corpus, segmenting the text corpus according to the ending identifier to obtain a text sentence set, and preprocessing text sentences in the text sentence set in a text preprocessing module, where the preprocessing may include: text sentence part-of-speech analysis and text sentence screening. 2) If only the neural network module is used, the preprocessed text sentence S can be segmented based on the segmented characters to obtain a plurality of sub-text sentences [ S1, S2.,. Sm ], each character in each sub-text sentence is labeled by using a first text processing model, and at least one first relation tuple Ts is determined according to the character labeling result of each sub-text sentence;
an example of this process is: as shown in fig. 7a, fig. 7a is a schematic view of a scene where a first relational tuple is determined based on a first text processing model according to an embodiment of the present application, each subfsentence Si, i e [1, m ] is sequentially input into the first text processing model to obtain a batch of extracted first relational tuples Ts, if a subfsentence Si is "zhang is an actor", the BERT model performs feature processing on Si, a labeling result of each character in Si is output by CRF, and the first relational tuple [ actor, zhang, X ] is determined based on the labeling result;
meanwhile, the participles meeting the part-of-speech condition (such as noun part-of-speech) in each sub-text sentence can be obtained based on the part-of-speech analysis result corresponding to the sub-text sentence as a candidate participle set Ni = [ Ni1, ni2,.. Once, nil ] of each sub-text sentence Si, the candidate participle sets of each two sub-text sentences (Si, sj,1 ≦ i < j ≦ m) are combined pairwise, candidate relation tuples [ nik, njz, X ] among all sub-text sentences are enumerated, a to-be-processed text sentence corresponding to each candidate relation tuple is generated based on the relation cue word 'yes', a second text processing model is sequentially input, the relation probability corresponding to each candidate relation tuple is output, and a second relation tuple Tp is determined according to the relation probability;
an example of this process: as shown in fig. 7b, fig. 7b is a scene schematic diagram for determining a second relation tuple based on a second text processing model according to an embodiment of the present application, and it is assumed that a candidate word segmentation set of Si includes: ball games; the candidate participle set of Sj comprises: and if the football and the basketball are combined pairwise, obtaining candidate relation tuples L1-4 between Si and Sj: [ ball game, football, X ], [ ball game, basketball, X ], [ football, ball game, X ], [ basketball, ball game, X ]; generating text sentences to be processed K1-4 based on a relation prompt word ' yes ', such as [ ball motion, football, X ] → ' football is ball motion ], performing feature processing on the text sentences to be processed by a BERT model, and outputting relation probabilities of the text sentences to be processed by a linear network layer and an SOFTMAX layer, wherein if the relation probability of K1 is 0.9, a candidate relation tuple with the corresponding relation probability larger than a probability threshold (such as 0.9) can be determined as a second relation tuple, such as [ ball motion, football, X ], [ ball motion, basketball, X ];
the first relation tuple Ts and the second relation tuple Tp can be integrated to obtain a relation tuple output result of the neural network module, further, the relation tuple output result can be screened according to a specified screening rule, and a relation knowledge graph corresponding to the text sentence set is constructed according to the screened relation tuple output result.
3) Due to the diversity of grammatical structures, the relationship extraction template can only extract a part of relationship tuples, and a large number of subfiles containing participle combinations do not accord with the construction of the relationship extraction template.
As shown in table 4 below:
TABLE 4
Therefore, the template extraction module and the neural network model can be used at the same time, at least one first relation tuple and at least one second relation tuple can be obtained according to the neural network model and used as a supplement for obtaining at least one third relation tuple according to the template extraction module, at least one first relation tuple, at least one second relation tuple and at least one third relation tuple are screened according to a specified screening rule, and a relation knowledge graph corresponding to the text sentence set is constructed according to the screened result.
As shown in fig. 6b, the training phase is mainly a training process of the neural network model by using the template extraction module, and includes: 1) Obtaining sample text corpus, segmenting according to the ending identifier to obtain a sample set, and preprocessing sample text sentences in the sample set in a text preprocessing module, where the preprocessing may include: text sentence part-of-speech analysis and text sentence screening. 2) And using a template extraction module to perform relation extraction on the preprocessed sample text sentence to obtain an initial relation tuple set, and screening the initial relation tuple set according to a specified screening rule to obtain at least one reference relation tuple. 3) The method comprises the steps of segmenting a preprocessed sample text sentence based on segmented characters to obtain a plurality of sample sub text sentences, labeling each character in the plurality of sample sub text sentences through at least one reference relation element group to obtain a first labeling result so as to achieve automatic labeling based on remote supervision, calling a first initial model to label the character in each sample sub text sentence to obtain a second labeling result, taking the first labeling result as a benchmark (namely a label), and training the first initial model according to the second labeling result to obtain a first text processing model. 4) And generating a text sentence to be processed corresponding to each reference relation tuple based on the relation extraction words, and training a second initial model according to the text sentence to be processed and a relation label (such as 1) thereof to obtain a second text processing model.
In the embodiment of the application, a text sentence set can be obtained, text sentences in the text sentence set are segmented based on segmentation characters to obtain a plurality of sub-text sentences, at least one first relation tuple is determined according to the plurality of sub-text sentences, at least one second relation tuple is determined according to the plurality of sub-text sentences, a relation extraction template is obtained, relation tuple extraction is performed on the text sentences in the text sentence set based on the relation extraction template to obtain at least one third relation tuple, and a relation knowledge graph corresponding to the text sentence set is constructed according to the at least one first relation tuple, the at least one second relation tuple and the at least one third relation tuple. By the method, the plurality of relation tuples can be determined according to the content in one sub-text sentence and the content between two sub-text sentences, so that the relation tuples can be determined more finely and comprehensively according to the text sentences, and the constructed knowledge graph is more accurate and reliable.
While the method of the embodiments of the present application has been described in detail above, to facilitate better implementation of the above-described aspects of the embodiments of the present application, the apparatus of the embodiments of the present application is provided below accordingly.
FIG. 8 is a diagram illustrating an exemplary embodiment of a text processing apparatus; the text processing means may be a computer program (comprising program code) running in the electronic device, for example the text processing means may be an application in the electronic device; the text processing apparatus may be used to perform some or all of the steps in the method embodiments shown in fig. 2 and 4. Referring to fig. 8, the text processing apparatus includes the following modules:
an obtaining module 801, configured to obtain a text sentence set, and segment a text sentence in the text sentence set based on a segmentation character to obtain a plurality of sub text sentences;
a processing module 802, configured to determine at least one first relational tuple according to the plurality of sub-text sentences; a first relation tuple comprises two participles which are from the same sub-text sentence and have a specified association relation;
the processing module 802 is further configured to determine at least one second relational tuple according to the plurality of sub-text sentences; a second relational tuple comprises two participles which are from different sub-text sentences and have a specified association relation;
the processing module 802 is further configured to construct a relational knowledge graph corresponding to the text sentence set according to the at least one first relational tuple and the at least one second relational tuple
In some embodiments, the processing module 802, when configured to determine at least one first relational tuple from the plurality of sub-text sentences, is specifically configured to:
labeling each character in the target sub text sentence; the target sub-text sentence is any one of a plurality of sub-text sentences;
determining a first participle according to the character marked as the first label, and determining a second participle according to the character marked as the second label; the participles determined according to the first label and the participles determined according to the second label have a specified incidence relation;
constructing at least one first relational tuple based on the first participle and the second participle; any of the first relational tuples includes a first participle and a second participle.
In some embodiments, the processing module 802, when configured to determine at least one second relational tuple from the plurality of sub-text sentences, is specifically configured to:
respectively acquiring a candidate word segmentation set corresponding to each sub text sentence; the candidate participle set comprises specified participles forming corresponding sub-text sentences;
constructing at least one candidate relation tuple according to the candidate participle set corresponding to each sub text sentence; any candidate relation tuple comprises two candidate participles, and the two candidate participles are from different candidate participle sets;
at least one second relational tuple is determined from the at least one candidate relational tuple.
In some embodiments, the processing module 802, when being configured to determine at least one second relationship tuple from the at least one candidate relationship tuple, is specifically configured to:
generating a text sentence to be processed corresponding to each candidate relation tuple based on the relation cue words;
respectively carrying out text processing on the text sentence to be processed corresponding to each candidate relation tuple to obtain the relation probability of the text sentence to be processed corresponding to each candidate relation tuple; the relationship probability indicates a probability that two candidate participles in the candidate relationship tuple satisfy a specified association condition;
and determining at least one second relation tuple from the at least one candidate relation tuple according to the corresponding relation probability of each candidate relation tuple.
In some embodiments, the processing module 802 is further configured to:
obtaining a relation extraction template, and performing relation tuple extraction on text sentences in the text sentence set based on the relation determination template to obtain at least one third relation tuple;
when the processing module 802 is configured to construct a relation knowledge graph corresponding to a text sentence set according to at least one first relation tuple and at least one second relation tuple, the processing module is specifically configured to:
performing data processing on at least one first relation tuple, at least one second relation tuple and at least one third relation tuple to obtain a target relation tuple set; the data processing includes at least one of: tuple screening processing and tuple deduplication processing;
and constructing a relation knowledge graph according to the target relation tuple set.
In some embodiments, the at least one first relational tuple is determined based on the first text-processing model; at least one second relational tuple is determined based on the second text processing model; the processing module 802 is further configured to:
acquiring a sample set and a relation extraction template;
respectively extracting relation tuples of a plurality of sample text sentences in the sample set based on a relation extraction template to obtain an initial relation tuple set, and screening the initial relation tuple set to obtain at least one reference relation tuple; any reference relation tuple comprises two sample participles, and the two sample participles have a specified association relation;
training the first initial model based on the plurality of sample text sentences and at least one reference relation element group to obtain a first text processing model;
and training the second initial model based on at least one reference relation element group to obtain a second text processing model.
In some embodiments, the plurality of sample text sentences includes a plurality of sample sub-text sentences segmented based on the segmented characters; the processing module 802 is specifically configured to, when the processing module is configured to train the first initial model based on the plurality of sample text sentences and the at least one reference relationship tuple to obtain the first text processing model:
labeling each character in each sample sub text sentence based on at least one reference relation tuple to obtain a first labeling result aiming at the plurality of sample sub text sentences;
calling a first initial model to respectively label each sample sub-text sentence to obtain a second labeling result aiming at the plurality of sample sub-text sentences;
and training the first initial model based on the second labeling result by taking the first labeling result as a reference to obtain a first text processing model.
In some embodiments, any of the plurality of sample sub-text sentences is represented as a target sample sub-text sentence; the processing module 802 is specifically configured to, when configured to label each character in each sample sub-text sentence based on at least one reference relation tuple to obtain a first labeling result for a plurality of sample sub-text sentences:
if the target sample sub-text sentence comprises a first sample word segmentation and a second sample word segmentation, marking a first character forming the first sample word segmentation as a first label, marking a second character forming the second sample word segmentation as a second label, and marking characters except the first character and the second character in the target sample sub-text sentence as a third label;
the first sample participle is a sample participle matched with the first label in the target relation tuple, the second sample participle is a sample participle matched with the second label in the target relation tuple, and the target relation tuple is any one of the at least one reference relation tuple.
According to an embodiment of the present application, the respective modules in the text processing apparatus shown in fig. 8 may be respectively or entirely combined into one or several other modules to form the text processing apparatus, or some of the modules may be further split into multiple functionally smaller modules to form the text processing apparatus, which may achieve the same operation without affecting the implementation of the technical effects of the embodiment of the present application. The modules are divided based on logic functions, and in practical application, the functions of one module can be realized by a plurality of modules, or the functions of a plurality of modules can be realized by one module. In other embodiments of the present application, the text processing apparatus may also include other modules, and in practical applications, these functions may also be implemented by the assistance of other modules, and may be implemented by cooperation of a plurality of modules. According to another embodiment of the present application, the text processing apparatus shown in fig. 8 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the respective methods shown in fig. 2 and 4 on a general-purpose computing device such as a computer including a central processing module (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and the like as well as a storage element, and implementing the text processing method of the embodiment of the present application. The computer program may be recorded on a computer-readable recording medium, for example, and loaded and executed in the above-described computing apparatus via the computer-readable recording medium.
In the embodiment of the application, an acquisition module acquires a text sentence set, and divides text sentences in the text sentence set based on dividing characters to obtain a plurality of sub text sentences; the processing module determines at least one first relational tuple according to the plurality of sub-text sentences; the processing module determines at least one second relational tuple according to the plurality of sub-text sentences; and the processing module constructs a relation knowledge graph corresponding to the text sentence set according to the at least one first relation tuple and the at least one second relation tuple. By the aid of the device, the relation tuples can be determined according to the content in one sub-text sentence and the content between two sub-text sentences, so that the relation tuples can be determined more comprehensively according to the text sentences with finer granularity, and the constructed knowledge graph is more accurate and reliable.
Referring to fig. 9, fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 9, the electronic device 900 includes: at least one processor 901, memory 902. Optionally, the electronic device may further include a network interface. Wherein data can be exchanged between the processor 901, the memory 902 and the network interface, the network interface is controlled by the processor 901 for transceiving messages, the memory 902 is used for storing a computer program, the computer program comprises program instructions, and the processor 901 is used for executing the program instructions stored in the memory 902. Wherein the processor 901 is configured to call the program instructions to execute the above method.
The memory 902 may include a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 902 may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a solid-state drive (SSD), etc.; the memory 902 may also comprise a combination of the above-described types of memory.
The processor 901 may be a Central Processing Unit (CPU). In one embodiment, processor 901 may also be a Graphics Processing Unit (GPU). The processor 901 may also be a combination of a CPU and a GPU.
In one possible embodiment, the memory 902 is used to store program instructions that the processor 901 can call to perform the following steps:
acquiring a text sentence set, and segmenting text sentences in the text sentence set based on segmentation characters to obtain a plurality of sub text sentences;
determining at least one first relational tuple from the plurality of sub-text sentences; a first relation tuple comprises two participles which are from the same sub-text sentence and have a specified association relation;
determining at least one second relational tuple according to the plurality of sub-text sentences; a second relational tuple comprises two participles which are from different sub-text sentences and have a specified association relation;
and constructing a relation knowledge graph corresponding to the text sentence set according to the at least one first relation tuple and the at least one second relation tuple.
In some embodiments, the processor 901, when being configured to determine at least one first relational tuple from the plurality of sub-textual sentences, is specifically configured to:
labeling each character in the target sub text sentence; the target sub-text sentence is any one of a plurality of sub-text sentences;
determining a first word segmentation according to the characters marked as the first label, and determining a second word segmentation according to the characters marked as the second label; the participles determined according to the first label and the participles determined according to the second label have a specified incidence relation;
constructing at least one first relational tuple based on the first participle and the second participle; any of the first relational tuples includes a first participle and a second participle.
In some embodiments, the processor 901, when configured to determine at least one second relational tuple from the plurality of sub-text sentences, is specifically configured to:
respectively acquiring a candidate word segmentation set corresponding to each sub text sentence; the candidate participle set comprises specified participles forming corresponding sub text sentences;
constructing at least one candidate relation tuple according to the candidate participle set corresponding to each sub text sentence; any candidate relation tuple comprises two candidate participles, and the two candidate participles are from different candidate participle sets;
at least one second relational tuple is determined from the at least one candidate relational tuple.
In some embodiments, the processor 901, when configured to determine at least one second relationship tuple from the at least one candidate relationship tuple, is specifically configured to:
generating a text sentence to be processed corresponding to each candidate relation tuple based on the relation cue words;
respectively carrying out text processing on the text sentence to be processed corresponding to each candidate relation tuple to obtain the relation probability of the text sentence to be processed corresponding to each candidate relation tuple; the relationship probability indicates a probability that two candidate participles in the candidate relationship tuple satisfy a specified association condition;
and determining at least one second relation tuple from the at least one candidate relation tuple according to the corresponding relation probability of each candidate relation tuple.
In some embodiments, the processor 901 is further configured to:
obtaining a relation extraction template, and performing relation tuple extraction on text sentences in the text sentence set based on the relation determination template to obtain at least one third relation tuple;
when the processor 901 is configured to construct a relation knowledge graph corresponding to a text sentence set according to at least one first relation tuple and at least one second relation tuple, the processor is specifically configured to:
performing data processing on at least one first relation tuple, at least one second relation tuple and at least one third relation tuple to obtain a target relation tuple set; the data processing includes at least one of: tuple screening processing and tuple deduplication processing;
and constructing a relation knowledge graph according to the target relation tuple set.
In some embodiments, the at least one first relational tuple is determined based on the first text-processing model; at least one second relational tuple is determined based on the second text processing model; the processor 901 is further configured to:
acquiring a sample set and a relation extraction template;
respectively extracting relation tuples of a plurality of sample text sentences in the sample set based on a relation extraction template to obtain an initial relation tuple set, and screening the initial relation tuple set to obtain at least one reference relation tuple; any reference relation tuple comprises two sample participles, and the two sample participles have a specified association relation;
training the first initial model based on the plurality of sample text sentences and at least one reference relation element group to obtain a first text processing model;
and training the second initial model based on at least one reference relation element group to obtain a second text processing model.
In some embodiments, the plurality of sample text sentences includes a plurality of sample sub-text sentences segmented based on the segmentation characters; the processor 901 is specifically configured to, when being configured to train the first initial model based on the multiple sample text sentences and the at least one reference relationship tuple to obtain the first text processing model:
labeling each character in each sample sub text sentence based on at least one reference relation tuple to obtain a first labeling result aiming at the plurality of sample sub text sentences;
calling a first initial model to respectively label each sample sub-text sentence to obtain a second labeling result aiming at the plurality of sample sub-text sentences;
and training the first initial model based on the second labeling result by taking the first labeling result as a reference to obtain a first text processing model.
In some embodiments, any of the plurality of sample sub-text sentences is represented as a target sample sub-text sentence; the processor 901 is specifically configured to, when configured to label each character in each sample sub-text sentence based on at least one reference relation tuple to obtain a first labeling result for a plurality of sample sub-text sentences:
if the target sample sub-text sentence comprises a first sample word segmentation and a second sample word segmentation, marking a first character forming the first sample word segmentation as a first label, marking a second character forming the second sample word segmentation as a second label, and marking characters except the first character and the second character in the target sample sub-text sentence as a third label;
the first sample participle is a sample participle matched with the first label in the target relation tuple, the second sample participle is a sample participle matched with the second label in the target relation tuple, and the target relation tuple is any one of the at least one reference relation tuple.
In the embodiment of the application, the processor can obtain the text sentence set, segment the text sentences in the text sentence set based on the segmentation characters to obtain a plurality of sub-text sentences, determine at least one first relation tuple according to the plurality of sub-text sentences, determine at least one second relation tuple according to the plurality of sub-text sentences, and construct a relation knowledge graph corresponding to the text sentence set according to the at least one first relation tuple and the at least one second relation tuple. In the scheme, the plurality of relation tuples can be determined according to the content in one sub-text sentence and the content between two sub-text sentences, so that the relation tuples can be determined more finely and comprehensively according to the text sentences, and the constructed knowledge graph is more accurate and reliable.
In a specific implementation, the above-described apparatus, processor, memory, and the like may perform the implementation described in the above-described method embodiment, and may also perform the implementation described in the embodiment of the present application, which is not described herein again.
Also provided in embodiments of the present application is a computer (readable) storage medium storing a computer program, where the computer program includes program instructions, and the program instructions, when executed by a processor, cause the processor to perform some or all of the steps performed in the above-mentioned method embodiments. Alternatively, the computer storage media may be volatile or nonvolatile. The computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
Embodiments of the present application also provide a computer program product, which includes computer instructions (program instructions), and when the computer instructions are executed by a processor, the computer instructions can implement some or all of the steps of the text processing method. Alternatively, the computer instructions may be stored in a computer-readable storage medium, and a processor of a computer device such as an electronic device reads the program instructions from the computer-readable storage medium, and the processor executes the program instructions, so that the computer device executes the text processing method provided above.
Reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
Those of ordinary skill in the art would appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the invention are all or partially effected when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The available media may be magnetic media (e.g., floppy disks, hard disks, tapes), optical media (e.g., DVDs), or semiconductor media (e.g., solid State Disks (SSDs)), among others.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present disclosure, and all the changes or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (12)
1. A method of text processing, the method comprising:
acquiring a text sentence set, and segmenting text sentences in the text sentence set based on segmentation characters to obtain a plurality of sub text sentences;
determining at least one first relational tuple according to the plurality of sub-text sentences; a first relation tuple comprises two participles which are from the same sub-text sentence and have a specified association relation;
determining at least one second relational tuple according to the plurality of sub-text sentences; a second relational tuple comprises two participles which are from different sub-text sentences and have the specified association relation;
and constructing a relation knowledge graph corresponding to the text sentence set according to the at least one first relation tuple and the at least one second relation tuple.
2. The method of claim 1, wherein determining at least one first relational tuple from the plurality of sub-text sentences comprises:
labeling each character in the target sub text sentence; the target sub-text sentence is any one of the plurality of sub-text sentences;
determining a first word segmentation according to the characters marked as the first label, and determining a second word segmentation according to the characters marked as the second label; the participles determined according to the first label and the participles determined according to the second label have the specified incidence relation;
constructing the at least one first relational tuple based on the first participle and the second participle; any of the first relational tuples includes a first participle and a second participle.
3. The method of claim 1, wherein determining at least one second relational tuple from the plurality of sub-text sentences comprises:
respectively acquiring a candidate word segmentation set corresponding to each sub text sentence; the candidate participle set comprises specified participles forming corresponding sub-text sentences;
constructing at least one candidate relation tuple according to the candidate participle set corresponding to each sub text sentence; any candidate relation tuple comprises two candidate participles, and the two candidate participles are from different candidate participle sets;
determining the at least one second relational tuple from the at least one candidate relational tuple.
4. The method of claim 3, wherein the determining the at least one second relational tuple from the at least one candidate relational tuple comprises:
generating a text sentence to be processed corresponding to each candidate relation tuple based on the relation cue words;
respectively performing text processing on the text sentence to be processed corresponding to each candidate relation tuple to obtain the relation probability of the text sentence to be processed corresponding to each candidate relation tuple; the relation probability is used for indicating the probability that two candidate participles in the candidate relation tuple meet the specified association condition;
determining the at least one second relation tuple from the at least one candidate relation tuple according to the relation probability corresponding to each candidate relation tuple.
5. The method of claim 1, further comprising:
acquiring a relation extraction template, and performing relation tuple extraction on the text sentences in the text sentence set based on the relation determination template to obtain at least one third relation tuple;
the constructing a relational knowledge graph corresponding to the text sentence set according to the at least one first relational tuple and the at least one second relational tuple comprises:
performing data processing on the at least one first relation tuple, the at least one second relation tuple and the at least one third relation tuple to obtain a target relation tuple set; the data processing includes at least one of: tuple screening processing and tuple deduplication processing;
and constructing the relation knowledge graph according to the target relation tuple set.
6. The method of claim 1, wherein the at least one first relational tuple is determined based on a first text-processing model; the at least one second relational tuple is determined based on a second text processing model; the method further comprises the following steps:
acquiring a sample set and a relation extraction template;
respectively extracting relationship tuples of a plurality of sample text sentences in the sample set based on the relationship extraction template to obtain an initial relationship tuple set, and screening the initial relationship tuple set to obtain at least one reference relationship tuple; any reference relation tuple comprises two sample participles, and the two sample participles have the specified association relation;
training a first initial model based on the plurality of sample text sentences and the at least one reference relationship element group to obtain a first text processing model;
and training a second initial model based on the at least one reference relation tuple to obtain the second text processing model.
7. The method of claim 6, wherein the plurality of sample text sentences comprise a plurality of sample sub-text sentences segmented based on the segmentation characters; training a first initial model based on the plurality of sample text sentences and the at least one reference relationship tuple to obtain the first text processing model, including:
labeling each character in each sample sub text sentence based on the at least one reference relation tuple to obtain a first labeling result aiming at the plurality of sample sub text sentences;
calling the first initial model to label each sample sub text sentence respectively to obtain a second labeling result aiming at the plurality of sample sub text sentences;
and training the first initial model based on the second labeling result by taking the first labeling result as a reference to obtain the first text processing model.
8. The method of claim 7, wherein any of the plurality of sample sub-text sentences is represented as a target sample sub-text sentence; the labeling processing is performed on each character in each sample sub-text sentence based on the at least one reference relation tuple to obtain a first labeling result for the plurality of sample sub-text sentences, and the labeling processing includes:
if the target sample sub-text sentence comprises a first sample word segmentation and a second sample word segmentation, marking a first character forming the first sample word segmentation as a first label, marking a second character forming the second sample word segmentation as a second label, and marking characters except the first character and the second character in the target sample sub-text sentence as a third label;
the first sample participle is a sample participle matched with the first label in a target relation tuple, the second sample participle is a sample participle matched with the second label in the target relation tuple, and the target relation tuple is any one of the at least one reference relation tuple.
9. A text processing apparatus, characterized in that the apparatus comprises:
the acquisition module is used for acquiring a text sentence set and segmenting text sentences in the text sentence set based on segmentation characters to obtain a plurality of sub text sentences;
a processing module, configured to determine at least one first relational tuple according to the plurality of sub-text sentences; a first relation tuple comprises two participles which are from the same sub-text sentence and have a specified association relation;
the processing module is further configured to determine at least one second relational tuple according to the plurality of sub-text sentences; a second relational tuple comprises two participles which are from different sub-text sentences and have the specified association relation;
the processing module is further configured to construct a relation knowledge graph corresponding to the text sentence set according to the at least one first relation tuple and the at least one second relation tuple.
10. An electronic device comprising a processor and a memory, wherein the memory is configured to store a computer program comprising program instructions, and wherein the processor is configured to invoke the program instructions to perform the method of any of claims 1-8.
11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-8.
12. A computer program product, characterized in that the computer program product comprises computer instructions which, when executed by a processor, implement the method according to any one of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210903474.1A CN115827884B (en) | 2022-07-27 | 2022-07-27 | Text processing method, text processing device, electronic equipment, medium and program product |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210903474.1A CN115827884B (en) | 2022-07-27 | 2022-07-27 | Text processing method, text processing device, electronic equipment, medium and program product |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115827884A true CN115827884A (en) | 2023-03-21 |
CN115827884B CN115827884B (en) | 2024-08-23 |
Family
ID=85522975
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210903474.1A Active CN115827884B (en) | 2022-07-27 | 2022-07-27 | Text processing method, text processing device, electronic equipment, medium and program product |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115827884B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107657063A (en) * | 2017-10-30 | 2018-02-02 | 合肥工业大学 | The construction method and device of medical knowledge collection of illustrative plates |
CN109062894A (en) * | 2018-07-19 | 2018-12-21 | 南京源成语义软件科技有限公司 | The automatic identification algorithm of Chinese natural language Entity Semantics relationship |
CN110390021A (en) * | 2019-06-13 | 2019-10-29 | 平安科技(深圳)有限公司 | Drug knowledge mapping construction method, device, computer equipment and storage medium |
CN111709243A (en) * | 2020-06-19 | 2020-09-25 | 南京优慧信安科技有限公司 | Knowledge extraction method and device based on deep learning |
CN112149427A (en) * | 2020-10-12 | 2020-12-29 | 腾讯科技(深圳)有限公司 | Method for constructing verb phrase implication map and related equipment |
CN112507715A (en) * | 2020-11-30 | 2021-03-16 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for determining incidence relation between entities |
CN113282762A (en) * | 2021-05-27 | 2021-08-20 | 深圳数联天下智能科技有限公司 | Knowledge graph construction method and device, electronic equipment and storage medium |
-
2022
- 2022-07-27 CN CN202210903474.1A patent/CN115827884B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107657063A (en) * | 2017-10-30 | 2018-02-02 | 合肥工业大学 | The construction method and device of medical knowledge collection of illustrative plates |
CN109062894A (en) * | 2018-07-19 | 2018-12-21 | 南京源成语义软件科技有限公司 | The automatic identification algorithm of Chinese natural language Entity Semantics relationship |
CN110390021A (en) * | 2019-06-13 | 2019-10-29 | 平安科技(深圳)有限公司 | Drug knowledge mapping construction method, device, computer equipment and storage medium |
CN111709243A (en) * | 2020-06-19 | 2020-09-25 | 南京优慧信安科技有限公司 | Knowledge extraction method and device based on deep learning |
CN112149427A (en) * | 2020-10-12 | 2020-12-29 | 腾讯科技(深圳)有限公司 | Method for constructing verb phrase implication map and related equipment |
CN112507715A (en) * | 2020-11-30 | 2021-03-16 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for determining incidence relation between entities |
CN113282762A (en) * | 2021-05-27 | 2021-08-20 | 深圳数联天下智能科技有限公司 | Knowledge graph construction method and device, electronic equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
丁奕齐: "面向领域知识图谱构建的知识抽取的研究和实现", 中国优秀硕士学问论文全文数据库 信息科技辑/面向领域知识图谱构建的知识抽取的研究和实现, 15 January 2022 (2022-01-15), pages 138 - 3464 * |
欧阳丹彤;范琪;: "子句级别语境感知的开放信息抽取方法", 吉林大学学报(工学版), no. 05, 19 November 2017 (2017-11-19), pages 1563 - 1570 * |
Also Published As
Publication number | Publication date |
---|---|
CN115827884B (en) | 2024-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112084337B (en) | Training method of text classification model, text classification method and equipment | |
US11922121B2 (en) | Method and apparatus for information extraction, electronic device, and storage medium | |
CN106776711B (en) | Chinese medical knowledge map construction method based on deep learning | |
Ghosh et al. | Fracking sarcasm using neural network | |
CN111967242B (en) | Text information extraction method, device and equipment | |
CN112528637B (en) | Text processing model training method, device, computer equipment and storage medium | |
KR102041621B1 (en) | System for providing artificial intelligence based dialogue type corpus analyze service, and building method therefor | |
RU2679988C1 (en) | Extracting information objects with the help of a classifier combination | |
Arumugam et al. | Hands-On Natural Language Processing with Python: A practical guide to applying deep learning architectures to your NLP applications | |
CN112100332A (en) | Word embedding expression learning method and device and text recall method and device | |
US10915756B2 (en) | Method and apparatus for determining (raw) video materials for news | |
CN111353303A (en) | Word vector construction method and device, electronic equipment and storage medium | |
CN112015928A (en) | Information extraction method and device of multimedia resource, electronic equipment and storage medium | |
US20230004830A1 (en) | AI-Based Cognitive Cloud Service | |
TW202032534A (en) | Voice recognition method and device, electronic device and storage medium | |
CN114661872A (en) | Beginner-oriented API self-adaptive recommendation method and system | |
CN113139043A (en) | Question and answer sample generation method and device, electronic equipment and storage medium | |
CN116384403A (en) | Multi-mode social media named entity recognition method based on scene graph | |
Fudholi et al. | BERT-based tourism Named Entity Recognition: making use of social media for travel recommendations | |
CN115827884B (en) | Text processing method, text processing device, electronic equipment, medium and program product | |
CN113591493B (en) | Translation model training method and translation model device | |
CN111949781B (en) | Intelligent interaction method and device based on natural sentence syntactic analysis | |
CN114416923A (en) | News entity linking method and system based on rich text characteristics | |
CN108595434B (en) | Syntax dependence method based on conditional random field and rule adjustment | |
CN113705194A (en) | Extraction method and electronic equipment for short |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40083093 Country of ref document: HK |
|
GR01 | Patent grant | ||
GR01 | Patent grant |