CN115146070A

CN115146070A - Key value generation method, knowledge graph generation method, device, equipment and medium

Info

Publication number: CN115146070A
Application number: CN202210754072.XA
Authority: CN
Inventors: 王兆吉; 黄昉; 史亚冰; 蒋烨; 柴春光
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-06-28
Filing date: 2022-06-28
Publication date: 2022-10-04

Abstract

The disclosure provides a key value generation method, a knowledge graph generation device, electronic equipment and a storage medium, relates to the technical field of computer data processing and artificial intelligence, and particularly relates to artificial intelligence, natural language processing and deep learning technologies. The specific implementation scheme is as follows: obtaining a target document according to the document to be processed; analyzing the target document to obtain a target key value pair; determining the type of the target key value pair according to the target key value pair; and obtaining a key value result aiming at the document to be processed according to the target key value pair and the type of the target key value pair.

Description

Key value generation method, knowledge graph generation method, device, equipment and medium

Technical Field

The present disclosure relates to the field of computer data processing and artificial intelligence, and more particularly to artificial intelligence, natural language processing, and deep learning techniques. In particular, the present invention relates to a key value generation method, a knowledge graph generation method, an apparatus, an electronic device, and a storage medium.

Background

The knowledge extraction task is one of the tasks in the construction of the knowledge graph. Available knowledge units can be extracted from the natural language text by automated or semi-automated techniques to supplement the relationship of entity attributes to entities in the knowledge graph.

Knowledge units may be structured in the form of SPO triplets. An SPO triplet may include S (i.e., entities), P (entity attributes or relationships between entities), and O (entity attribute values or associated entities).

Disclosure of Invention

The disclosure provides a key value generation method, a knowledge graph generation device, electronic equipment and a storage medium.

According to an aspect of the present disclosure, there is provided a key value generation method, including: obtaining a target document according to the document to be processed; analyzing the target document to obtain a target key value pair; determining the type of the target key-value pair according to the target key-value pair; and obtaining a key value result aiming at the document to be processed according to the target key value pair and the type of the target key value pair.

According to another aspect of the present disclosure, there is provided a knowledge-graph generating method, including: carrying out entity identification on the target document to obtain a target entity; generating a key value result by using the method according to the present disclosure; generating a knowledge unit according to the key value result and the target entity; and generating a knowledge graph according to the knowledge unit.

According to another aspect of the present disclosure, there is provided a key value generation apparatus including: the first acquisition module is used for acquiring a target document according to the document to be processed; the analysis module is used for analyzing the target document to obtain a target key value pair; the determining module is used for determining the type of the target key value pair according to the target key value pair; and the second acquisition module is used for acquiring a key value result aiming at the document to be processed according to the target key value pair and the type of the target key value pair.

According to another aspect of the present disclosure, there is provided a knowledge-graph generating apparatus including: the entity identification module is used for carrying out entity identification on the target document to obtain a target entity; a first generating module configured to generate a key-value result using the apparatus according to the present disclosure; a second generation module, configured to generate a knowledge unit according to the key value result and the target entity; and the third generation module is used for generating the knowledge graph according to the knowledge unit.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method as described in the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 schematically illustrates an exemplary system architecture to which a key value generation method, a knowledge graph generation method, and an apparatus may be applied, according to an embodiment of the present disclosure;

fig. 2 schematically shows a flow chart of a key-value generation method according to an embodiment of the present disclosure;

FIG. 3A schematically illustrates an example schematic of a training process of a classification model in the case where a deep learning model includes a first language module, according to an embodiment of the disclosure;

FIG. 3B schematically illustrates an example schematic diagram of a training process of a classification model in the case where a deep learning model includes a second language module and a first feature fusion module, according to an embodiment of the disclosure;

FIG. 3C schematically illustrates an example schematic of a training process of a classification model in the case where a deep learning model includes a third language module and a first pre-training module, in accordance with an embodiment of the disclosure;

FIG. 3D schematically illustrates an example schematic diagram of a training process of a classification model in the case where a deep learning model includes a fourth language module and a third fusion module, according to an embodiment of the disclosure;

FIG. 4 schematically illustrates a flow diagram of a knowledge-graph generation method according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates an example schematic diagram of generating a knowledge-graph according to an embodiment of the disclosure;

fig. 6 schematically shows a block diagram of a key-value generation apparatus according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a block diagram of a knowledge-graph generating apparatus according to an embodiment of the present disclosure; and

fig. 8 schematically illustrates a block diagram of an electronic device suitable for implementing a key-value generation method and a knowledge-graph generation method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The key value plays an important role in the construction of the industry knowledge map. On one hand, key-Value (KV) is an important carrier of document knowledge data, and can provide SPO knowledge assisted map construction. On the other hand, the construction cost of the industry body is high, and KV extraction does not depend on the industry body and can be used as a high-quality source of the industry body.

The task goal of KV extraction is to analyze the document through the document analysis rule according to the industry document distribution rule to obtain a KV candidate pair, and then filter the non-knowledge noise data included in the KV candidate pair by relying on the manpower customization rule. And calling an entity extraction interface, making and developing an entity association strategy according to an industry document rule, generating SPO candidate data, and manually correcting the data.

Because no clear definition is given to KV and KV class SPO standards in the KV extraction task, the generalization of the industry extraction strategy is poor. Under the condition of needing to face extraction tasks of multiple industries, different analysis rules, filtering rules and entity association rules need to be configured according to different industry knowledge, so that the industry reusability is poor, and the development efficiency is low.

Therefore, the embodiment of the disclosure provides a key value generation method. For example, according to the document to be processed, a target document is obtained. And analyzing the target document to obtain a target key value pair. And determining the type of the target key-value pair according to the target key-value pair. According to the target key value pair and the target key value pair type, a key value result aiming at the document to be processed is obtained, the key value pair type is determined, and the standards of KV and KV class SPO are clearly defined, so that the extraction target is clear, the key value pair generation requirements of various industries can be supported when multi-industry extraction tasks are faced, repeated development is not needed, and the industry reusability and the development efficiency are improved.

Fig. 1 schematically illustrates an exemplary system architecture to which a key value generation method, a knowledge graph generation method, and an apparatus may be applied according to an embodiment of the present disclosure.

It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios. For example, in another embodiment, an exemplary system architecture to which the key value generation method, the knowledge graph generation method, the key value generation apparatus, and the knowledge graph generation apparatus may be applied may include a terminal device, but the terminal device may implement the key value generation method, the knowledge graph generation method, the key value generation apparatus, and the knowledge graph generation apparatus, which are provided by the embodiment of the present disclosure, without interacting with a server.

As shown in fig. 1, the system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired and/or wireless communication links, and so forth.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as a knowledge reading application, a web browser application, a search application, an instant messaging tool, a mailbox client, and/or social platform software, etc. (by way of example only).

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (for example only) providing support for content browsed by the user using the

terminal devices

101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the key value generation method and the knowledge graph generation method provided by the embodiments of the present disclosure may be generally executed by the

terminal device

101, 102, or 103. Correspondingly, the key value generation device and the knowledge graph generation device provided by the embodiment of the present disclosure may also be disposed in the

terminal device

101, 102, or 103.

Alternatively, the key value generation method and the knowledge graph generation method provided by the embodiments of the present disclosure may also be generally performed by the server 105. Accordingly, the key value generation apparatus and the knowledge graph generation apparatus provided by the embodiment of the present disclosure may be generally disposed in the server 105. The key value generation method and the knowledge graph generation method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the key value generation device and the knowledge graph generation device provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

For example, the

terminal devices

101, 102, and 103 may obtain a document to be processed to obtain a target document, then send the obtained target document to the server 105, and the server 105 analyzes the target document to obtain a target key value pair; determining the type of the target key value pair according to the target key value pair; and obtaining a key value result aiming at the document to be processed according to the target key value pair and the target key value pair type. Or the target document is parsed by a server or server cluster capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105, and finally the key-value result is obtained.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

It should be noted that the sequence numbers of the respective operations in the following methods are merely used as representations of the operations for description, and should not be construed as representing the execution order of the respective operations. The method need not be performed in the exact order shown, unless explicitly stated.

Fig. 2 schematically shows a flow chart of a key value generation method according to an embodiment of the present disclosure.

As shown in FIG. 2, the method 200 includes operations S210-S240.

In operation S210, a target document is obtained according to the document to be processed.

In operation S220, the target document is parsed to obtain a target key-value pair.

In operation S230, a target key-value pair type is determined according to the target key-value pair.

In operation S240, a key value result for the document to be processed is obtained according to the target key value pair and the type of the target key value pair.

According to an embodiment of the present disclosure, the document to be processed may be a structured document, an unstructured document, or a semi-structured document. For example, a structured document may include document information managed in a relational database table. The unstructured document may include at least one of: DOC (Document), XML (Extensible Markup Language), DOCX, PDF (Portable Document Format), XLS, XLSX, and CAJ (Chinese Academic Journals), and the like. The semi-structured document may include at least one of: log files, jsn (JavaScript Object Notation) documents, email (i.e., mail), and so on.

According to the embodiment of the disclosure, the target document may be a document obtained after preprocessing a document to be processed. The target document may also be a structured document, an unstructured document, or a semi-structured document.

According to embodiments of the present disclosure, key-value pairs may be semi-structured text representing key-value (i.e., KV) information. Compared with unstructured data and structured data, the semi-structured data has certain structurality on the text level and the semantic level, the expression types are flexible and rich, the data format is more standard, and the data has a specific distribution rule.

According to embodiments of the present disclosure, key-value pairs may be text segments that can be represented in a single line of text, with strong format separators. On the text level, the Key Value pairs have various distribution formats, the coverage is wide in various industry documents, the Key information of multiple parallel sections in the documents is emphasized by means of structural features and is an important source of knowledge, and the distribution formats can be expressed as Key [ separator ] Value, such as list title-list item. On the semantic level, the Key Value pairs can be matched with various binary relations (namely binary schemas) conforming to the Key-Value slot positions, the structural characteristics are obvious, the industry semantics are weak, key information can be intuitively emphasized through the structural characteristics, a user lacking the industry background can be assisted in understanding knowledge, and Schema expression characteristics in various industries have universality.

According to the embodiment of the disclosure, after the target document is obtained, a plurality of target key-value pairs can be identified from each independent paragraph statement in the target document by using a general key-value pair parsing strategy. The target key-value pair is in the form of semi-structured data.

According to an embodiment of the present disclosure, the key-value pair type may be a binary Schema in which key-value pairs are matched in a semantic level, and is used for recording text pairs having different relationships. The key-value pair types can be classified into attribute-value, noun-interpretation, and non-KV classes according to the key-value pair. The key-value pair type can be classified according to a judgment standard. For example, the non-KV-like subclassing may include at least one of: content-description (sequential), subject-view, and date-event, etc. Table 1 schematically shows a classification table of key-value pair types.

TABLE 1

According to the embodiment of the disclosure, after the type of the target key value pair is determined, a key value result of the document to be processed can be generated by combining the target key value pair. The key value result can represent the values of various attributes of the document to be processed. May be represented as (K, V, key-value pair type).

According to the embodiment of the disclosure, the target document is obtained according to the document to be processed, the target document is analyzed to obtain the target key value pair, the type of the target key value pair is determined according to the target key value pair, and the key value result aiming at the document to be processed is obtained according to the target key value pair and the type of the target key value pair, so that the key value pair type is determined, and the definite definition is given to the standard of KV and KV SPO, so that the extraction target is clear, the key value pair generation requirements of each industry can be supported when multi-industry extraction tasks are met, repeated development is not needed, and the industry reusability and the development efficiency are improved.

According to an embodiment of the present disclosure, operation S230 may include the following operations.

And performing feature extraction on the target key value pair to obtain a target key value pair feature vector. And determining the type of the target key-value pair according to the target key-value pair feature vector.

According to an embodiment of the present disclosure, a target key value pair key field may be used for storing an attribute, and a value field in the target key value pair is used for dereferencing the storage attribute. For example, the target key-value pair is "design unit, xx two companies", where the attribute is "design unit" and the value is "xx two companies".

According to the embodiment of the disclosure, feature extraction can be performed on the target key value pair by using a deep network learning model, so as to obtain a feature vector of the target key value pair. The characteristic vector of the target key value pair can represent the relation information between the attribute and the value of the target key value pair, and the type of the target key value pair is determined according to the relation information between the attribute and the value.

According to the embodiment of the present disclosure, performing feature extraction on the target key-value pair to obtain the target key-value pair feature vector may include the following operations.

And carrying out object coding on the object in the target key value pair to obtain a target object feature vector. And carrying out position coding on the target key value pair to obtain a target position feature vector. And carrying out fragment coding on the target key value pair to obtain a target fragment feature vector. And obtaining a target key value pair feature vector according to the target object feature vector, the target position feature vector and the target fragment feature vector.

According to the embodiment of the disclosure, after keys and values in a target key value pair are spliced into a section of text through [ SEP ], the text is converted into word lists after word segmentation, each word list can be regarded as an object in the target key value pair, and each object in the target key value pair is subjected to object coding to obtain a target object feature vector; and carrying out position coding and segment coding on the target key value pair to obtain a target position characteristic vector and a target segment characteristic vector.

According to the embodiment of the disclosure, the target object feature vector, the target position feature vector and the target segment feature vector of the target key value pair are input into a language prediction model to obtain the target key value pair feature vector.

According to an embodiment of the present disclosure, determining the target key-value pair type according to the target key-value pair feature vector may include the following operations.

Expected feature data is created from the target key-value pairs. And performing feature extraction on the expected feature data to obtain an expected feature vector. And determining the type of the target key-value pair according to the target key-value pair feature vector and the expected feature vector.

According to embodiments of the present disclosure, expected feature data may be created according to the feature classes extracted for the target key-value pairs. The extracted feature categories may include object (i.e., token) granularity, part-of-speech granularity, and part-of-speech granularity. Expected feature data may be created according to recognition rules for multiple dimensions for each feature class.

According to an embodiment of the present disclosure, 54-dimensional expected feature data may be created according to the recognition rule based on badcase analysis of the universal verification set, for example, the 54-dimensional expected feature data may include 32-dimensional token granularity, 16-dimensional part-of-speech granularity, and 6-dimensional part-of-speech granularity. The expected feature data has obvious guiding significance, and high-level semantic features (parts of speech and parts of speech) can be directly received by utilizing the expected feature data. For example, table 2 schematically shows a scheme table for creating expected feature data.

TABLE 2

It should be noted that the above creating dimensions for the expected feature data are only exemplary, and do not limit the number of dimensions of the present disclosure.

According to the embodiment of the disclosure, expected feature data of the expected feature data are obtained according to the creation scheme table of the expected feature data, feature extraction is carried out on the expected feature data to obtain expected feature vectors, and feature fusion is carried out on the target key value pair feature vectors and the expected feature vectors to obtain target key value pair types.

According to an embodiment of the present disclosure, performing feature extraction on the expected feature data to obtain an expected feature vector may include the following operations.

And carrying out dense coding on the expected feature data to obtain dense feature vectors. And performing factorization on the dense feature vectors to obtain expected feature vectors.

According to an embodiment of the present disclosure, densely encoding the expected feature data may include: the expected feature data of 54 dimensions to be created, which may include expected feature data of 32-dimensional token granularity, expected feature data of 16-dimensional part of speech granularity, and expected feature data of 6-dimensional part of speech granularity. The 54-dimensional expected feature data can be divided into 120-dimensional 0/1 features and then input, and dense coding is carried out to obtain dense feature vectors.

According to an embodiment of the present disclosure, factoring the dense feature vectors may include: inputting the dense eigenvector into the factor layer for factorization to obtain the expected eigenvector.

According to an embodiment of the present disclosure, determining the target key-value pair type according to the target key-value pair may include the following operations.

And processing the target key value pair by using the classification model to obtain the type of the target key value pair. The classification model can be obtained by training a deep learning model by using sample document data.

According to an embodiment of the present disclosure, the classification model may be a model that determines a target key-value pair type of the target document. The classification model can be obtained by training a deep learning model by using sample document data.

According to an embodiment of the present disclosure, the sample document data may be data for training a deep learning model, and the structure of the sample document data may be structured data, semi-structured data, and unstructured data.

According to the embodiment of the disclosure, the target key value pair is input into the classification model, and the target key value pair type is obtained through encoding processing and predictive analysis of the target key value pair.

According to an embodiment of the present disclosure, a deep learning model may include a first language module.

According to an embodiment of the present disclosure, the classification model may be obtained by training the first language module according to the first sample classification result and the sample label value based on the first loss function. The first sample classification result is obtained by processing sample document data using the first language module.

According to an embodiment of the present disclosure, the sample tag value may be a real value of the sample document data. The first classification result may be a sample classification prediction value obtained by processing sample document data based on the first language module.

According to an embodiment of the present disclosure, the first language module may be a first text classification model. And adjusting parameters of the first language module based on the first sample classification result and the sample label value, training the first language module based on the adjusted parameters until the first loss function meets a preset ending condition, and finishing the training of the first language module to obtain a classification model.

According to the embodiment of the present disclosure, the predetermined ending condition may be that a maximum number of iterations is satisfied, or that a predetermined convergence condition is satisfied.

According to an embodiment of the present disclosure, the deep learning model may include a second language module and a first feature fusion module.

According to an embodiment of the present disclosure, the classification model may be obtained by training the second language module and the first feature fusion module according to the second sample classification result and the sample label value based on the second loss function. The second sample classification result is based on the first expected sample the feature vector is derived from the first sample feature vector. The first expected sample feature vector is obtained by processing expected sample feature data by using a first feature fusion module. Expected sample feature data is created from sample document data. The expected sample feature data includes feature data of at least one of an object granularity, a part-of-speech granularity, and a part-of-speech granularity. The first sample feature vector is obtained by processing sample document data using the second language module.

According to an embodiment of the present disclosure, the second language module may be a second text classification module, and the first feature fusion module may be a first feature fusion optimization module.

According to an embodiment of the present disclosure, the second sample classification result may be a sample prediction classification result determined by a first sample feature vector obtained by processing expected sample data based on the second language module and a first expected sample feature vector obtained by processing the expected sample data based on the first feature fusion module. The sample tag value may be the true value of the sample document data.

According to embodiments of the present disclosure, expected sample data may be created by identification rules using sample document data based on badcase analysis of a generic validation set. The expected sample data may be multi-dimensional expected sample data, and the expected sample data may include characteristic data of at least one of object granularity, part-of-speech granularity, and part-of-speech granularity.

According to the embodiment of the disclosure, parameter adjustment is carried out on the second language module and the first feature fusion module based on the classification result of the second sample and the sample label value, then the second language module and the first feature fusion module are trained based on the adjusted parameters until the second loss function meets the preset ending condition, and the training of the second language module and the first feature fusion module is completed to obtain the classification model.

According to the embodiment of the disclosure, the first feature fusion can directly fuse expected sample data into the coding result of the second language module, learn the optimal proportion of the statistical features and the semantic features, and improve the model learning efficiency, so that the statistical features and the semantic features of the text data to be predicted can be better fused in the prediction process, and the prediction efficiency is improved.

According to an embodiment of the present disclosure, a deep learning model may include a third language module and a first pre-training module.

According to an embodiment of the present disclosure, the classification model may be obtained by training a third language module according to a third sample classification result and a fourth sample classification result based on a third loss function. The third sample classification result is obtained by processing the sample document data using the third language module. The fourth sample classification result is obtained by processing the sample document data by using the first pre-training module.

According to an embodiment of the present disclosure, the third language module may be a third text classification module, and the first pre-training module may be a first pre-training language module. The first pre-training language module has a fixed weight parameter.

According to the embodiment of the disclosure, the third Language module has a good representation of general semantics, can support MLM (Mask Language Model) prediction, but is easily shifted during large-batch data training so as to over-fit sample document data. And in combination with the first pre-training module, the classification task of the third language module is trained, and meanwhile, the language representation capability similar to that of the first pre-training module with fixed weight can be kept, so that the generalization capability of the model is improved.

According to an embodiment of the present disclosure, the third sample classification result may be a sample prediction classification result obtained by processing sample document data based on the third language module. The fourth sample classification result may be a sample classification result obtained by processing the sample document data based on the first pre-training language module.

According to the embodiment of the disclosure, since the weight parameter of the first pre-training language module is a fixed value, the third language module can be subjected to parameter adjustment based on the third sample classification result and the fourth sample classification result, and then the third language module is trained based on the adjusted parameter until the third loss function meets the predetermined end condition, so that the training of the third language module is completed, and the classification is obtained by combining the first pre-training language module.

According to an embodiment of the present disclosure, the deep learning model may include a fourth language module, a second feature fusion module, and a second pre-training module.

According to an embodiment of the present disclosure, the classification model includes a fourth language module and a second feature fusion module obtained when model parameters of the fourth language module and the second feature fusion module are adjusted according to the output value until a predetermined end condition is satisfied.

According to an embodiment of the present disclosure, the output value may be determined according to the first output value and the second output value. The first output value is derived from the fifth sample classification result and the sample label value based on the fourth loss function. The second output value is obtained from the sixth sample classification result and the seventh sample classification result based on the fifth loss function. The fifth sample classification result is obtained according to the second expected sample feature vector and the second sample feature vector. The second expected sample feature vector is obtained by processing the expected sample feature data by using a second feature fusion module. Expected sample feature data is created from sample document data. The second sample feature vector is obtained by processing sample document data using a fourth language module. The sixth sample classification result is obtained by processing the sample document data using the fourth language module. The seventh sample classification result is obtained by processing the sample document data using the second pre-training module.

According to an embodiment of the present disclosure, the fourth language module may be a fourth text classification module. The second feature fusion module may be a second feature fusion optimization module. The second pre-training module may be a second pre-training language module. The second pre-training language module has a fixed weight parameter.

According to an embodiment of the present disclosure, the first output value may be an output value obtained by processing the sample document data based on the second feature fusion module and the fourth language module. The second output value may be an output value obtained by processing the sample document data based on the fourth language module and the second pre-training module.

According to an embodiment of the present disclosure, the fifth sample classification result may be a sample prediction classification result determined by obtaining a second expected sample feature vector by processing the expected sample feature data based on the second feature fusion module and a second sample feature vector by processing the sample document data based on the fourth language module. The sample tag value may be the true value of the sample document data.

According to an embodiment of the present disclosure, the sixth sample classification result may be a sample prediction classification result obtained based on the fourth language module processing the sample document data. The seventh sample classification result may be processing sample document data based on the second pre-training language module.

According to an embodiment of the disclosure, the first output value is determined according to the fifth sample classification result and the sample label value based on the fourth loss function. Determining a second output value from the sixth sample classification result and the seventh sample classification result based on a fifth loss function. And determining an output value of the classification model according to the first output value and the second output value, adjusting model parameters of the fourth language module and the second feature fusion module according to the output value, training the fourth language module and the second feature fusion module based on the adjusted model parameters until the model parameters meet a preset end condition, and finishing the training of the fourth language module and the second feature fusion module to obtain the classification model.

The training process of the classification model according to the embodiment of the disclosure is further described with reference to fig. 3A to 3D in conjunction with the specific embodiment.

Fig. 3A schematically illustrates an example schematic diagram of a training process of a classification model in a case where a deep learning model includes a first language module according to an embodiment of the disclosure.

As shown in FIG. 3A, in 300A, the classification model 302 may include a first language module 302 \u1. Sample text data 301 may be input into a first language module 302 \u2 resulting in a first sample classification result 303. The first sample classification result 303 and the sample label value 304 are input to a first loss function 305, resulting in a third output value 306. The model parameters of the first language module 302 \u1 may be adjusted according to the third output value 306 until a predetermined termination condition is met, resulting in the classification module 302.

Fig. 3B schematically illustrates an example schematic diagram of a training process of a classification model in a case where the deep learning model includes a second language module and a first feature fusion module according to an embodiment of the disclosure.

As shown in FIG. 3B, in 300B, the classification model 308 may include a second language module 308 _1and a first feature fusion module 308_2. The sample document data 307 may be input into a second language module 308 \u1 resulting in a first sample feature vector 309. The expected sample feature data 310 may be input to the first fusion module 308 \u2, resulting in a first predicted sample feature vector 311. From the first sample feature vector 309 and the first expected sample feature vector 311, a first sample fused feature vector 312 is obtained. According to the first sample fusion feature vector 312, a second sample classification result 313 is obtained. The second sample classification result 313 and the sample label value 314 are input to a second loss function 315, resulting in a fourth output value 316. The model parameters of the second language module 308\ u 1 and the first feature fusion module 308 \/u 2 may be adjusted according to the fourth output value 316 until a predetermined termination condition is met, resulting in the classification module 308.

Fig. 3C schematically illustrates an example schematic diagram of a training process of a classification model in a case where the deep learning model includes a third language module and a first pre-training module, according to an embodiment of the present disclosure.

As shown in FIG. 3C, in 300C, the classification model 318 may include a third language module 318 \u1. The sample document data 317 may be input into a third language module 318 \u1 resulting in a third sample classification result 319. The sample document data 317 is input into the first pre-training module 320 to obtain a fourth sample classification result 321. The third sample classification result 319 and the fourth sample classification result 321 are input into a third loss function 322, resulting in a fifth output value 323. The model parameters of the third language module 318 \u1 may be adjusted according to the fifth output value 323 until a predetermined termination condition is met, resulting in the classification module 318.

Fig. 3D schematically illustrates an example schematic diagram of a training process of a classification model in a case where a deep learning model includes a fourth language module and a third fusion module according to an embodiment of the disclosure.

As shown in FIG. 3D, in 300D, the classification model 325 may include a fourth language module 325_1 and a third fusion module 325 \u2. The sample document data 324 can be input into a fourth language module 325, u 1, resulting in a second sample feature vector 326. A sixth sample classification result 327 may be obtained based on the second sample feature vector 326. The expected sample feature data 328 is input into the third fusion module 325 \u2 resulting in a second expected sample feature vector 329. A second sample fused feature vector 330 is obtained according to the second sample feature vector 326 and the second expected sample feature vector 329. According to the second sample fusion feature vector 330, a fifth sample classification result 331 is obtained. The fifth sample classification result 331 and the sample label value 332 are input into the fourth loss function 333, resulting in a first output value 334.

The sample document data 324 may be input to a second pre-training module 335 resulting in a seventh sample classification result 336. The sixth sample classification result 327 and the seventh sample classification result 336 are input into a fifth loss function 337, resulting in a second output value 338. An output value 339 may be derived based on the first output value 334 and the second output value 338. The model parameters of the fourth language module 335 _1and the second feature fusion module 325 _2may be adjusted according to the output value 339 until a predetermined termination condition is met, resulting in the classification module 325.

According to an embodiment of the present disclosure, the sample document data may include at least one of: a sample key-value pair and a non-sample key-value pair. The sample key-value pairs may include at least one of: a sample key-value pair formed by attributes and attribute values and a sample key-value pair formed by noun and name interpretations.

According to an embodiment of the present disclosure, sample document data is used for training of a classification model, and selection of sample document data for training is crucial to the capability of the model. When sample document data is selected, the label which can meet the requirement of the sample document data should be as accurate as possible so as to avoid the influence of a large amount of noise data on the model. The distribution of sample document data needs to accord with the distribution of task targets, the data completeness is improved, and each mode is covered as far as possible. Finally, the data of the same type of sample documents should not be excessively similar, namely, the data has diversity, and the over-fitting phenomenon is avoided.

According to embodiments of the present disclosure, sample document data may include sample key-value pairs and non-sample key-value pairs. The sample key-value pair may include a sample key-value pair formed of an attribute and an attribute value, i.e., "attribute-value", and a sample key-value pair formed of a noun and a name interpretation, i.e., "noun-interpretation". In addition, the sample document data may also include non-sample key-value pairs, i.e., non-KV.

According to the embodiment of the disclosure, when constructing the attribute-value class sample document data, because the task lacks the annotation data and the data requirement magnitude is large, the data source with high confidence level can be selected to be automatically generated.

According to an embodiment of the present disclosure, constructing the attribute-value class sample document data may include the following operations.

First, the "attribute-value" type basic data is derived from an information box (info box) of a high-frequency Page View (PV) web Page. As a result of research, most high frequency infoboxes are high quality manual review data, but the provider does not provide direct fields to distinguish the data boundary between manual review and policy generation, so the overall quality is improved by PV constraints, and the infobox with default high PV is higher in quality.

Secondly, promote sample variety through promoting the trade coverage that high frequency page browsed the webpage: based on a core set concept system, a high-frequency page browsing webpage covers as many and as thin concepts as possible so as to achieve the purpose of improving sample diversity, therefore, sampling is carried out based on a KgIsA label, and higher probability retention of encyclopedias corresponding to concepts with small granularity and data quantity is guaranteed.

Finally, because the page browsing web page is more encyclopedia web sites describing the general knowledge of social characters, works and the like, the infobox data distribution of the encyclopedia web sites is deviated and is inconsistent with the general field and industry data distribution to be solved in expectation, equivalent sampling is selected during key sampling, and the coverage of low-frequency keys is increased.

According to the embodiment of the disclosure, when constructing the noun-explanation type sample document data, because the difficulty of labeling the data machine is large, the noun-explanation type sample document data can be generated based on the logical relationship completely without considering whether the data is from the KV structure.

According to the embodiments of the present disclosure, the choice of the noun-explanation type data source is more biased towards the definition of the article, method and function, rather than the usage of the word explanation, character experience, etc.

According to the embodiment of the disclosure, when constructing noun-explanation type sample document data, an encyclopedia webpage can be selected as a data construction source, and the encyclopedia webpage can be screened based on a concept system. The encyclopedia abstract (namely bdbksummary) is an explanation of an encyclopedia title (namely bdbklemmitle), but the abstract is a combination of multiple sections of texts, and does not conform to the data distribution of KV value basic documents. Based on the method, the noun-explanation rule mining result of encyclopedia text to encyclopedia entry can be selected to generate noun-explanation type sample document data.

According to the embodiment of the disclosure, the noun-explanation type sample document data can be constructed based on the mining result of the key synonym encyclopedia page by memorizing the high-frequency key of the attribute-value type, so as to avoid overfitting of the model and enhance the generalization of the model.

According to the embodiment of the present disclosure, when constructing sample document data of "non-KV" class, subclasses may be segmented for the non-KV class, and a policy may be designed for each subclass to construct the sample document data. Not only should guarantee data completeness, all sub-class signals are transmitted to the model, but also data diversity should be guaranteed, and single-mode overfitting is avoided.

For example, table 3 schematically shows a configuration scheme table of each type of sample document data.

TABLE 3

According to the embodiment of the disclosure, after the construction of the key-value pair type and non-key-value pair type sample document data included in the sample document data is completed, a test set can be constructed in the general field, and the test set can be constructed by sampling 2.5% of the sample document data to detect the model iteration effect; and taking 2.5% of sampled sample document data as a verification set to select an optimal model, thereby realizing the evaluation of the sample document data.

For example, industry test datasets for the power industry, financial industry, and JG industry may be constructed to verify model generalization capability. Industry test data sets for the power industry and the financial industry may be derived from real business documents. For example, the KV type and the SPO can be parsed and labeled all manually from the document in a high-recall manner. The test data for the JG industry may be derived from an encyclopedia web page.

For example, table 4 schematically shows various industry test set data statistics tables.

Industry	Data source	Number of documents	Number of KV	Number of SPOs	diff S	diff P
							Electric power	Nannet standard document	6	351	166	33	117
Finance	Chinese-to-foreign document	4	1381	937	329	179
							JG	Encyclopedic	23	747	691	61	344

TABLE 4

According to an embodiment of the present disclosure, parsing the target document to obtain the target key-value pair may include the following operations.

And carrying out sentence division on the target document to obtain a target sentence. And carrying out key-value pair division on the target statement to obtain a target key-value pair.

According to an embodiment of the present disclosure, the target document may be paragraph text that needs KV identification. The target sentence may be text that conforms to a sentence division policy.

According to an embodiment of the present disclosure, performing sentence division on a target document may include: and performing sentence granularity segmentation on the target document based on the basic segmentation strategy, and determining the sentences meeting the basic segmentation strategy and the sentence granularity segmentation as target sentences.

According to an embodiment of the present disclosure, performing key-value pair partitioning on a target sentence may include: and performing key-value pair division on the target statement based on a KV granularity splitting strategy, and determining the key-value pairs meeting the KV granularity strategy as target key-value pairs.

According to the embodiment of the disclosure, sentence division is performed on a target document to obtain a target sentence, which may include the following operations. And carrying out sentence division on the target document based on the sentence division separators to obtain the target sentences.

According to an embodiment of the present disclosure, the statement partition delimiter includes a first-level statement partition delimiter and a second-level statement partition delimiter.

According to an embodiment of the disclosure, performing statement division on the target document based on the statement division delimiter to obtain the target statement may include the following operations.

And carrying out sentence division on the target document based on the first-stage sentence division separators to obtain intermediate sentences. And under the condition that the second-level sentence division separator exists in the intermediate sentence, performing sentence division on the intermediate sentence to obtain the target sentence.

According to embodiments of the present disclosure, sentence separators may be used to identify the location of word separators in the sentence text.

According to embodiments of the present disclosure, the first level statement partition delimiter may be ". ","! ","! ","? ","? ". The second-level sentence division separator can be that the segmented text contains 0 or more than 2 colons.

According to the embodiment of the disclosure, the target document can be subjected to sentence division based on the basic segmentation strategy and the sentence granularity segmentation strategy. Based on the basic segmentation strategy, before the target document is traversed, a symbol stack is initialized to store left brackets in the traversing process, and when matched right brackets are encountered, stack top elements are eliminated. Initializing a character stack for storing characters in the traversal process, and adding the characters when the separator is encountered and the symbol stack is empty.

According to an embodiment of the present disclosure, text characters of a target document are traversed sequentially and, if left brackets are encountered, added to a symbol stack. If right brackets are encountered, the symbol stack top elements are matched, the stack top elements are eliminated, and the left brackets and the right brackets are completely matched. If normal text is encountered, adding the normal text into a character stack; and if the separator is encountered and the symbol stack is empty, adding the character stack, and finally adding the result into an output set as an intermediate statement. If the right bracket is not encountered, an error is reported to exit.

According to the embodiment of the disclosure, when the target document is subjected to sentence division, special processing is performed on the matching problem of parentheses in the target document, the sentence division is performed under the condition that the left and right parentheses are completely matched, and otherwise, an error is reported and the exit is performed. Thereby avoiding segmentation result errors caused by noise data introduced by brackets. For example, segmentation is performed for the separators of "Boeing company (English name: boeing)" and colon, and if special treatment is performed on the matching problem of parentheses, segmentation is performed for "English name" and "boeing"; if no special processing is performed, the result is cut into the wrong segmentation results of Boeing company (English name and boeing).

According to the embodiment of the disclosure, a sentence granularity segmentation strategy is used, a basic segmentation strategy is called, a target document is subjected to sentence segmentation to obtain an intermediate sentence, under the condition that a second-stage sentence segmentation separator exists in the intermediate sentence, whether 0 or more than 2 colons exist in the intermediate sentence can be judged, and if yes, the intermediate sentence is segmented based on the second-stage sentence segmentation separator to obtain the target sentence.

According to an embodiment of the present disclosure, performing key-value pair division on a target statement to obtain a target key-value pair may include: and performing key-value pair division on the target statement based on the key-value pair division separator to obtain a target key-value pair, wherein the key-value pair division separator comprises a colon.

According to embodiments of the present disclosure, the key-value pair partition delimiter may be a half-corner/full-corner colon as a delimiter.

According to an embodiment of the present disclosure, key-value pair partitioning a target sentence based on key-value pair partition delimiters may include: and calling a basic segmentation strategy, segmenting the target statement under the condition that left and right brackets in the target statement are completely matched, determining that the segmented statement text has a half-angle/full-angle colon as a separator, sequentially segmenting the segmented statement text by KV granularity, segmenting the segmented statement text into KV binary groups, and determining the KV binary groups as target key value pairs. In addition, based on the setting of the sentence granularity segmentation strategy, at most one KV binary group can be included in a text sentence by default.

According to the embodiment of the disclosure, obtaining the target document according to the document to be processed may include the following operations.

And calling a document processing interface corresponding to the document to be processed. And processing the document to be processed by using the document processing interface to obtain a document interface class object. And obtaining the target document according to the document interface class object.

In accordance with an embodiment of the present disclosure, the document processing interface is a general interface provided when a document is preprocessed.

According to the embodiment of the disclosure, the document to be processed is analyzed by using the document processing interface, so as to obtain the document interface class object. The Document interface class object may include a Document representation class (i.e., document class) and a Document paragraph representation class (i.e., node class).

According to the embodiment of the disclosure, a meta-empty dictionary is included in the Document class structure, so that the key information depended by the subsequent modules can be transmitted through. For example, table 5 schematically shows a document representation class structure attribute information table.

Attribute name	Type (B)	Definition of
			title	str	Title name of document
root	node	Document root node
			nodes	list[Node]	Document full-volume node list
kvs	list[KVpair]	Full KV list of documents
			meta	dict	Other information

TABLE 5

According to an embodiment of the present disclosure, a Node class (in units of linefeeds) is a basic element of a document tree construct. For example, table 6 schematically shows an attribute information table of the Node class structure.

TABLE 6

According to the embodiment of the disclosure, the document to be processed is analyzed and processed by using the document processing interface, and after the document interface class object is obtained, the target document is determined based on the document interface class object.

According to the embodiment of the disclosure, processing the document to be processed by using the document processing interface to obtain the document interface class object may include the following operations.

And processing the document to be processed by using the document processing script corresponding to the document to be processed to obtain a document interface class object.

According to the embodiment of the disclosure, the document to be processed and the document processing script corresponding to the document to be processed can be input into the document preprocessing module, and the document to be processed is converted into the interface class object by using the document processing script, so that the document interface class object is obtained.

In accordance with an embodiment of the present disclosure, the Document interface class object may include Document at least one of a class and a Node class. The Document class may be used to store Document objects that are translated by the Document processing interface. The Node class may be used as a basic element of the document tree formation. The process of document tree hierarchy may include creating nodes and creating edges. All the above steps are completed by the document processing script.

According to embodiments of the present disclosure, a Document processing script may return a Document class as a unified interface. The reason is that: on one hand, the processing flow of input documents of different types and different structures can be simplified, and a user only needs to configure a hierarchical analysis script and transmit the hierarchical analysis script in a configuration form. On the other hand, the user has higher freedom in the hierarchical parsing script, because the current class design includes meta field, which can support the user to directly transfer the key field to the subsequent module in the script without modifying the code.

FIG. 4 schematically shows a flow diagram of a knowledge-graph generation method according to an embodiment of the disclosure.

As shown in FIG. 4, the method 400 may include operations S410-S440.

In operation S410, entity identification is performed on the target document to obtain a target entity.

In operation S420, a key value result is generated by using the key value generation method.

In operation S430, a knowledge unit is generated according to the key-value result and the target entity.

In operation S440, a knowledge graph is generated from the knowledge units.

According to an embodiment of the present disclosure, the target entity may be an entity to which a key-value pair belongs in the current document. The entity identifying the target document may include performing entity identifying on the target document by using an entity extraction module, and outputting a target entity related to the key value pair.

According to the embodiment of the present disclosure, the key value result may include a target key value pair and a target key value pair type, and the method flow for generating the target key value result by using the key value pair generation method is described in detail in the key value generation method, and is not described herein again.

According to embodiments of the present disclosure, a knowledge unit may be composed of a target entity and a key-value result, where the target entity is described, defined, or illustrated by a target key-value pair and a target key-value pair type stored in the key-value result.

According to the embodiment of the disclosure, after the target entity and the target key value result in the target document are obtained, the obtained key value result and the target entity are associated through the entity association module, the extraction of the SPO is completed, and the knowledge unit is generated. In this process, one target entity may correspond to multiple key-value pairs, or each target entity may correspond to one key-value pair.

According to the embodiment of the disclosure, the target entity is obtained by performing entity identification on the target document. And analyzing the target document to obtain a target key value pair and a target key value pair type, and generating a key value result according to the target key value pair and the target key value pair type. And generating a knowledge unit according to the target entity and the key value result so as to obtain a knowledge graph, and realizing the purpose of supporting the key value pair generation requirements of each industry by determining the types of the key value pairs so as to realize the multiplexing on the documents to be processed of each industry and achieve the purpose of generating the basic SPO data by combining the association with the target entity.

According to the embodiment of the disclosure, performing entity identification on the target document to obtain the target entity may include the following operations.

And determining an entity identification area and an entity identification strategy according to the preset configuration information. And according to the entity identification strategy, carrying out entity identification on the entity identification area of the target document to obtain a target entity.

According to the embodiment of the disclosure, the entity extraction module can mark KV related entities, and since the entity extraction cannot be generalized, the entity extraction is realized by adopting preset configuration information, and basic function configurations such as caching, named entity models, rule identification and the like are supported.

In accordance with an embodiment of the present disclosure, the predetermined configuration information may be information that determines an entity identification area and an entity identification policy. The entity identification area may be an area where target entity identification is performed on the target document. The entity identification policy may be a policy for target entity identification of a target document.

According to the embodiment of the disclosure, the entity identification is carried out on the target entity identification area of the target document based on the entity identification strategy, so as to obtain the target entity of the target entity identification area in the target document.

According to embodiments of the present disclosure, the entity identification policy may include at least one of: a named entity identification policy, a rule identification policy, and a blacklist identification policy.

According to embodiments of the present disclosure, entity Recognition on industry documents typically relies on a near (Named Entity Recognition) model based on industry corpus training. The named entity identification policy may include a named entity identification scoping and a named entity identification interface address. In addition, in some entity identification tasks, entities are derived from results of manual labeling, so that a cache configuration can be set, and the cache configuration can comprise a cache active area and a cache file address. For example, table 7 schematically shows configuration entries of the named entity identification policy.

TABLE 7

According to the embodiment of the disclosure, compared with the process that the NER interface or the manual labeling cache needs a large amount of industry data accumulation, manual labeling or training, the method is not suitable for the industry cold start stage, and the rule identification strategy can support the rapid output and correction of data through the configuration of the matching rules, and can adapt to the industry cold start stage. For example, table 8 schematically shows configuration items of the rule identification policy.

TABLE 8

According to the embodiment of the disclosure, the blacklist identification policy may support deletion of an error entity or an error category entity by configuration of a regular expression rule to increase entity identification accuracy. For example, table 9 schematically shows configuration items of the blacklist identification policy.

TABLE 9

According to an embodiment of the disclosure, generating a knowledge unit according to a key-value result and a target entity may include the following operations.

And under the condition that the target key-value pair type is determined to be the expected key-value pair type, associating the target key-value pair with the target entity to obtain a knowledge unit based on the preset association sequence and the position of the target entity in the target document.

In accordance with embodiments of the present disclosure, it is contemplated that the key-value pair type may be a key-value pair type associated with the target entity. The predetermined association order may be an order in which the associations are made according to the location of the target entity in the target document. For example, a location may include a sentence, a paragraph title, a chapter title, and so forth. A location has an associated priority corresponding to the location.

According to an embodiment of the present disclosure, associating the target key-value pair with the target entity may include: and realizing the association between the target key value pair and the target entity through the entity association strategy. The entity association policy supports association order configuration. For example, table 10 schematically shows configuration items of the entity association policy.

Watch 10

According to embodiments of the present disclosure, a configuration example of the JG industry may be taken as an example: in the aspect of NER configuration, as the JG industry does not deploy NER service, NER results are read in through an offline cache; in order to supplement NER missed entity, two accurate matching rules of title (namely theme) area are set for entity extraction; in order to solve the problem of NER accuracy, identification rules for removing bk (i.e. encyclopedia) prefixes are respectively set for value (i.e. value taking) areas, and rules for excluding time, troops and xx types are respectively set for type areas. In the entity association stage, the priorities of the entity association stage are respectively a document title, a title hierarchy and a user designated area from low to high.

According to the embodiment of the disclosure, a configuration example of the power industry can be taken as an example: in the aspect of NER configuration, an NER result is obtained in a mode of calling an api interface; setting a precise matching and suffix matching mode aiming at a document title and a title level for supplementing an NER calling-missing entity; in the entity association stage, the priorities of the entity association stage are respectively a document title, a title hierarchy and a user designated area from low to high.

FIG. 5 schematically shows an example schematic diagram of generating a knowledge-graph according to an embodiment of the disclosure.

As shown in FIG. 5, in diagram 500, a document to be processed 502 is processed by a document processing script 501 to obtain a target document 503. The target document 503 is parsed to obtain target key-value pairs 504. From the target key-value pair 504, a target key-value pair type 505 is determined. A key-value result 506 is determined from the target key-value pair 504 and the target key-value pair type 505. And performing entity recognition on the target document 503 to obtain a target entity 507. A knowledge unit 508 is generated based on the key-value result 506 and the target entity 507. From knowledge unit 508, a knowledge graph 509 is generated.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The above is only an exemplary embodiment, but is not limited thereto, and other key value generation methods and knowledge graph generation methods known in the art may be included as long as the extraction target definition, the industry reusability, and the development efficiency can be improved.

Fig. 6 schematically shows a block diagram of a key-value generation apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the key value generation apparatus 600 may include a first obtaining module 610, a parsing module 620, a determining module 630, and a second obtaining module 640.

The first obtaining module 610 is configured to obtain a target document according to a document to be processed.

And the parsing module 620 is configured to parse the target document to obtain a target key value pair.

A determining module 630, configured to determine the type of the target key-value pair according to the target key-value pair.

The second obtaining module 640 is configured to obtain a key value result for the document to be processed according to the target key value pair and the type of the target key value pair.

According to an embodiment of the present disclosure, the determining module 630 may include an extracting sub-module and a first determining sub-module.

And the extraction submodule is used for carrying out feature extraction on the target key value pair to obtain a target key value pair feature vector.

And the first determining submodule is used for determining the type of the target key-value pair according to the target key-value pair characteristic vector.

According to an embodiment of the present disclosure, the extraction sub-module may include an object encoding unit, a position encoding unit, a fragment encoding unit, and an acquisition unit.

And the object coding unit is used for carrying out object coding on the object in the target key value pair to obtain a target object characteristic vector.

And the position coding unit is used for carrying out position coding on the target key value pair to obtain a target position feature vector.

And the segment coding unit is used for carrying out segment coding on the target key value pair to obtain a target segment characteristic vector.

And the obtaining unit is used for obtaining a target key value pair feature vector according to the target object feature vector, the target position feature vector and the target fragment feature vector.

According to an embodiment of the present disclosure, the first determination submodule may include a creation unit, an extraction unit, and a determination unit.

And the creating unit is used for creating expected characteristic data according to the target key value pair.

And the extraction unit is used for carrying out feature extraction on the expected feature data to obtain an expected feature vector.

And the determining unit is used for determining the type of the target key value pair according to the target key value pair characteristic vector and the expected characteristic vector.

According to an embodiment of the present disclosure, the extraction unit may include a dense coding subunit and a decomposition subunit.

And the dense coding subunit is used for carrying out dense coding on the expected feature data to obtain dense feature vectors.

And the decomposition subunit is used for performing factorization on the dense feature vectors to obtain expected feature vectors.

According to an embodiment of the present disclosure, the determining module 630 may include a first processing sub-module.

And the first processing submodule is used for processing the target key value pair by using the classification model to obtain the type of the target key value pair. The classification model is obtained by training a deep learning model by using sample document data.

According to an embodiment of the present disclosure, a deep learning model includes a first language module.

According to the embodiment of the disclosure, the classification model is obtained by training the first language module according to the first sample classification result and the sample label value based on the first loss function. The first sample classification result is obtained by processing sample document data using the first language module.

According to an embodiment of the present disclosure, a deep learning model includes a second language module and a first feature fusion module.

According to the embodiment of the disclosure, the classification model is obtained by training the second language module and the first feature fusion module according to the second sample classification result and the sample label value based on the second loss function. The second sample classification result is derived from the first expected sample feature vector and the first sample feature vector. The first expected sample feature vector is obtained by processing expected sample feature data by using a first feature fusion module. Expected sample characteristic data is based on sample document data is created. The expected sample feature data includes feature data of at least one of an object granularity, a part-of-speech granularity, and a part-of-speech granularity. The first sample feature vector is obtained by processing sample document data using the second language module.

According to an embodiment of the present disclosure, a deep learning model includes a third language module and a first pre-training module.

According to the embodiment of the disclosure, the classification model is obtained by training a third language module according to a third sample classification result and a fourth sample classification result based on a third loss function. The third sample classification result is obtained by processing the sample document data using the third language module. The fourth sample classification result is obtained by processing the sample document data by using the first pre-training module.

According to an embodiment of the present disclosure, the deep learning model includes a fourth language module, a second feature fusion module, and a second pre-training module.

According to an embodiment of the present disclosure, the output value is determined according to the first output value and the second output value. The first output value is obtained from the fifth sample classification result and the sample label value based on the second and fourth loss functions. The second output value is obtained from the sixth sample classification result and the seventh sample classification result based on the third and fifth loss functions. The fifth sample classification result is obtained according to the second expected sample feature vector and the second sample feature vector. The second expected sample feature vector is obtained by processing the expected sample feature data by using a second feature fusion module. Expected sample feature data is created from sample document data. The second sample feature vector is obtained by processing the sample document data using the fourth language module. The sixth sample classification result is obtained by processing the sample document data using the fourth language module. And the seventh sample classification result is obtained by processing the sample document data by using the second pre-training module.

According to an embodiment of the present disclosure, the sample document data includes at least one of: a sample key-value pair and a non-sample key-value pair. The sample key-value pairs include at least one of: a sample key-value pair formed by attributes and attribute values and a sample key-value pair formed by noun and name interpretations.

The parsing module 620 may include a first partitioning submodule and a second partitioning submodule according to an embodiment of the present disclosure.

And the first dividing module is used for carrying out sentence division on the target document to obtain a target sentence.

And the second partitioning submodule is used for performing key-value pair partitioning on the target statement to obtain a target key-value pair.

According to an embodiment of the present disclosure, the first division module may include a first division unit.

And the first dividing unit is used for carrying out sentence division on the target document based on the sentence division separators to obtain the target sentences.

According to an embodiment of the present disclosure, the sentence division delimiter includes a first level sentence division delimiter and a second level sentence division delimiter.

According to an embodiment of the present disclosure, the dividing unit may include a first dividing subunit and a second dividing subunit.

And the first dividing molecule unit is used for carrying out sentence division on the target document based on the first-stage sentence division separators to obtain intermediate sentences.

And the second division subunit is used for carrying out sentence division on the intermediate sentence under the condition that the second-level sentence division separator exists in the intermediate sentence to obtain the target sentence.

According to an embodiment of the present disclosure, the second division sub-module may include a second division unit.

And the second dividing unit is used for carrying out key-value pair division on the target statement based on the key-value pair division separators to obtain a target key-value pair. The key-value pair partition delimiter includes a colon.

According to an embodiment of the present disclosure, the first obtaining module 610 may include a calling sub-module, a second processing sub-module, and a obtaining sub-module.

And the calling submodule is used for calling the document processing interface corresponding to the document to be processed.

And the second processing submodule is used for processing the document to be processed by using the document processing interface to obtain a document interface class object.

And the obtaining submodule is used for obtaining the target document according to the document interface class object.

According to an embodiment of the present disclosure, the second processing submodule may include a processing unit.

And the processing unit is used for processing the document to be processed by using the document processing script corresponding to the document to be processed to obtain the document interface class object.

FIG. 7 schematically shows a block diagram of a knowledge-graph generating apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the knowledge-graph generating apparatus 700 may include an entity identification module 710, a first generating module 720, a second generating module 730, and a third generating module 740.

And the entity identification module 710 performs entity identification on the target document to obtain a target entity.

The first generating module 720 is configured to generate a key result by using the key generation apparatus.

The second generating module 730 is configured to generate a knowledge unit according to the key-value result and the target entity.

A third generating module 740, configured to generate a knowledge graph according to the knowledge units.

According to an embodiment of the present disclosure, the entity identification module 710 may include a second determination submodule and an identification submodule.

And the second determining submodule is used for determining the entity identification area and the entity identification strategy according to the preset configuration information.

And the identification submodule is used for carrying out entity identification on the entity identification area of the target document according to the entity identification strategy to obtain a target entity.

According to an embodiment of the present disclosure, the entity identification policy includes at least one of: a named entity identification policy, a rule identification policy, and a blacklist identification policy.

According to an embodiment of the present disclosure, the key-value result includes a target key-value pair and a target key-value pair type.

According to an embodiment of the present disclosure, the second generation module 730 may include an association submodule.

And the association submodule is used for associating the target key-value pair with the target entity to obtain the knowledge unit based on the preset association sequence and the position of the target entity in the target document under the condition that the target key-value pair type is determined to be the expected key-value pair type.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to an embodiment of the present disclosure, a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described above.

According to an embodiment of the disclosure, a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.

Fig. 8 schematically illustrates a block diagram of an electronic device suitable for implementing a key-value generation method and a knowledge-graph generation method according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as the key value generation method and the knowledge graph generation method. For example, in some embodiments, the key-value generation method and the knowledge-graph generation method may be implemented as computer software programs tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 800 via ROM 802 and/or communications unit 809. When loaded into the RAM 803 and executed by the computing unit 801, a computer program may perform one or more steps of the key-value generation method and the knowledge-graph generation method described above. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the key-value generation method and the knowledge-graph generation method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device for displaying information to a user (e.g., CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A key value generation method includes:

obtaining a target document according to the document to be processed;

analyzing the target document to obtain a target key value pair;

determining the type of the target key-value pair according to the target key-value pair; and

and obtaining a key value result aiming at the document to be processed according to the target key value pair and the type of the target key value pair.

2. The method of claim 1, wherein the determining a target key-value pair type from the target key-value pair comprises:

extracting the features of the target key value pairs to obtain target key value pair feature vectors; and

and determining the type of the target key-value pair according to the target key-value pair feature vector.

3. The method of claim 2, wherein the performing feature extraction on the target key-value pair to obtain a target key-value pair feature vector comprises:

carrying out object coding on the object in the target key value pair to obtain a target object feature vector;

carrying out position coding on the target key value pair to obtain a target position feature vector;

carrying out fragment coding on the target key value pair to obtain a target fragment feature vector; and

and obtaining the target key value pair feature vector according to the target object feature vector, the target position feature vector and the target fragment feature vector.

4. The method of claim 2 or 3, wherein the determining the target key-value pair type from the target key-value pair feature vector comprises:

creating expected feature data according to the target key-value pairs;

performing feature extraction on the expected feature data to obtain an expected feature vector; and

and determining the target key-value pair type according to the target key-value pair feature vector and the expected feature vector.

5. The method of claim 4, wherein said extracting the expected feature data to obtain an expected feature vector comprises:

carrying out dense coding on the expected feature data to obtain dense feature vectors; and

and performing factorization on the dense feature vector to obtain the expected feature vector.

6. The method of claim 2, wherein the determining a target key-value pair type from the target key-value pair comprises:

and processing the target key value pair by using a classification model to obtain the type of the target key value pair, wherein the classification model is obtained by training a deep learning model by using sample document data.

7. The method of claim 6, wherein the deep learning model comprises a first language module;

the classification model is obtained by training the first language module according to a first sample classification result and a sample label value based on a first loss function;

the first sample classification result is obtained by processing the sample document data by the first language module.

8. The method of claim 6, wherein the deep learning model comprises a second language module and a first feature fusion module;

the classification model is obtained by training the second language module and the first feature fusion module according to a second sample classification result and a sample label value based on a second loss function;

the second sample classification result is obtained according to the first expected sample feature vector and the first sample feature vector;

the first expected sample feature vector is obtained by processing the expected sample feature data by using the first feature fusion module, the expected sample feature data is created according to the sample document data, and the expected sample feature data comprises feature data with at least one of object granularity, part-of-speech granularity and part-of-speech granularity;

the first sample feature vector is obtained by processing the sample document data by the second language module.

9. The method of claim 6, wherein the deep learning model comprises a third language module and a first pre-training module;

the classification model is obtained by training the third language module according to a third sample classification result and a fourth sample classification result based on a third loss function;

the third sample classification result is obtained by processing the sample document data by using the third language module;

the fourth sample classification result is obtained by processing the sample document data by the first pre-training module.

10. The method of claim 6, wherein the deep learning model comprises a fourth language module, a second feature fusion module, and a second pre-training module;

the classification model comprises a fourth language module and a second feature fusion module which are obtained under the condition that model parameters of the fourth language module and the second feature fusion module are adjusted according to output values until a preset end condition is met;

the output value is determined from the first output value and the second output value;

the first output value is obtained according to a fifth sample classification result and a sample label value based on a fourth loss function;

the second output value is obtained according to a sixth sample classification result and a seventh sample classification result based on a fifth loss function;

the fifth sample classification result is obtained according to the second expected sample feature vector and the second sample feature vector;

the second expected sample feature vector is obtained by processing the expected sample feature data by using the second feature fusion module, and the expected sample feature data is created according to the sample document data;

the second sample feature vector is obtained by processing the sample document data by using the fourth language module;

the sixth sample classification result is obtained by processing the sample document data by the fourth language module;

the seventh sample classification result is obtained by processing the sample document data using the second pre-training module.

11. The method according to any one of claims 6 to 10, wherein the sample document data includes at least one of: a sample key-value pair and a non-sample key-value pair, the sample key-value pair comprising at least one of: a sample key-value pair formed by attributes and attribute values and a sample key-value pair formed by noun and name interpretations.

12. The method according to any one of claims 1 to 11, wherein the parsing the target document to obtain a target key-value pair includes:

carrying out sentence division on the target document to obtain a target sentence; and

and performing key-value pair division on the target statement to obtain the target key-value pair.

13. The method of claim 12, wherein the sentence dividing of the target document to obtain the target sentence comprises:

and carrying out sentence division on the target document based on the sentence division separators to obtain the target sentences.

14. The method of claim 13, wherein the statement partition delimiter comprises a first level statement partition delimiter and a second level statement partition delimiter;

performing statement division on the target document based on the statement division separators to obtain the target statement, wherein the step of performing statement division on the target document based on the statement division separators comprises:

performing sentence division on the target document based on the first-stage sentence division separators to obtain intermediate sentences; and

and under the condition that the second-level statement division separator exists in the intermediate statement, carrying out statement division on the intermediate statement to obtain the target statement.

15. The method according to any one of claims 12 to 14, wherein the performing key-value pair partitioning on the target sentence to obtain the target key-value pair comprises:

and performing key-value pair division on the target statement based on the key-value pair division separator to obtain the target key-value pair, wherein the key-value pair division separator comprises a colon.

16. The method according to any one of claims 1 to 14, wherein the obtaining a target document according to the document to be processed comprises:

calling a document processing interface corresponding to the document to be processed;

processing the document to be processed by using the document processing interface to obtain a document interface class object; and

and obtaining the target document according to the document interface class object.

17. The method of claim 16, wherein the processing the document to be processed by using the document processing interface to obtain a document interface class object comprises:

and processing the document to be processed by using the document processing script corresponding to the document to be processed to obtain the document interface class object.

18. A method of knowledge-graph generation, comprising:

carrying out entity identification on the target document to obtain a target entity;

generating a key-value result using the method of any one of claims 1-17;

generating a knowledge unit according to the key value result and the target entity; and

and generating a knowledge graph according to the knowledge unit.

19. The method of claim 18, wherein the entity identifying the target document to obtain the target entity comprises:

determining an entity identification area and an entity identification strategy according to preset configuration information; and

and according to the entity identification strategy, carrying out entity identification on the entity identification area of the target document to obtain the target entity.

20. The method of claim 19, wherein the entity identification policy comprises at least one of: a named entity identification policy, a rule identification policy, and a blacklist identification policy.

21. The method of any of claims 18-20, wherein the key-value result includes a target key-value pair and a target key-value pair type;

wherein the generating a knowledge unit according to the key value result and the target entity comprises:

and under the condition that the target key-value pair type is determined to be an expected key-value pair type, associating the target key-value pair with the target entity to obtain the knowledge unit based on a preset association sequence and the position of the target entity in the target document.

22. A key-value generation apparatus comprising:

the first acquisition module is used for acquiring a target document according to the document to be processed;

the analysis module is used for analyzing the target document to obtain a target key value pair;

the determining module is used for determining the type of the target key value pair according to the target key value pair; and

and the second acquisition module is used for acquiring a key value result aiming at the document to be processed according to the target key value pair and the type of the target key value pair.

23. A knowledge-graph generating apparatus comprising:

the entity identification module is used for carrying out entity identification on the target document to obtain a target entity;

a first generating module for generating a key-value result using the apparatus of claim 22;

the second generation module is used for generating a knowledge unit according to the key value result and the target entity; and

and the third generation module is used for generating a knowledge graph according to the knowledge unit.

24. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 21.

25. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to any one of claims 1-21.

26. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 21.