CN117725220A

CN117725220A - Method, server and storage medium for document characterization and document retrieval

Info

Publication number: CN117725220A
Application number: CN202311378060.2A
Authority: CN
Inventors: 张凯; 宋凯嵩; 康杨杨; 刘晓钟
Original assignee: Hangzhou AliCloud Feitian Information Technology Co Ltd
Current assignee: Hangzhou AliCloud Feitian Information Technology Co Ltd
Priority date: 2023-10-23
Filing date: 2023-10-23
Publication date: 2024-03-19

Abstract

The application provides a method for document characterization and document retrieval, a server and a storage medium. According to the method, a plurality of documents to be processed and document association graphs of the plurality of documents are obtained, wherein the document association graphs comprise nodes corresponding to the documents and edges representing association relations among the documents; the method comprises the steps of inputting content information of a plurality of documents and a document association graph into a document representation model, performing diagrammatical learning through the document representation model based on the document association graph initialized by using semantic representations of the documents, updating feature representations of the nodes, taking the updated feature representations of the nodes as document representations of corresponding documents, learning semantic information of single documents and related information among the documents based on the document association graph, improving quality of the document representations, and further improving accuracy of document retrieval, recommendation and classification based on the document representations.

Description

Method, server and storage medium for document characterization and document retrieval

技术领域Technical field

本申请涉及计算机技术，尤其涉及一种文档表征和文档检索的方法、服务器及存储介质。The present application relates to computer technology, and in particular to a method, server and storage medium for document characterization and document retrieval.

背景技术Background technique

在实际应用场景中很多文档并不是独立存在，比如含有超链接的文档、具有引用关系的科技论文、具有共同招/投标人的标书等等。现实应用中大多数与多文档场景有关，比如科技论文、标书等文档的检索、推荐、分类、多文档总结等场景中，均涉及到具有关联关系的多文档。In actual application scenarios, many documents do not exist independently, such as documents containing hyperlinks, scientific papers with citation relationships, tender documents with common bidders, etc. Most of the real-life applications are related to multi-document scenarios, such as retrieval, recommendation, classification, multi-document summary and other scenarios of scientific papers, tender documents and other documents, all involving multiple documents with associated relationships.

在传统的文档表征学习中，技术人员通常比较关注单文档中语义信息的获取，主要使用语言模型基于单文档在词、句子等层面的语义信息生成文档表示，无法学习到文档间的相关性信息，文档表示的质量较低，进而导致基于文档表示的检索、推荐、分类的精准度低。In traditional document representation learning, technicians usually focus on the acquisition of semantic information in a single document. They mainly use language models to generate document representations based on the semantic information of a single document at the word, sentence and other levels, and cannot learn the correlation information between documents. , the quality of document representation is low, which leads to low accuracy in retrieval, recommendation, and classification based on document representation.

发明内容Contents of the invention

本申请提供一种文档表征和文档检索的方法、服务器及存储介质，用以解决现有的文档表征方法获得的文档表示的质量较低，导致下游基于文档表示的检索、推荐、分类的精准度低的问题。This application provides a method, server and storage medium for document characterization and document retrieval to solve the problem of low quality of document representation obtained by existing document characterization methods, which results in the accuracy of downstream retrieval, recommendation and classification based on document representation. low question.

第一方面，本申请提供一种文档表征方法，包括：In the first aspect, this application provides a document characterization method, including:

获取待处理的多个文档，以及所述多个文档的文档关联图，所述文档关联图包括各所述文档对应的节点和表示文档间关联关系的边；将所述多个文档的内容信息和所述文档关联图输入文档表示模型，通过所述文档表示模型基于使用各所述文档的语义表示初始化的文档关联图进行图表征学习，更新各节点的特征表示；将更新后各节点的特征表示作为对应文档的文档表示。Obtain multiple documents to be processed and the document association graph of the multiple documents. The document association graph includes nodes corresponding to each of the documents and edges representing the association relationships between documents; convert the content information of the multiple documents The document association graph is input into a document representation model, and the document representation model performs graph representation learning based on the document association graph initialized using the semantic representation of each document, and updates the feature representation of each node; the updated features of each node are Represents a document representation as a corresponding document.

第二方面，本申请提供一种文档检索方法，包括：In the second aspect, this application provides a document retrieval method, including:

响应于文档检索请求，获取用户输入的查询信息；将所述查询信息映射为向量表示，将所述向量表示与文档检索库中各文档的文档表示进行相似度匹配，得到与所述查询信息匹配的至少一个目标文档；输出所述至少一个目标文档的信息；其中，文档检索库中各文档的文档表示是通过如下方式确定的：获取文档检索库中待表征的多个文档，以及所述多个文档的文档关联图，所述文档关联图包括各所述文档对应的节点和表示文档间关联关系的边；将所述多个文档和所述文档关联图输入文档表示模型，通过所述文档表示模型基于使用各所述文档的语义表示初始化的文档关联图进行图表征学习，更新各节点的特征表示；将更新后各节点的特征表示作为对应文档的文档表示。In response to the document retrieval request, obtain the query information input by the user; map the query information into a vector representation, perform similarity matching between the vector representation and the document representation of each document in the document retrieval database, and obtain a match with the query information at least one target document; output information of the at least one target document; wherein the document representation of each document in the document retrieval library is determined in the following manner: obtaining multiple documents to be represented in the document retrieval library, and the multiple documents A document association graph for each document, the document association graph including nodes corresponding to each of the documents and edges representing the association between documents; inputting the multiple documents and the document association graph into a document representation model, through the documents The representation model performs graph representation learning based on the document association graph initialized using the semantic representation of each document, and updates the feature representation of each node; the updated feature representation of each node is used as the document representation of the corresponding document.

第三方面，本申请提供一种服务器，包括：In a third aspect, this application provides a server, including:

至少一个处理器；以及与所述至少一个处理器通信连接的存储器；其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述服务器执行第一方面或第二方面所述的方法。at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to The server is caused to execute the method described in the first aspect or the second aspect.

第四方面，本申请提供一种计算机可读存储介质，所述计算机可读存储介质中存储有计算机执行指令，当处理器执行所述计算机执行指令时，实现如第一方面或第二方面所述的方法。In a fourth aspect, the present application provides a computer-readable storage medium. Computer-executable instructions are stored in the computer-readable storage medium. When the processor executes the computer-executable instructions, the implementation as described in the first aspect or the second aspect is achieved. method described.

本申请提供的文档表征和文档检索的方法、服务器及存储介质，通过获取待处理的多个文档，以及多个文档的文档关联图，文档关联图包括各文档对应的节点和表示文档间关联关系的边；将多个文档的内容信息和文档关联图输入文档表示模型，通过文档表示模型基于使用各所述文档的语义表示初始化的文档关联图进行图表征学习，更新各节点的特征表示，将更新后各节点的特征表示作为对应文档的文档表示，不仅可以学习到单文档的语义信息，还可以基于多文档的文档关联图学习到文档间的相关性信息，提升了文档表示的质量，进而可以提升基于文档表示的文档检索、推荐、分类的精准度。The document characterization and document retrieval methods, servers and storage media provided by this application obtain multiple documents to be processed and document association graphs of the multiple documents. The document association graph includes nodes corresponding to each document and represents the association between documents. edge; input the content information and document association graph of multiple documents into the document representation model, perform graph representation learning based on the document association graph initialized using the semantic representation of each document through the document representation model, update the feature representation of each node, and The updated feature representation of each node is used as the document representation of the corresponding document. Not only can the semantic information of a single document be learned, but also the correlation information between documents can be learned based on the document association graph of multiple documents, which improves the quality of the document representation and further It can improve the accuracy of document retrieval, recommendation, and classification based on document representation.

附图说明Description of the drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本申请的实施例，并与说明书一起用于解释本申请的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

图1为本申请所适用的一示例系统架构的示意图；Figure 1 is a schematic diagram of an example system architecture applicable to this application;

图2为本申请所适用的另一示例系统架构的示意图；Figure 2 is a schematic diagram of another example system architecture applicable to this application;

图3为本申请一示例性实施例提供的文档表征方法的流程图；Figure 3 is a flow chart of a document characterization method provided by an exemplary embodiment of the present application;

图4为本申请一示例性实施例提供的文档表示模型的示例架构图；Figure 4 is an example architecture diagram of a document representation model provided by an exemplary embodiment of the present application;

图5为本申请一示例性实施例提供的语义信息学习模块的结构示例图；Figure 5 is an example structural diagram of a semantic information learning module provided by an exemplary embodiment of the present application;

图6为本申请一示例性实施例提供的文档表征模型的结构示例图；Figure 6 is an example structural diagram of a document representation model provided by an exemplary embodiment of the present application;

图7为本申请一示例性实施例提供的文本表示模型的训练方法流程图；Figure 7 is a flow chart of a training method for a text representation model provided by an exemplary embodiment of the present application;

图8为本申请一示例性实施例提供的文档表征模型的训练框架示例图；Figure 8 is an example diagram of the training framework of the document representation model provided by an exemplary embodiment of the present application;

图9为本申请一示例性实施例提供的一种文档表征方法的流程图；Figure 9 is a flow chart of a document characterization method provided by an exemplary embodiment of the present application;

图10为本申请另一示例性实施例提供的文档检索方法的流程图；Figure 10 is a flow chart of a document retrieval method provided by another exemplary embodiment of the present application;

图11为本申请实施例提供的一种服务器的结构示意图。Figure 11 is a schematic structural diagram of a server provided by an embodiment of the present application.

通过上述附图，已示出本申请明确的实施例，后文中将有更详细的描述。这些附图和文字描述并不是为了通过任何方式限制本申请构思的范围，而是通过参考特定实施例为本领域技术人员说明本申请的概念。Through the above-mentioned drawings, clear embodiments of the present application have been shown, which will be described in more detail below. These drawings and text descriptions are not intended to limit the scope of the present application's concepts in any way, but are intended to illustrate the application's concepts for those skilled in the art with reference to specific embodiments.

具体实施方式Detailed ways

这里将详细地对示例性实施例进行说明，其示例表示在附图中。下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the appended claims.

需要说明的是，本申请所涉及的用户信息(包括但不限于用户设备信息、用户属性信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)，均为经用户授权或者经过各方充分授权的信息和数据，并且相关数据的收集、使用和处理需要遵守相关法律法规和标准，并提供有相应的操作入口，供用户选择授权或者拒绝。It should be noted that the user information (including but not limited to user device information, user attribute information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all It is information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with relevant laws, regulations and standards, and a corresponding operation entrance is provided for the user to choose to authorize or refuse.

首先对本申请所涉及的名词进行解释：First, the terms involved in this application will be explained:

文档：计算机用于，一般将各类文字编辑软件产生的文件叫做文档。本实施例中，文档是指记录了文本内容的文件，文档数据包含文件内容，以及文件的创建人、创建时间、类型、用途、所属领域等相关信息。Documents: When computers are used, files generated by various text editing software are generally called documents. In this embodiment, a document refers to a file that records text content. The document data includes the file content, as well as relevant information such as the creator, creation time, type, purpose, and field of the file.

图数据(Graph)：也即图，是一个有序二元组，表示为G＝(V，E)，其中V＝{v₁,...,v_N}表示节点集，表示边集。图中的节点具有特征表示，边表示所连接的两个节点间的关联关系。Graph data (Graph): that is, the graph is an ordered tuple, expressed as G = (V, E), where V = {v ₁ ,..., v _N } represents the node set, Represents the edge set. The nodes in the graph have characteristic representations, and the edges represent the association between the two connected nodes.

表征学习(Representation Learning，简称RL)：是自然语言处理中的一个子任务，主要关注如何将文本中的语义信息等映射成可被用于下游任务的向量表示。Representation Learning (RL): It is a subtask in natural language processing. It mainly focuses on how to map semantic information in text into vector representations that can be used for downstream tasks.

图神经网络(Graph Neural Network，简称GNN)：是指使用神经网络来学习图结构数据，提取和发掘图结构数据中的特征和模式，满足聚类、分类、预测、分割、生成等图学习任务需求的算法总称，主要关注在如何将神经网络的学习范式应用在图数据结构上，进而学习图上节点和拓扑结构的信息。Graph Neural Network (GNN): refers to the use of neural networks to learn graph-structured data, extract and explore features and patterns in graph-structured data, and meet graph learning tasks such as clustering, classification, prediction, segmentation, and generation. The general name of the required algorithm mainly focuses on how to apply the learning paradigm of neural network to the graph data structure, and then learn the information of the nodes and topology structure on the graph.

沃瑟斯坦距离(Wasserstein Distance，简称WD)：是用于度量两个统计分布相似性的距离。Wasserstein Distance (WD): It is a distance used to measure the similarity of two statistical distributions.

图池化：是一种操作，用于将具有相似结构的图分解为小的节点集合。它可以通过最大池化或平均池化等全局池化操作来选出代表整张图表示的节点信息。池化操作会缩小数据量，通过某种算法减少节点数目，从而实现一层一层的抽取。全局池化只包含读出层，而读出层可以利用池化操作将整张图的表示读出。全局池化常用于图分类等任务中。Graph pooling: is an operation used to decompose a graph with a similar structure into a small collection of nodes. It can select node information that represents the entire graph through global pooling operations such as maximum pooling or average pooling. The pooling operation will reduce the amount of data and reduce the number of nodes through a certain algorithm, thereby achieving layer-by-layer extraction. Global pooling only includes the readout layer, and the readout layer can use the pooling operation to read out the representation of the entire graph. Global pooling is often used in tasks such as graph classification.

预训练大语言模型：对大规模语言模型(Large Language Model，简称LLM)进行预训练后得到的预训练模型。Pre-trained large language model: a pre-trained model obtained by pre-training a large language model (LLM).

在传统的文档表征学习中，技术人员通常比较关注单文档中语义信息的获取。但在实际应用场景中很多文档并不是独立存在，比如含有超链接的文档、具有引用关系的科技论文、具有共同招/投标人的标书等等。现实应用中大多数与多文档场景有关，比如科技论文、标书等文档的检索、推荐、分类、多文档总结等场景中，均涉及到具有关联关系的多文档。In traditional document representation learning, technicians usually focus on acquiring semantic information in a single document. However, in actual application scenarios, many documents do not exist independently, such as documents containing hyperlinks, scientific papers with citation relationships, tender documents with common bidders, etc. Most of the real-life applications are related to multi-document scenarios, such as retrieval, recommendation, classification, multi-document summary and other scenarios of scientific papers, tender documents and other documents, all involving multiple documents with associated relationships.

传统的文档表征学习方案，主要使用语言模型基于单文档在词、句子等层面的语义信息生成文档表示，无法学习到文档间的相关性信息，文档表示的质量较低，进而导致基于文档表示的检索、推荐、分类的精准度低。Traditional document representation learning solutions mainly use language models to generate document representations based on the semantic information of a single document at the word, sentence and other levels. However, the correlation information between documents cannot be learned, and the quality of the document representation is low, which in turn leads to document representation-based The accuracy of retrieval, recommendation, and classification is low.

针对上述技术问题，本申请提供一种文档表征方法，获取待处理的多个文档，以及多个文档的文档关联图，文档关联图包括各文档对应的节点和表示文档间关联关系的边；将多个文档的内容信息和文档关联图输入文档表示模型，通过文档表示模型基于各文档的内容信息生成各文档的语义表示，使用各文档的语义表示初始化文档关联图中各节点的特征表示，基于初始化后的文档关联图进行图表征学习，更新各节点的特征表示，将更新后各节点的特征表示作为对应文档的文档表示，不仅可以学习到单文档的语义信息，还可以基于多文档的文档关联图学习到文档间的相关性信息，提升了文档表示的质量，进而可以提升基于文档表示的文档检索、推荐、分类的精准度。In response to the above technical problems, this application provides a document characterization method to obtain multiple documents to be processed and the document association graph of the multiple documents. The document association graph includes nodes corresponding to each document and edges representing the association between documents; The content information and document association graph of multiple documents are input into the document representation model. The document representation model generates the semantic representation of each document based on the content information of each document. The semantic representation of each document is used to initialize the feature representation of each node in the document association graph. Based on The initialized document association graph is used for graph representation learning, the feature representation of each node is updated, and the updated feature representation of each node is used as the document representation of the corresponding document. Not only can the semantic information of a single document be learned, but also documents based on multiple documents can be learned. The association graph learns the correlation information between documents, which improves the quality of document representation, and thereby improves the accuracy of document retrieval, recommendation, and classification based on document representation.

图1为本申请所适用的一示例系统架构的示意图。如图1所示，该系统架构包括服务器和端侧设备。其中，服务器与端侧设备之间具有可通信的通信链路，能够实现服务器与端侧设备间的通信连接。Figure 1 is a schematic diagram of an example system architecture applicable to this application. As shown in Figure 1, the system architecture includes servers and end-side devices. Among them, there is a communicable communication link between the server and the end-side device, which can realize the communication connection between the server and the end-side device.

其中，服务器是部署在云端或本地的具有计算能力的设备，例如云集群等。服务器存储有训练好的文档表示模型，可基于文档表示模型实现文档表征功能。另外，服务器还可以负责实现文档表示模型的训练。Among them, the server is a device with computing capabilities deployed in the cloud or locally, such as a cloud cluster, etc. The server stores a trained document representation model and can implement the document representation function based on the document representation model. In addition, the server can also be responsible for implementing the training of the document representation model.

端侧设备可以是运行下游应用的电子设备，具体可以为具有网络通信功能、运算功能以及信息显示功能的硬件设备，其包括但不限于智能手机、平板电脑、台式电脑、本地服务器、云端服务器等。端侧设备运行下游应用时需要使用文档表示模型的文档表示能力。端侧设备运行的下游应用可以是实现文档检索、文档推荐、文档分类等至少一种文档处理任务的应用系统，例如标书搜索、标书分类、文献检索、文献分类等。在实现下游应用的功能时，需要使用文档表示模型的文档表示能力，基于给定文档集中的多个文档生成各个文档的高质量的文档表示，进一步地基于各个文档的文档表示，实现下游应用的文档处理任务。End-side devices can be electronic devices that run downstream applications. Specifically, they can be hardware devices with network communication functions, computing functions, and information display functions, including but not limited to smartphones, tablets, desktop computers, local servers, cloud servers, etc. . When the end-side device runs downstream applications, it needs to use the document representation capability of the document representation model. The downstream application run by the end-side device may be an application system that implements at least one document processing task such as document retrieval, document recommendation, and document classification, such as bid document search, bid document classification, document retrieval, document classification, etc. When implementing the functions of downstream applications, it is necessary to use the document representation capability of the document representation model to generate high-quality document representations of each document based on multiple documents in a given document set. Further, based on the document representation of each document, the downstream application can be implemented Document processing tasks.

基于图1所示的系统架构，端侧设备向服务器发送待处理的多个文档。服务器接收待处理的多个文档并构建多个文档的文档关联图，其中，档关联图包括各文档对应的节点和表示文档间关联关系的边。进一步地，服务器将多个文档的内容信息和文档关联图输入文档表示模型，通过文档表示模型基于使用各文档的语义表示初始化的文档关联图进行图表征学习，更新各节点的特征表示，将更新后各节点的特征表示作为对应文档的文档表示。服务器将多个文档的文档表示返回至端侧设备。Based on the system architecture shown in Figure 1, the end-side device sends multiple documents to be processed to the server. The server receives multiple documents to be processed and constructs a document association graph of the multiple documents. The document association graph includes nodes corresponding to each document and edges representing the association between documents. Further, the server inputs the content information and document association graphs of multiple documents into the document representation model, performs graph representation learning through the document representation model based on the document association graph initialized using the semantic representation of each document, updates the feature representation of each node, and updates the The feature representation of each node is then used as the document representation of the corresponding document. The server returns the document representations of the multiple documents to the end-side device.

端侧设备接收服务器返回的多个文档的文档表示，根据多个文档的文档表示继续执行下游应用的文档处理逻辑，实现下游应用的文档处理任务，获得文档处理结果。The end-side device receives the document representations of multiple documents returned by the server, continues to execute the document processing logic of the downstream application based on the document representations of the multiple documents, implements the document processing tasks of the downstream application, and obtains the document processing results.

在一示例场景中，以下游应用为文档分类系统为例，端侧设备在对给定的多个文档进行分类时，需要获得各个文档的文本表示，再基于各个文档的文档表示进行分类，获得文档分类结果。端侧设备向服务器发送待分类的文档集，该文档集包含多个文档，其中至少两个文档间具有关联关系。服务器接收端侧设备发送的文档集，并基于文档间的关联关系构建文档关联图。服务器将文档集中的多个文档的内容信息和文档关联图输入文档表示模型，通过文档表示模型基于使用各文档的语义表示初始化的文档关联图进行图表征学习，更新各节点的特征表示，将更新后各节点的特征表示作为对应文档的文档表示。服务器将文档集中多个文档的文档表示返回至端侧设备。端侧设备接收服务器返回的文档集中多个文档的文档表示，根据各个文档的文档表示对各个文档进行分类，得到各个文档的分类结果。In an example scenario, taking the downstream application as a document classification system as an example, when classifying multiple given documents, the end-side device needs to obtain the text representation of each document, and then classify based on the document representation of each document, obtaining Document classification results. The terminal device sends a document set to be classified to the server. The document set contains multiple documents, at least two of which have an associated relationship. The server receives the document set sent by the end-side device and constructs a document association graph based on the association between documents. The server inputs the content information and document association graph of multiple documents in the document set into the document representation model. The document representation model performs graph representation learning based on the document association graph initialized using the semantic representation of each document, updates the feature representation of each node, and updates the The feature representation of each node is then used as the document representation of the corresponding document. The server returns document representations of multiple documents in the document set to the end-side device. The end-side device receives the document representations of multiple documents in the document set returned by the server, classifies each document according to the document representation of each document, and obtains the classification result of each document.

基于图1所示的系统架构，还可以实现文档聚类、生成多文档的总结等其他文档处理任务，此处不做具体限定。例如，标书、文献的分类，标书、文献的聚类，指定多个文档的总结等。Based on the system architecture shown in Figure 1, other document processing tasks such as document clustering and generating summaries of multiple documents can also be implemented, which are not specifically limited here. For example, classification of bid documents and documents, clustering of bid documents and documents, summary of specified multiple documents, etc.

图2为本申请所适用的另一示例系统架构的示意图。如图2所示，该系统架构包括服务器和端侧设备。其中，服务器与端侧设备之间具有可通信的通信链路，能够实现服务器与端侧设备间的通信连接。FIG. 2 is a schematic diagram of another example system architecture applicable to this application. As shown in Figure 2, the system architecture includes servers and end-side devices. Among them, there is a communicable communication link between the server and the end-side device, which can realize the communication connection between the server and the end-side device.

其中，服务器是部署在云端或本地的具有计算能力的设备，例如云集群等。服务器存储有训练好的文档表示模型，可基于文档表示模型实现文档表征功能。服务器可以使用文档表示模型，对给定文档检索库中的多个文档进行表征，获得文档检索库中各文档的文档表示，并将各文档的文档表示存储到文档检索库中。进一步地，服务器可以基于包含各文档的文档表示的文档检索库，实现文档检索、推荐等任务。另外，服务器还可以负责实现文档表示模型的训练。Among them, the server is a device with computing capabilities deployed in the cloud or locally, such as a cloud cluster, etc. The server stores a trained document representation model and can implement the document representation function based on the document representation model. The server can use the document representation model to characterize multiple documents in a given document retrieval library, obtain the document representation of each document in the document retrieval library, and store the document representation of each document in the document retrieval library. Further, the server can implement tasks such as document retrieval and recommendation based on a document retrieval database containing document representations of each document. In addition, the server can also be responsible for implementing the training of the document representation model.

端侧设备可以是用户使用的电子设备，具体可以为具有网络通信功能、运算功能以及信息显示功能的硬件设备，其包括但不限于智能手机、平板电脑、台式电脑等。The end-side device can be an electronic device used by the user, and specifically can be a hardware device with network communication functions, computing functions, and information display functions, including but not limited to smartphones, tablet computers, desktop computers, etc.

基于图2所示的系统架构，用户通过端侧设备向服务器发送文档处理请求，文档处理请求包含用户相关的输入信息，包含如下至少一项：用户输入的查询信息、用户画像、用户历史行为(包括但不限于浏览、点击、搜索、收藏、引用、关注)相关文档信息。服务器接收用户相关的输入信息，将用户相关的输入信息映射为向量表示，将向量表示与文档检索库中各文档的文档表示进行相似度匹配，得到与输入信息匹配的至少一个目标文档，向端侧设备返回目标文档的信息。端侧设备接收服务器返回的目标文档的信息，向用户展示目标文档的信息。Based on the system architecture shown in Figure 2, the user sends a document processing request to the server through the end-side device. The document processing request contains user-related input information, including at least one of the following: user-input query information, user portrait, user historical behavior ( Including but not limited to browsing, clicking, searching, collecting, citing, following) related document information. The server receives user-related input information, maps the user-related input information into a vector representation, performs similarity matching between the vector representation and the document representation of each document in the document retrieval database, and obtains at least one target document that matches the input information, and sends it to the terminal The side device returns information about the target document. The terminal device receives the information of the target document returned by the server and displays the information of the target document to the user.

在一示例场景中，以文档检索场景为例，用户通过端侧设备向服务器发送文档检索请求，文档检索请求包含用户输入的查询信息。服务器获取用户输入的查询信息，将查询信息映射为向量表示，将向量表示与文档检索库中各文档的文档表示进行相似度匹配，得到与查询信息匹配的至少一个目标文档，得到文档检索结果。进一步地，服务器向端侧设备返回目标文档的信息。端侧设备接收服务器返回的目标文档的信息，向用户展示目标文档的信息，完成文档检索。其中，文档检索场景具体可以是标书检索、文献检索、各技术领域知识文档的检索等，此处不做具体限定。In an example scenario, taking the document retrieval scenario as an example, the user sends a document retrieval request to the server through the terminal device, and the document retrieval request includes the query information input by the user. The server obtains the query information input by the user, maps the query information into a vector representation, performs similarity matching between the vector representation and the document representation of each document in the document retrieval database, obtains at least one target document that matches the query information, and obtains the document retrieval results. Further, the server returns the information of the target document to the end-side device. The end-side device receives the information of the target document returned by the server, displays the information of the target document to the user, and completes the document retrieval. Among them, the document retrieval scenario can specifically include tender document retrieval, document retrieval, retrieval of knowledge documents in various technical fields, etc. There are no specific limitations here.

基于图2所示的系统架构，还可以实现文档推荐，根据用户行为相关的第一文档，在文档检索库中检索与第一文档的文档表示相似度较高的目标文档，推荐给用户，实现基于用户行为的个性化文档推荐。例如，标书推荐、文献推荐、各技术领域知识文档的推荐等，此处不做具体限定。Based on the system architecture shown in Figure 2, document recommendation can also be implemented. According to the first document related to the user's behavior, the target document with a higher similarity to the document representation of the first document is retrieved in the document retrieval library and recommended to the user. Personalized document recommendations based on user behavior. For example, recommendation of tender documents, recommendation of literature, recommendation of knowledge documents in various technical fields, etc. are not specifically limited here.

下面以具体地实施例对本申请的技术方案以及本申请的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以相互结合，对于相同或相似的概念或过程可能在某些实施例中不再赘述。下面将结合附图，对本申请的实施例进行描述。The technical solution of the present application and how the technical solution of the present application solves the above technical problems will be described in detail below with specific embodiments. The following specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments. The embodiments of the present application will be described below with reference to the accompanying drawings.

图3为本申请一示例性实施例提供的文档表征方法的流程图。本实施例的执行主体可以是运行有文档表示模型的服务器，具体可以为前述任一系统架构中的服务器。如图3所示，该方法具体步骤如下：Figure 3 is a flow chart of a document characterization method provided by an exemplary embodiment of the present application. The execution subject of this embodiment may be a server running the document representation model, and may specifically be a server in any of the aforementioned system architectures. As shown in Figure 3, the specific steps of this method are as follows:

步骤S301、获取待处理的多个文档，以及多个文档的文档关联图，文档关联图包括各文档对应的节点和表示文档间关联关系的边。Step S301: Obtain multiple documents to be processed and document association graphs of the multiple documents. The document association graph includes nodes corresponding to each document and edges representing association relationships between documents.

本实施例的方法适用于对具有一定关联性的多个文档进行多文档的表征学习，获得各个文档的高质量的文档表示，具体可以应用于文档分类、聚类、检索、推荐、多文档总结等应用场景。在应用于不同的应用场景时，待处理的多个文档可以不同。例如，待处理的多个文档可以是用户/下游应用给定的待分类或聚类的文档集中的文档、或者是特定文档检索库中的全部或部分文档。其中，待处理的文档的内容可以包含限不限于文本、图像、程序代码、超链接等信息。The method of this embodiment is suitable for performing multi-document representation learning on multiple documents with certain correlations to obtain high-quality document representation of each document. It can be specifically applied to document classification, clustering, retrieval, recommendation, and multi-document summary. and other application scenarios. When applied to different application scenarios, the multiple documents to be processed can be different. For example, the multiple documents to be processed may be documents in a document set to be classified or clustered given by the user/downstream application, or all or part of the documents in a specific document retrieval library. Among them, the content of the document to be processed can include but not limited to text, images, program codes, hyperlinks and other information.

文档间的关联关系表示文档间的相关性，具体可以是引用、同一作者、给同一目标对象、同一描述对象等。例如，论文类的文档间可以具有引用关系，标书类的文档可以属于同一投标人/招标人，知识类的文档可以描述同一物品等。不同的应用场景中，针对不同文档集，文档间的关联关系可以不同。对于具体应用场景的给定文档集中不同文档间的关联关系，可以采用数据分析、挖掘的方式确定，具体可以采用现有技术获得，此处不做具体限定。The association between documents represents the correlation between documents, which can be citations, the same author, the same target object, the same description object, etc. For example, there can be reference relationships between paper-type documents, tender-type documents can belong to the same bidder/tenderer, knowledge-type documents can describe the same item, etc. In different application scenarios, for different document sets, the relationships between documents can be different. The correlation between different documents in a given document set in a specific application scenario can be determined through data analysis and mining. The details can be obtained using existing technologies, which are not specifically limited here.

对于给定的多个文档，构建各个文档对应的节点，根据文档间的关联关系，在任意两个具有关联关系的文档对应节点之间建立边，得到多个文档的文档关联图。需要说明的是，同一文档关联图中可以包含一种或多种不同类型的边，不同类型的边表示不同的关联关系。For multiple given documents, nodes corresponding to each document are constructed. According to the association between documents, an edge is established between any two document corresponding nodes with an association relationship to obtain a document association graph of multiple documents. It should be noted that the same document association graph can contain one or more different types of edges, and different types of edges represent different association relationships.

该步骤中，可选地，端侧设备向服务器提供待处理的多个文档。服务器从端侧设备获取待处理的多个文档，分析多个文档间的关联关系，并基于多个文档间的关联关系构建多个文档的文档关联图。文档关联图包括各文档对应的节点和表示文档间关联关系的边。In this step, optionally, the end-side device provides multiple documents to be processed to the server. The server obtains multiple documents to be processed from the terminal device, analyzes the relationships between the multiple documents, and builds a document association graph of the multiple documents based on the relationships between the multiple documents. The document association graph includes nodes corresponding to each document and edges representing the association between documents.

可选地，端侧设备向服务器提供待处理的多个文档，以及多个文档的文档关联图。服务器接收端侧设备发送的多个文档以及多个文档的文档关联图。由端侧设备负责构建多个文档的文档关联图。Optionally, the end-side device provides multiple documents to be processed and document association graphs of the multiple documents to the server. The server receives multiple documents sent by the end-side device and the document association graph of the multiple documents. The end-side device is responsible for building a document association graph of multiple documents.

步骤S302、将多个文档的内容信息和文档关联图输入文档表示模型，通过文档表示模型基于使用各文档的语义表示初始化的文档关联图进行图表征学习，更新各节点的特征表示。Step S302: Input the content information and document association graphs of multiple documents into the document representation model, and use the document representation model to perform graph representation learning based on the document association graph initialized using the semantic representation of each document, and update the feature representation of each node.

本实施例中，服务器将各文档的内容信息及文档关联图输入文档表示模型，文档表示模型基于各文档的内容信息生成各文档的语义表示，使用基于单一文档表征得到的各文档的语义表示，初始化文档关联图中各节点的特征表示，对初始化后文档关联图进行图表征学习并更新各节点的特征表示，输得到更新后的文档关联图。这样，在基于单一文档表征得到的各文档的语义表示的基础上，通过图表征学习来学习文档关联图的拓扑结构信息，也即学习多个文档间的相关性信息，提升文档关联图中各个节点的特征表示的质量。In this embodiment, the server inputs the content information of each document and the document association graph into the document representation model. The document representation model generates the semantic representation of each document based on the content information of each document, and uses the semantic representation of each document obtained based on a single document representation. Initialize the feature representation of each node in the document association graph, perform graph representation learning on the initialized document association graph and update the feature representation of each node, and output the updated document association graph. In this way, based on the semantic representation of each document obtained based on a single document representation, the topological structure information of the document association graph is learned through graph representation learning, that is, the correlation information between multiple documents is learned, and each element in the document association graph is improved. The quality of a node’s feature representation.

其中，输入文档表示模型的文档的内容信息可以包含文档的完整内容信息，也可以是预先指定的文档的部分内容信息，例如文档正文、摘要、关键段落等，本实施例此处不做具体限定。The content information of the document input into the document representation model may include the complete content information of the document, or may be partial content information of a pre-specified document, such as the document body, abstract, key paragraphs, etc., which are not specifically limited in this embodiment. .

在一可选实施例中，服务器可以使用其他模型或算法获得各文档的语义表示，使用各文档的语义表示初始化多个文档的文档关联图，将初始化后的文档关联图输入文档表示模型，文档表示模型基于输入的初始化后的文档关联图进行图表征学习并更新各节点的特征表示，输得到更新后的文档关联图。In an optional embodiment, the server can use other models or algorithms to obtain the semantic representation of each document, use the semantic representation of each document to initialize the document association graph of multiple documents, and input the initialized document association graph into the document representation model. The representation model performs graph representation learning based on the input initialized document association graph and updates the feature representation of each node, and outputs an updated document association graph.

步骤S303、将更新后各节点的特征表示作为对应文档的文档表示。Step S303: Use the updated feature representation of each node as the document representation of the corresponding document.

将更新后文档关联图中各节点高质量的特征表示，作为节点对应文档的文档表示，即可得到各文档的高质量的文档表示。Using the high-quality feature representation of each node in the updated document association graph as the document representation of the document corresponding to the node, a high-quality document representation of each document can be obtained.

进一步地，基于各文档的高质量的文档表示，可以实现文档检索、推荐、分类、聚类、多文档总结等文档处理任务，从而提升文档处理的精准度。Furthermore, based on the high-quality document representation of each document, document processing tasks such as document retrieval, recommendation, classification, clustering, and multi-document summary can be realized, thereby improving the accuracy of document processing.

示例性地，以文档聚类为例，可以通过本实施例的方法获得待聚类的多个文档的高质量的文档表示，并基于获得的文档表示进行聚类，得到文档聚类结果，可以提升文档聚类的精准度。Illustratively, taking document clustering as an example, the method of this embodiment can be used to obtain high-quality document representations of multiple documents to be clustered, and clustering is performed based on the obtained document representations to obtain document clustering results. Improve the accuracy of document clustering.

示例性地，以文档检索为例，可以通过本实施例的方法来获得文档检索库中各个文档的高质量的文档表示。在进行文档检索时，将用户相关的输入信息的向量表示与文档检索库中高质量的文档表示进行相似度匹配，确定与输入信息相匹配的目标文档，得到文档检索结果，可以提升文档检索的精准度和召回率。Illustratively, taking document retrieval as an example, a high-quality document representation of each document in the document retrieval library can be obtained through the method of this embodiment. When performing document retrieval, the vector representation of user-related input information is matched with high-quality document representations in the document retrieval database to determine the target document that matches the input information and obtain the document retrieval results, which can improve the accuracy of document retrieval. degree and recall.

本实施例的方法，获取待处理的多个文档，以及多个文档的文档关联图，文档关联图包括各文档对应的节点和表示文档间关联关系的边，将多个文档的内容信息和文档关联图输入文档表示模型，通过文档表示模型基于各文档的内容信息生成各文档的语义表示，使用各文档的语义表示初始化文档关联图中各节点的特征表示，基于初始化后的文档关联图进行图表征学习，更新各节点的特征表示，将更新后各节点的特征表示作为对应文档的文档表示，不仅可以学习到单文档的语义信息，还可以基于多文档的文档关联图学习到文档间的相关性信息，提升了文档表示的质量，进而可以提升基于文档表示的文档检索、推荐、分类的精准度。The method of this embodiment obtains multiple documents to be processed and the document association graph of the multiple documents. The document association graph includes the nodes corresponding to each document and the edges representing the association relationships between documents, and combines the content information of the multiple documents with the document The association graph is input into the document representation model, and the semantic representation of each document is generated based on the content information of each document through the document representation model. The semantic representation of each document is used to initialize the feature representation of each node in the document association graph, and the graph is performed based on the initialized document association graph. Representation learning updates the feature representation of each node and uses the updated feature representation of each node as the document representation of the corresponding document. It can not only learn the semantic information of a single document, but also learn the correlation between documents based on the document association graph of multiple documents. This information improves the quality of document representation, thereby improving the accuracy of document retrieval, recommendation, and classification based on document representation.

图4为本申请一示例性实施例提供的文档表示模型的示例架构图，在一可选实施例中，如图4所示，前述方法实施例中，文档表示模型包括语义信息学习模块和相关性信息学习模块。其中，语义信息学习模块用于：对各文档的内容信息进行语义信息表征学习，得到各文档的语义表示。相关性信息学习模块用于：使用各文档的语义表示初始化文档关联图中各节点的特征表示，基于初始化后的文档关联图进行图表征学习，更新各节点的特征表示。Figure 4 is an example architecture diagram of a document representation model provided by an exemplary embodiment of the present application. In an optional embodiment, as shown in Figure 4, in the foregoing method embodiment, the document representation model includes a semantic information learning module and a related Sexual Information Learning Module. Among them, the semantic information learning module is used to perform semantic information representation learning on the content information of each document to obtain the semantic representation of each document. The correlation information learning module is used to: use the semantic representation of each document to initialize the feature representation of each node in the document association graph, perform graph representation learning based on the initialized document association graph, and update the feature representation of each node.

基于图4所示的模型架构，前述步骤S302具体采用如下方式实现：Based on the model architecture shown in Figure 4, the aforementioned step S302 is implemented in the following manner:

步骤S3021、将多个文档的内容信息输入语义信息学习模块，通过语义信息学习模块对各文档的内容信息进行语义信息表征学习，得到各文档的语义表示，并将各文档的语义表示输入相关性学习模块。Step S3021: Input the content information of multiple documents into the semantic information learning module, perform semantic information representation learning on the content information of each document through the semantic information learning module, obtain the semantic representation of each document, and input the semantic representation of each document into the correlation Learning modules.

本实施例中，通过语义信息学习模块分别对单一文档的内容信息进行语义信息表征学习，可以较好地学习到文档内词、句子等层级的语义信息，从而获得单一文档的高质量的语义表示。获得的单一文档的高质量的语义表示，会输出后续的相关性学习模块，用于初始化文档关联图中各节点的特征表示。In this embodiment, the semantic information representation learning is performed on the content information of a single document through the semantic information learning module. The semantic information at the word, sentence and other levels within the document can be better learned, thereby obtaining a high-quality semantic representation of a single document. . The obtained high-quality semantic representation of a single document will output the subsequent correlation learning module, which is used to initialize the feature representation of each node in the document correlation graph.

步骤S3022、将文档关联图输入相关性信息学习模块，通过相关性信息学习模块使用各文档的语义表示，初始化文档关联图中各节点的特征表示，并基于初始化后的文档关联图进行图表征学习，更新各节点的特征表示。Step S3022: Input the document association graph into the relevance information learning module, use the semantic representation of each document through the relevance information learning module to initialize the feature representation of each node in the document association graph, and perform graph representation learning based on the initialized document association graph. , update the feature representation of each node.

本实施例中，文档关联图作为相关性信息学习模块的输入，相关性信息学习模块使用语义信息学习模块输出的各文档的高质量的语义表示，初始化文档关联图中各节点的特征表示，使得初始化之后的文档关联图中节点的特征表示为对应文档的高质量的语义表示。进一步地，相关性信息学习模块基于初始化后的文档关联图进行图表征学习，更新各节点的特征表示，可以基于多文档的文档关联图学习到文档关联图的拓扑结构信息，而文档关联图的拓扑结构信息包含了文档间的相关性信息，因此更新节点的特征表示的过程中可以学习到文档间的相关性信息，使得更新后节点的特征表示不仅包含单一文档的语义信息，同时包含文档间的相关性信息，将更新后节点的特征表示作为对应文档的文档表示，提升了文档表示的质量。In this embodiment, the document association graph is used as the input of the relevance information learning module. The relevance information learning module uses the high-quality semantic representation of each document output by the semantic information learning module to initialize the feature representation of each node in the document association graph, such that The feature representation of the nodes in the document association graph after initialization is a high-quality semantic representation of the corresponding document. Furthermore, the correlation information learning module performs graph representation learning based on the initialized document association graph, updates the feature representation of each node, and can learn the topological structure information of the document association graph based on the multi-document document association graph, and the document association graph Topological structure information contains correlation information between documents, so the correlation information between documents can be learned in the process of updating the feature representation of nodes, so that the feature representation of the updated node not only contains the semantic information of a single document, but also contains the semantic information between documents. The correlation information of the updated node is used as the document representation of the corresponding document, which improves the quality of the document representation.

在另一可选实施例中，文档表示模型还可以仅包括相关性信息学习模块，该相关性信息学习模块采用图神经网络GNN实现。在进行多文档表征时，通过现有文档表征方法对多个文档的内容信息分别进行单文档的语义表示，得到各文档的语义表示，使用各文档的语义表示初始化多个文档的文档关联图中各节点的特征表示，得到初始化后的文档关联图，将初始化后的文档关联图输入相关性信息学习模块进行图表征学习，更新文档关联图中各节点的特征表示。In another optional embodiment, the document representation model may also include only a correlation information learning module, which is implemented using a graph neural network GNN. When performing multi-document representation, the existing document representation method is used to perform single-document semantic representation on the content information of multiple documents to obtain the semantic representation of each document. The semantic representation of each document is used to initialize the document association graph of multiple documents. The feature representation of each node is used to obtain the initialized document association graph. The initialized document association graph is input into the relevance information learning module for graph representation learning, and the feature representation of each node in the document association graph is updated.

图5为本申请一示例性实施例提供的语义信息学习模块的结构示例图，在前述图4所示模型架构的基础上，在一可选实施例中，如图5所示，语义信息学习模块包括：实体关系图构建模块、第一图神经网络、文本表示模型和语义表示融合模块。Figure 5 is an example structural diagram of a semantic information learning module provided by an exemplary embodiment of the present application. Based on the model architecture shown in Figure 4, in an optional embodiment, as shown in Figure 5, the semantic information learning module Modules include: entity relationship diagram building module, first graph neural network, text representation model and semantic representation fusion module.

其中，实体关系图构建模块用于：根据各文档的内容信息，构建各文档的实体关系图(Entity-Relationship Graph，简称E-R图)。第一图神经网络用于：对各文档的实体关系图分别进行图表征学习，得到各文档包含的实体的特征表示，将各文档包含的实体的特征表示融合，得到各文档的实体层级的语义表示。第一图神经网络可以采用图卷积神经网络(Graph Convolutional Network，简称GCN)、构造性图神经网络(Neural Network forGraphs，简称NN4G)、图注意力网络(Graph Attention Network，简称GAT)、图同构网络(Graph Isomorphism Network，简称GIN)或其他图神经网络实现，能够实现图数据的表征学习，本实施例此处不做具体限定。Among them, the entity-relationship graph building module is used to construct an entity-relationship graph (Entity-Relationship Graph, E-R graph for short) of each document based on the content information of each document. The first graph neural network is used to: perform graph representation learning on the entity relationship diagram of each document to obtain the feature representation of the entities contained in each document, and fuse the feature representations of the entities contained in each document to obtain the entity-level semantics of each document express. The first graph neural network can use Graph Convolutional Network (GCN for short), Neural Network for Graphs (NN4G for short), Graph Attention Network (GAT for short), graph synchronization Implementation of Graph Isomorphism Network (GIN for short) or other graph neural networks can realize representation learning of graph data. This embodiment is not specifically limited here.

其中，文档的实体关系图(E-R图)是以文档包含的命名实体为节点，以命名实体间的关系为边构建的图结构。具体可以基于任一文档的内容信息，抽取文档包含的命名实体以及命名实体间的关系，基于文档包含的命名实体创建对应节点，基于命名实体间的关系构建对应节点间的边，来构建文档的实体关系图。该实体关系图中节点的特征向量可以是对应命名实体的向量表示，具体可以根据命名实体的词向量确定。例如，对于任一命名实体进行分词处理。若分词后命名实体仅包含一个词，则将该词的词向量作为命名实体的向量表示，并作为该命名实体对应节点的特征表示。若该命名实体包含多个词，则将这多个词的词向量求均值，得到该命名实体的向量表示，并作为该命名实体对应节点的特征表示。另外，该实体关系图中节点的特征向量还可以通过随机初始化或其他方式确定，此处不做具体限定。Among them, the entity-relationship graph (E-R graph) of a document is a graph structure constructed with the named entities contained in the document as nodes and the relationships between named entities as edges. Specifically, based on the content information of any document, the named entities contained in the document and the relationship between the named entities can be extracted, corresponding nodes can be created based on the named entities contained in the document, and the edges between the corresponding nodes can be constructed based on the relationship between the named entities to construct the document. Entity relationship diagram. The feature vector of the node in the entity relationship graph may be a vector representation corresponding to the named entity, which may be specifically determined based on the word vector of the named entity. For example, perform word segmentation processing for any named entity. If the named entity contains only one word after word segmentation, the word vector of the word is used as the vector representation of the named entity and as the feature representation of the node corresponding to the named entity. If the named entity contains multiple words, the word vectors of the multiple words are averaged to obtain the vector representation of the named entity, which is used as the feature representation of the node corresponding to the named entity. In addition, the feature vectors of the nodes in the entity relationship diagram can also be determined through random initialization or other methods, which are not specifically limited here.

可选地，第一图神经网络可以包含图编码层和图池化层，图编码层用于对各文档的实体关系图分别进行图表征学习，对实体关系图中各节点的特征表示进行编码，更新实体关系图中节点的特征表示，输出更新后的实体关系图。更新后的实体关系图中各节点的特征表示作为对应实体的特征表示，即得到各文档包含的实体的特征表示。更新后的实体关系图输入至图池化层，图池化层用于对更新后的实体关系图进行图池化操作，来实现各节点的特征表示的融合，输出实体关系的图表示。将实体关系的图表示作为对应文档的语义表示，该语义表示是通过对文档包含的实体的特征表示融合得到的，是文档的实体层级的语义表示。Optionally, the first graph neural network may include a graph encoding layer and a graph pooling layer. The graph encoding layer is used to perform graph representation learning on the entity relationship graph of each document, and encode the feature representation of each node in the entity relationship graph. , update the feature representation of nodes in the entity relationship graph, and output the updated entity relationship graph. The characteristic representation of each node in the updated entity relationship graph is used as the characteristic representation of the corresponding entity, that is, the characteristic representation of the entities contained in each document is obtained. The updated entity relationship graph is input to the graph pooling layer. The graph pooling layer is used to perform graph pooling operations on the updated entity relationship graph to achieve the fusion of feature representations of each node and output a graph representation of the entity relationship. The graph representation of entity relationships is used as the semantic representation of the corresponding document. This semantic representation is obtained by fusing the feature representations of the entities contained in the document, and is a semantic representation of the entity level of the document.

可选地，第一图神经网络可以包含图编码层，图编码层用于对各文档的实体关系图分别进行图表征学习，对实体关系图中各节点的特征表示进行编码，更新实体关系图中节点的特征表示，输出更新后的实体关系图。更新后的实体关系图中各节点的特征表示作为对应实体的特征表示，即得到各文档包含的实体的特征表示。进一步地，将任一文档包含的实体的特征表示求平均或拼接，来实现各节点的特征表示的融合，得到文档的实体层级的语义表示。Optionally, the first graph neural network may include a graph coding layer. The graph coding layer is used to perform graph representation learning on the entity relationship graph of each document, encode the feature representation of each node in the entity relationship graph, and update the entity relationship graph. Feature representation of the nodes in the node, and output the updated entity relationship graph. The characteristic representation of each node in the updated entity relationship graph is used as the characteristic representation of the corresponding entity, that is, the characteristic representation of the entities contained in each document is obtained. Furthermore, the feature representations of the entities contained in any document are averaged or spliced to achieve the fusion of the feature representations of each node and obtain the semantic representation of the entity level of the document.

本实施例中，文本表示模型用于：对各文档的内容信息分别进行文本表征，获得各文档的文本层级的语义表示。该文本表示模型可以采用现有的任意一种文本表示模型/算法实现。示例性地，文本表示模型可以采用各类语言模型(Language Model，简称LM)实现，如基于Transformer的双向编码器表示(Bidirectional Encoder Representations fromTransformers，简称BERT)、大语言模型、预训练语言模型等。In this embodiment, the text representation model is used to perform text representation on the content information of each document to obtain a text-level semantic representation of each document. The text representation model can be implemented using any existing text representation model/algorithm. For example, the text representation model can be implemented using various types of language models (Language Model, LM for short), such as Transformer-based Bidirectional Encoder Representations from Transformers (BERT for short), large language models, pre-trained language models, etc.

语义表示融合模块用于：将各文档的实体层级的语义表示和文本层级的语义表示融合，得到各文档的语义表示。具体地，语义表示融合模块通过将各文档的实体层级的语义表示和文本层级的语义表示求平均，得到各文档的语义表示；或者，将各文档的实体层级的语义表示和文本层级的语义表示拼接，得到各文档的语义表示。The semantic representation fusion module is used to fuse the entity-level semantic representation of each document and the text-level semantic representation to obtain the semantic representation of each document. Specifically, the semantic representation fusion module obtains the semantic representation of each document by averaging the semantic representation of the entity level and the semantic representation of the text level of each document; or, the semantic representation of the entity level and the semantic representation of the text level of each document are obtained. Splicing to obtain the semantic representation of each document.

基于图5所示语义信息学习模块的结构示例，前述步骤S3021具体可以采用如下方式实现：Based on the structural example of the semantic information learning module shown in Figure 5, the aforementioned step S3021 can be implemented in the following manner:

将多个文档的内容信息输入实体关系图构建模块，通过实体关系图构建模块根据各文档的内容信息，构建各文档的实体关系图；将各文档的实体关系图输入第一图神经网络，通过第一图神经网络对各文档对应的实体关系图分别进行图表征学习，获得各文档的实体层级的语义表示；并将多个文档的内容信息输入文本表示模型，通过文本表示模型对各文档的内容信息分别进行文本表征，获得各文档的文本层级的语义表示；通过语义表示融合模块，将各文档的实体层级的语义表示和文本层级的语义表示融合，得到各文档的语义表示。各文档的语义表示输出至相关性学习模块。Input the content information of multiple documents into the entity relationship diagram building module, and use the entity relationship diagram building module to construct the entity relationship diagram of each document based on the content information of each document; input the entity relationship diagram of each document into the first graph neural network, and use The first graph neural network performs graph representation learning on the entity relationship diagram corresponding to each document to obtain the entity-level semantic representation of each document; inputs the content information of multiple documents into the text representation model, and uses the text representation model to The content information is represented as text respectively to obtain the text-level semantic representation of each document; through the semantic representation fusion module, the entity-level semantic representation of each document and the text-level semantic representation are fused to obtain the semantic representation of each document. The semantic representation of each document is output to the correlation learning module.

在另一可选实施例中，语义信息学习模块可以采用文本表示模型实现，能够提取单一文档包含内容文本的语义信息即可。示例性地，语义信息学习模块可以采用语言模型(Language Model，简称LM)实现。例如基于Transformer的双向编码器表示(BidirectionalEncoder Representations from Transformers，简称BERT)、大语言模型、预训练语言模型等实现。In another optional embodiment, the semantic information learning module can be implemented using a text representation model, and it only needs to be able to extract the semantic information of the content text contained in a single document. For example, the semantic information learning module can be implemented using a language model (Language Model, LM for short). For example, Transformer-based bidirectional encoder representations (BidirectionalEncoder Representations from Transformers, BERT for short), large language models, pre-trained language models, etc. are implemented.

前述步骤S3021中，将多个文档分别输入文本表示模型，通过文本表示模型对各文档分别进行文本表征，获得各文档的文本层级的语义表示，作为各文档的语义表示。In the aforementioned step S3021, multiple documents are respectively input into the text representation model, and text representation is performed on each document through the text representation model to obtain the text-level semantic representation of each document as the semantic representation of each document.

在另一可选实施例中，语义信息学习模块可以基于图神经网络GNN实现，通过基于单一文档的内容信息抽取文档包含的命名实体以及命名实体间的关系，构建文档的实体关系图。该实体关系图中的节点表示文档的内容信息包含的命名实体，节点的特征向量可以是对应命名实体的向量表示，实体关系图中的边表示命名实体间的关系。将文档的实体关系图输入语义信息学习模块的图神经网络GNN进行图表征学习，更新实体关系图中节点的特征向量，获得编码后的实体关系图，编码后的实体关系图中各节点的特征向量作为命名实体的编码向量。将文档包含的命名实体的编码向量进行整合，得到文档的语义表示。具体地，将文档包含的命名实体的编码向量进行整合，得到文档的语义表示时，可以将文档包含的各命名实体的编码向量的求平均，得到文档的语义表示。或者，对编码后的实体关系图进行图池化，读出整个实体关系图的图表示，作为对应文档的语义表示。In another optional embodiment, the semantic information learning module can be implemented based on the graph neural network GNN, and constructs the entity relationship graph of the document by extracting the named entities contained in the document and the relationships between the named entities based on the content information of a single document. The nodes in the entity relationship graph represent named entities contained in the content information of the document. The feature vectors of the nodes may be vector representations corresponding to the named entities. The edges in the entity relationship graph represent relationships between named entities. Input the entity relationship graph of the document into the graph neural network GNN of the semantic information learning module for graph representation learning, update the feature vectors of the nodes in the entity relationship graph, and obtain the encoded entity relationship graph. The features of each node in the encoded entity relationship graph are obtained. Vector as an encoding vector of named entities. Integrate the encoding vectors of named entities contained in the document to obtain the semantic representation of the document. Specifically, when the encoding vectors of the named entities contained in the document are integrated to obtain the semantic representation of the document, the encoding vectors of each named entity contained in the document can be averaged to obtain the semantic representation of the document. Or, perform graph pooling on the encoded entity-relationship graph, and read out the graph representation of the entire entity-relationship graph as the semantic representation of the corresponding document.

前述步骤S3021中，将各文档的实体关系图输入语义信息学习的图神经网络，通过图神经网络对各文档的实体关系图分别进行图表征学习，获得各文档的实体层级的语义表示，作为各文档的语义表示。In the aforementioned step S3021, the entity relationship graph of each document is input into the graph neural network for semantic information learning. The entity relationship graph of each document is separately graph representation learned through the graph neural network to obtain the semantic representation of the entity level of each document as each document. Semantic representation of the document.

前述任一方法实施例中，任一文档的实体关系图通过如下方式获得：对文档的内容文本进行命名实体识别(Named Entity Recognition，简称NER)和关系抽取(RelationExtraction，简称RE)，获得文档包含的实体(也即命名实体)，以及实体之间的关系。构建文档包含的各实体对应的节点，根据实体间的关系构建边，得到文档的实体关系图。其中，对文档的内容文本进行命名实体识别NER和关系抽取RE，可以采用现有的命名实体识别方法和关系抽取方法实现，此处不做具体限定。需要说明的是，文档的内容可以包含限不限于文本、图像、程序代码、超链接等信息。在对文档的内容文本进行命名实体识别和关系抽取时，可以基于文档内的文本内容(包含程序代码和超链接等)进行命名实体识别和关系抽取，获得文档包含的实体(也即命名实体)，以及实体之间的关系。或者，可以对文档中的图像进行图像识别，获得文档中图像包含的文本信息，将文档中图像包含的文本信息也作为文档的内容文本，根据文档内的文本内容以及图像包含的文本信息，命名实体识别和关系抽取，获得文档包含的实体(也即命名实体)，以及实体之间的关系。In any of the foregoing method embodiments, the entity relationship diagram of any document is obtained as follows: performing Named Entity Recognition (NER) and Relation Extraction (RE) on the content text of the document to obtain the content of the document. entities (that is, named entities), and the relationships between entities. Construct nodes corresponding to each entity contained in the document, construct edges based on the relationships between entities, and obtain the entity relationship graph of the document. Among them, named entity recognition NER and relationship extraction RE are performed on the content text of the document, which can be implemented using existing named entity recognition methods and relationship extraction methods, and are not specifically limited here. It should be noted that the content of the document can include but not limited to text, images, program code, hyperlinks and other information. When performing named entity recognition and relationship extraction on the content text of a document, named entity recognition and relationship extraction can be performed based on the text content in the document (including program code and hyperlinks, etc.) to obtain the entities contained in the document (that is, named entities) , and the relationships between entities. Alternatively, image recognition can be performed on the image in the document to obtain the text information contained in the image in the document. The text information contained in the image in the document is also used as the content text of the document. According to the text content in the document and the text information contained in the image, the name Entity recognition and relationship extraction are used to obtain the entities contained in the document (that is, named entities) and the relationships between entities.

在一可选实施例中，在前述图4所示模型架构的基础上，相关性信息学习模块包括：初始化模块和第二图神经网络。其中，初始化模块用于：使用各文档的语义表示，初始化文档关联图中各节点的特征表示。第二图神经网络用于：基于初始化后的文档关联图进行图表征学习，更新各节点的特征表示。In an optional embodiment, based on the model architecture shown in Figure 4, the correlation information learning module includes: an initialization module and a second graph neural network. Among them, the initialization module is used to: use the semantic representation of each document to initialize the feature representation of each node in the document association graph. The second graph neural network is used to perform graph representation learning based on the initialized document association graph and update the feature representation of each node.

本实施例中，前述步骤S3022具体可以采用如下方式实现：In this embodiment, the aforementioned step S3022 can be implemented in the following manner:

将各文档的语义表示和文档关联图输入初始化模块，通过初始化模块使用各文档的语义表示，初始化文档关联图中各节点的特征表示，将初始化后的文档关联图输入第二图神经网络，通过第二图神经网络基于初始化后的文档关联图进行图表征学习，更新各节点的特征表示。Input the semantic representation and document association graph of each document into the initialization module, use the semantic representation of each document through the initialization module to initialize the feature representation of each node in the document association graph, and input the initialized document association graph into the second graph neural network, through The second graph neural network performs graph representation learning based on the initialized document association graph and updates the feature representation of each node.

其中，第二图神经网络包含图编码层，图编码层用于对初始化后的文档关联图进行图表征学习，对文档关联图中各节点的特征表示进行编码，更新文档关联图中节点的特征表示，输出更新后的文档关联图。更新后的文档关联图中各节点的特征表示作为对应文档的文档表示，即得到各文档的文档表示。Among them, the second graph neural network includes a graph coding layer. The graph coding layer is used to perform graph representation learning on the initialized document association graph, encode the feature representation of each node in the document association graph, and update the features of the nodes in the document association graph. Indicates that the updated document association graph is output. The feature representation of each node in the updated document association graph is used as the document representation of the corresponding document, that is, the document representation of each document is obtained.

示例性地，第二图神经网络具体可以采用图卷积神经网络GCN、构造性图神经网络NN4G、图注意力网络GAT、图同构网络GIN或其他图神经网络实现，能够实现图数据的表征学习，本实施例此处不做具体限定。可选地，第二图神经网络与语义信息学习模块的第一图神经网络的图编码层，可以采用相同的结构实现，但不共享参数。Exemplarily, the second graph neural network can be implemented using a graph convolutional neural network GCN, a constructive graph neural network NN4G, a graph attention network GAT, a graph isomorphism network GIN or other graph neural networks, and can realize the representation of graph data. Learning is not specifically limited in this embodiment. Optionally, the second graph neural network and the graph coding layer of the first graph neural network of the semantic information learning module can be implemented using the same structure, but do not share parameters.

在另一可选实施例中，相关性信息学习模块可以仅包含第二图神经网络，在将文档关联图输入相关性信息学习模块之前，使用各文档的语义表示，初始化文档关联图中各节点的特征表示，得到初始化后的文档关联图，将初始化后的文档关联图输入相关性信息学习模块。In another optional embodiment, the relevance information learning module may only include the second graph neural network. Before inputting the document association graph into the relevance information learning module, each node in the document association graph is initialized using the semantic representation of each document. Feature representation, obtain the initialized document association graph, and input the initialized document association graph into the correlation information learning module.

示例性地，图6为本申请一示例性实施例提供的文档表征模型的结构示例图，如图6所示，文档表征模型包括语义信息学习模块和相关性信息学习模块。其中，语义信息学习模块包括：实体关系图构建模块、第一图神经网络、语言模型和语义表示融合模块。相关性信息学习模块包括初始化模块和第二图神经网络。在进行多文档表征时，将多个文档的内容信息输入语言模型，通过语言模型对各文档的内容信息分别进行文本表征，获得各文档的文本层级的语义表示，并输入语义表示融合模块；并将多个文档的内容信息输入实体关系图构建模块，通过实体关系图构建模块根据各文档的内容信息，构建各文档的实体关系图；将各个文档的实体关系图输入第一图神经网络，通过第一图神经网络对各文档的实体关系图分别进行图表征学习，得到各文档包含的实体的特征表示，将各文档包含的实体的特征表示融合，得到各文档的实体层级的语义表示，并输入语义表示融合模块。通过语义表示融合模块将各文档的实体层级的语义表示和文本层级的语义表示融合，得到各文档的语义表示。各文档的语义表示输入相关性信息学习模块的初始化模块，通过初始化模块使用各文档的语义表示，初始化文档关联图中各节点的特征表示，得到初始化后的文档关联图，并输入第二图神经网络。第二图神经网络基于初始化后的文档关联图进行图表征学习，更新各节点的特征表示；将更新后各节点的特征表示作为对应文档的文档表示。Exemplarily, FIG. 6 is an example structural diagram of a document representation model provided by an exemplary embodiment of the present application. As shown in FIG. 6 , the document representation model includes a semantic information learning module and a correlation information learning module. Among them, the semantic information learning module includes: entity relationship diagram building module, first graph neural network, language model and semantic representation fusion module. The correlation information learning module includes an initialization module and a second graph neural network. When performing multi-document representation, input the content information of multiple documents into the language model, perform text representation on the content information of each document through the language model, obtain the text-level semantic representation of each document, and input it into the semantic representation fusion module; and Input the content information of multiple documents into the entity relationship diagram building module, and use the entity relationship diagram building module to construct the entity relationship diagram of each document based on the content information of each document; input the entity relationship diagram of each document into the first graph neural network, and use The first graph neural network performs graph representation learning on the entity relationship diagram of each document, obtains the feature representation of the entities contained in each document, fuses the feature representations of the entities contained in each document, and obtains the semantic representation of the entity level of each document, and Enter the semantic representation fusion module. Through the semantic representation fusion module, the entity-level semantic representation of each document and the text-level semantic representation are fused to obtain the semantic representation of each document. The semantic representation of each document is input into the initialization module of the correlation information learning module. The semantic representation of each document is used through the initialization module to initialize the feature representation of each node in the document association graph to obtain the initialized document association graph and input it into the second graph neural network. network. The second graph neural network performs graph representation learning based on the initialized document association graph, updates the feature representation of each node, and uses the updated feature representation of each node as the document representation of the corresponding document.

图7为本申请一示例性实施例提供的文本表示模型的训练方法流程图。在一可选实施例中，如图7所示，前述方法实施例使用的文档表示模型的训练过程如下：Figure 7 is a flow chart of a training method for a text representation model provided by an exemplary embodiment of the present application. In an optional embodiment, as shown in Figure 7, the training process of the document representation model used in the foregoing method embodiment is as follows:

步骤S701、构建文档表示模型。Step S701: Construct a document representation model.

本实施例中，构建基于图4所示模型架构的文档表示模型，具体可以采用前述任一方法实施例中的模型架构，例如图4与其他可选实施例结合确定模型结构、或图6所示的模型结构，本实施例此处不做具体限定。In this embodiment, the document representation model based on the model architecture shown in Figure 4 can be constructed. Specifically, the model architecture in any of the aforementioned method embodiments can be used. For example, Figure 4 is combined with other optional embodiments to determine the model structure, or the model structure shown in Figure 6 The model structure shown is not specifically limited in this embodiment.

步骤S702、将用于训练的文档集中的多个文档样本，以及多个文档样本的文档关联图输入文档表示模型，通过文档表示模型对多个文档样本的文档关联图中的初始边进行掩码处理，并使用各文档样本的语义表示初始化多个文档样本的文档关联图中各节点的特征表示，获得初始化后的文档关联图，基于初始化后的文档关联图进行图表征学习，更新各节点的特征表示。Step S702: Input the multiple document samples in the document set used for training and the document association graphs of the multiple document samples into the document representation model, and use the document representation model to mask the initial edges in the document association graphs of the multiple document samples. Process, and use the semantic representation of each document sample to initialize the feature representation of each node in the document association graph of multiple document samples, obtain the initialized document association graph, perform graph representation learning based on the initialized document association graph, and update the Feature representation.

本实施例中，用于训练文本表示模型的训练数据包括多个文档集，任一文档集包含多个文档样本，文档集中的多个文档样本具有一定的相关性，其中部分文档样本间存在关联关系。其中，多个文档集可以通过对现有的文档表征训练数据中的多个文档进行分组得到，每一分组的多个文档组成一个文档集。各个文档间的关联关系是文档自有的特点，通过分析文档内容或文档相关属性信息(如作者、时间、用途、所属领域、来源等)确定，或者由人工进行标注得到。另外，文档集的划分可以由服务器根据文档所属邻域、来源等相关属性进行自动划分，或者服务器对文档进行随机划分，或者由相关技术人员人工划分实现，此处不做具体限定。In this embodiment, the training data used to train the text representation model includes multiple document sets. Each document set contains multiple document samples. The multiple document samples in the document set have a certain correlation, and some of the document samples are related. relation. Among them, multiple document sets can be obtained by grouping multiple documents in the existing document representation training data, and multiple documents in each group constitute a document set. The relationship between each document is the document's own characteristics, which can be determined by analyzing the document content or document-related attribute information (such as author, time, purpose, field, source, etc.), or obtained by manual annotation. In addition, the document set can be automatically divided by the server based on relevant attributes such as the neighborhood and source of the document, or the server can randomly divide the documents, or the relevant technical personnel can manually divide the document set. There is no specific limit here.

该步骤中，基于一个文档集中的多个文档样本，获取多个文档样本的文档关联图的过程，与前述步骤S301中获取多个文档的文档关联图的实现方式一致，具体参见前述实施例的相关内容，此处不再赘述。In this step, the process of obtaining document association graphs of multiple document samples based on multiple document samples in a document set is consistent with the implementation of obtaining document association graphs of multiple documents in the aforementioned step S301. For details, please refer to the aforementioned embodiments. The relevant content will not be repeated here.

该步骤S702的具体实现方式与前述步骤S302类似，不同之处在于，在初始化文档关联图的过程中，步骤S302中仅初始化各节点的特征表示，而步骤S702中不仅初始化各节点的特征表示，还对文档关联图中的初始边进行掩码处理，其他处理过程与步骤S302一致，具体参见前述实施例的相关内容，此处不再赘述。The specific implementation of step S702 is similar to the aforementioned step S302. The difference is that in the process of initializing the document association graph, only the feature representation of each node is initialized in step S302, while in step S702 not only the feature representation of each node is initialized. The initial edges in the document association graph are also masked, and other processing procedures are consistent with step S302. For details, please refer to the relevant content of the foregoing embodiments, which will not be described again here.

本实施例中，对文本表示模型的训练任务为对文档关联图中节点间边的预测任务，为了构建训练任务，在初始化多个文档样本的文档关联图的过程中，不仅使用各文档样本的语义表示，初始化多个文档样本的文档关联图中各节点的特征表示，还需要对多个文档样本的文档关联图中的初始边进行掩码处理，获得初始化后的文档关联图。在后续步骤S703中，根据更新后各节点的特征表示，预测各节点间的关联关系，也即预测各节点间是否存在边。In this embodiment, the training task of the text representation model is the task of predicting the edges between nodes in the document association graph. In order to construct the training task, in the process of initializing the document association graph of multiple document samples, not only the Semantic representation: initialize the feature representation of each node in the document association graph of multiple document samples. It is also necessary to mask the initial edges in the document association graph of multiple document samples to obtain the initialized document association graph. In subsequent step S703, based on the updated characteristic representation of each node, the association relationship between each node is predicted, that is, whether there is an edge between each node is predicted.

示例性地，一个图G可以表示为一个有序二元组，表示为G＝(V，E)，其中V＝{v₁,...,v_N}表示节点集，表示边集。图G可以存储为特征矩阵X∈R^N×d和邻接矩阵A∈R^N×N相关联。其中，特征矩阵X为图中节点的特征表示构成的矩阵，d为节点特征表示的维度。N为图G包含的节点数量。邻接矩阵A记录了图中任意两个节点间是否有边，A_ij＝1表示节点v_i和v_j之间有边，A_ij＝0表示节点v_i和v_j之间没有边。对多个文档样本的文档关联图中的初始边进行掩码处理，可以将邻接矩阵中的1使用指定掩码(如0或NULL)覆盖，其中，NULL表示不确定是否存在边。For example, a graph G can be represented as an ordered tuple, represented as G = (V, E), where V = {v ₁ ,..., v _N } represents the node set, Represents the edge set. A graph G can be stored as a feature matrix X∈R ^N×d associated with an adjacency matrix A∈R ^N×N . Among them, the feature matrix X is a matrix composed of feature representations of nodes in the graph, and d is the dimension of node feature representations. N is the number of nodes contained in graph G. The adjacency matrix A records whether there is an edge between any two nodes in the graph. A _ij =1 indicates that there is an edge between nodes _vi and v _j , and A _ij =0 indicates that there is no edge between nodes _vi and v _j . To mask the initial edges in the document association graph of multiple document samples, 1 in the adjacency matrix can be covered with a specified mask (such as 0 or NULL), where NULL indicates that it is uncertain whether an edge exists.

步骤S703、根据更新后各节点的特征表示，预测文档关联图中各节点间的关联关系。Step S703: Predict the association between nodes in the document association graph based on the updated feature representation of each node.

在获得更新后各节点的特征表示之后，可以将各节点的特征表示输入第一分类器，通过第一分类器对文档关联图中各节点间是否存在关联关系进行分类预测，确定文档关联图中各节点间的关联关系，得到文档关联图中各节点间的关联关系的预测结果。该预测结果包含文档关联图中各节点间的是否存在边(也即是否存在关联关系)。After obtaining the updated feature representation of each node, the feature representation of each node can be input into the first classifier, and the first classifier can be used to classify and predict whether there is an association relationship between each node in the document association graph, and determine whether there is an association relationship between the nodes in the document association graph. The association relationship between each node is used to obtain the prediction result of the association relationship between each node in the document association graph. The prediction result includes whether there are edges between nodes in the document association graph (that is, whether there is an association relationship).

其中，用于预测文档关联图中各节点间的关联关系的第一分类器可以采用多层感知机(Multi-Layer Perceptron，简称MLP)，或者采用其他分类模型或算法实现，如基于支持向量机、决策树的分类算法等，本实施例此处不做具体限定。Among them, the first classifier used to predict the association between nodes in the document association graph can use a multi-layer perceptron (MLP), or other classification models or algorithms, such as support vector machines. , decision tree classification algorithm, etc., there are no specific limitations here in this embodiment.

步骤S704、根据文档关联图中各节点间的关联关系的预测结果和文档关联图中的初始边，计算第一损失，根据第一损失更新文档表示模型的参数。Step S704: Calculate the first loss based on the prediction result of the association relationship between nodes in the document association graph and the initial edge in the document association graph, and update the parameters of the document representation model based on the first loss.

该步骤中，根据基于更新后各节点的特征表示，确定的文档关联图中各节点间的关联关系的预测结果，以及文档关联图中的初始边，计算交叉熵(Cross-Entropy，简称CE)损失，作为第一损失。进一步地，根据第一损失，通过反向传播的方式，更新文档表示模型的参数。具体可以采用梯度下降法或其他常用参数更新方法，来更新文档表示模型的参数。训练完成后可获得训练好的文档表示模型。In this step, cross-entropy (CE) is calculated based on the updated feature representation of each node, the determined prediction results of the association between nodes in the document association graph, and the initial edges in the document association graph. Loss, as the first loss. Further, according to the first loss, the parameters of the document representation model are updated through backpropagation. Specifically, the gradient descent method or other common parameter update methods can be used to update the parameters of the document representation model. After training is completed, the trained document representation model can be obtained.

需要说明的是，步骤S703中使用的第一分类器可以是预训练的第一分类器，该第一分类器的参数在文档表征模型的训练过程中保持不变；或者，在文档表征模型的训练过程中，基于第一损失更新该第一分类器和文档表示模型的参数，可以在训练过程中不断提升第一分类器的分类准确性，从而可以提升文档表示模型的训练效果。It should be noted that the first classifier used in step S703 may be a pre-trained first classifier whose parameters remain unchanged during the training process of the document representation model; or, in the training process of the document representation model, During the training process, updating the parameters of the first classifier and the document representation model based on the first loss can continuously improve the classification accuracy of the first classifier during the training process, thereby improving the training effect of the document representation model.

本实施例的方法，不仅对单文档进行语义表征，获得各文档的包含丰富语义信息的语义表示，而且对多文档间的关联关系进行建模，将文档作为节点，文档间的关联关系作为边，构建多个文档间的文档关联图。文档关联图的拓扑结构包含了文档的相关性信息。在文本表示模型的训练过程中，使用各文档的包含丰富语义信息的语义表示初始化文档关联图中节点的特征表示，并利用掩码策略将文档关联图中的边进行掩码处理后输入图神经网络进行图表征学习，更新节点的特征表示，根据更新后节点的特征表示来推测节点间的关联关系，并根据预测结果和文档关联图的初始边计算损失并更新模型参数，使得文档表示模型不仅具有学习文档语义信息的能力，同时具有学习文档间相关性信息的能力，可以提升多文档表征的质量。The method of this embodiment not only performs semantic representation on a single document and obtains a semantic representation containing rich semantic information for each document, but also models the association between multiple documents, using the document as a node and the association between documents as an edge. , build a document association graph between multiple documents. The topological structure of the document association graph contains the relevance information of the document. During the training process of the text representation model, the semantic representation of each document containing rich semantic information is used to initialize the feature representation of the nodes in the document association graph, and the masking strategy is used to mask the edges in the document association graph and then input into the graph neural The network performs graph representation learning, updates the feature representation of nodes, infers the association between nodes based on the updated feature representation, and calculates the loss and updates the model parameters based on the prediction results and the initial edges of the document association graph, making the document representation model not only It has the ability to learn document semantic information and also has the ability to learn correlation information between documents, which can improve the quality of multi-document representation.

在一可选实施例中，基于图5所示的语义信息学习模块的结构，语义信息学习模块包括：实体关系图构建模块、第一图神经网络、文本表示模型和语义表示融合模块。具体结构参见前述实施例的相关内容，此处不再赘述。前述步骤702中通过文档表示模型使用各文档样本的语义表示，具体采用如下方式实现：In an optional embodiment, based on the structure of the semantic information learning module shown in Figure 5, the semantic information learning module includes: an entity relationship diagram building module, a first graph neural network, a text representation model and a semantic representation fusion module. For the specific structure, please refer to the relevant content of the foregoing embodiments and will not be described again here. In the aforementioned step 702, the semantic representation of each document sample is used through the document representation model, which is specifically implemented in the following manner:

将文档集中的各文档样本的内容信息输入实体关系图构建模块，通过实体关系图构建模块根据各文档样本的内容信息，构建各文档样本的实体关系图；将文档集中的各文档样本的实体关系图输入第一图神经网络，通过第一图神经网络对各文档样本的实体关系图分别进行图表征学习，获得各文档样本的实体层级的语义表示；并将文档集的多个文档样本的关键内容信息输入文本表示模型，通过文本表示模型对各文档样本的关键内容信息分别进行文本表征，获得各文档样本的文本层级的语义表示；将各文档样本的实体层级的语义表示和文本层级的语义表示融合，得到各文档样本的语义表示。Input the content information of each document sample in the document set into the entity relationship diagram building module, and use the entity relationship diagram building module to construct the entity relationship diagram of each document sample based on the content information of each document sample; enter the entity relationship diagram of each document sample in the document set The graph is input into the first graph neural network, and graph representation learning is performed on the entity relationship diagram of each document sample through the first graph neural network to obtain the entity-level semantic representation of each document sample; and the key points of multiple document samples in the document set are obtained The content information is input into the text representation model, and the key content information of each document sample is separately represented as text through the text representation model to obtain the text-level semantic representation of each document sample; the entity-level semantic representation and text-level semantic representation of each document sample are Representation fusion is used to obtain the semantic representation of each document sample.

在理想情况下，实体层级的信息可以帮助文本层级的信息更多的关注在有用信息上，减少对无用信息的权重；而文本层级的信息可以充实实体层级的信息。但在实际训练过程中，申请人发现单独训练文本表示模型和第一图神经网络，会导致两个层级的信息很难平衡。Ideally, entity-level information can help text-level information focus more on useful information and reduce the weight of useless information; while text-level information can enrich entity-level information. However, during the actual training process, the applicant found that training the text representation model and the first graph neural network separately would make it difficult to balance the information at the two levels.

本实施例中，在文档表征模型的训练过程中，为了平衡文本表示模型和第一图神经网络间的学习，增加第二损失，来约束两个层级信息的学习。具体通过如下方式实现：In this embodiment, during the training process of the document representation model, in order to balance the learning between the text representation model and the first graph neural network, a second loss is added to constrain the learning of the two levels of information. Specifically, this is achieved in the following ways:

根据各文档样本的实体层级的语义表示，预测各文档样本的类别标签，得到第一预测结果，并根据各文档样本的文本层级的语义表示，预测各文档样本的类别标签，得到第二预测结果；计算第一预测结果和第二预测结果间的距离，得到第二损失；根据第二损失更新语义信息学习模块的参数。According to the entity-level semantic representation of each document sample, the category label of each document sample is predicted to obtain the first prediction result; and based on the text-level semantic representation of each document sample, the category label of each document sample is predicted to obtain the second prediction result. ; Calculate the distance between the first prediction result and the second prediction result to obtain the second loss; update the parameters of the semantic information learning module according to the second loss.

其中，文档样本的类别标签表示文档样本的类别，具体可以是样本文档所属领域、主题、类型等等，文档样本的实际类别标签通过预先标注确定。Among them, the category label of the document sample represents the category of the document sample, which can be the field, subject, type, etc. of the sample document. The actual category label of the document sample is determined through pre-annotation.

示例性地，根据各文档样本的实体层级的语义表示，预测各文档样本的类别标签时，可以使用第二分类器实现。将各文档样本的实体层级的语义表示输入第二分类器，通过第二分类器对文档样本的类别进行分类预测，得到第一预测结果。该第一预测结果包含各文档样本的类别标签的分布。其中，任一文档样本的类别标签的分布是一个离散分布，可以表示为维度为K的向量，K为类别标签的总数，向量中的1表示文档具有对应类别标签，向量中的0表示文档不具有对应类别标签。示例性地，第一预测结果可以是各文档样本的类别标签的分布拼接形成的完整离散分布。其中，第二分类器可以采用多层感知机(Multi-LayerPerceptron，简称MLP)，或者采用其他分类模型或算法实现，如基于支持向量机、决策树的分类算法等，本实施例此处不做具体限定。For example, when predicting the category label of each document sample based on the entity-level semantic representation of each document sample, a second classifier can be used. The entity-level semantic representation of each document sample is input into the second classifier, and the category of the document sample is classified and predicted by the second classifier to obtain the first prediction result. The first prediction result includes the distribution of category labels of each document sample. Among them, the distribution of category labels for any document sample is a discrete distribution, which can be expressed as a vector with dimension K, where K is the total number of category labels. 1 in the vector indicates that the document has the corresponding category label, and 0 in the vector indicates that the document does not Have corresponding category labels. For example, the first prediction result may be a complete discrete distribution formed by splicing distributions of category labels of each document sample. Among them, the second classifier can be implemented using a multi-layer perceptron (MLP) or other classification models or algorithms, such as classification algorithms based on support vector machines and decision trees, etc., which are not included in this embodiment. Specific limitations.

示例性地，根据各文档样本的文本层级的语义表示，预测各文档样本的类别标签时，可以使用第三分类器实现。将各文档样本的文本层级的语义表示输入第三分类器，通过第三分类器对文档样本的类别进行分类预测，得到第二预测结果。该第二结果包含各文档样本的类别标签的分布，与第一预测结果包含的信息项类似。其中，第三分类器可以采用多层感知机(Multi-Layer Perceptron，简称MLP)，或者采用其他分类模型或算法实现，如基于支持向量机、决策树的分类算法等，本实施例此处不做具体限定。For example, when predicting the category label of each document sample based on the text-level semantic representation of each document sample, a third classifier can be used. The text-level semantic representation of each document sample is input into a third classifier, and the category of the document sample is classified and predicted by the third classifier to obtain a second prediction result. The second result contains the distribution of category labels of each document sample, which is similar to the information items contained in the first prediction result. Among them, the third classifier can be implemented using a Multi-Layer Perceptron (MLP), or other classification models or algorithms, such as classification algorithms based on support vector machines, decision trees, etc., which are not included in this embodiment. Make specific limitations.

其中，第二分类器和第三分类器可以采用相同或不同的分类器实现，此处不做具体限定。另外，第二分类器、第三分类器可以是预训练的分类器，第二分类器和第三分类器的参数在文档表征模型的训练过程中保持不变。The second classifier and the third classifier can be implemented using the same or different classifiers, and are not specifically limited here. In addition, the second classifier and the third classifier may be pre-trained classifiers, and the parameters of the second classifier and the third classifier remain unchanged during the training process of the document representation model.

可选地，在文档表征模型的训练过程中，可以基于第一预测结果和文档样本的实际类别标签计算第一交叉熵损失，基于第一交叉熵损失更新第二分类器的参数；并基于第二预测结果和文档样本的实际类别标签计算第二交叉熵损失，基于第二交叉熵损失更新第三分类器的参数，从而在训练过程中不断提升这两个分类器的分类准确性，从而可以提升文档表示模型的训练效果。Optionally, during the training process of the document representation model, the first cross-entropy loss can be calculated based on the first prediction result and the actual category label of the document sample, and the parameters of the second classifier can be updated based on the first cross-entropy loss; and based on the first cross-entropy loss; Calculate the second cross-entropy loss based on the second prediction result and the actual category label of the document sample, and update the parameters of the third classifier based on the second cross-entropy loss, thereby continuously improving the classification accuracy of the two classifiers during the training process, so that the Improve the training effect of document representation model.

可选地，第一预测结果和第二预测结果间的距离，具体可以是第一预测结果和第二预测结果这两个离散分布间的沃瑟斯坦(Wasserstein)距离、KL(Kullback-Leibler)散度、曼哈顿距离等，或者是其他用于度量两个离散分布间相似性的信息，此处不做具体限定。在一优选实施方式中，计算第一预测结果和第二预测结果间的沃瑟斯坦(Wasserstein)距离，作为第二损失，使用沃瑟斯坦(Wasserstein)距离来平衡第一图神经网络和文本表示模型间的学习，可以提升模型训练的效果。Optionally, the distance between the first prediction result and the second prediction result may specifically be the Wasserstein distance or KL (Kullback-Leibler) between the two discrete distributions of the first prediction result and the second prediction result. Divergence, Manhattan distance, etc., or other information used to measure the similarity between two discrete distributions, are not specifically limited here. In a preferred embodiment, the Wasserstein distance between the first prediction result and the second prediction result is calculated, and as the second loss, the Wasserstein distance is used to balance the first graph neural network and the text representation. Learning between models can improve the effect of model training.

可选地，还可以根据第一预测结果和各文档样本的实际类别标签，计算第一交叉熵损失；和/或，根据第二预测结果和各文档样本的实际类别标签，计算第二交叉熵损失；根据第一交叉熵损失和/或第二交叉熵损失，确定第三损失，根据第二损失和第三损失来更新语义信息学习模块的参数，可以提升语义信息学习模块的性能，从而提升生成的文档的语义表示的质量。Optionally, the first cross-entropy loss can also be calculated based on the first prediction result and the actual category label of each document sample; and/or the second cross-entropy loss can be calculated based on the second prediction result and the actual category label of each document sample. Loss; determine the third loss based on the first cross entropy loss and/or the second cross entropy loss, and update the parameters of the semantic information learning module based on the second loss and the third loss, which can improve the performance of the semantic information learning module, thereby improving The quality of the semantic representation of the generated documents.

可选地，可以将第一交叉熵损失作为第三损失，或者将第二交叉熵损失作为第三损失，或者将第一交叉熵损失和第二交叉熵损失之和作为第三损失。Alternatively, the first cross-entropy loss may be used as the third loss, or the second cross-entropy loss may be used as the third loss, or the sum of the first cross-entropy loss and the second cross-entropy loss may be used as the third loss.

可选地，根据第二损失和第三损失更新语义信息学习模块的参数时，可以将第二损失和第三损失之和作为综合的语义学习损失，或者第二损失和第三损失加权求和作为综合的语义学习损失，根据综合的语义学习损失更新语义信息学习模块的参数。其中，第二损失和第三损失的权重系数可以根据实际应用场景的需要进行配置和调整，此处不做具体限定。Optionally, when updating the parameters of the semantic information learning module based on the second loss and the third loss, the sum of the second loss and the third loss can be used as the comprehensive semantic learning loss, or the weighted sum of the second loss and the third loss can be used As a comprehensive semantic learning loss, the parameters of the semantic information learning module are updated according to the comprehensive semantic learning loss. Among them, the weight coefficients of the second loss and the third loss can be configured and adjusted according to the needs of the actual application scenario, and are not specifically limited here.

可选地，计算得到第一损失、第二损失和第三损失之后，可以根据第一损失、第二损失和第三损失计算综合损失，根据综合损失更新整个文档表征模型的参数。例如，计算第一损失、第二损失和第三损失之和，作为综合损失；或者，对算第一损失、第二损失和第三损失加权求和，作为综合损失。其中第一损失、第二损失和第三损失的权重系数可以根据实际应用场景的需要进行配置和调整，此处不做具体限定。Optionally, after the first loss, the second loss and the third loss are calculated, the comprehensive loss can be calculated based on the first loss, the second loss and the third loss, and the parameters of the entire document representation model can be updated based on the comprehensive loss. For example, the sum of the first loss, the second loss and the third loss is calculated as the comprehensive loss; or the weighted sum of the first loss, the second loss and the third loss is calculated as the comprehensive loss. The weight coefficients of the first loss, the second loss and the third loss can be configured and adjusted according to the needs of the actual application scenario, and are not specifically limited here.

示例性地，图8为本申请一示例性实施例提供的文档表征模型的训练框架示例图，基于图6所示的文档表征模型，在训练过程中增加与第一图神经网络对应的第一分类器、与语言模型对应的第二分类器、与第二图神经网络对应的第三分类器。在文档表征模型的训练过程中，将各文档样本(如图8中所示的P_A...P_B)的实体关系图输入第一图神经网络进行图表征学习，得到各文档样本包含实体的特征表示(如图8中所示的e₁、e₂、e₅，e₅、e₆、e₉)，进一步可以确定各个文档样本的实体层级的语义表示，如图8中所示的其中上标E表示实体层级。将各个文档样本的实体层级的语义表示输入第一分类器进行分类预测，得到第一预测结果，如图8所示的μ^E。Exemplarily, Figure 8 is an example diagram of the training framework of the document representation model provided by an exemplary embodiment of the present application. Based on the document representation model shown in Figure 6, a first graph corresponding to the first graph neural network is added during the training process. a classifier, a second classifier corresponding to the language model, and a third classifier corresponding to the second graph neural network. During the training process of the document representation model, the entity relationship diagram of each document sample ( _PA ...P _B as shown in Figure 8) is input into the first graph neural network for graph representation learning, and it is obtained that each document sample contains entities The feature representation (e ₁ , e ₂ , e ₅ , e ₅ , e ₆ , e ₉ shown in Figure 8 ) can further determine the entity-level semantic representation of each document sample, as shown in Figure 8 The superscript E represents the entity level. The entity-level semantic representation of each document sample is input into the first classifier for classification prediction, and the first prediction result is obtained, as shown in μ ^E in Figure 8.

并且，将各文档样本的摘要信息输入语言模型进行文本表示，得到各文档样本的文本层级的语义表示，如图8中所示的其中上标T表示文本层级。将各个文档样本的文本层级的语义表示输入第二分类器进行分类预测，得到第二预测结果，如图8所示的μ^T。Moreover, the summary information of each document sample is input into the language model for text representation, and the text-level semantic representation of each document sample is obtained, as shown in Figure 8 The superscript T represents the text level. The text-level semantic representation of each document sample is input into the second classifier for classification prediction, and the second prediction result is obtained, as shown in Figure 8 as μ ^T .

通过计算第一预测结果μ^E和第二预测结果μ^T间的沃瑟斯坦(Wasserstein)距离d，作为第二损失来约束第一图神经网络和语言模型的学习，并结合实际类别标签计算交叉熵损失(图8中未示出)，再确定综合的语义信息学习损失l₁(图8中未示出)。By calculating the Wasserstein distance d between the first prediction result μ ^E and the second prediction result μ ^T , it is used as the second loss to constrain the learning of the first graph neural network and language model, and the intersection is calculated in combination with the actual category label Entropy loss (not shown in Figure 8), and then determine the comprehensive semantic information learning loss l ₁ (not shown in Figure 8).

将各文档的实体层级的语义表示和文本层级的语义表示融合，使用融合得到的各文档的语义表示初始化文档关联图中各节点的特征表示，并对文档关联图中的边进行掩码(图8所示文档关联图中的虚线表示边被掩码，实线表示实际存在边或预测存在边)得到初始化后的文档关联图。将初始化后的文档关联图输入第二图神经网络进行图表征学习，更新各节点的特征表示。通过第三分类器根据更新后各节点的特征表示，预测各节点间是否存在边，得到边预测结果。根据文档关联图中各节点间的关联关系的预测结果和文档关联图中的初始边，计算交叉熵损失，作为第一损失，如图8中所示的l₂。Fuse the entity-level semantic representation of each document with the text-level semantic representation, use the fused semantic representation of each document to initialize the feature representation of each node in the document association graph, and mask the edges in the document association graph (Fig. The dotted line in the document association graph shown in 8 indicates that the edge is masked, and the solid line indicates the actual existing edge or the predicted existing edge) to obtain the initialized document association graph. The initialized document association graph is input into the second graph neural network for graph representation learning, and the feature representation of each node is updated. The third classifier is used to predict whether there is an edge between each node based on the updated feature representation of each node, and the edge prediction result is obtained. According to the predicted results of the association between nodes in the document association graph and the initial edges in the document association graph, the cross-entropy loss is calculated as the first loss, as l ₂ shown in Figure 8.

其中，l₁用于更新第一图神经网络和语言模型的参数，l₂用于更新第二图神经网络的参数；或者，基于l₁+l₂更新一图神经网络、语言模型和第二图神经网络的参数。Among them, l ₁ is used to update the parameters of the first graph neural network and language model, and l ₂ is used to update the parameters of the second graph neural network; or, based on l ₁ + l ₂ , update the first graph neural network, language model and second graph neural network. Parameters of graph neural networks.

需要说明的是，当文档表征模型的结构不同时，在基于文档表征模型进行多文档表征或模型训练的处理流程会有所不同，具体参见前述实施例中对各种可选模型结构的说明，通过实施例的结合可以获知各种可能的实施方式，本实施例不再一一赘述。It should be noted that when the structure of the document representation model is different, the processing flow of multi-document representation or model training based on the document representation model will be different. For details, please refer to the description of various optional model structures in the aforementioned embodiments. Various possible implementation modes can be learned through the combination of embodiments, and the details will not be described one by one in this embodiment.

需要说明的是，前述模型训练方案中，对文档表示模型的语义信息学习模块和相关性信息学习模块进行联合训练获得。在其他实施例中，文档表示模型的语义信息学习模块和相关性信息学习模块可以分开训练，分别训练语义信息学习模块和相关性信息学习模块，将训练完成的语义信息学习模块和相关性信息学习模块组合，得到文档表示模型。在训练语义信息学习模块的过程中，无需计算第一损失，仅计算前述第二损失和第三损失，根据第二损失和第三损失，更新语义信息学习模块的参数，无需计算第一损失。在训练相关性信息学习模块的过程中，计算第一损失，根据第一损失更新文档表示模型的参数，无需计算第二损失和第三损失。各个损失的计算方式参见前述模型训练实施例中的相关内容，此处不再赘述。It should be noted that in the aforementioned model training scheme, the semantic information learning module and the correlation information learning module of the document representation model are jointly trained. In other embodiments, the semantic information learning module and the correlation information learning module of the document representation model can be trained separately, and the semantic information learning module and the correlation information learning module are trained separately, and the trained semantic information learning module and correlation information learning module are Modules are combined to obtain a document representation model. In the process of training the semantic information learning module, there is no need to calculate the first loss. Only the aforementioned second loss and third loss are calculated. According to the second loss and the third loss, the parameters of the semantic information learning module are updated without calculating the first loss. In the process of training the correlation information learning module, the first loss is calculated, and the parameters of the document representation model are updated according to the first loss, without the need to calculate the second loss and the third loss. For the calculation method of each loss, please refer to the relevant content in the aforementioned model training embodiment, and will not be described again here.

图9为本申请一示例性实施例提供的一种文档表征方法的流程图，本实施例基于图1所示的系统架构，本实施例的执行主体为图1所示系统架构中的服务器。如图9所示，该方法具体步骤如下：FIG. 9 is a flow chart of a document characterization method provided by an exemplary embodiment of the present application. This embodiment is based on the system architecture shown in FIG. 1 , and the execution subject of this embodiment is the server in the system architecture shown in FIG. 1 . As shown in Figure 9, the specific steps of this method are as follows:

步骤S901、接收端侧设备提交的文档集，文档集包含待处理的多个文档。Step S901: Receive a document set submitted by the end-side device. The document set contains multiple documents to be processed.

本实施例中，端侧设备在运行下游应用的过程中需要对文档集中的多个文档进行表征获得文档表示时，将文档集提交至服务器。服务器接收端侧设备提交的文档集，从而获取到文档集中包含的待处理的多个文档。In this embodiment, when the end-side device needs to characterize multiple documents in the document set to obtain the document representation during the process of running the downstream application, it submits the document set to the server. The server receives the document set submitted by the end-side device, thereby obtaining multiple documents to be processed contained in the document set.

示例性地，端侧设备向服务器提交文档集时，可以通过前端界面上传文档集；或者，端侧设备向服务器提交包含文档集的请求，使得服务器获取请求包含的数据并返回结果。例如，端侧设备通过调用服务器提供的文档表征服务的调用接口，向服务器发送包含文档集的调用请求。另外，还可以采用其他端侧向服务器提交文件的方式实现，此处不再赘述。For example, when the end-side device submits a document set to the server, it can upload the document set through the front-end interface; or, the end-side device submits a request containing the document set to the server, so that the server obtains the data contained in the request and returns the result. For example, the terminal device sends a calling request containing a document set to the server by calling the calling interface of the document representation service provided by the server. In addition, it can also be implemented by submitting files to the server from other terminals, which will not be described here.

步骤S902、构建文档集对应的文档关联图，文档关联图包括各文档对应的节点和表示文档间关联关系的边。Step S902: Construct a document association graph corresponding to the document set. The document association graph includes nodes corresponding to each document and edges representing association relationships between documents.

获取到包含待处理的多个文档的文档集之后，服务器可以自动构建文档集对应的文档关联图，也即构建文档集中多个文档的文档关联图。具体地，对于文档集包含的多个文档，构建各个文档对应的节点；根据文档间的关联关系，在任意两个具有关联关系的文档对应的节点之间建立边，得到文档集中多个文档的文档关联图。需要说明的是，同一文档关联图中可以包含一种或多种不同类型的边，不同类型的边表示不同的关联关系。After obtaining a document set containing multiple documents to be processed, the server can automatically build a document association graph corresponding to the document set, that is, a document association graph of multiple documents in the document set. Specifically, for multiple documents included in the document set, nodes corresponding to each document are constructed; according to the association between documents, an edge is established between the nodes corresponding to any two documents with an association relationship, and the number of documents in the document set is obtained. Document association diagram. It should be noted that the same document association graph can contain one or more different types of edges, and different types of edges represent different association relationships.

其中，文档间的关联关系表示文档间的相关性，具体可以是引用、同一作者、给同一目标对象、同一描述对象等。例如，论文类的文档间可以具有引用关系，标书类的文档可以属于同一投标人/招标人，知识类的文档可以描述同一物品等。不同的应用场景中，针对不同文档集，文档间的关联关系可以不同。对于具体应用场景的文档集中不同文档间的关联关系，可以采用数据分析、挖掘的方式确定，具体可以采用现有技术获得，此处不做具体限定。Among them, the association relationship between documents represents the correlation between documents, which can be specific references, the same author, the same target object, the same description object, etc. For example, there can be reference relationships between paper-type documents, tender-type documents can belong to the same bidder/tenderer, knowledge-type documents can describe the same item, etc. In different application scenarios, for different document sets, the relationships between documents can be different. The correlation between different documents in the document set of a specific application scenario can be determined through data analysis and mining. The details can be obtained using existing technologies, and are not specifically limited here.

本实施例中，通过步骤S901-S902，实现步骤S301中获取待处理的多个文档，以及多个文档的文档关联图的处理过程。在另一实施例中，文档集对应的文档关联图也可以由端侧设备构建，端侧设备向服务器提交文档集和文档集对应的文档关联图。In this embodiment, through steps S901-S902, the process of obtaining multiple documents to be processed and the document association graphs of the multiple documents in step S301 is implemented. In another embodiment, the document association graph corresponding to the document set can also be constructed by the end-side device, and the end-side device submits the document set and the document association graph corresponding to the document set to the server.

步骤S903、将多个文档的内容信息和文档关联图输入文档表示模型，通过文档表示模型基于各文档的内容信息生成各文档的语义表示，使用各文档的语义表示初始化文档关联图中各节点的特征表示，基于初始化后的文档关联图进行图表征学习，更新各节点的特征表示。Step S903: Input the content information and document association graphs of multiple documents into the document representation model, generate a semantic representation of each document based on the content information of each document through the document representation model, and use the semantic representation of each document to initialize the nodes in the document association graph. Feature representation, graph representation learning is performed based on the initialized document association graph, and the feature representation of each node is updated.

该步骤与前述步骤S302的具体实现方式一致，具体参见前述实施例的相关内容，此处不再赘述。This step is consistent with the specific implementation manner of the aforementioned step S302. For details, please refer to the relevant content of the aforementioned embodiments and will not be described again here.

步骤S904、将更新后各节点的特征表示作为对应文档的文档表示。Step S904: Use the updated feature representation of each node as the document representation of the corresponding document.

该步骤与前述步骤S303的具体实现方式一致，具体参见前述实施例的相关内容，此处不再赘述。This step is consistent with the specific implementation manner of the aforementioned step S303. For details, please refer to the relevant content of the aforementioned embodiments and will not be described again here.

步骤S905、向端侧设备发送文档集中各文档的文档表示。Step S905: Send the document representation of each document in the document set to the end-side device.

本实施例的方法，服务器接收端侧设备提交的文档集，构建文档集对应的文档关联图，并将多个文档的内容信息和文档关联图输入文档表示模型，通过文档表示模型基于各文档的内容信息生成各文档的语义表示，使用各文档的语义表示初始化文档关联图中各节点的特征表示，基于初始化后的文档关联图进行图表征学习，更新各节点的特征表示，可以实现端到端的多文档表征，无需端侧设备提供文档关联图等复杂的数据。In the method of this embodiment, the server receives a document set submitted by the end-side device, constructs a document association graph corresponding to the document set, and inputs the content information and document association graphs of multiple documents into the document representation model. The document representation model is based on the document representation of each document. The content information generates the semantic representation of each document, uses the semantic representation of each document to initialize the feature representation of each node in the document association graph, performs graph representation learning based on the initialized document association graph, and updates the feature representation of each node, which can achieve end-to-end Multi-document representation eliminates the need for end-side devices to provide complex data such as document association graphs.

本实施例中，通过文档表征获得的各文档的文档表示之后，服务器将文档集中各文档的文档表示返回至端侧设备。端侧设备接收到文档集中各文档的文档表示之后，可以根据多个文档的文档表示继续执行下游应用的文档处理逻辑，实现下游应用的文档处理任务，获得文档处理结果。In this embodiment, after obtaining the document representation of each document through document characterization, the server returns the document representation of each document in the document set to the end-side device. After receiving the document representation of each document in the document set, the end-side device can continue to execute the document processing logic of the downstream application based on the document representation of multiple documents, implement the document processing task of the downstream application, and obtain the document processing result.

示例性地，以文档分类场景为例，端侧设备可以将包含待分类的多个文档的文档集提交至服务器。服务器对文档集中的多个文档进行表征获得各个文档的文档表示，将各个文档的文档表示返回至端侧设备。端侧设备根据各文档的文档表示进行文档分类，得到文档分类结果。Illustratively, taking the document classification scenario as an example, the end-side device can submit a document set containing multiple documents to be classified to the server. The server characterizes multiple documents in the document set to obtain the document representation of each document, and returns the document representation of each document to the end-side device. The end-side device performs document classification based on the document representation of each document and obtains the document classification result.

示例性地，以文档检索场景为例，端侧设备可以将包含检索库中多个文档的文档集提交至服务器。服务器对文档集包含的多个文档进行表征获得检索库中多个文档的文档表示，将检索库中多个文档的文档表示返回至端侧设备。端侧设备更新/存储检索库中多个文档的文档表示。进一步地，端侧设备基于更换后的检索库进行文档检索。例如，端侧设备根据用户输入的查询信息，将查询信息映射为向量表示，将向量表示与检索库中各文档的文档表示进行相似度匹配，获得与查询信息匹配的一个或者多个文档，作为检索结果，并输出检索结果。Illustratively, taking the document retrieval scenario as an example, the terminal device can submit a document set containing multiple documents in the retrieval library to the server. The server characterizes multiple documents contained in the document set to obtain document representations of multiple documents in the retrieval database, and returns the document representations of multiple documents in the retrieval database to the terminal device. The end-side device updates/stores document representations of multiple documents in the retrieval library. Further, the end-side device performs document retrieval based on the replaced retrieval database. For example, the end-side device maps the query information into a vector representation based on the query information input by the user, performs similarity matching between the vector representation and the document representation of each document in the retrieval database, and obtains one or more documents that match the query information, as Search results and output the search results.

本实施例的方法应用于各下游应用时，可以由服务器在获得各文档的文档表示之后，根据多个文档的文档表示继续执行设定的处理逻辑，获得文档处理结果，并向端侧设备返回文档处理结果。When the method of this embodiment is applied to each downstream application, the server can continue to execute the set processing logic based on the document representations of multiple documents after obtaining the document representation of each document, obtain the document processing result, and return it to the end-side device. Document processing results.

示例性地，在文档分类场景中，端侧设备可以将包含待分类的多个文档的文档集提交至服务器。服务器对文档集中的多个文档进行表征获得各个文档的文档表示之后，服务器根据各文档的文档表示，对多个文档进行分类，得到文档分类结果，并向端侧设备发送文档分类结果。For example, in a document classification scenario, the end-side device may submit a document set containing multiple documents to be classified to the server. After the server characterizes multiple documents in the document set to obtain the document representation of each document, the server classifies the multiple documents according to the document representation of each document, obtains the document classification result, and sends the document classification result to the end-side device.

示例性地，在文档关联关系预测场景中，端侧设备可以将包含待处理的多个文档的文档集提交至服务器。服务器对文档集中的多个文档进行表征获得各个文档的文档表示之后，服务器根据各文档的文档表示，预测各文档间的关联关系，并向端侧设备发送各文档间的关联关系。或者，服务器对文档集中的多个文档进行表征获得各个文档的文档表示之后，向端侧设备返回文档集中各文档的文档表示。端侧设备根据各文档的文档表示，预测各文档间的关联关系，得到各文档间的关联关系。For example, in a document association relationship prediction scenario, the end-side device may submit a document set containing multiple documents to be processed to the server. After the server characterizes multiple documents in the document set to obtain the document representation of each document, the server predicts the association between each document based on the document representation of each document, and sends the association between each document to the end-side device. Alternatively, after the server characterizes multiple documents in the document set to obtain the document representation of each document, it returns the document representation of each document in the document set to the end-side device. The end-side device predicts the association between each document based on the document representation of each document, and obtains the association between each document.

示例性地，在文档检索场景下，服务器基于前述方法实施例的方法，对文档检索库中的多个文档进行文档表征，获得各文档的文档表示；进一步地，服务器根据多个文档的文档表示，更新文档检索库中各文档的文档表示。Exemplarily, in the document retrieval scenario, the server performs document representation on multiple documents in the document retrieval library based on the method of the aforementioned method embodiment, and obtains the document representation of each document; further, the server performs document representation on the multiple documents based on the document representation of the multiple documents. , update the document representation of each document in the document retrieval database.

进一步地，服务器还可以基于更新后的文档检索库，实现在线文档检索的处理。具体地，服务器接收端侧设备发送的文档检索请求，文档检索请求包含输入的查询信息。服务器将查询信息映射为向量表示，将向量表示与文档检索库中各文档的文档表示进行相似度匹配，得到与查询信息匹配的至少一个目标文档，向端侧设备发送至少一个目标文档。Furthermore, the server can also implement online document retrieval processing based on the updated document retrieval database. Specifically, the server receives a document retrieval request sent by the end-side device, and the document retrieval request includes the input query information. The server maps the query information into a vector representation, performs similarity matching between the vector representation and the document representation of each document in the document retrieval database, obtains at least one target document that matches the query information, and sends at least one target document to the end-side device.

图10为本申请另一示例性实施例提供的文档检索方法的流程图，本实施例的执行主体可以运行有文档检索系统的本地或云端的服务器，该服务器存储有文档检索库，文档检索库中多个文档的文档表示是通过前述文档表征实施例的方法表征得到的。如图10所示，该方法具体步骤如下：Figure 10 is a flow chart of a document retrieval method provided by another exemplary embodiment of the present application. The execution subject of this embodiment can run a local or cloud server with a document retrieval system. The server stores a document retrieval library. The document retrieval library The document representations of multiple documents in are obtained by characterizing the method of the foregoing document representation embodiment. As shown in Figure 10, the specific steps of this method are as follows:

步骤S1001、接收端侧设备发送的文档检索请求，文档检索请求包含输入的查询信息。Step S1001: Receive a document retrieval request sent by the end-side device. The document retrieval request includes the input query information.

其中，输入的查询信息可以包含如下至少一项：用户通过前端界面的输入框输入的关键词、用户从前端界面中选定的至少一个信息项(例如可选的关键词、筛选条件等)。The input query information may include at least one of the following: keywords entered by the user through the input box of the front-end interface, at least one information item selected by the user from the front-end interface (such as optional keywords, filtering conditions, etc.).

步骤S1002、将查询信息映射为向量表示，将向量表示与文档检索库中各文档的文档表示进行相似度匹配，得到与查询信息匹配的至少一个目标文档。Step S1002: Map the query information into a vector representation, perform similarity matching between the vector representation and the document representation of each document in the document retrieval database, and obtain at least one target document that matches the query information.

其中，将查询信息映射为向量表示，具体可以采用现有技术中任意一种将文本信息转换为向量表示的方法实现，此处不做具体限定。Among them, mapping the query information into a vector representation can be implemented by any method in the prior art for converting text information into a vector representation, which is not specifically limited here.

将向量表示与文档检索库中各文档的文档表示进行相似度匹配时，可以计算查询信息的向量表示与文档检索库中各文档的文档表示的相似度，选择相似度较高的至少一个文档，作为与查询信息匹配的目标文档。When similarity matching is performed between the vector representation and the document representation of each document in the document retrieval database, the similarity between the vector representation of the query information and the document representation of each document in the document retrieval database can be calculated, and at least one document with higher similarity is selected. As a target document that matches the query information.

可选地，可以按照相似度由高到低的顺序对文档检索库中的文档进行排序，根据排序结果选择排序靠前的预设数量的文档，作为与查询信息匹配的目标文档。其中预设数量可以根据实际应用场景的需要和经验值进行配置和调整，此处不做具体限定。Optionally, the documents in the document retrieval database can be sorted in order of similarity from high to low, and a preset number of top-ranked documents are selected according to the sorting results as target documents matching the query information. The preset number can be configured and adjusted according to the needs of actual application scenarios and experience values, and there is no specific limit here.

可选地，可以根据相似度，选择相似度大于预设相似度阈值的文档，作为与查询信息匹配的目标文档。其中预设相似度阈值可以根据实际应用场景的需要和经验值进行配置和调整，此处不做具体限定。Optionally, based on the similarity, documents whose similarity is greater than a preset similarity threshold can be selected as target documents matching the query information. The preset similarity threshold can be configured and adjusted according to the needs of actual application scenarios and experience values, and is not specifically limited here.

本实施例中，采用前述实施例的方案对文档检索库中的多个文档进行文档表征，获得各文档的文档表示。具体地，获取文档检索库中待表征的多个文档，以及多个文档的文档关联图，文档关联图包括各文档对应的节点和表示文档间关联关系的边；将多个文档和文档关联图输入文档表示模型，通过文档表示模型基于使用各文档的语义表示初始化的文档关联图进行图表征学习，更新各节点的特征表示；将更新后各节点的特征表示作为对应文档的文档表示。进一步地，根据多个文档的文档表示，更新文档检索库中各文档的文档表示。其中，对文档检索库中的多个文档进行表征，获得各文档的文档表示的具体实现过程，参见前述实施例中的相关内容，本实施例此处不再赘述。In this embodiment, the solution of the aforementioned embodiment is used to perform document characterization on multiple documents in the document retrieval database, and the document representation of each document is obtained. Specifically, multiple documents to be represented in the document retrieval database are obtained, as well as document association graphs of the multiple documents. The document association graph includes nodes corresponding to each document and edges representing the association relationships between documents; multiple documents and document association graphs are Input the document representation model, and use the document representation model to perform graph representation learning based on the document association graph initialized using the semantic representation of each document, and update the feature representation of each node; use the updated feature representation of each node as the document representation of the corresponding document. Further, based on the document representations of the multiple documents, the document representation of each document in the document retrieval database is updated. For the specific implementation process of characterizing multiple documents in the document retrieval database and obtaining the document representation of each document, please refer to the relevant content in the foregoing embodiments, which will not be described again in this embodiment.

步骤S1003、向端侧设备发送至少一个目标文档。Step S1003: Send at least one target document to the end-side device.

服务器将检索到的与查询信息匹配的至少一个目标文档返回至端侧设备，以使端侧设备向用户输出检索到的与查询信息匹配的至少一个目标文档的相关信息。另外，服务器可以根据检索系统的配置，将需要向用户输出的目标文档的一项或多项关键信息发送至端侧设备，使得端侧设备输出目标文档的一项或多项关键信息。The server returns the retrieved at least one target document that matches the query information to the end-side device, so that the end-side device outputs relevant information of the retrieved at least one target document that matches the query information to the user. In addition, according to the configuration of the retrieval system, the server can send one or more key information of the target document that needs to be output to the user to the end-side device, so that the end-side device outputs one or more key information of the target document.

本实施例提供了一种文档检索方法，通过采用前述实施例的方案对文档检索库中的多个文档进行文档表征，获得各文档的文档表示，根据多个文档的文档表示，更新文档检索库中各文档的文档表示，可以提升文档检索库中文档表示的质量，从而可以提升文档检索的准确性和召回率。This embodiment provides a document retrieval method. By using the solution of the previous embodiment to perform document characterization on multiple documents in the document retrieval library, the document representation of each document is obtained, and the document retrieval library is updated according to the document representation of the multiple documents. The document representation of each document in the document can improve the quality of document representation in the document retrieval library, thereby improving the accuracy and recall rate of document retrieval.

图11为本申请实施例提供的一种服务器的结构示意图。如图11所示，本实施例的服务器可以包括：至少一个处理器1101；以及与至少一个处理器通信连接的存储器1102。Figure 11 is a schematic structural diagram of a server provided by an embodiment of the present application. As shown in Figure 11, the server in this embodiment may include: at least one processor 1101; and a memory 1102 communicatively connected to the at least one processor.

其中，存储器1102存储有可被至少一个处理器1101执行的指令，指令被至少一个处理器1101执行，以使服务器执行如上述任一实施例的方法，其具体功能和所能实现的技术效果类似，此处不再赘述。Among them, the memory 1102 stores instructions that can be executed by at least one processor 1101. The instructions are executed by at least one processor 1101, so that the server performs the method of any of the above embodiments. Its specific functions and the technical effects that can be achieved are similar. , which will not be described again here.

可选地，存储器1102既可以是独立的，也可以跟处理器1101集成在一起。可选的，如图11所示，该服务器还包括：防火墙1103、负载均衡器1104、通信组件1105、电源组件1106等其它组件。图11中仅示意性给出部分组件，并不意味着服务器只包括图11所示组件。Optionally, the memory 1102 can be independent or integrated with the processor 1101. Optionally, as shown in Figure 11, the server also includes: a firewall 1103, a load balancer 1104, a communication component 1105, a power supply component 1106 and other other components. Only some components are schematically shown in Figure 11, which does not mean that the server only includes the components shown in Figure 11.

本申请实施例还提供一种计算机可读存储介质，计算机可读存储介质中存储有计算机执行指令，当处理器执行计算机执行指令时，实现前述任一实施例的方法，具体功能和所能实现的技术效果此处不再赘述。Embodiments of the present application also provide a computer-readable storage medium. Computer-executable instructions are stored in the computer-readable storage medium. When the processor executes the computer-executed instructions, the method of any of the foregoing embodiments is implemented. The specific functions and what can be achieved are The technical effects will not be repeated here.

本申请实施例还提供一种计算机程序产品，包括计算机程序，该计算机程序被处理器执行时实现前述任一实施例的方法。计算机程序存储在可读存储介质中，服务器的至少一个处理器可以从可读存储介质读取计算机程序，至少一个处理器执行计算机程序使得服务器执行上述任一方法实施例所提供的技术方案，具体功能和所能实现的技术效果此处不再赘述。An embodiment of the present application also provides a computer program product, including a computer program, which implements the method of any of the foregoing embodiments when executed by a processor. The computer program is stored in a readable storage medium. At least one processor of the server can read the computer program from the readable storage medium. At least one processor executes the computer program so that the server executes the technical solution provided by any of the above method embodiments. Specifically The functions and technical effects that can be achieved are not repeated here.

本申请实施例提供一种芯片，包括：处理模块与通信接口，该处理模块能执行前述方法实施例中服务器的技术方案。可选的，该芯片还包括存储模块(如，存储器)，存储模块用于存储指令，处理模块用于执行存储模块存储的指令，并且对存储模块中存储的指令的执行使得处理模块执行前述任一方法实施例所提供的技术方案。An embodiment of the present application provides a chip, which includes: a processing module and a communication interface. The processing module can execute the technical solution of the server in the foregoing method embodiment. Optionally, the chip also includes a storage module (such as a memory), the storage module is used to store instructions, the processing module is used to execute the instructions stored in the storage module, and the execution of the instructions stored in the storage module causes the processing module to perform any of the foregoing. A technical solution provided by a method embodiment.

上述以软件功能模块的形式实现的集成的模块，可以存储在一个计算机可读取存储介质中。上述软件功能模块存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)或处理器执行本申请各个实施例方法的部分步骤。The above integrated modules implemented in the form of software function modules can be stored in a computer-readable storage medium. The above-mentioned software function module is stored in a storage medium and includes a number of instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to execute some steps of the methods of various embodiments of the present application.

应理解，上述处理器可以是处理单元(Central Processing Unit，简称CPU)，还可以是其它通用处理器、数字信号处理器(Digital Signal Processor，简称DSP)、专用集成电路(Application Specific Integrated Circuit，简称ASIC)等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合申请所公开的方法的步骤可以直接体现为硬件处理器执行完成，或者用处理器中的硬件及软件模块组合执行完成。存储器可能包含高速随机存取存储器(Random Access Memory，简称RAM)，也可能还包括非易失性存储NVM，例如至少一个磁盘存储器，还可以为U盘、移动硬盘、只读存储器、磁盘或光盘等。It should be understood that the above-mentioned processor can be a processing unit (Central Processing Unit, CPU for short), or other general-purpose processor, Digital Signal Processor (Digital Signal Processor, DSP for short), Application Specific Integrated Circuit (Application Specific Integrated Circuit, for short) ASIC) etc. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc. The steps of the method disclosed in the application can be directly implemented by a hardware processor, or executed by a combination of hardware and software modules in the processor. The memory may include high-speed random access memory (RAM), or may also include non-volatile storage NVM, such as at least one disk memory, or it may be a USB flash drive, mobile hard disk, read-only memory, magnetic disk or optical disk wait.

上述存储器可以是对象存储(Object Storage Service，简称OSS)。The above-mentioned storage may be an object storage service (OSS).

上述存储器可以由任何类型的易失性或非易失性存储设备或者它们的组合实现，如静态随机存取存储器(SRAM)，电可擦除可编程只读存储器(EEPROM)，可擦除可编程只读存储器(EPROM)，可编程只读存储器(PROM)，只读存储器(ROM)，磁存储器，快闪存储器，磁盘或光盘。The above memory can be implemented by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable and programmable read-only memory (EEPROM). Programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

上述通信组件被配置为便于通信组件所在设备和其他设备之间有线或无线方式的通信。通信组件所在设备可以接入基于通信标准的无线网络，如移动热点(WiFi)，第二代移动通信系统(2G)、第三代移动通信系统(3G)、第四代移动通信系统(4G)/长期演进(LTE)、第五代移动通信系统(5G)等移动通信网络，或它们的组合。在一个示例性实施例中，通信组件经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中，通信组件还包括近场通信(NFC)模块，以促进短程通信。例如，在NFC模块可基于射频识别(RFID)技术，红外数据协会(IrDA)技术，超宽带(UWB)技术，蓝牙(BT)技术和其他技术来实现。The above communication component is configured to facilitate wired or wireless communication between the device where the communication component is located and other devices. The device where the communication component is located can access wireless networks based on communication standards, such as mobile hotspots (WiFi), second-generation mobile communication systems (2G), third-generation mobile communication systems (3G), and fourth-generation mobile communication systems (4G). / Long Term Evolution (LTE), fifth generation mobile communication system (5G) and other mobile communication networks, or their combination. In an exemplary embodiment, the communication component receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component further includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

上述电源组件，为电源组件所在设备的各种组件提供电力。电源组件可以包括电源管理系统，一个或多个电源，及其他与为电源组件所在设备生成、管理和分配电力相关联的组件。The above-mentioned power supply component provides power to various components of the device where the power supply component is located. A power component may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to the device in which the power component resides.

上述存储介质可以是由任何类型的易失性或非易失性存储设备或者它们的组合实现，如静态随机存取存储器(SRAM)，电可擦除可编程只读存储器(EEPROM)，可擦除可编程只读存储器(EPROM)，可编程只读存储器(PROM)，只读存储器(ROM)，磁存储器，快闪存储器，磁盘或光盘。存储介质可以是通用或专用计算机能够存取的任何可用介质。The above storage medium can be implemented by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Except programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk. Storage media can be any available media that can be accessed by a general purpose or special purpose computer.

一种示例性的存储介质耦合至处理器，从而使处理器能够从该存储介质读取信息，且可向该存储介质写入信息。当然，存储介质也可以是处理器的组成部分。处理器和存储介质可以位于专用集成电路(Application Specific Integrated Circuits，简称ASIC)中。当然，处理器和存储介质也可以作为分立组件存在于电子设备或主控设备中。An exemplary storage medium is coupled to the processor such that the processor can read information from the storage medium and write information to the storage medium. Of course, the storage medium can also be an integral part of the processor. The processor and storage medium may be located in Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may also exist as discrete components in an electronic device or a host control device.

需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that, in this document, the terms "comprising", "comprises" or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, article or device that includes a series of elements not only includes those elements, It also includes other elements not expressly listed or inherent in the process, method, article or apparatus. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article or apparatus that includes that element.

上述本申请实施例的顺序仅仅为了描述，不代表实施例的优劣。另外，在上述实施例及附图中的描述的一些流程中，包含了按照特定顺序出现的多个操作，但是应该清楚了解，这些操作可以不按照其在本文中出现的顺序来执行或并行执行，仅仅是用于区分开各个不同的操作，序号本身不代表任何的执行顺序。另外，这些流程可以包括更多或更少的操作，并且这些操作可以按顺序执行或并行执行。需要说明的是，本文中的“第一”、“第二”等描述，是用于区分不同的消息、设备、模块等，不代表先后顺序，也不限定“第一”和“第二”是不同的类型。“多个”的含义是两个以上，除非另有明确具体的限定。The order of the above embodiments of the present application is only for description and does not represent the advantages and disadvantages of the embodiments. In addition, some of the processes described in the above embodiments and drawings include multiple operations that appear in a specific order, but it should be clearly understood that these operations may not be performed in the order in which they appear in this document or may be performed in parallel. , is only used to distinguish different operations, and the sequence number itself does not represent any execution order. Additionally, these processes may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that the descriptions such as "first" and "second" in this article are used to distinguish different messages, devices, modules, etc., and do not represent the order, nor do they limit "first" and "second" are different types. "Plural" means more than two, unless otherwise clearly and specifically limited.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，空调器，或者网络设备等)执行本申请各个实施例的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology. The computer software product is stored in a storage medium (such as ROM/RAM, disk, CD), including several instructions to cause a terminal device (which can be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods of various embodiments of the present application.

本领域技术人员在考虑说明书及实践这里公开的发明后，将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。Other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary technical means in the technical field that are not disclosed in this application. .

以上仅为本申请的优选实施例，并非因此限制本申请的专利范围，凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本申请的专利保护范围内。The above are only preferred embodiments of the present application, and are not intended to limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made using the contents of the description and drawings of the present application may be directly or indirectly used in other related technical fields. , are all equally included in the patent protection scope of this application.

Claims

1. A method of document characterization, comprising:

acquiring a plurality of documents to be processed and a document association graph of the plurality of documents, wherein the document association graph comprises nodes corresponding to the documents and edges representing association relations among the documents;

inputting the content information of the plurality of documents and the document association graph into a document representation model, performing graph feature learning based on the document association graph initialized by using the semantic representation of each document through the document representation model, and updating the feature representation of each node;

the characteristic representation of each node after updating is used as the document representation of the corresponding document.

2. The method of claim 1, wherein the document representation model comprises: a semantic information learning module and a relevance information learning module,

the semantic information learning module is used for: carrying out semantic information characterization learning on the content information of each document to obtain semantic representation of each document;

the correlation information learning module is used for: initializing the characteristic representation of each node in the document association graph by using the semantic representation of each document, and carrying out graph feature learning based on the initialized document association graph to update the characteristic representation of each node.

3. The method of claim 2, wherein the semantic information learning module comprises: an entity relation diagram construction module, a first graph neural network, a text representation model and a semantic representation fusion module,

the entity relation diagram construction module is used for: constructing an entity relation diagram of each document according to the content information of each document;

the first graph neural network is used for: respectively performing graph sign learning on the entity relation graph of each document to obtain feature representations of the entities contained in each document, and fusing the feature representations of the entities contained in each document to obtain semantic representations of the entity levels of each document;

the text representation model is used for: respectively carrying out text representation on the content information of each document to obtain semantic representation of a text level of each document;

the semantic representation fusion module is used for: and fusing the semantic representation of the entity level and the semantic representation of the text level of each document to obtain the semantic representation of each document.

4. The method of claim 2, wherein the correlation information learning module comprises: the module and the second graph neural network are initialized,

The initialization module is used for: initializing feature representations of nodes in the document association graph by using semantic representations of the documents;

the second graph neural network is used for: and performing graph feature learning based on the initialized document association graph, and updating the feature representation of each node.

5. The method of claim 2, wherein said inputting the content information of the plurality of documents and the document association graph into a document representation model, through the document representation model, performing graph feature learning based on a document association graph initialized using semantic representations of the respective documents, updating feature representations of the respective nodes, comprises:

inputting the content information of the plurality of documents into the semantic information learning module, performing semantic information characterization learning on the content information of each document through the semantic information learning module to obtain semantic representation of each document, and inputting the semantic representation of each document into the relativity learning module;

inputting the document association graph into the correlation information learning module, initializing the characteristic representation of each node in the document association graph by using the semantic representation of each document through the correlation information learning module, and carrying out graph feature learning based on the initialized document association graph to update the characteristic representation of each node.

6. The method of claim 5, wherein the semantic information learning module comprises: an entity relationship graph construction module, a first graph neural network and a text representation model,

inputting the content information of the plurality of documents into the semantic information learning module, performing semantic information characterization learning on the content information of each document through the semantic information learning module to obtain semantic representation of each document, wherein the semantic representation comprises the following steps:

inputting the content information of the plurality of documents into the entity relation diagram construction module, and constructing an entity relation diagram of each document according to the content information of each document through the entity relation diagram construction module;

inputting the entity relation graph of each document into the first graph neural network, and respectively performing graph sign learning on the entity relation graph of each document through the first graph neural network to obtain semantic representation of the entity level of each document;

inputting the content information of the documents into the text representation model, and respectively carrying out text representation on the key content information of each document through the text representation model to obtain semantic representation of the text level of each document;

And fusing the semantic representation of the entity level and the semantic representation of the text level of each document to obtain the semantic representation of each document.

7. The method of claim 5, wherein the correlation information learning module comprises: the module and the second graph neural network are initialized,

inputting the document association graph into the correlation information learning module, initializing feature representations of nodes in the document association graph by the correlation information learning module by using semantic representations of the documents, and performing graph feature learning based on the initialized document association graph to update the feature representations of the nodes, wherein the method comprises the following steps:

inputting the semantic representation of each document and the document association graph into the initialization module, and initializing the characteristic representation of each node in the document association graph by using the semantic representation of each document through the initialization module;

inputting the initialized document association diagram into the second diagram neural network, performing diagram feature learning based on the initialized document association diagram through the second diagram neural network, and updating the feature representation of each node.

8. The method of claim 2, wherein the training process of the document representation model comprises:

Inputting a plurality of document samples in a document set for training and document associated graphs of the plurality of document samples into a document representation model, masking initial edges in the document associated graphs of the plurality of document samples through the document representation model, initializing feature representations of nodes in the document associated graphs of the plurality of document samples by using semantic representations of the document samples to obtain an initialized document associated graph, performing graph feature learning based on the initialized document associated graph, and updating the feature representations of the nodes;

predicting the association relation among the nodes in the document association graph according to the updated characteristic representation of each node;

and calculating a first loss according to a predicted result of the association relation among the nodes in the document association diagram and an initial edge in the document association diagram, and updating parameters of a document representation model according to the first loss.

9. The method of claim 8, wherein the semantic information learning module comprises: an entity relation diagram construction module, a first graph neural network, a text representation model and a semantic representation fusion module,

the using of the semantic representation of each of the document samples by the document representation model includes:

Inputting the content information of each document sample in the document set into the entity relation diagram construction module, and constructing an entity relation diagram of each document sample according to the content information of each document sample by the entity relation diagram construction module;

inputting the entity relation diagram of each document sample in the document set into a first graph neural network, and respectively performing graph sign learning on the entity relation diagram of each document sample through the first graph neural network to obtain semantic representation of the entity level of each document sample;

inputting key content information of a plurality of document samples of the document set into the text representation model, and respectively carrying out text representation on the key content information of each document sample through the text representation model to obtain semantic representation of a text level of each document sample;

and fusing the semantic representation of the entity level and the semantic representation of the text level of each document sample to obtain the semantic representation of each document sample.

10. The method of claim 8, wherein the document represents a training process of a model, comprising:

predicting class labels of the document samples according to semantic representations of entity levels of the document samples to obtain first prediction results, and predicting class labels of the document samples according to semantic representations of text levels of the document samples to obtain second prediction results;

Calculating the distance between the first predicted result and the second predicted result to obtain a second loss; calculating cross entropy loss according to the first prediction result and/or the second prediction result and the actual category labels of the document samples to obtain third loss;

and updating parameters of the semantic information learning module according to the second loss and the third loss.

11. The method of any of claims 1-10, wherein the obtaining a plurality of documents to be processed, and a document association graph of the plurality of documents, comprises:

receiving a document set submitted by a terminal side device, wherein the document set comprises a plurality of documents to be processed;

and constructing a document association diagram corresponding to the document set, wherein the document association diagram comprises nodes corresponding to the documents and edges representing association relations among the documents.

12. The method of claim 11, wherein after the step of using the updated feature representation of each node as a document representation of the corresponding document, further comprises:

classifying the plurality of documents according to the document representation of each document to obtain a document classification result, and sending the document classification result to the terminal side equipment;

Or,

and predicting the association relation among the documents according to the document representation of the documents, and sending the association relation among the documents to the terminal side equipment.

13. The method of any of claims 1-10, wherein the obtaining a plurality of documents to be processed, and a document association graph of the plurality of documents, comprises:

acquiring a plurality of documents contained in a document retrieval library and a document association graph of the plurality of documents;

after the updated characteristic representation of each node is used as the document representation of the corresponding document, the method further comprises the following steps:

updating the document representation of each document in the document retrieval library according to the document representations of the plurality of documents.

14. A document retrieval method, comprising:

responding to a document retrieval request, and acquiring query information input by a user;

mapping the query information into vector representations, and performing similarity matching on the vector representations and document representations of all documents in a document retrieval library to obtain at least one target document matched with the query information;

outputting information of the at least one target document;

wherein the document representation of each document in the document retrieval library is determined by: acquiring a plurality of documents to be characterized in a document retrieval library and document association graphs of the plurality of documents, wherein the document association graphs comprise nodes corresponding to the documents and edges representing association relations among the documents; inputting the plurality of documents and the document association graph into a document representation model, performing graph sign learning based on the document association graph initialized by using the semantic representation of each document through the document representation model, and updating the feature representation of each node; the characteristic representation of each node after updating is used as the document representation of the corresponding document.

15. A server, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor to cause the server to perform the method of any one of claims 1-14.

16. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor implement the method of any of claims 1-14.