CN116704202A

CN116704202A - Visual relation detection method based on knowledge embedding

Info

Publication number: CN116704202A
Application number: CN202310746413.3A
Authority: CN
Inventors: 田玲; 张栗粽; 郑旭; 成达; 吴瀚宇; 陈俊宇
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-06-21
Filing date: 2023-06-21
Publication date: 2023-09-05

Abstract

The invention discloses a visual relation detection method based on knowledge embedding, which comprises the following steps: the method comprises the steps of inputting images, and respectively detecting target initial category prediction features, spatial features of target boundary boxes and visual features of target pair joint areas; defining types of priori knowledge, and constructing a corresponding knowledge graph aiming at each type of priori knowledge; constructing target initial category prediction features into graph structures represented by nodes and edges, and representing knowledge graphs into adjacent matrixes corresponding to the graph structures; obtaining the mutually related target category characteristics based on an updating mechanism of the gate control graph neural network GGNN; according to the spatial characteristics of the target bounding box, the target category characteristics and the visual characteristics of the target pair joint region, the context information is obtained, and the visual relation detection is carried out on each target pair through the softmax function, so that the problems of incomplete image information capturing and understanding and low performance under complex scenes during the visual relation detection are solved.

Description

A Visual Relationship Detection Method Based on Knowledge Embedding

技术领域technical field

本发明属于视觉关系检测技术领域，具体涉及一种基于知识嵌入的视觉关系检测方法。The invention belongs to the technical field of visual relationship detection, and in particular relates to a visual relationship detection method based on knowledge embedding.

背景技术Background technique

经过近70年的快速发展，人工智能在科技、经济、文化等方面对人类产生了深远的影响。计算机视觉是人工智能技术中一个重要的研究领域，它让计算机通过图像或视频数据来模拟人类视觉系统的能力。当前，图像分类、目标检测和语义分割等基础的计算机视觉任务已获得显著成效，甚至超越人类水平。然而，对于更复杂的图像理解任务而言，仍然需要进一步提升研究效果。由此，需要视觉关系检测作为桥梁，在识别图像中视觉内容的基础上，检测图像目标之间的相互关系，从而深入理解和把握图像中丰富的语义信息。After nearly 70 years of rapid development, artificial intelligence has had a profound impact on mankind in terms of technology, economy, and culture. Computer vision is an important research field in artificial intelligence technology, which allows computers to simulate the capabilities of the human visual system through image or video data. At present, basic computer vision tasks such as image classification, object detection and semantic segmentation have achieved remarkable results, even surpassing human performance. However, for more complex image understanding tasks, further research results are still needed. Therefore, visual relationship detection is needed as a bridge to detect the relationship between image objects on the basis of recognizing the visual content in the image, so as to deeply understand and grasp the rich semantic information in the image.

视觉关系检测方法是一类计算机视觉技术，用于识别图像中物体之间的关系。最早的视觉关系检测方法参照神经网络在CV领域和NLP领域的优越模型，通过一定的组合，构建一个适合图像与语义联合理解的综合性模型，完成视觉关系检测。Visual relationship detection methods are a class of computer vision techniques used to identify relationships between objects in images. The earliest visual relationship detection method refers to the superior model of the neural network in the CV field and the NLP field. Through a certain combination, a comprehensive model suitable for the joint understanding of images and semantics is constructed to complete the visual relationship detection.

这些方法由三部分组成，目标检测模块、物体特征提取模块和关系建模模块。这可以通过不同的方法实现，早期基于翻译嵌入网络的方法被用于探索视觉关系检测。在这种方法中，通过对连接主语和宾语边界框区域的嵌入特征进行向量变换，来表示语言翻译中的谓词，从而表示实体之间的关系。然而，这种方法往往过于简单地独立利用从图像中提取的目标的视觉特征，忽视了上下文信息在视觉检测中的重要作用。These methods consist of three parts, object detection module, object feature extraction module and relational modeling module. This can be achieved by different approaches, early approaches based on translation embedding networks were used to explore visual relationship detection. In this approach, predicates in language translation are represented by vector transformations of embedded features connecting subject and object bounding box regions, thereby representing relations between entities. However, such methods are often too simplistic and independently utilize the visual features of objects extracted from images, ignoring the important role of contextual information in visual detection.

由此，基于循环神经网络的消息传递模型得以引入，在节点与边之间交换关系信息。随后，由于关系三元组的中“主体-关系”或“关系-客体”子结构的大量存在，基于循环神经网络的全局上下文信息提取模型被提出。这种模型一般利用基于长短期记忆人工神经网络LSTM的模型提取图像上下文信息，却忽略了图像目标于关系连接之间的固有规律,导致关键信息难以准确捕捉。As a result, a message passing model based on recurrent neural networks is introduced to exchange relational information between nodes and edges. Subsequently, due to the large number of "subject-relation" or "relation-object" substructures in relation triplets, a global context information extraction model based on recurrent neural networks was proposed. This kind of model generally uses the model based on long short-term memory artificial neural network LSTM to extract image context information, but ignores the inherent law between image objects and relational connections, making it difficult to accurately capture key information.

如上所述，视觉关系检测已经经历了飞速的发展，产生了显著的效果。基于深度学习的不断进步，以更快区域卷积神经网络Faster-RCNN为代表的目标检测算法已经获得了优越的性能并大规模用于实际工业。然而，由于现实场景中关系的复杂性和图像场景中关系的多样性，视觉图像中目标间关系的准确理解仍然面临巨大挑战。在具有语义信息的数据关系检测任务中，仅依赖于图像特征的方法可能无法获得令人满意的结果。现有技术仅利用图像特征或直接融合语义特征进行特征增强，导致能够检测出的关系数量有限以及复杂关系检测存在歧义，因此，利用知识嵌入，优化模型的智能理解和推理能力变得尤为重要。As mentioned above, visual relationship detection has undergone rapid development, yielding remarkable results. Based on the continuous progress of deep learning, the target detection algorithm represented by Faster Regional Convolutional Neural Network Faster-RCNN has achieved superior performance and is widely used in actual industry. However, due to the complexity of relationships in real-world scenes and the diversity of relationships in image scenes, accurate understanding of the relationships between objects in visual images still faces great challenges. In data relation detection tasks with semantic information, methods that only rely on image features may not achieve satisfactory results. Existing technologies only use image features or directly fuse semantic features for feature enhancement, resulting in a limited number of relationships that can be detected and ambiguities in complex relationship detection. Therefore, it is particularly important to use knowledge embedding to optimize the intelligent understanding and reasoning capabilities of the model.

发明内容Contents of the invention

本发明提供了一种基于知识嵌入的视觉关系检测方法，解决了在视觉关系检测时图像信息捕捉和理解不完全以及在复杂场景下的性能表现不高的问题。The invention provides a visual relationship detection method based on knowledge embedding, which solves the problems of incomplete capture and understanding of image information and low performance in complex scenes during visual relationship detection.

为了解决上述技术问题，本发明的技术方案为：一种基于知识嵌入的视觉关系检测方法，包括以下步骤：In order to solve the above technical problems, the technical solution of the present invention is: a visual relationship detection method based on knowledge embedding, comprising the following steps:

S1、输入图像，分别检测目标初始类别预测特征、目标边界框的空间特征以及目标对联合区域的视觉特征；S1. Input the image, respectively detect the initial category prediction features of the target, the spatial features of the target bounding box, and the visual features of the target pair joint area;

S2、定义先验知识的类型，并针对每种先验类型构建对应的知识图；S2. Define the type of prior knowledge, and construct a corresponding knowledge graph for each prior type;

S3、将目标初始类别预测特征构建为以节点和边表示的图形结构，并将知识图表示为图形结构对应的邻接矩阵，通过邻接矩阵表示边信息；S3. Construct the target initial category prediction feature into a graph structure represented by nodes and edges, and represent the knowledge graph as an adjacency matrix corresponding to the graph structure, and represent edge information through the adjacency matrix;

S4、基于门控图神经网络GGNN的更新机制，将图形结构的节点以及邻接矩阵与目标初始类别预测特征进行联合学习，得到相互关联的目标类别特征；S4. Based on the update mechanism of the gated graph neural network GGNN, the nodes of the graph structure and the adjacency matrix are jointly learned with the initial category prediction features of the target to obtain interrelated target category features;

S5、根据目标边界框的空间特征、相互关联的目标类别特征与目标对联合区域的视觉特征，得到上下文信息，并通过softmax函数对每个目标对进行关系分类，完成视觉关系检测。S5. According to the spatial features of the target bounding box, the interrelated target category features and the visual features of the target pair joint area, the context information is obtained, and the relationship between each target pair is classified through the softmax function to complete the visual relationship detection.

本发明的有益效果是：本发明通过定义先验知识的类型，构建不同类型的知识图，并且将视觉与知识进行联合学习，进一步学习各种信息间内部依赖关系，解决了在视觉关系检测时图像信息捕捉和理解不完全以及在复杂场景下的性能表现不高的问题。The beneficial effects of the present invention are: the present invention constructs different types of knowledge graphs by defining the types of prior knowledge, and jointly learns vision and knowledge, and further learns the internal dependencies between various information, which solves the problem of visual relationship detection. Incomplete capture and understanding of image information and poor performance in complex scenes.

进一步地，所述步骤S1具体步骤为：Further, the specific steps of the step S1 are:

S11、输入图像，将基于深度残差网络ResNet101的特征金字塔网络FPN作为区域卷积神经网络Faster R-CNN的主干网络，提取图像多尺度融合的视觉特征；S11, input the image, use the feature pyramid network FPN based on the deep residual network ResNet101 as the backbone network of the regional convolutional neural network Faster R-CNN, and extract the visual features of the multi-scale fusion of the image;

S12、将视觉特征经候选区域网络，得到感兴趣目标集合，并利用非极大值抑制NMS获得包含目标信息的候选区域；S12. Pass the visual features through the candidate area network to obtain a target set of interest, and use non-maximum value suppression NMS to obtain candidate areas containing target information;

S13、根据区域特征聚集方式ROIAlign对候选区域的特征进行统一表示，并根据全连接和softmax函数分别获得目标初始类别预测特征、目标边界框的空间特征以及目标对联合区域的视觉特征。S13. According to the regional feature aggregation method ROIAlign, the features of the candidate regions are uniformly represented, and the initial category prediction features of the target, the spatial features of the target bounding box, and the visual features of the joint area of the target pair are respectively obtained according to the full connection and the softmax function.

进一步地，所述步骤S2具体步骤为：Further, the specific steps of step S2 are:

S21、通过知识来源定义先验知识的类型，其中，将视觉基因组VG数据集提取的知识定义为统计知识，将全局向量表征Glove词向量提取的知识定义为语言知识；S21. Define the type of prior knowledge through the source of knowledge, wherein the knowledge extracted from the Visual Genome VG dataset is defined as statistical knowledge, and the knowledge extracted from the global vector representation Glove word vector is defined as language knowledge;

S22、根据统计知识，分别对统计知识中不同类别目标之间的共现关系进行统计，根据共现关系得到共现概率分布表，并将共现概率分布表进行余弦相似度度量，得到内部统计知识图；S22. According to the statistical knowledge, make statistics on the co-occurrence relationship between different categories of objects in the statistical knowledge, obtain the co-occurrence probability distribution table according to the co-occurrence relationship, and measure the cosine similarity of the co-occurrence probability distribution table to obtain internal statistics knowledge map;

S23、根据语言知识，通过文本特征嵌入的形式，得到语言先验知识表，并将语言先验知识表通过余弦相似度度量构建外部语言知识图；S23. According to the language knowledge, the language prior knowledge table is obtained through text feature embedding, and the language prior knowledge table is used to measure the cosine similarity to construct an external language knowledge graph;

S24、通过加权融合构建内部统计知识图和外部语言知识图的知识图。S24. Construct the knowledge graph of the internal statistical knowledge graph and the external language knowledge graph through weighted fusion.

上述进一步方案的有益效果为：本发明通过内部统计知识和外部语言知识构建知识图，解决了传统关系检测方法在泛化能力和外部语言歧义方面的限制，从而克服了图像信息捕捉和理解不完全的问题。The beneficial effect of the above further solution is: the present invention builds a knowledge map through internal statistical knowledge and external language knowledge, which solves the limitations of the traditional relationship detection method in terms of generalization ability and external language ambiguity, thereby overcoming the incomplete capture and understanding of image information The problem.

进一步地，所述步骤S22中余弦相似度度量公式为：Further, the cosine similarity measure formula in the step S22 is:

其中，cos(θ_i,j)表示余弦相似度度量，δ_i表示目标i的共现关系，δ_j表示目标j的共现关系，θ_i,j表示目标i与目标j的相似度，n表示目标数量总数。Among them, cos(θ _{i, j} ) represents the cosine similarity measure, δ _i represents the co-occurrence relationship of target i, δ _j represents the co-occurrence relationship of target j, θ _{i, j} represents the similarity between target i and target j, n Indicates the total number of targets.

进一步地，所述步骤S4具体步骤为：Further, the specific steps of step S4 are:

S41、确定步骤S3中的节点特征传播次数T、控制节点特征的传播范围和深度，完成对门控图神经网络GGNN参数的定义；S41. Determine the number of node feature propagation times T in step S3, control the spread range and depth of node features, and complete the definition of GGNN parameters for the gated graph neural network;

S42、计算节点的初始隐藏状态，并在每一个时间步t内将图形结构中的节点和对应的邻接矩阵进行联合学习；S42. Calculate the initial hidden state of the node, and jointly learn the nodes in the graph structure and the corresponding adjacency matrix in each time step t;

S43、利用时间步t的联合学习信息以及上一时间步的节点隐藏状态，根据门控图神经网络GGNN的节点信息传播机制，更新节点的隐藏状态，得到每个节点的最终隐藏状态；S43. Using the joint learning information at time step t and the node hidden state at the previous time step, according to the node information propagation mechanism of the gated graph neural network GGNN, update the hidden state of the node, and obtain the final hidden state of each node;

S44、将每个节点的最终隐藏状态以及目标初始类别预测特征进行融合，得到视觉与知识联合学习后的相互关联的目标类别特征。S44. Fusing the final hidden state of each node and the initial target category prediction features to obtain the interrelated target category features after joint learning of vision and knowledge.

上述进一步方案的有益效果为：本发明将图形结构中的节点和对应的邻接矩阵进行联合学习，能够更全面地捕捉图像中的语义含义，增强模型对目标类别预测的置信度，从而提高了视觉关系检测方法的准确性。The beneficial effect of the above further scheme is: the present invention jointly learns the nodes in the graph structure and the corresponding adjacency matrix, can more comprehensively capture the semantic meaning in the image, and enhance the confidence of the model in predicting the target category, thereby improving the visual The accuracy of the relationship detection method.

进一步地，所述步骤S42中节点的初始隐藏状态的表达式为：Further, the expression of the initial hidden state of the node in the step S42 is:

其中，表示第m个节点在初始时刻的隐藏状态，σ()表示对应的激活函数，FC()表示全连接层，/>表示将拼接后的特征映射至低维向量，x_i表示第i个目标经过特征提取得到的视觉特征向量，p_i表示第i个目标的目标初始类别预测特征，||表示特征之间的拼接操作；in, Represents the hidden state of the mth node at the initial moment, σ() represents the corresponding activation function, FC() represents the fully connected layer, /> Indicates that the spliced features are mapped to low-dimensional vectors, x _i represents the visual feature vector obtained by feature extraction of the i-th target, p _i represents the target initial category prediction feature of the i-th target, || represents the splicing between features operate;

所述步骤S42中进行联合学习的表达式为：The expression for joint learning in the step S42 is:

其中，表示第m个节点在第t个时间步的联合学习信息，N表示类别数量，/>表示第N个节点在t-1时间步的隐藏状态，G_s表示构建的知识图，(G_s)_m表示知识图的第m列信息，b表示超参数；in, Represents the joint learning information of the mth node at the tth time step, N represents the number of categories, /> Indicates the hidden state of the Nth node at time step t-1, G _s indicates the constructed knowledge graph, (G _s ) _m indicates the mth column information of the knowledge graph, and b indicates hyperparameters;

所述步骤S43中更新节点的隐藏状态计算过程为：The hidden state calculation process of updating nodes in the step S43 is:

其中，表示第m个节点控制遗忘信息，σ()表示对应的激活函数，h^(t-1)表示节点在t-1时间步的隐藏状态，h_m ^(t-1)表示第m个节点在t-1时间步的隐藏状态，/>表示第m个节点控制产生的新信息，/>表示第m个节点新产生的信息，tanh()表示tanh激活函数，W^zT表示更新门训练参数矩阵，U^zT表示更新门偏置参数矩阵，W^T表示重置门训练参数矩阵，U^T表示重置门偏置参数矩阵，W^rT表示表示新信息的训练参数矩阵，U^rT表示新信息的偏置参数矩阵，⊙表示向量对应元素乘积，/>表示第m个节点的隐藏状态；in, Indicates that the mth node controls the forgotten information, σ() represents the corresponding activation function, h ^(t-1) represents the hidden state of the node at t-1 time step, h _m ^(t-1) represents the mth node at t Hidden state for -1 time step, /> Indicates the new information generated by the mth node control, /> Represents the information newly generated by the mth node, tanh() represents the tanh activation function, W ^zT represents the update gate training parameter matrix, U ^zT represents the update gate bias parameter matrix, W ^T represents the reset gate training parameter matrix, U ^T represents Reset the gate bias parameter matrix, W ^rT represents the training parameter matrix representing the new information, U ^rT represents the bias parameter matrix of the new information, ⊙ represents the product of the corresponding elements of the vector, /> Indicates the hidden state of the mth node;

所述步骤S44中目标类别特征的表达式为：The expression of the target category feature in the step S44 is:

其中，h_dk表示区域k的节点初始状态，FC()表示全连接层，表示初始节点隐藏状态，/>表示经过学习的最终节点状态，/>表示重新缩放，φ₀()表示聚合方式，p_f表示相互关联的目标类别特征，p_i表示第i个目标的目标初始类别预测特征。Among them, h _dk represents the initial state of the node in area k, FC() represents the fully connected layer, Indicates the initial node hidden state, /> represents the learned final node state, /> represents rescaling, φ ₀ () represents the aggregation method, p _f represents the interrelated target category features, and p _i represents the target initial category prediction features of the i-th target.

进一步地，所述步骤S5具体步骤为：Further, the specific steps of step S5 are:

S51、利用注意力机制Transfomer融合目标边界框的空间特征、目标对联合区域的视觉特征以及相互关联的目标类别特征，得到目标对的上下文信息；S51. Use the attention mechanism Transformer to fuse the spatial features of the target bounding box, the visual features of the joint area of the target pair, and the interrelated target category features to obtain the context information of the target pair;

S52、将目标对联合区域的视觉特征与目标对的上下文信息进行融合，获得关系上下文特征；S52. Fusing the visual features of the target pair joint area with the context information of the target pair to obtain relational context features;

S53、根据关系上下文特征，利用softmax函数对每个目标对进行关系分类，完成视觉关系检测目标对进行关系检测。S53. According to the relationship context feature, use the softmax function to classify the relationship of each target pair, and complete the visual relationship detection target pair to perform relationship detection.

进一步地，所述步骤S51中目标对的上下文信息的计算公式为：Further, the calculation formula of the context information of the target pair in the step S51 is:

Q＝Transformer([X,E^p,E^f]；W^z)Q＝Transformer([X, E ^p , E ^f ]; W ^z )

其中，Q表示目标对的上下文信息，即融合后的特征，W^z表示训练参数，E^p表示目标分类预测的嵌入向量，E^f表示目标边界框坐标的嵌入向量，X表示目标的视觉特征，Transformer()表示利用基于注意力机制的深度学习模型。Among them, Q represents the context information of the target pair, that is, the fused features, W ^z represents the training parameters, E ^p represents the embedding vector of target classification prediction, E ^f represents the embedding vector of the target bounding box coordinates, X represents the visual features of the target, Transformer() represents a deep learning model that utilizes an attention mechanism.

进一步地，所述步骤S53中利用softmax函数对每个目标对进行关系分类的计算公式为：Further, in the step S53, the calculation formula for using the softmax function to classify the relationship between each target pair is:

r_ij＝q_ij⊙e_i⊙u_ij r _ij =q _ij ⊙e _i ⊙u _ij

p(r_k|l,o_i,o_j)＝softmax(r_ij)p(r _k |l,o _i ,o _j )＝softmax(r _ij )

其中，e_i表示目标对的语义特征，W_z表示训练参数，r_ij表示目标i与目标j对应的上下文特征，和/>分别目标i和目标j的目标类别特征，q_ij表示目标对的上下文信息，/>表示重新缩放，u_ij表示目标对联合区域的视觉特征，⊙表示向量对应元素乘积，p(r_k|l,o_i,o_j)表示目标对关系分类的预测概率，r_k表示目标对之间的关系，l表示目标图像，o_i表示目标对中的目标i，o_j表示目标对中的目标j。Among them, e _i represents the semantic feature of the target pair, W _z represents the training parameters, r _ij represents the context feature corresponding to target i and target j, and /> The target category features of target i and target j respectively, q _ij represents the context information of the target pair, /> Indicates rescaling, u _ij indicates the visual features of the target pair joint region, ⊙ indicates the product of corresponding elements of the vector, p(r _k |l,o _i ,o _j ) indicates the predicted probability of the target pair relationship classification, r _k indicates the target pair l represents the target image, o _i represents the target i in the target pair, and o _j represents the target j in the target pair.

上述进一步方案的有益效果为：本发明利用注意力机制Transfomer进一步学习各种信息间内部依赖关系，更好地捕捉特征之间的关联，提升了视觉关系检测方法在复杂场景下的性能表现。The beneficial effect of the above further solution is: the present invention uses the attention mechanism Transformer to further learn the internal dependencies among various information, better capture the correlation between features, and improve the performance of the visual relationship detection method in complex scenes.

附图说明Description of drawings

图1为本发明基于知识嵌入的视觉关系检测流程图。Fig. 1 is a flowchart of visual relationship detection based on knowledge embedding in the present invention.

图2为本发明图形结构表示的示意图。Figure 2 is a schematic representation of the graphic structure representation of the present invention.

具体实施方式Detailed ways

本领域的普通技术人员将会意识到，这里所述的实施例是为了帮助读者理解本发明的原理，应被理解为本发明的保护范围并不局限于这样的特别陈述和实施例。本领域的普通技术人员可以根据本发明公开的这些技术启示做出各种不脱离本发明实质的其它各种具体变形和组合，这些变形和组合仍然在本发明的保护范围内。Those skilled in the art will appreciate that the embodiments described here are to help readers understand the principles of the present invention, and it should be understood that the protection scope of the present invention is not limited to such specific statements and embodiments. Those skilled in the art can make various other specific modifications and combinations based on the technical revelations disclosed in the present invention without departing from the essence of the present invention, and these modifications and combinations are still within the protection scope of the present invention.

实施例Example

如图1所示，本发明提供了一种基于知识嵌入的视觉关系检测方法，包括以下步骤：As shown in Figure 1, the present invention provides a visual relationship detection method based on knowledge embedding, including the following steps:

所述步骤S1具体步骤为：The specific steps of the step S1 are:

在本实施例中，首先对输入图像进行视觉特征提取，提取后续检测所需的目标初始类别预测特征、目标边界框的空间特征以及目标对联合区域的视觉特征。采用基于深度残差网络ResNet101的特征金字塔网络FPN作为更快区域卷积神经网络Faster R-CNN的主干网络，以获取多尺度融合的视觉特征。然后借助候选区域网络，获取目标候选集合B＝{b₁,b₂,...,b_n}，经过采样与非极大抑制NMS去除重复目标，获得一系列用于后续视觉关系检测的包含目标信息的候选区域。In this embodiment, the visual feature extraction is first performed on the input image, and the initial target category prediction features required for subsequent detection, the spatial features of the target bounding box, and the visual features of the joint area of the target pair are extracted. The feature pyramid network FPN based on the deep residual network ResNet101 is used as the backbone network of Faster R-CNN to obtain multi-scale fusion visual features. Then, with the help of the candidate area network, the target candidate set B={b ₁ ,b ₂ ,...,b _n } is obtained, and repeated targets are removed through sampling and non-maximum suppression NMS, and a series of inclusions for subsequent visual relationship detection are obtained. Candidate regions for target information.

随后，利用区域特征聚集方式ROIAlign将候选区域分成k×k个单元，对候选区域的特征进行统一表示。最后，利用全连接和softmax函数即可获得需要视觉特征，它们包括目标初始类别预测特征，目标边界框的空间特征以及目标对联合区域的视觉特征。Then, ROIAlign is used to divide the candidate region into k×k units, and the features of the candidate region are uniformly represented. Finally, the required visual features can be obtained by using the full connection and softmax function, which include the initial category prediction features of the target, the spatial features of the target bounding box, and the visual features of the joint region of the target pair.

所述步骤S2具体步骤为：The specific steps of the step S2 are:

在本实施例中，首先定义知识类型，通过知识来源不同将知识进行划分，将其定义为统计知识和语言知识，其中，将视觉基因组VG数据集提取的知识定义为统计知识，将全局向量表征Glove词向量提取的知识定义为语言知识。然后获取内部统计知识图，假设共需检测M种类别，分别统计每种目标与其他目标的共现关系，以此获得一个M×M大小的共现概率分布表T_s。之后，对获得的共现概率分布表T_s进行余弦相似度度量，来评估目标对之间的相似程度和相关性，获得统计知识图G_s。In this embodiment, the knowledge type is first defined, and the knowledge is divided according to different sources of knowledge, which are defined as statistical knowledge and language knowledge. Among them, the knowledge extracted from the Visual Genome VG dataset is defined as statistical knowledge, and the global vector representation The knowledge extracted by Glove word vector is defined as language knowledge. Then obtain the internal statistical knowledge map, assuming that there are M categories to be detected, and count the co-occurrence relationship between each target and other targets respectively, so as to obtain a co-occurrence probability distribution table T _s of M×M size. Afterwards, cosine similarity measurement is performed on the obtained co-occurrence probability distribution table T _s to evaluate the similarity and correlation between target pairs, and obtain the statistical knowledge graph G _s .

同理，根据语言知识，通过文本特征嵌入的形式，获得语言先验知识表T_l。词向量采用的维度为300维，通过余弦距离度量捕获目标之间的语义关联，获得语言知识图G_l，最终利用加权融合统计知识图与语言知识图，得到最终的知识图G_a。Similarly, according to the language knowledge, the language prior knowledge table T _l is obtained through text feature embedding. The dimension used by the word vector is 300 dimensions, and the semantic relationship between the targets is captured by the cosine distance measurement to obtain the language knowledge graph G _l , and finally the weighted fusion statistical knowledge graph and the language knowledge graph are used to obtain the final knowledge graph G _a .

所述步骤S22中余弦相似度度量公式为：In the step S22, the cosine similarity measure formula is:

所述步骤S4具体步骤为：The specific steps of the step S4 are:

所述步骤S42中节点的初始隐藏状态的表达式为：The expression of the initial hidden state of the node in the step S42 is:

如图2所示，在本实施例中，首先构建图形结构，利用公式计算节点的初始隐藏状态。并在每一个时间步t内将图形结构中的节点和对应的邻接矩阵进行联合学习。对于第i个目标，经过特征提取得到视觉特征向量x_i，表示将特征向量映射至低维向量。加入第i个目标的目标初始类别预测特征p_i，形成图像的最终节点表示。As shown in FIG. 2, in this embodiment, a graph structure is constructed first, and the initial hidden states of nodes are calculated using formulas. And in each time step t, the nodes in the graph structure and the corresponding adjacency matrix are jointly learned. For the i-th target, the visual feature vector x _i is obtained through feature extraction, Represents the mapping of feature vectors to low-dimensional vectors. The target initial category prediction feature p _i of the i-th target is added to form the final node representation of the image.

完成节点信息传播以后，得到每个区域d的最终隐藏状态。将得到的最终隐藏状态与初始目标类别预测特征进行融合，得到融合增强的视觉语义特征，即相互关联的目标类别特征。After the node information propagation is completed, the final hidden state of each region d is obtained. The obtained final hidden state is fused with the initial target category prediction features to obtain fusion-enhanced visual semantic features, that is, interrelated target category features.

所述步骤S5具体步骤为：The specific steps of the step S5 are:

所述步骤S51中目标对的上下文信息的计算公式为：The calculation formula of the context information of the target pair in the step S51 is:

Q＝Transformer([X,E^p,E^f]；W^z)Q＝Transformer([X, E ^p , E ^f ]; W ^z )

所述步骤S53中利用softmax函数对每个目标对进行关系分类的计算公式为：In the step S53, the calculation formula for using the softmax function to classify the relationship between each target pair is:

r_ij＝q_ij⊙e_i⊙u_ij r _ij =q _ij ⊙e _i ⊙u _ij

p(r_k|l,o_i,o_j)＝softmax(r_ij)p(r _k |l,o _i ,o _j )＝softmax(r _ij )

在本实施例中，提取步骤S1得到的目标边界框的空间特征f_i，目标对联合区域的视觉特征u_ij以及步骤S4得到的相互关联的目标类别特征p_f，利用注意力机制Transfomer学习不同特征之间的依赖关系，准确捕捉它们之间的相互作用，完成特征融合，获得更加丰富、更加准确的上下文信息。In this embodiment, the spatial features f _i of the target bounding box obtained in step S1, the visual features u _ij of the joint region of the target pair and the interrelated target category features p _f obtained in step S4 are extracted, and the attention mechanism Transformer is used to learn different The dependencies between features can accurately capture the interaction between them, complete feature fusion, and obtain richer and more accurate context information.

最后，将目标对联合区域的视觉特征与目标对的上下文信进行融合，获得关系上下文特征。根据关系上下文特征对每个目标对进行关系分类，完成视觉关系检测。Finally, the visual features of the joint region of the target pair are fused with the contextual information of the target pair to obtain relational contextual features. Visual relationship detection is accomplished by classifying the relationship of each object pair according to the relationship context features.

Claims

1. A visual relationship detection method based on knowledge embedding, characterized in that, comprising the following steps:

S1. Input the image, respectively detect the initial category prediction features of the target, the spatial features of the target bounding box, and the visual features of the target pair joint area;

S2. Define the type of prior knowledge, and construct a corresponding knowledge graph for each prior type;

S3. Construct the target initial category prediction feature into a graph structure represented by nodes and edges, and represent the knowledge graph as an adjacency matrix corresponding to the graph structure, and represent edge information through the adjacency matrix;

S4. Based on the update mechanism of the gated graph neural network GGNN, the nodes of the graph structure and the adjacency matrix are jointly learned with the initial category prediction features of the target to obtain interrelated target category features;

S5. According to the spatial features of the target bounding box, the interrelated target category features and the visual features of the target pair joint area, the context information is obtained, and the relationship between each target pair is classified through the softmax function to complete the visual relationship detection.

2. The visual relationship detection method based on knowledge embedding according to claim 1, wherein the specific steps of said step S1 are:

S11, input the image, use the feature pyramid network FPN based on the deep residual network ResNet101 as the backbone network of the regional convolutional neural network Faster R-CNN, and extract the visual features of the multi-scale fusion of the image;

S12. Pass the visual features through the candidate area network to obtain a target set of interest, and use non-maximum value suppression NMS to obtain candidate areas containing target information;

S13. According to the regional feature aggregation method ROIAlign, the features of the candidate regions are uniformly represented, and the initial category prediction features of the target, the spatial features of the target bounding box, and the visual features of the joint area of the target pair are respectively obtained according to the full connection and the softmax function.

3. The visual relationship detection method based on knowledge embedding according to claim 1, wherein the specific steps of the step S2 are:

S21. Define the type of prior knowledge through the source of knowledge, wherein the knowledge extracted from the Visual Genome VG dataset is defined as statistical knowledge, and the knowledge extracted from the global vector representation Glove word vector is defined as language knowledge;

S22. According to the statistical knowledge, make statistics on the co-occurrence relationship between different categories of objects in the statistical knowledge, obtain the co-occurrence probability distribution table according to the co-occurrence relationship, and measure the cosine similarity of the co-occurrence probability distribution table to obtain internal statistics knowledge map;

S23. According to the language knowledge, the language prior knowledge table is obtained through text feature embedding, and the language prior knowledge table is used to measure the cosine similarity to construct an external language knowledge graph;

S24. Construct the knowledge graph of the internal statistical knowledge graph and the external language knowledge graph through weighted fusion.

4. the visual relation detection method based on knowledge embedding according to claim 3, is characterized in that, the expression of cosine similarity measure in the described step S22 is:

Among them, cos(θ _{i, j} ) represents the cosine similarity measure, δ _i represents the co-occurrence relationship of target i, δ _j represents the co-occurrence relationship of target j, θ _{i, j} represents the similarity between target i and target j, n Indicates the total number of targets.

5. The visual relationship detection method based on knowledge embedding according to claim 1, wherein the specific steps of the step S4 are:

S41. Determine the number of node feature propagation times T in step S3, control the spread range and depth of node features, and complete the definition of GGNN parameters for the gated graph neural network;

S42. Calculate the initial hidden state of the node, and jointly learn the nodes in the graph structure and the corresponding adjacency matrix in each time step t;

S43. Using the joint learning information at time step t and the node hidden state at the previous time step, according to the node information propagation mechanism of the gated graph neural network GGNN, update the hidden state of the node, and obtain the final hidden state of each node;

S44. Fusing the final hidden state of each node and the initial target category prediction features to obtain the interrelated target category features after joint learning of vision and knowledge.

6. The visual relationship detection method based on knowledge embedding according to claim 5, characterized in that, the expression of the initial hidden state of the node in the step S42 is:

in, Represents the hidden state of the mth node at the initial moment, σ() represents the corresponding activation function, FC() represents the fully connected layer, /> Indicates that the spliced features are mapped to low-dimensional vectors, x _i represents the visual feature vector obtained by feature extraction of the i-th target, p _i represents the target initial category prediction feature of the i-th target, || represents the splicing between features operate;

The expression for joint learning in the step S42 is:

in, Represents the joint learning information of the mth node at the tth time step, N represents the number of categories, /> Indicates the hidden state of the Nth node at time step t-1, G _s indicates the constructed knowledge graph, (G _s ) _m indicates the mth column information of the knowledge graph, and b indicates hyperparameters;

The hidden state calculation process of updating nodes in the step S43 is:

in, Indicates that the mth node controls the forgotten information, σ() represents the corresponding activation function, h ^(t-1) represents the hidden state of the node at t-1 time step, h _m ^(t-1) represents the mth node at t Hidden state for -1 time step, /> Indicates the new information generated by the mth node control, /> Represents the information newly generated by the mth node, tanh() represents the tanh activation function, W ^zT represents the update gate training parameter matrix, U ^zT represents the update gate bias parameter matrix, W ^T represents the reset gate training parameter matrix, U ^T represents Reset the gate bias parameter matrix, W ^rT represents the training parameter matrix representing the new information, U ^rT represents the bias parameter matrix of the new information, ⊙ represents the product of the corresponding elements of the vector, /> Indicates the hidden state of the mth node;

The expression of the interrelated target category feature in the step S44 is:

Among them, h _dk represents the initial state of the node in area k, FC() represents the fully connected layer, represents the initial node hidden state, represents the learned final node state, /> represents rescaling, φ ₀ () represents the aggregation method, p _f represents the interrelated target category features, and p _i represents the target initial category prediction features of the i-th target.

7. The visual relationship detection method based on knowledge embedding according to claim 1, wherein the specific steps of the step S5 are:

S51. Use the attention mechanism Transformer to fuse the spatial features of the target bounding box, the visual features of the target pair joint area, and the interrelated target category features to obtain the context information of the target pair;

S52. Fusing the visual features of the target pair joint area with the context information of the target pair to obtain relational context features;

S53. According to the relationship context feature, use the softmax function to classify the relationship of each target pair, and complete the visual relationship detection target pair to perform relationship detection.

8. The visual relationship detection method based on knowledge embedding according to claim 7, characterized in that, the calculation formula of the context information of the target pair in the step S51 is:

Q＝Transformer([X, E ^p , E ^f ]; W ^z )

Among them, Q represents the context information of the target pair, that is, the fused features, W ^z represents the training parameters, E ^p represents the embedding vector of target classification prediction, E ^f represents the embedding vector of the target bounding box coordinates, X represents the visual features of the target, Transformer() represents a deep learning model that utilizes an attention mechanism.

9. The visual relationship detection method based on knowledge embedding according to claim 7, characterized in that, in the step S53, the calculation formula for using the softmax function to classify each target pair is:

r _ij =q _ij ⊙e _i ⊙u _ij

p(r _k |l,o _i ,o _j )＝softmax(r _ij )

Among them, e _i represents the semantic feature of the target pair, W _z represents the training parameters, r _ij represents the context feature corresponding to target i and target j, and /> The target category features of target i and target j respectively, q _ij represents the context information of the target pair, /> Indicates rescaling, u _ij indicates the visual features of the target pair joint region, ⊙ indicates the product of corresponding elements of the vector, p(r _k |l,o _i ,o _j ) indicates the predicted probability of the target pair relationship classification, r _k indicates the target pair l represents the target image, o _i represents the target i in the target pair, and o _j represents the target j in the target pair.