Nothing Special   »   [go: up one dir, main page]

CN110647612A - Visual conversation generation method based on double-visual attention network - Google Patents

Visual conversation generation method based on double-visual attention network Download PDF

Info

Publication number
CN110647612A
CN110647612A CN201910881305.0A CN201910881305A CN110647612A CN 110647612 A CN110647612 A CN 110647612A CN 201910881305 A CN201910881305 A CN 201910881305A CN 110647612 A CN110647612 A CN 110647612A
Authority
CN
China
Prior art keywords
attention
visual
feature
vector
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910881305.0A
Other languages
Chinese (zh)
Inventor
郭丹
王辉
汪萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Polytechnic University
Original Assignee
Hefei Polytechnic University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Polytechnic University filed Critical Hefei Polytechnic University
Priority to CN201910881305.0A priority Critical patent/CN110647612A/en
Publication of CN110647612A publication Critical patent/CN110647612A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了一种基于双视觉注意力网络的视觉对话生成方法,包括以下步骤:1、视觉对话中文本输入的预处理和单词表的构建;2、对话图像的特征提取以及对话文本的特征提取;3、基于当前问题信息对历史对话信息进行注意力处理;4、双视觉特征各自独立的注意力处理;5、双视觉特征相互交叉的注意力处理;6、视觉特征的优化处理;7、多模态语义融合及解码生成答案特征序列;8、基于双视觉注意力网络的视觉对话生成网络模型的参数优化;9、预测答案生成。本发明能为智能体提供更完整、更合理的视觉语义信息,以及更细粒度的文本语义信息,从而提高智能体对问题所预测生成的答案的合理性和准确性。

Figure 201910881305

The invention discloses a method for generating a visual dialogue based on a dual visual attention network, comprising the following steps: 1. Preprocessing of text input in the visual dialogue and construction of a word list; 2. Feature extraction of dialogue images and features of dialogue texts Extraction; 3. Attention processing of historical dialogue information based on current problem information; 4. Independent attention processing of dual visual features; 5. Attention processing of overlapping dual visual features; 6. Optimization processing of visual features; 7 , Multimodal semantic fusion and decoding to generate answer feature sequence; 8. Parameter optimization of visual dialogue generation network model based on dual visual attention network; 9. Predicted answer generation. The invention can provide more complete and more reasonable visual semantic information and finer-grained text semantic information for the intelligent body, thereby improving the rationality and accuracy of the answers predicted and generated by the intelligent body.

Figure 201910881305

Description

一种基于双视觉注意力网络的视觉对话生成方法A visual dialogue generation method based on dual visual attention network

技术领域technical field

本发明属于计算机视觉技术领域,涉及到模式识别、自然语言处理、人工智能等技术,具体地说是一种基于双视觉注意力网络的视觉对话生成方法。The invention belongs to the technical field of computer vision, and relates to technologies such as pattern recognition, natural language processing, artificial intelligence, etc., in particular to a visual dialogue generation method based on a dual visual attention network.

背景技术Background technique

视觉对话是一种人机交互方法,其目的是让机器智能体与人类能够对给定的日常场景图以问答的形式进行合理正确的自然对话。因此,如何让智能体正确的理解由图像、文本组成的多模态语义信息从而对人类提出的问题给出合理的回答是视觉对话中的关键。视觉对话目前也是计算机视觉领域热门研究课题之一,其应用场景也非常的广泛,包括:帮助视觉障碍的人群了解社交媒体内容或日常环境、人工智能助力、机器人应用等方面。Visual dialogue is a human-computer interaction method whose purpose is to enable machine agents and humans to have a reasonable and correct natural dialogue in the form of question and answer given a graph of everyday scenes. Therefore, how to make the agent correctly understand the multimodal semantic information composed of images and texts so as to give reasonable answers to the questions raised by humans is the key in visual dialogue. Visual dialogue is currently one of the hot research topics in the field of computer vision, and its application scenarios are also very wide, including: helping visually impaired people understand social media content or daily environment, artificial intelligence assistance, robot applications, etc.

随着现代图像处理技术和深度学习的发展,视觉对话技术也得到了巨大的发展,但是仍然面临以下几点问题:With the development of modern image processing technology and deep learning, visual dialogue technology has also been greatly developed, but it still faces the following problems:

一、智能体在处理文本信息时缺乏对文本特征进行更细粒度的学习。First, the agent lacks more fine-grained learning of text features when processing text information.

例如2017年,Jiasen Lu等作者在顶级国际会议Conference and Workshop onNeural Information Processing Systems(NIPS 2017)上发表的文章《Best ofBothWorlds:Transferring Knowledge from Discriminative Learning to a GenerativeVisual Dialog Model》中提出的基于历史对话的图像注意力方法,该方法首先对历史对话进行句子层面的注意力处理,然后基于处理后的文本特征对图像特征进行注意力学习,但是该方法在处理当前问题的文本信息时只考虑了句子层面的语义,没有考虑词层面的语义,而在实际提问的句子里面通常只有部分关键词是与预测的答案最相关的。因此,该方法在实际应用时会有一定的局限性。For example, in 2017, Jiasen Lu and other authors presented the image based on historical dialogue in the article "Best of BothWorlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model" published at the top international conference Conference and Workshop on Neural Information Processing Systems (NIPS 2017). Attention method, this method first performs sentence-level attention processing on historical dialogue, and then performs attention learning on image features based on processed text features, but this method only considers sentence-level information when processing the text information of the current problem. Semantics does not consider the semantics at the word level, and usually only some keywords are most relevant to the predicted answer in the sentence of the actual question. Therefore, this method has certain limitations in practical application.

二、现有方法都基于全局图像进行特征提取,导致视觉语义信息不够精确。Second, the existing methods are all based on the global image for feature extraction, resulting in inaccurate visual semantic information.

例如2018年,Qi Wu等作者在顶级国际会议IEEE Conference on ComputerVision and Pattern Recognition(CVPR 2018)上发表的《Are You Talking to Me?Reasoned Visual Dialog Generation throughAdversarial Learning》。这篇文章利用全局视觉特征、问题以及历史对话文本特征进行一系列的相互注意力处理并融合得到多模态语义特征,该方法有效的学习了不同特征之间的语义关系,但是该方法只考虑了全局视觉特征,使得在对图像进行注意力处理后经常会关注到一些问题无关的视觉信息,这些冗余信息会对智能体的答案预测造成干扰。For example, in 2018, Qi Wu and other authors published "Are You Talking to Me?" at the top international conference IEEE Conference on ComputerVision and Pattern Recognition (CVPR 2018). Reasoned Visual Dialog Generation through Adversarial Learning. This article uses global visual features, questions and historical dialogue text features to perform a series of mutual attention processing and fusion to obtain multimodal semantic features. This method effectively learns the semantic relationship between different features, but this method only considers The global visual features are used, so that some visual information irrelevant to the question is often paid attention to after the attention processing of the image, and these redundant information will interfere with the answer prediction of the agent.

发明内容SUMMARY OF THE INVENTION

本发明是为了克服现有技术存在的不足之处,提出一种基于双视觉注意力网络的视觉对话生成方法,以期能为智能体提供更完整、更合理的视觉语义信息,以及更细粒度的文本语义信息,从而提高智能体在对问题进行答案推理生成时的合理性和准确性。In order to overcome the shortcomings of the prior art, the present invention proposes a visual dialogue generation method based on a dual visual attention network, in order to provide the agent with more complete and more reasonable visual semantic information, as well as more fine-grained Text semantic information, so as to improve the rationality and accuracy of the agent in the reasoning and generation of the answer to the question.

本发明为解决技术问题采用如下技术方案:The present invention adopts the following technical scheme for solving the technical problem:

本发明一种基于双视觉注意力网络的视觉对话生成方法的特点是按如下步骤进行:The feature of a method for generating a visual dialogue based on a dual visual attention network of the present invention is to carry out the following steps:

步骤1、视觉对话中文本输入的预处理和单词表的构建:Step 1. Preprocessing of text input in visual dialogue and construction of word list:

步骤1.1、获取视觉对话数据集,所述视觉对话数据集中包含句子文本和图像;Step 1.1, obtain a visual dialogue dataset, which contains sentence text and images;

对所述视觉对话数据集中所有的句子文本进行分词处理,得到分割后的单词;Perform word segmentation on all sentence texts in the visual dialogue dataset to obtain segmented words;

步骤1.2、从分割后的单词中筛选出词频率大于阈值的所有单词,并构建单词索引表Voc;再对所述索引表Voc中的每一个单词进行one-hot编码,得到one-hot向量表O=[o1,o2,...,on,...,oN],其中on表示索引表Voc中的第n个单词所对应的one-hot编码向量,N为索引表Voc中的单词个数;Step 1.2, screen out all words whose word frequency is greater than the threshold from the divided words, and construct a word index table Voc; then perform one-hot encoding on each word in the index table Voc to obtain a one-hot vector table O=[o 1 ,o 2 ,...,on ,...,o N ], where o n represents the one-hot encoding vector corresponding to the nth word in the index table Voc, and N is the index table The number of words in the Voc;

步骤1.3、随机初始化一个词嵌入矩阵We

Figure BDA0002205997000000021
其中dw代表词向量的维度;利用词嵌入矩阵We将one-hot向量表中的每个单词的编码向量映射到相应的词向量上,从而得到词向量表;Step 1.3, randomly initialize a word embedding matrix We ,
Figure BDA0002205997000000021
Wherein d w represents the dimension of the word vector; the word embedding matrix We is used to map the encoding vector of each word in the one-hot vector table to the corresponding word vector, thereby obtaining the word vector table;

步骤2、对话图像的特征提取以及对话文本的特征提取;Step 2, feature extraction of dialogue images and feature extraction of dialogue texts;

步骤2.1、从视觉对话数据集中获取任意一个图像I及其对应的历史对话U=[u1,u2,...,ut,...,uT]、当前问题

Figure BDA0002205997000000022
和真实答案标签AGT所组成的视觉对话信息D;其中T为历史对话U中的对话片段总数,ut表示对话中的第t段对话,L1表示当前问题Q的句子长度,wQ,i表示当前问题Q中的第i个单词在所述词向量表中所对应的词向量;Step 2.1. Obtain any image I and its corresponding historical dialogue U=[u 1 ,u 2 ,...,u t ,...,u T ] from the visual dialogue dataset, the current problem
Figure BDA0002205997000000022
Visual dialogue information D composed of real answer labels A GT ; where T is the total number of dialogue segments in the historical dialogue U, ut represents the t -th dialogue in the dialogue, L 1 represents the sentence length of the current question Q, w Q, i represents the word vector corresponding to the i-th word in the current question Q in the word vector table;

步骤2.2、使用卷积神经网络提取视觉对话信息D中图像I的特征,得到全局视觉特征

Figure BDA0002205997000000023
其中
Figure BDA0002205997000000024
表示全局视觉特征V(0)中的第m个区域特征,M表示全局视觉特征V(0)中的总的空间区域数,dg为全局视觉特征V(0)的通道维度;Step 2.2, use the convolutional neural network to extract the features of the image I in the visual dialogue information D to obtain the global visual features
Figure BDA0002205997000000023
in
Figure BDA0002205997000000024
represents the mth regional feature in the global visual feature V (0) , M represents the total number of spatial regions in the global visual feature V (0), and d g is the channel dimension of the global visual feature V (0) ;

步骤2.3、使用目标检测特征提取器提取视觉对话信息D中图像I的特征,得到局部视觉特征其中表示局部视觉特征R(0)中的第k个目标对象特征,K表示局部视觉特征R(0)中的检测的局部目标对象总数,dr为局部视觉特征R(0)的通道维度;Step 2.3. Use the target detection feature extractor to extract the features of the image I in the visual dialogue information D to obtain local visual features in represents the k-th target object feature in the local visual feature R (0) , K represents the total number of local target objects detected in the local visual feature R(0), and d r is the channel dimension of the local visual feature R (0 );

步骤2.4、利用全连接操作将全局视觉特征和局部视觉特征映射到同一维度的空间中,得到转换后的全局视觉特征V=[v1,v2,...,vm,...,vM],以及局部视觉特征R=[r1,r2,...,rk,...,rK],

Figure BDA0002205997000000032
其中vm表示全局视觉特征V中的第m个区域特征,rk表示局部视觉特征R中的第k个目标对象特征,d为转换后的通道维度;Step 2.4. Use the full connection operation to map the global visual feature and the local visual feature to the space of the same dimension, and obtain the transformed global visual feature V=[v 1 ,v 2 ,..., vm ,..., v M ], and local visual features R=[r 1 ,r 2 ,...,r k ,...,r K ],
Figure BDA0002205997000000032
where v m represents the m-th regional feature in the global visual feature V, r k represents the k-th target object feature in the local visual feature R, and d is the transformed channel dimension;

步骤2.5、使用长短期记忆网络LSTM对当前问题Q进行特征提取,得到隐状态特征序列

Figure BDA0002205997000000033
并取长短期记忆网络LSTM的最后一个步长输出的隐状态特征
Figure BDA0002205997000000034
作为当前问题Q的句子级问题特征向量q,
Figure BDA0002205997000000035
其中hQ,i表示长短期记忆网络LSTM的第i个步长输出的隐状态特征;Step 2.5, use the long short-term memory network LSTM to extract the features of the current question Q, and obtain the hidden state feature sequence
Figure BDA0002205997000000033
And take the hidden state feature of the last step output of the long short-term memory network LSTM
Figure BDA0002205997000000034
As the sentence-level question feature vector q of the current question Q,
Figure BDA0002205997000000035
where h Q,i represents the hidden state feature of the i-th step output of the long short-term memory network LSTM;

步骤2.6、使用长短期记忆网络LSTM对历史对话U中的第t段对话

Figure BDA0002205997000000036
进行特征提取,得到第t个隐状态序列
Figure BDA0002205997000000037
长短期记忆网络取LSTM的最后一个步长输出的隐状态特征
Figure BDA0002205997000000038
作为第t段对话ut的句子级特征ht
Figure BDA0002205997000000039
则总的历史对话特征为H=[h1,h2,...,ht,...,hT],
Figure BDA00022059970000000310
其中wt,i表示第t段对话ut中第i个单词在所述词向量表中所对应的词向量,L2为第t段对话ut的句子长度,ht,i表示长短期记忆网络LSTM的第i个步长输出的隐状态特征;Step 2.6, use the long short-term memory network LSTM to analyze the t-th dialogue in the historical dialogue U
Figure BDA0002205997000000036
Perform feature extraction to get the t-th hidden state sequence
Figure BDA0002205997000000037
The long short-term memory network takes the hidden state features of the last step output of LSTM
Figure BDA0002205997000000038
As the sentence-level feature h t of the t-th dialogue u t ,
Figure BDA0002205997000000039
Then the total historical dialogue feature is H=[h 1 , h 2 ,...,h t ,...,h T ],
Figure BDA00022059970000000310
Where w t,i represents the word vector corresponding to the i-th word in the word vector table in the t -th dialogue ut, L 2 is the sentence length of the t-th dialogue ut , h t,i represents the long-term and short-term The hidden state feature of the i-th step output of the memory network LSTM;

步骤3、基于当前问题信息对历史对话信息进行注意力处理;Step 3. Perform attention processing on the historical dialogue information based on the current problem information;

利用式(1)对所述总的历史对话特征H=[h1,h2,...,ht,...,hT]进行注意力处理,得到注意力关注的历史特征向量ha

Figure BDA00022059970000000311
Use formula (1) to perform attention processing on the total historical dialogue features H=[h 1 , h 2 ,...,h t ,...,h T ], and obtain the historical feature vector h that the attention pays attention to a ,
Figure BDA00022059970000000311

ha=αhHT (1)h a = α h H T (1)

式(1)中,

Figure BDA00022059970000000312
表示对历史对话特征H的注意力分布权重,并有:In formula (1),
Figure BDA00022059970000000312
represents the weight of the attention distribution on the historical dialogue feature H, and has:

αh=softmax(PTzh) (2)α h =softmax(PTz h ) (2)

式(2)中,表示句子级问题特征向量q与历史对话特征H之间的相似度矩阵,表示相似度矩阵zh的待训练参数,并有:In formula (2), represents the similarity matrix between the sentence-level question feature vector q and the historical dialogue feature H, Represents the parameters to be trained of the similarity matrix z h , and has:

zh=tanh(Wqq+WhH) (3)z h =tanh(W q q+W h H) (3)

式(3)中,

Figure BDA00022059970000000315
表示句子级问题特征向量q对应的待训练参数,
Figure BDA00022059970000000316
表示历史对话特征H对应的待训练参数;In formula (3),
Figure BDA00022059970000000315
represents the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure BDA00022059970000000316
Represents the parameters to be trained corresponding to the historical dialogue feature H;

步骤4、双视觉特征各自独立的注意力处理;Step 4. Independent attention processing of dual visual features;

步骤4.1、利用式(4)对全局视觉特征V=[v1,v2,...,vm,...,vM]进行注意力处理,得到注意力关注的全局视觉特征向量V′, Step 4.1. Use formula (4) to perform attention processing on the global visual feature V=[v 1 , v 2 ,..., vm ,...,v M ], and obtain the global visual feature vector V that the attention pays attention to ',

V′=αV1VT (4)V′=α V1 V T (4)

式(4)中,

Figure BDA0002205997000000042
表示对全局视觉特征V的注意力分布权重,并有:In formula (4),
Figure BDA0002205997000000042
Represents the attention distribution weight to the global visual feature V, and has:

Figure BDA0002205997000000043
Figure BDA0002205997000000043

式(5)中,

Figure BDA0002205997000000044
表示句子级问题特征向量q、注意力关注的历史特征向量ha以及全局视觉特征V之间的相似度矩阵
Figure BDA0002205997000000045
表示相似度矩阵zV1的待训练参数,并有:In formula (5),
Figure BDA0002205997000000044
Represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention attention, and the global visual feature V
Figure BDA0002205997000000045
Represents the parameters to be trained of the similarity matrix z V1 , and has:

zV1=tanh(Wq1q+Wh1ha+WV1V) (6)z V1 = tanh(W q1 q+W h1 h a +W V1 V) (6)

式(6)中,

Figure BDA0002205997000000046
表示句子级问题特征向量q对应的待训练参数,表示注意力关注的历史特征向量ha对应的待训练参数,
Figure BDA0002205997000000048
表示全局视觉特征V对应的待训练参数;In formula (6),
Figure BDA0002205997000000046
represents the parameters to be trained corresponding to the sentence-level question feature vector q, The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention,
Figure BDA0002205997000000048
Indicates the parameters to be trained corresponding to the global visual feature V;

步骤4.2、利用式(7)对局部视觉特征R=[r1,r2,...,rk,...,rK]进行注意力处理,得到注意力关注的局部视觉特征向量R′,

Figure BDA0002205997000000049
Step 4.2. Use the formula (7) to perform attention processing on the local visual features R=[r 1 , r 2 ,...,r k ,...,r K ], and obtain the local visual feature vector R that the attention pays attention to ',
Figure BDA0002205997000000049

R′=αR1RT (7)R′=α R1 R T (7)

式(7)中,

Figure BDA00022059970000000410
表示对局部视觉特征R的注意力分布权重,并有:In formula (7),
Figure BDA00022059970000000410
Represents the weight of the attention distribution to the local visual feature R, and has:

Figure BDA00022059970000000411
Figure BDA00022059970000000411

式(8)中,

Figure BDA00022059970000000412
表示句子级问题特征向量q、注意力关注的历史特征向量ha以及局部视觉特征R之间的相似度矩阵,
Figure BDA00022059970000000413
表示相似度矩阵zV1的待训练参数,并有:In formula (8),
Figure BDA00022059970000000412
represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention attention, and the local visual feature R,
Figure BDA00022059970000000413
Represents the parameters to be trained of the similarity matrix z V1 , and has:

zR1=tanh(W′q1q+W′h1ha+WR1R) (9)z R1 = tanh(W′ q1 q+W′ h1 h a +W R1 R) (9)

式(9)中,

Figure BDA00022059970000000414
表示句子级问题特征向量q对应的待训练参数,
Figure BDA00022059970000000415
表示注意力关注的历史特征向量ha对应的待训练参数,
Figure BDA00022059970000000416
表示局部视觉特征R对应的待训练参数;In formula (9),
Figure BDA00022059970000000414
represents the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure BDA00022059970000000415
The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention,
Figure BDA00022059970000000416
Represents the parameters to be trained corresponding to the local visual feature R;

步骤5、双视觉特征相互交叉的注意力处理;Step 5. Attention processing of the intersection of dual visual features;

步骤5.1、利用式(10)对全局视觉特征V=[v1,v2,...,vm,...,vM]进行双视觉交叉注意力处理,得到进一步注意力关注的全局视觉特征向量V″,

Figure BDA0002205997000000051
Step 5.1. Use formula (10) to perform bi-visual cross-attention processing on the global visual feature V=[v 1 , v 2 ,...,v m ,...,v M ] to obtain the global attention paid by further attention The visual feature vector V″,
Figure BDA0002205997000000051

V″=αV2VT (10)V″=α V2 V T (10)

式(10)中,

Figure BDA0002205997000000052
表示对全局视觉特征V的进一步注意力分布权重,并有:In formula (10),
Figure BDA0002205997000000052
represents the further attention distribution weights to the global visual feature V, and has:

Figure BDA0002205997000000053
Figure BDA0002205997000000053

式(11)中,

Figure BDA0002205997000000054
表示句子级问题特征向量q、注意力关注的历史特征向量ha、注意力关注的局部视觉特征向量R′以及全局视觉特征V之间的相似度矩阵,表示相似度矩阵zV2的待训练参数,并有:In formula (11),
Figure BDA0002205997000000054
represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention, the local visual feature vector R′ of attention, and the global visual feature V, Represents the parameters to be trained of the similarity matrix z V2 , and has:

zV2=tanh(Wq2q+Wh2ha+WR2R′+WV2V) (12)z V2 = tanh(W q2 q+W h2 h a +W R2 R′+W V2 V) (12)

式(12)中,

Figure BDA0002205997000000056
表示句子级问题特征向量q对应的待训练参数,
Figure BDA0002205997000000057
表示注意力关注的历史特征向量ha对应的待训练参数,
Figure BDA0002205997000000058
表示注意力关注的局部视觉特征向量R′对应的待训练参数,
Figure BDA0002205997000000059
表示全局视觉特征V对应的待训练参数;In formula (12),
Figure BDA0002205997000000056
represents the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure BDA0002205997000000057
The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention,
Figure BDA0002205997000000058
The parameters to be trained corresponding to the local visual feature vector R′ representing the attention of the attention,
Figure BDA0002205997000000059
Indicates the parameters to be trained corresponding to the global visual feature V;

步骤5.2、利用式(13)对局部视觉特征R=[r1,r2,...,rk,...,rK]进行双视觉交叉注意力处理,得到进一步注意力关注的局部视觉特征向量R″,

Figure BDA00022059970000000510
Step 5.2. Use formula (13) to perform bi-visual cross-attention processing on the local visual features R=[r 1 , r 2 ,...,r k ,...,r K ] to obtain the local attention paid by further attention The visual feature vector R″,
Figure BDA00022059970000000510

R″=αR2RT (13)R″=α R2 R T (13)

式(13)中,

Figure BDA00022059970000000511
表示对局部视觉特征R的进一步注意力分布权重,并有:In formula (13),
Figure BDA00022059970000000511
represents further attention distribution weights on local visual features R, and has:

Figure BDA00022059970000000512
Figure BDA00022059970000000512

式(14)中,

Figure BDA00022059970000000513
表示句子级问题特征向量q、注意力关注的历史特征向量ha、注意力关注的全部视觉特征向量V′以及局部视觉特征R之间的相似度矩阵,表示相似度矩阵zR2的待训练参数,并有:In formula (14),
Figure BDA00022059970000000513
Represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention, the total visual feature vector V′ of attention, and the local visual feature R, Represents the parameters to be trained of the similarity matrix z R2 , and has:

zR2=tanh(W′q2q+W′h2ha+W′V2V′+W′R2R) (15)z R2 = tanh(W′ q2 q+W′ h2 h a +W′ V2 V′+W′ R2 R) (15)

式(15)中,

Figure BDA00022059970000000515
表示句子级问题特征向量q对应的待训练参数,
Figure BDA00022059970000000516
表示注意力关注的历史特征向量ha对应的待训练参数,示注意力关注的全部视觉特征向量V′对应的待训练参数,
Figure BDA0002205997000000062
表示局部视觉特征R对应的待训练参数;In formula (15),
Figure BDA00022059970000000515
represents the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure BDA00022059970000000516
The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention, The parameters to be trained corresponding to all visual feature vectors V′ that show attention,
Figure BDA0002205997000000062
Represents the parameters to be trained corresponding to the local visual feature R;

步骤6、视觉特征的优化处理;Step 6, optimization processing of visual features;

步骤6.1、利用式(16)对当前问题Q进行词级别注意力处理,得到注意力关注的词级问题特征向量qs

Figure BDA0002205997000000063
Step 6.1. Use Equation (16) to perform word-level attention processing on the current question Q, and obtain the word-level question feature vector q s of which the attention is concerned,
Figure BDA0002205997000000063

qs=αqQT (16)q s = α q Q T (16)

式(16)中,

Figure BDA0002205997000000064
表示对当前问题Q的注意力分布权重,并有:In formula (16),
Figure BDA0002205997000000064
Represents the weight of the attention distribution on the current question Q, and has:

式(14)中,

Figure BDA0002205997000000066
表示当前问题Q的自注意力语义矩阵,
Figure BDA0002205997000000067
表示自注意力语义矩阵zQ的待训练参数,并有:In formula (14),
Figure BDA0002205997000000066
represents the self-attention semantic matrix of the current question Q,
Figure BDA0002205997000000067
represents the to-be-trained parameters of the self-attention semantic matrix z Q , and has:

zQ=tanh(WQQ) (18)z Q = tanh(W Q Q) (18)

式(18)中,

Figure BDA0002205997000000068
表示词级别注意力处理时当前问题Q对应的待训练参数;In formula (18),
Figure BDA0002205997000000068
Indicates the parameters to be trained corresponding to the current question Q during word-level attention processing;

步骤6.2、利用式(19)和式(20)对进一步注意力关注的全局视觉特征向量V″和局部视觉特征向量R″分别进行优化处理,并得到最终的全局视觉特征向量

Figure BDA0002205997000000069
和局部视觉特征向量
Figure BDA00022059970000000611
Figure BDA00022059970000000612
Step 6.2. Use formula (19) and formula (20) to optimize the global visual feature vector V″ and local visual feature vector R″ for further attention, and obtain the final global visual feature vector
Figure BDA0002205997000000069
and local visual feature vectors
Figure BDA00022059970000000611
Figure BDA00022059970000000612

Figure BDA00022059970000000613
Figure BDA00022059970000000613

式(19)和式(20)中,

Figure BDA00022059970000000615
表示视觉特征优化处理时词级问题特征向量qs对应的待训练参数,⊙表示点乘运算;In formula (19) and formula (20),
Figure BDA00022059970000000615
Represents the parameter to be trained corresponding to the feature vector q s of the word-level problem during visual feature optimization processing, and ⊙ represents the dot product operation;

步骤7、多模态语义融合及解码生成答案特征序列;Step 7. Multimodal semantic fusion and decoding to generate an answer feature sequence;

步骤7.1、将所述注意力关注的词级问题特征向量qs,注意力关注的历史特征向量ha,优化后的全局视觉特征向量

Figure BDA00022059970000000616
和局部视觉特征向量
Figure BDA00022059970000000617
进行拼接后得到多模态特征向量eM
Figure BDA00022059970000000618
其中dM=3d+dw代表多模态特征向量的维度;再利用全连接操作对所述多模态特征向量eM进行映射,得到融合语义特征向量e,
Figure BDA0002205997000000071
Step 7.1. Combine the word-level question feature vector q s of the attention, the historical feature vector ha of the attention, and the optimized global visual feature vector
Figure BDA00022059970000000616
and local visual feature vectors
Figure BDA00022059970000000617
After splicing, the multimodal feature vector e M is obtained,
Figure BDA00022059970000000618
where d M =3d+d w represents the dimension of the multi-modal feature vector; then use the full connection operation to map the multi-modal feature vector e M to obtain the fusion semantic feature vector e,
Figure BDA0002205997000000071

步骤7.2、将所述融合语义特征向量e输入到长短期记忆网络LSTM中,得到预测答案的隐状态特征序列

Figure BDA0002205997000000072
其中hA,i为长短期记忆网络LSTM的第i个步长的输出,L3为真实答案标签AGT的句子长度;Step 7.2. Input the fused semantic feature vector e into the long short-term memory network LSTM to obtain the hidden state feature sequence of the predicted answer
Figure BDA0002205997000000072
Where h A,i is the output of the i-th step size of the long short-term memory network LSTM, and L 3 is the sentence length of the true answer label A GT ;

步骤7.3、利用全连接操作将所述预测答案的隐状态特征序列

Figure BDA0002205997000000073
映射到与所述one-hot向量表O同一维度的空间中,得到预测答案的单词向量集合
Figure BDA0002205997000000074
其中yi表示预测答案中第i个单词的映射向量,且向量长度与单词个数相同;Step 7.3. Use the full connection operation to convert the hidden state feature sequence of the predicted answer
Figure BDA0002205997000000073
Map to the space of the same dimension as the one-hot vector table O, and get the word vector set of the predicted answer
Figure BDA0002205997000000074
where y i represents the mapping vector of the ith word in the predicted answer, and the length of the vector is the same as the number of words;

步骤8、基于双视觉注意力网络的视觉对话生成网络模型的参数优化;Step 8, parameter optimization of the visual dialogue generation network model based on the dual visual attention network;

步骤8.1、根据所述单词one-hot向量表O对真实答案标签AGT中的单词构建向量集合

Figure BDA0002205997000000075
其中表示真实答案标签AGT中第i个单词的映射向量,且向量长度与单词个数相同;Step 8.1. According to the word one-hot vector table O, construct a vector set for the words in the true answer label A GT
Figure BDA0002205997000000075
in Represents the mapping vector of the ith word in the true answer label A GT , and the length of the vector is the same as the number of words;

步骤8.2利用式(21)计算预测答案与真实答案AGT之间的损失代价E:Step 8.2 Use equation (21) to calculate the loss cost E between the predicted answer and the real answer A GT :

步骤8.3、利用随机梯度下降法将所述损失代价E进行优化求解,使损失代价E达到最小,从而得到最优参数的基于双视觉注意力网络的视觉对话网络模型;Step 8.3, using the stochastic gradient descent method to optimize and solve the loss cost E to minimize the loss cost E, thereby obtaining a visual dialogue network model based on the dual visual attention network with optimal parameters;

步骤9、预测答案生成;Step 9. Predict the answer generation;

对所述预测答案的单词向量集合

Figure BDA0002205997000000078
使用贪心解码算法得到第i个单词的映射向量yi中最大值所对应的位置,并根据最大值所对应的位置在单词索引表Voc中查找相应位置的单词作为第i个单词的映射向量yi最终的预测单词,进而得到单词向量集合Y所对应的预测答案,并以所述当前问题Q和单词向量集合Y所对应的预测答案作为最终生成的视觉对话。the set of word vectors for the predicted answer
Figure BDA0002205997000000078
Use the greedy decoding algorithm to obtain the position corresponding to the maximum value in the mapping vector y i of the ith word, and look up the word at the corresponding position in the word index table Voc according to the position corresponding to the maximum value as the mapping vector y of the ith word i final predicted word, and then the predicted answer corresponding to the word vector set Y is obtained, and the predicted answer corresponding to the current question Q and the word vector set Y is used as the final generated visual dialogue.

与已有技术相比,本发明的有益效果体现在:Compared with the prior art, the beneficial effects of the present invention are embodied in:

1、和以往研究的视觉对话技术相比,本发明不仅提取了全局图像的视觉特征,还提取了局部图像对象的视觉特征,全局视觉特征包含的视觉语义信息更加全面,而局部视觉特征包含的视觉语义更加具体,从而充分考虑了两种视觉特征的特点,并通过两阶注意力处理来分别学习两种视觉特征的内部关系和相互关系,进而形成视觉语义互补,使得智能体能获得更完整、更准确的视觉语义信息。1. Compared with the visual dialogue technology studied in the past, the present invention not only extracts the visual features of the global image, but also extracts the visual features of the local image objects. The global visual features contain more comprehensive visual semantic information, while the local visual features contain more comprehensive visual semantic information. The visual semantics is more specific, which fully considers the characteristics of the two visual features, and learns the internal and interrelationships of the two visual features through two-order attention processing, thereby forming visual semantic complementarity, so that the agent can obtain more complete, More accurate visual semantic information.

2、本发明从句子层面和词层面分别处理文本特征,在处理时首先对问题和历史对话进行句子层面的特征提取并对历史对话特征进行注意力处理;接着,基于所得句子层面的文本特征来学习两种视觉特征的关系;之后,本发明从词层面对问题特征进行注意力处理,以捕获问题中有助于答案推测的关键词特征,这种更细粒度的文本处理方法使得本发明在视觉对话中可以生成更准确合理的答案。2. The present invention processes the text features from the sentence level and the word level respectively. When processing, firstly perform sentence-level feature extraction on questions and historical dialogues and perform attention processing on historical dialogue features; then, based on the obtained sentence-level text features. The relationship between two visual features is learned; after that, the present invention performs attention processing on the question features from the word level to capture the keyword features in the question that are helpful for answer inference. This finer-grained text processing method enables the present invention to More accurate and reasonable answers can be generated in visual dialogue.

3、本发明提出了一种多模态语义融合结构,该结构首先利用词层面的问题文本特征对两种视觉特征分别进行优化,以进一步突出视觉特征中与问题关键词相关的视觉信息。接着,通过拼接问题特征、历史对话特征、全局视觉特征和局部视觉特征,并进行学习与融合,进一步的,各视觉特征和文本特征可以通过多模态语义融合网络互相产生影响,并辅助优化网络的参数,融合网络同时获取了视觉语义和文本语义之后,智能体的答案预测生成效果也有了很大的提升,预测的结果也更精确。3. The present invention proposes a multi-modal semantic fusion structure, which firstly optimizes two visual features by using the question text features at the word level, so as to further highlight the visual information related to the question keywords in the visual features. Then, by splicing question features, historical dialogue features, global visual features and local visual features, and learning and merging, further, each visual feature and text feature can interact with each other through the multimodal semantic fusion network, and assist in optimizing the network. After the fusion network obtains visual semantics and textual semantics at the same time, the agent's answer prediction generation effect is also greatly improved, and the prediction results are more accurate.

附图说明Description of drawings

图1为本发明的网络模型示意图;1 is a schematic diagram of a network model of the present invention;

图2为本发明的双视觉注意力处理示意图;2 is a schematic diagram of dual visual attention processing of the present invention;

图3为本发明网络模型训练示意图。FIG. 3 is a schematic diagram of the training of the network model of the present invention.

具体实施方式Detailed ways

在本实施例中,如图1所示,一种基于双视觉注意力网络的视觉对话生成方法是按如下步骤进行:In this embodiment, as shown in Figure 1, a method for generating a visual dialogue based on a dual visual attention network is performed as follows:

步骤1、视觉对话中文本输入的预处理和单词表的构建:Step 1. Preprocessing of text input in visual dialogue and construction of word list:

步骤1.1、从网上获取视觉对话数据集,目前公开的数据集主要有VisDialDataset,该数据集由乔治亚理工学院的相关研究员收集而成,视觉对话数据集中包含句子文本和图像;Step 1.1. Obtain the visual dialogue data set from the Internet. The currently public data sets mainly include VisDialDataset, which was collected by relevant researchers of Georgia Institute of Technology. The visual dialogue data set contains sentence text and images;

对视觉对话数据集中所有的句子文本进行分词处理,得到分割后的单词;Perform word segmentation on all sentence texts in the visual dialogue dataset to obtain segmented words;

步骤1.2、从分割后的单词中筛选出词频率大于阈值的所有单词,阈值的大小可设置为4,并构建单词索引表Voc;创建单词索引表Voc的方法:单词表可以包含单词、标点符号;统计单词的个数并对单词进行排序,其中为了满足优化的训练过程,添加了一个空白符。对所有单词按照顺序构建单词与序号的对应表;再对索引表Voc中的每一个单词进行one-hot编码,得到one-hot向量表O=[o1,o2,...,on,...,oN],其中on表示索引表Voc中的第n个单词所对应的one-hot编码向量,N为索引表Voc中的单词个数;Step 1.2. Screen out all words whose word frequency is greater than the threshold from the divided words. The size of the threshold can be set to 4, and the word index table Voc is constructed; the method of creating the word index table Voc: the word table can contain words, punctuation marks ; Count the number of words and sort the words, in which a blank is added to meet the optimized training process. Construct the correspondence table of words and serial numbers for all words in order; then perform one-hot encoding on each word in the index table Voc to obtain a one-hot vector table O=[o 1 ,o 2 ,...,on ,...,o N ], where o n represents the one-hot encoding vector corresponding to the nth word in the index table Voc, and N is the number of words in the index table Voc;

步骤1.3、随机初始化一个词嵌入矩阵We

Figure BDA0002205997000000091
其中dw代表词向量的维度;利用词嵌入矩阵We将one-hot向量表中的第n个单词的编码向量on映射到第n个词向量wn
Figure BDA0002205997000000092
从而得到词向量表;Step 1.3, randomly initialize a word embedding matrix We ,
Figure BDA0002205997000000091
where d w represents the dimension of the word vector; the word embedding matrix We is used to map the encoding vector on of the nth word in the one-hot vector table to the nth word vector w n ,
Figure BDA0002205997000000092
Thereby, the word vector table is obtained;

步骤2、对话图像的特征提取以及对话文本的特征提取;Step 2, feature extraction of dialogue images and feature extraction of dialogue texts;

步骤2.1、从视觉对话数据集中获取任意一个图像I及其对应的历史对话U=[u1,u2,...,ut,...,uT]、当前问题

Figure BDA0002205997000000093
和真实答案标签AGT所组成的视觉对话信息D;其中T为历史对话U中的对话片段总数,ut表示对话中的第t段对话,L1表示当前问题Q的句子长度,L1的大小可设置为16,对于句子长度小于16的句子会用零向量进行填充,填充至其长度为L1,wQ,i表示句子中的第i个单词的词向量;Step 2.1. Obtain any image I and its corresponding historical dialogue U=[u 1 ,u 2 ,...,u t ,...,u T ] from the visual dialogue dataset, the current problem
Figure BDA0002205997000000093
Visual dialogue information D composed of real answer labels A GT ; where T is the total number of dialogue segments in the historical dialogue U, ut represents the t -th dialogue in the dialogue, L 1 represents the sentence length of the current question Q, and L 1 ’s The size can be set to 16. For sentences with a sentence length less than 16, it will be filled with a zero vector until its length is L 1 , w Q,i represents the word vector of the i-th word in the sentence;

步骤2.2、使用卷积神经网络提取视觉对话信息D中图像I的特征,得到全局视觉特征

Figure BDA0002205997000000094
其中
Figure BDA0002205997000000095
表示全局视觉特征V(0)中的第m个区域特征,M表示全局视觉特征V(0)中的总的空间区域数,dg为全局视觉特征V(0)的通道维度;本实施例中,可以采用预训练的VGG卷积神经网络对图像I的全局视觉特征进行特征提取;VGG是一种二维卷积神经网络,它被证明了有很强的视觉信息表达能力,因此我们使用在COCO2014数据集上预训练过的VGG作为实验的全局视觉特征提取器,并且这一部分的网络不参与后续步骤8的参数更新部分;Step 2.2, use the convolutional neural network to extract the features of the image I in the visual dialogue information D to obtain the global visual features
Figure BDA0002205997000000094
in
Figure BDA0002205997000000095
represents the mth regional feature in the global visual feature V (0) , M represents the total number of spatial regions in the global visual feature V(0), and d g is the channel dimension of the global visual feature V (0) ; this embodiment , the pre-trained VGG convolutional neural network can be used to extract the global visual features of the image I; VGG is a two-dimensional convolutional neural network, which has been proved to have a strong ability to express visual information, so we use The VGG pre-trained on the COCO2014 dataset is used as the global visual feature extractor of the experiment, and this part of the network does not participate in the parameter update part of the subsequent step 8;

步骤2.3、使用目标检测特征提取器提取视觉对话信息D中图像I的特征,得到局部视觉特征

Figure BDA0002205997000000096
其中
Figure BDA0002205997000000097
表示局部视觉特征R(0)中的第k个目标对象特征,K表示局部视觉特征R(0)中的检测的局部目标对象总数,dr为局部视觉特征R(0)的通道维度;本实施例中,可以采用预训练的Faster-RCNN目标检测特征提取器对图像I的局部视觉特征进行特征提取;Faster-RCNN所提取的局部视觉特征在许多视觉任务上都取得了优异的效果,因此我们使用在Visual Genome数据集上预训练过的Faster-RCNN作为实验的局部视觉特征提取器,并且这一部分的网络不参与后续步骤8的参数更新部分;Step 2.3. Use the target detection feature extractor to extract the features of the image I in the visual dialogue information D to obtain local visual features
Figure BDA0002205997000000096
in
Figure BDA0002205997000000097
Represents the k-th target object feature in the local visual feature R (0) , K represents the total number of local target objects detected in the local visual feature R(0), and d r is the channel dimension of the local visual feature R (0) ; this In the embodiment, the pre-trained Faster-RCNN target detection feature extractor can be used to perform feature extraction on the local visual features of image I; the local visual features extracted by Faster-RCNN have achieved excellent results in many visual tasks, so We use the Faster-RCNN pre-trained on the Visual Genome dataset as the local visual feature extractor for the experiment, and this part of the network does not participate in the parameter update part of the subsequent step 8;

步骤2.4、利用全连接操作将全局视觉特征和局部视觉特征映射到同一维度的空间中,得到转换后的全局视觉特征V=[v1,v2,...,vm,...,vM],

Figure BDA0002205997000000098
以及局部视觉特征R=[r1,r2,...,rk,...,rK],
Figure BDA0002205997000000099
其中vm表示全局视觉特征V中的第m个区域特征,rk表示局部视觉特征R中的第k个目标对象特征,d为转换后的通道维度;Step 2.4. Use the full connection operation to map the global visual feature and the local visual feature to the space of the same dimension, and obtain the transformed global visual feature V=[v 1 ,v 2 ,..., vm ,..., v M ],
Figure BDA0002205997000000098
and local visual features R=[r 1 ,r 2 ,...,r k ,...,r K ],
Figure BDA0002205997000000099
where v m represents the m-th regional feature in the global visual feature V, r k represents the k-th target object feature in the local visual feature R, and d is the transformed channel dimension;

步骤2.5、使用长短期记忆网络LSTM对当前问题Q进行特征提取,得到隐状态特征序列

Figure BDA0002205997000000101
并取LSTM的最后一个步长输出的隐状态特征作为当前问题Q的句子级问题特征向量q,
Figure BDA0002205997000000103
其中hQ,i表示LSTM第i个步长输出的隐状态特征;Step 2.5, use the long short-term memory network LSTM to extract the features of the current question Q, and obtain the hidden state feature sequence
Figure BDA0002205997000000101
And take the hidden state feature of the last step output of LSTM As the sentence-level question feature vector q of the current question Q,
Figure BDA0002205997000000103
where h Q,i represents the hidden state feature of the ith step output of LSTM;

步骤2.6、使用长短期记忆网络LSTM对历史对话U中的每一段对话

Figure BDA0002205997000000104
进行特征提取,得到隐状态序列取LSTM的最后一个步长输出的隐状态特征
Figure BDA0002205997000000106
作为对话ut的句子级特征ht
Figure BDA0002205997000000107
总的历史对话特征为H=[h1,h2,...,ht,...,hT],
Figure BDA0002205997000000108
其中wu,i表示对话ut中第i个单词的词向量,L2为对话ut的句子长度,L2的大小可设置为25,对于句子长度小于25的句子会用零向量进行填充,填充至其长度为L2,hu,i表示LSTM第i个步长输出的隐状态特征;Step 2.6, use the long short-term memory network LSTM to analyze each dialogue in the historical dialogue U
Figure BDA0002205997000000104
Perform feature extraction to get the hidden state sequence Take the hidden state feature of the last step output of the LSTM
Figure BDA0002205997000000106
As the sentence-level feature ht of the dialogue ut ,
Figure BDA0002205997000000107
The total historical dialogue features are H=[h 1 , h 2 ,...,h t ,...,h T ],
Figure BDA0002205997000000108
Where w u,i represents the word vector of the i-th word in the dialogue ut , L 2 is the sentence length of the dialogue u t , the size of L 2 can be set to 25, and the sentences with a sentence length less than 25 will be filled with zero vectors , padded to its length L 2 , h u,i represents the hidden state feature of the ith step output of LSTM;

步骤3、基于当前问题信息对历史对话信息进行注意力处理;Step 3. Perform attention processing on the historical dialogue information based on the current problem information;

利用式(1)对总的历史对话特征H=[h1,h2,...,ht,...,hT]进行注意力处理,得到注意力关注的历史特征向量ha, Use formula (1) to perform attention processing on the total historical dialogue features H=[h 1 , h 2 ,...,h t , ... ,h T ], and obtain the historical feature vector ha of attention,

ha=αhH (1)h a = α h H (1)

式(1)中,

Figure BDA00022059970000001010
表示对历史对话特征H的注意力分布权重,并有:In formula (1),
Figure BDA00022059970000001010
represents the weight of the attention distribution on the historical dialogue feature H, and has:

αh=softmax(PTzh) (2)α h =softmax(PTz h ) (2)

式(2)中,

Figure BDA00022059970000001011
表示句子级问题特征向量q与历史对话特征H之间的相似度矩阵,
Figure BDA00022059970000001012
表示相似度矩阵zh的待训练参数,并有:In formula (2),
Figure BDA00022059970000001011
represents the similarity matrix between the sentence-level question feature vector q and the historical dialogue feature H,
Figure BDA00022059970000001012
Represents the parameters to be trained of the similarity matrix z h , and has:

zh=tanh(Wqq+WhH) (3)z h =tanh(W q q+W h H) (3)

式(3)中,

Figure BDA00022059970000001013
表示句子级问题特征向量q对应的待训练参数,
Figure BDA00022059970000001014
表示历史对话特征H对应的待训练参数;In formula (3),
Figure BDA00022059970000001013
represents the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure BDA00022059970000001014
Represents the parameters to be trained corresponding to the historical dialogue feature H;

步骤4、如图2所示,对双视觉特征进行各自独立的注意力处理;Step 4, as shown in Figure 2, perform independent attention processing on the dual visual features;

步骤4.1、利用式(4)对全局视觉特征V=[v1,v2,...,vm,...,vM]进行注意力处理,得到注意力关注的全局视觉特征向量V′,

Figure BDA00022059970000001015
Step 4.1. Use formula (4) to perform attention processing on the global visual feature V=[v 1 , v 2 ,..., vm ,...,v M ], and obtain the global visual feature vector V that the attention pays attention to ′,
Figure BDA00022059970000001015

V′=αV1VT (4)V′=α V1 V T (4)

式(4)中,

Figure BDA0002205997000000111
表示对全局视觉特征V的注意力分布权重,并有:In formula (4),
Figure BDA0002205997000000111
Represents the attention distribution weight to the global visual feature V, and has:

Figure BDA0002205997000000112
Figure BDA0002205997000000112

式(5)中,

Figure BDA0002205997000000113
表示句子级问题特征向量q、注意力关注的历史特征向量ha以及全局视觉特征V之间的相似度矩阵,
Figure BDA0002205997000000114
表示相似度矩阵zV1的待训练参数,并有:In formula (5),
Figure BDA0002205997000000113
represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention attention, and the global visual feature V,
Figure BDA0002205997000000114
Represents the parameters to be trained of the similarity matrix z V1 , and has:

zV1=tanh(Wq1q+Wh1ha+WV1V) (6)z V1 = tanh(W q1 q+W h1 h a +W V1 V) (6)

式(6)中,

Figure BDA0002205997000000115
表示句子级问题特征向量q对应的待训练参数,
Figure BDA0002205997000000116
表示注意力关注的历史特征向量ha对应的待训练参数,
Figure BDA0002205997000000117
表示全局视觉特征V对应的待训练参数;In formula (6),
Figure BDA0002205997000000115
represents the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure BDA0002205997000000116
The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention,
Figure BDA0002205997000000117
Indicates the parameters to be trained corresponding to the global visual feature V;

步骤4.2、利用式(7)对局部视觉特征R=[r1,r2,...,rk,...,rK]进行注意力处理,得到注意力关注的局部视觉特征向量R′,

Figure BDA0002205997000000118
Step 4.2. Use the formula (7) to perform attention processing on the local visual features R=[r 1 , r 2 ,...,r k ,...,r K ], and obtain the local visual feature vector R that the attention pays attention to ′,
Figure BDA0002205997000000118

R′=αR1RT (7)R′=α R1 R T (7)

式(7)中,

Figure BDA0002205997000000119
表示对局部视觉特征R的注意力分布权重,并有:In formula (7),
Figure BDA0002205997000000119
Represents the weight of the attention distribution to the local visual feature R, and has:

Figure BDA00022059970000001110
Figure BDA00022059970000001110

式(8)中,表示句子级问题特征向量q、注意力关注的历史特征向量ha以及局部视觉特征R之间的相似度矩阵,表示相似度矩阵zV1的待训练参数,并有:In formula (8), represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention attention, and the local visual feature R, Represents the parameters to be trained of the similarity matrix z V1 , and has:

zR1=tanh(W′q1q+W′h1ha+WR1R) (9)z R1 = tanh(W′ q1 q+W′ h1 h a +W R1 R) (9)

式(9)中,表示句子级问题特征向量q对应的待训练参数,

Figure BDA00022059970000001114
表示注意力关注的历史特征向量ha对应的待训练参数,表示局部视觉特征R对应的待训练参数;In formula (9), represents the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure BDA00022059970000001114
The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention, Represents the parameters to be trained corresponding to the local visual feature R;

步骤5、如图2所示,对双视觉特征进行相互交叉的注意力处理;Step 5. As shown in Figure 2, perform cross-over attention processing on the dual visual features;

步骤5.1、利用式(10)对全局视觉特征V=[v1,v2,...,vm,...,vM]进行双视觉交叉注意力处理,得到进一步注意力关注的全局视觉特征向量V″,

Figure BDA00022059970000001116
Step 5.1. Use formula (10) to perform bi-visual cross-attention processing on the global visual feature V=[v 1 , v 2 ,...,v m ,...,v M ] to obtain the global attention paid by further attention The visual feature vector V″,
Figure BDA00022059970000001116

V″=αV2VT (10)V″=α V2 V T (10)

式(10)中,

Figure BDA0002205997000000121
表示对全局视觉特征V的进一步注意力分布权重,并有:In formula (10),
Figure BDA0002205997000000121
represents the further attention distribution weights to the global visual feature V, and has:

Figure BDA0002205997000000122
Figure BDA0002205997000000122

式(11)中,

Figure BDA0002205997000000123
表示句子级问题特征向量q、注意力关注的历史特征向量ha、注意力关注的局部视觉特征向量R′以及全局视觉特征V之间的相似度矩阵,表示相似度矩阵zV2的待训练参数,并有:In formula (11),
Figure BDA0002205997000000123
represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention, the local visual feature vector R′ of attention, and the global visual feature V, Represents the parameters to be trained of the similarity matrix z V2 , and has:

zV2=tanh(Wq2q+Wh2ha+WR2R′+WV2V) (12)z V2 = tanh(W q2 q+W h2 h a +W R2 R′+W V2 V) (12)

式(12)中,

Figure BDA0002205997000000125
表示句子级问题特征向量q对应的待训练参数,
Figure BDA0002205997000000126
表示注意力关注的历史特征向量ha对应的待训练参数,
Figure BDA0002205997000000127
表示注意力关注的局部视觉特征向量R′对应的待训练参数,
Figure BDA0002205997000000128
表示全局视觉特征V对应的待训练参数;In formula (12),
Figure BDA0002205997000000125
represents the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure BDA0002205997000000126
The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention,
Figure BDA0002205997000000127
The parameters to be trained corresponding to the local visual feature vector R′ representing the attention of the attention,
Figure BDA0002205997000000128
Indicates the parameters to be trained corresponding to the global visual feature V;

步骤5.2、利用式(13)对局部视觉特征R=[r1,r2,...,rk,...,rK]进行双视觉交叉注意力处理,得到进一步注意力关注的局部视觉特征向量R″, Step 5.2. Use formula (13) to perform bi-visual cross-attention processing on the local visual features R=[r 1 , r 2 ,...,r k ,...,r K ] to obtain the local attention paid by further attention The visual feature vector R″,

R″=αR2RT (13)R″=α R2 R T (13)

式(13)中,表示对局部视觉特征R的进一步注意力分布权重,并有:In formula (13), represents further attention distribution weights on local visual features R, and has:

Figure BDA00022059970000001211
Figure BDA00022059970000001211

式(14)中,

Figure BDA00022059970000001212
表示句子级问题特征向量q、注意力关注的历史特征向量ha、注意力关注的全部视觉特征向量V′以及局部视觉特征R之间的相似度矩阵,
Figure BDA00022059970000001213
表示相似度矩阵zR2的待训练参数,并有:In formula (14),
Figure BDA00022059970000001212
Represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention, the total visual feature vector V′ of attention, and the local visual feature R,
Figure BDA00022059970000001213
Represents the parameters to be trained of the similarity matrix z R2 , and has:

zR2=tanh(W′q2q+W′h2ha+W′V2V′+W′R2R) (15)z R2 = tanh(W′ q2 q+W′ h2 h a +W′ V2 V′+W′ R2 R) (15)

式(15)中,

Figure BDA00022059970000001214
表示句子级问题特征向量q对应的待训练参数,
Figure BDA00022059970000001215
表示注意力关注的历史特征向量ha对应的待训练参数,
Figure BDA00022059970000001216
示注意力关注的全部视觉特征向量V′对应的待训练参数,
Figure BDA00022059970000001217
表示局部视觉特征R对应的待训练参数;In formula (15),
Figure BDA00022059970000001214
represents the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure BDA00022059970000001215
The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention,
Figure BDA00022059970000001216
The parameters to be trained corresponding to all visual feature vectors V′ that show attention,
Figure BDA00022059970000001217
Represents the parameters to be trained corresponding to the local visual feature R;

步骤6、视觉特征的优化处理;Step 6, optimization processing of visual features;

步骤6.1、利用式(16)对当前问题Q进行词级别注意力处理,得到注意力关注的词级问题特征向量qs

Figure BDA0002205997000000131
Step 6.1. Use Equation (16) to perform word-level attention processing on the current question Q, and obtain the word-level question feature vector q s of which the attention is concerned,
Figure BDA0002205997000000131

qs=αqQ (16)q s = α q Q (16)

式(16)中,

Figure BDA0002205997000000132
表示对当前问题Q的注意力分布权重,并有:In formula (16),
Figure BDA0002205997000000132
Represents the weight of the attention distribution on the current question Q, and has:

Figure BDA0002205997000000133
Figure BDA0002205997000000133

式(14)中,

Figure BDA0002205997000000134
表示当前问题Q的自注意力语义矩阵,
Figure BDA0002205997000000135
表示自注意力语义矩阵zQ的待训练参数,并有:In formula (14),
Figure BDA0002205997000000134
represents the self-attention semantic matrix of the current question Q,
Figure BDA0002205997000000135
represents the to-be-trained parameters of the self-attention semantic matrix z Q , and has:

zQ=tanh(WQQ) (18)z Q = tanh(W Q Q) (18)

式(18)中,

Figure BDA0002205997000000136
表示词级别注意力处理时当前问题Q对应的待训练参数;In formula (18),
Figure BDA0002205997000000136
Indicates the parameters to be trained corresponding to the current question Q during word-level attention processing;

步骤6.2、利用式(19)、(20)对进一步注意力关注的全局视觉特征向量V″和局部视觉特征向量R″分别进行优化处理,并得到最终的全局视觉特征向量

Figure BDA0002205997000000137
Figure BDA0002205997000000138
和局部视觉特征向量
Figure BDA0002205997000000139
Figure BDA00022059970000001310
Step 6.2. Use equations (19) and (20) to optimize the global visual feature vector V″ and local visual feature vector R″ for further attention, and obtain the final global visual feature vector
Figure BDA0002205997000000137
Figure BDA0002205997000000138
and local visual feature vectors
Figure BDA0002205997000000139
Figure BDA00022059970000001310

Figure BDA00022059970000001311
Figure BDA00022059970000001311

式(19)和式(20)中,

Figure BDA00022059970000001313
表示视觉特征优化处理时词级问题特征向量qs对应的待训练参数,⊙表示点乘运算;In formula (19) and formula (20),
Figure BDA00022059970000001313
Represents the parameter to be trained corresponding to the feature vector q s of the word-level problem during visual feature optimization processing, and ⊙ represents the dot product operation;

步骤7、多模态语义融合及解码生成答案特征序列;Step 7. Multimodal semantic fusion and decoding to generate an answer feature sequence;

步骤7.1、将注意力关注的词级问题特征向量qs,注意力关注的历史特征向量ha,优化后的全局视觉特征向量

Figure BDA00022059970000001314
和局部视觉特征向量
Figure BDA00022059970000001315
进行拼接后得到多模态特征向量eM
Figure BDA00022059970000001316
其中dM=3d+dw代表多模态特征向量的维度;再利用一层全连接操作对多模态特征向量eM进行映射,得到融合语义特征向量e, Step 7.1. Combine the word-level problem feature vector q s of attention, the historical feature vector ha of attention, and the optimized global visual feature vector
Figure BDA00022059970000001314
and local visual feature vectors
Figure BDA00022059970000001315
After splicing, the multimodal feature vector e M is obtained,
Figure BDA00022059970000001316
where d M = 3d+d w represents the dimension of the multimodal feature vector; then use a layer of full connection operation to map the multimodal feature vector e M to obtain the fusion semantic feature vector e,

步骤7.2、将融合语义特征向量e输入到长短期记忆网络LSTM中,得到预测答案的隐状态特征序列

Figure BDA00022059970000001318
其中hA,i为LSTM的第i个时间步长的输出,L3为真实答案标签AGT的句子长度,L3的大小可设置为9;Step 7.2. Input the fused semantic feature vector e into the long short-term memory network LSTM to obtain the hidden state feature sequence of the predicted answer
Figure BDA00022059970000001318
Where h A, i is the output of the ith time step of LSTM, L 3 is the sentence length of the true answer label A GT , and the size of L 3 can be set to 9;

步骤7.3、利用全连接操作将预测答案的隐状态特征序列

Figure BDA0002205997000000141
映射到与one-hot向量表O同一维度的空间中,得到预测答案的单词向量集合
Figure BDA0002205997000000142
其中yi表示预测答案中第i个单词的映射向量,且向量长度与单词个数相同;Step 7.3. Use the full connection operation to predict the hidden state feature sequence of the answer
Figure BDA0002205997000000141
Map to the space of the same dimension as the one-hot vector table O, and get the word vector set of the predicted answer
Figure BDA0002205997000000142
where y i represents the mapping vector of the ith word in the predicted answer, and the length of the vector is the same as the number of words;

步骤8、如图3所示,对基于双视觉注意力网络的视觉对话生成网络模型的参数进行优化;Step 8, as shown in Figure 3, optimize the parameters of the visual dialogue generation network model based on the dual visual attention network;

步骤8.1、根据单词one-hot向量表O对真实答案标签AGT中的单词构建向量集合

Figure BDA0002205997000000143
其中
Figure BDA0002205997000000144
表示真实答案标签AGT中第i个单词的映射向量,且向量长度与单词个数相同;Step 8.1. According to the word one-hot vector table O, construct a vector set for the words in the true answer label A GT
Figure BDA0002205997000000143
in
Figure BDA0002205997000000144
Represents the mapping vector of the ith word in the true answer label A GT , and the length of the vector is the same as the number of words;

步骤8.2利用式(21)计算预测答案与真实答案AGT之间的损失代价E:Step 8.2 Use equation (21) to calculate the loss cost E between the predicted answer and the real answer A GT :

Figure BDA0002205997000000145
Figure BDA0002205997000000145

步骤8.3、利用随机梯度下降法将损失代价E进行优化求解,使损失代价E达到最小,从而得到最优参数的基于双视觉注意力网络的视觉对话网络模型;Step 8.3, using the stochastic gradient descent method to optimize the loss cost E to minimize the loss cost E, thereby obtaining a visual dialogue network model based on the dual visual attention network with optimal parameters;

步骤9、预测答案生成;Step 9. Predict the answer generation;

对预测答案的单词向量集合

Figure BDA0002205997000000146
使用贪心解码算法得到第i个单词的映射向量yi中最大值所对应的位置,并根据最大值所对应的位置在单词索引表Voc中查找相应位置的单词作为第i个单词的映射向量yi最终的预测单词,进而得到单词向量集合Y所对应的预测答案,并以当前问题Q和单词向量集合Y所对应的预测答案作为最终生成的视觉对话。collection of word vectors for predicted answers
Figure BDA0002205997000000146
Use the greedy decoding algorithm to obtain the position corresponding to the maximum value in the mapping vector y i of the ith word, and look up the word at the corresponding position in the word index table Voc according to the position corresponding to the maximum value as the mapping vector y of the ith word i finally predict the word, and then obtain the predicted answer corresponding to the word vector set Y, and use the predicted answer corresponding to the current question Q and the word vector set Y as the final generated visual dialogue.

Claims (1)

1.一种基于双视觉注意力网络的视觉对话生成方法,其特征是按如下步骤进行:1. a visual dialogue generation method based on dual visual attention network is characterized in that carrying out as follows: 步骤1、视觉对话中文本输入的预处理和单词表的构建:Step 1. Preprocessing of text input in visual dialogue and construction of word list: 步骤1.1、获取视觉对话数据集,所述视觉对话数据集中包含句子文本和图像;Step 1.1, obtain a visual dialogue dataset, which contains sentence text and images; 对所述视觉对话数据集中所有的句子文本进行分词处理,得到分割后的单词;Perform word segmentation on all sentence texts in the visual dialogue dataset to obtain segmented words; 步骤1.2、从分割后的单词中筛选出词频率大于阈值的所有单词,并构建单词索引表Voc;再对所述索引表Voc中的每一个单词进行one-hot编码,得到one-hot向量表O=[o1,o2,...,on,...,oN],其中on表示索引表Voc中的第n个单词所对应的one-hot编码向量,N为索引表Voc中的单词个数;Step 1.2, screen out all words whose word frequency is greater than the threshold from the divided words, and construct a word index table Voc; then perform one-hot encoding on each word in the index table Voc to obtain a one-hot vector table O=[o 1 ,o 2 ,...,on ,...,o N ], where o n represents the one-hot encoding vector corresponding to the nth word in the index table Voc, and N is the index table The number of words in the Voc; 步骤1.3、随机初始化一个词嵌入矩阵We
Figure FDA0002205996990000011
其中dw代表词向量的维度;利用词嵌入矩阵We将one-hot向量表中的每个单词的编码向量映射到相应的词向量上,从而得到词向量表;
Step 1.3, randomly initialize a word embedding matrix We ,
Figure FDA0002205996990000011
Wherein d w represents the dimension of the word vector; the word embedding matrix We is used to map the encoding vector of each word in the one-hot vector table to the corresponding word vector, thereby obtaining the word vector table;
步骤2、对话图像的特征提取以及对话文本的特征提取;Step 2, feature extraction of dialogue images and feature extraction of dialogue texts; 步骤2.1、从视觉对话数据集中获取任意一个图像I及其对应的历史对话U=[u1,u2,...,ut,...,uT]、当前问题
Figure FDA0002205996990000012
和真实答案标签AGT所组成的视觉对话信息D;其中T为历史对话U中的对话片段总数,ut表示对话中的第t段对话,L1表示当前问题Q的句子长度,wQ,i表示当前问题Q中的第i个单词在所述词向量表中所对应的词向量;
Step 2.1. Obtain any image I and its corresponding historical dialogue U=[u 1 ,u 2 ,...,u t ,...,u T ] from the visual dialogue dataset, the current problem
Figure FDA0002205996990000012
Visual dialogue information D composed of real answer labels A GT ; where T is the total number of dialogue segments in the historical dialogue U, ut represents the t -th dialogue in the dialogue, L 1 represents the sentence length of the current question Q, w Q, i represents the word vector corresponding to the i-th word in the current question Q in the word vector table;
步骤2.2、使用卷积神经网络提取视觉对话信息D中图像I的特征,得到全局视觉特征
Figure FDA0002205996990000013
其中
Figure FDA0002205996990000014
表示全局视觉特征V(0)中的第m个区域特征,M表示全局视觉特征V(0)中的总的空间区域数,dg为全局视觉特征V(0)的通道维度;
Step 2.2, use the convolutional neural network to extract the features of the image I in the visual dialogue information D to obtain the global visual features
Figure FDA0002205996990000013
in
Figure FDA0002205996990000014
represents the mth regional feature in the global visual feature V (0) , M represents the total number of spatial regions in the global visual feature V (0), and d g is the channel dimension of the global visual feature V (0) ;
步骤2.3、使用目标检测特征提取器提取视觉对话信息D中图像I的特征,得到局部视觉特征
Figure FDA0002205996990000015
其中
Figure FDA0002205996990000016
表示局部视觉特征R(0)中的第k个目标对象特征,K表示局部视觉特征R(0)中的检测的局部目标对象总数,dr为局部视觉特征R(0)的通道维度;
Step 2.3. Use the target detection feature extractor to extract the features of the image I in the visual dialogue information D to obtain local visual features
Figure FDA0002205996990000015
in
Figure FDA0002205996990000016
Represents the k-th target object feature in the local visual feature R (0) , K represents the total number of local target objects detected in the local visual feature R(0), and d r is the channel dimension of the local visual feature R (0) ;
步骤2.4、利用全连接操作将全局视觉特征和局部视觉特征映射到同一维度的空间中,得到转换后的全局视觉特征V=[v1,v2,...,vm,...,vM],
Figure FDA0002205996990000017
以及局部视觉特征R=[r1,r2,...,rk,...,rK],其中vm表示全局视觉特征V中的第m个区域特征,rk表示局部视觉特征R中的第k个目标对象特征,d为转换后的通道维度;
Step 2.4. Use the full connection operation to map the global visual feature and the local visual feature to the space of the same dimension, and obtain the transformed global visual feature V=[v 1 ,v 2 ,..., vm ,..., v M ],
Figure FDA0002205996990000017
and local visual features R=[r 1 ,r 2 ,...,r k ,...,r K ], where v m represents the m-th regional feature in the global visual feature V, r k represents the k-th target object feature in the local visual feature R, and d is the transformed channel dimension;
步骤2.5、使用长短期记忆网络LSTM对当前问题Q进行特征提取,得到隐状态特征序列并取长短期记忆网络LSTM的最后一个步长输出的隐状态特征
Figure FDA0002205996990000022
作为当前问题Q的句子级问题特征向量q,
Figure FDA0002205996990000023
其中hQ,i表示长短期记忆网络LSTM的第i个步长输出的隐状态特征;
Step 2.5, use the long short-term memory network LSTM to extract the features of the current question Q, and obtain the hidden state feature sequence And take the hidden state feature of the last step output of the long short-term memory network LSTM
Figure FDA0002205996990000022
As the sentence-level question feature vector q of the current question Q,
Figure FDA0002205996990000023
where h Q,i represents the hidden state feature of the i-th step output of the long short-term memory network LSTM;
步骤2.6、使用长短期记忆网络LSTM对历史对话U中的第t段对话
Figure FDA0002205996990000024
进行特征提取,得到第t个隐状态序列
Figure FDA0002205996990000025
长短期记忆网络取LSTM的最后一个步长输出的隐状态特征作为第t段对话ut的句子级特征ht
Figure FDA0002205996990000027
则总的历史对话特征为H=[h1,h2,...,ht,...,hT],
Figure FDA0002205996990000028
其中wt,i表示第t段对话ut中第i个单词在所述词向量表中所对应的词向量,L2为第t段对话ut的句子长度,ht,i表示长短期记忆网络LSTM的第i个步长输出的隐状态特征;
Step 2.6, use the long short-term memory network LSTM to analyze the t-th dialogue in the historical dialogue U
Figure FDA0002205996990000024
Perform feature extraction to get the t-th hidden state sequence
Figure FDA0002205996990000025
The long short-term memory network takes the hidden state features of the last step output of LSTM As the sentence-level feature h t of the t-th dialogue u t ,
Figure FDA0002205996990000027
Then the total historical dialogue feature is H=[h 1 , h 2 ,...,h t ,...,h T ],
Figure FDA0002205996990000028
Where w t,i represents the word vector corresponding to the i-th word in the word vector table in the t -th dialogue ut, L 2 is the sentence length of the t-th dialogue ut , h t,i represents the long-term and short-term The hidden state feature of the i-th step output of the memory network LSTM;
步骤3、基于当前问题信息对历史对话信息进行注意力处理;Step 3. Perform attention processing on the historical dialogue information based on the current problem information; 利用式(1)对所述总的历史对话特征H=[h1,h2,...,ht,...,hT]进行注意力处理,得到注意力关注的历史特征向量ha
Figure FDA0002205996990000029
Use formula (1) to perform attention processing on the total historical dialogue features H=[h 1 , h 2 ,...,h t ,...,h T ], and obtain the historical feature vector h that the attention pays attention to a ,
Figure FDA0002205996990000029
ha=αhHT (1)h a = α h H T (1) 式(1)中,
Figure FDA00022059969900000210
表示对历史对话特征H的注意力分布权重,并有:
In formula (1),
Figure FDA00022059969900000210
represents the weight of the attention distribution on the historical dialogue feature H, and has:
αh=softmax(PTzh) (2)α h =softmax(P T z h ) (2) 式(2)中,
Figure FDA00022059969900000211
表示句子级问题特征向量q与历史对话特征H之间的相似度矩阵,
Figure FDA00022059969900000212
表示相似度矩阵zh的待训练参数,并有:
In formula (2),
Figure FDA00022059969900000211
represents the similarity matrix between the sentence-level question feature vector q and the historical dialogue feature H,
Figure FDA00022059969900000212
Represents the parameters to be trained of the similarity matrix z h , and has:
zh=tanh(Wqq+WhH) (3)z h =tanh(W q q+W h H) (3) 式(3)中,表示句子级问题特征向量q对应的待训练参数,
Figure FDA00022059969900000214
表示历史对话特征H对应的待训练参数;
In formula (3), represents the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure FDA00022059969900000214
Represents the parameters to be trained corresponding to the historical dialogue feature H;
步骤4、双视觉特征各自独立的注意力处理;Step 4. Independent attention processing of dual visual features; 步骤4.1、利用式(4)对全局视觉特征V=[v1,v2,...,vm,...,vM]进行注意力处理,得到注意力关注的全局视觉特征向量V′,
Figure FDA00022059969900000215
Step 4.1. Use formula (4) to perform attention processing on the global visual feature V=[v 1 , v 2 ,..., vm ,...,v M ], and obtain the global visual feature vector V that the attention pays attention to ',
Figure FDA00022059969900000215
V′=αV1VT (4)V′=α V1 V T (4) 式(4)中,
Figure FDA0002205996990000031
表示对全局视觉特征V的注意力分布权重,并有:
In formula (4),
Figure FDA0002205996990000031
Represents the attention distribution weight to the global visual feature V, and has:
Figure FDA0002205996990000032
Figure FDA0002205996990000032
式(5)中,
Figure FDA0002205996990000033
表示句子级问题特征向量q、注意力关注的历史特征向量ha以及全局视觉特征V之间的相似度矩阵
Figure FDA0002205996990000034
表示相似度矩阵zV1的待训练参数,并有:
In formula (5),
Figure FDA0002205996990000033
Represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention attention, and the global visual feature V
Figure FDA0002205996990000034
Represents the parameters to be trained of the similarity matrix z V1 , and has:
zV1=tanh(Wq1q+Wh1ha+WV1V) (6)z V1 = tanh(W q1 q+W h1 h a +W V1 V) (6) 式(6)中,
Figure FDA0002205996990000035
表示句子级问题特征向量q对应的待训练参数,
Figure FDA0002205996990000036
表示注意力关注的历史特征向量ha对应的待训练参数,表示全局视觉特征V对应的待训练参数;
In formula (6),
Figure FDA0002205996990000035
represents the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure FDA0002205996990000036
The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention, Indicates the parameters to be trained corresponding to the global visual feature V;
步骤4.2、利用式(7)对局部视觉特征R=[r1,r2,...,rk,...,rK]进行注意力处理,得到注意力关注的局部视觉特征向量R′,
Figure FDA0002205996990000038
Step 4.2. Use the formula (7) to perform attention processing on the local visual features R=[r 1 , r 2 ,...,r k ,...,r K ], and obtain the local visual feature vector R that the attention pays attention to ',
Figure FDA0002205996990000038
R′=αR1RT (7)R′=α R1 R T (7) 式(7)中,
Figure FDA0002205996990000039
表示对局部视觉特征R的注意力分布权重,并有:
In formula (7),
Figure FDA0002205996990000039
Represents the weight of the attention distribution to the local visual feature R, and has:
Figure FDA00022059969900000310
Figure FDA00022059969900000310
式(8)中,
Figure FDA00022059969900000311
表示句子级问题特征向量q、注意力关注的历史特征向量ha以及局部视觉特征R之间的相似度矩阵,
Figure FDA00022059969900000312
表示相似度矩阵zV1的待训练参数,并有:
In formula (8),
Figure FDA00022059969900000311
represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention attention, and the local visual feature R,
Figure FDA00022059969900000312
Represents the parameters to be trained of the similarity matrix z V1 , and has:
zR1=tanh(W′q1q+W′h1ha+WR1R) (9)z R1 = tanh(W′ q1 q+W′ h1 h a +W R1 R) (9) 式(9)中,
Figure FDA00022059969900000313
表示句子级问题特征向量q对应的待训练参数,
Figure FDA00022059969900000314
表示注意力关注的历史特征向量ha对应的待训练参数,表示局部视觉特征R对应的待训练参数;
In formula (9),
Figure FDA00022059969900000313
represents the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure FDA00022059969900000314
The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention, Represents the parameters to be trained corresponding to the local visual feature R;
步骤5、双视觉特征相互交叉的注意力处理;Step 5. Attention processing of the intersection of dual visual features; 步骤5.1、利用式(10)对全局视觉特征V=[v1,v2,...,vm,...,vM]进行双视觉交叉注意力处理,得到进一步注意力关注的全局视觉特征向量V″,
Figure FDA00022059969900000316
Step 5.1. Use formula (10) to perform bi-visual cross-attention processing on the global visual feature V=[v 1 , v 2 ,...,v m ,...,v M ] to obtain the global attention paid by further attention The visual feature vector V″,
Figure FDA00022059969900000316
V″=αV2VT (10)V″=α V2 V T (10) 式(10)中,
Figure FDA0002205996990000041
表示对全局视觉特征V的进一步注意力分布权重,并有:
In formula (10),
Figure FDA0002205996990000041
represents the further attention distribution weights to the global visual feature V, and has:
Figure FDA0002205996990000042
Figure FDA0002205996990000042
式(11)中,
Figure FDA0002205996990000043
表示句子级问题特征向量q、注意力关注的历史特征向量ha、注意力关注的局部视觉特征向量R′以及全局视觉特征V之间的相似度矩阵,
Figure FDA0002205996990000044
表示相似度矩阵zV2的待训练参数,并有:
In formula (11),
Figure FDA0002205996990000043
represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention, the local visual feature vector R′ of attention, and the global visual feature V,
Figure FDA0002205996990000044
Represents the parameters to be trained of the similarity matrix z V2 , and has:
zV2=tanh(Wq2q+Wh2ha+WR2R′+WV2V) (12)z V2 = tanh(W q2 q+W h2 h a +W R2 R′+W V2 V) (12) 式(12)中,
Figure FDA0002205996990000045
表示句子级问题特征向量q对应的待训练参数,
Figure FDA0002205996990000046
表示注意力关注的历史特征向量ha对应的待训练参数,
Figure FDA0002205996990000047
表示注意力关注的局部视觉特征向量R′对应的待训练参数,
Figure FDA0002205996990000048
表示全局视觉特征V对应的待训练参数;
In formula (12),
Figure FDA0002205996990000045
represents the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure FDA0002205996990000046
The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention,
Figure FDA0002205996990000047
The parameters to be trained corresponding to the local visual feature vector R′ representing the attention of the attention,
Figure FDA0002205996990000048
Indicates the parameters to be trained corresponding to the global visual feature V;
步骤5.2、利用式(13)对局部视觉特征R=[r1,r2,...,rk,...,rK]进行双视觉交叉注意力处理,得到进一步注意力关注的局部视觉特征向量R″,
Figure FDA0002205996990000049
Step 5.2. Use formula (13) to perform bi-visual cross-attention processing on the local visual features R=[r 1 , r 2 ,...,r k ,...,r K ] to obtain the local attention paid by further attention The visual feature vector R″,
Figure FDA0002205996990000049
R″=αR2RT (13)R″=α R2 R T (13) 式(13)中,
Figure FDA00022059969900000410
表示对局部视觉特征R的进一步注意力分布权重,并有:
In formula (13),
Figure FDA00022059969900000410
represents further attention distribution weights on local visual features R, and has:
Figure FDA00022059969900000411
Figure FDA00022059969900000411
式(14)中,
Figure FDA00022059969900000412
表示句子级问题特征向量q、注意力关注的历史特征向量ha、注意力关注的全部视觉特征向量V′以及局部视觉特征R之间的相似度矩阵,表示相似度矩阵zR2的待训练参数,并有:
In formula (14),
Figure FDA00022059969900000412
Represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention, the total visual feature vector V′ of attention, and the local visual feature R, Represents the parameters to be trained of the similarity matrix z R2 , and has:
zR2=tanh(W′q2q+W′h2ha+W′V2V′+W′R2R) (15)z R2 = tanh(W′ q2 q+W′ h2 h a +W′ V2 V′+W′ R2 R) (15) 式(15)中,
Figure FDA00022059969900000414
表示句子级问题特征向量q对应的待训练参数,
Figure FDA00022059969900000415
表示注意力关注的历史特征向量ha对应的待训练参数,示注意力关注的全部视觉特征向量V′对应的待训练参数,
Figure FDA00022059969900000417
表示局部视觉特征R对应的待训练参数;
In formula (15),
Figure FDA00022059969900000414
represents the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure FDA00022059969900000415
The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention, The parameters to be trained corresponding to all visual feature vectors V′ that show attention,
Figure FDA00022059969900000417
Represents the parameters to be trained corresponding to the local visual feature R;
步骤6、视觉特征的优化处理;Step 6, optimization processing of visual features; 步骤6.1、利用式(16)对当前问题Q进行词级别注意力处理,得到注意力关注的词级问题特征向量qs Step 6.1. Use Equation (16) to perform word-level attention processing on the current question Q, and obtain the word-level question feature vector q s of which the attention is concerned, qs=αqQT (16)q s = α q Q T (16) 式(16)中,
Figure FDA0002205996990000051
表示对当前问题Q的注意力分布权重,并有:
In formula (16),
Figure FDA0002205996990000051
Represents the weight of the attention distribution on the current question Q, and has:
Figure FDA0002205996990000052
Figure FDA0002205996990000052
式(14)中,表示当前问题Q的自注意力语义矩阵,
Figure FDA0002205996990000054
表示自注意力语义矩阵zQ的待训练参数,并有:
In formula (14), represents the self-attention semantic matrix of the current question Q,
Figure FDA0002205996990000054
represents the to-be-trained parameters of the self-attention semantic matrix z Q , and has:
zQ=tanh(WQQ) (18)z Q = tanh(W Q Q) (18) 式(18)中,表示词级别注意力处理时当前问题Q对应的待训练参数;In formula (18), Indicates the parameters to be trained corresponding to the current question Q during word-level attention processing; 步骤6.2、利用式(19)和式(20)对进一步注意力关注的全局视觉特征向量V″和局部视觉特征向量R″分别进行优化处理,并得到最终的全局视觉特征向量
Figure FDA0002205996990000056
和局部视觉特征向量
Figure FDA0002205996990000058
Figure FDA0002205996990000059
Step 6.2. Use formula (19) and formula (20) to optimize the global visual feature vector V″ and local visual feature vector R″ for further attention, and obtain the final global visual feature vector
Figure FDA0002205996990000056
and local visual feature vectors
Figure FDA0002205996990000058
Figure FDA0002205996990000059
Figure FDA00022059969900000510
Figure FDA00022059969900000510
Figure FDA00022059969900000511
Figure FDA00022059969900000511
式(19)和式(20)中,
Figure FDA00022059969900000512
表示视觉特征优化处理时词级问题特征向量qs对应的待训练参数,⊙表示点乘运算;
In formula (19) and formula (20),
Figure FDA00022059969900000512
Represents the parameter to be trained corresponding to the feature vector q s of the word-level problem during visual feature optimization processing, and ⊙ represents the dot product operation;
步骤7、多模态语义融合及解码生成答案特征序列;Step 7. Multimodal semantic fusion and decoding to generate an answer feature sequence; 步骤7.1、将所述注意力关注的词级问题特征向量qs,注意力关注的历史特征向量ha,优化后的全局视觉特征向量
Figure FDA00022059969900000513
和局部视觉特征向量
Figure FDA00022059969900000514
进行拼接后得到多模态特征向量eM
Figure FDA00022059969900000515
其中dM=3d+dw代表多模态特征向量的维度;再利用全连接操作对所述多模态特征向量eM进行映射,得到融合语义特征向量e,
Figure FDA00022059969900000516
Step 7.1. Combine the word-level question feature vector q s of the attention, the historical feature vector ha of the attention, and the optimized global visual feature vector
Figure FDA00022059969900000513
and local visual feature vectors
Figure FDA00022059969900000514
After splicing, the multimodal feature vector e M is obtained,
Figure FDA00022059969900000515
where d M =3d+d w represents the dimension of the multi-modal feature vector; then use the full connection operation to map the multi-modal feature vector e M to obtain the fusion semantic feature vector e,
Figure FDA00022059969900000516
步骤7.2、将所述融合语义特征向量e输入到长短期记忆网络LSTM中,得到预测答案的隐状态特征序列其中hA,i为长短期记忆网络LSTM的第i个步长的输出,L3为真实答案标签AGT的句子长度;Step 7.2. Input the fused semantic feature vector e into the long short-term memory network LSTM to obtain the hidden state feature sequence of the predicted answer Where h A,i is the output of the i-th step size of the long short-term memory network LSTM, and L 3 is the sentence length of the true answer label A GT ; 步骤7.3、利用全连接操作将所述预测答案的隐状态特征序列映射到与所述one-hot向量表O同一维度的空间中,得到预测答案的单词向量集合
Figure FDA0002205996990000061
其中yi表示预测答案中第i个单词的映射向量,且向量长度与单词个数相同;
Step 7.3. Use the full connection operation to convert the hidden state feature sequence of the predicted answer Map to the space of the same dimension as the one-hot vector table O, and get the word vector set of the predicted answer
Figure FDA0002205996990000061
where y i represents the mapping vector of the ith word in the predicted answer, and the length of the vector is the same as the number of words;
步骤8、基于双视觉注意力网络的视觉对话生成网络模型的参数优化;Step 8, parameter optimization of the visual dialogue generation network model based on the dual visual attention network; 步骤8.1、根据所述单词one-hot向量表O对真实答案标签AGT中的单词构建向量集合
Figure FDA0002205996990000062
其中表示真实答案标签AGT中第i个单词的映射向量,且向量长度与单词个数相同;
Step 8.1. According to the word one-hot vector table O, construct a vector set for the words in the true answer label A GT
Figure FDA0002205996990000062
in Represents the mapping vector of the ith word in the true answer label A GT , and the length of the vector is the same as the number of words;
步骤8.2利用式(21)计算预测答案与真实答案AGT之间的损失代价E:Step 8.2 Use equation (21) to calculate the loss cost E between the predicted answer and the real answer A GT : 步骤8.3、利用随机梯度下降法将所述损失代价E进行优化求解,使损失代价E达到最小,从而得到最优参数的基于双视觉注意力网络的视觉对话网络模型;Step 8.3, using the stochastic gradient descent method to optimize and solve the loss cost E to minimize the loss cost E, thereby obtaining a visual dialogue network model based on the dual visual attention network with optimal parameters; 步骤9、预测答案生成Step 9. Prediction answer generation 对所述预测答案的单词向量集合
Figure FDA0002205996990000065
使用贪心解码算法得到第i个单词的映射向量yi中最大值所对应的位置,并根据最大值所对应的位置在单词索引表Voc中查找相应位置的单词作为第i个单词的映射向量yi最终的预测单词,进而得到单词向量集合Y所对应的预测答案,并以所述当前问题Q和单词向量集合Y所对应的预测答案作为最终生成的视觉对话。
the set of word vectors for the predicted answer
Figure FDA0002205996990000065
Use the greedy decoding algorithm to obtain the position corresponding to the maximum value in the mapping vector y i of the ith word, and look up the word at the corresponding position in the word index table Voc according to the position corresponding to the maximum value as the mapping vector y of the ith word i final predicted word, and then the predicted answer corresponding to the word vector set Y is obtained, and the predicted answer corresponding to the current question Q and the word vector set Y is used as the final generated visual dialogue.
CN201910881305.0A 2019-09-18 2019-09-18 Visual conversation generation method based on double-visual attention network Pending CN110647612A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910881305.0A CN110647612A (en) 2019-09-18 2019-09-18 Visual conversation generation method based on double-visual attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910881305.0A CN110647612A (en) 2019-09-18 2019-09-18 Visual conversation generation method based on double-visual attention network

Publications (1)

Publication Number Publication Date
CN110647612A true CN110647612A (en) 2020-01-03

Family

ID=68992004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910881305.0A Pending CN110647612A (en) 2019-09-18 2019-09-18 Visual conversation generation method based on double-visual attention network

Country Status (1)

Country Link
CN (1) CN110647612A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783475A (en) * 2020-07-28 2020-10-16 北京深睿博联科技有限责任公司 A Semantic Visual Localization Method and Device Based on Phrase Relation Propagation
CN111897939A (en) * 2020-08-12 2020-11-06 腾讯科技(深圳)有限公司 Visual dialogue method, training device and training equipment of visual dialogue model
CN111967487A (en) * 2020-03-23 2020-11-20 同济大学 Incremental data enhancement method for visual question-answer model training and application
CN111967272A (en) * 2020-06-23 2020-11-20 合肥工业大学 Visual dialog generation system based on semantic alignment
CN113010712A (en) * 2021-03-04 2021-06-22 天津大学 Visual question answering method based on multi-graph fusion
CN113177112A (en) * 2021-04-25 2021-07-27 天津大学 KR product fusion multi-mode information-based neural network visual dialogue model and method
CN113220859A (en) * 2021-06-01 2021-08-06 平安科技(深圳)有限公司 Image-based question and answer method and device, computer equipment and storage medium
CN113420606A (en) * 2021-05-31 2021-09-21 华南理工大学 Method for realizing autonomous navigation of robot based on natural language and machine vision
CN113435399A (en) * 2021-07-14 2021-09-24 电子科技大学 Multi-round visual dialogue method based on multi-level sequencing learning
CN113515951A (en) * 2021-07-19 2021-10-19 同济大学 A story description generation method based on knowledge-augmented attention network and group-level semantics
CN113553418A (en) * 2021-07-27 2021-10-26 天津大学 Visual dialog generation method and device based on multi-modal learning
CN113761153A (en) * 2021-05-19 2021-12-07 腾讯科技(深圳)有限公司 Question and answer processing method and device based on picture, readable medium and electronic equipment
CN113868451A (en) * 2021-09-02 2021-12-31 天津大学 Cross-modal social network conversation method and device based on context cascade perception
CN113989300A (en) * 2021-10-29 2022-01-28 北京百度网讯科技有限公司 Method, device, electronic device and storage medium for lane line segmentation
CN114299510A (en) * 2022-03-08 2022-04-08 山东山大鸥玛软件股份有限公司 Handwritten English line recognition system
CN114556443A (en) * 2020-01-15 2022-05-27 北京京东尚科信息技术有限公司 Multimedia data semantic analysis system and method using attention-based converged network
CN114661874A (en) * 2022-03-07 2022-06-24 浙江理工大学 A visual question answering method based on multi-angle semantic understanding and adaptive dual-channel
CN115098623A (en) * 2022-06-06 2022-09-23 中国船舶集团有限公司系统工程研究院 Physical training data feature extraction method based on BERT
CN115277248A (en) * 2022-09-19 2022-11-01 南京聚铭网络科技有限公司 Network security alarm merging method, device and storage medium
CN115422388A (en) * 2022-09-13 2022-12-02 四川省人工智能研究院(宜宾) Visual conversation method and system
CN116342332A (en) * 2023-05-31 2023-06-27 合肥工业大学 Auxiliary judging method, device, equipment and storage medium based on Internet
US12223284B2 (en) 2022-09-13 2025-02-11 Sichuan Institute of Artificial Intelligence, Yibin, Sichuan, China Visual dialogue method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8160883B2 (en) * 2004-01-10 2012-04-17 Microsoft Corporation Focus tracking in dialogs
CN104077419A (en) * 2014-07-18 2014-10-01 合肥工业大学 Long inquiring image searching reordering algorithm based on semantic and visual information
US20170024645A1 (en) * 2015-06-01 2017-01-26 Salesforce.Com, Inc. Dynamic Memory Network
CN108877801A (en) * 2018-06-14 2018-11-23 南京云思创智信息科技有限公司 More wheel dialog semantics based on multi-modal Emotion identification system understand subsystem
CN110209789A (en) * 2019-05-29 2019-09-06 山东大学 A kind of multi-modal dialog system and method for user's attention guidance

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8160883B2 (en) * 2004-01-10 2012-04-17 Microsoft Corporation Focus tracking in dialogs
CN104077419A (en) * 2014-07-18 2014-10-01 合肥工业大学 Long inquiring image searching reordering algorithm based on semantic and visual information
US20170024645A1 (en) * 2015-06-01 2017-01-26 Salesforce.Com, Inc. Dynamic Memory Network
CN108877801A (en) * 2018-06-14 2018-11-23 南京云思创智信息科技有限公司 More wheel dialog semantics based on multi-modal Emotion identification system understand subsystem
CN110209789A (en) * 2019-05-29 2019-09-06 山东大学 A kind of multi-modal dialog system and method for user's attention guidance

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DANGUO等: "Dual Visual Attention Network for Visual Dialog", 《PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE》 *

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114556443A (en) * 2020-01-15 2022-05-27 北京京东尚科信息技术有限公司 Multimedia data semantic analysis system and method using attention-based converged network
CN114556443B (en) * 2020-01-15 2025-01-07 北京京东尚科信息技术有限公司 Multimedia data semantic analysis system and method using attention-based fusion network
CN111967487A (en) * 2020-03-23 2020-11-20 同济大学 Incremental data enhancement method for visual question-answer model training and application
CN111967487B (en) * 2020-03-23 2022-09-20 同济大学 Incremental data enhancement method for visual question-answer model training and application
CN111967272A (en) * 2020-06-23 2020-11-20 合肥工业大学 Visual dialog generation system based on semantic alignment
CN111967272B (en) * 2020-06-23 2023-10-31 合肥工业大学 Visual dialogue generating system based on semantic alignment
CN111783475A (en) * 2020-07-28 2020-10-16 北京深睿博联科技有限责任公司 A Semantic Visual Localization Method and Device Based on Phrase Relation Propagation
CN111897939B (en) * 2020-08-12 2024-02-02 腾讯科技(深圳)有限公司 Visual dialogue method, training method, device and equipment for visual dialogue model
CN111897939A (en) * 2020-08-12 2020-11-06 腾讯科技(深圳)有限公司 Visual dialogue method, training device and training equipment of visual dialogue model
CN113010712A (en) * 2021-03-04 2021-06-22 天津大学 Visual question answering method based on multi-graph fusion
CN113010712B (en) * 2021-03-04 2022-12-02 天津大学 Visual question answering method based on multi-graph fusion
CN113177112A (en) * 2021-04-25 2021-07-27 天津大学 KR product fusion multi-mode information-based neural network visual dialogue model and method
CN113761153A (en) * 2021-05-19 2021-12-07 腾讯科技(深圳)有限公司 Question and answer processing method and device based on picture, readable medium and electronic equipment
CN113761153B (en) * 2021-05-19 2023-10-24 腾讯科技(深圳)有限公司 Picture-based question-answering processing method and device, readable medium and electronic equipment
CN113420606A (en) * 2021-05-31 2021-09-21 华南理工大学 Method for realizing autonomous navigation of robot based on natural language and machine vision
CN113420606B (en) * 2021-05-31 2022-06-14 华南理工大学 Method for realizing autonomous navigation of robot based on natural language and machine vision
CN113220859B (en) * 2021-06-01 2024-05-10 平安科技(深圳)有限公司 Question answering method and device based on image, computer equipment and storage medium
CN113220859A (en) * 2021-06-01 2021-08-06 平安科技(深圳)有限公司 Image-based question and answer method and device, computer equipment and storage medium
CN113435399A (en) * 2021-07-14 2021-09-24 电子科技大学 Multi-round visual dialogue method based on multi-level sequencing learning
CN113435399B (en) * 2021-07-14 2022-04-15 电子科技大学 Multi-round visual dialogue method based on multi-level sequencing learning
CN113515951A (en) * 2021-07-19 2021-10-19 同济大学 A story description generation method based on knowledge-augmented attention network and group-level semantics
CN113515951B (en) * 2021-07-19 2022-07-05 同济大学 A story description generation method based on knowledge-augmented attention network and group-level semantics
CN113553418A (en) * 2021-07-27 2021-10-26 天津大学 Visual dialog generation method and device based on multi-modal learning
CN113868451A (en) * 2021-09-02 2021-12-31 天津大学 Cross-modal social network conversation method and device based on context cascade perception
CN113868451B (en) * 2021-09-02 2024-06-11 天津大学 Cross-modal conversation method and device for social network based on up-down Wen Jilian perception
CN113989300A (en) * 2021-10-29 2022-01-28 北京百度网讯科技有限公司 Method, device, electronic device and storage medium for lane line segmentation
CN114661874A (en) * 2022-03-07 2022-06-24 浙江理工大学 A visual question answering method based on multi-angle semantic understanding and adaptive dual-channel
CN114661874B (en) * 2022-03-07 2024-04-30 浙江理工大学 Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels
CN114299510A (en) * 2022-03-08 2022-04-08 山东山大鸥玛软件股份有限公司 Handwritten English line recognition system
CN115098623B (en) * 2022-06-06 2024-12-10 中国船舶集团有限公司系统工程研究院 A physical training data feature extraction method based on BERT
CN115098623A (en) * 2022-06-06 2022-09-23 中国船舶集团有限公司系统工程研究院 Physical training data feature extraction method based on BERT
CN115422388B (en) * 2022-09-13 2024-07-26 四川省人工智能研究院(宜宾) Visual dialogue method and system
CN115422388A (en) * 2022-09-13 2022-12-02 四川省人工智能研究院(宜宾) Visual conversation method and system
US12223284B2 (en) 2022-09-13 2025-02-11 Sichuan Institute of Artificial Intelligence, Yibin, Sichuan, China Visual dialogue method and system
CN115277248A (en) * 2022-09-19 2022-11-01 南京聚铭网络科技有限公司 Network security alarm merging method, device and storage medium
CN116342332B (en) * 2023-05-31 2023-08-01 合肥工业大学 Auxiliary judging method, device, equipment and storage medium based on Internet
CN116342332A (en) * 2023-05-31 2023-06-27 合肥工业大学 Auxiliary judging method, device, equipment and storage medium based on Internet

Similar Documents

Publication Publication Date Title
CN110647612A (en) Visual conversation generation method based on double-visual attention network
CN110609891B (en) Visual dialog generation method based on context awareness graph neural network
US11631007B2 (en) Method and device for text-enhanced knowledge graph joint representation learning
CN110298037B (en) Text Recognition Approach Based on Convolutional Neural Network Matching with Enhanced Attention Mechanism
CN110866117B (en) Short text classification method based on semantic enhancement and multi-level label embedding
CN110502749B (en) Text relation extraction method based on double-layer attention mechanism and bidirectional GRU
CN108733792B (en) An Entity Relationship Extraction Method
CN110765775B (en) A Domain Adaptation Method for Named Entity Recognition Fusing Semantics and Label Differences
CN109992780B (en) Specific target emotion classification method based on deep neural network
CN110569508A (en) Emotional orientation classification method and system integrating part-of-speech and self-attention mechanism
CN108416065A (en) Image based on level neural network-sentence description generates system and method
CN112232053B (en) Text similarity computing system, method and storage medium based on multi-keyword pair matching
CN109753567A (en) A Text Classification Method Combining Title and Body Attention Mechanisms
CN112800190B (en) Joint prediction method of intent recognition and slot value filling based on Bert model
CN112163429B (en) Sentence correlation obtaining method, system and medium combining cyclic network and BERT
CN110888980A (en) Implicit discourse relation identification method based on knowledge-enhanced attention neural network
CN110781290A (en) Extraction method of structured text abstract of long chapter
CN110516530A (en) An image description method based on non-aligned multi-view feature enhancement
CN110909736A (en) An Image Description Method Based on Long Short-Term Memory Model and Object Detection Algorithm
CN115146057B (en) Interactive attention-based image-text fusion emotion recognition method for ecological area of supply chain
CN110874411A (en) A Cross-Domain Sentiment Classification System Based on Fusion of Attention Mechanisms
CN116524593A (en) A dynamic gesture recognition method, system, device and medium
CN114417851B (en) Emotion analysis method based on keyword weighted information
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN117764084A (en) Short text emotion analysis method based on multi-head attention mechanism and multi-model fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200103

RJ01 Rejection of invention patent application after publication