CN110647612A - Visual conversation generation method based on double-visual attention network - Google Patents
Visual conversation generation method based on double-visual attention network Download PDFInfo
- Publication number
- CN110647612A CN110647612A CN201910881305.0A CN201910881305A CN110647612A CN 110647612 A CN110647612 A CN 110647612A CN 201910881305 A CN201910881305 A CN 201910881305A CN 110647612 A CN110647612 A CN 110647612A
- Authority
- CN
- China
- Prior art keywords
- attention
- visual
- feature
- vector
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000000007 visual effect Effects 0.000 title claims abstract description 242
- 238000000034 method Methods 0.000 title claims abstract description 19
- 238000012545 processing Methods 0.000 claims abstract description 51
- 230000009977 dual effect Effects 0.000 claims abstract description 22
- 238000000605 extraction Methods 0.000 claims abstract description 14
- 230000004927 fusion Effects 0.000 claims abstract description 11
- 238000005457 optimization Methods 0.000 claims abstract description 10
- 238000010276 construction Methods 0.000 claims abstract description 4
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 239000013598 vector Substances 0.000 claims description 201
- 239000011159 matrix material Substances 0.000 claims description 42
- 230000006403 short-term memory Effects 0.000 claims description 17
- 238000013507 mapping Methods 0.000 claims description 12
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 6
- 238000013527 convolutional neural network Methods 0.000 claims description 5
- 238000001514 detection method Methods 0.000 claims description 4
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 230000007774 longterm Effects 0.000 claims description 2
- 230000015654 memory Effects 0.000 claims description 2
- 239000003795 chemical substances by application Substances 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000003909 pattern recognition Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002354 daily effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000001771 impaired effect Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Library & Information Science (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
本发明公开了一种基于双视觉注意力网络的视觉对话生成方法,包括以下步骤:1、视觉对话中文本输入的预处理和单词表的构建;2、对话图像的特征提取以及对话文本的特征提取;3、基于当前问题信息对历史对话信息进行注意力处理;4、双视觉特征各自独立的注意力处理;5、双视觉特征相互交叉的注意力处理;6、视觉特征的优化处理;7、多模态语义融合及解码生成答案特征序列;8、基于双视觉注意力网络的视觉对话生成网络模型的参数优化;9、预测答案生成。本发明能为智能体提供更完整、更合理的视觉语义信息,以及更细粒度的文本语义信息,从而提高智能体对问题所预测生成的答案的合理性和准确性。
The invention discloses a method for generating a visual dialogue based on a dual visual attention network, comprising the following steps: 1. Preprocessing of text input in the visual dialogue and construction of a word list; 2. Feature extraction of dialogue images and features of dialogue texts Extraction; 3. Attention processing of historical dialogue information based on current problem information; 4. Independent attention processing of dual visual features; 5. Attention processing of overlapping dual visual features; 6. Optimization processing of visual features; 7 , Multimodal semantic fusion and decoding to generate answer feature sequence; 8. Parameter optimization of visual dialogue generation network model based on dual visual attention network; 9. Predicted answer generation. The invention can provide more complete and more reasonable visual semantic information and finer-grained text semantic information for the intelligent body, thereby improving the rationality and accuracy of the answers predicted and generated by the intelligent body.
Description
技术领域technical field
本发明属于计算机视觉技术领域,涉及到模式识别、自然语言处理、人工智能等技术,具体地说是一种基于双视觉注意力网络的视觉对话生成方法。The invention belongs to the technical field of computer vision, and relates to technologies such as pattern recognition, natural language processing, artificial intelligence, etc., in particular to a visual dialogue generation method based on a dual visual attention network.
背景技术Background technique
视觉对话是一种人机交互方法,其目的是让机器智能体与人类能够对给定的日常场景图以问答的形式进行合理正确的自然对话。因此,如何让智能体正确的理解由图像、文本组成的多模态语义信息从而对人类提出的问题给出合理的回答是视觉对话中的关键。视觉对话目前也是计算机视觉领域热门研究课题之一,其应用场景也非常的广泛,包括:帮助视觉障碍的人群了解社交媒体内容或日常环境、人工智能助力、机器人应用等方面。Visual dialogue is a human-computer interaction method whose purpose is to enable machine agents and humans to have a reasonable and correct natural dialogue in the form of question and answer given a graph of everyday scenes. Therefore, how to make the agent correctly understand the multimodal semantic information composed of images and texts so as to give reasonable answers to the questions raised by humans is the key in visual dialogue. Visual dialogue is currently one of the hot research topics in the field of computer vision, and its application scenarios are also very wide, including: helping visually impaired people understand social media content or daily environment, artificial intelligence assistance, robot applications, etc.
随着现代图像处理技术和深度学习的发展,视觉对话技术也得到了巨大的发展,但是仍然面临以下几点问题:With the development of modern image processing technology and deep learning, visual dialogue technology has also been greatly developed, but it still faces the following problems:
一、智能体在处理文本信息时缺乏对文本特征进行更细粒度的学习。First, the agent lacks more fine-grained learning of text features when processing text information.
例如2017年,Jiasen Lu等作者在顶级国际会议Conference and Workshop onNeural Information Processing Systems(NIPS 2017)上发表的文章《Best ofBothWorlds:Transferring Knowledge from Discriminative Learning to a GenerativeVisual Dialog Model》中提出的基于历史对话的图像注意力方法,该方法首先对历史对话进行句子层面的注意力处理,然后基于处理后的文本特征对图像特征进行注意力学习,但是该方法在处理当前问题的文本信息时只考虑了句子层面的语义,没有考虑词层面的语义,而在实际提问的句子里面通常只有部分关键词是与预测的答案最相关的。因此,该方法在实际应用时会有一定的局限性。For example, in 2017, Jiasen Lu and other authors presented the image based on historical dialogue in the article "Best of BothWorlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model" published at the top international conference Conference and Workshop on Neural Information Processing Systems (NIPS 2017). Attention method, this method first performs sentence-level attention processing on historical dialogue, and then performs attention learning on image features based on processed text features, but this method only considers sentence-level information when processing the text information of the current problem. Semantics does not consider the semantics at the word level, and usually only some keywords are most relevant to the predicted answer in the sentence of the actual question. Therefore, this method has certain limitations in practical application.
二、现有方法都基于全局图像进行特征提取,导致视觉语义信息不够精确。Second, the existing methods are all based on the global image for feature extraction, resulting in inaccurate visual semantic information.
例如2018年,Qi Wu等作者在顶级国际会议IEEE Conference on ComputerVision and Pattern Recognition(CVPR 2018)上发表的《Are You Talking to Me?Reasoned Visual Dialog Generation throughAdversarial Learning》。这篇文章利用全局视觉特征、问题以及历史对话文本特征进行一系列的相互注意力处理并融合得到多模态语义特征,该方法有效的学习了不同特征之间的语义关系,但是该方法只考虑了全局视觉特征,使得在对图像进行注意力处理后经常会关注到一些问题无关的视觉信息,这些冗余信息会对智能体的答案预测造成干扰。For example, in 2018, Qi Wu and other authors published "Are You Talking to Me?" at the top international conference IEEE Conference on ComputerVision and Pattern Recognition (CVPR 2018). Reasoned Visual Dialog Generation through Adversarial Learning. This article uses global visual features, questions and historical dialogue text features to perform a series of mutual attention processing and fusion to obtain multimodal semantic features. This method effectively learns the semantic relationship between different features, but this method only considers The global visual features are used, so that some visual information irrelevant to the question is often paid attention to after the attention processing of the image, and these redundant information will interfere with the answer prediction of the agent.
发明内容SUMMARY OF THE INVENTION
本发明是为了克服现有技术存在的不足之处,提出一种基于双视觉注意力网络的视觉对话生成方法,以期能为智能体提供更完整、更合理的视觉语义信息,以及更细粒度的文本语义信息,从而提高智能体在对问题进行答案推理生成时的合理性和准确性。In order to overcome the shortcomings of the prior art, the present invention proposes a visual dialogue generation method based on a dual visual attention network, in order to provide the agent with more complete and more reasonable visual semantic information, as well as more fine-grained Text semantic information, so as to improve the rationality and accuracy of the agent in the reasoning and generation of the answer to the question.
本发明为解决技术问题采用如下技术方案:The present invention adopts the following technical scheme for solving the technical problem:
本发明一种基于双视觉注意力网络的视觉对话生成方法的特点是按如下步骤进行:The feature of a method for generating a visual dialogue based on a dual visual attention network of the present invention is to carry out the following steps:
步骤1、视觉对话中文本输入的预处理和单词表的构建:Step 1. Preprocessing of text input in visual dialogue and construction of word list:
步骤1.1、获取视觉对话数据集,所述视觉对话数据集中包含句子文本和图像;Step 1.1, obtain a visual dialogue dataset, which contains sentence text and images;
对所述视觉对话数据集中所有的句子文本进行分词处理,得到分割后的单词;Perform word segmentation on all sentence texts in the visual dialogue dataset to obtain segmented words;
步骤1.2、从分割后的单词中筛选出词频率大于阈值的所有单词,并构建单词索引表Voc;再对所述索引表Voc中的每一个单词进行one-hot编码,得到one-hot向量表O=[o1,o2,...,on,...,oN],其中on表示索引表Voc中的第n个单词所对应的one-hot编码向量,N为索引表Voc中的单词个数;Step 1.2, screen out all words whose word frequency is greater than the threshold from the divided words, and construct a word index table Voc; then perform one-hot encoding on each word in the index table Voc to obtain a one-hot vector table O=[o 1 ,o 2 ,...,on ,...,o N ], where o n represents the one-hot encoding vector corresponding to the nth word in the index table Voc, and N is the index table The number of words in the Voc;
步骤1.3、随机初始化一个词嵌入矩阵We,其中dw代表词向量的维度;利用词嵌入矩阵We将one-hot向量表中的每个单词的编码向量映射到相应的词向量上,从而得到词向量表;Step 1.3, randomly initialize a word embedding matrix We , Wherein d w represents the dimension of the word vector; the word embedding matrix We is used to map the encoding vector of each word in the one-hot vector table to the corresponding word vector, thereby obtaining the word vector table;
步骤2、对话图像的特征提取以及对话文本的特征提取;Step 2, feature extraction of dialogue images and feature extraction of dialogue texts;
步骤2.1、从视觉对话数据集中获取任意一个图像I及其对应的历史对话U=[u1,u2,...,ut,...,uT]、当前问题和真实答案标签AGT所组成的视觉对话信息D;其中T为历史对话U中的对话片段总数,ut表示对话中的第t段对话,L1表示当前问题Q的句子长度,wQ,i表示当前问题Q中的第i个单词在所述词向量表中所对应的词向量;Step 2.1. Obtain any image I and its corresponding historical dialogue U=[u 1 ,u 2 ,...,u t ,...,u T ] from the visual dialogue dataset, the current problem Visual dialogue information D composed of real answer labels A GT ; where T is the total number of dialogue segments in the historical dialogue U, ut represents the t -th dialogue in the dialogue, L 1 represents the sentence length of the current question Q, w Q, i represents the word vector corresponding to the i-th word in the current question Q in the word vector table;
步骤2.2、使用卷积神经网络提取视觉对话信息D中图像I的特征,得到全局视觉特征其中表示全局视觉特征V(0)中的第m个区域特征,M表示全局视觉特征V(0)中的总的空间区域数,dg为全局视觉特征V(0)的通道维度;Step 2.2, use the convolutional neural network to extract the features of the image I in the visual dialogue information D to obtain the global visual features in represents the mth regional feature in the global visual feature V (0) , M represents the total number of spatial regions in the global visual feature V (0), and d g is the channel dimension of the global visual feature V (0) ;
步骤2.3、使用目标检测特征提取器提取视觉对话信息D中图像I的特征,得到局部视觉特征其中表示局部视觉特征R(0)中的第k个目标对象特征,K表示局部视觉特征R(0)中的检测的局部目标对象总数,dr为局部视觉特征R(0)的通道维度;Step 2.3. Use the target detection feature extractor to extract the features of the image I in the visual dialogue information D to obtain local visual features in represents the k-th target object feature in the local visual feature R (0) , K represents the total number of local target objects detected in the local visual feature R(0), and d r is the channel dimension of the local visual feature R (0 );
步骤2.4、利用全连接操作将全局视觉特征和局部视觉特征映射到同一维度的空间中,得到转换后的全局视觉特征V=[v1,v2,...,vm,...,vM],以及局部视觉特征R=[r1,r2,...,rk,...,rK],其中vm表示全局视觉特征V中的第m个区域特征,rk表示局部视觉特征R中的第k个目标对象特征,d为转换后的通道维度;Step 2.4. Use the full connection operation to map the global visual feature and the local visual feature to the space of the same dimension, and obtain the transformed global visual feature V=[v 1 ,v 2 ,..., vm ,..., v M ], and local visual features R=[r 1 ,r 2 ,...,r k ,...,r K ], where v m represents the m-th regional feature in the global visual feature V, r k represents the k-th target object feature in the local visual feature R, and d is the transformed channel dimension;
步骤2.5、使用长短期记忆网络LSTM对当前问题Q进行特征提取,得到隐状态特征序列并取长短期记忆网络LSTM的最后一个步长输出的隐状态特征作为当前问题Q的句子级问题特征向量q,其中hQ,i表示长短期记忆网络LSTM的第i个步长输出的隐状态特征;Step 2.5, use the long short-term memory network LSTM to extract the features of the current question Q, and obtain the hidden state feature sequence And take the hidden state feature of the last step output of the long short-term memory network LSTM As the sentence-level question feature vector q of the current question Q, where h Q,i represents the hidden state feature of the i-th step output of the long short-term memory network LSTM;
步骤2.6、使用长短期记忆网络LSTM对历史对话U中的第t段对话进行特征提取,得到第t个隐状态序列长短期记忆网络取LSTM的最后一个步长输出的隐状态特征作为第t段对话ut的句子级特征ht,则总的历史对话特征为H=[h1,h2,...,ht,...,hT],其中wt,i表示第t段对话ut中第i个单词在所述词向量表中所对应的词向量,L2为第t段对话ut的句子长度,ht,i表示长短期记忆网络LSTM的第i个步长输出的隐状态特征;Step 2.6, use the long short-term memory network LSTM to analyze the t-th dialogue in the historical dialogue U Perform feature extraction to get the t-th hidden state sequence The long short-term memory network takes the hidden state features of the last step output of LSTM As the sentence-level feature h t of the t-th dialogue u t , Then the total historical dialogue feature is H=[h 1 , h 2 ,...,h t ,...,h T ], Where w t,i represents the word vector corresponding to the i-th word in the word vector table in the t -th dialogue ut, L 2 is the sentence length of the t-th dialogue ut , h t,i represents the long-term and short-term The hidden state feature of the i-th step output of the memory network LSTM;
步骤3、基于当前问题信息对历史对话信息进行注意力处理;Step 3. Perform attention processing on the historical dialogue information based on the current problem information;
利用式(1)对所述总的历史对话特征H=[h1,h2,...,ht,...,hT]进行注意力处理,得到注意力关注的历史特征向量ha, Use formula (1) to perform attention processing on the total historical dialogue features H=[h 1 , h 2 ,...,h t ,...,h T ], and obtain the historical feature vector h that the attention pays attention to a ,
ha=αhHT (1)h a = α h H T (1)
式(1)中,表示对历史对话特征H的注意力分布权重,并有:In formula (1), represents the weight of the attention distribution on the historical dialogue feature H, and has:
αh=softmax(PTzh) (2)α h =softmax(PTz h ) (2)
式(2)中,表示句子级问题特征向量q与历史对话特征H之间的相似度矩阵,表示相似度矩阵zh的待训练参数,并有:In formula (2), represents the similarity matrix between the sentence-level question feature vector q and the historical dialogue feature H, Represents the parameters to be trained of the similarity matrix z h , and has:
zh=tanh(Wqq+WhH) (3)z h =tanh(W q q+W h H) (3)
式(3)中,表示句子级问题特征向量q对应的待训练参数,表示历史对话特征H对应的待训练参数;In formula (3), represents the parameters to be trained corresponding to the sentence-level question feature vector q, Represents the parameters to be trained corresponding to the historical dialogue feature H;
步骤4、双视觉特征各自独立的注意力处理;Step 4. Independent attention processing of dual visual features;
步骤4.1、利用式(4)对全局视觉特征V=[v1,v2,...,vm,...,vM]进行注意力处理,得到注意力关注的全局视觉特征向量V′, Step 4.1. Use formula (4) to perform attention processing on the global visual feature V=[v 1 , v 2 ,..., vm ,...,v M ], and obtain the global visual feature vector V that the attention pays attention to ',
V′=αV1VT (4)V′=α V1 V T (4)
式(4)中,表示对全局视觉特征V的注意力分布权重,并有:In formula (4), Represents the attention distribution weight to the global visual feature V, and has:
式(5)中,表示句子级问题特征向量q、注意力关注的历史特征向量ha以及全局视觉特征V之间的相似度矩阵表示相似度矩阵zV1的待训练参数,并有:In formula (5), Represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention attention, and the global visual feature V Represents the parameters to be trained of the similarity matrix z V1 , and has:
zV1=tanh(Wq1q+Wh1ha+WV1V) (6)z V1 = tanh(W q1 q+W h1 h a +W V1 V) (6)
式(6)中,表示句子级问题特征向量q对应的待训练参数,表示注意力关注的历史特征向量ha对应的待训练参数,表示全局视觉特征V对应的待训练参数;In formula (6), represents the parameters to be trained corresponding to the sentence-level question feature vector q, The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention, Indicates the parameters to be trained corresponding to the global visual feature V;
步骤4.2、利用式(7)对局部视觉特征R=[r1,r2,...,rk,...,rK]进行注意力处理,得到注意力关注的局部视觉特征向量R′, Step 4.2. Use the formula (7) to perform attention processing on the local visual features R=[r 1 , r 2 ,...,r k ,...,r K ], and obtain the local visual feature vector R that the attention pays attention to ',
R′=αR1RT (7)R′=α R1 R T (7)
式(7)中,表示对局部视觉特征R的注意力分布权重,并有:In formula (7), Represents the weight of the attention distribution to the local visual feature R, and has:
式(8)中,表示句子级问题特征向量q、注意力关注的历史特征向量ha以及局部视觉特征R之间的相似度矩阵,表示相似度矩阵zV1的待训练参数,并有:In formula (8), represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention attention, and the local visual feature R, Represents the parameters to be trained of the similarity matrix z V1 , and has:
zR1=tanh(W′q1q+W′h1ha+WR1R) (9)z R1 = tanh(W′ q1 q+W′ h1 h a +W R1 R) (9)
式(9)中,表示句子级问题特征向量q对应的待训练参数,表示注意力关注的历史特征向量ha对应的待训练参数,表示局部视觉特征R对应的待训练参数;In formula (9), represents the parameters to be trained corresponding to the sentence-level question feature vector q, The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention, Represents the parameters to be trained corresponding to the local visual feature R;
步骤5、双视觉特征相互交叉的注意力处理;Step 5. Attention processing of the intersection of dual visual features;
步骤5.1、利用式(10)对全局视觉特征V=[v1,v2,...,vm,...,vM]进行双视觉交叉注意力处理,得到进一步注意力关注的全局视觉特征向量V″, Step 5.1. Use formula (10) to perform bi-visual cross-attention processing on the global visual feature V=[v 1 , v 2 ,...,v m ,...,v M ] to obtain the global attention paid by further attention The visual feature vector V″,
V″=αV2VT (10)V″=α V2 V T (10)
式(10)中,表示对全局视觉特征V的进一步注意力分布权重,并有:In formula (10), represents the further attention distribution weights to the global visual feature V, and has:
式(11)中,表示句子级问题特征向量q、注意力关注的历史特征向量ha、注意力关注的局部视觉特征向量R′以及全局视觉特征V之间的相似度矩阵,表示相似度矩阵zV2的待训练参数,并有:In formula (11), represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention, the local visual feature vector R′ of attention, and the global visual feature V, Represents the parameters to be trained of the similarity matrix z V2 , and has:
zV2=tanh(Wq2q+Wh2ha+WR2R′+WV2V) (12)z V2 = tanh(W q2 q+W h2 h a +W R2 R′+W V2 V) (12)
式(12)中,表示句子级问题特征向量q对应的待训练参数,表示注意力关注的历史特征向量ha对应的待训练参数,表示注意力关注的局部视觉特征向量R′对应的待训练参数,表示全局视觉特征V对应的待训练参数;In formula (12), represents the parameters to be trained corresponding to the sentence-level question feature vector q, The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention, The parameters to be trained corresponding to the local visual feature vector R′ representing the attention of the attention, Indicates the parameters to be trained corresponding to the global visual feature V;
步骤5.2、利用式(13)对局部视觉特征R=[r1,r2,...,rk,...,rK]进行双视觉交叉注意力处理,得到进一步注意力关注的局部视觉特征向量R″, Step 5.2. Use formula (13) to perform bi-visual cross-attention processing on the local visual features R=[r 1 , r 2 ,...,r k ,...,r K ] to obtain the local attention paid by further attention The visual feature vector R″,
R″=αR2RT (13)R″=α R2 R T (13)
式(13)中,表示对局部视觉特征R的进一步注意力分布权重,并有:In formula (13), represents further attention distribution weights on local visual features R, and has:
式(14)中,表示句子级问题特征向量q、注意力关注的历史特征向量ha、注意力关注的全部视觉特征向量V′以及局部视觉特征R之间的相似度矩阵,表示相似度矩阵zR2的待训练参数,并有:In formula (14), Represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention, the total visual feature vector V′ of attention, and the local visual feature R, Represents the parameters to be trained of the similarity matrix z R2 , and has:
zR2=tanh(W′q2q+W′h2ha+W′V2V′+W′R2R) (15)z R2 = tanh(W′ q2 q+W′ h2 h a +W′ V2 V′+W′ R2 R) (15)
式(15)中,表示句子级问题特征向量q对应的待训练参数,表示注意力关注的历史特征向量ha对应的待训练参数,示注意力关注的全部视觉特征向量V′对应的待训练参数,表示局部视觉特征R对应的待训练参数;In formula (15), represents the parameters to be trained corresponding to the sentence-level question feature vector q, The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention, The parameters to be trained corresponding to all visual feature vectors V′ that show attention, Represents the parameters to be trained corresponding to the local visual feature R;
步骤6、视觉特征的优化处理;Step 6, optimization processing of visual features;
步骤6.1、利用式(16)对当前问题Q进行词级别注意力处理,得到注意力关注的词级问题特征向量qs, Step 6.1. Use Equation (16) to perform word-level attention processing on the current question Q, and obtain the word-level question feature vector q s of which the attention is concerned,
qs=αqQT (16)q s = α q Q T (16)
式(16)中,表示对当前问题Q的注意力分布权重,并有:In formula (16), Represents the weight of the attention distribution on the current question Q, and has:
式(14)中,表示当前问题Q的自注意力语义矩阵,表示自注意力语义矩阵zQ的待训练参数,并有:In formula (14), represents the self-attention semantic matrix of the current question Q, represents the to-be-trained parameters of the self-attention semantic matrix z Q , and has:
zQ=tanh(WQQ) (18)z Q = tanh(W Q Q) (18)
式(18)中,表示词级别注意力处理时当前问题Q对应的待训练参数;In formula (18), Indicates the parameters to be trained corresponding to the current question Q during word-level attention processing;
步骤6.2、利用式(19)和式(20)对进一步注意力关注的全局视觉特征向量V″和局部视觉特征向量R″分别进行优化处理,并得到最终的全局视觉特征向量 和局部视觉特征向量 Step 6.2. Use formula (19) and formula (20) to optimize the global visual feature vector V″ and local visual feature vector R″ for further attention, and obtain the final global visual feature vector and local visual feature vectors
式(19)和式(20)中,表示视觉特征优化处理时词级问题特征向量qs对应的待训练参数,⊙表示点乘运算;In formula (19) and formula (20), Represents the parameter to be trained corresponding to the feature vector q s of the word-level problem during visual feature optimization processing, and ⊙ represents the dot product operation;
步骤7、多模态语义融合及解码生成答案特征序列;Step 7. Multimodal semantic fusion and decoding to generate an answer feature sequence;
步骤7.1、将所述注意力关注的词级问题特征向量qs,注意力关注的历史特征向量ha,优化后的全局视觉特征向量和局部视觉特征向量进行拼接后得到多模态特征向量eM,其中dM=3d+dw代表多模态特征向量的维度;再利用全连接操作对所述多模态特征向量eM进行映射,得到融合语义特征向量e, Step 7.1. Combine the word-level question feature vector q s of the attention, the historical feature vector ha of the attention, and the optimized global visual feature vector and local visual feature vectors After splicing, the multimodal feature vector e M is obtained, where d M =3d+d w represents the dimension of the multi-modal feature vector; then use the full connection operation to map the multi-modal feature vector e M to obtain the fusion semantic feature vector e,
步骤7.2、将所述融合语义特征向量e输入到长短期记忆网络LSTM中,得到预测答案的隐状态特征序列其中hA,i为长短期记忆网络LSTM的第i个步长的输出,L3为真实答案标签AGT的句子长度;Step 7.2. Input the fused semantic feature vector e into the long short-term memory network LSTM to obtain the hidden state feature sequence of the predicted answer Where h A,i is the output of the i-th step size of the long short-term memory network LSTM, and L 3 is the sentence length of the true answer label A GT ;
步骤7.3、利用全连接操作将所述预测答案的隐状态特征序列映射到与所述one-hot向量表O同一维度的空间中,得到预测答案的单词向量集合其中yi表示预测答案中第i个单词的映射向量,且向量长度与单词个数相同;Step 7.3. Use the full connection operation to convert the hidden state feature sequence of the predicted answer Map to the space of the same dimension as the one-hot vector table O, and get the word vector set of the predicted answer where y i represents the mapping vector of the ith word in the predicted answer, and the length of the vector is the same as the number of words;
步骤8、基于双视觉注意力网络的视觉对话生成网络模型的参数优化;Step 8, parameter optimization of the visual dialogue generation network model based on the dual visual attention network;
步骤8.1、根据所述单词one-hot向量表O对真实答案标签AGT中的单词构建向量集合其中表示真实答案标签AGT中第i个单词的映射向量,且向量长度与单词个数相同;Step 8.1. According to the word one-hot vector table O, construct a vector set for the words in the true answer label A GT in Represents the mapping vector of the ith word in the true answer label A GT , and the length of the vector is the same as the number of words;
步骤8.2利用式(21)计算预测答案与真实答案AGT之间的损失代价E:Step 8.2 Use equation (21) to calculate the loss cost E between the predicted answer and the real answer A GT :
步骤8.3、利用随机梯度下降法将所述损失代价E进行优化求解,使损失代价E达到最小,从而得到最优参数的基于双视觉注意力网络的视觉对话网络模型;Step 8.3, using the stochastic gradient descent method to optimize and solve the loss cost E to minimize the loss cost E, thereby obtaining a visual dialogue network model based on the dual visual attention network with optimal parameters;
步骤9、预测答案生成;Step 9. Predict the answer generation;
对所述预测答案的单词向量集合使用贪心解码算法得到第i个单词的映射向量yi中最大值所对应的位置,并根据最大值所对应的位置在单词索引表Voc中查找相应位置的单词作为第i个单词的映射向量yi最终的预测单词,进而得到单词向量集合Y所对应的预测答案,并以所述当前问题Q和单词向量集合Y所对应的预测答案作为最终生成的视觉对话。the set of word vectors for the predicted answer Use the greedy decoding algorithm to obtain the position corresponding to the maximum value in the mapping vector y i of the ith word, and look up the word at the corresponding position in the word index table Voc according to the position corresponding to the maximum value as the mapping vector y of the ith word i final predicted word, and then the predicted answer corresponding to the word vector set Y is obtained, and the predicted answer corresponding to the current question Q and the word vector set Y is used as the final generated visual dialogue.
与已有技术相比,本发明的有益效果体现在:Compared with the prior art, the beneficial effects of the present invention are embodied in:
1、和以往研究的视觉对话技术相比,本发明不仅提取了全局图像的视觉特征,还提取了局部图像对象的视觉特征,全局视觉特征包含的视觉语义信息更加全面,而局部视觉特征包含的视觉语义更加具体,从而充分考虑了两种视觉特征的特点,并通过两阶注意力处理来分别学习两种视觉特征的内部关系和相互关系,进而形成视觉语义互补,使得智能体能获得更完整、更准确的视觉语义信息。1. Compared with the visual dialogue technology studied in the past, the present invention not only extracts the visual features of the global image, but also extracts the visual features of the local image objects. The global visual features contain more comprehensive visual semantic information, while the local visual features contain more comprehensive visual semantic information. The visual semantics is more specific, which fully considers the characteristics of the two visual features, and learns the internal and interrelationships of the two visual features through two-order attention processing, thereby forming visual semantic complementarity, so that the agent can obtain more complete, More accurate visual semantic information.
2、本发明从句子层面和词层面分别处理文本特征,在处理时首先对问题和历史对话进行句子层面的特征提取并对历史对话特征进行注意力处理;接着,基于所得句子层面的文本特征来学习两种视觉特征的关系;之后,本发明从词层面对问题特征进行注意力处理,以捕获问题中有助于答案推测的关键词特征,这种更细粒度的文本处理方法使得本发明在视觉对话中可以生成更准确合理的答案。2. The present invention processes the text features from the sentence level and the word level respectively. When processing, firstly perform sentence-level feature extraction on questions and historical dialogues and perform attention processing on historical dialogue features; then, based on the obtained sentence-level text features. The relationship between two visual features is learned; after that, the present invention performs attention processing on the question features from the word level to capture the keyword features in the question that are helpful for answer inference. This finer-grained text processing method enables the present invention to More accurate and reasonable answers can be generated in visual dialogue.
3、本发明提出了一种多模态语义融合结构,该结构首先利用词层面的问题文本特征对两种视觉特征分别进行优化,以进一步突出视觉特征中与问题关键词相关的视觉信息。接着,通过拼接问题特征、历史对话特征、全局视觉特征和局部视觉特征,并进行学习与融合,进一步的,各视觉特征和文本特征可以通过多模态语义融合网络互相产生影响,并辅助优化网络的参数,融合网络同时获取了视觉语义和文本语义之后,智能体的答案预测生成效果也有了很大的提升,预测的结果也更精确。3. The present invention proposes a multi-modal semantic fusion structure, which firstly optimizes two visual features by using the question text features at the word level, so as to further highlight the visual information related to the question keywords in the visual features. Then, by splicing question features, historical dialogue features, global visual features and local visual features, and learning and merging, further, each visual feature and text feature can interact with each other through the multimodal semantic fusion network, and assist in optimizing the network. After the fusion network obtains visual semantics and textual semantics at the same time, the agent's answer prediction generation effect is also greatly improved, and the prediction results are more accurate.
附图说明Description of drawings
图1为本发明的网络模型示意图;1 is a schematic diagram of a network model of the present invention;
图2为本发明的双视觉注意力处理示意图;2 is a schematic diagram of dual visual attention processing of the present invention;
图3为本发明网络模型训练示意图。FIG. 3 is a schematic diagram of the training of the network model of the present invention.
具体实施方式Detailed ways
在本实施例中,如图1所示,一种基于双视觉注意力网络的视觉对话生成方法是按如下步骤进行:In this embodiment, as shown in Figure 1, a method for generating a visual dialogue based on a dual visual attention network is performed as follows:
步骤1、视觉对话中文本输入的预处理和单词表的构建:Step 1. Preprocessing of text input in visual dialogue and construction of word list:
步骤1.1、从网上获取视觉对话数据集,目前公开的数据集主要有VisDialDataset,该数据集由乔治亚理工学院的相关研究员收集而成,视觉对话数据集中包含句子文本和图像;Step 1.1. Obtain the visual dialogue data set from the Internet. The currently public data sets mainly include VisDialDataset, which was collected by relevant researchers of Georgia Institute of Technology. The visual dialogue data set contains sentence text and images;
对视觉对话数据集中所有的句子文本进行分词处理,得到分割后的单词;Perform word segmentation on all sentence texts in the visual dialogue dataset to obtain segmented words;
步骤1.2、从分割后的单词中筛选出词频率大于阈值的所有单词,阈值的大小可设置为4,并构建单词索引表Voc;创建单词索引表Voc的方法:单词表可以包含单词、标点符号;统计单词的个数并对单词进行排序,其中为了满足优化的训练过程,添加了一个空白符。对所有单词按照顺序构建单词与序号的对应表;再对索引表Voc中的每一个单词进行one-hot编码,得到one-hot向量表O=[o1,o2,...,on,...,oN],其中on表示索引表Voc中的第n个单词所对应的one-hot编码向量,N为索引表Voc中的单词个数;Step 1.2. Screen out all words whose word frequency is greater than the threshold from the divided words. The size of the threshold can be set to 4, and the word index table Voc is constructed; the method of creating the word index table Voc: the word table can contain words, punctuation marks ; Count the number of words and sort the words, in which a blank is added to meet the optimized training process. Construct the correspondence table of words and serial numbers for all words in order; then perform one-hot encoding on each word in the index table Voc to obtain a one-hot vector table O=[o 1 ,o 2 ,...,on ,...,o N ], where o n represents the one-hot encoding vector corresponding to the nth word in the index table Voc, and N is the number of words in the index table Voc;
步骤1.3、随机初始化一个词嵌入矩阵We,其中dw代表词向量的维度;利用词嵌入矩阵We将one-hot向量表中的第n个单词的编码向量on映射到第n个词向量wn,从而得到词向量表;Step 1.3, randomly initialize a word embedding matrix We , where d w represents the dimension of the word vector; the word embedding matrix We is used to map the encoding vector on of the nth word in the one-hot vector table to the nth word vector w n , Thereby, the word vector table is obtained;
步骤2、对话图像的特征提取以及对话文本的特征提取;Step 2, feature extraction of dialogue images and feature extraction of dialogue texts;
步骤2.1、从视觉对话数据集中获取任意一个图像I及其对应的历史对话U=[u1,u2,...,ut,...,uT]、当前问题和真实答案标签AGT所组成的视觉对话信息D;其中T为历史对话U中的对话片段总数,ut表示对话中的第t段对话,L1表示当前问题Q的句子长度,L1的大小可设置为16,对于句子长度小于16的句子会用零向量进行填充,填充至其长度为L1,wQ,i表示句子中的第i个单词的词向量;Step 2.1. Obtain any image I and its corresponding historical dialogue U=[u 1 ,u 2 ,...,u t ,...,u T ] from the visual dialogue dataset, the current problem Visual dialogue information D composed of real answer labels A GT ; where T is the total number of dialogue segments in the historical dialogue U, ut represents the t -th dialogue in the dialogue, L 1 represents the sentence length of the current question Q, and L 1 ’s The size can be set to 16. For sentences with a sentence length less than 16, it will be filled with a zero vector until its length is L 1 , w Q,i represents the word vector of the i-th word in the sentence;
步骤2.2、使用卷积神经网络提取视觉对话信息D中图像I的特征,得到全局视觉特征其中表示全局视觉特征V(0)中的第m个区域特征,M表示全局视觉特征V(0)中的总的空间区域数,dg为全局视觉特征V(0)的通道维度;本实施例中,可以采用预训练的VGG卷积神经网络对图像I的全局视觉特征进行特征提取;VGG是一种二维卷积神经网络,它被证明了有很强的视觉信息表达能力,因此我们使用在COCO2014数据集上预训练过的VGG作为实验的全局视觉特征提取器,并且这一部分的网络不参与后续步骤8的参数更新部分;Step 2.2, use the convolutional neural network to extract the features of the image I in the visual dialogue information D to obtain the global visual features in represents the mth regional feature in the global visual feature V (0) , M represents the total number of spatial regions in the global visual feature V(0), and d g is the channel dimension of the global visual feature V (0) ; this embodiment , the pre-trained VGG convolutional neural network can be used to extract the global visual features of the image I; VGG is a two-dimensional convolutional neural network, which has been proved to have a strong ability to express visual information, so we use The VGG pre-trained on the COCO2014 dataset is used as the global visual feature extractor of the experiment, and this part of the network does not participate in the parameter update part of the subsequent step 8;
步骤2.3、使用目标检测特征提取器提取视觉对话信息D中图像I的特征,得到局部视觉特征其中表示局部视觉特征R(0)中的第k个目标对象特征,K表示局部视觉特征R(0)中的检测的局部目标对象总数,dr为局部视觉特征R(0)的通道维度;本实施例中,可以采用预训练的Faster-RCNN目标检测特征提取器对图像I的局部视觉特征进行特征提取;Faster-RCNN所提取的局部视觉特征在许多视觉任务上都取得了优异的效果,因此我们使用在Visual Genome数据集上预训练过的Faster-RCNN作为实验的局部视觉特征提取器,并且这一部分的网络不参与后续步骤8的参数更新部分;Step 2.3. Use the target detection feature extractor to extract the features of the image I in the visual dialogue information D to obtain local visual features in Represents the k-th target object feature in the local visual feature R (0) , K represents the total number of local target objects detected in the local visual feature R(0), and d r is the channel dimension of the local visual feature R (0) ; this In the embodiment, the pre-trained Faster-RCNN target detection feature extractor can be used to perform feature extraction on the local visual features of image I; the local visual features extracted by Faster-RCNN have achieved excellent results in many visual tasks, so We use the Faster-RCNN pre-trained on the Visual Genome dataset as the local visual feature extractor for the experiment, and this part of the network does not participate in the parameter update part of the subsequent step 8;
步骤2.4、利用全连接操作将全局视觉特征和局部视觉特征映射到同一维度的空间中,得到转换后的全局视觉特征V=[v1,v2,...,vm,...,vM],以及局部视觉特征R=[r1,r2,...,rk,...,rK],其中vm表示全局视觉特征V中的第m个区域特征,rk表示局部视觉特征R中的第k个目标对象特征,d为转换后的通道维度;Step 2.4. Use the full connection operation to map the global visual feature and the local visual feature to the space of the same dimension, and obtain the transformed global visual feature V=[v 1 ,v 2 ,..., vm ,..., v M ], and local visual features R=[r 1 ,r 2 ,...,r k ,...,r K ], where v m represents the m-th regional feature in the global visual feature V, r k represents the k-th target object feature in the local visual feature R, and d is the transformed channel dimension;
步骤2.5、使用长短期记忆网络LSTM对当前问题Q进行特征提取,得到隐状态特征序列并取LSTM的最后一个步长输出的隐状态特征作为当前问题Q的句子级问题特征向量q,其中hQ,i表示LSTM第i个步长输出的隐状态特征;Step 2.5, use the long short-term memory network LSTM to extract the features of the current question Q, and obtain the hidden state feature sequence And take the hidden state feature of the last step output of LSTM As the sentence-level question feature vector q of the current question Q, where h Q,i represents the hidden state feature of the ith step output of LSTM;
步骤2.6、使用长短期记忆网络LSTM对历史对话U中的每一段对话进行特征提取,得到隐状态序列取LSTM的最后一个步长输出的隐状态特征作为对话ut的句子级特征ht,总的历史对话特征为H=[h1,h2,...,ht,...,hT],其中wu,i表示对话ut中第i个单词的词向量,L2为对话ut的句子长度,L2的大小可设置为25,对于句子长度小于25的句子会用零向量进行填充,填充至其长度为L2,hu,i表示LSTM第i个步长输出的隐状态特征;Step 2.6, use the long short-term memory network LSTM to analyze each dialogue in the historical dialogue U Perform feature extraction to get the hidden state sequence Take the hidden state feature of the last step output of the LSTM As the sentence-level feature ht of the dialogue ut , The total historical dialogue features are H=[h 1 , h 2 ,...,h t ,...,h T ], Where w u,i represents the word vector of the i-th word in the dialogue ut , L 2 is the sentence length of the dialogue u t , the size of L 2 can be set to 25, and the sentences with a sentence length less than 25 will be filled with zero vectors , padded to its length L 2 , h u,i represents the hidden state feature of the ith step output of LSTM;
步骤3、基于当前问题信息对历史对话信息进行注意力处理;Step 3. Perform attention processing on the historical dialogue information based on the current problem information;
利用式(1)对总的历史对话特征H=[h1,h2,...,ht,...,hT]进行注意力处理,得到注意力关注的历史特征向量ha, Use formula (1) to perform attention processing on the total historical dialogue features H=[h 1 , h 2 ,...,h t , ... ,h T ], and obtain the historical feature vector ha of attention,
ha=αhH (1)h a = α h H (1)
式(1)中,表示对历史对话特征H的注意力分布权重,并有:In formula (1), represents the weight of the attention distribution on the historical dialogue feature H, and has:
αh=softmax(PTzh) (2)α h =softmax(PTz h ) (2)
式(2)中,表示句子级问题特征向量q与历史对话特征H之间的相似度矩阵,表示相似度矩阵zh的待训练参数,并有:In formula (2), represents the similarity matrix between the sentence-level question feature vector q and the historical dialogue feature H, Represents the parameters to be trained of the similarity matrix z h , and has:
zh=tanh(Wqq+WhH) (3)z h =tanh(W q q+W h H) (3)
式(3)中,表示句子级问题特征向量q对应的待训练参数,表示历史对话特征H对应的待训练参数;In formula (3), represents the parameters to be trained corresponding to the sentence-level question feature vector q, Represents the parameters to be trained corresponding to the historical dialogue feature H;
步骤4、如图2所示,对双视觉特征进行各自独立的注意力处理;Step 4, as shown in Figure 2, perform independent attention processing on the dual visual features;
步骤4.1、利用式(4)对全局视觉特征V=[v1,v2,...,vm,...,vM]进行注意力处理,得到注意力关注的全局视觉特征向量V′, Step 4.1. Use formula (4) to perform attention processing on the global visual feature V=[v 1 , v 2 ,..., vm ,...,v M ], and obtain the global visual feature vector V that the attention pays attention to ′,
V′=αV1VT (4)V′=α V1 V T (4)
式(4)中,表示对全局视觉特征V的注意力分布权重,并有:In formula (4), Represents the attention distribution weight to the global visual feature V, and has:
式(5)中,表示句子级问题特征向量q、注意力关注的历史特征向量ha以及全局视觉特征V之间的相似度矩阵,表示相似度矩阵zV1的待训练参数,并有:In formula (5), represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention attention, and the global visual feature V, Represents the parameters to be trained of the similarity matrix z V1 , and has:
zV1=tanh(Wq1q+Wh1ha+WV1V) (6)z V1 = tanh(W q1 q+W h1 h a +W V1 V) (6)
式(6)中,表示句子级问题特征向量q对应的待训练参数,表示注意力关注的历史特征向量ha对应的待训练参数,表示全局视觉特征V对应的待训练参数;In formula (6), represents the parameters to be trained corresponding to the sentence-level question feature vector q, The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention, Indicates the parameters to be trained corresponding to the global visual feature V;
步骤4.2、利用式(7)对局部视觉特征R=[r1,r2,...,rk,...,rK]进行注意力处理,得到注意力关注的局部视觉特征向量R′, Step 4.2. Use the formula (7) to perform attention processing on the local visual features R=[r 1 , r 2 ,...,r k ,...,r K ], and obtain the local visual feature vector R that the attention pays attention to ′,
R′=αR1RT (7)R′=α R1 R T (7)
式(7)中,表示对局部视觉特征R的注意力分布权重,并有:In formula (7), Represents the weight of the attention distribution to the local visual feature R, and has:
式(8)中,表示句子级问题特征向量q、注意力关注的历史特征向量ha以及局部视觉特征R之间的相似度矩阵,表示相似度矩阵zV1的待训练参数,并有:In formula (8), represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention attention, and the local visual feature R, Represents the parameters to be trained of the similarity matrix z V1 , and has:
zR1=tanh(W′q1q+W′h1ha+WR1R) (9)z R1 = tanh(W′ q1 q+W′ h1 h a +W R1 R) (9)
式(9)中,表示句子级问题特征向量q对应的待训练参数,表示注意力关注的历史特征向量ha对应的待训练参数,表示局部视觉特征R对应的待训练参数;In formula (9), represents the parameters to be trained corresponding to the sentence-level question feature vector q, The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention, Represents the parameters to be trained corresponding to the local visual feature R;
步骤5、如图2所示,对双视觉特征进行相互交叉的注意力处理;Step 5. As shown in Figure 2, perform cross-over attention processing on the dual visual features;
步骤5.1、利用式(10)对全局视觉特征V=[v1,v2,...,vm,...,vM]进行双视觉交叉注意力处理,得到进一步注意力关注的全局视觉特征向量V″, Step 5.1. Use formula (10) to perform bi-visual cross-attention processing on the global visual feature V=[v 1 , v 2 ,...,v m ,...,v M ] to obtain the global attention paid by further attention The visual feature vector V″,
V″=αV2VT (10)V″=α V2 V T (10)
式(10)中,表示对全局视觉特征V的进一步注意力分布权重,并有:In formula (10), represents the further attention distribution weights to the global visual feature V, and has:
式(11)中,表示句子级问题特征向量q、注意力关注的历史特征向量ha、注意力关注的局部视觉特征向量R′以及全局视觉特征V之间的相似度矩阵,表示相似度矩阵zV2的待训练参数,并有:In formula (11), represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention, the local visual feature vector R′ of attention, and the global visual feature V, Represents the parameters to be trained of the similarity matrix z V2 , and has:
zV2=tanh(Wq2q+Wh2ha+WR2R′+WV2V) (12)z V2 = tanh(W q2 q+W h2 h a +W R2 R′+W V2 V) (12)
式(12)中,表示句子级问题特征向量q对应的待训练参数,表示注意力关注的历史特征向量ha对应的待训练参数,表示注意力关注的局部视觉特征向量R′对应的待训练参数,表示全局视觉特征V对应的待训练参数;In formula (12), represents the parameters to be trained corresponding to the sentence-level question feature vector q, The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention, The parameters to be trained corresponding to the local visual feature vector R′ representing the attention of the attention, Indicates the parameters to be trained corresponding to the global visual feature V;
步骤5.2、利用式(13)对局部视觉特征R=[r1,r2,...,rk,...,rK]进行双视觉交叉注意力处理,得到进一步注意力关注的局部视觉特征向量R″, Step 5.2. Use formula (13) to perform bi-visual cross-attention processing on the local visual features R=[r 1 , r 2 ,...,r k ,...,r K ] to obtain the local attention paid by further attention The visual feature vector R″,
R″=αR2RT (13)R″=α R2 R T (13)
式(13)中,表示对局部视觉特征R的进一步注意力分布权重,并有:In formula (13), represents further attention distribution weights on local visual features R, and has:
式(14)中,表示句子级问题特征向量q、注意力关注的历史特征向量ha、注意力关注的全部视觉特征向量V′以及局部视觉特征R之间的相似度矩阵,表示相似度矩阵zR2的待训练参数,并有:In formula (14), Represents the similarity matrix between the sentence-level question feature vector q , the historical feature vector ha of attention, the total visual feature vector V′ of attention, and the local visual feature R, Represents the parameters to be trained of the similarity matrix z R2 , and has:
zR2=tanh(W′q2q+W′h2ha+W′V2V′+W′R2R) (15)z R2 = tanh(W′ q2 q+W′ h2 h a +W′ V2 V′+W′ R2 R) (15)
式(15)中,表示句子级问题特征向量q对应的待训练参数,表示注意力关注的历史特征向量ha对应的待训练参数,示注意力关注的全部视觉特征向量V′对应的待训练参数,表示局部视觉特征R对应的待训练参数;In formula (15), represents the parameters to be trained corresponding to the sentence-level question feature vector q, The parameters to be trained corresponding to the historical feature vector ha representing the attention of the attention, The parameters to be trained corresponding to all visual feature vectors V′ that show attention, Represents the parameters to be trained corresponding to the local visual feature R;
步骤6、视觉特征的优化处理;Step 6, optimization processing of visual features;
步骤6.1、利用式(16)对当前问题Q进行词级别注意力处理,得到注意力关注的词级问题特征向量qs, Step 6.1. Use Equation (16) to perform word-level attention processing on the current question Q, and obtain the word-level question feature vector q s of which the attention is concerned,
qs=αqQ (16)q s = α q Q (16)
式(16)中,表示对当前问题Q的注意力分布权重,并有:In formula (16), Represents the weight of the attention distribution on the current question Q, and has:
式(14)中,表示当前问题Q的自注意力语义矩阵,表示自注意力语义矩阵zQ的待训练参数,并有:In formula (14), represents the self-attention semantic matrix of the current question Q, represents the to-be-trained parameters of the self-attention semantic matrix z Q , and has:
zQ=tanh(WQQ) (18)z Q = tanh(W Q Q) (18)
式(18)中,表示词级别注意力处理时当前问题Q对应的待训练参数;In formula (18), Indicates the parameters to be trained corresponding to the current question Q during word-level attention processing;
步骤6.2、利用式(19)、(20)对进一步注意力关注的全局视觉特征向量V″和局部视觉特征向量R″分别进行优化处理,并得到最终的全局视觉特征向量 和局部视觉特征向量 Step 6.2. Use equations (19) and (20) to optimize the global visual feature vector V″ and local visual feature vector R″ for further attention, and obtain the final global visual feature vector and local visual feature vectors
式(19)和式(20)中,表示视觉特征优化处理时词级问题特征向量qs对应的待训练参数,⊙表示点乘运算;In formula (19) and formula (20), Represents the parameter to be trained corresponding to the feature vector q s of the word-level problem during visual feature optimization processing, and ⊙ represents the dot product operation;
步骤7、多模态语义融合及解码生成答案特征序列;Step 7. Multimodal semantic fusion and decoding to generate an answer feature sequence;
步骤7.1、将注意力关注的词级问题特征向量qs,注意力关注的历史特征向量ha,优化后的全局视觉特征向量和局部视觉特征向量进行拼接后得到多模态特征向量eM,其中dM=3d+dw代表多模态特征向量的维度;再利用一层全连接操作对多模态特征向量eM进行映射,得到融合语义特征向量e, Step 7.1. Combine the word-level problem feature vector q s of attention, the historical feature vector ha of attention, and the optimized global visual feature vector and local visual feature vectors After splicing, the multimodal feature vector e M is obtained, where d M = 3d+d w represents the dimension of the multimodal feature vector; then use a layer of full connection operation to map the multimodal feature vector e M to obtain the fusion semantic feature vector e,
步骤7.2、将融合语义特征向量e输入到长短期记忆网络LSTM中,得到预测答案的隐状态特征序列其中hA,i为LSTM的第i个时间步长的输出,L3为真实答案标签AGT的句子长度,L3的大小可设置为9;Step 7.2. Input the fused semantic feature vector e into the long short-term memory network LSTM to obtain the hidden state feature sequence of the predicted answer Where h A, i is the output of the ith time step of LSTM, L 3 is the sentence length of the true answer label A GT , and the size of L 3 can be set to 9;
步骤7.3、利用全连接操作将预测答案的隐状态特征序列映射到与one-hot向量表O同一维度的空间中,得到预测答案的单词向量集合其中yi表示预测答案中第i个单词的映射向量,且向量长度与单词个数相同;Step 7.3. Use the full connection operation to predict the hidden state feature sequence of the answer Map to the space of the same dimension as the one-hot vector table O, and get the word vector set of the predicted answer where y i represents the mapping vector of the ith word in the predicted answer, and the length of the vector is the same as the number of words;
步骤8、如图3所示,对基于双视觉注意力网络的视觉对话生成网络模型的参数进行优化;Step 8, as shown in Figure 3, optimize the parameters of the visual dialogue generation network model based on the dual visual attention network;
步骤8.1、根据单词one-hot向量表O对真实答案标签AGT中的单词构建向量集合其中表示真实答案标签AGT中第i个单词的映射向量,且向量长度与单词个数相同;Step 8.1. According to the word one-hot vector table O, construct a vector set for the words in the true answer label A GT in Represents the mapping vector of the ith word in the true answer label A GT , and the length of the vector is the same as the number of words;
步骤8.2利用式(21)计算预测答案与真实答案AGT之间的损失代价E:Step 8.2 Use equation (21) to calculate the loss cost E between the predicted answer and the real answer A GT :
步骤8.3、利用随机梯度下降法将损失代价E进行优化求解,使损失代价E达到最小,从而得到最优参数的基于双视觉注意力网络的视觉对话网络模型;Step 8.3, using the stochastic gradient descent method to optimize the loss cost E to minimize the loss cost E, thereby obtaining a visual dialogue network model based on the dual visual attention network with optimal parameters;
步骤9、预测答案生成;Step 9. Predict the answer generation;
对预测答案的单词向量集合使用贪心解码算法得到第i个单词的映射向量yi中最大值所对应的位置,并根据最大值所对应的位置在单词索引表Voc中查找相应位置的单词作为第i个单词的映射向量yi最终的预测单词,进而得到单词向量集合Y所对应的预测答案,并以当前问题Q和单词向量集合Y所对应的预测答案作为最终生成的视觉对话。collection of word vectors for predicted answers Use the greedy decoding algorithm to obtain the position corresponding to the maximum value in the mapping vector y i of the ith word, and look up the word at the corresponding position in the word index table Voc according to the position corresponding to the maximum value as the mapping vector y of the ith word i finally predict the word, and then obtain the predicted answer corresponding to the word vector set Y, and use the predicted answer corresponding to the current question Q and the word vector set Y as the final generated visual dialogue.
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910881305.0A CN110647612A (en) | 2019-09-18 | 2019-09-18 | Visual conversation generation method based on double-visual attention network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910881305.0A CN110647612A (en) | 2019-09-18 | 2019-09-18 | Visual conversation generation method based on double-visual attention network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110647612A true CN110647612A (en) | 2020-01-03 |
Family
ID=68992004
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910881305.0A Pending CN110647612A (en) | 2019-09-18 | 2019-09-18 | Visual conversation generation method based on double-visual attention network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110647612A (en) |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111783475A (en) * | 2020-07-28 | 2020-10-16 | 北京深睿博联科技有限责任公司 | A Semantic Visual Localization Method and Device Based on Phrase Relation Propagation |
CN111897939A (en) * | 2020-08-12 | 2020-11-06 | 腾讯科技(深圳)有限公司 | Visual dialogue method, training device and training equipment of visual dialogue model |
CN111967487A (en) * | 2020-03-23 | 2020-11-20 | 同济大学 | Incremental data enhancement method for visual question-answer model training and application |
CN111967272A (en) * | 2020-06-23 | 2020-11-20 | 合肥工业大学 | Visual dialog generation system based on semantic alignment |
CN113010712A (en) * | 2021-03-04 | 2021-06-22 | 天津大学 | Visual question answering method based on multi-graph fusion |
CN113177112A (en) * | 2021-04-25 | 2021-07-27 | 天津大学 | KR product fusion multi-mode information-based neural network visual dialogue model and method |
CN113220859A (en) * | 2021-06-01 | 2021-08-06 | 平安科技(深圳)有限公司 | Image-based question and answer method and device, computer equipment and storage medium |
CN113420606A (en) * | 2021-05-31 | 2021-09-21 | 华南理工大学 | Method for realizing autonomous navigation of robot based on natural language and machine vision |
CN113435399A (en) * | 2021-07-14 | 2021-09-24 | 电子科技大学 | Multi-round visual dialogue method based on multi-level sequencing learning |
CN113515951A (en) * | 2021-07-19 | 2021-10-19 | 同济大学 | A story description generation method based on knowledge-augmented attention network and group-level semantics |
CN113553418A (en) * | 2021-07-27 | 2021-10-26 | 天津大学 | Visual dialog generation method and device based on multi-modal learning |
CN113761153A (en) * | 2021-05-19 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Question and answer processing method and device based on picture, readable medium and electronic equipment |
CN113868451A (en) * | 2021-09-02 | 2021-12-31 | 天津大学 | Cross-modal social network conversation method and device based on context cascade perception |
CN113989300A (en) * | 2021-10-29 | 2022-01-28 | 北京百度网讯科技有限公司 | Method, device, electronic device and storage medium for lane line segmentation |
CN114299510A (en) * | 2022-03-08 | 2022-04-08 | 山东山大鸥玛软件股份有限公司 | Handwritten English line recognition system |
CN114556443A (en) * | 2020-01-15 | 2022-05-27 | 北京京东尚科信息技术有限公司 | Multimedia data semantic analysis system and method using attention-based converged network |
CN114661874A (en) * | 2022-03-07 | 2022-06-24 | 浙江理工大学 | A visual question answering method based on multi-angle semantic understanding and adaptive dual-channel |
CN115098623A (en) * | 2022-06-06 | 2022-09-23 | 中国船舶集团有限公司系统工程研究院 | Physical training data feature extraction method based on BERT |
CN115277248A (en) * | 2022-09-19 | 2022-11-01 | 南京聚铭网络科技有限公司 | Network security alarm merging method, device and storage medium |
CN115422388A (en) * | 2022-09-13 | 2022-12-02 | 四川省人工智能研究院(宜宾) | Visual conversation method and system |
CN116342332A (en) * | 2023-05-31 | 2023-06-27 | 合肥工业大学 | Auxiliary judging method, device, equipment and storage medium based on Internet |
US12223284B2 (en) | 2022-09-13 | 2025-02-11 | Sichuan Institute of Artificial Intelligence, Yibin, Sichuan, China | Visual dialogue method and system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8160883B2 (en) * | 2004-01-10 | 2012-04-17 | Microsoft Corporation | Focus tracking in dialogs |
CN104077419A (en) * | 2014-07-18 | 2014-10-01 | 合肥工业大学 | Long inquiring image searching reordering algorithm based on semantic and visual information |
US20170024645A1 (en) * | 2015-06-01 | 2017-01-26 | Salesforce.Com, Inc. | Dynamic Memory Network |
CN108877801A (en) * | 2018-06-14 | 2018-11-23 | 南京云思创智信息科技有限公司 | More wheel dialog semantics based on multi-modal Emotion identification system understand subsystem |
CN110209789A (en) * | 2019-05-29 | 2019-09-06 | 山东大学 | A kind of multi-modal dialog system and method for user's attention guidance |
-
2019
- 2019-09-18 CN CN201910881305.0A patent/CN110647612A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8160883B2 (en) * | 2004-01-10 | 2012-04-17 | Microsoft Corporation | Focus tracking in dialogs |
CN104077419A (en) * | 2014-07-18 | 2014-10-01 | 合肥工业大学 | Long inquiring image searching reordering algorithm based on semantic and visual information |
US20170024645A1 (en) * | 2015-06-01 | 2017-01-26 | Salesforce.Com, Inc. | Dynamic Memory Network |
CN108877801A (en) * | 2018-06-14 | 2018-11-23 | 南京云思创智信息科技有限公司 | More wheel dialog semantics based on multi-modal Emotion identification system understand subsystem |
CN110209789A (en) * | 2019-05-29 | 2019-09-06 | 山东大学 | A kind of multi-modal dialog system and method for user's attention guidance |
Non-Patent Citations (1)
Title |
---|
DANGUO等: "Dual Visual Attention Network for Visual Dialog", 《PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE》 * |
Cited By (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114556443A (en) * | 2020-01-15 | 2022-05-27 | 北京京东尚科信息技术有限公司 | Multimedia data semantic analysis system and method using attention-based converged network |
CN114556443B (en) * | 2020-01-15 | 2025-01-07 | 北京京东尚科信息技术有限公司 | Multimedia data semantic analysis system and method using attention-based fusion network |
CN111967487A (en) * | 2020-03-23 | 2020-11-20 | 同济大学 | Incremental data enhancement method for visual question-answer model training and application |
CN111967487B (en) * | 2020-03-23 | 2022-09-20 | 同济大学 | Incremental data enhancement method for visual question-answer model training and application |
CN111967272A (en) * | 2020-06-23 | 2020-11-20 | 合肥工业大学 | Visual dialog generation system based on semantic alignment |
CN111967272B (en) * | 2020-06-23 | 2023-10-31 | 合肥工业大学 | Visual dialogue generating system based on semantic alignment |
CN111783475A (en) * | 2020-07-28 | 2020-10-16 | 北京深睿博联科技有限责任公司 | A Semantic Visual Localization Method and Device Based on Phrase Relation Propagation |
CN111897939B (en) * | 2020-08-12 | 2024-02-02 | 腾讯科技(深圳)有限公司 | Visual dialogue method, training method, device and equipment for visual dialogue model |
CN111897939A (en) * | 2020-08-12 | 2020-11-06 | 腾讯科技(深圳)有限公司 | Visual dialogue method, training device and training equipment of visual dialogue model |
CN113010712A (en) * | 2021-03-04 | 2021-06-22 | 天津大学 | Visual question answering method based on multi-graph fusion |
CN113010712B (en) * | 2021-03-04 | 2022-12-02 | 天津大学 | Visual question answering method based on multi-graph fusion |
CN113177112A (en) * | 2021-04-25 | 2021-07-27 | 天津大学 | KR product fusion multi-mode information-based neural network visual dialogue model and method |
CN113761153A (en) * | 2021-05-19 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Question and answer processing method and device based on picture, readable medium and electronic equipment |
CN113761153B (en) * | 2021-05-19 | 2023-10-24 | 腾讯科技(深圳)有限公司 | Picture-based question-answering processing method and device, readable medium and electronic equipment |
CN113420606A (en) * | 2021-05-31 | 2021-09-21 | 华南理工大学 | Method for realizing autonomous navigation of robot based on natural language and machine vision |
CN113420606B (en) * | 2021-05-31 | 2022-06-14 | 华南理工大学 | Method for realizing autonomous navigation of robot based on natural language and machine vision |
CN113220859B (en) * | 2021-06-01 | 2024-05-10 | 平安科技(深圳)有限公司 | Question answering method and device based on image, computer equipment and storage medium |
CN113220859A (en) * | 2021-06-01 | 2021-08-06 | 平安科技(深圳)有限公司 | Image-based question and answer method and device, computer equipment and storage medium |
CN113435399A (en) * | 2021-07-14 | 2021-09-24 | 电子科技大学 | Multi-round visual dialogue method based on multi-level sequencing learning |
CN113435399B (en) * | 2021-07-14 | 2022-04-15 | 电子科技大学 | Multi-round visual dialogue method based on multi-level sequencing learning |
CN113515951A (en) * | 2021-07-19 | 2021-10-19 | 同济大学 | A story description generation method based on knowledge-augmented attention network and group-level semantics |
CN113515951B (en) * | 2021-07-19 | 2022-07-05 | 同济大学 | A story description generation method based on knowledge-augmented attention network and group-level semantics |
CN113553418A (en) * | 2021-07-27 | 2021-10-26 | 天津大学 | Visual dialog generation method and device based on multi-modal learning |
CN113868451A (en) * | 2021-09-02 | 2021-12-31 | 天津大学 | Cross-modal social network conversation method and device based on context cascade perception |
CN113868451B (en) * | 2021-09-02 | 2024-06-11 | 天津大学 | Cross-modal conversation method and device for social network based on up-down Wen Jilian perception |
CN113989300A (en) * | 2021-10-29 | 2022-01-28 | 北京百度网讯科技有限公司 | Method, device, electronic device and storage medium for lane line segmentation |
CN114661874A (en) * | 2022-03-07 | 2022-06-24 | 浙江理工大学 | A visual question answering method based on multi-angle semantic understanding and adaptive dual-channel |
CN114661874B (en) * | 2022-03-07 | 2024-04-30 | 浙江理工大学 | Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels |
CN114299510A (en) * | 2022-03-08 | 2022-04-08 | 山东山大鸥玛软件股份有限公司 | Handwritten English line recognition system |
CN115098623B (en) * | 2022-06-06 | 2024-12-10 | 中国船舶集团有限公司系统工程研究院 | A physical training data feature extraction method based on BERT |
CN115098623A (en) * | 2022-06-06 | 2022-09-23 | 中国船舶集团有限公司系统工程研究院 | Physical training data feature extraction method based on BERT |
CN115422388B (en) * | 2022-09-13 | 2024-07-26 | 四川省人工智能研究院(宜宾) | Visual dialogue method and system |
CN115422388A (en) * | 2022-09-13 | 2022-12-02 | 四川省人工智能研究院(宜宾) | Visual conversation method and system |
US12223284B2 (en) | 2022-09-13 | 2025-02-11 | Sichuan Institute of Artificial Intelligence, Yibin, Sichuan, China | Visual dialogue method and system |
CN115277248A (en) * | 2022-09-19 | 2022-11-01 | 南京聚铭网络科技有限公司 | Network security alarm merging method, device and storage medium |
CN116342332B (en) * | 2023-05-31 | 2023-08-01 | 合肥工业大学 | Auxiliary judging method, device, equipment and storage medium based on Internet |
CN116342332A (en) * | 2023-05-31 | 2023-06-27 | 合肥工业大学 | Auxiliary judging method, device, equipment and storage medium based on Internet |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110647612A (en) | Visual conversation generation method based on double-visual attention network | |
CN110609891B (en) | Visual dialog generation method based on context awareness graph neural network | |
US11631007B2 (en) | Method and device for text-enhanced knowledge graph joint representation learning | |
CN110298037B (en) | Text Recognition Approach Based on Convolutional Neural Network Matching with Enhanced Attention Mechanism | |
CN110866117B (en) | Short text classification method based on semantic enhancement and multi-level label embedding | |
CN110502749B (en) | Text relation extraction method based on double-layer attention mechanism and bidirectional GRU | |
CN108733792B (en) | An Entity Relationship Extraction Method | |
CN110765775B (en) | A Domain Adaptation Method for Named Entity Recognition Fusing Semantics and Label Differences | |
CN109992780B (en) | Specific target emotion classification method based on deep neural network | |
CN110569508A (en) | Emotional orientation classification method and system integrating part-of-speech and self-attention mechanism | |
CN108416065A (en) | Image based on level neural network-sentence description generates system and method | |
CN112232053B (en) | Text similarity computing system, method and storage medium based on multi-keyword pair matching | |
CN109753567A (en) | A Text Classification Method Combining Title and Body Attention Mechanisms | |
CN112800190B (en) | Joint prediction method of intent recognition and slot value filling based on Bert model | |
CN112163429B (en) | Sentence correlation obtaining method, system and medium combining cyclic network and BERT | |
CN110888980A (en) | Implicit discourse relation identification method based on knowledge-enhanced attention neural network | |
CN110781290A (en) | Extraction method of structured text abstract of long chapter | |
CN110516530A (en) | An image description method based on non-aligned multi-view feature enhancement | |
CN110909736A (en) | An Image Description Method Based on Long Short-Term Memory Model and Object Detection Algorithm | |
CN115146057B (en) | Interactive attention-based image-text fusion emotion recognition method for ecological area of supply chain | |
CN110874411A (en) | A Cross-Domain Sentiment Classification System Based on Fusion of Attention Mechanisms | |
CN116524593A (en) | A dynamic gesture recognition method, system, device and medium | |
CN114417851B (en) | Emotion analysis method based on keyword weighted information | |
CN111145914B (en) | Method and device for determining text entity of lung cancer clinical disease seed bank | |
CN117764084A (en) | Short text emotion analysis method based on multi-head attention mechanism and multi-model fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200103 |
|
RJ01 | Rejection of invention patent application after publication |