CN110647612A

CN110647612A - Visual conversation generation method based on double-visual attention network

Info

Publication number: CN110647612A
Application number: CN201910881305.0A
Authority: CN
Inventors: 郭丹; 王辉; 汪萌
Original assignee: Hefei Polytechnic University
Current assignee: Hefei Polytechnic University
Priority date: 2019-09-18
Filing date: 2019-09-18
Publication date: 2020-01-03

Abstract

The invention discloses a method for generating a visual dialogue based on a dual visual attention network, comprising the following steps: 1. Preprocessing of text input in the visual dialogue and construction of a word list; 2. Feature extraction of dialogue images and features of dialogue texts Extraction; 3. Attention processing of historical dialogue information based on current problem information; 4. Independent attention processing of dual visual features; 5. Attention processing of overlapping dual visual features; 6. Optimization processing of visual features; 7 , Multimodal semantic fusion and decoding to generate answer feature sequence; 8. Parameter optimization of visual dialogue generation network model based on dual visual attention network; 9. Predicted answer generation. The invention can provide more complete and more reasonable visual semantic information and finer-grained text semantic information for the intelligent body, thereby improving the rationality and accuracy of the answers predicted and generated by the intelligent body.

Description

A visual dialogue generation method based on dual visual attention network

技术领域technical field

本发明属于计算机视觉技术领域，涉及到模式识别、自然语言处理、人工智能等技术，具体地说是一种基于双视觉注意力网络的视觉对话生成方法。The invention belongs to the technical field of computer vision, and relates to technologies such as pattern recognition, natural language processing, artificial intelligence, etc., in particular to a visual dialogue generation method based on a dual visual attention network.

背景技术Background technique

视觉对话是一种人机交互方法，其目的是让机器智能体与人类能够对给定的日常场景图以问答的形式进行合理正确的自然对话。因此，如何让智能体正确的理解由图像、文本组成的多模态语义信息从而对人类提出的问题给出合理的回答是视觉对话中的关键。视觉对话目前也是计算机视觉领域热门研究课题之一，其应用场景也非常的广泛，包括：帮助视觉障碍的人群了解社交媒体内容或日常环境、人工智能助力、机器人应用等方面。Visual dialogue is a human-computer interaction method whose purpose is to enable machine agents and humans to have a reasonable and correct natural dialogue in the form of question and answer given a graph of everyday scenes. Therefore, how to make the agent correctly understand the multimodal semantic information composed of images and texts so as to give reasonable answers to the questions raised by humans is the key in visual dialogue. Visual dialogue is currently one of the hot research topics in the field of computer vision, and its application scenarios are also very wide, including: helping visually impaired people understand social media content or daily environment, artificial intelligence assistance, robot applications, etc.

随着现代图像处理技术和深度学习的发展，视觉对话技术也得到了巨大的发展，但是仍然面临以下几点问题：With the development of modern image processing technology and deep learning, visual dialogue technology has also been greatly developed, but it still faces the following problems:

一、智能体在处理文本信息时缺乏对文本特征进行更细粒度的学习。First, the agent lacks more fine-grained learning of text features when processing text information.

例如2017年，Jiasen Lu等作者在顶级国际会议Conference and Workshop onNeural Information Processing Systems(NIPS 2017)上发表的文章《Best ofBothWorlds:Transferring Knowledge from Discriminative Learning to a GenerativeVisual Dialog Model》中提出的基于历史对话的图像注意力方法，该方法首先对历史对话进行句子层面的注意力处理，然后基于处理后的文本特征对图像特征进行注意力学习，但是该方法在处理当前问题的文本信息时只考虑了句子层面的语义，没有考虑词层面的语义，而在实际提问的句子里面通常只有部分关键词是与预测的答案最相关的。因此，该方法在实际应用时会有一定的局限性。For example, in 2017, Jiasen Lu and other authors presented the image based on historical dialogue in the article "Best of BothWorlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model" published at the top international conference Conference and Workshop on Neural Information Processing Systems (NIPS 2017). Attention method, this method first performs sentence-level attention processing on historical dialogue, and then performs attention learning on image features based on processed text features, but this method only considers sentence-level information when processing the text information of the current problem. Semantics does not consider the semantics at the word level, and usually only some keywords are most relevant to the predicted answer in the sentence of the actual question. Therefore, this method has certain limitations in practical application.

二、现有方法都基于全局图像进行特征提取，导致视觉语义信息不够精确。Second, the existing methods are all based on the global image for feature extraction, resulting in inaccurate visual semantic information.

例如2018年，Qi Wu等作者在顶级国际会议IEEE Conference on ComputerVision and Pattern Recognition(CVPR 2018)上发表的《Are You Talking to Me？Reasoned Visual Dialog Generation throughAdversarial Learning》。这篇文章利用全局视觉特征、问题以及历史对话文本特征进行一系列的相互注意力处理并融合得到多模态语义特征，该方法有效的学习了不同特征之间的语义关系，但是该方法只考虑了全局视觉特征，使得在对图像进行注意力处理后经常会关注到一些问题无关的视觉信息，这些冗余信息会对智能体的答案预测造成干扰。For example, in 2018, Qi Wu and other authors published "Are You Talking to Me?" at the top international conference IEEE Conference on ComputerVision and Pattern Recognition (CVPR 2018). Reasoned Visual Dialog Generation through Adversarial Learning. This article uses global visual features, questions and historical dialogue text features to perform a series of mutual attention processing and fusion to obtain multimodal semantic features. This method effectively learns the semantic relationship between different features, but this method only considers The global visual features are used, so that some visual information irrelevant to the question is often paid attention to after the attention processing of the image, and these redundant information will interfere with the answer prediction of the agent.

发明内容SUMMARY OF THE INVENTION

本发明是为了克服现有技术存在的不足之处，提出一种基于双视觉注意力网络的视觉对话生成方法，以期能为智能体提供更完整、更合理的视觉语义信息，以及更细粒度的文本语义信息，从而提高智能体在对问题进行答案推理生成时的合理性和准确性。In order to overcome the shortcomings of the prior art, the present invention proposes a visual dialogue generation method based on a dual visual attention network, in order to provide the agent with more complete and more reasonable visual semantic information, as well as more fine-grained Text semantic information, so as to improve the rationality and accuracy of the agent in the reasoning and generation of the answer to the question.

本发明为解决技术问题采用如下技术方案：The present invention adopts the following technical scheme for solving the technical problem:

本发明一种基于双视觉注意力网络的视觉对话生成方法的特点是按如下步骤进行：The feature of a method for generating a visual dialogue based on a dual visual attention network of the present invention is to carry out the following steps:

步骤1、视觉对话中文本输入的预处理和单词表的构建：Step 1. Preprocessing of text input in visual dialogue and construction of word list:

步骤1.1、获取视觉对话数据集，所述视觉对话数据集中包含句子文本和图像；Step 1.1, obtain a visual dialogue dataset, which contains sentence text and images;

对所述视觉对话数据集中所有的句子文本进行分词处理，得到分割后的单词；Perform word segmentation on all sentence texts in the visual dialogue dataset to obtain segmented words;

步骤1.2、从分割后的单词中筛选出词频率大于阈值的所有单词，并构建单词索引表Voc；再对所述索引表Voc中的每一个单词进行one-hot编码，得到one-hot向量表O＝[o₁,o₂,...,o_n,...,o_N]，其中o_n表示索引表Voc中的第n个单词所对应的one-hot编码向量，N为索引表Voc中的单词个数；Step 1.2, screen out all words whose word frequency is greater than the threshold from the divided words, and construct a word index table Voc; then perform one-hot encoding on each word in the index table Voc to obtain a one-hot vector table O=[o ₁ ,o ₂ ,...,on ,...,o _N ], where o _n represents the one-hot encoding vector corresponding to the _nth word in the index table Voc, and N is the index table The number of words in the Voc;

步骤1.3、随机初始化一个词嵌入矩阵W_e，

其中d_w代表词向量的维度；利用词嵌入矩阵W_e将one-hot向量表中的每个单词的编码向量映射到相应的词向量上，从而得到词向量表；Step 1.3, randomly initialize a word embedding _{matrix We} ,

_Wherein d _w represents the dimension of the word vector; the word embedding matrix We is used to map the encoding vector of each word in the one-hot vector table to the corresponding word vector, thereby obtaining the word vector table;

步骤2、对话图像的特征提取以及对话文本的特征提取；Step 2, feature extraction of dialogue images and feature extraction of dialogue texts;

步骤2.1、从视觉对话数据集中获取任意一个图像I及其对应的历史对话U＝[u₁,u₂,...,u_t,...,u_T]、当前问题

和真实答案标签A_GT所组成的视觉对话信息D；其中T为历史对话U中的对话片段总数，u_t表示对话中的第t段对话，L₁表示当前问题Q的句子长度，w_Q,i表示当前问题Q中的第i个单词在所述词向量表中所对应的词向量；Step 2.1. Obtain any image I and its corresponding historical dialogue U=[u ₁ ,u ₂ ,...,u _t ,...,u _T ] from the visual dialogue dataset, the current problem

Visual dialogue information D composed of real answer labels A _GT ; where T is the total number of dialogue segments in the historical dialogue U, ut represents the _t -th dialogue in the dialogue, L ₁ represents the sentence length of the current question Q, w _{Q, i} represents the word vector corresponding to the i-th word in the current question Q in the word vector table;

步骤2.2、使用卷积神经网络提取视觉对话信息D中图像I的特征，得到全局视觉特征

其中

表示全局视觉特征V⁽⁰⁾中的第m个区域特征，M表示全局视觉特征V⁽⁰⁾中的总的空间区域数，d_g为全局视觉特征V⁽⁰⁾的通道维度；Step 2.2, use the convolutional neural network to extract the features of the image I in the visual dialogue information D to obtain the global visual features

in

represents the mth regional feature in the global visual feature V ⁽⁰⁾ ^, M represents the total number of spatial regions in the global visual feature V (0), and d _g is the channel dimension of the global visual feature V ⁽⁰⁾ ;

步骤2.3、使用目标检测特征提取器提取视觉对话信息D中图像I的特征，得到局部视觉特征其中表示局部视觉特征R⁽⁰⁾中的第k个目标对象特征，K表示局部视觉特征R⁽⁰⁾中的检测的局部目标对象总数，d_r为局部视觉特征R⁽⁰)的通道维度；Step 2.3. Use the target detection feature extractor to extract the features of the image I in the visual dialogue information D to obtain local visual features in represents the k-th target object feature in the local visual feature R ⁽⁰⁾ ^, K represents the total number of local target objects detected in the local visual feature R(0), and d _r is the channel dimension of the local visual feature R ⁽⁰ );

步骤2.4、利用全连接操作将全局视觉特征和局部视觉特征映射到同一维度的空间中，得到转换后的全局视觉特征V＝[v₁,v₂,...,v_m,...,v_M]，以及局部视觉特征R＝[r₁,r₂,...,r_k,...,r_K]，

其中v_m表示全局视觉特征V中的第m个区域特征，r_k表示局部视觉特征R中的第k个目标对象特征，d为转换后的通道维度；Step 2.4. Use the full connection operation to map the global visual feature and the local visual feature to the space of the same dimension, and obtain the transformed global visual feature V=[v ₁ ,v ₂ ,..., _vm ,..., v _M ], and local visual features R=[r ₁ ,r ₂ ,...,r _k ,...,r _K ],

where v _m represents the m-th regional feature in the global visual feature V, r _k represents the k-th target object feature in the local visual feature R, and d is the transformed channel dimension;

步骤2.5、使用长短期记忆网络LSTM对当前问题Q进行特征提取，得到隐状态特征序列

并取长短期记忆网络LSTM的最后一个步长输出的隐状态特征

作为当前问题Q的句子级问题特征向量q，

其中h_Q,i表示长短期记忆网络LSTM的第i个步长输出的隐状态特征；Step 2.5, use the long short-term memory network LSTM to extract the features of the current question Q, and obtain the hidden state feature sequence

And take the hidden state feature of the last step output of the long short-term memory network LSTM

As the sentence-level question feature vector q of the current question Q,

where h _Q,i represents the hidden state feature of the i-th step output of the long short-term memory network LSTM;

步骤2.6、使用长短期记忆网络LSTM对历史对话U中的第t段对话

进行特征提取，得到第t个隐状态序列

长短期记忆网络取LSTM的最后一个步长输出的隐状态特征

作为第t段对话u_t的句子级特征h_t，

则总的历史对话特征为H＝[h₁,h₂,...,h_t,...,h_T]，

其中w_t,i表示第t段对话u_t中第i个单词在所述词向量表中所对应的词向量，L₂为第t段对话u_t的句子长度，h_t,i表示长短期记忆网络LSTM的第i个步长输出的隐状态特征；Step 2.6, use the long short-term memory network LSTM to analyze the t-th dialogue in the historical dialogue U

Perform feature extraction to get the t-th hidden state sequence

The long short-term memory network takes the hidden state features of the last step output of LSTM

As the sentence-level feature h _t of the t-th dialogue u _t ,

Then the total historical dialogue feature is H=[h ₁ , h ₂ ,...,h _t ,...,h _T ],

Where w _t,i represents the word vector corresponding to the i-th word in the word vector table in the _t -th dialogue ut, L ₂ is the sentence length of the t-th dialogue _ut , h _t,i represents the long-term and short-term The hidden state feature of the i-th step output of the memory network LSTM;

步骤3、基于当前问题信息对历史对话信息进行注意力处理；Step 3. Perform attention processing on the historical dialogue information based on the current problem information;

利用式(1)对所述总的历史对话特征H＝[h₁,h₂,...,h_t,...,h_T]进行注意力处理，得到注意力关注的历史特征向量h_a，

Use formula (1) to perform attention processing on the total historical dialogue features H=[h ₁ , h ₂ ,...,h _t ,...,h _T ], and obtain the historical feature vector h that the attention pays attention to _a ,

h_a＝α^hH^T (1)h _a = α ^h H ^T (1)

式(1)中，

表示对历史对话特征H的注意力分布权重，并有：In formula (1),

represents the weight of the attention distribution on the historical dialogue feature H, and has:

α^h＝softmax(PTz^h) (2)α ^h =softmax(PTz ^h ) (2)

式(2)中，表示句子级问题特征向量q与历史对话特征H之间的相似度矩阵，表示相似度矩阵z^h的待训练参数，并有：In formula (2), represents the similarity matrix between the sentence-level question feature vector q and the historical dialogue feature H, Represents the parameters to be trained of the similarity matrix z ^h , and has:

z^h＝tanh(W_qq+W_hH) (3)z ^h =tanh(W _q q+W _h H) (3)

式(3)中，

表示句子级问题特征向量q对应的待训练参数，

表示历史对话特征H对应的待训练参数；In formula (3),

represents the parameters to be trained corresponding to the sentence-level question feature vector q,

Represents the parameters to be trained corresponding to the historical dialogue feature H;

步骤4、双视觉特征各自独立的注意力处理；Step 4. Independent attention processing of dual visual features;

步骤4.1、利用式(4)对全局视觉特征V＝[v₁,v₂,...,v_m,...,v_M]进行注意力处理，得到注意力关注的全局视觉特征向量V′， Step 4.1. Use formula (4) to perform attention processing on the global visual feature V=[v ₁ , v ₂ ,..., _vm ,...,v _M ], and obtain the global visual feature vector V that the attention pays attention to ',

V′＝α^V1V^T (4)V′=α ^V1 V ^T (4)

式(4)中，

表示对全局视觉特征V的注意力分布权重，并有：In formula (4),

Represents the attention distribution weight to the global visual feature V, and has:

式(5)中，

表示句子级问题特征向量q、注意力关注的历史特征向量h_a以及全局视觉特征V之间的相似度矩阵

表示相似度矩阵z^V1的待训练参数，并有：In formula (5),

Represents the similarity matrix between the sentence-level question feature vector _q , the historical feature vector ha of attention attention, and the global visual feature V

Represents the parameters to be trained of the similarity matrix z ^V1 , and has:

z^V1＝tanh(W_q1q+W_h1h_a+W_V1V) (6)z ^V1 = tanh(W _q1 q+W _h1 h _a +W _V1 V) (6)

式(6)中，

表示句子级问题特征向量q对应的待训练参数，表示注意力关注的历史特征向量h_a对应的待训练参数，

表示全局视觉特征V对应的待训练参数；In formula (6),

represents the parameters to be trained corresponding to the sentence-level question feature vector q, The parameters to be trained corresponding to the historical feature vector _ha representing the attention of the attention,

Indicates the parameters to be trained corresponding to the global visual feature V;

步骤4.2、利用式(7)对局部视觉特征R＝[r₁,r₂,...,r_k,...,r_K]进行注意力处理，得到注意力关注的局部视觉特征向量R′，

Step 4.2. Use the formula (7) to perform attention processing on the local visual features R=[r ₁ , r ₂ ,...,r _k ,...,r _K ], and obtain the local visual feature vector R that the attention pays attention to ',

R′＝α^R1R^T (7)R′=α ^R1 R ^T (7)

式(7)中，

表示对局部视觉特征R的注意力分布权重，并有：In formula (7),

Represents the weight of the attention distribution to the local visual feature R, and has:

式(8)中，

表示句子级问题特征向量q、注意力关注的历史特征向量h_a以及局部视觉特征R之间的相似度矩阵，

表示相似度矩阵z^V1的待训练参数，并有：In formula (8),

represents the similarity matrix between the sentence-level question feature vector _q , the historical feature vector ha of attention attention, and the local visual feature R,

z^R1＝tanh(W′_q1q+W′_h1h_a+W_R1R) (9)z ^R1 = tanh(W′ _q1 q+W′ _h1 h _a +W _R1 R) (9)

式(9)中，

表示句子级问题特征向量q对应的待训练参数，

表示注意力关注的历史特征向量h_a对应的待训练参数，

表示局部视觉特征R对应的待训练参数；In formula (9),

The parameters to be trained corresponding to the historical feature vector _ha representing the attention of the attention,

Represents the parameters to be trained corresponding to the local visual feature R;

步骤5、双视觉特征相互交叉的注意力处理；Step 5. Attention processing of the intersection of dual visual features;

步骤5.1、利用式(10)对全局视觉特征V＝[v₁,v₂,...,v_m,...,v_M]进行双视觉交叉注意力处理，得到进一步注意力关注的全局视觉特征向量V″,

Step 5.1. Use formula (10) to perform bi-visual cross-attention processing on the global visual feature V=[v ₁ , v ₂ ,...,v _m ,...,v _M ] to obtain the global attention paid by further attention The visual feature vector V″,

V″＝α^V2V^T (10)V″=α ^V2 V ^T (10)

式(10)中，

表示对全局视觉特征V的进一步注意力分布权重，并有：In formula (10),

represents the further attention distribution weights to the global visual feature V, and has:

式(11)中，

表示句子级问题特征向量q、注意力关注的历史特征向量h_a、注意力关注的局部视觉特征向量R′以及全局视觉特征V之间的相似度矩阵，表示相似度矩阵z^V2的待训练参数，并有：In formula (11),

represents the similarity matrix between the sentence-level question feature vector _q , the historical feature vector ha of attention, the local visual feature vector R′ of attention, and the global visual feature V, Represents the parameters to be trained of the similarity matrix z ^V2 , and has:

z^V2＝tanh(W_q2q+W_h2h_a+W_R2R′+W_V2V) (12)z ^V2 = tanh(W _q2 q+W _h2 h _a +W _R2 R′+W _V2 V) (12)

式(12)中，

表示句子级问题特征向量q对应的待训练参数，

表示注意力关注的历史特征向量h_a对应的待训练参数，

表示注意力关注的局部视觉特征向量R′对应的待训练参数，

表示全局视觉特征V对应的待训练参数；In formula (12),

The parameters to be trained corresponding to the local visual feature vector R′ representing the attention of the attention,

步骤5.2、利用式(13)对局部视觉特征R＝[r₁,r₂,...,r_k,...,r_K]进行双视觉交叉注意力处理，得到进一步注意力关注的局部视觉特征向量R″,

Step 5.2. Use formula (13) to perform bi-visual cross-attention processing on the local visual features R=[r ₁ , r ₂ ,...,r _k ,...,r _K ] to obtain the local attention paid by further attention The visual feature vector R″,

R″＝α^R2R^T (13)R″=α ^R2 R ^T (13)

式(13)中，

表示对局部视觉特征R的进一步注意力分布权重，并有：In formula (13),

represents further attention distribution weights on local visual features R, and has:

式(14)中，

表示句子级问题特征向量q、注意力关注的历史特征向量h_a、注意力关注的全部视觉特征向量V′以及局部视觉特征R之间的相似度矩阵，表示相似度矩阵z^R2的待训练参数，并有：In formula (14),

Represents the similarity matrix between the sentence-level question feature vector _q , the historical feature vector ha of attention, the total visual feature vector V′ of attention, and the local visual feature R, Represents the parameters to be trained of the similarity matrix z ^R2 , and has:

z^R2＝tanh(W′_q2q+W′_h2h_a+W′_V2V′+W′_R2R) (15)z ^R2 = tanh(W′ _q2 q+W′ _h2 h _a +W′ _V2 V′+W′ _R2 R) (15)

式(15)中，

表示句子级问题特征向量q对应的待训练参数，

表示注意力关注的历史特征向量h_a对应的待训练参数，示注意力关注的全部视觉特征向量V′对应的待训练参数，

表示局部视觉特征R对应的待训练参数；In formula (15),

The parameters to be trained corresponding to the historical feature vector _ha representing the attention of the attention, The parameters to be trained corresponding to all visual feature vectors V′ that show attention,

步骤6、视觉特征的优化处理；Step 6, optimization processing of visual features;

步骤6.1、利用式(16)对当前问题Q进行词级别注意力处理，得到注意力关注的词级问题特征向量q^s，

Step 6.1. Use Equation (16) to perform word-level attention processing on the current question Q, and obtain the word-level question feature vector q ^s of which the attention is concerned,

q^s＝α^qQ^T (16)q ^s = α ^q Q ^T (16)

式(16)中，

表示对当前问题Q的注意力分布权重，并有：In formula (16),

Represents the weight of the attention distribution on the current question Q, and has:

式(14)中，

表示当前问题Q的自注意力语义矩阵，

表示自注意力语义矩阵z^Q的待训练参数，并有：In formula (14),

represents the self-attention semantic matrix of the current question Q,

represents the to-be-trained parameters of the self-attention semantic matrix z ^Q , and has:

z^Q＝tanh(W_QQ) (18)z ^Q = tanh(W _Q Q) (18)

式(18)中，

表示词级别注意力处理时当前问题Q对应的待训练参数；In formula (18),

Indicates the parameters to be trained corresponding to the current question Q during word-level attention processing;

步骤6.2、利用式(19)和式(20)对进一步注意力关注的全局视觉特征向量V″和局部视觉特征向量R″分别进行优化处理，并得到最终的全局视觉特征向量

和局部视觉特征向量

Step 6.2. Use formula (19) and formula (20) to optimize the global visual feature vector V″ and local visual feature vector R″ for further attention, and obtain the final global visual feature vector

and local visual feature vectors

式(19)和式(20)中，

表示视觉特征优化处理时词级问题特征向量q^s对应的待训练参数，⊙表示点乘运算；In formula (19) and formula (20),

Represents the parameter to be trained corresponding to the feature vector q ^s of the word-level problem during visual feature optimization processing, and ⊙ represents the dot product operation;

步骤7、多模态语义融合及解码生成答案特征序列；Step 7. Multimodal semantic fusion and decoding to generate an answer feature sequence;

步骤7.1、将所述注意力关注的词级问题特征向量q^s，注意力关注的历史特征向量h_a，优化后的全局视觉特征向量

和局部视觉特征向量

进行拼接后得到多模态特征向量e_M，

其中d_M＝3d+d_w代表多模态特征向量的维度；再利用全连接操作对所述多模态特征向量e_M进行映射，得到融合语义特征向量e,

Step 7.1. Combine the word-level question feature vector _q ^s of the attention, the historical feature vector ha of the attention, and the optimized global visual feature vector

and local visual feature vectors

After splicing, the multimodal feature vector e _M is obtained,

where d _M =3d+d _w represents the dimension of the multi-modal feature vector; then use the full connection operation to map the multi-modal feature vector e _M to obtain the fusion semantic feature vector e,

步骤7.2、将所述融合语义特征向量e输入到长短期记忆网络LSTM中，得到预测答案的隐状态特征序列

其中h_A,i为长短期记忆网络LSTM的第i个步长的输出，L₃为真实答案标签A_GT的句子长度；Step 7.2. Input the fused semantic feature vector e into the long short-term memory network LSTM to obtain the hidden state feature sequence of the predicted answer

Where h _A,i is the output of the i-th step size of the long short-term memory network LSTM, and L ₃ is the sentence length of the true answer label A _GT ;

步骤7.3、利用全连接操作将所述预测答案的隐状态特征序列

映射到与所述one-hot向量表O同一维度的空间中，得到预测答案的单词向量集合

其中y_i表示预测答案中第i个单词的映射向量，且向量长度与单词个数相同；Step 7.3. Use the full connection operation to convert the hidden state feature sequence of the predicted answer

Map to the space of the same dimension as the one-hot vector table O, and get the word vector set of the predicted answer

where y _i represents the mapping vector of the ith word in the predicted answer, and the length of the vector is the same as the number of words;

步骤8、基于双视觉注意力网络的视觉对话生成网络模型的参数优化；Step 8, parameter optimization of the visual dialogue generation network model based on the dual visual attention network;

步骤8.1、根据所述单词one-hot向量表O对真实答案标签A_GT中的单词构建向量集合

其中表示真实答案标签A_GT中第i个单词的映射向量，且向量长度与单词个数相同；Step 8.1. According to the word one-hot vector table O, construct a vector set for the words in the true answer label A _GT

in Represents the mapping vector of the ith word in the true answer label A _GT , and the length of the vector is the same as the number of words;

步骤8.2利用式(21)计算预测答案与真实答案A_GT之间的损失代价E：Step 8.2 Use equation (21) to calculate the loss cost E between the predicted answer and the real answer A _GT :

步骤8.3、利用随机梯度下降法将所述损失代价E进行优化求解，使损失代价E达到最小，从而得到最优参数的基于双视觉注意力网络的视觉对话网络模型；Step 8.3, using the stochastic gradient descent method to optimize and solve the loss cost E to minimize the loss cost E, thereby obtaining a visual dialogue network model based on the dual visual attention network with optimal parameters;

步骤9、预测答案生成；Step 9. Predict the answer generation;

对所述预测答案的单词向量集合

使用贪心解码算法得到第i个单词的映射向量y_i中最大值所对应的位置，并根据最大值所对应的位置在单词索引表Voc中查找相应位置的单词作为第i个单词的映射向量y_i最终的预测单词，进而得到单词向量集合Y所对应的预测答案，并以所述当前问题Q和单词向量集合Y所对应的预测答案作为最终生成的视觉对话。the set of word vectors for the predicted answer

Use the greedy decoding algorithm to obtain the position corresponding to the maximum value in the mapping vector y _i of the ith word, and look up the word at the corresponding position in the word index table Voc according to the position corresponding to the maximum value as the mapping vector y of the ith word _i final predicted word, and then the predicted answer corresponding to the word vector set Y is obtained, and the predicted answer corresponding to the current question Q and the word vector set Y is used as the final generated visual dialogue.

与已有技术相比，本发明的有益效果体现在：Compared with the prior art, the beneficial effects of the present invention are embodied in:

1、和以往研究的视觉对话技术相比，本发明不仅提取了全局图像的视觉特征，还提取了局部图像对象的视觉特征，全局视觉特征包含的视觉语义信息更加全面，而局部视觉特征包含的视觉语义更加具体，从而充分考虑了两种视觉特征的特点，并通过两阶注意力处理来分别学习两种视觉特征的内部关系和相互关系，进而形成视觉语义互补，使得智能体能获得更完整、更准确的视觉语义信息。1. Compared with the visual dialogue technology studied in the past, the present invention not only extracts the visual features of the global image, but also extracts the visual features of the local image objects. The global visual features contain more comprehensive visual semantic information, while the local visual features contain more comprehensive visual semantic information. The visual semantics is more specific, which fully considers the characteristics of the two visual features, and learns the internal and interrelationships of the two visual features through two-order attention processing, thereby forming visual semantic complementarity, so that the agent can obtain more complete, More accurate visual semantic information.

2、本发明从句子层面和词层面分别处理文本特征，在处理时首先对问题和历史对话进行句子层面的特征提取并对历史对话特征进行注意力处理；接着，基于所得句子层面的文本特征来学习两种视觉特征的关系；之后，本发明从词层面对问题特征进行注意力处理，以捕获问题中有助于答案推测的关键词特征，这种更细粒度的文本处理方法使得本发明在视觉对话中可以生成更准确合理的答案。2. The present invention processes the text features from the sentence level and the word level respectively. When processing, firstly perform sentence-level feature extraction on questions and historical dialogues and perform attention processing on historical dialogue features; then, based on the obtained sentence-level text features. The relationship between two visual features is learned; after that, the present invention performs attention processing on the question features from the word level to capture the keyword features in the question that are helpful for answer inference. This finer-grained text processing method enables the present invention to More accurate and reasonable answers can be generated in visual dialogue.

3、本发明提出了一种多模态语义融合结构，该结构首先利用词层面的问题文本特征对两种视觉特征分别进行优化，以进一步突出视觉特征中与问题关键词相关的视觉信息。接着，通过拼接问题特征、历史对话特征、全局视觉特征和局部视觉特征，并进行学习与融合，进一步的，各视觉特征和文本特征可以通过多模态语义融合网络互相产生影响，并辅助优化网络的参数，融合网络同时获取了视觉语义和文本语义之后，智能体的答案预测生成效果也有了很大的提升，预测的结果也更精确。3. The present invention proposes a multi-modal semantic fusion structure, which firstly optimizes two visual features by using the question text features at the word level, so as to further highlight the visual information related to the question keywords in the visual features. Then, by splicing question features, historical dialogue features, global visual features and local visual features, and learning and merging, further, each visual feature and text feature can interact with each other through the multimodal semantic fusion network, and assist in optimizing the network. After the fusion network obtains visual semantics and textual semantics at the same time, the agent's answer prediction generation effect is also greatly improved, and the prediction results are more accurate.

附图说明Description of drawings

图1为本发明的网络模型示意图；1 is a schematic diagram of a network model of the present invention;

图2为本发明的双视觉注意力处理示意图；2 is a schematic diagram of dual visual attention processing of the present invention;

图3为本发明网络模型训练示意图。FIG. 3 is a schematic diagram of the training of the network model of the present invention.

具体实施方式Detailed ways

在本实施例中，如图1所示，一种基于双视觉注意力网络的视觉对话生成方法是按如下步骤进行：In this embodiment, as shown in Figure 1, a method for generating a visual dialogue based on a dual visual attention network is performed as follows:

步骤1.1、从网上获取视觉对话数据集，目前公开的数据集主要有VisDialDataset，该数据集由乔治亚理工学院的相关研究员收集而成，视觉对话数据集中包含句子文本和图像；Step 1.1. Obtain the visual dialogue data set from the Internet. The currently public data sets mainly include VisDialDataset, which was collected by relevant researchers of Georgia Institute of Technology. The visual dialogue data set contains sentence text and images;

对视觉对话数据集中所有的句子文本进行分词处理，得到分割后的单词；Perform word segmentation on all sentence texts in the visual dialogue dataset to obtain segmented words;

步骤1.2、从分割后的单词中筛选出词频率大于阈值的所有单词，阈值的大小可设置为4，并构建单词索引表Voc；创建单词索引表Voc的方法：单词表可以包含单词、标点符号；统计单词的个数并对单词进行排序，其中为了满足优化的训练过程，添加了一个空白符。对所有单词按照顺序构建单词与序号的对应表；再对索引表Voc中的每一个单词进行one-hot编码，得到one-hot向量表O＝[o₁,o₂,...,o_n,...,o_N]，其中o_n表示索引表Voc中的第n个单词所对应的one-hot编码向量，N为索引表Voc中的单词个数；Step 1.2. Screen out all words whose word frequency is greater than the threshold from the divided words. The size of the threshold can be set to 4, and the word index table Voc is constructed; the method of creating the word index table Voc: the word table can contain words, punctuation marks ; Count the number of words and sort the words, in which a blank is added to meet the optimized training process. Construct the correspondence table of words and serial numbers for all words in order; then perform one-hot encoding on each word in the index table Voc to obtain a one-hot _vector table O=[o ₁ ,o ₂ ,...,on ,...,o _N ], where o _n represents the one-hot encoding vector corresponding to the nth word in the index table Voc, and N is the number of words in the index table Voc;

步骤1.3、随机初始化一个词嵌入矩阵W_e，

其中d_w代表词向量的维度；利用词嵌入矩阵W_e将one-hot向量表中的第n个单词的编码向量o_n映射到第n个词向量w_n，

从而得到词向量表；Step 1.3, randomly initialize a word embedding _{matrix We} ,

where d _w represents the dimension of the word vector; the word embedding matrix We is used to map the encoding vector on of the _nth word in the one-hot vector table to the _nth word vector w _n ,

Thereby, the word vector table is obtained;

和真实答案标签A_GT所组成的视觉对话信息D；其中T为历史对话U中的对话片段总数，u_t表示对话中的第t段对话，L₁表示当前问题Q的句子长度，L₁的大小可设置为16，对于句子长度小于16的句子会用零向量进行填充，填充至其长度为L₁，w_Q,i表示句子中的第i个单词的词向量；Step 2.1. Obtain any image I and its corresponding historical dialogue U=[u ₁ ,u ₂ ,...,u _t ,...,u _T ] from the visual dialogue dataset, the current problem

Visual dialogue information D composed of real answer labels A _GT ; where T is the total number of dialogue segments in the historical dialogue U, ut represents the _t -th dialogue in the dialogue, L ₁ represents the sentence length of the current question Q, and L ₁ ’s The size can be set to 16. For sentences with a sentence length less than 16, it will be filled with a zero vector until its length is L ₁ , w _Q,i represents the word vector of the i-th word in the sentence;

其中

表示全局视觉特征V⁽⁰⁾中的第m个区域特征，M表示全局视觉特征V⁽⁰⁾中的总的空间区域数，d_g为全局视觉特征V⁽⁰⁾的通道维度；本实施例中，可以采用预训练的VGG卷积神经网络对图像I的全局视觉特征进行特征提取；VGG是一种二维卷积神经网络，它被证明了有很强的视觉信息表达能力，因此我们使用在COCO2014数据集上预训练过的VGG作为实验的全局视觉特征提取器，并且这一部分的网络不参与后续步骤8的参数更新部分；Step 2.2, use the convolutional neural network to extract the features of the image I in the visual dialogue information D to obtain the global visual features

in

represents the mth regional feature in the global visual feature V ⁽⁰⁾ ^, M represents the total number of spatial regions in the global visual feature V(0), and d _g is the channel dimension of the global visual feature V ⁽⁰⁾ ; this embodiment , the pre-trained VGG convolutional neural network can be used to extract the global visual features of the image I; VGG is a two-dimensional convolutional neural network, which has been proved to have a strong ability to express visual information, so we use The VGG pre-trained on the COCO2014 dataset is used as the global visual feature extractor of the experiment, and this part of the network does not participate in the parameter update part of the subsequent step 8;

步骤2.3、使用目标检测特征提取器提取视觉对话信息D中图像I的特征，得到局部视觉特征

其中

表示局部视觉特征R⁽⁰⁾中的第k个目标对象特征，K表示局部视觉特征R⁽⁰⁾中的检测的局部目标对象总数，d_r为局部视觉特征R⁽⁰⁾的通道维度；本实施例中，可以采用预训练的Faster-RCNN目标检测特征提取器对图像I的局部视觉特征进行特征提取；Faster-RCNN所提取的局部视觉特征在许多视觉任务上都取得了优异的效果，因此我们使用在Visual Genome数据集上预训练过的Faster-RCNN作为实验的局部视觉特征提取器，并且这一部分的网络不参与后续步骤8的参数更新部分；Step 2.3. Use the target detection feature extractor to extract the features of the image I in the visual dialogue information D to obtain local visual features

in

Represents the k-th target object feature in the local visual feature R ⁽⁰⁾ ^, K represents the total number of local target objects detected in the local visual feature R(0), and d _r is the channel dimension of the local visual feature R ⁽⁰⁾ ; this In the embodiment, the pre-trained Faster-RCNN target detection feature extractor can be used to perform feature extraction on the local visual features of image I; the local visual features extracted by Faster-RCNN have achieved excellent results in many visual tasks, so We use the Faster-RCNN pre-trained on the Visual Genome dataset as the local visual feature extractor for the experiment, and this part of the network does not participate in the parameter update part of the subsequent step 8;

步骤2.4、利用全连接操作将全局视觉特征和局部视觉特征映射到同一维度的空间中，得到转换后的全局视觉特征V＝[v₁,v₂,...,v_m,...,v_M]，

以及局部视觉特征R＝[r₁,r₂,...,r_k,...,r_K]，

其中v_m表示全局视觉特征V中的第m个区域特征，r_k表示局部视觉特征R中的第k个目标对象特征，d为转换后的通道维度；Step 2.4. Use the full connection operation to map the global visual feature and the local visual feature to the space of the same dimension, and obtain the transformed global visual feature V=[v ₁ ,v ₂ ,..., _vm ,..., v _M ],

and local visual features R=[r ₁ ,r ₂ ,...,r _k ,...,r _K ],

并取LSTM的最后一个步长输出的隐状态特征作为当前问题Q的句子级问题特征向量q，

其中h_Q,i表示LSTM第i个步长输出的隐状态特征；Step 2.5, use the long short-term memory network LSTM to extract the features of the current question Q, and obtain the hidden state feature sequence

And take the hidden state feature of the last step output of LSTM As the sentence-level question feature vector q of the current question Q,

where h _Q,i represents the hidden state feature of the ith step output of LSTM;

步骤2.6、使用长短期记忆网络LSTM对历史对话U中的每一段对话

进行特征提取，得到隐状态序列取LSTM的最后一个步长输出的隐状态特征

作为对话u_t的句子级特征h_t，

总的历史对话特征为H＝[h₁,h₂,...,h_t,...,h_T]，

其中w_u,i表示对话u_t中第i个单词的词向量，L₂为对话u_t的句子长度，L₂的大小可设置为25，对于句子长度小于25的句子会用零向量进行填充，填充至其长度为L₂，h_u,i表示LSTM第i个步长输出的隐状态特征；Step 2.6, use the long short-term memory network LSTM to analyze each dialogue in the historical dialogue U

Perform feature extraction to get the hidden state sequence Take the hidden state feature of the last step output of the LSTM

As the sentence-level feature _ht of the dialogue _ut ,

The total historical dialogue features are H=[h ₁ , h ₂ ,...,h _t ,...,h _T ],

Where w _u,i represents the word vector of the i-th word in the dialogue _ut , L ₂ is the sentence length of the dialogue u _t , the size of L ₂ can be set to 25, and the sentences with a sentence length less than 25 will be filled with zero vectors , padded to its length L ₂ , h _u,i represents the hidden state feature of the ith step output of LSTM;

利用式(1)对总的历史对话特征H＝[h₁,h₂,...,h_t,...,h_T]进行注意力处理，得到注意力关注的历史特征向量h_a, Use formula (1) to perform attention processing on the total historical dialogue features H=[h ₁ , h ₂ ,...,h _t , _... ,h _T ], and obtain the historical feature vector ha of attention,

h_a＝α^hH (1)h _a = α ^h H (1)

式(1)中，

表示对历史对话特征H的注意力分布权重，并有：In formula (1),

α^h＝softmax(PTz^h) (2)α ^h =softmax(PTz ^h ) (2)

式(2)中，

表示句子级问题特征向量q与历史对话特征H之间的相似度矩阵，

表示相似度矩阵z^h的待训练参数，并有：In formula (2),

represents the similarity matrix between the sentence-level question feature vector q and the historical dialogue feature H,

Represents the parameters to be trained of the similarity matrix z ^h , and has:

z^h＝tanh(W_qq+W_hH) (3)z ^h =tanh(W _q q+W _h H) (3)

式(3)中，

表示句子级问题特征向量q对应的待训练参数，

表示历史对话特征H对应的待训练参数；In formula (3),

步骤4、如图2所示，对双视觉特征进行各自独立的注意力处理；Step 4, as shown in Figure 2, perform independent attention processing on the dual visual features;

步骤4.1、利用式(4)对全局视觉特征V＝[v₁,v₂,...,v_m,...,v_M]进行注意力处理，得到注意力关注的全局视觉特征向量V′,

Step 4.1. Use formula (4) to perform attention processing on the global visual feature V=[v ₁ , v ₂ ,..., _vm ,...,v _M ], and obtain the global visual feature vector V that the attention pays attention to ′,

V′＝α^V1V^T (4)V′=α ^V1 V ^T (4)

式(4)中，

表示对全局视觉特征V的注意力分布权重，并有：In formula (4),

式(5)中，

表示句子级问题特征向量q、注意力关注的历史特征向量h_a以及全局视觉特征V之间的相似度矩阵，

表示相似度矩阵z^V1的待训练参数，并有：In formula (5),

represents the similarity matrix between the sentence-level question feature vector _q , the historical feature vector ha of attention attention, and the global visual feature V,

式(6)中，

表示句子级问题特征向量q对应的待训练参数，

表示注意力关注的历史特征向量h_a对应的待训练参数，

表示全局视觉特征V对应的待训练参数；In formula (6),

步骤4.2、利用式(7)对局部视觉特征R＝[r₁,r₂,...,r_k,...,r_K]进行注意力处理，得到注意力关注的局部视觉特征向量R′,

Step 4.2. Use the formula (7) to perform attention processing on the local visual features R=[r ₁ , r ₂ ,...,r _k ,...,r _K ], and obtain the local visual feature vector R that the attention pays attention to ′,

R′＝α^R1R^T (7)R′=α ^R1 R ^T (7)

式(7)中，

表示对局部视觉特征R的注意力分布权重，并有：In formula (7),

式(8)中，表示句子级问题特征向量q、注意力关注的历史特征向量h_a以及局部视觉特征R之间的相似度矩阵，表示相似度矩阵z^V1的待训练参数，并有：In formula (8), represents the similarity matrix between the sentence-level question feature vector _q , the historical feature vector ha of attention attention, and the local visual feature R, Represents the parameters to be trained of the similarity matrix z ^V1 , and has:

式(9)中，表示句子级问题特征向量q对应的待训练参数，

表示注意力关注的历史特征向量h_a对应的待训练参数，表示局部视觉特征R对应的待训练参数；In formula (9), represents the parameters to be trained corresponding to the sentence-level question feature vector q,

The parameters to be trained corresponding to the historical feature vector _ha representing the attention of the attention, Represents the parameters to be trained corresponding to the local visual feature R;

步骤5、如图2所示，对双视觉特征进行相互交叉的注意力处理；Step 5. As shown in Figure 2, perform cross-over attention processing on the dual visual features;

V″＝α^V2V^T (10)V″=α ^V2 V ^T (10)

式(10)中，

式(11)中，

式(12)中，

表示句子级问题特征向量q对应的待训练参数，

表示注意力关注的历史特征向量h_a对应的待训练参数，

表示注意力关注的局部视觉特征向量R′对应的待训练参数，

表示全局视觉特征V对应的待训练参数；In formula (12),

步骤5.2、利用式(13)对局部视觉特征R＝[r₁,r₂,...,r_k,...,r_K]进行双视觉交叉注意力处理，得到进一步注意力关注的局部视觉特征向量R″, Step 5.2. Use formula (13) to perform bi-visual cross-attention processing on the local visual features R=[r ₁ , r ₂ ,...,r _k ,...,r _K ] to obtain the local attention paid by further attention The visual feature vector R″,

R″＝α^R2R^T (13)R″=α ^R2 R ^T (13)

式(13)中，表示对局部视觉特征R的进一步注意力分布权重，并有：In formula (13), represents further attention distribution weights on local visual features R, and has:

式(14)中，

表示句子级问题特征向量q、注意力关注的历史特征向量h_a、注意力关注的全部视觉特征向量V′以及局部视觉特征R之间的相似度矩阵，

表示相似度矩阵z^R2的待训练参数，并有：In formula (14),

Represents the similarity matrix between the sentence-level question feature vector _q , the historical feature vector ha of attention, the total visual feature vector V′ of attention, and the local visual feature R,

Represents the parameters to be trained of the similarity matrix z ^R2 , and has:

式(15)中，

表示句子级问题特征向量q对应的待训练参数，

表示注意力关注的历史特征向量h_a对应的待训练参数，

示注意力关注的全部视觉特征向量V′对应的待训练参数，

表示局部视觉特征R对应的待训练参数；In formula (15),

The parameters to be trained corresponding to all visual feature vectors V′ that show attention,

q^s＝α^qQ (16)q ^s = α ^q Q (16)

式(16)中，

表示对当前问题Q的注意力分布权重，并有：In formula (16),

式(14)中，

表示当前问题Q的自注意力语义矩阵，

表示自注意力语义矩阵z^Q的待训练参数，并有：In formula (14),

represents the self-attention semantic matrix of the current question Q,

z^Q＝tanh(W_QQ) (18)z ^Q = tanh(W _Q Q) (18)

式(18)中，

步骤6.2、利用式(19)、(20)对进一步注意力关注的全局视觉特征向量V″和局部视觉特征向量R″分别进行优化处理，并得到最终的全局视觉特征向量

和局部视觉特征向量

Step 6.2. Use equations (19) and (20) to optimize the global visual feature vector V″ and local visual feature vector R″ for further attention, and obtain the final global visual feature vector

and local visual feature vectors

式(19)和式(20)中，

步骤7.1、将注意力关注的词级问题特征向量q^s，注意力关注的历史特征向量h_a，优化后的全局视觉特征向量

和局部视觉特征向量

进行拼接后得到多模态特征向量e_M，

其中d_M＝3d+d_w代表多模态特征向量的维度；再利用一层全连接操作对多模态特征向量e_M进行映射，得到融合语义特征向量e, Step 7.1. Combine the word-level problem feature vector _q ^s of attention, the historical feature vector ha of attention, and the optimized global visual feature vector

and local visual feature vectors

After splicing, the multimodal feature vector e _M is obtained,

where d _M = 3d+d _w represents the dimension of the multimodal feature vector; then use a layer of full connection operation to map the multimodal feature vector e _M to obtain the fusion semantic feature vector e,

步骤7.2、将融合语义特征向量e输入到长短期记忆网络LSTM中，得到预测答案的隐状态特征序列

其中h_A,i为LSTM的第i个时间步长的输出，L₃为真实答案标签A_GT的句子长度，L₃的大小可设置为9；Step 7.2. Input the fused semantic feature vector e into the long short-term memory network LSTM to obtain the hidden state feature sequence of the predicted answer

Where h _{A, i} is the output of the ith time step of LSTM, L ₃ is the sentence length of the true answer label A _GT , and the size of L ₃ can be set to 9;

步骤7.3、利用全连接操作将预测答案的隐状态特征序列

映射到与one-hot向量表O同一维度的空间中，得到预测答案的单词向量集合

其中y_i表示预测答案中第i个单词的映射向量，且向量长度与单词个数相同；Step 7.3. Use the full connection operation to predict the hidden state feature sequence of the answer

步骤8、如图3所示，对基于双视觉注意力网络的视觉对话生成网络模型的参数进行优化；Step 8, as shown in Figure 3, optimize the parameters of the visual dialogue generation network model based on the dual visual attention network;

步骤8.1、根据单词one-hot向量表O对真实答案标签A_GT中的单词构建向量集合

其中

表示真实答案标签A_GT中第i个单词的映射向量，且向量长度与单词个数相同；Step 8.1. According to the word one-hot vector table O, construct a vector set for the words in the true answer label A _GT

in

Represents the mapping vector of the ith word in the true answer label A _GT , and the length of the vector is the same as the number of words;

步骤8.3、利用随机梯度下降法将损失代价E进行优化求解，使损失代价E达到最小，从而得到最优参数的基于双视觉注意力网络的视觉对话网络模型；Step 8.3, using the stochastic gradient descent method to optimize the loss cost E to minimize the loss cost E, thereby obtaining a visual dialogue network model based on the dual visual attention network with optimal parameters;

步骤9、预测答案生成；Step 9. Predict the answer generation;

对预测答案的单词向量集合

使用贪心解码算法得到第i个单词的映射向量y_i中最大值所对应的位置，并根据最大值所对应的位置在单词索引表Voc中查找相应位置的单词作为第i个单词的映射向量y_i最终的预测单词，进而得到单词向量集合Y所对应的预测答案，并以当前问题Q和单词向量集合Y所对应的预测答案作为最终生成的视觉对话。collection of word vectors for predicted answers

Use the greedy decoding algorithm to obtain the position corresponding to the maximum value in the mapping vector y _i of the ith word, and look up the word at the corresponding position in the word index table Voc according to the position corresponding to the maximum value as the mapping vector y of the ith word _i finally predict the word, and then obtain the predicted answer corresponding to the word vector set Y, and use the predicted answer corresponding to the current question Q and the word vector set Y as the final generated visual dialogue.

Claims

1. a visual dialogue generation method based on dual visual attention network is characterized in that carrying out as follows:

Step 1. Preprocessing of text input in visual dialogue and construction of word list:

Step 1.1, obtain a visual dialogue dataset, which contains sentence text and images;

Perform word segmentation on all sentence texts in the visual dialogue dataset to obtain segmented words;

Step 1.2, screen out all words whose word frequency is greater than the threshold from the divided words, and construct a word index table Voc; then perform one-hot encoding on each word in the index table Voc to obtain a one-hot vector table O=[o ₁ ,o ₂ ,...,on ,...,o _N ], where o _n represents the one-hot encoding vector corresponding to the _nth word in the index table Voc, and N is the index table The number of words in the Voc;

Step 1.3, randomly initialize a word embedding _{matrix We} ,

Step 2, feature extraction of dialogue images and feature extraction of dialogue texts;

Step 2.1. Obtain any image I and its corresponding historical dialogue U=[u ₁ ,u ₂ ,...,u _t ,...,u _T ] from the visual dialogue dataset, the current problem

Step 2.2, use the convolutional neural network to extract the features of the image I in the visual dialogue information D to obtain the global visual features

in

Step 2.3. Use the target detection feature extractor to extract the features of the image I in the visual dialogue information D to obtain local visual features

in

Represents the k-th target object feature in the local visual feature R ⁽⁰⁾ ^, K represents the total number of local target objects detected in the local visual feature R(0), and d _r is the channel dimension of the local visual feature R ⁽⁰⁾ ;

Step 2.4. Use the full connection operation to map the global visual feature and the local visual feature to the space of the same dimension, and obtain the transformed global visual feature V=[v ₁ ,v ₂ ,..., _vm ,..., v _M ],

and local visual features R=[r ₁ ,r ₂ ,...,r _k ,...,r _K ], where v _m represents the m-th regional feature in the global visual feature V, r _k represents the k-th target object feature in the local visual feature R, and d is the transformed channel dimension;

Step 2.5, use the long short-term memory network LSTM to extract the features of the current question Q, and obtain the hidden state feature sequence And take the hidden state feature of the last step output of the long short-term memory network LSTM

As the sentence-level question feature vector q of the current question Q,

Step 2.6, use the long short-term memory network LSTM to analyze the t-th dialogue in the historical dialogue U

Perform feature extraction to get the t-th hidden state sequence

The long short-term memory network takes the hidden state features of the last step output of LSTM As the sentence-level feature h _t of the t-th dialogue u _t ,

Step 3. Perform attention processing on the historical dialogue information based on the current problem information;

h _a = α ^h H ^T (1)

In formula (1),

α ^h =softmax(P ^T z ^h ) (2)

In formula (2),

z ^h =tanh(W _q q+W _h H) (3)

In formula (3), represents the parameters to be trained corresponding to the sentence-level question feature vector q,

Step 4. Independent attention processing of dual visual features;

Step 4.1. Use formula (4) to perform attention processing on the global visual feature V=[v ₁ , v ₂ ,..., _vm ,...,v _M ], and obtain the global visual feature vector V that the attention pays attention to ',

V′=α ^V1 V ^T (4)

In formula (4),

In formula (5),

z ^V1 = tanh(W _q1 q+W _h1 h _a +W _V1 V) (6)

In formula (6),

The parameters to be trained corresponding to the historical feature vector _ha representing the attention of the attention, Indicates the parameters to be trained corresponding to the global visual feature V;

R′=α ^R1 R ^T (7)

In formula (7),

In formula (8),

z ^R1 = tanh(W′ _q1 q+W′ _h1 h _a +W _R1 R) (9)

In formula (9),

Step 5. Attention processing of the intersection of dual visual features;

V″=α ^V2 V ^T (10)

In formula (10),

In formula (11),

represents the similarity matrix between the sentence-level question feature vector _q , the historical feature vector ha of attention, the local visual feature vector R′ of attention, and the global visual feature V,

Represents the parameters to be trained of the similarity matrix z ^V2 , and has:

z ^V2 = tanh(W _q2 q+W _h2 h _a +W _R2 R′+W _V2 V) (12)

In formula (12),

R″=α ^R2 R ^T (13)

In formula (13),

In formula (14),

z ^R2 = tanh(W′ _q2 q+W′ _h2 h _a +W′ _V2 V′+W′ _R2 R) (15)

In formula (15),

Step 6, optimization processing of visual features;

q ^s = α ^q Q ^T (16)

In formula (16),

In formula (14), represents the self-attention semantic matrix of the current question Q,

z ^Q = tanh(W _Q Q) (18)

In formula (18), Indicates the parameters to be trained corresponding to the current question Q during word-level attention processing;

and local visual feature vectors

In formula (19) and formula (20),

Step 7. Multimodal semantic fusion and decoding to generate an answer feature sequence;

and local visual feature vectors

After splicing, the multimodal feature vector e _M is obtained,

Step 7.2. Input the fused semantic feature vector e into the long short-term memory network LSTM to obtain the hidden state feature sequence of the predicted answer Where h _A,i is the output of the i-th step size of the long short-term memory network LSTM, and L ₃ is the sentence length of the true answer label A _GT ;

Step 7.3. Use the full connection operation to convert the hidden state feature sequence of the predicted answer Map to the space of the same dimension as the one-hot vector table O, and get the word vector set of the predicted answer

Step 8, parameter optimization of the visual dialogue generation network model based on the dual visual attention network;

Step 8.1. According to the word one-hot vector table O, construct a vector set for the words in the true answer label A _GT

Step 8.2 Use equation (21) to calculate the loss cost E between the predicted answer and the real answer A _GT :

Step 8.3, using the stochastic gradient descent method to optimize and solve the loss cost E to minimize the loss cost E, thereby obtaining a visual dialogue network model based on the dual visual attention network with optimal parameters;

Step 9. Prediction answer generation

the set of word vectors for the predicted answer