CN116204850B

CN116204850B - Multi-mode emotion analysis method based on dynamic gradient and multi-view collaborative attention

Info

Publication number: CN116204850B
Application number: CN202310257670.0A
Authority: CN
Inventors: 孙俊; 王香
Original assignee: Uni Entropy Intelligent Technology Wuxi Co ltd
Current assignee: Uni Entropy Intelligent Technology Wuxi Co ltd
Priority date: 2023-03-14
Filing date: 2023-03-14
Publication date: 2023-11-03
Anticipated expiration: 2043-03-14
Also published as: CN116204850A

Abstract

The invention relates to the technical field of emotion analysis, and particularly discloses a multi-mode emotion analysis method based on dynamic gradient and multi-view collaborative attention, which comprises the following steps: acquiring multi-mode data with emotion information; inputting the multi-modal data into a multi-modal emotion analysis model to obtain emotion analysis results, wherein training of the multi-modal emotion analysis model comprises: performing modal representation learning on the training set data to obtain shallow learning characteristics; carrying out feature fusion processing on the shallow learning features to obtain deep fusion features, and carrying out dynamic gradient adjustment processing to obtain gradient adjustment parameters; model prediction is carried out according to deep fusion characteristics so as to obtain training results; and repeating the training process according to the verification set data and the test set data to obtain the multi-mode emotion analysis model. The multi-mode emotion analysis method based on dynamic gradient and multi-view collaborative attention provided by the invention can pay attention to interaction among modes and balance among modes when emotion analysis is carried out.

Description

Multi-mode emotion analysis method based on dynamic gradient and multi-view collaborative attention

Technical Field

The invention relates to the technical field of emotion analysis, in particular to a multi-mode emotion analysis method based on dynamic gradient and multi-view collaborative attention.

Background

In the big data age, data can bring a lot of value. By performing emotion analysis on data in the internet, an important role can be played in many fields. In the field of electronic commerce, consumer comments are subjected to emotion analysis, so that feedback of the market to the commodity can be quickly known, and technical support is provided for business management and government scientific supervision of the merchant. In the field of human-computer interaction, robots can make more suitable reactions by understanding the emotion and intention of a person, so that the robots can better serve the person.

The study object of traditional emotion analysis is mainly single-mode data, especially text data. For example, in the prior art, emotion classification of texts is effectively realized by constructing an emotion dictionary containing basic emotion words and scene emotion words. For another example, by using naive bayes, the accuracy rate of three machine learning methods of maximum entropy classification and support vector machine in the emotion classification task of the movie comment data set reaches 82.9%. Or a Long-term Memory neural network (LSTM) and a Convolutional Neural Network (CNN) are used as basic models, and a C-LSTM model is provided, so that not only can the local characteristics of the phrase be captured, but also the global semantics and temporal semantics of the sentence can be captured, and excellent performance can be obtained in a text emotion analysis task.

However, with the advent of the big data age, the forms in which people can express emotion are becoming more and more diverse. The limitation is more and more obvious compared with more single-mode emotion analysis. Many studies have demonstrated that single-mode emotion analysis is lower in emotion recognition accuracy than multi-mode emotion analysis, especially complex emotion recognition. The multi-modal model optimizes the unified learning objective of all modes under the combined training strategy, which may not be as good as the single-modal model in some cases, which is contrary to the general cognition and is easy to ignore. Studies have found that the modalities tend to converge at different speeds, resulting in inconsistent convergence problems. A common solution is to assist in modality training by means of an additional pre-training model, but this way of focusing on feature learning of the modalities themselves tends to simplify interactions between modalities. If the interaction between modalities is concerned, the imbalance between modalities is easily ignored.

Therefore, how to pay attention to both interaction and balance between modalities in emotion analysis is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The invention provides a multi-mode emotion analysis method based on dynamic gradient and multi-view collaborative attention, which solves the problem that interaction and balance between modes cannot be concerned during emotion analysis in the related technology.

As one aspect of the present invention, there is provided a multi-modal emotion analysis method based on dynamic gradient and multi-view collaborative attention, including:

acquiring multi-modal data with emotion information, wherein the multi-modal data comprises text, audio and video;

inputting the multi-modal data into a multi-modal emotion analysis model to obtain emotion analysis results, wherein training of the multi-modal emotion analysis model comprises:

performing modal representation learning on the training set data to obtain shallow learning characteristics;

carrying out feature fusion processing on the shallow learning features to obtain deep fusion features, and carrying out dynamic gradient adjustment processing to obtain gradient adjustment parameters, wherein the gradient adjustment parameters are used for assisting the modal representation learning to update the obtained shallow learning features;

model prediction is carried out according to the deep fusion characteristics so as to obtain training results;

repeating the training process according to the verification set data and the test set data to obtain a multi-mode emotion analysis model;

wherein the training set data, the verification set data and the test set data all comprise multi-modal data with emotion information.

Further, performing modal representation learning on the training set data to obtain shallow learning features, including:

Extracting features of the text training set data through a pre-training BERT model to obtain shallow learning features of the text;

and carrying out feature extraction on the audio training set data and the video training set data through the sLSTM model to obtain the audio shallow learning features and the video shallow learning features.

Further, performing feature fusion processing on the shallow learning features to obtain deep fusion features, and performing dynamic gradient adjustment processing to obtain gradient adjustment parameters, including:

carrying out feature fusion processing on the shallow learning features through a multi-view collaborative attention network and an LSTM (least squares) to obtain deep fusion features;

and realizing gradient adjustment on the shallow learning characteristic value through a dynamic gradient adjustment strategy, and obtaining gradient adjustment parameters.

Further, the shallow learning feature is subjected to feature fusion processing through a multi-view collaborative attention network and an LSTM, and deep fusion features are obtained, including:

carrying out multi-view collaborative attention processing on shallow learning features of each two modes to obtain features of the two modes based on the attention of the other party;

and processing the deep fusion characteristics through LSTM according to the characteristics of each mode based on the attention of other modes and the shallow learning characteristics of the mode.

Further, the shallow learning features of each two modes are subjected to multi-view collaborative attention processing, and the features of the two modes based on the attention of the opposite party are obtained, including:

projecting shallow learning features of one mode in every two modes to three coding spaces through a nonlinear projection layer;

interacting the shallow learning characteristics of the other mode in each two modes with the three coding spaces, and calculating a shared similarity matrix of the shallow learning characteristics of the other mode in each two modes and the three coding spaces and an attention diagram of each two modes;

the average calculation is performed on the basis of the attention attempts of every two modalities to each other, obtaining a final attention profile;

and obtaining the characteristics based on the attention of each other between every two modalities according to the final attention force.

Further, processing by LSTM according to the features of each modality based on the attention of other modalities and the shallow learning features of the modality to obtain deep fusion features, including:

splicing according to the features of each mode based on the attention of other modes and the shallow learning features of the mode to obtain fusion features of each mode based on multi-view collaborative attention;

The fusion features of each modality based on multi-view collaborative attention are input to the LSTM to obtain deep fusion features of all modalities.

Further, implementing gradient adjustment on the shallow learning characteristic value through a dynamic gradient adjustment strategy, and obtaining gradient adjustment parameters, including:

processing the shallow features of each mode through the corresponding classifier to obtain a myopia prediction result and a loss value of each mode;

and carrying out gradient adjustment according to different strategies and myopia prediction results and loss values of each mode so as to obtain gradient adjustment parameters.

Further, gradient adjustment is performed according to different strategies and myopia prediction results and loss values of each mode to obtain gradient adjustment parameters, including:

calculating according to the first strategy and the myopia prediction result of each mode to obtain a contribution degree penalty coefficient;

calculating according to the second strategy and the loss value of each mode to obtain a learning speed penalty coefficient;

wherein the contribution degree penalty coefficient and the learning speed penalty coefficient are both used to update parameters of the feature extractor for each modality;

the first strategy is used for monitoring the contribution degree of each mode to the learning target, and the second strategy is used for monitoring the learning speed of each mode.

Further, according to the first strategy and the myopia prediction result of each mode, calculating to obtain a contribution degree penalty coefficient, including:

obtaining a myopia prediction result of each mode;

and determining the parameter training optimization speed of the feature extractor corresponding to each mode according to the effect of the myopia prediction result of the mode, wherein the effect of the myopia prediction result is inversely proportional to the parameter training optimization speed of the feature extractor.

Further, according to the second strategy and the loss value of each mode, calculating to obtain a learning speed penalty coefficient, including:

and saving the loss value of each mode in real time, and setting a learning speed penalty coefficient for modes with the convergence speed smaller than a preset convergence threshold in the iteration times.

According to the multi-mode emotion analysis method based on dynamic gradient and multi-view collaborative attention, the learning of the characteristics of each mode is assisted through a dynamic gradient mechanism, the learned characteristics of each mode are subjected to characteristic fusion through the multi-view collaborative attention, and emotion prediction is performed by combining with the single-mode characteristics, so that interaction among modes and balance among modes can be focused during emotion analysis, and the multi-mode emotion analysis result is effectively improved.

Drawings

The accompanying drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate the invention and together with the description serve to explain, without limitation, the invention.

FIG. 1 is a flow chart of a multi-modal emotion analysis method based on dynamic gradient and multi-view collaborative attention provided by the invention.

Fig. 2 is a flowchart for obtaining shallow learning features according to the present invention.

Fig. 3 is a schematic structural diagram of the lstm provided by the present invention.

FIG. 4 is a flowchart of a method for multi-modal emotion analysis based on dynamic gradient and multi-view collaborative attention provided by the present invention.

Fig. 5 is a flow chart of feature fusion provided by the present invention.

Fig. 6 is a flowchart for obtaining deep fusion features according to the present invention.

Fig. 7 is a flowchart for obtaining gradient adjustment parameters according to the present invention.

Detailed Description

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments.

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe the embodiments of the invention herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In this embodiment, a method for multi-modal emotion analysis based on dynamic gradient and multi-view collaborative attention is provided, and fig. 1 is a flowchart of the method for multi-modal emotion analysis based on dynamic gradient and multi-view collaborative attention provided in an embodiment of the present invention, as shown in fig. 1, including:

s100, acquiring multi-modal data with emotion information, wherein the multi-modal data comprises texts, audios and videos;

In the embodiment of the invention, data with different forms of emotion information, mainly text, audio and video, are acquired. Data preliminary processing for each modality is performed using a multi-modality processing framework MMSA-FET, and audio data is processed using libros, containing 12 Mel cepstral coefficients (MFCCs) and other low-level acoustic features. Video data is processed using OPENFACE, which includes basic and advanced facial motion units. Text data is processed using Bert to obtain contextual word embedding of text.

S200, inputting the multi-modal data into a multi-modal emotion analysis model to obtain an emotion analysis result, wherein training of the multi-modal emotion analysis model comprises the following steps:

s210, performing modal representation learning on training set data to obtain shallow learning characteristics;

in the embodiment of the invention, shallow features are initially extracted from the features of each mode, and the gradient descent speed of the extractor is dynamically adjusted by monitoring the difference of contribution of the shallow features to a learning target and the learning speed, so that the optimization of each mode is adaptively controlled.

Specifically, as shown in fig. 2, performing modal representation learning on the training set data to obtain shallow learning features includes:

s211, extracting features of the text training set data through a pre-training BERT model to obtain shallow text learning features;

The dataset is defined as d= { X _i ,Y _i I=1, 2, &..n, N is the length of the dataset, the common forms for multi-modality have text, audio and video, and the initial features obtained by preprocessing the three forms of data are defined as X ^m The sequence length is defined as L and the initial feature dimension is defined as D ^m Where m ε { t, a, v }. Each X in the dataset _i Consisting of data of these three modalities, i.e

S212, extracting features of the audio training set data and the video training set data through a sLSTM model to obtain the shallow learning features of the audio and the shallow learning features of the video.

In the embodiment of the invention, the labels in the CMU-MOSI and CMU-MOSEI data sets are continuous emotion intensities, and Y in the emotion analysis regression task _i ∈[-3,3]-3 represents a strong negative emotion and 3 represents a strong positive emotion. The regression task can be changed into the classification task by dividing according to the continuous emotion values.

The modality representation learning layer is to learn the internal feature representation from each modality initial feature. The unified feature extractor may be defined as:

R ^m ＝SubNet(θ ^m ,X ^m )，

wherein ,SubNet(θ ^m (s) is the feature extraction network of each modality, d) ^m And defining the feature dimension of each mode shallow learning feature after feature extraction.

In the embodiment of the invention, a text feature extractor selects a pre-trained BERT model, and an audio and video feature extractor selects a stacked long and short term Memory neural network (Stacked Long Short-term Memory, sLSTM). The LSTM can better capture the dependence of a longer distance of a sequence by setting a gate structure to selectively pass through information. The LSTM is shown in fig. 3 as being composed of a plurality of LSTM layers, each LSTM layer comprising a plurality of interconnected LSTM cells, the first layer inputting initial features, the inputs of the other LSTM layers being hidden states of the LSTM of the previous layer. The embodiment of the invention adopts two layers of LSTM for stacking. The hidden state in the LSTM layer may be propagated both by time and to the next layer, increasing the depth of the network, and thus the feature representation capability of the feature extractor, compared to a single LSTM.

S220, carrying out feature fusion processing on the shallow learning features to obtain deep fusion features, and carrying out dynamic gradient adjustment processing to obtain gradient adjustment parameters, wherein the gradient adjustment parameters are used for assisting the modal representation learning to update the obtained shallow learning features;

in an embodiment of the present invention, as shown in fig. 4 and fig. 5, performing feature fusion processing on the shallow learning features to obtain deep fusion features, and performing dynamic gradient adjustment processing to obtain gradient adjustment parameters, where the steps include:

S221, realizing feature fusion processing on the shallow learning features through a multi-view collaborative attention network and an LSTM (least squares) to obtain deep fusion features;

in the embodiment of the invention, the shallow learning features of different modes are fused by adopting different views by using a multi-view collaborative attention mechanism. Based on a cooperative attention mechanism, the correlation information among different modes can be captured by calculating a shared similarity matrix among feature vectors, and then the shared similarity matrix is acted on the original feature vectors to realize the mutual interaction among the features; in addition, interactions between any two modalities are considered by creating multiple views, and multiple post-attention attempts are fused in an average manner.

Specifically, as shown in fig. 6, performing feature fusion processing on the shallow learning features to obtain deep fusion features, and performing dynamic gradient adjustment processing to obtain gradient adjustment parameters, where the gradient adjustment parameters are used to assist the modal representation learning to update the obtained shallow learning features may include:

s221a, carrying out multi-view collaborative attention processing on shallow learning features of each two modes to obtain features of the two modes based on the attention of the other party;

In the embodiment of the invention, the method specifically comprises the following steps:

(1) Projecting shallow learning features of one mode in every two modes to three coding spaces through a nonlinear projection layer;

it can be understood herein that, since the embodiment of the present invention has three modes, such as mode a, mode B and mode C, when the shallow learning features of mode a and mode B are fused, the shallow learning features of mode a can be projected to three encoding spaces through the nonlinear projection layer.

First, h non-linear projection layers are used to characterize the encoded text R ^t Mapping to video R ^v Is used for the encoding space of the (c).

wherein , and b⁽ⁱ⁾ Trainable weights and deviations representing the ith linear projection layer, respectively, are transformed to +.>

It should be appreciated that linear projection of features into three spaces, performing interactions for each space, creates multiple attention patterns. In particular, a bimodal fusion between text and video modalities is taken as an example.

(2) Interacting the shallow learning characteristics of the other mode in each two modes with the three coding spaces, and calculating a shared similarity matrix of the shallow learning characteristics of the other mode in each two modes and the three coding spaces and an attention diagram of each two modes;

In the embodiment of the present invention, taking the foregoing description as an example, the shallow learning feature of the modality B is interacted with the three encoding spaces, and the shared similarity matrix of the shallow learning feature of the modality B and the three encoding spaces and the attention force diagram between the modality a and the modality B are calculated.

Specifically, a shared similarity matrix is computed that reflects the relationship between hidden vectors of the text modality and the video modality. Normalizing the shared similarity matrix in the row direction to obtain an attention map A for generating text for each time step in the video _v . Normalizing the shared similarity matrix in the column direction, resulting in an attention map A that generates video for each word within the text _t ，

wherein ,

(3) The average calculation is performed on the basis of the attention attempts of every two modalities to each other, obtaining a final attention profile;

in an embodiment of the present invention, again taking the foregoing description as an example, a plurality of attention views are averaged.

(4) And obtaining the characteristics based on the attention of each other between every two modalities according to the final attention force.

A based on generating an attention map of the text for each time step within the video _v Computing the attention context of the text feature and computing the context of the video feature based on the attention effort to generate video for each word within the text, and using a non-linear projection layer to map the attention context of the text feature to the initial text feature R ^t Is used to derive a text feature context based on video attentionSimilarly find text attention based video feature context +.>

wherein ,after transformation, the person is added with->

Synergistic attention to text and audio features is achieved in the same wayAndmulti-view collaborative attention of audio feature and video feature is given +.> and />

S221b, processing the deep fusion characteristics through LSTM according to the characteristics of the attention of each mode based on other modes and the shallow learning characteristics of the mode.

The horizontal stitching will be based on the various feature contexts and shallow features of different attention,representing a stitching operation. And fusing time information into the features through the LSTM to encode, so as to realize the depth fusion of different modes and obtain the deep fusion features of each mode. The original shallow learning characteristic of the mode is added to prevent too much information from being lost in the process of searching for commonalities.

wherein ,dh represents the hidden layer feature dimension of the LSTM.

S222, realizing gradient adjustment on the shallow learning characteristic value through a dynamic gradient adjustment strategy, and obtaining gradient adjustment parameters.

In the embodiment of the invention, in order to balance the performances of different modes, each mode can play an optimal state, so that the overall multi-mode characteristic performance is improved. The dynamic gradient mechanism provided by the embodiment of the invention adjusts the optimization process of each mode, and the gradient of each mode feature extractor is adaptively adjusted by monitoring the contribution difference of each mode to a learning target and the learning speed of each mode.

Specifically, as shown in fig. 7, implementing gradient adjustment on the shallow learning eigenvalue through a dynamic gradient adjustment strategy, and obtaining gradient adjustment parameters, including:

s222a, processing shallow features of each mode through corresponding classifiers to obtain a myopia prediction result and a loss value of each mode;

and S222b, carrying out gradient adjustment according to different strategies and the myopia prediction result and loss value of each mode so as to obtain gradient adjustment parameters.

In the embodiment of the present invention, specifically, gradient adjustment is performed according to different strategies and myopia prediction results and loss values of each mode, so as to obtain gradient adjustment parameters, including:

Further specifically, the calculating according to the first strategy and the myopia prediction result of each mode, to obtain a contribution degree penalty coefficient, includes:

obtaining a myopia prediction result of each mode;

The method is particularly understood to be that the parameters of the feature extractor corresponding to the mode with high contribution degree, namely the mode with good myopia prediction result effect, are subjected to proper gradient penalty in the training process, so that the optimized speed is reduced.

By means of the first strategy, the mode with better performance can be prevented from leading the optimization process of the whole model, and therefore the optimization space of other modes is compressed. The gradient change rate of the good-performance mode is properly slowed down, the training time is prolonged, and a convergence space is provided for other modes.

Further specifically, the calculating according to the second strategy and the loss value of each mode, to obtain the learning speed penalty coefficient, includes:

The method is specifically understood to be that the loss value of each mode is saved while the contribution of learning targets of each mode is monitored, and if one mode is converged too slowly or grows negatively in a certain iteration number, the constraint is carried out through the learning speed penalty coefficient.

By this second strategy, the gradient penalty is increased for modes that have a tendency to overfit. This is because the monitoring contribution degree gives a sufficient convergence space for each mode, but there is still a gap between the feature dimension and the convergence speed of each mode, so that the modes with good performance are compressed, the training time is prolonged, and other modes are easy to be over-fitted.

In the embodiment of the invention, shallow learning features of different modes obtained by a mode representation learning layer are input into a feature fusion layerSimultaneously inputting the model parameters into an additional classifier, observing the classification effect of the shallow learning characteristics of each mode through accuracy, and designing a contribution degree penalty coefficientThe classification effect of the shallow learning features of the ith iteration mode m is represented.

wherein ,a linear classifier representing modality m +.>Representing the approximate predictive power of the single mode m at the ith iteration, +.>A penalty factor representing the contribution of modality m to the learning objective. According to the characteristics of the tanh function, the better the classification effect is, the higher the penalty coefficient is.

In addition, the learning speed of each mode is monitored, and gradient penalty is increased for modes with over-fitting tendency by monitoring the change condition of loss values of each mode in the training process. Common practice to address overfitting is to use regularization or early shutdown of the loss values for the model as a whole. However, in embodiments of the present invention, conventional solution overfitting practices simply constrain the model as a whole and cannot be refined to each modality. Therefore, the embodiment of the invention punishs the coefficient by setting the learning speed The strategy is to learn the target contribution in each modeMeanwhile, the loss value of each mode is saved, and if one mode is converged too slowly or grows negatively in a certain iteration number, the mode is restrained through learning a speed penalty coefficient. .

wherein ,the loss value of the single mode m at the ith iteration is represented, and I represents the batch interval number. />And (3) a learning speed penalty coefficient for the mode m is represented, and whether the mode m is possibly overfitted or not is estimated according to the change of the learning speed of the mode m after I iterations. According to the characteristics of the sigmoid function, when the model is normally converged, the larger the loss value reduction degree is, the smaller the learning speed penalty coefficient is.

Dynamic gradient mechanism integrates the two strategies, and the feature extractor super-parameters of the mode m in the iteration tThe updates are as follows:

wherein lambda and mu both represent super parameters for controlling the modulation degree, loss represents the Loss value of the whole model, mode optimization with better performance is relieved through two penalty coefficients, a mode with low performance can get rid of limited optimization effort, enough training is obtained, and meanwhile, the problem of single-mode overfitting in a multi-mode model is solved.

S230, carrying out model prediction according to the deep fusion characteristics to obtain training results;

S240, repeating the training process according to the verification set data and the test set data to obtain a multi-mode emotion analysis model;

In the embodiment of the invention, the multi-modal emotion analysis model outputs emotion analysis results about data, wherein the emotion analysis results comprise feedback of whether the data is positive emotion or negative emotion, average absolute error (MAE) of a related index regression task, pearson Correlation coefficient (coreaction) and classification accuracy and F1 score of a classification task.

In addition, in the embodiment of the present invention, the experimental parameter settings are shown in table 1, and the division of the training set, the verification set and the test set on different data sets is shown in table 2. The training strategy for the experiment was: firstly, after training in a training set, evaluating by using a verification set, storing a model with the best effect on the verification set, and testing the model on a testing set as an experimental result after 8 continuous iterations are not exceeded.

Table 1 experimental parameter set table

Table 2 data set partitioning table

It should be noted that the data set CMU-MOSI contains 93 videos from YouTube89 lectures, each video only showing one person, and represents an evaluation of the movie. Each video contains several segments, the emotion label of each segment is a number between [ -3,3], positive emotion greater than 0, negative emotion less than 0, and neutral equal to 0. The positive emotion and the negative emotion are set from weak to strong, wherein 1, 2 and 3 represent emotion degrees.

The CMU-MOSI has a higher number of utterances than the CMU-MOSI, and the samples, lectures, and subjects are more diverse. The dataset contained 23453 annotated video clips from 5000 videos, 1000 different lectures and 250 different topics.

The experimental results of the different models on the CMU-MOSI dataset and the CMU-MOSI dataset are shown in tables 3 and 4. The model provided by the embodiment of the invention is superior to other reference models in accuracy and F1 in classification tasks, and especially compared with the existing advanced MISA model, the accuracy and F1 are respectively improved by 1.07 percent and 1.11 percent in a CMU-MOSI data set, and are respectively improved by 1.86 percent and 1.72 percent in a CMU-MOSEI data set, so that competitive performance is achieved. In the CMU-MOSEI data set, compared with other models, the multi-mode emotion analysis model of the embodiment of the invention has larger lifting amplitude, because the MOSEI is more complex and diverse, the multi-mode emotion analysis model of the embodiment of the invention uses stacked LSTM in feature representation learning, and deepens the depth of the model, thereby increasing the learning ability of the model when coping with complex and more data.

The multi-modal emotion analysis model provided by the embodiment of the invention also obtains good performance on regression tasks, exceeds other reference models on the CMU-MOSEI data set, and has an effect 0.054 lower than the pearson correlation coefficient of MISA on the CMU-MOSI data set. This is because the MISA captures the commonality across modalities while mapping the features of each modality to private space, capturing a specific representation of each modality.

TABLE 3 results of experiments on CMU-MOSI datasets

TABLE 4 experimental results on CMU-MOSEI dataset

In summary, the multi-modal emotion analysis method based on dynamic gradient and multi-view collaborative attention provided by the invention helps to learn the characteristics of each mode through a dynamic gradient mechanism, performs characteristic fusion on the learned characteristics of each mode through multi-view collaborative attention, and performs emotion prediction by combining with single-mode characteristics, so that interaction and balance between modes can be focused during emotion analysis, and the multi-modal emotion analysis result is effectively improved.

In the embodiment of the invention, the deep features of each mode under the attention of different modes are obtained through the feature fusion layer based on the multi-view collaborative attention. And splicing the deep features of each mode, taking an average value according to the row, and inputting the average value into a last layer of linear classifier to obtain a predicted value of multi-mode emotion analysis. Specifically, the method can be expressed as:

f(X _i )＝W _f ·mean(H _aggLAV )+b，

wherein ,representing the result of stitching the fused features according to the second dimension,/->

In order to verify the effectiveness of the two modules of the dynamic gradient mechanism and the multi-view collaborative attention mechanism provided by the embodiment of the invention, the two modules are removed respectively to determine the influence of the two modules on the overall effect of the model. The experimental results on the CMU-MOSI dataset are shown in Table 5, and after the dynamic gradient mechanism is removed, the MAE of the regression task rises, and the pearson correlation coefficient and the F1 value and the accuracy of the classification task are reduced to a small extent. And removing a multi-view collaborative attention mechanism, and directly splicing to predict emotion after independently encoding the characteristics of each mode. After removal, the effect of the model is obviously reduced, which shows that the addition of the multi-view collaborative attention mechanism can effectively learn multi-mode fusion characteristics and improve the performance of the model compared with direct splicing fusion. The reduction amplitude of the multi-view collaborative attention is higher than that of the dynamic gradient, which is probably because the effect of the dynamic gradient is only to help balance the convergence of each mode, and each mode has the basis of feature learning before the dynamic gradient is added, so the improvement effect of the dynamic gradient mechanism is not better than the feature fusion of the multi-view collaborative attention. As can be seen from the above experimental results, removing any module in the model reduces the performance of the model, and fully verifies the necessity of each module of the model provided by the embodiment of the invention to achieve the best experimental effect.

Table 5 ablation experimental results

In order to verify the feature fusion effect of the model provided by the embodiment of the invention, emotion analysis experiments of single mode, double mode, triple mode and different combinations are designed, and the experiments are carried out on a CMU-MOSI data set. In a single-mode experiment, text, audio and video are independently input into a model, the characteristic processing mode of the single mode is unchanged, and a dynamic gradient mechanism has no effect on the single-mode experiment. In the bimodal experiment, two modes are input into a feature fusion layer through a feature representation learning layer, and emotion features are output for emotion prediction. And (3) performing a tri-modal experiment, wherein the tri-modal experiment is similar to the bi-modal experiment, and after the modes are fused in pairs, the fusion characteristics are spliced to perform emotion prediction. The bi-modal and tri-modal are affected by the dynamic gradient mechanism, so a comparison experiment of whether the dynamic gradient mechanism is adopted is increased. The experimental results are shown in table 6 (the "×" in table 6 indicates that a dynamic gradient adjustment mechanism is added), and the best effect of the three-mode emotion classification can be found through the data in the table, the two modes are the worst, the single-mode performance effect is the worst, and the necessity of the multi-mode information is proved, and the multi-mode emotion analysis is superior to the single-mode emotion analysis. In a single-mode experiment, the text mode has the best performance effect and the most obvious emotion characteristics, but is lower than the two modes and the three modes, which shows that complementarity exists among the modes and the importance of mode fusion in the multi-mode model, and also proves the effectiveness of a multi-view cooperative attention mechanism on feature fusion, and the more the modes involved, the better the correlation among data can be captured, and the better the emotion analysis effect.

Table 6 results of different modality combinations experiments on CMU-MOSI dataset

In order to further analyze the influence of the multi-view collaborative attention, view quantity and multi-view fusion strategy on the emotion analysis effect of the model, a three-view and double-view comparison experiment is added on the basis of a three-mode experiment, and features are linearly projected into two spaces by the double views to execute interaction on each space. Meanwhile, in order to prove the effectiveness of multi-view fusion strategy in selecting a plurality of attentions to average, a spliced fusion strategy is designed for comparison. And the spliced fusion strategy experiment linearly projects the features into three spaces, respectively executes cooperative attention operation, and splices the output features of each mode based on attention of other modes to predict emotion. The experimental results are shown in table 7. Experimental results indicate that the accuracy of the dual view is slightly lower than the three views because of the more varied information that the three views can capture. The accuracy rate of the three views is greatly reduced after the splicing strategy is selected, and the possible reasons are that noise exists in the spliced fusion characteristics, and the effect of the model is affected.

TABLE 7 comparative experimental results on CMU-MOSI dataset

In order to verify the effect of using a pretrained Bert model and sLSTM in a modal representation learning layer, a comparison experiment is designed, an LSTM is adopted for audio and video modes for experiments, and a Bert pretraining model is unchanged for texts because Bert adopts a trans-former encoder structure as a feature extractor, and an MLM mask language model is used for training word sense understanding capability, so that the method has stronger semantic information extraction capability. The experimental results are shown in table 7, and it can be found that the accuracy of using LSTM is obviously reduced by 4.88 percent compared with that of using sstm, which proves the characteristic representation capability of stacked LSTM, the hidden state in the stacked LSTM layer can be propagated through time and also transferred to the next layer, the hierarchical structure can represent time sequence more complex, thus capturing information in different proportions, and the deeper network structure is superior to shallower structure in some tasks.

In addition, by observing a series of comparative experiments on the presence or absence of the use of dynamic gradient mechanisms in the bi-and tri-modes of table 6, it was found that the use of dynamic gradient mechanisms has a small increase in magnitude than the non-use of dynamic gradient mechanisms and the smaller the number of modes, the larger the increase in magnitude, possibly because the penalty factor is affected by the number of modes in calculating the degree of contribution of the different modes. According to the dynamic gradient strategy, the grid search mode is adopted for the modulation degree superparameter selection of the two strategies by monitoring the contribution difference of each mode to the learning target and monitoring the learning speed of each mode. In order to explore the effectiveness of two strategies of a dynamic gradient mechanism, an ablation experiment of the dynamic gradient strategy is designed. The experimental results are shown in Table 7. It can be observed that only one strategy is less effective than not using a dynamic gradient mechanism and only the learning speed is monitored to reduce the accuracy by 1.75 percent than only the contribution difference, because only the learning speed is monitored, which can lead to too slow convergence or too large punishment when the modality occasionally converges or does not converge, resulting in too positive learning of the modality. The two strategies are matched together and balanced with each other, so that the effect of a dynamic gradient mechanism can be achieved.

In summary, the multi-modal emotion analysis method based on dynamic gradient and multi-view collaborative attention provided by the embodiment of the invention has the following advantages:

1) The multi-mode emotion analysis model integrating the dynamic gradient mechanism and the multi-view collaborative attention mechanism can effectively solve the problems that the multi-dimensionality of human emotion is ignored by single-mode emotion analysis, and the feature fusion mode adopted by most researches does not have good consideration of feature extraction in modes and feature interaction among modes.

2) By monitoring the difference of contribution degree of each mode to a learning target in the optimization process of each mode in model training, the convergence speed of each mode is dynamically controlled, so that the characteristics in each mode are fully learned, and the problem of optimization unbalance easily occurring in multi-mode emotion analysis can be effectively solved.

3) Through a multi-view collaborative attention mechanism, different views are constructed between every two modes to perform bidirectional interaction, long-distance context information of every two modes is learned, and through attention to the characteristics of another mode, important parts in the characteristics of the device are emphasized, the characteristic information of each mode is enriched, and the problem of poor interactivity between different modes of the existing multi-mode emotion analysis method is effectively solved.

It is to be understood that the above embodiments are merely illustrative of the application of the principles of the present invention, but not in limitation thereof. Various modifications and improvements may be made by those skilled in the art without departing from the spirit and substance of the invention, and are also considered to be within the scope of the invention.

Claims

1. A multi-modal emotion analysis method based on dynamic gradients and multi-view collaborative attention, comprising:

wherein the training set data, the verification set data and the test set data all comprise multi-modal data with emotion information;

performing modal representation learning on the training set data to obtain shallow learning features, including:

carrying out feature extraction on the audio training set data and the video training set data through a sLSTM model to obtain audio shallow learning features and video shallow learning features;

performing feature fusion processing on the shallow learning features to obtain deep fusion features, and performing dynamic gradient adjustment processing to obtain gradient adjustment parameters, wherein the method comprises the following steps:

realizing gradient adjustment on the shallow learning characteristics through a dynamic gradient adjustment strategy, and obtaining gradient adjustment parameters;

the feature fusion processing is realized on the shallow learning features through a multi-view collaborative attention network and an LSTM, and deep fusion features are obtained, including: the shallow learning features are used for obtaining the features of each modality based on the attention of other modalities through a multi-view cooperative attention network,

Splicing the features of each mode based on the attention of other modes and the shallow learning features of the current mode to obtain the fusion features of each mode based on the multi-view collaborative attention,

inputting the fusion characteristics of each mode based on the multi-view cooperative attention to the LSTM so as to obtain deep fusion characteristics of all modes;

realizing gradient adjustment of the shallow learning features through a dynamic gradient adjustment strategy, and obtaining gradient adjustment parameters comprises the following steps: obtaining a contribution degree penalty coefficient by monitoring the contribution degree of each mode to a learning target, and obtaining a learning speed penalty coefficient by monitoring the learning speed of each mode, and adaptively adjusting the gradient of each mode feature extractor according to the contribution degree penalty coefficient and the learning speed penalty coefficient;

wherein, the expression of the contribution degree penalty coefficient is:

wherein , a linear classifier representing modality m +.>Representing the approximate predictive power of the single mode m at the ith iteration, +.>A penalty coefficient representing the contribution of the mode m to the learning target, and N represents the length of the data set;

the expression of the learning speed penalty coefficient is:

wherein ,representing the loss value of the single mode m at the ith iteration, I representing the number of lot intervals, +. >A learning speed penalty coefficient representing modality m.

2. The multi-modal emotion analysis method based on dynamic gradient and multi-view collaborative attention according to claim 1, wherein the feature fusion processing is implemented on the shallow learning feature through a multi-view collaborative attention network and LSTM to obtain a deep fusion feature, comprising:

and processing the deep fusion characteristics through the LSTM according to the characteristics of each mode based on the attention of other modes and the shallow learning characteristics of the current mode.

3. The multi-modal emotion analysis method based on dynamic gradient and multi-view collaborative attention according to claim 2, wherein the multi-view collaborative attention processing is performed on shallow learning features of each two modalities to obtain features of the two modalities based on the attention of the other party, comprising:

4. The method of claim 1, wherein implementing gradient adjustment of the shallow learning feature by a dynamic gradient adjustment strategy and obtaining gradient adjustment parameters comprises:

5. The method of claim 4, wherein the gradient adjustment is performed according to different strategies and myopia prediction results and loss values of each mode to obtain gradient adjustment parameters, comprising:

6. The method for multi-modal emotion analysis based on dynamic gradient and multi-view collaborative attention according to claim 5, wherein calculating based on a first strategy and a myopia prediction result for each modality to obtain a contribution degree penalty coefficient includes:

obtaining a myopia prediction result of each mode;

and determining the parameter training optimization speed of the feature extractor corresponding to the current mode according to the effect of the myopia prediction result of each mode, wherein the effect of the myopia prediction result is inversely proportional to the parameter training optimization speed of the feature extractor.

7. The multi-modal emotion analysis method based on dynamic gradient and multi-view collaborative attention according to claim 5, wherein calculating according to a second policy and loss values for each modality, obtaining learning speed penalty coefficients, includes: