CN113158727A - Bimodal fusion emotion recognition method based on video and voice information - Google Patents
Bimodal fusion emotion recognition method based on video and voice information Download PDFInfo
- Publication number
- CN113158727A CN113158727A CN202011613947.1A CN202011613947A CN113158727A CN 113158727 A CN113158727 A CN 113158727A CN 202011613947 A CN202011613947 A CN 202011613947A CN 113158727 A CN113158727 A CN 113158727A
- Authority
- CN
- China
- Prior art keywords
- voice
- emotion
- features
- information
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 32
- 238000000034 method Methods 0.000 title claims abstract description 30
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 22
- 230000002902 bimodal effect Effects 0.000 title claims abstract description 16
- 230000008451 emotion Effects 0.000 claims abstract description 54
- 239000013598 vector Substances 0.000 claims abstract description 39
- 238000012549 training Methods 0.000 claims abstract description 17
- 238000012545 processing Methods 0.000 claims abstract description 12
- 238000011156 evaluation Methods 0.000 claims abstract description 7
- 238000013528 artificial neural network Methods 0.000 claims abstract description 6
- 238000010606 normalization Methods 0.000 claims abstract description 5
- 238000011176 pooling Methods 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 11
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 238000009826 distribution Methods 0.000 claims description 8
- 230000001815 facial effect Effects 0.000 claims description 8
- 238000001914 filtration Methods 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000001514 detection method Methods 0.000 claims description 5
- 230000009467 reduction Effects 0.000 claims description 5
- 238000003708 edge detection Methods 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 2
- 230000002708 enhancing effect Effects 0.000 claims description 2
- 230000000306 recurrent effect Effects 0.000 claims description 2
- 230000002123 temporal effect Effects 0.000 claims description 2
- 238000003062 neural network model Methods 0.000 claims 2
- 238000004364 calculation method Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 230000002996 emotional effect Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000008921 facial expression Effects 0.000 description 3
- 230000010365 information processing Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000000295 complement effect Effects 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
- G06V40/176—Dynamic expression
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Multimedia (AREA)
- Computing Systems (AREA)
- Child & Adolescent Psychology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a bimodal fusion emotion recognition method based on video information and voice information. Extracting face images and voice features from the video and voice information, carrying out normalization processing on face image feature vectors and voice feature vectors, transmitting the processed features into a Bi-GRU network for training, and then using input features in two single-mode sub-networks for calculating the weight of state information at each moment. And fusing the input features of the two single-mode sub-networks to obtain a multi-mode combined feature vector, and taking the combined feature vector as the input of a pre-trained deep neural network, wherein the deep neural network contains an emotion classifier, and different types of emotion evaluation information are obtained through the emotion classifier, so that the emotion evaluation information of the user is more objective and has reference value, and a more accurate emotion recognition result is obtained.
Description
Technical Field
The application relates to the field of emotion recognition, in particular to a bimodal fusion emotion recognition method based on video and voice information.
Background
In general, the way humans naturally communicate and express emotions is multimodal. This means that we can express emotions orally or visually. When more emotions are expressed in tones, the audio data may contain a main clue for emotion recognition; when more face images are used to express emotions, it is considered that most of the clues required to mine emotion are present in face images using multimodal information such as human facial expressions, speech intonation, and language content, which is a fun and challenging problem.
The emotion calculation under the traditional mode research direction is mainly researched by a single mode. Such as speech emotion aspect, video motion and face image. These conventional single-modal emotion recognition calculations have been in the process of research in practical applications, although significant efforts have been made in the respective fields. However, since the emotional expression forms of people have complex and diverse characteristics, if a single expression form is considered, the emotion of people is judged, and the final result is one-sided and unobtrusive, so that a lot of valuable emotional information is lost.
With the deep development of artificial intelligence technology in the information era, most people pay more attention to the research on the aspect of emotion calculation, but the emotion of human bodies is complex and variable, the accuracy rate of judging emotion characteristics by independently measuring certain information is low, and the invention is provided for improving the accuracy rate.
Disclosure of Invention
The invention aims to provide a bimodal fusion emotion recognition method based on video and voice information, which fully utilizes bimodal fusion to obtain more emotion information and can judge the emotion state according to user voice and facial expression information.
The technical solution for realizing the purpose of the invention is as follows: a bimodal fusion emotion recognition method based on video information and voice information comprises the following steps:
the method comprises the steps of firstly, acquiring video information and voice information through a camera and a microphone of external equipment, carrying out feature extraction on the video information and the voice information, and respectively extracting facial image features and voice features.
Step two, after the video information and the voice information are obtained, the method further comprises the following steps: and preprocessing the video information and the voice information.
Step three, the video information preprocessing of the initial information specifically comprises the following steps:
1) acquiring a video file to be processed, and analyzing the video file to obtain a video frame;
2) generating a histogram corresponding to the video frame based on the pixel information of the video frame, determining the definition of the video frame, and clustering the video frame according to the histogram and an edge detection operator to obtain at least one class; filtering repeated video frames in each class and video frames with the definition smaller than a definition threshold value;
3) and carrying out face detection, alignment, rotation and size adjustment on the video frame by adopting a method based on a convolutional neural network to obtain a face image.
Step four, the voice information preprocessing of the initial information specifically comprises the following steps:
1) pre-emphasizing the collected voice information to flatten the frequency spectrum of the signal;
2) performing frame windowing on the collected voice signals to obtain voice analysis frames;
3) and carrying out short-time Fourier transform on the voice analysis frame to obtain a voice spectrogram.
Step five, the extraction of the facial image features specifically comprises the following steps:
1) preparing a pre-trained deep convolutional neural network model packet, wherein the model comprises a pooling layer, a full-link layer, a dropout layer in front of the full-link layer and a softmax layer behind the full-link layer;
2) and inputting the face image into a pre-trained image feature model for processing, wherein the feature vector output by a full connection layer in the model is the face image feature vector.
Step six, the extraction of the voice features specifically comprises the following steps:
1) pre-emphasizing the voice digital signal by using a first-order high-pass FIR digital filter, and performing frame processing on the pre-emphasized voice data by using a short-time analysis technology to obtain a voice characteristic parameter time sequence;
2) windowing the voice characteristic parameter time sequence by using a Hamming window function to obtain voice windowing data, and performing endpoint detection on the voice windowing data by using a double-threshold comparison method to obtain preprocessed voice data;
3) carrying out short-time Fourier transform on the preprocessed voice data to obtain a voice spectrogram;
4) and acquiring the voice spectrogram, and extracting voice characteristic data by utilizing a preprocessed AlexNet network to obtain the voice characteristic data.
5) And acquiring the feature data, performing Correlation Feature Selection (CFS) on the feature data, and filtering out irrelevant features with small correlation with the category label to obtain the final voice feature.
And seventhly, performing normalization processing on the extracted face image characteristics and the extracted voice characteristics.
And step eight, respectively transmitting the normalized feature vectors into a Bi-GRU network for training, extracting features through a maximum pooling layer and an average pooling layer of the network, and further calculating correlation among multi-modal state information and attention distribution of each mode at each moment.
Step nine, the step of calculating the correlation among the multi-modal state information and the attention distribution of each modality at each moment specifically comprises the following steps:
1) since the state information between the modalities is taken into account, the weights will focus on the state information of both modalities simultaneously, i.e. the correlation of the state information between the modalities and the state information of each instant of the two single-modality sub-networks. Correlation siThe calculation is as follows:
whereinIs looked atStatus information, w, of Bi-GRU network outputs in frequency-mode subnetworksvIs thatThe correlation weight of (a);is the state information of the Bi-GRU network output in the speech modality subnetwork, waIs thatThe correlation weight of (a); b1Is the correlation deviation, b2Is the fusion bias, V is the weight of the multimodal fusion, and tanh is the activation function.
2) From multimodal status information siCan calculate the attention distribution at each moment in the multi-modality, i.e., the weight a corresponding to the status information1The calculation is as follows:
where softmax is a normalized exponential function.
Tenth, performing feature fusion on the extracted facial expression features and the extracted voice features to obtain a combined feature vector, and including the following steps:
1) performing dimensionality reduction on the feature vector after feature fusion by using a PCA method packaged in a sklern tool library;
2) and carrying out normalization processing on the feature vector obtained after the dimension reduction processing to obtain a combined feature vector of two channels.
Step eleven, inputting the combined feature vector to a pre-trained deep convolutional neural network, wherein the deep convolutional neural network comprises an emotion classifier, acquires emotion evaluation information and judges the emotion of the user.
Step twelve, the training of the pre-trained deep convolutional network comprises the following steps:
1) acquiring a face image open source emotion data set and a voice open source emotion data set, and acquiring face image emotion sample data and voice emotion sample data from the face image emotion data set and the voice emotion data set;
2) and enhancing the face emotion sample data, extracting face image feature data and performing feature selection on the feature data to obtain face image feature data. Performing short-time Fourier transform on the voice emotion sample data to obtain a voice spectrogram, extracting voice characteristic data by using an AlexNet network, and performing characteristic selection on the characteristic data to obtain voice characteristic data;
3) respectively carrying out feature fusion on feature vectors with the same emotion label in the face image feature data and the voice feature data to obtain joint feature vectors corresponding to different emotion labels, and using the joint feature vectors as a joint feature vector training data set of the character emotion recognition model;
4) the joint feature vector data set is trained using a temporal recurrent neural network.
Has the advantages that: compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:
the invention realizes the bimodal fusion of video information and voice information, and is an innovation of emotion recognition. Emotion recognition based on video and voice information is more persuasive than a single information recognition result;
in addition, feature fusion is selected in the aspect of a bimodal fusion mode, complementary information among different modalities and mutual influence among the complementary information are effectively fused, and the obtained combined feature result can more comprehensively display the emotional state of a user;
in addition, an idea is provided for other dual-mode fusion or multi-mode fusion, and various functions can be continuously improved subsequently, so that the traditional single-mode emotion recognition technology can achieve the purpose of upgrading and updating.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a schematic flow chart of a bimodal fusion emotion recognition method based on video and voice information provided by the present invention;
FIG. 2 is a schematic view of a video information processing flow provided by the present invention;
fig. 3 is a schematic diagram of a processing flow of voice information provided by the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
The invention is further illustrated by the following examples and figures of the specification.
Example 1
A bimodal fusion emotion recognition method based on video and voice information, as shown in FIG. 1, includes the following steps:
1) collecting voice signals and face images: carrying out non-contact acquisition on natural voice and a human face image by using a microphone and a camera;
the camera refers to a CMOS digital camera, and outputs an electric signal to be directly amplified and converted into a digital signal;
the microphone is a digital MEMS microphone and outputs 1/2-period pulse density modulation digital signals;
2) signal preprocessing: respectively preprocessing signals of two modes, namely video signals and voice signals, so that the signals meet the input requirements of corresponding models of different modes;
3) and (3) emotion feature extraction: respectively extracting the features of the face image signal and the voice signal preprocessed in the step 2) to obtain corresponding feature vectors;
4) training feature vectors and calculating relevance and attention distributions: normalizing the feature vectors obtained in the step 3), then respectively transmitting the normalized feature vectors into a Bi-GRU network for training, and further extracting features through a maximum pooling layer and an average pooling layer of the network to calculate the correlation among multi-modal state information and the attention distribution of each modal at each moment;
5) and (3) emotional characteristic fusion: performing feature fusion on the facial image and the voice feature vectors extracted in the step 3) by adopting a corresponding method;
6) judging the emotion: inputting the fusion characteristics of the step 5) into a pre-trained deep convolutional neural network, wherein the deep convolutional neural network comprises an emotion classifier, acquiring emotion evaluation information through the emotion classifier, and judging emotion according to the emotion evaluation information of the user.
Example 2
The video information processing flow, as shown in fig. 2, includes the following steps:
1) acquiring a video file to be processed; analyzing the video file to obtain a video frame; filtering the video frame based on the pixel information of the video frame, and taking the video frame obtained after filtering as the image of the face emotion to be recognized;
2) generating a histogram corresponding to the video frame and determining the definition of the video frame based on the pixel information of the video frame; clustering the video frames according to the histogram and the edge detection operator to obtain at least one class; filtering repeated video frames in each class and video frames with the definition smaller than a definition threshold value;
3) based on the filtered video frame, carrying out face detection, alignment, rotation and size adjustment on the video frame by adopting a method based on a convolutional neural network to obtain a face image;
4) based on the face image, inputting the face image into an image feature extraction model obtained by pre-training for processing, and determining a feature vector output by a full connection layer in the image feature extraction model as the image feature vector;
5) and normalizing the acquired human face image features, then transmitting the normalized human face image features into a Bi-GRU network for training, and further extracting the features through a maximum pooling layer and an average pooling layer of the network.
Example 3
The voice information processing flow, as shown in fig. 3, includes the following steps:
1) acquiring a human body voice signal by using a digital MEMS (micro electro mechanical system) microphone, pre-emphasizing the human body voice signal by using a first-order high-pass FIR (finite impulse response) digital filter, and outputting pre-emphasized voice data;
2) performing frame processing on the pre-emphasized voice data by using a short-time analysis technology to obtain a voice characteristic parameter time sequence;
3) windowing the voice characteristic parameter time sequence by using a Hamming window function to obtain voice windowing data
4) Carrying out endpoint detection on the voice windowing data by using a double-threshold comparison method to obtain preprocessed voice data;
5) carrying out short-time Fourier transform on the preprocessed voice data to obtain a voice spectrogram;
6) inputting the spectrogram into a preprocessed AlexNet network, and taking out voice characteristic data from a convolutional layer (Conv 4);
7) and performing feature selection on the feature data to obtain final voice features.
8) And normalizing the acquired voice features, transmitting the normalized voice features into a Bi-GRU network for training, and further extracting the features through a maximum pooling layer and an average pooling layer of the network.
Claims (9)
1. A bimodal fusion emotion recognition method based on video and voice information is characterized by comprising the following steps:
step 1: acquiring face information and voice information of a user with emotion to be recognized through a camera and a microphone of external equipment, inputting the face information and the voice information into a pre-trained feature extraction network, and respectively extracting face image features and voice features;
step 2: and normalizing the extracted human face image features and the extracted voice features, then transmitting the normalized human face image features and the extracted voice features into a Bi-GRU network for training, and calculating correlation and attention distribution of each mode at each moment through input features in two single-mode sub-networks.
And step 3: and carrying out feature fusion on the extracted face image features and the extracted voice features to obtain a combined feature vector. The combined feature vector is obtained by fusing the human face image features and the voice features with the same emotion labels and performing dimensionality reduction and normalization processing;
and 4, step 4: and inputting the fusion characteristics into a pre-trained deep neural network, wherein the deep neural network comprises an emotion classifier and is used for acquiring emotion evaluation information of different types and finally evaluating the emotion of the user.
2. The method according to claim 1, wherein the video information is face image information.
3. The bimodal fusion emotion recognition method based on video and voice information, as claimed in claim 1, wherein said obtaining facial image information and extracting facial image features comprises the steps of:
step 1: acquiring a video file to be processed; analyzing the video file to obtain a video frame; filtering the video frame based on the pixel information of the video frame, and taking the video frame obtained after filtering as the image of the face emotion to be recognized;
step 2: generating a histogram corresponding to the video frame and determining the definition of the video frame based on the pixel information of the video frame; clustering the video frames according to the histogram and the edge detection operator to obtain at least one class; filtering repeated video frames in each class and video frames with the definition smaller than a definition threshold value;
and step 3: based on the filtered video frame, carrying out face detection, alignment, rotation and size adjustment on the video frame by adopting a method based on a convolutional neural network to obtain a face image;
and 4, step 4: based on the face image, the face image is input into an image feature extraction model obtained through pre-training and processed, the feature vector output by a full connection layer in the image feature extraction model is determined to be the image feature vector, the image feature extraction model is obtained by training a preset depth convolution neural network model, and the preset depth convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
4. The method according to claim 1, wherein the obtaining of the speech information and the extraction of the speech features are performed by a pre-processed AlexNet network.
5. The extracting of speech features according to claim 4 comprises the steps of:
step 1: the method comprises the steps of acquiring an original voice signal of a human body by using a microphone, and preprocessing the voice signal to obtain a spectrogram.
Step 2: inputting the spectrogram into a pre-trained AlexNet network, passing through a first input layer, a second convolution layer, a second pooling layer and a third convolution layer, taking out the obtained voice characteristics from a fourth convolution layer (Conv4), wherein ReLu is used as an activation function at the output end of each convolution layer.
6. The method for extracting the acquired voice features according to claim 5 is implemented as follows,
similarity between features is measured using Correlation Feature Selection (CFS). Irrelevant features that are less relevant to the category label are discarded. The evaluation criteria are as follows:
7. The bimodal fusion emotion recognition method based on video and voice information, as claimed in claim 1, wherein said normalizing said facial image features and voice features, and then transmitting them into Bi-GRU network for training, and calculating correlation and attention distribution comprises the steps of:
step 1: finding the maximum values of the facial image features and the voice features, dividing all feature vectors by the maximum values in the corresponding modes respectively, and converging to 0-1, so that the network training and convergence speed is improved;
step 2: the Bi-GRU network combines the model architectures of the GRU and the BRNN network. And respectively transmitting the normalized feature vectors into the network for training, extracting features through a maximum pooling layer and an average pooling layer of the network, and further calculating the correlation among multi-modal state information and the attention distribution of each modal at each moment.
8. The bimodal fusion emotion recognition method based on video and voice information, as claimed in claim 1, wherein said feature fusion of said facial image features and said voice features to obtain a joint feature vector comprises the steps of:
step 1: performing dimensionality reduction on the feature vector after feature fusion by using a PCA method packaged in a sklern tool library;
step 2: and carrying out normalization processing on the feature vector obtained after the dimension reduction processing to obtain a combined feature vector of two channels.
9. The bimodal fusion emotion recognition method based on video and voice information, as claimed in claim 1, wherein the training of the pre-trained deep convolutional network comprises the following steps:
step 1: acquiring a face image open source emotion data set and a voice open source emotion data set, and acquiring face image emotion sample data and voice emotion sample data from the face image emotion data set and the voice emotion data set;
step 2: and enhancing the face emotion sample data, extracting face image feature data and performing feature selection on the feature data to obtain face image feature data. Performing short-time Fourier transform on the voice emotion sample data to obtain a voice spectrogram, extracting voice characteristic data by using an AlexNet network, and performing characteristic selection on the characteristic data to obtain voice characteristic data;
and step 3: respectively carrying out feature fusion on feature vectors with the same emotion label in the face image feature data and the voice feature data to obtain joint feature vectors corresponding to different emotion labels, and using the joint feature vectors as a joint feature vector training data set of the character emotion recognition model;
and 4, step 4: the joint feature vector data set is trained using a temporal recurrent neural network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011613947.1A CN113158727A (en) | 2020-12-31 | 2020-12-31 | Bimodal fusion emotion recognition method based on video and voice information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011613947.1A CN113158727A (en) | 2020-12-31 | 2020-12-31 | Bimodal fusion emotion recognition method based on video and voice information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113158727A true CN113158727A (en) | 2021-07-23 |
Family
ID=76878273
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011613947.1A Pending CN113158727A (en) | 2020-12-31 | 2020-12-31 | Bimodal fusion emotion recognition method based on video and voice information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113158727A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113742599A (en) * | 2021-11-05 | 2021-12-03 | 太平金融科技服务(上海)有限公司深圳分公司 | Content recommendation method, device, equipment and computer readable storage medium |
CN113807249A (en) * | 2021-09-17 | 2021-12-17 | 广州大学 | Multi-mode feature fusion based emotion recognition method, system, device and medium |
CN113920568A (en) * | 2021-11-02 | 2022-01-11 | 中电万维信息技术有限责任公司 | Face and human body posture emotion recognition method based on video image |
CN114022192A (en) * | 2021-10-20 | 2022-02-08 | 百融云创科技股份有限公司 | Data modeling method and system based on intelligent marketing scene |
CN114399818A (en) * | 2022-01-05 | 2022-04-26 | 广东电网有限责任公司 | Multi-mode face emotion recognition method and device |
CN114582000A (en) * | 2022-03-18 | 2022-06-03 | 南京工业大学 | Multimode elderly emotion recognition fusion model based on video image facial expressions and voice and establishment method thereof |
CN114973490A (en) * | 2022-05-26 | 2022-08-30 | 南京大学 | Monitoring and early warning system based on face recognition |
CN115100329A (en) * | 2022-06-27 | 2022-09-23 | 太原理工大学 | Multi-mode driving-based emotion controllable facial animation generation method |
CN115424108A (en) * | 2022-11-08 | 2022-12-02 | 四川大学 | Cognitive dysfunction evaluation method based on audio-visual fusion perception |
CN115496226A (en) * | 2022-09-29 | 2022-12-20 | 中国电信股份有限公司 | Multi-modal emotion analysis method, device, equipment and storage based on gradient adjustment |
CN117152668A (en) * | 2023-10-30 | 2023-12-01 | 成都方顷科技有限公司 | Intelligent logistics implementation method, device and equipment based on Internet of things |
CN117312992A (en) * | 2023-11-30 | 2023-12-29 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Emotion recognition method and system for fusion of multi-view face features and audio features |
CN117349792A (en) * | 2023-10-25 | 2024-01-05 | 中国人民解放军空军军医大学 | Emotion recognition method based on facial features and voice features |
CN118055300A (en) * | 2024-04-10 | 2024-05-17 | 深圳云天畅想信息科技有限公司 | Cloud video generation method and device based on large model and computer equipment |
CN118380020A (en) * | 2024-06-21 | 2024-07-23 | 吉林大学 | Method for identifying emotion change of interrogation object based on multiple modes |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110188343A (en) * | 2019-04-22 | 2019-08-30 | 浙江工业大学 | Multi-modal emotion identification method based on fusion attention network |
CN111242155A (en) * | 2019-10-08 | 2020-06-05 | 台州学院 | Bimodal emotion recognition method based on multimode deep learning |
CN111339913A (en) * | 2020-02-24 | 2020-06-26 | 湖南快乐阳光互动娱乐传媒有限公司 | Method and device for recognizing emotion of character in video |
CN111563422A (en) * | 2020-04-17 | 2020-08-21 | 五邑大学 | Service evaluation obtaining method and device based on bimodal emotion recognition network |
-
2020
- 2020-12-31 CN CN202011613947.1A patent/CN113158727A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110188343A (en) * | 2019-04-22 | 2019-08-30 | 浙江工业大学 | Multi-modal emotion identification method based on fusion attention network |
CN111242155A (en) * | 2019-10-08 | 2020-06-05 | 台州学院 | Bimodal emotion recognition method based on multimode deep learning |
CN111339913A (en) * | 2020-02-24 | 2020-06-26 | 湖南快乐阳光互动娱乐传媒有限公司 | Method and device for recognizing emotion of character in video |
CN111563422A (en) * | 2020-04-17 | 2020-08-21 | 五邑大学 | Service evaluation obtaining method and device based on bimodal emotion recognition network |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113807249B (en) * | 2021-09-17 | 2024-01-12 | 广州大学 | Emotion recognition method, system, device and medium based on multi-mode feature fusion |
CN113807249A (en) * | 2021-09-17 | 2021-12-17 | 广州大学 | Multi-mode feature fusion based emotion recognition method, system, device and medium |
CN114022192A (en) * | 2021-10-20 | 2022-02-08 | 百融云创科技股份有限公司 | Data modeling method and system based on intelligent marketing scene |
CN113920568A (en) * | 2021-11-02 | 2022-01-11 | 中电万维信息技术有限责任公司 | Face and human body posture emotion recognition method based on video image |
CN113920568B (en) * | 2021-11-02 | 2024-07-02 | 中电万维信息技术有限责任公司 | Face and human body posture emotion recognition method based on video image |
CN113742599A (en) * | 2021-11-05 | 2021-12-03 | 太平金融科技服务(上海)有限公司深圳分公司 | Content recommendation method, device, equipment and computer readable storage medium |
CN114399818A (en) * | 2022-01-05 | 2022-04-26 | 广东电网有限责任公司 | Multi-mode face emotion recognition method and device |
CN114582000A (en) * | 2022-03-18 | 2022-06-03 | 南京工业大学 | Multimode elderly emotion recognition fusion model based on video image facial expressions and voice and establishment method thereof |
CN114973490A (en) * | 2022-05-26 | 2022-08-30 | 南京大学 | Monitoring and early warning system based on face recognition |
CN115100329A (en) * | 2022-06-27 | 2022-09-23 | 太原理工大学 | Multi-mode driving-based emotion controllable facial animation generation method |
CN115496226A (en) * | 2022-09-29 | 2022-12-20 | 中国电信股份有限公司 | Multi-modal emotion analysis method, device, equipment and storage based on gradient adjustment |
CN115424108B (en) * | 2022-11-08 | 2023-03-28 | 四川大学 | Cognitive dysfunction evaluation method based on audio-visual fusion perception |
CN115424108A (en) * | 2022-11-08 | 2022-12-02 | 四川大学 | Cognitive dysfunction evaluation method based on audio-visual fusion perception |
CN117349792A (en) * | 2023-10-25 | 2024-01-05 | 中国人民解放军空军军医大学 | Emotion recognition method based on facial features and voice features |
CN117349792B (en) * | 2023-10-25 | 2024-06-07 | 中国人民解放军空军军医大学 | Emotion recognition method based on facial features and voice features |
CN117152668A (en) * | 2023-10-30 | 2023-12-01 | 成都方顷科技有限公司 | Intelligent logistics implementation method, device and equipment based on Internet of things |
CN117152668B (en) * | 2023-10-30 | 2024-02-06 | 成都方顷科技有限公司 | Intelligent logistics implementation method, device and equipment based on Internet of things |
CN117312992A (en) * | 2023-11-30 | 2023-12-29 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Emotion recognition method and system for fusion of multi-view face features and audio features |
CN117312992B (en) * | 2023-11-30 | 2024-03-12 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Emotion recognition method and system for fusion of multi-view face features and audio features |
CN118055300A (en) * | 2024-04-10 | 2024-05-17 | 深圳云天畅想信息科技有限公司 | Cloud video generation method and device based on large model and computer equipment |
CN118380020A (en) * | 2024-06-21 | 2024-07-23 | 吉林大学 | Method for identifying emotion change of interrogation object based on multiple modes |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113158727A (en) | Bimodal fusion emotion recognition method based on video and voice information | |
CN105976809B (en) | Identification method and system based on speech and facial expression bimodal emotion fusion | |
CN108899050B (en) | Voice signal analysis subsystem based on multi-modal emotion recognition system | |
CN108877801B (en) | Multi-turn dialogue semantic understanding subsystem based on multi-modal emotion recognition system | |
CN112151030B (en) | Multi-mode-based complex scene voice recognition method and device | |
CN112581979B (en) | Speech emotion recognition method based on spectrogram | |
CN112766173B (en) | Multi-mode emotion analysis method and system based on AI deep learning | |
CN103413113A (en) | Intelligent emotional interaction method for service robot | |
CN111128242B (en) | Multi-mode emotion information fusion and identification method based on double-depth network | |
CN103996155A (en) | Intelligent interaction and psychological comfort robot service system | |
CN113516990A (en) | Voice enhancement method, method for training neural network and related equipment | |
CN115169507A (en) | Brain-like multi-mode emotion recognition network, recognition method and emotion robot | |
CN112101096A (en) | Suicide emotion perception method based on multi-mode fusion of voice and micro-expression | |
CN112418166A (en) | Emotion distribution learning method based on multi-mode information | |
CN111967361A (en) | Emotion detection method based on baby expression recognition and crying | |
CN111079465A (en) | Emotional state comprehensive judgment method based on three-dimensional imaging analysis | |
CN111785262B (en) | Speaker age and gender classification method based on residual error network and fusion characteristics | |
Shinde et al. | Real time two way communication approach for hearing impaired and dumb person based on image processing | |
KR20180101959A (en) | Method and system for extracting Video feature vector using multi-modal correlation | |
Rwelli et al. | Gesture based Arabic sign language recognition for impaired people based on convolution neural network | |
Saha et al. | Towards automatic speech identification from vocal tract shape dynamics in real-time MRI | |
Chinmayi et al. | Emotion Classification Using Deep Learning | |
CN113571095B (en) | Speech emotion recognition method and system based on nested deep neural network | |
CN114492579A (en) | Emotion recognition method, camera device, emotion recognition device and storage device | |
CN116453548A (en) | Voice emotion recognition method based on attention MCNN combined with gender information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |