WO2021128741A1

WO2021128741A1 - Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium

Info

Publication number: WO2021128741A1
Application number: PCT/CN2020/094338
Authority: WO
Inventors: 朱锦祥; 单以磊; 臧磊
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2019-12-24
Filing date: 2020-06-04
Publication date: 2021-07-01
Also published as: CN111081279A

Abstract

A voice emotion fluctuation analysis method and apparatus, and a computer device and a storage medium, relating to the technical field of artificial intelligence. The method comprises: obtaining a first audio feature and a first text feature of voice data to be tested (S100); extracting a second audio feature in the first audio feature on the basis of an audio feature extraction network in a pretrained audio recognition model, and extracting a second text feature in the first text feature on the basis of a text feature extraction network in a pretrained text recognition model (S102); recognizing the second audio feature to obtain an audio emotion recognition result, and recognizing the second text feature to obtain a text emotion recognition result (S104); and performing fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain an emotion recognition result, and sending the emotion recognition result to an associated terminal (S106). The present invention can effectively improve the accuracy of emotion fluctuation analysis.

Description

Voice mood fluctuation analysis method, device, computer equipment and storage medium

This application is based on the Chinese invention application with the application number 201911341679.X and titled "Speech Mood Fluctuation Analysis Method and Apparatus" filed on December 24, 2019, and claims its priority.

Technical field

The embodiments of the present application relate to the field of artificial intelligence technology, and in particular to a method, device, computer equipment, and storage medium for analyzing voice mood fluctuations.

Background technique

With the development of artificial intelligence technology, mood swing analysis is used in more and more business scenarios, such as the mood swings of both parties when a customer service staff talks with a customer. In the prior art, the analysis of audio mood fluctuations is generally performed through the analysis of the audio signal of the sound, such as the frequency and amplitude changes of the intonation and sound waves. The inventor realizes that the current mood fluctuation analysis method is relatively single, and the audio signals of different people are different, and the accuracy of analyzing emotions using only sound audio signals is low.

Summary of the invention

In view of this, the embodiments of the present application provide a voice mood fluctuation analysis method, device, computer equipment, and computer-readable storage medium, which are used for the problem of low accuracy in analyzing mood fluctuations.

A method for analyzing voice mood fluctuations, including:

Acquire the first audio feature and the first text feature of the voice data to be tested;

Extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; extract the first text feature based on the text feature extraction network in the pre-trained text recognition model The second character feature in

Recognizing the second audio feature to obtain an audio emotion recognition result; recognizing the second text feature to obtain a text emotion recognition result;

Perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and send the emotion recognition result to the associated terminal.

A voice mood fluctuation analysis device, including:

The first voice feature acquisition module: used to acquire the first audio feature and the first text feature of the voice data to be tested;

The second voice feature extraction module: used to extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; based on the text feature in the pre-trained text recognition model Extracting a network to extract a second character feature in the first character feature;

Voice feature recognition module: used to recognize the second audio feature to obtain an audio emotion recognition result; recognize the second text feature to obtain a text emotion recognition result;

Recognition result acquisition module: used to perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and send the emotion recognition result to the associated terminal.

A computer device includes a memory, a processor, and computer-readable instructions stored on the memory and capable of running on the processor. When the processor executes the computer-readable instructions, the following is achieved step:

One or more readable storage media storing computer readable instructions, the computer readable storage medium storing computer readable instructions, and when the computer readable instructions are executed by one or more processors, the one Or multiple processors perform the following steps:

The voice mood fluctuation analysis method, device, computer equipment, and computer-readable storage medium provided by the embodiments of the present invention adopt dual-channel analysis of voice mood. In addition to analyzing voice mood by audio acoustic prosody, the speaker is further judged by speaking content The sentiment, thereby improving the accuracy of sentiment analysis.

Description of the drawings

FIG. 1 is a flow chart of the steps of the method for analyzing voice mood fluctuations according to the first embodiment of this application;

Figure 2 is a specific flow chart of obtaining the voice data to be tested;

Figure 3 is a specific flow chart of extracting the first audio feature in the voice data to be tested;

Figure 4 is a specific flow chart of extracting the first text feature in the voice data to be tested;

Figure 5 is a specific flow chart of identifying the second audio feature and obtaining an audio emotion recognition result;

Figure 6 is a specific flow chart of recognizing the second text feature and obtaining text emotion recognition results;

FIG. 7 is a specific flow chart of performing fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and sending the emotion recognition result to the associated terminal.

FIG. 8 is a schematic diagram of the program modules of the second embodiment of the speech mood fluctuation analysis device of this application;

FIG. 9 is a schematic diagram of the hardware structure of the third embodiment of the computer equipment of this application.

Detailed ways

In order to make the purpose, technical solutions, and advantages of this application clearer and clearer, the following further describes the application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The technical solutions between the various embodiments can be combined with each other, but they must be based on what can be achieved by a person of ordinary skill in the art. When the combination of technical solutions is contradictory or cannot be achieved, it should be considered that such a combination of technical solutions does not exist. It is not within the scope of protection required by this application.

Example one

Please refer to FIG. 1, which shows a flowchart of the steps of a method for analyzing voice mood fluctuations according to an embodiment of the present application. It can be understood that the flowchart in this method embodiment is not used to limit the order of execution of the steps. The following is an exemplary description with computer equipment as the main body of execution, and the details are as follows:

S100: Acquire the first audio feature and the first text feature of the voice data to be tested;

Referring to FIG. 2, the voice mood fluctuation analysis method of the embodiment of the present application further includes:

S110: Acquire voice data to be tested.

The acquiring voice data to be tested further includes:

S110A: Obtain offline or online voice data;

Specifically, the voice data includes online voice data and offline voice data, the online voice data refers to the voice data obtained in real time during a call, and the offline voice data refers to the voice data from the call stored in the background of the system, The voice data to be tested is a recording file in wav format.

S110B: Separate and process the voice data to obtain voice data to be tested, where the voice data to be tested includes multiple pieces of first user voice data and second user voice data.

Specifically, after the voice data is obtained, the number of voices to be tested is divided into multiple segments of the first user voice data and the second user voice data according to the mute part of the call voice, and the endpoint detection technology and the voice separation technology are used to remove the voice data to be tested. The mute part of the voice data call process, and based on the set duration threshold of the silent duration in the speech interval, mark the start and end points of each conversation, and cut and separate them according to the time point to obtain multiple short audio clips. The pattern recognition tool marks the speaker identity and speaking time of each short audio clip and distinguishes it by a number. The duration threshold is determined according to empirical values. As an example, the duration threshold of this solution is 0.25 to 0.3 seconds.

The number includes, but is not limited to, the service number of the customer service, the landline number of the customer service, and the mobile phone number of the customer.

Specifically, the voiceprint recognition tool is the LIUM_SpkDiarization toolkit, and the first user’s voice data and the second user’s voice data are distinguished through the LIU_SpkDiarization toolkit, for example as follows:

[Table 1]

start_timestart_time	end_timeend_time	speakerspeaker
00	33	11
44	88	22
8.38.3	12.512.5	11

We believe that the first person to speak is the first user (namely speaker1 in the table), and the second person is naturally the second user (namely speaker2 in the table).

Referring to FIG. 3, the acquiring the first audio feature of the voice data to be tested further includes:

S100A1: Perform frame and window processing on the voice data to be tested to obtain voice analysis frames;

Specifically, the voice data signal has short-term stability, and the voice data signal can be subjected to framing processing to obtain multiple audio frames. The audio frame refers to a collection of N sampling points. In this embodiment, N is 256 or 512, and the time covered is 20-30ms. After obtaining multiple audio frames, each audio frame is multiplied by the Hamming window to increase the continuity between the left and right ends of the frame. Get the voice analysis frame.

S100B1: Perform Fourier transform on the speech analysis frame to obtain a corresponding frequency spectrum;

Specifically, because the voice data signal is difficult to change in the time domain, it is necessary to convert the voice data signal into an energy distribution in the frequency domain, and subject the voice analysis frame to Fourier transform to obtain the frequency spectrum of each voice analysis frame.

S100C1: Pass the spectrum through a mel filter bank to obtain a mel spectrum;

S100D1: Perform cepstrum analysis on the Mel spectrum to obtain the first audio feature of the voice data to be measured.

Specifically, cepstrum analysis is performed on the Mel spectrum to obtain 36 1024-dimensional audio vectors, and the audio vectors are the first audio features of the voice data to be tested.

Referring to FIG. 4, the acquiring the first text feature of the voice data to be tested further includes:

S100A2: Convert the voice data to be tested into text;

Specifically, a voice dictation interface is used to convert the multiple pieces of first user voice data and second user voice data into text. As an embodiment, the dictation interface is a voice dictation interface of iFLYTEK.

S100B2: Perform word segmentation processing on the text to obtain L word segmentation, where L is a natural number greater than 0;

Specifically, the word segmentation process is completed by a dictionary word segmentation algorithm. The dictionary word segmentation algorithm includes but is not limited to forward maximum matching method, reverse maximum matching method, and two-way matching word segmentation method. It can also be based on hidden Markov models HMM, CRF, SVM, deep learning algorithm.

S100C2: Perform word vector mapping on the L word segmentation respectively to obtain a d-dimensional word vector matrix corresponding to the L word segmentation, where d is a natural number greater than 0, and the d-dimensional word vector matrix is the first text of the voice data to be tested feature.

Specifically, the 128-dimensional word vector of each word segmentation is obtained through models such as word2vec.

S102: Extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; extract the second audio feature based on the text feature extraction network in the pre-trained text recognition model The second character feature in a character feature.

Specifically, the second audio feature and the second text feature are the feature extraction network of the emotion recognition model. The feature extraction network extracted from the first audio feature and the first text feature has fewer dimensions and pays more attention to the semantic feature vector of words expressing emotions. , By extracting the second audio feature and the second text feature, the learning ability of the model can be better, and the accuracy of the final classification can be higher.

S104: Recognize the second audio feature to obtain an audio emotion recognition result; recognize the second text feature to obtain a text emotion recognition result.

Specifically, the audio recognition result is obtained by inputting the audio feature into an audio recognition model, and the text emotion recognition result is obtained by inputting the text feature into a text recognition model. Specifically, the audio recognition model and the text emotion recognition model include a feature extraction network and a classification network, where the feature extraction network is used to extract semantic feature vectors with fewer dimensions from the first audio feature and the first text feature, that is, the second The audio feature and the second text feature, the classification network is used to output the confidence of each preset emotion category, where the preset emotion category can be classified according to business requirements, for example, positive, negative, etc. The text emotion recognition model is a deep neural network model that includes an Embedding layer and a long and short-term memory loop neural layer (LSTM, Long Short-Term Memory), and the audio emotion recognition model includes a self-attention layer and a bidirectional long and short-term memory neural network. The neural network model of the network layer (forward LSTM and backward LSTM).

The long-short-short-term memory network is used to deal with the sequence dependence between long spans, and is suitable for the task of processing dependence between long texts.

Further, the embodiment of the present application further includes training the audio recognition model and the text recognition model, and the training process includes:

Acquiring a training set and a verification set corresponding to the target field;

The obtaining the training set and the verification set corresponding to the target field includes the following steps:

Obtain the voice data of the training set and the verification set;

Specifically, the methods for obtaining the voice data of the training set and the verification set include, but are not limited to, the recording data of the company's internal call center, the customer service recording data provided by the partner company, the customer service recording data provided by the customer, and the direct purchase of customer service from the data platform. For the recording data, in this embodiment, the recording data is obtained by directly purchasing from the data platform, and the recording data includes a training set and a verification set.

Mark the emotion type of the recording data;

Specifically, the labeling process is: manually labeling the pause time point of each recording to obtain multiple short audio fragments (conversation fragments) of each recording; and labeling each short audio fragment with emotional tendency (ie, positive emotions, negative emotions). Emotion). In this embodiment, the audio annotation tool audio-annotator is used to mark the start and end time points and emotions of the audio clip.

Separate training set and verification set;

Specifically, the process of separating the training set and the verification set is: randomly scramble all the labeled audio clip samples, and then divide them into two data sets at a ratio of 4:1, and the more part is used for model training. It is the training set, and the small part is used for model verification, which is the verification set.

Adjust the voice emotion recognition model and text emotion recognition model based on the emotion type of the training set;

Using the test set, the speech emotion recognition model and the text emotion recognition model are tested to determine the accuracy of the speech emotion recognition model and the text emotion recognition model.

Referring to FIG. 5, the recognizing the second audio feature and obtaining the audio emotion recognition result further includes:

S104A1: Based on the audio classification network in the pre-trained audio recognition model, identify the second audio feature, and obtain multiple audio emotion classifications and first confidence levels corresponding to the audio emotion classifications;

The extracted second audio feature is input into the audio classification network in the audio recognition model, and the classification network layer analyzes the second audio feature to obtain multiple audio emotion classifications corresponding to the second audio feature and corresponding to each audio emotion classification The first degree of confidence. For example, the first confidence of "positive emotion" is 0.3, and the first confidence of "negative emotion" is 0.7.

S104B1: Select the audio emotion classification with the highest first confidence as the target audio emotion classification, and the corresponding first confidence is the target audio emotion classification parameter.

Correspondingly, the target audio emotion is classified as "negative emotion", and the target audio emotion classification parameter is 0.7.

S104C1: Perform numerical mapping on the target audio emotion classification vector parameter to obtain an audio emotion recognition result.

Among them, the value mapping refers to mapping the original output result as the emotion category to a specific value, so as to facilitate further observation of emotion fluctuations in the future. In one embodiment, the emotion category is mapped to a specific number through a certain functional relationship. For example, after obtaining the first confidence of each preset emotion category of the voice data to be tested, the emotion with the highest confidence is selected The target audio emotion classification vector parameter X corresponding to the category is calculated using the following audio emotion recognition result formula to calculate the final output audio emotion recognition result Y.

In this embodiment, the numerical mapping relationship is that when the recognized emotion category is "positive", Y=0.5X; when the emotion recognition result is "negative", Y=0.5(1+X), so Make the final output audio emotion recognition result be a floating point number between 0 and 1.

Specifically, the final output audio emotion recognition result is 0.85.

Referring to FIG. 6, recognizing the second text feature and obtaining a text emotion recognition result further includes:

S104A2: Recognize the second text feature based on the text classification network in the pre-trained text recognition model, and obtain the second confidence level corresponding to the multiple text emotion classification vectors.

The extracted second text features are input into the text classification network in the text recognition model, and the classification network layer analyzes the second text features to obtain multiple text emotion classifications corresponding to the second text features and corresponding to each text emotion classification The second degree of confidence. For example, the second confidence level of "positive emotion" is 0.2, and the first confidence level of "negative emotion" is 0.8.

S104B2: Select the audio emotion classification with the highest second confidence as the target text emotion classification, and the corresponding second confidence is the target text emotion classification parameter.

Correspondingly, the target text emotion is classified as "negative emotion", and the target text emotion classification parameter is 0.8.

S104C2: Perform numerical mapping on the target text emotion classification vector parameter to obtain a text emotion recognition result.

Specifically, the final output text emotion recognition result is 0.9.

S106: Perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain an emotion recognition result, and send the emotion recognition result to an associated terminal.

Referring to FIG. 7, the step S106 may further include:

S106A, weighting the audio emotion recognition results and text emotion recognition results of each segment of the first user’s voice data to obtain a first emotion value, and performing audio emotion recognition results and text emotion recognition results of each segment of the second user’s voice data Perform weighting processing to obtain the second sentiment value;

Specifically, the two emotion values of the same audio clip are processed by a numerical weighting method. The emotion value is a floating-point number between 0 and 1. The closer to 1, the more negative the emotion is. On the contrary, the closer to 0 means more positive emotions.

As an embodiment, the weight of the emotion value obtained by the speech emotion recognition channel is 0.7; the weight of the emotion value obtained by the text emotion recognition channel is 0.3.

Taking the foregoing embodiment as an example for further description, the final output sentiment value is 0.865.

S106B, generating a first emotion value heat map according to the first emotion value and generating a second emotion value heat map according to the second emotion value;

Specifically, each segment of speech to be tested is numbered and drawn according to the time sequence, and the emotion value heat map is drawn, and the heat map is used to cluster the emotions of each time period.

Specifically, use the heatmap function of python's seaborn library to draw a heat map of emotion values, and use different colors to represent different emotions. For example, when a positive emotion is positive, the color will be darker.

S106C: Send the first emotion value heat map and the second emotion value heat map to the associated terminal.

Specifically, the associated terminal includes a first user terminal and a second user terminal. As an embodiment, when the first user and the second user are the customer and the customer service respectively, the associated terminal is in addition to the client and the customer service terminal, It also includes the customer service quality supervision and management terminal and the customer service superior to supervise and correct the service quality of the customer service.

The embodiment of this application uses dual-channel analysis of voice emotions. In addition to analyzing voice emotions through audio acoustic prosody, the speaker’s emotions are further judged through the content of the speech, thereby improving the accuracy of emotion analysis. Combining dialogue separation technology, each paragraph is The emotion value of the dialogue is analyzed and judged, so as to obtain the emotion of the speaker at each time period during the complete conversation, and then analyze the emotion fluctuation of the speaker, provide concrete reference and help for the customer service quality inspection, and make the evaluation result Be more objective, and ultimately help companies improve the quality of customer service and improve customer experience.

Example two

Please continue to refer to FIG. 8, which shows a schematic diagram of the program modules of the voice mood fluctuation analysis device of the present application. In this embodiment, the speech mood fluctuation analysis device 20 may include or be divided into one or more program modules, and the one or more program modules are stored in a storage medium and executed by one or more processors to This application is completed, and the above-mentioned voice mood fluctuation analysis method can be realized. The program module referred to in the embodiment of the present application refers to an instruction segment of a series of computer-readable instructions that can complete a specific function, and is more suitable for describing the execution process of the voice mood fluctuation analysis device 20 in the storage medium than the program itself. The following description will specifically introduce the functions of each program module in this embodiment:

The first voice feature acquiring module 200 is configured to acquire the first audio feature and the first text feature of the voice data to be tested.

Further, the first voice feature acquisition module 200 is also used for:

Obtain offline or online voice data to be tested;

The voice data is separated and processed to obtain the voice data to be tested, and the voice data to be tested includes multiple pieces of first user voice data and second user voice data.

The second voice feature extraction module 202: used to extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; based on the text in the pre-trained text recognition model The feature extraction network extracts the second text feature in the first text feature.

Further, the second speech feature extraction module 202 is also used for:

Perform frame and window processing on the voice data to be tested to obtain voice analysis frames;

Performing Fourier transform on the speech analysis frame to obtain a corresponding frequency spectrum;

Passing the spectrum through the mel filter bank to obtain a mel spectrum;

Perform cepstrum analysis on the Mel spectrum to obtain the first audio feature of the voice data to be measured.

Further, the second speech feature extraction module 202 is also used for:

Converting the voice data to be tested into text;

Perform word segmentation processing on the text to obtain L word segmentation, where L is a natural number greater than 0;

The word vector mapping is performed on the L word segmentation respectively to obtain a d-dimensional word vector matrix corresponding to the L word segmentation, where d is a natural number greater than 0, and the d-dimensional word vector matrix is the first text feature of the voice data to be tested.

The voice feature recognition module 204: used to recognize the second audio feature to obtain an audio emotion recognition result; recognize the second text feature to obtain a text emotion recognition result.

Further, the voice feature recognition module 204 is also used for:

Recognize the second audio feature based on the audio classification network in the pre-trained audio recognition model, and obtain the first confidence corresponding to the multiple audio emotion classification vectors;

The audio emotion classification with the highest first confidence is selected as the target audio emotion classification, and the corresponding first confidence is the target audio emotion classification parameter;

Perform numerical mapping on the target audio emotion classification vector parameters to obtain audio emotion recognition results.

Further, the voice feature recognition module 204 is also used for:

Based on the text classification network in the pre-trained text recognition model, recognize the second text feature, and obtain the second confidence level corresponding to the multiple text emotion classification vectors;

Select the audio emotion classification with the highest second confidence as the target text emotion classification, and the corresponding second confidence is the target text emotion classification parameter;

Perform numerical mapping on the target text emotion classification vector parameters to obtain text emotion recognition results.

The recognition result acquisition module 206 is configured to perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and send the emotion recognition result to the associated terminal.

Further, the voice feature recognition module 204 is also used for:

Perform weighting processing on the audio emotion recognition results and text emotion recognition results of each segment of the first user’s voice data to obtain the first emotion value, and weight the audio emotion recognition results and text emotion recognition results of each segment of the second user’s voice data Processing, get the second emotional value;

Generating a first emotion value heat map according to the first emotion value and generating a second emotion value heat map according to the second emotion value;

The first emotion value heat map and the second emotion value heat map are sent to the associated terminal.

Example three

Refer to FIG. 9, which is a schematic diagram of the hardware architecture of the computer device according to the third embodiment of the present application. In this embodiment, the computer device 2 is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions. The computer device 2 may be a rack server, a blade server, a tower server, or a cabinet server (including an independent server or a server cluster composed of multiple servers). As shown in FIG. 9, the computer device 2 at least includes, but is not limited to, a memory 21, a processor 22, and a network interface 23 that can communicate with each other through a system device bus, and are stored in the memory 21 and can be connected to the processor 22. The running computer-readable instruction may be a computer-readable instruction corresponding to the device 20 for executing the voice mood fluctuation analysis device.

In this embodiment, the memory 21 includes at least one type of computer-readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory ( RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc. In some embodiments, the memory 21 may be an internal storage unit of the computer device 2, for example, a hard disk or a memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a smart media card (SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc. Of course, the memory 21 may also include both the internal storage unit of the computer device 2 and its external storage device. In this embodiment, the memory 21 is generally used to store operating system devices and various application software installed in the computer equipment 2, for example, the program code of the voice mood fluctuation analysis device 20 in the second embodiment. In addition, the memory 21 can also be used to temporarily store various types of data that have been output or will be output.

In some embodiments, the processor 22 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. The processor 22 is generally used to control the overall operation of the computer device 2. In this embodiment, the processor 22 is used to run the program code or process data stored in the memory 21, for example, to run the voice mood fluctuation analysis device 20 to implement the voice mood fluctuation analysis method of the first, second, third or fourth embodiment.

The network interface 23 may include a wireless network interface or a wired network interface, and the network interface 23 is generally used to establish a communication connection between the computer device 2 and other electronic system devices. For example, the network interface 23 is used to connect the computer device 2 with an external terminal through a network, and establish a data transmission channel and a communication connection between the computer device 2 and the external terminal. The network may be an intranet, the Internet, a global system of mobile communication (GSM), a wideband code division multiple access (WCDMA), a 4G network, 5G network, Bluetooth (Bluetooth), Wi-Fi and other wireless or wired networks.

It should be pointed out that FIG. 9 only shows the computer device 2 with components 20-23, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.

In this embodiment, the voice mood fluctuation analysis device 20 stored in the memory 21 can also be divided into one or more program modules, and the one or more program modules are stored in the memory 21 and consist of one Or executed by multiple processors (in this embodiment, the processor 22) to complete the application.

For example, FIG. 8 shows a schematic diagram of the program modules of Embodiment 8 of the device for implementing voice emotion fluctuation analysis 20. In this embodiment, the device for voice emotion fluctuation analysis 20 can be divided into a first voice feature acquisition module 200, The second voice feature extraction module 202, the voice feature recognition module 204, and the recognition result acquisition module 206. Among them, the program module referred to in the present application refers to a series of computer-readable instructions that can complete specific functions. It is more suitable than a program to describe the execution process of the voice mood fluctuation analysis device 20 in the computer device 2. . The specific functions of the program module, the first voice feature acquisition module 200 and the recognition result acquisition module 206, have been described in detail in the second embodiment, and will not be repeated here.

Example four

This embodiment also provides a computer-readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), only Read memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disks, optical disks, servers, App application malls, etc., on which are stored computer readable Instructions, computer-readable instructions, when executed by the processor, realize the corresponding functions of the voice mood fluctuation analysis method. The computer-readable storage medium of this embodiment is used to store the voice mood fluctuation analysis device 20, and when executed by a processor, the voice mood fluctuation analysis method of the first, second, third or fourth embodiment is implemented.

In an embodiment, one or more readable storage media storing computer readable instructions are provided. The computer readable storage medium stores computer readable instructions, and the computer readable instructions are executed by one or more processors. During execution, the one or more processors are executed to implement the video object accelerated detection method in the foregoing embodiment. In order to avoid repetition, details are not described herein again. The readable storage medium in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium. A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a non-volatile readable storage. The medium may also be stored in a volatile readable storage medium, and when the computer readable instructions are executed, they may include the processes of the above-mentioned method embodiments.

The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority or inferiority of the embodiments.

Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。

The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application. .

Summary of the invention

technical problem

The solution to the problem

The beneficial effects of the invention

Claims

A method for analyzing voice mood fluctuations, which includes:

Acquire the first audio feature and the first text feature of the voice data to be tested;

Extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; extract the first text feature based on the text feature extraction network in the pre-trained text recognition model The second character feature in

Recognizing the second audio feature to obtain an audio emotion recognition result; recognizing the second text feature to obtain a text emotion recognition result;

Perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and send the emotion recognition result to the associated terminal.
The method for analyzing voice mood fluctuations according to claim 1, wherein said obtaining the first audio feature and the first text feature of the voice data to be tested comprises:

Perform frame and window processing on the voice data to be tested to obtain voice analysis frames;

Performing Fourier transform on the speech analysis frame to obtain a corresponding frequency spectrum;

Passing the spectrum through the mel filter bank to obtain a mel spectrum;

Perform cepstrum analysis on the Mel spectrum to obtain the first audio feature of the voice data to be measured.
The method for analyzing voice emotion fluctuations according to claim 2, wherein said recognizing said second audio feature to obtain an audio emotion recognition result; recognizing said second text feature and obtaining a text emotion recognition result comprises:

Recognize the second audio feature based on the audio classification network in the pre-trained audio recognition model, and obtain the first confidence corresponding to the multiple audio emotion classification vectors;

The audio emotion classification with the highest first confidence is selected as the target audio emotion classification, and the corresponding first confidence is the target audio emotion classification parameter;

Perform numerical mapping on the target audio emotion classification vector parameters to obtain audio emotion recognition results.
The voice mood fluctuation analysis method according to claim 1, wherein said obtaining the first audio feature and the first text feature of the voice data to be tested further comprises:

Converting the voice data to be tested into text;

Perform word segmentation processing on the text to obtain L word segmentation, where L is a natural number greater than 0;

The word vector mapping is performed on the L word segmentation respectively to obtain a d-dimensional word vector matrix corresponding to the L word segmentation, where d is a natural number greater than 0, and the d-dimensional word vector matrix is the first text feature of the voice data to be tested.
The method for analyzing voice emotion fluctuations according to claim 4, wherein said recognizing said second audio feature to obtain an audio emotion recognition result; recognizing said second text feature and obtaining a text emotion recognition result comprises:

Based on the text classification network in the pre-trained text recognition model, recognize the second text feature, and obtain the second confidence level corresponding to the multiple text emotion classification vectors;

Select the audio emotion classification with the highest second confidence as the target text emotion classification, and the corresponding second confidence is the target text emotion classification parameter;

Perform numerical mapping on the target text emotion classification vector parameters to obtain text emotion recognition results.
The method for analyzing voice mood fluctuations according to claim 1, wherein the method further comprises:

Obtain offline or online voice data to be tested;

The voice data is separated and processed to obtain the voice data to be tested, and the voice data to be tested includes multiple pieces of first user voice data and second user voice data.
The method for analyzing voice emotion fluctuations according to claim 6, wherein the fusion processing of the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and sending the emotion recognition result to the associated terminal comprises :

Perform weighting processing on the audio emotion recognition results and text emotion recognition results of each segment of the first user’s voice data to obtain the first emotion value, and weight the audio emotion recognition results and text emotion recognition results of each segment of the second user’s voice data Processing, get the second emotional value;

Generating a first emotion value heat map according to the first emotion value and generating a second emotion value heat map according to the second emotion value;

The first emotion value heat map and the second emotion value heat map are sent to the associated terminal.
A voice mood fluctuation analysis device, which includes:

The first voice feature acquisition module: used to acquire the first audio feature and the first text feature of the voice data to be tested;

The second voice feature extraction module: used to extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; based on the text feature in the pre-trained text recognition model Extracting a network to extract a second character feature in the first character feature;

Voice feature recognition module: used to recognize the second audio feature to obtain an audio emotion recognition result; recognize the second text feature to obtain a text emotion recognition result.

Recognition result acquisition module: used to perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and send the emotion recognition result to the associated terminal.
A computer device comprising a memory, a processor, and computer-readable instructions stored on the memory and running on the processor, characterized in that the processor executes the computer-readable instructions The following steps are implemented when ordering:

Acquire the first audio feature and the first text feature of the voice data to be tested;

Extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; extract the first text feature based on the text feature extraction network in the pre-trained text recognition model The second character feature in

Recognizing the second audio feature to obtain an audio emotion recognition result; recognizing the second text feature to obtain a text emotion recognition result;

Perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and send the emotion recognition result to the associated terminal.
The computer device according to claim 9, wherein said recognizing said second audio feature to obtain an audio emotion recognition result; recognizing said second text feature and obtaining a text emotion recognition result comprises:

Recognize the second audio feature based on the audio classification network in the pre-trained audio recognition model, and obtain the first confidence corresponding to the multiple audio emotion classification vectors;

The audio emotion classification with the highest first confidence is selected as the target audio emotion classification, and the corresponding first confidence is the target audio emotion classification parameter;

Perform numerical mapping on the target audio emotion classification vector parameters to obtain audio emotion recognition results.
The computer device according to claim 9, wherein said acquiring the first audio feature and the first text feature of the voice data to be tested further comprises:

Converting the voice data to be tested into text;

Perform word segmentation processing on the text to obtain L word segmentation, where L is a natural number greater than 0;

The word vector mapping is performed on the L word segmentation respectively to obtain a d-dimensional word vector matrix corresponding to the L word segmentation, where d is a natural number greater than 0, and the d-dimensional word vector matrix is the first text feature of the voice data to be tested.
The computer device according to claim 11, wherein said recognizing said second audio feature to obtain an audio emotion recognition result; recognizing said second text feature and obtaining a text emotion recognition result comprises:

Based on the text classification network in the pre-trained text recognition model, recognize the second text feature, and obtain the second confidence level corresponding to the multiple text emotion classification vectors;

Select the audio emotion classification with the highest second confidence as the target text emotion classification, and the corresponding second confidence is the target text emotion classification parameter;

Perform numerical mapping on the target text emotion classification vector parameters to obtain text emotion recognition results.
The computer device according to claim 9, wherein the method further comprises:

Obtain offline or online voice data to be tested;

The voice data is separated and processed to obtain the voice data to be tested, and the voice data to be tested includes multiple pieces of first user voice data and second user voice data.
The computer device according to claim 13, wherein the fusion processing of the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and sending the emotion recognition result to the associated terminal comprises:

Perform weighting processing on the audio emotion recognition results and text emotion recognition results of each segment of the first user’s voice data to obtain the first emotion value, and weight the audio emotion recognition results and text emotion recognition results of each segment of the second user’s voice data Processing, get the second emotional value;

Generating a first emotion value heat map according to the first emotion value and generating a second emotion value heat map according to the second emotion value;

The first emotion value heat map and the second emotion value heat map are sent to the associated terminal.
One or more readable storage media storing computer readable instructions, the computer readable storage medium storing computer readable instructions, where the computer readable instructions when executed by one or more processors cause all The one or more processors execute the following steps:

Acquire the first audio feature and the first text feature of the voice data to be tested;

Extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; extract the first text feature based on the text feature extraction network in the pre-trained text recognition model The second character feature in

Recognizing the second audio feature to obtain an audio emotion recognition result; recognizing the second text feature to obtain a text emotion recognition result;

Perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and send the emotion recognition result to the associated terminal.
The readable storage medium according to claim 15, wherein said recognizing said second audio feature to obtain an audio emotion recognition result; recognizing said second text feature and obtaining a text emotion recognition result comprises:

Recognize the second audio feature based on the audio classification network in the pre-trained audio recognition model, and obtain the first confidence corresponding to the multiple audio emotion classification vectors;

The audio emotion classification with the highest first confidence is selected as the target audio emotion classification, and the corresponding first confidence is the target audio emotion classification parameter;

Perform numerical mapping on the target audio emotion classification vector parameters to obtain audio emotion recognition results.
The readable storage medium according to claim 15, wherein said obtaining the first audio feature and the first text feature of the voice data to be tested further comprises:

Converting the voice data to be tested into text;

Perform word segmentation processing on the text to obtain L word segmentation, where L is a natural number greater than 0;

The word vector mapping is performed on the L word segmentation respectively to obtain a d-dimensional word vector matrix corresponding to the L word segmentation, where d is a natural number greater than 0, and the d-dimensional word vector matrix is the first text feature of the voice data to be tested.
The readable storage medium according to claim 17, wherein said recognizing said second audio feature to obtain an audio emotion recognition result; recognizing said second text feature and obtaining a text emotion recognition result comprises:

Based on the text classification network in the pre-trained text recognition model, recognize the second text feature, and obtain the second confidence level corresponding to the multiple text emotion classification vectors;

Select the audio emotion classification with the highest second confidence as the target text emotion classification, and the corresponding second confidence is the target text emotion classification parameter;

Perform numerical mapping on the target text emotion classification vector parameters to obtain text emotion recognition results.
The readable storage medium according to claim 15, wherein the method further comprises:

Obtain offline or online voice data to be tested;

The voice data is separated and processed to obtain the voice data to be tested, and the voice data to be tested includes multiple pieces of first user voice data and second user voice data.
The readable storage medium according to claim 19, wherein the fusion processing of the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and sending the emotion recognition result to the associated terminal comprises:

Perform weighting processing on the audio emotion recognition results and text emotion recognition results of each segment of the first user’s voice data to obtain the first emotion value, and weight the audio emotion recognition results and text emotion recognition results of each segment of the second user’s voice data Processing, get the second emotional value;

Generating a first emotion value heat map according to the first emotion value and generating a second emotion value heat map according to the second emotion value;

The first emotion value heat map and the second emotion value heat map are sent to the associated terminal.