WO2021128741A1 - Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium - Google Patents
Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium Download PDFInfo
- Publication number
- WO2021128741A1 WO2021128741A1 PCT/CN2020/094338 CN2020094338W WO2021128741A1 WO 2021128741 A1 WO2021128741 A1 WO 2021128741A1 CN 2020094338 W CN2020094338 W CN 2020094338W WO 2021128741 A1 WO2021128741 A1 WO 2021128741A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- text
- emotion
- feature
- recognition result
- Prior art date
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 143
- 238000004458 analytical method Methods 0.000 title claims abstract description 47
- 230000008909 emotion recognition Effects 0.000 claims abstract description 147
- 238000000034 method Methods 0.000 claims abstract description 36
- 238000000605 extraction Methods 0.000 claims abstract description 30
- 238000007499 fusion processing Methods 0.000 claims abstract description 15
- 239000013598 vector Substances 0.000 claims description 40
- 230000036651 mood Effects 0.000 claims description 33
- 230000015654 memory Effects 0.000 claims description 31
- 230000011218 segmentation Effects 0.000 claims description 25
- 238000012545 processing Methods 0.000 claims description 22
- 238000013507 mapping Methods 0.000 claims description 18
- 239000000284 extract Substances 0.000 claims description 15
- 238000001228 spectrum Methods 0.000 claims description 14
- 239000011159 matrix material Substances 0.000 claims description 10
- 230000002996 emotional effect Effects 0.000 claims description 5
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 230000008569 process Effects 0.000 description 12
- 238000012549 training Methods 0.000 description 12
- 238000012795 verification Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 3
- 239000012634 fragment Substances 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 230000006403 short-term memory Effects 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 206010027951 Mood swings Diseases 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000007787 long-term memory Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
- G10L15/05—Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
Definitions
- the embodiments of the present application relate to the field of artificial intelligence technology, and in particular to a method, device, computer equipment, and storage medium for analyzing voice mood fluctuations.
- mood swing analysis is used in more and more business scenarios, such as the mood swings of both parties when a customer service staff talks with a customer.
- the analysis of audio mood fluctuations is generally performed through the analysis of the audio signal of the sound, such as the frequency and amplitude changes of the intonation and sound waves.
- the inventor realizes that the current mood fluctuation analysis method is relatively single, and the audio signals of different people are different, and the accuracy of analyzing emotions using only sound audio signals is low.
- the embodiments of the present application provide a voice mood fluctuation analysis method, device, computer equipment, and computer-readable storage medium, which are used for the problem of low accuracy in analyzing mood fluctuations.
- a method for analyzing voice mood fluctuations including:
- a voice mood fluctuation analysis device including:
- the first voice feature acquisition module used to acquire the first audio feature and the first text feature of the voice data to be tested;
- the second voice feature extraction module used to extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; based on the text feature in the pre-trained text recognition model Extracting a network to extract a second character feature in the first character feature;
- Voice feature recognition module used to recognize the second audio feature to obtain an audio emotion recognition result; recognize the second text feature to obtain a text emotion recognition result;
- Recognition result acquisition module used to perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and send the emotion recognition result to the associated terminal.
- a computer device includes a memory, a processor, and computer-readable instructions stored on the memory and capable of running on the processor.
- the processor executes the computer-readable instructions, the following is achieved step:
- One or more readable storage media storing computer readable instructions
- the computer readable storage medium storing computer readable instructions
- the one Or multiple processors perform the following steps:
- the voice mood fluctuation analysis method, device, computer equipment, and computer-readable storage medium provided by the embodiments of the present invention adopt dual-channel analysis of voice mood.
- the speaker is further judged by speaking content The sentiment, thereby improving the accuracy of sentiment analysis.
- FIG. 1 is a flow chart of the steps of the method for analyzing voice mood fluctuations according to the first embodiment of this application;
- Figure 2 is a specific flow chart of obtaining the voice data to be tested
- Figure 3 is a specific flow chart of extracting the first audio feature in the voice data to be tested
- Figure 4 is a specific flow chart of extracting the first text feature in the voice data to be tested
- Figure 5 is a specific flow chart of identifying the second audio feature and obtaining an audio emotion recognition result
- Figure 6 is a specific flow chart of recognizing the second text feature and obtaining text emotion recognition results
- FIG. 7 is a specific flow chart of performing fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and sending the emotion recognition result to the associated terminal.
- FIG. 8 is a schematic diagram of the program modules of the second embodiment of the speech mood fluctuation analysis device of this application.
- FIG. 9 is a schematic diagram of the hardware structure of the third embodiment of the computer equipment of this application.
- FIG. 1 shows a flowchart of the steps of a method for analyzing voice mood fluctuations according to an embodiment of the present application. It can be understood that the flowchart in this method embodiment is not used to limit the order of execution of the steps.
- the following is an exemplary description with computer equipment as the main body of execution, and the details are as follows:
- the voice mood fluctuation analysis method of the embodiment of the present application further includes:
- the acquiring voice data to be tested further includes:
- the voice data includes online voice data and offline voice data
- the online voice data refers to the voice data obtained in real time during a call
- the offline voice data refers to the voice data from the call stored in the background of the system
- the voice data to be tested is a recording file in wav format.
- S110B Separate and process the voice data to obtain voice data to be tested, where the voice data to be tested includes multiple pieces of first user voice data and second user voice data.
- the number of voices to be tested is divided into multiple segments of the first user voice data and the second user voice data according to the mute part of the call voice, and the endpoint detection technology and the voice separation technology are used to remove the voice data to be tested.
- the mute part of the voice data call process and based on the set duration threshold of the silent duration in the speech interval, mark the start and end points of each conversation, and cut and separate them according to the time point to obtain multiple short audio clips.
- the pattern recognition tool marks the speaker identity and speaking time of each short audio clip and distinguishes it by a number.
- the duration threshold is determined according to empirical values. As an example, the duration threshold of this solution is 0.25 to 0.3 seconds.
- the number includes, but is not limited to, the service number of the customer service, the landline number of the customer service, and the mobile phone number of the customer.
- the voiceprint recognition tool is the LIUM_SpkDiarization toolkit
- the first user’s voice data and the second user’s voice data are distinguished through the LIU_SpkDiarization toolkit, for example as follows:
- the acquiring the first audio feature of the voice data to be tested further includes:
- S100A1 Perform frame and window processing on the voice data to be tested to obtain voice analysis frames
- the voice data signal has short-term stability, and the voice data signal can be subjected to framing processing to obtain multiple audio frames.
- the audio frame refers to a collection of N sampling points. In this embodiment, N is 256 or 512, and the time covered is 20-30ms. After obtaining multiple audio frames, each audio frame is multiplied by the Hamming window to increase the continuity between the left and right ends of the frame. Get the voice analysis frame.
- S100B1 Perform Fourier transform on the speech analysis frame to obtain a corresponding frequency spectrum
- the voice data signal is difficult to change in the time domain, it is necessary to convert the voice data signal into an energy distribution in the frequency domain, and subject the voice analysis frame to Fourier transform to obtain the frequency spectrum of each voice analysis frame.
- S100C1 Pass the spectrum through a mel filter bank to obtain a mel spectrum
- S100D1 Perform cepstrum analysis on the Mel spectrum to obtain the first audio feature of the voice data to be measured.
- cepstrum analysis is performed on the Mel spectrum to obtain 36 1024-dimensional audio vectors, and the audio vectors are the first audio features of the voice data to be tested.
- the acquiring the first text feature of the voice data to be tested further includes:
- S100A2 Convert the voice data to be tested into text
- a voice dictation interface is used to convert the multiple pieces of first user voice data and second user voice data into text.
- the dictation interface is a voice dictation interface of iFLYTEK.
- S100B2 Perform word segmentation processing on the text to obtain L word segmentation, where L is a natural number greater than 0;
- the word segmentation process is completed by a dictionary word segmentation algorithm.
- the dictionary word segmentation algorithm includes but is not limited to forward maximum matching method, reverse maximum matching method, and two-way matching word segmentation method. It can also be based on hidden Markov models HMM, CRF, SVM, deep learning algorithm.
- S100C2 Perform word vector mapping on the L word segmentation respectively to obtain a d-dimensional word vector matrix corresponding to the L word segmentation, where d is a natural number greater than 0, and the d-dimensional word vector matrix is the first text of the voice data to be tested feature.
- the 128-dimensional word vector of each word segmentation is obtained through models such as word2vec.
- the second audio feature and the second text feature are the feature extraction network of the emotion recognition model.
- the feature extraction network extracted from the first audio feature and the first text feature has fewer dimensions and pays more attention to the semantic feature vector of words expressing emotions. .
- the learning ability of the model can be better, and the accuracy of the final classification can be higher.
- S104 Recognize the second audio feature to obtain an audio emotion recognition result; recognize the second text feature to obtain a text emotion recognition result.
- the audio recognition result is obtained by inputting the audio feature into an audio recognition model
- the text emotion recognition result is obtained by inputting the text feature into a text recognition model
- the audio recognition model and the text emotion recognition model include a feature extraction network and a classification network, where the feature extraction network is used to extract semantic feature vectors with fewer dimensions from the first audio feature and the first text feature, that is, the second The audio feature and the second text feature, the classification network is used to output the confidence of each preset emotion category, where the preset emotion category can be classified according to business requirements, for example, positive, negative, etc.
- the text emotion recognition model is a deep neural network model that includes an Embedding layer and a long and short-term memory loop neural layer (LSTM, Long Short-Term Memory), and the audio emotion recognition model includes a self-attention layer and a bidirectional long and short-term memory neural network.
- the neural network model of the network layer forward LSTM and backward LSTM).
- the long-short-short-term memory network is used to deal with the sequence dependence between long spans, and is suitable for the task of processing dependence between long texts.
- the embodiment of the present application further includes training the audio recognition model and the text recognition model, and the training process includes:
- the methods for obtaining the voice data of the training set and the verification set include, but are not limited to, the recording data of the company's internal call center, the customer service recording data provided by the partner company, the customer service recording data provided by the customer, and the direct purchase of customer service from the data platform.
- the recording data is obtained by directly purchasing from the data platform, and the recording data includes a training set and a verification set.
- the labeling process is: manually labeling the pause time point of each recording to obtain multiple short audio fragments (conversation fragments) of each recording; and labeling each short audio fragment with emotional tendency (ie, positive emotions, negative emotions). Emotion).
- the audio annotation tool audio-annotator is used to mark the start and end time points and emotions of the audio clip.
- the process of separating the training set and the verification set is: randomly scramble all the labeled audio clip samples, and then divide them into two data sets at a ratio of 4:1, and the more part is used for model training. It is the training set, and the small part is used for model verification, which is the verification set.
- the speech emotion recognition model and the text emotion recognition model are tested to determine the accuracy of the speech emotion recognition model and the text emotion recognition model.
- the recognizing the second audio feature and obtaining the audio emotion recognition result further includes:
- S104A1 Based on the audio classification network in the pre-trained audio recognition model, identify the second audio feature, and obtain multiple audio emotion classifications and first confidence levels corresponding to the audio emotion classifications;
- the extracted second audio feature is input into the audio classification network in the audio recognition model, and the classification network layer analyzes the second audio feature to obtain multiple audio emotion classifications corresponding to the second audio feature and corresponding to each audio emotion classification
- the first degree of confidence For example, the first confidence of "positive emotion” is 0.3, and the first confidence of "negative emotion” is 0.7.
- S104B1 Select the audio emotion classification with the highest first confidence as the target audio emotion classification, and the corresponding first confidence is the target audio emotion classification parameter.
- the target audio emotion is classified as "negative emotion", and the target audio emotion classification parameter is 0.7.
- S104C1 Perform numerical mapping on the target audio emotion classification vector parameter to obtain an audio emotion recognition result.
- the value mapping refers to mapping the original output result as the emotion category to a specific value, so as to facilitate further observation of emotion fluctuations in the future.
- the emotion category is mapped to a specific number through a certain functional relationship. For example, after obtaining the first confidence of each preset emotion category of the voice data to be tested, the emotion with the highest confidence is selected
- the target audio emotion classification vector parameter X corresponding to the category is calculated using the following audio emotion recognition result formula to calculate the final output audio emotion recognition result Y.
- the final output audio emotion recognition result is 0.85.
- recognizing the second text feature and obtaining a text emotion recognition result further includes:
- S104A2 Recognize the second text feature based on the text classification network in the pre-trained text recognition model, and obtain the second confidence level corresponding to the multiple text emotion classification vectors.
- the extracted second text features are input into the text classification network in the text recognition model, and the classification network layer analyzes the second text features to obtain multiple text emotion classifications corresponding to the second text features and corresponding to each text emotion classification
- the second degree of confidence For example, the second confidence level of "positive emotion” is 0.2, and the first confidence level of "negative emotion” is 0.8.
- S104B2 Select the audio emotion classification with the highest second confidence as the target text emotion classification, and the corresponding second confidence is the target text emotion classification parameter.
- the target text emotion is classified as "negative emotion", and the target text emotion classification parameter is 0.8.
- S104C2 Perform numerical mapping on the target text emotion classification vector parameter to obtain a text emotion recognition result.
- the final output text emotion recognition result is 0.9.
- S106 Perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain an emotion recognition result, and send the emotion recognition result to an associated terminal.
- step S106 may further include:
- the two emotion values of the same audio clip are processed by a numerical weighting method.
- the emotion value is a floating-point number between 0 and 1. The closer to 1, the more negative the emotion is. On the contrary, the closer to 0 means more positive emotions.
- the weight of the emotion value obtained by the speech emotion recognition channel is 0.7; the weight of the emotion value obtained by the text emotion recognition channel is 0.3.
- the final output sentiment value is 0.865.
- each segment of speech to be tested is numbered and drawn according to the time sequence, and the emotion value heat map is drawn, and the heat map is used to cluster the emotions of each time period.
- S106C Send the first emotion value heat map and the second emotion value heat map to the associated terminal.
- the associated terminal includes a first user terminal and a second user terminal.
- the associated terminal is in addition to the client and the customer service terminal, It also includes the customer service quality supervision and management terminal and the customer service superior to supervise and correct the service quality of the customer service.
- the embodiment of this application uses dual-channel analysis of voice emotions.
- the speaker In addition to analyzing voice emotions through audio acoustic prosody, the speaker’s emotions are further judged through the content of the speech, thereby improving the accuracy of emotion analysis.
- Combining dialogue separation technology each paragraph is The emotion value of the dialogue is analyzed and judged, so as to obtain the emotion of the speaker at each time period during the complete conversation, and then analyze the emotion fluctuation of the speaker, provide concrete reference and help for the customer service quality inspection, and make the evaluation result Be more objective, and ultimately help companies improve the quality of customer service and improve customer experience.
- FIG. 8 shows a schematic diagram of the program modules of the voice mood fluctuation analysis device of the present application.
- the speech mood fluctuation analysis device 20 may include or be divided into one or more program modules, and the one or more program modules are stored in a storage medium and executed by one or more processors to This application is completed, and the above-mentioned voice mood fluctuation analysis method can be realized.
- the program module referred to in the embodiment of the present application refers to an instruction segment of a series of computer-readable instructions that can complete a specific function, and is more suitable for describing the execution process of the voice mood fluctuation analysis device 20 in the storage medium than the program itself. The following description will specifically introduce the functions of each program module in this embodiment:
- the first voice feature acquiring module 200 is configured to acquire the first audio feature and the first text feature of the voice data to be tested.
- first voice feature acquisition module 200 is also used for:
- the voice data is separated and processed to obtain the voice data to be tested, and the voice data to be tested includes multiple pieces of first user voice data and second user voice data.
- the second voice feature extraction module 202 used to extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; based on the text in the pre-trained text recognition model The feature extraction network extracts the second text feature in the first text feature.
- the second speech feature extraction module 202 is also used for:
- the second speech feature extraction module 202 is also used for:
- the word vector mapping is performed on the L word segmentation respectively to obtain a d-dimensional word vector matrix corresponding to the L word segmentation, where d is a natural number greater than 0, and the d-dimensional word vector matrix is the first text feature of the voice data to be tested.
- the voice feature recognition module 204 used to recognize the second audio feature to obtain an audio emotion recognition result; recognize the second text feature to obtain a text emotion recognition result.
- voice feature recognition module 204 is also used for:
- the audio emotion classification with the highest first confidence is selected as the target audio emotion classification, and the corresponding first confidence is the target audio emotion classification parameter;
- voice feature recognition module 204 is also used for:
- the recognition result acquisition module 206 is configured to perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and send the emotion recognition result to the associated terminal.
- voice feature recognition module 204 is also used for:
- the first emotion value heat map and the second emotion value heat map are sent to the associated terminal.
- the computer device 2 is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions.
- the computer device 2 may be a rack server, a blade server, a tower server, or a cabinet server (including an independent server or a server cluster composed of multiple servers).
- the computer device 2 at least includes, but is not limited to, a memory 21, a processor 22, and a network interface 23 that can communicate with each other through a system device bus, and are stored in the memory 21 and can be connected to the processor 22.
- the running computer-readable instruction may be a computer-readable instruction corresponding to the device 20 for executing the voice mood fluctuation analysis device.
- the memory 21 includes at least one type of computer-readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory ( RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
- the memory 21 may be an internal storage unit of the computer device 2, for example, a hard disk or a memory of the computer device 2.
- the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a smart media card (SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc.
- the memory 21 may also include both the internal storage unit of the computer device 2 and its external storage device.
- the memory 21 is generally used to store operating system devices and various application software installed in the computer equipment 2, for example, the program code of the voice mood fluctuation analysis device 20 in the second embodiment.
- the memory 21 can also be used to temporarily store various types of data that have been output or will be output.
- the processor 22 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips.
- the processor 22 is generally used to control the overall operation of the computer device 2.
- the processor 22 is used to run the program code or process data stored in the memory 21, for example, to run the voice mood fluctuation analysis device 20 to implement the voice mood fluctuation analysis method of the first, second, third or fourth embodiment.
- the network interface 23 may include a wireless network interface or a wired network interface, and the network interface 23 is generally used to establish a communication connection between the computer device 2 and other electronic system devices.
- the network interface 23 is used to connect the computer device 2 with an external terminal through a network, and establish a data transmission channel and a communication connection between the computer device 2 and the external terminal.
- the network may be an intranet, the Internet, a global system of mobile communication (GSM), a wideband code division multiple access (WCDMA), a 4G network, 5G network, Bluetooth (Bluetooth), Wi-Fi and other wireless or wired networks.
- FIG. 9 only shows the computer device 2 with components 20-23, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.
- the voice mood fluctuation analysis device 20 stored in the memory 21 can also be divided into one or more program modules, and the one or more program modules are stored in the memory 21 and consist of one Or executed by multiple processors (in this embodiment, the processor 22) to complete the application.
- FIG. 8 shows a schematic diagram of the program modules of Embodiment 8 of the device for implementing voice emotion fluctuation analysis 20.
- the device for voice emotion fluctuation analysis 20 can be divided into a first voice feature acquisition module 200, The second voice feature extraction module 202, the voice feature recognition module 204, and the recognition result acquisition module 206.
- the program module referred to in the present application refers to a series of computer-readable instructions that can complete specific functions. It is more suitable than a program to describe the execution process of the voice mood fluctuation analysis device 20 in the computer device 2. .
- the specific functions of the program module, the first voice feature acquisition module 200 and the recognition result acquisition module 206 have been described in detail in the second embodiment, and will not be repeated here.
- This embodiment also provides a computer-readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), only Read memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disks, optical disks, servers, App application malls, etc., on which are stored computer readable Instructions, computer-readable instructions, when executed by the processor, realize the corresponding functions of the voice mood fluctuation analysis method.
- the computer-readable storage medium of this embodiment is used to store the voice mood fluctuation analysis device 20, and when executed by a processor, the voice mood fluctuation analysis method of the first, second, third or fourth embodiment is implemented.
- one or more readable storage media storing computer readable instructions are provided.
- the computer readable storage medium stores computer readable instructions, and the computer readable instructions are executed by one or more processors. During execution, the one or more processors are executed to implement the video object accelerated detection method in the foregoing embodiment. In order to avoid repetition, details are not described herein again.
- the readable storage medium in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
- a person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a non-volatile readable storage.
- the medium may also be stored in a volatile readable storage medium, and when the computer readable instructions are executed, they may include the processes of the above-mentioned method embodiments.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- Hospice & Palliative Care (AREA)
- General Health & Medical Sciences (AREA)
- Psychiatry (AREA)
- Child & Adolescent Psychology (AREA)
- Artificial Intelligence (AREA)
- Quality & Reliability (AREA)
- Machine Translation (AREA)
Abstract
A voice emotion fluctuation analysis method and apparatus, and a computer device and a storage medium, relating to the technical field of artificial intelligence. The method comprises: obtaining a first audio feature and a first text feature of voice data to be tested (S100); extracting a second audio feature in the first audio feature on the basis of an audio feature extraction network in a pretrained audio recognition model, and extracting a second text feature in the first text feature on the basis of a text feature extraction network in a pretrained text recognition model (S102); recognizing the second audio feature to obtain an audio emotion recognition result, and recognizing the second text feature to obtain a text emotion recognition result (S104); and performing fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain an emotion recognition result, and sending the emotion recognition result to an associated terminal (S106). The present invention can effectively improve the accuracy of emotion fluctuation analysis.
Description
本申请以2019年12月24日提交的申请号为201911341679.X,名称为“语音情绪波动分析方法及装置”的中国发明申请为基础,并要求其优先权。This application is based on the Chinese invention application with the application number 201911341679.X and titled "Speech Mood Fluctuation Analysis Method and Apparatus" filed on December 24, 2019, and claims its priority.
本申请实施例涉及人工智能技术领域,尤其涉及一种语音情绪波动分析方法、装置、计算机设备及存储介质。The embodiments of the present application relate to the field of artificial intelligence technology, and in particular to a method, device, computer equipment, and storage medium for analyzing voice mood fluctuations.
随着人工智能技术的发展,情绪波动分析被运用在越来越多的商业场景中,例如客服人员与客户进行交谈时,双方的情绪波动情况。现有技术中,针对音频的情绪波动分析一般是通过声音的音频信号,例如语调、声波的频率和幅度变化进行分析。发明人意识到当前情绪波动分析方式较为单一,并且不同人的音频信号也不相同,只用声音的音频信号对情绪进行分析准确性较低。With the development of artificial intelligence technology, mood swing analysis is used in more and more business scenarios, such as the mood swings of both parties when a customer service staff talks with a customer. In the prior art, the analysis of audio mood fluctuations is generally performed through the analysis of the audio signal of the sound, such as the frequency and amplitude changes of the intonation and sound waves. The inventor realizes that the current mood fluctuation analysis method is relatively single, and the audio signals of different people are different, and the accuracy of analyzing emotions using only sound audio signals is low.
发明内容Summary of the invention
有鉴于此,本申请实施例提供了一种语音情绪波动分析方法方法、装置、计算机设备及计算机可读存储介质,用于情绪波动进行分析准确性较低的问题。In view of this, the embodiments of the present application provide a voice mood fluctuation analysis method, device, computer equipment, and computer-readable storage medium, which are used for the problem of low accuracy in analyzing mood fluctuations.
一种语音情绪波动分析方法,包括:A method for analyzing voice mood fluctuations, including:
获取待测语音数据的第一音频特征和第一文字特征;Acquire the first audio feature and the first text feature of the voice data to be tested;
基于预先训练好的音频识别模型中的音频特征提取网络,提取所述第一音频特征中的第二音频特征;基于预先训练好的文字识别模型中的文字特征提取网络,提取所述第一文字特征中的第二文字特征;Extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; extract the first text feature based on the text feature extraction network in the pre-trained text recognition model The second character feature in
识别所述第二音频特征,获取音频情绪识别结果;识别所述第二文字特征,获取文字情绪识别结果;Recognizing the second audio feature to obtain an audio emotion recognition result; recognizing the second text feature to obtain a text emotion recognition result;
对所述音频情绪识别结果和文字情绪识别结果进行融合处理,得到情绪识别结果,并将所述情绪识别结果发送至关联终端。Perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and send the emotion recognition result to the associated terminal.
一种语音情绪波动分析装置,包括:A voice mood fluctuation analysis device, including:
第一语音特征获取模块:用于获取待测语音数据的第一音频特征和第一文字特 征;The first voice feature acquisition module: used to acquire the first audio feature and the first text feature of the voice data to be tested;
第二语音特征提取模块:用于基于预先训练好的音频识别模型中的音频特征提取网络,提取所述第一音频特征中的第二音频特征;基于预先训练好的文字识别模型中的文字特征提取网络,提取所述第一文字特征中的第二文字特征;The second voice feature extraction module: used to extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; based on the text feature in the pre-trained text recognition model Extracting a network to extract a second character feature in the first character feature;
语音特征识别模块:用于识别所述第二音频特征,获取音频情绪识别结果;识别所述第二文字特征,获取文字情绪识别结果;Voice feature recognition module: used to recognize the second audio feature to obtain an audio emotion recognition result; recognize the second text feature to obtain a text emotion recognition result;
识别结果获取模块:用于对所述音频情绪识别结果和文字情绪识别结果进行融合处理,得到情绪识别结果,并将所述情绪识别结果发送至关联终端。Recognition result acquisition module: used to perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and send the emotion recognition result to the associated terminal.
一种计算机设备,所述计算机设备包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored on the memory and capable of running on the processor. When the processor executes the computer-readable instructions, the following is achieved step:
获取待测语音数据的第一音频特征和第一文字特征;Acquire the first audio feature and the first text feature of the voice data to be tested;
基于预先训练好的音频识别模型中的音频特征提取网络,提取所述第一音频特征中的第二音频特征;基于预先训练好的文字识别模型中的文字特征提取网络,提取所述第一文字特征中的第二文字特征;Extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; extract the first text feature based on the text feature extraction network in the pre-trained text recognition model The second character feature in
识别所述第二音频特征,获取音频情绪识别结果;识别所述第二文字特征,获取文字情绪识别结果;Recognizing the second audio feature to obtain an audio emotion recognition result; recognizing the second text feature to obtain a text emotion recognition result;
对所述音频情绪识别结果和文字情绪识别结果进行融合处理,得到情绪识别结果,并将所述情绪识别结果发送至关联终端。Perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and send the emotion recognition result to the associated terminal.
一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer readable instructions, the computer readable storage medium storing computer readable instructions, and when the computer readable instructions are executed by one or more processors, the one Or multiple processors perform the following steps:
获取待测语音数据的第一音频特征和第一文字特征;Acquire the first audio feature and the first text feature of the voice data to be tested;
基于预先训练好的音频识别模型中的音频特征提取网络,提取所述第一音频特征中的第二音频特征;基于预先训练好的文字识别模型中的文字特征提取网络,提取所述第一文字特征中的第二文字特征;Extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; extract the first text feature based on the text feature extraction network in the pre-trained text recognition model The second character feature in
识别所述第二音频特征,获取音频情绪识别结果;识别所述第二文字特征,获取文字情绪识别结果;Recognizing the second audio feature to obtain an audio emotion recognition result; recognizing the second text feature to obtain a text emotion recognition result;
对所述音频情绪识别结果和文字情绪识别结果进行融合处理,得到情绪识别结果,并将所述情绪识别结果发送至关联终端。Perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and send the emotion recognition result to the associated terminal.
本发明实施例提供的语音情绪波动分析方法、装置、计算机设备及计算机可读存储介质,采用双通道分析语音情绪,除通过音频声学韵律来分析语音情绪外,还通过说话内容来进一步判断说话人的情绪,从而提高情绪分析的准确率。The voice mood fluctuation analysis method, device, computer equipment, and computer-readable storage medium provided by the embodiments of the present invention adopt dual-channel analysis of voice mood. In addition to analyzing voice mood by audio acoustic prosody, the speaker is further judged by speaking content The sentiment, thereby improving the accuracy of sentiment analysis.
图1为本申请实施例一之语音情绪波动分析方法的步骤流程图;FIG. 1 is a flow chart of the steps of the method for analyzing voice mood fluctuations according to the first embodiment of this application;
图2为获取待测语音数据的具体流程图;Figure 2 is a specific flow chart of obtaining the voice data to be tested;
图3为提取所述待测语音数据中的第一音频特征的具体流程图;Figure 3 is a specific flow chart of extracting the first audio feature in the voice data to be tested;
图4为提取所述待测语音数据中的第一文字特征的具体流程图;Figure 4 is a specific flow chart of extracting the first text feature in the voice data to be tested;
图5为识别所述第二音频特征,获取音频情绪识别结果的具体流程图;Figure 5 is a specific flow chart of identifying the second audio feature and obtaining an audio emotion recognition result;
图6为识别所述第二文字特征,获取文字情绪识别结果的具体流程图;Figure 6 is a specific flow chart of recognizing the second text feature and obtaining text emotion recognition results;
图7为对所述音频情绪识别结果和文字情绪识别结果进行融合处理,得到情绪识别结果,并将所述情绪识别结果发送至关联终端的具体流程图。FIG. 7 is a specific flow chart of performing fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and sending the emotion recognition result to the associated terminal.
图8为本申请语音情绪波动分析装置之实施例二的程序模块示意图;FIG. 8 is a schematic diagram of the program modules of the second embodiment of the speech mood fluctuation analysis device of this application;
图9为本申请计算机设备之实施例三的硬件结构示意图。FIG. 9 is a schematic diagram of the hardware structure of the third embodiment of the computer equipment of this application.
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions, and advantages of this application clearer and clearer, the following further describes the application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。The technical solutions between the various embodiments can be combined with each other, but they must be based on what can be achieved by a person of ordinary skill in the art. When the combination of technical solutions is contradictory or cannot be achieved, it should be considered that such a combination of technical solutions does not exist. It is not within the scope of protection required by this application.
实施例一Example one
请参阅图1,示出了本申请实施例之语音情绪波动分析方法的步骤流程图。可以理解,本方法实施例中的流程图不用于对执行步骤的顺序进行限定。下面以 计算机设备为执行主体进行示例性描述,具体如下:Please refer to FIG. 1, which shows a flowchart of the steps of a method for analyzing voice mood fluctuations according to an embodiment of the present application. It can be understood that the flowchart in this method embodiment is not used to limit the order of execution of the steps. The following is an exemplary description with computer equipment as the main body of execution, and the details are as follows:
S100:获取待测语音数据的第一音频特征和第一文字特征;S100: Acquire the first audio feature and the first text feature of the voice data to be tested;
参阅图2,本申请实施例的语音情绪波动分析方法还包括:Referring to FIG. 2, the voice mood fluctuation analysis method of the embodiment of the present application further includes:
S110:获取待测语音数据。S110: Acquire voice data to be tested.
所述获取待测语音数据进一步包括:The acquiring voice data to be tested further includes:
S110A:获取离线或者在线的语音数据;S110A: Obtain offline or online voice data;
具体的,所述语音数据包括在线语音数据和离线语音数据,所述在线语音数据是指通话过程中实时获取的语音数据,所述离线语音数据是指从存储于系统后台中的通话语音数据,所述待测语音数据为wav格式的录音文件。Specifically, the voice data includes online voice data and offline voice data, the online voice data refers to the voice data obtained in real time during a call, and the offline voice data refers to the voice data from the call stored in the background of the system, The voice data to be tested is a recording file in wav format.
S110B:对所述语音数据进行分离处理得到待测语音数据,所述待测语音数据包括多段第一用户语音数据和第二用户语音数据。S110B: Separate and process the voice data to obtain voice data to be tested, where the voice data to be tested includes multiple pieces of first user voice data and second user voice data.
具体的,获取所述语音数据后,根据通话语音的静音部分,将待测语音数分成多段第一用户语音数据和第二用户语音数据,采用端点检测技术和声音分离技术,去除所述待测语音数据通话过程中的静音部分,且基于设置的说话间隔段存在的静音时长的时长阈值,标记每一段对话的起点和终点,并根据时间点进行切割分离,得到多个短音频片段,采用声纹识别工具标记每一个短音频片段的说话人身份和说话时间并用编号进行区分。所述时长阈值根据经验值确定,作为一个实施例,本方案的时长阈值为0.25~0.3秒。Specifically, after the voice data is obtained, the number of voices to be tested is divided into multiple segments of the first user voice data and the second user voice data according to the mute part of the call voice, and the endpoint detection technology and the voice separation technology are used to remove the voice data to be tested. The mute part of the voice data call process, and based on the set duration threshold of the silent duration in the speech interval, mark the start and end points of each conversation, and cut and separate them according to the time point to obtain multiple short audio clips. The pattern recognition tool marks the speaker identity and speaking time of each short audio clip and distinguishes it by a number. The duration threshold is determined according to empirical values. As an example, the duration threshold of this solution is 0.25 to 0.3 seconds.
所述编号包括但不限于客服的工号、客服的座机号码以及客户的手机号码。The number includes, but is not limited to, the service number of the customer service, the landline number of the customer service, and the mobile phone number of the customer.
具体的,所述声纹识别工具为LIUM_SpkDiarization工具包,通过LIUM_SpkDiarization工具包对第一用户语音数据和第二用户语音数据进行区分,举例如下:Specifically, the voiceprint recognition tool is the LIUM_SpkDiarization toolkit, and the first user’s voice data and the second user’s voice data are distinguished through the LIU_SpkDiarization toolkit, for example as follows:
[Table 1][Table 1]
start_timestart_time | end_timeend_time | speakerspeaker |
00 | 33 | 11 |
44 | 88 | 22 |
8.38.3 | 12.512.5 | 11 |
我们认为,第一个开口说话的人是第一用户(即表中的speaker1),第二个自然是第二用户(即表中的speaker2)。We believe that the first person to speak is the first user (namely speaker1 in the table), and the second person is naturally the second user (namely speaker2 in the table).
参阅图3,所述获取待测语音数据的第一音频特征进一步包括:Referring to FIG. 3, the acquiring the first audio feature of the voice data to be tested further includes:
S100A1:对所述待测语音数据进行分帧加窗处理,获得语音分析帧;S100A1: Perform frame and window processing on the voice data to be tested to obtain voice analysis frames;
具体的,语音数据信号具有短时平稳性,可以将语音数据信号进行分帧处理,得到多个音频帧,音频帧是指N个采样点的集合。在本实施例中,N为256或512,涵盖的时间为20~30ms,得到多个所述音频帧后,将每一个音频帧乘以汉明窗,以增加帧左端和右端的连续性,获取语音分析帧。Specifically, the voice data signal has short-term stability, and the voice data signal can be subjected to framing processing to obtain multiple audio frames. The audio frame refers to a collection of N sampling points. In this embodiment, N is 256 or 512, and the time covered is 20-30ms. After obtaining multiple audio frames, each audio frame is multiplied by the Hamming window to increase the continuity between the left and right ends of the frame. Get the voice analysis frame.
S100B1:对所述语音分析帧进行傅里叶变换得到对应的频谱;S100B1: Perform Fourier transform on the speech analysis frame to obtain a corresponding frequency spectrum;
具体的,由于语音数据信号在时域上变化较难,因此需要将语音数据信号转换为频域上的能量分布,将所述语音分析帧经过傅里叶变换,得到各语音分析帧的频谱。Specifically, because the voice data signal is difficult to change in the time domain, it is necessary to convert the voice data signal into an energy distribution in the frequency domain, and subject the voice analysis frame to Fourier transform to obtain the frequency spectrum of each voice analysis frame.
S100C1:将所述频谱经过梅尔滤波器组得到梅尔频谱;S100C1: Pass the spectrum through a mel filter bank to obtain a mel spectrum;
S100D1:将所述梅尔频谱进行倒谱分析,获得所述待测语音数据的第一音频特征。S100D1: Perform cepstrum analysis on the Mel spectrum to obtain the first audio feature of the voice data to be measured.
具体的,将所述梅尔频谱进行倒谱分析,得到36个1024维的音频向量,所述音频向量即为所述待测语音数据的第一音频特征。Specifically, cepstrum analysis is performed on the Mel spectrum to obtain 36 1024-dimensional audio vectors, and the audio vectors are the first audio features of the voice data to be tested.
参阅图4,所述所述获取待测语音数据的第一文字特征进一步包括:Referring to FIG. 4, the acquiring the first text feature of the voice data to be tested further includes:
S100A2:将所述待测语音数据转换为文字;S100A2: Convert the voice data to be tested into text;
具体的,利用语音听写接口,将所述多段第一用户语音数据和第二用户语音数据转换为文字。作为一个实施例,所述听写接口为讯飞语音听写接口。Specifically, a voice dictation interface is used to convert the multiple pieces of first user voice data and second user voice data into text. As an embodiment, the dictation interface is a voice dictation interface of iFLYTEK.
S100B2:对所述文字进行分词处理,得到L个分词,其中L为大于0的自然数;S100B2: Perform word segmentation processing on the text to obtain L word segmentation, where L is a natural number greater than 0;
具体的,所述分词处理通过词典分词算法完成,所述词典分词算法包括但不限于正向最大匹配法、逆向最大匹配法和双向匹配分词法,也可以基于隐马尔科夫模型HMM、CRF、SVM、深度学习算法。Specifically, the word segmentation process is completed by a dictionary word segmentation algorithm. The dictionary word segmentation algorithm includes but is not limited to forward maximum matching method, reverse maximum matching method, and two-way matching word segmentation method. It can also be based on hidden Markov models HMM, CRF, SVM, deep learning algorithm.
S100C2:对所述L个分词分别进行词向量映射,以获取L个分词对应的d维词向量矩阵,其中d为大于0的自然数,所述d维词向量矩阵为待测语音数据的第一文字特征。S100C2: Perform word vector mapping on the L word segmentation respectively to obtain a d-dimensional word vector matrix corresponding to the L word segmentation, where d is a natural number greater than 0, and the d-dimensional word vector matrix is the first text of the voice data to be tested feature.
具体的,通过word2vec等模型,获取每个分词的128维词向量。Specifically, the 128-dimensional word vector of each word segmentation is obtained through models such as word2vec.
S102:基于预先训练好的音频识别模型中的音频特征提取网络,提取所述第一音频特征中的第二音频特征;基于预先训练好的文字识别模型中的文字特征提取网络,提取所述第一文字特征中的第二文字特征。S102: Extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; extract the second audio feature based on the text feature extraction network in the pre-trained text recognition model The second character feature in a character feature.
具体的,所述第二音频特征和第二文字特征为情绪识别模型的特征提取网络从第一音频特征和第一文字特征中提取出的维度更少的并且更加关注表达情绪的词的语义特征向量,通过提取第二音频特征和第二文字特征,可以使模型的学习能力更好,最终分类的准确率更高。Specifically, the second audio feature and the second text feature are the feature extraction network of the emotion recognition model. The feature extraction network extracted from the first audio feature and the first text feature has fewer dimensions and pays more attention to the semantic feature vector of words expressing emotions. , By extracting the second audio feature and the second text feature, the learning ability of the model can be better, and the accuracy of the final classification can be higher.
S104:识别所述第二音频特征,获取音频情绪识别结果;识别所述第二文字特征,获取文字情绪识别结果。S104: Recognize the second audio feature to obtain an audio emotion recognition result; recognize the second text feature to obtain a text emotion recognition result.
具体的,所述音频识别结果通过将所述音频特征输入音频识别模型中获得,所述文字情绪识别结果通过将所述文字特征输入文字识别模型中获得。具体的,所述音频识别模型和文字情绪识别模型包括特征提取网络和分类网络,其中,特征提取网络用于从第一音频特征和第一文字特征中提取维度更少的语义特征向量,即第二音频特征和第二文字特征,分类网络用于输出各个预设情绪类别的置信度,其中预设的情绪类别可以根据业务需求划分,例如,积极、消极等。所述文字情绪识别模型是包括Embedding层和长短时记忆循环神经层(LSTM,Long Short-Term Memory)的深度神经网络模型,所述音频情绪识别模型是包括自注意力层和双向长短时记忆神经网络层(前向的LSTM与后向的LSTM)的神经网络模型。Specifically, the audio recognition result is obtained by inputting the audio feature into an audio recognition model, and the text emotion recognition result is obtained by inputting the text feature into a text recognition model. Specifically, the audio recognition model and the text emotion recognition model include a feature extraction network and a classification network, where the feature extraction network is used to extract semantic feature vectors with fewer dimensions from the first audio feature and the first text feature, that is, the second The audio feature and the second text feature, the classification network is used to output the confidence of each preset emotion category, where the preset emotion category can be classified according to business requirements, for example, positive, negative, etc. The text emotion recognition model is a deep neural network model that includes an Embedding layer and a long and short-term memory loop neural layer (LSTM, Long Short-Term Memory), and the audio emotion recognition model includes a self-attention layer and a bidirectional long and short-term memory neural network. The neural network model of the network layer (forward LSTM and backward LSTM).
长短短期记忆网络是用于处理长跨度间的序列依赖关系,适合于处理长文本间依赖的任务。The long-short-short-term memory network is used to deal with the sequence dependence between long spans, and is suitable for the task of processing dependence between long texts.
进一步地,本申请的实施例还包括,对所述音频识别模型和所述文字识别模型进行训练,所述训练过程包括:Further, the embodiment of the present application further includes training the audio recognition model and the text recognition model, and the training process includes:
获取与所述目标领域对应的训练集及校验集;Acquiring a training set and a verification set corresponding to the target field;
所述获取与目标领域对应的训练集和校验集包括以下步骤:The obtaining the training set and the verification set corresponding to the target field includes the following steps:
获取训练集和校验集的语音数据;Obtain the voice data of the training set and the verification set;
具体的,所述训练集和叫校验集语音数据的获取方式包括但不限于公司内部呼 叫中心录音数据、合作公司提供的客服录音数据、客户方提供的客服录音数据以及从数据平台直接购买客服录音数据,本实施例中选择从数据平台直接购买的方式获取录音数据,所述录音数据包括了训练集和校验集。Specifically, the methods for obtaining the voice data of the training set and the verification set include, but are not limited to, the recording data of the company's internal call center, the customer service recording data provided by the partner company, the customer service recording data provided by the customer, and the direct purchase of customer service from the data platform. For the recording data, in this embodiment, the recording data is obtained by directly purchasing from the data platform, and the recording data includes a training set and a verification set.
对所述录音数据的情感类型进行标注;Mark the emotion type of the recording data;
具体的,所述标注过程为:人工标注每个录音的停顿时间点,可得到每段录音的多短音频片段(对话片段);对每个短音频片段进行情绪倾向标注(即积极情绪、消极情绪),本实施例中,使用音频标注工具audio-annotator实现音频片段的起点和终点时间点标记和情绪标记。Specifically, the labeling process is: manually labeling the pause time point of each recording to obtain multiple short audio fragments (conversation fragments) of each recording; and labeling each short audio fragment with emotional tendency (ie, positive emotions, negative emotions). Emotion). In this embodiment, the audio annotation tool audio-annotator is used to mark the start and end time points and emotions of the audio clip.
分离训练集和校验集;Separate training set and verification set;
具体的,所述分离训练集和校验集的过程为:将标注好的所有音频片段样本进行随机打乱,然后按4∶1的比例分成两份数据集,多的部分用于模型训练,即为训练集,少的部分用于模型验证,即为校验集。Specifically, the process of separating the training set and the verification set is: randomly scramble all the labeled audio clip samples, and then divide them into two data sets at a ratio of 4:1, and the more part is used for model training. It is the training set, and the small part is used for model verification, which is the verification set.
基于训练集的情感类型,对所述语音情绪识别模型和文字情绪识别模型进行调整;Adjust the voice emotion recognition model and text emotion recognition model based on the emotion type of the training set;
利用所述测试集,对所述语音情绪识别模型和文字情绪识别模型进行测试,以确定所述语音情绪识别模型和文字情绪识别模型的准确性。Using the test set, the speech emotion recognition model and the text emotion recognition model are tested to determine the accuracy of the speech emotion recognition model and the text emotion recognition model.
参阅图5,所述识别所述第二音频特征,获取音频情绪识别结果进一步包括:Referring to FIG. 5, the recognizing the second audio feature and obtaining the audio emotion recognition result further includes:
S104A1:基于预先训练好的音频识别模型中的音频分类网络,识别所述第二音频特征,获取多个音频情绪分类及音频情绪分类对应的第一置信度;S104A1: Based on the audio classification network in the pre-trained audio recognition model, identify the second audio feature, and obtain multiple audio emotion classifications and first confidence levels corresponding to the audio emotion classifications;
将提取到的第二音频特征输入所述音频识别模型中的音频分类网络,分类网络层对第二音频特征进行分析,得到第二音频特征对应的多个音频情绪分类及每个音频情绪分类对应的第一置信度。例如,“积极情绪”的的第一置信度为0.3,“消极情绪”的第一置信度为0.7。The extracted second audio feature is input into the audio classification network in the audio recognition model, and the classification network layer analyzes the second audio feature to obtain multiple audio emotion classifications corresponding to the second audio feature and corresponding to each audio emotion classification The first degree of confidence. For example, the first confidence of "positive emotion" is 0.3, and the first confidence of "negative emotion" is 0.7.
S104B1:选取第一置信度最高的音频情绪分类为目标音频情绪分类,对应的第一置信度为目标音频情绪分类参数。S104B1: Select the audio emotion classification with the highest first confidence as the target audio emotion classification, and the corresponding first confidence is the target audio emotion classification parameter.
对应的,目标音频情绪分类为“消极情绪”,目标音频情绪分类参数为0.7。Correspondingly, the target audio emotion is classified as "negative emotion", and the target audio emotion classification parameter is 0.7.
S104C1:对所述目标音频情绪分类向量参数进行数值映射,得到音频情绪识别结果。S104C1: Perform numerical mapping on the target audio emotion classification vector parameter to obtain an audio emotion recognition result.
其中,数值映射是指将原本输出结果为情绪类别映射为具体的数值,方便后续进一步观测情绪的波动。在一实施方式中,通过一定的函数关系式,将情绪类别映射为具体的数字,例如,获取所述待测语音数据的各个预设情绪类别的第一置信度后,选取置信度最大的情绪类别对应的目标音频情绪分类向量参数X,采用如下的音频情绪识别结果公式对最终输出的音频情绪识别结果Y进行计算。Among them, the value mapping refers to mapping the original output result as the emotion category to a specific value, so as to facilitate further observation of emotion fluctuations in the future. In one embodiment, the emotion category is mapped to a specific number through a certain functional relationship. For example, after obtaining the first confidence of each preset emotion category of the voice data to be tested, the emotion with the highest confidence is selected The target audio emotion classification vector parameter X corresponding to the category is calculated using the following audio emotion recognition result formula to calculate the final output audio emotion recognition result Y.
在本实施例中,所述数值映射关系为,当识别出来的情绪类别为“积极”时,Y=0.5X;当情绪识别结果为“消极”时,Y=0.5(1+X),以使最终输出的音频情绪识别结果为数值为0到1之间的浮点数。In this embodiment, the numerical mapping relationship is that when the recognized emotion category is "positive", Y=0.5X; when the emotion recognition result is "negative", Y=0.5(1+X), so Make the final output audio emotion recognition result be a floating point number between 0 and 1.
具体的,最终输出的音频情绪识别结果为0.85。Specifically, the final output audio emotion recognition result is 0.85.
参阅图6,识别所述第二文字特征,获取文字情绪识别结果进一步包括:Referring to FIG. 6, recognizing the second text feature and obtaining a text emotion recognition result further includes:
S104A2:基于预先训练好的文字识别模型中的文字分类网络,识别所述第二文字特征,获取多个文字情绪分类向量对应的第二置信度。S104A2: Recognize the second text feature based on the text classification network in the pre-trained text recognition model, and obtain the second confidence level corresponding to the multiple text emotion classification vectors.
将提取到的第二文字特征输入所述文字识别模型中的文字分类网络,分类网络层对第二文字特征进行分析,得到第二文字特征对应的多个文字情绪分类及每个文字情绪分类对应的第二置信度。例如,“积极情绪”的的第二置信度为0.2,“消极情绪”的第一置信度为0.8。The extracted second text features are input into the text classification network in the text recognition model, and the classification network layer analyzes the second text features to obtain multiple text emotion classifications corresponding to the second text features and corresponding to each text emotion classification The second degree of confidence. For example, the second confidence level of "positive emotion" is 0.2, and the first confidence level of "negative emotion" is 0.8.
S104B2:选取第二置信度最高的音频情绪分类为目标文字情绪分类,对应的第二置信度为目标文字情绪分类参数。S104B2: Select the audio emotion classification with the highest second confidence as the target text emotion classification, and the corresponding second confidence is the target text emotion classification parameter.
对应的,目标文字情绪分类为“消极情绪”,目标文字情绪分类参数为0.8。Correspondingly, the target text emotion is classified as "negative emotion", and the target text emotion classification parameter is 0.8.
S104C2:对所述目标文字情绪分类向量参数进行数值映射,得到文字情绪识别结果。S104C2: Perform numerical mapping on the target text emotion classification vector parameter to obtain a text emotion recognition result.
具体的,最终输出的文字情绪识别结果为0.9。Specifically, the final output text emotion recognition result is 0.9.
S106,对所述音频情绪识别结果和文字情绪识别结果进行融合处理,得到情绪识别结果,并将所述情绪识别结果发送至关联终端。S106: Perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain an emotion recognition result, and send the emotion recognition result to an associated terminal.
参阅图7,所述步骤S106可进一步包括:Referring to FIG. 7, the step S106 may further include:
S106A,对每段第一用户的语音数据的音频情绪识别结果和文字情绪识别结果进行加权处理,得到第一情绪值,对每段第二用户的语音数据的音频情绪识别结果和文字情绪识别结果进行加权处理,得到第二情绪值;S106A, weighting the audio emotion recognition results and text emotion recognition results of each segment of the first user’s voice data to obtain a first emotion value, and performing audio emotion recognition results and text emotion recognition results of each segment of the second user’s voice data Perform weighting processing to obtain the second sentiment value;
具体的,用数值加权的方法处理同一音频片段的两种情绪值,所述情绪值为数值为0到1之间的浮点数,越接近于1则情绪越偏向于消极,反之,越接近于0则情绪越偏向于积极。Specifically, the two emotion values of the same audio clip are processed by a numerical weighting method. The emotion value is a floating-point number between 0 and 1. The closer to 1, the more negative the emotion is. On the contrary, the closer to 0 means more positive emotions.
作为一个实施例,语音情绪识别通道得到的情绪值的权重为0.7;文字情绪识别通道得到的情绪值的权重为0.3。As an embodiment, the weight of the emotion value obtained by the speech emotion recognition channel is 0.7; the weight of the emotion value obtained by the text emotion recognition channel is 0.3.
以上述实施方式为例进行进一步说明,最终的输出的情绪值为0.865。Taking the foregoing embodiment as an example for further description, the final output sentiment value is 0.865.
S106B,根据所述第一情绪值生成第一情绪值热图及根据所述第二情绪值生成第二情绪值热图;S106B, generating a first emotion value heat map according to the first emotion value and generating a second emotion value heat map according to the second emotion value;
具体的,根据时间顺序对每段待测语音进行编号并绘制的情绪值热图,所述热图用于对每个时间段的情绪进行聚类。Specifically, each segment of speech to be tested is numbered and drawn according to the time sequence, and the emotion value heat map is drawn, and the heat map is used to cluster the emotions of each time period.
具体的,使用python的seaborn库的heatmap函数绘制情绪值的热图,使用不同的颜色表示不同的情绪,例如积极的情绪为正时,颜色会更深。Specifically, use the heatmap function of python's seaborn library to draw a heat map of emotion values, and use different colors to represent different emotions. For example, when a positive emotion is positive, the color will be darker.
S106C,将所述第一情绪值热图和第二情绪值热图发送至关联终端。S106C: Send the first emotion value heat map and the second emotion value heat map to the associated terminal.
具体的,所述关联终端包括第一用户终端和第二用户终端,作为一个实施例,当第一用户和第二用户分别为客户和客服时,所述关联终端除了客户端和客服端外,还包括客服质量监督管理端和客服上级,以便对客服的服务质量进行监督和纠正。Specifically, the associated terminal includes a first user terminal and a second user terminal. As an embodiment, when the first user and the second user are the customer and the customer service respectively, the associated terminal is in addition to the client and the customer service terminal, It also includes the customer service quality supervision and management terminal and the customer service superior to supervise and correct the service quality of the customer service.
本申请实施例采用双通道分析语音情绪,除通过音频声学韵律来分析语音情绪外,还通过说话内容来进一步判断说话人的情绪,从而提高情绪分析的准确率,结合对话分离技术,对每一段对话的情绪值都进行分析和判断,从而得到完整通话过程中,各时间段说话人的情绪,进而可分析说话人的情绪波动情况,给客服质检提供具象化的参考和帮助,使评价结果更加客观,最终帮助企业提高客服服务质量,改善客户体验。The embodiment of this application uses dual-channel analysis of voice emotions. In addition to analyzing voice emotions through audio acoustic prosody, the speaker’s emotions are further judged through the content of the speech, thereby improving the accuracy of emotion analysis. Combining dialogue separation technology, each paragraph is The emotion value of the dialogue is analyzed and judged, so as to obtain the emotion of the speaker at each time period during the complete conversation, and then analyze the emotion fluctuation of the speaker, provide concrete reference and help for the customer service quality inspection, and make the evaluation result Be more objective, and ultimately help companies improve the quality of customer service and improve customer experience.
实施例二Example two
请继续参阅图8,示出了本申请语音情绪波动分析装置的程序模块示意图。在本实施例中,语音情绪波动分析装置20可以包括或被分割成一个或多个程序模块,一个或者多个程序模块被存储于存储介质中,并由一个或多个处理器所执行,以完成本申请,并可实现上述语音情绪波动分析方法。本申请实施例所称 的程序模块是指能够完成特定功能的一系列计算机可读指令的指令段,比程序本身更适合于描述语音情绪波动分析装置20在存储介质中的执行过程。以下描述将具体介绍本实施例各程序模块的功能:Please continue to refer to FIG. 8, which shows a schematic diagram of the program modules of the voice mood fluctuation analysis device of the present application. In this embodiment, the speech mood fluctuation analysis device 20 may include or be divided into one or more program modules, and the one or more program modules are stored in a storage medium and executed by one or more processors to This application is completed, and the above-mentioned voice mood fluctuation analysis method can be realized. The program module referred to in the embodiment of the present application refers to an instruction segment of a series of computer-readable instructions that can complete a specific function, and is more suitable for describing the execution process of the voice mood fluctuation analysis device 20 in the storage medium than the program itself. The following description will specifically introduce the functions of each program module in this embodiment:
第一语音特征获取模块200,用于获取待测语音数据的第一音频特征和第一文字特征。The first voice feature acquiring module 200 is configured to acquire the first audio feature and the first text feature of the voice data to be tested.
进一步地,第一语音特征获取模块200还用于:Further, the first voice feature acquisition module 200 is also used for:
获取离线或者在线的待测语音数据;Obtain offline or online voice data to be tested;
对所述语音数据进行分离处理得到待测语音数据,所述待测语音数据包括多段第一用户语音数据和第二用户语音数据。The voice data is separated and processed to obtain the voice data to be tested, and the voice data to be tested includes multiple pieces of first user voice data and second user voice data.
第二语音特征提取模块202:用于基于预先训练好的音频识别模型中的音频特征提取网络,提取所述第一音频特征中的第二音频特征;基于预先训练好的文字识别模型中的文字特征提取网络,提取所述第一文字特征中的第二文字特征。The second voice feature extraction module 202: used to extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; based on the text in the pre-trained text recognition model The feature extraction network extracts the second text feature in the first text feature.
进一步地,第二语音特征提取模块202还用于:Further, the second speech feature extraction module 202 is also used for:
对所述待测语音数据进行分帧加窗处理,获得语音分析帧;Perform frame and window processing on the voice data to be tested to obtain voice analysis frames;
对所述语音分析帧进行傅里叶变换得到对应的频谱;Performing Fourier transform on the speech analysis frame to obtain a corresponding frequency spectrum;
将所述频谱经过梅尔滤波器组得到梅尔频谱;Passing the spectrum through the mel filter bank to obtain a mel spectrum;
将所述梅尔频谱进行倒谱分析,获得所述待测语音数据的第一音频特征。Perform cepstrum analysis on the Mel spectrum to obtain the first audio feature of the voice data to be measured.
进一步地,第二语音特征提取模块202还用于:Further, the second speech feature extraction module 202 is also used for:
将所述待测语音数据转换为文字;Converting the voice data to be tested into text;
对所述文字进行分词处理,得到L个分词,其中L为大于0的自然数;Perform word segmentation processing on the text to obtain L word segmentation, where L is a natural number greater than 0;
对所述L个分词分别进行词向量映射,以获取L个分词对应的d维词向量矩阵,其中d为大于0的自然数,所述d维词向量矩阵为待测语音数据的第一文字特征。The word vector mapping is performed on the L word segmentation respectively to obtain a d-dimensional word vector matrix corresponding to the L word segmentation, where d is a natural number greater than 0, and the d-dimensional word vector matrix is the first text feature of the voice data to be tested.
语音特征识别模块204:用于识别所述第二音频特征,获取音频情绪识别结果;识别所述第二文字特征,获取文字情绪识别结果。The voice feature recognition module 204: used to recognize the second audio feature to obtain an audio emotion recognition result; recognize the second text feature to obtain a text emotion recognition result.
进一步地,语音特征识别模块204还用于:Further, the voice feature recognition module 204 is also used for:
基于预先训练好的音频识别模型中的音频分类网络,识别所述第二音频特征,获取多个音频情绪分类向量对应的第一置信度;Recognize the second audio feature based on the audio classification network in the pre-trained audio recognition model, and obtain the first confidence corresponding to the multiple audio emotion classification vectors;
选取第一置信度最高的音频情绪分类为目标音频情绪分类,对应的第一置信度为目标音频情绪分类参数;The audio emotion classification with the highest first confidence is selected as the target audio emotion classification, and the corresponding first confidence is the target audio emotion classification parameter;
对所述目标音频情绪分类向量参数进行数值映射,得到音频情绪识别结果。Perform numerical mapping on the target audio emotion classification vector parameters to obtain audio emotion recognition results.
进一步地,语音特征识别模块204还用于:Further, the voice feature recognition module 204 is also used for:
基于预先训练好的文字识别模型中的文字分类网络,识别所述第二文字特征,获取多个文字情绪分类向量对应的第二置信度;Based on the text classification network in the pre-trained text recognition model, recognize the second text feature, and obtain the second confidence level corresponding to the multiple text emotion classification vectors;
选取第二置信度最高的音频情绪分类为目标文字情绪分类,对应的第二置信度为目标文字情绪分类参数;Select the audio emotion classification with the highest second confidence as the target text emotion classification, and the corresponding second confidence is the target text emotion classification parameter;
对所述目标文字情绪分类向量参数进行数值映射,得到文字情绪识别结果。Perform numerical mapping on the target text emotion classification vector parameters to obtain text emotion recognition results.
识别结果获取模块206:用于对所述音频情绪识别结果和文字情绪识别结果进行融合处理,得到情绪识别结果,并将所述情绪识别结果发送至关联终端。The recognition result acquisition module 206 is configured to perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and send the emotion recognition result to the associated terminal.
进一步地,语音特征识别模块204还用于:Further, the voice feature recognition module 204 is also used for:
对每段第一用户的语音数据的音频情绪识别结果和文字情绪识别结果进行加权处理,得到第一情绪值,对每段第二用户的语音数据的音频情绪识别结果和文字情绪识别结果进行加权处理,得到第二情绪值;Perform weighting processing on the audio emotion recognition results and text emotion recognition results of each segment of the first user’s voice data to obtain the first emotion value, and weight the audio emotion recognition results and text emotion recognition results of each segment of the second user’s voice data Processing, get the second emotional value;
根据所述第一情绪值生成第一情绪值热图及根据所述第二情绪值生成第二情绪值热图;Generating a first emotion value heat map according to the first emotion value and generating a second emotion value heat map according to the second emotion value;
将所述第一情绪值热图和第二情绪值热图发送至关联终端。The first emotion value heat map and the second emotion value heat map are sent to the associated terminal.
实施例三Example three
参阅图9,是本申请实施例三之计算机设备的硬件架构示意图。本实施例中,所述计算机设备2是一种能够按照事先设定或者存储的指令,自动进行数值计算和/或信息处理的设备。该计算机设备2可以是机架式服务器、刀片式服务器、塔式服务器或机柜式服务器(包括独立的服务器,或者多个服务器所组成的服务器集群)等。如图9所示,所述计算机设备2至少包括,但不限于,可通过系统装置总线相互通信连接存储器21、处理器22、网络接口23、以及存储在存储器21上并可在处理器22上运行的计算机可读指令,该计算机可读指令可以为执行语音情绪波动分析装置20对应的计算机可读指令。Refer to FIG. 9, which is a schematic diagram of the hardware architecture of the computer device according to the third embodiment of the present application. In this embodiment, the computer device 2 is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions. The computer device 2 may be a rack server, a blade server, a tower server, or a cabinet server (including an independent server or a server cluster composed of multiple servers). As shown in FIG. 9, the computer device 2 at least includes, but is not limited to, a memory 21, a processor 22, and a network interface 23 that can communicate with each other through a system device bus, and are stored in the memory 21 and can be connected to the processor 22. The running computer-readable instruction may be a computer-readable instruction corresponding to the device 20 for executing the voice mood fluctuation analysis device.
本实施例中,存储器21至少包括一种类型的计算机可读存储介质,所述可读存 储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,存储器21可以是计算机设备2的内部存储单元,例如该计算机设备2的硬盘或内存。在另一些实施例中,存储器21也可以是计算机设备2的外部存储设备,例如该计算机设备2上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,存储器21还可以既包括计算机设备2的内部存储单元也包括其外部存储设备。本实施例中,存储器21通常用于存储安装于计算机设备2的操作系统装置和各类应用软件,例如实施例二的语音情绪波动分析装置20的程序代码等。此外,存储器21还可以用于暂时地存储已经输出或者将要输出的各类数据。In this embodiment, the memory 21 includes at least one type of computer-readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory ( RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc. In some embodiments, the memory 21 may be an internal storage unit of the computer device 2, for example, a hard disk or a memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a smart media card (SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc. Of course, the memory 21 may also include both the internal storage unit of the computer device 2 and its external storage device. In this embodiment, the memory 21 is generally used to store operating system devices and various application software installed in the computer equipment 2, for example, the program code of the voice mood fluctuation analysis device 20 in the second embodiment. In addition, the memory 21 can also be used to temporarily store various types of data that have been output or will be output.
处理器22在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器22通常用于控制计算机设备2的总体操作。本实施例中,处理器22用于运行存储器21中存储的程序代码或者处理数据,例如运行语音情绪波动分析装置20,以实现实施例一、二、三或四的语音情绪波动分析方法。In some embodiments, the processor 22 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. The processor 22 is generally used to control the overall operation of the computer device 2. In this embodiment, the processor 22 is used to run the program code or process data stored in the memory 21, for example, to run the voice mood fluctuation analysis device 20 to implement the voice mood fluctuation analysis method of the first, second, third or fourth embodiment.
所述网络接口23可包括无线网络接口或有线网络接口,该网络接口23通常用于在所述计算机设备2与其他电子系统装置之间建立通信连接。例如,所述网络接口23用于通过网络将所述计算机设备2与外部终端相连,在所述计算机设备2与外部终端之间的建立数据传输通道和通信连接等。所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯系统系统(Global System of Mobile communication,GSM)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、Wi-Fi等无线或有线网络。The network interface 23 may include a wireless network interface or a wired network interface, and the network interface 23 is generally used to establish a communication connection between the computer device 2 and other electronic system devices. For example, the network interface 23 is used to connect the computer device 2 with an external terminal through a network, and establish a data transmission channel and a communication connection between the computer device 2 and the external terminal. The network may be an intranet, the Internet, a global system of mobile communication (GSM), a wideband code division multiple access (WCDMA), a 4G network, 5G network, Bluetooth (Bluetooth), Wi-Fi and other wireless or wired networks.
需要指出的是,图9仅示出了具有部件20-23的计算机设备2,但是应理解的是,并不要求实施所有示出的部件,可以替代的实施更多或者更少的部件。It should be pointed out that FIG. 9 only shows the computer device 2 with components 20-23, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.
在本实施例中,存储于存储器21中的所述语音情绪波动分析装置20还可以被分 割为一个或者多个程序模块,所述一个或者多个程序模块被存储于存储器21中,并由一个或多个处理器(本实施例为处理器22)所执行,以完成本申请。In this embodiment, the voice mood fluctuation analysis device 20 stored in the memory 21 can also be divided into one or more program modules, and the one or more program modules are stored in the memory 21 and consist of one Or executed by multiple processors (in this embodiment, the processor 22) to complete the application.
例如,图8示出了所述实现语音情绪波动分析装置20实施例8的程序模块示意图,该实施例中,所述基于语音情绪波动分析装置20可以被划分为第一语音特征获取模块200、第二语音特征提取模块202、语音特征识别模块204和识别结果获取模块206。其中,本申请所称的程序模块是指能够完成特定功能的一系列计算机可读指令的指令段,比程序更适合于描述所述语音情绪波动分析装置20在所述计算机设备2中的执行过程。所述程序模块第一语音特征获取模块200-识别结果获取模块206的具体功能在实施例二中已有详细描述,在此不再赘述。For example, FIG. 8 shows a schematic diagram of the program modules of Embodiment 8 of the device for implementing voice emotion fluctuation analysis 20. In this embodiment, the device for voice emotion fluctuation analysis 20 can be divided into a first voice feature acquisition module 200, The second voice feature extraction module 202, the voice feature recognition module 204, and the recognition result acquisition module 206. Among them, the program module referred to in the present application refers to a series of computer-readable instructions that can complete specific functions. It is more suitable than a program to describe the execution process of the voice mood fluctuation analysis device 20 in the computer device 2. . The specific functions of the program module, the first voice feature acquisition module 200 and the recognition result acquisition module 206, have been described in detail in the second embodiment, and will not be repeated here.
实施例四Example four
本实施例还提供一种计算机可读存储介质,如闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘、服务器、App应用商城等等,其上存储有计算机可读指令,计算机可读指令被处理器执行时实现语音情绪波动分析方法的相应功能。本实施例的计算机可读存储介质用于存储语音情绪波动分析装置20,被处理器执行时实现实施例一、二、三或四的语音情绪波动分析方法。This embodiment also provides a computer-readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), only Read memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disks, optical disks, servers, App application malls, etc., on which are stored computer readable Instructions, computer-readable instructions, when executed by the processor, realize the corresponding functions of the voice mood fluctuation analysis method. The computer-readable storage medium of this embodiment is used to store the voice mood fluctuation analysis device 20, and when executed by a processor, the voice mood fluctuation analysis method of the first, second, third or fourth embodiment is implemented.
在一实施例中,提供一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现上述实施例中视频物体加速检测方法,为避免重复,这里不再赘述。本实施例中的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一非易失性可读存储介质也可以存储在易失性可读存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。In an embodiment, one or more readable storage media storing computer readable instructions are provided. The computer readable storage medium stores computer readable instructions, and the computer readable instructions are executed by one or more processors. During execution, the one or more processors are executed to implement the video object accelerated detection method in the foregoing embodiment. In order to avoid repetition, details are not described herein again. The readable storage medium in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium. A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a non-volatile readable storage. The medium may also be stored in a volatile readable storage medium, and when the computer readable instructions are executed, they may include the processes of the above-mentioned method embodiments.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority or inferiority of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。。The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application. .
发明概述Summary of the invention
问题的解决方案The solution to the problem
发明的有益效果The beneficial effects of the invention
Claims (20)
- 一种语音情绪波动分析方法,其中,包括:A method for analyzing voice mood fluctuations, which includes:获取待测语音数据的第一音频特征和第一文字特征;Acquire the first audio feature and the first text feature of the voice data to be tested;基于预先训练好的音频识别模型中的音频特征提取网络,提取所述第一音频特征中的第二音频特征;基于预先训练好的文字识别模型中的文字特征提取网络,提取所述第一文字特征中的第二文字特征;Extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; extract the first text feature based on the text feature extraction network in the pre-trained text recognition model The second character feature in识别所述第二音频特征,获取音频情绪识别结果;识别所述第二文字特征,获取文字情绪识别结果;Recognizing the second audio feature to obtain an audio emotion recognition result; recognizing the second text feature to obtain a text emotion recognition result;对所述音频情绪识别结果和文字情绪识别结果进行融合处理,得到情绪识别结果,并将所述情绪识别结果发送至关联终端。Perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and send the emotion recognition result to the associated terminal.
- 根据权利要求1所述的语音情绪波动分析方法,其中,所述获取待测语音数据的第一音频特征和第一文字特征包括:The method for analyzing voice mood fluctuations according to claim 1, wherein said obtaining the first audio feature and the first text feature of the voice data to be tested comprises:对所述待测语音数据进行分帧加窗处理,获得语音分析帧;Perform frame and window processing on the voice data to be tested to obtain voice analysis frames;对所述语音分析帧进行傅里叶变换得到对应的频谱;Performing Fourier transform on the speech analysis frame to obtain a corresponding frequency spectrum;将所述频谱经过梅尔滤波器组得到梅尔频谱;Passing the spectrum through the mel filter bank to obtain a mel spectrum;将所述梅尔频谱进行倒谱分析,获得所述待测语音数据的第一音频特征。Perform cepstrum analysis on the Mel spectrum to obtain the first audio feature of the voice data to be measured.
- 根据权利要求2所述的语音情绪波动分析方法,其中,所述识别所述第二音频特征,获取音频情绪识别结果;识别所述第二文字特征,获取文字情绪识别结果包括:The method for analyzing voice emotion fluctuations according to claim 2, wherein said recognizing said second audio feature to obtain an audio emotion recognition result; recognizing said second text feature and obtaining a text emotion recognition result comprises:基于预先训练好的音频识别模型中的音频分类网络,识别所述第二音频特征,获取多个音频情绪分类向量对应的第一置信度;Recognize the second audio feature based on the audio classification network in the pre-trained audio recognition model, and obtain the first confidence corresponding to the multiple audio emotion classification vectors;选取第一置信度最高的音频情绪分类为目标音频情绪分类,对应的第一置信度为目标音频情绪分类参数;The audio emotion classification with the highest first confidence is selected as the target audio emotion classification, and the corresponding first confidence is the target audio emotion classification parameter;对所述目标音频情绪分类向量参数进行数值映射,得到音频情绪识别结果。Perform numerical mapping on the target audio emotion classification vector parameters to obtain audio emotion recognition results.
- 根据权利要求1所述的语音情绪波动分析方法,其中,所述获取待 测语音数据的第一音频特征和第一文字特征还包括:The voice mood fluctuation analysis method according to claim 1, wherein said obtaining the first audio feature and the first text feature of the voice data to be tested further comprises:将所述待测语音数据转换为文字;Converting the voice data to be tested into text;对所述文字进行分词处理,得到L个分词,其中L为大于0的自然数;Perform word segmentation processing on the text to obtain L word segmentation, where L is a natural number greater than 0;对所述L个分词分别进行词向量映射,以获取L个分词对应的d维词向量矩阵,其中d为大于0的自然数,所述d维词向量矩阵为待测语音数据的第一文字特征。The word vector mapping is performed on the L word segmentation respectively to obtain a d-dimensional word vector matrix corresponding to the L word segmentation, where d is a natural number greater than 0, and the d-dimensional word vector matrix is the first text feature of the voice data to be tested.
- 根据权利要求4所述的语音情绪波动分析方法,其中,所述识别所述第二音频特征,获取音频情绪识别结果;识别所述第二文字特征,获取文字情绪识别结果包括:The method for analyzing voice emotion fluctuations according to claim 4, wherein said recognizing said second audio feature to obtain an audio emotion recognition result; recognizing said second text feature and obtaining a text emotion recognition result comprises:基于预先训练好的文字识别模型中的文字分类网络,识别所述第二文字特征,获取多个文字情绪分类向量对应的第二置信度;Based on the text classification network in the pre-trained text recognition model, recognize the second text feature, and obtain the second confidence level corresponding to the multiple text emotion classification vectors;选取第二置信度最高的音频情绪分类为目标文字情绪分类,对应的第二置信度为目标文字情绪分类参数;Select the audio emotion classification with the highest second confidence as the target text emotion classification, and the corresponding second confidence is the target text emotion classification parameter;对所述目标文字情绪分类向量参数进行数值映射,得到文字情绪识别结果。Perform numerical mapping on the target text emotion classification vector parameters to obtain text emotion recognition results.
- 根据权利要求1所述的语音情绪波动分析方法,其中,所述方法还包括:The method for analyzing voice mood fluctuations according to claim 1, wherein the method further comprises:获取离线或者在线的待测语音数据;Obtain offline or online voice data to be tested;对所述语音数据进行分离处理得到待测语音数据,所述待测语音数据包括多段第一用户语音数据和第二用户语音数据。The voice data is separated and processed to obtain the voice data to be tested, and the voice data to be tested includes multiple pieces of first user voice data and second user voice data.
- 根据权利要求6所述的语音情绪波动分析方法,其中,所述对所述音频情绪识别结果和文字情绪识别结果进行融合处理,得到情绪识别结果,并将所述情绪识别结果发送至关联终端包括:The method for analyzing voice emotion fluctuations according to claim 6, wherein the fusion processing of the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and sending the emotion recognition result to the associated terminal comprises :对每段第一用户的语音数据的音频情绪识别结果和文字情绪识别结果进行加权处理,得到第一情绪值,对每段第二用户的语音数据的音频情绪识别结果和文字情绪识别结果进行加权处理,得到第二情绪值;Perform weighting processing on the audio emotion recognition results and text emotion recognition results of each segment of the first user’s voice data to obtain the first emotion value, and weight the audio emotion recognition results and text emotion recognition results of each segment of the second user’s voice data Processing, get the second emotional value;根据所述第一情绪值生成第一情绪值热图及根据所述第二情绪值生成第二情绪值热图;Generating a first emotion value heat map according to the first emotion value and generating a second emotion value heat map according to the second emotion value;将所述第一情绪值热图和第二情绪值热图发送至关联终端。The first emotion value heat map and the second emotion value heat map are sent to the associated terminal.
- 一种语音情绪波动分析装置,其中,包括:A voice mood fluctuation analysis device, which includes:第一语音特征获取模块:用于获取待测语音数据的第一音频特征和第一文字特征;The first voice feature acquisition module: used to acquire the first audio feature and the first text feature of the voice data to be tested;第二语音特征提取模块:用于基于预先训练好的音频识别模型中的音频特征提取网络,提取所述第一音频特征中的第二音频特征;基于预先训练好的文字识别模型中的文字特征提取网络,提取所述第一文字特征中的第二文字特征;The second voice feature extraction module: used to extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; based on the text feature in the pre-trained text recognition model Extracting a network to extract a second character feature in the first character feature;语音特征识别模块:用于识别所述第二音频特征,获取音频情绪识别结果;识别所述第二文字特征,获取文字情绪识别结果。Voice feature recognition module: used to recognize the second audio feature to obtain an audio emotion recognition result; recognize the second text feature to obtain a text emotion recognition result.识别结果获取模块:用于对所述音频情绪识别结果和文字情绪识别结果进行融合处理,得到情绪识别结果,并将所述情绪识别结果发送至关联终端。Recognition result acquisition module: used to perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and send the emotion recognition result to the associated terminal.
- 一种计算机设备,所述计算机设备包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,其特征于,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device comprising a memory, a processor, and computer-readable instructions stored on the memory and running on the processor, characterized in that the processor executes the computer-readable instructions The following steps are implemented when ordering:获取待测语音数据的第一音频特征和第一文字特征;Acquire the first audio feature and the first text feature of the voice data to be tested;基于预先训练好的音频识别模型中的音频特征提取网络,提取所述第一音频特征中的第二音频特征;基于预先训练好的文字识别模型中的文字特征提取网络,提取所述第一文字特征中的第二文字特征;Extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; extract the first text feature based on the text feature extraction network in the pre-trained text recognition model The second character feature in识别所述第二音频特征,获取音频情绪识别结果;识别所述第二文字特征,获取文字情绪识别结果;Recognizing the second audio feature to obtain an audio emotion recognition result; recognizing the second text feature to obtain a text emotion recognition result;对所述音频情绪识别结果和文字情绪识别结果进行融合处理,得到情绪识别结果,并将所述情绪识别结果发送至关联终端。Perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and send the emotion recognition result to the associated terminal.
- 根据权利要求9所述的计算机设备,其中,所述识别所述第二音频 特征,获取音频情绪识别结果;识别所述第二文字特征,获取文字情绪识别结果包括:The computer device according to claim 9, wherein said recognizing said second audio feature to obtain an audio emotion recognition result; recognizing said second text feature and obtaining a text emotion recognition result comprises:基于预先训练好的音频识别模型中的音频分类网络,识别所述第二音频特征,获取多个音频情绪分类向量对应的第一置信度;Recognize the second audio feature based on the audio classification network in the pre-trained audio recognition model, and obtain the first confidence corresponding to the multiple audio emotion classification vectors;选取第一置信度最高的音频情绪分类为目标音频情绪分类,对应的第一置信度为目标音频情绪分类参数;The audio emotion classification with the highest first confidence is selected as the target audio emotion classification, and the corresponding first confidence is the target audio emotion classification parameter;对所述目标音频情绪分类向量参数进行数值映射,得到音频情绪识别结果。Perform numerical mapping on the target audio emotion classification vector parameters to obtain audio emotion recognition results.
- 根据权利要求9所述的计算机设备,其中,所述获取待测语音数据的第一音频特征和第一文字特征还包括:The computer device according to claim 9, wherein said acquiring the first audio feature and the first text feature of the voice data to be tested further comprises:将所述待测语音数据转换为文字;Converting the voice data to be tested into text;对所述文字进行分词处理,得到L个分词,其中L为大于0的自然数;Perform word segmentation processing on the text to obtain L word segmentation, where L is a natural number greater than 0;对所述L个分词分别进行词向量映射,以获取L个分词对应的d维词向量矩阵,其中d为大于0的自然数,所述d维词向量矩阵为待测语音数据的第一文字特征。The word vector mapping is performed on the L word segmentation respectively to obtain a d-dimensional word vector matrix corresponding to the L word segmentation, where d is a natural number greater than 0, and the d-dimensional word vector matrix is the first text feature of the voice data to be tested.
- 根据权利要求11所述的计算机设备,其中,所述识别所述第二音频特征,获取音频情绪识别结果;识别所述第二文字特征,获取文字情绪识别结果包括:The computer device according to claim 11, wherein said recognizing said second audio feature to obtain an audio emotion recognition result; recognizing said second text feature and obtaining a text emotion recognition result comprises:基于预先训练好的文字识别模型中的文字分类网络,识别所述第二文字特征,获取多个文字情绪分类向量对应的第二置信度;Based on the text classification network in the pre-trained text recognition model, recognize the second text feature, and obtain the second confidence level corresponding to the multiple text emotion classification vectors;选取第二置信度最高的音频情绪分类为目标文字情绪分类,对应的第二置信度为目标文字情绪分类参数;Select the audio emotion classification with the highest second confidence as the target text emotion classification, and the corresponding second confidence is the target text emotion classification parameter;对所述目标文字情绪分类向量参数进行数值映射,得到文字情绪识别结果。Perform numerical mapping on the target text emotion classification vector parameters to obtain text emotion recognition results.
- 根据权利要求9所述的计算机设备,其中,所述方法还包括:The computer device according to claim 9, wherein the method further comprises:获取离线或者在线的待测语音数据;Obtain offline or online voice data to be tested;对所述语音数据进行分离处理得到待测语音数据,所述待测语音 数据包括多段第一用户语音数据和第二用户语音数据。The voice data is separated and processed to obtain the voice data to be tested, and the voice data to be tested includes multiple pieces of first user voice data and second user voice data.
- 根据权利要求13所述的计算机设备,其中,所述对所述音频情绪识别结果和文字情绪识别结果进行融合处理,得到情绪识别结果,并将所述情绪识别结果发送至关联终端包括:The computer device according to claim 13, wherein the fusion processing of the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and sending the emotion recognition result to the associated terminal comprises:对每段第一用户的语音数据的音频情绪识别结果和文字情绪识别结果进行加权处理,得到第一情绪值,对每段第二用户的语音数据的音频情绪识别结果和文字情绪识别结果进行加权处理,得到第二情绪值;Perform weighting processing on the audio emotion recognition results and text emotion recognition results of each segment of the first user’s voice data to obtain the first emotion value, and weight the audio emotion recognition results and text emotion recognition results of each segment of the second user’s voice data Processing, get the second emotional value;根据所述第一情绪值生成第一情绪值热图及根据所述第二情绪值生成第二情绪值热图;Generating a first emotion value heat map according to the first emotion value and generating a second emotion value heat map according to the second emotion value;将所述第一情绪值热图和第二情绪值热图发送至关联终端。The first emotion value heat map and the second emotion value heat map are sent to the associated terminal.
- 一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer readable instructions, the computer readable storage medium storing computer readable instructions, where the computer readable instructions when executed by one or more processors cause all The one or more processors execute the following steps:获取待测语音数据的第一音频特征和第一文字特征;Acquire the first audio feature and the first text feature of the voice data to be tested;基于预先训练好的音频识别模型中的音频特征提取网络,提取所述第一音频特征中的第二音频特征;基于预先训练好的文字识别模型中的文字特征提取网络,提取所述第一文字特征中的第二文字特征;Extract the second audio feature in the first audio feature based on the audio feature extraction network in the pre-trained audio recognition model; extract the first text feature based on the text feature extraction network in the pre-trained text recognition model The second character feature in识别所述第二音频特征,获取音频情绪识别结果;识别所述第二文字特征,获取文字情绪识别结果;Recognizing the second audio feature to obtain an audio emotion recognition result; recognizing the second text feature to obtain a text emotion recognition result;对所述音频情绪识别结果和文字情绪识别结果进行融合处理,得到情绪识别结果,并将所述情绪识别结果发送至关联终端。Perform fusion processing on the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and send the emotion recognition result to the associated terminal.
- 根据权利要求15所述的可读存储介质,其中,所述识别所述第二音频特征,获取音频情绪识别结果;识别所述第二文字特征,获取文字情绪识别结果包括:The readable storage medium according to claim 15, wherein said recognizing said second audio feature to obtain an audio emotion recognition result; recognizing said second text feature and obtaining a text emotion recognition result comprises:基于预先训练好的音频识别模型中的音频分类网络,识别所述第 二音频特征,获取多个音频情绪分类向量对应的第一置信度;Recognize the second audio feature based on the audio classification network in the pre-trained audio recognition model, and obtain the first confidence corresponding to the multiple audio emotion classification vectors;选取第一置信度最高的音频情绪分类为目标音频情绪分类,对应的第一置信度为目标音频情绪分类参数;The audio emotion classification with the highest first confidence is selected as the target audio emotion classification, and the corresponding first confidence is the target audio emotion classification parameter;对所述目标音频情绪分类向量参数进行数值映射,得到音频情绪识别结果。Perform numerical mapping on the target audio emotion classification vector parameters to obtain audio emotion recognition results.
- 根据权利要求15所述的可读存储介质,其中,所述获取待测语音数据的第一音频特征和第一文字特征还包括:The readable storage medium according to claim 15, wherein said obtaining the first audio feature and the first text feature of the voice data to be tested further comprises:将所述待测语音数据转换为文字;Converting the voice data to be tested into text;对所述文字进行分词处理,得到L个分词,其中L为大于0的自然数;Perform word segmentation processing on the text to obtain L word segmentation, where L is a natural number greater than 0;对所述L个分词分别进行词向量映射,以获取L个分词对应的d维词向量矩阵,其中d为大于0的自然数,所述d维词向量矩阵为待测语音数据的第一文字特征。The word vector mapping is performed on the L word segmentation respectively to obtain a d-dimensional word vector matrix corresponding to the L word segmentation, where d is a natural number greater than 0, and the d-dimensional word vector matrix is the first text feature of the voice data to be tested.
- 根据权利要求17所述的可读存储介质,其中,所述识别所述第二音频特征,获取音频情绪识别结果;识别所述第二文字特征,获取文字情绪识别结果包括:The readable storage medium according to claim 17, wherein said recognizing said second audio feature to obtain an audio emotion recognition result; recognizing said second text feature and obtaining a text emotion recognition result comprises:基于预先训练好的文字识别模型中的文字分类网络,识别所述第二文字特征,获取多个文字情绪分类向量对应的第二置信度;Based on the text classification network in the pre-trained text recognition model, recognize the second text feature, and obtain the second confidence level corresponding to the multiple text emotion classification vectors;选取第二置信度最高的音频情绪分类为目标文字情绪分类,对应的第二置信度为目标文字情绪分类参数;Select the audio emotion classification with the highest second confidence as the target text emotion classification, and the corresponding second confidence is the target text emotion classification parameter;对所述目标文字情绪分类向量参数进行数值映射,得到文字情绪识别结果。Perform numerical mapping on the target text emotion classification vector parameters to obtain text emotion recognition results.
- 根据权利要求15所述的可读存储介质,其中,所述方法还包括:The readable storage medium according to claim 15, wherein the method further comprises:获取离线或者在线的待测语音数据;Obtain offline or online voice data to be tested;对所述语音数据进行分离处理得到待测语音数据,所述待测语音数据包括多段第一用户语音数据和第二用户语音数据。The voice data is separated and processed to obtain the voice data to be tested, and the voice data to be tested includes multiple pieces of first user voice data and second user voice data.
- 根据权利要求19所述的可读存储介质,其中,所述对所述音频情绪识别结果和文字情绪识别结果进行融合处理,得到情绪识别结 果,并将所述情绪识别结果发送至关联终端包括:The readable storage medium according to claim 19, wherein the fusion processing of the audio emotion recognition result and the text emotion recognition result to obtain the emotion recognition result, and sending the emotion recognition result to the associated terminal comprises:对每段第一用户的语音数据的音频情绪识别结果和文字情绪识别结果进行加权处理,得到第一情绪值,对每段第二用户的语音数据的音频情绪识别结果和文字情绪识别结果进行加权处理,得到第二情绪值;Perform weighting processing on the audio emotion recognition results and text emotion recognition results of each segment of the first user’s voice data to obtain the first emotion value, and weight the audio emotion recognition results and text emotion recognition results of each segment of the second user’s voice data Processing, get the second emotional value;根据所述第一情绪值生成第一情绪值热图及根据所述第二情绪值生成第二情绪值热图;Generating a first emotion value heat map according to the first emotion value and generating a second emotion value heat map according to the second emotion value;将所述第一情绪值热图和第二情绪值热图发送至关联终端。The first emotion value heat map and the second emotion value heat map are sent to the associated terminal.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911341679.XA CN111081279A (en) | 2019-12-24 | 2019-12-24 | Voice emotion fluctuation analysis method and device |
CN201911341679.X | 2019-12-24 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021128741A1 true WO2021128741A1 (en) | 2021-07-01 |
Family
ID=70317032
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/094338 WO2021128741A1 (en) | 2019-12-24 | 2020-06-04 | Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111081279A (en) |
WO (1) | WO2021128741A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114373455A (en) * | 2021-12-08 | 2022-04-19 | 北京声智科技有限公司 | Emotion recognition method and device, electronic equipment and storage medium |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111081279A (en) * | 2019-12-24 | 2020-04-28 | 深圳壹账通智能科技有限公司 | Voice emotion fluctuation analysis method and device |
CN111739559B (en) * | 2020-05-07 | 2023-02-28 | 北京捷通华声科技股份有限公司 | Speech early warning method, device, equipment and storage medium |
CN111916112A (en) * | 2020-08-19 | 2020-11-10 | 浙江百应科技有限公司 | Emotion recognition method based on voice and characters |
CN111938674A (en) * | 2020-09-07 | 2020-11-17 | 南京宇乂科技有限公司 | Emotion recognition control system for conversation |
CN112215927B (en) * | 2020-09-18 | 2023-06-23 | 腾讯科技(深圳)有限公司 | Face video synthesis method, device, equipment and medium |
CN112100337B (en) * | 2020-10-15 | 2024-03-05 | 平安科技(深圳)有限公司 | Emotion recognition method and device in interactive dialogue |
CN112527994A (en) * | 2020-12-18 | 2021-03-19 | 平安银行股份有限公司 | Emotion analysis method, emotion analysis device, emotion analysis equipment and readable storage medium |
CN112837702A (en) * | 2020-12-31 | 2021-05-25 | 萨孚凯信息系统(无锡)有限公司 | Voice emotion distributed system and voice signal processing method |
CN112911072A (en) * | 2021-01-28 | 2021-06-04 | 携程旅游网络技术(上海)有限公司 | Call center volume identification method and device, electronic equipment and storage medium |
CN113053409B (en) * | 2021-03-12 | 2024-04-12 | 科大讯飞股份有限公司 | Audio evaluation method and device |
CN113129927B (en) * | 2021-04-16 | 2023-04-07 | 平安科技(深圳)有限公司 | Voice emotion recognition method, device, equipment and storage medium |
CN114049902B (en) * | 2021-10-27 | 2023-04-07 | 广东万丈金数信息技术股份有限公司 | Aricloud-based recording uploading identification and emotion analysis method and system |
CN117333913A (en) * | 2022-06-24 | 2024-01-02 | 上海哔哩哔哩科技有限公司 | Method and device for identifying emotion categories, storage medium and electronic equipment |
CN115430155A (en) * | 2022-09-06 | 2022-12-06 | 北京中科心研科技有限公司 | Team cooperation capability assessment method and system based on audio analysis |
CN117688344B (en) * | 2024-02-04 | 2024-05-07 | 北京大学 | Multi-mode fine granularity trend analysis method and system based on large model |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108305642A (en) * | 2017-06-30 | 2018-07-20 | 腾讯科技(深圳)有限公司 | The determination method and apparatus of emotion information |
CN108305641A (en) * | 2017-06-30 | 2018-07-20 | 腾讯科技(深圳)有限公司 | The determination method and apparatus of emotion information |
CN108305643A (en) * | 2017-06-30 | 2018-07-20 | 腾讯科技(深圳)有限公司 | The determination method and apparatus of emotion information |
US20190325897A1 (en) * | 2018-04-21 | 2019-10-24 | International Business Machines Corporation | Quantifying customer care utilizing emotional assessments |
CN110390956A (en) * | 2019-08-15 | 2019-10-29 | 龙马智芯(珠海横琴)科技有限公司 | Emotion recognition network model, method and electronic equipment |
CN111081279A (en) * | 2019-12-24 | 2020-04-28 | 深圳壹账通智能科技有限公司 | Voice emotion fluctuation analysis method and device |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102779510B (en) * | 2012-07-19 | 2013-12-18 | 东南大学 | Speech emotion recognition method based on feature space self-adaptive projection |
CN106228977B (en) * | 2016-08-02 | 2019-07-19 | 合肥工业大学 | Multi-mode fusion song emotion recognition method based on deep learning |
-
2019
- 2019-12-24 CN CN201911341679.XA patent/CN111081279A/en active Pending
-
2020
- 2020-06-04 WO PCT/CN2020/094338 patent/WO2021128741A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108305642A (en) * | 2017-06-30 | 2018-07-20 | 腾讯科技(深圳)有限公司 | The determination method and apparatus of emotion information |
CN108305641A (en) * | 2017-06-30 | 2018-07-20 | 腾讯科技(深圳)有限公司 | The determination method and apparatus of emotion information |
CN108305643A (en) * | 2017-06-30 | 2018-07-20 | 腾讯科技(深圳)有限公司 | The determination method and apparatus of emotion information |
US20190325897A1 (en) * | 2018-04-21 | 2019-10-24 | International Business Machines Corporation | Quantifying customer care utilizing emotional assessments |
CN110390956A (en) * | 2019-08-15 | 2019-10-29 | 龙马智芯(珠海横琴)科技有限公司 | Emotion recognition network model, method and electronic equipment |
CN111081279A (en) * | 2019-12-24 | 2020-04-28 | 深圳壹账通智能科技有限公司 | Voice emotion fluctuation analysis method and device |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114373455A (en) * | 2021-12-08 | 2022-04-19 | 北京声智科技有限公司 | Emotion recognition method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN111081279A (en) | 2020-04-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021128741A1 (en) | Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium | |
Boles et al. | Voice biometrics: Deep learning-based voiceprint authentication system | |
US9536547B2 (en) | Speaker change detection device and speaker change detection method | |
US9368116B2 (en) | Speaker separation in diarization | |
CN110706690A (en) | Speech recognition method and device | |
CN110910891B (en) | Speaker segmentation labeling method based on long-time and short-time memory deep neural network | |
JP2019211749A (en) | Method and apparatus for detecting starting point and finishing point of speech, computer facility, and program | |
CN110457432A (en) | Interview methods of marking, device, equipment and storage medium | |
CN111785275A (en) | Voice recognition method and device | |
CN108899033B (en) | Method and device for determining speaker characteristics | |
WO2019119279A1 (en) | Method and apparatus for emotion recognition from speech | |
WO2018095167A1 (en) | Voiceprint identification method and voiceprint identification system | |
Kopparapu | Non-linguistic analysis of call center conversations | |
Pao et al. | Combining acoustic features for improved emotion recognition in mandarin speech | |
CN116153337B (en) | Synthetic voice tracing evidence obtaining method and device, electronic equipment and storage medium | |
CN110782902A (en) | Audio data determination method, apparatus, device and medium | |
CN112992155B (en) | Far-field voice speaker recognition method and device based on residual error neural network | |
Chakroun et al. | Efficient text-independent speaker recognition with short utterances in both clean and uncontrolled environments | |
Yousfi et al. | Holy Qur'an speech recognition system Imaalah checking rule for warsh recitation | |
CN115547345A (en) | Voiceprint recognition model training and related recognition method, electronic device and storage medium | |
Mansour et al. | A comparative study in emotional speaker recognition in noisy environment | |
Komlen et al. | Text independent speaker recognition using LBG vector quantization | |
Shahriar et al. | Identification of Spoken Language using Machine Learning Approach | |
Bharti et al. | SVM based Voice Activity Detection by fusing a new acoustic feature PLMS with some existing acoustic features of speech | |
CN117935865B (en) | User emotion analysis method and system for personalized marketing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20904401 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 31.10.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20904401 Country of ref document: EP Kind code of ref document: A1 |