CN109243492A

CN109243492A - A kind of speech emotion recognition system and recognition methods

Info

Publication number: CN109243492A
Application number: CN201811263371.3A
Authority: CN
Inventors: 张震; 李鹏; 黄远; 高圣翔; 殷兵; 刘冠男; 倪江帆; 冯向雷
Original assignee: Xun Feizhi Metamessage Science And Technology Ltd; National Computer Network and Information Security Management Center
Current assignee: Xun Feizhi Metamessage Science And Technology Ltd; National Computer Network and Information Security Management Center
Priority date: 2018-10-28
Filing date: 2018-10-28
Publication date: 2019-01-18

Abstract

The present invention discloses a kind of speech emotion recognition system, including speech preprocessing module, affective feature extraction module, sentiment analysis module, the input of the speech preprocessing module terminates voice data, the output end of the speech preprocessing module and the input terminal of the affective feature extraction module are connected, the output end of the affective feature extraction module and the input terminal of the sentiment analysis module are connected, the output end output analysis recognition result of the sentiment analysis module；The speech preprocessing module obtains voice signal by carrying out processing to voice data, and be transmitted to the affective feature extraction module and extract to being associated with close parameters,acoustic with emotion in the voice signal, finally it is sent into the judgement that the sentiment analysis module completes emotion.The present invention also proposes a kind of speech-emotion recognition method, increases the detection means of telephone fraud system, can carry out multi dimensional analysis for voice data, the detection accuracy rate of system improves 5%.

Description

A kind of speech emotion recognition system and recognition methods

Technical field

The present invention relates to a kind of speech emotion recognition system and recognition methods, belong to speech analysis techniques field.

Background technique

Currently, existing telephone fraud prevention intercepting system is mainly based upon the pre-alarm and prevention technology of signaling data, is based on The fraudulent call early warning technology that nocuousness recording compares and natural person's fraudulent call early warning technology based on intelligent sound technology, There are the following problems for this several technology paths, and simple signalling analysis, recording compare analysis, and available characteristic information is few, Accuracy and it is comprehensive on, be difficult to accomplish that each side takes into account.In addition, needing to accumulate the language of one section of duration for voice subject analysis Sound, to system speech access capability and handle analysis ability requirement it is all relatively high, on line system operation may require that compared with High cost.

Summary of the invention

The technical problem to be solved by the present invention is to overcome the deficiencies of existing technologies, a kind of speech emotion recognition system is provided And recognition methods identifies the abnormal emotion feature of target speaker, can help to assess by mood, emotion recognition processing The abnormal behaviour taken on the telephone and intention effectively assist the early warning detection of fraudulent call.It overcomes in conventional solution and passes through The intention understanding on basis can only obtain literal message, can not deeply excavate the skill because of mood, emotion variation bring exception information Art defect.

In order to solve the above technical problems, the present invention provides a kind of speech emotion recognition system, which is characterized in that including voice The input of preprocessing module, affective feature extraction module, sentiment analysis module, the speech preprocessing module terminates voice number According to the output end of the speech preprocessing module and the input terminal of the affective feature extraction module are connected, and the emotion is special The input terminal of the output end and the sentiment analysis module of levying extraction module is connected, and the output end of the sentiment analysis module is defeated Recognition result is analyzed out；The speech preprocessing module obtains voice signal by carrying out processing to voice data, and is transmitted to The affective feature extraction module is extracted to being associated with close parameters,acoustic with emotion in the voice signal, is finally sent into The sentiment analysis module completes the judgement of emotion.

As a kind of preferred embodiment, the affective feature extraction module includes characteristic parameter extraction module, feature ginseng Number chooses and processing module, and output end and the characteristic parameter of the characteristic parameter extraction module are chosen defeated with processing module Enter end to be connected.

As a kind of preferred embodiment, the characteristic parameter extraction module includes that the temporal signatures being sequentially connected extract mould Block, fundamental frequency characteristic extracting module, clear pond sound judgment module, word speed extraction module, formant extraction module, the temporal signatures mention Modulus block is used to extract the short-time energy feature in voice signal, and the fundamental frequency characteristic extracting module is used to extract in voice signal Fundamental frequency feature, the clear pond sound judgment module is used to extract zero-crossing rate feature in voice signal, the word speed extraction module For extracting the word speed feature in voice signal, the formant that the formant extraction module is used to extract in voice signal is special Sign.

As a kind of preferred embodiment, the characteristic parameter, which is chosen, to be used to complete data conversion and biography with processing module It passs, by special to the single features parameter extracted in the characteristic parameter extraction module such as short-time energy feature, zero-crossing rate Sign, fundamental frequency feature, word speed feature and formant feature carry out selection processing, and final characteristic parameter is taken together, each One feature vector of sound section of each of voice signal formation, and set of eigenvectors is ultimately formed, it is defeated to form classifier training Enter file, the training or identification for the sentiment analysis module use.

As a kind of preferred embodiment, the sentiment analysis module includes classifier modules, and the classifier modules exist On the basis of identifying the characteristic parameter for successfully extracting voice document, by machine learning method, feelings belonging to the recording file are predicted Thread classification.

As a kind of preferred embodiment, the classifier modules are on the basis of deep neural network, in conjunction with based on contribution The PCA algorithm of analysis proposes a kind of deep neural network voice mood identification model based on PCA algorithm contribution analysis, passes through PCA contribution analysis technology is extracted the main component in category feature comprising voice mood and is inputted as deep neural network, carries out Network training effectively reduces nuisance parameter, training for promotion efficiency, realizes mood classification.

The present invention also proposes a kind of speech-emotion recognition method, which is characterized in that specifically comprises the following steps: depth nerve Voice-over-net Emotion identification model training step；Deep neural network voice mood identification model prediction steps.

As a kind of preferred embodiment, the deep neural network voice mood identification model training step is specifically wrapped Include: it is such as short that the voice mood database input characteristic parameter extraction module of tape label is carried out processing acquisition single features parameter When energy feature, zero-crossing rate feature, fundamental frequency feature, word speed feature and formant feature, then input characteristic parameter selection and place Reason module carries out selection processing, and final characteristic parameter is taken together, sound section of each of each voice signal formation One feature vector, and set of eigenvectors is ultimately formed, form classifier training input file, point of input sentiment analysis module Class device module is trained, and obtains deep neural network voice mood identification model.

As a kind of preferred embodiment, the deep neural network speech emotional model prediction step is specifically included: will The voice mood database input characteristic parameter extraction module of unknown classification carries out processing and obtains single features parameter such as in short-term Energy feature, zero-crossing rate feature, fundamental frequency feature, word speed feature and formant feature, then input characteristic parameter selection and processing Module carries out selection processing, and final characteristic parameter is taken together, and sound section of each of each voice signal forms one A feature vector, and set of eigenvectors is ultimately formed, classifier training input file is formed, the classification of sentiment analysis module is inputted The deep neural network voice mood that device module is obtained according to the deep neural network voice mood identification model training step Identification model predicts the classification of mood belonging to voice signal, and exports Emotion identification dimension result.

Advantageous effects of the invention: a kind of speech emotion recognition system proposed by the present invention and recognition methods, lead to Mood, emotion recognition processing are crossed, the abnormal emotion feature of target speaker is identified, can help to assess the exception taken on the telephone Behavior and intention effectively assist the early warning detection of fraudulent call, overcome in conventional solution and understood by the intention on basis It can only obtain literal message, can not deeply excavate the technological deficiency because of mood, emotion variation bring exception information, increase electricity The detection means for talking about swindle system, can carry out multi dimensional analysis for voice data, the detection accuracy rate of system improves 5%.

Detailed description of the invention

Fig. 1 is a kind of structural block diagram of speech emotion recognition system of the invention.

Fig. 2 is a kind of flow chart of speech-emotion recognition method of the invention.

Fig. 3 is the structural block diagram of emotional characteristics parameter extraction module of the invention.

Fig. 4 is the flow chart of deep neural network voice mood identification model training step of the invention.

Fig. 5 is the flow chart of deep neural network speech emotional model prediction step of the invention.

Specific embodiment

The invention will be further described below in conjunction with the accompanying drawings.Following embodiment is only used for clearly illustrating the present invention Technical solution, and not intended to limit the protection scope of the present invention.

As shown in Fig. 1 the structural block diagram of a kind of speech emotion recognition system of the invention.The present invention provides a kind of language Sound emotion recognition system, which is characterized in that including speech preprocessing module, affective feature extraction module, sentiment analysis module, institute The input termination voice data of speech preprocessing module is stated, the output end of the speech preprocessing module is mentioned with the affective characteristics The input terminal of modulus block is connected, the input terminal phase of the output end of the affective feature extraction module and the sentiment analysis module Connection, the output end output analysis recognition result of the sentiment analysis module；The speech preprocessing module passes through to voice number Voice signal is obtained according to processing is carried out, and is transmitted to the affective feature extraction module and is associated in the voice signal with emotion Close parameters,acoustic extracts, and is finally sent into the judgement that the sentiment analysis module completes emotion.

As shown in Fig. 3 the structural block diagram of emotional characteristics parameter extraction module of the invention.As a kind of preferable reality Example is applied, the characteristic parameter extraction module includes the temporal signatures extraction module being sequentially connected, fundamental frequency characteristic extracting module, Qing Chi Sound judgment module, word speed extraction module, formant extraction module, the temporal signatures extraction module are used to extract in voice signal Short-time energy feature, the fundamental frequency characteristic extracting module is used to extract fundamental frequency feature in voice signal, and the clear pond sound is sentenced Disconnected module is used to extract the zero-crossing rate feature in voice signal, and the word speed extraction module is used to extract the word speed in voice signal Feature, the formant extraction module are used to extract the formant feature in voice signal.

As shown in Fig. 2 the flow chart of a kind of speech-emotion recognition method of the invention.The present invention also proposes a kind of language Sound emotion identification method, which is characterized in that specifically comprise the following steps: deep neural network voice mood identification model training step Suddenly；Deep neural network voice mood identification model prediction steps.

As shown in Fig. 4 the flow chart of deep neural network voice mood identification model training step of the invention.Make For a kind of preferred embodiment, the deep neural network voice mood identification model training step is specifically included: by tape label Voice mood database input characteristic parameter extraction module carry out processing obtain single features parameter such as short-time energy feature, Zero-crossing rate feature, fundamental frequency feature, word speed feature and formant feature, then input characteristic parameter selection are selected with processing module Processing is selected, and final characteristic parameter is taken together, one feature vector of sound section of each of each voice signal formation, And set of eigenvectors is ultimately formed, classifier training input file is formed, the classifier modules of input sentiment analysis module carry out Training obtains deep neural network voice mood identification model.

As shown in Fig. 5 the flow chart of deep neural network speech emotional model prediction step of the invention.As one Kind preferred embodiment, the deep neural network speech emotional model prediction step specifically include: by the voice of unknown classification Mood data library input characteristic parameter extraction module carries out processing and obtains single features parameter such as short-time energy feature, zero-crossing rate Feature, fundamental frequency feature, word speed feature and formant feature, then input characteristic parameter selection carry out at selection with processing module Reason, and final characteristic parameter is taken together, one feature vector of sound section of each of each voice signal formation, and most End form forms classifier training input file at set of eigenvectors, inputs the classifier modules of sentiment analysis module according to described The deep neural network voice mood identification model that deep neural network voice mood identification model training step obtains predicts language The classification of mood belonging to sound signal, and export Emotion identification dimension result.

It should be noted that the raw tone feature classification for emotion recognition has very much, initial stage possibly can not specify table Which bright speech characteristic parameter can accurately reflect human emotion variation, so would generally extraction as much as possible can characterize feelings The characteristic parameter of sense variation is used for Emotion identification.Although these characteristic parameters can reflect the variation of mood to varying degrees, But there are certain association between some characteristic parameters, there is overlapping in the information reflected to a certain extent, these overlappings Parameter is redundancy feature parameter；In addition, there are also some characteristic parameters may with wait upon the mood relevance very little of classification, or even do not have There is direct association, these characteristic parameters are useless parameter.Either redundancy feature parameter or useless characteristic parameter, have The complexity of whole system may be increased, or even influences the recognition efficiency of classifier.

For these reasons, before carrying out the classification of signal Emotion identification, between the characteristic parameter inside that needs to extract Correlation is eliminated, while removing useless parameter.This just needs to choose method appropriate and chooses tool from the numerous parameters extracted There are the actual parameter of significant contribution degree, i.e. progress feature selecting.

Parameter selection method more mature for the research of parameter selection in Emotion identification, being used at present by numerous scholars Linear discriminant analysis (linear Discriminant Analysis, LDA), principle component analysis (Principle Component Analysis, PCA) method, fuzzy entropy method, suboptimum search method, linearly return old modelling (Regression Model) etc..Wherein, PCA analytic approach is presently the most common feature selecting and dimension reduction method, it is with weight losses is not wanted as far as possible Information is objective, is several a small number of parameters by the multiple initial parameter linear combinations extracted, these transformed minority ginsengs It is mutually incoherent between number, it is called the principal component of former characteristic parameter, it will levy most information of parameter, and dimension comprising original It is less, it is higher than the superiority of initial parameter to be based on this, contribution analysis is carried out to above-mentioned category feature using PCA method, it is real Existing dimension-reduction treatment.

Classifier modules are the nucleus modules of Emotion identification system, it is to identify the feature ginseng for successfully extracting voice document On the basis of number, by machine learning method, predict that mood belonging to the recording file is classified.The module can success prediction it Before, it needs first to train it.This programme, in conjunction with the PCA algorithm based on contribution analysis, proposes on the basis of deep neural network A kind of deep neural network speech emotional model based on PCA contribution analysis, being extracted by PCA technology includes language in category feature The main component of sound mood is inputted as deep neural network, carries out network training, effectively reduces nuisance parameter, training for promotion effect Rate realizes mood classification.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, without departing from the technical principles of the invention, several improvement and deformations can also be made, these improvement and deformations Also it should be regarded as protection scope of the present invention.

Claims

1. a kind of speech emotion recognition system, which is characterized in that including speech preprocessing module, affective feature extraction module, feelings Feel analysis module, the input of the speech preprocessing module terminates voice data, the output end of the speech preprocessing module with The input terminal of the affective feature extraction module is connected, the output end and the sentiment analysis of the affective feature extraction module The input terminal of module is connected, the output end output analysis recognition result of the sentiment analysis module；The voice pre-processes mould Block obtains voice signal by carrying out processing to voice data, and is transmitted to the affective feature extraction module and believes the voice Close parameters,acoustic is associated with emotion in number to extract, and is finally sent into the judgement that the sentiment analysis module completes emotion.

2. a kind of speech emotion recognition system according to claim 1, which is characterized in that the affective feature extraction module Including characteristic parameter extraction module, characteristic parameter chooses and processing module, the output end of the characteristic parameter extraction module and institute It states characteristic parameter selection and the input terminal of processing module is connected.

3. a kind of speech emotion recognition system according to claim 2, which is characterized in that the characteristic parameter extraction module Including be sequentially connected temporal signatures extraction module, fundamental frequency characteristic extracting module, clear pond sound judgment module, word speed extraction module, Formant extraction module, the temporal signatures extraction module are used to extract the short-time energy feature in voice signal, the fundamental frequency Characteristic extracting module is used to extract the fundamental frequency feature in voice signal, and the clear pond sound judgment module is used to extract in voice signal Zero-crossing rate feature, the word speed extraction module is used to extract word speed feature in voice signal, the formant extraction module For extracting the formant feature in voice signal.

4. a kind of speech emotion recognition system according to claim 2, which is characterized in that the characteristic parameter chooses and place Reason module is used to complete data conversion and transmitting, by the single features parameter extracted in the characteristic parameter extraction module Such as short-time energy feature, zero-crossing rate feature, fundamental frequency feature, word speed feature and formant feature carry out selection processing, and will most Whole characteristic parameter takes together, one feature vector of sound section of each of each voice signal formation, and ultimately forms spy Vector set is levied, classifier training input file is formed, the training or identification for the sentiment analysis module use.

5. a kind of speech emotion recognition system according to claim 1, which is characterized in that the sentiment analysis module includes Classifier modules, the classifier modules pass through machine learning on the basis of identifying the characteristic parameter for successfully extracting voice document Method predicts that mood belonging to the recording file is classified.

6. a kind of speech emotion recognition system according to claim 5, which is characterized in that the classifier modules are in depth On neural net base, in conjunction with the PCA algorithm based on contribution analysis, a kind of depth mind based on PCA algorithm contribution analysis is proposed Through voice-over-net Emotion identification model, by PCA contribution analysis technology extract in category feature comprising voice mood it is main at It is allocated as inputting for deep neural network, carries out network training, effectively reduce nuisance parameter, training for promotion efficiency, realize mood point Class.

7. a kind of recognition methods based on speech emotion recognition system described in claim 1, which is characterized in that specifically include as Lower step: deep neural network voice mood identification model training step；The prediction of deep neural network voice mood identification model Step.

8. a kind of speech-emotion recognition method according to claim 7, which is characterized in that the deep neural network voice Emotion identification model training step specifically includes: the voice mood database input characteristic parameter extraction module of tape label is carried out It is special that processing obtains single features parameter such as short-time energy feature, zero-crossing rate feature, fundamental frequency feature, word speed feature and formant Sign, then input characteristic parameter selection carries out selection processing with processing module, and final characteristic parameter is taken together, each One feature vector of sound section of each of voice signal formation, and set of eigenvectors is ultimately formed, it is defeated to form classifier training Enter file, the classifier modules of input sentiment analysis module are trained, and obtain deep neural network voice mood identification model.

9. a kind of speech-emotion recognition method according to claim 8, which is characterized in that the deep neural network voice Emotion model prediction steps specifically include: at the voice mood database input characteristic parameter extraction module of unknown classification Reason obtains single features parameter such as short-time energy feature, zero-crossing rate feature, fundamental frequency feature, word speed feature and formant feature, Then input characteristic parameter selection carries out selection processing with processing module, and final characteristic parameter is taken together, Mei Geyu One feature vector of sound section of each of sound signal formation, and set of eigenvectors is ultimately formed, form classifier training input File inputs the classifier modules of sentiment analysis module according to the deep neural network voice mood identification model training step The deep neural network voice mood identification model of acquisition predicts the classification of mood belonging to voice signal, and exports Emotion identification Dimension result.