CN114299916A

CN114299916A - Speech enhancement method, computer device, and storage medium

Info

Publication number: CN114299916A
Application number: CN202111677651.0A
Authority: CN
Inventors: 赵力; 黄继昆
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-08

Abstract

The application relates to the technical field of voice processing, and discloses a voice enhancement method, computer equipment and a storage medium, wherein a voice signal in a target scene is extracted, a first acoustic characteristic signal in the voice signal is input into a trained voice enhancement model for voice enhancement, and after a first target acoustic characteristic signal is obtained, the first target acoustic characteristic signal and the phase of the first acoustic characteristic signal are synthesized to obtain a target voice signal; the trained voice enhancement model carries out voice enhancement on the first acoustic characteristic information, so that stationary noise and non-stationary noise can be suppressed or eliminated, and impact noise can be suppressed or eliminated at the same time.

Description

Speech enhancement method, computer device, and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech enhancement method, a computer device, and a storage medium.

Background

With the rapid development of the voice processing technology, the voice enhancement technology has been widely applied to different fields, such as audio and video live broadcast, man-machine interaction, conference call and the like. Speech enhancement refers to the process of extracting as clean as possible of the original speech from noisy speech. The method for reducing the noise of the frequency domain single-channel model is usually adopted, and the method for reducing the noise of the frequency domain single-channel model is a method for restraining or eliminating stationary noise and various non-stationary noises in a mask mode, and the method does not restrain or eliminate impact noise.

In practical application scenes, various sound mixing is inevitable, and particularly in a scene of teacher resource sharing, various audio and video devices need to be equipped in a classroom to meet the requirement of multiple places on class at the same time. In the course of teaching, the sound of the teacher's feet, the sound of the desk and chair, the sound of the microphone rubbing clothes, the sound of the outside car whistling outside the classroom, etc. can be prevented from being recorded in the microphone to form impact noise, and further the quality of the sound received by the classroom in different places is influenced.

Therefore, the existing speech enhancement method cannot be applied to scenes with impulsive noise, and has the problem of poor adaptability.

Disclosure of Invention

The application provides a voice enhancement method, computer equipment and a storage medium, which can inhibit or eliminate stationary noise and non-stationary noise and impact noise at the same time, so that the voice enhancement method can be applied to scenes with impact noise, and the adaptability of the voice enhancement method is improved.

In a first aspect, the present application provides a speech enhancement method, comprising:

acquiring a voice signal in a target scene;

extracting a first acoustic characteristic signal in the voice signal, and inputting the first acoustic characteristic signal into a trained voice enhancement model for enhancement processing to obtain a first target acoustic characteristic signal;

and synthesizing the phases of the first target acoustic characteristic signal and the first acoustic characteristic signal to obtain a target voice signal.

In a second aspect, the present application further provides a computer device, comprising:

a memory and a processor;

the memory is used for storing a computer program;

the processor is adapted to execute the computer program and to carry out the steps of the speech enhancement method according to the first aspect when executing the computer program.

In a fourth aspect, the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, causes the processor to carry out the steps of the speech enhancement method according to the first aspect above.

The application discloses a voice enhancement method, computer equipment and a storage medium, wherein a voice signal in a target scene is extracted, a first acoustic characteristic signal in the voice signal is input into a trained voice enhancement model for voice enhancement, after a first target acoustic characteristic signal is obtained, the first target acoustic characteristic signal and the phase of the first acoustic characteristic signal are synthesized, and a target voice signal is obtained; the trained voice enhancement model carries out voice enhancement on the first acoustic characteristic information, so that stationary noise and non-stationary noise can be inhibited or eliminated, and impact noise can be inhibited or eliminated at the same time. Therefore, the voice enhancement method provided by the embodiment of the application can be applied to scenes with impact noise, and the quality of voice is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart diagram of a speech enhancement method provided by an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating a training process of a speech enhancement model provided by an embodiment of the present application;

FIG. 3 is a diagram of an application scenario of the speech enhancement method;

FIG. 4 is a schematic structural diagram of a short-time memory module provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a speech enhancement model provided by an embodiment of the present application;

fig. 6 is a schematic block diagram of a structure of a computer device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

The embodiment of the application provides a voice enhancement method, a voice enhancement device, voice enhancement equipment and a storage medium. The voice enhancement method provided by the embodiment of the application can be used for voice enhancement scenes with impulsive noise, for example, voice enhancement scenes aiming at classroom impulsive noise, and can effectively improve the quality of voice.

For example, the speech enhancement method provided by the embodiment of the application can be applied to a terminal or a server, and by extracting a speech signal in a target scene, inputting a first acoustic feature signal in the speech signal into a trained speech enhancement model for speech enhancement to obtain a first target acoustic feature signal, and then synthesizing the phases of the first target acoustic feature signal and the first acoustic feature signal to obtain a target speech signal; the trained voice enhancement model comprises a preset number of FSMN modules, the preset number of FSMN modules are stacked to form a network structure with a stacked structure, a loss function of the network structure with the stacked structure comprises a frequency domain loss part, a signal constraint loss part and a cross-domain constraint loss part, and the trained voice enhancement model performs voice enhancement on the first acoustic characteristic information, so that stationary noise and non-stationary noise can be suppressed or eliminated, and impact noise can be suppressed or eliminated at the same time. Therefore, the voice enhancement method provided by the embodiment of the application can be applied to scenes with impact noise, and the quality of voice is improved.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a schematic flow chart of a speech enhancement method according to an embodiment of the present application. The voice enhancement method can be realized by a terminal or a server, wherein the terminal can be a handheld terminal, a personal computer, a notebook computer, wearable intelligent equipment or a robot and the like; the server may be a single server or a cluster of servers, which may be local servers, cloud servers, etc.

As shown in fig. 1, the speech enhancement method provided in this embodiment specifically includes: step S101 to step S103. The details are as follows:

s101, acquiring a voice signal in a target scene.

The target scene is an application scene with impact noise, such as an audio and video live broadcast scene, a man-machine interaction scene and an online conference call scene, and is recorded into a microphone to form the impact noise scene due to the existence of various noises. The impulse noise is noise in which energy sharply increases in a short time (usually, several tens to several hundreds milliseconds) compared to stationary noise.

S102, extracting a first acoustic feature signal in the voice signal, and inputting the first acoustic feature signal into a trained voice enhancement model for voice enhancement to obtain a first target acoustic feature signal.

The trained voice enhancement model comprises a preset number of short-time Memory (FSMN) modules, and the preset number of FSMN modules are stacked to form a network structure with a stacking structure; the loss function of the network structure having the stacked structure includes a frequency domain loss part, a signal constraint loss part, and a cross-domain constraint loss part. Because the FSMN module has a short-time memory function, the FSMN module is added into the voice enhancement model, the short-time memory function of the FSMN can be effectively utilized, the purpose of real-time noise processing is met, and a good noise reduction effect is achieved. And by adopting the loss function comprising the frequency domain loss part, the signal constraint loss part and the cross-domain constraint loss part, the voice enhancement model can effectively focus on the impact noise part, and the noise reduction output of the voice enhancement model has more robustness while the voice quality is ensured.

Specifically, the frequency domain loss part is adjusted using a data sample distribution modification parameter. And a translational linear mapping relation is formed between the data sample distribution correction parameter of the frequency domain loss part and the energy value of the impact noise. The signal constraint loss part is used for keeping the value of the loss function of the network model with the stacking structure when the expected output value of the network model with the stacking structure is larger than the actual output value, and making the value of the loss function of the network model with the stacking structure be zero when the expected output value is smaller than or equal to the actual output value. The cross-domain constraint loss part is used for performing cross-domain constraint from a frequency domain to a time domain when a frequency domain signal of a network model with a stacked structure is transformed into a time domain signal. The auditory perception which is more accordant with human ears can be output, the parameter quantity and complexity of the voice enhancement model can be reduced, the voice enhancement model has certain dereverberation capability, and the requirement of actual use in an actual reverberation scene can be met.

It should be understood that before inputting the first acoustic feature signal into the trained speech enhancement model for speech enhancement, the method further includes the step of training the speech enhancement model. The step of training the speech enhancement model may be completed before the speech signal in the target scene is acquired, or may be completed after the speech signal in the target scene is acquired. That is, the step of training the speech enhancement model and the step of acquiring the speech signal in the target scene are executed in parallel, and may be executed alternatively.

Illustratively, as shown in fig. 2, fig. 2 is a schematic diagram of a training process of a speech enhancement model provided by an embodiment of the present application. It should be understood that the training process of the speech enhancement model provided by the embodiment of the present application may be implemented by a terminal or a server. Specifically, when the speech enhancement method described in fig. 1 is implemented by a terminal, if the computing power of the terminal is limited, the training process of the speech enhancement model may be implemented by a server.

Correspondingly, as shown in fig. 3, fig. 3 is a schematic view of an application scenario of the speech enhancement method. In the present embodiment, the speech enhancement method is implemented by the terminal 301. The terminal 301 is an audio and video device in a classroom and is used for meeting requirements of multiple places for having class at the same time, and due to the fact that computing capacity of various audio and video devices is limited, the voice enhancement model can be completed by the server 302 in communication connection with the audio and video devices, the server 302 sends the trained voice enhancement model to the terminal 301, and the terminal 301 performs voice enhancement processing based on the voice enhancement model. It should be understood that when the speech enhancement method described in fig. 1 is implemented by the terminal, the training process of the speech enhancement model can be implemented by the terminal if the computing power of the terminal is sufficient. The server may be a local server or a cloud server.

In addition, when the speech enhancement method described in fig. 1 is implemented by a local server, the training process of the speech enhancement model may be implemented by a cloud server in addition to the local server. The embodiments of the present application are not particularly limited.

As can be seen from fig. 2, the training steps of the speech enhancement model provided in the embodiment of the present application include: step S201 to step S204. The details are as follows:

s201, acquiring noise mixing voice signals and clean voice signals under a plurality of preset scenes.

Specifically, noisy frequency signals in a plurality of preset scenes may be acquired as the noisy speech signal. The clean voice signal is an audio signal obtained when there is no noise in a plurality of preset scenes.

S202, extracting a second acoustic characteristic signal from the mixed noise voice signal, and inputting the second acoustic characteristic signal into a preset network model to predict a second target acoustic characteristic signal.

Extracting a second acoustic feature signal from the noisy speech signal, comprising: and performing short-time Fourier transform (STFT) on the noisy frequency signal or processing the noisy frequency signal in other modes such as wavelet transform and the like to extract a second acoustic characteristic signal in the noisy frequency signal. Wherein the second acoustic signature signal includes, but is not limited to, an amplitude signature signal.

The preset network model may be a network model with a stacked structure, and the second acoustic feature signal is input into the network model with the stacked structure to calculate an early reverberation (or referred to as an early reflection) convolution clean speech signal, and the early reverberation convolution clean speech signal is taken as the second target acoustic feature signal.

Illustratively, the inputting the second acoustic feature signal into the network model with the stacked structure for second target acoustic feature prediction includes: generating a first sequence of the second acoustic characteristic signal, inputting the first sequence into the network model with the stacked structure for analysis, and acquiring a second sequence of the predicted acoustic characteristic signal output by any FSMN module in the network model with the stacked structure; and after the first sequence and the second sequence are spliced, inputting a next FSMN module adjacent to any FSMN module for training until a second target acoustic characteristic signal is obtained.

Specifically, the FSMN module includes a network layer and a short-time memory module in which the first sequence of acoustic signature signals is stored. Wherein the first sequence comprises a current frame and at least one historical frame, or comprises a current frame and at least one future frame; the at least one historical frame temporally precedes the current frame, the at least one future frame temporally follows the current frame, and the current frame is a frame of the network layer analysis.

Exemplarily, as shown in fig. 4, fig. 4 is a schematic structural diagram of a short-time memory module provided in an embodiment of the present application. As can be seen from FIG. 4, the output of a layer in the FSMN module is

h denotes the output, the superscript l denotes the output of the ith layer, the subscript t denotes the acoustic features of the input as the tth frame, what will be output

Spliced with a certain number of historical output features (or future features) and obtained by convolution or other modes

Then will be

And

and merging and inputting the merged data into the next layer of network.

It should be understood that due to the characteristics of the impulse noise, the speech enhancement model needs to pay more attention to the short-term memory to meet the requirement of being capable of suppressing the noise in a shorter time, so that the network structure of the speech enhancement model refers to the FSMN module, utilizes the short-term memory of the limited historical frame, and can introduce part of the information of the future frame in the short-term memory to improve the performance, and simultaneously stack a plurality of FSMN modules, so that the speech enhancement model has better performance of suppressing the impulse noise.

In some embodiments, the inputting the first sequence into the network model with the stacked structure for analysis includes: inputting the first sequence into the network model having a stacked structure; analyzing the current frame in the first sequence through a network layer of any FSMN module; and after the analysis result is spliced with at least one future frame or at least one historical frame in the short-time memory module, calculating the predicted acoustic characteristic signal of the current frame.

S203, iteratively updating parameters of the network model based on the clean speech signal and the predicted second target acoustic feature.

Wherein, the Loss function of the network model with the stacked structure in the target optimization function Loss may be expressed as:

Loss＝α*Freq_mse+β*Freq_sdr+γ*Temp_mae

wherein, alpha, beta and gamma are Loss regulating factors and can be any real number; freq_mseFor the frequency domain loss part, Freq_sdrFor signal confinement loss part, Temp_maeThe loss part is constrained across domains.

Specifically, in the iterative updating process of the parameters of the network model with the stacked structure, in order to effectively suppress impulse noise, in the iterative updating process, a training width (weight) of the model is designed and acts on a frequency domain loss part of a loss function of the network model, so that the model is optimized with more attention paid to the frequency domain loss part during model training, and specifically, the frequency domain loss part can be expressed as:

Freq_mse＝weigt*mse(pred,label)

wherein pred is the model output prediction target speech signal, w_min，w_maxLower and upper limits, respectively, of weight, F_NNoise acoustic features after normalization (mean reduction by variance); wherein alpha is_wAnd beta_wTranslation and linear mapping representing affine transformations, respectively, in pairs of Freq_mseIn the iterative updating process, according to the characteristics of the impact noise, firstly, alpha is adjusted_wThe center of the weight value range is positioned at w_minAnd w_maxNear the center, then adjust for beta_wThe value of (A) makes the value range of weight fall mostly on w_minAnd w_maxIn between, can furthest remain the distribution characteristic of impact noise, make impact noise energy great department, weight is close to w_maxWhere the energy is small, is close to w_minSo as to realize the suppression of impulse noise by the trained voice enhancement model.

Furthermore, Freq_sdrThe method is used for restraining the voice signal from being restrained by the voice enhancement model and reducing the damage of the voice signal. Because the amplitude of the impact noise is generally larger at the part where the impact noise and the voice are mixed, the impact noise with larger amplitude needs to be suppressed, so that the effect of restraining the voice enhancement model from suppressing the voice signal is achieved, and the loss of the voice signal is effectively prevented. In particular, Freq_sdrCan be expressed as:

Freq_sdr＝mse(label-pred,>0,0)

when the label value of the speech signal expected to be output is larger than the pred value of the speech signal predicted by the speech enhancement model and the speech signal is damaged, Freq is reserved_sdrA value of (d); when the label value is less than or equal to the pred value, the voice signal is not damaged, and at the moment, Freq_sdrThe value of (d) is 0.

Furthermore, Temp of the loss function of the network model_maeThe cross-domain constraint loss part is used for enabling the voice enhancement model to carry out cross-domain constraint from the frequency domain to the time domain after STFT conversion from the frequency domain to the time domain and overlap addition, so that the noise suppression effect of the voice enhancement model is better. In particular, Temp_maeCan be expressed as:

Temp_mae＝mae(istft(noisy*pred),istft(noisy*label))

where mae denotes the average absolute error, noise is a noise signal, label is a speech signal desired to be output, pred is a speech signal predicted by the speech enhancement model, and istft is the inverse of the short-time fourier transform.

S204, according to preset model training conditions, determining that the training of the network type is finished, and obtaining the voice enhancement model.

And the preset model training condition comprises that the training of the network model with the stacked structure is determined to be finished according to the value of the loss function or the iteration times, so that the voice enhancement model is obtained. Specifically, determining that the training of the network model is finished according to a preset model training condition to obtain the speech enhancement model includes:

and if the value of the loss function is smaller than a preset loss function threshold value, or the iteration times of the loss function is larger than a preset iteration time threshold value, determining that the training of the network model with the stacked structure is finished, and obtaining the speech enhancement model. Exemplarily, as shown in fig. 5, fig. 5 is a schematic structural diagram of a speech enhancement model provided in an embodiment of the present application.

Through the analysis, the voice enhancement model provided by the embodiment can be used for stacking a plurality of FSMN modules according to the characteristics of the impact noise and by utilizing the advantage of short-time memory of the FSMN, so that a better model for removing the impact noise is obtained. And according to the characteristics of the impact noise, on the loss function of model training, the weight is utilized to adjust the frequency domain loss part, meanwhile, in order to ensure the voice quality, the signal constraint loss part is constrained, and in addition, the cross-domain constraint from the frequency domain to the time domain is added to further improve the performance of the voice enhancement model. Considering the reverberation factor of the actual scene, the expected output of the speech enhancement model is changed into early reverberation convolution clean speech, so that the speech enhancement model has certain dereverberation capability and conforms to the auditory perception of human ears.

S103, synthesizing the first target acoustic characteristic signal and the phase of the first acoustic characteristic signal to obtain a target voice signal.

In some embodiments, the synthesizing the phases of the first target acoustic feature signal and the first acoustic feature signal to obtain the target speech signal includes: multiplying the first target acoustic characteristic signal by the first acoustic characteristic signal to obtain a magnitude spectrum of the denoised voice signal; and synthesizing the amplitude spectrum and the phase of the first acoustic characteristic signal to obtain the target voice signal.

As can be seen from the above analysis, in the speech enhancement method provided in the embodiment of the present application, by extracting a speech signal in a target scene, inputting a first acoustic feature signal in the speech signal into a trained speech enhancement model for speech enhancement, obtaining a first target acoustic feature signal, and then synthesizing phases of the first target acoustic feature signal and the first acoustic feature signal to obtain a target speech signal; the trained voice enhancement model comprises a preset number of FSMN modules, the preset number of FSMN modules are stacked to form a network structure with a stacked structure, a loss function of the network structure with the stacked structure comprises a frequency domain loss part, a signal constraint loss part and a cross-domain constraint loss part, and the trained voice enhancement model performs voice enhancement on the first acoustic characteristic information, so that stationary noise and non-stationary noise can be suppressed or eliminated, and impact noise can be suppressed or eliminated at the same time. Therefore, the voice enhancement method provided by the embodiment of the application can be applied to scenes with impact noise, and the quality of voice is improved.

Referring to fig. 6, fig. 6 is a schematic block diagram of a structure of a computer device according to an embodiment of the present application. The computer device includes a processor, a memory, and a network interface connected by a system bus, where the memory may include a non-volatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program comprises program instructions that, when executed, cause a processor to perform any of the speech enhancement methods.

The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.

The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by a processor, causes the processor to perform any of the speech enhancement methods.

The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the configuration shown in fig. 6 is a block diagram of only a portion of the configuration associated with the present application and does not constitute a limitation on the terminal to which the present application is applied, and that a particular terminal may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:

acquiring a voice signal in a target scene;

extracting a first acoustic characteristic signal in the voice signal, and inputting the first acoustic characteristic signal into a trained voice enhancement model for voice enhancement to obtain a first target acoustic characteristic signal;

In some embodiments, before the inputting the speech signal into a preset speech enhancement model for speech enhancement, the method further includes:

acquiring noise mixing voice signals and clean voice signals under a plurality of preset scenes;

extracting a second acoustic characteristic signal from the mixed noise voice signal, and inputting the second acoustic characteristic signal into a preset network model to predict a second target acoustic characteristic;

iteratively updating parameters of the network model based on the clean speech signal and a predicted second target acoustic feature signal;

and determining that the training of the network model with the stacked structure is finished according to preset model training conditions to obtain the voice enhancement model.

In some embodiments, the inputting the second acoustic feature signal into a preset network model for second target acoustic feature signal prediction includes:

generating a first sequence of the second acoustic characteristic signal, inputting the first sequence into the network model for analysis, and acquiring a second sequence of the predicted acoustic characteristic signal output by any one short-time memory module in the network model;

after the first sequence and the second sequence are spliced, inputting a next short-time memory module adjacent to any one short-time memory module for training until the second target acoustic characteristic signal is obtained.

In some embodiments, the short-time memory module comprises a network layer and a short-time memory module in which the first sequence of acoustic signature signals is stored;

wherein the first sequence comprises a current frame and at least one historical frame, or comprises a current frame and at least one future frame; the at least one historical frame temporally precedes the current frame, the at least one future frame temporally follows the current frame, and the current frame is a frame of the network layer analysis.

In some embodiments, said inputting said first sequence into said network model for analysis comprises:

inputting the first sequence into the network model;

analyzing a current frame in the first sequence through a network layer of any one short-time memory module;

and after the analysis result is spliced with at least one future frame or at least one historical frame in the short-time memory module, calculating the predicted acoustic characteristic signal of the current frame.

In some embodiments, the frequency domain loss part is adjusted using a data sample distribution modification parameter having a translational linear mapping relationship with an energy value of the impulse noise.

In some embodiments, the signal constraint penalty portion is configured to retain a value of a penalty function of the network model having the stacked architecture when an expected output value of the network model having the stacked architecture is greater than an actual output value, and to make the value of the penalty function of the network model having the stacked architecture zero when the expected output value is less than or equal to the actual output value.

In some embodiments, the cross-domain constraint loss part is used for performing cross-domain constraint from a frequency domain to a time domain when a frequency domain signal of the network model is transformed into a time domain signal.

In some embodiments, the synthesizing the phases of the first target acoustic feature signal and the first acoustic feature signal to obtain the target speech signal includes:

multiplying the first target acoustic characteristic signal by the first acoustic characteristic signal to obtain a magnitude spectrum of the denoised voice signal;

and synthesizing the amplitude spectrum and the phase of the first acoustic characteristic signal to obtain the target voice signal.

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and the processor executes the program instructions to implement the speech enhancement method provided in any embodiment of the present application.

The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of speech enhancement, the method comprising:

acquiring a voice signal in a target scene;

2. The method according to claim 1, before said inputting said speech signal into a preset speech enhancement model for speech enhancement, further comprising:

extracting a second acoustic characteristic signal from the mixed noise voice signal, and inputting the second acoustic characteristic signal into a preset network model to predict a second target acoustic characteristic signal;

and determining that the training of the network model is finished according to preset model training conditions to obtain the voice enhancement model.

3. The method of claim 2, wherein inputting the second acoustic feature signal into a preset network model for second target acoustic feature signal prediction comprises:

after the first sequence and the second sequence are spliced, inputting a next short-time memory module adjacent to any one short-time memory module for training;

and obtaining the second target acoustic characteristic signal according to the preset model training condition.

4. The method of claim 3, wherein the short-time memory module comprises a network layer and a short-time memory module, the short-time memory module having the first sequence of acoustic signature signals stored therein;

5. The method of claim 4, wherein inputting the first sequence into the network model for analysis comprises:

inputting the first sequence into the network model;

6. The method of claim 1, wherein the frequency domain loss part is adjusted using a data sample distribution modification parameter having a translational linear mapping relationship with an energy value of the impulse noise.

7. The method of claim 1, wherein the signal constraint penalty component is configured to retain a value of a penalty function of the network model when a desired output value of the network model is greater than a predicted output value, and to zero the value of the penalty function of the network model when the desired output value is less than or equal to the predicted output value.

8. The method of claim 1, wherein the cross-domain constraint loss part is used for performing cross-domain constraint from frequency domain to time domain when transforming the frequency domain signal to the time domain signal of the network model.

9. The method of claim 1, wherein the synthesizing the phases of the first target acoustic signature signal and the first acoustic signature signal to obtain a target speech signal comprises:

10. A computer device, comprising:

a memory and a processor;

the memory is used for storing a computer program;

the processor for executing the computer program and for realizing the steps of the speech enhancement method according to any of claims 1 to 9 when executing the computer program.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the steps of the speech enhancement method according to any one of claims 1 to 9.