Nothing Special   »   [go: up one dir, main page]

CN113160844A - Speech enhancement method and system based on noise background classification - Google Patents

Speech enhancement method and system based on noise background classification Download PDF

Info

Publication number
CN113160844A
CN113160844A CN202110459982.0A CN202110459982A CN113160844A CN 113160844 A CN113160844 A CN 113160844A CN 202110459982 A CN202110459982 A CN 202110459982A CN 113160844 A CN113160844 A CN 113160844A
Authority
CN
China
Prior art keywords
voice
speech
noise
processed
generator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110459982.0A
Other languages
Chinese (zh)
Inventor
李晔
冯涛
张鹏
李姝�
汪付强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Computer Science Center National Super Computing Center in Jinan
Original Assignee
Shandong Computer Science Center National Super Computing Center in Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Computer Science Center National Super Computing Center in Jinan filed Critical Shandong Computer Science Center National Super Computing Center in Jinan
Priority to CN202110459982.0A priority Critical patent/CN113160844A/en
Publication of CN113160844A publication Critical patent/CN113160844A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a speech enhancement method and a system based on noise background classification, comprising the following steps: acquiring a voice signal to be processed; carrying out feature extraction on a voice signal to be processed; inputting the extracted features into a trained classifier to obtain a noise background label of the voice to be processed; selecting a trained generator corresponding to the noise background label; and inputting the voice signal to be processed into the selected trained generator to obtain an enhanced voice signal. The method selects a Mel frequency cepstrum coefficient extracted from the voice with noise to be input into a classifier to classify the noise background, and generates a countermeasure network aiming at the noise background in the same model for the voice with good classification to realize voice enhancement.

Description

Speech enhancement method and system based on noise background classification
Technical Field
The present invention relates to the field of speech signal processing technologies, and in particular, to a speech enhancement method and system based on noise background classification.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
Speech is the most direct and efficient tool for exchanging information between people, and is also a tool for communicating between people and machines. However, when information is exchanged between people and communication is performed between people and machines, noise always affects the information, the type of noise is different in different scenes, and the influence of different noises on effective voice information is different. For example, people converse in automobiles, and the noise is engine noise, horn sound and the like; most of noises in the cafe are guest conversation sounds; the noise in the computer room is mostly the fan sound of the computer running. Therefore, the same method may not be good for speech enhancement in multiple scenes. Therefore, how to use a speech enhancement method to achieve good effects in different scenes becomes a technical problem to be solved urgently by technical personnel in the field.
At present, various speech enhancement methods are mostly used for speech enhancement aiming at a specific background noise, and the enhancement effect is common when meeting other types of noise backgrounds, so that a speech enhancement method aiming at various noise scenes is urgently needed.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a speech enhancement method and a speech enhancement system based on noise background classification; different noise scenes are distinguished, so that voice enhancement is performed by using a specific network in the same model aiming at a certain scene, and a better voice enhancement effect is realized.
In a first aspect, the present invention provides a speech enhancement method based on noise background classification;
the speech enhancement method based on the noise background classification comprises the following steps:
acquiring a voice signal to be processed;
carrying out feature extraction on a voice signal to be processed;
inputting the extracted features into a trained classifier to obtain a noise background label of the voice to be processed;
selecting a trained generator corresponding to the noise background label;
and inputting the voice signal to be processed into the selected trained generator to obtain an enhanced voice signal.
In a second aspect, the present invention provides a speech enhancement system based on noise background classification;
a speech enhancement system based on noise background classification, comprising:
an acquisition module configured to: acquiring a voice signal to be processed;
a feature extraction module configured to: carrying out feature extraction on a voice signal to be processed;
a classification module configured to: inputting the extracted features into a trained classifier to obtain a noise background label of the voice to be processed;
a selection module configured to: selecting a trained generator corresponding to the noise background label;
an enhancement module configured to: and inputting the voice signal to be processed into the selected trained generator to obtain an enhanced voice signal.
In a third aspect, the present invention further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.
In a fourth aspect, the present invention also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.
Compared with the prior art, the invention has the beneficial effects that:
the invention fully considers the problem that most voice enhancement methods in the field of voice enhancement can not obtain good effect when voice enhancement is carried out under multiple scenes, selects a Mel frequency cepstrum coefficient for extracting noisy voice to input into a classifier to classify noise backgrounds, and uses a confrontation network generated aiming at the noise backgrounds in the same model to realize voice enhancement on the well-classified voice.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flow chart of the method of the first embodiment.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Interpretation of terms:
Mel-Frequency Cepstral Coefficient (MFCC);
a countermeasure network (generic adaptive Networks) is generated.
Example one
The embodiment provides a speech enhancement method based on noise background classification;
as shown in fig. 1, the speech enhancement method based on noise background classification includes:
s101: acquiring a voice signal to be processed;
s102: carrying out feature extraction on a voice signal to be processed;
s103: inputting the extracted features into a trained classifier to obtain a noise background label of the voice to be processed;
s104: selecting a trained generator corresponding to the noise background label;
s105: and inputting the voice signal to be processed into the selected trained generator to obtain an enhanced voice signal.
Further, the S102: carrying out feature extraction on a voice signal to be processed; the method specifically comprises the following steps:
and extracting the Mel frequency cepstrum coefficient characteristics of the voice signal to be processed.
Further, the step S103: inputting the extracted features into a trained classifier to obtain a noise background label of the voice to be processed; wherein, the classifier after training, the training step includes:
constructing a first training set, wherein the first training set is a voice signal characteristic of a known noise background label;
inputting the training set into a classifier, and training the classifier;
and when the loss function of the classifier obtains the minimum value or the training reaches the iteration times, stopping the training to obtain the trained classifier.
Illustratively, the building of the data set includes:
for the clean speech data set, THCHS30 was chosen for use, THCHS30 is an open chinese speech database published by the university of qinghua speech and language technology Center (CSLT).
The noise background selects the noise recorded in five scenes, namely a coffee shop, a running automobile, a running subway, a server running machine room and a cafeteria.
Dividing the data set THCHS30 into six parts equally, wherein each part of time is 5 hours, synthesizing five parts of pure voice data with the noise of five different scenes into noisy voice with different signal-to-noise ratios by using a program to serve as a training set, dividing the remaining part of pure voice into five parts, and synthesizing the remaining part of pure voice with the noise of five noise scenes into noisy voice with different signal-to-noise ratios to serve as a test set.
When a voice file with noise is synthesized, each training set file adds a code of the type at the end of the file name according to the current noise background type.
Extracting information of the noise voice:
MFCC feature extraction is carried out on the noisy speech under each scene, the label code of the noise type of the noisy speech is read, each MFCC feature and each label are correspondingly stored in an array A, and the sequence of the array A is disturbed.
Further, the classifier is a convolutional neural network.
Or,
further, the specific structure of the classifier comprises:
the device comprises a first convolution layer, a first activation function layer, a first maximum pooling layer, a second convolution layer, a second activation function layer and a second maximum pooling layer which are connected in sequence.
The number of convolution kernels of the first convolution layer is the same as that of convolution kernels of the second convolution layer, the first convolution layer is provided with 32 convolution kernels, and each convolution kernel has a sampling window of 5 x 5.
The following are exemplary: the classifier is composed of a first layer which is composed of convolution layers, the convolution layers are provided with 32 convolution kernels, each convolution kernel is provided with a sampling window of 5 x 5, a ReLU activation function is used after each convolution layer, a max-posing pooling layer is applied, a second convolution layer is added later, the configuration of the second convolution layer is the same as that of the first convolution layer, the ReLU activation function is used, the max-posing pooling layer is applied, the output of the second max-posing pooling layer is flattened into 1 dimension, the second convolution layer is input into a full connection layer, and the prediction result of the classifier is obtained after the full connection layer.
Further, the known noise background label, for example, includes: a coffee shop, a driving car, a running subway, a server running machine room and/or an independent restaurant.
Illustratively, the classifier trains: inputting the disordered noisy speech array a into a classifier, verifying the predicted noise background type label with the label in the array a by the classifier, optimizing reverse transmission by using an Adamaptizer optimizer to adjust parameters of layers to reduce errors if the errors are large due to continuous comparison of a large batch of files, learning to predict the noise background label by the model after 150 times of iterative training, and ensuring the accuracy to be more than 98%.
Further, the S104: selecting a trained generator corresponding to the noise background label; wherein, the generator after training, concrete training step includes:
s1041: constructing a second training set; the second training set comprising: a noise-free speech signal and a noisy speech signal of a known noise background label; the noise-carrying voice signal of the known noise background label is obtained by adding the background noise of the corresponding label to the noise-free voice signal;
s1042: repeating the steps of initializing the discriminator, initializing the generator and optimizing the weight;
when the method is executed for the first time, the discriminator initialization step and the generator initialization step both use normally distributed random numbers to assign weights;
when the optimization is not executed for the first time, a discriminator initialization step and a generator initialization step use the weight optimized by the optimizer in the last suboptimal weight step;
s1043: judging whether the number of the current trained data is larger than a set value or not, and repeating the training until the set training number is reached; after training, storing the last layer of weight in the step of optimizing the weight; a trained generator is obtained.
Further, the discriminator initializing step; the method specifically comprises the following steps:
when the method is executed for the first time, assigning the weight value by using the normally distributed random number; inputting the preprocessed noiseless voice into a discriminator, wherein the discriminator outputs 1 to represent that the input is noiseless voice;
when the optimization is not carried out for the first time, the weight optimized by the optimizer in the last second optimization weight step is used; the noiseless voice and the noisy voice processed by the generator are input into a discriminator, and the discriminator outputs a discrimination result.
Further, the generator initializes; the method specifically comprises the following steps:
when the method is executed for the first time, assigning the weight value by using the normally distributed random number; inputting the preprocessed voice with noise into a generator, compressing the preprocessed voice with noise by a coding structure, performing reverse compression by a decoding structure, and sending voice characteristics in the voice with noise into the decoding structure from the coding structure through jump connection to guide the decoding structure to generate enhanced voice;
when the optimization is not carried out for the first time, the weight optimized by the optimizer in the last second optimization weight step is used; the pre-processed noisy speech is input into a generator, the pre-processed noisy speech is firstly compressed by a coding structure, then is subjected to inverse compression by a decoding structure, and the speech characteristics in the noisy speech are sent into the decoding structure from the coding structure through jumping connection to guide the decoding structure to generate enhanced speech.
Further, optimizing the weight value; the method specifically comprises the following steps:
the AdamaOptizer optimizer in the generation countermeasure network updates the weights of convolution kernels of each coding structure and decoding structure in the generator through gradient descent according to the loss value of the generator and the loss value of the discriminator which are obtained by the enhanced voice and the noiseless voice, so as to generate enhanced voice which is more similar to the noiseless voice; the optimizer also updates the weights in the discriminator to enhance the ability of the discriminator to recognize the enhanced speech.
Further, the second training set is constructed by selecting a proper voice data set and multiple noise backgrounds and synthesizing training data with noise type labels with different signal-to-noise ratios by using pure voice and different noises.
Illustratively, the generator is composed of a plurality of convolutional layers and a plurality of deconvolution layers, the convolutional layers are called encoding structures, the deconvolution layers are called decoding structures, the convolutional layers and the deconvolution layers have mirror symmetry structures, and jump connection structures are added between the convolutional layers and the deconvolution layers.
The discriminator is made up of a plurality of convolutional layers, the structure of which is the same as that of the convolutional layers in the generator.
Further, the step S105: inputting the voice signal to be processed into the selected generator after training to obtain an enhanced voice signal; the method specifically comprises the following steps:
and inputting the voice signal to be processed into the selected generator after training, and sequentially carrying out coding and decoding processing to obtain an enhanced voice signal.
The invention classifies the noise background by extracting Mel Frequency Cepstral Coefficient (MFCC) of the voice with noise and inputting the extracted voice into a convolutional neural network, and realizes voice enhancement by using a generation countermeasure network (generic adaptive networks) model aiming at the noise background in the same model for the voice with the classified voice.
And constructing a plurality of voice enhancement networks, wherein the number of the voice enhancement networks is the same as that of the noise backgrounds, the input noise type of each voice enhancement network is divided, and the divided voice enhancement networks only accept the input of the corresponding noisy languages.
And inputting the noisy speech of the unknown scene into the model, and classifying the noisy speech by the classifier and obtaining the enhanced speech through the speech enhancement network.
Illustratively, the speech enhancement model selects a plurality of Generative Adaptive Networks (GAN), and constructs a total of five identical Generative adaptive networks, each selecting the same structure, each GAN network consisting of one generator and one discriminator.
Illustratively, the training phase generates the countermeasure network input data processing:
and storing scene type labels of the noisy speech, the noiseless speech and the noisy speech corresponding to the five noise backgrounds into a TFrecord file.
In the TFrecord file, the noisy speech is marked as noise, the noiseless speech is marked as clean, the scene type label of the noisy speech is marked as label, and the noisy speech and the noiseless speech are correspondingly generated in the countermeasure network according to the type of the label.
The method comprises the following steps of preprocessing the noisy speech and the noiseless speech before inputting, and dividing the preprocessed noisy speech and the noiseless speech into a plurality of batches, wherein one batch is 150 sampling points in one second.
The five generative confrontation networks are the same when performing the voice enhancement operation, only the input noisy voice is different from the noiseless voice, and the following takes the example of the generative confrontation network of the noise background in the coffee shop with the noise code set as a.
Illustratively, the training phase generates the initialization of the discriminators within the countermeasure network:
the weights of the convolution kernels of the convolution layers in the discriminator are initialized using a random number that generates a normal distribution, the preprocessed noiseless speech is input to the discriminator, and the discriminator outputs 1, indicating that such input is noiseless speech.
Illustratively, the training phase generates generator initializations within the countermeasure network:
the weights of the convolution kernels of the coding structure and the decoding structure in the generator are initialized by using the random numbers which generate the normal distribution. The pre-processed noisy speech is input into a generator, the pre-processed noisy speech is firstly compressed by a coding structure, then is subjected to inverse compression by a decoding structure, and the speech characteristics in the noisy speech are sent into the decoding structure from the coding structure through jumping connection to guide the decoding structure to generate enhanced speech.
Illustratively, the weight optimizing stage in the training stage:
after the two stages of the initialization of the discriminator and the initialization of the generator are completed, the enhanced voice generated by the generator is input into the discriminator, and because the input of the discriminator is the noiseless voice in the initialization stage, the enhanced voice and the noiseless voice have larger difference at the moment, the discriminator can output 0 which represents that the input is the enhanced voice.
An AdamaOptizer optimizer in the generation countermeasure network guides the weights of convolution kernels of each coding structure and decoding structure in a generator to be updated according to the loss value of the generator and the loss value of a discriminator obtained by the enhanced speech and the noiseless speech, so that enhanced speech which is more similar to the noiseless speech is generated; the optimizer also updates the weights in the discriminator to enhance the ability of the discriminator to recognize the enhanced speech.
Inputting the files of the test set into a classifier, automatically classifying the files in the test set by the classifier, labeling a noise background label, inputting the voice with noise into a GAN network for processing the voice with noise according to the label labeled by the classifier, performing noise reduction processing on the voice with noise by the GAN network at an interval of 1 second, and connecting the processed files after processing all the voice with noise to obtain enhanced voice.
The innovation points of the invention are as follows: extracting the Mel frequency cepstrum coefficient of the noise voice, inputting the Mel frequency cepstrum coefficient into a classifier to classify the noise background, and using a generation countermeasure network aiming at the noise background in a model to realize voice enhancement for the classified voice.
The invention provides a voice enhancement method based on noise background classification, which classifies the noise background by inputting the Mel frequency cepstrum coefficient of the voice with noise into a classifier, and uses a generation countermeasure network aiming at the noise background in a model to realize voice enhancement. Compared with other voice enhancement methods, the method has better generalization and better effect in different noise scenes.
Example two
The embodiment provides a speech enhancement system based on noise background classification;
a speech enhancement system based on noise background classification, comprising:
an acquisition module configured to: acquiring a voice signal to be processed;
a feature extraction module configured to: carrying out feature extraction on a voice signal to be processed;
a classification module configured to: inputting the extracted features into a trained classifier to obtain a noise background label of the voice to be processed;
a selection module configured to: selecting a trained generator corresponding to the noise background label;
an enhancement module configured to: and inputting the voice signal to be processed into the selected trained generator to obtain an enhanced voice signal.
It should be noted here that the above-mentioned obtaining module, the feature extracting module, the classifying module, the selecting module and the enhancing module correspond to steps S101 to S105 in the first embodiment, and the above-mentioned modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the contents disclosed in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.
EXAMPLE III
The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.
The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
Example four
The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The speech enhancement method based on the noise background classification is characterized by comprising the following steps:
acquiring a voice signal to be processed;
carrying out feature extraction on a voice signal to be processed;
inputting the extracted features into a trained classifier to obtain a noise background label of the voice to be processed;
selecting a trained generator corresponding to the noise background label;
and inputting the voice signal to be processed into the selected trained generator to obtain an enhanced voice signal.
2. The speech enhancement method based on noise background classification according to claim 1, characterized by performing feature extraction on the speech signal to be processed; the method specifically comprises the following steps:
and extracting the Mel frequency cepstrum coefficient characteristics of the voice signal to be processed.
3. The speech enhancement method based on noise background classification according to claim 1, wherein the extracted features are input into a trained classifier to obtain a noise background label of the speech to be processed; wherein, the classifier after training, the training step includes:
constructing a first training set, wherein the first training set is a voice signal characteristic of a known noise background label;
inputting the training set into a classifier, and training the classifier;
and when the loss function of the classifier obtains the minimum value or the training reaches the iteration times, stopping the training to obtain the trained classifier.
4. The noise background classification-based speech enhancement method of claim 1, wherein based on the noise background labels, trained generators corresponding to the labels are selected; wherein, the generator after training, concrete training step includes:
(1) constructing a second training set; the second training set comprising: a noise-free speech signal and a noisy speech signal of a known noise background label; the noise-carrying voice signal of the known noise background label is obtained by adding the background noise of the corresponding label to the noise-free voice signal;
(2) repeating the steps of initializing the discriminator, initializing the generator and optimizing the weight;
when the method is executed for the first time, the discriminator initialization step and the generator initialization step both use normally distributed random numbers to assign weights;
when the optimization is not executed for the first time, a discriminator initialization step and a generator initialization step use the weight optimized by the optimizer in the last suboptimal weight step;
(3) judging whether the number of the current trained data is larger than a set value or not, and repeating the training until the set training number is reached; after training, storing the last layer of weight in the step of optimizing the weight; a trained generator is obtained.
5. The noise background classification-based speech enhancement method according to claim 4, wherein the discriminator initializing step; the method specifically comprises the following steps:
when the method is executed for the first time, assigning the weight value by using the normally distributed random number; inputting the preprocessed noiseless voice into a discriminator, wherein the discriminator outputs 1 to represent that the input is noiseless voice;
when the optimization is not carried out for the first time, the weight optimized by the optimizer in the last second optimization weight step is used; the noiseless voice and the noisy voice processed by the generator are input into a discriminator, and the discriminator outputs a discrimination result.
6. The noise background classification-based speech enhancement method of claim 4, wherein the generator initialization step; the method specifically comprises the following steps:
when the method is executed for the first time, assigning the weight value by using the normally distributed random number; inputting the preprocessed voice with noise into a generator, compressing the preprocessed voice with noise by a coding structure, performing reverse compression by a decoding structure, and sending voice characteristics in the voice with noise into the decoding structure from the coding structure through jump connection to guide the decoding structure to generate enhanced voice;
when the optimization is not carried out for the first time, the weight optimized by the optimizer in the last second optimization weight step is used; the pre-processed noisy speech is input into a generator, the pre-processed noisy speech is firstly compressed by a coding structure, then is subjected to inverse compression by a decoding structure, and the speech characteristics in the noisy speech are sent into the decoding structure from the coding structure through jumping connection to guide the decoding structure to generate enhanced speech.
7. The speech enhancement method according to claim 4, wherein said step of optimizing weights; the method specifically comprises the following steps:
the optimizer in the generation countermeasure network updates the weights of the convolution kernels of each coding structure and decoding structure in the generator through gradient descent according to the loss value of the generator and the loss value of the discriminator obtained by the enhanced speech and the noiseless speech, so as to generate enhanced speech which is more similar to the noiseless speech; the optimizer also updates the weights in the discriminator to enhance the ability of the discriminator to recognize the enhanced speech.
8. A speech enhancement system based on noise background classification, comprising:
an acquisition module configured to: acquiring a voice signal to be processed;
a feature extraction module configured to: carrying out feature extraction on a voice signal to be processed;
a classification module configured to: inputting the extracted features into a trained classifier to obtain a noise background label of the voice to be processed;
a selection module configured to: selecting a trained generator corresponding to the noise background label;
an enhancement module configured to: and inputting the voice signal to be processed into the selected trained generator to obtain an enhanced voice signal.
9. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-7.
10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 7.
CN202110459982.0A 2021-04-27 2021-04-27 Speech enhancement method and system based on noise background classification Pending CN113160844A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110459982.0A CN113160844A (en) 2021-04-27 2021-04-27 Speech enhancement method and system based on noise background classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110459982.0A CN113160844A (en) 2021-04-27 2021-04-27 Speech enhancement method and system based on noise background classification

Publications (1)

Publication Number Publication Date
CN113160844A true CN113160844A (en) 2021-07-23

Family

ID=76871861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110459982.0A Pending CN113160844A (en) 2021-04-27 2021-04-27 Speech enhancement method and system based on noise background classification

Country Status (1)

Country Link
CN (1) CN113160844A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114267372A (en) * 2021-12-31 2022-04-01 思必驰科技股份有限公司 Voice noise reduction method, system, electronic device and storage medium
CN116597855A (en) * 2023-07-18 2023-08-15 深圳市则成电子股份有限公司 Adaptive noise reduction method and device and computer equipment

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104795064A (en) * 2015-03-30 2015-07-22 福州大学 Recognition method for sound event under scene of low signal to noise ratio
CN108806708A (en) * 2018-06-13 2018-11-13 中国电子科技集团公司第三研究所 Voice de-noising method based on Computational auditory scene analysis and generation confrontation network model
CN109285538A (en) * 2018-09-19 2019-01-29 宁波大学 A kind of mobile phone source title method under the additive noise environment based on normal Q transform domain
CN109448702A (en) * 2018-10-30 2019-03-08 上海力声特医学科技有限公司 Artificial cochlea's auditory scene recognition methods
CN109859767A (en) * 2019-03-06 2019-06-07 哈尔滨工业大学(深圳) A kind of environment self-adaption neural network noise-reduction method, system and storage medium for digital deaf-aid
CN110164472A (en) * 2019-04-19 2019-08-23 天津大学 Noise classification method based on convolutional neural networks
CN110600054A (en) * 2019-09-06 2019-12-20 南京工程学院 Sound scene classification method based on network model fusion
EP3716270A1 (en) * 2019-03-29 2020-09-30 Goodix Technology (HK) Company Limited Speech processing system and method therefor
CN111785288A (en) * 2020-06-30 2020-10-16 北京嘀嘀无限科技发展有限公司 Voice enhancement method, device, equipment and storage medium
US20200335086A1 (en) * 2019-04-19 2020-10-22 Behavioral Signal Technologies, Inc. Speech data augmentation
CN112446242A (en) * 2019-08-29 2021-03-05 北京三星通信技术研究有限公司 Acoustic scene classification method and device and corresponding equipment

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104795064A (en) * 2015-03-30 2015-07-22 福州大学 Recognition method for sound event under scene of low signal to noise ratio
CN108806708A (en) * 2018-06-13 2018-11-13 中国电子科技集团公司第三研究所 Voice de-noising method based on Computational auditory scene analysis and generation confrontation network model
CN109285538A (en) * 2018-09-19 2019-01-29 宁波大学 A kind of mobile phone source title method under the additive noise environment based on normal Q transform domain
CN109448702A (en) * 2018-10-30 2019-03-08 上海力声特医学科技有限公司 Artificial cochlea's auditory scene recognition methods
CN109859767A (en) * 2019-03-06 2019-06-07 哈尔滨工业大学(深圳) A kind of environment self-adaption neural network noise-reduction method, system and storage medium for digital deaf-aid
EP3716270A1 (en) * 2019-03-29 2020-09-30 Goodix Technology (HK) Company Limited Speech processing system and method therefor
CN110164472A (en) * 2019-04-19 2019-08-23 天津大学 Noise classification method based on convolutional neural networks
US20200335086A1 (en) * 2019-04-19 2020-10-22 Behavioral Signal Technologies, Inc. Speech data augmentation
CN112446242A (en) * 2019-08-29 2021-03-05 北京三星通信技术研究有限公司 Acoustic scene classification method and device and corresponding equipment
CN110600054A (en) * 2019-09-06 2019-12-20 南京工程学院 Sound scene classification method based on network model fusion
CN111785288A (en) * 2020-06-30 2020-10-16 北京嘀嘀无限科技发展有限公司 Voice enhancement method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘载文: "《水环境系统智能化软测量与控制方法》", 31 March 2013, 中国轻工业出版社 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114267372A (en) * 2021-12-31 2022-04-01 思必驰科技股份有限公司 Voice noise reduction method, system, electronic device and storage medium
CN116597855A (en) * 2023-07-18 2023-08-15 深圳市则成电子股份有限公司 Adaptive noise reduction method and device and computer equipment
CN116597855B (en) * 2023-07-18 2023-09-29 深圳市则成电子股份有限公司 Adaptive noise reduction method and device and computer equipment

Similar Documents

Publication Publication Date Title
CA2498015C (en) Combining active and semi-supervised learning for spoken language understanding
CN113470662A (en) Generating and using text-to-speech data for keyword spotting systems and speaker adaptation in speech recognition systems
CN113160844A (en) Speech enhancement method and system based on noise background classification
US20030061037A1 (en) Method and apparatus for identifying noise environments from noisy signals
CN101154380B (en) Method and device for registration and validation of speaker's authentication
CN111081230B (en) Speech recognition method and device
CN104903954A (en) Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination
JP2013539558A (en) Parameter speech synthesis method and system
CN112735482A (en) Endpoint detection method and system based on combined deep neural network
CN1199488A (en) Pattern recognition
CN104299623A (en) Automated confirmation and disambiguation modules in voice applications
KR20150145024A (en) Terminal and server of speaker-adaptation speech-recognition system and method for operating the system
CN113611293B (en) Mongolian data set expansion method
CN114664318A (en) Voice enhancement method and system based on generation countermeasure network
JP2020071482A (en) Word sound separation method, word sound separation model training method and computer readable medium
KR102241364B1 (en) Apparatus and method for determining user stress using speech signal
CN108369803B (en) Method for forming an excitation signal for a parametric speech synthesis system based on a glottal pulse model
CN112750445A (en) Voice conversion method, device and system and storage medium
CN116994553A (en) Training method of speech synthesis model, speech synthesis method, device and equipment
CN118675536A (en) Real-time packet loss concealment using deep-drawn networks
CN117373431A (en) Audio synthesis method, training method, device, equipment and storage medium
CN110633735B (en) Progressive depth convolution network image identification method and device based on wavelet transformation
CN115376498A (en) Speech recognition method, model training method, device, medium, and electronic apparatus
Alashban et al. Speaker gender classification in mono-language and cross-language using BLSTM network
JP2017520016A (en) Excitation signal generation method of glottal pulse model based on parametric speech synthesis system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination