WO2022116487A1 - 基于生成对抗网络的语音处理方法、装置、设备及介质 - Google Patents
基于生成对抗网络的语音处理方法、装置、设备及介质 Download PDFInfo
- Publication number
- WO2022116487A1 WO2022116487A1 PCT/CN2021/096660 CN2021096660W WO2022116487A1 WO 2022116487 A1 WO2022116487 A1 WO 2022116487A1 CN 2021096660 W CN2021096660 W CN 2021096660W WO 2022116487 A1 WO2022116487 A1 WO 2022116487A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- noise
- adversarial network
- generative adversarial
- target
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 24
- 238000000034 method Methods 0.000 claims abstract description 46
- 238000012545 processing Methods 0.000 claims abstract description 38
- 238000012549 training Methods 0.000 claims description 67
- 239000011159 matrix material Substances 0.000 claims description 48
- 230000006870 function Effects 0.000 claims description 37
- 238000000605 extraction Methods 0.000 claims description 6
- 230000000694 effects Effects 0.000 abstract description 6
- 238000005516 engineering process Methods 0.000 abstract description 2
- 239000012634 fragment Substances 0.000 abstract 5
- 230000011218 segmentation Effects 0.000 abstract 3
- 238000001514 detection method Methods 0.000 description 11
- 230000009286 beneficial effect Effects 0.000 description 10
- 230000008569 process Effects 0.000 description 10
- 230000009467 reduction Effects 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 5
- 230000008034 disappearance Effects 0.000 description 5
- 238000004880 explosion Methods 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 239000013598 vector Substances 0.000 description 4
- 238000005070 sampling Methods 0.000 description 3
- 208000009119 Giant Axonal Neuropathy Diseases 0.000 description 2
- 201000003382 giant axonal neuropathy 1 Diseases 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 241000699670 Mus sp. Species 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Definitions
- the present application relates to the technical field of speech processing, and in particular, to a method, apparatus, device, and medium for speech processing based on generative adversarial networks.
- Speech processing includes steps such as speech enhancement (Speech Enhancement) and speech endpoint detection (Voice Activity Detection).
- speech enhancement is to remove the background noise mixed in the speech signal. By removing the background noise, a clearer speech signal can be obtained, which is beneficial to obtain better performance for subsequent tasks.
- voice endpoint detection is to obtain the starting endpoint of the voice end. By eliminating non-voice, the subsequent calculation can be reduced and the robustness and accuracy of the subsequent voice system can be improved. However, the excessive background noise in the actual environment brings great challenges to speech processing.
- the existing method is to input the to-be-processed speech with background noise into the generative adversarial network, and then use the discriminator in the generative adversarial network to discriminate the to-be-processed speech.
- the discriminative results are trained to achieve the purpose of removing background noise.
- the inventor realized that in the speech processing of this method, since the speech to be processed is directly discriminated, it is easy to cause a large difference in the discriminant result error, so that the effect of the final speech processing noise is not obvious enough, making the speech processing accurate. low degree. There is an urgent need for a method that can improve the accuracy of speech processing.
- the purpose of the embodiments of the present application is to propose a voice processing method, apparatus, device and medium based on a generative adversarial network, so as to improve the accuracy of voice processing.
- an embodiment of the present application provides a method for speech processing based on a generative adversarial network, which includes:
- the speech signals to be spliced are spliced according to the cutting order mark to obtain a reconstructed speech signal.
- an embodiment of the present application further provides a voice processing device based on a generative adversarial network, which includes:
- the to-be-processed speech segment acquisition module is used to acquire the to-be-processed speech segment, cuts the to-be-processed speech segment according to the preset length, and marks the cutting order to obtain the cutting speech segment and the cutting order mark;
- a cut speech segment input module for inputting the cut speech segment into a trained generative adversarial network to obtain the noise-reduced speech signal and the corresponding speech endpoint information of the noise-reduced speech signal;
- a voice signal module to be spliced for combining the noise-reduced voice signal with the corresponding voice endpoint information to form a voice signal to be spliced
- the reshaped voice signal acquisition module is used for splicing the voice signals to be spliced according to the cutting order mark to obtain the reshaped voice signal.
- an embodiment of the present application further provides a computer device, which includes a memory and a processor, where computer-readable instructions are stored in the memory, and the processor implements the following steps when executing the computer-readable instructions:
- the speech signals to be spliced are spliced according to the cutting order mark to obtain a reconstructed speech signal.
- embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the processor executes the following: step:
- the speech signals to be spliced are spliced according to the cutting order mark to obtain a reconstructed speech signal.
- the embodiments of the present application provide a method, apparatus, device, and medium for speech processing based on a generative confrontation network.
- a reconstructed voice signal that can be enhanced by voice and detected by endpoint is obtained, which is beneficial to counterweight
- the voice judgment of the plastic voice signal effectively improves the accuracy of voice processing.
- FIG. 1 is a schematic diagram of an application environment of a voice processing method based on a generative adversarial network provided by an embodiment of the present application;
- FIG. 2 is a flowchart of an implementation of a method for speech processing based on a generative adversarial network provided according to an embodiment of the present application
- FIG. 3 is a flowchart of an implementation of a sub-process in the voice processing method based on generative adversarial networks provided by an embodiment of the present application;
- Fig. 4 is another implementation flowchart of the sub-process in the speech processing method based on generative adversarial network provided by the embodiment of the present application;
- Fig. 5 is another implementation flowchart of the sub-process in the speech processing method based on generative adversarial network provided by the embodiment of the present application;
- Fig. 6 is another implementation flowchart of the sub-process in the speech processing method based on generative adversarial network provided by the embodiment of the present application;
- Fig. 7 is another implementation flowchart of the sub-process in the speech processing method based on generative adversarial network provided by the embodiment of the present application;
- FIG. 8 is another implementation flowchart of the sub-process in the speech processing method based on the generative adversarial network provided by the embodiment of the present application;
- FIG. 9 is a schematic diagram of a voice processing apparatus based on a generative adversarial network provided by an embodiment of the present application.
- FIG. 10 is a schematic diagram of a computer device provided by an embodiment of the present application.
- the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 .
- the network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 .
- the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
- the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like.
- Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as web browser applications, search applications, instant communication tools, and the like.
- the terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, and the like.
- the server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the terminal devices 101 , 102 , and 103 .
- the generative adversarial network-based speech processing method provided by the embodiments of the present application is generally executed by a server, and accordingly, a generative adversarial network-based speech processing apparatus is generally configured in the server.
- terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
- FIG. 2 shows a specific implementation manner of a speech processing method based on a generative adversarial network.
- the method of the present application is not limited to the flow sequence shown in FIG. 2, and the method includes the following steps:
- S1 Acquire the to-be-processed speech segment, cut the to-be-processed speech segment according to a preset length, and mark the cutting order to obtain the cutting speech segment and the cutting order mark.
- the server first obtains the to-be-processed speech segment, and cuts the to-be-processed speech segment according to the preset length in the order from the speech start stage to the end of the speech. At the same time, the cutting order is marked to obtain the cutting speech segment and the cutting order mark.
- the cutting sequence mark is a mark corresponding to each cut speech segment when the speech segment to be processed is cut.
- the duration of the speech segment to be processed is 500 seconds
- the server cuts the speech segment to be processed according to the length of 2 seconds, cuts the speech segment from the beginning of the speech to the end of the speech, and marks the cutting sequence to obtain 250 segmented speech segments and 250 speech segments.
- Segment cutting sequence mark corresponding to the speech segment. For example, if the speech segment to be processed is cut from 0 to 2 seconds, the corresponding cutting order is marked as 1.
- the preset length is set according to the actual situation, and is not limited here. In a specific embodiment, the preset length is 2 seconds.
- S2 Input the cut speech segment into the trained generative adversarial network to obtain the noise-reduced speech signal and the speech endpoint information corresponding to the noise-reduced speech signal.
- the server inputs the cut speech segment into the trained generative adversarial network, and performs speech enhancement processing on the cut speech segment through the generator in the trained generative adversarial network to generate an enhanced speech signal, which is noise reduction.
- each noise-reduced speech signal is the sampling point corresponding to each cut speech segment; then the enhanced speech signal is input into the discriminator of the trained generative adversarial network, and the trained generative adversarial
- the discriminator of the network judges the noise-reduced voice signal, and outputs the voice endpoint information corresponding to the noise-reduced voice signal, that is, the probability value of whether the output noise-reduced voice signal is a real voice signal.
- the noise-reduced voice signal is the voice sampled signal that can be enhanced after cutting the voice segment through voice enhancement;
- the voice endpoint information is the probability value of whether the corresponding noise-reduced voice signal is a real voice signal, which is judged by the probability value as Whether it is real voice, that is, the obtained judgment result is real voice or non-real voice.
- Generative Adversarial Networks is a deep learning model.
- the model produces fairly good outputs through mutual game learning of (at least) two modules in the framework: Generative Model and Discriminative Model.
- the generation model here corresponds to the generator in the generative adversarial network in this application, and is used to output the noise-reduced speech signal obtained after the cut speech signal is enhanced by speech;
- the discriminant model here corresponds to the generation in this application.
- the discriminator in the adversarial network is used to output the discriminant result.
- the noise-reduced voice signal is a voice signal that can be enhanced after cutting the voice segment and the corresponding voice signal is enhanced
- the voice endpoint information is the probability value of whether the corresponding noise-reduced voice signal is a real voice signal
- the probability value of passing It is judged whether it is real voice or non-voice. Since in fact each noise-reduced speech signal is the sampling point corresponding to each cut speech segment, and the speech endpoint information is the probability value corresponding to each sampling point, by comparing the noise-reduced speech signal with the result of the speech endpoint information Combined to form a speech signal to be spliced.
- each voice signal to be spliced can be enhanced by voice, and each voice signal to be spliced contains a probability value of whether it is real voice, that is, each voice signal to be spliced can be enhanced by voice and voiced at the same time. detection.
- S4 Splicing the speech signals to be spliced according to the cutting order marks to obtain a reconstructed speech signal.
- the speech signals to be spliced are spliced from the beginning of each segment of speech to the end of the speech, and are spliced according to the cutting sequence marks to obtain the reconstructed speech signal.
- the reconstructed speech signal has passed speech enhancement and speech endpoint detection, and is combined with speech enhancement and speech endpoint detection to achieve the purpose of removing noise and improving the accuracy of speech processing.
- the cutting speech segment and the cutting order mark are obtained;
- the noise-reduced voice signal and the voice endpoint information corresponding to the noise-reduced voice signal are obtained; the noise-reduced voice signal and the corresponding voice endpoint information are combined to form a to-be-spliced voice signal.
- the voice signal to be spliced is spliced according to the cutting order mark to obtain a reshaped voice signal, and by combining the noise-reduced voice signal after voice enhancement and the voice endpoint information after voice detection, The reconstructed voice signal that can be enhanced by the voice and detected by the endpoint is obtained, which is beneficial to the voice judgment of the reconstructed voice signal, and effectively improves the accuracy of voice processing.
- FIG. 3 shows a specific implementation before step S1. This embodiment includes:
- S2A Acquire a preset noise voice signal and a target voice signal, and cut the noise voice signal and the target voice signal according to a preset length to obtain a noise voice segment and a target voice segment.
- the noise speech signal and the target speech signal are first obtained, and then input into the generative adversarial network for training.
- the preset length for cutting the noise voice signal and the target voice signal may be different from the preset length for cutting the speech segment to be processed in step S1, but the best implementation effect is to combine the two.
- the preset length of the user is set to the same length.
- the overlapping parts of the noise speech segment and the target speech segment will be set, and the segment to be processed does not need to be set to overlap. This is because the overlapping part is set in the model training of the generative adversarial network, which can increase the training data and enable the model to learn better network parameters.
- S2B Extract the noise speech segment and the target speech segment as training data according to the method of random extraction and no replacement.
- the noise speech segment and the target speech segment are extracted by random extraction without replacement, so as to ensure that the randomly extracted noise speech segment and the target speech segment are not repetitive, which is beneficial to the model training of the generative adversarial network.
- S2C Input the training data into the generative adversarial network, generate the observed speech segment and the discrimination result, and calculate the loss function value according to the observed speech segment and the discrimination result to obtain the target loss.
- the noise speech segments are input into the generator of the generative adversarial network to generate observation speech segments, and then the noise speech segments and target speech segments are input into the generator.
- the discriminator of the adversarial network the respective discrimination results are obtained. Then, according to the observed speech segment and the discrimination result, the loss function value is calculated to obtain the target loss.
- the discrimination result is to input the training data into the discriminator of the generative adversarial network.
- the discriminator discriminates the training data to obtain whether each training data is real speech or noise. If it is real speech, the discrimination result is 1, if it is noise , the discrimination result is 0. And since the training data is not single, there are a large number of discriminative results 1 and 0 in the discriminant result, which is beneficial to calculate the loss function value of the discriminant result.
- S2D Update the parameters of the generative adversarial network according to the target loss, and get the trained generative adversarial network.
- step S2C the target loss obtained in step S2C is updated correspondingly to the generator parameters and discriminator parameters of the generative adversarial network, and finally a trained generative adversarial network is obtained.
- a preset noise voice signal and a target voice signal are acquired, and the noise voice signal and the target voice signal are cut according to a preset length to obtain a noise voice segment and a target voice segment, and then randomly extracted and not returned method, extract the noise speech segment and the target speech segment as training data, and then input the training data into the generative adversarial network to generate the observed speech segment and the discrimination result, and calculate the loss function value according to the observed speech segment and the discrimination result, and obtain the target loss , and finally update the parameters of the generative adversarial network according to the target loss, and obtain a trained generative adversarial network, so that the generative adversarial network is trained according to the noise speech segment and the target speech segment, which is beneficial to the subsequent output of the noise-reduced speech signal and the noise-reduced speech.
- the voice endpoint information corresponding to the signal thereby improving the accuracy of voice processing.
- FIG. 4 shows a specific implementation of step S2C.
- the training data is input into the generative adversarial network, the observed speech segment and the discrimination result are generated, and the loss is calculated according to the observed speech segment and the discrimination result. function value to obtain the specific implementation process of the target loss, which is described in detail as follows:
- S2C1 Input the noisy speech segment in the training data into the generator of the generative adversarial network, generate the observed speech segment, and calculate the loss function value of the observed speech segment and the target speech segment in the training data to obtain the first loss value.
- the detected and speech-enhanced speech signal can be obtained, that is, the observed speech segment.
- the loss function the loss function value of the observed speech segment and the target speech segment in the training data is calculated, and the first loss value is obtained.
- the first loss value The larger the value is, the more dissimilar the observed speech segment and the target speech segment are, that is, the greater the degree of deviation between the two; That is, the larger the obtained first loss value, the closer the training of the generative adversarial network is to completion. Therefore, the first loss value is used in subsequent steps to update the generator parameters.
- S2C2 Input the noisy speech segment in the training data into the discriminator of the generative adversarial network to obtain the first discriminant result, and calculate the loss function value of the first discriminant result to obtain the second loss value.
- the discriminator is used to discriminate the noisy speech segment in the training data, and it is obtained whether the noisy speech segment in each training data is real speech or noise. If it is real speech, the first discrimination result is 1; if it is noise, then The first discrimination result is 0. Since the noisy speech segment of the training data is not single, there are a large number of discrimination results 1 and 0 in the first discrimination result, so the loss function value of the first discrimination result is calculated to obtain the second loss value.
- S2C3 Input the target speech segment in the training data into the discriminator of the generative adversarial network to obtain the second discriminant result, and calculate the loss function value of the second discriminant result to obtain the third loss value.
- the target speech segment in the training data is discriminated by the discriminator, and it is obtained whether the target speech segment in each training data is real speech or noise. If it is real speech, the second discrimination result is 1. If it is noise, then The second discriminant result is 0. Since the target speech segment of the training data is not single, there are a large number of discrimination results 1 and 0 in the second discrimination result, so the loss function value of the second discrimination result is calculated to obtain the third loss value.
- S2C4 Use the first loss value, the second loss value, and the third loss value as the target loss.
- the first loss value, the second loss value, and the third loss value are used as the target loss, and the parameters of the subsequent generative adversarial network are updated.
- the observed speech segment is generated, and the loss function value of the observed speech segment and the target speech segment in the training data is calculated to obtain the first Loss value
- the noise speech segment in the training data into the discriminator of the generative adversarial network, obtain the first discrimination result, and calculate the loss function value of the first discriminant result, get the second loss value, put the target in the training data
- the speech segment is input into the discriminator of the generative adversarial network to obtain the second discriminant result
- the loss function value of the second discriminant result is calculated to obtain the third loss value, and the first loss value, the second loss value and the third loss value are calculated.
- the loss function value is calculated through different data, which is convenient for the subsequent update of the parameters of the generative adversarial network, thereby improving the accuracy of speech processing.
- FIG. 5 shows a specific implementation of step S2D.
- step S2D the parameters of the generative adversarial network are updated according to the target loss, and the specific implementation process of the trained generative adversarial network is obtained, which is described in detail as follows:
- S2D1 According to the first loss value, update the generator parameters of the generative adversarial network.
- the first loss value is obtained by generating the observed speech segment by the generator, and by calculating the loss function value of the observed speech segment and the target speech segment in the training data, therefore, according to the first loss value, update the generated Generator parameters for adversarial networks. This facilitates updates to the parameters of the generative adversarial network.
- S2D2 According to the second loss value and the third loss value, update the discriminator parameters of the generative adversarial network.
- the discriminator parameters of the generative adversarial network is beneficial to updating the parameters of the generative adversarial network.
- the network parameters of the generative adversarial network are updated through the first loss value, the second loss value and the third loss value. If the first loss value does not reach the preset threshold value, it is regenerated according to the above steps S2C1 to S2C3 The first loss value, the second loss value and the third loss value, and the network parameters of the generative adversarial network are updated until the first loss value reaches the preset threshold, indicating that the trained generative adversarial network is fully capable of dealing with noise and speech signals. Recognition of target speech signals. Therefore, when the first loss value reaches the preset threshold, the parameters of the generative adversarial network are stopped to be updated, and the trained generative adversarial network is obtained.
- the preset threshold is set according to the actual situation, and is not limited here. In a specific embodiment, the preset threshold is 0.95.
- the generator parameters of the generative adversarial network are updated according to the first loss value
- the discriminator parameters of the generative adversarial network are updated according to the second loss value and the third loss value.
- the first loss value reaches a preset threshold
- stop updating the parameters of the generative adversarial network and get the trained generative adversarial network. Realizing the update of the generative adversarial network is beneficial to improve the accuracy of speech processing.
- FIG. 6 shows a specific implementation of step S2.
- step S2 the cut speech segment is input into the trained generative adversarial network to obtain the noise reduction speech signal and the corresponding noise reduction speech signal.
- the specific implementation process of voice endpoint information is described in detail as follows:
- S21 Input the cut speech segment into the trained generative adversarial network, and generate sequence matrix features for the cut speech segment through the encoder-decoder model of the generator.
- the encoder-decoder model includes functions such as encoding and decoding.
- Encoding consists of an encoder that converts the input sequence into a dense vector of fixed dimensions, and the decoding stage generates the target translation from this activation state.
- the encoding-decoding model of the generator first generates a sequence of dense vectors for the cut speech segment, and then converts the sequence of dense vectors into a sequence matrix feature in the form of a matrix.
- the transmission channel of network information and deep network information combines sequence matrix features of the same size to obtain target features and solve gradient explosion and gradient disappearance.
- sequence matrix features of the same size refer to the sequence matrix features with the same width and height.
- the convolutional neural network is a kind of feedforward neural network that includes convolutional computation and has a deep structure, and is one of the representative algorithms of deep learning.
- Convolutional neural networks have the ability to learn representations and can perform translation-invariant classification of input information according to its hierarchical structure.
- the skip connection method is to combine the sequence matrix features of the same size by establishing the transmission channel of shallow network information and deep network information to solve the problem of gradient explosion and gradient disappearance in the training of generative adversarial network model .
- the convolutional neural network since the entire network of the producer is constructed by a convolutional neural network, the convolutional neural network also includes a 2-layer fully-connected network, and the input signal of the fully-connected layer network is a hidden vector and an encoding-decoding model.
- One layer, that is, the input quantity is the target feature.
- the fully-connected layer is used to generate the voice endpoint result, that is, in this embodiment, the noise-reduced voice signal is obtained by inputting the target feature into the fully-connected layer network of the generator, thereby realizing the acquisition of the enhanced voice signal.
- S24 Input the noise-reduced speech signal into the trained discriminator of the generative adversarial network to obtain speech endpoint information corresponding to the noise-reduced speech signal.
- the voice endpoint information corresponding to each noise-reduced speech signal is output, and then the probability that each noise-reduced speech signal is a real speech is judged .
- the encoding-decoding model of the generator is used to generate sequence matrix features for the segmented speech segments. Combine to obtain the target features, input the target features into the fully connected layer network of the generator, get the noise-reduced speech signal, and then input the noise-reduced speech signal into the trained discriminator of the generative adversarial network to get the reduced noise.
- the voice endpoint information corresponding to the noisy voice signal realizes the acquisition of the noise reduction voice signal and the voice endpoint information corresponding to the noise reduction voice signal, which facilitates the subsequent reconstruction of the voice signal, thereby improving the accuracy of voice processing.
- FIG. 7 shows a specific implementation of step S22.
- step S22 the sequence matrix features of the same size are combined in a skip connection manner to obtain the specific implementation process of the target feature, which is described in detail as follows:
- S221 Traverse the sequence matrix features to obtain sequence matrix features of the same size as the target matrix, wherein the target matrix has the same width and height.
- the sequence matrix features of the same size are obtained by traversing the sequence matrix features of the shallow network layer and the deep network layer as the target matrix.
- the width and height of the target matrix are the same.
- a transmission channel between the shallow network layer and the deep network layer of the fully connected network is established, and the target matrix is combined to obtain the target feature.
- the sequence matrix features of the same size are obtained by traversing the sequence matrix features, which are used as target matrices, and the target matrices are combined by means of skip connections to obtain the target features, which solves the problem of the occurrence of GANs in the training process of the generative adversarial network model.
- the problems of gradient explosion and gradient disappearance are beneficial to the training of generative adversarial networks, which in turn helps to improve the accuracy of speech processing.
- Figure 8 shows a specific implementation of step S4, in step S4, the voice signals to be spliced are spliced according to the cutting order mark, and the specific implementation process of reshaping the voice signal is obtained, which is described in detail as follows:
- S41 Arrange the speech signals to be spliced according to the order of cutting order marks from small to large to obtain a speech sequence.
- the speech signals to be spliced are arranged according to the cutting order marks in ascending order to obtain a speech sequence.
- S42 According to the speech sequence, splicing the head and tail of the speech signal to be spliced to obtain a reconstructed speech signal.
- the head and tail of the speech signal to be spliced are spliced to form a complete reshaped speech signal.
- the reshaped speech signal has passed speech enhancement and speech endpoint detection, and is combined through speech enhancement and speech endpoint detection to achieve In order to remove noise and improve the accuracy of speech processing.
- the speech signals to be spliced are arranged according to the cutting order marks from small to large to obtain a speech sequence, and according to the speech sequence, the beginning and the end of the speech signals to be spliced are spliced to obtain a reconstructed speech signal, which realizes The purpose of speech processing is achieved, and the reshaped speech signal has the characteristics of speech enhancement and speech endpoint detection, which is beneficial to the accuracy of speech processing.
- the above-mentioned speech segment to be processed may also be stored in a node of a blockchain.
- the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
- the present application provides an embodiment of a speech processing apparatus based on a generative adversarial network.
- the apparatus embodiment corresponds to the method embodiment shown in FIG. 2.
- the device can be specifically applied to various electronic devices.
- the voice processing apparatus based on the generative adversarial network of this embodiment includes: a speech segment acquisition module 51 to be processed, a cut speech segment input module 52 , a speech signal module 53 to be spliced, and a reconstructed speech signal acquisition module 54 ,in:
- the to-be-processed speech segment acquisition module 51 is used to acquire the to-be-processed speech segment, cuts the to-be-processed speech segment according to the preset length, and marks the cutting order to obtain the cutting speech segment and the cutting order mark;
- the cut speech segment input module 52 is used to input the cut speech segment into the trained generation adversarial network to obtain the speech endpoint information corresponding to the noise-reduced speech signal and the noise-reduced speech signal;
- the voice signal module 53 to be spliced is used to combine the noise-reduced voice signal with the corresponding voice endpoint information to form the voice signal to be spliced;
- the reshaped voice signal acquisition module 54 is used for splicing the voice signals to be spliced according to the cutting order marks to obtain the reshaped voice signal.
- the speech processing device based on the generative adversarial network further includes:
- the voice cutting module is used to obtain the preset noise voice signal and the target voice signal, and cut the noise voice signal and the target voice signal according to the preset length to obtain the noise voice segment and the target voice segment;
- the training data acquisition module is used to extract the noise speech segment and the target speech segment as training data according to the method of random extraction and no replacement;
- the target loss acquisition module is used to input the training data into the generative adversarial network, generate the observed speech segment and the discrimination result, and calculate the loss function value according to the observed speech segment and the discrimination result to obtain the target loss;
- the parameter update module is used to update the parameters of the generative adversarial network according to the target loss, and obtain the trained generative adversarial network.
- the target loss acquisition module includes:
- the first loss value calculation unit is used to input the noisy speech segment in the training data into the generator of the generative adversarial network, generate the observed speech segment, and calculate the loss function value of the observed speech segment and the target speech segment in the training data, get the first loss value;
- the second loss value calculation unit is used to input the noise speech segment in the training data into the discriminator of the generative adversarial network, obtain the first discrimination result, and calculate the loss function value of the first discrimination result to obtain the second loss value;
- the third loss value calculation unit is used to input the target speech segment in the training data into the discriminator of the generative adversarial network to obtain the second discrimination result, and calculate the loss function value of the second discrimination result to obtain the third loss value;
- the target loss defining unit is used for taking the first loss value, the second loss value and the third loss value as the target loss.
- parameter update module includes:
- a generator parameter updating unit configured to update the generator parameters of the generative adversarial network according to the first loss value
- the discriminator parameter updating unit is used to update the discriminator parameters of the generated adversarial network according to the second loss value and the third loss value;
- the update stop unit is configured to stop updating the parameters of the generative adversarial network when the first loss value reaches a preset threshold, so as to obtain a trained generative adversarial network.
- cut speech segment input module 52 includes:
- the sequence matrix feature unit is used to input the cut speech segment into the trained generative adversarial network, and generate sequence matrix features for the cut speech segment through the encoder-decoding model of the generator;
- the target feature acquisition unit is used to combine the sequence matrix features of the same size in the manner of skip connection to obtain the target feature
- the noise-reduced speech signal unit is used to input the target feature into the fully connected layer network of the generator to obtain the noise-reduced speech signal;
- the voice endpoint information unit is used for inputting the noise-reduced voice signal into the trained discriminator of the generative adversarial network to obtain voice endpoint information corresponding to the noise-reduced voice signal.
- the target feature acquisition unit includes:
- the target matrix acquisition subunit is used to traverse the sequence matrix features and obtain the sequence matrix features of the same size as the target matrix, wherein the width and height of the target matrix are consistent;
- the target feature acquisition sub-unit is used to combine target matrices by means of skip connection to obtain target features.
- the reconstructed speech signal acquisition module 54 includes:
- the phonetic sequence acquisition unit is used for arranging the voice signals to be spliced according to the order of the cutting order mark from small to large to obtain the phonetic sequence;
- the speech signal reshaping unit is used for splicing the head and tail of the speech signal to be spliced according to the speech sequence to obtain the reshaped speech signal.
- the above-mentioned speech segment to be processed may also be stored in a node of a blockchain.
- FIG. 10 is a block diagram of the basic structure of a computer device according to this embodiment.
- the computer device 6 includes a memory 61 , a processor 62 , and a network interface 63 connected to each other through a system bus. It should be pointed out that the figure only shows the computer device 6 with three components, the memory 61, the processor 62, and the network interface 63, but it should be understood that it is not required to implement all the shown components, and alternative implementations are possible. More or fewer components.
- the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
- ASIC Application Specific Integrated Circuit
- FPGA Field-Programmable Gate Array
- DSP Digital Signal Processor
- embedded equipment etc.
- the memory 61 stores computer-readable instructions, and when the processor 62 executes the computer-readable instructions, all steps of any embodiment of the above-mentioned generative adversarial network-based speech processing method can be implemented.
- the computer equipment may be a desktop computer, a notebook computer, a palmtop computer, and a cloud server and other computing equipment.
- Computer devices can interact with users through keyboards, mice, remote controls, touchpads, or voice-activated devices.
- the memory 61 includes at least one type of readable storage medium, the computer-readable storage medium may be non-volatile or volatile, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type storage (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), Magnetic memory, magnetic disk, optical disk, etc.
- the memory 61 may be an internal storage unit of the computer device 6 , such as a hard disk or a memory of the computer device 6 .
- the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc.
- the memory 61 may also include both the internal storage unit of the computer device 6 and its external storage device.
- the memory 61 is generally used to store the operating system and various application software installed on the computer device 6, such as computer-readable instructions based on the generative adversarial network-based voice processing method.
- the memory 61 can also be used to temporarily store various types of data that have been output or will be output.
- the processor 62 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
- the processor 62 is typically used to control the overall operation of the computer device 6 .
- the processor 62 is configured to execute computer-readable instructions stored in the memory 61 or process data, for example, computer-readable instructions for executing a voice processing method based on a generative adversarial network.
- the network interface 63 may comprise a wireless network interface or a wired network interface, and the network interface 63 is typically used to establish a communication connection between the computer device 6 and other electronic devices.
- the present application also provides another embodiment, which is to provide a computer-readable storage medium, where computer-readable instructions are stored in the computer-readable storage medium, and the computer-readable instructions can be executed by at least one processor to cause at least one processing
- the generator performs all steps of any of the above-mentioned embodiments of the generative adversarial network-based speech processing method.
- the methods of the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course hardware can also be used, but in many cases the former is better implementation.
- the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods of the various embodiments of the present application.
- a storage medium such as ROM/RAM, magnetic disk, CD-ROM
- the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
- Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
- the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Telephonic Communication Services (AREA)
Abstract
涉及语音处理技术领域,一种基于生成对抗网络的语音处理方法、装置、设备及介质,其中方法包括获取待处理语音段,按照预设长度对待处理语音段进行切割,并对切割次序进行标记,得到切割语音段和切割次序标记(S1);将切割语音段输入到训练好的生成对抗网络中,得到降噪的语音信号和降噪的语音信号对应的语音端点信息(S2);将降噪的语音信号与对应的语音端点信息进行结合,形成待拼接的语音信号(S3);将待拼接的语音信号按照切割次序标记进行拼接,得到重塑语音信号(S4)。还涉及区块链技术,待处理语音段存储于区块链中。通过将降噪的语音信号和语音端点信息进行结合,有效提高了语音处理的准确度。
Description
本申请要求于2020年12月01日提交中国专利局、申请号为202011387380.0,发明名称为“基于生成对抗网络的语音处理方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及语音处理技术领域,尤其涉及一种基于生成对抗网络的语音处理方法、装置、设备及介质。
语音处理包括语音增强(Speech Enhancement)和语音端点检测(Voice Activity Detection)等步骤。语音增强旨在去除掉混在语音信号的背景噪声,通过去除掉背景噪声,可以获取更加清晰的语音信号,利于后续的任务获取较好的表现效果。语音端点检测旨在获取语音端的起始端点,通过消除掉非语音,可以减少后续的计算,提高后续语音系统的鲁棒性和准确性。但实际环境的背景噪声过大给语音处理都带来了巨大挑战。
为了解决实际环境中背景噪声过大的问题,现有方法是通过将带有背景噪音的待处理语音输入到生成对抗网络中,再通过生成对抗网络中判别器对待处理语音进行判别,然后通过对判别结果进行训练,以达到去除背景噪声的目的。但是,发明人意识到这种方法在语音处理中,由于是直接将待处理语音进行直接对判别,容易导致判别结果误差相差较大,从而导致最终语音处理噪声的效果不够明显,使得语音处理准确度较低。现亟需一种能够提高语音处理准确度的方法。
发明内容
本申请实施例的目的在于提出一种基于生成对抗网络的语音处理方法、装置、设备及介质,以提高语音处理的准确度。
第一方面,本申请实施例提供一种基于生成对抗网络的语音处理方法,其包括:
获取待处理语音段,按照预设长度对所述待处理语音段进行切割,并对切割次序进行标记,得到切割语音段和切割次序标记;
将所述切割语音段输入到训练好的生成对抗网络中,得到降噪的语音信号和所述降噪的语音信号对应的语音端点信息;
将所述降噪的语音信号与所述对应的语音端点信息进行结合,形成待拼接的语音信号;
将待拼接的语音信号按照所述切割次序标记进行拼接,得到重塑语音信号。
第二方面,本申请实施例还提供一种基于生成对抗网络的语音处理装置,其包括:
待处理语音段获取模块,用于获取待处理语音段,按照预设长度对所述待处理语音段进行切割,并对切割次序进行标记,得到切割语音段和切割次序标记;
切割语音段输入模块,用于将所述切割语音段输入到训练好的生成对抗网络中,得到降噪的语音信号和所述降噪的语音信号对应的语音端点信息;
待拼接的语音信号模块,用于将所述降噪的语音信号与所述对应的语音端点信息进行结合,形成待拼接的语音信号;
重塑语音信号获取模块,用于将待拼接的语音信号按照所述切割次序标记进行拼接,得到重塑语音信号。
第三方面,本申请实施例还提供一种计算机设备,其包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
获取待处理语音段,按照预设长度对所述待处理语音段进行切割,并对切割次序进行标记,得到切割语音段和切割次序标记;
将所述切割语音段输入到训练好的生成对抗网络中,得到降噪的语音信号和所述降噪的语音信号对应的语音端点信息;
将所述降噪的语音信号与所述对应的语音端点信息进行结合,形成待拼接的语音信号;
将待拼接的语音信号按照所述切割次序标记进行拼接,得到重塑语音信号。
第四方面,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时使得所述处理器执行如下步骤:
获取待处理语音段,按照预设长度对所述待处理语音段进行切割,并对切割次序进行标记,得到切割语音段和切割次序标记;
将所述切割语音段输入到训练好的生成对抗网络中,得到降噪的语音信号和所述降噪的语音信号对应的语音端点信息;
将所述降噪的语音信号与所述对应的语音端点信息进行结合,形成待拼接的语音信号;
将待拼接的语音信号按照所述切割次序标记进行拼接,得到重塑语音信号。
本申请实施例提供了一种基于生成对抗网络的语音处理方法、装置、设备及介质。本申请实施例通过将经过语音增强后的降噪的语音信号和经过语音检测后的语音端点信息,进行结合,得到能够被语音增强和被端点检测后的重塑语音信号,进而有利于对重塑语音信号的语音判断,有效提高了语音处理的准确度。
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的基于生成对抗网络的语音处理方法的应用环境示意图;
图2根据本申请实施例提供的基于生成对抗网络的语音处理方法的一实现流程图;
图3是本申请实施例提供的基于生成对抗网络的语音处理方法中子流程的一实现流程图;
图4是本申请实施例提供的基于生成对抗网络的语音处理方法中子流程的又一实现流程图;
图5是本申请实施例提供的基于生成对抗网络的语音处理方法中子流程的又一实现流程图;
图6是本申请实施例提供的基于生成对抗网络的语音处理方法中子流程的又一实现流程图;
图7是本申请实施例提供的基于生成对抗网络的语音处理方法中子流程的又一实现流程图;
图8是本申请实施例提供的基于生成对抗网络的语音处理方法中子流程的又一实现流程图;
图9是本申请实施例提供的基于生成对抗网络的语音处理装置示意图;
图10是本申请实施例提供的计算机设备的示意图。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的 实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
为了使本技术领域的人员更好地理解本申请方案,下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。
下面结合附图和实施方式对本申请进行详细说明。
请参阅图1,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如网页浏览器应用、搜索类应用、即时通信工具等。
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、膝上型便携计算机和台式计算机等等。
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上显示的页面提供支持的后台服务器。
需要说明的是,本申请实施例所提供的基于生成对抗网络的语音处理方法一般由服务器执行,相应地,基于生成对抗网络的语音处理装置一般配置于服务器中。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
请参阅图2,图2示出了基于生成对抗网络的语音处理方法的一种具体实施方式。
需注意的是,若有实质上相同的结果,本申请的方法并不以图2所示的流程顺序为限,该方法包括如下步骤:
S1:获取待处理语音段,按照预设长度对待处理语音段进行切割,并对切割次序进行标记,得到切割语音段和切割次序标记。
具体的,在需要对待处理语音段进行语音处理时,服务器首先会获取待处理语音段,并且根据预设长度对待处理语音段,按照从语音开始阶段到语音结束的顺序进行切割,并且在语音切割的同时,对切割次序进行标记,从而得到切割语音段和切割次序标记。
其中,切割次序标记是待处理语音段在被切割时,每段切割语音段对应的标记。
例如,待处理语音段时长为500秒,服务器按照2秒的长度对待处理语音段,从语音开始阶段到语音结束的顺序进行切割,并且在切割次序上进行标记,得到250段切割语音段和250段切割语音段对应的切割次序标记。如从待处理语音段0到2秒的切割语音段,其对应切割次序标记为1。
需要说明的是,预设长度根据实际情况进行设定,此处不做限定。在一具体的实施例中预设长度为2秒。
S2:将切割语音段输入到训练好的生成对抗网络中,得到降噪的语音信号和降噪的语音信号对应的语音端点信息。
具体的,服务器将切割语音段输入到训练好的生成对抗网络中,通过训练好的生成对抗网络中的生成器对切割语音段,进行语音增强处理,产生被增强的语音信号,即为降噪的语音信号,实际上每个降噪的语音信号为每个切割语音段对应的采样点;再将被增强的语音信号输入到训练好的生成对抗网络的判别器中,通过训练好的生成对抗网络的判别器对降噪的语音信号进行判定,输出降噪的语音信号对应的语音端点信息,也即输出降噪的语音信号是否为真实语音信号的概率值。
其中,降噪的语音信号是切割语音段经过语音增强后,对应得到能够被增强的语音采样信号;语音端点信息是对应降噪的语音信号是否为真实语音信号的概率值,通过概率值判断为是否为真实语音,也即得到的判断结果为真实语音或非真实语音。
其中,生成式对抗网络(GAN,Generative Adversarial Networks)是一种深度学 习模型。该模型通过框架中(至少)两个模块:生成模型(Generative Model)和判别模型(Discriminative Model)的互相博弈学习产生相当好的输出。其中,此处的生成模型对应本申请中的生成对抗网络中的生成器,用于输出切割语音信号被语音增强后,得到的降噪的语音信号;此处的判别模型对应本申请中的生成对抗网络中的判别器,用于输出判别结果。
S3:将降噪的语音信号与对应的语音端点信息进行结合,形成待拼接的语音信号。
具体的,由于降噪的语音信号是切割语音段经过语音增强后,对应得到能够被增强的语音信号,而语音端点信息是对应降噪的语音信号是否为真实语音信号的概率值,通过概率值判断为是否为真实语音,还是非语音。由于实际上每个降噪的语音信号为每个切割语音段对应的采样点,而语音端点信息为每个的采样点对应的概率值,通过将降噪的语音信号与语音端点信息的结果进行结合,形成待拼接的语音信号。也即每个待拼接的语音信号能够得到被语音增强,同时每个待拼接的语音信号含有是否为真实语音的概率值,即得到的每个待拼接的语音信号能够同时被语音增强和被语音检测。
S4:将待拼接的语音信号按照切割次序标记进行拼接,得到重塑语音信号。
具体的,将待拼接的语音信号从每一段语音开始到语音结束,按照切割次序标记进行拼接,得到重塑语音信号。其中,该重塑语音信号是通过了语音增强和语音端点检测,并且通过语音增强和语音端点检测进行结合,实现了去除噪音,提高语音处理准确度的目的。
本实施例中,通过获取待处理语音段,按照预设长度对所述待处理语音段进行切割,并对切割次序进行标记,得到切割语音段和切割次序标记;将所述切割语音段输入到训练好的生成对抗网络中,得到降噪的语音信号和所述降噪的语音信号对应的语音端点信息;将所述降噪的语音信号与所述对应的语音端点信息进行结合,形成待拼接的语音信号;将待拼接的语音信号按照所述切割次序标记进行拼接,得到重塑语音信号,通过将经过语音增强后的降噪的语音信号和经过语音检测后的语音端点信息,进行结合,得到能够被语音增强和被端点检测后的重塑语音信号,进而有利于对重塑语音信号的语音判断,有效提高了语音处理的准确度。请参阅图3,图3示出了步骤S1之前的一种具体实施方式,该实施例包括:
S2A:获取预设的噪音语音信号和目标语音信号,并按照预设长度对噪音语音信号和目标语音信号进行切割,得到噪音语音段和目标语音段。
具体的,在生成对抗网络的训练过程中,首先会获取噪音语音信号和目标语音信号,再将其输入到生成对抗网络中进行训练。
需要说明的是,对噪音语音信号和目标语音信号进行切割的预设长度,与对步骤S1中对待处理语音段进行切割的预设长度可以为不同的长度,但是最佳的实施效果为将两者的预设长度设置为相同长度。另外,在对噪音语音信号和目标语音信号进行切割时,会设置噪音语音段和目标语音段各自重叠部分,而对待处理语音段进行切割不需要设置重叠部分。这是因为在生成对抗网络的模型训练的时候设置重叠部分,可以增加训练数据,使模型学习到更好的网络参数,而在进行待处理语音的处理过程时,只需要对每一个采样点经过一次处理,即可完成语音处理任务。
S2B:按照随机抽取不放回的方式,抽取噪音语音段和目标语音段,作为训练数据。
具体的,采用随机抽取不放回的方式,抽取噪音语音段和目标语音段,以保证随机抽取到噪音语音段和目标语音段不具有重复性,有利于生成对抗网络的模型训练。
S2C:将训练数据输入生成对抗网络中,生成观测语音段和判别结果,并根据观测语音段与判别结果,计算损失函数值,得到目标损失。
具体的,由于训练数据包括随机抽取到的噪音语音段和目标语音段,将噪音语音段输入到生成对抗网络的生成器中,生成观测语音段,再将噪音语音段、目标语音段输入到生成对抗网络的判别器中,得到各自的判别结果。再根据观测语音段与判别结果,计算损失函数值,得到目标损失。
其中,判别结果是将训练数据输入到生成对抗网络的判别器中,通过判别器对训练数据进行判别,得到每个训练数据是真实语音还是噪音,若是真实语音,则判别结果为1,若是噪音,则判别结果为0。并且由于训练数据不是单一的,所以判别结果中存在大量的判别结果1和0,这有利于计算判别结果的损失函数值。
S2D:根据目标损失更新生成对抗网络的参数,得到训练好的生成对抗网络。
具体的,通过步骤S2C获取到的目标损失,将其对应更新生成对抗网络的生成器参数和判别器参数,最终得到训练好的生成对抗网络。
在本实施中,获取预设的噪音语音信号和目标语音信号,并按照预设长度对噪音语音信号和目标语音信号进行切割,得到噪音语音段和目标语音段,然后按照随机抽取不放回的方式,抽取噪音语音段和目标语音段,作为训练数据,再将训练数据输入生成对抗网络中,生成观测语音段和判别结果,并根据观测语音段与判别结果,计算损失函数值,得到目标损失,最后根据目标损失更新生成对抗网络的参数,得到训练好的生成对抗网络,使得根据噪音语音段和目标语音段训练生成对抗网络,有利于后续输出降噪的语音信号和所述降噪的语音信号对应的语音端点信息,从而提高语音处理的准确度。
请参阅图4,图4示出了步骤S2C的一种具体实施方式,步骤S2C中将训练数据输入生成对抗网络中,生成观测语音段和判别结果,并根据观测语音段与判别结果,计算损失函数值,得到目标损失的具体实现过程,详叙如下:
S2C1:将训练数据中的噪音语音段输入到生成对抗网络的生成器中,生成观测语音段,并计算观测语音段和训练数据中的目标语音段的损失函数值,得到第一损失值。
具体的,将训练数据中的噪音语音段输入到生成对抗网络的生成器中,能够得到被检测和被语音增强的语音信号,即观测语音段。再通过损失函数,计算观测语音段和训练数据中的目标语音段的损失函数值,得到第一损失值,通过第一损失值能够判别观测语音段与目标语音段的偏离程度,第一损失值值越大,说明观测语音段和目标语音段越不相似,也即两者的偏离程度越大;在本申请中,对生成对抗网络进行训练,使其能够最大程度辨别噪音和真实语音,也即得到的第一损失值越大,说明对生成对抗网络的训练越接近完成。所以,第一损失值在后续步骤中用来更新生成器参数。
S2C2:将训练数据中的噪音语音段输入到生成对抗网络的判别器中,得到第一判别结果,并计算第一判别结果的损失函数值,得到第二损失值。
具体的,通过判别器对训练数据中的噪音语音段进行判别,得到每个训练数据中的噪音语音段是真实语音或是噪音,若是真实语音,则第一判别结果为1,若是噪音,则第一判别结果为0。由于训练数据的噪音语音段不是单一的,所以第一判别结果中存在大量的判别结果1和0,从而进行计算第一判别结果的损失函数值,得到第二损失值。
S2C3:将训练数据中的目标语音段输入到生成对抗网络的判别器中,得到第二判别结果,并计算第二判别结果的损失函数值,得到第三损失值。
具体的,通过判别器对训练数据中的目标语音段进行判别,得到每个训练数据中的目标语音段是真实语音或是噪音,若是真实语音,则第二判别结果为1,若是噪音,则第二判别结果为0。由于训练数据的目标语音段不是单一的,所以第二判别结果中存在大量的判别结果1和0,从而进行计算第二判别结果的损失函数值,得到第三损失值。
S2C4:将第一损失值、第二损失值以及第三损失值作为目标损失。
具体的,将第一损失值、第二损失值以及第三损失值作为目标损失,对后续的生成对抗网络的参数进行更新。
本实施例中,通过将训练数据中的噪音语音段输入到生成对抗网络的生成器中,生成观测语音段,并计算观测语音段和训练数据中的目标语音段的损失函数值,得到第一损失值,将训练数据中的噪音语音段输入到生成对抗网络的判别器中,得到第一判别结果,并计算第一判别结果的损失函数值,得到第二损失值,将训练数据中的目标语音段输入到生成对抗网络的判别器中,得到第二判别结果,并计算第二判别结果的损失函数值,得到第 三损失值,将第一损失值、第二损失值以及第三损失值作为目标损失,实现通过不同的数据进行计算损失函数值,便于后续更新生成对抗网络的参数,从而实现提高语音处理的准确度。
请参阅图5,图5示出了步骤S2D的一种具体实施方式,步骤S2D中根据目标损失更新生成对抗网络的参数,得到训练好的生成对抗网络的具体实现过程,详叙如下:
S2D1:根据第一损失值,更新生成对抗网络的生成器参数。
具体的,由于第一损失值是通过生成器生成观测语音段,并通过计算观测语音段和训练数据中的目标语音段的损失函数值而得来的,故此,根据第一损失值,更新生成对抗网络的生成器参数。这有利于对生成对抗网络参数的更新。
S2D2:根据第二损失值和第三损失值,更新生成对抗网络的判别器参数。
具体的,由于第二损失值和第三损失值都是判别器生成的判定结果而计算得来的,通过其来更新生成对抗网络的判别器参数,有利于对生成对抗网络参数的更新。
S2D3:当第一损失值达到预设阈值时,停止更新生成对抗网络的参数,得到训练好的生成对抗网络。
具体的,通过第一损失值、第二损失值以及第三损失值对生成对抗网络的网络参数进行更新,若第一损失值没有达到预设阈值时,则按照以上步骤S2C1至步骤S2C3重新产生第一损失值、第二损失值以及第三损失值,并对生成对抗网络的网络参数进行更新,直至第一损失值达到预设阈值时,说明训练的生成对抗网络充分具备对噪音语音信号和目标语音信号的识别。故当第一损失值达到预设阈值时,停止更新生成对抗网络的参数,得到训练好的生成对抗网络。
需要说明的是,预设阈值根据实际情况进行设定,此处不做限定。在一具体实施例中,预设阈值为0.95。
本实施例中,根据第一损失值,更新生成对抗网络的生成器参数,根据第二损失值和第三损失值,更新生成对抗网络的判别器参数,当第一损失值达到预设阈值时,停止更新生成对抗网络的参数,得到训练好的生成对抗网络。实现对生成对抗网络的更新,有利于提高语音处理的准确度。
请参阅图6,图6示出了步骤S2的一种具体实施方式,步骤S2中根将切割语音段输入到训练好的生成对抗网络中,得到降噪的语音信号和降噪的语音信号对应的语音端点信息的具体实现过程,详叙如下:
S21:将切割语音段输入到训练好的生成对抗网络中,通过生成器的编码-解码模型对切割语音段生成序列矩阵特征。
具体的,编码-解码模型(Encoder-decoder)包括编码和解码等功能。编码由一个编码器将输入序列转化成一个固定维度的稠密向量,解码阶段将这个激活状态生成目标译文。在本实施例中,通过生成器的编码-解码模型先对切割语音段生成一序列稠密向量,再通过将这一序列稠密向量转化成矩阵形式的序列矩阵特征。
其中,序列矩阵特征是通过生成器的编码-解码模型对切割语音段编码和解码后生成的,其用来代表切割语音段的特征信息。例如,某一序列特征Y,其包含的特征信息有特征信息y1、特征信息y2、特征信息y3、特征信息y4,则该序列特征为Y={y1、y2、y3、y4}。
S22:按照跳跃连接的方式,将同等大小的序列矩阵特征进行结合,得到目标特征。
具体的,由于随着生成对抗网络模型训练网络深度的增加,容易出现梯度爆炸和梯度消失的情况,这种情况下不利于对生成对抗网络模型的训练,故引进跳跃连接的方式,建立浅层网络信息与深层网络信息的传递通道,将同等大小的序列矩阵特征进行结合,得到目标特征,解决梯度爆炸和梯度消失。
其中,同等大小的序列矩阵特征是指宽度和高度一致的序列矩阵特征。
进一步的,在生成器的整个网络中,是通过卷积神经网络进行构建。其中,卷积神经网络是一类包含卷积计算且具有深度结构的前馈神经网络,是深度学习的代表算法之一。 卷积神经网络具有表征学习能力,能够按其阶层结构对输入信息进行平移不变分类。
其中,跳跃连接(Skip Connection)的方式是通过建立浅层网络信息与深层网络信息的传递通道,将同等大小的序列矩阵特征进行结合,以解决生成对抗网络模型训练出现梯度爆炸和梯度消失的问题。
S23:将目标特征输入到生成器的全连接层网络中,得到降噪的语音信号。
具体的,由于生产器整个网络是由卷积神经网络进行构建的,该卷积神经网络还包含一个2层的全连接网络,该全连接层网络的输入信号为隐藏向量和编码-解码模型最后一层,既输入量为目标特征。通过此全连接层用于产生语音端点结果,即在本实施例中,通过将目标特征输入到生成器的全连接层网络中,得到降噪的语音信号,实现了获取被增强的语音信号。
S24:将降噪的语音信号输入到训练好的生成对抗网络的判别器中,得到降噪的语音信号对应的语音端点信息。
具体的,通过训练好的生成对抗网络的判别器对降噪的语音信号的判别,输出每一个降噪的语音信号对应的语音端点信息,进而判断每一个降噪的语音信号为真实语音的概率。
本实施例中,通过将切割语音段输入到训练好的生成对抗网络中,通过生成器的编码-解码模型对切割语音段生成序列矩阵特征,按照跳跃连接的方式,将同等大小的序列矩阵特征进行结合,得到目标特征,将目标特征输入到生成器的全连接层网络中,得到降噪的语音信号,再将降噪的语音信号输入到训练好的生成对抗网络的判别器中,得到降噪的语音信号对应的语音端点信息,实现了降噪的语音信号和降噪的语音信号对应的语音端点信息的获取,便于后续进行重塑语音信号,进而提高语音处理的准确度。
请参阅图7,图7示出了步骤S22的一种具体实施方式,步骤S22中按照跳跃连接的方式,将同等大小的序列矩阵特征进行结合,得到目标特征的具体实现过程,详叙如下:
S221:遍历序列矩阵特征,获取同等大小的序列矩阵特征,作为目标矩阵,其中,目标矩阵的宽度和高度一致。
具体的,为了解决生成对抗网络模型训练过程中,出现梯度爆炸和梯度消失的问题,通过遍历浅层网络层和深层网络层的序列矩阵特征,获取等大小的序列矩阵特征,作为目标矩阵。该目标矩阵的宽度和高度是一致的。
S222:通过跳跃连接的方式,将目标矩阵进行结合,得到目标特征。
具体的,通过跳跃连接的方式,建立全连接网络的浅层网络层和深层网络层的传递通道,将目标矩阵进行结合,得到目标特征。
本实施例中,通过遍历序列矩阵特征,获取同等大小的序列矩阵特征,作为目标矩阵,通过跳跃连接的方式,将目标矩阵进行结合,得到目标特征,解决了生成对抗网络模型训练过程中,出现梯度爆炸和梯度消失的问题,有利于生成对抗网络的训练,进而有利于提高语音处理的准确度。
请参阅图8,图8示出了步骤S4的一种具体实施方式,步骤S4中将待拼接的语音信号按照切割次序标记进行拼接,得到重塑语音信号的具体实现过程,详叙如下:
S41:按照切割次序标记从小到大的次序,将待拼接的语音信号进行排列,得到语音序列。
具体的,由于切割次序标记是按照从待处理语音开始阶段到结束进行标记的,所以将按照切割次序标记从小到大次序,将待拼接的语音信号进行排列,得到语音序列。
S42:根据语音序列,将待拼接的语音信号的首尾进行拼接,得到重塑语音信号。
具体的,将将待拼接的语音信号的首尾进行拼接,形成完整的重塑语音信号,该重塑语音信号是通过了语音增强和语音端点检测,并且通过语音增强和语音端点检测进行结合,实现了去除噪音,提高语音处理准确度的目的。
本实施例中,按照切割次序标记从小到大次序,将待拼接的语音信号进行排列,得到 语音序列,并根据语音序列,将待拼接的语音信号的首尾进行拼接,得到重塑语音信号,实现了对语音处理的目的,并且重塑语音信号具备语音增强和语音端点检测的特点,有利于语音处理的准确度。
需要强调的是,为进一步保证上述待处理语音段的私密和安全性,上述待处理语音段还可以存储于一区块链的节点中。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。
请参考图9,作为对上述图2所示方法的实现,本申请提供了一种基于生成对抗网络的语音处理装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图9所示,本实施例的基于生成对抗网络的语音处理装置包括:待处理语音段获取模块51、切割语音段输入模块52、待拼接的语音信号模块53以及重塑语音信号获取模块54,其中:
待处理语音段获取模块51,用于获取待处理语音段,按照预设长度对待处理语音段进行切割,并对切割次序进行标记,得到切割语音段和切割次序标记;
切割语音段输入模块52,用于将切割语音段输入到训练好的生成对抗网络中,得到降噪的语音信号和降噪的语音信号对应的语音端点信息;
待拼接的语音信号模块53,用于将降噪的语音信号与对应的语音端点信息进行结合,形成待拼接的语音信号;
重塑语音信号获取模块54,用于将待拼接的语音信号按照切割次序标记进行拼接,得到重塑语音信号。
进一步的,在切割语音段输入模块52之前,基于生成对抗网络的语音处理装置还包括:
语音切割模块,用于获取预设的噪音语音信号和目标语音信号,并按照预设长度对噪音语音信号和目标语音信号进行切割,得到噪音语音段和目标语音段;
训练数据获取模块,用于按照随机抽取不放回的方式,抽取噪音语音段和目标语音段,作为训练数据;
目标损失获取模块,用于将训练数据输入生成对抗网络中,生成观测语音段和判别结果,并根据观测语音段与判别结果,计算损失函数值,得到目标损失;
参数更新模块,用于根据目标损失更新生成对抗网络的参数,得到训练好的生成对抗网络。
进一步的,目标损失获取模块包括:
第一损失值计算单元,用于将训练数据中的噪音语音段输入到生成对抗网络的生成器中,生成观测语音段,并计算观测语音段和训练数据中的目标语音段的损失函数值,得到第一损失值;
第二损失值计算单元,用于将训练数据中的噪音语音段输入到生成对抗网络的判别器中,得到第一判别结果,并计算第一判别结果的损失函数值,得到第二损失值;
第三损失值计算单元,用于将训练数据中的目标语音段输入到生成对抗网络的判别器中,得到第二判别结果,并计算第二判别结果的损失函数值,得到第三损失值;
目标损失定义单元,用于将第一损失值、第二损失值以及第三损失值作为目标损失。
进一步的,参数更新模块包括:
生成器参数更新单元,用于根据第一损失值,更新生成对抗网络的生成器参数;
判别器参数更新单元,用于根据第二损失值和第三损失值,更新生成对抗网络的判别 器参数;
更新停止单元,用于当第一损失值达到预设阈值时,停止更新生成对抗网络的参数,得到训练好的生成对抗网络。
进一步的,切割语音段输入模块52包括:
序列矩阵特征单元,用于将切割语音段输入到训练好的生成对抗网络中,通过生成器的编码-解码模型对切割语音段生成序列矩阵特征;
目标特征获取单元,用于按照跳跃连接的方式,将同等大小的序列矩阵特征进行结合,得到目标特征;
降噪的语音信号单元,用于将目标特征输入到生成器的全连接层网络中,得到降噪的语音信号;
语音端点信息单元,用于将降噪的语音信号输入到训练好的生成对抗网络的判别器中,得到降噪的语音信号对应的语音端点信息。
进一步的,目标特征获取单元包括:
目标矩阵获取子单元,用于遍历序列矩阵特征,获取同等大小的序列矩阵特征,作为目标矩阵,其中,目标矩阵的宽度和高度一致;
目标特征获取子单元,用于通过跳跃连接的方式,将目标矩阵进行结合,得到目标特征。
进一步的,重塑语音信号获取模块54包括:
语音序列获取单元,用于按照切割次序标记从小到大的次序,将待拼接的语音信号进行排列,得到语音序列;
语音信号重塑单元,用于根据语音序列,将待拼接的语音信号的首尾进行拼接,得到重塑语音信号。
需要强调的是,为进一步保证上述待处理语音段的私密和安全性,上述待处理语音段还可以存储于一区块链的节点中。
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图10,图10为本实施例计算机设备基本结构框图。
计算机设备6包括通过系统总线相互通信连接存储器61、处理器62、网络接口63。需要指出的是,图中仅示出了具有三种组件存储器61、处理器62、网络接口63的计算机设备6,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
其中,存储器61中存储有计算机可读指令,处理器62执行计算机可读指令时可实现上述基于生成对抗网络的语音处理方法的任意实施例的所有步骤。
计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。
存储器61至少包括一种类型的可读存储介质,计算机可读存储介质可以是非易失性,也可以是易失性,可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,存储器61可以是计算机设备6的内部存储单元,例如该计算机设备6的硬盘或内存。在另一些实施例中,存储器61也可以是计算机设备6的外部存储设备,例如该计算机设备6上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,存储器61还可 以既包括计算机设备6的内部存储单元也包括其外部存储设备。本实施例中,存储器61通常用于存储安装于计算机设备6的操作系统和各类应用软件,例如基于生成对抗网络的语音处理方法的计算机可读指令等。此外,存储器61还可以用于暂时地存储已经输出或者将要输出的各类数据。
处理器62在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器62通常用于控制计算机设备6的总体操作。本实施例中,处理器62用于运行存储器61中存储的计算机可读指令或者处理数据,例如运行一种基于生成对抗网络的语音处理方法的计算机可读指令。
网络接口63可包括无线网络接口或有线网络接口,该网络接口63通常用于在计算机设备6与其他电子设备之间建立通信连接。
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,计算机可读存储介质存储有计算机可读指令,计算机可读指令可被至少一个处理器执行,以使至少一个处理器执行上述基于生成对抗网络的语音处理方法的任意实施例的所有步骤。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例的方法。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。
Claims (20)
- 一种基于生成对抗网络的语音处理方法,包括:获取待处理语音段,按照预设长度对所述待处理语音段进行切割,并对切割次序进行标记,得到切割语音段和切割次序标记;将所述切割语音段输入到训练好的生成对抗网络中,得到降噪的语音信号和所述降噪的语音信号对应的语音端点信息;将所述降噪的语音信号与所述对应的语音端点信息进行结合,形成待拼接的语音信号;将待拼接的语音信号按照所述切割次序标记进行拼接,得到重塑语音信号。
- 根据权利要求1所述的基于生成对抗网络的语音处理方法,其中,在所述将所述切割语音段输入到训练好的生成对抗网络中,得到降噪的语音信号和所述降噪的语音信号对应的语音端点信息之前,所述方法还包括:获取预设的噪音语音信号和目标语音信号,并按照所述预设长度对所述噪音语音信号和所述目标语音信号进行切割,得到噪音语音段和目标语音段;按照随机抽取不放回的方式,抽取噪音语音段和目标语音段,作为训练数据;将所述训练数据输入生成对抗网络中,生成观测语音段和判别结果,并根据所述观测语音段与所述判别结果,计算损失函数值,得到目标损失;根据所述目标损失更新生成对抗网络的参数,得到所述训练好的生成对抗网络。
- 根据权利要求2所述的基于生成对抗网络的语音处理方法,其中,所述将所述训练数据输入生成对抗网络中,生成观测语音段和判别结果,并根据所述观测语音段与所述判别结果,计算损失函数值,得到目标损失,包括:将所述训练数据中的噪音语音段输入到生成对抗网络的生成器中,生成所述观测语音段,并计算所述观测语音段和所述训练数据中的目标语音段的损失函数值,得到第一损失值;将所述训练数据中的噪音语音段输入到生成对抗网络的判别器中,得到第一判别结果,并计算所述第一判别结果的损失函数值,得到第二损失值;将所述训练数据中的目标语音段输入到生成对抗网络的判别器中,得到第二判别结果,并计算所述第二判别结果的损失函数值,得到第三损失值;将所述第一损失值、所述第二损失值以及所述第三损失值作为所述目标损失。
- 根据权利要求3所述的基于生成对抗网络的语音处理方法,其中,所述根据所述目标损失更新生成对抗网络的参数,得到所述训练好的生成对抗网络,包括:根据所述第一损失值,更新所述生成对抗网络的生成器参数;根据所述第二损失值和所述第三损失值,更新所述生成对抗网络的判别器参数;当所述第一损失值达到预设阈值时,停止更新所述生成对抗网络的参数,得到所述训练好的生成对抗网络。
- 根据权利要求1所述的基于生成对抗网络的语音处理方法,其中,所述将所述切割语音段输入到训练好的生成对抗网络中,得到降噪的语音信号和所述降噪的语音信号对应的语音端点信息,包括:将所述切割语音段输入到训练好的生成对抗网络中,通过生成器的编码-解码模型对所述切割语音段生成序列矩阵特征;按照跳跃连接的方式,将同等大小的所述序列矩阵特征进行结合,得到目标特征;将所述目标特征输入到所述生成器的全连接层网络中,得到所述降噪的语音信号;将所述降噪的语音信号输入到所述训练好的生成对抗网络的判别器中,得到所述降噪的语音信号对应的语音端点信息。
- 根据权利要求5所述的基于生成对抗网络的语音处理方法,其中,所述按照跳跃连接的方式,将同等大小的所述序列矩阵特征进行结合,得到目标特征,包括:遍历所述序列矩阵特征,获取同等大小的所述序列矩阵特征,作为目标矩阵,其中, 所述目标矩阵的宽度和高度一致;通过跳跃连接的方式,将所述目标矩阵进行结合,得到所述目标特征。
- 根据权利要求1所述的基于生成对抗网络的语音处理方法,其中,所述将待拼接的语音信号按照所述切割次序标记进行拼接,得到重塑语音信号,包括:按照所述切割次序标记从小到大的次序,将所述待拼接的语音信号进行排列,得到语音序列;根据所述语音序列,将所述待拼接的语音信号的首尾进行拼接,得到所述重塑语音信号。
- 一种基于生成对抗网络的语音处理装置,包括:待处理语音段获取模块,用于获取待处理语音段,按照预设长度对所述待处理语音段进行切割,并对切割次序进行标记,得到切割语音段和切割次序标记;切割语音段输入模块,用于将所述切割语音段输入到训练好的生成对抗网络中,得到降噪的语音信号和所述降噪的语音信号对应的语音端点信息;待拼接的语音信号模块,用于将所述降噪的语音信号与所述对应的语音端点信息进行结合,形成待拼接的语音信号;重塑语音信号获取模块,用于将待拼接的语音信号按照所述切割次序标记进行拼接,得到重塑语音信号。
- 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:获取待处理语音段,按照预设长度对所述待处理语音段进行切割,并对切割次序进行标记,得到切割语音段和切割次序标记;将所述切割语音段输入到训练好的生成对抗网络中,得到降噪的语音信号和所述降噪的语音信号对应的语音端点信息;将所述降噪的语音信号与所述对应的语音端点信息进行结合,形成待拼接的语音信号;将待拼接的语音信号按照所述切割次序标记进行拼接,得到重塑语音信号。
- 根据权利要求9所述的计算机设备,其中,在所述将所述切割语音段输入到训练好的生成对抗网络中,得到降噪的语音信号和所述降噪的语音信号对应的语音端点信息之前,所述计算机设备还包括:获取预设的噪音语音信号和目标语音信号,并按照所述预设长度对所述噪音语音信号和所述目标语音信号进行切割,得到噪音语音段和目标语音段;按照随机抽取不放回的方式,抽取噪音语音段和目标语音段,作为训练数据;将所述训练数据输入生成对抗网络中,生成观测语音段和判别结果,并根据所述观测语音段与所述判别结果,计算损失函数值,得到目标损失;根据所述目标损失更新生成对抗网络的参数,得到所述训练好的生成对抗网络。
- 根据权利要求10所述的计算机设备,其中,所述将所述训练数据输入生成对抗网络中,生成观测语音段和判别结果,并根据所述观测语音段与所述判别结果,计算损失函数值,得到目标损失,包括:将所述训练数据中的噪音语音段输入到生成对抗网络的生成器中,生成所述观测语音段,并计算所述观测语音段和所述训练数据中的目标语音段的损失函数值,得到第一损失值;将所述训练数据中的噪音语音段输入到生成对抗网络的判别器中,得到第一判别结果,并计算所述第一判别结果的损失函数值,得到第二损失值;将所述训练数据中的目标语音段输入到生成对抗网络的判别器中,得到第二判别结果,并计算所述第二判别结果的损失函数值,得到第三损失值;将所述第一损失值、所述第二损失值以及所述第三损失值作为所述目标损失。
- 根据权利要求11所述的计算机设备,其中,所述根据所述目标损失更新生成对抗网络的参数,得到所述训练好的生成对抗网络,包括:根据所述第一损失值,更新所述生成对抗网络的生成器参数;根据所述第二损失值和所述第三损失值,更新所述生成对抗网络的判别器参数;当所述第一损失值达到预设阈值时,停止更新所述生成对抗网络的参数,得到所述训练好的生成对抗网络。
- 根据权利要求9所述的计算机设备,其中,所述将所述切割语音段输入到训练好的生成对抗网络中,得到降噪的语音信号和所述降噪的语音信号对应的语音端点信息,包括:将所述切割语音段输入到训练好的生成对抗网络中,通过生成器的编码-解码模型对所述切割语音段生成序列矩阵特征;按照跳跃连接的方式,将同等大小的所述序列矩阵特征进行结合,得到目标特征;将所述目标特征输入到所述生成器的全连接层网络中,得到所述降噪的语音信号;将所述降噪的语音信号输入到所述训练好的生成对抗网络的判别器中,得到所述降噪的语音信号对应的语音端点信息。
- 根据权利要求13所述的计算机设备,其中,所述按照跳跃连接的方式,将同等大小的所述序列矩阵特征进行结合,得到目标特征,包括:遍历所述序列矩阵特征,获取同等大小的所述序列矩阵特征,作为目标矩阵,其中,所述目标矩阵的宽度和高度一致;通过跳跃连接的方式,将所述目标矩阵进行结合,得到所述目标特征。
- 根据权利要求9所述的计算机设备,其中,所述将待拼接的语音信号按照所述切割次序标记进行拼接,得到重塑语音信号,包括:按照所述切割次序标记从小到大的次序,将所述待拼接的语音信号进行排列,得到语音序列;根据所述语音序列,将所述待拼接的语音信号的首尾进行拼接,得到所述重塑语音信号。
- 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时使得所述处理器执行如下步骤:获取待处理语音段,按照预设长度对所述待处理语音段进行切割,并对切割次序进行标记,得到切割语音段和切割次序标记;将所述切割语音段输入到训练好的生成对抗网络中,得到降噪的语音信号和所述降噪的语音信号对应的语音端点信息;将所述降噪的语音信号与所述对应的语音端点信息进行结合,形成待拼接的语音信号;将待拼接的语音信号按照所述切割次序标记进行拼接,得到重塑语音信号。
- 根据权利要求16所述的计算机可读存储介质,其中,在所述将所述切割语音段输入到训练好的生成对抗网络中,得到降噪的语音信号和所述降噪的语音信号对应的语音端点信息之前,所述计算机可读存储介质还包括:获取预设的噪音语音信号和目标语音信号,并按照所述预设长度对所述噪音语音信号和所述目标语音信号进行切割,得到噪音语音段和目标语音段;按照随机抽取不放回的方式,抽取噪音语音段和目标语音段,作为训练数据;将所述训练数据输入生成对抗网络中,生成观测语音段和判别结果,并根据所述观测语音段与所述判别结果,计算损失函数值,得到目标损失;根据所述目标损失更新生成对抗网络的参数,得到所述训练好的生成对抗网络。
- 根据权利要求17所述的计算机可读存储介质,其中,所述将所述训练数据输入生成对抗网络中,生成观测语音段和判别结果,并根据所述观测语音段与所述判别结果, 计算损失函数值,得到目标损失,包括:将所述训练数据中的噪音语音段输入到生成对抗网络的生成器中,生成所述观测语音段,并计算所述观测语音段和所述训练数据中的目标语音段的损失函数值,得到第一损失值;将所述训练数据中的噪音语音段输入到生成对抗网络的判别器中,得到第一判别结果,并计算所述第一判别结果的损失函数值,得到第二损失值;将所述训练数据中的目标语音段输入到生成对抗网络的判别器中,得到第二判别结果,并计算所述第二判别结果的损失函数值,得到第三损失值;将所述第一损失值、所述第二损失值以及所述第三损失值作为所述目标损失。
- 根据权利要求18所述的计算机可读存储介质,其中,所述根据所述目标损失更新生成对抗网络的参数,得到所述训练好的生成对抗网络,包括:根据所述第一损失值,更新所述生成对抗网络的生成器参数;根据所述第二损失值和所述第三损失值,更新所述生成对抗网络的判别器参数;当所述第一损失值达到预设阈值时,停止更新所述生成对抗网络的参数,得到所述训练好的生成对抗网络。
- 根据权利要求16所述的计算机可读存储介质,其中,所述将所述切割语音段输入到训练好的生成对抗网络中,得到降噪的语音信号和所述降噪的语音信号对应的语音端点信息,包括:将所述切割语音段输入到训练好的生成对抗网络中,通过生成器的编码-解码模型对所述切割语音段生成序列矩阵特征;按照跳跃连接的方式,将同等大小的所述序列矩阵特征进行结合,得到目标特征;将所述目标特征输入到所述生成器的全连接层网络中,得到所述降噪的语音信号;将所述降噪的语音信号输入到所述训练好的生成对抗网络的判别器中,得到所述降噪的语音信号对应的语音端点信息。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011387380.0 | 2020-12-01 | ||
CN202011387380.0A CN112397057B (zh) | 2020-12-01 | 2020-12-01 | 基于生成对抗网络的语音处理方法、装置、设备及介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022116487A1 true WO2022116487A1 (zh) | 2022-06-09 |
Family
ID=74604174
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/096660 WO2022116487A1 (zh) | 2020-12-01 | 2021-05-28 | 基于生成对抗网络的语音处理方法、装置、设备及介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112397057B (zh) |
WO (1) | WO2022116487A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115171710A (zh) * | 2022-07-08 | 2022-10-11 | 山东省计算中心(国家超级计算济南中心) | 基于多角度判别的生成对抗网络的语音增强方法及系统 |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112397057B (zh) * | 2020-12-01 | 2024-07-02 | 平安科技(深圳)有限公司 | 基于生成对抗网络的语音处理方法、装置、设备及介质 |
CN112992168B (zh) * | 2021-02-26 | 2024-04-19 | 平安科技(深圳)有限公司 | 语音降噪器训练方法、装置、计算机设备和存储介质 |
CN113096673B (zh) * | 2021-03-30 | 2022-09-30 | 山东省计算中心(国家超级计算济南中心) | 基于生成对抗网络的语音处理方法及系统 |
CN114842824A (zh) * | 2022-05-26 | 2022-08-02 | 深圳市华冠智联科技有限公司 | 对室内环境噪音消音的方法、装置、设备及介质 |
CN116631427B (zh) * | 2023-07-24 | 2023-09-29 | 美智纵横科技有限责任公司 | 降噪模型的训练方法、降噪处理方法、装置及芯片 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108257592A (zh) * | 2018-01-11 | 2018-07-06 | 广州势必可赢网络科技有限公司 | 一种基于长短期记忆模型的人声分割方法及系统 |
CN109147810A (zh) * | 2018-09-30 | 2019-01-04 | 百度在线网络技术(北京)有限公司 | 建立语音增强网络的方法、装置、设备和计算机存储介质 |
CN109218629A (zh) * | 2018-09-14 | 2019-01-15 | 三星电子(中国)研发中心 | 视频生成方法、存储介质和装置 |
US20190130221A1 (en) * | 2017-11-02 | 2019-05-02 | Royal Bank Of Canada | Method and device for generative adversarial network training |
CN110600017A (zh) * | 2019-09-12 | 2019-12-20 | 腾讯科技(深圳)有限公司 | 语音处理模型的训练方法、语音识别方法、系统及装置 |
CN112397057A (zh) * | 2020-12-01 | 2021-02-23 | 平安科技(深圳)有限公司 | 基于生成对抗网络的语音处理方法、装置、设备及介质 |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106782507B (zh) * | 2016-12-19 | 2018-03-06 | 平安科技(深圳)有限公司 | 语音分割的方法及装置 |
CN106611598B (zh) * | 2016-12-28 | 2019-08-02 | 上海智臻智能网络科技股份有限公司 | 一种vad动态参数调整方法和装置 |
CN107293289B (zh) * | 2017-06-13 | 2020-05-29 | 南京医科大学 | 一种基于深度卷积生成对抗网络的语音生成方法 |
CN107799126B (zh) * | 2017-10-16 | 2020-10-16 | 苏州狗尾草智能科技有限公司 | 基于有监督机器学习的语音端点检测方法及装置 |
CN109887494B (zh) * | 2017-12-01 | 2022-08-16 | 腾讯科技(深圳)有限公司 | 重构语音信号的方法和装置 |
CN110875060A (zh) * | 2018-08-31 | 2020-03-10 | 阿里巴巴集团控股有限公司 | 语音信号处理方法、装置、系统、设备和存储介质 |
CN111383651A (zh) * | 2018-12-29 | 2020-07-07 | Tcl集团股份有限公司 | 一种语音降噪方法、装置及终端设备 |
CN110136727B (zh) * | 2019-04-16 | 2024-04-16 | 平安科技(深圳)有限公司 | 基于说话内容的说话者身份识别方法、装置及存储介质 |
CN110619885B (zh) * | 2019-08-15 | 2022-02-11 | 西北工业大学 | 基于深度完全卷积神经网络的生成对抗网络语音增强方法 |
CN110689879B (zh) * | 2019-10-10 | 2022-02-25 | 中国科学院自动化研究所 | 端到端语音转写模型的训练方法、系统、装置 |
CN111261147B (zh) * | 2020-01-20 | 2022-10-11 | 浙江工业大学 | 一种面向语音识别系统的音乐嵌入攻击防御方法 |
CN111341294B (zh) * | 2020-02-28 | 2023-04-18 | 电子科技大学 | 将文本转换为指定风格语音的方法 |
CN111445900A (zh) * | 2020-03-11 | 2020-07-24 | 平安科技(深圳)有限公司 | 一种语音识别的前端处理方法、装置及终端设备 |
CN111986659B (zh) * | 2020-07-16 | 2024-08-06 | 百度在线网络技术(北京)有限公司 | 建立音频生成模型的方法以及装置 |
-
2020
- 2020-12-01 CN CN202011387380.0A patent/CN112397057B/zh active Active
-
2021
- 2021-05-28 WO PCT/CN2021/096660 patent/WO2022116487A1/zh active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190130221A1 (en) * | 2017-11-02 | 2019-05-02 | Royal Bank Of Canada | Method and device for generative adversarial network training |
CN108257592A (zh) * | 2018-01-11 | 2018-07-06 | 广州势必可赢网络科技有限公司 | 一种基于长短期记忆模型的人声分割方法及系统 |
CN109218629A (zh) * | 2018-09-14 | 2019-01-15 | 三星电子(中国)研发中心 | 视频生成方法、存储介质和装置 |
CN109147810A (zh) * | 2018-09-30 | 2019-01-04 | 百度在线网络技术(北京)有限公司 | 建立语音增强网络的方法、装置、设备和计算机存储介质 |
CN110600017A (zh) * | 2019-09-12 | 2019-12-20 | 腾讯科技(深圳)有限公司 | 语音处理模型的训练方法、语音识别方法、系统及装置 |
CN112397057A (zh) * | 2020-12-01 | 2021-02-23 | 平安科技(深圳)有限公司 | 基于生成对抗网络的语音处理方法、装置、设备及介质 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115171710A (zh) * | 2022-07-08 | 2022-10-11 | 山东省计算中心(国家超级计算济南中心) | 基于多角度判别的生成对抗网络的语音增强方法及系统 |
Also Published As
Publication number | Publication date |
---|---|
CN112397057A (zh) | 2021-02-23 |
CN112397057B (zh) | 2024-07-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022116487A1 (zh) | 基于生成对抗网络的语音处理方法、装置、设备及介质 | |
CN111814466A (zh) | 基于机器阅读理解的信息抽取方法、及其相关设备 | |
CN112528637B (zh) | 文本处理模型训练方法、装置、计算机设备和存储介质 | |
CN112685565A (zh) | 基于多模态信息融合的文本分类方法、及其相关设备 | |
WO2022142011A1 (zh) | 一种地址识别方法、装置、计算机设备及存储介质 | |
CN110288980A (zh) | 语音识别方法、模型的训练方法、装置、设备及存储介质 | |
CN112466314A (zh) | 情感语音数据转换方法、装置、计算机设备及存储介质 | |
CN111883140A (zh) | 基于知识图谱和声纹识别的认证方法、装置、设备及介质 | |
WO2023159746A1 (zh) | 基于图像分割的图像抠图方法、装置、计算机设备及介质 | |
CN111027291A (zh) | 文本中标点符号添加、模型训练方法、装置及电子设备 | |
WO2021159669A1 (zh) | 系统安全登录方法、装置、计算机设备和存储介质 | |
US10468031B2 (en) | Diarization driven by meta-information identified in discussion content | |
CN112468658A (zh) | 语音质量检测方法、装置、计算机设备及存储介质 | |
CN112632244A (zh) | 一种人机通话的优化方法、装置、计算机设备及存储介质 | |
CN115438149A (zh) | 一种端到端模型训练方法、装置、计算机设备及存储介质 | |
CN111382403A (zh) | 用户行为识别模型的训练方法、装置、设备及存储介质 | |
CN112084779A (zh) | 用于语义识别的实体获取方法、装置、设备及存储介质 | |
TWI818427B (zh) | 使用基於文本的說話者變更檢測的說話者劃分糾正方法及系統 | |
CN113948090B (zh) | 语音检测方法、会话记录产品及计算机存储介质 | |
CN111899747B (zh) | 用于合成音频的方法和装置 | |
CN113436633A (zh) | 说话人识别方法、装置、计算机设备及存储介质 | |
CN112908339B (zh) | 一种会议环节定位方法、装置、定位设备及可读存储介质 | |
EP3989219B1 (en) | Method for detecting an audio adversarial attack with respect to a voice command processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium | |
CN112071331B (zh) | 语音文件修复方法、装置、计算机设备及存储介质 | |
CN114783423A (zh) | 基于语速调整的语音切分方法、装置、计算机设备及介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21899530 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21899530 Country of ref document: EP Kind code of ref document: A1 |