Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a training method of a speech recognition model for children according to an embodiment of the present invention, which includes the following steps:
s11: acquiring training data, wherein the training data comprises child voice training data, a hard tag corresponding to the child voice training data and random noise data;
s12: obtaining an unconditionally generated countermeasure network through the training of a baseline acoustic model;
s13: inputting the random noise data into the unconditional generation countermeasure network to obtain noise enhancement acoustic features;
s14: inputting the noise enhancement acoustic features into the baseline acoustic model to obtain a posterior probability soft label corresponding to each frame of the noise enhancement acoustic features;
s15: training a child speech-enhanced acoustic recognition model using at least the noise-enhanced acoustic features and the soft labels and the child speech training data and the hard labels as sample training data.
In the present embodiment, in order to solve the problem of insufficient voice data amount of children, it is generally considered to use a conventional data enhancement method to adjust the voice. And the generation of new data directly from random variables by using a generation model is a novel attempt at the task and has higher operation difficulty compared with the traditional data enhancement method.
For step S11, acquiring existing child voice training data, a probability hard label corresponding to the child voice training data, and generated random noise data;
as an embodiment, the random noise data includes randomly distributed samples. In the present embodiment, the random noise data is not an environment "noise, noise" that is commonly understood, but a random variable, for example, a 100-dimensional feature vector sampled from a gaussian distribution with a mean of 0 and a variance of 1. This random distribution is sampled as noise data for training the data enhancement model. Such noisy data is input to a generator which outputs enhanced data which itself does not contain "noise".
For step S12, an unconditionally generated countermeasure network is obtained by training the baseline acoustic model. Making predictions using the baseline is simple and easy to understand. The baseline model also provides the lowest acceptable standard for performance. While generating a countermeasure network requires a "Generator": this is a neural network or, more simply, a Function (Function). A set of vectors is input and a set of target matrices is generated via a generator (if the child speech needs to be generated, the matrices are the phoneme set of the child speech). The purpose of the method is to make the self-made sample capability as strong as possible, so that the judgment of whether the network fails to judge whether the sample is a true sample or a false sample is strong.
There is also a "Discriminator": the purpose of the discriminator is to discriminate whether a segment of speech of a child is from a set of true samples or a set of false samples. If the input is a true sample, the network output is close to 1, the input is a false sample, and the network output is close to 0, so that the purpose of good judgment is achieved. In the unconditional generation countermeasure network trained by the method, the discriminator is not needed, and only the generator is used.
For step S13, the random noise data is input to the unconditionally generated countermeasure network trained in step S12, by which generators in the unconditionally generated countermeasure network, a greater number of noise enhanced acoustic features are obtained. The noise enhancing acoustic features may be a matrix, with the first dimension being a time dimension and the second dimension being an acoustic feature dimension.
For step S14, the noise-enhanced acoustic features are input into the baseline acoustic model, and a soft label of a posterior probability corresponding to the noise-enhanced features of each frame of noise audio is determined by the baseline acoustic model.
For step S15, training a child speech enhancement acoustic recognition model for data enhancement by using at least the noise enhancement acoustic features and the soft labels corresponding to the noise enhancement acoustic features and the child speech training data and the hard labels corresponding to the child speech training data as training samples. Because the noise enhancement acoustic features and the soft labels corresponding to the noise enhancement acoustic features are added, the added noise is generated randomly, the random sampling of the input noise at each time can be changed, and the generated data can have various different data, so that the children voice enhancement acoustic recognition model learns more children voice data with more quantity and more abundant types. For example, there are a number of different pronunciation methods for the pronunciation of the 'a'. More specifically, different persons can make 'a' sounds different each time, and in consideration of the characteristic, the 'a' sounds with the same content are simulated through random noise enhancement features and child voice training data, but pronunciations are different scenes, so that the nature of the pronunciations is changed, and the generated child voice data does not contain 'noise and murmurmur', and is more diversified. The obvious improvement is completely unrealized by the existing noise and noise adding method. And the voice data of the children with more abundant types are trained, so that the recognition effect of the voice recognition model of the children is improved.
According to the embodiment, under the condition that the voice training data of the children are limited, the unconditional countermeasure network is used for determining the enhanced acoustic features of the randomly distributed sampling without the acquisition cost, the voice enhanced acoustic recognition models of the children are trained through the enhanced acoustic features and the limited voice training data of the children, the pronunciation essence of the limited voice training data of the children is changed, more diversified voices of the children are learned, and the voice recognition models of the children with higher recognition accuracy are trained.
As an implementation manner, in this embodiment, the method further includes:
generating an countermeasure network based on the child voice training data and a hard tag training condition corresponding to the child voice training data, and acquiring an enhanced acoustic feature determined by a generator of the condition generation countermeasure network and a condition tag corresponding to the enhanced acoustic feature;
training a child speech-enhanced acoustic recognition model using at least the noise-enhanced acoustic features and the soft labels and the child speech training data and the hard labels as sample training data comprises:
and training a child voice enhanced acoustic recognition model by using at least the noise enhanced acoustic features and the soft labels, the child voice training data and the hard labels, and the enhanced acoustic features and the condition labels as sample training data.
In this embodiment, a conditional generation confrontation network may be trained based on the child speech data and a hard tag corresponding to the child speech training data, in which both a generator and a discriminator are required as compared to an unconditional generation confrontation network. Through continuous training of the generator and the discriminator, the data enhancement effect of the generator is improved. Further generating enhanced acoustic features determined by the countermeasure network and condition labels corresponding to the enhanced acoustic features through the trained conditions;
if the random noise enhanced acoustic features and the soft labels are matched with the enhanced acoustic features and the condition labels corresponding to the enhanced acoustic features, the recognition effect of the child speech recognition model can be further improved under the condition.
Considering that the unconditional generation countermeasure network cannot be trained by using the baseline acoustic model, the conditional generation countermeasure network can be used to determine that the enhanced acoustic features and the conditional labels corresponding to the enhanced acoustic features and the child voice training data and the hard labels are used as training samples to train the child voice enhanced acoustic recognition model, so that the recognition effect of the child voice recognition model is relatively improved.
As an embodiment, the unconditionally generated countermeasure network and the type of conditionally generated countermeasure network comprise Wasserstein generated countermeasure network.
The initial training of GANs is unstable, and many studies attempt to propose new training criteria to improve the stability and convergence of the training of GANs. Recently, Wasserstein generated countermeasure networks (WGANs) and modified Wasserstein GAN (WGAN-GP) training, utilized the distance between two distributions of Wasserstein. The Wasserstein distance (also known as the dozing distance) is used as a distance estimator for the discriminator in the form of a gradient penalty loss, since it is ideally continuous and differentiable almost anywhere under mild assumptions. Specifically, the WGAN is improved using the following objective function:
where α is a random number between 0 and 1. The Gradient Penalty (GP) term enforces a D to 1 gradient norm. This formula may provide a more stable GAN training process.
To further describe the method, GAN (generic adaptive network, generated countermeasure network) utilizes a countermeasure learning process of two models: generators (G) and discriminators (D) have made many recent advances in various generation tasks. The whole process can be seen as competition of the discriminator D with the generator G: the purpose of the generator G is to reduce Gaussian noise
Conversion to pseudo samples
So that the sample
Cannot be distinguished from a real sample. Discriminator D identifies the data samples. The discriminator is trained to distinguish between false samples and true samples. The objective function of the discriminator is:
discriminator D is trained to predict the validity of each datum, 1 (true) and 0 (false). The objective function for generator G is:
thus, G refers to the sample being generated that is classified as true by the discriminator.
In order to embed condition information into GAN during training, the condition GAN (cgan) is an extension of the original GAN, and the condition information is utilized in both the generator and the discriminator. By integrating other condition information, cGAN can generate data under desired conditions. The target function in cGAN can be written as follows
To embed the condition information into the GAN training process, the method treats the condition information as a projection between the discriminators intermediate features and the condition tags.
In this approach, two generation frameworks are explored: unconditional GAN and conditional GAN (cgan). Both are implemented at the frame level. In particular, a Filter Bank (FBANK) function is used as an input to the discriminator and as an output of the generator. The basic unit of input and generation is a context sequence of frames, the concatenation of which is approximately one syllable in size.
Based on the generative model described in the above steps, the method investigated two types of data enhancement frameworks, namely unconditional GAN and conditional GAN (cgan). Both are implemented at the frame level, which is a feature map on the spectrum of the child's voice extracted from the voice waveform. In particular, a Filter Bank (FBANK) function is used as an input to the discriminator and as an output of the generator. The basic unit of input and generation is a context sequence of frames, the concatenation of which is approximately one syllable in size. Given these K-dimensional FBANK features, m of them are stacked to form an m x K matrix. In the following experiment, K is set to 40 and m is set to 20.
As shown in FIG. 2 for a block diagram of a child ASR data enhancement framework, using the raw reality data, a different generative model will first be trained to generate additional augmented data. For unconditional generation experiments, due to the lack of labels for generating data, an unsupervised learning strategy was developed in which a posterior probability is generated for each frame using a baseline acoustic model, which is referred to hereinafter as a soft label. Assuming that the distribution between the real data and the enhanced data yields a highly trained GAN model with a high similarity, KL (Kullback-Leibler, relative entropy) divergence is used as the training criterion for the acoustic model, so that an optimization function can be derived:
wherein O istIs an input element, S is an acoustic state, PrefIs the original label as a reference. Representing the generated dataset and the real speech dataset as DgAnd Dr. The posterior probability from the baseline acoustic model and the enhanced acoustic model is denoted as Pbaseline (S | O)t) And Paug(S|Ot) Known as soft tags.
For conditional cGAN, the acoustic state (i.e., the clustered senone label in the pre-trained acoustic system) will be used as specific condition information as a guide for network training and data generation. For the generator, the state information is still ready to be a hot connection vector. For the discriminator, it is necessary to take the inner product between the embedded condition vector and the feature vector to introduce the condition information into the model. Compared to the unconditional GAN, the information in the generation of the unconditional GAN is just a random noise vector, and these acoustic states can also be used directly when deriving the signature of the enhancement data generated by the generator in the cGAN. Using these data and labels, a new enhanced acoustic model can be obtained by joint training using the actual data and the generated data.
The above method was experimented with three types of data sets: (1) a 100-hour manually transcribed mandarin chinese adult corpus, comprising 120K speech, with an average duration of 3 seconds. (2) A 40 hour manual transcription mandarin chinese children corpus, including 47k speech. (3) A test set containing four child speech subsets totaling 16k speech and two subsets of adult speech subsets totaling 8k speech. For the child test dataset it contains 4 different subdata sets (a, B, C, D) sampled from different environments, whereas in the adult test dataset it contains 2 subdata sets (a, B). There are significant differences between these data sets (including collection devices, domains).
A hidden markov model based on gaussian mixture model (GMM-HMM) was first constructed using the Kaldi toolkit using a standard recipe consisting of 9663 clustered states trained using maximum likelihood estimation. Using a trained GMM-HMM model, a state-level label can be derived by forcing an alignment of 100 hours of real adult speech and 40 hours of real child speech. All DNN (deep neural network) acoustic models were constructed using Kaldi using a cross-entropy criterion and the ASGD (asynchronous random gradient descent) based BP (back propagation) algorithm. 95% of the training data was used for training and the remaining 5% was used for validation. Standard test tubing in the Kaldi formulation was used for decoding and scoring.
The reference model in the experiment of the method comprises 5 hidden layers, each layer comprises 2048 units, and a ReLU activation function is used after each layer; the input layer has 1320 units due to the use of a 40-dimensional filter bank feature with Δ and Δ and a context expansion of 5 frames per side. The output layer consists of 9663 cells corresponding to the GMM-HMM cluster state. For better comparison, two baseline models with two experimental settings (B-01 and B-02) were trained with the same architecture but different training sets. B-01 only receives speech training from children, B-02 receives speech training from children and adults. As shown in the schematic diagram of the comparison list of acoustic modeling and different training data in fig. 3, we (Word error rate) for two baseline models are listed. It can be observed that:
(1) only in the case of limited speech of children with system construction, the performance is very poor for both children and adults.
(2) Adding more adult data can greatly improve the accuracy of adult speech, but the impact on children's speech is still limited (and even less effective in some cases). In conventional ASR, children's speech is more difficult to recognize than adult speech.
All GAN models used herein for data enhancement are implemented by PyTorch (PyTorch is a python-based scientific computing package that serves two kinds of audiences and can use the computational power of GPU as a deep learning computing platform to provide maximum computational flexibility and speed). For unconditional GAN (G-01 and G-03), the method uses a 4-layer fully-connected network structure with ReLU activation (800 → 1024 → 768 → 256 → 1). Also, the generator uses an inverse that also contains four fully connected layers and one sigmoid function to output. In G-02, convolutional layers were also introduced to better analyze the structural configuration: for the discriminator, there are three convolutional layers with channels {128, 256, 512}, steps { (1, 2), (3, 3), (3, 3) }, then a Leaky ReLU activation function after each layer, and finally a fully connected layer. Similar to the discriminator, there is one fully connected layer to transmit the incoming random noise, and then the generator uses the three transposed convolutional layers to generate the signature. The input to the generator is 256-dimensional random noise sampled from a central isotropic multivariate gaussian.
Based on the G-01 and G-03 architectures, the cGANs model of the method uses the condition information as an additional input in the form of a single input vector and projects it through fully connected layers to a 256-dimensional vector Vc. At the same time, the original feature input will also project through another fully connected layer to the 256-dimensional vector Vf. Thereafter, the rest of the network will use VfTo calculate the countermeasure loss, VcAnd VfThe inner product of (a) is lost as a condition.
In the CG-01 experiment, condition information was used as a direct label for cGAN training. For CG-02, a linear combination of soft tags and conditional hard tags is used as training tags by assigning a hyper-parameter β [0,1 ]. More specifically, the new tag may be derived in the following manner:
pcomb=λpbaseline(s|ot)+(1-β)pcondition
during the training process, discriminator D is updated 5 times, and then the G generator in each mini-batch is updated once. The gradient penalty parameter λ is set to 10. The network was trained using Adam and the minimum batch size was set to 64.
Visualization of generated data to better understand the feature samples generated from GAN, the feature maps of real child speech data samples are visualized and compared to the samples generated by the model training process in fig. 4. According to the graph, the characteristic quality generated by the same noise vector is gradually improved along with the convergence of the model along with the increase of the training time, and the finally generated sample has high similarity with the real characteristic sample of the child. Comparing different units in the generated feature maps, it can be seen that a well-converged model can generate features with diversity using different random noise.
Exploration generative models to investigate how generative models affect the quality of generated features and the improvements they can bring to the ASR system, generative models with different network architectures were first tested using basic GAN or conditional GAN. Results of Using the proposed methodA schematic diagram of a comparison list of acoustic modeling with different training data is shown in fig. 3. (1) Setting #1 using the child's voice only in the original training set: g-01 can significantly reduce the relative contribution on the set of child tests compared to B-01, after adding the same amount of generated child data as the real data
The WER of (1). The use of a convolutional layer in the GAN model of G-02 can further improve the results. With this data-limited arrangement, the generated children's speech is also very helpful in recognizing adult speech. (2) Settings #2 containing adult and child voices in the original training set: it is clear that we can be greatly reduced for children's voice, but the performance of children's voice is degraded, and data for adults seems to be of no use to recognize children. With the proposed GAN or cGAN based child data expansion method, still a large we reduction can be obtained on the child test set, which is also consistent with the observation in set 1. Furthermore, the generated child data is also very helpful for adult voices, even if they are slightly improved. Both GAN (G-03) and cGAN (CG-01, CG-02) generated data can improve acoustic models, resulting in better results on the child test set, indicating that both soft tags and conditional tags can successfully guide model training and that their combination can achieve better performance.
Exploring data volume in experiments, it was also explored whether enhancing the data volume would have a large impact on acoustic modeling. In a second setup, using a fixed 40 hour child voice +100 hour adult voice, a comparison of systems using different amounts of enhanced child data (i.e., 20 hours to 80 hours) was made, resulting in a comparison tabulated schematic of the system average word error rates for different data volumes as shown in FIG. 5. First, the WER decreases as the generated data increases, but as it approaches the actual subdata size used to train the generative model, the improvement will approach saturation.
In summary, unsupervised and combined frameworks can generate powerful generative models with limited subdata and tags. Through experimentation with various model settings, it was found that the introduction of GAN-generated enhancement data could significantly enhance a child's ASR system. The resulting system can reduce the WER of the child's voice by more than 20%, and the newly generated child's voice based on GAN can even improve the adult's voice under certain conditions.
Fig. 6 is a schematic structural diagram of a training system for a child speech recognition model according to an embodiment of the present invention, which can execute the training method for a child speech recognition model according to any of the above embodiments and is configured in a terminal.
The training system of the speech recognition model for children provided by the embodiment comprises: a data acquisition program module 11, an antagonistic network generation program module 12, an acoustic feature determination program module 13, a label determination program module 14, and a recognition model training program module 15.
The data acquisition program module 11 is configured to acquire training data, where the training data includes child voice training data, a hard tag corresponding to the child voice training data, and random noise data; the confrontation network generation program module 12 is used for obtaining unconditionally generated confrontation networks through the training of the baseline acoustic model; an acoustic feature determination program module 13 for inputting the random noise data into the unconditional generation countermeasure network to obtain noise-enhanced acoustic features; the label determination program module 14 is configured to input the noise-enhanced acoustic features to the baseline acoustic model, and obtain a posterior probability soft label corresponding to each frame of the noise-enhanced acoustic features; the recognition model training program module 15 is configured to train the child speech-enhanced acoustic recognition model using at least the noise-enhanced acoustic features and the soft labels and the child speech training data and the hard labels as sample training data.
Further, the antagonistic network generating program module is further for: generating an confrontation network based on the child voice training data and hard label training conditions corresponding to the child voice training data;
the tag determination program module is further for: acquiring enhanced acoustic features determined by a generator of the conditional generation countermeasure network and a condition label corresponding to the enhanced acoustic features;
the recognition model training program module is further to:
and training a child voice enhanced acoustic recognition model by using at least the noise enhanced acoustic features and the soft labels, the child voice training data and the hard labels, and the enhanced acoustic features and the condition labels as sample training data.
Further, the random noise data comprises randomly generated audio parameters.
Further, the unconditionally generated countermeasure network and the type of conditionally generated countermeasure network comprise a Wasserstein generated countermeasure network.
The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the training method of the child speech recognition model in any method embodiment;
as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
acquiring training data, wherein the training data comprises child voice training data, a hard tag corresponding to the child voice training data and random noise data;
obtaining an unconditionally generated countermeasure network through the training of a baseline acoustic model;
inputting the random noise data into the unconditional generation countermeasure network to obtain noise enhancement acoustic features;
inputting the noise enhancement acoustic features into the baseline acoustic model to obtain a posterior probability soft label corresponding to each frame of the noise enhancement acoustic features;
training a child speech-enhanced acoustic recognition model using at least the noise-enhanced acoustic features and the soft labels and the child speech training data and the hard labels as sample training data.
As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform a method of training a child speech recognition model in any of the method embodiments described above.
The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
An embodiment of the present invention further provides an electronic device, which includes: the device comprises at least one processor and a memory which is connected with the at least one processor in a communication mode, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor so as to enable the at least one processor to execute the steps of the training method of the children speech recognition model according to any embodiment of the invention.
The client of the embodiment of the present application exists in various forms, including but not limited to:
(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.
(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.
(4) Other electronic devices with data processing capabilities.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.