CN112364993A - Model joint training method and device, computer equipment and storage medium - Google Patents
Model joint training method and device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN112364993A CN112364993A CN202110044163.XA CN202110044163A CN112364993A CN 112364993 A CN112364993 A CN 112364993A CN 202110044163 A CN202110044163 A CN 202110044163A CN 112364993 A CN112364993 A CN 112364993A
- Authority
- CN
- China
- Prior art keywords
- network
- feature matrix
- model
- training
- dimensional feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 239000011159 matrix material Substances 0.000 claims abstract description 135
- 230000009467 reduction Effects 0.000 claims abstract description 62
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 21
- 230000006870 function Effects 0.000 claims description 30
- 238000004590 computer program Methods 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 5
- 230000000694 effects Effects 0.000 abstract description 17
- 238000013528 artificial neural network Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000003062 neural network model Methods 0.000 description 4
- 238000013145 classification model Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Soundproofing, Sound Blocking, And Sound Damping (AREA)
Abstract
The application provides a model joint training method, a device, computer equipment and a storage medium, comprising the following steps: constructing a first acoustic feature matrix of the audio training data; inputting the first acoustic feature matrix into an encoding network to obtain a first high-dimensional feature matrix; inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix; inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix; and respectively inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network and the classification network based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model. In this application, decoding network output second acoustic feature matrix has increased the data volume of training the sample, and the joint training awakens the model and falls the model of making an uproar, and the effect is better than when training the model alone, and trains fastly, and training is with low costs.
Description
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a model joint training method and apparatus, a computer device, and a storage medium.
Background
Currently, training the wake-up model and the noise reduction model are generally based on a collected clean speech data set and a collected noise data set. During training, a data enhancement technology simulating a real scene is performed to increase the diversity of training data and improve the anti-noise capability of the model in the real scene.
In order to obtain a good noise reduction model, training requires substantially more training data than the training of the wake-up model. When the training data is limited, only noisy speech or only a small amount of clean speech is available, a good noise reduction model cannot be obtained; the effect that the direct training awakening model can obtain is relatively limited, and the effect of the awakening model is difficult to further promote.
Disclosure of Invention
The application mainly aims to provide a model joint training method, a model joint training device, computer equipment and a storage medium, and aims to overcome the defect that the effect of a model obtained by training when the current training data is less is poor.
In order to achieve the above object, the present application provides a model joint training method, including the following steps:
constructing a first acoustic feature matrix of the audio training data;
inputting the first acoustic feature matrix into an encoding network to obtain a first high-dimensional feature matrix;
inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix;
inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix;
inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network respectively, and adjusting network parameters of the coding network, the decoding network and the classification network based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model; the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model.
Further, the classification network comprises a full connection layer and a softmax function, and the loss function used is a cross-entropy loss function.
Further, the step of inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network, and the classification network based on a back propagation algorithm to obtain a trained wake-up model and a trained noise reduction model includes:
after the first high-dimensional feature matrix and the second high-dimensional feature matrix are input into the full-connection layer for calculation, calculating a cross entropy loss value based on the loss function;
adjusting network parameters of the encoding network, decoding network, and classification network using a gradient descent back propagation algorithm to minimize the cross entropy loss value;
and after iterative training, when the cross entropy loss value does not decrease any more, the model converges to obtain the trained awakening model and the noise reduction model.
Further, the audio training data comprises positive sample audio and negative sample audio;
before the step of constructing the first acoustic feature matrix of the audio training data, the method comprises:
acquiring noise voice as the negative sample audio;
acquiring pure awakening voice; the pure awakening voice is pure voice without noise and carrying awakening words;
and mixing the pure awakening voice and the noise voice according to a preset signal-to-noise ratio to obtain the awakening voice with noise as the positive sample audio.
The application also provides a model joint training device, including:
the audio training device comprises a construction unit, a processing unit and a processing unit, wherein the construction unit is used for constructing a first acoustic feature matrix of audio training data;
the first coding unit is used for inputting the first acoustic feature matrix into a coding network to obtain a first high-dimensional feature matrix;
the decoding unit is used for inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix;
the second coding unit is used for inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix;
the training unit is used for respectively inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network and the classification network based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model; the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model.
Further, the classification network comprises a full connection layer and a softmax function, and the loss function used is a cross-entropy loss function.
Further, the training unit is specifically configured to:
after the first high-dimensional feature matrix and the second high-dimensional feature matrix are input into the full-connection layer for calculation, calculating a cross entropy loss value based on the loss function;
adjusting network parameters of the encoding network, decoding network, and classification network using a gradient descent back propagation algorithm to minimize the cross entropy loss value;
and after iterative training, when the cross entropy loss value does not decrease any more, the model converges to obtain the trained awakening model and the noise reduction model.
Further, the audio training data comprises positive sample audio and negative sample audio; the device further comprises:
a first obtaining unit configured to obtain a noise voice as the negative sample audio;
the second acquisition unit is used for acquiring the pure awakening voice; the pure awakening voice is pure voice without noise and carrying awakening words;
and the mixing unit is used for mixing the pure awakening voice and the noise voice according to a preset signal-to-noise ratio to obtain the awakening voice with noise, and the awakening voice with noise is used as the positive sample audio.
The present application further provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of any one of the above methods when executing the computer program.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the above.
The model joint training method, device, computer equipment and storage medium provided by the application comprise: constructing a first acoustic feature matrix of the audio training data; inputting the first acoustic feature matrix into an encoding network to obtain a first high-dimensional feature matrix; inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix; inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix; and respectively inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network and the classification network based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model. In the application, the decoding network outputs the second acoustic feature matrix, so that the data volume of a training sample is increased, and an awakening model and a noise reduction model are jointly trained; the two models share one coding network, so that the two models have the capability of more accurately extracting target information from noisy audio, the effect is better than that when the models are trained independently, the training speed is high, and the training cost is low.
Drawings
FIG. 1 is a schematic diagram illustrating the steps of a model co-training method according to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating a principle of a model joint training method according to an embodiment of the present application;
FIG. 3 is a block diagram of a model co-training apparatus according to an embodiment of the present application;
fig. 4 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Referring to fig. 1, an embodiment of the present application provides a model joint training method, including the following steps:
step S1, constructing a first acoustic feature matrix of the audio training data;
step S2, inputting the first acoustic feature matrix into a coding network to obtain a first high-dimensional feature matrix;
step S3, inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix;
step S4, inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix;
step S5, the first high-dimensional feature matrix and the second high-dimensional feature matrix are respectively input into a classification network, and network parameters of the coding network, the decoding network and the classification network are adjusted based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model; the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model.
In this embodiment, the model joint training method is applied to a scene with less training data, and the effect of the trained model is improved. The method obtains the noise reduction model with a certain noise reduction effect while obtaining the awakening model through training, and provides a feasible scheme for constructing the noise reduction model under the condition of insufficient training data. The training data is less, which means that the pure wake-up voice is less or no.
Specifically, as described in the above step S1, the audio training data is audio data, typically noisy wake-up voice data, and the audio data is labeled with a corresponding label for training the neural network model. Before the training of the input value neural network model, a first acoustic feature matrix of the audio training data needs to be constructed, and a linear transformation network can be generally adopted for extracting the feature matrix.
As described in the above step S2, the above coding network (kws net) is a neural network for extracting an audio high-dimensional feature matrix, and the coding network inputs the acoustic feature matrix of the audio and outputs the feature matrix of the high-dimensional space.
As described in the step S3, the decoding network (decode _ net) is a neural network for decoding the high-dimensional feature matrix into an acoustic feature matrix, and after passing through the decoding network, a new acoustic feature matrix, that is, the second acoustic feature matrix is generated. It can be understood that, in the present embodiment, as shown in fig. 2, the coding network described above is used as a common part of the noise reduction model and the wake-up network, wherein the coding network of the wake-up model mainly extracts information related to the speech content in the noisy sound when processing the input data, and the coding network of the noise reduction model mainly separates the target sound feature when processing the input data, and then generates the target speech according to the target sound feature. The same point of the two methods is that feature information of the target voice needs to be extracted, so that the coding network of the noise reduction model can keep the voice information when processing the noisy audio, and the audio generated by the decoding network can be awakened after the audio is awakened through the awakening network.
As described in step S4, the audio generated by the decoding network can be woken up after passing through the wake-up network; therefore, the second acoustic feature matrix obtained by decoding through the decoding network can also be used as training data, and the training data is input into the coding network to obtain a second high-dimensional feature matrix.
As stated in step S5, the first high-dimensional feature matrix and the second high-dimensional feature matrix are respectively input into the classification network, and network parameters (network weights) of the coding network, the decoding network, and the classification network are continuously adjusted based on a back propagation algorithm, so as to obtain a trained wake-up model and a trained noise reduction model.
In this embodiment, the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model; after the coding network, the decoding network and the classification network are iteratively trained, the awakening model and the noise reduction model can be obtained after the model is converged. In this embodiment, the decoding network outputs the second acoustic feature matrix, which increases the data size of the training sample, and jointly trains the wake-up model and the noise reduction model; the two models share one coding network, so that the two models have the capability of more accurately extracting target information from noisy audio, the effect is better than that when the models are trained independently, the training speed is high, and the training cost is low.
In summary, the model joint training method in the embodiment of the present application is applicable to a scene with insufficient training data, that is, when there is not enough data to construct a noise reduction model to assist in improving the effect of the wake-up model, the wake-up network and the noise reduction network share one coding network, and the two networks are directly subjected to joint training on the noisy wake-up speech training data and the noisy data set. In addition, when the loss function of the awakening model is used for training the awakening model, the noise reduction model is trained in a sequential manner, the obtained noise reduction model has a certain noise reduction effect under the condition that pure voice does not exist, and a feasible scheme is provided for constructing the noise reduction model under the condition that data is insufficient.
In an embodiment, the coding network includes any one or more of neural networks such as DNN, CNN, RNN, etc., and the networks may all implement coding of the acoustic feature matrix, which is not limited herein.
In an embodiment, the decoding network includes any one or more of neural networks such as DNN, CNN, RNN, etc., and the above networks may all implement decoding of the high-dimensional feature matrix, which is not limited herein.
In an embodiment, the classification network comprises a full connectivity layer and a softmax function, and the loss function used is a cross-entropy loss function.
In this embodiment, the classification network of the wake-up model is a general classification model, the target is a class label, and in the decoding network, since the audio output by the wake-up model is input into the classification model of the coding network again as a sample, the target is still a class label, so that two network joint training only has one loss function, that is, a cross entropy loss function commonly used by the general classification model, and the formula is as follows:
Total_loss = ce_loss;
in an embodiment, the step of inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network, and the classification network based on a back propagation algorithm to obtain a trained wake-up model and a trained noise reduction model includes:
after the first high-dimensional feature matrix and the second high-dimensional feature matrix are input into the full-connection layer for calculation, calculating a cross entropy loss value based on the loss function;
adjusting network parameters of the encoding network, decoding network, and classification network using a gradient descent back propagation algorithm to minimize the cross entropy loss value;
and after iterative training, when the cross entropy loss value does not decrease any more, the model converges to obtain the trained awakening model and the noise reduction model.
In the iterative training process of this embodiment, the classification result is predicted, and the cross entropy loss value between the predicted classification result and the real label is calculated by the loss function. And then, continuously adjusting network parameters of the coding network, the decoding network and the classification network, namely network weights, by adopting a gradient descending back propagation algorithm so as to minimize a cross entropy loss value calculated by a loss function, and converging the model when the cross entropy loss value does not descend any more, thereby obtaining the trained awakening model and the trained noise reduction model.
In an embodiment, the audio training data comprises positive sample audio and negative sample audio;
before the step of constructing the first acoustic feature matrix of the audio training data, the method comprises:
acquiring noise voice as the negative sample audio;
and acquiring a noisy awakening voice as the positive sample audio.
In an embodiment, the step of obtaining a noisy wake-up voice as the positive sample audio includes:
acquiring pure awakening voice; the pure awakening voice is pure voice without noise and carrying awakening words;
and mixing the pure awakening voice and the noise voice to obtain the awakening voice with noise as the positive sample audio.
Referring to fig. 3, an embodiment of the present application further provides a model joint training apparatus, including:
a construction unit 10, configured to construct a first acoustic feature matrix of the audio training data;
the first encoding unit 20 is configured to input the first acoustic feature matrix to an encoding network to obtain a first high-dimensional feature matrix;
a decoding unit 30, configured to input the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix;
a second encoding unit 40, configured to input the second acoustic feature matrix to the encoding network to obtain a second high-dimensional feature matrix;
the training unit 50 is configured to input the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjust network parameters of the coding network, the decoding network, and the classification network based on a back propagation algorithm to obtain a trained wake-up model and a trained noise reduction model; the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model.
In this embodiment, the model joint training device is applied to a scene with less training data, so as to improve the effect of the trained model. The method obtains the noise reduction model with a certain noise reduction effect while obtaining the awakening model through training, and provides a feasible scheme for constructing the noise reduction model under the condition of insufficient training data. The training data is less, which means that the pure wake-up voice is less or no.
Specifically, as described in the above building unit 10, the audio training data is audio data, typically noisy wake-up voice data, and the audio data is labeled with a corresponding label for training the neural network model. Before the training of the input value neural network model, a first acoustic feature matrix of the audio training data needs to be constructed, and a linear transformation network can be generally adopted for extracting the feature matrix.
As described in the first encoding unit 20, the encoding network (kws net) is a neural network for extracting a high-dimensional feature matrix of audio, and the encoding network inputs an acoustic feature matrix of audio and outputs a feature matrix of a high-dimensional space.
As described in the decoding unit 30, the decoding network (decode _ net) is a neural network for decoding a high-dimensional feature matrix into an acoustic feature matrix, and after passing through the decoding network, a new acoustic feature matrix, that is, the second acoustic feature matrix is generated. It can be understood that, in the present embodiment, as shown in fig. 2, the coding network described above is used as a common part of the noise reduction model and the wake-up network, wherein the coding network of the wake-up model mainly extracts information related to the speech content in the noisy sound when processing the input data, and the coding network of the noise reduction model mainly separates the target sound feature when processing the input data, and then generates the target speech according to the target sound feature. The same point of the two methods is that feature information of the target voice needs to be extracted, so that the coding network of the noise reduction model can keep the voice information when processing the noisy audio, and the audio generated by the decoding network can be awakened after the audio is awakened through the awakening network.
As described in the second encoding unit 40, the audio generated by the decoding network can be awakened after passing through the wake-up network; therefore, the second acoustic feature matrix obtained by decoding through the decoding network can also be used as training data, and the training data is input into the coding network to obtain a second high-dimensional feature matrix.
As stated in the training unit 50, the first high-dimensional feature matrix and the second high-dimensional feature matrix are respectively input into the classification network, and network parameters (network weights) of the coding network, the decoding network and the classification network are continuously adjusted based on a back propagation algorithm, so as to obtain a trained wake-up model and a trained noise reduction model.
In this embodiment, the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model; after the coding network, the decoding network and the classification network are iteratively trained, the awakening model and the noise reduction model can be obtained after the model is converged. In this embodiment, the decoding network outputs the second acoustic feature matrix, which increases the data size of the training sample, and jointly trains the wake-up model and the noise reduction model; the two models share one coding network, so that the two models have the capability of more accurately extracting target information from noisy audio, the effect is better than that when the models are trained independently, the training speed is high, and the training cost is low.
To sum up, for the model joint training device in this application embodiment, be applicable to the not enough scene of training data volume, when there is not enough data to construct the noise reduction model promptly and assist the effect that promotes the awakening model, make the awakening network and the noise reduction network share a coding network, directly carry out the joint training to two networks on the awakening voice training data that make an uproar and the noise data set, through this kind of training mode, make the coding network possess the ability of accurately extracting the target information from the voice that makes an uproar, thereby make the model effect better than when training alone, and training speed is fast, training is with low costs. In addition, when the loss function of the awakening model is used for training the awakening model, the noise reduction model is trained in a sequential manner, the obtained noise reduction model has a certain noise reduction effect under the condition that pure voice does not exist, and a feasible scheme is provided for constructing the noise reduction model under the condition that data is insufficient.
In one embodiment, the coding network includes any one or more of a neural network such as DNN, CNN, RNN, etc.
In one embodiment, the decoding network comprises any one or more of a neural network such as a DNN, a CNN, an RNN, etc.
In an embodiment, the classification network comprises a full connectivity layer and a softmax function, and the loss function used is a cross-entropy loss function.
In an embodiment, the training unit 50 is specifically configured to:
after the first high-dimensional feature matrix and the second high-dimensional feature matrix are input into the full-connection layer for calculation, calculating a cross entropy loss value based on the loss function;
adjusting network parameters of the encoding network, decoding network, and classification network using a gradient descent back propagation algorithm to minimize the cross entropy loss value;
and after iterative training, when the cross entropy loss value does not decrease any more, the model converges to obtain the trained awakening model and the noise reduction model.
In an embodiment, the audio training data comprises positive sample audio and negative sample audio;
the model joint training device further comprises:
a first obtaining unit configured to obtain a noise voice as the negative sample audio;
and the second acquisition unit is used for acquiring the awakening voice with noise as the positive sample audio.
In an embodiment, the second obtaining unit is specifically configured to:
acquiring pure awakening voice; the pure awakening voice is pure voice without noise and carrying awakening words;
and mixing the pure awakening voice and the noise voice to obtain the awakening voice with noise as the positive sample audio.
In this embodiment, please refer to the method described in the above embodiment for specific implementation of each unit in the model joint training apparatus, which is not described herein again.
Referring to fig. 4, a computer device, which may be a server and whose internal structure may be as shown in fig. 4, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing models and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a model co-training method.
Those skilled in the art will appreciate that the architecture shown in fig. 4 is only a block diagram of some of the structures associated with the present solution and is not intended to limit the scope of the present solution as applied to computer devices.
An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements a model joint training method. It is to be understood that the computer-readable storage medium in the present embodiment may be a volatile-readable storage medium or a non-volatile-readable storage medium.
In summary, the model joint training method, apparatus, computer device and storage medium provided in the embodiments of the present application include: constructing a first acoustic feature matrix of the audio training data; inputting the first acoustic feature matrix into an encoding network to obtain a first high-dimensional feature matrix; inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix; inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix; and respectively inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network and the classification network based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model. In the application, the decoding network outputs the second acoustic feature matrix, so that the data volume of a training sample is increased, and an awakening model and a noise reduction model are jointly trained; the two models share one coding network, so that the two models have the capability of more accurately extracting target information from noisy audio, the effect is better than that when the models are trained independently, the training speed is high, and the training cost is low.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The above description is only for the preferred embodiment of the present application and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.
Claims (10)
1. A model joint training method is characterized by comprising the following steps:
constructing a first acoustic feature matrix of the audio training data;
inputting the first acoustic feature matrix into an encoding network to obtain a first high-dimensional feature matrix;
inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix;
inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix;
inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network respectively, and adjusting network parameters of the coding network, the decoding network and the classification network based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model; the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model.
2. The model co-training method of claim 1, wherein the classification network comprises a fully connected layer and a softmax function, and the loss function used is a cross-entropy loss function.
3. The model joint training method according to claim 2, wherein the step of inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting the network parameters of the coding network, the decoding network, and the classification network based on a back propagation algorithm to obtain a trained wake-up model and a noise reduction model comprises:
after the first high-dimensional feature matrix and the second high-dimensional feature matrix are input into the full-connection layer for calculation, calculating a cross entropy loss value based on the loss function;
adjusting network parameters of the encoding network, decoding network, and classification network using a gradient descent back propagation algorithm to minimize the cross entropy loss value;
and after iterative training, when the cross entropy loss value does not decrease any more, the model converges to obtain the trained awakening model and the noise reduction model.
4. The model co-training method of claim 1, wherein the audio training data comprises positive sample audio and negative sample audio;
before the step of constructing the first acoustic feature matrix of the audio training data, the method comprises:
acquiring noise voice as the negative sample audio;
acquiring pure awakening voice; the pure awakening voice is pure voice without noise and carrying awakening words;
and mixing the pure awakening voice and the noise voice according to a preset signal-to-noise ratio to obtain the awakening voice with noise as the positive sample audio.
5. A model co-training apparatus, comprising:
the audio training device comprises a construction unit, a processing unit and a processing unit, wherein the construction unit is used for constructing a first acoustic feature matrix of audio training data;
the first coding unit is used for inputting the first acoustic feature matrix into a coding network to obtain a first high-dimensional feature matrix;
the decoding unit is used for inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix;
the second coding unit is used for inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix;
the training unit is used for respectively inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network and the classification network based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model; the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model.
6. The model co-training apparatus of claim 5, wherein the classification network comprises a fully connected layer and a softmax function, and the loss function used is a cross-entropy loss function.
7. The model joint training device of claim 6, wherein the training unit is specifically configured to:
after the first high-dimensional feature matrix and the second high-dimensional feature matrix are input into the full-connection layer for calculation, calculating a cross entropy loss value based on the loss function;
adjusting network parameters of the encoding network, decoding network, and classification network using a gradient descent back propagation algorithm to minimize the cross entropy loss value;
and after iterative training, when the cross entropy loss value does not decrease any more, the model converges to obtain the trained awakening model and the noise reduction model.
8. The model co-training apparatus of claim 5, wherein the audio training data comprises positive sample audio and negative sample audio; the device further comprises:
a first obtaining unit configured to obtain a noise voice as the negative sample audio;
the second acquisition unit is used for acquiring the pure awakening voice; the pure awakening voice is pure voice without noise and carrying awakening words;
and the mixing unit is used for mixing the pure awakening voice and the noise voice according to a preset signal-to-noise ratio to obtain the awakening voice with noise, and the awakening voice with noise is used as the positive sample audio.
9. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method according to any of claims 1 to 4.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110044163.XA CN112364993B (en) | 2021-01-13 | 2021-01-13 | Model joint training method and device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110044163.XA CN112364993B (en) | 2021-01-13 | 2021-01-13 | Model joint training method and device, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112364993A true CN112364993A (en) | 2021-02-12 |
CN112364993B CN112364993B (en) | 2021-04-30 |
Family
ID=74534933
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110044163.XA Active CN112364993B (en) | 2021-01-13 | 2021-01-13 | Model joint training method and device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112364993B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114512136A (en) * | 2022-03-18 | 2022-05-17 | 北京百度网讯科技有限公司 | Model training method, audio processing method, device, apparatus, storage medium, and program |
CN116074150A (en) * | 2023-03-02 | 2023-05-05 | 广东浩博特科技股份有限公司 | Switch control method and device for intelligent home and intelligent home |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463953A (en) * | 2017-07-21 | 2017-12-12 | 上海交通大学 | Image classification method and system based on quality insertion in the case of label is noisy |
CN109977212A (en) * | 2019-03-28 | 2019-07-05 | 清华大学深圳研究生院 | Talk with the reply content generation method and terminal device of robot |
CN110009025A (en) * | 2019-03-27 | 2019-07-12 | 河南工业大学 | A kind of semi-supervised additive noise self-encoding encoder for voice lie detection |
CN110503981A (en) * | 2019-08-26 | 2019-11-26 | 苏州科达科技股份有限公司 | Without reference audio method for evaluating objective quality, device and storage medium |
CN110619885A (en) * | 2019-08-15 | 2019-12-27 | 西北工业大学 | Method for generating confrontation network voice enhancement based on deep complete convolution neural network |
-
2021
- 2021-01-13 CN CN202110044163.XA patent/CN112364993B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463953A (en) * | 2017-07-21 | 2017-12-12 | 上海交通大学 | Image classification method and system based on quality insertion in the case of label is noisy |
CN110009025A (en) * | 2019-03-27 | 2019-07-12 | 河南工业大学 | A kind of semi-supervised additive noise self-encoding encoder for voice lie detection |
CN109977212A (en) * | 2019-03-28 | 2019-07-05 | 清华大学深圳研究生院 | Talk with the reply content generation method and terminal device of robot |
CN110619885A (en) * | 2019-08-15 | 2019-12-27 | 西北工业大学 | Method for generating confrontation network voice enhancement based on deep complete convolution neural network |
CN110503981A (en) * | 2019-08-26 | 2019-11-26 | 苏州科达科技股份有限公司 | Without reference audio method for evaluating objective quality, device and storage medium |
Non-Patent Citations (2)
Title |
---|
MUN S 等: "Deep neural network based learning and transferring mid-level audio features for acoustic scene classification", 《2017IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS SPEECH AND SIGNAL PROCESSING(ICASSP)》 * |
夏清 等: "基于深度学习的数字几何处理与分析技术研究进展", 《计算机研究与发展》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114512136A (en) * | 2022-03-18 | 2022-05-17 | 北京百度网讯科技有限公司 | Model training method, audio processing method, device, apparatus, storage medium, and program |
CN114512136B (en) * | 2022-03-18 | 2023-09-26 | 北京百度网讯科技有限公司 | Model training method, audio processing method, device, equipment, storage medium and program |
CN116074150A (en) * | 2023-03-02 | 2023-05-05 | 广东浩博特科技股份有限公司 | Switch control method and device for intelligent home and intelligent home |
CN116074150B (en) * | 2023-03-02 | 2023-06-09 | 广东浩博特科技股份有限公司 | Switch control method and device for intelligent home and intelligent home |
Also Published As
Publication number | Publication date |
---|---|
CN112364993B (en) | 2021-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Matsubara et al. | Head network distillation: Splitting distilled deep neural networks for resource-constrained edge computing systems | |
CN112435656B (en) | Model training method, voice recognition method, device, equipment and storage medium | |
CN112712813B (en) | Voice processing method, device, equipment and storage medium | |
CN109523014B (en) | News comment automatic generation method and system based on generative confrontation network model | |
CN110119447B (en) | Self-coding neural network processing method, device, computer equipment and storage medium | |
CN112214604A (en) | Training method of text classification model, text classification method, device and equipment | |
CN112364993B (en) | Model joint training method and device, computer equipment and storage medium | |
CN112331183B (en) | Non-parallel corpus voice conversion method and system based on autoregressive network | |
CN111428771B (en) | Video scene classification method and device and computer-readable storage medium | |
CN111583911B (en) | Speech recognition method, device, terminal and medium based on label smoothing | |
CN112735389A (en) | Voice training method, device and equipment based on deep learning and storage medium | |
CN112365885A (en) | Training method and device of wake-up model and computer equipment | |
CN110069611B (en) | Topic-enhanced chat robot reply generation method and device | |
CN111598213A (en) | Network training method, data identification method, device, equipment and medium | |
CN113128232A (en) | Named entity recognition method based on ALBERT and multi-word information embedding | |
CN107463928A (en) | Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM | |
CN112149651A (en) | Facial expression recognition method, device and equipment based on deep learning | |
CN114360502A (en) | Processing method of voice recognition model, voice recognition method and device | |
CN113052257A (en) | Deep reinforcement learning method and device based on visual converter | |
CN113626610A (en) | Knowledge graph embedding method and device, computer equipment and storage medium | |
CN110955765A (en) | Corpus construction method and apparatus of intelligent assistant, computer device and storage medium | |
WO2022246986A1 (en) | Data processing method, apparatus and device, and computer-readable storage medium | |
WO2022121188A1 (en) | Keyword detection method and apparatus, device and storage medium | |
Naik et al. | Indian monsoon rainfall classification and prediction using robust back propagation artificial neural network | |
CN109033413B (en) | Neural network-based demand document and service document matching method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: Model joint training method, device, computer equipment, and storage medium Granted publication date: 20210430 Pledgee: Shenzhen Shunshui Incubation Management Co.,Ltd. Pledgor: SHENZHEN YOUJIE ZHIXIN TECHNOLOGY Co.,Ltd. Registration number: Y2024980029366 |