Nothing Special   »   [go: up one dir, main page]

CN113129868B - Method for obtaining speech recognition model, speech recognition method and corresponding device - Google Patents

Method for obtaining speech recognition model, speech recognition method and corresponding device Download PDF

Info

Publication number
CN113129868B
CN113129868B CN202110270662.0A CN202110270662A CN113129868B CN 113129868 B CN113129868 B CN 113129868B CN 202110270662 A CN202110270662 A CN 202110270662A CN 113129868 B CN113129868 B CN 113129868B
Authority
CN
China
Prior art keywords
frame
sequence
speech recognition
recognition model
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110270662.0A
Other languages
Chinese (zh)
Other versions
CN113129868A (en
Inventor
梁鸣心
付晓寅
白锦峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202110270662.0A priority Critical patent/CN113129868B/en
Publication of CN113129868A publication Critical patent/CN113129868A/en
Application granted granted Critical
Publication of CN113129868B publication Critical patent/CN113129868B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The disclosure discloses a method for obtaining a voice recognition model, a voice recognition method and a corresponding device, and relates to artificial intelligence technologies such as intelligent voice and deep learning. The specific implementation scheme is as follows: acquiring training data, wherein the training data comprises a voice frame sequence and a text label corresponding to the voice frame sequence, and the voice frame sequence comprises more than one voice frame; performing frame splicing processing on each frame in the voice frame sequence to obtain a frame splicing sequence; down-sampling the frame splicing sequence to obtain a frame skipping sequence; training by utilizing the frame splicing sequence and the corresponding text labels to obtain a first speech recognition model; and training by utilizing the frame skipping sequence and the corresponding text labels based on the first voice recognition model to obtain a second voice recognition model, wherein the second voice recognition model is used for voice recognition. The method and the device can effectively reduce the calculation amount of voice recognition.

Description

Method for obtaining speech recognition model, speech recognition method and corresponding device
Technical Field
The present disclosure relates to the field of computer application technology, and more particularly to the field of artificial intelligence techniques such as intelligent speech and deep learning.
Background
Automatic speech recognition is an important component of a human-computer interaction system, and current mainstream solutions for speech recognition are based on deep learning. The speech recognition model based on deep learning has high requirements on computing resources, and the calculated amount becomes the key for determining the model parameters and calculating the real-time rate. Especially, the off-line end has more strict requirements on computing resources, and the compression of the computing amount becomes an urgent problem to be solved.
Disclosure of Invention
The present disclosure provides a method of obtaining a speech recognition model, a method of speech recognition and a corresponding apparatus, so as to reduce the amount of computation.
According to a first aspect of the present disclosure, there is provided a method of obtaining a speech recognition model, comprising:
acquiring training data, wherein the training data comprises a voice frame sequence and a text label corresponding to the voice frame sequence, and the voice frame sequence comprises more than one voice frame;
performing frame splicing processing on each frame in the voice frame sequence to obtain a frame splicing sequence;
down-sampling the frame splicing sequence to obtain a frame skipping sequence;
training by utilizing the frame splicing sequence and the corresponding text labels to obtain a first speech recognition model;
and training by utilizing the frame skipping sequence and the corresponding text labels to obtain a second speech recognition model based on the first speech recognition model.
According to a second aspect of the present disclosure, there is provided a method of speech recognition, comprising:
acquiring a voice frame sequence;
performing frame splicing processing on each frame in the voice frame sequence to obtain a frame splicing sequence;
down-sampling the frame splicing sequence to obtain a frame skipping sequence;
inputting the frame skipping sequence into a second voice recognition model, and acquiring a text recognition result output by the second voice recognition model;
wherein the second speech recognition model is pre-trained by the method of obtaining a speech recognition model as described above.
According to a third aspect of the present disclosure, there is provided an apparatus for obtaining a speech recognition model, comprising:
the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring training data, the training data comprises a voice frame sequence and a text label corresponding to the voice frame sequence, and the voice frame sequence comprises more than one voice frame;
the first frame splicing unit is used for splicing frames of each frame in the voice frame sequence to obtain a spliced frame sequence;
the first frame skipping unit is used for performing down sampling on the splicing frame sequence to obtain a frame skipping sequence;
the model training unit is used for training by utilizing the splicing frame sequence and the corresponding text labels to obtain a first voice recognition model; and training by utilizing the frame skipping sequence and the corresponding text labels to obtain a second speech recognition model based on the first speech recognition model.
According to a fourth aspect of the present disclosure, there is provided an apparatus for speech recognition, comprising:
a second obtaining unit configured to obtain a sequence of voice frames;
the second frame splicing unit is used for splicing frames of each frame in the voice frame sequence to obtain a spliced frame sequence;
the second frame skipping unit is used for performing down sampling on the splicing frame sequence to obtain a frame skipping sequence;
the result acquisition unit is used for inputting the frame skipping sequence into a second voice recognition model and acquiring a text recognition result output by the second voice recognition model;
wherein the second speech recognition model is pre-trained by the apparatus for obtaining speech recognition models as described above.
According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.
According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.
According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.
According to the technical scheme, the frame skipping sequence is obtained by performing frame splicing processing on the voice frame sequence and then performing down-sampling, so that the frame rate to be processed by the voice recognition model is effectively reduced, and the calculation amount of voice recognition is reduced.
And on the basis of a model obtained by training through a frame splicing sequence, a mode of obtaining a voice recognition model by training through a frame skipping sequence is utilized, so that the stability of model training can be ensured, and the performance loss is reduced.
It should be understood that what is described in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flow chart of a method for obtaining a speech recognition model according to an embodiment of the present disclosure;
fig. 2 is an example diagram of a splicing frame sequence and a frame skipping sequence provided by an embodiment of the present disclosure;
FIG. 3a is a schematic diagram of training a first speech recognition model according to an embodiment of the present disclosure;
FIG. 3b is a schematic diagram of training a third speech recognition model according to an embodiment of the present disclosure;
FIG. 3c is a schematic diagram of training a second speech recognition model according to an embodiment of the present disclosure;
FIG. 4 is a flow chart of a speech recognition method provided by an embodiment of the present disclosure;
FIG. 5 is a block diagram of an apparatus for obtaining a speech recognition model according to an embodiment of the present disclosure;
FIG. 6 is a block diagram of a speech recognition device provided in an embodiment of the present disclosure;
FIG. 7 is a block diagram of an electronic device used to implement an embodiment of the disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In speech recognition, short-time spectral analysis is often required to extract speech features, and the spectral analysis is usually performed with a fixed frame shift. The frame shift refers to the amount of displacement of two adjacent frames, and the frame rate indicates the number of frames per second. For example, a frame shift of 10ms, the frame rate is 100 frames/second. The frame rate determines the resolution of the features, the number of computations per second. The key to the amount of compression computation is to reduce the frame rate. The method and apparatus provided by the present disclosure are based on the above-mentioned idea of reducing the frame rate. For speech recognition, the key and basis are the speech recognition model, and the process of obtaining the speech recognition model and the process of speech recognition are described in detail below with reference to the embodiments, respectively.
Fig. 1 is a flowchart of a method for obtaining a speech recognition model according to an embodiment of the present disclosure. The execution main body of the method can be a device for obtaining the voice recognition model, and the device can be positioned at a server end or a computer terminal with stronger computing power. The application may be in the form of an application, or may also be a functional unit such as a plug-in or Software Development Kit (SDK) in the application, which is not particularly limited in the embodiment of the present disclosure. As shown in fig. 1, the method mainly comprises the following steps:
in 101, training data is obtained, the training data including a speech frame sequence and a text label corresponding thereto, the speech frame sequence including more than one speech frame.
The training data used in the present disclosure may be a sequence of speech frames that have been annotated with text. The speech frame sequence is more than one speech frame obtained by framing speech. The framing processing refers to a mode of adopting an overlapped segmentation method to enable the voice frames to smoothly transit and maintain continuity. The difference between the positions of two adjacent frames (e.g., the difference between the start positions of two adjacent frames) is called frame shift. The framing process may employ currently mature techniques, which are not described in detail herein.
At 102, a frame splicing process is performed on each frame in the speech frame sequence to obtain a frame splicing sequence.
If the voice frame sequence is directly downsampled in order to reduce the frame rate, the short-time stable performance of the voice features is damaged, the trained model is easy to diverge, and the performance loss is large. In order to ensure the model performance and the training stability, in the present disclosure, each frame in the speech frame sequence is first subjected to frame splicing processing, and then the obtained frame splicing sequence is subjected to down-sampling.
Wherein the frame splicing process may include: and respectively combining each frame in the voice sequence with the first m frames and the last n frames adjacent to each frame to obtain each frame in the splicing frame sequence, wherein m and n are preset positive integers. The redundancy may be generated when the value of m is too large, and the large delay may be generated when the value of n is too large, but if the values of m and n are smaller, the calculation amount is larger, so that the values of m and n need to be reasonably set according to the requirements of more practical scenes. Specifically, an empirical value or an experimental value may be used, and for example, m and n may be 2.
As shown in FIG. 2, the frame splicing process is performed on the speech frame sequence, and for the 1 st speech frame, the first 2 frames and the second 2 frames are combined, that is, 1-3 (the first 2 frames are blank) speech frames are combined as the 1 st frame in the frame splicing sequence and marked as X1. For the 2 nd speech frame, the first 2 frames, the second 2 frames and the last 2 frames are combined, namely 1-4 (the front has 1 frame blank) speech frames are combined to be the 2 nd frame in the splicing frame sequence and marked as X2. For the 3 rd speech frame, the first 2 frames and the second 2 frames are combined, namely 1-5 speech frames are combined to be used as the 3 rd frame in the splicing frame sequence and marked as X3. By analogy, a splicing frame sequence { X ] is obtained1,X2,X3,…,XTAnd f, wherein T is the number of speech frames in the speech frame sequence.
At 103, the sequence of mosaic frames is down-sampled to obtain a sequence of frame skipping.
The sampling rate adopted when the sequence of the splicing frames is subjected to down-sampling can also be reasonably set according to the requirements of an actual scene, and empirical values or experimental values can be adopted. Taking the sampling rate of 1/2 as an example, as shown in fig. 2, if sampling is performed every 2 frames, the 1 st frame, the 3 rd frame, and the 5 th frame … … are sampled from the sequence of frames to form a frame skipping sequence. Compared with a direct down-sampling mode, the mode of firstly splicing frames and then down-sampling greatly reduces the loss of characteristic information. As shown in fig. 2, the frame shift of the original speech frame is 10ms, although the frame shift is increased from the original 10ms to 20ms, and the frame rate is decreased to 1/2, there is a certain overlap between frames in the frame skipping sequence, which ensures the continuity between features.
At 104, a first speech recognition model is trained using the sequence of spellings and the corresponding text labels.
The speech recognition model referred to in the present disclosure employs a structure that is subject to an encoder and a decoder, such as the SMLTA (Streaming Multi-Layer Truncated Attention), las (listener Attention) system.
Wherein, the encoder and the decoder may be composed of an LSTM (Long Short-Term Memory) network. The input to the encoder is actually acoustic features such as Fbank (filter bank), MFCC (Mel-frequency cepstral coefficients) extracted from each frame. The encoder obtains the hidden vector of each frame according to the acoustic characteristics of each input frame, and the decoder outputs the text after the hidden vector sequence formed by the hidden vector of each frame is input into the decoder. The text embodies acoustic modeling units, which may be, for example, words, syllables, phones (phones), etc.
At 105, based on the first speech recognition model, a second speech recognition model is obtained by training with a frame skipping sequence and a corresponding text label.
If the speech recognition model is trained directly using the frame skipping sequence after the above step 103, the model training may be unstable, difficult to converge and have performance that is greatly different from the model trained with the original speech frame sequence. Therefore, in this embodiment, a multi-stage training manner is adopted, that is, the frame splicing sequence is used to perform model training in step 104, and then the frame skipping sequence is further used to perform training, so that the finally obtained model can smoothly converge and reduce the performance loss.
The above steps 104 and 105 are described in detail with reference to the following embodiments.
First, in this step 104, a preliminary model, referred to herein as a "first speech recognition model", may be obtained by training using a sequence of spellings and corresponding text labels. Specific training procedure As shown in FIG. 3a, the acoustic features { X ] of the sequence of frames are pieced together1,X2,X3,…,XTAs the input of the encoder, the output of the encoder is the hidden vector sequence { h }1,h2,h3,…,hTAnd (5) as the input of a decoder, marking the corresponding text as the target output of the decoder, and training the encoder and the decoder to obtain a first voice recognition model.
The training process described above uses a training goal that minimizes the difference between the recognition result output by the decoder and the text label. A loss function can be constructed based on the training target, and the values of the loss function are used for updating the model parameters.
As one of the preferred embodiments, two loss functions can be designed:
the first Loss function may take the form of, for example, Loss Ce (Cross entropy), whose goal is to maximize the posterior probability of the model output text labels, for example:
P(L|X)=ΠP(lt|X,l1:N)
where X denotes a sequence of input models, L denotes a sequence of model outputs, and L ═ L1,l2,...,lN]。
The second penalty function may be implemented, for example, as a Loss CTC (connection semantic Temporal Classification) with the goal of maximizing the path probability of the model output text labels and the training task using the Loss CTC requires the addition of a fully connected Layer (Full Layer). Inputting the hidden vector sequence into the full-link layer to obtain the prediction of the peak value, wherein the prediction result of the peak value is y1,y2,y3,…,yMIt does not require every frame to be aligned, but rather consistent across paths (i.e., the results of the text sequence). That is, alignment on every frame is not required as long as the predicted text result is guaranteed to be correct.
For example: p (Y | X) ═ Σπ∈φ(Y')P(π|X)
Where Y represents the sequence of model outputs and φ (Y') represents all possible text sequences of outputs.
A first Loss function, such as Loss Ce, is for each frame of data, minimizing training errors per frame of data. While the second penalty function, such as Loss CTC, is sequence-based, it does not need to consider the alignment of each frame with the labels, as long as the path is correct, i.e., the training error of the entire sequence is minimized.
In the process of training the encoder and the decoder, the model parameters of the encoder and the decoder may be updated as a whole using the first loss function, and only the model parameters of the decoder may be updated using the second loss function. And obtaining a first voice recognition model after the training end condition is reached. The training end condition may be, for example, convergence of a loss function, a value of the loss function being less than or equal to a preset loss function threshold, a number of iterations reaching a preset number threshold, or the like.
In an implementation manner of step 105, the second speech recognition model may be obtained by training with a frame skipping sequence and a corresponding text label directly based on the first speech recognition model. However, as a preferred embodiment, a two-stage training mode may be adopted: the first stage performs frame skipping training of the decoder, and the second stage performs frame skipping training of the encoder and the decoder.
When frame skipping training of a decoder is carried out in the first stage, based on a first speech recognition model, acoustic features of a splicing frame sequence are used as input of an encoder, and a frame skipping hidden vector sequence is obtained by down-sampling a hidden vector sequence output by the encoder; and taking the frame skipping hidden vector sequence as the input of a decoder, taking the corresponding text label as the target output of the decoder, and continuing to train the encoder and the decoder to obtain a third speech recognition model.
As shown in fig. 3b, the pre-training encoder and decoder of this stage is the first speech recognition model obtained in step 104. During the training at the present stage, the acoustic characteristics { X ] of the frame skipping sequence are obtained1,X2,X3,…,XTThe input is the encoder, which outputs the hidden vector sequence { h }1,h2,h3,…,hT}. Downsampling the hidden vector sequence, for example, using a downsampling rate of 1/2, to obtain a frame-skipping hidden vector sequence { h }1,h3,h5,…,hT-1}. Inputting the frame-skipping hidden vector sequence into a decoder, and outputting a text sequence { l1,l2,...,lN}. In the above training process, the input and output of the encoder are unchanged compared to the training process of step 104, and the decoder portion is mainly trained.
A first Loss function such as Loss Ce may also be employed to update the model parameters of the encoder and decoder with the goal of maximizing the a posteriori probability of the model output text labels.
A second Loss function, such as Loss CTC, is employed only to update the model parameters of the encoder. The goal is to maximize the path probability of the model output text labels and the training task using the Loss CTC requires the addition of a fully connected Layer (Full Layer). Inputting the hidden vector sequence into the full-link layer to obtain the prediction of the peak value, wherein the prediction result of the peak value is y1,y2,y3,…,yMIt does not require every frame to be aligned, but rather consistent across paths (i.e., the results of the text sequence). That is, alignment on every frame is not required as long as the predicted text result is guaranteed to be correct.
During this first stage of training, the Loss CTC is used to maintain the performance of the encoder, the main part of the training decoder. And obtaining a third speech recognition model after the training end condition is reached. The training end condition may be, for example, convergence of a loss function, a value of the loss function being less than or equal to a preset loss function threshold, a number of iterations reaching a preset number threshold, or the like.
In the second stage, when frame skipping training of the encoder and the decoder is performed, based on the third speech recognition model, the acoustic features of the frame skipping sequence are used as the input of the encoder, the hidden vector sequence output by the encoder is used as the input of the decoder, the corresponding text labels are used as the target output of the decoder, and the encoder and the decoder are continuously trained to obtain the second speech recognition model.
As shown in fig. 3c, the encoder and decoder are the third speech recognition model obtained in the first stage before this stage of training. During the training at the present stage, the acoustic characteristics { X ] of the frame skipping sequence are obtained1,X3,X5,…,XT-1The input is the encoder, which outputs the hidden vector sequence { h }1,h3,h5,…,hT-1}. The hidden vector sequence is input to a decoder, which outputs a text sequence { l }1,l2,...,lN}. In the training process described above, the input and output of the decoder are unchanged compared to the first training phase, mainly training the encoder part.
It is also possible to use a first Loss function such as Loss Ce to update the model parameters of the encoder and decoder and a second Loss function such as Loss CTC to update only the model parameters of the encoder. In this second stage of training, Loss Ce is used to preserve the performance of the decoder, and the encoder is trained along with Loss CTC. And obtaining a second voice recognition model after the training end condition is reached. The training end condition may be, for example, convergence of a loss function, a value of the loss function being less than or equal to a preset loss function threshold, a number of iterations reaching a preset number threshold, or the like.
It can be seen that, in the above two-stage training mode, the input sequence of each of the encoder and decoder is not changed in each training stage, so as to guide the other part of training, and this "transitional" training mode can ensure the stability of model training, so as to minimize the performance loss of the obtained model.
The obtained second speech recognition model comprises an encoder and a decoder, the full-connected layer involved by the Loss CTC in the training process is used for assisting the training of the speech recognition model, and the second speech recognition model may not comprise the full-connected layer in an actual speech recognition scene.
On the basis of the second speech recognition model obtained by the above training, speech recognition can be performed using the second speech recognition model. Fig. 4 is a flowchart of a speech recognition method according to an embodiment of the present disclosure. The execution main body of the method can be a voice recognition device, and the device can be positioned at a server end or a computer terminal with stronger computing power. The application may be in the form of an application, or may also be a plug-in the application or a functional unit such as an SDK, which is not particularly limited in this disclosure. As shown in fig. 4, the method may include the steps of:
in 401, a sequence of speech frames is obtained.
The speech frame sequence involved in this step may be more than one speech frame obtained after framing the speech to be recognized.
At 402, a frame splicing process is performed on each frame in the speech frame sequence to obtain a frame splicing sequence.
At 403, the sequence of mosaic frames is down-sampled to obtain a sequence of frame jumps.
The specific manner of the frame splicing processing and the frame skipping processing may refer to the description about steps 102 and 103 in the embodiment shown in fig. 1, and is not described herein again. The frame splicing processing and frame skipping processing adopted in the speech recognition process need to be consistent with the mode adopted by the speech recognition model in the training process.
At 404, the frame skipping sequence is input into the second speech recognition model, and a text recognition result output by the second speech recognition model is obtained.
In this step, after the acoustic features of the frame skipping sequence are input into the encoder of the second speech recognition model, the encoder outputs a hidden vector sequence. The hidden vector sequence is used as the input of a decoder of the second speech recognition model, and a text recognition result of the speech frame sequence is output by the decoder. The text recognition result is embodied by acoustic modeling units, which may be words, syllables, phones (phones), etc., for example.
In the speech recognition process, the frame skipping sequence reduces the frame rate compared with the original speech frame sequence, so that the calculation amount in the speech recognition process is effectively reduced. Although the frame splicing processing brings about the increase of the input characteristic dimension, the increase of the characteristic dimension only has an influence on the first layer network of the encoder part, and the total computation amount of the encoder part and the decoder part is effectively reduced because the time dimension is sufficiently compressed.
For online speech recognition services, lowering the frame rate can save computational resources and reduce response time. For offline speech recognition, reducing the frame rate can accommodate models with larger parameters, improve recognition accuracy, reduce power consumption, and the like. Therefore, in the speech recognition system, the low frame rate optimization plays a significant role in user experience, service cost and product performance.
The above is a detailed description of the method provided by the present disclosure, and the following is a detailed description of the apparatus provided by the present disclosure with reference to the embodiments.
Fig. 5 is a block diagram of an apparatus for obtaining a speech recognition model according to an embodiment of the disclosure, and as shown in fig. 5, the apparatus 500 may include: a first obtaining unit 510, a first frame splicing unit 520, a first frame skipping unit 530, and a model training unit 540. The main functions of each component unit are as follows:
a first obtaining unit 510, configured to obtain training data, where the training data includes a speech frame sequence and a text label corresponding to the speech frame sequence, and the speech frame sequence includes more than one speech frame.
The first frame splicing unit 520 is configured to perform frame splicing processing on each frame in the speech frame sequence to obtain a frame splicing sequence.
Specifically, the first frame splicing unit 520 may merge each frame in the speech frame sequence with the first m frames and the last n frames adjacent to each frame in the speech frame sequence, respectively, to obtain each frame in the frame splicing sequence, where m and n are preset positive integers. The redundancy may be generated when the value of m is too large, and the large delay may be generated when the value of n is too large, but if the values of m and n are smaller, the calculation amount is larger, so that the values of m and n need to be reasonably set according to the requirements of more practical scenes. Specifically, an empirical value or an experimental value may be used, and for example, m and n may be 2.
A first frame skipping unit 530, configured to down-sample the sequence of the splicing frame to obtain a frame skipping sequence.
The sampling rate adopted when the sequence of the splicing frames is subjected to down-sampling can also be reasonably set according to the requirements of an actual scene, and empirical values or experimental values can be adopted.
A model training unit 540, configured to train to obtain a first speech recognition model by using the frame splicing sequence and the corresponding text labels; and training by utilizing the frame skipping sequence and the corresponding text label based on the first voice recognition model to obtain a second voice recognition model.
As one of the realizable manners, the model training unit 540 may include: a first training subunit 541, a second training subunit 542, and a third training subunit 543.
The first training subunit 541 is configured to train the encoder and the decoder to obtain a first speech recognition model including the encoder and the decoder, where the acoustic features of the frame splicing sequence are used as input of the encoder, a hidden vector sequence output by the encoder is used as input of the decoder, and a corresponding text label is used as target output of the decoder.
Wherein the encoder and decoder may be comprised of an LSTM network. The input to the encoder is actually an acoustic feature such as Fbank, MFCC, etc. extracted from each frame. The decoder outputs the recognition result text. The text embodies acoustic modeling units, which may be, for example, words, syllables, phones (phones), etc.
And the second training subunit 542 is configured to, based on the first speech recognition model, take the acoustic features of the frame splicing sequence as input of the encoder, perform down-sampling on the hidden vector sequence output by the encoder, take the frame skipping hidden vector sequence obtained after the down-sampling as input of the decoder, take the corresponding text label as target output of the decoder, and continue to train the encoder and the decoder to obtain a third speech recognition model.
And a third training subunit 543, configured to train, based on the third speech recognition model, by using the frame skipping sequence and the corresponding text label, to obtain a second speech recognition model.
As one of the realizable manners, the third training subunit 543 may use, based on the third speech recognition model, the acoustic feature of the frame skipping sequence as an input of the encoder, use the hidden vector sequence output by the encoder as an input of the decoder, use the corresponding text label as a target output of the decoder, and continue to train the encoder and the decoder to obtain the second speech recognition model.
As one of the preferred embodiments, two loss functions can be designed: the first Loss function may take the form of, for example, Loss Ce (Cross entropy), whose goal is to maximize the posterior probability of the model output text labels. A second penalty function may be employed, such as the Loss CTC, with the goal of maximizing the path probability of the model output text labels and the training task using the Loss CTC requires the addition of a Full connectivity Layer (Full Layer). The input of the hidden vector sequence into the fully-connected layer results in the prediction of the peak value, and the prediction result of the peak value does not require alignment of every frame, but is consistent on the path (i.e. the result of the text sequence). That is, alignment on every frame is not required as long as the predicted text result is guaranteed to be correct.
The first training subunit 541, the second training subunit 542, and the third training subunit 543 included in the model training unit 543 may determine a first loss function and a second loss function when training the encoder and the decoder; the model parameters of the encoder and decoder are updated with a first loss function, and the model parameters of the decoder are updated with a second loss function.
Fig. 6 is a block diagram of a speech recognition apparatus provided in an embodiment of the present disclosure, and as shown in fig. 6, the apparatus 600 may include: a second obtaining unit 601, a second frame splicing unit 602, a second frame skipping unit 603, and a result obtaining unit 604. The main functions of each component unit are as follows:
a second obtaining unit 601, configured to obtain a sequence of voice frames.
The speech frame sequence related in this embodiment may be more than one speech frame obtained by framing the speech to be recognized.
The second frame splicing unit 602 is configured to perform frame splicing processing on each frame in the speech frame sequence to obtain a frame splicing sequence.
A second frame skipping unit 603, configured to down-sample the splicing frame sequence to obtain a frame skipping sequence.
The specific manner of the frame splicing processing and the frame skipping processing may refer to the related description in the embodiment of the apparatus shown in fig. 5, and is not described herein again. The frame splicing processing and frame skipping processing adopted in the speech recognition process need to be consistent with the mode adopted by the speech recognition model in the training process.
And a result obtaining unit 604, configured to input the frame skipping sequence into the second speech recognition model, and obtain a text recognition result output by the second speech recognition model. Wherein the second speech recognition model is pre-trained by the apparatus for obtaining speech recognition models as shown in fig. 5.
And after the acoustic features of the frame skipping sequence are input into an encoder of the second speech recognition model, the encoder outputs an implicit vector sequence. The hidden vector sequence is used as the input of a decoder of the second speech recognition model, and a text recognition result of the speech frame sequence is output by the decoder. The text recognition result is embodied by acoustic modeling units, which may be words, syllables, phones (phones), etc., for example.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
As shown in fig. 7, is a block diagram of an electronic device according to an embodiment of the disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 executes the respective methods and processes described above, such as a method of acquiring a speech recognition model or a speech recognition method. For example, in some embodiments, the method of obtaining a speech recognition model or the speech recognition method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708.
In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 802 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the method of obtaining a speech recognition model or the speech recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g. by means of firmware) to perform the method of obtaining a speech recognition model or the speech recognition method.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller 30, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility existing in the traditional physical host and virtual Private Server (VPs) service. The server may also be a server of a distributed system, or a server incorporating a blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (18)

1. A method of obtaining a speech recognition model, comprising:
acquiring training data, wherein the training data comprises a voice frame sequence and a text label corresponding to the voice frame sequence, and the voice frame sequence comprises more than one voice frame;
performing frame splicing processing on each frame in the voice frame sequence to obtain a frame splicing sequence;
down-sampling the frame splicing sequence to obtain a frame skipping sequence;
training by utilizing the frame splicing sequence and the corresponding text labels to obtain a first speech recognition model;
and training by utilizing the frame skipping sequence and the corresponding text labels to obtain a second speech recognition model based on the first speech recognition model, so as to obtain the second speech recognition model by utilizing the multi-stage training of the splicing frame sequence and the frame skipping sequence.
2. The method of claim 1, wherein the frame splicing processing of each frame in the sequence of speech frames to obtain a sequence of spliced frames comprises:
and respectively merging each frame in the voice frame sequence with the first m frames and the last n frames adjacent to each frame to obtain each frame in the splicing frame sequence, wherein m and n are preset positive integers.
3. The method of claim 1, wherein the training with the sequence of spellings and the corresponding text labels to derive a first speech recognition model comprises:
and taking the acoustic features of the splicing frame sequence as the input of an encoder, taking the implicit vector sequence output by the encoder as the input of a decoder, taking the corresponding text labels as the target output of the decoder, training the encoder and the decoder, and obtaining a first speech recognition model comprising the encoder and the decoder.
4. The method of claim 1, wherein training with the sequence of frame hops and corresponding text labels to derive a second speech recognition model based on the first speech recognition model comprises:
based on the first speech recognition model, taking the acoustic features of the splicing frame sequence as the input of an encoder, performing down-sampling on a hidden vector sequence output by the encoder, taking a frame-skipping hidden vector sequence obtained after down-sampling as the input of a decoder, taking a corresponding text label as the target output of the decoder, and continuing to train the encoder and the decoder to obtain a third speech recognition model;
and training by utilizing the frame skipping sequence and the corresponding text labels to obtain a second speech recognition model based on the third speech recognition model.
5. The method of claim 4, wherein training a second speech recognition model using the sequence of frame hops and corresponding text labels based on the third speech recognition model comprises:
and based on the third speech recognition model, taking the acoustic features of the frame skipping sequence as the input of the encoder, taking the hidden vector sequence output by the encoder as the input of the decoder, taking the corresponding text label as the target output of the decoder, and continuing to train the encoder and the decoder to obtain the second speech recognition model.
6. The method of claim 3, 4 or 5, wherein the training the encoder and decoder comprises:
determining a first loss function and a second loss function, wherein the first loss function aims at maximizing the posterior probability of outputting the text labels, and the second loss function aims at maximizing the sum of the path probabilities of outputting the text labels;
updating model parameters of the encoder and decoder with the first loss function, and updating model parameters of the decoder with the second loss function.
7. The method of claim 6, wherein the first penalty function is a cross entropy penalty function and the second penalty function is a connection-wise sequential classification (CTC) penalty function.
8. A method of speech recognition, comprising:
acquiring a voice frame sequence;
performing frame splicing processing on each frame in the voice frame sequence to obtain a frame splicing sequence;
down-sampling the frame splicing sequence to obtain a frame skipping sequence;
inputting the frame skipping sequence into a second voice recognition model, and acquiring a text recognition result output by the second voice recognition model;
wherein the second speech recognition model is pre-trained by the method of any one of claims 1 to 6.
9. An apparatus for obtaining a speech recognition model, comprising:
the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring training data, the training data comprises a voice frame sequence and a text label corresponding to the voice frame sequence, and the voice frame sequence comprises more than one voice frame;
the first frame splicing unit is used for splicing frames of each frame in the voice frame sequence to obtain a spliced frame sequence;
the first frame skipping unit is used for performing down sampling on the splicing frame sequence to obtain a frame skipping sequence;
the model training unit is used for training by utilizing the splicing frame sequence and the corresponding text labels to obtain a first voice recognition model; and training by utilizing the frame skipping sequence and the corresponding text labels to obtain a second speech recognition model based on the first speech recognition model, so as to obtain the second speech recognition model by utilizing the multi-stage training of the splicing frame sequence and the frame skipping sequence.
10. The apparatus according to claim 9, wherein the first frame splicing unit is specifically configured to merge each frame in the sequence of speech frames with its adjacent first m frames and last n frames, respectively, to obtain each frame in the sequence of frame splicing, where m and n are preset positive integers.
11. The apparatus of claim 9, wherein the model training unit comprises: and the first training subunit is used for taking the acoustic features of the splicing frame sequence as the input of an encoder, taking a hidden vector sequence output by the encoder as the input of a decoder, taking a corresponding text label as the target output of the decoder, and training the encoder and the decoder to obtain a first speech recognition model comprising the encoder and the decoder.
12. The apparatus of claim 9, wherein the model training unit comprises:
the second training subunit is used for taking the acoustic features of the splicing frame sequence as the input of an encoder based on the first speech recognition model, performing down-sampling on a hidden vector sequence output by the encoder, taking a frame skipping hidden vector sequence obtained after down-sampling as the input of a decoder, taking a corresponding text label as the target output of the decoder, and continuing to train the encoder and the decoder to obtain a third speech recognition model;
and the third training subunit is used for training by utilizing the frame skipping sequence and the corresponding text labels to obtain a second speech recognition model based on the third speech recognition model.
13. The apparatus according to claim 12, wherein the third training subunit is specifically configured to, based on the third speech recognition model, use the acoustic features of the frame skipping sequence as the input of the encoder, use a hidden vector sequence output by the encoder as the input of the decoder, use a corresponding text label as the target output of the decoder, and continue training the encoder and the decoder to obtain the second speech recognition model.
14. The apparatus according to claim 11, 12 or 13, wherein the model training unit comprises sub-units, when training the encoder and decoder, specifically configured to:
determining a first loss function and a second loss function, wherein the first loss function aims at maximizing the posterior probability of outputting the text labels, and the second loss function aims at maximizing the sum of the path probabilities of outputting the text labels;
updating model parameters of the encoder and decoder with the first loss function, and updating model parameters of the decoder with the second loss function.
15. The apparatus of claim 14, wherein the first penalty function is a cross entropy penalty function and the second penalty function is a connection-wise sequential classification (CTC) penalty function.
16. An apparatus for speech recognition, comprising:
a second obtaining unit configured to obtain a sequence of voice frames;
the second frame splicing unit is used for splicing frames of each frame in the voice frame sequence to obtain a spliced frame sequence;
the second frame skipping unit is used for performing down sampling on the splicing frame sequence to obtain a frame skipping sequence;
the result acquisition unit is used for inputting the frame skipping sequence into a second voice recognition model and acquiring a text recognition result output by the second voice recognition model;
wherein the second speech recognition model is pre-trained by the apparatus of any of claims 9 to 15.
17. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.
CN202110270662.0A 2021-03-12 2021-03-12 Method for obtaining speech recognition model, speech recognition method and corresponding device Active CN113129868B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110270662.0A CN113129868B (en) 2021-03-12 2021-03-12 Method for obtaining speech recognition model, speech recognition method and corresponding device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110270662.0A CN113129868B (en) 2021-03-12 2021-03-12 Method for obtaining speech recognition model, speech recognition method and corresponding device

Publications (2)

Publication Number Publication Date
CN113129868A CN113129868A (en) 2021-07-16
CN113129868B true CN113129868B (en) 2022-02-25

Family

ID=76773083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110270662.0A Active CN113129868B (en) 2021-03-12 2021-03-12 Method for obtaining speech recognition model, speech recognition method and corresponding device

Country Status (1)

Country Link
CN (1) CN113129868B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113793599B (en) * 2021-09-15 2023-09-29 北京百度网讯科技有限公司 Training method of voice recognition model, voice recognition method and device
CN113870846B (en) * 2021-09-27 2024-05-31 平安科技(深圳)有限公司 Speech recognition method, device and storage medium based on artificial intelligence
CN113889088B (en) * 2021-09-28 2022-07-15 北京百度网讯科技有限公司 Method and device for training speech recognition model, electronic equipment and storage medium
CN113689846B (en) * 2021-10-27 2022-02-08 深圳市友杰智新科技有限公司 Speech recognition model training method, device, computer equipment and storage medium
CN114242113B (en) * 2021-12-16 2023-08-08 北京百度网讯科技有限公司 Voice detection method, training device and electronic equipment
CN115101041A (en) * 2022-05-09 2022-09-23 北京百度网讯科技有限公司 Method and device for training speech synthesis and speech synthesis model
CN115910044B (en) * 2023-01-10 2023-06-30 广州小鹏汽车科技有限公司 Voice recognition method and device and vehicle

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110428818A (en) * 2019-08-09 2019-11-08 中国科学院自动化研究所 The multilingual speech recognition modeling of low-resource, audio recognition method
CN110648658A (en) * 2019-09-06 2020-01-03 北京达佳互联信息技术有限公司 Method and device for generating voice recognition model and electronic equipment
CN111415667A (en) * 2020-03-25 2020-07-14 极限元(杭州)智能科技股份有限公司 Stream-type end-to-end speech recognition model training and decoding method
CN111696526A (en) * 2020-06-22 2020-09-22 北京达佳互联信息技术有限公司 Method for generating voice recognition model, voice recognition method and device
CN111916067A (en) * 2020-07-27 2020-11-10 腾讯科技(深圳)有限公司 Training method and device of voice recognition model, electronic equipment and storage medium
CN112071308A (en) * 2020-09-11 2020-12-11 中山大学 Awakening word training method based on speech synthesis data enhancement
CN112183120A (en) * 2020-09-18 2021-01-05 北京字节跳动网络技术有限公司 Speech translation method, device, equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7610205B2 (en) * 2002-02-12 2009-10-27 Dolby Laboratories Licensing Corporation High quality time-scaling and pitch-scaling of audio signals
CN105869624B (en) * 2016-03-29 2019-05-10 腾讯科技(深圳)有限公司 The construction method and device of tone decoding network in spoken digit recognition
CN109309764B (en) * 2017-07-28 2021-09-03 北京搜狗科技发展有限公司 Audio data processing method and device, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110428818A (en) * 2019-08-09 2019-11-08 中国科学院自动化研究所 The multilingual speech recognition modeling of low-resource, audio recognition method
CN110648658A (en) * 2019-09-06 2020-01-03 北京达佳互联信息技术有限公司 Method and device for generating voice recognition model and electronic equipment
CN111415667A (en) * 2020-03-25 2020-07-14 极限元(杭州)智能科技股份有限公司 Stream-type end-to-end speech recognition model training and decoding method
CN111696526A (en) * 2020-06-22 2020-09-22 北京达佳互联信息技术有限公司 Method for generating voice recognition model, voice recognition method and device
CN111916067A (en) * 2020-07-27 2020-11-10 腾讯科技(深圳)有限公司 Training method and device of voice recognition model, electronic equipment and storage medium
CN112071308A (en) * 2020-09-11 2020-12-11 中山大学 Awakening word training method based on speech synthesis data enhancement
CN112183120A (en) * 2020-09-18 2021-01-05 北京字节跳动网络技术有限公司 Speech translation method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《Polish whispery speech recognition — Minimum sampling frequency》;Piotr Kozierski et al.;《2017 22nd International Conference on Methods and Models in Automation and Robotics (MMAR)》;20170921;全文 *
《基于双微阵列与卷积神经网络的语音识别方法》;刘伟波等;《计算机应用》;20191130;全文 *

Also Published As

Publication number Publication date
CN113129868A (en) 2021-07-16

Similar Documents

Publication Publication Date Title
CN113129868B (en) Method for obtaining speech recognition model, speech recognition method and corresponding device
CN112466288B (en) Voice recognition method and device, electronic equipment and storage medium
CN111241819B (en) Word vector generation method and device and electronic equipment
CN115309877A (en) Dialog generation method, dialog model training method and device
CN112861548A (en) Natural language generation and model training method, device, equipment and storage medium
CN113380239B (en) Training method of voice recognition model, voice recognition method, device and equipment
CN114242113A (en) Voice detection method, training method and device and electronic equipment
CN114186681A (en) Method, apparatus and computer program product for generating model clusters
CN113129869B (en) Method and device for training and recognizing voice recognition model
CN113689868B (en) Training method and device of voice conversion model, electronic equipment and medium
CN113361574A (en) Training method and device of data processing model, electronic equipment and storage medium
CN114399992B (en) Voice instruction response method, device and storage medium
CN113689866B (en) Training method and device of voice conversion model, electronic equipment and medium
CN113889087B (en) Speech recognition and model establishment method, device, equipment and storage medium
CN112687271B (en) Voice translation method and device, electronic equipment and storage medium
CN114842541A (en) Model training and face recognition method, device, equipment and storage medium
CN114023310A (en) Method, device and computer program product applied to voice data processing
CN114119972A (en) Model acquisition and object processing method and device, electronic equipment and storage medium
CN115312042A (en) Method, apparatus, device and storage medium for processing audio
CN114187892A (en) Style migration synthesis method and device and electronic equipment
CN113553413A (en) Dialog state generation method and device, electronic equipment and storage medium
CN114818748B (en) Method for generating translation model, translation method and device
CN118609568A (en) Inference method and device of large language model, electronic equipment and readable storage medium
CN113345472B (en) Voice endpoint detection method and device, electronic equipment and storage medium
CN113689867B (en) Training method and device of voice conversion model, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant