CN113129868B

CN113129868B - Method for obtaining speech recognition model, speech recognition method and corresponding device

Info

Publication number: CN113129868B
Application number: CN202110270662.0A
Authority: CN
Inventors: 梁鸣心; 付晓寅; 白锦峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2022-02-25
Anticipated expiration: 2041-03-12
Also published as: CN113129868A

Abstract

The disclosure discloses a method for obtaining a voice recognition model, a voice recognition method and a corresponding device, and relates to artificial intelligence technologies such as intelligent voice and deep learning. The specific implementation scheme is as follows: acquiring training data, wherein the training data comprises a voice frame sequence and a text label corresponding to the voice frame sequence, and the voice frame sequence comprises more than one voice frame; performing frame splicing processing on each frame in the voice frame sequence to obtain a frame splicing sequence; down-sampling the frame splicing sequence to obtain a frame skipping sequence; training by utilizing the frame splicing sequence and the corresponding text labels to obtain a first speech recognition model; and training by utilizing the frame skipping sequence and the corresponding text labels based on the first voice recognition model to obtain a second voice recognition model, wherein the second voice recognition model is used for voice recognition. The method and the device can effectively reduce the calculation amount of voice recognition.

Description

Method for obtaining speech recognition model, speech recognition method and corresponding device

Technical Field

The present disclosure relates to the field of computer application technology, and more particularly to the field of artificial intelligence techniques such as intelligent speech and deep learning.

Background

Automatic speech recognition is an important component of a human-computer interaction system, and current mainstream solutions for speech recognition are based on deep learning. The speech recognition model based on deep learning has high requirements on computing resources, and the calculated amount becomes the key for determining the model parameters and calculating the real-time rate. Especially, the off-line end has more strict requirements on computing resources, and the compression of the computing amount becomes an urgent problem to be solved.

Disclosure of Invention

The present disclosure provides a method of obtaining a speech recognition model, a method of speech recognition and a corresponding apparatus, so as to reduce the amount of computation.

According to a first aspect of the present disclosure, there is provided a method of obtaining a speech recognition model, comprising:

acquiring training data, wherein the training data comprises a voice frame sequence and a text label corresponding to the voice frame sequence, and the voice frame sequence comprises more than one voice frame;

performing frame splicing processing on each frame in the voice frame sequence to obtain a frame splicing sequence;

down-sampling the frame splicing sequence to obtain a frame skipping sequence;

training by utilizing the frame splicing sequence and the corresponding text labels to obtain a first speech recognition model;

and training by utilizing the frame skipping sequence and the corresponding text labels to obtain a second speech recognition model based on the first speech recognition model.

According to a second aspect of the present disclosure, there is provided a method of speech recognition, comprising:

acquiring a voice frame sequence;

down-sampling the frame splicing sequence to obtain a frame skipping sequence;

inputting the frame skipping sequence into a second voice recognition model, and acquiring a text recognition result output by the second voice recognition model;

wherein the second speech recognition model is pre-trained by the method of obtaining a speech recognition model as described above.

According to a third aspect of the present disclosure, there is provided an apparatus for obtaining a speech recognition model, comprising:

the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring training data, the training data comprises a voice frame sequence and a text label corresponding to the voice frame sequence, and the voice frame sequence comprises more than one voice frame;

the first frame splicing unit is used for splicing frames of each frame in the voice frame sequence to obtain a spliced frame sequence;

the first frame skipping unit is used for performing down sampling on the splicing frame sequence to obtain a frame skipping sequence;

the model training unit is used for training by utilizing the splicing frame sequence and the corresponding text labels to obtain a first voice recognition model; and training by utilizing the frame skipping sequence and the corresponding text labels to obtain a second speech recognition model based on the first speech recognition model.

According to a fourth aspect of the present disclosure, there is provided an apparatus for speech recognition, comprising:

a second obtaining unit configured to obtain a sequence of voice frames;

the second frame splicing unit is used for splicing frames of each frame in the voice frame sequence to obtain a spliced frame sequence;

the second frame skipping unit is used for performing down sampling on the splicing frame sequence to obtain a frame skipping sequence;

the result acquisition unit is used for inputting the frame skipping sequence into a second voice recognition model and acquiring a text recognition result output by the second voice recognition model;

wherein the second speech recognition model is pre-trained by the apparatus for obtaining speech recognition models as described above.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.

According to the technical scheme, the frame skipping sequence is obtained by performing frame splicing processing on the voice frame sequence and then performing down-sampling, so that the frame rate to be processed by the voice recognition model is effectively reduced, and the calculation amount of voice recognition is reduced.

And on the basis of a model obtained by training through a frame splicing sequence, a mode of obtaining a voice recognition model by training through a frame skipping sequence is utilized, so that the stability of model training can be ensured, and the performance loss is reduced.

It should be understood that what is described in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a method for obtaining a speech recognition model according to an embodiment of the present disclosure;

fig. 2 is an example diagram of a splicing frame sequence and a frame skipping sequence provided by an embodiment of the present disclosure;

FIG. 3a is a schematic diagram of training a first speech recognition model according to an embodiment of the present disclosure;

FIG. 3b is a schematic diagram of training a third speech recognition model according to an embodiment of the present disclosure;

FIG. 3c is a schematic diagram of training a second speech recognition model according to an embodiment of the present disclosure;

FIG. 4 is a flow chart of a speech recognition method provided by an embodiment of the present disclosure;

FIG. 5 is a block diagram of an apparatus for obtaining a speech recognition model according to an embodiment of the present disclosure;

FIG. 6 is a block diagram of a speech recognition device provided in an embodiment of the present disclosure;

FIG. 7 is a block diagram of an electronic device used to implement an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In speech recognition, short-time spectral analysis is often required to extract speech features, and the spectral analysis is usually performed with a fixed frame shift. The frame shift refers to the amount of displacement of two adjacent frames, and the frame rate indicates the number of frames per second. For example, a frame shift of 10ms, the frame rate is 100 frames/second. The frame rate determines the resolution of the features, the number of computations per second. The key to the amount of compression computation is to reduce the frame rate. The method and apparatus provided by the present disclosure are based on the above-mentioned idea of reducing the frame rate. For speech recognition, the key and basis are the speech recognition model, and the process of obtaining the speech recognition model and the process of speech recognition are described in detail below with reference to the embodiments, respectively.

Fig. 1 is a flowchart of a method for obtaining a speech recognition model according to an embodiment of the present disclosure. The execution main body of the method can be a device for obtaining the voice recognition model, and the device can be positioned at a server end or a computer terminal with stronger computing power. The application may be in the form of an application, or may also be a functional unit such as a plug-in or Software Development Kit (SDK) in the application, which is not particularly limited in the embodiment of the present disclosure. As shown in fig. 1, the method mainly comprises the following steps:

in 101, training data is obtained, the training data including a speech frame sequence and a text label corresponding thereto, the speech frame sequence including more than one speech frame.

The training data used in the present disclosure may be a sequence of speech frames that have been annotated with text. The speech frame sequence is more than one speech frame obtained by framing speech. The framing processing refers to a mode of adopting an overlapped segmentation method to enable the voice frames to smoothly transit and maintain continuity. The difference between the positions of two adjacent frames (e.g., the difference between the start positions of two adjacent frames) is called frame shift. The framing process may employ currently mature techniques, which are not described in detail herein.

At 102, a frame splicing process is performed on each frame in the speech frame sequence to obtain a frame splicing sequence.

If the voice frame sequence is directly downsampled in order to reduce the frame rate, the short-time stable performance of the voice features is damaged, the trained model is easy to diverge, and the performance loss is large. In order to ensure the model performance and the training stability, in the present disclosure, each frame in the speech frame sequence is first subjected to frame splicing processing, and then the obtained frame splicing sequence is subjected to down-sampling.

Wherein the frame splicing process may include: and respectively combining each frame in the voice sequence with the first m frames and the last n frames adjacent to each frame to obtain each frame in the splicing frame sequence, wherein m and n are preset positive integers. The redundancy may be generated when the value of m is too large, and the large delay may be generated when the value of n is too large, but if the values of m and n are smaller, the calculation amount is larger, so that the values of m and n need to be reasonably set according to the requirements of more practical scenes. Specifically, an empirical value or an experimental value may be used, and for example, m and n may be 2.

As shown in FIG. 2, the frame splicing process is performed on the speech frame sequence, and for the 1 st speech frame, the first 2 frames and the second 2 frames are combined, that is, 1-3 (the first 2 frames are blank) speech frames are combined as the 1 st frame in the frame splicing sequence and marked as X₁. For the 2 nd speech frame, the first 2 frames, the second 2 frames and the last 2 frames are combined, namely 1-4 (the front has 1 frame blank) speech frames are combined to be the 2 nd frame in the splicing frame sequence and marked as X₂. For the 3 rd speech frame, the first 2 frames and the second 2 frames are combined, namely 1-5 speech frames are combined to be used as the 3 rd frame in the splicing frame sequence and marked as X₃. By analogy, a splicing frame sequence { X ] is obtained₁,X₂,X₃,…,X_TAnd f, wherein T is the number of speech frames in the speech frame sequence.

At 103, the sequence of mosaic frames is down-sampled to obtain a sequence of frame skipping.

The sampling rate adopted when the sequence of the splicing frames is subjected to down-sampling can also be reasonably set according to the requirements of an actual scene, and empirical values or experimental values can be adopted. Taking the sampling rate of 1/2 as an example, as shown in fig. 2, if sampling is performed every 2 frames, the 1 st frame, the 3 rd frame, and the 5 th frame … … are sampled from the sequence of frames to form a frame skipping sequence. Compared with a direct down-sampling mode, the mode of firstly splicing frames and then down-sampling greatly reduces the loss of characteristic information. As shown in fig. 2, the frame shift of the original speech frame is 10ms, although the frame shift is increased from the original 10ms to 20ms, and the frame rate is decreased to 1/2, there is a certain overlap between frames in the frame skipping sequence, which ensures the continuity between features.

At 104, a first speech recognition model is trained using the sequence of spellings and the corresponding text labels.

The speech recognition model referred to in the present disclosure employs a structure that is subject to an encoder and a decoder, such as the SMLTA (Streaming Multi-Layer Truncated Attention), las (listener Attention) system.

Wherein, the encoder and the decoder may be composed of an LSTM (Long Short-Term Memory) network. The input to the encoder is actually acoustic features such as Fbank (filter bank), MFCC (Mel-frequency cepstral coefficients) extracted from each frame. The encoder obtains the hidden vector of each frame according to the acoustic characteristics of each input frame, and the decoder outputs the text after the hidden vector sequence formed by the hidden vector of each frame is input into the decoder. The text embodies acoustic modeling units, which may be, for example, words, syllables, phones (phones), etc.

At 105, based on the first speech recognition model, a second speech recognition model is obtained by training with a frame skipping sequence and a corresponding text label.

If the speech recognition model is trained directly using the frame skipping sequence after the above step 103, the model training may be unstable, difficult to converge and have performance that is greatly different from the model trained with the original speech frame sequence. Therefore, in this embodiment, a multi-stage training manner is adopted, that is, the frame splicing sequence is used to perform model training in step 104, and then the frame skipping sequence is further used to perform training, so that the finally obtained model can smoothly converge and reduce the performance loss.

The

above steps

104 and 105 are described in detail with reference to the following embodiments.

First, in this step 104, a preliminary model, referred to herein as a "first speech recognition model", may be obtained by training using a sequence of spellings and corresponding text labels. Specific training procedure As shown in FIG. 3a, the acoustic features { X ] of the sequence of frames are pieced together₁,X₂,X₃,…,X_TAs the input of the encoder, the output of the encoder is the hidden vector sequence { h }₁,h₂,h₃,…,h_TAnd (5) as the input of a decoder, marking the corresponding text as the target output of the decoder, and training the encoder and the decoder to obtain a first voice recognition model.

The training process described above uses a training goal that minimizes the difference between the recognition result output by the decoder and the text label. A loss function can be constructed based on the training target, and the values of the loss function are used for updating the model parameters.

As one of the preferred embodiments, two loss functions can be designed:

the first Loss function may take the form of, for example, Loss Ce (Cross entropy), whose goal is to maximize the posterior probability of the model output text labels, for example:

P(L|X)＝ΠP(l_t|X,l_1:N)

where X denotes a sequence of input models, L denotes a sequence of model outputs, and L ═ L₁,l₂,...,l_N]。

The second penalty function may be implemented, for example, as a Loss CTC (connection semantic Temporal Classification) with the goal of maximizing the path probability of the model output text labels and the training task using the Loss CTC requires the addition of a fully connected Layer (Full Layer). Inputting the hidden vector sequence into the full-link layer to obtain the prediction of the peak value, wherein the prediction result of the peak value is y₁,y₂,y₃,…,y_MIt does not require every frame to be aligned, but rather consistent across paths (i.e., the results of the text sequence). That is, alignment on every frame is not required as long as the predicted text result is guaranteed to be correct.

For example: p (Y | X) ═ Σ_π∈φ(Y')P(π|X)

Where Y represents the sequence of model outputs and φ (Y') represents all possible text sequences of outputs.

A first Loss function, such as Loss Ce, is for each frame of data, minimizing training errors per frame of data. While the second penalty function, such as Loss CTC, is sequence-based, it does not need to consider the alignment of each frame with the labels, as long as the path is correct, i.e., the training error of the entire sequence is minimized.

In the process of training the encoder and the decoder, the model parameters of the encoder and the decoder may be updated as a whole using the first loss function, and only the model parameters of the decoder may be updated using the second loss function. And obtaining a first voice recognition model after the training end condition is reached. The training end condition may be, for example, convergence of a loss function, a value of the loss function being less than or equal to a preset loss function threshold, a number of iterations reaching a preset number threshold, or the like.

In an implementation manner of step 105, the second speech recognition model may be obtained by training with a frame skipping sequence and a corresponding text label directly based on the first speech recognition model. However, as a preferred embodiment, a two-stage training mode may be adopted: the first stage performs frame skipping training of the decoder, and the second stage performs frame skipping training of the encoder and the decoder.

When frame skipping training of a decoder is carried out in the first stage, based on a first speech recognition model, acoustic features of a splicing frame sequence are used as input of an encoder, and a frame skipping hidden vector sequence is obtained by down-sampling a hidden vector sequence output by the encoder; and taking the frame skipping hidden vector sequence as the input of a decoder, taking the corresponding text label as the target output of the decoder, and continuing to train the encoder and the decoder to obtain a third speech recognition model.

As shown in fig. 3b, the pre-training encoder and decoder of this stage is the first speech recognition model obtained in step 104. During the training at the present stage, the acoustic characteristics { X ] of the frame skipping sequence are obtained₁,X₂,X₃,…,X_TThe input is the encoder, which outputs the hidden vector sequence { h }₁,h₂,h₃,…,h_T}. Downsampling the hidden vector sequence, for example, using a downsampling rate of 1/2, to obtain a frame-skipping hidden vector sequence { h }₁,h₃,h₅,…,h_T-1}. Inputting the frame-skipping hidden vector sequence into a decoder, and outputting a text sequence { l₁,l₂,...,l_N}. In the above training process, the input and output of the encoder are unchanged compared to the training process of step 104, and the decoder portion is mainly trained.

A first Loss function such as Loss Ce may also be employed to update the model parameters of the encoder and decoder with the goal of maximizing the a posteriori probability of the model output text labels.

A second Loss function, such as Loss CTC, is employed only to update the model parameters of the encoder. The goal is to maximize the path probability of the model output text labels and the training task using the Loss CTC requires the addition of a fully connected Layer (Full Layer). Inputting the hidden vector sequence into the full-link layer to obtain the prediction of the peak value, wherein the prediction result of the peak value is y₁,y₂,y₃,…,y_MIt does not require every frame to be aligned, but rather consistent across paths (i.e., the results of the text sequence). That is, alignment on every frame is not required as long as the predicted text result is guaranteed to be correct.

During this first stage of training, the Loss CTC is used to maintain the performance of the encoder, the main part of the training decoder. And obtaining a third speech recognition model after the training end condition is reached. The training end condition may be, for example, convergence of a loss function, a value of the loss function being less than or equal to a preset loss function threshold, a number of iterations reaching a preset number threshold, or the like.

In the second stage, when frame skipping training of the encoder and the decoder is performed, based on the third speech recognition model, the acoustic features of the frame skipping sequence are used as the input of the encoder, the hidden vector sequence output by the encoder is used as the input of the decoder, the corresponding text labels are used as the target output of the decoder, and the encoder and the decoder are continuously trained to obtain the second speech recognition model.

As shown in fig. 3c, the encoder and decoder are the third speech recognition model obtained in the first stage before this stage of training. During the training at the present stage, the acoustic characteristics { X ] of the frame skipping sequence are obtained₁,X₃,X₅,…,X_T-1The input is the encoder, which outputs the hidden vector sequence { h }₁,h₃,h₅,…,h_T-1}. The hidden vector sequence is input to a decoder, which outputs a text sequence { l }₁,l₂,...,l_N}. In the training process described above, the input and output of the decoder are unchanged compared to the first training phase, mainly training the encoder part.

It is also possible to use a first Loss function such as Loss Ce to update the model parameters of the encoder and decoder and a second Loss function such as Loss CTC to update only the model parameters of the encoder. In this second stage of training, Loss Ce is used to preserve the performance of the decoder, and the encoder is trained along with Loss CTC. And obtaining a second voice recognition model after the training end condition is reached. The training end condition may be, for example, convergence of a loss function, a value of the loss function being less than or equal to a preset loss function threshold, a number of iterations reaching a preset number threshold, or the like.

It can be seen that, in the above two-stage training mode, the input sequence of each of the encoder and decoder is not changed in each training stage, so as to guide the other part of training, and this "transitional" training mode can ensure the stability of model training, so as to minimize the performance loss of the obtained model.

The obtained second speech recognition model comprises an encoder and a decoder, the full-connected layer involved by the Loss CTC in the training process is used for assisting the training of the speech recognition model, and the second speech recognition model may not comprise the full-connected layer in an actual speech recognition scene.

On the basis of the second speech recognition model obtained by the above training, speech recognition can be performed using the second speech recognition model. Fig. 4 is a flowchart of a speech recognition method according to an embodiment of the present disclosure. The execution main body of the method can be a voice recognition device, and the device can be positioned at a server end or a computer terminal with stronger computing power. The application may be in the form of an application, or may also be a plug-in the application or a functional unit such as an SDK, which is not particularly limited in this disclosure. As shown in fig. 4, the method may include the steps of:

in 401, a sequence of speech frames is obtained.

The speech frame sequence involved in this step may be more than one speech frame obtained after framing the speech to be recognized.

At 402, a frame splicing process is performed on each frame in the speech frame sequence to obtain a frame splicing sequence.

At 403, the sequence of mosaic frames is down-sampled to obtain a sequence of frame jumps.

The specific manner of the frame splicing processing and the frame skipping processing may refer to the description about

steps

102 and 103 in the embodiment shown in fig. 1, and is not described herein again. The frame splicing processing and frame skipping processing adopted in the speech recognition process need to be consistent with the mode adopted by the speech recognition model in the training process.

At 404, the frame skipping sequence is input into the second speech recognition model, and a text recognition result output by the second speech recognition model is obtained.

In this step, after the acoustic features of the frame skipping sequence are input into the encoder of the second speech recognition model, the encoder outputs a hidden vector sequence. The hidden vector sequence is used as the input of a decoder of the second speech recognition model, and a text recognition result of the speech frame sequence is output by the decoder. The text recognition result is embodied by acoustic modeling units, which may be words, syllables, phones (phones), etc., for example.

In the speech recognition process, the frame skipping sequence reduces the frame rate compared with the original speech frame sequence, so that the calculation amount in the speech recognition process is effectively reduced. Although the frame splicing processing brings about the increase of the input characteristic dimension, the increase of the characteristic dimension only has an influence on the first layer network of the encoder part, and the total computation amount of the encoder part and the decoder part is effectively reduced because the time dimension is sufficiently compressed.

For online speech recognition services, lowering the frame rate can save computational resources and reduce response time. For offline speech recognition, reducing the frame rate can accommodate models with larger parameters, improve recognition accuracy, reduce power consumption, and the like. Therefore, in the speech recognition system, the low frame rate optimization plays a significant role in user experience, service cost and product performance.

The above is a detailed description of the method provided by the present disclosure, and the following is a detailed description of the apparatus provided by the present disclosure with reference to the embodiments.

Fig. 5 is a block diagram of an apparatus for obtaining a speech recognition model according to an embodiment of the disclosure, and as shown in fig. 5, the apparatus 500 may include: a first obtaining unit 510, a first frame splicing unit 520, a first frame skipping unit 530, and a model training unit 540. The main functions of each component unit are as follows:

a first obtaining unit 510, configured to obtain training data, where the training data includes a speech frame sequence and a text label corresponding to the speech frame sequence, and the speech frame sequence includes more than one speech frame.

The first frame splicing unit 520 is configured to perform frame splicing processing on each frame in the speech frame sequence to obtain a frame splicing sequence.

Specifically, the first frame splicing unit 520 may merge each frame in the speech frame sequence with the first m frames and the last n frames adjacent to each frame in the speech frame sequence, respectively, to obtain each frame in the frame splicing sequence, where m and n are preset positive integers. The redundancy may be generated when the value of m is too large, and the large delay may be generated when the value of n is too large, but if the values of m and n are smaller, the calculation amount is larger, so that the values of m and n need to be reasonably set according to the requirements of more practical scenes. Specifically, an empirical value or an experimental value may be used, and for example, m and n may be 2.

A first frame skipping unit 530, configured to down-sample the sequence of the splicing frame to obtain a frame skipping sequence.

The sampling rate adopted when the sequence of the splicing frames is subjected to down-sampling can also be reasonably set according to the requirements of an actual scene, and empirical values or experimental values can be adopted.

A model training unit 540, configured to train to obtain a first speech recognition model by using the frame splicing sequence and the corresponding text labels; and training by utilizing the frame skipping sequence and the corresponding text label based on the first voice recognition model to obtain a second voice recognition model.

As one of the realizable manners, the model training unit 540 may include: a first training subunit 541, a second training subunit 542, and a third training subunit 543.

The first training subunit 541 is configured to train the encoder and the decoder to obtain a first speech recognition model including the encoder and the decoder, where the acoustic features of the frame splicing sequence are used as input of the encoder, a hidden vector sequence output by the encoder is used as input of the decoder, and a corresponding text label is used as target output of the decoder.

Wherein the encoder and decoder may be comprised of an LSTM network. The input to the encoder is actually an acoustic feature such as Fbank, MFCC, etc. extracted from each frame. The decoder outputs the recognition result text. The text embodies acoustic modeling units, which may be, for example, words, syllables, phones (phones), etc.

And the second training subunit 542 is configured to, based on the first speech recognition model, take the acoustic features of the frame splicing sequence as input of the encoder, perform down-sampling on the hidden vector sequence output by the encoder, take the frame skipping hidden vector sequence obtained after the down-sampling as input of the decoder, take the corresponding text label as target output of the decoder, and continue to train the encoder and the decoder to obtain a third speech recognition model.

And a third training subunit 543, configured to train, based on the third speech recognition model, by using the frame skipping sequence and the corresponding text label, to obtain a second speech recognition model.

As one of the realizable manners, the third training subunit 543 may use, based on the third speech recognition model, the acoustic feature of the frame skipping sequence as an input of the encoder, use the hidden vector sequence output by the encoder as an input of the decoder, use the corresponding text label as a target output of the decoder, and continue to train the encoder and the decoder to obtain the second speech recognition model.

As one of the preferred embodiments, two loss functions can be designed: the first Loss function may take the form of, for example, Loss Ce (Cross entropy), whose goal is to maximize the posterior probability of the model output text labels. A second penalty function may be employed, such as the Loss CTC, with the goal of maximizing the path probability of the model output text labels and the training task using the Loss CTC requires the addition of a Full connectivity Layer (Full Layer). The input of the hidden vector sequence into the fully-connected layer results in the prediction of the peak value, and the prediction result of the peak value does not require alignment of every frame, but is consistent on the path (i.e. the result of the text sequence). That is, alignment on every frame is not required as long as the predicted text result is guaranteed to be correct.

The first training subunit 541, the second training subunit 542, and the third training subunit 543 included in the model training unit 543 may determine a first loss function and a second loss function when training the encoder and the decoder; the model parameters of the encoder and decoder are updated with a first loss function, and the model parameters of the decoder are updated with a second loss function.

Fig. 6 is a block diagram of a speech recognition apparatus provided in an embodiment of the present disclosure, and as shown in fig. 6, the apparatus 600 may include: a second obtaining unit 601, a second frame splicing unit 602, a second frame skipping unit 603, and a result obtaining unit 604. The main functions of each component unit are as follows:

a second obtaining unit 601, configured to obtain a sequence of voice frames.

The speech frame sequence related in this embodiment may be more than one speech frame obtained by framing the speech to be recognized.

The second frame splicing unit 602 is configured to perform frame splicing processing on each frame in the speech frame sequence to obtain a frame splicing sequence.

A second frame skipping unit 603, configured to down-sample the splicing frame sequence to obtain a frame skipping sequence.

The specific manner of the frame splicing processing and the frame skipping processing may refer to the related description in the embodiment of the apparatus shown in fig. 5, and is not described herein again. The frame splicing processing and frame skipping processing adopted in the speech recognition process need to be consistent with the mode adopted by the speech recognition model in the training process.

And a result obtaining unit 604, configured to input the frame skipping sequence into the second speech recognition model, and obtain a text recognition result output by the second speech recognition model. Wherein the second speech recognition model is pre-trained by the apparatus for obtaining speech recognition models as shown in fig. 5.

And after the acoustic features of the frame skipping sequence are input into an encoder of the second speech recognition model, the encoder outputs an implicit vector sequence. The hidden vector sequence is used as the input of a decoder of the second speech recognition model, and a text recognition result of the speech frame sequence is output by the decoder. The text recognition result is embodied by acoustic modeling units, which may be words, syllables, phones (phones), etc., for example.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

As shown in fig. 7, is a block diagram of an electronic device according to an embodiment of the disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 executes the respective methods and processes described above, such as a method of acquiring a speech recognition model or a speech recognition method. For example, in some embodiments, the method of obtaining a speech recognition model or the speech recognition method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708.

In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 802 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the method of obtaining a speech recognition model or the speech recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g. by means of firmware) to perform the method of obtaining a speech recognition model or the speech recognition method.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller 30, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility existing in the traditional physical host and virtual Private Server (VPs) service. The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of obtaining a speech recognition model, comprising:

down-sampling the frame splicing sequence to obtain a frame skipping sequence;

and training by utilizing the frame skipping sequence and the corresponding text labels to obtain a second speech recognition model based on the first speech recognition model, so as to obtain the second speech recognition model by utilizing the multi-stage training of the splicing frame sequence and the frame skipping sequence.

2. The method of claim 1, wherein the frame splicing processing of each frame in the sequence of speech frames to obtain a sequence of spliced frames comprises:

and respectively merging each frame in the voice frame sequence with the first m frames and the last n frames adjacent to each frame to obtain each frame in the splicing frame sequence, wherein m and n are preset positive integers.

3. The method of claim 1, wherein the training with the sequence of spellings and the corresponding text labels to derive a first speech recognition model comprises:

and taking the acoustic features of the splicing frame sequence as the input of an encoder, taking the implicit vector sequence output by the encoder as the input of a decoder, taking the corresponding text labels as the target output of the decoder, training the encoder and the decoder, and obtaining a first speech recognition model comprising the encoder and the decoder.

4. The method of claim 1, wherein training with the sequence of frame hops and corresponding text labels to derive a second speech recognition model based on the first speech recognition model comprises:

based on the first speech recognition model, taking the acoustic features of the splicing frame sequence as the input of an encoder, performing down-sampling on a hidden vector sequence output by the encoder, taking a frame-skipping hidden vector sequence obtained after down-sampling as the input of a decoder, taking a corresponding text label as the target output of the decoder, and continuing to train the encoder and the decoder to obtain a third speech recognition model;

and training by utilizing the frame skipping sequence and the corresponding text labels to obtain a second speech recognition model based on the third speech recognition model.

5. The method of claim 4, wherein training a second speech recognition model using the sequence of frame hops and corresponding text labels based on the third speech recognition model comprises:

and based on the third speech recognition model, taking the acoustic features of the frame skipping sequence as the input of the encoder, taking the hidden vector sequence output by the encoder as the input of the decoder, taking the corresponding text label as the target output of the decoder, and continuing to train the encoder and the decoder to obtain the second speech recognition model.

6. The method of claim 3, 4 or 5, wherein the training the encoder and decoder comprises:

determining a first loss function and a second loss function, wherein the first loss function aims at maximizing the posterior probability of outputting the text labels, and the second loss function aims at maximizing the sum of the path probabilities of outputting the text labels;

updating model parameters of the encoder and decoder with the first loss function, and updating model parameters of the decoder with the second loss function.

7. The method of claim 6, wherein the first penalty function is a cross entropy penalty function and the second penalty function is a connection-wise sequential classification (CTC) penalty function.

8. A method of speech recognition, comprising:

acquiring a voice frame sequence;

down-sampling the frame splicing sequence to obtain a frame skipping sequence;

wherein the second speech recognition model is pre-trained by the method of any one of claims 1 to 6.

9. An apparatus for obtaining a speech recognition model, comprising:

the model training unit is used for training by utilizing the splicing frame sequence and the corresponding text labels to obtain a first voice recognition model; and training by utilizing the frame skipping sequence and the corresponding text labels to obtain a second speech recognition model based on the first speech recognition model, so as to obtain the second speech recognition model by utilizing the multi-stage training of the splicing frame sequence and the frame skipping sequence.

10. The apparatus according to claim 9, wherein the first frame splicing unit is specifically configured to merge each frame in the sequence of speech frames with its adjacent first m frames and last n frames, respectively, to obtain each frame in the sequence of frame splicing, where m and n are preset positive integers.

11. The apparatus of claim 9, wherein the model training unit comprises: and the first training subunit is used for taking the acoustic features of the splicing frame sequence as the input of an encoder, taking a hidden vector sequence output by the encoder as the input of a decoder, taking a corresponding text label as the target output of the decoder, and training the encoder and the decoder to obtain a first speech recognition model comprising the encoder and the decoder.

12. The apparatus of claim 9, wherein the model training unit comprises:

the second training subunit is used for taking the acoustic features of the splicing frame sequence as the input of an encoder based on the first speech recognition model, performing down-sampling on a hidden vector sequence output by the encoder, taking a frame skipping hidden vector sequence obtained after down-sampling as the input of a decoder, taking a corresponding text label as the target output of the decoder, and continuing to train the encoder and the decoder to obtain a third speech recognition model;

and the third training subunit is used for training by utilizing the frame skipping sequence and the corresponding text labels to obtain a second speech recognition model based on the third speech recognition model.

13. The apparatus according to claim 12, wherein the third training subunit is specifically configured to, based on the third speech recognition model, use the acoustic features of the frame skipping sequence as the input of the encoder, use a hidden vector sequence output by the encoder as the input of the decoder, use a corresponding text label as the target output of the decoder, and continue training the encoder and the decoder to obtain the second speech recognition model.

14. The apparatus according to claim 11, 12 or 13, wherein the model training unit comprises sub-units, when training the encoder and decoder, specifically configured to:

15. The apparatus of claim 14, wherein the first penalty function is a cross entropy penalty function and the second penalty function is a connection-wise sequential classification (CTC) penalty function.

16. An apparatus for speech recognition, comprising:

a second obtaining unit configured to obtain a sequence of voice frames;

wherein the second speech recognition model is pre-trained by the apparatus of any of claims 9 to 15.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.