CN111128191B

CN111128191B - Online end-to-end voice transcription method and system

Info

Publication number: CN111128191B
Application number: CN201911415035.0A
Authority: CN
Inventors: 张鹏远; 缪浩然; 程高峰; 颜永红
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2023-03-28
Anticipated expiration: 2039-12-31
Also published as: CN111128191A

Abstract

The invention provides a method and a system for online end-to-end voice transcription, wherein in one embodiment, acoustic features are extracted from an audio file; carrying out nonlinear transformation and down-sampling on the acoustic features and outputting a first feature sequence; partitioning the first characteristic sequence, sequentially inputting each block of characteristic sequence into an encoder and outputting a plurality of groups of second characteristic sequences; modeling the second characteristic sequence, outputting a plurality of groups of Chinese character sequences and scoring the plurality of groups of Chinese character sequences; and taking the Chinese character sequence with the highest score as a final transcription result. The encoder structure is improved to process the partitioned audio; by improving the structure of the decoder, the Chinese characters are output on the basis of audio truncation. So that text is transcribed while audio is being input.

Description

An online end-to-end speech transcription method and system

技术领域technical field

本发明涉及语音转写技术领域,尤其涉及一种在线端对端语音转写方法及系统。The invention relates to the technical field of speech transcription, and in particular to an online end-to-end speech transcription method and system.

背景技术Background technique

语音转写技术是将输入的音频转换为文本的重要技术，也是人机交互领域的一个重要研究内容。Speech transcription technology is an important technology for converting input audio into text, and it is also an important research content in the field of human-computer interaction.

传统的语音转写技术包含声学模型、发音字典和语言模型，并且借助加权有限状态转换机构建复杂的解码网络，将声学特征序列转换为文本序列。当前新兴的端对端语言转写技术采用单个神经网络模型，直接将声学特征转换为文本序列，极大地简化了语音转写过程中的解码流程。但是目前的高性能的端对端语音转写必须等待完整的音频输入后，才能开始转换为文本序列，限制了端对端语音转写技术应用于实时转写的在线任务。Traditional speech transcription technology includes acoustic models, pronunciation dictionaries, and language models, and uses weighted finite state transition machines to build complex decoding networks to convert acoustic feature sequences into text sequences. The current emerging end-to-end language transcription technology uses a single neural network model to directly convert acoustic features into text sequences, which greatly simplifies the decoding process in the speech transcription process. However, the current high-performance end-to-end speech transcription must wait for the complete audio input before it can start converting into text sequences, which limits the application of end-to-end speech transcription technology to online tasks of real-time transcription.

发明内容Contents of the invention

有鉴于此，本申请实施例提供了一种在线端对端语音转写方法及系统，克服了现有的端对端语音转写技术不能应用于实时转写在线任务的问题，通过改进基于编码器和解码器结构的端对端语音转写技术，使得编码器和解码器不再依赖完整的音频就能开始转换为文本序列。In view of this, the embodiment of the present application provides an online end-to-end voice transcription method and system, which overcomes the problem that the existing end-to-end voice transcription technology cannot be applied to real-time transcription online tasks. The end-to-end speech transcription technology of the encoder and decoder structure enables the encoder and decoder to start converting into text sequences without relying on the complete audio.

第一方面，本发明申请提供了一种在线端对端语音转写方法包括：In the first aspect, the application of the present invention provides an online end-to-end voice transcription method including:

获取音频文件，对所述音频文件提取声学特征；Acquiring an audio file, and extracting acoustic features from the audio file;

对所述声学特征进行非线性变换和降采样并输出第一特征序列；performing nonlinear transformation and downsampling on the acoustic features and outputting a first feature sequence;

将第一特征序列进行分块，依次将每块特征序列输入到编码器中并输出多组第二特征序列；dividing the first feature sequence into blocks, sequentially inputting each block of feature sequences into the encoder and outputting multiple sets of second feature sequences;

对所述第二特征序列进行建模，输出多组汉字序列并对所述多组汉字序列进行打分；Modeling the second feature sequence, outputting multiple sets of Chinese character sequences and scoring the multiple sets of Chinese character sequences;

将分数最高的汉字序列作为最终转写结果。The Chinese character sequence with the highest score is taken as the final transliteration result.

可选地，所述获取音频文件，对所述音频文件提取声学特征包括：Optionally, the acquiring the audio file, and extracting the acoustic features from the audio file include:

对获取的音频文件提取对数梅尔谱特征作为帧级别声学特征。The logarithmic mel spectrum features are extracted from the acquired audio files as frame-level acoustic features.

可选地，所述编码器为基于自注意力机制的在线编码器；Optionally, the encoder is an online encoder based on a self-attention mechanism;

所述编码器由12个相同的子模块堆叠组成，每个子模块依次由自注意力网络、残差网络、层规范化网络、全连接网络、残差网络和层规范化网络堆叠组成。The encoder is composed of a stack of 12 identical sub-modules, and each sub-module is sequentially composed of a self-attention network, a residual network, a layer normalization network, a fully connected network, a residual network, and a layer normalization network stack.

可选地，所述对所述第二特征序列进行处理，输出多组汉字序列并对所述多组汉字序列进行打分包括：Optionally, the processing the second feature sequence, outputting multiple groups of Chinese character sequences and scoring the multiple groups of Chinese character sequences includes:

构建基于自注意力机制的在线解码器，所述解码器对第二特征序列进行建模，并对输出的多组汉字序列进行打分；Build an online decoder based on the self-attention mechanism, the decoder models the second feature sequence, and scores the output multiple groups of Chinese character sequences;

所述解码器由6个相同的子模块堆叠组成，其中每个子模块为一层自注意力网络、一层残差网络、一层层规范化网络、一层截断注意力网络、一层残差网络、一层层规范化网络、一层全连接网络、一层残差网络和一层层规范化网络。The decoder is composed of 6 identical sub-module stacks, wherein each sub-module is a layer of self-attention network, a layer of residual network, a layer of normalized network, a layer of truncated attention network, a layer of residual network , a layer-by-layer normalization network, a layer-by-layer fully connected network, a layer-by-layer residual network, and a layer-by-layer normalization network.

可选地，所述解码器对第二特征序列进行建模，并对输出的多组汉字序列进行打分包括：Optionally, the decoder models the second feature sequence, and scoring the output multiple groups of Chinese character sequences includes:

将多组第二特征序列依次通过所述解码器的6个子模块，将最后一个子模块的层规范网络的输出特征输入汉字分类器；Pass multiple groups of second feature sequences through 6 submodules of the decoder in turn, and input the output features of the layer specification network of the last submodule into the Chinese character classifier;

所述汉字分类器输出多组汉字以及每组汉字对应的分数；The Chinese character classifier outputs multiple groups of Chinese characters and the corresponding scores of each group of Chinese characters;

取排名前十的汉字分别输入解码器输出下一个汉字，直到解码器输出终止符为止。Take the top ten Chinese characters and input them into the decoder to output the next Chinese character until the decoder outputs the terminator.

第二方面，本发明申请提供了一种在线端对端语音转写系统包括：In the second aspect, the application of the present invention provides an online end-to-end speech transcription system including:

采集单元：用于采集音频，并对所述音频提取声学特征；Acquisition unit: used to collect audio and extract acoustic features from the audio;

处理单元：用于对采集单元提取的声学特征进行非线性变换和降采样并输出第一特征序列；将第一特征序列进行分块，依次将每块特征序列输入到编码器中并输出多组第二特征序列；Processing unit: used to perform nonlinear transformation and down-sampling on the acoustic features extracted by the acquisition unit and output the first feature sequence; divide the first feature sequence into blocks, input each block of feature sequences into the encoder in turn and output multiple groups the second feature sequence;

所述处理单元还用于，对所述第二特征序列进行建模，输出多组汉字序列并对所述多组汉字序列进行打分；The processing unit is also used to model the second feature sequence, output multiple sets of Chinese character sequences and score the multiple sets of Chinese character sequences;

输出单元：用于将处理单元输出的汉字序列中分数最高的汉字序列作为最终转写结果，并输出。Output unit: used to take the Chinese character sequence with the highest score among the Chinese character sequences output by the processing unit as the final transcription result, and output it.

可选地，对所述音频提取声学特征包括：Optionally, extracting acoustic features from the audio includes:

将多组第二特征序列依次通过所述解码器的6个子模块，将最后一个子模块的层规范网络的输出特征输入汉字分类器；Pass multiple groups of second feature sequences through 6 submodules of the decoder in turn, and input the output features of the layer norm network of the last submodule into the Chinese character classifier;

本申请实施例提供一种在线端对端语音转写方法及系统，在一个实施例中，对音频提取对数梅尔谱特征作为帧级别声学特征；构建前端神经网络，将对数梅尔谱特征进行非线性变换和降采样；构建基于自注意力机制的在线编码器，对前端神经网络的输出特征序列进行建模，并输出一组新的特征序列；构建基于自注意力机制的在线解码器，对编码器输出的特征序列进行建模，并输出多组汉字序列；利用束搜索算法搜索分数最高的汉字序列，并作为最终的转写结果。通过改进编码器结构，让其处理分块的音频；通过改进解码器的结构，让其在截断音频的基础上输出汉字。使得在输入音频的同时转写文本。An embodiment of the present application provides an online end-to-end speech transcription method and system. In one embodiment, logarithmic mel spectrum features are extracted from the audio as frame-level acoustic features; a front-end neural network is constructed to convert the logarithmic mel spectrum Non-linear transformation and downsampling of features; construction of an online encoder based on self-attention mechanism, modeling the output feature sequence of the front-end neural network, and output a new set of feature sequences; construction of online decoding based on self-attention mechanism The device models the feature sequence output by the encoder, and outputs multiple sets of Chinese character sequences; uses the beam search algorithm to search for the Chinese character sequence with the highest score, and uses it as the final transliteration result. By improving the structure of the encoder, let it process the divided audio; by improving the structure of the decoder, let it output Chinese characters on the basis of truncated audio. Enables transcription of text while inputting audio.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其它的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments. Obviously, the accompanying drawings in the following description are only some embodiments of the present invention. For Those of ordinary skill in the art can also obtain other drawings based on these drawings without making creative efforts.

图1为本发明申请一种在线端对端语音转写系统的结构示意图；Fig. 1 is a schematic structural diagram of an online end-to-end speech transcription system for the application of the present invention;

图2为本发明申请一种在线端对端语音转写方法的流程图；Fig. 2 is a flow chart of applying for an online end-to-end voice transcription method according to the present invention;

图3为基于自注意力机制的在线编码器对输入其中的特征序列的处理流程图；Fig. 3 is a flow chart of the processing of the feature sequence input by the online encoder based on the self-attention mechanism;

图4为基于自注意力机制的在线解码器对输入其中的特征序列的处理流程图。Fig. 4 is a flow chart of the processing of the feature sequence input by the online decoder based on the self-attention mechanism.

具体实施方式Detailed ways

下面通过附图和实施例，对本发明的技术方案做进一步的详细描述。The technical solutions of the present invention will be described in further detail below with reference to the accompanying drawings and embodiments.

图1为本发明申请一种在线端到端语音转写方法的结构示意图，参见图1，本申请实施例中的一种端到端远场语音识别系统包括：采集单元101、处理单元102和输出单元103。Fig. 1 is a schematic structural diagram of an online end-to-end speech transcription method of the present invention. Referring to Fig. 1, an end-to-end far-field speech recognition system in the embodiment of the present application includes: an acquisition unit 101, a processing unit 102 and output unit 103.

采集单元101用于采集音频信号，并将采集的音频信号通过高通滤波器，进行预加重，提高音频信号中的高频部分。The collection unit 101 is used to collect audio signals, and pass the collected audio signals through a high-pass filter for pre-emphasis to increase the high frequency part of the audio signals.

对通过高通滤波器的音频信号进行分帧，每帧25毫秒，帧移10毫秒。对每一帧进行加窗，窗函数为汉明窗。然后对每一帧进行快速傅里叶变换得到各帧的频谱，进一步得到各帧的能量谱。进一步地，对每一帧的能量谱计算通过梅尔滤波器的能量，并取对数，得到对数梅尔谱，其中梅尔滤波器的个数为80,因此每帧得到80维的对数梅尔谱特征。Framing the audio signal passed through the high-pass filter, each frame is 25 milliseconds, and the frame is shifted by 10 milliseconds. Windowing is performed on each frame, and the window function is a Hamming window. Then perform fast Fourier transform on each frame to obtain the frequency spectrum of each frame, and further obtain the energy spectrum of each frame. Further, calculate the energy passing through the Mel filter for the energy spectrum of each frame, and take the logarithm to obtain the logarithmic Mel spectrum, wherein the number of Mel filters is 80, so each frame gets an 80-dimensional pair Several mel spectral features.

处理单元102包括：第一处理单元1021、第二处理单元1022和第三处理单元1023。The processing unit 102 includes: a first processing unit 1021 , a second processing unit 1022 and a third processing unit 1023 .

第一处理单元1021用于构建前端神经网络。其中，前端神经网络包含两层二维卷积网络，一层线性网络和一层位置编码网络。其中，卷积网络的卷积核大小为3、步长为2，卷积核数量为256，经过两层二维卷积层后特征序列的长度变为原先长度的四分之一，线性层将卷积层的输出特征投影到256维，位置编码层将线性层的输出特征和256维的位置特征相加。The first processing unit 1021 is used to construct a front-end neural network. Among them, the front-end neural network includes two layers of two-dimensional convolutional network, one layer of linear network and one layer of position encoding network. Among them, the convolution kernel size of the convolution network is 3, the step size is 2, and the number of convolution kernels is 256. After two layers of two-dimensional convolution layers, the length of the feature sequence becomes a quarter of the original length, and the linear layer The output features of the convolutional layer are projected to 256 dimensions, and the position encoding layer adds the output features of the linear layer and the 256-dimensional position features.

将采集单元101提取的对数梅尔谱特征序列输入到前端神经网络中，进行非线性变换和降采样，采样率为四分之一。The logarithmic mel spectrum feature sequence extracted by the acquisition unit 101 is input into the front-end neural network for nonlinear transformation and down-sampling, and the sampling rate is one-fourth.

第二处理单元1022用于构建基于自注意力机制的在线编码器，该编码器由12个相同的子模块堆叠组成，每个子模块依次由自注意力网络、残差网络、层规范化网络、全连接网络、残差网络和层规范化网络堆叠组成。The second processing unit 1022 is used to construct an online encoder based on the self-attention mechanism. Concatenation network, residual network and layer normalization network stack composition.

将前端神经网络中输出的特征序列进行分块，再依次将每块特征序列输入在线编码器中，并输出多组新的特征序列。其中，从在线编码器中输出的特征序列与输入在线编码器的特征序列相同。Divide the feature sequences output from the front-end neural network into blocks, and then input each block of feature sequences into the online encoder in turn, and output multiple sets of new feature sequences. Among them, the feature sequence output from the online encoder is the same as the feature sequence input to the online encoder.

第三处理单元1023用于构建基于自注意力机制的在线解码器，该解码器由6个相同的子模块堆叠组成，每个子模块依次由自注意力网络、残差网络、层规范化网络、截断注意力网络、残差网络、层规范化网络、全连接网络、残差网络和层规范化网络堆叠组成。The third processing unit 1023 is used to build an online decoder based on the self-attention mechanism. The decoder is composed of 6 identical sub-modules stacked. Attention network, residual network, layer normalization network, fully connected network, residual network and layer normalization network stack.

在将在线编码器的输出特征输入到在线解码器之前，还需要为在线解码器输入一个起始符号作为在线解码器的起始点。将起始符号的词的嵌入特征与位置特征相加，并输入基于自注意力机制的在线解码器。Before inputting the output features of the online encoder to the online decoder, it is also necessary to input a starting symbol for the online decoder as the starting point of the online decoder. Add the word embedding feature of the start symbol and the position feature, and feed it into an online decoder based on the self-attention mechanism.

输出单元103用于将处理单元102输出的汉字序列中分数最高的汉字序列作为最终转写结果，并输出。The output unit 103 is configured to take the Chinese character sequence with the highest score among the Chinese character sequences output by the processing unit 102 as the final transcription result, and output it.

在一个可能的实施例中，输出单元103采用束搜索算法控制第三处理单元1023每次输出汉字的个数，将其按照分数由高到低排序，取排名前十的汉字分别输入解码器输出下一个汉字，直到解码器输出终止符为止。结束后，具有最高分数的汉字序列将作为最终的转写结果。In a possible embodiment, the output unit 103 uses a beam search algorithm to control the number of Chinese characters output by the third processing unit 1023 each time, sort them according to the scores from high to low, and input the top ten Chinese characters into the decoder for output. The next Chinese character until the decoder outputs a terminator. At the end, the Chinese character sequence with the highest score will be the final transliteration result.

图2为本发明申请一种在线端到端语音转写方法的流程图，参照图2，一种在线端到端语音转写方法包括步骤S201-步骤S205：Fig. 2 is a flowchart of an online end-to-end speech transcription method for the present invention. Referring to Fig. 2, an online end-to-end speech transcription method includes steps S201-step S205:

步骤S201：获取音频文件，对所述音频文件提取声学特征。Step S201: Acquire audio files, and extract acoustic features from the audio files.

对音频提取对数梅尔谱特征作为帧级别的声学特征。具体包括：对获取的音频文件进行预加重，提升高频部分。即将音频文件中的语音信号通过高通滤波器：Extract log-mel spectral features from audio as frame-level acoustic features. Specifically, it includes: pre-emphasizing the acquired audio files, and enhancing high-frequency parts. That is, the speech signal in the audio file is passed through a high-pass filter:

H(z)＝1-0.97z^-1 (1)H(z)＝1-0.97z ^-1 (1)

对音频文件中的音频进行分帧和加窗处理，其中每一帧25毫秒，帧移10毫秒，窗函数为汉明窗。Framing and windowing are performed on the audio in the audio file, where each frame is 25 milliseconds, the frame shift is 10 milliseconds, and the window function is a Hamming window.

对每一帧进行快速傅里叶变换得到各帧的频谱，通过各帧的频谱进一步处理得到各帧的能量谱。Perform fast Fourier transform on each frame to obtain the frequency spectrum of each frame, and further process the frequency spectrum of each frame to obtain the energy spectrum of each frame.

计算每一帧的能量谱通过梅尔滤波器的能量，并取对数，得到对数梅尔谱，其中梅尔滤波器的个数为80,因此每帧得到80维的对数梅尔谱特征。Calculate the energy spectrum of each frame through the energy of the Mel filter, and take the logarithm to obtain the logarithmic Mel spectrum, where the number of Mel filters is 80, so each frame obtains an 80-dimensional logarithmic Mel spectrum feature.

步骤S202：对所述声学特征进行非线性变换和降采样并输出第一特征序列。Step S202: Perform nonlinear transformation and down-sampling on the acoustic features and output a first feature sequence.

将提取的声学特征输入到前端神经网络中，由前端神经网络对该声学特征进行非线性变换和降采样并输出第一特征序列；其中，采样率为四分之一。The extracted acoustic features are input into the front-end neural network, and the acoustic features are nonlinearly transformed and down-sampled by the front-end neural network to output the first feature sequence; wherein, the sampling rate is 1/4.

在一个可能的实施例中，构建的前端神经网络包含两层二维卷积网络，一层线性网络和一层位置编码网络，其中卷积网络的卷积核大小为3、步长为2，卷积核数量为256，经过两层二维卷积层后特征序列的长度变为原先长度的四分之一，线性层将卷积层的输出特征投影到256维，位置编码层将线性层的输出特征和256维的位置特征相加，总体的计算过程如下：In a possible embodiment, the constructed front-end neural network includes two layers of two-dimensional convolutional network, one layer of linear network and one layer of position encoding network, wherein the convolutional network has a convolution kernel size of 3 and a step size of 2, The number of convolution kernels is 256. After two layers of two-dimensional convolutional layers, the length of the feature sequence becomes a quarter of the original length. The linear layer projects the output features of the convolutional layer to 256 dimensions, and the position encoding layer converts the linear layer The output features of and the 256-dimensional position features are added, and the overall calculation process is as follows:

Y＝ReLU(Conv(ReLU(Conv(X))) (2)Y=ReLU(Conv(ReLU(Conv(X))) (2)

z_i＝Linear(y_i)+p_i (3)z _i =Linear(y _i )+p _i (3)

其中Conv(·)表示卷积层；ReLU(·)表示激活函数，其表达式为：Among them, Conv(·) represents the convolutional layer; ReLU(·) represents the activation function, and its expression is:

ReLU(x)＝max(0,x) (4)ReLU(x)=max(0,x) (4)

Linear(·)表示线性层，X,Y分别表示对数梅尔谱特征序列和两层二维卷积层的输出特征序列，y_i,p_i,z_i分别表示第i个两层二维卷积层的输出特征、第i个位置特征和第i个前端神经网络的输出特征，其中位置特征每个维度的计算公式为：Linear(·) represents the linear layer, X, Y represent the logarithmic mel spectrum feature sequence and the output feature sequence of the two-layer two-dimensional convolutional layer, respectively, y _i , p _i , z _i represent the i-th two-dimensional two-dimensional The output features of the convolutional layer, the i-th position feature, and the output feature of the i-th front-end neural network, where the calculation formula for each dimension of the position feature is:

p_i,2k+1＝sin(i/10000^k/128) (5)p _i,2k+1 = sin(i/10000 ^k/128 ) (5)

p_i,2k+2＝cos(i/10000^k/128) (6)p _i,2k+2 = cos(i/10000 ^k/128 ) (6)

步骤S203：将第一特征序列进行分块，依次将每块特征序列输入到编码器中并输出多组第二特征序列。Step S203: Divide the first feature sequence into blocks, sequentially input each block of feature sequences into the encoder and output multiple sets of second feature sequences.

在一个可能的实施例中，将每块特征序列输入到基于自注意力机制的在线编码器，该编码器由12个相同的子模块堆叠构成，每个子模块依次为一层自注意力网络、一层残差网络、一层层规范化网络、一层全连接网络、一层残差网络和一层层规范化网络，每层网络输出特征均为256维。In a possible embodiment, each block feature sequence is input to an online encoder based on the self-attention mechanism, which is composed of 12 identical sub-modules stacked, and each sub-module is a layer of self-attention network, One layer of residual network, one layer of normalized network, one layer of fully connected network, one layer of residual network and one layer of normalized network, each layer of network output feature is 256 dimensions.

将第一特征序列进行分块，每块特征序列的长度为64。将每一块特征序列输入到编码器中进行处理，其处理流程图如图3所示，包括步骤S2031-S2038：The first feature sequence is divided into blocks, and the length of each feature sequence is 64. Input each block of feature sequences into the encoder for processing, the processing flow chart is shown in Figure 3, including steps S2031-S2038:

步骤S2031：将每块特征序列输入编码器第一个模块的自注意力网络中，计算如下：Step S2031: Input each feature sequence into the self-attention network of the first module of the encoder, and the calculation is as follows:

SAN(Q,K,V)＝[head₁,…,head₄]W^O (7)SAN(Q,K,V)＝[head ₁ ,...,head ₄ ]W ^O (7)

其中W_i ^Q,W_i ^K,W_i ^V,W^O表示参数矩阵，Q表示其中一块特征序列，其右边拼接了64个未来输入特征，K表示其中一块特征序列,其左边拼接了64个历史输入特征，右边拼接了64个未来输入特征，V和K相同。Among them, W _i ^Q , W _i ^K , W _i ^V , and W ^O represent the parameter matrix, Q represents one of the feature sequences, and 64 future input features are spliced on the right side, and K represents one of the feature sequences, and 64 historical input features are spliced on the left side. Input features, 64 future input features are concatenated on the right, V and K are the same.

步骤S2032：将自注意力网络的输入和输出特征相加作为残差网络的输出特征。Step S2032: Add the input and output features of the self-attention network as the output features of the residual network.

步骤S2033：将残差网络的输出特征输入层规范化网络，计算如下：Step S2033: Input the output features of the residual network into the layer normalization network, and the calculation is as follows:

其中对每一帧输入特征h计算均值μ和方差var，通过模型参数w_i和b_i对h的每个维度数值进行规整和线性变换，输出新的特征序列

Among them, the mean value μ and variance var are calculated for the input feature h of each frame, and the value of each dimension of h is regularized and linearly transformed through the model parameters w _i and b _i to output a new feature sequence

步骤S2034：将层规范化网络的输出特征输入全连接网络，该网络的计算公式为：Step S2034: Input the output features of the layer normalization network into the fully connected network, the calculation formula of which is:

F(x)＝max(0,xW₁+b₁)W₂+b₂ (12)F(x)＝max(0,xW ₁ +b ₁ )W ₂ +b ₂ (12)

步骤S2035：将全连接网络的输入和输出特征相加作为残差网络的输出特征。Step S2035: add the input and output features of the fully connected network as the output features of the residual network.

步骤S2036：将残差网络的输出特征输入层规范化网络，计算公式与步骤S2033中的计算公式相同。Step S2036: Input the output features of the residual network into the layer normalization network, and the calculation formula is the same as that in step S2033.

步骤S2037：判断当前模块是否为最后一个子模块，若当前模块为最后一个子模块，则结束编码器计算流程；否则，执行步骤S2038。Step S2037: Determine whether the current module is the last sub-module, if the current module is the last sub-module, end the encoder calculation process; otherwise, execute step S2038.

步骤S2038：保存层规范化网络的输出特征，作为下一个子模块的自注意力网络的输入，并执行步骤S2032。Step S2038: Save the output features of the layer normalization network as the input of the self-attention network of the next sub-module, and execute step S2032.

将保存层规范化网络的输出特征，作为下一块特征处理过程中用来拼接的历史帧，并将输出特征输入下一个子模块。The output features of the normalized network of the preservation layer will be used as the historical frame used for splicing in the next feature processing process, and the output features will be input into the next sub-module.

在一个可能的实施例中，输入在线编码器的第一特征序列和在线编码器输出的第二特征序列的长度相同。In a possible embodiment, the lengths of the first feature sequence input to the online encoder and the second feature sequence output by the online encoder are the same.

步骤S204：对所述第二特征序列进行处理，输出多组汉字序列并对所述多组汉字序列进行打分。Step S204: Process the second feature sequence, output multiple sets of Chinese character sequences and score the multiple sets of Chinese character sequences.

通过在线解码器对在线编码器输出的第二特征序列进行建模。其中，基于自注意力机制的在线解码器由6个相同的子模块堆叠构成，每个子模块依次为一层自注意力网络、一层残差网络、一层层规范化网络、一层截断注意力网络、一层残差网络、一层层规范化网络、一层全连接网络、一层残差网络和一层层规范化网络，每层网络输出特征均为256维。The second feature sequence output by the online encoder is modeled by an online decoder. Among them, the online decoder based on the self-attention mechanism is composed of 6 identical sub-modules stacked, and each sub-module is a layer of self-attention network, a layer of residual network, a layer of normalized network, and a layer of truncated attention. network, a layer of residual network, a layer of normalized network, a layer of fully connected network, a layer of residual network and a layer of normalized network, each layer of network output feature is 256 dimensions.

通过在线解码器对在线编码器输出的第二特征序列进行处理。其处理流程图如图4所示，包括步骤：The second feature sequence output by the online encoder is processed by the online decoder. Its processing flowchart is shown in Figure 4, including steps:

步骤S2041：将起始符号的词嵌入特征与位置特征相加形成的特征序列与解码器输出的特征序列输入自注意力网络中。在自注意力网络中的计算公式与公式(7)和公式(8)相同。其中，其中Q表示一个输入特征，K表示当前和所有历史的输入特征组成的特征序列，V和K相同。Step S2041: Input the feature sequence formed by adding the word embedding feature of the starting symbol and the position feature and the feature sequence output by the decoder into the self-attention network. The calculation formula in the self-attention network is the same as formula (7) and formula (8). Among them, where Q represents an input feature, K represents the feature sequence composed of the current and all historical input features, and V and K are the same.

步骤S2042：将自注意力网络的输入和输出特征相加作为残差网络的输出特征。Step S2042: Add the input and output features of the self-attention network as the output features of the residual network.

步骤S2043：将残差网络的输出特征输入层规范化网络，计算公式与步骤S2033中的计算公式相同。Step S2043: Input the output features of the residual network into the layer normalization network, and the calculation formula is the same as that in step S2033.

步骤S2044：将层规范化网络的第i次输出特征q_i输入截断注意力网络，计算公式如下，针对每一个编码器的输出特征k_j，依次计算j＝1,2,…：Step S2044: Input the i-th output feature q _i of the layer normalization network into the truncated attention network, the calculation formula is as follows, for each encoder output feature k _j , sequentially calculate j=1,2,...:

当某个j满足p_ij>0.5，并且j大于上一个截断点时，将其确立为当前截断注意力网络的截断点，并计算输出特征：When a certain j satisfies p _ij >0.5, and j is greater than the previous truncation point, it is established as the truncation point of the current truncated attention network, and the output features are calculated:

步骤S2045：将截断注意力网络的输入和输出特征相加作为残差网络的输出特征；Step S2045: add the input and output features of the truncated attention network as the output features of the residual network;

步骤S2046：将残差网络的输出特征输入层规范化网络，计算公式与步骤S2033中的计算公式相同。Step S2046: Input the output features of the residual network into the layer normalization network, and the calculation formula is the same as that in step S2033.

步骤S2047：将层规范化网络的输出特征输入全连接网络，计算公式与步骤S2034中的计算公式相同。Step S2047: Input the output features of the layer normalization network into the fully connected network, and the calculation formula is the same as that in step S2034.

步骤S2048：将全连接网络的输入和输出特征相加作为残差网络的输出特征，并将残差网络的输出特征输入层规范化网络。计算公式与步骤S2033中的计算公式相同；Step S2048: add the input and output features of the fully connected network as the output features of the residual network, and input the output features of the residual network into the layer normalization network. The calculation formula is the same as the calculation formula in step S2033;

步骤S2049：判断当前模块是否为最后一个子模块，若当前模块为最后一个子模块，则执行步骤S20411；否则执行步骤S20410。Step S2049: Determine whether the current module is the last sub-module, if the current module is the last sub-module, execute step S20411; otherwise, execute step S20410.

步骤S20410：保存层规范化网络的输出特征，作为下一个子模块的自注意力网络的输入，并执行步骤S2042。Step S20410: Save the output features of the layer normalization network as the input of the self-attention network of the next sub-module, and execute step S2042.

步骤S20411：将层规范化网络的输出特征输入汉字分类器，输出多组汉字以及对应的分数，每一个汉字都将其词嵌入与位置特征相加，输入解码器预测下一个汉字。Step S20411: Input the output features of the layer normalization network into the Chinese character classifier, and output multiple groups of Chinese characters and corresponding scores. For each Chinese character, add its word embedding and position features, and input it into the decoder to predict the next Chinese character.

步骤S205：将分数最高的汉字序列作为最终转写结果。Step S205: The Chinese character sequence with the highest score is taken as the final transcription result.

采用束搜索算法，控制解码器中每次输出汉字的个数，并将其按照分数由高到低排序，取排名前十的汉字分别输入解码器输出下一个汉字，直到解码器输出终止符为止。结束后，将具有最高分数的汉字序列将作为最终的转写结果。Use the beam search algorithm to control the number of Chinese characters output by the decoder each time, and sort them according to the score from high to low. Take the top ten Chinese characters and input them into the decoder to output the next Chinese character until the decoder outputs the terminator. . At the end, the Chinese character sequence with the highest score will be the final transliteration result.

本领域技术人员应该可以意识到，在上述一个或多个示例中，本发明所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时，可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。Those skilled in the art should be aware that, in the above one or more examples, the functions described in the present invention may be implemented by hardware, software, firmware or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

以上所述的具体实施方式，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施方式而已，并不用于限定本发明的保护范围，凡在本发明的技术方案的基础之上，所做的任何修改、等同替换、改进等，均应包括在本发明的保护范围之内。The specific embodiments described above have further described the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention and are not intended to limit the scope of the present invention. Protection scope, any modification, equivalent replacement, improvement, etc. made on the basis of the technical solution of the present invention shall be included in the protection scope of the present invention.

Claims

1. An online end-to-end voice transcription method, comprising:

Acquiring audio files, and extracting acoustic features from the audio files, including:

Extract logarithmic mel spectrum features from the acquired audio files as frame-level acoustic features;

performing nonlinear transformation and downsampling on the acoustic features and outputting a first feature sequence;

dividing the first feature sequence into blocks, sequentially inputting each block of feature sequences into the encoder and outputting multiple sets of second feature sequences;

Modeling the second feature sequence, outputting multiple sets of Chinese character sequences and scoring the multiple sets of Chinese character sequences;

Wherein, the online decoder based on the self-attention mechanism is constructed, and the decoder models the second feature sequence, and scores the output multiple groups of Chinese character sequences;

The decoder is composed of 6 identical sub-module stacks, wherein each sub-module consists of a layer of self-attention network, a layer of residual network, a layer of normalized network, a layer of truncated attention network, a layer of residual network , a layer-by-layer normalization network, a layer-by-layer fully connected network, a layer-by-layer residual network and a layer-by-layer normalization network;

The truncated attention network processes the output features in the second feature sequence one by one, determines a truncation point, truncates the second feature sequence into multiple subsequences to calculate output features, and sends the output features to other The sub-modules are processed to obtain the final output features;

Input the final output features into the Chinese character classifier for scoring;

The Chinese character sequence with the highest score is taken as the final transliteration result.

2. The method according to claim 1, wherein the encoder is an online encoder based on a self-attention mechanism;

The encoder is composed of a stack of 12 identical sub-modules, and each sub-module is sequentially composed of a self-attention network, a residual network, a layer normalization network, a fully connected network, a residual network, and a layer normalization network stack.

3. The method according to claim 1, wherein the decoder models the second feature sequence, and scoring the output multiple groups of Chinese character sequences includes:

Pass multiple groups of second feature sequences through 6 submodules of the decoder in turn, and input the output features of the layer norm network of the last submodule into the Chinese character classifier;

The Chinese character classifier outputs multiple groups of Chinese characters and the corresponding scores of each group of Chinese characters;

Take the top ten Chinese characters and input them into the decoder to output the next Chinese character until the decoder outputs the terminator.

4. An online end-to-end speech transcription system, comprising:

Acquisition unit: used to collect audio and extract acoustic features from the audio, including:

Processing unit: used to perform nonlinear transformation and down-sampling on the acoustic features extracted by the acquisition unit and output the first feature sequence; divide the first feature sequence into blocks, input each block of feature sequences into the encoder in turn and output multiple groups the second feature sequence;

The processing unit is also used to model the second feature sequence, output multiple sets of Chinese character sequences and score the multiple sets of Chinese character sequences;

Wherein, the processing unit constructs an online decoder based on a self-attention mechanism, and the decoder models the second feature sequence, and scores the output multiple groups of Chinese character sequences;

The decoder is composed of 5 identical sub-module stacks, wherein each sub-module consists of a layer of self-attention network, a layer of residual network, a layer of normalized network, a layer of truncated attention network, a layer of residual network , a layer-by-layer normalization network, a layer-by-layer fully connected network, a layer-by-layer residual network and a layer-by-layer normalization network;

Output unit: used to take the Chinese character sequence with the highest score among the Chinese character sequences output by the processing unit as the final transcription result, and output it.

5. The system according to claim 4, wherein the encoder is an online encoder based on a self-attention mechanism;

6. The system according to claim 4, wherein the decoder models the second feature sequence, and scoring the output multiple groups of Chinese character sequences includes:

Pass multiple groups of second feature sequences through 6 submodules of the decoder in turn, and input the output features of the layer specification network of the last submodule into the Chinese character classifier;